Learning probabilistic finite automata

Learning probabilistic finite automata Colin de la Higuera University of Nantes Nantes, November 2013 1 Acknowledgements l l Laurent Miclet, Jos...
Author: Coral Booth
0 downloads 0 Views 1MB Size
Learning probabilistic finite automata Colin de la Higuera University of Nantes

Nantes, November 2013

1

Acknowledgements l

l

Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco Casacuberta, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Thierry Murgue, Franck Thollard, Enrique Vidal, Frédéric Tantini,... List is necessarily incomplete. Excuses to those that have been forgotten

http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/ Chapters 5 and 16

Nantes, November 2013

2

Outline 1. 2. 3. 4. 5. 6. 7.

PFA Distances between distributions FFA Basic elements for learning PFA ALERGIA MDI and DSAI Open questions Nantes, November 2013

3

1 PFA Probabilistic finite (state) automata

Nantes, November 2013

4

Practical motivations (Computational biology, speech recognition, web services, automatic translation, image processing …) l A lot of positive data l Not necessarily any negative data l No ideal target l Noise Nantes, November 2013

5

The grammar induction problem, revisited l

l l

The data consists of positive strings, «generated» following an unknown distribution The goal is now to find (learn) this distribution or the grammar/automaton that is used to generate the strings

Nantes, November 2013

6

Success of the probabilistic models l l l

n-grams Hidden Markov Models Probabilistic grammars

Nantes, November 2013

7

b

1 2

a

1 2

1 2

1 3

1 4

a b 1 2

a

b

3 4

2 3

DPFA: Deterministic Probabilistic Finite Automaton

Nantes, November 2013

8

1 2

a

1 2

1 2

a

1 3

1 4

a b 1 2

b

3 4

b

2 3

1 1 1 2 3 1 PrA(abab)= ´ ´ ´ ´ = 2 2 3 3 4 24 Nantes, November 2013

9

b

0.1

a

0.9

0.7

a

0.35

0.7

b 0.3

a

0.3

b

Nantes, November 2013

0.65

10

1 2

a

1 2

b 1 2

1 3

1 4

a a 1 2

b

3 4

b

2 3

PFA: Probabilistic Finite (state) Automaton Nantes, November 2013

11

1 2

e

1 2

e

1 2

1 3

1 4

a e 1 2

b

3 4

b

2 3

e-PFA: Probabilistic Finite (state) Automaton with e-transitions Nantes, November 2013

12

How useful are these automata? l l

l l

They can define a distribution over S* They do not tell us if a string belongs to a language They are good candidates for grammar induction There is (was?) not that much written theory

Nantes, November 2013

13

Basic references l l

l l

l

The HMM literature Azaria Paz 1973: Introduction to probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines, Vidal, Thollard, cdlh, Casacuberta & Carrasco Grammatical Inference papers Nantes, November 2013

14

Automata, definitions Let D be a distribution over S*

0£PrD(w)£1

åwÎS* Pr (w)=1 D

Nantes, November 2013

15

A Probabilistic Finite (state) Automaton is a l l l l

Q set of states IP : Q®[0;1] FP : Q®[0;1] dP : Q´S´Q ®[0;1]

Nantes, November 2013

16

What does a PFA do? l

It defines the probability of each string w as the sum (over all paths reading w) of the products of the probabilities

l

PrA(w)=åpiÎpaths(w)Pr(pi)

l

pi=qi0ai1qi1ai2…ainqin

l

Pr(pi)=IP(qi0)·FP(qin) · Õaij dP (qij-1,aij,qij)

l

Note that if l-transitions are allowed the sum may be infinite Nantes, November 2013

17

b

0.4

0.1

a

a

b 0.4 0.7

0.2

a

0.1

a 0.3

0.35

b 1

0.45

Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532 Nantes, November 2013

18

l

l

l

non deterministic PFA: many initial states/only one initial state a l-PFA: a PFA with l-transitions and perhaps many initial states DPFA: a deterministic PFA

Nantes, November 2013

19

Consistency

A PFA is consistent if l PrA(S*)=1 l "xÎS* 0£PrA(x)£1

Nantes, November 2013

20

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and "qÎQ FP(q) + åq’ÎQ,aÎS dP (q,a,q’)= 1 Nantes, November 2013

21

Equivalence between models Equivalence between PFA and HMM… l But the HMM usually define distributions over each Sn l

Nantes, November 2013

22

A football HMM win draw lose win draw lose win draw lose 1 2

1 4

1 4

1 4

1 4

1 2

1 4

1 4

1 2

1 4

1 4 3 4

1 4

1 4

1 2

Nantes, November 2013

3 4

23

Equivalence between PFA with l-transitions and PFA without l-transitions cdlh 2003, Hanneforth & cdlh 2009 l l l

Many initial states can be transformed into one initial state with l-transitions; l-transitions can be removed in polynomial time; Strategy: l number the states l eliminate first l-loops, then the transitions with highest ranking arrival state

Nantes, November 2013

24

PFA are strictly more powerful than DPFA Folk theorem (and) You can’t even tell in advance if you are in a good case or not (see: Denis & Esposito 2004)

Nantes, November 2013

25

Example

a a

1 2

2 3 1 2

1 3

This distribution cannot be modelled by a DPFA

a 1 2

a Nantes, November 2013

1 2 26

What does a DPFA over S ={a} look like?

a…a

a

a

And with this architecture you cannot generate the previous one Nantes, November 2013

27

Parsing issues

l

l

Computation of the probability of a string or of a set of strings Deterministic case l l

Simple: apply definitions Technically, rather sum up logs: this is easier, safer and cheaper

Nantes, November 2013

28

0.1

b

0.9

a

a

0.7

a

0.7

b 0.3

0.35

0.3

b

0.65

Pr(aba) = 0.7*0.9*0.35*0 = 0 Pr(abb) = 0.7*0.9*0.65*0.3 = 0.12285 Nantes, November 2013

29

Non-deterministic case

b

0.4

0.1

a

a

b 0.4 0.7

0.2

a

0.1

a 0.3

0.35

b 1

0.45

Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532 Nantes, November 2013

30

In the literature l

l l

The computation of the probability of a string is by dynamic programming : O(n2m) 2 algorithms: Backward and Forward If we want the most probable derivation to define the probability of a string, then we can use the Viterbi algorithm

Nantes, November 2013

31

Forward algorithm l

l l l

A[i,j]=Pr(qi|a1..aj) (The probability of being in state qi after having read a1..aj) A[i,0]=IP(qi) A[i,j+1]= åk£|Q|A[k,j] . dP(qk,aj+1,qi) Pr(a1..an)= åk£|Q|A[k,n] . FP(qk) Nantes, November 2013

32

2 Distances What for? lEstimate

the quality of a language model lHave an indicator of the convergence of learning algorithms lConstruct kernels

Nantes, November 2013

33

2.1 Entropy l

l l

How many bits do we need to correct our model? Two distributions over S*: D et D’ Kullback Leibler divergence (or relative entropy) between D and D’:

åwÎS* PrD(w) ´ïlog PrD(w)-log PrD’ (w)ï

Nantes, November 2013

34

2.2 Perplexity l

l

The idea is to allow the computation of the divergence, but relatively to a test set (S) An approximation (sic) is perplexity: inverse of the geometric mean of the probabilities of the elements of the test set

Nantes, November 2013

35

ÕwÎS PrD

-1/ïSï (w)

= 1 ïSï

ÕwÎS PrD(w)

Problem if some probability is null... Nantes, November 2013

36

Why multiply (1) l

We are trying to compute the probability of independently drawing the different strings in set S

Nantes, November 2013

37

Why multiply? (2) l

Suppose we have two predictors for a coin toss l l

l l

The tests are H: 6, T: 4 Arithmetic mean l l

l

Predictor 1: heads 60%, tails 40% Predictor 2: heads 100%

P1: 36%+16%=0,52 P2: 0,6

Predictor 2 would be the better predictor ;-)

Nantes, November 2013

38

2.3 Distance d2 d2(D, D’)=

åwÎS (PrD(w)-PrD’(w))2 *

Can be computed in polynomial time if D and are given by PFA (Carrasco & cdlh 2002)

D’

This also means that equivalence of PFA is in P

Nantes, November 2013

39

3 FFA Frequency Finite (state) Automata

Nantes, November 2013

40

A learning sample l l l

is a multiset Strings appear with a frequency (or multiplicity) S={l (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)}

Nantes, November 2013

41

DFFA A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition, and for entering the initial state such that l the sum of what enters is equal to what exits and l the sum of what halts is equal to what starts

Nantes, November 2013

42

Example

a: 2

a: 1

6 3

2

1

b : 5

b: 3

a: 5

b: 4 Nantes, November 2013

43

From a DFFA to a DPFA Frequencies become relative frequencies by dividing by sum of exiting frequencies

a: 2/6

a: 1/7

6/6 3/13

2/7

1/6

b: 5/13

b: 3/6

a: 5/13

b: 4/7 Nantes, November 2013

44

From a DFA and a sample to a DFFA S = {l, aaaa, ab, babb, bbbb, bbbbaa} a: 2

a: 1

6 3

2

1

b: 5

b: 3

a: 5

b: 4 Nantes, November 2013

45

Note l l

l

Another sample may lead to the same DFFA Doing the same with a NFA is a much harder problem Typically what algorithm Baum-Welch (EM) has been invented for…

Nantes, November 2013

46

The frequency prefix tree acceptor l l

l

The data is a multi-set The FTA is the smallest tree-like FFA consistent with the data Can be transformed into a PFA if needed

Nantes, November 2013

47

From the sample to the FTA FTA(S)

a:4 a:6

a:7 14

4

a:2

2

b:2 b:1

a:1

b:1

a:1

3

1 1

b:4 a:1

b:4

a:1

a:1

3

S={l (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)} Nantes, November 2013

48

Red, Blue and White states -Red states are confirmed states -Blue states are the (non Red) successors of the Red states -White states are the others

a a

b

a

b

a b

a b

Same as with DFA and what RPNI does Nantes, November 2013

49

Merge and fold Suppose we decide to merge b with state a l

a:26 10

b:6

a:6

a:10 100

6

60

b:24

a:4

a:4 b:24

4

11

b:9 Nantes, November 2013

9

50

Merge and fold First disconnect and reconnect to

b:24

l

b

a

a:26 10

b:6

a:6

a:10 100

6

60

a:4

a:4 b:24

4

11

b:9 Nantes, November 2013

9

51

Merge and fold Then fold

b:24 a:26 10

b:6

a:6

a:10 100

60

b:24

6

a:4

4

a:4

11

b:9

Nantes, November 2013

9

52

Merge and fold after folding

b:24 a:26 10

b:30

a:10 100

60

a:4

a:10

10

b:9

9

4

11

Nantes, November 2013

53

State merging algorithm A=FTA(S); Blue ={d(qI,a): aÎS }; Red ={qI} While Blue¹Æ do choose q from Blue such that Freq(q)³t0 if $pÎRed: d(Ap,Aq) is small then A = merge_and_fold(A,p,q) else Red = Red È {q} Blue = {d(q,a): qÎRed} – {Red} Nantes, November 2013

54

The real question l l l l l

How do we decide if d(Ap,Aq) is small? Use a distance… Be able to compute this distance If possible update the computation easily Have properties related to this distance

Nantes, November 2013

55

Deciding if two distributions are similar l

l

l

If the two distributions are known, equality can be tested Distance (L2 norm) between distributions can be exactly computed But what if the two distributions are unknown?

Nantes, November 2013

56

Taking decisions

Suppose we want to merge b with state a l

a:26 10

b:6

a:6

a:10 100

6

60

b:24

a:4

a:4 b:24

4

11

b:9 Nantes, November 2013

9

57

Taking decisions

Yes if the two distributions induced are similar

a:26 10

b:6

a:6

a:10

6

60

b:24

a:4 b:24

4

11

b:9 a:4

a:4 b:24

a:4 9

4

11

b:9

9

Nantes, November 2013

58

5 Alergia

Nantes, November 2013

59

Alergia’s test l D1

» D2 if "x PrD1(x) » PrD2(x)

l

Easier to test: l PrD1 (l)=PrD2(l) l "aÎS PrD1 (aS*)=PrD2(aS*)

l

And do this recursively! Of course, do it on frequencies

l

Nantes, November 2013

60

Hoeffding bounds

f1 f 2 g¬ n1 n2 æ 1 ö 1 2 1 ÷. ln g < çç + ÷ 2 a n n 2 ø è 1 g indicates if the relative frequencies sufficiently close

Nantes, November 2013

f1 and f 2 are n1 n2 61

A run of Alergia Our learning multisample S={l(490), a(128), b(170), aa(31), ab(42), ba(38), bb(14), aaa(8), aab(10), aba(10), abb(4), baa(9), bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3), aabb(2), abaa(2), abab(2), abba(2), abbb(1), baaa(2), baab(2), baba(1), babb(1), bbaa(1), bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1), aabaa(1), aabab(1), aabba(1), abbaa(1), abbab(1)} Nantes, November 2013

62

l

l

Parameter a is arbitrarily set to 0.05. We choose 30 as a value for threshold t0. Note that for the blue state who have a frequency less than the threshold, a special merging operation takes place

Nantes, November 2013

63

1

a

a : 15

10

128

a :257

2

b:3

2

a:5

3

8

a : 64 31 b : 18

a:4

42 1000

b : 9

4

a : 13

9

490

b : 170 38

a : 57

b : 6

b : 26

a :5 14

b

a:2

2

a

b:2

2

a:4

2

b:1 a:2

1

b:2 a:1

2

b:1

1

4

170

a:1 3

b:1 a:1

b Nantes, : 7 November 6 2013

a

2

b:3

a : 14 10

b : 65

b a

1 1 1 1 1

a b

1 1

2

1

1 1 1

64

Can we merge l and a?

l

Compare l and a, aS* and aaS*, bS* and abS* 490/1000 with 128/257 , 257/1000 with 64/257 , 253/1000 with 65/257 , . . . .

l

All tests return true

l l l

Nantes, November 2013

65

Merge… a: 64

a: 15

a:4

2

b:3

2

a:5

3

8

31

b: 18

10

128

a: 257

42 1000

b: 9

4

a: 13

9

490

b: 170 38

a: 57

b: 6

a: 5 14

a:2

2

a

b:2

2

a:4

2

b:1 a:2 b:2 a:1

2

b:1

1

4

a:1 3

b:1 a:1

b:Nantes, 7 November 6 2013

a b

1

170

b a

2

b:3

a: 14 10

b: 65

b: 26

1

a

1 1 1 1 1

a b

1 1

2

1

1 1 1

66

And fold

a:341 1000 660

a: 16

b: 340

12

52

a: 77

b: 9

7

225

b: 38

20

a: 10 6

a: 2

2

b: 2 a: 1

2

b: 1

1

a: 1

1

b: 1 a: 1

1

b:Nantes, 8 November 7 2013

1

1

67

Next merge ? l with b ?

a:341

a: 16

a: 2 12

52

a: 77

1000 660

b: 340

b: 9

b: 38

a: 10 20

b: 8

b: 2 a: 1

2

b: 1

1

7

225

a: 1 6 7

Nantes, November 2013

2

b: 1 a: 1

1

1 1 1

68

Can we merge l and b? l l

l

Compare l and b, aS* and baS*, bS* and bbS* 660/1341 and 225/340 are different (giving g= 0.162) On the other hand

æ 1 ö 1 2 1 ç ÷. ln = 0.111 + ç n ÷ 2 a n 2 ø è 1 Nantes, November 2013

69

Promotion

a:341

a: 16

12

52

a: 77

1000 660

b: 340

b: 9

7

225

b: 38

20

a:10 b: 8

6 7

Nantes, November 2013

a:2 b:2 a:1

2 2 1

b:1

1

a:1

1

b:1 a:1

1 1

70

Merge

a: 341

a: 77

a: 16

a:2 12

52

b: 9 660

b: 340

7

225

b: 38

a: 10 20

b: 8

b:2 a:1

2

b:1

1

a:1 6 7

Nantes, November 2013

2

b:1 a:1

1

1 1 1

71

And fold

a:341

a: 95

1000 660

b: 340

291

b: 49

a: 11 29

b: 9

a: 2 7 8

Nantes, November 2013

b: 2 a: 1

2 2 1

72

Merge

a:341

a: 95

b: 340 1000 660

225

b: 49

a: 11 29

b: 9

a: 2 7 8

Nantes, November 2013

b: 2 a: 1

2 2 1

73

And fold As a PFA a: 354

a: 96

a: .354

b: 351

b: .351

1000 698

a: .096

302

.698

b: 49

.302

b: .049

Nantes, November 2013

74

Conclusion and logic l l

l

Alergia builds a DFFA in polynomial time Alergia can identify DPFA in the limit with probability 1 No good definition of Alergia’s properties

Nantes, November 2013

75

6 DSAI and MDI Why not change the criterion?

Nantes, November 2013

76

Criterion for DSAI l l l

l l

Use a distinguishing string Use norm L¥ Two distributions are different if there is a string with a very different probability Such a string is called m-distinguishable Question becomes: Is there a string x such that |PrA,q(x)-PrA,q’(x)|>m Nantes, November 2013

77

(much more to DSAI) l

l

D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic finite automata. In Proceedings of Colt 1995, pages 31– 40, 1995. PAC learnability results, in the case where targets are acyclic graphs

Nantes, November 2013

78

Criterion for MDI l l

l

MDL inspired heuristic Criterion is: does the reduction of the size of the automaton compensate for the increase in preplexity? F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings of the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000

Nantes, November 2013

79

a PFA/HMM learning competition lOrganisation

committee:

Hasan Ibne Akram, Technische Universität München, Germany ♦ Rémi Eyraud, Aix-Marseille Université, France ♦ Jeffrey Heinz, University of Delaware, USA ♦ Colin de la Higuera, University of Nantes, France ♦ James Scicluna, University of Nantes, France ♦ Sicco Verwer, Radboud University Nijmegen, The Nederlands ♦

Scientific Committee u Pieter Adriaans, University of Amsterdam, The Netherlands u Dana Angluin, Yale University, USA u Alexander Clark, Royal Holloway University of London, United

Kingdom u Pierre Dupont, Université catholique de Louvain, Belgium. u Ricard Gavaldà, Universitat Politécnica de Catalunya, Spain u Colin de la Higuera, University of Nantes, France u Jean-Christophe Janodet, University of Evry, France u Tim Oates, University of Maryland in Baltimore County, USA u Jose Oncina, University of Alicante, Spain u Menno van Zaanen, Tilburg University, The Netherlands ICGI'12 - Workshop

81

Timeline uDecember 2011: first ideas uFebruary 2012: website, first baselines and the

first data set on-line uMach 2012: First phase (training phase) uMay 20: Second phase (competition) uJune 5: First real world problem available uJuly 3: End of the competition uSeptember 7: special session in ICGI'12 ICGI'12 - Workshop

82

Target Generation uTargets were generated completely at random u4 kinds of targets: Ø Ø Ø Ø

HMM PDFA PFA Markov Chains (used only during the training phase)

u5 to 75 states u4 to 24 letters alphabet uAll initial, symbol and transition probability draw

from a Dirichlet distribution ICGI'12 - Workshop

83

Target Generation uSymbol sparsity: percentage of possible state-

symbol pairs selected for the target (between 20% and 80%) Ø

Ø

A state is randomly selected, then a not already taken symbol for this state One transition is generated by selecting a target state

uTransition

sparsity: percentage of additional transitions (between 0% and 20%) Ø

Ø

Selected without replacement from the set of possible transitions Modified to remain uniform over the source state and transition labels ICGI'12 - Workshop

84

Evaluation Score uA perplexity measure:

Where PrT is the probability in the target and PrC is the submitted probability (these probabilities have to be normalize on the test set) uEquivalent to the Kullback–Leibler divergence uIndependent of a specific model

ICGI'12 - Workshop

85

Real Data uNatural language problem: 10 000 POS

sequences (+1 000 unique for test) selected from +100 000 obtained with the Frog Dutch tagger (11 symbols) on a corpus of Dutch translations of Jules Verne books. uDiscretized sensor signals: 20 000 strings (+ 1 000 for test) corresponding to windows of length 20 over the fuel usage of trucks, selected from almost 500 000 available windows. uEvaluation: submissions were compared with the probabilities obtained with a 3-gram trained on the whole data set. ICGI'12 - Workshop

86

Overall score uFor each problem Ø

Ø Ø Ø

5 points were given to the leader (participant with the smallest perplexity score) 3 points to the second 2 points to the third 1 point to the fourth

uThe sum of the point gave the overall ranking.

ICGI'12 - Workshop

87

Train and test sets uAccess only to registered participants u51 problems for the training phase u48 problems for the competition phase (+2 real

world problems) u1 000 strings in each test set u20 000 or 100 000 strings in train sets

ICGI'12 - Workshop

88

Baseline Algorithms u2 simple baselines in python: Ø Ø

Frequency of the strings in the sets (train + test) Usual 3-gram on the strings of the sets (train + test)

uAn

implementation of the Baum-Welch algorithm in python uAn implementation of ALERGIA in OpenFST and Visual Studio uGood page rank of this page (no registration

needed)

ICGI'12 - Workshop

89

Competition activity u724 visits (max: 54 in one day) u196 unique visitors uIPs from 37 countries, 14 countries with 5 or

more IPs u38 registered participants u16 submitted at least one of their solutions u2 787 submissions u5 participants scored points u4 participants ranked first at least one day ICGI'12 - Workshop

90

Overall results Ran k 1 2 3 4 5

Team name Shibata-Yoshinaka Mans Hulden David Llorens Raphael Bailly Fabio Kepler

ICGI'12 - Workshop

Overall score 212 124 122 75 14

91

Overall Scores Evolution

ICGI'12 - Workshop

92

7 Conclusion and open questions

Nantes, November 2013

93

Appendix Stern Brocot trees Identification of probabilities If we were able to discover the structure, how do we identify the probabilities?

Nantes, November 2013

94

l

By estimation: the edge is used 1501 times out of 3000 passages through the state :

a 3000

1501 3000

Nantes, November 2013

95

Stern-Brocot trees: (Stern 1858, Brocot 1860) Can be constructed from two simple adjacent fractions by the «mean» operation a c a+c b md b+d =

Nantes, November 2013

96

0 1

1 0 1 1 1 2

2 1

1 3 1 4

2 3 2 5

3 5

3 2 3 4

4 3

Nantes, November 2013

3 1 5 3

5 2

4 1 97

Idea: l

Instead of returning c(x)/n, search the SternBrocot tree to find a good simple approximation of this value.

Nantes, November 2013

98

Iterated Logarithm: With probability 1, for a co-finite number of values of n we have:

c(x) - a < n b

l log log n n

"l>1

Nantes, November 2013

99

Suggest Documents