Learning probabilistic finite automata Colin de la Higuera University of Nantes
Nantes, November 2013
1
Acknowledgements l
l
Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco Casacuberta, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Thierry Murgue, Franck Thollard, Enrique Vidal, Frédéric Tantini,... List is necessarily incomplete. Excuses to those that have been forgotten
http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/ Chapters 5 and 16
Nantes, November 2013
2
Outline 1. 2. 3. 4. 5. 6. 7.
PFA Distances between distributions FFA Basic elements for learning PFA ALERGIA MDI and DSAI Open questions Nantes, November 2013
3
1 PFA Probabilistic finite (state) automata
Nantes, November 2013
4
Practical motivations (Computational biology, speech recognition, web services, automatic translation, image processing …) l A lot of positive data l Not necessarily any negative data l No ideal target l Noise Nantes, November 2013
5
The grammar induction problem, revisited l
l l
The data consists of positive strings, «generated» following an unknown distribution The goal is now to find (learn) this distribution or the grammar/automaton that is used to generate the strings
Nantes, November 2013
6
Success of the probabilistic models l l l
n-grams Hidden Markov Models Probabilistic grammars
Nantes, November 2013
7
b
1 2
a
1 2
1 2
1 3
1 4
a b 1 2
a
b
3 4
2 3
DPFA: Deterministic Probabilistic Finite Automaton
Nantes, November 2013
8
1 2
a
1 2
1 2
a
1 3
1 4
a b 1 2
b
3 4
b
2 3
1 1 1 2 3 1 PrA(abab)= ´ ´ ´ ´ = 2 2 3 3 4 24 Nantes, November 2013
9
b
0.1
a
0.9
0.7
a
0.35
0.7
b 0.3
a
0.3
b
Nantes, November 2013
0.65
10
1 2
a
1 2
b 1 2
1 3
1 4
a a 1 2
b
3 4
b
2 3
PFA: Probabilistic Finite (state) Automaton Nantes, November 2013
11
1 2
e
1 2
e
1 2
1 3
1 4
a e 1 2
b
3 4
b
2 3
e-PFA: Probabilistic Finite (state) Automaton with e-transitions Nantes, November 2013
12
How useful are these automata? l l
l l
They can define a distribution over S* They do not tell us if a string belongs to a language They are good candidates for grammar induction There is (was?) not that much written theory
Nantes, November 2013
13
Basic references l l
l l
l
The HMM literature Azaria Paz 1973: Introduction to probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines, Vidal, Thollard, cdlh, Casacuberta & Carrasco Grammatical Inference papers Nantes, November 2013
14
Automata, definitions Let D be a distribution over S*
0£PrD(w)£1
åwÎS* Pr (w)=1 D
Nantes, November 2013
15
A Probabilistic Finite (state) Automaton is a l l l l
Q set of states IP : Q®[0;1] FP : Q®[0;1] dP : Q´S´Q ®[0;1]
Nantes, November 2013
16
What does a PFA do? l
It defines the probability of each string w as the sum (over all paths reading w) of the products of the probabilities
l
PrA(w)=åpiÎpaths(w)Pr(pi)
l
pi=qi0ai1qi1ai2…ainqin
l
Pr(pi)=IP(qi0)·FP(qin) · Õaij dP (qij-1,aij,qij)
l
Note that if l-transitions are allowed the sum may be infinite Nantes, November 2013
17
b
0.4
0.1
a
a
b 0.4 0.7
0.2
a
0.1
a 0.3
0.35
b 1
0.45
Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532 Nantes, November 2013
18
l
l
l
non deterministic PFA: many initial states/only one initial state a l-PFA: a PFA with l-transitions and perhaps many initial states DPFA: a deterministic PFA
Nantes, November 2013
19
Consistency
A PFA is consistent if l PrA(S*)=1 l "xÎS* 0£PrA(x)£1
Nantes, November 2013
20
Consistency theorem
A is consistent if every state is useful (accessible and co-accessible) and "qÎQ FP(q) + åq’ÎQ,aÎS dP (q,a,q’)= 1 Nantes, November 2013
21
Equivalence between models Equivalence between PFA and HMM… l But the HMM usually define distributions over each Sn l
Nantes, November 2013
22
A football HMM win draw lose win draw lose win draw lose 1 2
1 4
1 4
1 4
1 4
1 2
1 4
1 4
1 2
1 4
1 4 3 4
1 4
1 4
1 2
Nantes, November 2013
3 4
23
Equivalence between PFA with l-transitions and PFA without l-transitions cdlh 2003, Hanneforth & cdlh 2009 l l l
Many initial states can be transformed into one initial state with l-transitions; l-transitions can be removed in polynomial time; Strategy: l number the states l eliminate first l-loops, then the transitions with highest ranking arrival state
Nantes, November 2013
24
PFA are strictly more powerful than DPFA Folk theorem (and) You can’t even tell in advance if you are in a good case or not (see: Denis & Esposito 2004)
Nantes, November 2013
25
Example
a a
1 2
2 3 1 2
1 3
This distribution cannot be modelled by a DPFA
a 1 2
a Nantes, November 2013
1 2 26
What does a DPFA over S ={a} look like?
a…a
a
a
And with this architecture you cannot generate the previous one Nantes, November 2013
27
Parsing issues
l
l
Computation of the probability of a string or of a set of strings Deterministic case l l
Simple: apply definitions Technically, rather sum up logs: this is easier, safer and cheaper
Nantes, November 2013
28
0.1
b
0.9
a
a
0.7
a
0.7
b 0.3
0.35
0.3
b
0.65
Pr(aba) = 0.7*0.9*0.35*0 = 0 Pr(abb) = 0.7*0.9*0.65*0.3 = 0.12285 Nantes, November 2013
29
Non-deterministic case
b
0.4
0.1
a
a
b 0.4 0.7
0.2
a
0.1
a 0.3
0.35
b 1
0.45
Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532 Nantes, November 2013
30
In the literature l
l l
The computation of the probability of a string is by dynamic programming : O(n2m) 2 algorithms: Backward and Forward If we want the most probable derivation to define the probability of a string, then we can use the Viterbi algorithm
Nantes, November 2013
31
Forward algorithm l
l l l
A[i,j]=Pr(qi|a1..aj) (The probability of being in state qi after having read a1..aj) A[i,0]=IP(qi) A[i,j+1]= åk£|Q|A[k,j] . dP(qk,aj+1,qi) Pr(a1..an)= åk£|Q|A[k,n] . FP(qk) Nantes, November 2013
32
2 Distances What for? lEstimate
the quality of a language model lHave an indicator of the convergence of learning algorithms lConstruct kernels
Nantes, November 2013
33
2.1 Entropy l
l l
How many bits do we need to correct our model? Two distributions over S*: D et D’ Kullback Leibler divergence (or relative entropy) between D and D’:
åwÎS* PrD(w) ´ïlog PrD(w)-log PrD’ (w)ï
Nantes, November 2013
34
2.2 Perplexity l
l
The idea is to allow the computation of the divergence, but relatively to a test set (S) An approximation (sic) is perplexity: inverse of the geometric mean of the probabilities of the elements of the test set
Nantes, November 2013
35
ÕwÎS PrD
-1/ïSï (w)
= 1 ïSï
ÕwÎS PrD(w)
Problem if some probability is null... Nantes, November 2013
36
Why multiply (1) l
We are trying to compute the probability of independently drawing the different strings in set S
Nantes, November 2013
37
Why multiply? (2) l
Suppose we have two predictors for a coin toss l l
l l
The tests are H: 6, T: 4 Arithmetic mean l l
l
Predictor 1: heads 60%, tails 40% Predictor 2: heads 100%
P1: 36%+16%=0,52 P2: 0,6
Predictor 2 would be the better predictor ;-)
Nantes, November 2013
38
2.3 Distance d2 d2(D, D’)=
åwÎS (PrD(w)-PrD’(w))2 *
Can be computed in polynomial time if D and are given by PFA (Carrasco & cdlh 2002)
D’
This also means that equivalence of PFA is in P
Nantes, November 2013
39
3 FFA Frequency Finite (state) Automata
Nantes, November 2013
40
A learning sample l l l
is a multiset Strings appear with a frequency (or multiplicity) S={l (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)}
Nantes, November 2013
41
DFFA A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition, and for entering the initial state such that l the sum of what enters is equal to what exits and l the sum of what halts is equal to what starts
Nantes, November 2013
42
Example
a: 2
a: 1
6 3
2
1
b : 5
b: 3
a: 5
b: 4 Nantes, November 2013
43
From a DFFA to a DPFA Frequencies become relative frequencies by dividing by sum of exiting frequencies
a: 2/6
a: 1/7
6/6 3/13
2/7
1/6
b: 5/13
b: 3/6
a: 5/13
b: 4/7 Nantes, November 2013
44
From a DFA and a sample to a DFFA S = {l, aaaa, ab, babb, bbbb, bbbbaa} a: 2
a: 1
6 3
2
1
b: 5
b: 3
a: 5
b: 4 Nantes, November 2013
45
Note l l
l
Another sample may lead to the same DFFA Doing the same with a NFA is a much harder problem Typically what algorithm Baum-Welch (EM) has been invented for…
Nantes, November 2013
46
The frequency prefix tree acceptor l l
l
The data is a multi-set The FTA is the smallest tree-like FFA consistent with the data Can be transformed into a PFA if needed
Nantes, November 2013
47
From the sample to the FTA FTA(S)
a:4 a:6
a:7 14
4
a:2
2
b:2 b:1
a:1
b:1
a:1
3
1 1
b:4 a:1
b:4
a:1
a:1
3
S={l (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)} Nantes, November 2013
48
Red, Blue and White states -Red states are confirmed states -Blue states are the (non Red) successors of the Red states -White states are the others
a a
b
a
b
a b
a b
Same as with DFA and what RPNI does Nantes, November 2013
49
Merge and fold Suppose we decide to merge b with state a l
a:26 10
b:6
a:6
a:10 100
6
60
b:24
a:4
a:4 b:24
4
11
b:9 Nantes, November 2013
9
50
Merge and fold First disconnect and reconnect to
b:24
l
b
a
a:26 10
b:6
a:6
a:10 100
6
60
a:4
a:4 b:24
4
11
b:9 Nantes, November 2013
9
51
Merge and fold Then fold
b:24 a:26 10
b:6
a:6
a:10 100
60
b:24
6
a:4
4
a:4
11
b:9
Nantes, November 2013
9
52
Merge and fold after folding
b:24 a:26 10
b:30
a:10 100
60
a:4
a:10
10
b:9
9
4
11
Nantes, November 2013
53
State merging algorithm A=FTA(S); Blue ={d(qI,a): aÎS }; Red ={qI} While Blue¹Æ do choose q from Blue such that Freq(q)³t0 if $pÎRed: d(Ap,Aq) is small then A = merge_and_fold(A,p,q) else Red = Red È {q} Blue = {d(q,a): qÎRed} – {Red} Nantes, November 2013
54
The real question l l l l l
How do we decide if d(Ap,Aq) is small? Use a distance… Be able to compute this distance If possible update the computation easily Have properties related to this distance
Nantes, November 2013
55
Deciding if two distributions are similar l
l
l
If the two distributions are known, equality can be tested Distance (L2 norm) between distributions can be exactly computed But what if the two distributions are unknown?
Nantes, November 2013
56
Taking decisions
Suppose we want to merge b with state a l
a:26 10
b:6
a:6
a:10 100
6
60
b:24
a:4
a:4 b:24
4
11
b:9 Nantes, November 2013
9
57
Taking decisions
Yes if the two distributions induced are similar
a:26 10
b:6
a:6
a:10
6
60
b:24
a:4 b:24
4
11
b:9 a:4
a:4 b:24
a:4 9
4
11
b:9
9
Nantes, November 2013
58
5 Alergia
Nantes, November 2013
59
Alergia’s test l D1
» D2 if "x PrD1(x) » PrD2(x)
l
Easier to test: l PrD1 (l)=PrD2(l) l "aÎS PrD1 (aS*)=PrD2(aS*)
l
And do this recursively! Of course, do it on frequencies
l
Nantes, November 2013
60
Hoeffding bounds
f1 f 2 g¬ n1 n2 æ 1 ö 1 2 1 ÷. ln g < çç + ÷ 2 a n n 2 ø è 1 g indicates if the relative frequencies sufficiently close
Nantes, November 2013
f1 and f 2 are n1 n2 61
A run of Alergia Our learning multisample S={l(490), a(128), b(170), aa(31), ab(42), ba(38), bb(14), aaa(8), aab(10), aba(10), abb(4), baa(9), bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3), aabb(2), abaa(2), abab(2), abba(2), abbb(1), baaa(2), baab(2), baba(1), babb(1), bbaa(1), bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1), aabaa(1), aabab(1), aabba(1), abbaa(1), abbab(1)} Nantes, November 2013
62
l
l
Parameter a is arbitrarily set to 0.05. We choose 30 as a value for threshold t0. Note that for the blue state who have a frequency less than the threshold, a special merging operation takes place
Nantes, November 2013
63
1
a
a : 15
10
128
a :257
2
b:3
2
a:5
3
8
a : 64 31 b : 18
a:4
42 1000
b : 9
4
a : 13
9
490
b : 170 38
a : 57
b : 6
b : 26
a :5 14
b
a:2
2
a
b:2
2
a:4
2
b:1 a:2
1
b:2 a:1
2
b:1
1
4
170
a:1 3
b:1 a:1
b Nantes, : 7 November 6 2013
a
2
b:3
a : 14 10
b : 65
b a
1 1 1 1 1
a b
1 1
2
1
1 1 1
64
Can we merge l and a?
l
Compare l and a, aS* and aaS*, bS* and abS* 490/1000 with 128/257 , 257/1000 with 64/257 , 253/1000 with 65/257 , . . . .
l
All tests return true
l l l
Nantes, November 2013
65
Merge… a: 64
a: 15
a:4
2
b:3
2
a:5
3
8
31
b: 18
10
128
a: 257
42 1000
b: 9
4
a: 13
9
490
b: 170 38
a: 57
b: 6
a: 5 14
a:2
2
a
b:2
2
a:4
2
b:1 a:2 b:2 a:1
2
b:1
1
4
a:1 3
b:1 a:1
b:Nantes, 7 November 6 2013
a b
1
170
b a
2
b:3
a: 14 10
b: 65
b: 26
1
a
1 1 1 1 1
a b
1 1
2
1
1 1 1
66
And fold
a:341 1000 660
a: 16
b: 340
12
52
a: 77
b: 9
7
225
b: 38
20
a: 10 6
a: 2
2
b: 2 a: 1
2
b: 1
1
a: 1
1
b: 1 a: 1
1
b:Nantes, 8 November 7 2013
1
1
67
Next merge ? l with b ?
a:341
a: 16
a: 2 12
52
a: 77
1000 660
b: 340
b: 9
b: 38
a: 10 20
b: 8
b: 2 a: 1
2
b: 1
1
7
225
a: 1 6 7
Nantes, November 2013
2
b: 1 a: 1
1
1 1 1
68
Can we merge l and b? l l
l
Compare l and b, aS* and baS*, bS* and bbS* 660/1341 and 225/340 are different (giving g= 0.162) On the other hand
æ 1 ö 1 2 1 ç ÷. ln = 0.111 + ç n ÷ 2 a n 2 ø è 1 Nantes, November 2013
69
Promotion
a:341
a: 16
12
52
a: 77
1000 660
b: 340
b: 9
7
225
b: 38
20
a:10 b: 8
6 7
Nantes, November 2013
a:2 b:2 a:1
2 2 1
b:1
1
a:1
1
b:1 a:1
1 1
70
Merge
a: 341
a: 77
a: 16
a:2 12
52
b: 9 660
b: 340
7
225
b: 38
a: 10 20
b: 8
b:2 a:1
2
b:1
1
a:1 6 7
Nantes, November 2013
2
b:1 a:1
1
1 1 1
71
And fold
a:341
a: 95
1000 660
b: 340
291
b: 49
a: 11 29
b: 9
a: 2 7 8
Nantes, November 2013
b: 2 a: 1
2 2 1
72
Merge
a:341
a: 95
b: 340 1000 660
225
b: 49
a: 11 29
b: 9
a: 2 7 8
Nantes, November 2013
b: 2 a: 1
2 2 1
73
And fold As a PFA a: 354
a: 96
a: .354
b: 351
b: .351
1000 698
a: .096
302
.698
b: 49
.302
b: .049
Nantes, November 2013
74
Conclusion and logic l l
l
Alergia builds a DFFA in polynomial time Alergia can identify DPFA in the limit with probability 1 No good definition of Alergia’s properties
Nantes, November 2013
75
6 DSAI and MDI Why not change the criterion?
Nantes, November 2013
76
Criterion for DSAI l l l
l l
Use a distinguishing string Use norm L¥ Two distributions are different if there is a string with a very different probability Such a string is called m-distinguishable Question becomes: Is there a string x such that |PrA,q(x)-PrA,q’(x)|>m Nantes, November 2013
77
(much more to DSAI) l
l
D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic finite automata. In Proceedings of Colt 1995, pages 31– 40, 1995. PAC learnability results, in the case where targets are acyclic graphs
Nantes, November 2013
78
Criterion for MDI l l
l
MDL inspired heuristic Criterion is: does the reduction of the size of the automaton compensate for the increase in preplexity? F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings of the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000
Nantes, November 2013
79
a PFA/HMM learning competition lOrganisation
committee:
Hasan Ibne Akram, Technische Universität München, Germany ♦ Rémi Eyraud, Aix-Marseille Université, France ♦ Jeffrey Heinz, University of Delaware, USA ♦ Colin de la Higuera, University of Nantes, France ♦ James Scicluna, University of Nantes, France ♦ Sicco Verwer, Radboud University Nijmegen, The Nederlands ♦
Scientific Committee u Pieter Adriaans, University of Amsterdam, The Netherlands u Dana Angluin, Yale University, USA u Alexander Clark, Royal Holloway University of London, United
Kingdom u Pierre Dupont, Université catholique de Louvain, Belgium. u Ricard Gavaldà, Universitat Politécnica de Catalunya, Spain u Colin de la Higuera, University of Nantes, France u Jean-Christophe Janodet, University of Evry, France u Tim Oates, University of Maryland in Baltimore County, USA u Jose Oncina, University of Alicante, Spain u Menno van Zaanen, Tilburg University, The Netherlands ICGI'12 - Workshop
81
Timeline uDecember 2011: first ideas uFebruary 2012: website, first baselines and the
first data set on-line uMach 2012: First phase (training phase) uMay 20: Second phase (competition) uJune 5: First real world problem available uJuly 3: End of the competition uSeptember 7: special session in ICGI'12 ICGI'12 - Workshop
82
Target Generation uTargets were generated completely at random u4 kinds of targets: Ø Ø Ø Ø
HMM PDFA PFA Markov Chains (used only during the training phase)
u5 to 75 states u4 to 24 letters alphabet uAll initial, symbol and transition probability draw
from a Dirichlet distribution ICGI'12 - Workshop
83
Target Generation uSymbol sparsity: percentage of possible state-
symbol pairs selected for the target (between 20% and 80%) Ø
Ø
A state is randomly selected, then a not already taken symbol for this state One transition is generated by selecting a target state
uTransition
sparsity: percentage of additional transitions (between 0% and 20%) Ø
Ø
Selected without replacement from the set of possible transitions Modified to remain uniform over the source state and transition labels ICGI'12 - Workshop
84
Evaluation Score uA perplexity measure:
Where PrT is the probability in the target and PrC is the submitted probability (these probabilities have to be normalize on the test set) uEquivalent to the Kullback–Leibler divergence uIndependent of a specific model
ICGI'12 - Workshop
85
Real Data uNatural language problem: 10 000 POS
sequences (+1 000 unique for test) selected from +100 000 obtained with the Frog Dutch tagger (11 symbols) on a corpus of Dutch translations of Jules Verne books. uDiscretized sensor signals: 20 000 strings (+ 1 000 for test) corresponding to windows of length 20 over the fuel usage of trucks, selected from almost 500 000 available windows. uEvaluation: submissions were compared with the probabilities obtained with a 3-gram trained on the whole data set. ICGI'12 - Workshop
86
Overall score uFor each problem Ø
Ø Ø Ø
5 points were given to the leader (participant with the smallest perplexity score) 3 points to the second 2 points to the third 1 point to the fourth
uThe sum of the point gave the overall ranking.
ICGI'12 - Workshop
87
Train and test sets uAccess only to registered participants u51 problems for the training phase u48 problems for the competition phase (+2 real
world problems) u1 000 strings in each test set u20 000 or 100 000 strings in train sets
ICGI'12 - Workshop
88
Baseline Algorithms u2 simple baselines in python: Ø Ø
Frequency of the strings in the sets (train + test) Usual 3-gram on the strings of the sets (train + test)
uAn
implementation of the Baum-Welch algorithm in python uAn implementation of ALERGIA in OpenFST and Visual Studio uGood page rank of this page (no registration
needed)
ICGI'12 - Workshop
89
Competition activity u724 visits (max: 54 in one day) u196 unique visitors uIPs from 37 countries, 14 countries with 5 or
more IPs u38 registered participants u16 submitted at least one of their solutions u2 787 submissions u5 participants scored points u4 participants ranked first at least one day ICGI'12 - Workshop
90
Overall results Ran k 1 2 3 4 5
Team name Shibata-Yoshinaka Mans Hulden David Llorens Raphael Bailly Fabio Kepler
ICGI'12 - Workshop
Overall score 212 124 122 75 14
91
Overall Scores Evolution
ICGI'12 - Workshop
92
7 Conclusion and open questions
Nantes, November 2013
93
Appendix Stern Brocot trees Identification of probabilities If we were able to discover the structure, how do we identify the probabilities?
Nantes, November 2013
94
l
By estimation: the edge is used 1501 times out of 3000 passages through the state :
a 3000
1501 3000
Nantes, November 2013
95
Stern-Brocot trees: (Stern 1858, Brocot 1860) Can be constructed from two simple adjacent fractions by the «mean» operation a c a+c b md b+d =
Nantes, November 2013
96
0 1
1 0 1 1 1 2
2 1
1 3 1 4
2 3 2 5
3 5
3 2 3 4
4 3
Nantes, November 2013
3 1 5 3
5 2
4 1 97
Idea: l
Instead of returning c(x)/n, search the SternBrocot tree to find a good simple approximation of this value.
Nantes, November 2013
98
Iterated Logarithm: With probability 1, for a co-finite number of values of n we have:
c(x) - a < n b
l log log n n
"l>1
Nantes, November 2013
99