An Introduction to Factor Graphs Hans-Andrea Loeliger
MLSB 2008, Copenhagen
1
Definition A factor graph represents the factorization of a function of several variables. We use Forney-style factor graphs (Forney, 2001). Example: f (x1, x2, x3, x4, x5) = fA(x1, x2, x3) · fB(x3, x4, x5) · fC(x4). fA x1 x2
fB x3
x5 x4 fC
Rules: • A node for every factor. • An edge or half-edge for every variable. • Node g is connected to edge x iff variable x appears in factor g. (What if some variable appears in more than 2 factors?) 2
Example:
Markov Chain
pXY Z (x, y, z) = pX (x) pY |X (y|x) pZ|Y (z|y).
X pX
Y pY |X
Z pZ|Y
We will often use capital letters for the variables. (Why?) Further examples will come later.
3
Message Passing Algorithms operate by passing messages along the edges of a factor graph:
6
6
?
?
6
...
6 ?
6
?
6 ?
?
4
A main point of factor graphs (and similar graphical notations):
A Unified View of Historically Different Things Statistical physics: - Markov random fields (Ising 1925) Signal processing: - linear state-space models and Kalman filtering: Kalman 1960. . . - recursive least-squares adaptive filters - Hidden Markov models: Baum et al. 1966. . . - unification: Levy et al. 1996. . . Error correcting codes: - Low-density parity check codes: Gallager 1962; Tanner 1981; MacKay 1996; Luby et al. 1998. . . - Convolutional codes and Viterbi decoding: Forney 1973. . . - Turbo codes: Berrou et al. 1993. . . Machine learning, statistics: - Bayesian networks: Pearl 1988; Shachter 1988; Lauritzen and Spiegelhalter 1988; Shafer and Shenoy 1990. . . 5
Other Notation Systems for Graphical Models Example: p(u, w, x, y, z) = p(u)p(w)p(x|u, w)p(y|x)p(z|x).
U
W
U
X
Z
=
X
Z
W Y
Y
Forney-style factor graph.
Original factor graph [FKLW 1997].
U
U
W
?
X
-
-
Z
?
Y
Bayesian network.
W
X
Z
Y
Markov random field. 6
Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .
7
The Two Basic Problems 1. Marginalization: Compute X 4 ¯ fk (xk ) =
f (x1, . . . , xn)
x1 , . . . , x n except xk
2. Maximization: Compute the “max-marginal” 4 fˆk (xk ) = max f (x1, . . . , xn) x1 , . . . , x n except xk
assuming that f is real-valued and nonnegative and has a maximum. Note that argmax f (x1, . . . , xn) = argmax fˆ1(x1), . . . , argmax fˆn(xn) . For large n, both problems are in general intractable even for x1, . . . , xn ∈ {0, 1}. 8
Factorization Helps For example, if f (x1, . . . , fn) = f1(x1)f2(x2) · · · fn(xn) then f¯k (xk ) =
X
f1(x1) · · ·
X
x1
fk−1(xk−1) fk (xk ) · · ·
X
xk−1
fn(xn)
xn
and fˆk (xk ) =
max f1(x1) · · · x1
max fk−1(xk−1) fk (xk ) · · · xk−1
max fn(xn) . xn
Factorization helps also beyond this trivial example.
9
Towards the sum-product algorithm:
Computing Marginals—A Generic Example Assume we wish to compute f¯3(x3) =
X
f (x1, . . . , x7)
x1 , . . . , x 7 except x3
and assume that f can be factored as follows:
f2
f4 X2
X1 f1
X4 X3
f3
f7 X7
X5 f5
X6 f6
10
Example cont’d:
Closing Boxes by the Distributive Law
f2
f4 X2
X1
X4 X3
f1
f3
f7 X7
X5 f5
X6 f6
! X
f¯3(x3) =
f1(x1)f2(x2)f3(x1, x2, x3)
x1 ,x2
·
X x4 ,x5
f4(x4)f5(x3, x4, x5)
X
! f6(x5, x6, x7)f7(x7)
x6 ,x7
11
Example cont’d: Message Passing View
f2
f4 X2
X1
X4
X7
X3 -
f1
f7
X5
f3
X6
f5
f6 !
X
f¯3(x3) =
f1(x1)f2(x2)f3(x1, x2, x3)
|x1,x2 ·
X x4 ,x5
{z
}
− → µ X3 (x3 )
f4(x4)f5(x3, x4, x5)
X
f6(x5, x6, x7)f7(x7)
x6 ,x7
| |
!
{z
← − (x ) µ X3 3
{z
← − (x ) µ X5 5
} } 12
Example cont’d: Messages Everywhere
?
X4
X2
X1
-
X7
?
f5
f3
?
X6
X5
X3
-
f1
f7
f4
f2
f6
4 4 → → µ X2 (x2) = f2(x2), etc., we have With − µ X1 (x1) = f1(x1), −
− → µ X3 (x3) =
X
− → → µ X1 (x1)− µ X2 (x2)f3(x1, x2, x3)
x1 ,x2
← − (x ) = µ X5 5
X
− → µ X7 (x7)f6(x5, x6, x7)
x6 ,x7
← − (x ) = µ X3 3
X
− (x )f (x , x , x ) − → µ µ X4 (x4)← X5 5 5 3 4 5
x4 ,x5 13
The Sum-Product Algorithm (Belief Propagation) Y
H HH 1 H HH HH j H H *
.. .
g
X -
Yn
Sum-product message computation rule: X → → − → µ Yn (yn) µ X (x) = g(x, y1, . . . , yn)− µ Y1 (y1) · · · − y1 ,...,yn
Sum-product theorem: If the factor graph for some function f has no cycles, then → − (x). f¯ (x) = − µ (x)← µ X
X
X
14
Towards the max-product algorithm:
Computing Max-Marginals—A Generic Example Assume we wish to compute fˆ3(x3) =
max
f (x1, . . . , x7)
x1 , . . . , x 7 except x3
and assume that f can be factored as follows:
f2
f4 X2
X1 f1
X4 X3
f3
f7 X7
X5 f5
X6 f6
15
Example:
Closing Boxes by the Distributive Law
f2
f4 X2
X1 f1
X4
X7
X3 f3
f7
X5 f5
X6 f6
! fˆ3(x3) =
max f1(x1)f2(x2)f3(x1, x2, x3) x1 ,x2
! · max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7)
x4 ,x5
x6 ,x7
16
Example cont’d: Message Passing View
f2 X2
X4
X7
X3
X1
-
X5
f3
f1
f7
f4
X6
f5
f6 !
fˆ3(x3) =
max f1(x1)f2(x2)f3(x1, x2, x3) |x1,x2 {z } − → µ X3 (x3 )
!
· max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7) x4 ,x5 |x6,x7 {z } ← − (x ) µ X5 5
|
{z
← − (x ) µ X3 3
}
17
Example cont’d: Messages Everywhere
?
X4
X2
X1 -
-
X7
?
?
X6
X5
X3
f1
f7
f4
f2
f6
f5
f3
4 4 → − With − µ X1 (x1) = f1(x1), → µ X2 (x2) = f2(x2), etc., we have
− → → → µ X3 (x3) = max − µ X1 (x1)− µ X2 (x2)f3(x1, x2, x3) x1 ,x2 ← − (x ) = max − → µ µ (x )f (x , x , x ) X5
5
x6 ,x7
X7
7
6
5
6
7
← − (x ) = max − → − (x )f (x , x , x ) µ µ X4 (x4)← µ X3 3 X5 5 5 3 4 5 x4 ,x5
18
The Max-Product Algorithm Y
HH HH1 HH H j H H H *
.. .
g
X -
Yn
Max-product message computation rule: → → − → µ Yn (yn) µ X (x) = max g(x, y1, . . . , yn)− µ Y1 (y1) · · · − y1 ,...,yn
Max-product theorem: If the factor graph for some global function f has no cycles, then → − (x). fˆ (x) = − µ (x)← µ X
X
X
19
Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .
20
Terminology fA x1 x2
fB x3
x5 x4 fC
Global function f = product of all factors; usually (but not always!) real and nonnegative. A configuration is an assignment of values to all variables. The configuration space is the set of all configurations, which is the domain of the global function. A configuration ω = (x1, . . . , x5) is valid iff f (ω) 6= 0.
21
Invalid Configurations Do Not Affect Marginals A configuration ω = (x1, . . . , xn) is valid iff f (ω) 6= 0. Recall: 1. Marginalization: Compute 4 f¯k (xk ) =
X
f (x1, . . . , xn)
x1 , . . . , x n except xk 2. Maximization: Compute the “max-marginal” 4 fˆk (xk ) =
max f (x1, . . . , xn) x1 , . . . , x n except xk
assuming that f is real-valued and nonnegative and has a maximum.
Constraints may be expressed by factors that evaluate to 0 if the constraint is violated. 22
Branching Point = Equality Constraint Factor X0 X
=
X 00
⇐⇒
X
t
The factor 0
00
4
f=(x, x , x ) =
1, if x = x0 = x00 0, else
enforces X = X 0 = X 00 for every valid configuration. More general: 4
f=(x, x0, x00) = δ(x − x0)δ(x − x00) where δ denotes the Kronecker delta for discrete variables and the Dirac delta for discrete variables. 23
Example:
Independent Observations Let Y1 and Y2 be two independent observations of X: p(x, y1, y2) = p(x)p(y1|x)p(y2|x). pX X = pY1|X
pY2|X Y1
Y2
Literally, the factor graph represents an extended model p(x, x0, x00, y1, y2) = p(x)p(y1|x0)p(y2|x00)f=(x, x0, x00) with the same marginals and max-marginals as p(x, y1, y2). 24
From A Priori to A Posteriori Probability Example (cont’d): Let Y1 = y1 and Y2 = y2 be two independent observations of X, i.e., p(x, y1, y2) = p(x)p(y1|x)p(y2|x). For fixed y1 and y2, we have p(x|y1, y2) =
p(x, y1, y2) ∝ p(x, y1, y2). p(y1, y2) pX X =
pY1|X
pY2|X y1
y2
The factorization is unchanged (except for a scale factor). Known variables will be denoted by small letters; unknown variables will usually be denoted by capital letters. 25
Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .
26
Example:
Hidden Markov Model
p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn) = p(x0)
n Y
p(xk |xk−1)p(yk |xk−1)
k=1
p(x0)
X0
-
=
-
X1
-
=
-
p(x1|x0) ?
p(y1|x0)
X2
-
...
p(x2|x1) ?
p(y2|x1) Y1 ?
Y2 ?
27
Sum-product algorithm applied to HMM:
Estimation of Current State p(xn, y1, . . . , yn) p(xn|y1, . . . , yn) = p(y1, . . . , yn) ∝ p(xn, y1, . . . , yn) X X = ... p(x0, x1, . . . , xn, y1, y2, . . . , yn) x0
xn−1
→ =− µ Xn (xn). For n = 2: X0 -
=
6
?
y1 ?
X1 -
=
6
?
y2 ?
X2 -
=
=1 6?
=1
...
Y3 ?
28
Backward Message in Chain Rule Model X
Y -
-
pY |X
pX
If Y = y is known (observed): ← − (x) = p µ X
Y |X (y|x),
the likelihood function. If Y is unknown: ← − (x) = µ X
X
pY |X (y|x)
y
= 1.
29
Sum-product algorithm applied to HMM:
Prediction of Next Output Symbol p(y1, . . . , yn+1) p(y1, . . . , yn) ∝ p(y1, . . . , yn+1) X = p(x0, x1, . . . , xn, y1, y2, . . . , yn, yn+1)
p(yn+1|y1, . . . , yn) =
x0 ,x1 ,...,xn
− =→ µ Yn (yn). For n = 2: X0 -
=
6
?
y1 ?
X1 -
=
-
X2 -
=
=1
6 ?
y2 ?
...
? ?
Y3
? ?
30
Sum-product algorithm applied to HMM:
Estimation of Time-k State p(xk , y1, y2, . . . , yn) p(xk | y1, y2, . . . , yn) = p(y1, y2, . . . , yn) ∝ p(xk , y1, y2, . . . , yn) X p(x0, x1, . . . , xn, y1, y2, . . . , yn) = x0 , . . . , x n except xk
→ − (x ) =− µ Xk (xk )← µ Xk k For k = 1: X0 -
=
6
?
y1 ?
X1
=
6
?
y2 ?
X2
=
...
6 ?
y3 ?
31
Sum-product algorithm applied to HMM:
All States Simultaneously p(xk |y1, . . . , yn) for all k: X0
=
6
?
y1 ?
X1
=
6
?
y2 ?
X2
=
-
...
6 ?
y3 ?
In this application, the sum-product algorithm coincides with the Baum-Welch / BCJR forward-backward algorithm.
32
Scaling of Messages In all the examples so far: → − (x )) equals the desired • The final result (such as − µ Xk (xk )← µ Xk k quantity (such as p(xk |y1, . . . , yn)) only up to a scale factor. • The missing scale factor γ may be recovered at the end from the condition X → − (x ) = 1. γ− µ Xk (xk )← µ Xk k xk
• It follows that messages may be scaled freely along the way. • Such message scaling is often mandatory to avoid numerical problems.
33
Sum-product algorithm applied to HMM:
Probability of the Observation p(y1, . . . , yn) = =
X
...
X
p(x0, x1, . . . , xn, y1, y2, . . . , yn)
x0
xn
X
− → µ Xn (xn).
xn
This is a number. Scale factors cannot be neglected in this case. For n = 2: X0 -
=
6
?
y1 ?
X1 -
=
6
?
y2 ?
X2 -
=
=1 6?
=1
...
Y3 ?
34
Max-product algorithm applied to HMM:
MAP Estimate of the State Trajectory The estimate (ˆ x0, . . . , xˆn)MAP = argmax p(x0, . . . , xn|y1, . . . , yn) x0 ,...,xn
= argmax p(x0, . . . , xn, y1, . . . , yn) x0 ,...,xn
may be obtained by computing 4
pˆk (xk ) =
max p(x0, . . . , xn, y1, . . . , yn) x1 , . . . , x n except xk
→ − (x ) =− µ Xk (xk )← µ Xk k for all k by forward-backward max-product sweeps. In this example, the max-product algorithm is a time-symmetric version of the Viterbi algorithm with soft output.
35
Max-product algorithm applied to HMM:
MAP Estimate of the State Trajectory cont’d Computing 4
pˆk (xk ) =
max p(x0, . . . , xn, y1, . . . , yn) x1 , . . . , x n except xk
→ − (x ) =− µ Xk (xk )← µ Xk k simultaneously for all k: X0
=
6
?
y1 ?
X1
=
6
?
y2 ?
X2
=
-
...
6 ?
y3 ?
36
Marginals and Output Edges → − (x) may be viewed as messages out of a Marginals such − µ X (x)← µ X “output half edge” (without incoming message): X0
X -
− → µ X¯ (x) =
Z Z x0
x00
-
=⇒
-
=
¯ X
X 00
-
? ?
− → − 00 (x00) δ(x − x0) δ(x − x00) dx0 dx00 µ X 0 (x0) ← µ X
− − 00 (x) =→ µ X 0 (x) ← µ X
=⇒ Marginals are computed like messages out of “=”-nodes.
37
Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .
38
Factor Graphs with Cycles? Continuous Variables?
6
6
?
?
6
6 ?
6
?
6 ?
?
39
Example:
Error Correcting Codes (e.g. LDPC Codes) Codeword x = (x1, . . . , xn) ∈ C ⊂ {0, 1}n. Received y = (y1, . . . , yn) ∈ Rn. p(x, y) ∝ p(y|x)δC (x). ⊕
f⊕(z1, . . . , zm) 1, if z1 + . . . + zm ≡ 0 mod 2 = 0, else
⊕
L\ L\
L L
“random” connections L L
=
X1
\ \
=
X2
\L \L
=
...
code membership function 1, if x ∈ C δC (x) = 0, else
Xn
channel model p(y|x) n Y e.g., p(y|x) = p(y`|x`) y1
y2
yn
`=1
Decoding by iterative sum-product message passing. 40
Example:
Magnets, Spin Glasses, etc. Configuration x = (x1, . . . , xn), xk ∈ {0, 1}. X X “Energy” w(x) = βk xk + βk,`(xk − x`)2 k −w(x)
p(x) ∝ e
=
neighboring pairs (k, `)
Y
e
k
2
Y
−βk xk
e−βk,`(xk −x`)
neighboring pairs (k, `)
H
=
H
=
H
=
H
=
H
=
H
=
H
=
H
=
H
= 41
Example:
Hidden Markov Model with Parameter(s)
p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn | θ) = p(x0)
n Y
p(xk |xk−1, θ)p(yk |xk−1)
k=1
Θ
p(x0)
X0
-
= ?
=
-
= X1
-
?
=
-
p(x1|x0, θ) ?
Y1 ?
X2
-
...
p(x2|x1, θ) ?
Y2 ?
42
Example:
Least-Squares Problems Minimizing
n X
x2k subject to (linear or nonlinear) constraints is
k=1
equivalent to maximizing e−
Pn
2 k=1 xk
=
n Y
2
e−xk
k=1
subject to the given constraints.
constraints X1 −x21
e
...
Xn −x2n
e
43
General Linear State Space Model Uk+1
Uk ?
?
Bk+1
Bk Xk−1 ...
?
Ak
-
+
Xk
-
=
?
Ck Y
? k
-
Ak+1
?
-
+
Xk+1-
=
-
... ?
Ck+1 Y
? k+1
Xk = Ak Xk−1 + Bk Uk Yk = Ck Xk
44
Gaussian Message Passing in Linear Models encompasses much of classical signal processing and appears often as a component of more complex problems/algorithms. Note: 1. Gaussianity of messages is preserved in linear models. 2. Includes Kalman filtering and recursive least-squares algorithms. 3. For Gaussian messages, the sum-product (integral-product) algorithm coincides with the max-product algorithm. 4. For jointly Gaussian random variables, MAP estimation = MMSE estimation = LMMSE estimation. 5. Even if X and Y are not jointly Gaussian, the LMMSE estimate of X from Y = y may be obtained by pretending that X and Y are jointly Gaussian (with their actual means and second-order moments) and forming the corresponding MAP estimate. See “The factor graph approach to model-based signal processing,” Proceedings of the IEEE, June 2007. 45
Continuous Variables: Message Types The following message types are widely applicable. • Quantization of messages into discrete bins. Infeasible in higher dimensions. • Single point: the message µ(x) is replaced by a temporary or final estimate xˆ. • Function value and gradient at a point selected by the receiving node. Allows steepest-ascent (descent) algorithms. • Gaussians. Works for Kalman filtering. • Gaussian mixtures. • List of samples: a pdf can be represented by a list of samples. This data type is the basis of particle filters, but it can be used as a general data type in a graphical model. • Compound messages: the “product” of other message types. 46
(7)
d quantity
(8)
rics. ding to (6) s therefore
sk−1 , (9)
ill choose
= E p(yk |Xk , Sk , Sk−1 )|Y k−1 ,
(16)
where the expectation with respect toPassing the probability density Particle Methodsisas Message p(sk−1 , sk , xk |y k−1 ) = p(sk−1 |y k−1 )p(xk , sk |sk−1 ) (17) Basic idea: represent a probability density f by a list = µk−1 (sk−1 )p(xk ,sk |sk−1 ). L = (ˆ x(1), w(1)), . . . (ˆ x(L), w(L))
(18)
of weighted samples III.(“particles”): A PARTICLE M ETHOD
f
x Fig. 2. A probability density function f : R → R+ and its representation • Versatile data type for sum-product, max-product, EM, . . . as a list of particles.
• Not sensitive to dimensionality.
If both the input alphabet X and the state space S are finite47 sets (and the alphabet of X and S is not too large), then
On Factor Graphs with Cycles • Generally iterative algorithms. • For example, alternating maximization x, y) xˆnew = argmax f (x, yˆ) and yˆnew = argmax f (ˆ x
y
using the max-product algorithm in each iteration. • Iterative sum-product message passing gives excellent results for maximization(!) in some applications (e.g., the decoding of error correcting codes). • Many other useful algorithms can be formulated in message passing form (e.g., J. Dauwels): gradient ascent, Gibbs sampling, expectation maximization, variational methods, clustering by “affinity propagation” (B. Frey) . . . • Factor graphs facilitate to mix and match different algorithmic techniques. End of this talk.
48