An Introduction to Factor Graphs Hans-Andrea Loeliger

MLSB 2008, Copenhagen

1

Definition A factor graph represents the factorization of a function of several variables. We use Forney-style factor graphs (Forney, 2001). Example: f (x1, x2, x3, x4, x5) = fA(x1, x2, x3) · fB(x3, x4, x5) · fC(x4). fA x1 x2

fB x3

x5 x4 fC

Rules: • A node for every factor. • An edge or half-edge for every variable. • Node g is connected to edge x iff variable x appears in factor g. (What if some variable appears in more than 2 factors?) 2

Example:

Markov Chain

pXY Z (x, y, z) = pX (x) pY |X (y|x) pZ|Y (z|y).

X pX

Y pY |X

Z pZ|Y

We will often use capital letters for the variables. (Why?) Further examples will come later.

3

Message Passing Algorithms operate by passing messages along the edges of a factor graph:







6

6

?

? 

 6







...

6 ?

6

?

6 ?

?

4

A main point of factor graphs (and similar graphical notations):

A Unified View of Historically Different Things Statistical physics: - Markov random fields (Ising 1925) Signal processing: - linear state-space models and Kalman filtering: Kalman 1960. . . - recursive least-squares adaptive filters - Hidden Markov models: Baum et al. 1966. . . - unification: Levy et al. 1996. . . Error correcting codes: - Low-density parity check codes: Gallager 1962; Tanner 1981; MacKay 1996; Luby et al. 1998. . . - Convolutional codes and Viterbi decoding: Forney 1973. . . - Turbo codes: Berrou et al. 1993. . . Machine learning, statistics: - Bayesian networks: Pearl 1988; Shachter 1988; Lauritzen and Spiegelhalter 1988; Shafer and Shenoy 1990. . . 5

Other Notation Systems for Graphical Models Example: p(u, w, x, y, z) = p(u)p(w)p(x|u, w)p(y|x)p(z|x). 

U



W

U

X

Z

=

X













Z

W Y

Y

Forney-style factor graph.

 

Original factor graph [FKLW 1997]. 



U

U 

  

W

? 

X

 

-



 -

Z



? 

Y 

Bayesian network.

W

    



X







Z



Y



Markov random field. 6

Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .

7

The Two Basic Problems 1. Marginalization: Compute X 4 ¯ fk (xk ) =

f (x1, . . . , xn)

x1 , . . . , x n except xk

2. Maximization: Compute the “max-marginal” 4 fˆk (xk ) = max f (x1, . . . , xn) x1 , . . . , x n except xk

assuming that f is real-valued and nonnegative and has a maximum. Note that   argmax f (x1, . . . , xn) = argmax fˆ1(x1), . . . , argmax fˆn(xn) . For large n, both problems are in general intractable even for x1, . . . , xn ∈ {0, 1}. 8

Factorization Helps For example, if f (x1, . . . , fn) = f1(x1)f2(x2) · · · fn(xn) then f¯k (xk ) =

X

 f1(x1) · · ·

X

x1

 fk−1(xk−1) fk (xk ) · · ·

X

xk−1

 fn(xn)

xn

and fˆk (xk ) =



 max f1(x1) · · · x1





max fk−1(xk−1) fk (xk ) · · · xk−1





max fn(xn) . xn

Factorization helps also beyond this trivial example.

9

Towards the sum-product algorithm:

Computing Marginals—A Generic Example Assume we wish to compute f¯3(x3) =

X

f (x1, . . . , x7)

x1 , . . . , x 7 except x3

and assume that f can be factored as follows:

f2

f4 X2

X1 f1

X4 X3

f3

f7 X7

X5 f5

X6 f6

10

Example cont’d:

Closing Boxes by the Distributive Law

f2

f4 X2

X1

X4 X3

f1

f3

f7 X7

X5 f5

X6 f6

! X

f¯3(x3) =

f1(x1)f2(x2)f3(x1, x2, x3)

x1 ,x2

·

X x4 ,x5

f4(x4)f5(x3, x4, x5)

X

! f6(x5, x6, x7)f7(x7)

x6 ,x7

11

Example cont’d: Message Passing View

f2

f4 X2

X1

X4

X7

X3 -

f1

f7

X5



f3

X6



f5

f6 !

X

f¯3(x3) =

f1(x1)f2(x2)f3(x1, x2, x3)

|x1,x2 ·

X x4 ,x5

{z

}

− → µ X3 (x3 )

f4(x4)f5(x3, x4, x5)

X

f6(x5, x6, x7)f7(x7)

x6 ,x7

| |

!

{z

← − (x ) µ X3 3

{z

← − (x ) µ X5 5

} } 12

Example cont’d: Messages Everywhere

?

X4

X2

X1

-

X7

?





f5

f3

?

X6

X5

X3

-

f1

f7

f4

f2

f6

4 4 → → µ X2 (x2) = f2(x2), etc., we have With − µ X1 (x1) = f1(x1), −

− → µ X3 (x3) =

X

− → → µ X1 (x1)− µ X2 (x2)f3(x1, x2, x3)

x1 ,x2

← − (x ) = µ X5 5

X

− → µ X7 (x7)f6(x5, x6, x7)

x6 ,x7

← − (x ) = µ X3 3

X

− (x )f (x , x , x ) − → µ µ X4 (x4)← X5 5 5 3 4 5

x4 ,x5 13

The Sum-Product Algorithm (Belief Propagation) Y

H HH 1 H HH HH j H H  *      

.. .

g

X -



Yn

Sum-product message computation rule: X → → − → µ Yn (yn) µ X (x) = g(x, y1, . . . , yn)− µ Y1 (y1) · · · − y1 ,...,yn

Sum-product theorem: If the factor graph for some function f has no cycles, then → − (x). f¯ (x) = − µ (x)← µ X

X

X

14

Towards the max-product algorithm:

Computing Max-Marginals—A Generic Example Assume we wish to compute fˆ3(x3) =

max

f (x1, . . . , x7)

x1 , . . . , x 7 except x3

and assume that f can be factored as follows:

f2

f4 X2

X1 f1

X4 X3

f3

f7 X7

X5 f5

X6 f6

15

Example:

Closing Boxes by the Distributive Law

f2

f4 X2

X1 f1

X4

X7

X3 f3

f7

X5 f5

X6 f6

! fˆ3(x3) =

max f1(x1)f2(x2)f3(x1, x2, x3) x1 ,x2

! · max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7) 

x4 ,x5

x6 ,x7

16

Example cont’d: Message Passing View

f2 X2

X4

X7

X3

X1

-

X5



f3

f1

f7

f4

X6



f5

f6 !

fˆ3(x3) =

max f1(x1)f2(x2)f3(x1, x2, x3) |x1,x2 {z } − → µ X3 (x3 )

!



· max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7) x4 ,x5 |x6,x7 {z } ← − (x ) µ X5 5

|

{z

← − (x ) µ X3 3

}

17

Example cont’d: Messages Everywhere

?

X4

X2

X1 -

-

X7

?

?

X6

X5

X3

f1

f7

f4

f2





f6

f5

f3

4 4 → − With − µ X1 (x1) = f1(x1), → µ X2 (x2) = f2(x2), etc., we have

− → → → µ X3 (x3) = max − µ X1 (x1)− µ X2 (x2)f3(x1, x2, x3) x1 ,x2 ← − (x ) = max − → µ µ (x )f (x , x , x ) X5

5

x6 ,x7

X7

7

6

5

6

7

← − (x ) = max − → − (x )f (x , x , x ) µ µ X4 (x4)← µ X3 3 X5 5 5 3 4 5 x4 ,x5

18

The Max-Product Algorithm Y

HH HH1 HH H j H H H  *       

.. .

g

X -



Yn

Max-product message computation rule: → → − → µ Yn (yn) µ X (x) = max g(x, y1, . . . , yn)− µ Y1 (y1) · · · − y1 ,...,yn

Max-product theorem: If the factor graph for some global function f has no cycles, then → − (x). fˆ (x) = − µ (x)← µ X

X

X

19

Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .

20

Terminology fA x1 x2

fB x3

x5 x4 fC

Global function f = product of all factors; usually (but not always!) real and nonnegative. A configuration is an assignment of values to all variables. The configuration space is the set of all configurations, which is the domain of the global function. A configuration ω = (x1, . . . , x5) is valid iff f (ω) 6= 0.

21

Invalid Configurations Do Not Affect Marginals A configuration ω = (x1, . . . , xn) is valid iff f (ω) 6= 0. Recall: 1. Marginalization: Compute 4 f¯k (xk ) =

X

f (x1, . . . , xn)

x1 , . . . , x n except xk 2. Maximization: Compute the “max-marginal” 4 fˆk (xk ) =

max f (x1, . . . , xn) x1 , . . . , x n except xk

assuming that f is real-valued and nonnegative and has a maximum.

Constraints may be expressed by factors that evaluate to 0 if the constraint is violated. 22

Branching Point = Equality Constraint Factor X0 X

=

X 00

⇐⇒

X

t

The factor 0

00

4

f=(x, x , x ) =



1, if x = x0 = x00 0, else

enforces X = X 0 = X 00 for every valid configuration. More general: 4

f=(x, x0, x00) = δ(x − x0)δ(x − x00) where δ denotes the Kronecker delta for discrete variables and the Dirac delta for discrete variables. 23

Example:

Independent Observations Let Y1 and Y2 be two independent observations of X: p(x, y1, y2) = p(x)p(y1|x)p(y2|x). pX X = pY1|X

pY2|X Y1

Y2

Literally, the factor graph represents an extended model p(x, x0, x00, y1, y2) = p(x)p(y1|x0)p(y2|x00)f=(x, x0, x00) with the same marginals and max-marginals as p(x, y1, y2). 24

From A Priori to A Posteriori Probability Example (cont’d): Let Y1 = y1 and Y2 = y2 be two independent observations of X, i.e., p(x, y1, y2) = p(x)p(y1|x)p(y2|x). For fixed y1 and y2, we have p(x|y1, y2) =

p(x, y1, y2) ∝ p(x, y1, y2). p(y1, y2) pX X =

pY1|X

pY2|X y1

y2

The factorization is unchanged (except for a scale factor). Known variables will be denoted by small letters; unknown variables will usually be denoted by capital letters. 25

Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .

26

Example:

Hidden Markov Model

p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn) = p(x0)

n Y

p(xk |xk−1)p(yk |xk−1)

k=1

p(x0)

X0

-

=

-

X1

-

=

-

p(x1|x0) ?

p(y1|x0)

X2

-

...

p(x2|x1) ?

p(y2|x1) Y1 ?

Y2 ?

27

Sum-product algorithm applied to HMM:

Estimation of Current State p(xn, y1, . . . , yn) p(xn|y1, . . . , yn) = p(y1, . . . , yn) ∝ p(xn, y1, . . . , yn) X X = ... p(x0, x1, . . . , xn, y1, y2, . . . , yn) x0

xn−1

→ =− µ Xn (xn). For n = 2: X0 -

=

6

?

y1 ?

X1 -

=

6

?

y2 ?

X2 -

=



=1 6?

=1

...

Y3 ?

28

Backward Message in Chain Rule Model X

Y -

-



pY |X

pX

If Y = y is known (observed): ← − (x) = p µ X

Y |X (y|x),

the likelihood function. If Y is unknown: ← − (x) = µ X

X

pY |X (y|x)

y

= 1.

29

Sum-product algorithm applied to HMM:

Prediction of Next Output Symbol p(y1, . . . , yn+1) p(y1, . . . , yn) ∝ p(y1, . . . , yn+1) X = p(x0, x1, . . . , xn, y1, y2, . . . , yn, yn+1)

p(yn+1|y1, . . . , yn) =

x0 ,x1 ,...,xn

− =→ µ Yn (yn). For n = 2: X0 -

=

6

?

y1 ?

X1 -

=

-

X2 -

=



=1

6 ?

y2 ?

...

? ?

Y3

? ?

30

Sum-product algorithm applied to HMM:

Estimation of Time-k State p(xk , y1, y2, . . . , yn) p(xk | y1, y2, . . . , yn) = p(y1, y2, . . . , yn) ∝ p(xk , y1, y2, . . . , yn) X p(x0, x1, . . . , xn, y1, y2, . . . , yn) = x0 , . . . , x n except xk

→ − (x ) =− µ Xk (xk )← µ Xk k For k = 1: X0 -

=

6

?

y1 ?

X1 

=

 6

?

y2 ?

X2 

=



...

6 ?

y3 ?

31

Sum-product algorithm applied to HMM:

All States Simultaneously p(xk |y1, . . . , yn) for all k: X0 

=

 6

?

y1 ?

X1 

=

 6

?

y2 ?

X2 

=

 -

...

6 ?

y3 ?

In this application, the sum-product algorithm coincides with the Baum-Welch / BCJR forward-backward algorithm.

32

Scaling of Messages In all the examples so far: → − (x )) equals the desired • The final result (such as − µ Xk (xk )← µ Xk k quantity (such as p(xk |y1, . . . , yn)) only up to a scale factor. • The missing scale factor γ may be recovered at the end from the condition X → − (x ) = 1. γ− µ Xk (xk )← µ Xk k xk

• It follows that messages may be scaled freely along the way. • Such message scaling is often mandatory to avoid numerical problems.

33

Sum-product algorithm applied to HMM:

Probability of the Observation p(y1, . . . , yn) = =

X

...

X

p(x0, x1, . . . , xn, y1, y2, . . . , yn)

x0

xn

X

− → µ Xn (xn).

xn

This is a number. Scale factors cannot be neglected in this case. For n = 2: X0 -

=

6

?

y1 ?

X1 -

=

6

?

y2 ?

X2 -

=



=1 6?

=1

...

Y3 ?

34

Max-product algorithm applied to HMM:

MAP Estimate of the State Trajectory The estimate (ˆ x0, . . . , xˆn)MAP = argmax p(x0, . . . , xn|y1, . . . , yn) x0 ,...,xn

= argmax p(x0, . . . , xn, y1, . . . , yn) x0 ,...,xn

may be obtained by computing 4

pˆk (xk ) =

max p(x0, . . . , xn, y1, . . . , yn) x1 , . . . , x n except xk

→ − (x ) =− µ Xk (xk )← µ Xk k for all k by forward-backward max-product sweeps. In this example, the max-product algorithm is a time-symmetric version of the Viterbi algorithm with soft output.

35

Max-product algorithm applied to HMM:

MAP Estimate of the State Trajectory cont’d Computing 4

pˆk (xk ) =

max p(x0, . . . , xn, y1, . . . , yn) x1 , . . . , x n except xk

→ − (x ) =− µ Xk (xk )← µ Xk k simultaneously for all k: X0 

=

 6

?

y1 ?

X1 

=

 6

?

y2 ?

X2 

=

 -

...

6 ?

y3 ?

36

Marginals and Output Edges → − (x) may be viewed as messages out of a Marginals such − µ X (x)← µ X “output half edge” (without incoming message): X0

X -

− → µ X¯ (x) =



Z Z x0

x00

-

=⇒

-

=

¯ X

X 00

-



? ?

− → − 00 (x00) δ(x − x0) δ(x − x00) dx0 dx00 µ X 0 (x0) ← µ X

− − 00 (x) =→ µ X 0 (x) ← µ X

=⇒ Marginals are computed like messages out of “=”-nodes.

37

Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . .

38

Factor Graphs with Cycles? Continuous Variables?







6

6

?

? 

 6







6 ?

6

?

6 ?

?

39

Example:

Error Correcting Codes (e.g. LDPC Codes) Codeword x = (x1, . . . , xn) ∈ C ⊂ {0, 1}n. Received y = (y1, . . . , yn) ∈ Rn. p(x, y) ∝ p(y|x)δC (x). ⊕

f⊕(z1, . . . , zm)  1, if z1 + . . . + zm ≡ 0 mod 2 = 0, else



L\ L\

 L  L

“random” connections L L

=

 

X1

\ \

=

 

X2

\L \L

=

...

code membership function  1, if x ∈ C δC (x) = 0, else

Xn

channel model p(y|x) n Y e.g., p(y|x) = p(y`|x`) y1

y2

yn

`=1

Decoding by iterative sum-product message passing. 40

Example:

Magnets, Spin Glasses, etc. Configuration x = (x1, . . . , xn), xk ∈ {0, 1}. X X “Energy” w(x) = βk xk + βk,`(xk − x`)2 k −w(x)

p(x) ∝ e

=

neighboring pairs (k, `)

Y

e

k

2

Y

−βk xk

e−βk,`(xk −x`)

neighboring pairs (k, `)

H

=

H

=

H

=

H

=

H

=

H

=

H

=

H

=

H

= 41

Example:

Hidden Markov Model with Parameter(s)

p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn | θ) = p(x0)

n Y

p(xk |xk−1, θ)p(yk |xk−1)

k=1

Θ

p(x0)

X0

-

= ?

=

-

= X1

-

?

=

-

p(x1|x0, θ) ?

Y1 ?

X2

-

...

p(x2|x1, θ) ?

Y2 ?

42

Example:

Least-Squares Problems Minimizing

n X

x2k subject to (linear or nonlinear) constraints is

k=1

equivalent to maximizing e−

Pn

2 k=1 xk

=

n Y

2

e−xk

k=1

subject to the given constraints.

constraints X1 −x21

e

...

Xn −x2n

e

43

General Linear State Space Model Uk+1

Uk ?

?

Bk+1

Bk Xk−1 ...

?

Ak

-

+

Xk

-

=

?

Ck Y

? k

-

Ak+1

?

-

+

Xk+1-

=

-

... ?

Ck+1 Y

? k+1

Xk = Ak Xk−1 + Bk Uk Yk = Ck Xk

44

Gaussian Message Passing in Linear Models encompasses much of classical signal processing and appears often as a component of more complex problems/algorithms. Note: 1. Gaussianity of messages is preserved in linear models. 2. Includes Kalman filtering and recursive least-squares algorithms. 3. For Gaussian messages, the sum-product (integral-product) algorithm coincides with the max-product algorithm. 4. For jointly Gaussian random variables, MAP estimation = MMSE estimation = LMMSE estimation. 5. Even if X and Y are not jointly Gaussian, the LMMSE estimate of X from Y = y may be obtained by pretending that X and Y are jointly Gaussian (with their actual means and second-order moments) and forming the corresponding MAP estimate. See “The factor graph approach to model-based signal processing,” Proceedings of the IEEE, June 2007. 45

Continuous Variables: Message Types The following message types are widely applicable. • Quantization of messages into discrete bins. Infeasible in higher dimensions. • Single point: the message µ(x) is replaced by a temporary or final estimate xˆ. • Function value and gradient at a point selected by the receiving node. Allows steepest-ascent (descent) algorithms. • Gaussians. Works for Kalman filtering. • Gaussian mixtures. • List of samples: a pdf can be represented by a list of samples. This data type is the basis of particle filters, but it can be used as a general data type in a graphical model. • Compound messages: the “product” of other message types. 46

(7)

d quantity

(8)

rics. ding to (6) s therefore

sk−1 , (9)

ill choose

= E p(yk |Xk , Sk , Sk−1 )|Y k−1 ,

(16)

where the expectation with respect toPassing the probability density Particle Methodsisas Message p(sk−1 , sk , xk |y k−1 ) = p(sk−1 |y k−1 )p(xk , sk |sk−1 ) (17) Basic idea: represent a probability density f by a list  = µk−1 (sk−1 )p(xk ,sk |sk−1 ). L = (ˆ x(1), w(1)), . . . (ˆ x(L), w(L))

(18)

of weighted samples III.(“particles”): A PARTICLE M ETHOD

f

x Fig. 2. A probability density function f : R → R+ and its representation • Versatile data type for sum-product, max-product, EM, . . . as a list of particles.

• Not sensitive to dimensionality.

If both the input alphabet X and the state space S are finite47 sets (and the alphabet of X and S is not too large), then

On Factor Graphs with Cycles • Generally iterative algorithms. • For example, alternating maximization x, y) xˆnew = argmax f (x, yˆ) and yˆnew = argmax f (ˆ x

y

using the max-product algorithm in each iteration. • Iterative sum-product message passing gives excellent results for maximization(!) in some applications (e.g., the decoding of error correcting codes). • Many other useful algorithms can be formulated in message passing form (e.g., J. Dauwels): gradient ascent, Gibbs sampling, expectation maximization, variational methods, clustering by “affinity propagation” (B. Frey) . . . • Factor graphs facilitate to mix and match different algorithmic techniques. End of this talk.

48