Expectation-Maximization Algorithm

Outline 1. Motivation and EM View 2. The Overview of EM Algorithm 3. Examples 4. Theoretical Issues in EM Algorithm 5. Variants of EM Algorithm

Problem Y : p-dimensional random vector with p.d.f. g(y | ψ) on Rp ! Observe random variable Y modeled by g(y | ψ), ψ ∈Ω. ! The likelihood function for ψ formed from the observed data y !

L(ψ) = g(y | ψ)

!

Estimate the state of ψ with maximum likelihood.

ψ * = arg max log L( y | ψ ) ψ ∈Ω

∂ log L( y | ψ ) =0 ∂ψ ψ =ψ *

(1)

♦Often it is difficult to solve equation (1).

Motivation !

Data sets with Missing Values ♦ Censored and grouped Observation ♦ Truncated Observation

!

Cannot be avoided in many real-world problems

Incomplete Data Model (1) !

Image two sample space X and Y, and a map h: X → Y. X h Y Incomplete

Complete data

data

X(y) y

{ f(x | ψ) }

{ g(y | ψ) }

Incomplete Data Model (2) ♦ Occurrence of x ∈ X ⇒ occurrence of y = h(x) ∈ Y. ♦ Only y can be observed and x ∈ X(y) is revealed. ♦ The relation between density f and g is:

g ( y |ψ ) = ∫

X ( y)

f ( x | ψ )dx

♦ In given problems, Y is fixed while X can be chosen.

Outline 1. Motivation and EM View 2. The Overview of EM Algorithm 3. Examples 4. Theoretical Issues in EM Algorithm 5. Variants of EM Algorithm

EM Algorithm (1) ! !

A Standard Tool in the Statistical Repertoire Basic Idea ♦ To associate with the given incomplete-data problem, a complete-data problem for which ML estimation is computationally more tractable

!

Y : random vector corresponding to the observed data y, having p.d.f. postulated as g(y; ψ) where ψ is a vector of unknown parameters with parameter space Ω ♦ ♦ ♦ ♦

y : observed incomplete data x : augmented complete data z : additional data, referred to as the unobservable or missing data f(x; ψ) : p.d.f. of the random vector X corresponding to the completedata vector x ♦ Complete-data log likelihood function log Lc(ψ) = log f(x; ψ)

EM Algorithm (2) !

The expectation of the complete-data log likelihood

!

ψ(0) ∈Ω : any first approximation to ψ*

Q(ψ | ψ ' ) = E [log f ( X | ψ ) | y,ψ ']

!

The EM algorithm consists of repeating two steps. ! On the (k+1) iteration ♦ E-step. Calculate Q(ψ | ψ(k)). ♦ M-step. Choose ψ(k+1) to be any value of ψ ∈Ω that maximizes Q(ψ; ψ(k)).

ψ ( k +1) = arg max Q(ψ | ψ ( k ) ) ψ ∈Ω

Properties !

Choose ψ(k+1) to maximize log f(x | ψ), where f(x | ψ) is unknown. ♦ Maximize its expectation given the data y and the current fit ψ(k)

!

Some properties ♦ The constraint ψ ∈Ω is naturally incorporated in the M-step. ♦ The likelihood g(y | ψ(k)) is non-decreasing. ♦ Simple and Stable ♦ The sequence of likelihood has a finite limit L*. ! L*

!

can be the local maximum.

Disadvantage ♦ Slow linear convergence

Outline 1. Motivation and EM View 2. The Overview of EM Algorithm 3. Examples 4. Theoretical Issues in EM Algorithm 5. Variants of EM Algorithm

Multinomial Example (1) !

The observed data vector of frequencies y = (y1, y2, y3, y4)T is postulated to arise from a multinomial distribution with four cells with cell probabilities 1 1 1 1 1 + ψ , (1 −ψ ), (1 −ψ ), and ψ 4 4 2 4 4

with 0 ≤ ψ ≤ 1. ! Example ♦ y = (125, 18, 20, 34)T , n = 197

Multinomial Example (2) !

The probability function g(y; ψ) y

y

y

1 2 3 n! 1 1  1 1  1 1  1  g ( y |ψ ) =  + ψ  − ψ  − ψ  ψ y1! y2 ! y3! y4!  2 4   4 4   4 4   4 

!

y4

Log likelihood for ψ log L(ψ ) = y1 log(2 + ψ ) + ( y2 + y3 ) log(1 − ψ ) + y4 logψ ∂ log L(ψ ) y y + y3 y 4 = 1 − 2 + ∂ψ 2 +ψ 1 −ψ ψ ♦ The likelihood equation can be solved explicitly to find the MLE of ψ.

Multinomial Example (3) !

! ! ! !

Suppose that the first of the original four multinomial cells could be split into two subcells having probabilities ½ and ¼ψ. MLE of ψ on the basis of this split (y12 + y4) / (y12 + y2 + y3 + y4) View y as being incomplete Complete-data vector x = (y11, y12, y2, y3, y4)T The cell frequencies in x : 1 1 1 1 1 , ψ , (1 −ψ ), (1 −ψ ), and ψ 4 4 2 4 4

!

y11 and y12 : unobservable and missing data

Multinomial Example (4) !

Over all values of x such that y11 + y12 = y1, g ( y |ψ ) = ∑ f ( x |ψ ) y

y

y

y

11 12 2 3 1 1  1 1  1 1  1  f ( x | ψ ) = C ( x)   ψ   − ψ   − ψ   ψ  2 4  4 4  4 4  4  n! C ( x) = y11! y12 ! y2 ! y3! y4 !

!

y4

Complete-data log likelihood

log Lc (ψ ) = ( y12 + y4 ) logψ + ( y2 + y3 ) log(1 −ψ )

Multinomial Example (5) Let ψ(0) be the initial value for ψ. ! E-step: !

♦ Filling in for unobservable data by averaging the complete-data log likelihood over conditional distribution given the observed data y ♦ Conditional Expectation of log Lc(ψ) given y

Q(ψ | ψ ( 0 ) ) = E[log Lc (ψ ) | y,ψ ( 0 ) ]

Multinomial Example (6) ♦ Replace y11 and y12 by their conditional expectations given the observed data y ! Y11

: random variable corresponding to y11 ! Y11 has a binomial distribution with sample size y1 and probability parameter 1  1 1 (0) 

 + ψ 2 2 4

! Initial

 

conditional expectation of Y11 given y1

Eψ ( 0 ) (Y11 | y1 ) = y11( 0 )

1 1 1  y1  + ψ ( 0)  2 2 4  1 1 1  = y1 − y11( 0 ) = y1ψ ( 0)  + ψ ( 0 )  4 2 4 

y11( 0 ) = y12( 0 )

Multinomial Example (7) !

M-step ♦ Choose ψ(1) to be the value of ψ that maximizes Q(ψ ;ψ(0)) w.r.t ψ.

ψ (1) = ( y12( 0 ) + y4 ) ( y12( 0 ) + y2 + y3 + y4 )

(

= y12( 0) + y4

) (n − y ) (0) 11

♦ In general,

ψ ( k +1) = ( y12( k ) + y4 ) (n − y11( k ) ) 1 1 1  y1  + ψ ( k )  2 2 4  = y1 − y11( k )

y11( k ) = y12( k )

Multinomial Example (8)

♦ r(k) = (ψ(k+1) - ψ(k) ) / (ψ(k) - ψ(k-1) ) ♦ Linear convergence equal to 0.1328

Maximum Likelihood Estimator !

Likelihood P(D|M): probability of data D being generated from model M. ! Learning (induction): inference process of deriving a parameterized model M=M(w) from data D P( M | D ) = !

P( D | M ) P( M ) P( D | M ) = P( M ) P( D ) P( D )

Maximum Likelihood Hypothesis

hML = arg max P( D | h) h∈H

!

Maximum Likelihood Estimator

θ * = arg max L(θ ) = arg max P( D | θ ) θ ∈H

θ ∈H

Estimating Mixing Proportions (1) !

Suppose that the p.d.f. of a random vector W has a gcomponent mixture form g

f ( w | ψ ) = ∑ π i f i ( w) i =1

♦ ψ = (π1,…,πg-1)T : g - 1 mixing proportions

Estimation of Mixing Proportions (2) !

Observed random sample obtained from mixture density y = (w1T, …, wnT)T,

where g

wi = ( x, f ( x | ψ )) = ( x, ∑ π i f i ( x)) i =1

!

We can observe fi(w)’s, but cannot observe πifi(w)’s. ♦ We do not know πi.

Estimation of Mixing Proportions (3) !

Log likelihood function for ψ n

log L(ψ ) = ∑ log f ( w j | ψ ) j =1

g  = ∑ log ∑ π i f i ( w j ) j =1  i =1  n

♦ Differentiating with respect to πi and equating the result to zero  f i ( w j ) f g ( w j )  −   = 0 (i = 1,L , g − 1) ∑ ˆ ˆ ψ ψ f ( w | ) f ( w | )   j =1  j j n

!Does

not yield am explicit solution for ψˆ

Estimation of Mixing Proportions (4) !

To pose this problem as an incomplete-data one ♦ Introduce as the unobservable or missing data the vector z = (z1T, …, znT)T ♦ zi : g-dimensional vector of zero-one indicator variables ♦ zij = (zj)i : whether wj arose from the ith component

!

MLE of πI n

∑z j =1

!

ij

n

Complete-data vector x

x = ( y T , z T )T

Estimation of Mixing Proportion (5) ♦ Complete-data has multinomial form.

x j ~ Mult g (1, {π i f i ( w)}) Pr( X j = x j ) = {π 1i f1 ( w)} 1 j {π 2i f 2 ( w)} 2 j L{π gi f g ( w)} gj z

z

z

♦ Complete-data log likelihood for ψ g

g

n

n

log Lc (ψ ) = ∑∑ zij log π i + ∑∑ zij log f i ( w j ) i =1 j =1

i =1 j =1

Estimation of Mixing Proportions (6) !

E-step ♦ Calculation of the current conditional expectation of Zij given the observed data y

{

E ( Z ij | y,ψ ( k ) ) = pr Z ij = 1 | y,ψ ( k )

}

= zij( k ) z

(k ) ij

π i( k ) f i ( w j ) π i( k ) f i ( w j ) = = g f (w j |ψ (k ) ) ∑ π k( k ) f k ( w j ) k =1

!

M-step π

( k +1) i

n

= ∑ zij( k ) / n j =1

Outline 1. Motivation and EM View 2. The Overview of EM Algorithm 3. Examples 4. Theoretical Issues in EM Algorithm 5. Variants of EM Algorithm

Non-Decreasing Likelihood (1) !

The conditional density of X given Y = y k ( x | y,ψ ) =

f ( x |ψ )



X ( y)

!

f ( x | ψ )dx

=

f ( x |ψ ) g ( y |ψ )

x ∈ X ( y)

The log likelihood log L(ψ) = log g(y | ψ) = log f(x | ψ) – log k(x | y, ψ) = log Lc(ψ) – log k(x | y, ψ)

!

Taking the conditional expectation log L(ψ ( k ) ) = E[log LC (ψ ) | y,ψ ( k ) ] − E[log k ( X | y,ψ ) | y,ψ ( k ) ] = Q (ψ | ψ ( k ) ) − H (ψ | ψ ( k ) )

Non-Decreasing Likelihood (2) !

What we want show log L(ψ ( k +1) ) − log L(ψ ( k ) ) = Q (ψ ( k +1) | ψ ( k ) ) − Q (ψ ( k ) | ψ ( k ) ) − {H (ψ ( k +1) | ψ ( k ) ) − H (ψ ( k ) | ψ ( k ) )} ≥0 ♦ The first term

Q (ψ ( k +1) | ψ ( k ) ) − Q (ψ | ψ ( k ) ) ≥ 0

! ψ(k+1)

is chosen so that for all ψ ∈ Ω

Q (ψ ( k +1) | ψ ( k ) ) ≥ Q (ψ | ψ ( k ) ) ♦ The second term

H (ψ ( k +1) | ψ ( k ) ) − H (ψ ( k ) | ψ ( k ) ) ≤ 0

(?)

Non-Decreasing Likelihood (3) ♦ By Jensen’s inequality and the concavity of the logarithmic function

H (ψ ( k +1) | ψ ( k ) ) − H (ψ ( k ) | ψ ( k ) ) = E[log{k ( X | y ,ψ ( k +1) ) / k ( X , y ,ψ ( k ) )} | y ,ψ ( k ) ]

[

≤ log E{k ( X | y ,ψ ( k +1) ) / k ( X , y,ψ ( k ) )} | y ,ψ ( k ) = log ∫

X ( y)

]

k ( x | y ,ψ ( k +1) ) dx

=0 !

For a bounded sequence of likelihood value {L(ψ(k))}, L(ψ(k)) converges monotonically to some L*.

Rate of Convergence (1) The EM algorithm implicitly defines a mapping ψ→M(ψ) ψ(k+1) = M(ψ(k)) (k = 0, 1, 2, …) ! If ψ(k) converges to to some point ψ*, ψ* = M(ψ*) ! In a neighborhood of ψ* , ψ(k+1) - ψ* ≈ J(ψ*)(ψ(k) - ψ*) where J(ψ) is Jacobian matrix for M(ψ). ! Thus, ( k +1) !

ψ −ψ * ≈ J (ψ *) (k ) ψ −ψ *

Rate of Convergence (2) !

Near ψ*, the EM algorithm is a linear iteration with J(ψ*). ♦ J(ψ*) : rate of convergence

!

Global rate of convergence

|| ψ ( k +1) − ψ * || r = lim k → ∞ || ψ ( k ) − ψ * || ♦ Under certain regularity conditions,

r = λmax = the largest eigenvalue of J(ψ*) ! Global speed of convergence s=1-r ♦ s is the smallest eigenvalue of S = Id - J(ψ*)

Outline 1. Motivation and EM View 2. The Overview of EM Algorithm 3. Examples 4. Theoretical Issues in EM Algorithm 5. Variants of EM Algorithm

Kinds of EM Variants 1. Generalized EM Algorithm 2. Modification for Maximum a Posteriori Estimation 3. ECM Algorithm 4. ECME Algorithm 5. MCEM Algorithm

Generalized EM Algorithm (1) !

M-step of the EM Algorithm

ψ ( k +1) = arg max Q(ψ | ψ ( k ) ) ψ ∈Ω

♦ That is, for all ψ ∈ Ω,

Q (ψ ( k +1) | ψ ( k ) ) ≥ Q(ψ | ψ ( k ) ) !

M-step of the GEM Algorithm ♦ Choose ψ(k+1) such that

Q(ψ ( k +1) | ψ ( k ) ) ≥ Q(ψ ( k ) | ψ ( k ) )

Generalized EM Algorithm (2) !

Monotonicity of GEM Algorithm

log L(ψ ( k +1) ) − log L(ψ ( k ) ) ≥ 0 ♦ In the prove of EM Algorithm, only ( k +1) (k )

Q (ψ



) − Q (ψ | ψ ( k ) ) ≥ 0

is changed into

Q (ψ ( k +1) | ψ ( k ) ) − Q (ψ ( k ) | ψ ( k ) ) ≥ 0.

For MAP Estimate (1) !

MAP Estimate

ψ * = arg max P (ψ | y ) ψ ∈Ω

= arg max P ( y | ψ ) P (ψ ) ψ ∈Ω

∴ log P (ψ | y ) = log P ( y | ψ ) + log P (ψ ) = log L(ψ ) + log P (ψ )

For MAP Estimate (2) !

E-step ♦ Calculate

[

]

E log p(ψ | x) | y,ψ ( k ) = Q(ψ | ψ ( k ) ) + log p(ψ ) !

M-step ♦ Choose ψ(k+1) to maximize the above equation over ψ ∈ Ω.

ECM Algorithm (1) !

Motivation ♦ when complete-data ML estimation is rather complicated ♦ Replace a complicated M-step with simpler CM-steps

!

CM-steps ♦ Conditional Maximization steps ♦ Maximize the conditional expectation of complete-data log likelihood function found in the preceding E-step subject to constraints on ψ ♦ May require iteration ! Maximization

over smaller dimensional spaces ! Simpler, faster, and more stable

ECM Algorithm (2) !

M-step is replaced by S > 1 steps. ♦ ψ(k+s/S) : the value of ψ on the sth CM-step of (k + 1)th iteration ♦ Choose ψ(k+s/S) to maximize Q(ψ |ψ(k)) subject to the constraint gs(ψ ) = gs(ψ(k+(s-1)/S) ) ♦ C = {gs(ψ ), s = 1, …, S } : a set of S preselected (vector) functions

!

!

ψ(k+s/S) satisfies Q(ψ (k+s/S) |ψ(k)) ≥ Q(ψ |ψ(k)) for all ψ ∈ Ωs(ψ(k+(s-1)/S) ) where Ωs(ψ(k+(s-1)/S) ) ≡ {ψ ∈ Ω : gs(ψ ) = gs(ψ(k+(s-1)/S) )} The value of ψ on the final CM-step ψ (k+S/S) = ψ (k+1)

ECM Algorithm (3) !

ECM algorithm is a GEM algorithm. Q (ψ ( k +1) | ψ ( k ) ) ≥ Q (ψ ( k + ( S −1) / S ) | ψ ( k ) ) ≥ Q (ψ ( k + ( S − 2 ) / S ) | ψ ( k ) ) M ≥ Q (ψ ( k ) | ψ ( k ) )

!

Space-Filling Condition S

I Gs(ψ

(k )

) = {0} for all k

s =1

♦ Gs(ψ) : column space of ∇gs(ψ) ♦ Complement

hull of all feasible direction by Ωs(ψ(k+(s-1)/S)) is the whole Euclidian space Rd. ! Resulting maximization is over the whole parameter space Ω. ! Convex

ECME Algorithm (4) !

ECME (ECM Either) Algorithm ♦ either : with this extension, some or all of the CM-steps of the ECM algorithm are replaced by steps that computationally maximize the incomplete-data log likelihood, log L(ψ), and not the Q-function. ♦ Faster convergence.

!

CM steps that act on the Q-function must be performed before those that act on the actual log likelihood. ! There are situations where the speed of convergence of the ECME algorithm is less than that of the ECM algorithm.

MCEM Algorithm (1) !

When E-step is complex and does not admit a closedform solution to the computation of Q(ψ |ψ(k)). ♦ E-step by a Monte Carlo process

Q (ψ | ψ ( k ) ) = E [log p (ψ | x ) | y ] = ∫ log p (ψ | x) p ( z | y,ψ ( k ) )dz Z

♦ Z : sample space of the missing data ♦ p(z | y, ψ) : conditional density of Z given y

MCEM Algorithm (2) !

MCEM Algorithm ♦ Monte Carlo E-step ! On

the kth iteration, draw z(1k), …, z(Mk) from p(z | y, ψ(k))

Q (ψ | ψ

(k )

1 )= M

M

( mk ) log p ( ψ | z , y) ∑ m =1

♦ M-step ! Maximize

Q(ψ |ψ(k)) over ψ to obtain ψ(k+1) .

MCEM Algorithm (3) !

An Example ♦ Single unknown parameter ψ ♦ Single missing variable z = y12 1 (k )   ψ    binomial y1 , 4 1 1 (k )   + ψ   2 4  

♦ Complete-data log likelihood log Lc (ψ ) = ( y12 + y4 ) logψ + ( y2 + y3 ) log(1 − ψ ) ♦ Q-function in MCEM Algorithm

Q (ψ | ψ ( k ) ) = ( z ( k ) + y4 ) logψ + ( y2 + y3 ) log(1 − ψ )

ψ

( k +1)

z ( k ) + y4 = z + y 4 + y 2 + y3

MCEM Algorithm (4) !

Difficult problems ♦ Monotonicity property is lost. ! Monitoring

of convergence ! Plot ψ(k) against k. ! Stabilization of the process indicates the convergence. ! With fluctuations, the process is terminated or continued with a larger value of M.

♦ Choice of M values of M in initial stage ! Increased as the algorithm moves closer to convergence ! Small

References !

A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM algorithm,” Jour. Roy. Stat. Soc., B39, pp.1-38, 1977. ! G. J. McLachlan and T. Krishnan, “The EM Algorithm and Extensions,” A Wiley-Interscience Publication, 1997. ! G. McLachlan and D. Peel, “Finite Mixture Models,” A Wiley-Interscience Publication, 2000. ! V. Cherkassky and F. Mulier, “Learning from Data: Concepts, Theory, and Methods,” A Wiley-Interscience Publication, 1998.