A Hierarchical Dirichlet Process Model with Multiple Levels of Clustering for Human EEG Seizure Modeling

A Hierarchical Dirichlet Process Model with Multiple Levels of Clustering for Human EEG Seizure Modeling Supplementary Materials Drausin Wulsin1 , Sha...
Author: Amice Bell
4 downloads 0 Views 248KB Size
A Hierarchical Dirichlet Process Model with Multiple Levels of Clustering for Human EEG Seizure Modeling Supplementary Materials Drausin Wulsin1 , Shane Jensen2 , and Brian Litt1,3 Depts. of Bioengineering1 , Statistics2 , and Neurology3 University of Pennsylvania

ICML 2012

1

Algorithms for posterior MCMC

Algorithms for sampling the atom indicators, level parameters, and base parameters are given in Algorithms 1, 2, and 3. We denote [· · · ] as a vector and the operations [· · · ] · [· · · ] and [· · · ] / [· · · ] as element-wise multiplication and division, respectively. A Dirac point mass in index k is given as δk . We denote sampling a scalar θ from a multinomial distribution with vector weights p as θ ∼ Multi(p), where the weights p need not be normalized. A ∗ in place of an index indicates that all indices are selected, yielding a vector. The function 1(·) is the indicator function, which gives a 1 if its argument is true and a 0 if it is false.

2

Normal model with a conjugate prior

Here we briefly describe the details of our Normal scaled inverse-χ2 conjugate prior for a univariate Normal and the resulting posterior. For more details, see (Gelman et al., 2004, pgs. 78-80). Our base model is a multivariate Normal with mean µ and diagonal covariance σ 2 , both in d R . The diagonal covariance implies that our d dimensions are independent, so the likelihood of x is simply the product of the each dimension i’s likelihood of xi , and sampling µ and σ 2 from the posterior just involves sampling each dimension individually. For notational ease, we will thus assume that d = 1 in what follows below, so x = x and (µ, σ 2 ) = (µ, σ 2 ). The Normal model likelihood is   1 p(x | µ, σ 2 ) ∝ (σ 2 )−1/2 exp − 2 (x − µ)2 (1) 2σ Given n i.i.d. observations x1 , . . . , xn , their joint likelihood is 2

2 −n/2

p(x1 , . . . , xn | µ, σ ) ∝ (σ )

n 1 X exp − 2 (xi − µ)2 2σ i=1

1

! (2)

Algorithm 1 Sampling algorithm for the MLC-HDP base- and level-atom indicators 1: for t = 1, . . . , T do 2: for j = 1, . . . , Jt do 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

(2)

` ← ztj for i = 1, . . . , Ntj do // for each channel i of seizure j of patient t (3) (3) (3) (3) (3) (3) π ← α · [β1 , . . . , βK , βK+1 ] + [n`1 , . . . , n`K+1 ] // calc weights over base atoms k 0 ∼ Multi (π · [f1 (xtji ), . . . , fK+1 (xtji )]) // sample channel i’s base atom indicator (3) if k 0 6= ztji then (3)

ztji ← k 0 Φk ← Φk xtji and Φk0 ← Φk0 ⊕ xtji (3) (3) (3) (3) n`k ← n`k − 1 and n`k0 ← n`k0 + 1 end if end for

(2)

ztj ← `0

24:

n`∗ ← n`∗ −

34: 35: 36:

// update base atom sufficient statistics // update level-3 atom counts

for ` = 1, . . . , L(2) + 1 do // calc likelihood of seizure j under each level-2 atom ` (3) (3) (3) (3) (3) π 0` ← α(3) ·P [β1 , . . . , βK , βK+1 ] + [n`1 , . . . , n`K+1 ] 0 0 π ` ← π ` / k π`k QNtj P g` = i=1 // product of marginal posteriors of the channels in seizure j k π`k fk (xtji ) end for (1) l ← zt (2) (2) (2) (2) τ ← α(2) · [β1 , . . . , βL(2) +1 ] + [nl1 , . . . , nlL(2) +1 ] // calc weights over level-2 atoms `0 ∼ Muti(τ · [g1 , . . . , gL(2) +1 ]) // sample seizure j’s level-2 atom indicator (2) if `0 6= ztj then

23:

25: 26: 27: 28: 29: 30: 31: 32: 33:

// for each patient t // for each seizure j of patient t

(3)

(3)

(2) nl`

(2) nl`



PNtj

−1

(3) i=1 δk=ztji (2) and nl`0

and ←

(2) nl`0

(3)

(3)

n`0 ∗ ← n`0 ∗ +

PNtj

(3) i=1 δk=ztji

+1

// update level-3 atom counts // update level-2 atom counts

end if end for for ` = 1, . . . , L(2) + 1 do (3) (3) (3) (3) (3) π 0` ← α(3) ·P [β1 , . . . , βK , βK+1 ] + [n`1 , . . . , n`K+1 ] 0 0 π ` ← π ` / k π`k end for for l = 1, . . . , L(1) + 1 do // calc likelihood of patient t under each level-1 atom l (2) (2) (2) (2) τ 0l ← α(2) · [β1 , . . . , βL(2) +1 ] + [nl1 , . . . , nlL(2) +1 ] P 0 τ l ← τ 0` / ` τ`k QJt QNtj PL(2) +1 hl = j=1 τl` π`(k=z(3) ) // product of marginal posteriors of the seizures i=1 `=1 tji

37: 38: 39: 40: 41: 42:

end for (1) (1) (1) (1) // calc weights over level-1 atoms ρ ← α(1) · [β1 , . . . , βL(1) +1 ] + [n1 , . . . , nL(1) +1 ] 0 l ∼ Muti(ρ · [h1 , . . . , hL(1) +1 ]) // sample patient t’s level-1 atom indicator (1) if l0 6= zt then (1) zt ← l 0 PJt PJt (2) (2) (2) (2) nl∗ ← nl∗ − j=1 δ`=z(2) and nl0 ∗ ← nl0 ∗ + j=1 δ`=z(2) // update level-2 atom counts tj

(1)

(1)

43: nl ← nl 44: end if 45: end for

−1

and

tj

(1)

(1)

nl 0 ← nl 0 + 1

// update level-1 atom counts

2

Algorithm 2 Sampling algorithm for the MLC-HDP level parameters 1: for v = 3, . . . , 1 do 2: (v) 3: if n·(L(v) +1) > 0 then 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

(v) n∗(L(v) +1)

←0

and

// for each level // if the extra, empty sub-atom has been filled (v) m∗(L(v) +1)

←0

// make new space in counts variables

(v)

ω ∼ Beta(1, α ) [βL(v) , βL(v) +1 ] ← βL(v) · [ω, 1 − ω] L(v) ← L(v) + 1 end if L(0) ≡ 1 for q = 1, . . . , L(v−1) do for r = 1, . . . , L(v) do (v) for s = 1, . .. , nqr do 

// sample a new parent weight via stick-breaking // increment L(v)

// sample m variables for each sub-atom r of each level-atom q

α(v) β (v)

θs ∼ Ber (v) (v)r α βr +s end forP (v) mqr ← s θs end for end for   (v) (v) (v) (v) [β1 , . . . , βL(v) +1 ] ∼ Dir m·1 , . . . , m·L(v) , γ (v)

15: 16: 17: 18: 19: 20: // sample parent weights over sub-atoms 21: 22: (optional) sample hyperparameters α(v) and γ (v) as shown in Algorithm 3 23: 24: end for

Algorithm 3 Sampling algorithm for the MLC-HDP hyperparameters 1: we assume that this algorithm has the same scope as line 22 of Algorithm 2. 2: let γ (v) have a Gamma(aγ(v) , bγ(v) ) prior and α(v) have a Gamma(aα(v) , bα(v) ) prior 3: 4: for r = 1, .. . , L(v) do // sample γ auxiliary variables  5:

6: 7:

(v)

m b r ← 1 m·q > 0   (v) cr ∼ Ber (v)m·r (v)  γ +m·r (v) dr ∼ Beta γ (v) + 1, m·r end for  P P γ (v) ∼ Gamma aγ(v) + r (m b r − cr ), bγ(v) − r log dr

8: 9: 10: 11: for q = 1, .  . . , L(v−1)  do (v) nr· 12: cq ∼ Ber (v) (v)

// sample γ from posterior // sample α auxiliary variables

α +nr·  (v) dq ∼ Beta α(v) + 1, nr· 14: end for   P P (v) 15: α(v) ∼ Gamma aγ(v) + m·· − q cq ), bγ(v) − q log dq

13:

3

// sample α from posterior

We assume that both the mean and variance of our Normal distribution are unknown, requiring a Normal scaled inverse-χ2 conjugate prior, which entails σ 2 ∼ Inv-χ2 (ν0 , σ02 ) µ | σ 2 ∼ N (µ0 , σ 2 /n0 ) and has a joint prior density, denoted N -Inv-χ2 (n0 , µ0 , ν0 , σ02 ), of   1 2 2 −(ν0 +1)/2+1 2 2 p(µ, σ ) ∝ (σ ) exp − 2 (ν0 σ0 + n0 (µ0 − µ) ) 2σ

(3)

(4)

where n0 is the prior counts, µ0 is the prior location, ν0 is the prior degrees of freedom, and σ02 is the prior scale. The posterior density is the product of the data likelihood and the joint prior and has the form !! n X 1 2 2 −(ν0 +n+1)/2+1 2 2 2 p(x1 , . . . , xn | µ, σ ) ∝ (σ ) exp − 2 (xi − µ) + ν0 σ0 + n0 (µ0 − µ) (5) 2σ i=1

with posterior parameters n, µ, ν, and statistics (Sudderth, 2006)

σ2,

which we can efficiently represent using the sufficient

n = n0 + n n µ = n 0 µ0 +

(6) n X

xi

(7)

i=1

ν = ν0 + n

(8)

ν σ 2 = ν0 σ02 + n0 µ20 +

n X

x2i − n µ2

(9)

i=1

These values can be cached for each base atom and easily updated when a new data point is added or removed. In practice, we only store the first three terms of (9) and subtract off the last term when we need to actually use the sufficient statistics (e.g., for sampling µ and σ 2 values from the posterior). In situations where full covariance is required, the setup is still quite similar, except the Cholesky decomposition of the analogous form of (9) is stored (see Sudderth 2006 for more details). We sample from the posterior in two steps, first drawing σ 2 from its marginal posterior distribution σ 2 ∼ Inv-χ2 (ν, σ 2 ) (10) or (in practice) θ ∼ Inv-χ2 (ν) σ 2 = ν σ 2 /θ

(11)

and then drawing µ from its conditional posterior distribution µ | σ ∼ N (µ, σ 2 /n)

(12)

Integrating over the µ and σ 2 parameters of the N -Inv-χ2 distribution yields the posterior predictive distribution, here a Student-t (Gelman et al., 2004, pg. 88), for a new data point x+ ,   n+1 2 p(x+ | x1 , . . . , xn , n0 , µ0 , ν0 , σ02 ) = tν x+ | µ, σ (13) n 4

References Gelman, Andrew, Carlin, John B., Stern, Hal S., and Rubin, Donald S. Bayesian Data Analysis. Chapman & Hall/CRC, 2nd edition, 2004. Sudderth, E.B. Graphical models for visual object recognition and tracking. PhD thesis, 2006.

5

Suggest Documents