Bayesian networks: Inference and learning

Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall 2011 Lecture 22 1 Outline ♦ Exact inference (briefly) ♦ App...
Author: Mervyn Butler
0 downloads 0 Views 336KB Size
Bayesian networks: Inference and learning

CS194-10 Fall 2011 Lecture 22

CS194-10 Fall 2011 Lecture 22

1

Outline ♦ Exact inference (briefly) ♦ Approximate inference (rejection sampling, MCMC) ♦ Parameter learning

CS194-10 Fall 2011 Lecture 22

2

Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B|j, m) = P(B, j, m)/P (j, m) = αP(B, j, m) = α Σe Σa P(B, e, a, j, m)

B

E A

J

M

Rewrite full joint entries using product of CPT entries: P(B|j, m) = α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a) = αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a) Recursive depth-first enumeration: O(D) space, O(K D ) time

CS194-10 Fall 2011 Lecture 22

3

Evaluation tree P(b) .001 P(e) .002 P(a|b,e) .95

P( e) .998 P( a|b,e) .05

P(a|b, e) .94

P( a|b, e) .06

P(j|a) .90

P(j| a) .05

P(j|a) .90

P(j| a) .05

P(m|a) .70

P(m| a) .01

P(m|a) .70

P(m| a) .01

Enumeration is inefficient: repeated computation e.g., computes P (j|a)P (m|a) for each value of e CS194-10 Fall 2011 Lecture 22

4

Efficient exact inference Junction tree and variable elimination algorithms avoid repeated computation (generalized from of dynamic programming) ♦ Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of exact inference are O(K LD) ♦ Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5

0.5

A

B

C

D

1

L

2

3

D

L

3. B v C v

A

L

2. C v D v

0.5

L

1. A v B v C

0.5

AND

CS194-10 Fall 2011 Lecture 22

5

Inference by stochastic simulation Idea: replace sum over hidden-variable assignments with a random sample. 1) Draw N samples from a sampling distribution S 0.5 2) Compute an approximate posterior probability Pˆ 3) Show this converges to the true probability P

Coin Outline: – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior

CS194-10 Fall 2011 Lecture 22

6

Sampling from an empty network function Prior-Sample(bn) returns an event sampled from prior specified by bn inputs: bn, a Bayesian network specifying joint distribution P(X1, . . . , XD ) x ← an event with D elements for j = 1, . . . , D do x[j] ← a random sample from P(Xj | values of parents(Xj ) in x) return x

CS194-10 Fall 2011 Lecture 22

7

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

8

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

9

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

10

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

11

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

12

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

13

Example P(C) .50 Cloudy

C P(S|C) T .10 F .50

Rain

Sprinkler

C P(R|C) T .80 F .20

Wet Grass

S R P(W|S,R) T T F F

T F T F

.99 .90 .90 .01

CS194-10 Fall 2011 Lecture 22

14

Sampling from an empty network contd. Probability that PriorSample generates a particular event D SP S (x1 . . . xD ) = Πj = 1P (xj |parents(Xj )) = P (x1 . . . xD ) i.e., the true prior probability E.g., SP S (t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t) Let NP S (x1 . . . xD ) be the number of samples generated for event x1, . . . , xD Then we have lim Pˆ (x1, . . . , xD ) = lim NP S (x1, . . . , xD )/N

N →∞

N →∞

= SP S (x1, . . . , xD ) = P (x1 . . . xD ) That is, estimates derived from PriorSample are consistent Shorthand: Pˆ (x1, . . . , xD ) ≈ P (x1 . . . xD )

CS194-10 Fall 2011 Lecture 22

15

Rejection sampling ˆ P(X|e) estimated from samples agreeing with e function Rejection-Sampling(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero for i = 1 to N do x ← Prior-Sample(bn) if x is consistent with e then N[x ] ← N[x ]+1 where x is the value of X in x return Normalize(N)

E.g., estimate P(Rain|Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = f alse. ˆ P(Rain|Sprinkler = true) = Normalize(h8, 19i) = h0.296, 0.704i Similar to a basic real-world empirical estimation procedure CS194-10 Fall 2011 Lecture 22

16

Analysis of rejection sampling ˆ P(X|e) = αNP S (X, e) (algorithm defn.) = NP S (X, e)/NP S (e) (normalized by NP S (e)) ≈ P(X, e)/P (e) (property of PriorSample) = P(X|e) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P (e) is small P (e) drops off exponentially with number of evidence variables!

CS194-10 Fall 2011 Lecture 22

17

Approximate inference using MCMC General idea of Markov chain Monte Carlo ♦ Sample space Ω, probability π(ω) (e.g., posterior given e) ♦ Would like to sample directly from π(ω), but it’s hard ♦ Instead, wander around Ω randomly, collecting samples ♦ Random wandering is controlled by transition kernel φ(ω → ω 0) specifying the probability of moving to ω 0 from ω (so the random state sequence ω0, ω1, . . . , ωt is a Markov chain) ♦ If φ is defined appropriately, the stationary distribution is π(ω) so that, after a while (mixing time) the collected samples are drawn from π

CS194-10 Fall 2011 Lecture 22

18

Gibbs sampling in Bayes nets Markov chain state ωt = current assignment xt to all variables Transition kernel: pick a variable Xj , sample it conditioned on all others Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj )) so generate next state by sampling a variable given its Markov blanket function Gibbs-Ask(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero Z, the nonevidence variables in bn z, the current state of variables Z, initially random for i = 1 to N do choose Zj in Z uniformly at random set the value of Zj in z by sampling from P(Zj |mb(Zj )) N[x ] ← N[x ] + 1 where x is the value of X in z return Normalize(N)

CS194-10 Fall 2011 Lecture 22

19

The Markov chain With Sprinkler = true, W etGrass = true, there are four states: Cloudy

Cloudy Rain

Sprinkler

Rain

Sprinkler

Wet Grass

Wet Grass

Cloudy

Cloudy Rain

Sprinkler Wet Grass

Rain

Sprinkler Wet Grass

CS194-10 Fall 2011 Lecture 22

20

MCMC example contd. Estimate P(Rain|Sprinkler = true, W etGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = f alse ˆ P(Rain|Sprinkler = true, W etGrass = true) = Normalize(h31, 69i) = h0.31, 0.69i

CS194-10 Fall 2011 Lecture 22

21

Markov blanket sampling Markov blanket of Cloudy is Sprinkler and Rain Markov blanket of Rain is Cloudy, Sprinkler, and W etGrass

Cloudy Rain

Sprinkler Wet Grass

Probability given the Markov blanket is calculated as follows: P (x0j |mb(Xj )) = αP (x0j |parents(Xj ))ΠZ`∈Children(Xj )P (z`|parents(Z`)) E.g., φ(¬cloudy, rain → cloudy, rain) = 0.5 × αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5 × α × 0.5 × 0.1 × 0.8 0.040 = 0.5 × 0.040+0.050 = 0.2222 (Easy for discrete variables; continuous case requires mathematical analysis for each combination of distribution types.) Easily implemented in message-passing parallel systems, brains Can converge slowly, especially for near-deterministic models CS194-10 Fall 2011 Lecture 22

22

Theory for Gibbs sampling Theorem: stationary distribution for Gibbs transition kernel is P (z | e); i.e., long-run fraction of time spent in each state is exactly proportional to its posterior probability Proof sketch: – The Gibbs transition kernel satisfies detailed balance for P (z | e) i.e., for all z, z0 the “flow” from z to z0 is the same as from z0 to z – π is the unique stationary distribution for any ergodic transition kernel satisfying detailed balance for π

CS194-10 Fall 2011 Lecture 22

23

Detailed balance Let πt(z) be the probability the chain is in state z at time t Detailed balance condition: “outflow” = “inflow” for each pair of states: πt(z)φ(z → z0) = πt(z0)φ(z0 → z)

for all z, z0

Detailed balance ⇒ stationarity: πt+1(z) = Σz0 πt(z0)φ(z0 → z) = Σz0 πt(z)φ(z → z0) = πt(z)Σz0 φ(z → z0) = πt(z0)

MCMC algorithms typically constructed by designing a transition kernel φ that is in detailed balance with desired π

CS194-10 Fall 2011 Lecture 22

24

Gibbs sampling transition kernel Probability of choosing variable Zj to sample is 1/(D − |E|) Let Z¯j be all other nonevidence variables, i.e., Z − {Zj } Current values are zj and z¯j ; e is fixed; transition probability is given by φ(z → z0) = φ(zj , z¯j → zj0 , z¯j ) = P (zj0 |z¯j , e)/(D − |E|) This gives detailed balance with P (z | e): 1 1 P (z | e)P (zj0 |z¯j , e) = P (zj , z¯j | e)P (zj0 |z¯j , e) D − |E| D − |E| 1 = P (zj |z¯j , e)P (z¯j | e)P (zj0 |z¯j , e) (chain rule) D − |E| 1 = P (zj |z¯j , e)P (zj0 , z¯j | e) (chain rule backwards) D − |E| = φ(z0 → z)π(z0) = π(z0)φ(z0 → z)

π(z)φ(z → z0) =

CS194-10 Fall 2011 Lecture 22

25

Summary (inference) Exact inference: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by MCMC: – Generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables

CS194-10 Fall 2011 Lecture 22

26

Parameter learning: Complete data θjk` = P (Xj = k | P arents(Xj ) = `) (i) Let xj = value of Xj in example i; assume Boolean for simplicity Log likelihood L(θ) = =

N X

D X

i=1 j =1 N X

D X

i=1 j =1

(i)

log P (xj | parents(Xj )(i) (i)

xj

log θj1`(i) (1 − θj1`(i) )

∂L Nj1` Nj0` = − =0 ∂θj1` θj1` 1 − θj1` Nj1` Nj1` θj1` = = . Nj0` + Nj1` Nj`

(i)

1−xj

gives

I.e., learning is completely decomposed; MLE = observed condition frequency

CS194-10 Fall 2011 Lecture 22

27

Example Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper:

P(F=cherry)

θ

P (F = cherry, W = green|hθ,θ1,θ2 ) Flavor = P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) F P(W=red | F) cherry θ = θ · (1 − θ1) lime θ 1 2

Wrapper

N candies, rc red-wrapped cherry candies, etc.: r

P (X|hθ,θ1,θ2 ) = θc(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`

L = [c log θ + ` log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)]

CS194-10 Fall 2011 Lecture 22

28

Example contd. Derivatives of L contain only the relevant parameter: ∂L c ` = − =0 ∂θ θ 1−θ

c ⇒ θ= c+`

∂L rc gc = − =0 ∂θ1 θ1 1 − θ1

rc ⇒ θ1 = rc + gc

r` g` ∂L = − =0 ∂θ2 θ2 1 − θ2

r` ⇒ θ2 = r` + g`

CS194-10 Fall 2011 Lecture 22

29

Hidden variables Why learn models with hidden variables? 2

Smoking

2

Diet

2

Exercise

2

Smoking

2

Diet

2

Exercise

54 HeartDisease

6

Symptom 1

6

Symptom 2

(a)

6

Symptom 3

54

Symptom 1

162

Symptom 2

486

Symptom 3

(b)

Hidden variables ⇒ simplified structure, fewer parameters ⇒ faster learning

CS194-10 Fall 2011 Lecture 22

30

Average negative log likelihood per case

Learning with and without hidden variables 3.5

3-3 network, APN algorithm 3-1-3 network, APN algorithm 1 hidden node, NN algorithm 3 hidden nodes, NN algorithm 5 hidden nodes, NN algorithm Target network

3.45 3.4 3.35 3.3 3.25 3.2 3.15 3.1 3.05 3

0

1000

2000 3000 4000 Number of training cases

5000

CS194-10 Fall 2011 Lecture 22

31

EM for Bayes nets For t = 0 to ∞ (until convergence) do E step: Compute all pijk` = P (Xj = k, P arents(Xj ) = ` | e(i), θ (t))

M step:

(t+1) θjk`

Nˆjk` Σipijk` = = Σk0 Nˆjk0` ΣiΣk0 pijk0`

E step can be any exact or approximate inference algorithm With MCMC, can treat each sample as a complete-data example

CS194-10 Fall 2011 Lecture 22

32

Candy example

Log-likelihood L

-1980

P(Bag=1)

θ Bag P(F=cherry | B) 1

θF1

2

θF2

Flavor

Bag

-1990 -2000 -2010 -2020

Wrapper

Hole

0

20

40 60 80 Iteration number

100

CS194-10 Fall 2011 Lecture 22

120

33

Example: Car insurance SocioEcon

Age

GoodStudent

ExtraCar

Mileage

RiskAversion

VehicleYear

SeniorTrain MakeModel

DrivingSkill DrivingHist

Antilock

DrivQuality Ruggedness

Airbag

CarValue HomeBase

Accident

Theft

OwnDamage Cushioning MedicalCost

OtherCost LiabilityCost

AntiTheftt

OwnCost PropertyCost CS194-10 Fall 2011 Lecture 22

34

Average negative log likelihood per case

Example: Car insurance contd. 12--3 network, APN algorithm Insurance network, APN algorithm 1 hidden node, NN algorithm 5 hidden nodes, NN algorithm 10 hidden nodes, NN algorithm Target network

3.4 3.2 3 2.8 2.6 2.4 2.2 2 1.8 0

1000

2000 3000 4000 Number of training cases

5000

CS194-10 Fall 2011 Lecture 22

35

Bayesian learning in Bayes nets Parameters become variables (parents of their previous owners): P (Wrapper = red | Flavor = cherry, Θ1 = θ1, Θ2 = θ2) = θ1 . network replicated for each example, parameters shared across all examples: Θ

Flavor1

Flavor2

Flavor3

Wrapper1

Wrapper2

Wrapper3

P(F=cherry)

θ

Flavor F

P(W=red | F)

cherry

θ1

lime

θ2

Wrapper

Θ1

Θ2

CS194-10 Fall 2011 Lecture 22

36

Bayesian learning contd. Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc. With independent Beta or Dirichlet priors, MAP EM learning = pseudocounts + expected counts: M step:

(t+1) θjk`

Nˆjk` = ajk` + Σk0 Nˆjk0`

Implemented in EM training mode for Hugin and other Bayes net packages

CS194-10 Fall 2011 Lecture 22

37

Summary (parameter learning) Complete data: likelihood factorizes; each parameter θjk` learned separately from the observed counts for conditional frequency Njk`/Nj` Incomplete data: likelihood is a summation over all values of hidden variables; can apply EM by computing “expected counts” Bayesian learning: parameters become variables in a replicated model with their own prior distributions defined by hyperparameters; then Bayesian learning is just ordinary inference in the model

CS194-10 Fall 2011 Lecture 22

38

Suggest Documents