Bayesian networks: Inference and learning
CS194-10 Fall 2011 Lecture 22
CS194-10 Fall 2011 Lecture 22
1
Outline ♦ Exact inference (briefly) ♦ Approximate inference (rejection sampling, MCMC) ♦ Parameter learning
CS194-10 Fall 2011 Lecture 22
2
Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B|j, m) = P(B, j, m)/P (j, m) = αP(B, j, m) = α Σe Σa P(B, e, a, j, m)
B
E A
J
M
Rewrite full joint entries using product of CPT entries: P(B|j, m) = α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a) = αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a) Recursive depth-first enumeration: O(D) space, O(K D ) time
CS194-10 Fall 2011 Lecture 22
3
Evaluation tree P(b) .001 P(e) .002 P(a|b,e) .95
P( e) .998 P( a|b,e) .05
P(a|b, e) .94
P( a|b, e) .06
P(j|a) .90
P(j| a) .05
P(j|a) .90
P(j| a) .05
P(m|a) .70
P(m| a) .01
P(m|a) .70
P(m| a) .01
Enumeration is inefficient: repeated computation e.g., computes P (j|a)P (m|a) for each value of e CS194-10 Fall 2011 Lecture 22
4
Efficient exact inference Junction tree and variable elimination algorithms avoid repeated computation (generalized from of dynamic programming) ♦ Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of exact inference are O(K LD) ♦ Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5
0.5
A
B
C
D
1
L
2
3
D
L
3. B v C v
A
L
2. C v D v
0.5
L
1. A v B v C
0.5
AND
CS194-10 Fall 2011 Lecture 22
5
Inference by stochastic simulation Idea: replace sum over hidden-variable assignments with a random sample. 1) Draw N samples from a sampling distribution S 0.5 2) Compute an approximate posterior probability Pˆ 3) Show this converges to the true probability P
Coin Outline: – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior
CS194-10 Fall 2011 Lecture 22
6
Sampling from an empty network function Prior-Sample(bn) returns an event sampled from prior specified by bn inputs: bn, a Bayesian network specifying joint distribution P(X1, . . . , XD ) x ← an event with D elements for j = 1, . . . , D do x[j] ← a random sample from P(Xj | values of parents(Xj ) in x) return x
CS194-10 Fall 2011 Lecture 22
7
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
8
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
9
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
10
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
11
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
12
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
13
Example P(C) .50 Cloudy
C P(S|C) T .10 F .50
Rain
Sprinkler
C P(R|C) T .80 F .20
Wet Grass
S R P(W|S,R) T T F F
T F T F
.99 .90 .90 .01
CS194-10 Fall 2011 Lecture 22
14
Sampling from an empty network contd. Probability that PriorSample generates a particular event D SP S (x1 . . . xD ) = Πj = 1P (xj |parents(Xj )) = P (x1 . . . xD ) i.e., the true prior probability E.g., SP S (t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t) Let NP S (x1 . . . xD ) be the number of samples generated for event x1, . . . , xD Then we have lim Pˆ (x1, . . . , xD ) = lim NP S (x1, . . . , xD )/N
N →∞
N →∞
= SP S (x1, . . . , xD ) = P (x1 . . . xD ) That is, estimates derived from PriorSample are consistent Shorthand: Pˆ (x1, . . . , xD ) ≈ P (x1 . . . xD )
CS194-10 Fall 2011 Lecture 22
15
Rejection sampling ˆ P(X|e) estimated from samples agreeing with e function Rejection-Sampling(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero for i = 1 to N do x ← Prior-Sample(bn) if x is consistent with e then N[x ] ← N[x ]+1 where x is the value of X in x return Normalize(N)
E.g., estimate P(Rain|Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = f alse. ˆ P(Rain|Sprinkler = true) = Normalize(h8, 19i) = h0.296, 0.704i Similar to a basic real-world empirical estimation procedure CS194-10 Fall 2011 Lecture 22
16
Analysis of rejection sampling ˆ P(X|e) = αNP S (X, e) (algorithm defn.) = NP S (X, e)/NP S (e) (normalized by NP S (e)) ≈ P(X, e)/P (e) (property of PriorSample) = P(X|e) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P (e) is small P (e) drops off exponentially with number of evidence variables!
CS194-10 Fall 2011 Lecture 22
17
Approximate inference using MCMC General idea of Markov chain Monte Carlo ♦ Sample space Ω, probability π(ω) (e.g., posterior given e) ♦ Would like to sample directly from π(ω), but it’s hard ♦ Instead, wander around Ω randomly, collecting samples ♦ Random wandering is controlled by transition kernel φ(ω → ω 0) specifying the probability of moving to ω 0 from ω (so the random state sequence ω0, ω1, . . . , ωt is a Markov chain) ♦ If φ is defined appropriately, the stationary distribution is π(ω) so that, after a while (mixing time) the collected samples are drawn from π
CS194-10 Fall 2011 Lecture 22
18
Gibbs sampling in Bayes nets Markov chain state ωt = current assignment xt to all variables Transition kernel: pick a variable Xj , sample it conditioned on all others Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj )) so generate next state by sampling a variable given its Markov blanket function Gibbs-Ask(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero Z, the nonevidence variables in bn z, the current state of variables Z, initially random for i = 1 to N do choose Zj in Z uniformly at random set the value of Zj in z by sampling from P(Zj |mb(Zj )) N[x ] ← N[x ] + 1 where x is the value of X in z return Normalize(N)
CS194-10 Fall 2011 Lecture 22
19
The Markov chain With Sprinkler = true, W etGrass = true, there are four states: Cloudy
Cloudy Rain
Sprinkler
Rain
Sprinkler
Wet Grass
Wet Grass
Cloudy
Cloudy Rain
Sprinkler Wet Grass
Rain
Sprinkler Wet Grass
CS194-10 Fall 2011 Lecture 22
20
MCMC example contd. Estimate P(Rain|Sprinkler = true, W etGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = f alse ˆ P(Rain|Sprinkler = true, W etGrass = true) = Normalize(h31, 69i) = h0.31, 0.69i
CS194-10 Fall 2011 Lecture 22
21
Markov blanket sampling Markov blanket of Cloudy is Sprinkler and Rain Markov blanket of Rain is Cloudy, Sprinkler, and W etGrass
Cloudy Rain
Sprinkler Wet Grass
Probability given the Markov blanket is calculated as follows: P (x0j |mb(Xj )) = αP (x0j |parents(Xj ))ΠZ`∈Children(Xj )P (z`|parents(Z`)) E.g., φ(¬cloudy, rain → cloudy, rain) = 0.5 × αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5 × α × 0.5 × 0.1 × 0.8 0.040 = 0.5 × 0.040+0.050 = 0.2222 (Easy for discrete variables; continuous case requires mathematical analysis for each combination of distribution types.) Easily implemented in message-passing parallel systems, brains Can converge slowly, especially for near-deterministic models CS194-10 Fall 2011 Lecture 22
22
Theory for Gibbs sampling Theorem: stationary distribution for Gibbs transition kernel is P (z | e); i.e., long-run fraction of time spent in each state is exactly proportional to its posterior probability Proof sketch: – The Gibbs transition kernel satisfies detailed balance for P (z | e) i.e., for all z, z0 the “flow” from z to z0 is the same as from z0 to z – π is the unique stationary distribution for any ergodic transition kernel satisfying detailed balance for π
CS194-10 Fall 2011 Lecture 22
23
Detailed balance Let πt(z) be the probability the chain is in state z at time t Detailed balance condition: “outflow” = “inflow” for each pair of states: πt(z)φ(z → z0) = πt(z0)φ(z0 → z)
for all z, z0
Detailed balance ⇒ stationarity: πt+1(z) = Σz0 πt(z0)φ(z0 → z) = Σz0 πt(z)φ(z → z0) = πt(z)Σz0 φ(z → z0) = πt(z0)
MCMC algorithms typically constructed by designing a transition kernel φ that is in detailed balance with desired π
CS194-10 Fall 2011 Lecture 22
24
Gibbs sampling transition kernel Probability of choosing variable Zj to sample is 1/(D − |E|) Let Z¯j be all other nonevidence variables, i.e., Z − {Zj } Current values are zj and z¯j ; e is fixed; transition probability is given by φ(z → z0) = φ(zj , z¯j → zj0 , z¯j ) = P (zj0 |z¯j , e)/(D − |E|) This gives detailed balance with P (z | e): 1 1 P (z | e)P (zj0 |z¯j , e) = P (zj , z¯j | e)P (zj0 |z¯j , e) D − |E| D − |E| 1 = P (zj |z¯j , e)P (z¯j | e)P (zj0 |z¯j , e) (chain rule) D − |E| 1 = P (zj |z¯j , e)P (zj0 , z¯j | e) (chain rule backwards) D − |E| = φ(z0 → z)π(z0) = π(z0)φ(z0 → z)
π(z)φ(z → z0) =
CS194-10 Fall 2011 Lecture 22
25
Summary (inference) Exact inference: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by MCMC: – Generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables
CS194-10 Fall 2011 Lecture 22
26
Parameter learning: Complete data θjk` = P (Xj = k | P arents(Xj ) = `) (i) Let xj = value of Xj in example i; assume Boolean for simplicity Log likelihood L(θ) = =
N X
D X
i=1 j =1 N X
D X
i=1 j =1
(i)
log P (xj | parents(Xj )(i) (i)
xj
log θj1`(i) (1 − θj1`(i) )
∂L Nj1` Nj0` = − =0 ∂θj1` θj1` 1 − θj1` Nj1` Nj1` θj1` = = . Nj0` + Nj1` Nj`
(i)
1−xj
gives
I.e., learning is completely decomposed; MLE = observed condition frequency
CS194-10 Fall 2011 Lecture 22
27
Example Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper:
P(F=cherry)
θ
P (F = cherry, W = green|hθ,θ1,θ2 ) Flavor = P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) F P(W=red | F) cherry θ = θ · (1 − θ1) lime θ 1 2
Wrapper
N candies, rc red-wrapped cherry candies, etc.: r
P (X|hθ,θ1,θ2 ) = θc(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`
L = [c log θ + ` log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)]
CS194-10 Fall 2011 Lecture 22
28
Example contd. Derivatives of L contain only the relevant parameter: ∂L c ` = − =0 ∂θ θ 1−θ
c ⇒ θ= c+`
∂L rc gc = − =0 ∂θ1 θ1 1 − θ1
rc ⇒ θ1 = rc + gc
r` g` ∂L = − =0 ∂θ2 θ2 1 − θ2
r` ⇒ θ2 = r` + g`
CS194-10 Fall 2011 Lecture 22
29
Hidden variables Why learn models with hidden variables? 2
Smoking
2
Diet
2
Exercise
2
Smoking
2
Diet
2
Exercise
54 HeartDisease
6
Symptom 1
6
Symptom 2
(a)
6
Symptom 3
54
Symptom 1
162
Symptom 2
486
Symptom 3
(b)
Hidden variables ⇒ simplified structure, fewer parameters ⇒ faster learning
CS194-10 Fall 2011 Lecture 22
30
Average negative log likelihood per case
Learning with and without hidden variables 3.5
3-3 network, APN algorithm 3-1-3 network, APN algorithm 1 hidden node, NN algorithm 3 hidden nodes, NN algorithm 5 hidden nodes, NN algorithm Target network
3.45 3.4 3.35 3.3 3.25 3.2 3.15 3.1 3.05 3
0
1000
2000 3000 4000 Number of training cases
5000
CS194-10 Fall 2011 Lecture 22
31
EM for Bayes nets For t = 0 to ∞ (until convergence) do E step: Compute all pijk` = P (Xj = k, P arents(Xj ) = ` | e(i), θ (t))
M step:
(t+1) θjk`
Nˆjk` Σipijk` = = Σk0 Nˆjk0` ΣiΣk0 pijk0`
E step can be any exact or approximate inference algorithm With MCMC, can treat each sample as a complete-data example
CS194-10 Fall 2011 Lecture 22
32
Candy example
Log-likelihood L
-1980
P(Bag=1)
θ Bag P(F=cherry | B) 1
θF1
2
θF2
Flavor
Bag
-1990 -2000 -2010 -2020
Wrapper
Hole
0
20
40 60 80 Iteration number
100
CS194-10 Fall 2011 Lecture 22
120
33
Example: Car insurance SocioEcon
Age
GoodStudent
ExtraCar
Mileage
RiskAversion
VehicleYear
SeniorTrain MakeModel
DrivingSkill DrivingHist
Antilock
DrivQuality Ruggedness
Airbag
CarValue HomeBase
Accident
Theft
OwnDamage Cushioning MedicalCost
OtherCost LiabilityCost
AntiTheftt
OwnCost PropertyCost CS194-10 Fall 2011 Lecture 22
34
Average negative log likelihood per case
Example: Car insurance contd. 12--3 network, APN algorithm Insurance network, APN algorithm 1 hidden node, NN algorithm 5 hidden nodes, NN algorithm 10 hidden nodes, NN algorithm Target network
3.4 3.2 3 2.8 2.6 2.4 2.2 2 1.8 0
1000
2000 3000 4000 Number of training cases
5000
CS194-10 Fall 2011 Lecture 22
35
Bayesian learning in Bayes nets Parameters become variables (parents of their previous owners): P (Wrapper = red | Flavor = cherry, Θ1 = θ1, Θ2 = θ2) = θ1 . network replicated for each example, parameters shared across all examples: Θ
Flavor1
Flavor2
Flavor3
Wrapper1
Wrapper2
Wrapper3
P(F=cherry)
θ
Flavor F
P(W=red | F)
cherry
θ1
lime
θ2
Wrapper
Θ1
Θ2
CS194-10 Fall 2011 Lecture 22
36
Bayesian learning contd. Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc. With independent Beta or Dirichlet priors, MAP EM learning = pseudocounts + expected counts: M step:
(t+1) θjk`
Nˆjk` = ajk` + Σk0 Nˆjk0`
Implemented in EM training mode for Hugin and other Bayes net packages
CS194-10 Fall 2011 Lecture 22
37
Summary (parameter learning) Complete data: likelihood factorizes; each parameter θjk` learned separately from the observed counts for conditional frequency Njk`/Nj` Incomplete data: likelihood is a summation over all values of hidden variables; can apply EM by computing “expected counts” Bayesian learning: parameters become variables in a replicated model with their own prior distributions defined by hyperparameters; then Bayesian learning is just ordinary inference in the model
CS194-10 Fall 2011 Lecture 22
38