An Introduction to Graphical Models

An Introduction to Graphical Models Loïc Schwaller Octobre 12, 2015 Abstract Graphical models are both a natural and powerful way to depict intricate ...
Author: Kerry Greene
3 downloads 2 Views 380KB Size
An Introduction to Graphical Models Loïc Schwaller Octobre 12, 2015 Abstract Graphical models are both a natural and powerful way to depict intricate dependency structures in multivariate random variables. They come in two flavours, directed or undirected, that are not mutually exclusive. However, some conditional independence structures can only be encoded in one or the other formalism. Among graphical models, Gaussian Graphical Models (often referred to as GGM) are of particular interest because of their ease to be manipulated and interpreted.

Contents 1 Introduction

2

2 Graphs 2.1 Undirected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 2 3

3 Markov properties 3.1 Conditional independence . . . . . . . . . . . . . . . . . . . . 3.2 Markov properties on undirected graphs . . . . . . . . . . . . 3.3 Markov properties on directed acyclic graphs . . . . . . . . . 3.4 Relation between undirected and directed Markov properties 3.5 Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

3 3 4 5 7 8

4 Gaussian Graphical Models 4.1 Conditional independence & Markov properties . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Linear structural equation system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 8 8

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

References [1]

P. Dawid and S. Lauritzen. “Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models”. In: The Annals of Statistics 21.3 (1993), pp. 1272–1317.

[2]

Morten Frydenberg. “The chain graph Markov property”. In: Scandinavian Journal of Statistics (1990), pp. 333–353.

[3]

J. M. Hammersley and P. Clifford. “Markov field on finite graphs and lattices”. 1971.

[4]

S. Lauritzen. Graphical Models. Oxford University Press, 1996. isbn: 0-19-852219-3.

[5]

S. Lauritzen et al. “Independence properties of directed markov fields”. In: Networks 20.5 (1990), pp. 491–505. issn: 00283045. doi: 10.1002/net.3230200503.

[6]

John Moussouris. “Gibbs and Markov Random Systems with Constraints”. In: Journal of statistical physics 10.1 (1974), pp. 11–33. issn: 0022-4715. doi: 10.1007/BF01011714.

[7]

J. Pearl and A. Paz. “Graphoids: a graph-based logic for reasoning about relevance relations”. In: Artificial Intelligence II (1987), pp. 357–63.

1

1

Introduction

A graphical model is a probabilistic model whose conditional (in)dependence structure between random variables is given by a graph. This framework has received a fair amount of attention recently, but the ideas can be traced back as far as the beginning of the XXth century with Gibbs. Indeed, one of the scientific areas that popularised graphical models is statistical physics. As an example, let us consider a simple model, named after the physicist Ernst Ising, that can be used to describe a large system of magnetic dipoles (or spins). Spins can be in one of the two states {±1}. They are spread on a graph (commonly a lattice) and can only interact with their neighbours. If σ = (σ1 , σ2 , ... ) is a state of the system giving assignments for all spins, the energy of σ and the associated Gibbsian distribution are respectively given by: H(σ) := −J

X

σi σj − H

i∼j

p(σ) :=

X

σi ;

i

σ=

1 exp (−H(σ)) . Z

The graphical aspect of this model is rather obvious, since the definition of H depends on the neighbourhood of each spin. This is an example of undirected graphical model. Such models are also called Markov random fields. Graphical models also naturally arise for instance when designing hierarchical models with sequentially drawn variables. Let us consider a classical Bayesian framework where observations X are drawn according to a distribution with parameters θ, and where θ is itself drawn from a distribution with (hyper)parameters λ. This model can be depicted by a directed graph. λ

θ

X

Here we have an example of directed graphical model. The idea is that the graph indicates a way to factorise the joint probability distribution of all variables as a product of conditional probability distributions. For this factorisation to be possible, the graph cannot have any directed cycle. It has to be a directed acyclic graph (DAG). These models are often referred to as Bayesian networks or belief networks. Other classical examples include Markov chains and hidden Markov models (HMM).

2 2.1

Graphs Undirected graphs

Let V = {1, ..., p}, p > 2, and P2 (V ) denote the subsets of V of size 2. Definition For E ⊆ P2 (V ), G = (V, E) is the undirected graph with vertices V and edges E. A graph G = (V, E) is said to be complete if E = P2 (V ). For a subset A ⊂ V , the subgraph of G induced by A is defined as GA := (A, EA ) with EA := {(i, j) ∈ E|i ∈ A, j ∈ A}. Definition Let G = (V, E) be a graph. For α ∈ V , we define the following set of vertices: boundary of α : closure of α :

bd(α) = {β ∈ V \ {α}|{α, β} ∈ E}, cl(α) = bd(α) ∪ {α}.

Path Let G = (V, E) be an undirected graph and α, β be two distinct vertices in V . A path from α to β is a sequence α = γ0 , ... , γn = β, n > 1, of distinct vertices such that, for all 1 6 i 6 n, {γi−1 , γi } ∈ E. Separation Let A, B, C be subsets of V . C is said to separate A from B if any path from α ∈ A to β ∈ B intersects C. 2

Decomposition A pair (A, B) of subsets of V is said to be a decomposition of G if V = A ∪ B, the subgraph induced by G on A ∩ B is complete and A ∩ B separates A from B. If A and B are both proper subsets of V , the decomposition is said to be proper. A graph G is said to be decomposable if it is either complete or if there exists a proper decomposition of G into two decomposable subgraphs. Chordal graphs A graph G is chordal if all cycles of four or more vertices have a chord, which is an edge that is not part of the cycle but connects two vertices of the cycle. Proposition 1 (Lauritzen, 1996, Prop. 2.5). The following conditions are equivalent for a graph G: • G is decomposable; • G is chordal.

2.2

Directed acyclic graphs

We consider the same set of vertices V = {1, ..., p}, p > 2. Directed graph A directed graph is a pair D = (V, E), where V is a set of vertices and the set of edges E is a subset V × V of ordered pairs of distinct vertices. Notice that we do not allow edges from a vertex to itself. Path Let D = (V, E) be a directed graph and α, β be two vertices in V . A (directed) path from α to β is a sequence α = γ0 , ... , γn = β, n > 1, of distinct vertices such that, for all 1 6 i 6 n, (γi−1 , γi ) ∈ E. If there is a path from α to β in D, we say that α leads to β and write α 7→ β. A directed cycle is a directed path from a vertex α to itself. DAG A directed acyclic graph or DAG is a directed graph with no directed cycles. Definition Let D = (V, E) be a DAG. For α ∈ V , we define the following set of vertices: pa(α) = {β ∈ V |(β, α) ∈ E},

parents of α:

de(α) = {β ∈ V |α 7→ β},

descendants of α:

nd(α) = V \ (de(α) ∪ {α})

non-descendants of α: Example

descendants of non-descendants of parents of

3 3.1

Markov properties Conditional independence

X

|=

Conditional independence Let X, Y, Z be random variables on a joint probability space (Ω, A, P ). We say X is conditionally independent of Y given Z, written Y |Z [P ]

if there exists a conditional probability measure for X given Y, Z under P that only depends on Z. When there is no doubt on P , it will be omitted. |=

Proposition 2. The ternary relation X Y |Z has the following properties, where h denotes an arbitrary measurable fucntion on the sample space of X: 3

Y |Z; Y |(Z, U );

W |(Y, Z), then X

|=

|=

Y |Z and X

|=

Y |Z;

|=

Y |Z and U = h(X), then X

|=

|=

(C4) if X

|=

(C3) if X

Y |Z and U = h(X), then U

|=

(C2) if X

Y |Z then X

|=

(C1) if X

(W, Y )|Z.

3.2

|=

|=

Remark Relation (C1)–(C4) can be viewed as formal axioms for conditional independence or irrelevance. A semi-graphoid is an algebraic structure which satisfies (C1)–(C4) where X, Y, Z are disjoint subsets of a finite set and U = h(X) is replaced by U ⊆ X (Pearl, 1988). Np Let X = (X1 , ..., Xp ) be a random vector taking values in a product space X = i=1 Xi . For any A ⊂ V , let XA denote (Xα )α∈A . The same notation is used for any x ∈ X . In the following, for any subsets A, B, C ⊂ V , we will use the short notation A B|C for XA XB |XC .

Markov properties on undirected graphs

Markov properties Let G = (V, E) be an undirected graph. A probability measure π on X is said to obey

α

|=

(P) the pairwise Markov property relative to G, if for any pair (α, β) ∈ V 2 such that {α, β} ∈ / E, β|V \ {α, β};

α

|=

(L) the local Markov property relative to G, if for any vertex α ∈ V , V \ cl(α)|bd(α);

A

|=

(G) the global Markov property relative to G, if for any triple (A, B, S) of disjoint subsets of V such that S separates A from B in G, B|S.

The Markov properties are related as described in the proposition below. Proposition 3 (Pearl and Paz, 1987). For any undirected graph G and any probability distribution π on X , it holds that (G) ⇒ (L) ⇒ (P ). If π has a positive and continuous density with respect to a product measure µ, then (G) ⇔ (L) ⇔ (P ). Example Let X, Y, Z be three random variables with values in {0, 1}. Let X = Y = Z and P (X = 0) = P (X = 1) = 1/2. The probability distribution of (X, Y, Z) satisfy the pairwise Markov property but not the local Markov property relative to the graph X

Y

Z

.

Factorisation A probability measure π on X is said to factorise over G (or to satisfy (F)) if its density p w.r.t. some product measure on X has the form Y p(x) = ψA (xA ), ∀x ∈ X , A∈A

where A are complete subsets of G or, equivalently, if Y p(x) = ψ˜C (xC ), C∈C

where C are the maximal cliques of G. 4

∀x ∈ X ,

Proposition 4 (Lauritzen, 1996, Prop. 3.8). For any undirected graph G and any probability distribution π on X , it holds that (F ) ⇒ (G) ⇒ (L) ⇒ (P ). Example (Moussouris, 1974) Here is an example of a distribution satisfying (G) but not (F ). Consider the uniform distribution on the 8 (out of 16) configurations of {0, 1}4 displayed below. 1 1

1 1

1 0

1 1

0 0

1 1

0 0

0 1

0 0

0 0

0 1

0 0

1 1

0 0

1 1

1 0

To show that this distribution is globally Markov with respect to , one only has to show that two opposite vertices are independent conditionally to the others two. But when the values of two diagonally opposite vertices are fixed, so is the value of one of the other two vertices, the global Markov property is therefore satisfied. We now show that the density does not factorise. Let us assume that it does factorise. Then p(0, 0, 0, 0) = 1/8 = ψ12 (0, 0)ψ23 (0, 0)ψ34 (0, 0)ψ41 (0, 0), so these factors are all strictly positive. Similarly reasoning on all 8 possible configurations yields that all factors ψ are strictly positive. This contradicts the fact that 8 configurations have probability 0. Theorem 1 (Hammersley and Clifford, 1971). A probability distribution π with positive and continuous density p with respect to a measure µ satisfies the pairwise Markov property with respect to a an undirected graph G if and only if it factorises according to G. Then (F ) ⇔ (G) ⇔ (L) ⇔ (P ). Proposition 5 (Lauritzen, 1996, Prop. 3.19). Let G be a decomposable graph. Then it holds that (F ) ⇔ (G).



(S separates A from B in G)

3.3

A

|=

Markov faithfulness A distribution π is said to be Markov faithful to a graph G if, for any triple (A, B, S) of disjoint subsets of V , it holds that B|S.

Markov properties on directed acyclic graphs

Ordering The vertices V of a DAG D are well-ordered if they are linearly ordered in a way that is compatible with D, i.e. so that α ∈ pa(β) ⇒ α < β,

∀α, β ∈ V.

Example A well-ordered DAG 2 1

4 5

3

7 6

5

d-separation Let us consider the possible directed structures that one can obtain from an undirected structure linking three nodes in a chain. The result is from one of the three types describes below.

tail-to-tail

head-to-tail

head-to-head

The head-to-head structure is also called a v-structure. Definition Let A, B, S be subsets of V . S is said to d-separate A and B if for any a ∈ A, b ∈ B and any undirected path γ between a and b, there exists a node u in γ s.t. either • u ∈ S, and edges of γ do not meet head-to-head at γ; • u∈ / S nor any of its descendants, and the arrows meet head-to-head at u. u is said to block the path γ from a to b. Example f

a

f

a

e

e

b

c

b

c

c does not d-separate a and b.

f

a

e

b

c

f d-separates a and b.

{c, f } d-separates a and b.

α

|=

Markov properties Let D = (V, E) be a DAG. A probability measure π on X is said to obey → − ( O ) the ordered Markov property relative to the well-ordered DAG D if {pr(α) \ pa(α)}|pa(α),

∀α ∈ V,

where pr(α) are the predecessors of α, i.e. the vertices before α in the well-ordering;

α

|=

→ − ( L ) the (directed) local Markov property relative to D, if all variables are conditionally independent of their non-descendants given their parents i.e. {nd(α) \ pa(α)}|pa(α)

∀α ∈ V ;

A

|=

→ − ( G ) the (directed) global Markov property relative to D, if for any triple (A, B, S) of disjoint subsets of V such that S d-separates A from B in D, B|S.

In the directed case the relationship between the different Markov properties is much simpler than in the undirected case. Proposition 6 (Lauritzen et al., 1990). It holds for any directed acyclic graph D that → − → − → − ( G ) ⇔ ( L ) ⇔ ( O ). → − Factorisation A probability distribution π on X is said to factorise over a DAG D (or to satisfy ( F )) if its density p w.r.t. some product measure on X has the form Y p(x) = pα (xα |xpa(α ), ∀x ∈ X . αinV

6

Proposition 7 (Lauritzen, 1996, Th. 3.27). Let π be a probability distribution on X and assume that it has a density with respect to some product measure on X . Then → − → − → − → − ( F ) ⇔ ( G ) ⇔ ( L ) ⇔ ( O ).

(S d-separates A from B in D)

3.4



A

|=

Markov faithfulness A distribution π is said to be Markov faithful to a graph D if, for any triple (A, B, S) of disjoint subsets of V , it holds that B|S.

Relation between undirected and directed Markov properties

Markov properties of directed and undirected graphs are different in general. However, there are obvious important connections between directed and undirected factorisations. Moralisation The moral graph DM of a DAG D is obtained by adding undirected edges between “unmarried" parents and subsequently dropping all directions. Example Illustration of the moralisation process. 2 1

4 5

3

2 7

6

1

4 5

3

2 7

6

1

4 5

3

7 6

Proposition 8. If π factorises over a DAG D, it factorises over the moralised graph DM . Proof. It follows from the fact that, in DM pa(α) ∪ {α} is complete for any vertex α ∈ V . Thus, if π satisfies any of the directed Markov properties with respect to D, it satisfies all Markov properties for DM . Perfect DAGs The skeleton D of a DAG D is the undirected graph obtained from D by ignoring directions. A DAG D is said to be perfect if D = DM . Proposition 9. π factorises over a perfect DAG D if and only if it factorises over its skeleton D. Example Rooted trees with arrows pointing away from the root are perfect DAGs. Thus, for such graphs, the directed and undirected Markov properties are the same.

Proposition 10. Let G be an undirected graph. An orientation of G is a DAG with skeleton G. An undirected graph G can be oriented to form a perfect DAG if and only if G is chordal. Markov equivalence Two DAGs D and D0 are said to be Markov equivalent if, for any triple (A, B, S) of disjoint subsets of V , it holds that (S d-separates A from B in D)



(S d-separates A from B in D0 ).

A DAG D and an undirected graph G are said to be Markov equivalent if, for any triple (A, B, S) of disjoint subsets of V , it holds that (S d-separates A from B in D)

⇔ 7

(S separates A from B in G).

Basically, two graphs are Markov equivalent if they induce the same conditional independence restrictions. Markov equivalence is easy to identify, as stated in the following proposition. Proposition 11 (Frydenberg, 1990). Two DAGs D and D0 are Markov equivalent if and only if they have the same skeleton and the same v-structures. A DAG D and an undirected graph G are Markov equivalent if and only if D is perfect and G = D. Example All these graphs are Markov equivalent:

but not Markov equivalent to:

3.5

Graphical Model

Definition An undirected graphical model is a couple (G, Π) where G is an undirected graph and Π is a family of distributions for X that satisfy one of the undirected Markov properties. A directed graphical model is a couple (D, Π) where D is a directed acyclic graph and Π is a family of distributions for X that satisfy one of the directed Markov properties.

4 4.1

Gaussian Graphical Models Conditional independence & Markov properties

We now take X = Rp and X is assumed to be Gaussian-distributed with mean µ and inverse covariance matrix (or precision matrix) Λ: X ∼ N (µ, Λ−1 ).



|=

Proposition 12. Let α, β be two distinct vertices of V . Then it holds that Xβ |XV \{α,β} ⇔ Λαβ = 0.

Let GΛ be the graph obtained from Λ by putting an edge between two vertices α and β if and only if Λαβ 6= 0. Then the distribution of X satisfy the pairwise Markov property with respect to GΛ . Moreover, since the distribution is positive, it also satisfies the local and global Markov properties. Finally, the equivalence in Proposition 12 shows that the distribution of X is Markov faithful to GΛ .

4.2

Linear structural equation system

Definition Let D be a DAG on V . Consider the equation system Xα = λ0α Xpa(α) + µα + Uα ,

∀α ∈ V,

where λα ∈ R∗p and {Uα }α∈V are independent random variables with Uα ∼ N (0, σα2 ), ∀α ∈ V . Such a system is called a recursive structural equation system. Proposition 13. A recursive structural equation system given by a DAG D defines a multivariate Gaussian distribution which satisfies the directed Markov property relative to D. It follows from Proposition 8 that this distribution satisfies the (undirected) Markov property relative to DM . Example Consider the following structural equation system.  X1 = U1    X2 = U2 X3 = λ31 X1 + U3 = λ31 U1 + U3    X4 = λ42 X2 + λ43 X3 + U4 = λ43 λ31 U1 + λ42 U2 + λ43 U3 + U4 This system is given by the DAG: 8

1

3

2 4

The precision matrix Λ associated to the distribution of X is then easily computed:   2 σ13 −σ13 1 + 0 0 2 2 2 σ3 σ3  σ1  2 σ42  σ42 σ43 −σ42  1 0 +   σ22 σ42 σ42 σ42   2 σ43 σ42 σ43 −σ43  1  −σ13 + 2 2 2 2 2   σ3 σ4 σ3 σ4 σ4 −σ42 −σ43 1 0 σ2 σ2 σ2 4

4

4

We can see that Λ23 6= 0 because of the head-to-head configuration at node 4. For a DAG D, let G(D) denote the set of multivariate Gaussian distributions that can be defined by a recursive structural equation system on D. For an undirected graph G, let G(G) denote the multivariate Gaussian distributions that satisfy the (undirected) Markov property related to G. Proposition 14. Let D be a DAG on V . It holds that G(D) ⊆ G(DM ) with equality when D is perfect.

9