Learning with Kernels on Graphs, Groups and Manifolds. Risi Kondor Columbia University, New York, USA

Learning with Kernels on Graphs, Groups and Manifolds Risi Kondor Columbia University, New York, USA. 1 Collaborators John Lafferty Guy Lebanon Mik...
8 downloads 0 Views 181KB Size
Learning with Kernels on Graphs, Groups and Manifolds Risi Kondor Columbia University, New York, USA.

1

Collaborators John Lafferty Guy Lebanon Mikhail Belkin Alex Smola Tony Jebara Count Laplace (1749-1827)

2

Batch Learning Input space:

Output space:

Learn

X

e.g.

Y

f : X 7→ Y

X = Rd

Y=R

regression

Y = {−1, 1} from examples

3

classi£cation

(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )

A Naive Approach Look for f minimizing m

1 X 2 (f (xi ) − yi ) + Rreg [f ] = {z } m i=1 | Loss function

Harmonic example:

f (x) = √

1 2π

d

Z

Ω[f ] =

Z

ek ω k

2

/2

Ω[f ] |{z}

Complexity penalty

k fˆ(ω) k2 dω

1 fˆ(ω) = √ d 2π

fˆ(ω)eiω·x dω

4

Z

f (x)e−iω·x dx

A space of functions The functions f naturally form a linear space H

Impose inner product such that hf, f i = Ω[f ] e.g. hf, f 0 i =

Z

ek ω k

2

/2

fˆ(ω)fˆ0 (ω) dω

If H is complete, it is said to be a Hilbert space.

5

Looking for f ∈ H minimizing m

1 X 2 (f (xi ) − yi ) + hf, f i Rreg [f ] = m i=1

6

Looking for f ∈ H minimizing m

1 X 2 (f (xi ) − yi ) + hf, f i Rreg [f ] = m i=1 Wouldn’t it be neat if hf, kx i = f (x), then m

1 X 2 (hf, kxi i − yi ) + hf, f i Rreg [f ] = m i=1 f only interacts with kxi ’s



f ∈ span(kx1 , kx2 , . . . , kxm )

7

Now plug in f (x) =

X

αi kxi (x):

i

m

1 X 2 (hf, kxi i − yi ) + hf, f i Rreg [f ] = m i=1 ↓

m m m X m ® ¢2 X ­ ® 1 XX¡ ­ α i α j kxi , k xj α j kxj , k xi − y i + Rreg [f ] = m i=1 j=1 i=1 j=1

Letting Kij = hki , kj i: Rreg = (Kα − y)> (Kα − y) + α> Kα

8

To £nd f (x) =

X

αi kxi (x) minimizing

i

Rreg = (Kα − y)> (Kα − y) + α> Kα ∂R = 0: set ∂α 2K(Kα − y) + 2Kα = 0 α = (K + I)−1 y

9

So what are kx and K? Recall hf, kx i = f (x). It is easy to show that for our example 0

kx (x ) = √ and

­

1 2π

d

®

e

−k x−x0 k2 /2

Ki,j = kxi , kxj = kxi (xj ) = √

10

1 2π

d

e

−k xi −xj k2 /2

K is the kernel!!!

11

Kernel Methods General regularization network form: ·

m 1 X L(f (xi ), yi ) + f = arg min f ∈H m i=1 {z } | Empirical risk

SVM classi£cation: SVM regression: Gaussian Process MAP

hf, f i | {z }

Regularizer

L = max(0, 1 − yi f (xi ))

L = | f (xi ) − yi |²

L=

1 (f (xi ) σ02

12

− y)2

¸

Conventional Explanation K : X × X 7→ R positive de£nite function: similarity measure There exists some mapping Φ : X 7→ H such that K(x, x0 ) = hΦ(x), Φ(x0 )i Find some geometric criterion to optimize in H, eg. maximum margin.

13

Nonlinear SVM’s

c ¨ Figure courtesy of B. Scholkopf and A. Smola, °MIT Press

14

Connection between two views • One-to-one correspondence between kernel and regularizer • Kernel is algorithmic • Form of regularizer really explains what’s going on, e.g. 0

K(x, x ) = √

hf, f i = √

1 2π

d

Z

1 2π

d

e

−k x−x0 k2 /(2σ 2 )

smoothing

l ek ω k

2

σ 2 /2

15

k fˆ(ω) k2 dω

roughening

Connection in general Regularization operator P : H 7→ H (self-adjoint) Z Ω[f ] = hf, f i = (P f )(x) · (P f )(x) dx Kernel operator K : H 7→ H Z (Kg)(x) = K(x, x0 )g(x0 ) dx0 hf, kx i = hf, K(x, ·)i =

¡ ¢−2 hence K = P

Z

⇒ 2

(Kδx )(x0 ) = K(x, x0 )

f (x0 )(P Kδx )(x0 ) dx0 = f (x)

16

References so far Girosi, Jones & Poggio Regularization Theory and Neural Network Architectures (Neural Computation, 1995) ¨ Smola and Scholkopf From Regularization Operators to Support Vector Kernels (NIPS 1998) Aronszajn Theory of Reproducing Kernels (1950) Kimeldorf & Wahba Some Results on Tchebychef£an Spline Functions (1971) .. .

17

Link to Diffusion Kβ (x, x0 ) = √

0 1 e−k x−x k/(2β) 2πβ

is solution to Diffusion equation ∂ Kβ (x, x0 ) = ∆ Kβ (x, x0 ) ∂β where ∆ =

∂2 ∂x21

+

∂2 ∂x22

+ ... +

∂2 ∂x2d

is the Laplacian.

Generally, may de£ne Laplacian operator ∆ : H 7→ H and kernel operator by ∂ K β = ∆K β . ∂β Formally K = eβ∆ and P = e−β∆/2 . 18

How do we generalize to discrete spaces?

19

1. Graphs

20

Graphs

Looking for positive de£nite K : V × V 7→ R, now just a matrix

21

Try Random Walks Aij =

 1 0

i∼j otherwise

A symmetric ⇒ even powers pos. def. K = A2 ?

A4 ?

A∞ ?

K = α 1 A2 + α 2 A4 + . . . ?

22

Diffusion In£nite number of ininitesimal steps: ¶n µ β K = lim 1 + L n→∞ n    1 Lij = −di    0

i∼j

(Laplacian)

i=j

otherwise

23

Diffusion In£nite number of ininitesimal steps: ¶n µ β = eβL K = lim 1 + L n→∞ n    1 Lij = −di    0

i∼j

(Laplacian)

i=j

otherwise

24

Exponential Kernels ³

Kβ = eβL = limn→∞ I + = I + βL +

25

β2 2!

β n

2

L

´n

L +

β3 3!

L3 + . . .

Exponential Kernels ³

Kβ = eβL = limn→∞ I + = I + βL +

d Kβ = L K β dβ

26

β2 2!

β n

2

L

´n

L +

β3 3!

L3 + . . .

K0 = I

K = eβL is positive de£nite.

For any symmetric L,

e

βL

= lim

n→∞

µ

β I+ L n

¶n

= lim

n→∞

µ

β L I+ 2n

conversely, ³

K = K 1/n

´n

= lim

n→∞

µ

for any in£nitely divisible (or £nite) K.

27

1 I+ L n

¶n

= eL

¶2n

Properties of Diffusion Kernels • Positive de£nite • Analogy with continuous case • Local relationships L induce global relationships Kβ by d Kβ = L K β dβ

K0 = I

• Works for undirected weighted graphs with weights wij = wji

28

Complete graphs

K(i, j) =

for n = 2

 1 + (n − 1) e−nβ     n  −nβ   1 − e n

for i = j

for i 6= j ,

Kβ (i, j) ∝ (tanh β)

d(i,j)

29

Closed chains

n−1 1 X −ων β 2πν(i − j) K(i, j) = e cos n ν=0 n

30

Tensor product kernels K (1) K (2)

kernel on X1 kernel on X2

K (1,2) = K (1) ⊗ K (2)

kernel on X1 ⊗ X2

K (1,2) ((x1 , x2 ), (x01 , x02 )) = K (1) (x1 , x01 ) K (2) (x2 , x02 )

L(1,2) = L(1) ⊗ I (2) + L(2) ⊗ I (1)

31

Hypercubes, etc. Hypercube: 0

K(x, x ) = (tanh β)

Alphabet A: µ K(x, x0 ) =

d(x,x0 )

−| A |β

1−e 1 + (| A | − 1) e−| A |β

¶d(x,x0 )

32

k-regular-trees

K(x, x0 ) = K(d(x, x0 )) = 2 π(k−1)

Z

π 0

e

³

−β 1−

2

√ k−1 k

cos x

´

sin x [ (k−1) sin(d+1) x − sin(d−1) x ] dx k 2 − 4 (k−1) cos2 x

33

Combinatorial view f : V 7→ R function on graph, or vector (f1 , f2 , . . . , fn )> X f Lf = − (fi − fj )2 >

i∼j

total weight of “edge violations”

34

Spectral view Eigenvalues 0 = λ0 ≥ λ1 ≥ λ2 ≥ . . . ≥ λn and corresponding eigenvectos v0 , v1 , v2 , . . . , vn . v0 = const v1 : “Fiedler-vector” smoothest function on graph orthogonal to v 0 In general vi minimizes v> L v v> v

L=

X

for

vi λi vi>

v ⊥ span(v1 , v2 , . . . , vi−1 )

K=e

i

βL

=

X i

35

vi eβ λi vi>

Regularization operator ¡ ¢−2 Recall K = P .

Now simply hf, f i = (P f )> (P f ) = f > P 2 f and P = K −1/2 .

K=e

βL

=

X

vi e

β λi

vi>

P =e

i

−βL/2

=

X i

36

vi e−β λi /2 vi>

Generalization [Smola & Kondor COLT 2003]

K=

X

vi r(λi ) vi>

i

• Diffusion kernel: r(λ) = exp(βλ) • p-step random walk kernel: −(a + λ)−p • Regularized Laplacian kernel: σ 2 λ − 1

37

Applications • Natural graphs: internet, web, social contacts, citations, scienti£c collaborations, etc. • Objects with graph-like structure: strings, etc. • Objects with unknown global structure: set of organic molecules, • Bioinformatics: network of molecular pahways in cells [J-P Vert and Kanehisa NIPS 2002] • Incorporating unlabelled data

38

2. Groups

39

Finite Groups Finite set G with operation G × G 7→ G • x1 x2 ∈ G

x 1 , x2 ∈ G

for any

(closure)

• x1 (x2 x3 ) = (x1 x2 )x3

(associativity)

• xe = ex = e

x∈G

for any

• x−1 x = xx−1 = e

(inverses)

40

(identity)

Symmetric groups Sn x1 = (12)(3)(4)(5) x2 = (1)(324)(5) x3 = x 1 x2

x1 (ABCDE) = BACDE x2 (ABCDE) = ACDBE

x3 (ABCDE) = x1 (x2 (ABCDE))

rankings, orderings, allocation, etc. natural sense of distance

41

Stationary kernels K(x1 , x2 ) = f (x2 x−1 1 )

K(x1 , x2 ) = f (x2 −x1 )



eg.

f positive de£nite function on G

42

K ∼ e−(x1 −x2 )

2

/2σ 2

Bochner’s Theorem f is positive de£nite and symmetrical on Rd iff fb(ω) > 0 fb(ω) = √

1



d

Z

e−i ω · x f (x) dx

is there an analog for £nite groups?

43

Representation theory ρ : G → C d×d

ρ(x1 x2 ) = ρ(x1 )ρ(x2 )

Equivalence: ρ1 (x) = t−1 ρ2 (x) t

∀x∈G

Reducibility: t

−1



ρ(x) t = 

ρ1 (x)

0

0

ρ2 (x)

44

 

∀x∈G

Irreducible representations of S5 ρtrivial (x) ≡ (1)

ρsign (x) ≡ (sgn(x))

ρdef. (x) ∈ C5×5



ρ(5)

→  e x(1)  ex(2)  ex(3)  ex(4) ex(5)

ρ(1,1,1,1,1) 



 e1  e2    = ρdef. (x)  e3     e4 e5

45

Irreducible representations of S5 

   t−1 ρdef. (x) t =   



ρtr. (x) ρ(4,1) (x)

     

ρreg. = ρ(5) ⊕ 4ρ(4,1) ⊕ 5ρ(3,2) ⊕ 6ρ(3,1,1) ⊕ 5ρ(2,2,1) ⊕ 4ρ(2,1,1,1) ⊕ ρ(1,1,1,1,1)

46

Fourier transforms on Groups fb(ρ) =

X

x∈G

F : CG 7→

inversion:

ρ∈R

ρ(x) f (x) M

Cdρ ×dρ

ρ∈R

1 X dρ trace[fb(ρ)ρ(x−1 )] f (x) = |G| ρ∈R

47

Bochner on Groups The function f is positive de£nite on G if and only if the matrices fb(ρ) are all positive de£nite.

48

Conjugacy classes conjugacy classes:

eg. on Sn

x1 ∼ =Conj. x2

iff

(·)(·)(·)(·) . . . (· ·)(·)(·)(·) . . .

(· · ·)(·)(·) . . .

(· ·)(· ·)(·)(·) . . . .. .

49

x1 = t−1 x2 t

for some t

Corollary to Bochner The function f is positive de£nite on G and constant on conjugacy classes if and only if X c ρ χρ cρ > 0 . f (x) = ρ∈R

characters: χρ (x) = trace[ρ(x)]

50

Diffusion on Cayley graph of S4

51

3. Manifolds

52

Riemannian Manifolds Surface M with sense of distance that locally looks like Rd In neigborhood of x, points can be represented as vector x + δx, d(x, x + δx)2 = (δx)> G(x) δx + O(k δx k4 ) Metric tensor: G(x) ∈ Rd×d positive de£nite for all x ∈ M For path p : [0, 1] 7→ Rd , `(p) =

Z

1 0

 

d X d X i=1 j=1

1/2

∂i p(γ) G(p(γ))ij ∂j p(γ) 53



Laplacian on a Riemannian Manifold Flat space: ∆ = ∂12 + ∂22 + . . . + ∂d2 Manifold:

X √ 1 ∂i det G (G−1 )ij ∂j ∆= √ det G ij

gives rise to operator ∆ : L2 (M) 7→ L2 (M) as before

54

Diffusion Kernel on M Solution of ∂t K t = ∆K t

with

K0 = I

1. Kt (x, x0 ) = Kt (x0 , x) 2. limt→0 Kt (x, x0 ) = δx (x0 ) ¢ ¡ ∂ 3. ∆ − ∂t K = 0 R 0 4. Kt (x, x ) = M Kt−s (x, x00 )Ks (x00 , x0 ) dz P∞ 5. Kt (x, x0 ) = i=0 e−λi t φi (x) φi (x0 ) 55

Manifold Structures in Data • Even when data at £rst seems very high dimensional, it is often constrained to a low dimensional manifold i.e. it only has a few internal degrees of freedom • Constraining the kernel to the manifold is expected to help • The graph Laplacian sampled from M approximates the Laplacian of M [Belkin & Niyogi NIPS 2001] • A natural use for unlabeled data points is to help construct the kernel [Belkin & Niyogi NIPS 2002]

56

The Statistical Manifold [Lafferty & Lebanon NIPS 2002] For a family { p(x|θ) : θ ∈ Rd } of statistical models the Fisher metric is Z Gij (θ) = E(∂i `θ ∂j `θ ) = (∂i log p(x|θ)) (∂j log p(x|θ)) p(x|θ) dx or equivalently Z ³ p ´³ p ´ ∂i p(x|θ ∂j p(x|θ dx . Gij = 4 Locally approximated by Kullback-Leibler divergence 57

The Multinomial n+1 X

(n + 1)! xn+1 θ1x1 θ2x2 . . . θn+1 p(x|θ) = x1 !x2 ! . . . xn+1 ! Gij (θ) =

n+1 X k=1

1 (∂i θk ) (∂j θk ) θi

θi = 1

i=1

θ ∈ Pd

p Consider the map T : Pd 7→ Sd via T : θ 7→ ( θ1 , θ2 , . . . , θn+1 ) on Sd the metric becomes the natural metric of the sphere, hence Ãn+1 ! Xp 0 θi θi0 d(θ, θ ) = 2 arccos √

i=1

58



Diffusion on the Sphere

0

Kt (x, x ) = (4πt)

0

−n 2

µ

2

0

d (x, x ) exp − 4t

Kt (θ, θ ) ≈ (4πt)

−n 2

Ã

¶X N

ψi (x, x0 ) ti + O(tN )

i=0

1 exp − arccos2 t

Ãn+1 Xp i=1

θi θi0

!!

Proposed and applied to text data in [Lafferty & Lebanon NIPS 2002]

59

Conclusions

60

“The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry.” (Nelson 1968)

61

Conclusions • Kernel methods: algorithm + kernel • Kernel alone encapsulates all knowledge about X • Laplacian is a unifying concept in Mathematics • Connection with diffusion intuitively appealing, but real justi£cation for exponential kernels lies deep in operator-land • Exponentially induced kernels lift knowledge of local structure to global level • Opens new links to Abstract Algebra and Information Geometry

62

References Belkin & Niyogi Laplacian Eigenmaps for Dimensionality Reduction (NIPS 2001) Kondor & Lafferty Diffusion Kernels on Graphs and Other Discrete Structures (ICML 2002) Lafferty & Lebanon Information Diffusion Kernels (NIPS 2002) Belkin & Niyogi Using Manifold Structure for Partially Labelled Classi£cation (NIPS 2002) Vert & Kanehisa Graph-driven Feature Abstraction from Microarray Data Using Diffusion Kernels and Kernel CCA (NIPS 2002) Smola & Kondor Kernels and Regularization on Graphs (COLT 2003)

63

Suggest Documents