Nuno Vasconcelos ECE Department, p , UCSD

Classification a classification problem has two types of variables •

X - vector of observations (features) in the world

•

Y - state (class) of the world

Perceptron: p classifier implements p the linear decision rule h( x) = sgn[g(x)] with g (x ) = w T x + b

appropriate when the classes are linearly separable to deal with non-linear separability we introduce a kernel

w

x

b w

g (x ) w

2

Kernel summary 1. D not linearly separable in X, apply feature transformation Φ:X → Z, such that dim(Z) >> dim(X) 2. computing Φ(x) too expensive: •

write your learning algorithm in dot-product form

•

instead of Φ(xi), ) we only need Φ(xi)TΦ(xj) ∀ij

3. instead of computing Φ(xi)TΦ(xj) ∀ij, define the “dot-product kernel”

K (x , z ) = Φ(x )T Φ (z ) and compute K(xi,xj) ∀ij directly •

note: the matrix

M ⎡ ⎤ K = ⎢⎢L K ( xi , x j ) L⎥⎥ ⎥⎦ ⎢⎣ M

is called the “kernel” or Gram matrix

4. forget about Φ(x) and use K(x,z) from the start! 3

Polynomial kernels this makes a significant difference when K(x,z) is easier to compute p that Φ((x))TΦ((z)) e.g., we have seen that

K (x , z ) = (x T z ) = Φ(x )T Φ (z ) 2

with

Φ:

ℜ →ℜ d

d2

⎛ x1 ⎞ ⎜ T ⎜ M ⎟ → (x 1x 1 , x 1x 2 , L , x 1x d , L , x d x 1 , x d x 2 , L , x d x d ) ⎜ ⎟ ⎝xd ⎠

while K(x,z) has complexity O(d), Φ(x)TΦ(z) is O(d2) for K(x,z) ( , ) = (x ( Tz))k we g go from O(d) ( ) to O(d ( k)

4

Question what is a good dot-product kernel? • intuitively intuitively, a good kernel is one that maximizes the margin γ in range space • however, nobody knows how to do this effectively

in practice: • pick a kernel from a library of known kernels • we talked about • the linear kernel K(x,z) = xTz • the Gaussian family

K (x , z ) = e

x −z − σ

2

• the polynomial family

K (x , z ) = (1 + x T z ) , k ∈ {1,2,L} k

5

Question “this problem of mine is really asking for the kernel k’(x,z) ( , ) = ...” • how do I know if this is a dot-product kernel?

let’s start by the definition Definition: a mapping

x2

k: X x X → ℜ (x,y) → k(x,y)

Φ

is a dot-product kernel if and only if k(x,y) =

x x x xxxxx o x xxoo ooo x oo o o oo

x1 xn

X x1

x x x xxxxx x xx o x oo o ooo oo ooo

H

x3 x2

where Φ: X → H, H is a vector space and, a dotproduct in H 6

Vector spaces note that both H and can be abstract, not necessarilyy ℜd Definition: a vector space is a set H where • addition and scalar multiplication are defined and satisfy: 1) x+(x’+x’’) = (x+x’)+x” 2) x+x’ = x’+x ∈ H 3)) 0 ∈ H,, 0 + x = x 4) –x ∈ H, -x + x = 0

5) λx ∈ H 6) 1x = x 7)) λ(λ’ x)) = (λλ’)x ) 8) λ(x+x’) = λx + λx’ 9) (λ+λ’)x = λx + λ’x

the canonical example is ℜd with standard vector addition and scalar multiplication another example is the space of mappings X → ℜ with (f+g)(x) = f(x) + g(x)

(λf)(x) = λf(x) 7

Bilinear forms to define dot-product we first need to recall the notion of a bilinear form Definition: a bilinear form on a vector space H is a mapping Q: H x H → ℜ (x,x’) → Q(x,x’) such that ∀ x,x x,x’,x’’ ,x ∈ H i) Q[(λx+λx’),x”] = λQ(x,x”) + λ’Q(x’,x”) ii)) Q[x”,( [ (λx+λx’)] )] = λQ(x”,x) ( ) + λ’Q(x”,x’) ( ) in ℜ d the canonical bilinear form is Q(x,x’) = xTAx’ if Q(x,x’) = Q(x’,x) ∀ x,x’∈ H, the form is symmetric 8

Dot products Definition: a dot-product on a vector space H is a symmetric y bilinear form : H x H → ℜ (x,x’) → such h th thatt i) ≥ 0, ∀ x∈ H ii) = 0 if and only if x = 0 note that for the canonical bilinear form in ℜ d = xTAx this means that A must be positive definite xTAx > 0, ∀ x ≠ 0 9

Positive definite matrices recall that (e.g. Linear Algebra and Applications, Strang) Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (semi) positive definite:

i) xTAx A ≥ 0, 0 ∀x≠0 ii) all eigenvalues of A satisfy λi ≥ 0 iii) all upper-left submatrices Ak have non-negative determinant i ) there is a matri iv) matrix R with ith independent rows ro s such s ch that A = RTR

upper left l ft submatrices: b ti A1 = a1,1

⎡ a1,1 A2 = ⎢ ⎣a2,1

a1, 2 ⎤ a2, 2 ⎥⎦

⎡ a1,1 ⎢ A3 = ⎢a2,1 ⎣⎢ a3,1

a1, 2 a2 , 2 a3, 2

a1,3 ⎤ ⎥ a2 , 3 ⎥ a3,3 ⎥⎦

L

10

Positive definite matrices property iv) is particularly interesting • in ℜ d, = xTAx is a dot-product kernel if and only if A is positive definite • from iv) this holds if and only if there is R such that A = RTR • hence = xTAy = (Rx)T(Ry) = Φ(x)TΦ(y) with

Φ: ℜ d → ℜ d x → Rx

i.e. the dot-product p kernel k(x,z) = xTAz, (A positive definite) is the standard dot-product dot product in the range space of the mapping Φ(x) = Rx 11

Note there are positive semidefinite matrices xTAx ≥ 0 and positive definite matrices xTAx > 0 we will work with semidefinite but, to simplify, will call definite if we really need > 0 we will say “strictly positive definite”

12

Positive definite kernels how do we define a positive definite function? Definition: a function k(x,y) k(x y) is a positive definite kernel on X xX if ∀ l and ∀ {x1, ..., xl}, xi∈ X, the Gram matrix M ⎡ ⎤ K = ⎢⎢Lk (x i , x j ) L⎥⎥ ⎢⎣ ⎥⎦ M

is positive definite. Note: this implies that • k(x,x) ≥ 0 ∀ x∈ X • ⎡k (x , x )

k (x , y )⎤ PD ∀ x,yy∈ X ⎢k (y , x ) k (y , y )⎥ ⎦ ⎣

((*)) etc... 13

Positive definite kernels this proves some simple properties • a PD kernel is symmetric

k (x , y ) = k (y , x ), ∀x , y ∈ X Proof: since PD means symmetric (*) implies k(x,y) = k(y,x) ∀ x,y∈ X y inequality q y for kernels: if k(x,y) ( y) is a PD kernel, • Cauchy-Schwarz then

k (x , y ) 2 ≤ k (x , x )k (y , y ), ∀x , y ∈ X Proof: from (*), and property iii) of PD matrices, the determinant of the 2x2 matrix of ((*)) is non-negative non negative. This means that k(x,x)k(y,y) – k(x,y)2 ≥ 0 14

Positive definite kernels it is not hard to show that all dot product kernels are PD Lemma 1: Let k(x,y) k(x y) be a dot dot-product product kernel kernel. Then k(x,y) k(x y) is positive definite p proof: • k(x,y) dot product kernel implies that •

∃Φ and some dot product such that k( ) = ( )>

• this implies that if: • we pick anyy l, and anyy sequence {{x1, ..., xl}}, • and let K be the associated Gram matrix • then, for ∀ c ≠ 0

M ⎡ ⎤ K = ⎢⎢Lk (x i , x j ) L⎥⎥ ⎢⎣ ⎥⎦ M

15

Positive definite kernels c T Kc = ∑ c i c j k (x i , x j ) ijj

= ∑ c i c j Φ (x i ), Φ (x j )

(k is dot product)

ij

= =

∑i c i Φ(x i ), ∑j c j Φ(x j ) ∑i c i Φ(x i )

( is a bilinear form)

2

≥0

(from def of dot product)

16

Positive definite kernels the converse is also true but more difficult to prove Lemma 2: Let k(x,y), k(x y) xx,y y∈ X, X be a positive definite kernel. Then k(x,y) is a dot product kernel p proof: •

we need to show that there is a transformation Φ, a vector space H = Φ(X), and a dot product * in H such that k(x y) = *

•

x2

we proceed in three steps 1. construct a vector space p H

x x x x x xx x o x x oo o o o x x o o o o o o

Φ

2. define the dot-product * on H 3. show that k(x,y) = * holds x1 xn

x x x x x xx x x x xx o

o o oo o o o oo o

X x1

H

x3 x2

17

The vector space H we define H as the space spanned by linear combinations of k(.,x ( , i) m ⎧ H = ⎨f (.) | f (.) = ∑ α i k (., x i ), i =1 ⎩

⎫ ∀m , ∀x i ∈ X ⎬ ⎭

notation: by k(.,xi) we mean a function of g(y) = k(y,xi) of y, xi is fixed. homework: check that H is a vector space • e.g. 2)

m

⎫ f (.) = ∑ α i k (., x i ) ⎪ ⎪ i =1 ⎬f (.) + f ' (.) = f ' (.) + f (.) ∈ H m' f ' (.) = ∑ β j k (., x ' j )⎪ ⎪⎭ i =11 18

Example when we use the Gaussian kernel

K (., x i ) = e

−

.− x i

2

σ2

k( i) is k(.,x i aG Gaussian i centered t d on xi with ith covariance i σI and .− x i m ⎧⎪ ⎫⎪ − σ2 , ∀m , ∀x i ⎬ H = ⎨f (.) | f (.) = ∑ α i e i =1 ⎪⎩ ⎪⎭

is the space of all linear combinations of Gaussians

e.g.

note that these are not mixtures but close 19

The operator * if f(.) and g(.) ∈ H, with m

f (.) = ∑ α i k (., x i ) i =1

m'

g (.) = ∑ β j k (., x ' j )

(**)

j =1

we define the operator * as

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j ) i =1 j =1

(***)

20

Example when we use the Gaussian kernel

K (., x i ) = e

−

.− x i

2

σ2

th operator the t < >* is i a weighted i ht d sum off G Gaussian i tterms

f ,g

m

*

m'

= ∑∑ α i β j e

−

x i −x ' j

2

σ2

i =1 j =1

you can look at this as either: • a dot product in H (still need to prove this) • a non-linear measure of similarity y in X, somewhat related to likelihoods 21

The operator * important note: for f(.) and g(.) ∈ H, the operator *

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j ) i =1 j =1

has the property

k (., ( x i ), ) k (., ( x ' j ) * = k (x i , x ' j )

((****))

proof: jjust make p

⎧α i = 1, α k = 0 ∀k ≠ i ⎨ ⎩β j = 1, β k = 0 ∀k ≠ j 22

The operator * assume that * is a dot product in H (proof in moments)) since

k (., x i ), k (., x j ) * = k (x i , x j )

then, clearly

k (x i , x j ) = Φ (x i ), Φ(x j ) with

*

Φ: X → H x → k(.,x) k( x)

i.e. the kernel is a dot-product on H, which results from the feature transformation Φ this proves Lemma 2 23

Example when we use the Gaussian kernel K (x , x i ) = e

x −x i − σ

2

• the point xi ∈ ℜd is mapped into the Gaussian G(x,x G(x xi,σI) •

H is the space of all functions that are linear combinations of Gaussians

• this has infinite dimension • the kernel is a dot product in H, and a non-linear similarity on X

ℜd x

x2 x

x x x

x

x x x x x x

x

x

x x x x x x x

o

o

Φ

o o o o o o o o o o

x1

x1

xo o o oo o o

xn x3

x2

H

*

x

x

* o o o

o o

* 24

In summary to show that k(x,y), x,y∈ X, positive definite ⇒ k(x,y) is a dot p product kernel we need to show that

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j ) i =1 j =1

is a dot product on m ⎧ H = ⎨f ((.)) | f ((.)) = ∑ α i k (., x i ), i =1 ⎩

⎫ ∀m , ∀x i ∈ X ⎬ ⎭

this reduces to verifying y g the dot p product conditions

25

The operator * 1) is * a bilinear form on H ? by definition of f(.) f( ) and g(.) g( ) in ((**))

f ,g

*

=

m

m'

α i k (., x i ), ∑ β j k (., x ' j ) ∑ i j =1

=1

*

on the other hand,

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j )

from (***)

i =1 j =1

m

m'

= ∑∑ α i β j k (., x i ), k (., x ' j ) *

from (****)

i =1 j =1

equality q y of the two left hand sides is the definition of bilinearity 26

The operator * 2) is * symmetric? note that

g ,f

m

*

m'

= ∑∑ α i β j k (x ' j , x i ) = f , g i =1 j =1

*

if and only if k(xi,xj’) = k(xj’,xi) for all xi, xj’. b t this but thi follows f ll ffrom the th positive iti d definiteness fi it off k(x,y) k( ) we have seen that a PD kernel is always symmetric h hence, * is i symmetric ti

27

The operator * 3) is * ≥ 0, ∀ f ∈ H ? by definition of f(.) f( ) in ((**))

= ∑∑ α iα j k (xi , x j ) = α T Kα m

f, f

*

m

i =1 j =1

where α ∈ ℜm and K is the Gram matrix since k(x,y) k(x y) is positive definite definite, K is positive definite by definition and * ≥ 0 (x) the only non-trivial part of the proof is to show that * = 0 ⇒ f = 0 we need two more results

28

The operator * Lemma 3: * is itself a positive definite kernel on H x H proof: • consider any sequence {f1, ..., fm}, fi ∈ H • then

γiγ j ∑ ijj

f i ,f j

*

=

∑i γ i f i , ∑j γ j f j

= g1, g 2 ≥0

*

(by bilinearity of *) *

(for some g1,g2 ∈ H ) ((byy ((x)) )

• hence the Gram matrix is always PD and the kernel * is PD

29

The operator * Lemma 4: ∀ f ∈ H, * = f(x) proof:

k (., x ), f (.) * = k (., x ), ∑ α i k (., x i ) i

= ∑ α i k (., x ), k (., x i ) i

= ∑ α i k (x , x i ) = f (x )

(by (**) ) * *

(by bilinearity of *) (by (****) )

i

30

The operator * 4) we are now ready to prove that * = 0 ⇒ f = 0 proof: • since * is a PD kernel (lemma 3) we can apply CauchySchwarz k (x , y ) 2 ≤ k (x , x )k (y , y ), ∀x , y ∈ X • using k(.,x) as x and f(.) as y this becomes

k (., x ), k (., x ) * f , f

(

≥ k (., x ), f *

)

2

*

• and using lemma 4

k (x , x ) f , f

2 ≥ f (x ) *

• from which * = 0 ⇒ f = 0

31

In summary we have shown that

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j ) i =1 j =1

is a dot product on m ⎧ ( ) | f (.) ( ) = ∑ α i k (., x i ), H = ⎨f (.) i =1 ⎩

⎫ ∀m , ∀x i ∈ X ⎬ ⎭

and this shows that if k(x,y), k(x y) xx,y y∈ X, X is a positive definite kernel, then k(x,y) is a dot product kernel. since we had initiallyy p proven the converse,, we have the following theorem. 32

Dot product kernels Theorem: k(x,y), x,y∈ X, is a dot-product kernel if and onlyy if it is a p positive definite kernel this is interesting because it allows us to check whether a k kernel l iis a d dott product d t or not! t! • check if the Gram matrix is positive definite for all possible sequences {x1, ..., xl}, xi∈ X

but the proof is much more interesting than this result alone it actually gives us insight on what the kernel is doing let’s summarize

33

Dot product kernels a dot product kernel k(x,y), x,y∈ X: • applies a feature transformation

Φ: X → H x → k(.,x) • to the vector space m ⎧ H = ⎨f (.) ( ) | f ((.)) = ∑ α i k (.,, x i ), i =1 ⎩

⎫ ∀m , ∀x i ∈ X ⎬ ⎭

• where the kernel implements the dot product

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j ) i =1 j =1

34

Dot product kernels the dot product

f ,g

m

*

m'

= ∑∑ α i β j k (x i , x ' j ) i =1 j =1

h th has the reproducing d i property t

k (., x ), f (.) * = f (x ) you can think of this as analog to the convolution with a Dirac delta we will talk about this a lot in the coming lectures finally, * is itself a positive definite kernel on H x H

35

A good picture to remember when we use the Gaussian kernel K (x , x i ) = e

x −x i − σ

2

• the point xi ∈ ℜd is mapped into the Gaussian G(x,x G(x xi,σI) •

H is the space of all functions that are linear combinations of Gaussians

• the th kkernell iis a d dott product d t iin H • the dot product with one of the Gaussians has the reproducing property

ℜd x

x2 x

x x x

x

x x x x x x

x

x

x x x x x x x

o

o

Φ

o o o o o o o o o o

x1

x1

xo o o oo o o

xn x3

x2

H

*

x

x

* o o o

o o

* 36

37