Machine Learning!!!!!Srihari. Kernel Methods! Sargur Srihari!

Machine Learning ! ! ! ! !Srihari Kernel Methods! Sargur Srihari! 1 Machine Learning ! ! ! ! !Srihari Topics in Kernel Methods! 1.  2...
Author: Cora Perkins
34 downloads 0 Views 1MB Size
Machine Learning

!

!

!

!

!Srihari

Kernel Methods!

Sargur Srihari!

1

Machine Learning

!

!

!

!

!Srihari

Topics in Kernel Methods!

1.  2.  3.  4.  5.  6.  7. 

Kernel Methods vs Linear Models/Neural Networks! Stored Sample Methods! Kernel Functions! Dual Representations! Constructing Kernels! Extension to Symbolic Inputs! Fisher Kernel! 2

Machine Learning

!

!

!

!

!Srihari

Kernel Methods vs Linear Models/Neural Networks!

•  Linear parametric models for regression and classification have the form y(x,w) •  During learning phase we either get a maximum likelihood estimate of w or a posterior distribution of w •  Training data is then discarded •  Prediction based only on vector w

•  This is true of Neural networks as well •  Another class of methods use the training samples or a subset of them 3

Machine Learning

!

!

!

!

!Srihari

Memory-Based Methods! •  Training data points are used in prediction phase! •  Examples of such methods! •  Parzen probability density model! •  Linear combination of kernel functions centered on each training data point!

•  Nearest neighbor classification!

•  These are memory-based methods! •  Require a metric to be defined! •  Fast to train, slow to predict! 4

Machine Learning

!

!

!

!

!Srihari

Kernel Functions! •  Many linear parametric models can be re-cast into equivalent dual representations where predictions are based on a kernel function evaluated at training points! •  Kernel function is given by! ! !k (x,x’) = φ (x)T φ(x’) •  where φ(x) is a fixed nonlinear feature space mapping (basis function)!

•  Kernel is a symmetric function of its arguments! ! !k (x,x’) = k (x’,x) •  Kernel function can be interpreted as the similarity of x and x’ •  Simplest is identity mapping in feature space φ(x) = x •  In which case k (x,x’) = xTx’ •  Called Linear Kernel!

5

Machine Learning

!

!

!

!

!Srihari

Kernel Trick (or Kernel Substitution)! •  Formulated as inner product allows extending wellknown algorithms ! •  by using the kernel trick!

•  Basic idea of kernel trick ! •  If an input vector x appears only in the form of scalar products then we can replace scalar products with some other choice of kernel!

•  Used widely! •  in support vector machines •  in developing non-linear variant of PCA •  In kernel Fisher discriminant 6

Machine Learning

!

!

!

!

!Srihari

Other Forms of Kernel Functions! •  Function of difference between arguments! ! !k (x,x’) = k (x-x’) •  Called stationary kernel since invariant to translation in space!

•  Homogeneous kernels, also known as radial basis functions! ! !k (x,x’) = k (||x-x’||)

For these to be valid kernel functions they should be shown to have the property k (x,x’) = φ (x)T φ(x’)

•  Depend only on the magnitude of the distance between arguments!

•  Note that the kernel function is a scalar value while x is an Mdimensional vector! !!

7

Machine Learning

!

!

!

!

!Srihari

Dual Representation! •  Linear models for regression and classification can be reformulated in terms of a dual representation! •  In which kernel function arises naturally!

•  Plays important role in SVMs! •  Consider linear regression model! •  whose parameters are determined by minimizing regularized sum-of-squares error function! N

2 1 λ J(w) = ∑{wT φ (x n ) − t n } + wT w 2 n=1 2

where w = (w0 ,.., w M-1 )T , φ = ( φ0 ,..φ M-1 )T we have N samples {x1 ,..x N } λ is the regularization coefficient

φ is the set of M basis functions or feature vector

•  Minimum obtained by setting gradient of J(w) wrt w equal to zero! €

Machine Learning

!

!

!

!

!Srihari

Solution for w as a linear combination of φ (xn)! •  By equating derivative J(w) wrt w to zero and solving for w we get! 1 w=− w φ (x ) − t φ (x ) N

∑{ λ

T

n

n

}

n

n=1

N

= ∑ an φ (x n ) n=1

= ΦT a

! •  Solution for w is a linear combination of vectors φ (xn) € whose coefficients are functions of w where! •  Φ is the design matrix whose nth row is given by φ (xn)T % φ 0 (x1 ) ' ' . Φ = 'φ 0 (x n ) ' . '&φ (x ) 0 N

. . .

φ M −1 (x1 ) ( * . * . φ M −1 (x n )* is a N × M matrix * . . φ M −1 (x N )*) .

•  Vector a=(a1,..,aN)T with the definition ! € ! an = −

1 T {w φ (x n ) − tn } λ

Machine Learning

!

!

!

!

!Srihari

Transformation from w to a !

•  Thus we have ! w = ΦT a •  Instead of working with parameter vector w we can reformulate least squares algorithm in terms of parameter vector a€ •  giving rise to dual representation!

•  We will see that although the definition of a still includes w an = −

1 T {w φ (x n ) − tn } λ

!It can be eliminated by the use of the kernel function! € 10

Machine Learning

!

!

!

!

!Srihari

Gram Matrix and Kernel Function! •  Define the Gram matrix K=ΦΦT an N x N matrix, with elements! Note: N x M times M x N ! ! !Knm= φ (xn)Tφ (xm)=k(xn,xm) •  where we introduce the kernel function k (x,x’) = φ (x)T φ (x’) " k(x1,x1 ) $ . K =$ $ . $ #k(x N ,x1 )

.

.

k(x1,x N ) % ' ' ' ' k(x N ,x N1 )&

Gram Matrix Definition:! Given N vectors, it is the! matrix of all inner products!

•  Notes: •  Φ is NxM and K is NxN •  K is a matrix of similarities of pairs of samples (thus it is symmetric)



!

11

Machine Learning

!

!

!

!

!Srihari

Error Function in Terms of Gram Matrix of Kernel! ! •  • 

Sum of squares Error Function is! 2 1 N λ J(w) = ∑{wT φ (x n ) − t n } + wT w 2 n=1 2

Substituting w = ΦTa into J(w) gives! €

1 1 λ J(w) = a T ΦΦT ΦΦT a − a T ΦΦT t + t T t + a T ΦΦT a 2 2 2 !where t = (t1,..,tN)T!

! •  Sum of squares error function is written in terms of Gram € matrix as! 1 1 λ J(a) = a T KKa − a T Kt + t T t + a T Ka ! 2 2 2 1 •  Solving for a by combining w=ΦTa and a = − λ {w φ (x ) − t } ! ! !a =(K +λIN)-1t T

n

n

n

€ for a can be expressed as a linear combination of elements of Solution € φ (x) whose coefficients are entirely in terms of kernel k(x,x’) from which we can recover original formulation in terms of parameters w

Machine Learning

!

!

!

!

!Srihari

Prediction Function! !

• 

Prediction for new input x •  We can write a =(K +λIN by combining and •  Substituting back into linear regression model, ! )-1t

y(x) = wT φ (x)

w=ΦTa

an = −

1 T {w φ (x n ) − tn } λ



= a T Φφ (x) = k(x)T (K + λIN )−1 t where k(x) has elements k n (x) = k(x n ,x)

• 

Prediction is a linear combination of the target € values from the training set.! !

Machine Learning

!

!

!

!

Advantage of Dual Representation! •  Solution for a is expressed entirely in terms of kernel function k(x,x’) •  Once we get a we can recover w as linear combination of elements of φ (x) using w = Φta •  In parametric formulation, solution is !w ML = (ΦT Φ)−1 ΦT t •  Instead of inverting an M x M matrix we are inverting an N x N matrix– an apparent disadvantage!

•  But, advantage of dual formulation is that we can work with kernel function k(x,x’) and therefore! •  avoid working with a feature vector φ (x) and ! •  problems associated with very high or infinite dimensionality 14 of x

!Srihari

Machine Learning

!

!

!

!

!Srihari

Constructing Kernels! •  To exploit kernel substitution need valid kernel functions! •  First Method! •  choose a feature space mapping φ (x) and use it to find corresponding kernel! •  One-dimensional input space! k(x, x') = φ (x)T φ (x') M

= ∑ φ i (x)φ i (x') i=1

•  where φ (x) are basis functions such as polynomial! •  For each i we choose φi=xi € 15

Machine Learning

!

!

!

!

!Srihari

Construction of Kernel Functions from basis functions! One-dimensional input space Polynomials!

Gaussian!

Logistic Sigmoid!

Basis! Functions! φι(x)!

Kernel! Functions! k(x,x’) = φ(x)Tφ(x) Red cross is x’

16

Machine Learning

!

!

!

!

!Srihari

Second Method: Direct Construction of Kernels! •  Function we choose has to correspond to a scalar product in some (perhaps infinite dimensional) space! •  Consider kernel function k(x,z) = (xTz)2 •  In two dimensional space! k(x,z) = (x T z) 2 = (x1z1 + x 2 z2 ) 2 = x12 z12 + 2x1z1 x 2 z2 + x 22 z22 = (x12 , 2x1 x 2 , x 22 )(z12 , 2z1z2 ,z22 )T T

= φ (x) φ (z) ! •  Feature mapping takes the form! φ ( x) = ( x12 , 2 x1 x 2 , x 22 ) •  Comprises of all second order terms with a specific weighting! € •  Inner product needs computing six feature values and 3 x 3 = 9 multiplications € •  Kernel function k(x,z) has 2 multiplications and a squaring

•  By considering (xTz+c)2 we get constant, linear, second order terms! •  By considering (xTz+c)M we get all powers of x (monomials)!17

Machine Learning

!

!

!

!

!Srihari

Testing whether a function is a valid kernel! •  Without having to construct the function φ (x) explicitly! •  Necessary and sufficient condition for a function k(x,x’) to be a kernel is ! •  Gram matrix K, whose elements are given by k(xn,xm) is positive semi-definite for all possible choices of the set {xn} •  Positive semi-definite is not the same thing as a matrix whose elements are non-negative! •  It means! zT Kz ≥ 0 for non - zero vectors z with real entries i.e., ∑ ∑ K nm zn zm ≥ 0 for any real numbers zn , zm n

m

•  Mercer’s theorem: any continuous, symmetric, positive semidefinite kernel function k(x, y) can be expressed as a dot product in € a high-dimensional space!

•  New kernels can be constructed from simpler kernels as building blocks! 18

Machine Learning

!

!

!

!

!Srihari

Techniques for Constructing Kernels! •  Given valid kernels k1(x,x’) and k2(x,x’) the following new kernels will be valid! 1.  k(x,x’) =ck1(x,x’) Where! f (.) is any function 2.  k(x,x’)=f(x)k1(x,x’)f(x’) 3.  k(x,x’)=q(k1(x,x’)) q(.) is a polynomial with non-negative coefficients!

4.  k(x,x’)=exp(k1(x,x’))

5.  k(x,x’)=k1(x,x’)+k2(x,x’)





6.  k(x,x’)=k1(x,x’)k2(x,x’)

φ(x) is a function from x to RM 7.  k(x,x’)=k3(φ(x).φ(x’)) k3 is a valid kernel in RM 8.  k(x,x’)=xTAx’ A is a symmetric positive semidefinite matrix! 9.  k(x,x’)=ka(xa,xb’)+kb(xb,xb’)xa and xb are variables with x=(xa,xb) 10.  k(x,x’)=ka(xa,xa’)kb(xb,xb’)ka and kb are valid kernel functions!

Machine Learning

!

!

!

!

!Srihari

Kernels appropriate for specific applications! •  Requirements for k(x,x’) •  It is symmetric! •  Its Gram matrix is positive semidefinite! •  It expresses the appropriate similarity between x and x’ for the intended application!

20

Machine Learning

!

!

!

!

!Srihari

Gaussian Kernel! •  Commonly used kernel is! ! k(x,x’) = exp (-||x-x’||2/2σ2) •  It is seen as a valid kernel by expanding the square! ! ||x-x’||2 = xTx + (x’)Tx’ -2xTx’ •  To give! k(x,x’) = exp (-xTx/2σ2) exp (-xTx’/σ2) exp (-(x’)Tx’/ 2σ2) •  From kernel construction rules 2 and 4 ! •  together with validity of linear kernel k(x,x’)=xTx’

•  Can be extended to non-Euclidean distances k(x,x’) = exp {(-1/2σ2)[κ(x,x’)+κ(x’,x’)-2κ(x,x’)]}

21

Machine Learning

!

!

!

!

!Srihari

Extension of Kernels to Symbolic Inputs! •  Important contribution of kernel viewpoint:! •  Inputs that are symbolic rather than vectors of real numbers!

•  Kernel functions defined for graphs, sets, strings, text documents! •  If A1 and A2 are two subsets of objects! •  A simple kernel is!

k(A1, A2 ) = 2|A1 ∩A 2 |



•  where | | indicates cardinality of set intersection! •  A valid kernel since it can be shown to correspond to an inner product in a feature space!

A={1,2,3,4,5} A1={2,3,4,5} A2={1,2,4,5} A1∩A2={2,4,5} Hence k(A1,A2)=8 What are feature vectors φ(A1) and φ(A2) such that φ(A1)φ(A2)T=8?

Machine Learning

!

!

!

!

!Srihari

Combining Discriminative and Generative Models! •  Generative models deal naturally with missing data and with HMM of varying length! •  Discriminative models such as SVM have better performance! •  Can use a generative model to define a kernel and use kernel in discriminative approach!

23

Machine Learning

!

!

!

!

!Srihari

Kernels based on Generative Models! •  Given a generative model p(x) we define a kernel by! ! !k (x,x’) = p(x) p(x’) •  A valid kernel since it is an inner product in the one-dimensional feature space defined by the mapping p(x)

•  Two inputs x and x’ are similar if they have high probabilities

24

Machine Learning

!

!

!

!

!Srihari

Kernel Functions based on Mixture Densities! •  Extension to sums of products of different probability distributions k(x,x') = ∑ p(x | i) p(x'| i) p(i) i

•  where p(i) are positive weighting coefficients •  It is a valid kernel based on two rules of kernel construction: € k(x,x’) =ck1(x,x’) and k(x,x’)=k1(x,x’)+k2(x,x’)

•  Two inputs x and x’ will give a large value of k , and hence appear similar, if they have a significant probability under a range of different components •  Taking the limit to infinite sum k(x,x') = ∫ p(x | z) p(x'| z) p(z)dz •  where z is a continuous latent variable €

25

Machine Learning

!

!

!

!

!Srihari

Kernels for Sequences! •  Data consists of ordered sequences of length L X={x1,..,xL} •  Generative model for sequences is HMM •  Hidden states Z={z1,..,zL}

•  Kernel Function for measuring similarity of sequences X and X’ is k(X,X') = ∑ p(X | Z) p(X'| Z') p(Z) Z

•  Both observed sequences are generated by same hidden sequence Z € 26

Machine Learning

!

!

!

!

!Srihari

Fisher Kernel! •  Alternative technique for using generative models! •  Used in document retrieval, protein sequences, document recognition!

•  Consider parametric generative model p(x|θ) where θ denotes vector of parameters! •  Goal: find kernel that measures similarity of two vectors x and x’ induced by the generative model! •  Define Fisher score as gradient wrt θ

g(θ,x) = ∇θ ln p(x | θ )

•  Fisher Kernel is!

A vector of same dimensionality as θ

Fisher score is more generally the gradient of the log-likelihood

k(x,x') = g(θ,x)T F−1g(θ,x’)

€ where F is the Fisher information matrix! F = E x [ g(θ,x)g(θ,x)T ]

€ €

27

Machine Learning

!

!

!

!

!Srihari

Fisher Information Matrix! •  Presence of Fisher information matrix causes kernel to be invariant under non-linear parametrization of the density model θ àψ(θ) •  In practice, infeasible to evaluate Fisher Information Matrix. Instead use the approximation 1 N F ≈ ∑ g(θ,x n )g(θ,x n )T N n=1

! •  This is the covariance matrix of the Fisher scores! T −1 •  So the Fisher kernel! k(x,x') = g( θ ,x) F g(θ,x’) € !corresponds to whitening of the Fisher scores! •  More simply omit F and use non-invariant kernel! € k(x,x') = g(θ,x)T g(θ,x')

28

Machine Learning

!

!

!

!

!Srihari

Sigmoidal Kernel! •  Provides a link between SVMs and neural networks! ! !k (x,x’) = tanh (axTx’ + b) •  Its Gram matrix is not positive semidefinite! •  But used in practice because it gives SVMs a superficial resembalance to neural networks!

•  Bayesian neural network with an appropriate prior reduces to a Gaussian process! •  Provides a deeper link between neural networks and kernel methods! 29