Machine Learning!!!!!Srihari. Kernel Methods! Sargur Srihari!

Machine Learning ! ! ! ! !Srihari Kernel Methods! Sargur Srihari! 1 Machine Learning ! ! ! ! !Srihari Topics in Kernel Methods! 1.  2...

Author: Cora Perkins

34 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Polynomial Curve Fitting. Sargur N. Srihari

Machine Learning Kernel Functions

MTAT Machine Learning (Spring 2013) Exercise session XV: Kernel Methods

Chapter 6. Kernel Methods

KVM: The Kernel Virtual Machine

Sampling Techniques for Kernel Methods

Reproducing Kernel Hilbert Spaces in Machine Learning

On the Kernel Extreme Learning Machine speedup

Scalable Kernel Methods via Doubly Stochastic Gradients

Sentiment Classification for Chinese Reviews Using Machine Learning Methods Based on String Kernel

Chapter 7. An Introduction to Kernel Methods

Mathematical Methods in Machine Learning

Kernels Methods in Machine Learning

Mathematical Methods in Machine Learning

Kernel Logistic Regression and the Import Vector Machine

Kernel-Based Machine Learning with Multiple Sources of Information

Regression tree ensembles in the perspective of kernel-based methods

Applying Machine Learning Methods to Aphasic Data

Machine Learning Methods for Automatic Image Colorization

Machine Learning Methods in Environmental Sciences

Fast Kronecker product kernel methods via sampled vec trick

GPU Processing Methods for Machine Vision

Fast Transpose Methods for Kernel Learning on Sparse Data

Machine Learning

!

!

!

!

!Srihari

Kernel Methods!

Sargur Srihari!

1

Machine Learning

!

!

!

!

!Srihari

Topics in Kernel Methods!

1.  2.  3.  4.  5.  6.  7. 

Kernel Methods vs Linear Models/Neural Networks! Stored Sample Methods! Kernel Functions! Dual Representations! Constructing Kernels! Extension to Symbolic Inputs! Fisher Kernel! 2

Machine Learning

!

!

!

!

!Srihari

Kernel Methods vs Linear Models/Neural Networks!

•  Linear parametric models for regression and classification have the form y(x,w) •  During learning phase we either get a maximum likelihood estimate of w or a posterior distribution of w •  Training data is then discarded •  Prediction based only on vector w

•  This is true of Neural networks as well •  Another class of methods use the training samples or a subset of them 3

Machine Learning

!

!

!

!

!Srihari

Memory-Based Methods! •  Training data points are used in prediction phase! •  Examples of such methods! •  Parzen probability density model! •  Linear combination of kernel functions centered on each training data point!

•  Nearest neighbor classification!

•  These are memory-based methods! •  Require a metric to be defined! •  Fast to train, slow to predict! 4

Machine Learning

!

!

!

!

!Srihari

Kernel Functions! •  Many linear parametric models can be re-cast into equivalent dual representations where predictions are based on a kernel function evaluated at training points! •  Kernel function is given by! ! !k (x,x’) = φ (x)T φ(x’) •  where φ(x) is a fixed nonlinear feature space mapping (basis function)!

•  Kernel is a symmetric function of its arguments! ! !k (x,x’) = k (x’,x) •  Kernel function can be interpreted as the similarity of x and x’ •  Simplest is identity mapping in feature space φ(x) = x •  In which case k (x,x’) = xTx’ •  Called Linear Kernel!

5

Machine Learning

!

!

!

!

!Srihari

Kernel Trick (or Kernel Substitution)! •  Formulated as inner product allows extending wellknown algorithms ! •  by using the kernel trick!

•  Basic idea of kernel trick ! •  If an input vector x appears only in the form of scalar products then we can replace scalar products with some other choice of kernel!

•  Used widely! •  in support vector machines •  in developing non-linear variant of PCA •  In kernel Fisher discriminant 6

Machine Learning

!

!

!

!

!Srihari

Other Forms of Kernel Functions! •  Function of difference between arguments! ! !k (x,x’) = k (x-x’) •  Called stationary kernel since invariant to translation in space!

•  Homogeneous kernels, also known as radial basis functions! ! !k (x,x’) = k (||x-x’||)

For these to be valid kernel functions they should be shown to have the property k (x,x’) = φ (x)T φ(x’)

•  Depend only on the magnitude of the distance between arguments!

•  Note that the kernel function is a scalar value while x is an Mdimensional vector! !!

7

Machine Learning

!

!

!

!

!Srihari

Dual Representation! •  Linear models for regression and classification can be reformulated in terms of a dual representation! •  In which kernel function arises naturally!

•  Plays important role in SVMs! •  Consider linear regression model! •  whose parameters are determined by minimizing regularized sum-of-squares error function! N

2 1 λ J(w) = ∑{wT φ (x n ) − t n } + wT w 2 n=1 2

where w = (w0 ,.., w M-1 )T , φ = ( φ0 ,..φ M-1 )T we have N samples {x1 ,..x N } λ is the regularization coefficient

φ is the set of M basis functions or feature vector

•  Minimum obtained by setting gradient of J(w) wrt w equal to zero! €

Machine Learning

!

!

!

!

!Srihari

Solution for w as a linear combination of φ (xn)! •  By equating derivative J(w) wrt w to zero and solving for w we get! 1 w=− w φ (x ) − t φ (x ) N

∑{ λ

T

n

n

}

n

n=1

N

= ∑ an φ (x n ) n=1

= ΦT a

! •  Solution for w is a linear combination of vectors φ (xn) € whose coefficients are functions of w where! •  Φ is the design matrix whose nth row is given by φ (xn)T % φ 0 (x1 ) ' ' . Φ = 'φ 0 (x n ) ' . '&φ (x ) 0 N

. . .

φ M −1 (x1 ) ( * . * . φ M −1 (x n )* is a N × M matrix * . . φ M −1 (x N )*) .

•  Vector a=(a1,..,aN)T with the definition ! € ! an = −

1 T {w φ (x n ) − tn } λ

Machine Learning

!

!

!

!

!Srihari

Transformation from w to a !

•  Thus we have ! w = ΦT a •  Instead of working with parameter vector w we can reformulate least squares algorithm in terms of parameter vector a€ •  giving rise to dual representation!

•  We will see that although the definition of a still includes w an = −

1 T {w φ (x n ) − tn } λ

!It can be eliminated by the use of the kernel function! € 10

Machine Learning

!

!

!

!

!Srihari

Gram Matrix and Kernel Function! •  Define the Gram matrix K=ΦΦT an N x N matrix, with elements! Note: N x M times M x N ! ! !Knm= φ (xn)Tφ (xm)=k(xn,xm) •  where we introduce the kernel function k (x,x’) = φ (x)T φ (x’) " k(x1,x1 ) $ . K =$ $ . $ #k(x N ,x1 )

.

.

k(x1,x N ) % ' ' ' ' k(x N ,x N1 )&

Gram Matrix Definition:! Given N vectors, it is the! matrix of all inner products!

•  Notes: •  Φ is NxM and K is NxN •  K is a matrix of similarities of pairs of samples (thus it is symmetric)

€

!

11

Machine Learning

!

!

!

!

!Srihari

Error Function in Terms of Gram Matrix of Kernel! ! •  • 

Sum of squares Error Function is! 2 1 N λ J(w) = ∑{wT φ (x n ) − t n } + wT w 2 n=1 2

Substituting w = ΦTa into J(w) gives! €

1 1 λ J(w) = a T ΦΦT ΦΦT a − a T ΦΦT t + t T t + a T ΦΦT a 2 2 2 !where t = (t1,..,tN)T!

! •  Sum of squares error function is written in terms of Gram € matrix as! 1 1 λ J(a) = a T KKa − a T Kt + t T t + a T Ka ! 2 2 2 1 •  Solving for a by combining w=ΦTa and a = − λ {w φ (x ) − t } ! ! !a =(K +λIN)-1t T

n

n

n

€ for a can be expressed as a linear combination of elements of Solution € φ (x) whose coefficients are entirely in terms of kernel k(x,x’) from which we can recover original formulation in terms of parameters w

Machine Learning

!

!

!

!

!Srihari

Prediction Function! !

• 

Prediction for new input x •  We can write a =(K +λIN by combining and •  Substituting back into linear regression model, ! )-1t

y(x) = wT φ (x)

w=ΦTa

an = −

1 T {w φ (x n ) − tn } λ

€

= a T Φφ (x) = k(x)T (K + λIN )−1 t where k(x) has elements k n (x) = k(x n ,x)

• 

Prediction is a linear combination of the target € values from the training set.! !

Machine Learning

!

!

!

!

Advantage of Dual Representation! •  Solution for a is expressed entirely in terms of kernel function k(x,x’) •  Once we get a we can recover w as linear combination of elements of φ (x) using w = Φta •  In parametric formulation, solution is !w ML = (ΦT Φ)−1 ΦT t •  Instead of inverting an M x M matrix we are inverting an N x N matrix– an apparent disadvantage!

•  But, advantage of dual formulation is that we can work with kernel function k(x,x’) and therefore! •  avoid working with a feature vector φ (x) and ! •  problems associated with very high or infinite dimensionality 14 of x

!Srihari

Machine Learning

!

!

!

!

!Srihari

Constructing Kernels! •  To exploit kernel substitution need valid kernel functions! •  First Method! •  choose a feature space mapping φ (x) and use it to find corresponding kernel! •  One-dimensional input space! k(x, x') = φ (x)T φ (x') M

= ∑ φ i (x)φ i (x') i=1

•  where φ (x) are basis functions such as polynomial! •  For each i we choose φi=xi € 15

Machine Learning

!

!

!

!

!Srihari

Construction of Kernel Functions from basis functions! One-dimensional input space Polynomials!

Gaussian!

Logistic Sigmoid!

Basis! Functions! φι(x)!

Kernel! Functions! k(x,x’) = φ(x)Tφ(x) Red cross is x’

16

Machine Learning

!

!

!

!

!Srihari

Second Method: Direct Construction of Kernels! •  Function we choose has to correspond to a scalar product in some (perhaps infinite dimensional) space! •  Consider kernel function k(x,z) = (xTz)2 •  In two dimensional space! k(x,z) = (x T z) 2 = (x1z1 + x 2 z2 ) 2 = x12 z12 + 2x1z1 x 2 z2 + x 22 z22 = (x12 , 2x1 x 2 , x 22 )(z12 , 2z1z2 ,z22 )T T

= φ (x) φ (z) ! •  Feature mapping takes the form! φ ( x) = ( x12 , 2 x1 x 2 , x 22 ) •  Comprises of all second order terms with a specific weighting! € •  Inner product needs computing six feature values and 3 x 3 = 9 multiplications € •  Kernel function k(x,z) has 2 multiplications and a squaring

•  By considering (xTz+c)2 we get constant, linear, second order terms! •  By considering (xTz+c)M we get all powers of x (monomials)!17

Machine Learning

!

!

!

!

!Srihari

Testing whether a function is a valid kernel! •  Without having to construct the function φ (x) explicitly! •  Necessary and sufficient condition for a function k(x,x’) to be a kernel is ! •  Gram matrix K, whose elements are given by k(xn,xm) is positive semi-definite for all possible choices of the set {xn} •  Positive semi-definite is not the same thing as a matrix whose elements are non-negative! •  It means! zT Kz ≥ 0 for non - zero vectors z with real entries i.e., ∑ ∑ K nm zn zm ≥ 0 for any real numbers zn , zm n

m

•  Mercer’s theorem: any continuous, symmetric, positive semidefinite kernel function k(x, y) can be expressed as a dot product in € a high-dimensional space!

•  New kernels can be constructed from simpler kernels as building blocks! 18

Machine Learning

!

!

!

!

!Srihari

Techniques for Constructing Kernels! •  Given valid kernels k1(x,x’) and k2(x,x’) the following new kernels will be valid! 1.  k(x,x’) =ck1(x,x’) Where! f (.) is any function 2.  k(x,x’)=f(x)k1(x,x’)f(x’) 3.  k(x,x’)=q(k1(x,x’)) q(.) is a polynomial with non-negative coefficients!

4.  k(x,x’)=exp(k1(x,x’))

5.  k(x,x’)=k1(x,x’)+k2(x,x’)

6.  k(x,x’)=k1(x,x’)k2(x,x’)

φ(x) is a function from x to RM 7.  k(x,x’)=k3(φ(x).φ(x’)) k3 is a valid kernel in RM 8.  k(x,x’)=xTAx’ A is a symmetric positive semidefinite matrix! 9.  k(x,x’)=ka(xa,xb’)+kb(xb,xb’)xa and xb are variables with x=(xa,xb) 10.  k(x,x’)=ka(xa,xa’)kb(xb,xb’)ka and kb are valid kernel functions!

Machine Learning

!

!

!

!

!Srihari

Kernels appropriate for specific applications! •  Requirements for k(x,x’) •  It is symmetric! •  Its Gram matrix is positive semidefinite! •  It expresses the appropriate similarity between x and x’ for the intended application!

20

Machine Learning

!

!

!

!

!Srihari

Gaussian Kernel! •  Commonly used kernel is! ! k(x,x’) = exp (-||x-x’||2/2σ2) •  It is seen as a valid kernel by expanding the square! ! ||x-x’||2 = xTx + (x’)Tx’ -2xTx’ •  To give! k(x,x’) = exp (-xTx/2σ2) exp (-xTx’/σ2) exp (-(x’)Tx’/ 2σ2) •  From kernel construction rules 2 and 4 ! •  together with validity of linear kernel k(x,x’)=xTx’

•  Can be extended to non-Euclidean distances k(x,x’) = exp {(-1/2σ2)[κ(x,x’)+κ(x’,x’)-2κ(x,x’)]}

21

Machine Learning

!

!

!

!

!Srihari

Extension of Kernels to Symbolic Inputs! •  Important contribution of kernel viewpoint:! •  Inputs that are symbolic rather than vectors of real numbers!

•  Kernel functions defined for graphs, sets, strings, text documents! •  If A1 and A2 are two subsets of objects! •  A simple kernel is!

k(A1, A2 ) = 2|A1 ∩A 2 |

€

•  where | | indicates cardinality of set intersection! •  A valid kernel since it can be shown to correspond to an inner product in a feature space!

A={1,2,3,4,5} A1={2,3,4,5} A2={1,2,4,5} A1∩A2={2,4,5} Hence k(A1,A2)=8 What are feature vectors φ(A1) and φ(A2) such that φ(A1)φ(A2)T=8?

Machine Learning

!

!

!

!

!Srihari

Combining Discriminative and Generative Models! •  Generative models deal naturally with missing data and with HMM of varying length! •  Discriminative models such as SVM have better performance! •  Can use a generative model to define a kernel and use kernel in discriminative approach!

23

Machine Learning

!

!

!

!

!Srihari

Kernels based on Generative Models! •  Given a generative model p(x) we define a kernel by! ! !k (x,x’) = p(x) p(x’) •  A valid kernel since it is an inner product in the one-dimensional feature space defined by the mapping p(x)

•  Two inputs x and x’ are similar if they have high probabilities

24

Machine Learning

!

!

!

!

!Srihari

Kernel Functions based on Mixture Densities! •  Extension to sums of products of different probability distributions k(x,x') = ∑ p(x | i) p(x'| i) p(i) i

•  where p(i) are positive weighting coefficients •  It is a valid kernel based on two rules of kernel construction: € k(x,x’) =ck1(x,x’) and k(x,x’)=k1(x,x’)+k2(x,x’)

•  Two inputs x and x’ will give a large value of k , and hence appear similar, if they have a significant probability under a range of different components •  Taking the limit to infinite sum k(x,x') = ∫ p(x | z) p(x'| z) p(z)dz •  where z is a continuous latent variable €

25

Machine Learning

!

!

!

!

!Srihari

Kernels for Sequences! •  Data consists of ordered sequences of length L X={x1,..,xL} •  Generative model for sequences is HMM •  Hidden states Z={z1,..,zL}

•  Kernel Function for measuring similarity of sequences X and X’ is k(X,X') = ∑ p(X | Z) p(X'| Z') p(Z) Z

•  Both observed sequences are generated by same hidden sequence Z € 26

Machine Learning

!

!

!

!

!Srihari

Fisher Kernel! •  Alternative technique for using generative models! •  Used in document retrieval, protein sequences, document recognition!

•  Consider parametric generative model p(x|θ) where θ denotes vector of parameters! •  Goal: find kernel that measures similarity of two vectors x and x’ induced by the generative model! •  Define Fisher score as gradient wrt θ

g(θ,x) = ∇θ ln p(x | θ )

•  Fisher Kernel is!

A vector of same dimensionality as θ

Fisher score is more generally the gradient of the log-likelihood

k(x,x') = g(θ,x)T F−1g(θ,x’)

€ where F is the Fisher information matrix! F = E x [ g(θ,x)g(θ,x)T ]

€ €

27

Machine Learning

!

!

!

!

!Srihari

Fisher Information Matrix! •  Presence of Fisher information matrix causes kernel to be invariant under non-linear parametrization of the density model θ àψ(θ) •  In practice, infeasible to evaluate Fisher Information Matrix. Instead use the approximation 1 N F ≈ ∑ g(θ,x n )g(θ,x n )T N n=1

! •  This is the covariance matrix of the Fisher scores! T −1 •  So the Fisher kernel! k(x,x') = g( θ ,x) F g(θ,x’) € !corresponds to whitening of the Fisher scores! •  More simply omit F and use non-invariant kernel! € k(x,x') = g(θ,x)T g(θ,x')

28

Machine Learning

!

!

!

!

!Srihari

Sigmoidal Kernel! •  Provides a link between SVMs and neural networks! ! !k (x,x’) = tanh (axTx’ + b) •  Its Gram matrix is not positive semidefinite! •  But used in practice because it gives SVMs a superficial resembalance to neural networks!

•  Bayesian neural network with an appropriate prior reduces to a Gaussian process! •  Provides a deeper link between neural networks and kernel methods! 29