Machine Learning
!
!
!
!
!Srihari
Kernel Methods!
Sargur Srihari!
1
Machine Learning
!
!
!
!
!Srihari
Topics in Kernel Methods!
1. 2. 3. 4. 5. 6. 7.
Kernel Methods vs Linear Models/Neural Networks! Stored Sample Methods! Kernel Functions! Dual Representations! Constructing Kernels! Extension to Symbolic Inputs! Fisher Kernel! 2
Machine Learning
!
!
!
!
!Srihari
Kernel Methods vs Linear Models/Neural Networks!
• Linear parametric models for regression and classification have the form y(x,w) • During learning phase we either get a maximum likelihood estimate of w or a posterior distribution of w • Training data is then discarded • Prediction based only on vector w
• This is true of Neural networks as well • Another class of methods use the training samples or a subset of them 3
Machine Learning
!
!
!
!
!Srihari
Memory-Based Methods! • Training data points are used in prediction phase! • Examples of such methods! • Parzen probability density model! • Linear combination of kernel functions centered on each training data point!
• Nearest neighbor classification!
• These are memory-based methods! • Require a metric to be defined! • Fast to train, slow to predict! 4
Machine Learning
!
!
!
!
!Srihari
Kernel Functions! • Many linear parametric models can be re-cast into equivalent dual representations where predictions are based on a kernel function evaluated at training points! • Kernel function is given by! ! !k (x,x’) = φ (x)T φ(x’) • where φ(x) is a fixed nonlinear feature space mapping (basis function)!
• Kernel is a symmetric function of its arguments! ! !k (x,x’) = k (x’,x) • Kernel function can be interpreted as the similarity of x and x’ • Simplest is identity mapping in feature space φ(x) = x • In which case k (x,x’) = xTx’ • Called Linear Kernel!
5
Machine Learning
!
!
!
!
!Srihari
Kernel Trick (or Kernel Substitution)! • Formulated as inner product allows extending wellknown algorithms ! • by using the kernel trick!
• Basic idea of kernel trick ! • If an input vector x appears only in the form of scalar products then we can replace scalar products with some other choice of kernel!
• Used widely! • in support vector machines • in developing non-linear variant of PCA • In kernel Fisher discriminant 6
Machine Learning
!
!
!
!
!Srihari
Other Forms of Kernel Functions! • Function of difference between arguments! ! !k (x,x’) = k (x-x’) • Called stationary kernel since invariant to translation in space!
• Homogeneous kernels, also known as radial basis functions! ! !k (x,x’) = k (||x-x’||)
For these to be valid kernel functions they should be shown to have the property k (x,x’) = φ (x)T φ(x’)
• Depend only on the magnitude of the distance between arguments!
• Note that the kernel function is a scalar value while x is an Mdimensional vector! !!
7
Machine Learning
!
!
!
!
!Srihari
Dual Representation! • Linear models for regression and classification can be reformulated in terms of a dual representation! • In which kernel function arises naturally!
• Plays important role in SVMs! • Consider linear regression model! • whose parameters are determined by minimizing regularized sum-of-squares error function! N
2 1 λ J(w) = ∑{wT φ (x n ) − t n } + wT w 2 n=1 2
where w = (w0 ,.., w M-1 )T , φ = ( φ0 ,..φ M-1 )T we have N samples {x1 ,..x N } λ is the regularization coefficient
φ is the set of M basis functions or feature vector
• Minimum obtained by setting gradient of J(w) wrt w equal to zero! €
Machine Learning
!
!
!
!
!Srihari
Solution for w as a linear combination of φ (xn)! • By equating derivative J(w) wrt w to zero and solving for w we get! 1 w=− w φ (x ) − t φ (x ) N
∑{ λ
T
n
n
}
n
n=1
N
= ∑ an φ (x n ) n=1
= ΦT a
! • Solution for w is a linear combination of vectors φ (xn) € whose coefficients are functions of w where! • Φ is the design matrix whose nth row is given by φ (xn)T % φ 0 (x1 ) ' ' . Φ = 'φ 0 (x n ) ' . '&φ (x ) 0 N
. . .
φ M −1 (x1 ) ( * . * . φ M −1 (x n )* is a N × M matrix * . . φ M −1 (x N )*) .
• Vector a=(a1,..,aN)T with the definition ! € ! an = −
1 T {w φ (x n ) − tn } λ
Machine Learning
!
!
!
!
!Srihari
Transformation from w to a !
• Thus we have ! w = ΦT a • Instead of working with parameter vector w we can reformulate least squares algorithm in terms of parameter vector a€ • giving rise to dual representation!
• We will see that although the definition of a still includes w an = −
1 T {w φ (x n ) − tn } λ
!It can be eliminated by the use of the kernel function! € 10
Machine Learning
!
!
!
!
!Srihari
Gram Matrix and Kernel Function! • Define the Gram matrix K=ΦΦT an N x N matrix, with elements! Note: N x M times M x N ! ! !Knm= φ (xn)Tφ (xm)=k(xn,xm) • where we introduce the kernel function k (x,x’) = φ (x)T φ (x’) " k(x1,x1 ) $ . K =$ $ . $ #k(x N ,x1 )
.
.
k(x1,x N ) % ' ' ' ' k(x N ,x N1 )&
Gram Matrix Definition:! Given N vectors, it is the! matrix of all inner products!
• Notes: • Φ is NxM and K is NxN • K is a matrix of similarities of pairs of samples (thus it is symmetric)
€
!
11
Machine Learning
!
!
!
!
!Srihari
Error Function in Terms of Gram Matrix of Kernel! ! • •
Sum of squares Error Function is! 2 1 N λ J(w) = ∑{wT φ (x n ) − t n } + wT w 2 n=1 2
Substituting w = ΦTa into J(w) gives! €
1 1 λ J(w) = a T ΦΦT ΦΦT a − a T ΦΦT t + t T t + a T ΦΦT a 2 2 2 !where t = (t1,..,tN)T!
! • Sum of squares error function is written in terms of Gram € matrix as! 1 1 λ J(a) = a T KKa − a T Kt + t T t + a T Ka ! 2 2 2 1 • Solving for a by combining w=ΦTa and a = − λ {w φ (x ) − t } ! ! !a =(K +λIN)-1t T
n
n
n
€ for a can be expressed as a linear combination of elements of Solution € φ (x) whose coefficients are entirely in terms of kernel k(x,x’) from which we can recover original formulation in terms of parameters w
Machine Learning
!
!
!
!
!Srihari
Prediction Function! !
•
Prediction for new input x • We can write a =(K +λIN by combining and • Substituting back into linear regression model, ! )-1t
y(x) = wT φ (x)
w=ΦTa
an = −
1 T {w φ (x n ) − tn } λ
€
= a T Φφ (x) = k(x)T (K + λIN )−1 t where k(x) has elements k n (x) = k(x n ,x)
•
Prediction is a linear combination of the target € values from the training set.! !
Machine Learning
!
!
!
!
Advantage of Dual Representation! • Solution for a is expressed entirely in terms of kernel function k(x,x’) • Once we get a we can recover w as linear combination of elements of φ (x) using w = Φta • In parametric formulation, solution is !w ML = (ΦT Φ)−1 ΦT t • Instead of inverting an M x M matrix we are inverting an N x N matrix– an apparent disadvantage!
• But, advantage of dual formulation is that we can work with kernel function k(x,x’) and therefore! • avoid working with a feature vector φ (x) and ! • problems associated with very high or infinite dimensionality 14 of x
!Srihari
Machine Learning
!
!
!
!
!Srihari
Constructing Kernels! • To exploit kernel substitution need valid kernel functions! • First Method! • choose a feature space mapping φ (x) and use it to find corresponding kernel! • One-dimensional input space! k(x, x') = φ (x)T φ (x') M
= ∑ φ i (x)φ i (x') i=1
• where φ (x) are basis functions such as polynomial! • For each i we choose φi=xi € 15
Machine Learning
!
!
!
!
!Srihari
Construction of Kernel Functions from basis functions! One-dimensional input space Polynomials!
Gaussian!
Logistic Sigmoid!
Basis! Functions! φι(x)!
Kernel! Functions! k(x,x’) = φ(x)Tφ(x) Red cross is x’
16
Machine Learning
!
!
!
!
!Srihari
Second Method: Direct Construction of Kernels! • Function we choose has to correspond to a scalar product in some (perhaps infinite dimensional) space! • Consider kernel function k(x,z) = (xTz)2 • In two dimensional space! k(x,z) = (x T z) 2 = (x1z1 + x 2 z2 ) 2 = x12 z12 + 2x1z1 x 2 z2 + x 22 z22 = (x12 , 2x1 x 2 , x 22 )(z12 , 2z1z2 ,z22 )T T
= φ (x) φ (z) ! • Feature mapping takes the form! φ ( x) = ( x12 , 2 x1 x 2 , x 22 ) • Comprises of all second order terms with a specific weighting! € • Inner product needs computing six feature values and 3 x 3 = 9 multiplications € • Kernel function k(x,z) has 2 multiplications and a squaring
• By considering (xTz+c)2 we get constant, linear, second order terms! • By considering (xTz+c)M we get all powers of x (monomials)!17
Machine Learning
!
!
!
!
!Srihari
Testing whether a function is a valid kernel! • Without having to construct the function φ (x) explicitly! • Necessary and sufficient condition for a function k(x,x’) to be a kernel is ! • Gram matrix K, whose elements are given by k(xn,xm) is positive semi-definite for all possible choices of the set {xn} • Positive semi-definite is not the same thing as a matrix whose elements are non-negative! • It means! zT Kz ≥ 0 for non - zero vectors z with real entries i.e., ∑ ∑ K nm zn zm ≥ 0 for any real numbers zn , zm n
m
• Mercer’s theorem: any continuous, symmetric, positive semidefinite kernel function k(x, y) can be expressed as a dot product in € a high-dimensional space!
• New kernels can be constructed from simpler kernels as building blocks! 18
Machine Learning
!
!
!
!
!Srihari
Techniques for Constructing Kernels! • Given valid kernels k1(x,x’) and k2(x,x’) the following new kernels will be valid! 1. k(x,x’) =ck1(x,x’) Where! f (.) is any function 2. k(x,x’)=f(x)k1(x,x’)f(x’) 3. k(x,x’)=q(k1(x,x’)) q(.) is a polynomial with non-negative coefficients!
4. k(x,x’)=exp(k1(x,x’))
5. k(x,x’)=k1(x,x’)+k2(x,x’)
6. k(x,x’)=k1(x,x’)k2(x,x’)
φ(x) is a function from x to RM 7. k(x,x’)=k3(φ(x).φ(x’)) k3 is a valid kernel in RM 8. k(x,x’)=xTAx’ A is a symmetric positive semidefinite matrix! 9. k(x,x’)=ka(xa,xb’)+kb(xb,xb’)xa and xb are variables with x=(xa,xb) 10. k(x,x’)=ka(xa,xa’)kb(xb,xb’)ka and kb are valid kernel functions!
Machine Learning
!
!
!
!
!Srihari
Kernels appropriate for specific applications! • Requirements for k(x,x’) • It is symmetric! • Its Gram matrix is positive semidefinite! • It expresses the appropriate similarity between x and x’ for the intended application!
20
Machine Learning
!
!
!
!
!Srihari
Gaussian Kernel! • Commonly used kernel is! ! k(x,x’) = exp (-||x-x’||2/2σ2) • It is seen as a valid kernel by expanding the square! ! ||x-x’||2 = xTx + (x’)Tx’ -2xTx’ • To give! k(x,x’) = exp (-xTx/2σ2) exp (-xTx’/σ2) exp (-(x’)Tx’/ 2σ2) • From kernel construction rules 2 and 4 ! • together with validity of linear kernel k(x,x’)=xTx’
• Can be extended to non-Euclidean distances k(x,x’) = exp {(-1/2σ2)[κ(x,x’)+κ(x’,x’)-2κ(x,x’)]}
21
Machine Learning
!
!
!
!
!Srihari
Extension of Kernels to Symbolic Inputs! • Important contribution of kernel viewpoint:! • Inputs that are symbolic rather than vectors of real numbers!
• Kernel functions defined for graphs, sets, strings, text documents! • If A1 and A2 are two subsets of objects! • A simple kernel is!
k(A1, A2 ) = 2|A1 ∩A 2 |
€
• where | | indicates cardinality of set intersection! • A valid kernel since it can be shown to correspond to an inner product in a feature space!
A={1,2,3,4,5} A1={2,3,4,5} A2={1,2,4,5} A1∩A2={2,4,5} Hence k(A1,A2)=8 What are feature vectors φ(A1) and φ(A2) such that φ(A1)φ(A2)T=8?
Machine Learning
!
!
!
!
!Srihari
Combining Discriminative and Generative Models! • Generative models deal naturally with missing data and with HMM of varying length! • Discriminative models such as SVM have better performance! • Can use a generative model to define a kernel and use kernel in discriminative approach!
23
Machine Learning
!
!
!
!
!Srihari
Kernels based on Generative Models! • Given a generative model p(x) we define a kernel by! ! !k (x,x’) = p(x) p(x’) • A valid kernel since it is an inner product in the one-dimensional feature space defined by the mapping p(x)
• Two inputs x and x’ are similar if they have high probabilities
24
Machine Learning
!
!
!
!
!Srihari
Kernel Functions based on Mixture Densities! • Extension to sums of products of different probability distributions k(x,x') = ∑ p(x | i) p(x'| i) p(i) i
• where p(i) are positive weighting coefficients • It is a valid kernel based on two rules of kernel construction: € k(x,x’) =ck1(x,x’) and k(x,x’)=k1(x,x’)+k2(x,x’)
• Two inputs x and x’ will give a large value of k , and hence appear similar, if they have a significant probability under a range of different components • Taking the limit to infinite sum k(x,x') = ∫ p(x | z) p(x'| z) p(z)dz • where z is a continuous latent variable €
25
Machine Learning
!
!
!
!
!Srihari
Kernels for Sequences! • Data consists of ordered sequences of length L X={x1,..,xL} • Generative model for sequences is HMM • Hidden states Z={z1,..,zL}
• Kernel Function for measuring similarity of sequences X and X’ is k(X,X') = ∑ p(X | Z) p(X'| Z') p(Z) Z
• Both observed sequences are generated by same hidden sequence Z € 26
Machine Learning
!
!
!
!
!Srihari
Fisher Kernel! • Alternative technique for using generative models! • Used in document retrieval, protein sequences, document recognition!
• Consider parametric generative model p(x|θ) where θ denotes vector of parameters! • Goal: find kernel that measures similarity of two vectors x and x’ induced by the generative model! • Define Fisher score as gradient wrt θ
g(θ,x) = ∇θ ln p(x | θ )
• Fisher Kernel is!
A vector of same dimensionality as θ
Fisher score is more generally the gradient of the log-likelihood
k(x,x') = g(θ,x)T F−1g(θ,x’)
€ where F is the Fisher information matrix! F = E x [ g(θ,x)g(θ,x)T ]
€ €
27
Machine Learning
!
!
!
!
!Srihari
Fisher Information Matrix! • Presence of Fisher information matrix causes kernel to be invariant under non-linear parametrization of the density model θ àψ(θ) • In practice, infeasible to evaluate Fisher Information Matrix. Instead use the approximation 1 N F ≈ ∑ g(θ,x n )g(θ,x n )T N n=1
! • This is the covariance matrix of the Fisher scores! T −1 • So the Fisher kernel! k(x,x') = g( θ ,x) F g(θ,x’) € !corresponds to whitening of the Fisher scores! • More simply omit F and use non-invariant kernel! € k(x,x') = g(θ,x)T g(θ,x')
28
Machine Learning
!
!
!
!
!Srihari
Sigmoidal Kernel! • Provides a link between SVMs and neural networks! ! !k (x,x’) = tanh (axTx’ + b) • Its Gram matrix is not positive semidefinite! • But used in practice because it gives SVMs a superficial resembalance to neural networks!
• Bayesian neural network with an appropriate prior reduces to a Gaussian process! • Provides a deeper link between neural networks and kernel methods! 29