ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline Kernel Methods Solving the dual problem Kernel approximation Suppor...

Author: Elfrieda Underwood

4 downloads 2 Views 632KB Size

Report

Download PDF

Recommend Documents

Scalable Machine Learning for Massive Astronomical Datasets

Randomized Algorithms for Scalable Machine Learning

MLlib: Scalable Machine Learning on Spark

Scalable Machine Learning Methods for Massive Biomedical Data Analysis

Apache Mahout. Scalable Machine Learning. Speaker: Isabel Drost

Machine Learning. Basic Methodology. Joakim Nivre. Machine Learning 1(24)

Machine Learning using MATLAB

Introducing Machine Learning

Machine Learning CMPSCI 383

Machine learning en introduktion

Machine Learning with WEKA

Machine Learning Tutorial

Machine Learning Models

Business Intelligence & Machine Learning

Data Mining & Machine Learning

Machine Learning Kernel Functions

6.867 Machine Learning

Copulas in Machine Learning

Machine Learning basic concepts

Machine Learning Summary

Machine Learning-based gameplay

Machine Learning with R

Machine Learning for NLP

Machine Learning 2030

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis

Oct 18, 2016

Outline Kernel Methods Solving the dual problem Kernel approximation

Support Vector Machines (SVM) SVM is a widely used classifier. Given: Training data points x1 , · · · , xn . Each xi ∈ Rd is a feature vector: Consider a simple case with two classes: yi ∈ {+1, −1}.

Goal: Find a hyperplane to separate these two classes of data: if yi = 1, w T xi ≥ 1 − ξi ; yi = −1, w T xi ≤ −1 + ξi .

Support Vector Machines (SVM) Given training data x1 , · · · , xn ∈ Rd with labels yi ∈ {+1, −1}. SVM primal problem (find optimal w ∈ Rd ): n

X 1 min w T w + C ξi w ,ξ 2 i=1

T

s.t. yi (w xi ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n, SVM dual problem (find optimal α ∈ Rn ): 1 min αT Qα − e T α α 2 s.t.0 ≤ αi ≤ C , ∀i = 1, . . . , n Q: n-by-n kernel matrix, Qij = yi yj xiT xj Each αi corresponds to one training P data point. Primal-dual relationship: w = i αi yi xi

Non-linearly separable problems What if the data is not linearly separable?

Solution: map data xi to higher dimensional(maybe infinite) feature space ϕ(xi ), where they are linearly separable.

Support Vector Machines (SVM) SVM primal problem: n

min w ,ξ

X 1 T w w +C ξi 2 i=1

T

s.t. yi (w ϕ(xi )) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n, The dual problem for SVM: 1 T α Qα−e T α, 2 s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

min α

where Qij = yi yj ϕ(xi )T ϕ(xj ) and e = [1, . . . , 1]T . Kernel trick: define K (xi , xj ) = ϕ(xi )T ϕ(xj ). P At optimum: w = ni=1 αi yi ϕ(xi ),

Various types of kernels 2

Gaussian kernel: K (xi , yj ) = e −γkxi −xj k2 ; Polynomial kernel: K (xi , xj ) = (γxiT xj + c)d . Other kernels for specific problems: Graph kernels (Vishwanathan et al., “Graph Kernels”, JMLR, 2010) Pyramid kernel for image matching (Grauman and Darrell, “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features”. In ICCV, 2005)

String kernel (Lodhi et al., “Text classification using string kernels”. JMLR, 2002).

General Kernelized ERM L2-Regularized Empirical Risk Minimization: n

X 1 `i (w T ϕ(xi )) min kw k2 + C w 2 i=1

x1 , · · · , xn : training samples ϕ(xi ): nonlinear mapping to a higher dimensional space Dual problem: n

X 1 min αT Qα + `∗i (−αi ) α 2 i=1

where Q ∈ Rn×n and Qij = ϕ(xi )T ϕ(xj ) = K (xi , xj ).

Kernel Ridge Regression Given training samples (xi , yi ), i = 1, · · · , n. n

1 1X T min kw k2 + (w ϕ(xi ) − yi )2 w 2 2 i=1

Dual problem: min αT Qα + kαk2 − 2αT y α

Scalability Challenge for solving kernel SVMs (for dataset with n samples): Space: O(n2 ) for storing the n-by-n kernel matrix (can be reduced in some cases); Time: O(n3 ) for computing the exact solution.

Greedy Coordinate Descent for Kernel SVM

Nonlinear SVM SVM primal problem: n

min w ,ξ

X 1 T w w +C ξi 2 i=1

T

s.t. yi (w ϕ(xi )) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n, w : an infinite dimensional vector, very hard to solve The dual problem for SVM: 1 T α Qα−e T α, 2 s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

min α

where Qij = yi yj ϕ(xi )T ϕ(xj ) and e = [1, . . . , 1]T . Kernel trick: define K (xi , xj ) = ϕ(xi )T ϕ(xj ). 2

Example: Gaussian kernel K (xi , xj ) = e −γkxi −xj k

Nonlinear SVM Can we solve the problem by dual coordinate descent? P The vector w = i yi αi ϕ(xi ) may have infinite dimensionality: Cannot maintain w Closed form solution needs O(n) computational time: δ ∗ = max(−αi min(C − αi ,

1 − (Qα)i )) Qii

(Assume Q is stored in memory) Can we improve coordinate descent using the same O(n) time complexity?

Greedy Coordinate Descent The Greedy Coordinate Descent (GCD) algorithm: For t = 1, 2, . . . 1. Compute δi∗ := argminδ f (α + δei ) for all i = 1, . . . , n 2. Find the best i ∗ according to the following criterion: i ∗ = argmax |δi∗ | i

3. αi ∗ ← αi ∗ + δi∗∗

Greedy Coordinate Descent Other variable selection criterion: The coordinate with the maximum step size: i ∗ = argmax |δi∗ | i

The coordinate with maximum objective function reduction: i ∗ = argmax f (α) − f (α + δi∗ ei ) i

The coordinate with the maximum projected gradient. ...

Greedy Coordinate Descent How to compute the optimal coordinate? Closed form solution of best δ: 1 − (Qα)i ∗ δi = max − αi , min C − αi , Qii Observations: 1 2

Computing all δi∗ needs O(n2 ) time If Qα is stored in memory, computing all δi∗ only needs O(n) time

Maintaining Qα also needs O(n) time after each update

Greedy Coordinate Descent Initial: α, z = Qα For t = 1, 2, . . . For all i = 1, . . . , n, compute 1 − zi ∗ ) δi = max − αi , min(C − αi , Qii Let i ∗ = argmaxi |δi∗ | α ← α + δi∗∗ z ← z + qi ∗ δi∗∗ (qi ∗ is the i ∗ -th column of Q) (This is a simplified version of the Sequential Minimal Optimization (SMO) algorithm proposed in Platt el al., 1998) (A similar version is implemented in LIBSVM)

How to solve problems with millions of samples? Q ∈ Rn×n cannot be fully stored Have a fixed size of memory to “cache” the computed columns of Q For each coordinate update: If qi is in memory, directly use it to update If qi is not in memory 1 2 3

Kick out the “Least Recent Used” column Recompute qi and store it in memory Update αi

LIBSVM Implemented in LIBSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/ Other functionalities: Multi-class classification Support vector regression Cross-validation

Kernel Approximation

Kernel Approximation Kernel methods are hard to scale up because Kernel matrix G is an n-by-n matrix Can we approximate the kernel using a low-rank representation? Two main algorithms: Nystrom approximation: Approximate the kernel matrix Random Features: Approximate the kernel function

Kernel Matrix Approximation We want to form a low rank approximation of G ∈ Rn×n , where Gij = K (xi , xj ) Can we do SVD?

Kernel Matrix Approximation We want to form a low rank approximation of G ∈ Rn×n , where Gij = K (xi , xj ) Can we do SVD? No, SVD needs to form the n-by-n matrix O(n2 ) space and O(n3 ) time

Kernel Matrix Approximation We want to form a low rank approximation of G ∈ Rn×n , where Gij = K (xi , xj ) Can we do SVD? No, SVD needs to form the n-by-n matrix O(n2 ) space and O(n3 ) time Can we do matrix completion or matrix sketching?

Kernel Matrix Approximation We want to form a low rank approximation of G ∈ Rn×n , where Gij = K (xi , xj ) Can we do SVD? No, SVD needs to form the n-by-n matrix O(n2 ) space and O(n3 ) time Can we do matrix completion or matrix sketching? Main problem: need to generalize to new points

Nystrom Approximation Nystrom approximation: Randomly sample m columns of G : W G= G21

G12 G22

W : m-by-m square matrix (observed) G21 : (n − m)-by-m matrix (observed) T G12 = G21 (observed) G22 : (n − m)-by-(n − m) matrix (not observed) Form a kernel approximation based on these mn elements W ˜ G ≈G = W † W G12 G21 W † : pseudo-inverse of W

Nystrom Approximation Why W † ? Exact recover the top left m-by-m matrix The kernel approximation: W G12 W ≈ W† W G21 G22 G21

G12

W G12 = G21 G21 W † G12

Nystrom Approximation Algorithm: 1 2

Sample m “landmark points”: v1 , · · · , vm Compute the kernel values between all training data to landmark points:   K (x1 , v1 ) · · · K (x1 , vm ) K (x , v ) · · · K (x , v ) 2 1 2 m  W  C= =  . . .. . . G21   . . . K (xn , v1 ) · · ·

3

K (xn , vm )

Form the kernel approximation G˜ = CW † C T (No need to explicitly form the n-by-n matrix)

Time complexity: mn kernel evaluations (O(mnd) time if use classical kernels) How to choose m? Trade-off between accuracy v.s computational time and memory space

Nystrom Approximation: Training Solve the dual problem using low-rank representation. Example: Kernel ridge regression min αT G˜α + λkαk2 − yT α α

Solve a linear system (G˜ + λI )α = y Use iterative methods (such as CG), with fast matrix-vector multiplication (G˜ + λI )p = CW † C T p + λp O(nm) time complexity per iteration

Nystrom Approximation: Training Another approach: reduce to linear ERM problems! Rewrite as G˜ = CW † C T = C (W † )1/2 (W † )1/2 C T So the approximate kernel can be represent by linear kernel with feature matrix C (W † )1/2 : K˜ (xi , xj ) = G˜ij = uiT uj ,   K (xj , v1 )   .. where uj = (W † )1/2   . K (xj , vm ) The problem is equivalent to a linear models with features u1 , · · · , un ∈ Rm Kernel SVM ⇒ Linear SVM with m features Kernel Ridge Regression ⇒ Linear Ridge regression with m features

Nystrom Approximation: Prediction Given the dual solution α. Prediction for testing sample x: n X i=1

αi K˜ (xi , x)

or

n X i=1

αi K (xi , x) ?

Nystrom Approximation: Prediction Given the dual solution α. Prediction for testing sample x: n X

αi K˜ (xi , x)

i=1

Need to use approximate kernel instead of original kernel! Approximate kernel gives much better accuracy! Because training & testing should use the same kernel.

Generalization bound: (Alaoui and Mahoney, “Fast Randomized Kernel Methods With Statistical Guarantees”. 2014. ) (Rudi et al., “Less is More: Nystrom Computational Regularization”. NIPS 2015. ) (using small number of landmark points can be viewed as a regularization)

Nystrom Approximation: Prediction How to define K˜ (x, xi )?    K (x1 , v1 ) · · · K (x1 , vm ) K (v1 , x1 ) · · ·   .. .. ..   † .. .. . . . ˜ G = W  . . K (xn , v1 ) · · · K (x1 , vm ) K (vm , x1 ) · · · K (x, v1 ) · · · K (x1 , vm )

 K (v1 , x)  ..  . K (vm , x)

 K (v1 , x)   .. Compute u = (W † )1/2   (need m kernel evaluations) . K (vm , x) Approximate kernel can be defined by 

K˜ (x, xi ) = u T ui

Nystrom Approximation: Prediction Prediction: f (x) =

n X

αi u T ui = u T (

i=1

n X

αi ui )

i=1

P ( ni=1 αi ui ) can be pre-computed ⇒ prediction time is m kernel evaluations Original kernel method: need n kernel evaluations for prediction. Summary: Original kernel Nystrom (m)

Quality exact rank m

Training time n kernel + kernel training nm kernel + linear training 1.x

Prediction time n kernel m kernel

Nystrom Approximation How to select landmark points (v1 , · · · , vm ) Traditional approach: uniform random sampling from training data Importance sampling (leverage score): Drineas and Mahoney “On the Nystrom method for approximating a Gram matrix for improved kernel-based learning”. JMLR 2015. Gittens and Mahoney “Revisiting the Nystrom method for improved large-scale machine learning”. ICML 2013

Kmeans sampling (performs extremely well) : Run kmeans clustering and set v1 , · · · , vm to be cluster centers Zhang et al., “Improved Nystrom low rank approximation and error analysis”. ICML 2008.

Subspace distance: Lim et al., “Double Nystrom method: An efficient and accurate Nystrom scheme for large-scale data sets”. ICML 2015.

Nystrom Approximation (other related papers) Distributed setting: (Kumar et al., “Ensemble Nystrom method”. NIPS 2009). Block diagonal + low-rank approximation: (Si et al., “Memory efficient kernel approximation”. ICML 2014. ) Nystrom method for fast prediction: (Hsieh et al., “Fast prediction for large-scale kernel machines. ” NIPS, 2014. ) Structured landmark points: (Si et al., “Computational Efficient Nystrom Approximation using Fast Transforms”. ICML 2016. )

Kernel Approximation (random features)

Random Features Directly approximation the kernel function (Data-independent) Generate nonlinear “features” ⇒ reduce the problem to linear model training Several examples: Random Fourier Features: (Rahimi and Recht, “Random features for large-scale kernel machines”. NIPS 2007) Random features for polynomial kernel: (Kar and Karnick, “Random feature maps for dot product kernels”. AISTATS 2012. )

Random Fourier Features Random features for “Shift-invariant kernel”: A continuous kernel is shift-invariant if K (x, y) = k(x − y). 2

Gaussian kernel: K (x, y) = e −kx−yk Laplacian kernel: K (x, y) = e −kx−yk1

Shift invariant kernel (if positive definite) can be written as: Z T k(x − y) = P(w )e jw (x−y) dw Rd

for some probability distribution P(w ).

Random Fourier Features Taking the real part we have Z P(w ) cos(w T x − w T y) k(x − y) = d R Z = P(w )(cos(w T x) cos(w T y) + sin(w T x) sin(w T y)) Rd

= Ew ∼P(w ) [zw (x)T zw (y)] where zw (x)T = [cos(w T x), sin(w T x)]. zw (x)T zw (y) is an unbiased estimator of K (x, y) if w is sampled from P(·)

Random Fourier Features Sample m vectors w1 , · · · , wm from P(·) distribution Generate random features for each sample:   sin(w1T xi ) cos(w1T xi )    sin(w T xi )  2   T   ui = cos(w2 xi )   . ..     T x )  sin(wm i Tx ) cos(wm i K (xi , xj ) ≈ uiT uj (larger m leads to better approximation)

Random Features  T u1  ..  T G ≈ UU , where U =  .  unT Rank 2m approximation. The problem can be reduced to linear classification/regression for u1 , · · · , un . Time complexity: O(nmd) time + linear training (Nystrom: O(nmd + m3 ) time + linear training ) Prediction time: O(md) per sample (Nystrom: O(md) per sample)

Fastfood Random Fourier features require O(md) time to generate m features: Main bottleneck: Computing wiT x for all i = 1, · · · , m Can we compute this in O(m log d) time?

Fastfood Random Fourier features require O(md) time to generate m features: Main bottleneck: Computing wiT x for all i = 1, · · · , m Can we compute this in O(m log d) time? Fastfood: proposed in (Le et al., “Fastfood Approximating Kernel Expansions in Loglinear Time”. ICML 2013. ) Reduce time by using “structured random features”

Fastfood Tool: fast matrix-vector multiplication for Hadamard matrix (Hd ∈ Rd×d ) Hd x requires O(d log d) time Hadamard matrix: H1 = [1] 1 1 H2 = 1 −1 H2k−1 H2k−1 H2k = H2k−1 −H2k−1 Hd x can be computed efficiently by Fast Hadamard Transform (dynamic programming)

Fastfood Sample m = d random features by  v1 0 · · ·   0 v2 · · · W = Hd  . .. ..  .. . . 0 0 ···

 0 0   0 vd

So W x only needs O(d log d) time. Instead of sampling W (d 2 elements), we just sample v1 , · · · , vd from N(0, σ) Each row of W is still N(0, σ) (but rows are not independent) Faster computation, but performance is slightly worse than original random features.

Extensions Fastfood: structured random features to improve computational speed: (Le et al., “Fastfood Approximating Kernel Expansions in Loglinear Time”. ICML 2013. )

Structured landmark points for Nystrom approximation: (Si et al., “Computational Efficient Nystrom Approximation using Fast Transforms”. ICML 2016. )

Doubly stochastic gradient with random features: (Dai et al., “Scalable Kernel Methods via Doubly Stochastic Gradients”. NIPS 2015. )

Generalization bound (?)

Coming up Choose the paper for presentation (due this Sunday, Oct 23).

Questions?