Stochastic Optimization for Machine Learning

Stochastic Optimization for Machine Learning ICML 2010, Haifa, Israel Tutorial by Nati Srebro and Ambuj Tewari Toyota Technological Institute at Chica...

Author: Oliver Douglas

19 downloads 0 Views 600KB Size

Report

Download PDF

Recommend Documents

CSC 2515 Tutorial: Optimization for Machine Learning

HESSIAN FREE OPTIMIZATION METHODS FOR MACHINE LEARNING PROBLEMS

Stochastic Optimization Algorithms

Machine Learning for NLP

The Interplay of Optimization and Machine Learning Research

Machine Learning for User Modeling

Machine Learning for Recommendation System

MACHINE LEARNING FOR INTERACTIVE SYSTEMS

A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies

Logistic Regression: Tight Bounds for Stochastic and Online Optimization

A Stochastic Quasi-Newton Method for Large-Scale Optimization

STOCHASTIC OPTIMIZATION APPROACH OF CAR VALUE DEPRECIATION

On stochastic optimization for short-term hydropower planning YELENA VARDANYAN

Stochastic Linear Optimization under Bandit Feedback

Optimization Techniques for Learning and Data Analysis

Active Learning for Multi-Objective Optimization

Machine learning for constraint solver design

Introduction to astroml: Machine Learning for Astrophysics

MACHINE LEARNING APPROACHES FOR ARRAY ANALYSIS

MACHINE LEARNING SYSTEMS FOR DETECTING DRIVER DROWSINESS

Machine Learning Algorithms for Real Data Sources

New roles for machine learning in design

Machine Learning Methods for Automatic Image Colorization

Randomized Algorithms for Scalable Machine Learning

Stochastic Optimization for Machine Learning ICML 2010, Haifa, Israel Tutorial by Nati Srebro and Ambuj Tewari Toyota Technological Institute at Chicago

Goals •

Introduce Stochastic Optimization setup, and its relationship to Statistical Learning and Online Learning

•

Understand Stochastic Gradient Descent: formulation, analysis and use in machine learning

•

Learn about extensions and generalizations to Gradient Descent and its analysis

•

Become familiar with concepts and approaches Stochastic Optimization, and their Machine Learning counterparts

Main Goal: Machine Learning is Stochastic Optimization

Outline • Gradient Descent and Stochastic Gradient Descent – Including sub-gradient descent

• The Stochastic Optimization setup and the two main approaches: – Statistical Average Approximation – Stochastic Approximation

• Machine Learning as Stochastic Optimization – Leading example: L2 regularized linear prediction, as in SVMs

• Connection to Online Learning (break)

• More careful look at Stochastic Gradient Descent • Generalization to other norms: Mirror Descent • Faster convergence under special assumptions

Prelude: Gradient Descent min F(w)

w∈W

Start at some w(0) Iterate: w(k+1) ← ΠW ( w(k) - α(k) ∇F(w(k)) ) ΠW(w) = arg minv ∈ W ||v-w||2

Prelude: Gradient Descent min F(w)

w∈W

Start at some w(0) Iterate: w(k+1) ← ΠW ( w(k) - α(k) ∇F(w(k)) ) ΠW(w) = arg minv ∈ W ||v-w||2

Gradient Descent: Analysis • We will focus on convex, Lipschitz functions. • Lipschitz functions: |F(v)-F(u)| ≤ G·||u-v||2 • If f is differentiable: ||F(w)||2 ≤ G • What if f is not differentiable? Subgradient!

Subgradient of a Convex Function F(w)

W0 W0

•

If F(·) is differentiable at w0, gradient gives linear lower bound on F(·): ∀v F(v) ≥ F(w0) + hv-w0,gi g = ∇F(w0)

•

In general, subgradient is any g corresponding to a linear lower bound: ∀v F(v) ≥ F(w0) + hv-w0,gi ⇔ g ∈ ∇F(w0)

•

G-Lipschitz: ||g||2≤G for all subgradients g∈∇F(w)

Subgradients: Examples •

F(z) = |z| ∇F(z) = {-1} ∇F(0) = [-1,1] ∇F(z) = {1}

•

F(z) = [1-z]+

∇F(z) = {-1} ∇F(1) = [-1,0] ∇F(z) = {0}

•

F(w) = ||w||1

z0

z1

3 sign(w[i]) ∇F(w)[i] =

Prelude II: Sub-Gradient Descent min F(w)

w∈W

Start at some w(0) Iterate: Get subgradient g(k) = ∇F(w(k)) w(k+1) ← ΠW ( w(k) - α(k)g(k) ) B/G α(k) = √ k

ΠW(w) = arg minv ∈ W ||v-w||2 Guarantee on sub-Optimality: ||∇F(w)||2 ≤ G

F (w

(k)

Ã

GB ) − F (w ) ≤ O √ k ∗

||w*||2 ≤ B

!

O

Ã

! 2 2 G B

²2

iterations

This is the best possible using only F(w) and ∇F(w) (if the dimension is unbounded)

Stochastic Sub-Gradient Descent min F(w)

w∈W

Start at some w(0) Iterate: Get subgradient estimate g(k), s.t. E[g(k)] ∈ ∇F(w(k)) w(k+1) ← ΠW ( w(k) - α(k)g(k) ) Pk (k) 1 Output w = k i=1 w(i) B/G (k) √ α

=

Guarantee on sub-Optimality:

||g(k)||2 ≤ G

h

i

Ã

GB E F (w ) − F (w ) ≤ O √ k (k)

∗

!

k

||w*||2 ≤ B

O

Ã

! 2 2 G B

²2

iterations

Same guarantee as (best possible) full-gradient guarantee: # of stochastic iterations = # of full gradient iterations

SGD for Machine Learning m 1 X ˆ (w) = min L loss(w on (xi , yi )) w m i=1

Subgradient estimate:

g(k) = ∇w loss( w(k) on (xi,yi) )

Example: linear prediction with hinge loss (SVM)

L2-regularized Linear Classification aka Support Vector Machines

≤M M

w ?

≥M

|w|=1

L2-regularized Linear Classification aka Support Vector Machines m 1 X min `(hw, xi i, yi ) kwk2 ≤B m i=1

m 1 X λ min `(hw, xi i, yi ) + kwk2 w m 2 i=1

≡

≤1 M

w ?

≥1

Margin: M = 1/|w|

ℓ(hw,xi,y)=[1-yhw,xi]+

SGD for Machine Learning m 1 X ˆ (w) = min L loss(w on (xi , yi )) w m i=1

Subgradient estimate:

g(k) = ∇w loss( w(k) on (xi,yi) )

Example: linear prediction with hinge loss (SVM) m 1 X ˆ (w) = min L `(hw, xi i, yi ) m i=1 kwk2 ≤B

³D E ´ (k ) 0 (k ) g = ` w , xi , yi xi ⎧ D E (k ) ⎨−y x yi w , xi < 1 i i = ⎩0 otherwise

||g(k)||2 ≤ G = sup ||xi||2

ℓ(hw,xi,y)=[1-yhw,xi]+

Start at some w(0) Iterate: Draw i ∈ 1..n at random If yihw,xii < 1, w ← w + α(k)yixi If ||w||2 ≥ B, w ← B w / ||w||2 wsum += w Output wsum/k

Stochastic vs Batch Gradient Descent m 1 X ˆ (w) = min L loss(w on (xi , yi )) w m i=1

w ← w-g1 w ← w-g2 w ← w-g3 w ← w-g4 w ← w-g5 w ← w-gm-1

g1 = ∇loss(w on (x1,y1) )

x1,y1

g1 = ∇loss(w on (x1,y1) )

g2 = ∇loss( w on (x2,y2) )

x2,y2

g2 = ∇loss( w on (x2,y2) )

g3 = ∇loss( w on (x3,y3) )

x3,y3

g3 = ∇loss( w on (x3,y3) )

g4 = ∇loss( w on (x4,y4) )

x4,y4

g4 = ∇loss( w on (x4,y4) )

g5 = ∇loss( w on (x5,y5) )

x5,y5

g5 = ∇loss( w on (x5,y5) )

xm,ym

gm = ∇loss( w on (xm,ym) )

w ← w-gm gm = ∇loss( w on (xm,ym) )

ˆ (w) = ∇L

w ← w-∑gi

1 m gi

Stochastic vs Batch Gradient Descent • Intuitive argument: if only taking simple gradient steps, better to be stochastic (will return to this later) ||x||2 ≤ X

• Formal result: • Stochastic Gradient Descent Runtime: • Batch Gradient Descent Runtime:

O

O

Ã

Ã

X 2B 2 ²2

X 2B 2 ²2

d

!

md

!

if only using gradients, and only assuming Lipschitz, this is the optimal runtime. • Compared with second order methods? • For specific objectives? With stronger assumptions?

Stochastic Optimization Setting minw∈W F(w) = Ez[f(w,z)] based on only stochastic information on F: – Only access to unbiased estimates of F(w) and ∇F(w) – No direct access to F(w)

•

E.g. when distribution of z is unknown, and can only get sample z(i) – g(k) = ∇wf(w(k),z(k)) unbiased estimator of ∇F(w)

•

Traditional applications: – Optimization under uncertainty • Uncertainty about network performance • Uncertainty about client demands • Uncertainty about system behavior in control problems

– Complex systems where its easier to sample then integrate over z • “monte carlo” optimization

Machine Learning is Stochastic Optimization •

Up to now: apply stochastic optimization to minimizing empirical error

•

But learning a good predictor is itself a stochastic optimization problem:

•

without knowing true distribution of (x,y), given sample (x1,y1),…,(xm,ym) Special case of stochastic optimization:

minh L(h) = Ex,y[loss(h(x),y)]

– optimization variable is the predictor (hypothesis) h – stochastic objective is generalization error (risk) – stochasticity is over instances we would like to be able to predict

•

Vapnik’s “General Learning Setting” is generic stochastic optimization:

minh F(h) = Ez[f(h,z)]

based on iid sample z1,…,zm

[Vapnik95]

General Learning: Examples Minimize F(h)=Ez[f(h;z)] based on sample z1,z2,…,zn

• Supervised learning: z = (x,y) h specifies a predictor h: X → Y f( h ; (x,y) ) = loss(h(x),y) • Unsupervised learning, e.g. k-means clustering: z = x ∈ Rd h = (μ[1],…,μ[k]) ∈ Rd×k specifies k cluster centers f( (μ[1],…,μ[k]) ; x ) = minj ||μ[i]-x||2 • Density estimation: h specifies probability density ph(x) f( h ; x ) = -log ph(x) • Optimization in uncertain environment, e.g.: z = traffic delays on each road segment h = route chosen (indicator over road segments in route) f( h ; z ) = hh,zi = total delay along route

Stochastic Convex Optimization • We will focus mostly on stochastic convex optimization: minw∈W F(w) = E[f(w,z)] – W is a convex subset of a normed vector space (e.g. R d ) – f(w,z), and so also F(w), is convex in w.

• For supervised learning: minw∈W L(w) = E[loss(hw,φ(x,y)i,y)] convex subset of normed vector space

convex loss

A non-linear predictor will not yield convex L(w) with any meaningful-forprediction loss function (linear in some implicit feature space is OK).

Stochastic Convex Optimization in Machine Learning minw∈W L(w) = E[loss(hw,φ(x,y)i,y)]

convex subset of normed vector space

•

convex loss

Can capture different: – convex loss functions – norms (regularizers) – explicit or implicit feature maps

•

Including: – – – – – –

•

SVMs (L2 norm with hinge loss) Regularized Logistic Regression CRFs, Structural SVMs (L2 norm with structured convex loss functions) LASSO (L1 norm with squared loss) Group LASSO (Group L2,1 or L∞⇔∞ norm) Trace-Norm Regularization (as in MMMF, multi-task learning)

Does NOT include, e.g.: – Non-convex loss (e.g. 0/1 loss) – Decision trees, decision lists – Fromulas (CNF, DNF, and variants)

These are instances of stochastic optimization, but not stochastic convex optimization

Stochastic Optimization

vs

Statistical Learning

•Focus on computational efficiency

•Focus on sample size

•Generally assumes unlimited sampling

•What can be done with a fixed number of samples?

- as in monte-carlo methods for complicated objectives

•Optimization variable generally a vector in a normed space - complexity control through norm

• Discussion mostly parametric BUT: - most convergence results are dimension-independent - methods and analysis applicable also to non-parametric problems

•Mostly convex objectives (or at least convex relaxations)

•Abstract hypothesis classes - linear predictors, but also combinatorial hypothesis classes - generic measures of complexity such as VC-dim, fat shattering, Radamacher

• Parametric (finite-dim) and non-parametric classes

• Non-convex classes and loss functions - multi-layer networks - sparse and low-rank models - combinatorial classes

Two Approaches to Stochastic Optimization minw∈W F(w) = E[f(w,z)] • Sample Average Approximation (SAA): [Kleywegt, Shapiro, Homem-de-Mello 2001], [Rubinstein Shapiro 1990], [Plambeck et al 1996]

– Collect sample z1,…,zm 1 Pm f (w, z ) ˆ (w) = m – Minimize F i i=1

– In our terminology: Empirical Risk Minimization – Analysis typically based on Uniform Concentration

• Sample Approximation (SA): [Robins Monro 1951]

– Update w(k) based on weak estimator to F(w(k)), ∇F(w(k)), etc • E.g., based on g(k) = ∇f(w,z(k))

– Simplest method: stochastic gradient descent – Similar to online approach in learning (more on this later)

Stochastic Approximation for Machine Learning minw L(w) = E[ℓ(hw,xi,y)] ||x||2 ≤ X

|ℓ’|≤1

• Our previous approach was a mixed approach: – SAA: collect sample of size m and minimize empirical error (w/ norm constraint): ˆ (w) = minkwk ≤B L 2

1 m

Pm i=1 `(hw, xi i, yi )

– Optimize this with SGD, i.e. applying SA to the empirical objective • At each SGD iteration, pick random (x,y) from empirical sample

– SGD guarantee is on empirical suboptimality: Ãr ˆ (w) ˆ (w(k )) ≤ L ˆ +O L

X 2B 2 k

!

– To get guarantee on L(w(k)), need to combined with uniform concentration: ¯ ¯ ¯ˆ ¯ supkwk≤B ¯L(w) − L(w)¯ ≤ O

• Pure SA approach:

Ãr

X 2B 2 m

!

– Optimize L(w) directly

• At each iteration, use an independent sample from the source distribution

– Same SGD guarantee, but directly to the generalization error: ⎛s

L(w(k)) ≤ L(w∗) + O ⎝

⎞ 2 2 ∗ X kw k2 ⎠

k

Stochastic Approximation (SGD) for Machine Learning SGD on Empirical Objective (SA inside SAA):

Direct SA Approach:

ˆ (w) minkwk ≤B L 2

Draw (x1,y1),…,(xm,ym) ∼ D Start at some w(0) Iterate: Draw i = i(k) ∼ Unif(1..m) g(k) = ℓ’(hw(k),xii,yi) xi w(k+1) ← ΠB( w(k) - α(k) g(k) ) Output

w(k) =

1 k

Pk (k) j=1 w

Start at some w(0) Iterate: Draw (x(k),y(k)) ∼ D g(k) = ℓ’(hw(k),x(k)i,y(k)) x(k) w(k+1) ← w(k) - α(k) g(k) Output w(k) =

1 k

Pk (k) j=1 w

• SA requires fresh sample at every iteration, i.e. needs m ≥ k • If m sample size m

• Do we need k > m iterations? • And also, recall earlier question: Is SAA with 2nd Order Optimization better then SGD ?

Stochastic Approximation (Stochastic Gradient Descent) for Machine Learning SGD on Empirical Objective (SA inside SAA):

Direct SA Approach:

ˆ (w) minkwk ≤B L 2

Draw (x1,y1),…,(xm,ym) ∼ D Start at some w(0) Iterate: Draw i = i(k) ∼ Unif(1..m) g(k) = ℓ’(hw(k),xii,yi) xi w(k+1) ← ΠB( w(k) - α(k) g(k) ) Output

w(k) =

1 k

Pk (k) j=1 w

ˆ (w(k)) ≤ L ˆ (w) ˆ +O L

Ãr

¯ ¯ ¯ˆ ¯ supkwk≤B ¯L(w) − L(w)¯ ≤ O

L(w(k)) ≤ L(w∗) + O

Start at some w(0) Iterate: Draw (x(k),y(k)) ∼ D g(k) = ℓ’(hw(k),x(k)i,y(k)) x(k) w(k+1) ← w(k) - α(k) g(k)

Ãr

X 2B 2 k

!

Output

X 2B 2 k Ãr

!

X 2B 2

+O

m

Ãr

⎛s

L(w(k)) ≤ L(w∗) + O ⎝

!

X 2B 2 m

B/X α(k) = √ P k w(k) = k1 kj=1 w(k)

!

||w*||≤B

X 2B 2 k

⎞ ⎠

SA vs SAA for L2 Regularized Learning L(w) = E[ℓ(hw,xi,y)] |ℓ’|≤1

||x||2 ≤ X

• SA (Single-Pass Stochastic Gradient Descent) – fresh sample (x(k),y(k)) at each iterations – i.e. single pass over the data Ãr ! 2 2 ∗ X kw k2 – After k iterations: L(w(k )) ≤ L(w∗) + O k ⇒ to get L(w) ≤ L(w*) + ²: sample size m = O

µ

X 2 kw∗ k2 2 ²2

¶

runtime = O(md) = O

• SAA (Empirical Risk Minimization) – Sample size to gurantee L(w) ≤ L(w*) + ²: m = Ω µ 2 ∗ 2 ¶ ⇒ using any method: X kw k2 runtime ≥ Ω d ²2

µ

X 2 kw∗ k2 2 ²2

µ

X 2 kw∗ k2 2d ²2

¶

¶

– And with a sample of size m, whatever we do, can’t guarantee generalization Ãr ! error better then: 2 2 ∗ X kw k L(w∗) + O

m

2

SA vs SAA for L2 Regularized Learning L(w) = E[ℓ(hw,xi,y)] |ℓ’|≤1

ˆ (w) ˆ = arg minkwk≤B L w

• Summary:

||x||2 ≤ X

output of one-pass (m) w = SGD on m samples

– Can obtain familiar SVM generalization guarantee directly from [Nemirovski Yudin 78]: with m samples, and after k=m iterations: L(w(m)) ≤ L(w∗) + O

Ãr

X 2 kw∗k2 2 m

!

– Even with limited sample size, can’t beat SA (single-pass SGD): guarantees best-possible generalization error with optimal runtime* w(0)

ERM (SAA) SGD (SA)

ˆ w

w* O

w(m)

Ãr

X 2 kw∗ k2 2 m

!

(figure adapted from Leon Bottou)

* Up to constant factors * Without further assumptions (tightness is “worst-case” over source distribution)

Those pesky constant factors… ˆ (w) ˆ = arg minkwk≤B L w

output of one-pass (m) w = SGD on m samples

•

The constant factor in the theoretical guarantees we can show for SA is actually a bit better then in the ERM guarantee (two vs four)

•

Its tight, in the worst case, up to a factor of eight.

•

But in practice, the ERM does seem to be better...

•

Said differently: with a fixed-size sample, after a single SGD pass over the data, we still don’t obtain the same generalization performance as the ERM.

Test misclassification error

Mixed approach: SGD on Empirical Error 0.058 0.056 0.054

m=300,000 m=400,000 m=500,000

0.052

0

1,000,000

2,000,000 # iterations k

3,000,000

Reuters RCV1 data, CCAT task

[Shalev-Shwartz, Srebro 2008]

Test misclassification error

Mixed approach: SGD on Empirical Error 0.058 0.056 0.054

m=300,000 m=400,000 m=500,000

0.052

0

1,000,000

2,000,000 # iterations k

3,000,000

Reuters RCV1 data, CCAT task

• The mixed approach (reusing examples) can make sense • Still: fresh samples are better ⇒ With a larger training set, can reduce generalization error faster ⇒ Larger training set means less runtime to get target generalization error [Shalev-Shwartz, Srebro 2008]

Outline • Gradient Descent and Stochastic Gradient Descent – Including sub-gradient descent

• The Stochastic Optimization setup and the two main approaches: – Statistical Average Approximation – Stochastic Approximation

• Machine Learning as Stochastic Optimization – Leading example: L2 regularized linear prediction, as in SVMs

• Connection to Online Learning (break)

• More careful look at Stochastic Gradient Descent • Generalization to other norms: Mirror Descent • Faster convergence under special assumptions

Online Optimization (and Learning) •

Online optimization setup: – As in stochastic optimization fixed and known f(w,z) and domain W – z(1),z(2),… presented sequentially by “adversary” – “Learner” responds with w(1),w(2),… Adversary:

z(1)

Learner: w(1)

z(2) w(2)

z(3) w(3)

….

– Learner’s goal: minimize regret versus best single response in hindsight. 1 k

k X

j=1

f (w(j), z (j)) −

1 inf w∗ ∈W k

– E.g., investment return: w[i] = investment in holding i z[i]= return on holding i f(w,z) = -hw,zi

– Learning: f(w,(x,y)) = loss( hw(x) on y )

k X

j=1

f (w∗, z (j))

Online Optimization (and Learning) •

Online optimization setup: – As in stochastic optimization fixed and known f(w,z) and domain W – z(1),z(2),… presented sequentially by “adversary” – “Learner” responds with w(1),w(2),… Adversary: Learner: w(1)

z(1)

z(2) w(2)

z(3) w(3)

….

– Learner’s goal: minimize regret versus best single response in hindsight.

•

Online Gradient Descent [Zinkevich 03]: Start at some w(0) Iterate: Predict w(k), receive z(k), pay f(w(k),z(k)) w(k+1) ← ΠW ( w(k) - α(k)∇f(w(k),z(k)) )

B/G α(k ) = √ k

||∇f(w,z)||2 ≤ G Ã ! k k X X GB 1 f (w(j), z (j) )− k1 f (w∗, z (j) ) ≤ O √ k k j=1 j=1 ||w*||≤B

Online Optimization vs Stochastic Approximation •

In both Online Setting and Stochastic Approximation – Receive samples sequentially – Update w after each sample

•

But, in Online Setting: – Objective is empirical regret, i.e. behavior on observed instances – z(k) chosen adversarialy (no distribution involved)

•

As opposed on Stochastic Approximation:

•

Stochastic Approximation is a computational approach, Online Learning is an analysis setup

– Objective is Ez[f(w,z)], i.e. behavior on “future” samples – i.i.d. samples z(k)

– E.g. “Follow the leader” is an online algorithm that solves an ERM problem at each iteration. It is still fully in the online setting, and is sensible to analyze as such

Online To Stochastic •

Any online algorithm with regret guarantee: 1 k

k X

f (w

(j)

,z

(j)

j=1

) − k1

k X

j=1

f (w∗, z (j)) ≤ R(k)

can be converted to a Stochastic Approximation algorithm, by outputting the average of the iterates [Cesa-Bianchi et al 04]: h

(k)

i

∗

E F (w ) − F (w ) ≤ R(k)

w(k) =

(in fact, even in high confidence rather then in expectation)

Onlined Gradient Descent [Zinkevich 03]

1 k

Pk (i) i=1 w

online2stochastic Stochastic Gradient Descent [Cesa-Binachi et al 04] [Nemirovski Yudin 78]

Break

Outline • Gradient Descent and Stochastic Gradient Descent – Including sub-gradient descent

• The Stochastic Optimization setup and the two main approaches: – Statistical Average Approximation – Stochastic Approximation

• Machine Learning as Stochastic Optimization – Leading example: L2 regularized linear prediction, as in SVMs

• Connection to Online Learning (break)

• More careful look at Stochastic Gradient Descent • Generalization to other norms: Mirror Descent • Faster convergence under special assumptions

Stochastic Gradient Descent min F(w)

w∈W

Start at w(0)=0 Iterate: Get subgradient estimate g(k) w(k+1) ← ΠW ( w(k) - α(k)g(k) ) Pk (k) 1 Output w = k j=1 w(j)

ΠW(w) = arg minv∈W ||v-w||2

Assumptions for analysis: • • •

F(w) is convex in w Independent and unbiased (sub)-gradient estimates: E[g(k)] ∈ ∇F(w(k)) E[||g(k)||22] ≤ G2 – Equivalently: supw||∇F(w)||2 + Var[g(k)] ≤ G2 – Slightly weaker then ||g(k)||2 ≤ G

•

We do not need W to be bounded (could be R d ) – But stepsize and convergence gurantee will depend on ||w*||2

Stochastic Gradient Descent: Stepsizes and Convergence • Main inequality: h

i (k) ∗ E F (w ) − F (w ) ≤

·° ° °2 °2 ¸ P ° ° ∗ ° ° k−1 °w − w(0) ° + j=0 (α(k))2E °g(k)°

2

Pk−1 (k) j=0 α

(same as for Gradient Descent analysis, but in expectation)

• With any

α(k)→

0 and

∑j=1..kα(j)→

• Fixed stepsizes:

h

kw∗ k α(j) = √ 2 G K

α(j) = ²2 G

• Decaying stepsizes: kw∗ k α(k) = √ 2 G k

h

i

G kw∗k2 E F (w ) ≤ F (w ) + √ K h

(K)

E F (w ) − F (w∗) ≤ ² (k)

i

∞: lim E F (w ) ≤ F (w∗) (K)

i

∗

G 2 kw∗ k2 2 k= ²2

with

G kw∗k2 ∗ E F (w ) ≤ F (w ) + 4 √ h

(k)

i

k

• If we don’t know G,||w*||, getting the stepsize wrong is not too bad: kw∗ k2 (k ) α =β √ G k

³ ´ G kw∗ k2 1 √ max β, β E F (w ) ≤ F (w ) + 4 k h

(k)

i

∗

Stochastic Gradient Descent: Comments µ

α(k) = Θ √1

k

¶

•

Fairly robust to stepsize

•

Projections:

Ã

G kw∗k2 (k) ∗ √ E F (w ) ≤ F (w ) + O k h

i

!

– If minimizing L(w) stochastically, using fresh samples at each iteration, can take W=R d , and no need to project ˆ – In mixed SA/SAA approach (SGD on L(w), reusing sample): must take W={w | ||w||≤ B} to ensure generalization

•

Sampling with/without replacement: – In mixed SA/SAA approach, when reusing sample, theory only valid when sampling iid with replacements. – In practice: better to take random permutation of data (ie sample without replacement). When permutation is exhusted (finished a pass over the data), take another random permutation, etc. Warning: No theory for this! – See Leon Bottou’s webpage.

Stochastic Gradient Descent: Comments •

Averaging: – As presented, SGD outputs the average over the iterates w(k) – Instead of taking the average, same gurantee holds for random iterate: • When done, pick j ∈ 1..K at random, output w(j) • Equivalently, use a random number of iterations (pick number of iterations at random between 1..K: on average you are fine).

– Not aware of guarantee of w(k) for non-random, predetermined k • E.g., it could be that for some problem, taking exactly 7328 SGD iterations would be bad, even in expectation over the sample, but taking a random number of iterations between 1 and 7328 would be fine. • My guess: we are missing some theory here…

– In practice: averaging reduces variance.

•

High Confidence Guarantee: With probability at least 1-δ over the samples: F (w(k)) ≤ F (w∗) + O

Ã

G kw∗k2 + log 1δ √ k

!

e.g. using an online to stochastic conversion [Cesa Bianchi et al 2004] – Only for average! (not for random iterate)

Other Regularizers • • •

Discussion so far focused on ||w||2, and L2 regularization In particular, SGD sample complexity depends on ||w||2, and so matches the sample complexity of L2 regularized learning. What about other regularizers? – E.g. L1, group norms, matrix norms

•

Option 1: SAA approach, minimizing: min

kwkreg ≤B

ˆ (w) L

or

ˆ (w) + λ kwkreg min L w

perhaps using SGD (runtime might depend on L2, but sample complexity on ||w||reg) •

Option 2: SA approach geared towards other norms…

SGD as a Proximal Method • Another motivation for (stochastic) gradient descent: ° D E 1 °° ° (k+1) (k) (k) (k) (k) w ← arg min F (w ) + g , w − w + °w − w ° 2 w 2α ° ° D E 1° ° = arg min α 1st g(k)order , w model + wF(w) − w(k)° °of only valid near w(k), 2 w 2 (k), based (k) on g around w so don’t go too far = w(k) − αg(k)

F(w) F(w(k))+h∇F(w(k)),w-w(k)i w(k)

SGD as a Proximal Method • Another motivation for (stochastic) gradient descent: D E (k+1) (k) (k) (k) w ← arg min F (w ) + g , w − w + w ° D E 1 °° ° (k) (k) = arg min α g , w + °w − w ° 2 w 2 = w(k) − αg(k)

° 1 °° ° (k) °w − w ° 2 2α

F(w) F(w(k))+h∇F(w(k)),w-w(k)i w(k)

SGD as a Proximal Method • Another motivation for (stochastic) gradient descent: D E (k+1) (k) (k) (k) w ← arg min F (w ) + g , w − w + w ° D E 1 °° ° (k) (k) = arg min α g , w + °w − w ° 2 w 2 = w(k) − αg(k)

° 1 °° ° (k) °w − w ° 2 2α

Start at w(0)=0 Iterate: Get subgradient estimate g(k) w(k+1) ← arg minw∈W α(k)hg(k),w(k)i+ ½||w-w(k)||2 P Output w(k) = k1 kj=1 w(j) replace with other norm?

Bregman Divergences •

For a differentiable, convex R define the Bergman Divergence: DR(w,v) = R(w)-R(v)-h∇R(v),w-vi

•

We will need R that is non-negative and τ-strongly convex w.r.t. our norm of interest ||w||, i.e. s.t.: DR(w,v) ≥ τ/2 ||w-v||2 w

DR(w,v)

R(v)-h∇R(v),w-vi

v

Bregman Divergences •

For a differentiable, convex R define the Bergman Divergence: DR(w,v) = R(w)-R(v)-h∇R(v),w-vi

•

We will need R that is non-negative and τ-strongly convex w.r.t. our norm of interest ||w||, i.e. s.t.: DR(w,v) ≥ τ/2 ||w-v||2

•

Examples: – R(w)=½||w||22 is 1-strongly convex w.r.t. ||w||2, DR(w,v)=½||w-v||22 – R(w)=½||w||p2 is (p-1)-strongly convex w.r.t ||w||p, for p>1 – R(w)=log(d)-∑iw[i]·log(w[i]) is 1-strongly convex w.r.t. ||w||1 on {w ∈ R d + | ||w||