Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research [email protected] Foundations of Machine Lear...

Author: Jasper Hodge

0 downloads 0 Views 4MB Size

Report

Download PDF

Recommend Documents

Machine Learning: Foundations and Algorithms

Regupol Vibration Isolation of Machine Foundations

Preschool Learning Foundations

Stability of machine learning algorithms

Instructional Technology Foundations and Theories of Learning

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Machine Learning. Basic Methodology. Joakim Nivre. Machine Learning 1(24)

Machine Learning using MATLAB

Introducing Machine Learning

Machine Learning CMPSCI 383

Machine learning en introduktion

ACTIVE AND PASSIVE ISOLATION VIBRATION ISOLATION OF MACHINE FOUNDATIONS

Machine Learning with WEKA

Machine Learning Tutorial

Machine Learning Models

Business Intelligence & Machine Learning

Data Mining & Machine Learning

Machine Learning Kernel Functions

6.867 Machine Learning

Copulas in Machine Learning

Machine Learning basic concepts

Machine Learning Summary

Machine Learning-based gameplay

Machine Learning with R

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research [email protected]

Foundations of Machine Learning

page

1

Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about 3-4 homework assignments + project (topic of your choice). Mailing list: join as soon as possible.

Foundations of Machine Learning

page 2

Course Material Textbook

Slides: course web page. http://www.cs.nyu.edu/~mohri/ml16

Foundations of Machine Learning

page 3

This Lecture Basic deﬁnitions and concepts. Introduction to the problem of learning. Probability tools.

Foundations of Machine Learning

page 4

Machine Learning Deﬁnition: computational methods using experience to improve performance. Experience: data-driven task, thus statistics, probability, and optimization. Computer science: learning algorithms, analysis of complexity, theoretical guarantees. Example: use document word counts to predict its topic.

Foundations of Machine Learning

page 5

Examples of Learning Tasks Text: document classiﬁcation, spam detection. Language: NLP tasks (e.g., morphological analysis, POS tagging, context-free parsing, dependency parsing). Speech: recognition, synthesis, veriﬁcation. Image: annotation, face recognition, OCR, handwriting recognition. Games (e.g., chess, backgammon). Unassisted control of vehicles (robots, car). Medical diagnosis, fraud detection, network intrusion. Foundations of Machine Learning

page 6

Some Broad ML Tasks Classiﬁcation: assign a category to each item (e.g., document classiﬁcation). Regression: predict a real value for each item (prediction of stock values, economic variables). Ranking: order items according to some criterion (relevant web pages returned by a search engine). Clustering: partition data into ‘homogenous’ regions (analysis of very large data sets). Dimensionality reduction: ﬁnd lower-dimensional manifold preserving some properties of the data. Foundations of Machine Learning

page 7

General Objectives of ML Theoretical questions:

• • •

what can be learned, under what conditions? are there learning guarantees? analysis of learning algorithms.

Algorithms:

• • •

more eﬃcient and more accurate algorithms. deal with large-scale problems. handle a variety of diﬀerent learning problems.

Foundations of Machine Learning

page 8

This Course Theoretical foundations:

• •

learning guarantees. analysis of algorithms.

Algorithms:

• •

main mathematically well-studied algorithms. discussion of their extensions.

Applications:

•

illustration of their use.

Foundations of Machine Learning

page 9

Topics Probability tools, concentration inequalities. PAC learning model, Rademacher complexity, VC-dimension, generalization bounds. Support vector machines (SVMs), margin bounds, kernel methods. Ensemble methods, boosting. Logistic regression and conditional maximum entropy models. On-line learning, weighted majority algorithm, Perceptron algorithm, mistake bounds. Regression, generalization, algorithms. Ranking, generalization, algorithms. Reinforcement learning, MDPs, bandit problems and algorithm. Foundations of Machine Learning

page 10

Deﬁnitions and Terminology Example: item, instance of the data used. Features: attributes associated to an item, often represented as a vector (e.g., word counts). Labels: category (classiﬁcation) or real value (regression) associated to an item. Data:

• • •

training data (typically labeled). test data (labeled but labels not seen). validation data (labeled, for tuning parameters).

Foundations of Machine Learning

page 11

General Learning Scenarios Settings:

• •

batch: learner receives full (training) sample, which he uses to make predictions for unseen points. on-line: learner receives one sample at a time and makes a prediction for that sample.

Queries:

• •

active: the learner can request the label of a point. passive: the learner receives labeled points.

Foundations of Machine Learning

page 12

Standard Batch Scenarios Unsupervised learning: no labeled data. Supervised learning: uses labeled data for prediction on unseen points. Semi-supervised learning: uses labeled and unlabeled data for prediction on unseen points. Transduction: uses labeled and unlabeled data for prediction on seen points.

Foundations of Machine Learning

page 13

Example - SPAM Detection Problem: classify each e-mail message as SPAM or nonSPAM (binary classiﬁcation problem). Potential data: large collection of SPAM and non-SPAM messages (labeled examples).

Foundations of Machine Learning

page 14

Learning Stages labeled data

algorithm

prior knowledge

training sample

A(⇥)

features

validation data test sample

Foundations of Machine Learning

A(⇥0 )

parameter selection

evaluation

page 15

This Lecture Basic deﬁnitions and concepts. Introduction to the problem of learning. Probability tools.

Foundations of Machine Learning

page 16

Deﬁnitions Spaces: input space X , output space Y. Loss function: L : Y ⇥Y ! R .

• • •

L(b y , y) : cost of predicting yb instead of y .

binary classiﬁcation: 0-1 loss, L(y, y 0 ) = 1y6=y0 . regression: Y ✓ R, l(y, y 0 ) = (y 0

y)2.

Hypothesis set: H ✓ Y X, subset of functions out of which the learner selects his hypothesis.

• •

depends on features. represents prior knowledge about task.

Foundations of Machine Learning

page 17

Supervised Learning Set-Up Training data: sample S of size m drawn i.i.d. from X ⇥Y according to distribution D : S = ((x1 , y1 ), . . . , (xm , ym )).

Problem: ﬁnd hypothesis h 2 H with small generalization error.

• •

deterministic case: output label deterministic function of input, y = f (x). stochastic case: output probabilistic function of input.

Foundations of Machine Learning

page 18

Errors Generalization error: for h 2 H , it is deﬁned by

R(h) =

E

(x,y)⇠D

[L(h(x), y)].

Empirical error: for h 2 H and sample S, it is m

X 1 b R(h) = L(h(xi ), yi ). m i=1

Bayes error:

R? =

•

inf

h h measurable

R(h).

in deterministic case, R? = 0.

Foundations of Machine Learning

page 19

Noise Noise:

•

in binary classiﬁcation, for any x 2 X , noise(x) = min{Pr[1|x], Pr[0|x]}.

•

observe that E[noise(x)] = R⇤ .

Foundations of Machine Learning

page 20

Learning ≠ Fitting

Notion of simplicity/complexity. How do we deﬁne complexity?

Foundations of Machine Learning

page 21

Generalization Observations:

• • • •

the best hypothesis on the sample may not be the best overall. generalization is not memorization. complex rules (very complex separation surfaces) can be poor predictors. trade-oﬀ: complexity of hypothesis set vs sample size (underﬁtting/overﬁtting).

Foundations of Machine Learning

page 22

error

Model Selection

generalization error penalty term empirical error

complexity

Foundations of Machine Learning

page 23

Empirical Risk Minimization Select hypothesis set H. Find hypothesis h 2 H minimizing empirical error: b h = argmin R(h). h2H

• •

but H may be too complex.

the sample size may not be large enough.

Foundations of Machine Learning

page 24

Structural Risk Minimization

(Vapnik, 1995)

Principle: consider an inﬁnite sequence of hypothesis sets ordered for inclusion, H 1 ⇢ H2 ⇢ · · · ⇢ H n ⇢ · · ·

b h = argmin R(h) + penalty(Hn , m). h2Hn ,n2N

• •

strong theoretical guarantees.

typically computationally hard.

Foundations of Machine Learning

page 25

General Algorithm Families Empirical risk minimization (ERM): b h = argmin R(h). h2H

Structural risk minimization (SRM): Hn ✓ Hn+1 , b h = argmin R(h) + penalty(Hn , m). h2Hn ,n2N

Regularization-based algorithms:

0,

b h = argmin R(h) + khk2 . h2H

Foundations of Machine Learning

page 26

This Lecture Basic deﬁnitions and concepts. Introduction to the problem of learning. Probability tools.

Foundations of Machine Learning

page 27

Basic Properties Union bound: Pr[A _ B]  Pr[A] + Pr[B]. Inversion: if Pr[X ✏]  f (✏) , then, for any > 0 , with probability at least 1 , X  f 1( ). Jensen’s inequality: if f is convex, f (E[X])  E[f (X)] . Z +1 Expectation: if X 0 , E[X] = Pr[X > t] dt . 0

Foundations of Machine Learning

page 28

Basic Inequalities Markov’s inequality: if X Pr[X

0 and ✏ > 0 , then ✏]  E[X] ✏ .

Chebyshev’s inequality: for any ✏ > 0 ,

Pr[|X

Foundations of Machine Learning

E[X]|

✏] 

2 X ✏2

.

page 29

Hoeﬀding’s Inequality Theorem: Let X1 , . . . , Xm be indep. rand. variables with the same expectation µ and Xi 2 [a, b], ( a < b ). Then, for any ✏ > 0, the following inequalities hold:  ✓ ◆ m 2 X 1 2m✏ Pr µ Xi > ✏  exp m i=1 (b a)2



m 1 X Pr Xi m i=1

Foundations of Machine Learning

µ > ✏  exp

✓

2

2m✏ (b a)2

◆

.

page 30

McDiarmid’s Inequality (McDiarmid, 1989)

Theorem: let X1 , . . . , Xm be independent random variables taking values in U and f : U m ! R a function verifying for all i 2 [1, m] , sup |f (x1 , . . . , xi , . . . , xm ) f (x1 , . . . , x0i , . . . , xm )|  ci .

x1 ,...,xm ,x0i

Then, for all ✏ > 0 , h

i

Pr f (X1 , . . . , Xm ) E[f (X1 , . . . , Xm )] > ✏  2 exp

Foundations of Machine Learning

✓

2✏ Pm

2

2 i=1 ci

◆

page 31

.

Appendix

Foundations of Machine Learning

page

32

Markov’s Inequality Theorem: let X be a non-negative random variable with E[X] < 1 , then, for all t > 0 , 1 Pr[X tE[X]]  . t Proof: X Pr[X t E[X]] = Pr[X = x] x tE[X]

 

X

x t E[X]

X x



x Pr[X = x] t E[X]

x Pr[X = x] t E[X]

X 1 =E = . t E[X] t Foundations of Machine Learning

page 33

Chebyshev’s Inequality Theorem: let X be a random variable with Var[X] < 1 , then, for all t > 0, 1 Pr[|X E[X]| t X ]  2 . t Proof: Observe that Pr[|X

E[X]|

t

X ] = Pr[(X

E[X])2

t2

2 X ].

The result follows Markov’s inequality.

Foundations of Machine Learning

page 34

Weak Law of Large Numbers Theorem: let (Xn )n2N be a sequence of independent random variables with the same mean µ and variance Pn 1 and let X n = n i=1 Xi , then, for any ✏ > 0 , lim Pr[|X n

n!1

µ|

2

0 , E[etX ]  e

t2 (b a)2 8

.

Proof: by convexity of x 7! etx , for all a  x  b , e

tx

b  b

x ta x e + a b

a tb e . a

Thus, E[etX ]

E[ bb

X ta ae

+

with, (t) = log( b b a eta + Foundations of Machine Learning

X a tb b a e ] a tb b ae )

=

b b

ta e + a

a tb b ae

= ta + log( b b a +

=e

(t)

,

a t(b a) ). b ae

page 37

Taking the derivative gives: 0

aet(b

(t) = a

b b

a a

b

Note that: (0) = 0 and 00

(t) =

= u(1

b b

ae

t(b

a)

.

a b

a

(0) = 0. Furthermore,

a 2 b a] t(b a) t(b

(b a)2 a) + ↵]2

↵ [(1

0

a)

t(b a)

↵(1 ↵)e = [(1 ↵)e =

t(b ae

a

=a

t(b a)

abe [b bae

a)

↵)e u)(b

(1

t(b a)

↵)e t(b a) (b ↵)e t(b a) + ↵]

+ ↵] [(1 (b a)2 2 a)  , 4

a)2

a . There exists 0  ✓  t such that: with ↵ = b a t2 00 a)2 0 2 (b (t) = (0) + t (0) + (✓)  t . 2 8

Foundations of Machine Learning

page 38

Hoeﬀding’s Theorem Theorem: Let X1 , . . . , Xmbe independent random variables. Then for Xi 2 [ai , bi ], the following inequalities hold Pm for Sm = i=1 Xi , for any ✏ > 0 ,

Pr[Sm

Pr[Sm

E[Sm ]

E[Sm ] 

✏]  e

2✏2 /

✏]  e

Pm

2✏2 /

2 (b a ) i i i=1

Pm

i=1 (bi

a i )2

.

Proof: The proof is based on Chernoﬀ’s bounding technique: for any random variable X and t > 0 , apply Markov’s inequality and select t to minimize

Pr[X

Foundations of Machine Learning

✏] = Pr[etX

tX E[e ] t✏ e ] . t✏ e

page 39

Using this scheme and the independence of the random variables gives Pr[Sm E[Sm ] ✏] e

t✏

E[et(Sm

=e

t✏

t(Xi ⇧m E[e i=1

(lemma applied to Xi E[Xi ])  e

t✏

=e

choosing t = 4✏/

Pm

i=1 (bi

e

E[Sm ])

]

E[Xi ])

]

m t2 (bi ai )2 /8 ⇧i=1 e P 2 t✏ t2 m i=1 (bi ai ) /8

e

2✏2 /

Pm

i=1 (bi

ai )2

,

ai ) 2 .

The second inequality is proved in a similar way.

Foundations of Machine Learning

page 40

Hoeﬀding’s Inequality Corollary: for any ✏ > 0 , any distribution D and any hypothesis h : X ! {0, 1}, the following inequalities hold: b Pr[R(h) b Pr[R(h)

R(h)

✏]  e

R(h) 

2m✏2

✏]  e

2m✏2

.

Proof: follows directly Hoeﬀding’s theorem. Combining these one-sided inequalities yields h

b Pr R(h) Foundations of Machine Learning

R(h)

i

✏  2e

2m✏2

.

page 41

Chernoﬀ’s Inequality Theorem: for any ✏ > 0 , any distribution D and any hypothesis h : X ! {0, 1}, the following inequalities hold: Proof: proof based on Chernoﬀ’s bounding technique.

b Pr[R(h)

(1 + ✏)R(h)]  e

b Pr[R(h)  (1

Foundations of Machine Learning

✏)R(h)]  e

m R(h) ✏2 /3 m R(h) ✏2 /2

.

page 42

McDiarmid’s Inequality (McDiarmid, 1989)

Theorem: let X1 , . . . , Xm be independent random variables taking values in U and f : U m ! R a function verifying for all i 2 [1, m] , sup |f (x1 , . . . , xi , . . . , xm ) f (x1 , . . . , x0i , . . . , xm )|  ci .

x1 ,...,xm ,x0i

Then, for all ✏ > 0 , h

i

Pr f (X1 , . . . , Xm ) E[f (X1 , . . . , Xm )] > ✏  2 exp

Foundations of Machine Learning

✓

2✏ Pm

2

2 i=1 ci

◆

page 43

.

Comments:

• •

Proof: uses Hoeﬀding’s lemma. Hoeﬀding’s inequality is a special case of McDiarmid’s with m 1 X |bi ai | f (x1 , . . . , xm ) = xi and ci = . m i=1 m

Foundations of Machine Learning

page 44

Jensen’s Inequality Theorem: let X be a random variable and f a measurable convex function. Then, f (E[X])  E[f (X)].

Proof: deﬁnition of convexity, continuity of convex functions, and density of ﬁnite distributions.

Foundations of Machine Learning

page 45