Supervised learning. Linear and logistic regression

CS 2710 Foundations of AI Lecture 22 Supervised learning. Linear and logistic regression. Milos Hauskrecht [email protected] 5329 Sennott Square CS ...
1 downloads 0 Views 333KB Size
CS 2710 Foundations of AI Lecture 22

Supervised learning. Linear and logistic regression. Milos Hauskrecht [email protected] 5329 Sennott Square

CS 2710 Foundations of AI

Supervised learning Data: D = { D1 , D 2 ,.., D n } Di =< x i , y i >

a set of n examples

x i = ( xi ,1 , xi , 2 , L xi ,d ) is an input vector of size d y i is the desired output (given by a teacher) Objective: learn the mapping f : X → Y s.t. y i ≈ f ( x i ) • •

for all i = 1,.., n

Regression: Y is continuous Example: earnings, product orders Classification: Y is discrete Example: handwritten digit

company stock price digit label

CS 2710 Foundations of AI

1

Linear regression.

CS 2710 Foundations of AI

Linear regression • Function f : X →Y is a linear combination of input components d

f ( x) = w0 + w1 x1 + w2 x 2 + K wd x d = w0 + ∑ w j x j j =1

w 0 , w1 , K w k - parameters (weights) Bias term

1

x1 Input vector

x2

w0

w1

w2



f (x, w )

wd

x xd

CS 2710 Foundations of AI

2

Linear regression • Shorter (vector) definition of the model – Include bias constant in the input vector

x = (1, x1 , x 2 , L x d )

f ( x) = w0 x0 + w1 x1 + w2 x 2 + K wd x d = w T x w 0 , w1 , K w k - parameters (weights) 1

w0

x1 Input vector



w1

w2

x2

f (x, w )

wd

x xd

CS 2710 Foundations of AI

Linear regression. Error. • Data: Di =< x i , y i > • Function: x i → f (x i ) • We would like to have y i ≈ f ( x i )

for all i = 1,.., n

• Error function – measures how much our predictions deviate from the desired answers Mean-squared error

Jn =

1 n

∑ (y

i =1,.. n

i

− f ( x i )) 2

• Learning: We want to find the weights minimizing the error ! CS 2710 Foundations of AI

3

Linear regression. Example x = ( x1 )

• 1 dimensional input 30

25

20

15

10

5

0

-5

-10

-15 -1.5

-1

-0.5

0

0.5

1

1.5

2

CS 2710 Foundations of AI

Linear regression. Example. x = ( x1 , x 2 )

• 2 dimensional input

20 15 10 5 0 -5 -10 -15 -20 -3

4 -2

2 -1

0

0 1

2

-2 3

-4

CS 2710 Foundations of AI

4

Linear regression. Optimization. • We want the weights minimizing the error 1 1 J n = ∑ ( y i − f (x i )) 2 = ∑ ( y i − w T x i ) 2 n i =1,..n n i =1,..n • For the optimal set of parameters, derivatives of the error with respect to each parameter must be 0 ∂ 2 n J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0 ∂w j n i =1

• Vector of derivatives: grad w ( J n ( w )) = ∇ w ( J n ( w )) = −

2 n

n

∑ (y i =1

i

− w T x i )x i = 0

CS 2710 Foundations of AI

Solving linear regression 2 n ∂ J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0 ∂w j n i =1 By rearranging the terms we get a system of linear equations with d+1 unknowns

Aw = b

n

n

n

n

n

i =1 n

i =1 n

i =1 n

i =1 n

i =1 n

w0 ∑ xi , 01 + w1 ∑ xi ,11 + K + w j ∑ xi , j 1 + K + wd ∑ xi , d 1 = ∑ yi 1 w0 ∑ xi,0 xi,1 + w1 ∑ xi,1xi,1 + K+ wj ∑ xi, j xi,1 + K+ wd ∑ xi,d xi,1 = ∑ yi xi,1 i =1

i =1

i =1

i =1

i =1

n

n

n

n

n

i =1

i =1

i =1

i =1

i =1

w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j CS 2710 Foundations of AI

5

Solving linear regression • The optimal set of weights satisfies: 2 n ∇ w ( J n ( w )) = − ∑ ( y i − w T x i ) x i = 0 n i =1 Leads to a system of linear equations (SLE) with d+1 unknowns of the form

Aw = b

n

n

n

n

n

i =1

i =1

i =1

i =1

i =1

w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j Solution to SLE:

w = A − 1b

• matrix inversion CS 2710 Foundations of AI

Gradient descent solution Goal: the weight optimization in the linear regression model 1 J n = Error ( w ) = ∑ ( y i − f ( x i , w )) 2 n i = 1 ,.. n An alternative to SLE solution: • Gradient descent Idea: – Adjust weights in the direction that improves the Error – The gradient tells us what is the right direction w ← w − α ∇ w Error i (w )

α >0

- a learning rate (scales the gradient changes) CS 2710 Foundations of AI

6

Gradient descent method • Descend using the gradient information Error ( w )

∇ w Error ( w ) | w *

w

w*

Direction of the descent • Change the value of w according to the gradient w ← w − α ∇ w Error i (w ) CS 2710 Foundations of AI

Gradient descent method Error ( w )

∂ Error ( w ) | w * ∂w

w

w*

• New value of the parameter ∂ w j ← w j * −α Error ( w ) | w * ∂w j

For all j

α > 0 - a learning rate (scales the gradient changes) CS 2710 Foundations of AI

7

Gradient descent method • Iteratively converge to the optimum of the Error function

Error (w )

w ( 0 ) w (1 ) w ( 2 )w ( 3 )

w

CS 2710 Foundations of AI

Online gradient algorithm • The error function is defined for the whole dataset D 1 J n = Error ( w ) = ( y i − f ( x i , w )) 2 ∑ n i =1,.. n • error for a sample Di =< x i , yi >

1 ( y i − f ( x i , w )) 2 2 • Online gradient method: changes weights after every sample ∂ wj ← wj −α Error i (w ) ∂w j • vector form: J online = Error i ( w ) =

w ← w − α ∇ w Error i (w )

α >0

- Learning rate that depends on the number of updates CS 2710 Foundations of AI

8

Online gradient method f (x) = w T x 1 J online = Error i ( w ) = ( y i − f ( x i , w )) 2 2 On-line algorithm: generates a sequence of online updates (i)-th update step with : Di =< x i , yi > Linear model On-line error

j-th weight:

wj wj

(i )

(i )

← wj ← wj

( i −1 )

( i −1)

− α (i )

∂ Error i ( w ) | w ( i −1 ) ∂w j

+ α (i )( yi − f ( x i , w ( i −1) )) xi , j

1 i - Gradually rescales changes in weights

Annealed learning rate: α (i) ≈

CS 2710 Foundations of AI

Online regression algorithm Online-linear-regression (D, number of iterations) Initialize weights w = ( w0 , w1 , w2 K wd ) for i=1:1: number of iterations do select a data point Di = ( x i , y i ) from D set α = 1 / i update weight vector w ← w + α ( yi − f ( x i , w )) x i

end for return weights w • Advantages: very easy to implement, continuous data streams, adapts to changes in the model over time CS 2710 Foundations of AI

9

On-line learning. Example 1

4.5

2

4.5

4

4

3.5

3.5

3

3

2.5

2.5

2

2

1.5

1.5

1 -3

-2

5.5

-1

0

1

2

3

-1

0

1

2

3

4

5

4.5

4.5

4

4

3.5

3.5

3

3

2.5

2.5

2

2

1.5

1.5

1 0.5 -3

-2

5.5

3

5

1 -3

1

-2

-1

0

1

2

3

0.5 -3

-2

-1

0

1

2

3

CS 2710 Foundations of AI

Logistic regression

CS 2710 Foundations of AI

10

Binary classification • Two classes Y = {0 ,1} • Our goal is to learn how to correctly classify two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X → {0,1} • First step: we need to devise a model of the function f • Inspiration: neuron (nerve cells) CS 2710 Foundations of AI

Neuron • neuron (nerve cell) and its activities

CS 2710 Foundations of AI

11

Neuron-based binary classification model

1

w0



w1 w2

x

y

z

wk

Threshold function CS 2710 Foundations of AI

Neuron-based binary classification •

Function we want to learn

Bias term

1

x (1) Input vector

x ( 2)

f : X → {0,1}

w0

w1 w2

Threshold function



z

y

wk

x

x (k ) CS 2710 Foundations of AI

12

Logistic regression model • A function model with smooth switching: f ( x ) = g ( w 0 + w1 x (1) + ... w k x ( k ) ) where w are parameters of the models and g(z) is a logistic function g ( z ) = 1 /(1 + e − z ) Bias term

1



w1 w2

x (1) Input vector

Logistic function

w0

x ( 2)

f ( x ) ∈ [ 0 ,1 ]

wk

x

f (x)

z

x (k ) CS 2710 Foundations of AI

Logistic function function

1 0.9

0.8 0.7

1 g (z) = (1 + e − z )

0.6

0.5 0.4

0.3 0.2

0.1 0 -20

-15

-10

-5

0

5

10

15

20

• also referred to as sigmoid function • replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1] • Output of the logistic regression has probabilistic interpretation f (x ) = p ( y = 1 | x ) - Probability of class 1 CS 2710 Foundations of AI

13

Logistic regression. Decision boundary Classification with the logistic regression model: If f (x ) = p ( y = 1 | x ) ≥ 0 . 5 then choose class 1 Else choose class 0 Logistic regression model defines a linear decision boundary Example:

Decision boundary 2

1.5

1

0.5

0

-0.5

-1

-1.5

-2 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

CS 2710 Foundations of AI

Logistic regression. Parameter optimization. • We construct a probabilistic version of the error function based on the likelihood of the data n L ( D , w ) = P ( D | w ) = ∏ P ( y = yi | x i , w ) i =1

Independent samples (xi ,yi)

• likelihood measures the goodness of fit (opposite of the error) • Log-likelihood trick. l ( D , w ) = log L ( D | w ) • Error is the opposite of the Log likelihood Error ( D , w ) = − l ( D , w ) CS 2710 Foundations of AI

14

Logistic regression: parameter learning •

Error function decomposes to online error components Error ( D , w ) =

n

∑ Error i =1

n

i

( D i , w ) = − ∑ li ( D i , w ) i =1

• Derivatives of the online error component for the LR model (in terms of weights) ∂ Error i ( D i , w ) = − ( y i − f ( x i , w )) ∂w0 ∂ Error i ( D i , w ) = − ( y i − f ( x i , w )) x i , j ∂w j

CS 2710 Foundations of AI

Logistic regression: parameter learning. • Assume D i =< x i , y i > • Let µ i = p ( yi = 1 | x i , w ) = g ( zi ) = g (w T x) • Then n n y L ( D , w ) = ∏ P ( y = y i | x i , w ) = ∏ µ i i (1 − µ i )1− y i i =1

i =1

• Find weights w that maximize the likelihood of outputs – The optimal weights are the same for both the likelihood and the log-likelihood n

l ( D , w ) = log ∏ µ i i (1 − µ i )1− y i = i =1

=

n



i =1

y

n

∑ log µ i =1

y i log µ i + (1 − y i ) log( 1 − µ i ) =

yi i

(1 − µ i )1− y i =

n

∑−J i =1

online

(Di , w )

CS 2710 Foundations of AI

15

Logistic regression: parameter learning •

Log likelihood n

∑y

l(D, w ) =

i =1

log µ i + (1 − y i ) log( 1 − µ i )

i

• Derivatives of the loglikelihood −

∂ l(D, w ) = ∂w j

n

∑−x i =1

i, j

Nonlinear in weights !!

( y i − g ( z i ))

n

n

i =1

i =1

∇ w − l ( D , w ) = ∑ −x i ( yi − g ( w T x i )) = ∑ −x i ( yi − f ( w , x i ))

• Gradient descent:

w ( k ) ← w ( k −1) − α ( k ) ∇ w [ − l ( D , w )] | w ( k −1 ) n

w ( k ) ← w ( k −1) + α ( k ) ∑ [ y i − f ( w ( k −1) , x i )] x i i =1

CS 2710 Foundations of AI

Derivation of the gradient •

l(D, w ) =

Log likelihood

n

∑y i =1

i

log µ i + (1 − y i ) log( 1 − µ i )

• Derivatives of the loglikelihood ∂ l(D, w ) = ∂w j

∂ ∑ ∂z [y n

i =1

∂zi = xi , j ∂w j

i

i

log µ i + (1 − y i ) log( 1 − µ i ) ]

∂zi ∂w j

Derivative of a logistic function ∂g ( zi ) = g ( z i )(1 − g ( z i )) ∂z i

∂ [ yi log µi + (1 − yi ) log(1 − µi )] = yi 1 ∂g ( zi ) + (1 − yi ) −1 ∂g ( zi ) ∂zi 1 − g ( zi ) ∂zi g ( zi ) ∂zi = yi (1 − g ( zi )) + (1 − yi )(− g ( zi )) = yi − g ( zi ) n

n

∇ w l ( D , w ) = ∑ −x i ( yi − g ( w T x i )) = ∑ −x i ( yi − f ( w , x i )) i =1

i =1

CS 2710 Foundations of AI

16

Logistic regression. Online gradient. • We want to optimize the online Error • On-line gradient update for the jth weight and ith step ∂ (i) ( i −1 ) wj ← wj −α [ Error i ( D i , w ) | w ( i −1 ) ] ∂w j • (i)th update for the logistic regression D i =< x i , y i > J-th weight wj

(i)

← wj

( i −1 )

(

)

+ α ( i ) y i − f ( x i , w ( i −1 ) ) x i , j

α - annealed learning rate (depends on the number of updates) The same update rule as used in the linear regression !!! CS 2710 Foundations of AI

Online logistic regression algorithm Online-logistic-regression (D, number of iterations) initialize weights w0 , w1 , w2 K wk for i=1:1: number of iterations do select a data point d= from D set α = 1/ i update weights (in parallel) w0 = w0 + α [ y − f ( x, w )]

w j = w j + α [ y − f (x, w)]x j

end for return weights CS 2710 Foundations of AI

17

Online algorithm. Example.

CS 2710 Foundations of AI

Online algorithm. Example.

CS 2710 Foundations of AI

18

Online algorithm. Example.

CS 2710 Foundations of AI

Online updates Linear regression

Logistic regression

f (x) = w T x

f (x) = p( y = 1 | x, w) = g (wT x)

1

1 w0

x1

x2

w1 w2



wd

xd

w0

f (x)

x1 x2

f (x) =

z

p( y = 1 | x)

wd

xd

On-line gradient update:

w ← w +α ( y − f (x, w))x

w1 w2



The same

On-line gradient update:

w ← w +α ( y − f (x, w))x

CS 2710 Foundations of AI

19

Limitations of basic linear units Linear regression f (x) = w T x

Logistic regression f (x) = p( y = 1 | x, w) = g (wT x)

1

1 w0

x1 x2

w1 w2



w0

f (x)

x1

w1 w2

x2

wd



z

p( y = 1 | x)

wd

xd

xd

Function linear in inputs !!

Linear decision boundary!!

CS 2710 Foundations of AI

20