CS 2710 Foundations of AI Lecture 22
Supervised learning. Linear and logistic regression. Milos Hauskrecht
[email protected] 5329 Sennott Square
CS 2710 Foundations of AI
Supervised learning Data: D = { D1 , D 2 ,.., D n } Di =< x i , y i >
a set of n examples
x i = ( xi ,1 , xi , 2 , L xi ,d ) is an input vector of size d y i is the desired output (given by a teacher) Objective: learn the mapping f : X → Y s.t. y i ≈ f ( x i ) • •
for all i = 1,.., n
Regression: Y is continuous Example: earnings, product orders Classification: Y is discrete Example: handwritten digit
company stock price digit label
CS 2710 Foundations of AI
1
Linear regression.
CS 2710 Foundations of AI
Linear regression • Function f : X →Y is a linear combination of input components d
f ( x) = w0 + w1 x1 + w2 x 2 + K wd x d = w0 + ∑ w j x j j =1
w 0 , w1 , K w k - parameters (weights) Bias term
1
x1 Input vector
x2
w0
w1
w2
∑
f (x, w )
wd
x xd
CS 2710 Foundations of AI
2
Linear regression • Shorter (vector) definition of the model – Include bias constant in the input vector
x = (1, x1 , x 2 , L x d )
f ( x) = w0 x0 + w1 x1 + w2 x 2 + K wd x d = w T x w 0 , w1 , K w k - parameters (weights) 1
w0
x1 Input vector
∑
w1
w2
x2
f (x, w )
wd
x xd
CS 2710 Foundations of AI
Linear regression. Error. • Data: Di =< x i , y i > • Function: x i → f (x i ) • We would like to have y i ≈ f ( x i )
for all i = 1,.., n
• Error function – measures how much our predictions deviate from the desired answers Mean-squared error
Jn =
1 n
∑ (y
i =1,.. n
i
− f ( x i )) 2
• Learning: We want to find the weights minimizing the error ! CS 2710 Foundations of AI
3
Linear regression. Example x = ( x1 )
• 1 dimensional input 30
25
20
15
10
5
0
-5
-10
-15 -1.5
-1
-0.5
0
0.5
1
1.5
2
CS 2710 Foundations of AI
Linear regression. Example. x = ( x1 , x 2 )
• 2 dimensional input
20 15 10 5 0 -5 -10 -15 -20 -3
4 -2
2 -1
0
0 1
2
-2 3
-4
CS 2710 Foundations of AI
4
Linear regression. Optimization. • We want the weights minimizing the error 1 1 J n = ∑ ( y i − f (x i )) 2 = ∑ ( y i − w T x i ) 2 n i =1,..n n i =1,..n • For the optimal set of parameters, derivatives of the error with respect to each parameter must be 0 ∂ 2 n J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0 ∂w j n i =1
• Vector of derivatives: grad w ( J n ( w )) = ∇ w ( J n ( w )) = −
2 n
n
∑ (y i =1
i
− w T x i )x i = 0
CS 2710 Foundations of AI
Solving linear regression 2 n ∂ J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0 ∂w j n i =1 By rearranging the terms we get a system of linear equations with d+1 unknowns
Aw = b
n
n
n
n
n
i =1 n
i =1 n
i =1 n
i =1 n
i =1 n
w0 ∑ xi , 01 + w1 ∑ xi ,11 + K + w j ∑ xi , j 1 + K + wd ∑ xi , d 1 = ∑ yi 1 w0 ∑ xi,0 xi,1 + w1 ∑ xi,1xi,1 + K+ wj ∑ xi, j xi,1 + K+ wd ∑ xi,d xi,1 = ∑ yi xi,1 i =1
i =1
i =1
i =1
i =1
n
n
n
n
n
i =1
i =1
i =1
i =1
i =1
w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j CS 2710 Foundations of AI
5
Solving linear regression • The optimal set of weights satisfies: 2 n ∇ w ( J n ( w )) = − ∑ ( y i − w T x i ) x i = 0 n i =1 Leads to a system of linear equations (SLE) with d+1 unknowns of the form
Aw = b
n
n
n
n
n
i =1
i =1
i =1
i =1
i =1
w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j Solution to SLE:
w = A − 1b
• matrix inversion CS 2710 Foundations of AI
Gradient descent solution Goal: the weight optimization in the linear regression model 1 J n = Error ( w ) = ∑ ( y i − f ( x i , w )) 2 n i = 1 ,.. n An alternative to SLE solution: • Gradient descent Idea: – Adjust weights in the direction that improves the Error – The gradient tells us what is the right direction w ← w − α ∇ w Error i (w )
α >0
- a learning rate (scales the gradient changes) CS 2710 Foundations of AI
6
Gradient descent method • Descend using the gradient information Error ( w )
∇ w Error ( w ) | w *
w
w*
Direction of the descent • Change the value of w according to the gradient w ← w − α ∇ w Error i (w ) CS 2710 Foundations of AI
Gradient descent method Error ( w )
∂ Error ( w ) | w * ∂w
w
w*
• New value of the parameter ∂ w j ← w j * −α Error ( w ) | w * ∂w j
For all j
α > 0 - a learning rate (scales the gradient changes) CS 2710 Foundations of AI
7
Gradient descent method • Iteratively converge to the optimum of the Error function
Error (w )
w ( 0 ) w (1 ) w ( 2 )w ( 3 )
w
CS 2710 Foundations of AI
Online gradient algorithm • The error function is defined for the whole dataset D 1 J n = Error ( w ) = ( y i − f ( x i , w )) 2 ∑ n i =1,.. n • error for a sample Di =< x i , yi >
1 ( y i − f ( x i , w )) 2 2 • Online gradient method: changes weights after every sample ∂ wj ← wj −α Error i (w ) ∂w j • vector form: J online = Error i ( w ) =
w ← w − α ∇ w Error i (w )
α >0
- Learning rate that depends on the number of updates CS 2710 Foundations of AI
8
Online gradient method f (x) = w T x 1 J online = Error i ( w ) = ( y i − f ( x i , w )) 2 2 On-line algorithm: generates a sequence of online updates (i)-th update step with : Di =< x i , yi > Linear model On-line error
j-th weight:
wj wj
(i )
(i )
← wj ← wj
( i −1 )
( i −1)
− α (i )
∂ Error i ( w ) | w ( i −1 ) ∂w j
+ α (i )( yi − f ( x i , w ( i −1) )) xi , j
1 i - Gradually rescales changes in weights
Annealed learning rate: α (i) ≈
CS 2710 Foundations of AI
Online regression algorithm Online-linear-regression (D, number of iterations) Initialize weights w = ( w0 , w1 , w2 K wd ) for i=1:1: number of iterations do select a data point Di = ( x i , y i ) from D set α = 1 / i update weight vector w ← w + α ( yi − f ( x i , w )) x i
end for return weights w • Advantages: very easy to implement, continuous data streams, adapts to changes in the model over time CS 2710 Foundations of AI
9
On-line learning. Example 1
4.5
2
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1 -3
-2
5.5
-1
0
1
2
3
-1
0
1
2
3
4
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1 0.5 -3
-2
5.5
3
5
1 -3
1
-2
-1
0
1
2
3
0.5 -3
-2
-1
0
1
2
3
CS 2710 Foundations of AI
Logistic regression
CS 2710 Foundations of AI
10
Binary classification • Two classes Y = {0 ,1} • Our goal is to learn how to correctly classify two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X → {0,1} • First step: we need to devise a model of the function f • Inspiration: neuron (nerve cells) CS 2710 Foundations of AI
Neuron • neuron (nerve cell) and its activities
CS 2710 Foundations of AI
11
Neuron-based binary classification model
1
w0
∑
w1 w2
x
y
z
wk
Threshold function CS 2710 Foundations of AI
Neuron-based binary classification •
Function we want to learn
Bias term
1
x (1) Input vector
x ( 2)
f : X → {0,1}
w0
w1 w2
Threshold function
∑
z
y
wk
x
x (k ) CS 2710 Foundations of AI
12
Logistic regression model • A function model with smooth switching: f ( x ) = g ( w 0 + w1 x (1) + ... w k x ( k ) ) where w are parameters of the models and g(z) is a logistic function g ( z ) = 1 /(1 + e − z ) Bias term
1
∑
w1 w2
x (1) Input vector
Logistic function
w0
x ( 2)
f ( x ) ∈ [ 0 ,1 ]
wk
x
f (x)
z
x (k ) CS 2710 Foundations of AI
Logistic function function
1 0.9
0.8 0.7
1 g (z) = (1 + e − z )
0.6
0.5 0.4
0.3 0.2
0.1 0 -20
-15
-10
-5
0
5
10
15
20
• also referred to as sigmoid function • replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1] • Output of the logistic regression has probabilistic interpretation f (x ) = p ( y = 1 | x ) - Probability of class 1 CS 2710 Foundations of AI
13
Logistic regression. Decision boundary Classification with the logistic regression model: If f (x ) = p ( y = 1 | x ) ≥ 0 . 5 then choose class 1 Else choose class 0 Logistic regression model defines a linear decision boundary Example:
Decision boundary 2
1.5
1
0.5
0
-0.5
-1
-1.5
-2 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
CS 2710 Foundations of AI
Logistic regression. Parameter optimization. • We construct a probabilistic version of the error function based on the likelihood of the data n L ( D , w ) = P ( D | w ) = ∏ P ( y = yi | x i , w ) i =1
Independent samples (xi ,yi)
• likelihood measures the goodness of fit (opposite of the error) • Log-likelihood trick. l ( D , w ) = log L ( D | w ) • Error is the opposite of the Log likelihood Error ( D , w ) = − l ( D , w ) CS 2710 Foundations of AI
14
Logistic regression: parameter learning •
Error function decomposes to online error components Error ( D , w ) =
n
∑ Error i =1
n
i
( D i , w ) = − ∑ li ( D i , w ) i =1
• Derivatives of the online error component for the LR model (in terms of weights) ∂ Error i ( D i , w ) = − ( y i − f ( x i , w )) ∂w0 ∂ Error i ( D i , w ) = − ( y i − f ( x i , w )) x i , j ∂w j
CS 2710 Foundations of AI
Logistic regression: parameter learning. • Assume D i =< x i , y i > • Let µ i = p ( yi = 1 | x i , w ) = g ( zi ) = g (w T x) • Then n n y L ( D , w ) = ∏ P ( y = y i | x i , w ) = ∏ µ i i (1 − µ i )1− y i i =1
i =1
• Find weights w that maximize the likelihood of outputs – The optimal weights are the same for both the likelihood and the log-likelihood n
l ( D , w ) = log ∏ µ i i (1 − µ i )1− y i = i =1
=
n
∑
i =1
y
n
∑ log µ i =1
y i log µ i + (1 − y i ) log( 1 − µ i ) =
yi i
(1 − µ i )1− y i =
n
∑−J i =1
online
(Di , w )
CS 2710 Foundations of AI
15
Logistic regression: parameter learning •
Log likelihood n
∑y
l(D, w ) =
i =1
log µ i + (1 − y i ) log( 1 − µ i )
i
• Derivatives of the loglikelihood −
∂ l(D, w ) = ∂w j
n
∑−x i =1
i, j
Nonlinear in weights !!
( y i − g ( z i ))
n
n
i =1
i =1
∇ w − l ( D , w ) = ∑ −x i ( yi − g ( w T x i )) = ∑ −x i ( yi − f ( w , x i ))
• Gradient descent:
w ( k ) ← w ( k −1) − α ( k ) ∇ w [ − l ( D , w )] | w ( k −1 ) n
w ( k ) ← w ( k −1) + α ( k ) ∑ [ y i − f ( w ( k −1) , x i )] x i i =1
CS 2710 Foundations of AI
Derivation of the gradient •
l(D, w ) =
Log likelihood
n
∑y i =1
i
log µ i + (1 − y i ) log( 1 − µ i )
• Derivatives of the loglikelihood ∂ l(D, w ) = ∂w j
∂ ∑ ∂z [y n
i =1
∂zi = xi , j ∂w j
i
i
log µ i + (1 − y i ) log( 1 − µ i ) ]
∂zi ∂w j
Derivative of a logistic function ∂g ( zi ) = g ( z i )(1 − g ( z i )) ∂z i
∂ [ yi log µi + (1 − yi ) log(1 − µi )] = yi 1 ∂g ( zi ) + (1 − yi ) −1 ∂g ( zi ) ∂zi 1 − g ( zi ) ∂zi g ( zi ) ∂zi = yi (1 − g ( zi )) + (1 − yi )(− g ( zi )) = yi − g ( zi ) n
n
∇ w l ( D , w ) = ∑ −x i ( yi − g ( w T x i )) = ∑ −x i ( yi − f ( w , x i )) i =1
i =1
CS 2710 Foundations of AI
16
Logistic regression. Online gradient. • We want to optimize the online Error • On-line gradient update for the jth weight and ith step ∂ (i) ( i −1 ) wj ← wj −α [ Error i ( D i , w ) | w ( i −1 ) ] ∂w j • (i)th update for the logistic regression D i =< x i , y i > J-th weight wj
(i)
← wj
( i −1 )
(
)
+ α ( i ) y i − f ( x i , w ( i −1 ) ) x i , j
α - annealed learning rate (depends on the number of updates) The same update rule as used in the linear regression !!! CS 2710 Foundations of AI
Online logistic regression algorithm Online-logistic-regression (D, number of iterations) initialize weights w0 , w1 , w2 K wk for i=1:1: number of iterations do select a data point d= from D set α = 1/ i update weights (in parallel) w0 = w0 + α [ y − f ( x, w )]
w j = w j + α [ y − f (x, w)]x j
end for return weights CS 2710 Foundations of AI
17
Online algorithm. Example.
CS 2710 Foundations of AI
Online algorithm. Example.
CS 2710 Foundations of AI
18
Online algorithm. Example.
CS 2710 Foundations of AI
Online updates Linear regression
Logistic regression
f (x) = w T x
f (x) = p( y = 1 | x, w) = g (wT x)
1
1 w0
x1
x2
w1 w2
∑
wd
xd
w0
f (x)
x1 x2
f (x) =
z
p( y = 1 | x)
wd
xd
On-line gradient update:
w ← w +α ( y − f (x, w))x
w1 w2
∑
The same
On-line gradient update:
w ← w +α ( y − f (x, w))x
CS 2710 Foundations of AI
19
Limitations of basic linear units Linear regression f (x) = w T x
Logistic regression f (x) = p( y = 1 | x, w) = g (wT x)
1
1 w0
x1 x2
w1 w2
∑
w0
f (x)
x1
w1 w2
x2
wd
∑
z
p( y = 1 | x)
wd
xd
xd
Function linear in inputs !!
Linear decision boundary!!
CS 2710 Foundations of AI
20