CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression
CS545: Classification with Logistic Regression
Why? Setup Derivation
Chuck Anderson Department of Computer Science Colorado State University
Fall, 2009
1 / 40
Outline
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
Logistic Regression Why? Setup Derivation
2 / 40
Masking
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression
Recall that a linear model used for classification can result in masking.
Why? Setup Derivation
3 / 40
Masking
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression
Recall that a linear model used for classification can result in masking.
Why? Setup Derivation
We discussed fixing this by using different shaped membership functions, other than linear.
4 / 40
Masking
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression
Recall that a linear model used for classification can result in masking.
Why? Setup Derivation
We discussed fixing this by using different shaped membership functions, other than linear. Our first approach to this was to use generative models (Gaussian distributions) to model the data from each class.
5 / 40
Masking
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression
Recall that a linear model used for classification can result in masking.
Why? Setup Derivation
We discussed fixing this by using different shaped membership functions, other than linear. Our first approach to this was to use generative models (Gaussian distributions) to model the data from each class. Using Bayes Theorem, we derived QDA and LDA.
6 / 40
CS545: Classification with Logistic Regression
Masking
Chuck Anderson
Remember this picture? Indicator variables
Logistic Regression Why? Setup Derivation
(0,0,1)
(0,1,0)
(1,0,0)
x
7 / 40
CS545: Classification with Logistic Regression
Masking
Chuck Anderson
Remember this picture? Indicator variables
Logistic Regression Why? Setup Derivation
(0,0,1)
(0,1,0)
(1,0,0)
x The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by
8 / 40
CS545: Classification with Logistic Regression
Masking
Chuck Anderson
Remember this picture? Indicator variables
Logistic Regression Why? Setup Derivation
(0,0,1)
(0,1,0)
(1,0,0)
x The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by - requiring the function values to be between 0 and 1, and
9 / 40
CS545: Classification with Logistic Regression
Masking
Chuck Anderson
Remember this picture? Indicator variables
Logistic Regression Why? Setup Derivation
(0,0,1)
(0,1,0)
(1,0,0)
x The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by - requiring the function values to be between 0 and 1, and - requiring them to sum to 1 for every value of x.
10 / 40
Logistic Regression Setup
CS545: Classification with Logistic Regression Chuck Anderson
We can satisfy those two requirements by directly representing p(C = k|x) as f (x, β k ) p(C = k|x) = PK m=1 f (x, β m )
Logistic Regression Why? Setup Derivation
where we haven’t discussed the form of f yet, but β represents the parameters of f that we will tune to fit the training data (later).
11 / 40
Logistic Regression Setup
CS545: Classification with Logistic Regression Chuck Anderson
We can satisfy those two requirements by directly representing p(C = k|x) as f (x, β k ) p(C = k|x) = PK m=1 f (x, β m )
Logistic Regression Why? Setup Derivation
where we haven’t discussed the form of f yet, but β represents the parameters of f that we will tune to fit the training data (later). This is certainly an expression that is between 0 and 1 for any x.
12 / 40
Logistic Regression Setup
CS545: Classification with Logistic Regression Chuck Anderson
We can satisfy those two requirements by directly representing p(C = k|x) as f (x, β k ) p(C = k|x) = PK m=1 f (x, β m )
Logistic Regression Why? Setup Derivation
where we haven’t discussed the form of f yet, but β represents the parameters of f that we will tune to fit the training data (later). This is certainly an expression that is between 0 and 1 for any x. Now we have p(C = k|x) expressed directly, as opposed to the previous generative approach of first modeling p(x|C = k) and using Bayes’ theorem to get p(C = k|x). 13 / 40
Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m )
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
14 / 40
Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m ) Now let’s deal with our requirement that the sum must equal 1 1=
K X k=1
pk (C = k|x) =
K X
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
g (x, β k )
k=1
15 / 40
Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m ) Now let’s deal with our requirement that the sum must equal 1 1=
K X k=1
pk (C = k|x) =
K X
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
g (x, β k )
k=1
However, this constraint overdetermines the g (x, β k ). If 1=a+b+c must be true, then given values for a and b, c is already determined, as c = 1 − a − b.
16 / 40
Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m ) Now let’s deal with our requirement that the sum must equal 1 1=
K X k=1
pk (C = k|x) =
K X
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
g (x, β k )
k=1
However, this constraint overdetermines the g (x, β k ). If 1=a+b+c must be true, then given values for a and b, c is already determined, as c = 1 − a − b. Another way to say this is that we can set c to any value, and values for a and b can still be found that satisfy the above equation. For example 1 = (a − c/2) + (b − c/2) + 2c 17 / 40
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression
So, let’s just set the final f (x, β k ) to be 1. Now f (x, β k ) 1 + PK −1 f (x, β ) , k < K m m=1 g (x, β k ) = 1 , k =K P −1 1+ K m=1 f (x, β m )
Why? Setup Derivation
18 / 40
Derivation Whatever we choose for f , we must make a plan for optimizing its parameters β. How?
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
19 / 40
CS545: Classification with Logistic Regression
Derivation Whatever we choose for f , we must make a plan for optimizing its parameters β. How? Let’s maximize the likelihood of the data. So, what is the likelihood of training data consisting of samples {x1 , x2 , . . . , xN } and class indicator variables t1,1 t1,2 . . . t1,K t2,1 t2,2 . . . t2,K .. .
Chuck Anderson Logistic Regression Why? Setup Derivation
tN,1 tN,2 . . . tN,K with every value tn,k being 0 or 1, and each row of this matrix contains a single 1. (We can also express {x1 , x2 , . . . , xN } as an N × D matrix, but we will be using single samples xn more often in the following.) 20 / 40
CS545: Classification with Logistic Regression
Data Likelihood The likelihood is just the product of all p(C = class of nth sample|xn ) values for sample n. A common way to express this product, using those handy indicator variables is N K −1 Y Y L(β) = p(C = k|xn )tn,k
Chuck Anderson Logistic Regression Why? Setup Derivation
n=1 k=1
21 / 40
CS545: Classification with Logistic Regression
Data Likelihood The likelihood is just the product of all p(C = class of nth sample|xn ) values for sample n. A common way to express this product, using those handy indicator variables is N K −1 Y Y L(β) = p(C = k|xn )tn,k
Chuck Anderson Logistic Regression Why? Setup Derivation
n=1 k=1
Why does second product stop with K − 1?
22 / 40
CS545: Classification with Logistic Regression
Data Likelihood The likelihood is just the product of all p(C = class of nth sample|xn ) values for sample n. A common way to express this product, using those handy indicator variables is N K −1 Y Y L(β) = p(C = k|xn )tn,k
Chuck Anderson Logistic Regression Why? Setup Derivation
n=1 k=1
Why does second product stop with K − 1? Say we have three classes (K = 3) and training sample n is from Class 2, then the inner product is p(C = 1|xn )tn,1 p(C = 2|xn )tn,2 p(C = 3|xn )tn,3 = p(C = 1|xn )0 p(C = 2|xn )1 p(C = 3|xn )0 = 1 p(C = 2|xn )1 1 = p(C = 2|xn ) This shows how the indicator variables as exponents select the correct terms to be included in the product.
23 / 40
Maximizing the Data Likelihood So, we want to find β that maximizes the data likelihood. How shall we proceed? L(β) =
N K −1 Y Y
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
p(C = k|xn )tn,k
n=1 k=1
24 / 40
Maximizing the Data Likelihood So, we want to find β that maximizes the data likelihood. How shall we proceed? L(β) =
N K −1 Y Y
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
p(C = k|xn )tn,k
n=1 k=1
Right. Find the derivative with respect to each component of β, or the gradient with respect to β. But there is a mess of products in this. So...
25 / 40
Maximizing the Data Likelihood So, we want to find β that maximizes the data likelihood. How shall we proceed? L(β) =
N K −1 Y Y
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation
p(C = k|xn )tn,k
n=1 k=1
Right. Find the derivative with respect to each component of β, or the gradient with respect to β. But there is a mess of products in this. So... Right. Work with the logarithm log L(β) which we will call l(β). l(β) = log L(β) =
N K −1 X X
tn,k log p(C = k|xn )
n=1 k=1
26 / 40
Gradient Descent
CS545: Classification with Logistic Regression Chuck Anderson
Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β.
Logistic Regression Why? Setup Derivation
27 / 40
CS545: Classification with Logistic Regression
Gradient Descent
Chuck Anderson
Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:
Logistic Regression Why? Setup Derivation
β ← β + α∇β l(β) where α is a constant that affects the step size.
28 / 40
CS545: Classification with Logistic Regression
Gradient Descent
Chuck Anderson
Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:
Logistic Regression Why? Setup Derivation
Initialize β to some value.
β ← β + α∇β l(β) where α is a constant that affects the step size.
29 / 40
CS545: Classification with Logistic Regression
Gradient Descent
Chuck Anderson
Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:
Logistic Regression Why? Setup Derivation
Initialize β to some value. Make small change to β in the direction of the gradient of l(β) with respect to β (or ∇β l(β))
β ← β + α∇β l(β) where α is a constant that affects the step size.
30 / 40
CS545: Classification with Logistic Regression
Gradient Descent
Chuck Anderson
Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:
Logistic Regression Why? Setup Derivation
Initialize β to some value. Make small change to β in the direction of the gradient of l(β) with respect to β (or ∇β l(β)) Repeat above step until l(β) seems to be at a maximum.
β ← β + α∇β l(β) where α is a constant that affects the step size.
31 / 40
CS545: Classification with Logistic Regression Chuck Anderson
Remember that β is a matrix of parameters, with, let’s say, columns corresponding to the values required for each f , of which there are K − 1.
Logistic Regression Why? Setup Derivation
32 / 40
CS545: Classification with Logistic Regression Chuck Anderson
Remember that β is a matrix of parameters, with, let’s say, columns corresponding to the values required for each f , of which there are K − 1.
Logistic Regression Why? Setup Derivation
We can work on the update formula and ∇β l(β) one column at a time β k ← β k + α∇βk l(β) and combine them at the end. β ← β + α(∇β1 l(β), ∇β2 l(β), . . . , ∇βK −1 l(β))
33 / 40
Remembering that ∂ log∂xh(x) = p(C = k|xn ) = g (xn , β k ) l(β) =
=
∇βj l(β) =
N K −1 X X n=1 k=1 N K −1 X X n=1 k=1 N K −1 X X n=1 k=1
1 ∂h(x) h(x) x
CS545: Classification with Logistic Regression
and that
tn,k log p(C = k|xn )
Chuck Anderson Logistic Regression Why? Setup Derivation
tn,k log g (xn , βk ) tn,k ∇β g (xn , βk ) g (xn , β k ) j
34 / 40
Remembering that ∂ log∂xh(x) = p(C = k|xn ) = g (xn , β k ) l(β) =
=
∇βj l(β) =
N K −1 X X n=1 k=1 N K −1 X X n=1 k=1 N K −1 X X n=1 k=1
1 ∂h(x) h(x) x
CS545: Classification with Logistic Regression
and that
tn,k log p(C = k|xn )
Chuck Anderson Logistic Regression Why? Setup Derivation
tn,k log g (xn , βk ) tn,k ∇β g (xn , βk ) g (xn , β k ) j
It would be super nice if ∇βj g (xn , βk ) includes the factor g (xn , β k ) so that it will cancel with the g (xn , β k ) in the denominator.
35 / 40
CS545: Classification with Logistic Regression
Can get this by defining T
f (xn , β k ) = e βk xn
so
Chuck Anderson
βT k xn
g (xn , β k ) =
1+
e PK −1
m=1 e
Logistic Regression
βT m xn
Why? Setup Derivation
36 / 40
CS545: Classification with Logistic Regression
Can get this by defining T
f (xn , β k ) = e βk xn
so
Chuck Anderson
βT k xn
g (xn , β k ) =
e PK −1
1+
Logistic Regression
m=1 e
Why? Setup Derivation
βT m xn
Now we can work on ∇βj g (xn , βk ) = ∇βj
1+
K −1 X
e
βT m xn
!−1
T
e β k xn
m=1
= −1 1 +
K −1 X
e
βT m xn
!−2 e
βT j xn
xn e
βT k xn
+
1+
m=1
1+
e β k xn PK −1 m=1
e
βT m xn
!−1
T
e βj
xn
xn
m=1 T
T
=−
K −1 X
T
e βm xn 1 +
e β j xn PK −1 m=1
T
Tx
e βj
n
xn + 1+
e β j xn PK −1 m=1
Tx
e βj
n
xn
= −g (xn , βk )g (xn , βj )xn + g (xn , βj )xn = g (xn , β k )(δjk − g (xn , β j ))xn
where δjk = 1 if j = k, 0 otherwise. 37 / 40
CS545: Classification with Logistic Regression
Now ∇βj l(β) = =
N K −1 X X
n=1 k=1 N K −1 X X n=1
=
Chuck Anderson
tn,k ∇β g (xn , βk ) g (xn , β k ) j
N X
tn,k δjk − g (xn , β j )
k=1
K −1 X
Logistic Regression
!
Why? Setup Derivation
tn,k
k=1
xn (tn,j − g (xn , β j ))
n=1
38 / 40
CS545: Classification with Logistic Regression
Now ∇βj l(β) = =
N K −1 X X
n=1 k=1 N K −1 X X n=1
=
Chuck Anderson
tn,k ∇β g (xn , βk ) g (xn , β k ) j
N X
tn,k δjk − g (xn , β j )
k=1
K −1 X
Logistic Regression
!
Why? Setup Derivation
tn,k
k=1
xn (tn,j − g (xn , β j ))
n=1
which results in this update rule for β j βj ← βj + α
N X (tn,j − g (xn , β j ))xn n=1
39 / 40
CS545: Classification with Logistic Regression
Now ∇βj l(β) = =
N K −1 X X
n=1 k=1 N K −1 X X n=1
=
Chuck Anderson
tn,k ∇β g (xn , βk ) g (xn , β k ) j
N X
tn,k δjk − g (xn , β j )
k=1
K −1 X
Logistic Regression
!
Why? Setup Derivation
tn,k
k=1
xn (tn,j − g (xn , β j ))
n=1
which results in this update rule for β j βj ← βj + α
N X (tn,j − g (xn , β j ))xn n=1
How do we do this in R? 40 / 40