CS545: Classification with Logistic Regression

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression CS545: Classification with Logistic Regression Why? Setup Derivati...
Author: Logan Hopkins
3 downloads 0 Views 215KB Size
CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression

CS545: Classification with Logistic Regression

Why? Setup Derivation

Chuck Anderson Department of Computer Science Colorado State University

Fall, 2009

1 / 40

Outline

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

Logistic Regression Why? Setup Derivation

2 / 40

Masking

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression

Recall that a linear model used for classification can result in masking.

Why? Setup Derivation

3 / 40

Masking

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression

Recall that a linear model used for classification can result in masking.

Why? Setup Derivation

We discussed fixing this by using different shaped membership functions, other than linear.

4 / 40

Masking

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression

Recall that a linear model used for classification can result in masking.

Why? Setup Derivation

We discussed fixing this by using different shaped membership functions, other than linear. Our first approach to this was to use generative models (Gaussian distributions) to model the data from each class.

5 / 40

Masking

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression

Recall that a linear model used for classification can result in masking.

Why? Setup Derivation

We discussed fixing this by using different shaped membership functions, other than linear. Our first approach to this was to use generative models (Gaussian distributions) to model the data from each class. Using Bayes Theorem, we derived QDA and LDA.

6 / 40

CS545: Classification with Logistic Regression

Masking

Chuck Anderson

Remember this picture? Indicator variables

Logistic Regression Why? Setup Derivation

(0,0,1)

(0,1,0)

(1,0,0)

x

7 / 40

CS545: Classification with Logistic Regression

Masking

Chuck Anderson

Remember this picture? Indicator variables

Logistic Regression Why? Setup Derivation

(0,0,1)

(0,1,0)

(1,0,0)

x The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by

8 / 40

CS545: Classification with Logistic Regression

Masking

Chuck Anderson

Remember this picture? Indicator variables

Logistic Regression Why? Setup Derivation

(0,0,1)

(0,1,0)

(1,0,0)

x The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by - requiring the function values to be between 0 and 1, and

9 / 40

CS545: Classification with Logistic Regression

Masking

Chuck Anderson

Remember this picture? Indicator variables

Logistic Regression Why? Setup Derivation

(0,0,1)

(0,1,0)

(1,0,0)

x The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by - requiring the function values to be between 0 and 1, and - requiring them to sum to 1 for every value of x.

10 / 40

Logistic Regression Setup

CS545: Classification with Logistic Regression Chuck Anderson

We can satisfy those two requirements by directly representing p(C = k|x) as f (x, β k ) p(C = k|x) = PK m=1 f (x, β m )

Logistic Regression Why? Setup Derivation

where we haven’t discussed the form of f yet, but β represents the parameters of f that we will tune to fit the training data (later).

11 / 40

Logistic Regression Setup

CS545: Classification with Logistic Regression Chuck Anderson

We can satisfy those two requirements by directly representing p(C = k|x) as f (x, β k ) p(C = k|x) = PK m=1 f (x, β m )

Logistic Regression Why? Setup Derivation

where we haven’t discussed the form of f yet, but β represents the parameters of f that we will tune to fit the training data (later). This is certainly an expression that is between 0 and 1 for any x.

12 / 40

Logistic Regression Setup

CS545: Classification with Logistic Regression Chuck Anderson

We can satisfy those two requirements by directly representing p(C = k|x) as f (x, β k ) p(C = k|x) = PK m=1 f (x, β m )

Logistic Regression Why? Setup Derivation

where we haven’t discussed the form of f yet, but β represents the parameters of f that we will tune to fit the training data (later). This is certainly an expression that is between 0 and 1 for any x. Now we have p(C = k|x) expressed directly, as opposed to the previous generative approach of first modeling p(x|C = k) and using Bayes’ theorem to get p(C = k|x). 13 / 40

Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m )

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

14 / 40

Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m ) Now let’s deal with our requirement that the sum must equal 1 1=

K X k=1

pk (C = k|x) =

K X

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

g (x, β k )

k=1

15 / 40

Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m ) Now let’s deal with our requirement that the sum must equal 1 1=

K X k=1

pk (C = k|x) =

K X

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

g (x, β k )

k=1

However, this constraint overdetermines the g (x, β k ). If 1=a+b+c must be true, then given values for a and b, c is already determined, as c = 1 − a − b.

16 / 40

Let’s give the above expression another name f (x, β k ) g (x, β k ) = p(C = k|x) = PK m=1 f (x, β m ) Now let’s deal with our requirement that the sum must equal 1 1=

K X k=1

pk (C = k|x) =

K X

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

g (x, β k )

k=1

However, this constraint overdetermines the g (x, β k ). If 1=a+b+c must be true, then given values for a and b, c is already determined, as c = 1 − a − b. Another way to say this is that we can set c to any value, and values for a and b can still be found that satisfy the above equation. For example 1 = (a − c/2) + (b − c/2) + 2c 17 / 40

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression

So, let’s just set the final f (x, β k ) to be 1. Now  f (x, β k )     1 + PK −1 f (x, β ) , k < K m m=1 g (x, β k ) =  1   , k =K P −1  1+ K m=1 f (x, β m )

Why? Setup Derivation

18 / 40

Derivation Whatever we choose for f , we must make a plan for optimizing its parameters β. How?

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

19 / 40

CS545: Classification with Logistic Regression

Derivation Whatever we choose for f , we must make a plan for optimizing its parameters β. How? Let’s maximize the likelihood of the data. So, what is the likelihood of training data consisting of samples {x1 , x2 , . . . , xN } and class indicator variables   t1,1 t1,2 . . . t1,K  t2,1 t2,2 . . . t2,K     ..   . 

Chuck Anderson Logistic Regression Why? Setup Derivation

tN,1 tN,2 . . . tN,K with every value tn,k being 0 or 1, and each row of this matrix contains a single 1. (We can also express {x1 , x2 , . . . , xN } as an N × D matrix, but we will be using single samples xn more often in the following.) 20 / 40

CS545: Classification with Logistic Regression

Data Likelihood The likelihood is just the product of all p(C = class of nth sample|xn ) values for sample n. A common way to express this product, using those handy indicator variables is N K −1 Y Y L(β) = p(C = k|xn )tn,k

Chuck Anderson Logistic Regression Why? Setup Derivation

n=1 k=1

21 / 40

CS545: Classification with Logistic Regression

Data Likelihood The likelihood is just the product of all p(C = class of nth sample|xn ) values for sample n. A common way to express this product, using those handy indicator variables is N K −1 Y Y L(β) = p(C = k|xn )tn,k

Chuck Anderson Logistic Regression Why? Setup Derivation

n=1 k=1

Why does second product stop with K − 1?

22 / 40

CS545: Classification with Logistic Regression

Data Likelihood The likelihood is just the product of all p(C = class of nth sample|xn ) values for sample n. A common way to express this product, using those handy indicator variables is N K −1 Y Y L(β) = p(C = k|xn )tn,k

Chuck Anderson Logistic Regression Why? Setup Derivation

n=1 k=1

Why does second product stop with K − 1? Say we have three classes (K = 3) and training sample n is from Class 2, then the inner product is p(C = 1|xn )tn,1 p(C = 2|xn )tn,2 p(C = 3|xn )tn,3 = p(C = 1|xn )0 p(C = 2|xn )1 p(C = 3|xn )0 = 1 p(C = 2|xn )1 1 = p(C = 2|xn ) This shows how the indicator variables as exponents select the correct terms to be included in the product.

23 / 40

Maximizing the Data Likelihood So, we want to find β that maximizes the data likelihood. How shall we proceed? L(β) =

N K −1 Y Y

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

p(C = k|xn )tn,k

n=1 k=1

24 / 40

Maximizing the Data Likelihood So, we want to find β that maximizes the data likelihood. How shall we proceed? L(β) =

N K −1 Y Y

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

p(C = k|xn )tn,k

n=1 k=1

Right. Find the derivative with respect to each component of β, or the gradient with respect to β. But there is a mess of products in this. So...

25 / 40

Maximizing the Data Likelihood So, we want to find β that maximizes the data likelihood. How shall we proceed? L(β) =

N K −1 Y Y

CS545: Classification with Logistic Regression Chuck Anderson Logistic Regression Why? Setup Derivation

p(C = k|xn )tn,k

n=1 k=1

Right. Find the derivative with respect to each component of β, or the gradient with respect to β. But there is a mess of products in this. So... Right. Work with the logarithm log L(β) which we will call l(β). l(β) = log L(β) =

N K −1 X X

tn,k log p(C = k|xn )

n=1 k=1

26 / 40

Gradient Descent

CS545: Classification with Logistic Regression Chuck Anderson

Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β.

Logistic Regression Why? Setup Derivation

27 / 40

CS545: Classification with Logistic Regression

Gradient Descent

Chuck Anderson

Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:

Logistic Regression Why? Setup Derivation

β ← β + α∇β l(β) where α is a constant that affects the step size.

28 / 40

CS545: Classification with Logistic Regression

Gradient Descent

Chuck Anderson

Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:

Logistic Regression Why? Setup Derivation

Initialize β to some value.

β ← β + α∇β l(β) where α is a constant that affects the step size.

29 / 40

CS545: Classification with Logistic Regression

Gradient Descent

Chuck Anderson

Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:

Logistic Regression Why? Setup Derivation

Initialize β to some value. Make small change to β in the direction of the gradient of l(β) with respect to β (or ∇β l(β))

β ← β + α∇β l(β) where α is a constant that affects the step size.

30 / 40

CS545: Classification with Logistic Regression

Gradient Descent

Chuck Anderson

Unforunately, the gradient of l(β) with respect to β is not linear in β, so we cannot simply set the result equal to zero and solve for β. Instead, we do gradient descent:

Logistic Regression Why? Setup Derivation

Initialize β to some value. Make small change to β in the direction of the gradient of l(β) with respect to β (or ∇β l(β)) Repeat above step until l(β) seems to be at a maximum.

β ← β + α∇β l(β) where α is a constant that affects the step size.

31 / 40

CS545: Classification with Logistic Regression Chuck Anderson

Remember that β is a matrix of parameters, with, let’s say, columns corresponding to the values required for each f , of which there are K − 1.

Logistic Regression Why? Setup Derivation

32 / 40

CS545: Classification with Logistic Regression Chuck Anderson

Remember that β is a matrix of parameters, with, let’s say, columns corresponding to the values required for each f , of which there are K − 1.

Logistic Regression Why? Setup Derivation

We can work on the update formula and ∇β l(β) one column at a time β k ← β k + α∇βk l(β) and combine them at the end. β ← β + α(∇β1 l(β), ∇β2 l(β), . . . , ∇βK −1 l(β))

33 / 40

Remembering that ∂ log∂xh(x) = p(C = k|xn ) = g (xn , β k ) l(β) =

=

∇βj l(β) =

N K −1 X X n=1 k=1 N K −1 X X n=1 k=1 N K −1 X X n=1 k=1

1 ∂h(x) h(x) x

CS545: Classification with Logistic Regression

and that

tn,k log p(C = k|xn )

Chuck Anderson Logistic Regression Why? Setup Derivation

tn,k log g (xn , βk ) tn,k ∇β g (xn , βk ) g (xn , β k ) j

34 / 40

Remembering that ∂ log∂xh(x) = p(C = k|xn ) = g (xn , β k ) l(β) =

=

∇βj l(β) =

N K −1 X X n=1 k=1 N K −1 X X n=1 k=1 N K −1 X X n=1 k=1

1 ∂h(x) h(x) x

CS545: Classification with Logistic Regression

and that

tn,k log p(C = k|xn )

Chuck Anderson Logistic Regression Why? Setup Derivation

tn,k log g (xn , βk ) tn,k ∇β g (xn , βk ) g (xn , β k ) j

It would be super nice if ∇βj g (xn , βk ) includes the factor g (xn , β k ) so that it will cancel with the g (xn , β k ) in the denominator.

35 / 40

CS545: Classification with Logistic Regression

Can get this by defining T

f (xn , β k ) = e βk xn

so

Chuck Anderson

βT k xn

g (xn , β k ) =

1+

e PK −1

m=1 e

Logistic Regression

βT m xn

Why? Setup Derivation

36 / 40

CS545: Classification with Logistic Regression

Can get this by defining T

f (xn , β k ) = e βk xn

so

Chuck Anderson

βT k xn

g (xn , β k ) =

e PK −1

1+

Logistic Regression

m=1 e

Why? Setup Derivation

βT m xn

Now we can work on ∇βj g (xn , βk ) = ∇βj

1+

K −1 X

e

βT m xn

!−1

T

e β k xn

m=1

= −1 1 +

K −1 X

e

βT m xn

!−2 e

βT j xn

xn e

βT k xn

+

1+

m=1

1+

e β k xn PK −1 m=1

e

βT m xn

!−1

T

e βj

xn

xn

m=1 T

T

=−

K −1 X

T

e βm xn 1 +

e β j xn PK −1 m=1

T

Tx

e βj

n

xn + 1+

e β j xn PK −1 m=1

Tx

e βj

n

xn

= −g (xn , βk )g (xn , βj )xn + g (xn , βj )xn = g (xn , β k )(δjk − g (xn , β j ))xn

where δjk = 1 if j = k, 0 otherwise. 37 / 40

CS545: Classification with Logistic Regression

Now ∇βj l(β) = =

N K −1 X X

n=1 k=1 N K −1 X X n=1

=

Chuck Anderson

tn,k ∇β g (xn , βk ) g (xn , β k ) j

N X

tn,k δjk − g (xn , β j )

k=1

K −1 X

Logistic Regression

!

Why? Setup Derivation

tn,k

k=1

xn (tn,j − g (xn , β j ))

n=1

38 / 40

CS545: Classification with Logistic Regression

Now ∇βj l(β) = =

N K −1 X X

n=1 k=1 N K −1 X X n=1

=

Chuck Anderson

tn,k ∇β g (xn , βk ) g (xn , β k ) j

N X

tn,k δjk − g (xn , β j )

k=1

K −1 X

Logistic Regression

!

Why? Setup Derivation

tn,k

k=1

xn (tn,j − g (xn , β j ))

n=1

which results in this update rule for β j βj ← βj + α

N X (tn,j − g (xn , β j ))xn n=1

39 / 40

CS545: Classification with Logistic Regression

Now ∇βj l(β) = =

N K −1 X X

n=1 k=1 N K −1 X X n=1

=

Chuck Anderson

tn,k ∇β g (xn , βk ) g (xn , β k ) j

N X

tn,k δjk − g (xn , β j )

k=1

K −1 X

Logistic Regression

!

Why? Setup Derivation

tn,k

k=1

xn (tn,j − g (xn , β j ))

n=1

which results in this update rule for β j βj ← βj + α

N X (tn,j − g (xn , β j ))xn n=1

How do we do this in R? 40 / 40