Lecture 6: Bayesian Logistic Regression

CSE 788.04: Topics in Machine Learning Lecture Date: April 11th, 2012 Lecture 6: Bayesian Logistic Regression Lecturer: Brian Kulis 1 Scribe: Ziqi...
4 downloads 0 Views 211KB Size
CSE 788.04: Topics in Machine Learning

Lecture Date: April 11th, 2012

Lecture 6: Bayesian Logistic Regression Lecturer: Brian Kulis

1

Scribe: Ziqi Huang

Logistic Regression

Logistic Regression is an approach to learning functions of the form f : X → Y or P (Y |X), in the case where Y is discrete-valued, and X =< X1 ...Xn > is any vector containing discrete or contiuous variables. For a two-class classification problem, the posterior probability of Y can be written as follows: P (Y = 1|X)

1 Pn 1 + exp(−ω − i=1 ωi Xi )

=

(1)

= σ(ω T Xi ) and P (Y = 0|X)

=

Pn exp(−ω − i=1 ωi Xi ) Pn 1 + exp(ω + i=1 ωi Xi )

=

1 − σ(ω T Xi )

(2)

where σ(·) is the logistic sigmoid function defined by σ(a)

=

1 1 + exp(−a).

(3)

which is plotted in Figure1. (Note we are implicitly redefining the data Xi to add an extra dimension with Figure 1: Plot of the logistic sigmoid function.

a 1, as in linear regression, and then redefining ω appropriately.) The term sigmoid means S-shaped. This 1

2

Lecture 6: Bayesian Logistic Regression

type of function is sometimes also called a squashing function because it maps the whole real axis into a finite interval.It satisfies the following symmetry property σ(−a)

1 − σ(a)

=

(4)

Interestingly, the parametric form of P (Y |X) used by Logistic Regression is precisely the form implied by the assumption of a Gaussian Naive Bayes classifier.

1.1

Form of P (Y |X) for Gaussian Naive Bayes Classifier

We derive the form of P (Y |X) entailed by the assumption of a Gaussian Naive Bayes (GNB) classifier. Consider a GNB based on the following modeling assumption: • Y is Boolean, governed by a Bernoulli distribution, with parameter π = P (Y = 1). • X =< X1 ...Xn >, where each Xi is a continuous random variable. • For each Xi , P (Xi |Y = yk ) is a Guassian distribution of the form N (µik , σi ) (in many cases, this will simply be N (µk , σ)). • For all i and j 6= i, Xi and Xj are conditionally independent given Y . Note here we are assuming the standard deviations σi vary from point to point, but do not depend on Y . We now derive the parametric form of P (Y |X) that follows from this set of GNB assumptions. In general, Bayes rule allows us to write P (Y = 1|X)

=

P (Y = 1)P (X|Y = 1) P (Y = 1)P (X|Y = 1) + P (Y = 0)P (X|Y = 0)

(5)

Dividing both the numerator and denominator by the numerator yields: P (Y = 1|X)

1

=

(6)

1+

P (Y =0)P (X|Y =0) P (Y =1)P (X|Y =1)

1+

P (Y =0)P (X|Y =0) exp(ln P (Y =1)P (X|Y =1) )

1+

P (Y =0) exp(ln P (Y =1)

1+

exp(ln 1−π π

1

=

1

= =

+

P

i

(Xi |Y =0) ln P P (Xi |Y =1) )

1 P P (Xi |Y =0) + i ln P (Xi |Y =1) )

Note the final step expresses P(Y=0) and P(Y=1) in terms of the binomial parameter π.

(7) (8) (9)

Lecture 6: Bayesian Logistic Regression

3

Now consider just the summation in the denominator of equation (9). Given our assumption that P (Xi |Y = yk ) is Gaussian, we can expand this term as follows: 2

X i

P (Xi |Y = 0) ln P (Xi |Y = 1)

=

X

ln

i

=

X

−(Xi −µi0 ) √ 1 exp( 2σ ) 2 2πσ 2 i 2 −(Xi −µi1 ) √ 1 exp( 2σ ) 2 2πσ 2 i

ln exp(

i

(10)

(Xi − µi1 )2 − (Xi − µi0 )2 ) 2σi2

(11)

=

X (Xi − µi1 )2 − (Xi − µi0 )2 ( ) 2σi2 i

(12)

=

X (X 2 − 2Xi µi1 + µ2 ) − (X 2 − 2Xi µi0 + µ2 ) i1 i i0 ( i ) 2 2σ i i

(13)

=

X 2Xi (µi0 − µi1 ) + µ2 − µ2 i0 i1 ( ) 2 2σ i i

(14)

=

X µi0 − µi1 µ2i1 X + ) ( i sigma2 2σi2 i

(15)

Note this expression is a linear weighted sum of the Xi0 s. Substituting expression (15) back into equation (9), we have P (Y = 1|X)

1

= 1 + exp(ln

1−π π

+

P

i(

µi0 −µi1 Xi σi2

+

µ2i1 −µ2i0 )) 2σi2

(16)

Or equivalently, P (Y = 1|X)

1 Pn 1 + exp(ω0 + i=1 ωi Xi )

=

(17)

where the weights ω1 ...ωn are given by ωi

=

µi0 − µi1 σi2

(18)

and ω0

=

ln

1 − π X µ2i1 − µi02 + π 2σi2 i

(19)

Then we can derive P (Y = 1|X) = σ(ω T Xi )

(20)

And also we have P (Y = 0|X)

=

1 − σ(ω T Xi )

(21)

To summarize, the logistic form arises naturally from a generative model. However, since the number of parameters in a generative model is often more than the number of parameters in the logistic regression model, one often prefers working directly with the logistic regression model to find the parameters W . This is a discriminative approach to classification, as we directly model the probabilities over the class labels.

4

2

Lecture 6: Bayesian Logistic Regression

Estimating Parameters for Logistic Regression

One reasonable approach to traning Logistic Regression is to choose parameter values that maximize the conditional data likelihood. We choose parameters Y W ← arg max P (Yi |Xi , W ) W

i

where W =< ω0 , ω1 ...ωn > is the vector of parameters to be estimated, Y l denotes the observed value of Y in the lth training example, and X l denotes the observed value in the lth training example. Equivalently, we can work with the log of the conditional likelihood: Y W ← arg max ln P (Yi |Xi , W ) W

i

And ln P (Yi |Xi , W )

= = =

n X i=1 n X i=1 n X

Yi ln P (Yi = 1|Xi , W ) + (1 − Yi ) ln P (Yi = 0|Xi , W )

(22)

Yi ln σ(ω T Xi ) + (1 − Yi ) ln (1 − σ(ω T Xi ))

(23)

Yi ln

i=1

σ(ω T Xi ) + ln (1 − σ(ω T Xi )) (1 − σ(ω T Xi )

(24)

As usual, we can define an error function by taking the negative logarithm of the likelihood, which gives the crossentropy error function in the form E(W )

= − ln p(Y 0 |W ) = −

N X

yn0 ln yn + (1 − yn0 ) ln (1 − yn )

(25) (26)

n=1

Unfortunately, there is no closed form solution to maximizing the likelihood with respect to W . The singularity can be avoided by inclusion of a prior and finding a MAP solution for w, or equivalently by adding a regularization term to the error function.

2.1

Iterative reweighted least squares

In the case of the linear regression models, the maximum likelihood solution, on the assumption of a Gaussian noise model, leads to a closed-form solution. However, for logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. However, the departure from a quadratic form is not substantial. To be precise, the error function is concave, as we shall see shortly, and hence has a unique minimum. Furthermore, the error function can be minimized by an efficient iterative technique based on the N ewton − Raphson iterative optimization scheme, which uses a local quadratic approximation to the log likelihood function. W (new)

=

W (old) − [∇2 σ(W (old) )]−1 ∇f (W (old))

(27)

Then we can derive ∇E(W )

=

N X

(W T Xn − Y 0 )Xn

n=1 T

= X XW − X T Y 0

(28) (29)

Lecture 6: Bayesian Logistic Regression

5

∇2 E(W )

N X

=

Xn XnT

(30)

n=1 T

= X X

(31)

Plug in equation (27), we can derive W (new)

= W (old) − (X T X)−1 {X T XW (old) − X T Y } =

−1

T

(X X)

T

X Y

(32) (33)

Now let us apply the Newton-Raphson update to the cross-entropy error function (26) for the logistic regression model. ∇E(W )

=

N X

(yn − yn0 )xn

(34)

n=1 T

= X (Y − Y 0 )

H = ∇∇E(W )

N X

=

(35)

yn (1 − yn )xn xTn

(36)

n=1 T

= X RX

(37)

where R = yn (1 − yn ), then we can derive W (new)

= = =

W (old) − (X T RX)−1 X T (Y − Y 0 ) T

−1

T

−1

(X RX)

(X RX)

T

{X RXW

(old)

T

(38) 0

− X (Y − Y )}

T

X Rz

(39) (40)

where z = XW (old) − R−1 (Y − Y 0 ).

2.2

Regularization in Logistic Regression

Overfitting the training data is a problem that can arise in Logistic Regression, especially when data is very high dimensional and training data is sparse. One approach to reducing overfitting is regularization, in which we create a modified penalized log likelihood function, which penelizes large value of W . One approach is to use the penalized log likelihood function W ← arg max W

X

ln P (Yi |Xi , w) −

λ kW k2 2

(41)

Which adds a penalty proportional to the squared magnitude of W . Here λ is a constant that determines the strength of this penalty term.

3

The Bayesian Setting

Now we have some assmputions: • P (W ) ∝ N (µ0 , σ02 )

6

Lecture 6: Bayesian Logistic Regression

• P (Y |X, W ) ∝ σ(W T X) • P (W |Y, X) ∝ P (Y |X, W )P (W ) ∝ σ(W T X)P (W ) Then for a new data Xnew , we can derive the predictive distribution: Z ¯ P (Ynew |X, Y , Xnew ) = P (Ynew |Y¯ , Xnew )P (W |Y¯ , X)dW

(42)

Since P (Ynew |Y¯ , Xnew ) is proportional to logistic sigmoid distribution and P (W |Y¯ , X) is proportional to Normal distribution, there is no closed form for P (Ynew |X, Y¯ , Xnew ). There are several approaches to approximating the predictive distribution: the Laplace approximation, variational methods, and Monte Carlo sampling are three of the main ones. Below we focus on the Laplace approximation.

4

The Laplace Approximation

In this section, we introduce a framework called the Laplace approximation, that aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. R We assume that there is a f (x) where exp(N f (x))dx has no closed form. In the Laplace method the goal is to find a Gaussian approximation g(z) which is centred on a mode of the distribution f (x). The first step is to find a mode for f (x), in other words a point x0 such that f 0 (x0 ) = 0. A Gaussian distribution has the property that its logarithm is a quadratic function of the variables. We therefore consider a Taylor expansion of f (x) f (x) ≈ f (x0 ) +

f 0 (x0 ) f 00 (x0 ) (x − x0 ) + (x − x0 )2 1! 2!

(43)

f 00 (x0 ) (x − x0 )2 2!

(44)

Since f 0 (x0 ) = 0 f (x) ≈ f (x0 ) + Therefore, Z

  |f 00 (x0 )| 2 (x − x0 ) dx exp − N f (x0 ) + 2!   Z |f 00 (x0 )| 2 exp(−N f (x0 ) exp − N (x − x0 ) dx 2! s 2π exp(−N f (x0 )) N |f 00 (x0 )| Z

exp(−N f (x))dx

= = =



(45) (46) (47)

So we can get a approximate closed form solution for f (x). Note that the approximate integral is accurate to order O(1/N ). Finally, given a distribution p(x), using the Laplace approximation we form a Gaussian approximation with mean x0 and precision |p00 (x0 )|, where x0 is a mode of p.

4.1

Example: P (W |Y ) ∝ P (Y |W, X)P (W )

As we introduced before, for the predictive distribution Z ¯ ¯ , Xnew )P (W |Y¯ , X)dW P (Ynew |X, Y , Xnew ) = P (Ynew |W Z = σ(W T X)P (W |Y¯ , X)dw

(48) (49)

Lecture 6: Bayesian Logistic Regression

7

we cannot get a closed form solution. We will use the Laplace approximation to approximate the posterior P (W |Y¯ , X) as a Gaussian. We further approximate P (Ynew |Y¯ , Xnew ) as a probit function φ(λa) and get an approximate solution for the predictive distribution. Because P (W |Y¯ , X)dW is approximated as Gaussian, we know that the marginal distribution will also be Gaussian. Since Z σ(W T X) = δ(a − W T X)σ(a)da (50) Then we derive Z

where P (a) =

R

σ(W T X)P (W |Y¯ , X)dw

Z =

σ(a)P (a)da

(51)

δ(a − W T X)P (W |Y¯ , X)dw then we can derive Z φ(λa)N (a|µ, σ)da ≈ σ(k(σ 2 )µ)

where k(σ 2 ) = (1 +

πσ 2 i − 12 8 )

(52)

and Z µa

= Z =

p(a)ada

(53)

P (W |Y¯ , X)W T XdW

(54)

T = WM AP X

(55)

and also we can derive σa2

Z =

p(a)(a2 − µ2 )da

(56) Z =

2

2

P (W |Y¯ , X)(W T X − mTN X )dw

= X T SN X

(57) (58)