CSE 788.04: Topics in Machine Learning
Lecture Date: April 11th, 2012
Lecture 6: Bayesian Logistic Regression Lecturer: Brian Kulis
1
Scribe: Ziqi Huang
Logistic Regression
Logistic Regression is an approach to learning functions of the form f : X → Y or P (Y |X), in the case where Y is discrete-valued, and X =< X1 ...Xn > is any vector containing discrete or contiuous variables. For a two-class classification problem, the posterior probability of Y can be written as follows: P (Y = 1|X)
1 Pn 1 + exp(−ω − i=1 ωi Xi )
=
(1)
= σ(ω T Xi ) and P (Y = 0|X)
=
Pn exp(−ω − i=1 ωi Xi ) Pn 1 + exp(ω + i=1 ωi Xi )
=
1 − σ(ω T Xi )
(2)
where σ(·) is the logistic sigmoid function defined by σ(a)
=
1 1 + exp(−a).
(3)
which is plotted in Figure1. (Note we are implicitly redefining the data Xi to add an extra dimension with Figure 1: Plot of the logistic sigmoid function.
a 1, as in linear regression, and then redefining ω appropriately.) The term sigmoid means S-shaped. This 1
2
Lecture 6: Bayesian Logistic Regression
type of function is sometimes also called a squashing function because it maps the whole real axis into a finite interval.It satisfies the following symmetry property σ(−a)
1 − σ(a)
=
(4)
Interestingly, the parametric form of P (Y |X) used by Logistic Regression is precisely the form implied by the assumption of a Gaussian Naive Bayes classifier.
1.1
Form of P (Y |X) for Gaussian Naive Bayes Classifier
We derive the form of P (Y |X) entailed by the assumption of a Gaussian Naive Bayes (GNB) classifier. Consider a GNB based on the following modeling assumption: • Y is Boolean, governed by a Bernoulli distribution, with parameter π = P (Y = 1). • X =< X1 ...Xn >, where each Xi is a continuous random variable. • For each Xi , P (Xi |Y = yk ) is a Guassian distribution of the form N (µik , σi ) (in many cases, this will simply be N (µk , σ)). • For all i and j 6= i, Xi and Xj are conditionally independent given Y . Note here we are assuming the standard deviations σi vary from point to point, but do not depend on Y . We now derive the parametric form of P (Y |X) that follows from this set of GNB assumptions. In general, Bayes rule allows us to write P (Y = 1|X)
=
P (Y = 1)P (X|Y = 1) P (Y = 1)P (X|Y = 1) + P (Y = 0)P (X|Y = 0)
(5)
Dividing both the numerator and denominator by the numerator yields: P (Y = 1|X)
1
=
(6)
1+
P (Y =0)P (X|Y =0) P (Y =1)P (X|Y =1)
1+
P (Y =0)P (X|Y =0) exp(ln P (Y =1)P (X|Y =1) )
1+
P (Y =0) exp(ln P (Y =1)
1+
exp(ln 1−π π
1
=
1
= =
+
P
i
(Xi |Y =0) ln P P (Xi |Y =1) )
1 P P (Xi |Y =0) + i ln P (Xi |Y =1) )
Note the final step expresses P(Y=0) and P(Y=1) in terms of the binomial parameter π.
(7) (8) (9)
Lecture 6: Bayesian Logistic Regression
3
Now consider just the summation in the denominator of equation (9). Given our assumption that P (Xi |Y = yk ) is Gaussian, we can expand this term as follows: 2
X i
P (Xi |Y = 0) ln P (Xi |Y = 1)
=
X
ln
i
=
X
−(Xi −µi0 ) √ 1 exp( 2σ ) 2 2πσ 2 i 2 −(Xi −µi1 ) √ 1 exp( 2σ ) 2 2πσ 2 i
ln exp(
i
(10)
(Xi − µi1 )2 − (Xi − µi0 )2 ) 2σi2
(11)
=
X (Xi − µi1 )2 − (Xi − µi0 )2 ( ) 2σi2 i
(12)
=
X (X 2 − 2Xi µi1 + µ2 ) − (X 2 − 2Xi µi0 + µ2 ) i1 i i0 ( i ) 2 2σ i i
(13)
=
X 2Xi (µi0 − µi1 ) + µ2 − µ2 i0 i1 ( ) 2 2σ i i
(14)
=
X µi0 − µi1 µ2i1 X + ) ( i sigma2 2σi2 i
(15)
Note this expression is a linear weighted sum of the Xi0 s. Substituting expression (15) back into equation (9), we have P (Y = 1|X)
1
= 1 + exp(ln
1−π π
+
P
i(
µi0 −µi1 Xi σi2
+
µ2i1 −µ2i0 )) 2σi2
(16)
Or equivalently, P (Y = 1|X)
1 Pn 1 + exp(ω0 + i=1 ωi Xi )
=
(17)
where the weights ω1 ...ωn are given by ωi
=
µi0 − µi1 σi2
(18)
and ω0
=
ln
1 − π X µ2i1 − µi02 + π 2σi2 i
(19)
Then we can derive P (Y = 1|X) = σ(ω T Xi )
(20)
And also we have P (Y = 0|X)
=
1 − σ(ω T Xi )
(21)
To summarize, the logistic form arises naturally from a generative model. However, since the number of parameters in a generative model is often more than the number of parameters in the logistic regression model, one often prefers working directly with the logistic regression model to find the parameters W . This is a discriminative approach to classification, as we directly model the probabilities over the class labels.
4
2
Lecture 6: Bayesian Logistic Regression
Estimating Parameters for Logistic Regression
One reasonable approach to traning Logistic Regression is to choose parameter values that maximize the conditional data likelihood. We choose parameters Y W ← arg max P (Yi |Xi , W ) W
i
where W =< ω0 , ω1 ...ωn > is the vector of parameters to be estimated, Y l denotes the observed value of Y in the lth training example, and X l denotes the observed value in the lth training example. Equivalently, we can work with the log of the conditional likelihood: Y W ← arg max ln P (Yi |Xi , W ) W
i
And ln P (Yi |Xi , W )
= = =
n X i=1 n X i=1 n X
Yi ln P (Yi = 1|Xi , W ) + (1 − Yi ) ln P (Yi = 0|Xi , W )
(22)
Yi ln σ(ω T Xi ) + (1 − Yi ) ln (1 − σ(ω T Xi ))
(23)
Yi ln
i=1
σ(ω T Xi ) + ln (1 − σ(ω T Xi )) (1 − σ(ω T Xi )
(24)
As usual, we can define an error function by taking the negative logarithm of the likelihood, which gives the crossentropy error function in the form E(W )
= − ln p(Y 0 |W ) = −
N X
yn0 ln yn + (1 − yn0 ) ln (1 − yn )
(25) (26)
n=1
Unfortunately, there is no closed form solution to maximizing the likelihood with respect to W . The singularity can be avoided by inclusion of a prior and finding a MAP solution for w, or equivalently by adding a regularization term to the error function.
2.1
Iterative reweighted least squares
In the case of the linear regression models, the maximum likelihood solution, on the assumption of a Gaussian noise model, leads to a closed-form solution. However, for logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. However, the departure from a quadratic form is not substantial. To be precise, the error function is concave, as we shall see shortly, and hence has a unique minimum. Furthermore, the error function can be minimized by an efficient iterative technique based on the N ewton − Raphson iterative optimization scheme, which uses a local quadratic approximation to the log likelihood function. W (new)
=
W (old) − [∇2 σ(W (old) )]−1 ∇f (W (old))
(27)
Then we can derive ∇E(W )
=
N X
(W T Xn − Y 0 )Xn
n=1 T
= X XW − X T Y 0
(28) (29)
Lecture 6: Bayesian Logistic Regression
5
∇2 E(W )
N X
=
Xn XnT
(30)
n=1 T
= X X
(31)
Plug in equation (27), we can derive W (new)
= W (old) − (X T X)−1 {X T XW (old) − X T Y } =
−1
T
(X X)
T
X Y
(32) (33)
Now let us apply the Newton-Raphson update to the cross-entropy error function (26) for the logistic regression model. ∇E(W )
=
N X
(yn − yn0 )xn
(34)
n=1 T
= X (Y − Y 0 )
H = ∇∇E(W )
N X
=
(35)
yn (1 − yn )xn xTn
(36)
n=1 T
= X RX
(37)
where R = yn (1 − yn ), then we can derive W (new)
= = =
W (old) − (X T RX)−1 X T (Y − Y 0 ) T
−1
T
−1
(X RX)
(X RX)
T
{X RXW
(old)
T
(38) 0
− X (Y − Y )}
T
X Rz
(39) (40)
where z = XW (old) − R−1 (Y − Y 0 ).
2.2
Regularization in Logistic Regression
Overfitting the training data is a problem that can arise in Logistic Regression, especially when data is very high dimensional and training data is sparse. One approach to reducing overfitting is regularization, in which we create a modified penalized log likelihood function, which penelizes large value of W . One approach is to use the penalized log likelihood function W ← arg max W
X
ln P (Yi |Xi , w) −
λ kW k2 2
(41)
Which adds a penalty proportional to the squared magnitude of W . Here λ is a constant that determines the strength of this penalty term.
3
The Bayesian Setting
Now we have some assmputions: • P (W ) ∝ N (µ0 , σ02 )
6
Lecture 6: Bayesian Logistic Regression
• P (Y |X, W ) ∝ σ(W T X) • P (W |Y, X) ∝ P (Y |X, W )P (W ) ∝ σ(W T X)P (W ) Then for a new data Xnew , we can derive the predictive distribution: Z ¯ P (Ynew |X, Y , Xnew ) = P (Ynew |Y¯ , Xnew )P (W |Y¯ , X)dW
(42)
Since P (Ynew |Y¯ , Xnew ) is proportional to logistic sigmoid distribution and P (W |Y¯ , X) is proportional to Normal distribution, there is no closed form for P (Ynew |X, Y¯ , Xnew ). There are several approaches to approximating the predictive distribution: the Laplace approximation, variational methods, and Monte Carlo sampling are three of the main ones. Below we focus on the Laplace approximation.
4
The Laplace Approximation
In this section, we introduce a framework called the Laplace approximation, that aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. R We assume that there is a f (x) where exp(N f (x))dx has no closed form. In the Laplace method the goal is to find a Gaussian approximation g(z) which is centred on a mode of the distribution f (x). The first step is to find a mode for f (x), in other words a point x0 such that f 0 (x0 ) = 0. A Gaussian distribution has the property that its logarithm is a quadratic function of the variables. We therefore consider a Taylor expansion of f (x) f (x) ≈ f (x0 ) +
f 0 (x0 ) f 00 (x0 ) (x − x0 ) + (x − x0 )2 1! 2!
(43)
f 00 (x0 ) (x − x0 )2 2!
(44)
Since f 0 (x0 ) = 0 f (x) ≈ f (x0 ) + Therefore, Z
|f 00 (x0 )| 2 (x − x0 ) dx exp − N f (x0 ) + 2! Z |f 00 (x0 )| 2 exp(−N f (x0 ) exp − N (x − x0 ) dx 2! s 2π exp(−N f (x0 )) N |f 00 (x0 )| Z
exp(−N f (x))dx
= = =
(45) (46) (47)
So we can get a approximate closed form solution for f (x). Note that the approximate integral is accurate to order O(1/N ). Finally, given a distribution p(x), using the Laplace approximation we form a Gaussian approximation with mean x0 and precision |p00 (x0 )|, where x0 is a mode of p.
4.1
Example: P (W |Y ) ∝ P (Y |W, X)P (W )
As we introduced before, for the predictive distribution Z ¯ ¯ , Xnew )P (W |Y¯ , X)dW P (Ynew |X, Y , Xnew ) = P (Ynew |W Z = σ(W T X)P (W |Y¯ , X)dw
(48) (49)
Lecture 6: Bayesian Logistic Regression
7
we cannot get a closed form solution. We will use the Laplace approximation to approximate the posterior P (W |Y¯ , X) as a Gaussian. We further approximate P (Ynew |Y¯ , Xnew ) as a probit function φ(λa) and get an approximate solution for the predictive distribution. Because P (W |Y¯ , X)dW is approximated as Gaussian, we know that the marginal distribution will also be Gaussian. Since Z σ(W T X) = δ(a − W T X)σ(a)da (50) Then we derive Z
where P (a) =
R
σ(W T X)P (W |Y¯ , X)dw
Z =
σ(a)P (a)da
(51)
δ(a − W T X)P (W |Y¯ , X)dw then we can derive Z φ(λa)N (a|µ, σ)da ≈ σ(k(σ 2 )µ)
where k(σ 2 ) = (1 +
πσ 2 i − 12 8 )
(52)
and Z µa
= Z =
p(a)ada
(53)
P (W |Y¯ , X)W T XdW
(54)
T = WM AP X
(55)
and also we can derive σa2
Z =
p(a)(a2 − µ2 )da
(56) Z =
2
2
P (W |Y¯ , X)(W T X − mTN X )dw
= X T SN X
(57) (58)