B IOS 312: M ODERN R EGRESSION A NALYSIS James C (Chris) Slaughter Department of Biostatistics Vanderbilt University School of Medicine [email protected] biostat.mc.vanderbilt.edu/CourseBios312

Copyright 2009-2012 JC Slaughter

All Rights Reserved Updated February 17, 2012

Contents

8 Parameter Estimation for Linear and Logistic Regression

5

8.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . .

5

8.1.1 Formal statement of model . . . . . . . . . . . . . . . . . .

5

8.1.2 Important Properties . . . . . . . . . . . . . . . . . . . . . .

7

8.1.3 Example: Hours and Bids . . . . . . . . . . . . . . . . . . .

7

8.2 Least Squares Estimation of Simple Linear Regression Parameters

9

8.2.1 Estimating β0 and β1 . . . . . . . . . . . . . . . . . . . . . .

9

8.2.2 Estimating σ 2 (non-robust) . . . . . . . . . . . . . . . . . . . 10 8.3 Maximum Likelihood Estimation of Simple Linear Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 8.3.1 Statement of the Model . . . . . . . . . . . . . . . . . . . . 11 8.3.2 Example: Estimation by Maximum Likelihood . . . . . . . . 11 8.3.3 Method of Maximum Likelihood for Simple Linear Regression 14 8.4 Bayesian Methods for Simple Linear Regression . . . . . . . . . . 15 8.4.1 A quick review . . . . . . . . . . . . . . . . . . . . . . . . . 15

3

CONTENTS

4

8.4.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 17 8.5 Logistic Model Estimation . . . . . . . . . . . . . . . . . . . . . . . 19 8.5.1 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . 19 8.5.2 Maximum Likelihood Estimation

. . . . . . . . . . . . . . . 20

Chapter 8

Parameter Estimation for Linear and Logistic Regression

8.1

Simple Linear Regression

8.1.1

Formal statement of model

· In the classical simple linear regression, there is only one predictor variables and the regression function is linear Yi = β0 + β1 xi + i

– Yi is the value of the response variable from the ith trial – β0 and β1 are the intercept and slope parameters, respectively – xi is a fixed, known constant, namely the value of the predictor variable for the ith trial – i is a random error term with mean E[i ] = 0 and variance V [i ] = σ 2 ; i 5

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

6

and j are uncorrelated so that their covariance is zero (i.e. Cov[i , j ] = 0 for all i; j 6= i) – i = 1, . . . , n

· Using Matrix notation, the model can be written as Y = Xβ +  

–Y =

      

Y1 Y2 .. . Yn





      

      

1 x1 1 x2 .. .. . . 1 xn



0 0 .. . 0





1 0 .. . 0

X=

n×1

∗ E[] = 0, where 0 =

      

∗ V [] = σ 2 I, where I =

      



       



β=

β0 β1

 

=

2×1

n×2

      

1 2 .. . n

       n×1

0 1 .. . 0

... ... ... ...

0 0 .. . 1

        n×n

∗ E[Y ] = Xβ – I use standard notation to indicate matrices/vectors and scalars ∗ Boldface indicates a vector or matrix (Y , X, β, , 0, I) ∗ Normal typeface indicates a scalar (Yi , xi , β0 , β1 , i , 0, 1)

        n×1

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

8.1.2

7

Important Properties

· The response Yi in the ith trial is the sum of two components: (1) the signal β0 + β1 xi and (2) the random noise i . Hence Yi is a random variable

· Since E[i] = 0, it follows that E[Yi ] = E[β0 + β1 xi + i ] = E[β0 + β1 xi ] + E[i ] = β0 + β1 xi

· In classical regression, the error terms are i are assumed to have constant variance σ 2 . Therefore, V [Yi ] = V [β0 + β1 xi + i ] = V [β0 + β1 xi ] + V [i ] = σ 2

· The error terms are assumed to be uncorrelated. Hence, the outcome in any one trial has no effect on the error term from any other trial. Since the error terms are uncorrelated, so are the responses Yi and Yj .

· In summary, the simple linear regression model implies that all responses Yi come from probability distributions whose means are β0 + β1 xi and whose variances are σ 2 , the same for all levels of X. Further, any two responses Yi and Yj are uncorrelated.

8.1.3

Example: Hours and Bids

· A consultant is studying the relationship between the amount of time required to prepare a bid and the number of bids received. Suppose the following regression model is appropriate Yi = β0 + β1 xi + i

· Yi is the time in hours and xi is the number of bids received

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

8

· This model is illustrated graphically in Figure 1.6 (courtesy of Applied Linear Statistical Models) – E[Yi |xi ] = 9.5 + 2.1 ∗ xi – E[Y1 |x1 = 45] = 9.5 + 2.1 ∗ 45 = 104 ∗ The observed Yi at x1 = 45 is Y1 = 108. Thus, 1 = 108 − 104 = 4

· The figure also displays the probability distribution of Y

for two values of x. Note that the distributions are assumed to be identical, except for the mean, for all values of X. This is the case for classical linear regression (non-robust) which assumes the “strong null”.

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

8.2

9

Least Squares Estimation of Simple Linear Regression Parameters

8.2.1

Estimating β0 and β1

· The method of least squares considers the deviation of Yi from its expected value Yi − (β0 + β1 xi )

· In particular, least squares estimations considers the sum of the n squared deviations Q=

n X

(Yi − β0 − β1 xi )2

i=1

· The least squares estimator minimizes Q with respect to β0 and β1 · How to find estimators that minimize Q – Numerical search (used for other models) – Analytical solution (calculus). Take partial derivatives and set equal to zero to find minimum. n X ∂Q = −2 (Yi − β0 − β1 xi ) = 0 ∂β0 i=1 n X ∂Q = −2 xi (Yi − β0 − β1 xi ) = 0 ∂β1 i=1

· Using the partial derivates, we can derive the normal equations X

X

Yi = nβ0 + β1 xi X X X xi Yi = β0 xi + β1 x2i

· The normal equations can be solved simulatenously for β0 and β1 to give the point estimates βˆ0 and βˆ1 βˆ1 =

P



(xi − x) Yi − Y P (xi − x)2



CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

10

X  1 X ˆ ˆ β0 = Yi − β1 xi = Y − βˆ1 x n

· Taking second partial derivatives will verify that the least squares estimators minimize Q

· Gauss-Markov Theorem – Under the conditions stated in 8.1.1, the least squares estimators β0 and β1 are unbiased and have minimum variance among all unbiased linear estimtaors.

8.2.2

Estimating σ 2 (non-robust)

· Remember how we usually estimate σ2 P

s2 =

Yi − Y n−1

2

– Numerator is the squared deviation from the mean – Denominator is the sample size minus 1 (1 parameter, Y is being used to estimate the mean)

· Simple regression setting: E[Y ] = β0 + β1xi  P

s2y|x =

Yi − βˆ0 − βˆ1 xi

2

n−2

– Subtract 2 because we are using two parameters (βˆ0 and βˆ1 ) to estimate the mean. If we use p parameters to estimate the mean, then use n − p in the denominator. – s2y|x is called the “Mean squared Error”

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

11

– sy|x is called the “Root MSE”

8.3

Maximum Likelihood Estimation of Simple Linear Regression Parameters

8.3.1

Statement of the Model

· In the classical simple linear regression, there is only one predictor variables and the regression function is linear Yi = β0 + β1 xi + i

– Yi is the value of the response variable from the ith trial – β0 and β1 are the intercept and slope parameters, respectively – xi is a fixed, known constant, namely the value of the predictor variable for the ith trial – i are independent N (0, σ 2 ) (i.i.d.) – i = 1, . . . , n

· Note that the model assumes the i are Normally distributed.   N β0 + β1 xi , σ

8.3.2

Hence, Yi ∼

2

Example: Estimation by Maximum Likelihood

· Consider a single Normal population whose standard deviation is known to be σ = 10. A random sample of n = 3 observations from the population gives Y1 = 250, Y2 = 265, and Y3 = 259.

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

12

· Question: Are these three observations more consistent with a a N (230, 102) population or a N (259, 102 ) population?

µ = 259

230

259

0.02 0.00

0.01

Normal Density Function

0.03

0.04

µ = 230

Y3 Y2

Y1

· Recall that the Normal density function is 2



f (x; µ, σ ) = 2πσ

 1 2 −2

1 x−µ exp − 2 σ2

!!

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

13

· The method of maximum likelihood uses the density of the probability distribution at Yi to measure the strenghth of agreement of observation Yi with that density. Larger values indicated more consistency.

· We can find the densities for Y1 = 250, denoted f1(x; µ), for the two cases of µ as follows: 

 1 2 −2

exp − 12



250−230 102



= 0.005399



− 1

exp − 12



250−259 102



= 0.026609

µ = 230 : f1 = 2π × 10

µ = 259 : f1 = 2π × 102

2

 

· Computing the densities for all three samples fro the two cases of µ gives µ = 230 µ = 259 f1 (x; µ) 0.005399 0.026609 f2 (x; µ) 0.000087 0.033322 f3 (x; µ) 0.000595 0.039894

· The method of maximum likelihood uses the product of the densities to measure the consistency of the sample values with the parametric distribution. L(µ) =

n Y

fi (x; µ)

i=1

· For our simple example L(µ = 230) = 0.005399 × 0.000087 × 0.000595 = 0.279 × 10−9 L(µ = 259) = 0.026609 × 0.033322 × 0.039894 = 0.0000353

· And the Likelihood Ratio is L(µ = 259) = 126, 881 L(µ = 230)

· The Likelihood Ratio measures strength of evidence. Royal (1997)

Using the scale of

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

14

– 3 to 8 (or 1/8 to 1/3) is moderate evidence – 8 to 200 (or 1/200 to 1/8) is strong evidence – > 200 (or < 1/200) is convincing evidence

· Note than in this sample, the MLE is the sample mean, Y = 258 – L(µ = 258) = .0000359 –

8.3.3

L(µ=258) L(µ=259)

= 1.014

Method of Maximum Likelihood for Simple Linear Regression

· The concepts from the previous example carry over directly to maximum likelihood estimation for linear regression

· Outcomes are assumed to follow an i.i.d Normal distribution Yi ∼ N (β0 + β1 xi , σ 2 )

· For each i, the density function is 2



f (Yi ; β0 , β1 , σ ) = 2πσ

 1 2 −2



1 Yi − β0 − β1 xi exp − 2 σ

!2  

· And the likelihood function is given by L(β0 , β1 , σ 2 ) =

n Y

f (Yi ; β0 , β1 , σ 2 )

i=1

· Maximum likelihood estimates are found analytically using partial derivates – Partial differentiation of the logaritm of the likelihood function is much easier

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

15

– Details are left for homework assignment – After simplification, you will obtain n  X

Yi − βˆ0 − βˆ1 xi



i=1   n X ˆ ˆ xi Yi − β0 − β1 xi i=1 2 n  1X ˆ ˆ Yi − β0 − β1 xi n i=1

= 0 = 0 = σ ˆ2

· Note that the MLE of the variance, σˆ2 differs from the MSE, s2y|x, the unbiased estimator of the variance s2y|x =

n σ ˆ2 n−2

· Differences will be small when n is large – Statisticians prefer unbiased estimates (Uniform Minumum Variance Unbiased Estimators in particular), so the MSE is used in practice

8.4

Bayesian Methods for Simple Linear Regression

8.4.1

A quick review

· The goal of all Bayesian analyses is to develop an appropriate model to answer the scientific questions and summarize the posterior distribution, p(θ|y) in appropriate ways – In the simple linear regression case, θ = (β0 , β1 , σ 2 )

· In order to make probability statements about θ given the data (y), must specify a joint probability distribution for θ and y p(θ, y) = p(θ)p(y|θ)

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

16

– p(θ, y) is the joint distribution of θ and y – p(θ) is the prior distribution of θ – p(y|θ) is the sampling distribution of the data given θ

· Bayes’ rule states that we can express the joint distribution of θ and y as the product of marginal and conditional distributions

p(θ, y) = p(y|θ) × p(θ) = p(θ|y) × p(y)

· Using Bayes’ rule, we can then express the posterior distribution of θ as p(θ, y) p(y) p(y|θ) × p(θ) = p(y) ∝ p(y|θ) × p(θ)

p(θ|y) =

· The preceding expression captures the core of Bayesian analysis – The posterior distribution of the parameter(s) θ given the data is the product of ∗ The likelihood: p(y|θ) ∗ The prior: p(θ)

· Conjugate Priors – If the posterior distributions f (θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conju-

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

17

gate distributions, and the prior is called a conjugate prior for the likelihood Family Binomial(n, θ) Poisson(θ)   2 Normal µ, σ , σ 2 known Normal µ, σ 2 , µ known, τ =

8.4.2

1 σ2

Prior θ ∼ beta (α, λ) θ ∼ gamma (δ ,γ ) 0 0  µ ∼ Normal µ0 , σ02 τ ∼ gamma (δ0 , γ0 )

Simple Linear Regression

· Note about notation – For convenience when doing derivations, Bayesians often use τ = σ12 . Since τ is the inverse of the variance, it is called the “precision parameter”

· The likelihood for simple linear regression is given by L(β0 , β1 , τ ) =

n Y

f (Yi ; β0 , β1 , τ )

i=1 n Y

τ = (2π) τ exp − (Yi − β0 + β1 xi )2 2 i=1   n τ n X = (2π)− 2 τ n exp − (Yi − β0 + β1 xi )2  i=1 2 !

− 12

· In the ordinary linear regression model with homoscedasticity, the likelihood in matrix notation can be specified as L(β, τ ) = f (Y |β, τ, X) τ = (2π) τ exp − (Y − Xβ)0 (Y − Xβ) 2 " # τ 0 n ∝ τ exp − (Y − Xβ) (Y − Xβ) 2 ∼ N (Xβ, τ −1 I) − n2

n

– I is the identity matrix (rank n)

"

#

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

18

– X is the design matrix containing covariates (rank k) – β is a k × 1 vector of parameters –τ=

1 σ2

is the common precision term

· A convenient non-informative prior distribution is uniform on (β, log σ) p(β, τ ) ∝ τ −1

· With this choice of prior, we can calculate the posterior distribution of β given τ and Y p(β|τ, Y ) = L(β, τ )p(β, τ ) ˆ τ −1 V ) ∼ N (β, β – where ˆ = (X 0 X)−1 X 0 Y β Vβ = (X 0 X)−1

· We can also calculate the posterior distribution of τ given Y n−k n−k 2 p(τ |Y ) ∼ Gamma , s 2 2

!

– where s2 =

1 ˆ 0 (Y − X β) ˆ (Y − X β) n−k

· Note that in classical linear regression, the standard non-Bayesian estimates ˆ and s2 as just defined of β and σ 2 are β

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

19

· The frequentist standard error estimate is found by setting τ −1 = s2 · Many other choices of prior distributions are possible. – How do we choose the “best” prior? The most convincing results are robust to prior assumptions.

8.5

Logistic Model Estimation

· The method of maximum likelihood is well suited to deal with binary Yi · As with linear regression, we need to first develop a joint probability function of the sampled observations 8.5.1

Likelihood Function

– Each Yi is a Bernoulli random variable P r(Yi = 1) = πi P r(Yi = 0) = 1 − πi – We can represent the probability distribution as follows fi (Yi ) = πiYi (1 − πi )1−Yi – Yi = 0, 1; i = 1, . . . , n – Each of the Yi are independent, so the likelihood function is given by g(Y1 , . . . , Yn ) =

n Y i=1

fi (Yi ) =

n Y i=1

πiYi (1 − πi )1−Yi

CHAPTER 8. PARAMETER ESTIMATION FOR LINEAR AND LOGISTIC REGRESSION

20

· It is easier to find maximum likelihood estimates by working with the loglikelihood n Y

loge (g(Y1 , . . . , Yn )) = loge

πiYi (1 − πi )1−Yi

i=1 n " X

πi 1 − π1

Yi × loge

=

i=1

!#

+

n X

loge (1 − πi )

i=1

· We then express the log likelihood in terms of the regression coefficients we wish to estimate. In simple logistic regression, we have πi = β0 + β1 Xi 1 − πi !

loge and

1 − πi =

1 1+

eβ0 +β1 ×Xi

· Hence, the likelihood for simple logistic regression can be expressed as follow loge L(β0 , β1 ) =

n X

Yi (β0 + β1 Xi ) −

i=1

8.5.2

n X

loge (1 + eβ0 +β1 ×Xi )

i=1

Maximum Likelihood Estimation

· The maximum likelihood estimates of β0 and β1 in the simple logistic regression model are those values of β0 and β1 that maximize the log-likelihood function

· No closed form solution exists for the values of β0 and β1 · Numerical searches are therefore required to find the maximum likelihood estimates of β0 and β1