CHAPTER 5
5
ST744, D. Zhang
Logistic Regression
I The logistic regression model • Data (xi , Yi ) xi
= covariate, indep or explanatory variable – continuous
Yi
= response
Yi |xi
∼ Bin(ni , π(xi ))
E(Yi /ni |xi ) = π(xi ) π(xi ){1 − π(xi )} var(Yi /ni |xi ) = ⇒ No over-dispersion! ni • Model
logit(π(x)) = log
π(x) = α + βx = η(x) 1 − π(x)
Slide 287
CHAPTER 5
ST744, D. Zhang
=⇒ eα+βx eη(x) π(x) = = α+βx 1+e 1 + eη(x) I.1 Interpretation of α and β: • Interpretation of α:
π(0) , α = log 1 − π(0)
log-odds of success (or disease) for the population defined by x = 0. • Interpretation of β: logitπ(x + 1) − logitπ(x) = β, =⇒ β = log
π(x + 1)/{1 − π(x + 1)} π(x)/{1 − π(x)} Slide 288
CHAPTER 5
ST744, D. Zhang
= log odds-ratio associated with one unit increase of x =⇒ eβ =
π(x + 1)/{1 − π(x + 1)} π(x)/{1 − π(x)}
= odds-ratio (of disease) associated with one unit increase of x. • Disease (success) probability for the population defined by x0 : eη(x0 ) eα+βx0 = π(x0 ) = α+βx 0 1+e 1 + eη(x0 ) We could define a new covariate x∗ = x − x0 . Then the disease probability π ∗ (x∗ ) in terms of x∗ is given by logitπ ∗ (x∗ ) = α∗ + βx and π ∗ (x∗ ) = π(x). =⇒ π ∗ (0) = π(x0 ) and ∗
α e . π ∗ (0) = 1 + eα∗ Slide 289
CHAPTER 5
ST744, D. Zhang
I.2 Alternative interpretation • Simple algebra shows the slope of π(x) at x is π ′ (x) = βπ(x){(1 − π(x)}, can be approximately interpreted as the change in disease (success) probability when x increase by one unit from x to x + 1. For example, when x0 = −α/β, α + βx0 = 0, ⇒ π(x0 ) = 0.5 ⇒ π ′ (x0 ) = β4 ⇒ Disease (success) prob increases (if β > 0) by β/4 additively when x increases by one unit from x0 to x0 + 1. or Disease (success) prob increases (if β > 0) from 0.5 to 0.75 (0.5+1/4) additively when x increases from x0 = −α/β to x0 + 1/β.
Slide 290
CHAPTER 5
ST744, D. Zhang
I.3 Empirical Check of the Logistic Model • Suppose ni is reasonably large, then pi = yi /ni will be a good estimate of πi . If logit(πi ) = α + βxi is a good model, the plot of pi v.s. xi will look like a logistic curve. However, not easy to tell. • Better to plot logit(pi ) v.s. Xi . If the logistic model is good, then this plot should show roughly a linear trend. • pi may be 0 or 1, in which case logit(pi ) is undefined. Add 0.5 to success and failure and recalculate sample proportion p˜i . Or equivalently calculate yi + 0.5 logit(˜ pi ) = log ni − yi + 0.5 and plot p˜i v.s. xi . A linear plot indicates the model is reasonable. Slide 291
CHAPTER 5
ST744, D. Zhang
I.4 Example: Horseshoe crab data • Define binary response Yi for female crab i as 1 if crab i has at least one satellite Yi = 0 otherwise
Define π(xi ) = P [Yi |xi ] where xi is the carapace width of crab i.
• First would like to check if logitπ(xi ) = α + βxi is reasonable.
Slide 292
CHAPTER 5
ST744, D. Zhang
• SAS program and output: data crab; input color spine width satell weight; weight=weight/1000; color=color-1; have_sat=(satell>0); datalines; 3 3 28.3 8 3050 4 3 22.5 0 1550 proc sort data=crab; by width; run; proc summary data=crab noprint; var have_sat; by width; output out=crab2 sum=have_sat; run; data crab2; set crab2; ni = _FREQ_; logitpi = log((have_sat + 0.5)/(ni - have_sat run; title "Empirical logit vs. width"; proc plot; plot logitpi*width; run;
Slide 293
+ 0.5));
CHAPTER 5
ST744, D. Zhang Plot of logitpi*width.
Legend: A = 1 obs, B = 2 obs, etc.
logitpi | 4 + | | | A A A 2 + A A A | A A A B A A | A A A A A AAA AA AB BA AA A | A AA A AA A AA 0 + A AA A A A A A A | A A A A A | A A AA A A A | A -2 + A A A | -+--------+--------+--------+--------+--------+--------+ 20.0 22.5 25.0 27.5 30.0 32.5 35.0
• The above plot indicates that the logistic model may be reasonable.
Slide 294
CHAPTER 5
ST744, D. Zhang
• We can use Proc GenMod or Proc Logistic to fit logitπ(xi ) = α + βxi . Here we use Proc Logistic: title "Logistic fit to the probability of having satellites"; proc logistic data=crab descending; model have_sat=width; run;
• Note: Here we need to use descending since the response variable Yi is binary and we want to model P [Yi = 1|xi ]. Otherwise, SAS models P [Yi = 0|xi ].
Slide 295
CHAPTER 5
ST744, D. Zhang Logistic fit to the probability of having satellites The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique
WORK.CRAB have_sat 2 binary logit Fisher’s scoring
Number of Observations Read Number of Observations Used
173 173
Response Profile Ordered Value
have_sat
Total Frequency
1 2
1 0
111 62
Probability modeled is have_sat=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics
Criterion
Intercept Only Slide 296
Intercept and Covariates
2
CHAPTER 5
ST744, D. Zhang AIC SC -2 Log L
227.759 230.912 225.759
198.453 204.759 194.453
Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald
Chi-Square
DF
Pr > ChiSq
31.3059 27.8752 23.8872
1 1 1
ChiSq
Intercept race azt
1 1 1
-1.0736 0.0555 -0.7195
0.2629 0.2886 0.2790
16.6705 0.0370 6.6507
ChiSq
1.3910
1
0.2382
Slide 336
3
CHAPTER 5 Step
ST744, D. Zhang 1. Effect race*azt entered: Model Fit Statistics
Criterion -2 Log L
Intercept Only
Intercept and Covariates
342.118
333.768
Analysis of Maximum Likelihood Estimates Parameter
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
Intercept race azt race*azt
1 1 1 1
-1.2763 0.3476 -0.2771 -0.6878
0.3265 0.3875 0.4655 0.5852
15.2823 0.8044 0.3542 1.3811
0 then satbin=1; else satbin=0; w_265 = width - 26.5; c1 = (color=1); c2 = (color=2); c3 = (color=3); c4 = (color=4); s1 = (spine=1); s2 = (spine=2); datalines; * data is given in other program; ; title "Logistic model with width and three dummies for color"; proc logistic descending; model satbin = width c1 c2 c3; run;
Slide 340
CHAPTER 5
ST744, D. Zhang Logistic model with width and three dummies for color
1
The LOGISTIC Procedure Probability modeled is satbin=1. Model Fit Statistics
Criterion
Intercept Only
Intercept and Covariates
225.759
187.457
-2 Log L
Analysis of Maximum Likelihood Estimates Parameter
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
Intercept width c1 c2 c3
1 1 1 1 1
-12.7151 0.4680 1.3299 1.4023 1.1061
2.7618 0.1055 0.8525 0.5484 0.5921
21.1965 19.6573 2.4335 6.5380 3.4901