CHAPTER 5

5

ST744, D. Zhang

Logistic Regression

I The logistic regression model • Data (xi , Yi ) xi

= covariate, indep or explanatory variable – continuous

Yi

= response

Yi |xi

∼ Bin(ni , π(xi ))

E(Yi /ni |xi ) = π(xi ) π(xi ){1 − π(xi )} var(Yi /ni |xi ) = ⇒ No over-dispersion! ni • Model

logit(π(x)) = log

π(x) = α + βx = η(x) 1 − π(x)

Slide 287

CHAPTER 5

ST744, D. Zhang

=⇒ eα+βx eη(x) π(x) = = α+βx 1+e 1 + eη(x) I.1 Interpretation of α and β: • Interpretation of α:

π(0) , α = log 1 − π(0)

log-odds of success (or disease) for the population defined by x = 0. • Interpretation of β: logitπ(x + 1) − logitπ(x) = β, =⇒ β = log

π(x + 1)/{1 − π(x + 1)} π(x)/{1 − π(x)} Slide 288

CHAPTER 5

ST744, D. Zhang

= log odds-ratio associated with one unit increase of x =⇒ eβ =

π(x + 1)/{1 − π(x + 1)} π(x)/{1 − π(x)}

= odds-ratio (of disease) associated with one unit increase of x. • Disease (success) probability for the population defined by x0 : eη(x0 ) eα+βx0 = π(x0 ) = α+βx 0 1+e 1 + eη(x0 ) We could define a new covariate x∗ = x − x0 . Then the disease probability π ∗ (x∗ ) in terms of x∗ is given by logitπ ∗ (x∗ ) = α∗ + βx and π ∗ (x∗ ) = π(x). =⇒ π ∗ (0) = π(x0 ) and ∗

α e . π ∗ (0) = 1 + eα∗ Slide 289

CHAPTER 5

ST744, D. Zhang

I.2 Alternative interpretation • Simple algebra shows the slope of π(x) at x is π ′ (x) = βπ(x){(1 − π(x)}, can be approximately interpreted as the change in disease (success) probability when x increase by one unit from x to x + 1. For example, when x0 = −α/β, α + βx0 = 0, ⇒ π(x0 ) = 0.5 ⇒ π ′ (x0 ) = β4 ⇒ Disease (success) prob increases (if β > 0) by β/4 additively when x increases by one unit from x0 to x0 + 1. or Disease (success) prob increases (if β > 0) from 0.5 to 0.75 (0.5+1/4) additively when x increases from x0 = −α/β to x0 + 1/β.

Slide 290

CHAPTER 5

ST744, D. Zhang

I.3 Empirical Check of the Logistic Model • Suppose ni is reasonably large, then pi = yi /ni will be a good estimate of πi . If logit(πi ) = α + βxi is a good model, the plot of pi v.s. xi will look like a logistic curve. However, not easy to tell. • Better to plot logit(pi ) v.s. Xi . If the logistic model is good, then this plot should show roughly a linear trend. • pi may be 0 or 1, in which case logit(pi ) is undefined. Add 0.5 to success and failure and recalculate sample proportion p˜i . Or equivalently calculate yi + 0.5 logit(˜ pi ) = log ni − yi + 0.5 and plot p˜i v.s. xi . A linear plot indicates the model is reasonable. Slide 291

CHAPTER 5

ST744, D. Zhang

I.4 Example: Horseshoe crab data • Define binary response Yi for female crab i as   1 if crab i has at least one satellite Yi =  0 otherwise

Define π(xi ) = P [Yi |xi ] where xi is the carapace width of crab i.

• First would like to check if logitπ(xi ) = α + βxi is reasonable.

Slide 292

CHAPTER 5

ST744, D. Zhang

• SAS program and output: data crab; input color spine width satell weight; weight=weight/1000; color=color-1; have_sat=(satell>0); datalines; 3 3 28.3 8 3050 4 3 22.5 0 1550 proc sort data=crab; by width; run; proc summary data=crab noprint; var have_sat; by width; output out=crab2 sum=have_sat; run; data crab2; set crab2; ni = _FREQ_; logitpi = log((have_sat + 0.5)/(ni - have_sat run; title "Empirical logit vs. width"; proc plot; plot logitpi*width; run;

Slide 293

+ 0.5));

CHAPTER 5

ST744, D. Zhang Plot of logitpi*width.

Legend: A = 1 obs, B = 2 obs, etc.

logitpi | 4 + | | | A A A 2 + A A A | A A A B A A | A A A A A AAA AA AB BA AA A | A AA A AA A AA 0 + A AA A A A A A A | A A A A A | A A AA A A A | A -2 + A A A | -+--------+--------+--------+--------+--------+--------+ 20.0 22.5 25.0 27.5 30.0 32.5 35.0

• The above plot indicates that the logistic model may be reasonable.

Slide 294

CHAPTER 5

ST744, D. Zhang

• We can use Proc GenMod or Proc Logistic to fit logitπ(xi ) = α + βxi . Here we use Proc Logistic: title "Logistic fit to the probability of having satellites"; proc logistic data=crab descending; model have_sat=width; run;

• Note: Here we need to use descending since the response variable Yi is binary and we want to model P [Yi = 1|xi ]. Otherwise, SAS models P [Yi = 0|xi ].

Slide 295

CHAPTER 5

ST744, D. Zhang Logistic fit to the probability of having satellites The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique

WORK.CRAB have_sat 2 binary logit Fisher’s scoring

Number of Observations Read Number of Observations Used

173 173

Response Profile Ordered Value

have_sat

Total Frequency

1 2

1 0

111 62

Probability modeled is have_sat=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics

Criterion

Intercept Only Slide 296

Intercept and Covariates

2

CHAPTER 5

ST744, D. Zhang AIC SC -2 Log L

227.759 230.912 225.759

198.453 204.759 194.453

Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald

Chi-Square

DF

Pr > ChiSq

31.3059 27.8752 23.8872

1 1 1

ChiSq

Intercept race azt

1 1 1

-1.0736 0.0555 -0.7195

0.2629 0.2886 0.2790

16.6705 0.0370 6.6507

ChiSq

1.3910

1

0.2382

Slide 336

3

CHAPTER 5 Step

ST744, D. Zhang 1. Effect race*azt entered: Model Fit Statistics

Criterion -2 Log L

Intercept Only

Intercept and Covariates

342.118

333.768

Analysis of Maximum Likelihood Estimates Parameter

DF

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept race azt race*azt

1 1 1 1

-1.2763 0.3476 -0.2771 -0.6878

0.3265 0.3875 0.4655 0.5852

15.2823 0.8044 0.3542 1.3811

0 then satbin=1; else satbin=0; w_265 = width - 26.5; c1 = (color=1); c2 = (color=2); c3 = (color=3); c4 = (color=4); s1 = (spine=1); s2 = (spine=2); datalines; * data is given in other program; ; title "Logistic model with width and three dummies for color"; proc logistic descending; model satbin = width c1 c2 c3; run;

Slide 340

CHAPTER 5

ST744, D. Zhang Logistic model with width and three dummies for color

1

The LOGISTIC Procedure Probability modeled is satbin=1. Model Fit Statistics

Criterion

Intercept Only

Intercept and Covariates

225.759

187.457

-2 Log L

Analysis of Maximum Likelihood Estimates Parameter

DF

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept width c1 c2 c3

1 1 1 1 1

-12.7151 0.4680 1.3299 1.4023 1.1061

2.7618 0.1055 0.8525 0.5484 0.5921

21.1965 19.6573 2.4335 6.5380 3.4901