Logistic Regression Review: For a 2 2 contingency table, both the response and predictor variables are binary. Today: Suppose we have Binary response variable: Y 1 if a trait is present 0 if a trait is absent Continuous predictor variable: X Statistical Objective: Estimate the probability that a trait is present given the value of X PrY 1|X
1
Example: Coronary Heart Disease and Age Objective: Predict the probability of coronary heart disease as a function of age. Procedure: A sample of 100 people between 20 and 69 years old was obtained. Each person was examined for the presence or absence of coronary heart disease. Here Y 1 if heart disease is present 0 if heart disease is absent and X age
2
Notation: Let PrY 1|X denote the probability of heart disease given age. Naive Model: 0 1X Properties: For 1 0, the estimated probability that the trait is present increases with increasing values of X For 1 0, the estimated probability that the trait is present decreases with increasing values of X
3
Problem: For 0 0 and For X 01 , we have 0 1
For X 1 0 , we have 1 For 0 0 and 1 0 For X 1 , we have 1
For X 01 , we have 0 Therefore, the requirement that all probabilities must lie between 0 and 1 is violated.
4
Logistic Regression Model: Logit Link Function log 0 1 X 1 A back-transformation gives exp 0 1 X 1 exp 0 1 X which is bounded between 0 and 1 as required.
5
Properties: For 1 0, the estimated probability that the trait is present increases with increasing values of X
For 1 0, the estimated probability that the trait is present decreases with increasing values of X
6
Interpretation of 1 Note:
0 1X 1 is the log odds that a trait is present in an observation with value X. If 1 0, then the odds that the trait is present increases with increasing X. If 1 0, then the odds that the trait is present decreases with increasing X. Therefore exp 0 1 X can be interpreted to be the odds that a trait is present given the value of X. Consider a pair observations taking values X and X 1, respectively. Then the odds ratio for the trait in the second observation over that in the first is exp 0 1 X 1 exp 1 exp 0 1 X Thus, for every unit increase in X, the odds log
7
that the trait is present will increase by a multiplicative factor of exp 1 .
8
Parameter Estimation Notation: Yi
1; if the trait is present in observation i
0; if the trait is absent in observation i X i Explanatory variable for observation i i PrY i 1|X i Probability that the trait is present in observation i given the value of the explanatory variable X i We can write PrY i y i |X i yi i 1 i 1y i Thus: PrY i 1|X i 1i 1 i 11 i 1 i 0 i exp 0 1 X i 1 exp 0 1 X i and PrY i 0|X i 0i 1 i 10 1 i 1 1 exp 0 1 X i
9
Assumptions: 1. The probability that the trait is present in observation i is given by exp 0 1 X i i 1 exp 0 1 X i 2. The data Y 1 , Y 2 , , Y n are independently distributed. The joint distribution of the data is: Pry PrY 1 y 1 , Y 2 y 2 , , Y n y n n
yi 1 i 1y i
i1 n
i1
i
expy i 0 1 X i 1 exp 0 1 X i
Instead of thinking of the above as a function of the data, we can think of it as a function of the parameters 0 , 1 .
10
Definition: The function n
L 0 , 1
i1
expy i 0 1 X i 1 exp 0 1 X i
is called the likelihood function. Definition: The maximum likelihood estimator for 0 , 1 is obtained by finding 0 , 1 that maximizes L 0 , 1 . Notes: Finding 0 , 1 that maximizes L 0 , 1 is equivalent to finding 0 , 1 that maximizes the log likelihood: l 0 , 1 log L 0 , 1 n
y i 0 1 X i log1 exp 0 1 X i i1
In general, there is no closed-form expression for the maximum likelihood estimator. Iterative algorithms must be used to solve this maximization problem.
11
Example: Coronary Heart Disease and Age Fitted Model: Pr Heart Disease exp5. 3095 0. 1109X 1 exp5. 3095 0. 1109X For 30 year-olds, the estimated probability of heart disease is
For 60 year-olds, the estimated probability of heart disease is
12
Example: Coronary Heart Disease and Age Fitted Model: Pr Heart Disease exp5. 3095 0. 1109X 1 exp5. 3095 0. 1109X For 30 year-olds, the estimated probability of heart disease is exp5. 3095 0. 1109 30 0. 13772 1 0. 13772 1 exp5. 3095 0. 1109 30
0. 12105 or about 12%. For 60 year-olds, the estimated probability of heart disease is exp5. 3095 0. 1109 60 3. 8363 1 3. 8363 1 exp5. 3095 0. 1109 60 0. 79323 or about 79%.
13
Hypothesis Testing Consider testing the null hypothesis H o : 1 0 against H a : 1 0 Wald’s Test: Equivalent approaches Textbook: Compute the test statistic 1 z SE 1 Under the null hypothesis, z is approximately normally distributed with mean zero and variance one. Reject H o at level if |z | z1 /2 SAS: 2 1 2 X SE 1 Under the null hypothesis, X 2 is approximately chi-square distributed with 1 d.f. Reject H o at level if
14
X 2 21 1
15
Example: Coronary Heart Disease and Age Consider the null hypothesis that the probability of contracting heart disease does not depend on age. Our test statistic is 2 0. 1109 2 21. 25 X 0. 0241 Since X 2 21. 25 3. 84 21 0. 95 we can reject the null hypothesis at the 0. 05 level. Conclusion: The probability of contracting heart disease increases with increasing age (X 2 21. 25; d. f. 1; p 0. 0001). We estimate that the odds of contracting coronary heart disease increases by a multiplicative factor of exp 1 ____________ per year. For example, for every 10 years, the odds increase by a factor of 16
1. 117 10 3. 02.
17
Confidence Intervals An approximate 1 100% confidence interval for 1 is given by 1 z1 /2 SE 1 Example: Coronary Heart Disease and Age A 95% confidence interval for 1 is given by 0. 1109 1. 96 0. 0241 0. 063664, 0. 158136 A 95% confidence interval for the annual rate which the odds of contracting coronary heart disease increases over time is given by exp0. 063664 , exp0. 158136 1. 066, 1. 171 A 95% confidence interval for the odds ratio of contracting coronary heart disease for people 10 years apart in age is given by 18
1. 066 10 , 1. 171 10 1. 895, 4. 848
19
Multiple Explanatory Variables Case Study 20.1.1. Donner Party Background: In 1846, the Donner party (Donner and Reed families) left Springfield, IL for California in covered wagons. After reaching Ft. Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died. Data: Three variables Yi X 1i X 2i
1; if person i survived
0; if person i died Age of person i 1; if person i is male 0; if person i is female
Objective: After taking into account age, 20
are women more likely to survive harsh conditions than men? Naive Approach: Analyze the contingency table Survivorship Sex
Survived Died Total
Female
10
5
15
Male
10
20
30
Total
20
25
45
Conclusion:
Problem: The above analysis assumes that the survivorship probability does not depend on age.
21
Notation: Response: Binary Variable Y i 1; if trait is present in observation i 0; if trait is absent in observation j Explanatory Variables: X i1 First explanatory variable for observation
X i2 Second explanatory variable for observatio X ip pth explanatory variable for observation i Consider more general logistic regression models of the form log i 0 1 X i1 2 X i2 p X ip 1 i or equivalently exp 0 1 X i1 2 X i2 p X ip i 1 exp 0 1 X i1 2 X i2 p X ip
22
Interpretation of j Note: log 0 1 X 1 2 X 2 p X p 1 is the log odds that a trait is present in an observation with values X 1 , X 2 , , X p for the explanatory variables. If j 0, then the odds that the trait is present increases with increasing X j . If j 0, then the odds that the trait is present decreases with increasing X j . Therefore exp 0 1 X 1 2 X 2 p X p can be interpreted to be the odds that a trait is present given the values of X 1 , X 2 , , X p .
23
Consider a pair observations taking values X j and X j 1, respectively for variable j, and fixed values X k for k j. Then the odds ratio for the trait in the second observation over that in the first is exp j Thus, for every unit increase in X j , the odds that the trait is present will increase by a multiplicative factor of exp j .
24
Likelihood Ratio Test Now consider the general problem of testing the null hypothesis that the probability of the trait depends on the value of the pth variable. This is accomplished by comparing two models: Reduced Model: log i 0 1 X i1 p1 X i,p1 1 i which contains all variables except the variable of interest p. Full Model: log i 0 1 X i1 p1 X i,p1 p X 1 i which contains all variables including the variable of interest p.
25
Notes: The fit of each model to the data is assessed by the likelihood or, equivalently, the log likelihood. The larger the log likelihood, the better the model fits the data. Adding parameters to model can only result in an increase in the log likelihood. So, the full model will always have a larger log likelihood than the reduced model. Question: Is the log likelihood for the full model significantly better than the log likelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p.
26
Notation: Let lreduced l 0 , 1 , , p1 denote the log likelihood for the reduced model. Let lfull l 0 , 1 , , p denote the log likelihood for the full model. The likelihood ratio test statistic is: 2 log 2 l 0 , 1 , , , p l 0 , 1 , , p1 Under H o : p 0, the likelihood ratio test statistic is approximately chi-square distributed with d.f. equal to the difference in the number of parameters estimated under the two models. Here, d. f. p 1 p 1. Reject H o at level if 2 log 21 1
27
Example: Donner Party To test for the effect of sex, compare: Reduced Model: log i 0 1 X i1 1 i where X i1 denotes age of subject i. Full Model: log i 0 1 X i1 2 X i2 1 i where X i1 denotes age, and X i2 denotes sex (1 for male, 0 for female) of subject i. Model
-2 log likelihood
Reduced (age)
56.291
Full (age and sex)
51.256
The likelihood ratio test statistic is 2 log 56. 291 51. 256 5. 035 Since 2 log 5. 035 5. 02 21 0. 975 we can reject the null hypothesis that sex has no effect on the survivorship probability at the 28
0. 025 level.
29
Fitted Model:
exp 1 exp
X1
X1
X2 X2
Conclusions: Sex Effect:
The odds that a male survives are estimated to be exp 2 ____________ times the odds that a female survives. Age Effect:
The odds of surviving decline by a multiplicative factor of exp 1 ____________
30
per year of age.
31
Fitted Model: Prsurvival exp3. 2304 0. 0782X 1 1. 5973X 2 1 exp3. 2304 0. 0782X 1 1. 5973X 2 Conclusions: After taking into account the effects of age, women had higher survival probabilities than men (2 log 5. 035, d. f. 1; p 0. 025). The odds that a male survives are estimated to be exp1. 5973 0. 202 times the odds that a female survives. Moreover, by Wald’s test, the survivorship probability decreases with increasing age (X 2 4. 47; d. f. 1; p 0. 0345). The odds of surviving decline by a multiplicative factor of exp0. 0782 0. 925 per year of age.
32
Computing Survivorship Probabilities Fitted Model: exp3. 2304 0. 0782X 1 1. 5973X 2 1 exp3. 2304 0. 0782X 1 1. 5973X 2 For Males: We have X 2 1, so that exp1. 6331 0. 0782X 1 1 exp1. 6331 0. 0782X 1 So, for a 24 year old male, the survivorship probability is ____________ For Females: We have X 2 0, so that the survivorship probability for a 24 year old female exp3. 2304 0. 0782X 1 1 exp3. 2304 0. 0782X 1 ____________
33
Computing Survivorship Probabilities Fitted Model: exp3. 2304 0. 0782X 1 1. 5973X 2 1 exp3. 2304 0. 0782X 1 1. 5973X 2 For Males: We have X 2 1, so that exp1. 6331 0. 0782X 1 1 exp1. 6331 0. 0782X 1 So, for a 24 year old male, the survivorship probability is exp1. 6331 0. 0782 24 1 exp1. 6331 0. 0782 24 0. 7837 0. 439 1 0. 7837 For Females: We have X 2 0, so that exp3. 2304 0. 0782X 1 1 exp3. 2304 0. 0782X 1 4. 167 0. 806 1 4. 167
34
Confidence Intervals: Age: 95% confidence interval for 1
A 95% confidence interval for the annual decline in the odds of surviving:
Sex: 95% confidence interval for 2
A 95% confidence interval for the odds of a male surviving over the odds of a female surviving:
35
Confidence Intervals: Age: 95% confidence interval for 1 0. 0782 1. 96 0. 0373 1. 151308, 0. 005092 A 95% confidence interval for the annual decline in the odds of surviving: exp0. 151308 , exp0. 005092 0. 860, 0. 995 Sex: 95% confidence interval for 2 1. 5973 1. 96 0. 7555 3. 07808, 0. 11652 A 95% confidence interval for the odds of a male surviving over the odds of a female surviving: exp3. 07808 , exp0. 11652 0. 046, 0. 890
36
Independence between Age and Sex The above analyses assume that the effect of sex on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models: Reduced Model: log i 0 1 X i1 2 X i2 1 i where X i1 denotes age of subject i. Full Model: log i 0 1 X i1 2 X i2 3 X i1 X i2 1 i Here, X i1 denotes age, and X i2 denotes sex (1 for male, 0 for female) of subject i.
37
Model
-2 log likelihood
Reduced (age, sex)
____________
Full (age, sex, interaction) ____________ The likelihood ratio test statistic is 2 log ____________ Conclusion:
38
Fitted Model: exp7. 2450 0. 1940X 1 6. 9267X 2 0. 1616X 1 X 2 1 exp7. 2450 0. 1940X 1 6. 9267X 2 0. 1616X 1 X 2
For Males: We have X 2 1, so that exp7. 2450 0. 1940X 1 6. 9267 1 0. 1616X 1 1 1 exp7. 2450 0. 1940X 1 6. 9267 1 0. 1616X 1 1 exp7. 2450 6. 9267 0. 1940 0. 1616 X 1 1 exp7. 2450 6. 9267 0. 1940 0. 1616 X 1 exp0. 3183 0. 0324X 1 1 exp0. 3183 0. 0324X 1 For a 24 year old male, the survivorship probability is exp0. 3183 0. 0324 24 1 exp0. 3183 0. 0324 24 0. 6317 0. 387 1 0. 6317 The odds of surviving decline by a factor of exp0. 0324 0. 968 per year of age.
39
For Females: We have X 2 0, so that exp7. 2450 0. 1940X 1 6. 9267 0 0. 1616X 1 0 1 exp7. 2450 0. 1940X 1 6. 9267 0 0. 1616X 1 0 exp7. 2450 0. 1940X 1 1 exp7. 2450 0. 1940X 1 For a 24 year old female, the survivorship probability is exp7. 2450 0. 1940 24 1 exp7. 2450 0. 1940 24 13. 316 0. 930 1 13. 316 The odds of surviving decline by a factor of exp0. 1940 0. 824 Note that the odds of survival decline more rapidly for females than for males.
40
Model Building Case Study 20.1.2. Bird Keeping and Lung Cancer Objective: Determine if the keeping of pet birds increases the risk of lung cancer. Procedure: Retrospective Study: 49 lung cancer patients 65 years of younger identified from 4 hospitals in The Hague. 98 controls from the general population with a similar age structure.
41
The following variables were measured: Lung Cancer Y 1; if lung cancer is present 0; if lung cancer is absent Whether or not pet birds were kept X 1; if birds were kept 0; if birds were not kept Sex X 1; if female 0; if male Socioeconomic Status X 1; if high socioeconomic status 0; if otherwise Age Years Smoking Number of Cigarettes per Day
42
Note: For this retrospective study, lung cancer cases were over-represented. Therefore: The maximum likelihood estimator over-estimates the intercept term. Therefore, the logistic regression will over-estimate the probability that a randomly selected individual from the population will have lung cancer. However, the maximum likelihood estimators for the slope terms will be approximately unbiased. Therefore, we can obtain an unbiased test for the hypothesis that bird keeping is associated with a higher risk of lung cancer. Since they do not depend on the intercept term, we can obtain unbiased estimates of odds ratios; i.e., the ratio of the odds of having lung cancer if you keep birds over the odds of having lung cancer if you do not keep birds.
43
Naive Approaches: Consider the contingency table: Cancer Bird Keeping No Yes Total No
64
16
80
Yes
34
33
67
Total
98
49
147
Conclusion:
Problem: This analysis assumes that the risk of cancer does not depend on smoking habits, age, sex, socioeconomic status, or any other variable.
44
A multiple logistic regression yields: Parameter
Estimate
S.E.
Wald’s X 2
p-valu
Sex
________
________
________
______
Socioeconomic
________
________
________
______
Age
________
________
________
______
Years Smoking
________
________
________
______
Cigarettes/Day
________
________
________
______
Bird Keeping
________
________
________
______
Conclusion:
45
A multiple logistic regression yields: Parameter
Estimate
S.E.
Wald’s X 2
p-value
Sex
0.5486
0.5303
1.07
0.3010
Socioeconomic
0.1137
0.4675
0.06
0.8078
-0.0403
0.0357
1.28
0.2584
Years Smoking
0.0742
0.0269
7.61
0.0058
Cigarettes/Day
0.0211
0.0259
0.66
0.4168
Bird Keeping
1.3589
0.4113
10.92
0.0010
Age
The results suggest that after taking into account sex, socioeconomic status, age, years smoking and cigarettes per day, bird keeping increases the risk of lung cancer (X 2 10. 92; d. f. 1; p 0. 001). Problem: The model may be over-parameterized (includes too many explanatory variables in the analysis).
46
Problem: The model may be over-parameterized (includes too many explanatory variables in the analysis). Over-Parameterized Models: Poor predictions for new data; Higher standard errors for model parameters; More complex interpretation of results. Parsimonious Models: Improved predictors for new data; Lower standard errors for model parameters; Simpler interpretation of results.
47
Strategy: Stage 1: Build the best fitting parsimonious model selected from a subset of explanatory variables, excluding the variable you wish to test. This will form the reduced model. Stage 2: Add the variable of interest to the model selected in Step 1 to form the full model. Test the significance of the variable of interest. Example: Lung Cancer and Bird Keeping Stage 1: Build the best fitting parsimonious model selecting a subset of sex, socioeconomic status, age, years smoking and cigarettes per day as explanatory variables. This will form the reduced model. Stage 2: Add bird keeping to the model selected in Step 1 to form the full model. Test the significance of bird keeping.
48
Building Parsimonious Models Potential explanatory variables to be included in the analysis should be selected based on knowledge of the problem at hand. This may be based on: a review of the literature; discussion with colleagues; prior experience. In the present example, years of smoking and cigarettes per day may be selected based on the well-known association between smoking and lung cancer. Socioeconomic status might also be considered based on the notion that those from high socioeconomic statuses may be exposed to better environments. If possible, rank the variables from most important to least important based on disciplinary knowledge. Enter the variables in rank order. In the absence of any other knowledge, a
49
stepwise procedure may be applied.
50
Stepwise Model Selection Procedure: Step 1: Fit all one variable models. Select the one-variable model with the largest log likelihood. Test the significance of the variable selected in Step 1. If significant, go to step 2. If not significant, no variables included in the model. Stop. Step 2: Fit all two variable models that include the variable selected in Step 1. Select the two-variable model with the largest log likelihood. Test the significance of the variable added in Step 2. If significant, go to step 3. If not significant, select the model from Step 1 and stop.
51
Example: Lung Cancer and Bird Keeping Stage 1: Obtain best fitting parsimonious model excluding bird keeping. Step 1: Years smoking is entered into the model first. Model
-2 log likelihood
Intercept only
____________
Years
_____________
The likelihood ratio test statistic is 2 log ____________ Conclusion:
52
Step 2: Age is entered into the model second. Model
-2 log likelihood
Years
____________
Years Age _____________ The likelihood ratio test statistic is 2 log ____________ Conclusion:
Step 3: No additional variables enter into the model significantly.
53
Example: Lung Cancer and Bird Keeping Stage 1: Obtain best fitting parsimonious model excluding bird keeping. Step 1: Years smoking is entered into the model first. This model has a -2 log likelihood of 173.167. The likelihood ratio test indicates that years smoking is statistically significant (2 log 187. 135 173. 167 13. 968; d. f. 1; p 0. 0002). Step 2: Age is entered into the model second. This model has a -2 log likelihood of 168.829. The likelihood ratio test indicates that age is statistically significant (2 log 173. 167 168. 135 5. 032; d. f. 1; p 0. 025). Step 3: No additional variables enter into the model significantly.
54
Fitted Model: Parameter
Estimate
S.E.
Wald’s X 2
p-value
Age
________
________
________
_______
Years Smoking
________
________
________
_______
Notes: Effect of years smoking:
Age effect:
55
Stage 2: Fit the reduced model: exp 0 1 smoking 1 exp 0 1 smoking and the full model: exp 0 1 smoking 2 bird keeping 1 exp 0 1 smoking 2 bird keeping Model
-2 log likelihood
Reduced (smoking)
____________
Full (smoking, bird keeping ____________ The likelihood ratio test statistic is: 2 log ____________ Conclusion:
56
Note: Since this is a retrospective study, the intercept term is over-estimated, and so the fitted model should not be reported. Report the estimated parameters: Parameter
Estimate
S.E.
Wald’s X 2
p-value
Years Smoking
________
________
________
_______
Bird Keeping
________
________
________
_______
Conclusions: Effect of years smoking:
Effect of bird keeping:
The estimated odds that a bird keeper contracts lung cancer are ________ times 57
the odds that a non-bird keep contracts cancer. The estimates odds of contracting lung cancer increases by a factor of ________ per year.
58