Logistic Regression Review: For a 2 2 contingency table, both the response and predictor variables are binary. Today: Suppose we have Binary response

Logistic Regression Review: For a 2  2 contingency table, both the response and predictor variables are binary. Today: Suppose we have  Binary respo...
Author: Norman Benson
2 downloads 0 Views 221KB Size
Logistic Regression Review: For a 2  2 contingency table, both the response and predictor variables are binary. Today: Suppose we have  Binary response variable: Y  1 if a trait is present  0 if a trait is absent  Continuous predictor variable: X Statistical Objective: Estimate the probability that a trait is present given the value of X PrY  1|X 

1

Example: Coronary Heart Disease and Age Objective: Predict the probability of coronary heart disease as a function of age. Procedure: A sample of 100 people between 20 and 69 years old was obtained. Each person was examined for the presence or absence of coronary heart disease. Here Y  1 if heart disease is present  0 if heart disease is absent and X  age

2

Notation: Let   PrY  1|X  denote the probability of heart disease given age. Naive Model:   0  1X Properties:  For  1  0, the estimated probability that the trait is present increases with increasing values of X  For  1  0, the estimated probability that the trait is present decreases with increasing values of X

3

Problem:  For  0  0 and   For X    01 , we have   0 1

 For X   1 0 , we have   1  For  0  0 and 1 0  For X   1 , we have   1 

 For X    01 , we have   0 Therefore, the requirement that all probabilities must lie between 0 and 1 is violated.

4

Logistic Regression Model: Logit Link Function log    0   1 X 1 A back-transformation gives exp 0   1 X   1  exp 0   1 X  which is bounded between 0 and 1 as required.

5

Properties:  For  1  0, the estimated probability that the trait is present increases with increasing values of X

 For  1  0, the estimated probability that the trait is present decreases with increasing values of X

6

Interpretation of  1 Note:

  0  1X 1 is the log odds that a trait is present in an observation with value X.  If  1  0, then the odds that the trait is present increases with increasing X.  If  1  0, then the odds that the trait is present decreases with increasing X. Therefore exp 0   1 X  can be interpreted to be the odds that a trait is present given the value of X. Consider a pair observations taking values X and X  1, respectively. Then the odds ratio for the trait in the second observation over that in the first is exp 0   1 X  1   exp 1  exp 0   1 X  Thus, for every unit increase in X, the odds log

7

that the trait is present will increase by a multiplicative factor of exp 1 .

8

Parameter Estimation Notation:  Yi 

1; if the trait is present in observation i

0; if the trait is absent in observation i  X i  Explanatory variable for observation i   i  PrY i  1|X i   Probability that the trait is present in observation i given the value of the explanatory variable X i We can write PrY i  y i |X i    yi i 1   i  1y i Thus: PrY i  1|X i    1i 1   i  11   i 1   i  0   i exp 0   1 X i   1  exp 0   1 X i  and PrY i  0|X i    0i 1   i  10  1   i 1  1  exp 0   1 X i 

9

Assumptions: 1. The probability that the trait is present in observation i is given by exp 0   1 X i  i  1  exp 0   1 X i  2. The data Y 1 , Y 2 , , Y n are independently distributed. The joint distribution of the data is: Pry   PrY 1  y 1 , Y 2  y 2 , , Y n  y n  n



  yi 1   i  1y i

i1 n



 i1

i

expy i  0   1 X i  1  exp 0   1 X i 

Instead of thinking of the above as a function of the data, we can think of it as a function of the parameters  0 ,  1 .

10

Definition: The function n

L 0 ,  1  

 i1

expy i  0   1 X i  1  exp 0   1 X i 

is called the likelihood function. Definition: The maximum likelihood estimator for  0 ,  1  is obtained by finding    0 ,  1 that maximizes L 0 ,  1 . Notes:    Finding  0 ,  1 that maximizes   L 0 ,  1  is equivalent to finding  0 ,  1 that maximizes the log likelihood: l 0 ,  1   log L 0 ,  1  n



y i  0  1 X i log1 exp 0  1 X i  i1

 In general, there is no closed-form expression for the maximum likelihood estimator. Iterative algorithms must be used to solve this maximization problem.

11

Example: Coronary Heart Disease and Age Fitted Model:    Pr Heart Disease exp5. 3095  0. 1109X  1  exp5. 3095  0. 1109X  For 30 year-olds, the estimated probability of heart disease is 

For 60 year-olds, the estimated probability of heart disease is

12

Example: Coronary Heart Disease and Age Fitted Model:    Pr Heart Disease exp5. 3095  0. 1109X  1  exp5. 3095  0. 1109X  For 30 year-olds, the estimated probability of heart disease is exp5. 3095  0. 1109  30   0. 13772 1  0. 13772 1  exp5. 3095  0. 1109  30  

 0. 12105 or about 12%. For 60 year-olds, the estimated probability of heart disease is exp5. 3095  0. 1109  60   3. 8363 1  3. 8363 1  exp5. 3095  0. 1109  60   0. 79323 or about 79%.

13

Hypothesis Testing Consider testing the null hypothesis H o :  1  0 against H a :  1  0 Wald’s Test: Equivalent approaches  Textbook: Compute the test statistic  1 z  SE  1 Under the null hypothesis, z is approximately normally distributed with mean zero and variance one. Reject H o at level  if |z |  z1  /2   SAS: 2  1 2 X   SE  1 Under the null hypothesis, X 2 is approximately chi-square distributed with 1 d.f. Reject H o at level  if

14

X 2   21 1   

15

Example: Coronary Heart Disease and Age Consider the null hypothesis that the probability of contracting heart disease does not depend on age. Our test statistic is 2 0. 1109 2   21. 25 X 0. 0241 Since X 2  21. 25  3. 84   21 0. 95  we can reject the null hypothesis at the   0. 05 level. Conclusion: The probability of contracting heart disease increases with increasing age (X 2  21. 25; d. f.  1; p  0. 0001). We estimate that the odds of contracting coronary heart disease increases by a multiplicative factor of  exp  1  ____________ per year. For example, for every 10 years, the odds increase by a factor of 16

1. 117 10  3. 02.

17

Confidence Intervals An approximate 1     100% confidence interval for  1 is given by    1  z1  /2   SE  1 Example: Coronary Heart Disease and Age A 95% confidence interval for  1 is given by 0. 1109  1. 96  0. 0241 0. 063664, 0. 158136  A 95% confidence interval for the annual rate which the odds of contracting coronary heart disease increases over time is given by exp0. 063664 , exp0. 158136  1. 066, 1. 171  A 95% confidence interval for the odds ratio of contracting coronary heart disease for people 10 years apart in age is given by 18

1. 066 10 , 1. 171 10  1. 895, 4. 848 

19

Multiple Explanatory Variables Case Study 20.1.1. Donner Party Background: In 1846, the Donner party (Donner and Reed families) left Springfield, IL for California in covered wagons. After reaching Ft. Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died. Data: Three variables  Yi   X 1i  X 2i

1; if person i survived

0; if person i died  Age of person i 1; if person i is male  0; if person i is female

Objective: After taking into account age, 20

are women more likely to survive harsh conditions than men? Naive Approach: Analyze the contingency table Survivorship Sex

Survived Died Total

Female

10

5

15

Male

10

20

30

Total

20

25

45

Conclusion:

Problem: The above analysis assumes that the survivorship probability does not depend on age.

21

Notation:  Response: Binary Variable Y i  1; if trait is present in observation i  0; if trait is absent in observation j  Explanatory Variables: X i1  First explanatory variable for observation

X i2  Second explanatory variable for observatio  X ip  pth explanatory variable for observation i Consider more general logistic regression models of the form log  i   0   1 X i1   2 X i2     p X ip 1  i or equivalently exp 0   1 X i1   2 X i2     p X ip  i  1  exp 0   1 X i1   2 X i2     p X ip 

22

Interpretation of  j Note: log    0   1 X 1   2 X 2     p X p 1 is the log odds that a trait is present in an observation with values X 1 , X 2 , , X p for the explanatory variables.  If  j  0, then the odds that the trait is present increases with increasing X j .  If  j  0, then the odds that the trait is present decreases with increasing X j . Therefore exp 0   1 X 1   2 X 2     p X p  can be interpreted to be the odds that a trait is present given the values of X 1 , X 2 , , X p .

23

Consider a pair observations taking values X j and X j  1, respectively for variable j, and fixed values X k for k  j. Then the odds ratio for the trait in the second observation over that in the first is exp j  Thus, for every unit increase in X j , the odds that the trait is present will increase by a multiplicative factor of exp j .

24

Likelihood Ratio Test Now consider the general problem of testing the null hypothesis that the probability of the trait depends on the value of the pth variable. This is accomplished by comparing two models:  Reduced Model: log  i   0   1 X i1     p1 X i,p1 1  i which contains all variables except the variable of interest p.  Full Model: log  i   0   1 X i1     p1 X i,p1   p X 1  i which contains all variables including the variable of interest p.

25

Notes:  The fit of each model to the data is assessed by the likelihood or, equivalently, the log likelihood. The larger the log likelihood, the better the model fits the data.  Adding parameters to model can only result in an increase in the log likelihood. So, the full model will always have a larger log likelihood than the reduced model.  Question: Is the log likelihood for the full model significantly better than the log likelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p.

26

Notation:     Let lreduced   l  0 ,  1 , ,  p1 denote the log likelihood for the reduced model.     Let lfull   l  0 ,  1 , ,  p denote the log likelihood for the full model. The likelihood ratio test statistic is:       2 log   2 l  0 ,  1 , , ,  p  l  0 ,  1 , ,  p1 Under H o :  p  0, the likelihood ratio test statistic is approximately chi-square distributed with d.f. equal to the difference in the number of parameters estimated under the two models. Here, d. f.  p  1   p  1. Reject H o at level  if 2 log    21 1   

27

Example: Donner Party To test for the effect of sex, compare:  Reduced Model: log  i   0   1 X i1 1  i where X i1 denotes age of subject i.  Full Model: log  i   0   1 X i1   2 X i2 1  i where X i1 denotes age, and X i2 denotes sex (1 for male, 0 for female) of subject i. Model

-2 log likelihood

Reduced (age)

56.291

Full (age and sex)

51.256

The likelihood ratio test statistic is 2 log   56. 291  51. 256  5. 035 Since 2 log   5. 035  5. 02   21 0. 975 we can reject the null hypothesis that sex has no effect on the survivorship probability at the 28

  0. 025 level.

29

Fitted Model:  

exp 1  exp



X1  

X1 

X2 X2

Conclusions:  Sex Effect:

 The odds that a male survives are estimated to be  exp  2  ____________ times the odds that a female survives.  Age Effect:

 The odds of surviving decline by a multiplicative factor of  exp  1  ____________

30

per year of age.

31

Fitted Model:    Prsurvival  exp3. 2304  0. 0782X 1  1. 5973X 2   1  exp3. 2304  0. 0782X 1  1. 5973X 2  Conclusions:  After taking into account the effects of age, women had higher survival probabilities than men (2 log   5. 035, d. f.  1; p  0. 025).  The odds that a male survives are estimated to be exp1. 5973   0. 202 times the odds that a female survives.  Moreover, by Wald’s test, the survivorship probability decreases with increasing age (X 2  4. 47; d. f.  1; p  0. 0345).  The odds of surviving decline by a multiplicative factor of exp0. 0782   0. 925 per year of age.

32

Computing Survivorship Probabilities Fitted Model: exp3. 2304  0. 0782X 1  1. 5973X 2    1  exp3. 2304  0. 0782X 1  1. 5973X 2   For Males: We have X 2  1, so that exp1. 6331  0. 0782X 1    1  exp1. 6331  0. 0782X 1  So, for a 24 year old male, the survivorship probability is    ____________  For Females: We have X 2  0, so that the survivorship probability for a 24 year old female exp3. 2304  0. 0782X 1    1  exp3. 2304  0. 0782X 1   ____________

33

Computing Survivorship Probabilities Fitted Model: exp3. 2304  0. 0782X 1  1. 5973X 2    1  exp3. 2304  0. 0782X 1  1. 5973X 2   For Males: We have X 2  1, so that exp1. 6331  0. 0782X 1    1  exp1. 6331  0. 0782X 1  So, for a 24 year old male, the survivorship probability is exp1. 6331  0. 0782  24    1  exp1. 6331  0. 0782  24   0. 7837  0. 439 1  0. 7837  For Females: We have X 2  0, so that exp3. 2304  0. 0782X 1    1  exp3. 2304  0. 0782X 1   4. 167  0. 806 1  4. 167

34

Confidence Intervals:  Age: 95% confidence interval for  1

A 95% confidence interval for the annual decline in the odds of surviving:

 Sex: 95% confidence interval for  2

A 95% confidence interval for the odds of a male surviving over the odds of a female surviving:

35

Confidence Intervals:  Age: 95% confidence interval for  1  0. 0782  1. 96  0. 0373 1. 151308, 0. 005092  A 95% confidence interval for the annual decline in the odds of surviving: exp0. 151308 , exp0. 005092  0. 860, 0. 995   Sex: 95% confidence interval for  2  1. 5973  1. 96  0. 7555 3. 07808, 0. 11652  A 95% confidence interval for the odds of a male surviving over the odds of a female surviving: exp3. 07808 , exp0. 11652  0. 046, 0. 890 

36

Independence between Age and Sex The above analyses assume that the effect of sex on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:  Reduced Model: log  i   0   1 X i1   2 X i2 1  i where X i1 denotes age of subject i.  Full Model: log  i   0   1 X i1   2 X i2   3 X i1 X i2 1  i Here, X i1 denotes age, and X i2 denotes sex (1 for male, 0 for female) of subject i.

37

Model

-2 log likelihood

Reduced (age, sex)

____________

Full (age, sex, interaction) ____________ The likelihood ratio test statistic is 2 log   ____________ Conclusion:

38

Fitted Model: exp7. 2450  0. 1940X 1 6. 9267X 2 0. 1616X 1 X 2    1 exp7. 2450  0. 1940X 1 6. 9267X 2 0. 1616X 1 X 2 

 For Males: We have X 2  1, so that exp7. 2450  0. 1940X 1 6. 9267  1  0. 1616X 1 1    1 exp7. 2450  0. 1940X 1 6. 9267  1  0. 1616X 1 1  exp7. 2450  6. 9267 0. 1940  0. 1616 X 1   1 exp7. 2450  6. 9267 0. 1940  0. 1616 X 1  exp0. 3183  0. 0324X 1   1 exp0. 3183  0. 0324X 1  For a 24 year old male, the survivorship probability is exp0. 3183  0. 0324  24    1  exp0. 3183  0. 0324  24   0. 6317  0. 387 1  0. 6317 The odds of surviving decline by a factor of exp0. 0324   0. 968 per year of age.

39

 For Females: We have X 2  0, so that exp7. 2450  0. 1940X 1 6. 9267  0  0. 1616X 1 0    1 exp7. 2450  0. 1940X 1 6. 9267  0  0. 1616X 1 0  exp7. 2450  0. 1940X 1   1 exp7. 2450  0. 1940X 1  For a 24 year old female, the survivorship probability is exp7. 2450  0. 1940  24    1  exp7. 2450  0. 1940  24   13. 316  0. 930 1  13. 316 The odds of surviving decline by a factor of exp0. 1940   0. 824 Note that the odds of survival decline more rapidly for females than for males.

40

Model Building Case Study 20.1.2. Bird Keeping and Lung Cancer Objective: Determine if the keeping of pet birds increases the risk of lung cancer. Procedure:  Retrospective Study:  49 lung cancer patients 65 years of younger identified from 4 hospitals in The Hague.  98 controls from the general population with a similar age structure.

41

 The following variables were measured:  Lung Cancer Y  1; if lung cancer is present  0; if lung cancer is absent  Whether or not pet birds were kept X  1; if birds were kept  0; if birds were not kept  Sex X  1; if female  0; if male  Socioeconomic Status X  1; if high socioeconomic status  0; if otherwise  Age  Years Smoking  Number of Cigarettes per Day

42

Note: For this retrospective study, lung cancer cases were over-represented. Therefore:  The maximum likelihood estimator over-estimates the intercept term.  Therefore, the logistic regression will over-estimate the probability that a randomly selected individual from the population will have lung cancer.  However, the maximum likelihood estimators for the slope terms will be approximately unbiased.  Therefore, we can obtain an unbiased test for the hypothesis that bird keeping is associated with a higher risk of lung cancer.  Since they do not depend on the intercept term, we can obtain unbiased estimates of odds ratios; i.e., the ratio of the odds of having lung cancer if you keep birds over the odds of having lung cancer if you do not keep birds.

43

Naive Approaches:  Consider the contingency table: Cancer Bird Keeping No Yes Total No

64

16

80

Yes

34

33

67

Total

98

49

147

Conclusion:

Problem: This analysis assumes that the risk of cancer does not depend on smoking habits, age, sex, socioeconomic status, or any other variable.

44

 A multiple logistic regression yields: Parameter

Estimate

S.E.

Wald’s X 2

p-valu

Sex

________

________

________

______

Socioeconomic

________

________

________

______

Age

________

________

________

______

Years Smoking

________

________

________

______

Cigarettes/Day

________

________

________

______

Bird Keeping

________

________

________

______

Conclusion:

45

 A multiple logistic regression yields: Parameter

Estimate

S.E.

Wald’s X 2

p-value

Sex

0.5486

0.5303

1.07

0.3010

Socioeconomic

0.1137

0.4675

0.06

0.8078

-0.0403

0.0357

1.28

0.2584

Years Smoking

0.0742

0.0269

7.61

0.0058

Cigarettes/Day

0.0211

0.0259

0.66

0.4168

Bird Keeping

1.3589

0.4113

10.92

0.0010

Age

The results suggest that after taking into account sex, socioeconomic status, age, years smoking and cigarettes per day, bird keeping increases the risk of lung cancer (X 2  10. 92; d. f.  1; p  0. 001). Problem: The model may be over-parameterized (includes too many explanatory variables in the analysis).

46

Problem: The model may be over-parameterized (includes too many explanatory variables in the analysis). Over-Parameterized Models:  Poor predictions for new data;  Higher standard errors for model parameters;  More complex interpretation of results. Parsimonious Models:  Improved predictors for new data;  Lower standard errors for model parameters;  Simpler interpretation of results.

47

Strategy: Stage 1: Build the best fitting parsimonious model selected from a subset of explanatory variables, excluding the variable you wish to test. This will form the reduced model. Stage 2: Add the variable of interest to the model selected in Step 1 to form the full model. Test the significance of the variable of interest. Example: Lung Cancer and Bird Keeping Stage 1: Build the best fitting parsimonious model selecting a subset of sex, socioeconomic status, age, years smoking and cigarettes per day as explanatory variables. This will form the reduced model. Stage 2: Add bird keeping to the model selected in Step 1 to form the full model. Test the significance of bird keeping.

48

Building Parsimonious Models  Potential explanatory variables to be included in the analysis should be selected based on knowledge of the problem at hand. This may be based on:  a review of the literature;  discussion with colleagues;  prior experience. In the present example, years of smoking and cigarettes per day may be selected based on the well-known association between smoking and lung cancer. Socioeconomic status might also be considered based on the notion that those from high socioeconomic statuses may be exposed to better environments.  If possible, rank the variables from most important to least important based on disciplinary knowledge. Enter the variables in rank order.  In the absence of any other knowledge, a

49

stepwise procedure may be applied.

50

Stepwise Model Selection Procedure: Step 1: Fit all one variable models. Select the one-variable model with the largest log likelihood. Test the significance of the variable selected in Step 1.   If significant, go to step 2.  If not significant, no variables included in the model. Stop. Step 2: Fit all two variable models that include the variable selected in Step 1. Select the two-variable model with the largest log likelihood. Test the significance of the variable added in Step 2.   If significant, go to step 3.  If not significant, select the model from Step 1 and stop. 

51

Example: Lung Cancer and Bird Keeping Stage 1: Obtain best fitting parsimonious model excluding bird keeping. Step 1: Years smoking is entered into the model first. Model

-2 log likelihood

Intercept only

____________

Years

_____________

The likelihood ratio test statistic is  2 log   ____________ Conclusion:

52

Step 2: Age is entered into the model second. Model

-2 log likelihood

Years

____________

Years  Age _____________ The likelihood ratio test statistic is  2 log   ____________ Conclusion:

Step 3: No additional variables enter into the model significantly.

53

Example: Lung Cancer and Bird Keeping Stage 1: Obtain best fitting parsimonious model excluding bird keeping. Step 1: Years smoking is entered into the model first. This model has a -2 log likelihood of 173.167. The likelihood ratio test indicates that years smoking is statistically significant (2 log   187. 135  173. 167  13. 968; d. f.  1; p  0. 0002). Step 2: Age is entered into the model second. This model has a -2 log likelihood of 168.829. The likelihood ratio test indicates that age is statistically significant (2 log   173. 167  168. 135  5. 032; d. f.  1; p  0. 025). Step 3: No additional variables enter into the model significantly.

54

Fitted Model: Parameter

Estimate

S.E.

Wald’s X 2

p-value

Age

________

________

________

_______

Years Smoking

________

________

________

_______

Notes:  Effect of years smoking:

 Age effect:

55

Stage 2: Fit the reduced model: exp 0   1 smoking   1  exp 0   1 smoking  and the full model: exp 0  1 smoking  2 bird keeping   1  exp 0  1 smoking  2 bird keeping  Model

-2 log likelihood

Reduced (smoking)

____________

Full (smoking, bird keeping ____________ The likelihood ratio test statistic is:  2 log   ____________ Conclusion:

56

Note: Since this is a retrospective study, the intercept term is over-estimated, and so the fitted model should not be reported. Report the estimated parameters: Parameter

Estimate

S.E.

Wald’s X 2

p-value

Years Smoking

________

________

________

_______

Bird Keeping

________

________

________

_______

Conclusions:  Effect of years smoking:

 Effect of bird keeping:

 The estimated odds that a bird keeper contracts lung cancer are ________ times 57

the odds that a non-bird keep contracts cancer.  The estimates odds of contracting lung cancer increases by a factor of ________ per year.

58

Suggest Documents