Lecture 18: Multiple Logistic Regression Mulugeta Gebregziabher, Ph.D. BMTRY 701/755: Biostatistical Methods II Spring 2007 Department of Biostatistics, Bioinformatics and Epidemiology Medical University of South Carolina
Lecture 18: Multiple Logistic Regression – p. 1/48
Topics to be covered
•
Review 1. Purpose of empirical models: Association vs Prediction 2. Design of observational studies: cross-sectional, prospective, case-control 3. Randomization, Stratification and Matching
•
Multiple logistic regression 1. The model 2. Estimation and Interpretation of Parameters 3. Confounding and Interaction 4. Effects of omitted variables 5. Model Fitting Strategies 6. Goodness of Fit and Model Diagnostics
• • •
Matching (group and individual) Conditional vs Unconditional analysis Methods III: Advanced Regression Methods
Lecture 18: Multiple Logistic Regression – p. 2/48
Review: Purpose of empirical models
Empirical models:
are models that are fitted to provide succinct descriptions of relationships observed in data. They can be of different forms, here we focus on regression models that have wide applicability
•
They are data-driven models that provide a range of possible relationships between variables often specified by mathematical convenience and a preference for simplicity.
•
If the model fits well, inferences are possible about the nature of relationships between variables in the ranges where they are observed (NO extrapolation)
•
Examples: Association studies in Epidemiology and Prediction studies in clinical or policy making research
Lecture 18: Multiple Logistic Regression – p. 3/48
Association Studies
•
Interest centers on what variables (variables of interest and adjustment variables) are in the model and the size and sign of their coefficients
•
Predicted value for each observation or model fit is not of interest per se
After adjusting for appropriate covariates, is broccoli intake associated with colorectal adenomatous polyps? Example 1.
logit(Pr(polyps)) = β0 + β1 energyintake + ... + βk Broccoliintake Example 2.
After adjusting for age, is heart disease (HD) associated with hypertension?
logit(Pr(HD)) = β0 + β1 Age + β2 hypertension
Lecture 18: Multiple Logistic Regression – p. 4/48
Prediction Studies
•
Interest centers on being able to accurately estimate or predict the response for a given combination of predictors
•
Focus is not much about which predictor variable allow to do this or what their coefficients are (Model fit is important)
A multiple logistic regression model for screening diabetes (Tabaei and Herman (2002) in Diabetes Care, 25, 1999-2003) Example 1.
logit(Pr(Diabetes)) = β0 + β1 Age + β2 Plasmaglucose + β3 Postprandialtime + β4 Female + β5 BMI Estimates:
βˆ0 = −10.038, βˆ1 = 0.033, βˆ2 = 0.031, βˆ3 = 0.250, βˆ4 = 0.562, βˆ5 = 0.035
They used a cutoff of 20% to predict a previously undiagnosed diabetes with sensitivity=65% and specificity=96%
Lecture 18: Multiple Logistic Regression – p. 5/48
Review: Designs for observational studies
We discuss three important designs that have a lot of use of logistic regression in their analysis. Define X to denote an exposure or treatment and D to be an outcome indicator (disease, death, etc). Example: For a binary X and D, CROSS-SECTIONAL DESIGN: randomly select n from a population of N records D X
X=1 X=0 Total
D=1 n11 n01 n.1
D=0 n10 n00 n.0
total n1. n0. nfixed
Lecture 18: Multiple Logistic Regression – p. 6/48
Review: Designs for observational studies
PROSPECTIVE DESIGN: randomly select n1. from N1 with X
= 1 and n0. from N0 with X = 0
D X
X=1 X=0 Total
D=1 n11 n01 n.1
D=0 n10 n00 n.0
total n1. fixed n0. fixed n
CASE-CONTROL DESIGN: randomly select n.1 from N1 cases and n.0 from N0 controls D X
X=1 X=0 Total
D=1 n11 n01 n.1 fixed
D=0 n10 n00 n.0 fixed
total n1. n0. n
Lecture 18: Multiple Logistic Regression – p. 7/48
Review: Example
Consider a hypothetical study of the association between maternal age and birth weight using data from 1000 hospital delivery records. We can use either of the three designs discussed above. Let X=I(maternal age ChiSq 0.0006 0.0006 0.0008
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept condom partners
1 1 1
-0.5522 -0.1281 1.1889
0.2190 0.3801 0.3800
6.3590 0.1136 9.7910
Pr > ChiSq
Exp(Est)
0.0117 0.7361 0.0018
0.576 0.880 3.284
Lecture 18: Multiple Logistic Regression – p. 17/48
Estimation and Interpretation of Parameters
•
Estimation is done using Maximum Likelihood Methods with Newton Raphson iterative algorithm (there is closed form solution for p=1, binary)
•
Interpretation of logit(Pr(D = 1)) = β0 + β1 X1 + .... + βp Xp 1. β0 is the log-odds when X1 = ... = Xp = 0 2. β1 is the log-odds ratio comparing levels of X1 , LIKE X = 1 vs X = 0 or for a unit change in X1 given X2 , ..., Xp are held constant
•
In our example: logit(Pr(std = 1)) = −0.5522 + −0.1281Condom + 1.1889Partners
•
Homework
Write the interpretation of the coefficient of Condom use and number of
partners
Lecture 18: Multiple Logistic Regression – p. 18/48
Confounding and Interaction
The first step in multiple logistic regression is to test any apriori hypothesis of interaction effect followed by confounding effect. These are the two ways an extraneous variable may affect the relationship between outcome and exposure exists when the relationship between two variables is different for different levels of a third variable. It is also called effect modification. For example in the Nurses Health Study: is
Interaction :
the association between breast cancer and oral contraceptive use different in women of age 30-39 and women of age 40-55? Confounding:
exists when the estimated relationship of interest changes when we add a third
variable. For example in the STD and condom use study, is the association between STD and condom use over-estimated because of the relationship between STD and number of partners?
In general, the basic questions to consider (Breslow and Day, Vol I, 1980) are:
•
the degree of association between risk for disease and the factors under study
•
the extent to which the observed association may result from bias, confounding and/or chance
Lecture 18: Multiple Logistic Regression – p. 19/48
Effect Modification: Example1 Consider the Nurses Health Study: the association between breast cancer and oral contraceptive use in women of age 30-39 and women of age 40-55 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 105.6737 3 ChiSq 45mph ------------------------Odds Ratio Estimates
Effect seatbelt
Point Estimate 2.786
95% Wald Confidence Limits 0.861 9.009
--------------------- speed ChiSq ChiSq 18.3672 6 0.0054 Analysis of Effects Eligible for Entry Score Effect DF Chi-Square Pr > ChiSq AGE 1 0.0000 1.0000 LWT 1 2.4562 0.1171 SMOKE 1 5.2581 0.0218 PTD 1 14.8178 0.0001 HT 1 0.0697 0.7918 UI 1 5.2242 0.0223
Lecture 18: Multiple Logistic Regression – p. 45/48
Model fitting strategies: Example Step
1. Effect PTD entered: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 -1.4917 0.2609 32.6944 PTD 1 1.9436 0.5494 12.5157
Pr > ChiSq ChiSq AGE 1 0.1297 0.7187 LWT 1 1.0145 0.3138 SMOKE 1 1.8865 0.1696 HT 1 0.0094 0.9229 UI 1 1.2965 0.2548 Step
2. Effect SMOKE entered: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 -1.7458 0.3358 27.0256 SMOKE 1 0.6458 0.4745 1.8529 PTD 1 1.7389 0.5685 9.3560
Pr > ChiSq ChiSq AGE 1 0.0793 0.7783 LWT 1 0.9037 0.3418 HT 1 0.0042 0.9483 UI 1 1.2472 0.2641 Step 3. Effect UI entered: Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Intercept SMOKE PTD UI Step
1 1 1 1
-1.8489 0.6418 1.5463 0.6139
0.3551 0.4772 0.5926 0.5536
Wald Chi-Square
Pr > ChiSq
27.1042 1.8088 6.8084 1.2296