Lecture 18: Multiple Logistic Regression – p. 1/48

Topics to be covered

•

Review 1. Purpose of empirical models: Association vs Prediction 2. Design of observational studies: cross-sectional, prospective, case-control 3. Randomization, Stratification and Matching

•

Multiple logistic regression 1. The model 2. Estimation and Interpretation of Parameters 3. Confounding and Interaction 4. Effects of omitted variables 5. Model Fitting Strategies 6. Goodness of Fit and Model Diagnostics

• • •

Matching (group and individual) Conditional vs Unconditional analysis Methods III: Advanced Regression Methods

Lecture 18: Multiple Logistic Regression – p. 2/48

Review: Purpose of empirical models

Empirical models:

are models that are fitted to provide succinct descriptions of relationships observed in data. They can be of different forms, here we focus on regression models that have wide applicability

•

They are data-driven models that provide a range of possible relationships between variables often specified by mathematical convenience and a preference for simplicity.

•

If the model fits well, inferences are possible about the nature of relationships between variables in the ranges where they are observed (NO extrapolation)

•

Examples: Association studies in Epidemiology and Prediction studies in clinical or policy making research

Lecture 18: Multiple Logistic Regression – p. 3/48

Association Studies

•

Interest centers on what variables (variables of interest and adjustment variables) are in the model and the size and sign of their coefficients

•

Predicted value for each observation or model fit is not of interest per se

After adjusting for appropriate covariates, is broccoli intake associated with colorectal adenomatous polyps? Example 1.

logit(Pr(polyps)) = β0 + β1 energyintake + ... + βk Broccoliintake Example 2.

After adjusting for age, is heart disease (HD) associated with hypertension?

logit(Pr(HD)) = β0 + β1 Age + β2 hypertension

Lecture 18: Multiple Logistic Regression – p. 4/48

Prediction Studies

•

Interest centers on being able to accurately estimate or predict the response for a given combination of predictors

•

Focus is not much about which predictor variable allow to do this or what their coefficients are (Model fit is important)

A multiple logistic regression model for screening diabetes (Tabaei and Herman (2002) in Diabetes Care, 25, 1999-2003) Example 1.

logit(Pr(Diabetes)) = β0 + β1 Age + β2 Plasmaglucose + β3 Postprandialtime + β4 Female + β5 BMI Estimates:

βˆ0 = −10.038, βˆ1 = 0.033, βˆ2 = 0.031, βˆ3 = 0.250, βˆ4 = 0.562, βˆ5 = 0.035

They used a cutoff of 20% to predict a previously undiagnosed diabetes with sensitivity=65% and specificity=96%

Lecture 18: Multiple Logistic Regression – p. 5/48

Review: Designs for observational studies

We discuss three important designs that have a lot of use of logistic regression in their analysis. Define X to denote an exposure or treatment and D to be an outcome indicator (disease, death, etc). Example: For a binary X and D, CROSS-SECTIONAL DESIGN: randomly select n from a population of N records D X

X=1 X=0 Total

D=1 n11 n01 n.1

D=0 n10 n00 n.0

total n1. n0. nfixed

Lecture 18: Multiple Logistic Regression – p. 6/48

Review: Designs for observational studies

PROSPECTIVE DESIGN: randomly select n1. from N1 with X

= 1 and n0. from N0 with X = 0

D X

X=1 X=0 Total

D=1 n11 n01 n.1

D=0 n10 n00 n.0

total n1. fixed n0. fixed n

CASE-CONTROL DESIGN: randomly select n.1 from N1 cases and n.0 from N0 controls D X

X=1 X=0 Total

D=1 n11 n01 n.1 fixed

D=0 n10 n00 n.0 fixed

total n1. n0. n

Lecture 18: Multiple Logistic Regression – p. 7/48

Review: Example

Consider a hypothetical study of the association between maternal age and birth weight using data from 1000 hospital delivery records. We can use either of the three designs discussed above. Let X=I(maternal age ChiSq 0.0006 0.0006 0.0008

The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept condom partners

1 1 1

-0.5522 -0.1281 1.1889

0.2190 0.3801 0.3800

6.3590 0.1136 9.7910

Pr > ChiSq

Exp(Est)

0.0117 0.7361 0.0018

0.576 0.880 3.284

Lecture 18: Multiple Logistic Regression – p. 17/48

Estimation and Interpretation of Parameters

•

Estimation is done using Maximum Likelihood Methods with Newton Raphson iterative algorithm (there is closed form solution for p=1, binary)

•

Interpretation of logit(Pr(D = 1)) = β0 + β1 X1 + .... + βp Xp 1. β0 is the log-odds when X1 = ... = Xp = 0 2. β1 is the log-odds ratio comparing levels of X1 , LIKE X = 1 vs X = 0 or for a unit change in X1 given X2 , ..., Xp are held constant

•

In our example: logit(Pr(std = 1)) = −0.5522 + −0.1281Condom + 1.1889Partners

•

Homework

Write the interpretation of the coefficient of Condom use and number of

partners

Lecture 18: Multiple Logistic Regression – p. 18/48

Confounding and Interaction

The first step in multiple logistic regression is to test any apriori hypothesis of interaction effect followed by confounding effect. These are the two ways an extraneous variable may affect the relationship between outcome and exposure exists when the relationship between two variables is different for different levels of a third variable. It is also called effect modification. For example in the Nurses Health Study: is

Interaction :

the association between breast cancer and oral contraceptive use different in women of age 30-39 and women of age 40-55? Confounding:

exists when the estimated relationship of interest changes when we add a third

variable. For example in the STD and condom use study, is the association between STD and condom use over-estimated because of the relationship between STD and number of partners?

In general, the basic questions to consider (Breslow and Day, Vol I, 1980) are:

•

the degree of association between risk for disease and the factors under study

•

the extent to which the observed association may result from bias, confounding and/or chance

Lecture 18: Multiple Logistic Regression – p. 19/48

Effect Modification: Example1 Consider the Nurses Health Study: the association between breast cancer and oral contraceptive use in women of age 30-39 and women of age 40-55 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 105.6737 3 ChiSq 45mph ------------------------Odds Ratio Estimates

Effect seatbelt

Point Estimate 2.786

95% Wald Confidence Limits 0.861 9.009

--------------------- speed ChiSq ChiSq 18.3672 6 0.0054 Analysis of Effects Eligible for Entry Score Effect DF Chi-Square Pr > ChiSq AGE 1 0.0000 1.0000 LWT 1 2.4562 0.1171 SMOKE 1 5.2581 0.0218 PTD 1 14.8178 0.0001 HT 1 0.0697 0.7918 UI 1 5.2242 0.0223

Lecture 18: Multiple Logistic Regression – p. 45/48

Model fitting strategies: Example Step

1. Effect PTD entered: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 -1.4917 0.2609 32.6944 PTD 1 1.9436 0.5494 12.5157

Pr > ChiSq ChiSq AGE 1 0.1297 0.7187 LWT 1 1.0145 0.3138 SMOKE 1 1.8865 0.1696 HT 1 0.0094 0.9229 UI 1 1.2965 0.2548 Step

2. Effect SMOKE entered: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 -1.7458 0.3358 27.0256 SMOKE 1 0.6458 0.4745 1.8529 PTD 1 1.7389 0.5685 9.3560

Pr > ChiSq ChiSq AGE 1 0.0793 0.7783 LWT 1 0.9037 0.3418 HT 1 0.0042 0.9483 UI 1 1.2472 0.2641 Step 3. Effect UI entered: Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Intercept SMOKE PTD UI Step

1 1 1 1

-1.8489 0.6418 1.5463 0.6139

0.3551 0.4772 0.5926 0.5536

Wald Chi-Square

Pr > ChiSq

27.1042 1.8088 6.8084 1.2296