Building and Applying Logistic Regression

Building and Applying Logistic Regression Today’s topics: 1. Introduction 2. Model selection strategies 3. Diagnostics 4. Conditional independence 5. ...
Author: Bruno Moore
36 downloads 1 Views 65KB Size
Building and Applying Logistic Regression Today’s topics: 1. Introduction 2. Model selection strategies 3. Diagnostics 4. Conditional independence 5. Other GLMs for binary data

Sections skipped: 6.3.4-6.3.7, 6.5, 6.7.2-6.7.9 Building and applying LR

6.1

Introduction In general there are two competing goals 1. The model should be complex enough to fit the data well 2. The model should be simple to interpret, smoothing rather than overfitting the data When the study is designed to answer certain questions these question govern the choice of model terms. Confirmatory analysis then checks and compares a relatively small number of models. For exploratory studies usually a search among a large number of models ends in conclusions about the dependence structure of the data.

Building and applying LR

6.2

Issues in Multivariate analysis Like in regression it is useful to look at relationships between each predictor and the response first, for logistic regression it is useful to look at univariate effects first through graphics or small contingency tables. An unbalanced outcome with a small number of subjects in one category, affects the number of possible predictor variables. One guideline suggests at least 10 outcomes of each type for every predictor variable. Features like spurious correlation and the suppression effect also exists in GLMs and thus in logistic regression (Simpson’s paradox). Like in ordinary regression multicollinearity may occur. In that case it seems that no single predictor is valuable when the other predictors are in the model also. Often predictors together give a significant effect, but no single predictor is significant.

Building and applying LR

6.3

Notation Models will be symbolized with the highes order terms. This is standard in log-linear analysis, and is also useful here. As an example, we have Y the dependent variable and three predictors A, B, and C. The model with all main effects is (A + B + C) and the model with an interaction between A and C and a main effect of B is represented by (A ∗ C + B)

Building and applying LR

6.4

Stepwise procedures 1. Forward selection adds terms sequentially until further additions do not improve model fit. At each stage it selects the term giving the greatest improvement in fit, measured by minimum P -value. 2. Backward elimination begins with a complex model and sequentially deletes terms. At each stage it selects the term has the least damaging effect on the model, measured again by P -value. (Implemented in SPSS). 3. The P -values are usually associated with the deviance-difference test, G2 (M0 |M1 ). 4. In both approaches for multi-category variables always an entire variable should be added or removed. 5. Neither approach does necessarily lead to a sensible model. 6. Never select a model only on statistics!

Building and applying LR

6.5

Model Selection 1. Any model is a simplification of reality. Why then fit a model at all? (a) A simple model has the advantage of model parsimony (b) If a model has relatively little bias, describing reality well, it tends to provide more accurate estimates of the quantities of interest. 2. Besides significance tests there are other quantities that can guide in model selection. One important example is the AIC-statistic. It weights model fit against parsimony AIC = −2(maximized LL − number of parameters)

It is equivalent to: AIC = G2 − 2df

The model with lowest AIC should be selected. Building and applying LR

6.6

Causal hypotheses If available, theory should be incorporated in model selection. Often a time ordering of effects suggests possible causal relationships. Split up the table in smaller parts and use logistic regression on each of the tables separately. Like in path-models. From the expected values of each of the models, expected values for the complicated model can be computed (but this is not in the text)

Building and applying LR

6.7

Diagnostics-residuals 1. Pearson residual: yi − µ ˆi  1/2 d var(Yi) yi − µ ˆi = p [niπ ˆi(1 − π ˆi)

ei =

2. Deviance residuals: The deviance residual for observation i is √ di × sign(yi − µ ˆi) where   yi ni − yi di = 2 yi log + (ni − yi) log niπ ˆi ni − niπ ˆi Both tend to be less variable then N (0, 1) and can be standardized. Plots of explanatory variables against residuals can detect model lack-of-fit. Alternatively plots with explanatory variables against fitted and observed proportions can detect lack-of-fit. When some ni are very small the analysis of residuals is not helpful. Building and applying LR

6.8

Diagnostics-influence Influence measures for each observation 1. For each parameter of the model, the change in its estimate when a observation is deleted divided by its standard error is called Dfbeta. 2. A measure of change in a joint confidence interval for the parameters produced by deleting an observation (c). 3. Change in X 2 or G2 statistic when an observation is deleted.

Building and applying LR

6.9

Diagnostics-R2 R2 is the measure of predictive in ordinary regression. The correlation between observed and fitted values (r(yi, µ ˆi)) can also be used in GLMs for predictive power. Other measures 1. Proportion reduction in squared error P ˆi)2 i (yi − π 1− P ¯i)2 i (yi − y 2. A measure depending on the maximized log likelihood Lm − L0 Ls − L0 It equals 0 when the model provides no improvement over the null model, and it equals 1 when it is as good as the saturated model. An adaption for grouped data based on deviance statistics is G2 (0) − G2 (M ) D = . G2 (0) ∗

Building and applying LR

6.10

Diagnostics-Classification tables A classification table classifies the observed versus the predicted binary outcome. Then we can look at sensitivity and specificity (see class 2). The predicted outcome variable is yˆi = 1 if π ˆi > π0 , usually with π0 = 0.5. Limitations of this classification table are that it collapses continuous predictive values π ˆ to binary ones. The choice of cut-off (π0 ) is rather arbitrary. The receiver operating characteristic curve plots sensitivity as a function of (1-specificity) for various values of π0 . The larger the area under the curve the better the predictions. The area under the curve is identical to the concordance index (c). It estimates the probability that the predictions and the outcomes are concordant, i.e. that the observations with larger y also have larger π ˆ. A value c = .5 means that the prediction is as good as a random guess and corresponds to a ROC-curve being a straight line.

Building and applying LR

6.11

Conditional independence We often are interested in the relationship between a treatment variable (X) and an outcome (Y ) after controlling for a possible confounding variable (Z). Here the 2 × 2 × K-case will be discussed. Let πik = P (Y = 1|X = i, Z = k). The model logit(πik ) = α + βxi + βkZ where xi = 1 indicates treatment and xi = 0 control. This model assumes that the XY -relationship is the same for each level of the confounding variable Z, (odds ratio is exp(β)).

We can test whether there is a treatment effect, H0 : β = 0 by the Wald statistic (βˆ/SE)2 or the likelihood ratio statistic compared to the model: logit(πik ) = α + βkZ

(The score statistic is the Cochran-Mantel-Haenszel statistic.) Building and applying LR

6.12

Probit Model Compared to the logistic regression model the probit (regression) model uses the standard normal cumulative distribution function (Φ) to make a link function (instead of the logistic link).

Φ−1 (π(x)) = α + βx Since 68% of the normal density falls within a standard deviation of the mean, 1/|β| is the distance between x-values where π(x) = .16 or π(x) = .84 and π(x) = .5. When both the logistic model and the probit model fit, the estimates of logistic regression are about 1.7 times those of probit models.

Building and applying LR

6.13

Complementary log-log model The probit and logit link are symmetric. Sometimes data are asymmetric in the sense that 1 is approximated faster on the one end than zero on the other end. Then the model with link log [− log(1 − π(x))] = α + βx might be a better option. This link is called the complementary log-log link. It approaches zero fairly slower than it approaches one. Interpretation: for x2 − x1 = 1 the complement probability at x2 equals the complement probability at x1 raised to the power exp(β). A related link is the log-log link log [− log(π(x))] = α + βx it approaches 0 sharply and 1 slowly.

Building and applying LR

6.14