The two plots above were generated by the SAS code:

Logistic Regression The data below are from a study conducted by Milicer and Szczotka on pre-teen and teenage girls in Warsaw. The subjects were class...

Author: Garry Sims

4 downloads 2 Views 179KB Size

Report

Download PDF

Recommend Documents

Grouped Jittered Box Plots in SAS 9.2 and SAS 9.3

The Seven Basic Plots. Digested By Greg Smith from The Seven Basic Plots By Christopher Booker

[ ],Two of the above are true

B. Make normal probability plots for the two coating thicknesses

When We Were Two. by Robert Newton

The Seven Basic Plots

GENERALIZED ASSOCIATION PLOTS: INFORMATION VISUALIZATION VIA ITERATIVELY GENERATED CORRELATION MATRICES

Then there were two

Name the two substances that were made when the organelles were illuminated

Two blockbuster mergers were

Debugging SAS Code Jeff Simpson

Demystifying the SAS Macro Facility by Example

INTERFACE OF THE SAS SYSTEM BY YAFOOL

The Seven Basic Plots Outline

Arrays. The two statements above are often combined:

Under the Hood: Using the Data Generated by Sakai

THE EXAMINATION OF RESIDUAL PLOTS

6. If you have questions that were not answered at the above web site, contact the

Generated by CamScanner

The Airs Above the Ground

The survey generated 98 responses from the 240 organizations that were ArchivesSpace members at the time the survey closed

in the example above)

THE SPIRIT ABOVE THE WATERS

The topologies used for the experiments were generated using the topology description language of the DSP. This facility supports

Logistic Regression The data below are from a study conducted by Milicer and Szczotka on pre-teen and teenage girls in Warsaw. The subjects were classified into 25 age categories. The number of girls in each group (sample size) and the number that reached menarche (# RM) at the time of the study were recorded. The age for a group corresponds to the midpoint for the age interval. Sample size

# RM

Age

Sample size

# RM

Age

376

0

9.21

200

0

10.21

93

0

10.58

106

67

13.33

120

2

10.83

105

81

13.58

90

2

11.08

117

88

13.83

88

5

11.33

98

79

14.08

105

10

11.58

97

90

14.33

111

17

11.83

120

113

14.58

100

16

12.08

102

95

14.83

93

29

12.33

122

117

15.08

100

39

12.58

111

107

15.33

108

51

12.83

94

92

15.58

99

47

13.08

114

112

15.83

1049

1049

17.58

The researchers were curious about how the proportion of girls that reached menarche ( pˆ = # RM/ sample size ) varied with age. One could perform a test of homogeneity by arranging the data as a 2 by 25 contingency table with columns indexed by age and two rows: ROW1 = # RM and ROW2 = # that have not RM = sample size − # RM. A more powerful approach treats these as regression data, using the proportion of girls reaching menarche as the response and age as a predictor. A plot of the observed proportion pˆ of girls that have reached menarche (phat in the plot) shows that the proportion increases as age increases, but that the relationship is nonlinear. The observed proportions, which are bounded between zero and one, have a lazy S-shape 1

(a sigmoidal function) when plotted against age. The change in the observed proportions for a given change in age is much smaller when the proportion is near 0 or 1 than when the proportion is near 1/2. This phenomenon is common with regression data where the response is a proportion.

The trend is nonlinear so linear regression is inappropriate.

A sensible alternative

might be to transform the response or the predictor to achieve near linearity. A common transformation of response proportions following a sigmoidal curve is to the logit scale log{ˆ p/(1 − pˆ)}. This transformation is the basis for the logistic regression model. The natural logarithm (base e) is traditionally used in logistic regression. The logit transformation is undefined when pˆ = 0 or pˆ = 1. To overcome this problem, researchers often plot the empirical logits, defined by log{(ˆ p + .5/n)/(1 − pˆ + .5/n)}, where n is the sample size or the number of observations on which pˆ is based. A plot of the empirical logits against age is roughly linear, which supports a logistic transformation for the response. 2

The two plots above were generated by the SAS code: data d1; input ntot nres age @@; phat = nres/ntot; emplog = log( (phat+.5/ntot)/(1-phat+.5/ntot) ); datalines; 376 0 9.21 200 0 93 0 10.58 106 67 120 2 10.83 105 81 90 2 11.08 117 88 88 5 11.33 98 79 105 10 11.58 97 90 111 17 11.83 120 113 100 16 12.08 102 95 93 29 12.33 122 117 100 39 12.58 111 107 108 51 12.83 94 92 99 47 13.08 114 112 1049 1049 ; symbol1 value=x interpol=none width=2; proc gplot; plot (phat emplog)*age; run;

3

10.21 13.33 13.58 13.83 14.08 14.33 14.58 14.83 15.08 15.33 15.58 15.83 17.58

The Simple Logistic Regression Model The simple logistic regression model expresses the population proportion p of individuals with a given attribute (called a success) as a function of a single predictor variable X. The model assumes that p is related to X through p log 1−p

!

= β0 + β1 X

or, equivalently, as p=

exp(β0 + β1 X) . 1 + exp(β0 + β1 X)

The logistic regression model is a binary response model, where the response for each case falls into one of 2 exclusive and exhaustive categories, success (cases with the attribute of interest) and failure (cases without the attribute of interest). The odds of success are p/(1 − p). For example, the odds of success are 1 (or 1 to 1) when p = 1/2. The odds of success are 9 (or 9 to 1) when p = .9. The logistic model assumes that the log-odds of success is linearly related to X. Graphs of the logistic model relating p to X are given below. The sign of the slope refers to the sign of β1 . I should (but usually will not) write p = p(X) to emphasize that p is the proportion of all individuals with score X that have the attribute of interest. In the menarche data, p = p(X) is the population proportion of girls at age X that have reached menarche. The data in a logistic regression problem are often given in summarized or aggregate form: X

n

y

X1

n1

y1

X2

n2

y2

.

.

.

.

.

.

Xm

nm

ym

where yi is the number of individuals with the attribute of interest among ni randomly selected or representative individuals with predictor variable value Xi . Multiple X columns 4

Probability Scale

5

0.8

1.0

Logit Scale

0 slope

Probability 0.4 0.6

Log-Odds 0

+ slope

0 slope

-5

0.2

- slope

- slope

0.0

+ slope

-5

0 X

5

-5

0 X

5

are used when the model includes multiple effects. For raw data on individual cases, yi = 1 or 0, depending on whether the case at Xi is a success or failure. The sample size column n is omitted with raw data. For logistic regression, a plot of the sample proportions pˆi = yi /ni against Xi should be roughly sigmoidal, and a plot of the empirical logits against Xi should be roughly linear. If not, then some other model is probably appropriate. I find the second plot easier to calibrate, but neither plot is very informative when the sample sizes are small, say 1 or 2. (Why? consider the data plot in the leukemia analysis to come). A plot of the empirical logits against age in the menarche data suggests that the logistic regression model is reasonable. The power of the logistic model versus the contingency table analysis discussed earlier is that the model gives estimates for the population proportion reaching menarche at all ages within the observed age range. The observed proportions allow you to estimate only the population proportions at the observed ages. There are a variety of other binary response models that are used in practice. The probit regression model or the complementary log-log regression model might be appropriate when the logistic model does fit the data.

5

Maximum Likelihood Estimation for Logistic Regression Model There are two unknown population parameters in the logistic regression model p log 1−p

!

= β0 + β1 X.

A simple way to estimate β0 and β1 is by least squares (LS), using the empirical logits as responses and the Xi s as the predictor values. Although this approach is straightforward, there are better inferential methods than standard LS regression which assumes normally distributed responses with constant variance. A deficiency of LS inference for the logistic model is that the observed counts yi have a Binomial distribution under random sampling. The Binomial distribution is a discrete probability model associated with counting the number of successes in a fixed size sample, and other equivalent experiments such as counting the number of heads in repeated flips of a coin. The distribution of the empirical logits depends on the yis so they are not normal (but are approximately normal in large samples), and are extremely skewed when the sample sizes ni are small. The response variability depends on the population proportions, and is not roughly constant when the observed proportions or the sample sizes vary appreciably across groups. The differences in variability among the empirical logits can be accounted for using weighted least squares (WLS) when the sample sizes are large. However, an alternative approach called maximum likelihood uses the exact Binomial distribution of the responses yi to generate optimal estimates of the regression coefficients. Software for maximum likelihood estimation is widely available, so LS and WLS methods are not really needed. In maximum likelihood estimation (MLE), the regression coefficients are estimated iteratively by minimizing the deviance function (also called the likelihood ratio chi-squared statistic) D=2

m X i=1

(

yi yi log ni pi

!

ni − yi + (ni − yi )log ni − ni pi

!)

over all possible values of β0 and β1 , where the pi s satisfy the logistic model pi log 1 − pi

!

= β0 + β1 Xi .

6

Equivalently, the MLEs are obtained by maximizing the log-likelihood function (not given here). The ML method also gives standard errors and significance tests for the regression estimates and extends to models with multiple predictors. The deviance is an analog of the residual sums of squares in linear regression. The choices for β0 and β1 that minimize the deviance are the parameter values that make the observed and fitted proportions as close together as possible in a “likelihood sense”. Suppose that b0 and b1 are the MLEs of β0 and β1 . The deviance evaluated at the MLEs: D=2

m X i=1

(

yi yi log ni p˜i

!

ni − yi + (ni − yi)log ni − ni p˜i

!)

,

where the fitted probabilities p˜i satisfy p˜i log 1 − p˜i

!

= b0 + b1 Xi ,

is used to test the adequacy of the model. The deviance is small when the data fits the model, that is, when the observed and fitted proportions are close together. Large values of D occur when one or more of the observed and fitted proportions are far apart, which suggests that the model is inappropriate. If the logistic model holds, then D has a chi-squared distribution with m − r degrees of freedom, where m is the number of groups and r (here 2) is the number of estimated regression parameters. A p-value for the deviance is given by the area under the chi-squared curve to the right of D. A small p-value indicates that the data does not fit the model. There are a few caveats to this asymptotic result that I will mention. Alternatively, the fit of the model can be evaluated using the chi-squared approximation to the Pearson X 2 statistic: 2

X =

m X i=1

(

(yi − ni p˜i )2 ((ni − yi ) − ni (1 − p˜i ))2 + ni p˜i ni (1 − p˜i )

)

=

m X

(yi − ni p˜i )2 . ˜i (1 − p˜i ) i=1 ni p

Fitting the Logistic Model by Maximum Likelihood The logistic regression model can be fitted in SAS using either the logistic procedure (specifically for logistic regression) or with the genmod procedure, which fits generalized 7

linear models, which includes logistic regression and other binary response models as a special case. I will focus on genmod but will note that the logistic procedure has model selection methods (forward, backward, best subsets, etc.) and diagnostic tools (i.e. Cook’s distance) not covered by genmod. The structure of genmod mimics glm (for normal theory regression) in that factors and interactions can be directly entered into the model. The following analysis for the menarche data illustrates some basic genmod features. The online help gives more details. The model statement specifies the response in the form of a ratio of two variables from the data set d1. The numerator nres is the number of successes at a given age, the single effect in the model referenced to the right of the equal sign on the model statement. The denominator ntot is the group sample size. This is the natural way to specify the model when the data are in summarized form versus data on individual responses (more on this later). Do not use the observed proportions computed in the data step as the “response” in the model statement. The dist=bin link=logit obstats noscale option on the model statement (following the ”back-slash”) implies that a binary (or binomial) response model with a logit link (i.e. logistic regression) is to be fitted to the data. Noscale implies that a dispersion parameter is not estimated. The obstats option together with the ods output statement creates a data set named r that contains a variety of case summary statistics (residuals, fitted probabilities). I used the contents procedure to show you the labels of the variables contained in r. The type3 option provides standard LR tests for individual effects (i.e. the difference in deviances for the models without and with the given effect). The output format is similar to that used in SAS’s glm procedure. proc genmod data=d1; model nres/ntot = age/dist=bin link=logit noscale obstats type3; ods output obstats = r; proc contents data = r; data e; merge d1 r; keep age phat Pred ResChi ResDev; proc print data = e;

8

run; The GENMOD Procedure Model Information Data Set Distribution Link Function Response Variable (Events) Response Variable (Trials) Number Number Number Number

of of of of

WORK.D1 Binomial Logit nres ntot

Observations Read Observations Used Events Trials

25 25 2308 3918

Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 23 26.7035 Scaled Deviance 23 26.7035 Pearson Chi-Square 23 21.8699 Scaled Pearson X2 23 21.8699 Log Likelihood -819.6524

Value/DF 1.1610 1.1610 0.9509 0.9509

Algorithm converged.

Parameter DF Intercept 1 age 1 Scale 0 NOTE: The scale

Data Set Name Member Type Engine Created Last Modified # 8 9 1 5 12 13 16

Analysis Of Parameter Estimates Standard Wald 95% Estimate Error Confidence Limits -21.2264 0.7707 -22.7369 -19.7159 1.6320 0.0590 1.5164 1.7475 1.0000 0.0000 1.0000 1.0000 parameter was held fixed.

ChiSquare 758.57 766.32

Pr > ChiSq ChiSq 0.0666 0.0209 0.0162

The ratios of D and X 2 divided by the df are both less than 1, indicating that there are no gross deficiencies with the model. Recall that the Analysis Of Parameter Estimates table gives p-values for testing the hypothesis that the regression coefficients are zero for each predictor in the model. The two predictors are LWBC and IAG, so the small p-values indicate that LWBC and IAG are important predictors of survival in this model. The LR test p-values in the LR table lead to the same conclusion. If either predictor was insignificant, I would consider refitting the model omitting the least significant effect, as in regression. Given that the model fits reasonably well, a test of H0 : β2 = 0 might be a primary interest here. This checks whether the regression lines are identical for the two AG levels, which is a test for whether AG affects the survival probability, after taking LWBC into account. This test is rejected at any of the usual significance levels, suggesting that the AG level affects the survival probability (assuming a very specific model). A plot of the predicted survival probabilities as a function of LWBC is given with the SAS output, using different colors to identify the two IAG groups. The model indicates that the probability of surviving at least one year from the time of diagnosis is a decreasing function of LWBC. For a given LWBC the survival probability is greater for AG+ patients

16

than for AG- patients. This tendency is consistent with the observed proportions, which show little information about the exact form of the trend. The estimated (ML) survival probabilities satisfy !

p˜ log = 5.54 − 1.11LWBC + 2.52IAG. 1 − p˜ For AG- individuals with IAG=0, this reduces to !

p˜ log = 5.54 − 1.11LWBC, 1 − p˜ or equivalently, p˜ =

exp(5.54 − 1.11LWBC) . 1 + exp(5.54 − 1.11LWBC)

For AG+ individuals with IAG=1, !

p˜ log = 5.54 − 1.11LWBC + 2.52 ∗ (1) = 8.06 − 1.11LWBC, 1 − p˜ or p˜ =

exp(8.06 − 1.11LWBC) . 1 + exp(8.06 − 1.11LWBC)

The equivalence between logit and probability scales is that log(p/(1 − p)) = z implies p = exp(z)/{1 + exp(z)}, regardless of how many effects define z. Using the logit scale, the difference between AG+ and AG- individuals in the estimated log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is the estimated IAG regression coefficient: (8.06 − 1.11LWBC) − (5.54 − 1.11LWBC) = 2.52. Using properties of exponential functions, the odds that an AG+ patient lives at least one year is exp(2.52) = 12.42 times larger than the odds that an AG- patient lives at least one year, regardless of LWBC. This statistical summary is called the adjusted odds ratio for the IAG effect. A CI for the adjusted odds ratio is obtained by exponentiating the CI for the corresponding regression effect given in the parameter estimates table: exp(.38) = 1.47 to exp(4.66) = 105.35. 17

Similarly, the estimated odds ratio of .33 = exp(−1.11) for LWBC implies that the odds of surviving at least one year is reduced by a factor of 3 for each unit increase of LWBC. Although the equal slopes model appears to fit well, a more general model might fit better. A natural generalization here would be to add an interaction, or product term, IAG ∗ LWBC to the model. The logistic model with an IAG effect and the IAG ∗ LWBC interaction is equivalent to fitting separate logistic regression lines to the two AG groups. This interaction model provides an easy way to test whether the slopes are equal across AG levels. I will note that the interaction term is not needed here.

Data from a Budworm Experiment The purpose of this experiment was to assess the resistance of tobacco budworm larvae to cypermethrin. Batches of 20 pyrethroid-resistant moths (i.e. moths that have grown resistant to the pesticide pyrethroid) of each sex were exposed to a range of doses of cypermethrin two days after emergence from pupation. The number of moths which were either knocked down (uncoordinated movements) or dead was recorded 72 hours after treatment. We will model the probability of uncoordinated movement, or death, as a function of sex and dose. The SAS program below inputs the number of responders for each combination of sex and dose. The sample size of 20 for each sex-dose combination is defined directly in the data step, rather than entered with the data. A plot of the observed proportion of responders against the log(dose) (log-dose is traditionally used in bioassay settings such as this) shows that increased doses are more effective, and that males tend to be more affected than females. The first genmod proc fits a logistic model with main effects for sex (a factor) and log(dose) (a predictor) and a sex-by-log(dose) interaction. The model statement specifies the response in the number of events divided by sample size format as used in the menarche data. The class statement identifies sex as a factor. You should probably read the ANCOVA (analysis of covariance) notes from my Data Analysis II website if you are not familiar with how factors are treated in regression models. Factors are typically used with qualitative effects such as sex, ethnicity and so on. The (main) effect for a factor with c levels (sex has c = 2 levels) is captured through c indicator (binary, 18

or 0-1) variables, one for each level of the factor. However, perfect collinearity results from including all c binary predictors in a model with an intercept. Without loss of generality, one of the binary predictors can be removed from the model. The group corresponding to the omitted binary predictor is called the baseline group. Interactions between a factor and a predictor (or another factor) corresponds to adding one df effect for each product of the binary variables with the other feature in the interaction. If a factor has c > 2 levels the SAS output will include estimates of the regression coefficients for the c − 1 single df effects while the LR table will include a test of significance for the entire effect (i.e. a single c − 1 df test on the importance of the effect). By default, SAS uses the category of the class variable with the highest alphanumeric value as the baseline group. Thus, in the budworm analysis, males (M) are the baseline group, and the sex effect corresponds to a single binary predictor with levels 1 for females and 0 for males. The interaction between ld and sex corresponds to a predictor that is the product of ld and the binary sex effect. Letting pj be the success probability (i.e. probability of death or uncoordinated movement) for batch j, and defining sexj = 1 if batch j consists of females and 0 if batch j consists of males (baseline), the fitted model is equivalent to: pj log 1 − pj

!

= β0 + β1 sexj + β2 ldj + β3 sexj ∗ ldj

where ldj is the log-dose given to batch j. The model implies that the log-odds of a response is linearly related to log(dose), but that males and females have distinct lines. To see this, note that the model reduces to pj log 1 − pj

!

= β0 + β2 ldj

for males (sex = 0) and pj log 1 − pj

!

= β0 + β1 + (β2 + β3 )ldj

for females. Thus, β0 and β2 give the intercept and slope for the baseline group, males, and β1 and β3 give the difference in intercepts and slopes between females and males. 19

It is important to note that the choice of a baseline group for a factor impacts the interpretation of the parameters, but does not impact the interpretation of the model provided the model follows the hierarchy principal. For example, how would the parameters and model change if we ran the same program but had defined sex to have 2 levels: M for males and W for females? We could have treated IAG as a factor in the leukemia analysis (even though it was coded numerically), but SAS would have treated IAG=1 as the baseline group. Treating IAG as a predictor (not a class variable) essentially leads to IAG=0 as the baseline group or level. You should read the online help for the class statement in genmod to learn how to change the default choice for the baseline group. options ls=79 nodate; data d1; input sex $ dose nresp @@; ntot = 20; phat = nresp/ntot; ld = log(dose); cards; m 1 1 m 2 4 m 4 9 m 8 13 m 16 18 m 32 20 f 1 0 f 2 2 f 4 6 f 8 10 f 16 12 f 32 16 ; symbol1 value=x interpol=none width=2; symbol2 value=y interpol=join width=2; 20

proc gplot; plot phat*ld=sex; run; Model with unequal slopes ------------------------proc genmod data=d1; classes sex; model nresp/ntot = sex ld sex*ld/dist=bin link=logit noscale type3; run; Model Information Data Set Distribution Link Function Response Variable (Events) Response Variable (Trials) Number Number Number Number

of of of of

WORK.D1 Binomial Logit nresp ntot

Observations Read Observations Used Events Trials

12 12 111 240

Class Level Information Class Levels Values sex 2 f m Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 8 4.9937 Scaled Deviance 8 4.9937 Pearson Chi-Square 8 3.5047 Scaled Pearson X2 8 3.5047 Log Likelihood -105.7388

Value/DF 0.6242 0.6242 0.4381 0.4381

Algorithm converged.

Parameter Intercept sex sex ld

f m

DF 1 1 0 1

Analysis Of Parameter Estimates Standard Wald 95% Estimate Error Confidence Limits -2.8186 0.5480 -3.8926 -1.7445 -0.1750 0.7783 -1.7004 1.3505 0.0000 0.0000 0.0000 0.0000 1.8163 0.3059 1.2166 2.4159 21

ChiSquare 26.46 0.05 . 35.24

Pr > ChiSq