06 Page 1 of 12

Logistic Regression in SPSS Start with “regression” in the “analyze” menu. In this case, we’re going to choose “binary logistic” for logistic regressi...

Author: Cory Pearson

16 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

1995 1:02: :06: :12: :26:06

:06 Page 1 CENTRES. Welsh

Tennessee. Page 1 of 12

Logistic Regression in SPSS Start with “regression” in the “analyze” menu. In this case, we’re going to choose “binary logistic” for logistic regression (it’s ‘binary’ because there are two possible outcomes for the dependent variable—e.g., yes or no).

UCDHSC Center for Nursing Research Page 1 of 12

Updated 5/20/06

This dialog box will appear. In this example, we’ll try to predict abstinence from alcohol & drugs up to 6 months after the end of treatment (“abstain”) as our dependent variable. “Abstain” is already the correct type of data for logistic regression—it’s coded as zero (meaning the person has relapsed) or one (meaning the person has remained abstinent). If the DV originally had more than two levels, we would have had to first convert it to two levels (using the “recode” command in the “transform” menu). We will try to predict “abstain” from a group of predictor variables: years of education (educ, which is I/R-level), employment status (employst, which is N-level: full-time vs. part-time vs. unemployed), number of outpatient sessions of treatment completed (opcontac, which is I/Rlevel), discharge status (dcstatus, which is N-level: planned termination vs. dropped out), whether the person participated in Alcoholics Anonymous groups (aa, which is N-level: yes or no), and whether the person participated in Aftercare groups following the end of treatment (aftercar, which is N-level: yes or no).

Enter your IVs and DVs here, from the list above (don’t forget “aftercar”). Leave the drop-down menu set on “Enter,” for now.

You will notice that some of the N-level variables are marked with “Cat,” which means the system has recognized them as categorical predictors, but some are not. Click on the Categorical button to manually tell the system about the other N-level predictors.

UCDHSC Center for Nursing Research Page 2 of 12

Updated 5/20/06

SPSS didn’t recognize some of our N-level predictors (employst and dcstatus) as being N-level, so we have to manually tell the program that these predictors are categorical. Do this by moving them from the list of “covariates” to the list of “categorical covariates.” (Don’t convert I/Rvariables to categorical variables if they don’t need to be, though – logistic regression can handle both types of predictors). That’s all you need to do on this screen. Click “continue” to go on.

UCDHSC Center for Nursing Research Page 3 of 12

Updated 5/20/06

Now click the “Options” button, and select the checkbox for “Hosmer-Lemeshow goodness-offit test.” This will allow you to see a p-value associated with the -2 log L statistic in the output.

UCDHSC Center for Nursing Research Page 4 of 12

Updated 5/20/06

Here’s the output from our first run of the data:

Logistic Regression Case Processing Summary Unweighted Cases(a) Selected Cases Included in Analysis

N 44

Percent 55.0

Missing Cases

36

45.0

Total

80

100.0

0

.0

Unselected Cases Total

80 100.0 a If weight is in effect, see classification table for the total number of cases.

This table just tells you how many people’s data were included in the analysis, and whether there was any missing data. Dependent Variable Encoding Original Value 0 100

Internal Value 0 1

Tells you how the system represented the DV as “yes” (1) or “no” (0). Originally, our DV was coded as 0 or 100, but SPSS recognized what we were trying to do. Categorical Variables Codings Parameter coding Frequency EMPLOY ST

(1)

2

1.000

.000

1.00

36

.000

2.00

1

.000

3.00

3

6.00

2

1.000

(3)

(4)

.000

.000

1.000

.000

.000

.000

1.000

.000

.000

.000

.000

1.000

.000

.000

.000

.000

AFTERC AR

T

19

F

25

.000

AA

T

22

1.000

F

22

.000

.00

21

1.000

1.00

23

.000

DCSTAT US

(2)

.00

This table just shows you how logistic regression is able to handle N-level predictors (remember that the usual multiple regression procedure can’t take anything other than I/R-level predictors). It does this by dummy-coding each individual level of the N-level predictor as 0 or 1, and then giving each person a score for each level of the predictor (e.g., “for person 3, AA True = 1, but AA False = 0.” Even though these are two sides of the same coin, it treats them as 2 predictors). UCDHSC Center for Nursing Research Page 5 of 12

Updated 5/20/06

Block 0: Beginning Block “Block 0” shows you how well you can predict the DV with no predictors in the model. Basically, you just make the same prediction for everyone. Even though there are two choices (yes vs. no) for the DV, you can actually get better than 50% correct, just by predicting the same thing for everyone. In this case, if we predict that everyone stayed abstinent up to 6 months, we’ll be correct 68.2% of the time. In this example, 68.2% is the base rate—it tells you how often you’ll be correct without any statistical prediction at all (just guessing the same thing for everyone). Classification Table(a,b) Observed

Predicted ABSTAIN 0

Step 0

ABSTAIN

Percentage Correct

100

0

0

14

.0

100

0

30

100.0

Overall Percentage

68.2

a Constant is included in the model. b The cut value is .500 Variables in the Equation B Step 0

Constant

.762

S.E. .324

Wald 5.545

df 1

Sig. .019

Exp(B) 2.143

This table just shows you what predictor variables were used, and has a p-value that tells you whether the model works better than chance. In this case, there are no predictors (everybody gets the same prediction—that they’re abstinent), but the predictions made are still better than chance (which would give you only 50% correct). This goes to show you that what you’re comparing your results to isn’t necessarily a 50/50 result. It can be significantly better, depending on the base rate, which shows whether the original scores were distributed evenly (50/50) or not. [more information from the printout has been omitted here]

UCDHSC Center for Nursing Research Page 6 of 12

Updated 5/20/06

Block 1: Method = Enter Now we move on to “block 1,” which shows you what the predictions are like when some predictor variables are included. In this case, we used an “enter” method for the order of entry, which puts in all of the predictors as a group, on the first step. (Right after this, we’ll use a “stepwise” method to pick just the few of these predictors that are the best, in which case it will only put in one variable at a time). Omnibus Tests of Model Coefficients

Step 1

Chi-square 22.876

9

Sig. .006

Block

22.876

9

.006

Model

22.876

9

.006

Step

df

Here we have a p-value that tells us whether all of the variables provide a good prediction or not. In this case, chi-square tells us that the model is a good fit for the data. (Here, chi-square is a “goodness of fit” statistic, like we’re used to seeing—you want the p-value to be less than .05). Model Summary

Step 1

-2 Log likelihood 32.167

Cox & Snell R Square .405

Nagelkerke R Square .568

Next, we have an R2 statistic that (like usual) tells us what % of the variability in the DV can be accounted for by the predictors at this step. Because all of the predictors were put in at once (using the “Enter” method), this is a multiple R2 that tells you the % of variability accounted for by all of the predictors together. Use the Nagelkerke R2, because the Cox & Snell can never quite get to 1 (mathematically), and is therefore less accurate. Note that this table also gives you a –2 log L statistic. –2 log L is a “badness of fit” statistic, so we want it to be small to show a good fit between the model and the data. A good rule of thumb is that if –2 log L is less than 100, it’s a good fit; and if –2 log L is less than 20, it’s a very good fit. The p-value assocated with the -2 log L is what’s provided by the Hosmer and Lemeshow test in the next output: Hosmer and Lemeshow Test Step 1

Chi-square 15.533

df 6

Sig. .016

Here, we see a significant p-value (.016), which means the model is not quite a perfect fit for the data – the Hosmer and Lemeshow test should be nonsignificant to show a good fit, based on the -2 log L being small (around 20 is good, a perfect fit is 1). So this result potentially contradicts the omnibus chi-square test results shown above, and means that we might want to look at additional predictors or remove some of these predictors to see if we can improve the model. UCDHSC Center for Nursing Research Page 7 of 12

Updated 5/20/06

Classification Table(a) Observed

Predicted ABSTAIN 0

Step 1

ABSTAIN

Percentage Correct

100

0

9

5

100

4

26

64.3 86.7

Overall Percentage

79.5

a The cut value is .500

Here’s the classification table again. Note that by adding all of these predictors, we have increased our overall accuracy (or “hit rate”) to 79.5%. As mentioned in the lecture notes, you can sometimes see an increase in the hit rate (more accurate classification) without seeing a corresponding increase in “goodness of fit” (as shown by the chi-square and –2 log L statistics); or you can have an improvement in goodness of fit without seeing an improvement in the hit rate; or (as seen here) they can both improve at the same time. Variables in the Equation B Step 1(a)

EDUC EMPLOYS T EMPLOYS T(1) EMPLOYS T(2) EMPLOYS T(3) EMPLOYS T(4) OPCONTA C DCSTATU S(1) AA(1) AFTERCA R(1) Constant

S.E. .042

.187

Wald

df

Sig.

.050

1

.822

3.590

4

.464

Exp(B) 1.043

2.170

36904.882

.000

1

1.000

8.758

-19.752

26065.960

.000

1

.999

.000

-.904

47905.210

.000

1

1.000

.405

-16.548

26065.960

.000

1

.999

.000

-.040

.026

2.318

1

.128

.961

-4.051

1.620

6.249

1

.012

.017

2.492

1.092

5.205

1

.023

12.081

.749

1.241

.364

1

.546

2.115

48040902 64.574 a Variable(s) entered on step 1: EDUC, EMPLOYST, OPCONTAC, DCSTATUS, AA, AFTERCAR. 22.293

26065.960

.000

1

.999

Finally, the printout shows you what variables were entered at each step. In this case, all of the predictors were entered at once. Also note the way that the table breaks out the categorical variables into each of the different categories, which are each dummy-coded as “1” or “0” in order to use them as predictors as logistic regression.

UCDHSC Center for Nursing Research Page 8 of 12

Updated 5/20/06

Now let’s go back to the main dialog box for logistic regression, and try the stepwise procedure.

Everything that we set up before is still selected in the dialog box. Note how the variables that we told SPSS were “categorical” (N-level) now have “(Cat)” after their names in this list. Now, select “Forward: Conditional” instead of “Enter” in this drop-down menu. You’ll notice that the logistic regression drop-down menu in SPSS has different choices from the ones that we saw in the same menu in multiple regression. I don’t know why—they’re essentially telling the program to do the same thing. In this case, “Forward: Conditional” is how SPSS is referring to forward stepwise regression.

UCDHSC Center for Nursing Research Page 9 of 12

Updated 5/20/06

Here’s the output from the stepwise version of the procedure:

Logistic Regression [steps skipped—you’ll still see the same “Block 0” results here]

Block 1: Method = Forward Stepwise (Conditional) Omnibus Tests of Model Coefficients

Step 1

Chi-square 12.771

Step Block Model

Sig.

df 1

.000

12.771

1

.000

12.771

1

.000

This table again shows you that the model is a good fit for the data, this time using only the single best predictor in step 1. However, notice that the –2 log L in the next table is slightly larger, which tells you that it’s not quite as good a fit as when we used all of the predictors the last time we tried this. Also, the R2 is smaller, which is another way to look at goodness of fit. Model Summary

Step 1

-2 Log likelihood

Cox & Snell R Square

Nagelkerke R Square

.252

.353

42.272

Hosmer and Lemeshow Test Step 1 2

Chi-square .000

df

1.411

Sig. 0

.

2

.494

However, the Hosmer and Lemeshow test is now nonsignificant, which means we’ve improved on our previous model – it’s more parsimonious with only two variables, and therefore this test’s requirements for the -2 log L are less strict than they were with more predictors. Classification Table(a) Observed

Predicted ABSTAIN 0

Step 1

ABSTAIN

0 100

Overall Percentage

Percentage Correct

100 12

2

9

21

85.7 70.0 75.0

a The cut value is .500

Here are the classification results. We got 75.0% accuracy with just the single best predictor, compared to 68.2% with no predictors in the model, or 79.5% when we included all of them. UCDHSC Center for Nursing Research Page 10 of 12

Updated 5/20/06

Variables in the Equation B Step 1(a)

DCSTATU S(1) Constant

S.E.

Wald

df

Sig.

Exp(B)

-2.639

.861

9.385

1

.002

.071

2.351

.740

10.096

1

.001

10.500

a Variable(s) entered on step 1: DCSTATUS.

Here are the actual variables used: In this case, “DCStatus” (discharge status; i.e., did the person complete treatment or drop out prematurely?) was the single best predictor of abstinence up to 6 months. (We know this because it’s the predictor selected by SPSS for “step 1” in the stepwise procedure). This table also gives you a beta-weight for the predictor, and a beta-zero (constant) term. Referring to the notes to see how a logistic regression equation is set up, we can then write the equation for this logistic regression curve, using DCStatus as the predictor: Odds = e( -2.639*DCStatus + 2.351 ) Model if Term Removed(a) Model Log Likelihood Variable Step 1

Change in 2 Log Likelihood 14.045

df

DCSTATU -28.159 S a Based on conditional parameter estimates

1

Sig. of the Change .000

This table shows you the change in the –2 log L that results from adding this predictor. In this case, adding the predictor reduced the –2 log L by 14.045 points, which was a statistically significant change. Analyzing the change as you add each new predictor is how SPSS determines when to stop adding predictors in the stepwise procedure. Variables not in the Equation

Step 1

Variables

EDUC EMPLOYST EMPLOYST(1 ) EMPLOYST(2 ) EMPLOYST(3 ) EMPLOYST(4 ) OPCONTAC

Score .251

df 1

Sig. .616

1.419

4

.841

.209

1

.648

1.404

1

.236

.100

1

.752

.810

1

.368

.099

1

.753

AA(1)

3.025

1

.082

AFTERCAR(1 )

1.203

1

.273

9.259

8

.321

Overall Statistics

UCDHSC Center for Nursing Research Page 11 of 12

Updated 5/20/06

The final list just shows you the variables that weren’t used during this step. In this case, SPSS stopped after 1 step, which means that adding more predictors won’t help us to get a better fit between the model and the data. You might have stepwise procedures that go on for 2, 3, or even more steps. Sometimes, all of the predictors still get added (just one at a time) in a stepwise procedure, which would tell you that you need all of them in order to get the best possible fit. One final note: You can also select “Backward: Conditonal” from the drop-down menu, to get a backward stepwise regression. As we discussed with multiple regression, this is an alternative procedure that tests the impact on the model as you remove variables one at a time. Usually, it comes down to the same answer as the forward stepwise procedure did.

UCDHSC Center for Nursing Research Page 12 of 12

Updated 5/20/06