STA6938-Logistic Regression Model Topic 4-Model Assessment Outlines: 1. Goodness-of-Fit 2. Prediction Accuracy 3. Model Diagnostics

1

1. Goodness-of-Fit What is Goodness-of-Fit Statistics? In many statistical inference procedures, we have used Chisquare distribution based on the likelihood ratio, Score, or Wald test statistics. The global Chi-square addresses the question “Is this model better than nothing?” The answer “yes” suggests the acceptance of the model. After we arrived the final model, we may want to ask the question “Is these a better model than this one?” Now the answer “yes” will lead to rejection of the model. Goodness-of-Fit statistic refers to the test statistic to address the second question. Let’s consider a general situation: we have n i.i.d. observations, ( yi , x1i , x2i , , x pi ) i = 1, 2, , n .

yi : binary outcome variable x i = ( x1i , x2i ,

, x pi ) : covariate vector

Covariate Pattern: a set of values to describe a possible combination of the p covariates in the data. Saturated Model: the maximal (perfect) model allowed to fit the data For example, if the data have J covariate patterns, in the jth ( j = 1, 2, , J ), suppose we have total m j observations and among them we have y j events, then the probability of event 2

for this particular covariate pattern is simply estimated by πˆ j =

yj mj

.

a. Pearson Chi-Square Statistic For jth covariate pattern, let

πˆ j =

e

gˆ ( x j )

1+ e

gˆ ( x j )

denote the estimated probability of event based on the logistic regression model. Then the estimated number of events associated with this covariate pattern is

yˆ j = m jπˆ j = m j

e

gˆ ( x j )

1+ e

gˆ ( x j )

Pearson Residual: (weighted residual)

r ( y j , πˆ j ) =

y j − yˆ j

m jπˆ j (1 − πˆ j )

y j − m jπˆ j

=

m jπˆ j (1 − πˆ j )

Pearson Chi-Square Statistic: J

Χ = ∑ r ( y j , πˆ j ) 2

2

j =1

3

b. Deviance Deviance Residual:

yj d ( y j , πˆ j ) = ± 2 y j ln m jπ j

1/ 2

m j − y j + ( m j − y j ) ln ˆ π − (1 ) m j j

where the sign, + or – follows that sign of

y j − m jπˆ j .

Deviance: J

D = ∑ d ( y j , πˆ j )

2

j =1

Asymptotic Results: If J is fixed, as n → ∞ ( m j → ∞, j = 1, 2,

,J

), we have

2 2 Χ → d χ J − p −1 2 D → d χ J − p −1

Note: Both Χ2 and D can be used as a summary measure of the goodness-of-Fit, only if the number of covariate patterns is fixed. What if J ≈ n ? It happens when the covariates involve numerical variables. The Chi-square distribution as described above is no longer valid.

4

c. The Hosmer-Lemeshow Test The Hosmer-Lemeshow Test is a practical procedure to assess the goodness-of-fit when the number of covariate pattern increases as the sample size increases. Hosmer-Lemeshow Test Easy to compute Sounds from simulation studies Has not been rigorously proved yet Hosmer-Lemeshow Test Procedure For each subject, we estimate the probability of the event. Sort the estimated probabilities in an ascending order. Group the data according to the percentiles of the estimated probabilities: • Use g=10 groups • The first group contains the n1 = n /10 subjects whose estimated probabilities are less than or equal to the first decile of the estimated probabilities • The qth group contains the subjects whose estimated probabilities are between the (q-1)th and qth deciles of the estimated probabilities For the kth group • Let ck denote the number of covariate patterns in the kth decile. • Count the number of events (responses) among

o = the ck covariate patterns: k

ck

∑y j =1

j

5

• Compute the average estimated probability in the kth decile: c m πˆ πk = ∑ j j j =1 nk Construct the Hosmer-Lemeshow test statistic k

( o − nkπ k ) Cˆ = ∑ k k =1 nk π k (1 − π k ) g

2

2 Empirically, Cˆ ∼ d χ g − 2 as n → ∞

d. Generalized R 2 -value R 2 -value, referred as the coefficient of determination in

the ordinary linear regression, is a widely accepted measure for the predictive power. How do we generalize this measure in logistic regression? Generalized R 2 -value: L2 R = 1 − exp − n 2

L2 : the likelihood ratio chi-square test statistic for testing the global significance of the logistic regression model. Why do we call this measure the generalized R 2 -value? For the ordinary linear regression model, this value is identical to the R 2 based on F-statistic.

6

Example 4.1 Evaluate the model fitted in Topic 3 for the UIS data set. proc logistic data=LOGISTIC.UIS Descending; class IVHX /PARAM=REFERENCE Descending; model DFREE=IVHX AGE NDRUGTX TREAT RACE SITE RACE*SITE AGE*NDRUGTX /lackfit rsq; run; The LOGISTIC Procedure R-Square

0.0802

Max-rescaled R-Square

0.1180

Partition for the Hosmer and Lemeshow Test

Group 1 2 3 4 5 6 7 8 9 10

Total

DFREE = 1 Observed Expected

58 58 59 58 58 58 58 58 59 51

4 8 8 14 13 11 18 20 27 24

DFREE = 0 Observed Expected

4.05 7.11 9.62 11.44 13.26 15.38 17.50 19.86 23.60 25.17

54 50 51 44 45 47 40 38 32 27

53.95 50.89 49.38 46.56 44.74 42.62 40.50 38.14 35.40 25.83

Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square 3.8251

DF 8

Pr > ChiSq 0.8725

The result shows that this model does not appear to present a lack of fit.

7

Let’s consider a smaller model: proc logistic data=LOGISTIC.UIS Descending; class IVHX /PARAM=REFERENCE Descending; model DFREE=IVHX AGE NDRUGTX TREAT RACE SITE /lackfit rsq; run; The LOGISTIC Procedure R-Square

0.0582

Max-rescaled R-Square

0.0857

Partition for the Hosmer and Lemeshow Test

Group

Total

1 2 3 4 5 6 7 8 9 10

58 58 58 60 59 60 58 59 58 47

DFREE = 1 Observed Expected 6 9 6 10 15 16 22 22 20 21

5.66 8.52 10.11 12.13 13.69 15.91 17.16 19.48 22.41 21.93

DFREE = 0 Observed Expected 52 49 52 50 44 44 36 37 38 26

52.34 49.48 47.89 47.87 45.31 44.09 40.84 39.52 35.59 25.07

Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square

DF

Pr > ChiSq

5.6350

8

0.6880

This reduced model does not appear to present a lack of fit neither. However, our model fitting procedure clearly indicates that the model with interaction terms is significant. What’s wrong with it? Nothing wrong! What is the problem? Actually this is the evidence that the Hosmer-Lemeshow test procedure has low testing power.

8

2. Prediction Accuracy Model fitting leads to the estimated probability of event, not directly the outcome. To obtain the derived dichotomous variable we must define a cut-point, c, and compare each estimated probability to c. The prediction of the outcome is given by 1 if pˆ i ≥ c yˆi = 1[ pˆ i ≥c ] = 0 otherwise In SAS, the cut-point c=0.5. Classification Table To compare the predicted events with the actual observed events, we create a table. For binary logistic regression, we create a 2 by 2 table of predicted by observed responses. Example 4.2 Classification table generated from the model fitted in Example 4.1. proc logistic data=LOGISTIC.UIS Descending; class IVHX /PARAM=REFERENCE Descending; model DFREE=IVHX AGE NDRUGTX TREAT RACE SITE RACE*SITE AGE*NDRUGTX; output out=probs predicted=phat; run; proc print data=probs; title 'Drug-Free Data'; run; data classification1; set probs; predict=(phat>=0.5); run; proc freq data=classification1; tables dfree*predict/ norow nocol nopercent; run;

9

Drug-Free Data The FREQ Procedure Table of DFREE by predict DFREE(Remained Drug Free for 12 months) predict Frequency‚ 0‚ 1‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 ‚ 423 ‚ 5 ‚ 428 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 135 ‚ 12 ‚ 147 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 558 17 575

Prediction accuracy=(423+12)/575=75.7% Is the prediction accuracy a good measure of model fitting? It looks like yes, but it is not. Reasons: Scale problem: the model outcome is measured on a continuum but the predicted out is binary The sensitivity issue: the prediction accuracy depends heavily on the distribution of the estimated probability rather than on the superiority of one model over another For example, suppose we have 100 subjects had an estimated probability=0.51. Obviously, all of these subjects will be classified as having event. If the model is well-calibrated (estimated probabilities reflect the true event experience in the data), among those 100 subjects, there are actually 49 who do not have event, so the misclassification is 49%, but the model is perfect for these 100 subjects.

10

Conclusion: One cannot models based on the 2 by 2 classification tables, since these measures are completely confounded by the distribution of estimated probability in the samples upon which they are generated. Various measure of prediction accuracy Though the prediction accuracy is not a good measure for the goodness of model fitting, it is certainly a useful measure for the prediction power generated from the fitted model. a. Sensitivity and Specificity Sensitivity-ratio consisting of the number of correctly classified events over the total number of events. Specificity-ratio consisting of the number pf correctly classified nonevents over the total number of nonevents. For Example 4.2,

12 = 8.16% 147 423 specificity= = 98.83% 428 sensitivity=

Note: Both sensitivity and specificity are very sensitive to the cut-point in classification. One cannot improve both by selecting the cut-point. Prediction accuracy= p × sensitivity + (1 − p) × specificity , p is the probability of event. 11

b. ROC curve ROC curve (receiver operating characteristic): originate from signal detection theory that shows how the receiver operates the existence of signal in the presence of noise. It plots the probability of detecting true signal (sensitivity) and the false signal (1-specificity) for an entire range of possible cut-points. The area under the ROC curve, which ranges from zero to one, provides a measure of the model’s ability to discriminate between those subjects who experience the event versus those who do not. Example 4.3 Generate the ROC curve for the model fitted in Example 4.1. proc logistic data=LOGISTIC.UIS Descending; class IVHX /PARAM=REFERENCE Descending; model DFREE=IVHX AGE NDRUGTX TREAT RACE SITE RACE*SITE AGE*NDRUGTX /outroc=rocdata; run; proc print data=rocdata; run; goptions cback=white colors=(black) border; axis1 length=2.5in; axis2 order=(0 to 1 by 0.1) length=2.5in; proc gplot data=rocdata; symbol1 i=join v=none; title1; title2 'UIS ROC CURVE'; plot _sensit_*_1mspec_ /haxis=axis1 vaxis=axis2; run;

12

The area under this curve provides a measure of discrimination which is the likelihood that a subject who remains drug free will have a higher Pr ( y = 1) than a subject who returns to drug use. General Rule: • • • •

ROC=0.5: no discrimination (just like flip a coin) 0.7 ≤ ROC