Model Selection in Minitab

Model Selection in Minitab Variable Selection The basic procedure is obtained by Stat ⇒ Regression ⇒ Best Subsets … Necessary input: The response va...
Author: Chrystal Turner
19 downloads 2 Views 247KB Size
Model Selection in Minitab Variable Selection The basic procedure is obtained by Stat ⇒ Regression ⇒ Best Subsets …

Necessary input:

The response variable is specified in the Responses: box. The pool of explanatory variables of which subsets are to be evaluated is entered in the Free predictors: box. No other specifications are necessary, though some may be desired, as follows. Commonly useful options: If there are some explanatory variables which are to be kept in all models, they are entered in the Predictors in all models: box in the main Best Subsets Regression window, and not in the Free predictors: box. The default output lists the two best models of each size. To change this number (I prefer to see at least 4), click the Options button, and entered the desired value in the Models of each size to print: box. To restrict the minimum or maximum sizes of the models to be evaluated, click the Options button, and entered the desired value(s) in the appropriate box under Free Predictor(s) in Each Model.

Output On the next page is an excerpt of the output from the analysis specified in the preceding screen shots. Each line in the table represents a particular model; as requested, four models of each size are reported. The variables in a particular model are indicated by the Xs in the columns under the variable names (which read downwards). For example, the first line is for the best one-variable model, with only TotPersInc. This model has R2 = 44.4, R2a = 44.3, Cp = 357.5, and s (= /MSE) = 0.37095. As another example, the best three-variable model is shown in the row starting 3 62.1 … This model contains the variables TotPop, PctPoverty, and PerCapInc. By both the Cp and the R2a (=s) criteria, the best model in this example is the first 8-variable model, with all the explanatory variables except LandArea and PctBach. The best two 9-variable models, which add one or the other of the variables not in the preceding model, are nearly as good by both criteria. Using the Cp criterion can be facilitated by plotting the Cp values against the number of variables in the model. This can be done by cutting-and-pasting the output table into another Minitab worksheet (or Excel, etc.). The reference line can be added by creating two new columns, one with the values 0 and the maximum number of variables added, and the other with the corresponding p (i.e. 1 and the maximum number variables plus 1). After making the scatterplot of Cp, these columns can be used to add a Calculated line… to the scatterplot. The other criteria can be plotted similarly. Examples are on the next page.

Model Selection in Minitab

2

Best Subsets Regression: logPhys versus LandArea, TotPop, ... Response is logPhys

Vars 1 1 1 1 2 2 2 2 3 3 … 7 7 7 7 8 8 8 8 9 9 9 9 10

R-Sq 44.4 42.0 24.9 20.3 54.8 54.7 54.0 52.5 62.1 60.3

R-Sq(adj) 44.3 41.8 24.8 20.1 54.6 54.5 53.8 52.3 61.8 60.1

Mallows C-p 357.5 392.1 635.2 701.7 210.3 212.6 222.5 243.3 108.6 134.0

S 0.37095 0.37896 0.43101 0.44418 0.33465 0.33526 0.33781 0.34311 0.30690 0.31403

69.4 69.0 68.7 68.4 69.9 69.7 69.5 69.1 69.9 69.9 69.7 69.1 69.9

68.9 68.5 68.2 67.9 69.3 69.1 68.9 68.6 69.3 69.3 69.1 68.5 69.2

12.7 18.4 22.3 27.3 7.6 10.8 13.9 18.3 9.2 9.4 12.4 20.3 11.0

0.27709 0.27890 0.28013 0.28169 0.27517 0.27618 0.27716 0.27857 0.27536 0.27540 0.27636 0.27889 0.27560

Model Selection in Minitab

L a n d A r e a

T o t P o p

P c t 1 8 t o 3 4

P c t O v e r 6 5

P c t H S G r a d

P c t B a c h

P c t P o v e r t y

P c t U n e m p

P e r C a p I n c

T o t P e r s I n c X

X X X X

X X X X

X

X

X X X X

X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X

3

50

40

Cp

30

20

10

0

0

2

4 6 number of variables (= p-1)

10

0.45

70

s (= square-root of MSE)

60

adjusted R^2

8

50

40

30

20

0.40

0.35

0.30

0.25 0

2

4 6 number of variables (= p-1)

8

10

0

2

4 6 number of variables (= p-1)

8

10

When one or more variables are forced to be in all models, the Vars column in the best-subsets output specifies how many of the free predictors are included; it does not count the predictors forced into the models. In such a case the number of parameters, p, will be the sum of the number of free predictors included (as given in the Vars column), the number of variables forced in, plus 1 for the intercept. On the next page is an excerpt of the output resulting from moving TotPop from the Free predictors: box to the Predictors in all models: box (bolding added to show these points).

Model Selection in Minitab

4

Best Subsets Regression: logPhys versus LandArea, Pct18to34, ... Response is logPhys The following variables are included in all models: TotPop

Vars 1 1 2 2

R-Sq 54.8 54.7 62.1 58.2

R-Sq(adj) 54.6 54.5 61.8 57.9

Mallows C-p 210.3 212.6 108.6 164.8

S 0.33465 0.33526 0.30690 0.32246

L a n d A r e a

P c t 1 8 t o 3 4

P c t O v e r 6 5

P c t H S G r a d

P c t B a c h X

P c t P o v e r t y

X X

P c t U n e m p

P e r C a p I n c

T o t P e r s I n c

X X

X

Other Criteria — PRESS The only other model-selection criterion available in Minitab is PRESSp. This can be gotten only for one model at a time, using the usual regression procedure. Stat ⇒ Regression ⇒ Regression …

Click on the Options button, then check the box for PRESS and predicted Rsquare under the Display part of the options window. (Predicted R2 is the fraction of the

Model Selection in Minitab

5

variation in the response variable “explained” by the leave-one-out predictions used to calculate PRESS. These two statistics are essentially redundant, but the predicted R2 can be compared directly to the regular R2 for the model, to judge whether the latter accurately reflects the predictive value of the model or is inflated by over-fitting.) Output The PRESS and predicted R2 statistics are printed just below the regular R2: In this Regression Analysis: logPhys versus TotPop, Pct18to34, ... The regression equation is logPhys = - 0.988 + 0.000001 TotPop + 0.0234 Pct18to34 + 0.0270 PctOver65 + 0.00833 PctHSGrad + 0.0483 PctPoverty - 0.0255 PctUnemp + 0.000086 PerCapInc - 0.000046 TotPersInc Predictor … TotPersInc

Coef

SE Coef

T

P

-0.00004566

0.00000922

-4.95

0.000

S = 0.275165 PRESS = 44.2863

R-Sq = 69.9%

R-Sq(adj) = 69.3%

R-Sq(pred) = 59.14%

Analysis of Variance Source Regression Residual Error Total

DF 8 431 439

SS 75.7403 32.6336 108.3739

MS 9.4675 0.0757

F 125.04

P 0.000

example the PRESS statistics is reasonably close to SSE, and the predicted R2 is reasonably close to the regular R2. These findings indicate that the model is at least not substantially overfit.

Model Selection in Minitab

6

Validation Internal validation, using PRESS and predicted R2, is done as explained above. MSPR with New Data The predicted values for the observations in the validation data set, predicted based on the selected model fit to the model-building data set, can be calculated in two ways. First, the fitted model equation can be entered into the Calculator, creating a column of predicted values in a worksheet containing only the validation data set. Alternatively, the validation data can be added (as new columns) to the modelbuilding data set, and Prediction intervals for new observations: can be requested in the Options window of the regression procedure, with the regression being done on the original data set. For example, if nTotPop is the Total Population variable in the new data set, and so on for the other variables, the following would be used:

The predicted values will be stored in a new column, named PFIT1. Once the column of predicted values is created, the MSPR can be calculated in the Calculator, using an expression like SSQ ( nLgPhys - PFIT1 ) / N( PFIT1), where nLgPhys is the column of observed values of the response variable in the new data set, PFIT1 is the predicted values, SSQ is a function computing the sum of squared values for an entire column (or in this case, of squared differences between two columns), and N is a function returning the number of non-missing values in a column. The result will be a column with only one entry, which is the MSPR.

Model Selection in Minitab

7

Data Splitting Various methods can be used to split a data set into model-building and validation sub-sets. A relatively easy method is to first create a column distinguishing which observations are to go into which sub-set, then use Data ⇒ Split Worksheet …

to divide the data set based on the column just created. For example, to separate odd and even observations, the calculation shown in the window on the next page will create a column (c21) containing 1s for all observations with odd IDNum and 0s for all even observations. (The function MOD(x,y) returns the remainder after dividing x by y, and the entire expression is a logical comparison, evaluating to 1 (true) when the equality is true, and 0 (false) otherwise.) This column can then

be used to split the data, as in the following window (invoked by Data ⇒ Split Worksheet …).

To simply select a random subset of observations, an easier method uses Calc ⇒ Random Data ⇒ Sample From Columns…

Using either of the preceding methods to create separate subsets, the MSPR can be calculated as described under MSPR with New Data above, either using the calculator to compute predicted values for the validation data set, or copying the validation data (as new columns) into the data-building subset and using the Prediction intervals for new observations: method. Yet another approach to data-splitting is to not actually divide the data set, but to separate the response variable into different columns for the model-building and

Model Selection in Minitab

8

validation subsets. For instance, creating a new column as logPhys / c21 (with c21 the 1/0 column created above), will have missing values for all even-numbered observations (for which c21 = 0). This new column, of observed values for the model-building data subset and missing values for the validation data subset, can then be used as the response variable in the regression procedure, and “fits” can be stored (click the Storage button in the main regression window, and check the box for Fits). “Fits” will be stored for all observations, including those with missing values of the response variable. The “fits” for the validation observations can then be separated from those for the model-building observations by copying the “fits” column to another column but selecting only observations in which the subsetting column (c21 here) equals 0. MSPR then can be calculated using this new column of the predicted values for the validation observations.

Model Selection in Minitab

9