We discussed in earlier lectures how transformations can be used to reduce or eliminate

Week 12 Lecture: Remedial Measures (Chapter 11) Weighted Least Squares We discussed in earlier lectures how transformations can be used to reduce or e...
4 downloads 0 Views 169KB Size
Week 12 Lecture: Remedial Measures (Chapter 11) Weighted Least Squares We discussed in earlier lectures how transformations can be used to reduce or eliminate heteroscedasticity (non-constant variance). Transformations, however, can alter the regression relationship between the dependent and independent variables. A technique called weighted least squares can be used instead of transformations to correct for unequal variances and still maintain the regression relationship between the dependent and independent variables. This technique recognizes that different errors can have different variances (e.g., young trees usually vary little in height while old trees usually vary considerably in height). If you have heteroscedasticity in your model, and you use the regular MLR model, you will have unbiased estimators, but they will no longer have minimum variance. To achieve minimum variance, you need to account for the unequal variances. Kutner et al. describe how to account for the unequal variances under three conditions: 1) error variances are known, 2) only the relative magnitude of the error variances are known, and 3) error variances are unknown. They correctly point out that the first two situations rarely occur, so I will not discuss them in detail (see pages 422 – 424 for more details). They describe a process using iteratively reweighted least squares (IRLS) to estimate which weights to use (we will revist IRLS later in this lecture to deal with outliers) in a regression model with heteroscedasticity. IRLS is certainly one technique that works well, and I refer you to their discussion (page 424). In this lecture, I will discuss another procedure to estimate regression coefficients when the exact error variance is unknown. First, though, I will describe the weighted MLR model.

1

The Weighted Linear Regression Model In matrix notation, recall the ordinary least squares regression model: Y = Xβ + ε. where: E(ε) = 0 σ2(ε) = W-1. We will assume that the variances in this model are unequal; thus, heteroscedasticity is present in our regression model. The variance-covariance matrix of the errors bears some explanation. Recall that the variance-covariance matrix of the ordinary least squares model is:

⎡σ 2 ⎢ 0 2 σ (ε ) = ⎢ ⎢M ⎢ ⎣⎢ 0

0⎤ ⎥ M ⎥ . O 0⎥ ⎥ L σ 2 ⎦⎥

0 L σ2 0 0 0

When heteroscedasticity is present, this matrix is more complex because the variances are different for each error term:

⎡σ12 ⎢ 0 2 σ (ε ) = ⎢ ⎢M ⎢ ⎣⎢ 0

⎤ ⎥ ⎥. ⎥ ⎥ L σ 2n ⎦⎥

0 L σ 22 0 0 O 0

0 M 0

In weighed least squares, we use a weights matrix to correct the heteroscedasticity: ⎡w 1 ⎢0 W=⎢ ⎢ M ⎢ ⎣0

0 L 0 ⎤ w2 0 M ⎥⎥ , 0 O 0 ⎥ ⎥ 0 L wn ⎦

2

where: w i =

1 , which is a value proportional to the variance of the error term. W-1 is simply 2 σi

the variance-covariance matrix of the error terms for the regression model with heteroscedasticity.

Now, we will define a new diagonal matrix with square roots of wi along the main diagonal:

W

1

2

⎡ ⎢ =⎢ ⎢ ⎢ ⎣⎢

w1 0 M 0

0 w2 0 0

L 0 O L

0 ⎤ ⎥ M ⎥ . 0 ⎥ ⎥ w n ⎦⎥

This matrix is symmetric, W1/2W1/2 = W, and W-1/2W-1/2 = W.

When we premultiply our MLR model with heteroscedasticity by W1/2, we obtain: Yw = Xwβ + εw.

where: Yw = W1/2Y Xw = W1/2X

εω = W1/2ε. Now we can find: E(εw) = W1/2E(ε) =W1/20 = 0

(see equation 5.45 on page 196)

σ2(εw) = W1/2σ2(ε)W1/2 = W1/2W-1W1/2 = W1/2W-1/2W-1/2W1/2 = I.

3

Thus, the premultiplied regression model has error terms with mean = 0 and constant variance, σ i2 ≡ 1 . We can proceed with the standard regression procedures to this weighted regression model. We can now define the matrix of regression coefficients for this weighted model: bw = [(W1/2X)’W1/2X]-1 (W1/2X)’W1/2Y

= (X’W1/2W1/2X)-1 X’W1/2W1/2Y = (X’WX)-1 X’WY.

Furnival’s Index of Fit

Now that we have found the weighted ordinary least squares model, we can use Furnival’s Index of Fit (Furnival 1961) to find the best weight to use in the regression model when the error

variances are unknown. This index is a modified likelihood criterion that reflects both the size of the residuals and possible departures from normality and homoscedasticity. It can be used to compare any number of models where the dependent variable, Y’, represents different transformations of the original Y variable. Furnival’s Index of Fit (FI) is defined as: FI = [f ' (Y )]

−1

MSE ,

where:

f ' (Y ) = the first derivative of the weighted dependent variable with respect to Y

[f ' (Y )]−1 = [z ] = geometric mean of z ⎡ n ⎤ ⎢ ∑ ln z i ⎥ ⎥. z = exp ⎢ i =1 ⎢ n ⎥ ⎢⎣ ⎥⎦

4

The idea is to fit the model with several weights, then choose the “best” model based on the smallest FI value. You need to run several weighted regressions to insure that you bracket the “best” model.

Example

We will use n = 50 ponderosa pine trees from the Black Hills of South Dakota to illustrate weighted linear regression. In this example, we will build a regression model to predict the cubic foot volume of the ponderosa pine trees from diameter (in inches) squared times height (in feet). Note: the raw data are not presented (refer to the SAS program, “WeightedOLSExample.sas”).

We will fit the following SLR model to the data: CFV = b0 + b1*D2H. This model appears to be appropriate; i.e., there is a linear relationship between D2H and CFV: 2

Cubic Foot Volume

Cubic Foot Volume over D H for Ponderosa Pine in the Black Hills of South Dakota 70 60 50 40 30 20 10 0 0

5000

10000

15000

20000

25000

30000

2

DH 5

Fitted Model with Fit Statistics: CFV = -1.677 + 0.00267*D2H RMSE = 2.25 ft3,

R2 = 0.97

Regression Coefficient Statistics: Variable

Coefficients

Standard Error

t-statistic

p-value

Intercept

-1.67700

0.60656

-2.76641

0.00802

D2H

0.0026722

0.000062988

42.42769

< 0.0001

Residual Plot:

Residual Plot for Unweighted Model 6 4

Residuals

2 0 -2 -4 -6 -8 -10 0

5000

10000

15000

20000

25000

30000

2

DH

6

Clearly, there is non-constant variance present in the unweighted model (other assumptions were met, but not presented). We will now apply the weights,

1

(D H ) 2

k

, to correct this problem. In

this type of volume equation, the inverse of D2H raised to different powers of k is known to correct non-constant variance. So, the formula for FI that we will use in this example is:

⎡ n 2 ⎢ P∑ ln D H FI = exp ⎢ i =1 n ⎢ ⎢⎣

(

) ⎤⎥ i

⎥ MSE , ⎥ ⎥⎦

where P = a scalar that is ½ the value of the power used in the weight variable. Notice that the first derivative of the weighted dependent variable is: ⎞ ⎛ 1 ⎟=⎜ ⎟ ⎜ (D 2 H )k ⎠ ⎝

⎛ CFV f ' (Y ) = f ' ⎜ ⎜ (D 2 H )k ⎝

⎞ ⎟. ⎟ ⎠

Thus, the inverse of this first derivative is:

[f ' (Y )]

−1

⎛ 1 =⎜ ⎜ D2H ⎝

(

)

k

⎞ ⎟ ⎟ ⎠

−1

(

)

k

= D2H .

We then want the geometric mean of this value to find the value of FI. When we put this value into the formula for the geometric mean, we can move the power, k, in front of the summation symbol because we are using natural logarithms (this is a rule for logarithms). Note, however, that we need to divide the power, k, by 2 to get P, because we need to take ½ the value of k (as we showed earlier when we derived the weighted linear regression model). With this formula, I can now make multiple runs using various weights, starting with k = 0.5 (square root), and increasing by increments of 0.5, until I determine the smallest FI, which represents the “best” model. In SAS, I will need to create a weight variable for 1/(D2H)k that can be used with the WEIGHT command in the PROC REG statement. 7

The following table contains FI’s for various weights used in the regression model: WEIGHT

P

Z

MSE

FI

b0

s(b0)

b1

s(b1)

NONE

--

0.00000

5.06221

2.24994

-1.678

1/X0.5

0.25

2.21069

0.04962

2.03198 -1.59434

1/X1

0.5

4.42138 0.0005434 1.93973 -1.56067 0.48715 0.00266 0.00006969

1/X1.5

0.75

6.63207

6.72E-06

1/X2

1

8.84276

9.301E-08 2.11171 -1.62724 0.49828 0.00267 0.00009407

0.60656 0.00267 0.00006299 0.529

0.00266 0.0000645

1.96769 -1.57258 0.47831 0.00266 0.00007936

Σln(D2H) = 442.138103 n=

50

Based on these results, the “best” model is the one with the power k = 1.0 on the weight: CFV = -1.56067 + 0.00266*D2H R-squared = 0.97 RMSE = 0.02331 ft3

I will show the calculation for our “best” model to illustrate how to find FI. In this model, MSE = 0.0005434, n = 50, one-half of the power of the weight = 0.5, and the sum of the natural logarithms for D2H = 442.138103. I then plug these values into our formula to find FI: ⎡ 0.5 ∗ 442.138103 ⎤ FI = exp ⎢ ⎥⎦ 0.0005434 = 1.93973 . 50 ⎣

As you can see, Furnival’s Index of Fit is a simple, easy-to-obtain value that identifies the “best” model from among a group of models. The really nice feature of FI is that any model form can be compared, because FI is based solely on the derivative of the transformed dependent variable. 8

Thus, you could, for example, compare several different model forms for predicting the volume of these ponderosa pine trees, such as the ones above, a nonlinear model (CFV = b0Db1Hb2), and a log-log model (ln(CFV) = b0 + b1ln(D) + b2ln(H)). Each of these models can be compared to find the “best” model for predicting volume of these ponderosa pine trees.

Ridge Regression Ridge regression is a remedial measure for serious multicollinearity. You can use this technique if other remedial measures fail to correct a multicollinearity problem (e.g., dropping one or more correlated predictor variables, use OLS regression with principal components as the predictors rather than the individual predictors, etc.). Ridge regression using least squares estimators that are slightly biased. The idea is that a biased estimator with a small standard error is often preferable over an unbiased estimator with a large standard error. Ridge regression uses the correlation transformation introduced in chapter 7 along with a biasing constant to obtain the ridge estimators for the transformed model (see Kutner et al., pages 433 – 434, for derivations).

The biasing constant is most commonly chosen based on the ridge trace and the VIF’s. The ridge trace is a simultaneous plot of the values of the transformed regression coefficients for different values of the biasing constant. As the value of the biasing constant is increased, the ridge trace will stabilize, eventually moving to zero. The VIF’s will also decrease and eventually stabilize as the biasing constant increases. So, the “best” biasing constant is the one whose value stabilizes the VIF’s and the ridge trace. The RIDGE option is the MODEL statement of PROC REG will perform a ridge regression and chose the biasing constant automatically.

9

Example

I will use Kutner et al.’s body fat example (page 434) to show how a ridge regression works. The body fat model has severe multicollinearity (the VIF’s for the three predictors all exceed 100), so it will serve as a good example for ridge regression. The following PROC REG code was used with their data (refer to the SAS program, “RidgeRegressionExample.sas”): proc reg data=test outvif outest=b ridge=0 to 0.03 by .002; model Y = X1 X2 X3 / noprint; plot / ridgeplot nomodel legend=legend2 nostat vref=0 lvref=1 cvref=blue cframe=ligr;

Here’s the ridge trace plot for the body fat example:

They decided that a biasing constant = c = 0.02 was best, because the ridge trace had stabilized and the VIF’s were near 1 (see text page 434, table 11.3 for values).

Here’s the new regression model transformed back to the original variables: 10

Y = -7.403 + 0.555*X1 + 0.368*X2 - 0.192*X3, versus, Y = 117.08 + 4.334*X1 – 2.857*X2 – 2.186*X3 (see page 258, Table 7.2d).

You can see that ridge regression produced a very different model in comparison to the original model. Kutner et al. discuss the differences on pages 434 - 435. Note: My ridge regression equation is slightly different from that in the book (page 435) because of rounding errors between my SAS run and their calculations. SAS automatically transforms the model to the original variables. You can find all the ridge estimators and VIF’s for each biasing constant in the OUTEST dataset (OUTEST = b in this example). However, I did not reproduce this output in these lecture notes.

Robust Regression In chapter 10, we discussed how to detect influential outliers. Outliers influence least squares estimators, sometimes severely if the outliers are influential. So, how do we handle influential outliers? If you determine that there was no measurement or recording error, then you must seriously consider whether or not to eliminate the outlying observation(s) from the dataset. If you are building a model to predict general trends, then you may be justified in eliminating outlying cases, because you may not be interested in modeling extreme cases. However, if you decide to retain the outlier(s), then you can use robust regression to dampen the influence of the outlying case(s).

11

Kutner et al. describe several robust regression methods, but they focus on the method of iteratively reweighted least squares regression (IRLS). IRLS uses weighted least squares to

dampen the influence of outliers. These weights are based on the residuals, which are measures of how far the observation is from its predicted value. Kutner et al. describe two common weighting functions: Huber and Bisquare. Either weight should use residuals that are scaled. They mention several ways to scale the residuals, like Median Absolute Deviation (MAD). I refer you to the text beginning on page 439 for all the details. Kutner et al. also provide an excellent example beginning on page 441. SAS version 9.1 now makes it easy to perform Robust Regression using PROC ROBUSTREG. SAS provides four IRLS estimation methods: M estimation introduced by Huber (1973), Least Trimmed Squares (LTS) estimation, S estimation, and MM estimation introduced by Yohai (1987). The latter three methods are considered High Breakdown Value Estimation techniques, and they are detailed in SAS online help for PROC ROBUSTREG. The M estimation method uses your choice of ten weighting functions, including the BISQUARE and HUBER weighting functions as well as MAD residual scaling (default). The other residual scaling methods include Huber’s, Tukey’s, and Fixed Constant scaling. We will work an example with PROC ROBUSTREG to show how IRLS can dampen the effect of influential outliers using M Estimation with the Bisquare and Huber weighting functions and Median residual scaling.

Example

I will again revisit the Black Hills ponderosa pine data to show how to perform IRLS in SAS. I created four outliers to illustrate how IRLS can dampen the influence of severe outliers in the final least squares model. PROC ROBUSTREG performs IRLS using different estimations 12

methods. For our example, we use M and MM Estimation. I also used the biweight and Huber criteria for M Estimation to scale the residuals. Refer to the example on the webpage, “IRLSExample.sas”. After running the SAS program, I made the following chart in Excel to show how IRLS dampened the effects of the severe outliers (I did not print the coefficients or fit statistics):

Robust vs. OLS Regression for Black Hills Ponderosa Pine 80 70

Cubic-foot Volume

60 50 40 observations OLS ROBUST_bisquare ROBUST_Huber ROBUST_MM

30 20 10 0 0

5000

10000

15000

20000

25000

30000

D2H

As you can see, IRLS did improve the fit by dampening the effects of the outliers, granted though, there is only a slight difference between the OLS and IRLS regression lines. Note also that MM and BISQUARE robust regression lines are incident. 13

Nonparametric Regression Nonparametric regression is useful if you desire to relax the assumptions of the normal regression model. It is especially useful when you do not know the nature of the response function. Kutner et al. discuss nonparametric regression beginning on page 425. They describe one of the most common nonparametric regression techniques, the Loess method. SAS has a procedure to perform nonparametric regression, PROC LOESS. You should know that nonparametric regression techniques do not provide an analytical solution to the model, but rather only a plot of the response surface. This can be a problem for models with many predictor variables, because they are hard to represent graphically. Kutner et al. provide a good example of nonparametric regression beginning on page 450, with the output graphs on page 452. You can see that finding an analytical solution to the variably shaped regression line is impossible; it can only be described for each case (local fitting). At this point, I am not going to attempt an example… The book also describes a procedure to examine data in segments called Regression Trees (page 453). This is a powerful way to use regression analysis in exploratory studies. Hopefully, we will have time to work some examples in class.

14

Suggest Documents