## Multiple Regression. SPSS output. Multiple Regression Multiple Regression Model:

Multiple Regression Multiple Regression Model: Multiple Regression y    1 x1   2 x2  ...   q xq   Relating a response (dependent, input...
Author: Shauna Morton
Multiple Regression

Multiple Regression Model:

Multiple Regression

y    1 x1   2 x2  ...   q xq  

Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x1, x2, …, xq . A technique for modeling the relationship between one response variable with several predictor variables.

y   y| x1, x2 , x3,...,xq  

minimize

 e   [ y  (   2 i

i

i

1

 x1i  ...   q  xqi )]2

i

 , 1, 2, 3, ... , q in the model can all be estimated by least square estimators:

Random component

ˆ , ˆ1 , ˆ 2 , ˆ3 ,..., ˆ q

   1 x1   2 x2  ...   q xq  

The Least-Square Regression Equation:

yˆ  ˆ  ˆ1 x1  ˆ2 x2  ˆ3 x3  ...  ˆq xq

Deterministic component 1

2

180

180

160

160

140

140

120

120

Weight

Weight

Example: Study weight (y) using age (x1) and height (x2).

100

100

80

80

Data: Age (months), height (inches), weight (pounds) were recorded for a group of school children.

60

60

40

40 120

140

160

180

200

220

240

50

260

60

70

80

Height

Age

Scatter plots above show that both age and height are linearly related to weight.

Model : y    1 x1   2 x2   with weigh t y, age x1 , and height x2

3

4

ANOVAb

SPSS output Model 1

Model Summary

Model 1

R .794a

R Square .630

St d. Error of the Estimate 11.868

a. Predictors: (Constant), Height, Age

Regression Residual Total

Sum of Squares 56233.254 32960.761 89194.015

df 2 234 236

Mean Square 28116.627 140.858

F 199.610

Sig. .000a

a. Predictors: (Const ant), Height, Age b. Dependent Variable: Weight

Test for significance of the model: p-value = .000 < .05 H0: Model is insignificant (i’s are all zeros). Ha: Model is significant (Some i’s are not zeros).

Coefficient of determination: the percentage of variability in the response variable (Weight) that can be described by predictor variables (Age, Height) through the model. 5

6

1

Multiple Regression Model estimation: SPSS output

Coeffi ci entsa

Coeffi ci entsa

Model 1

(Constant) Age Height

Unstandardized Coef f icients B St d. Error -127.820 12.099 .240 .055 3.090 .257

St andardi zed Coef f icien ts Beta .228 .627

t -10.565 4.360 12.008

Sig. .000 .000 .000

Collinearity Statistics Tolerance VI F .579 .579

1.727 1.727

Model 1

(Constant) Age Height

Unstandardized Coef f icients B St d. Error -127.820 12.099 .240 .055 3.090 .257

St andardi zed Coef f icien ts Beta .228 .627

t -10.565 4.360 12.008

Sig. .000 .000 .000

Collinearity Statistics Tolerance VI F .579 .579

1.727 1.727

a. Dependent Variable: Weight

a. Dependent Variable: Weight

Tests for Regression Coefficients

H0:  = 0 vs. Ha:   0 p-value = .000 < .05 H0: 1 = 0 vs. Ha: 1  0 p-value = .000 < .05 H0: 2 = 0 vs. Ha: 2  0 p-value = .000 < .05

Least square regression equation:

yˆ  127.82  .24  x1  3.09  x 2

Collinearity* statistics: If the VIF (Variance Inflation Factor) is greater than 10 there is problem of Multicollinearity. (Some said VIF needs to be less than 4.)

The average weight of children 144 months old and whose height is 55 inches would be: 127.82 + .24(144) + 3.09(55) = 76.69 lbs (estimated by the model)

7

Other possible models:

How to interpret  , 1 and  2 ? Model:

8

( y: Weight, x1: Age, x2: Height )

y =   1 x1  2 x2 + 

y =  + 1 x1 +  y =  + 2 x2 + 

where y: Weight, x1: Age, x2: Height

Interaction term

 is the constant or the y-intercept in the model. It is the average response when both predictor variables are 0.

With interaction term (Non-additive): • y =  + 1 x1 + 2 x2 + 3 x1 x2 +  • y =  + 1 x1 + 3 x1 x2 +  • y =  + 2 x2 + 3 x1 x2 + 

1 is the rate of change of expected (average) weight per unit change of age adjusted for the height variable.

2 is the rate of change of expected (average) weight per unit change of height adjusted for the age variable. 9

10

Coefficient Estimation with Interaction Between Age and Height

For boys: Coefficientsa

Model : y    1 x1   2 x2   3 x1 x2   with w eight y, age x1 , and height x2 Model 1

Coeffi ci entsa

Model 1

(Constant) Age Height INTAG_HT

Unstandardized Coef f icients B St d. Error 66.996 106.189 -.973 .660 -3.13E-02 1.710 1.936E-02 .010

St andardi zed Coef f icien ts Beta -.923 -.006 1.636

t .631 -1.476 -.018 1.847

Sig. .529 .141 .985 .066

Collinearity Statistics Tolerance VI F .004 .013 .002

250.009 77.016 501.996

a. Dependent Variable: Weight

• High VIF implies very serious collinearity. • Interaction should not be used in the model.

11

(Constant) Age Height

Unstandardized Coeff icients B Std. Error -113.713 15.590 .308 .084 2.681 .368

Standardi zed Coeff icien ts Beta .289 .574

t -7.294 3.672 7.283

Sig. .000 .000 .000

Collinearity Statistics Tolerance VIF .443 .443

2.259 2.259

a. Dependent Variable: Weight

• Is there a serious collinearity? • Write the weight prediction equation using age and height as predictor variables. • Find the average weight for boys that are 144 months old and 55 inches tall.

12

2

Multiple Regression

For girls:

Indicator Variables

Coefficientsa

Model 1

(Constant) Age Height

Unstandardized Coeff icients B Std. Error -150.597 20.767 .191 .076 3.604 .408

Standardi zed Coeff icien ts Beta .186 .650

t -7.252 2.524 8.838

Sig. .000 .013 .000

Collinearity Statistics Tolerance VIF .704 .704

- are binary variables that take only two possible values, 0 and 1, and can be use for including categorical variables in the model.

1.420 1.420

Male: 1 Female: 0

a. Dependent Variable: Weight

• Is there a serious collinearity? • Write the weight prediction equation using age and height as predictor variables. • Find the average weight for boys that are 144 months old and 55 inches tall.

Group Statistics

Weight

Gender Male Female

N

Mean 103.448 98.878

126 111

St d. Dev iation 19.968 18.616

St d. Error Mean 1.779 1.767

13

One Binary Independent Variable Model: (A model that models two independent samples situation with equal variances condition.)

14

Two independent samples t-test can be modeled with simple linear regression model SPSS output for two independent samples t-test for comparing the mean weight between male and female. Independent Samples Test

y =   1 x1 + 

Lev ene's Test f or Equality of Variances

where y : Weight, x1: Gender (x1 = 0 for female, x1 = 1 for male)

F Weight

Equal v ariances assumed Equal v ariances not assumed

Sig.

.630

t-t est f or Equalit y of Means

Sig. (2-tailed)

1.815

235

.071

4.570

2.518

-.392

9.532

1.823

234.233

.070

4.570

2.507

-.370

9.510

t

.428

df

St d. Error Dif f erence

SPSS output for linear regression with gender as predictor

When x1 = 0: y =    When x1 = 1: y =   1  

Coefficientsa

The difference of the means of the two categories is 1. 15

Model 1

(Constant) Gender

Unstandardized Coeff icients B Std. Error 98.878 1.836 4.570 2.518

Standardi zed Coeff icien ts Beta

t 53.846 1.815

.118

Sig. .000 .071

Collinearity Statistics Tolerance VIF 1.000

1.000

16

a. Dependent Variable: Weight

Coeffi ci entsa

Gender and Age as Predictor Variables Model : y    1 x1   2 x2   Model 1

with y weight, x1 age, and x2 gender ( x2  0 female, x2  1 male)

(Constant) Age Gender

Unstandardized Coef f icients B St d. Error -11.181 8.778 .669 .053 4.539 1.942

St andardi zed Coef f icien ts Beta .634 .117

t -1.274 12.705 2.338

Sig. .204 .000 .020

Collinearity Statistics Tolerance VI F 1.000 1.000

1.000 1.000

a. Dependent Variable: Weight

180

160

Weight

140

Age and Gender are both significant variables for predicting weight.

120

100

There is significant difference in average weight between genders if adjusted for age variable.

80

Gender 60

Male

40 120

Female 140

160

180

200

220

240

260

17

95% Conf idence Interv al of the Dif f erence Lower Upper

Mean Dif f erence

18

Age

3

Multiple Regression

Age, Height, & Gender as Predictors 180

Model : y    1 x1   2 x2   3 x3  

160

W e i g h t

with y weight x1 age

140 120 100 80

x2 height

60

x3 gender ( x3  0 female, x3  1 male)

260 240

220 200

180 160

Age

140

50

60

70

80

Gender

Height

Male Female

19

20

Coeffi ci entsa

Model 1

(Constant) Age Height Gender

Unstandardized Coef f icients B St d. Error -128.209 12.264 .238 .056 3.105 .267 -.338 1.604

St andardi zed Coef f icien ts Beta .226 .630 -.009

t -10.454 4.250 11.621 -.210

Sig. .000 .000 .000 .834

Collinearity Statistics Tolerance VI F .562 .539 .932

1.780 1.854 1.073

a. Dependent Variable: Weight

Gender variable becomes insignificant with Age and Height variables in the model.

How to include a categorical variable in the model? The proper way to include a categorical variable is to use indicator variables. For having a categorical variable with k categories, one should set up k – 1 indicator variables. Example: “Race” variable: White = 1, Black = 2, Hispanic = 3. - 2 indicator variables will be needed.

When comparing the difference in average weights between genders, and adjusted for age and height variables, the difference is statistically insignificant. 21

Common Mistake: Use of the internally coded values of a categorical explanatory variable directly in linear regression modeling calculation. 22

“Number of hours of exercise per week”

Example: “Race”: White = 1, Black = 2, Hispanic = 3. Use of indicator variables x1 and x2 for Race variable • x1 = 1 represents “White”, otherwise x1 = 0, • x2 = 1 represents “Black”, otherwise x2 = 0, • x1 = 0 and x2 = 0 represents “Hispanic”.

Model:

y =   1 x1  2 x2  3 x3  

“Body Fat Percentage”

“Number of hours of exercise per week”

Model:

y =   1 x1  2 x2  3 x3  

“Race”

Race: White

Interpretation of the model: x1 = 1 and x2 = 0, y =   1  3 x3  

Race: Black

x1 = 0 and x2 = 1, y =   2  3 x3  

Race: Hispanic x1 = 0 and x2 = 0, y =   3 x3  

“Body Fat Percentage”

“Race”

23

24

4

Multiple Regression

Suppose that the least squares regression equation for the model above is

Example: Study female life expectancy using percentage of urbanization and birth rate.

y  20  2.1 x1  1.3  x2  .1 x3 Estimate the avg. body fat for a white person exercise 10 hours per week: 20 + 2.1 x 1 + 1.3 x 0 – .1x10 = 21.1

Estimate the avg. body fat for a hispanic person exercise 10 hours per week: 20 + 2.1 x 0 + 1.3 x 0 – .1x10 = 18.9

90

80

80

Female life expectancy 1992

Estimate the avg. body fat for a black person exercise 10 hours per week: 20 + 2.1 x 0 + 1.3 x 1 – .1x10 = 20.2

Female life expectancy 1992

90

70

60

50

70

60

50

40 0

25

10

20

30

40

50

40

60

0

20

Births per 1000 population, 1992

Model : y    1 x1   2 x2   Model Summary

R .904a

R Square .817

60

80

100 26

120

Percent urban, 1992

ANOVAb

y life expectancy, x1 birth rate, x2 percent urbanized Model 1

40

St d. Error of the Estimate 4.89

Model 1

Regression Residual Total

Sum of Squares 12577.056 2825.820 15402.876

df 2 118 120

Mean Square 6288.528 23.948

F 262.595

Sig. .000a

a. Predictors: (Const ant), Births per 1000 populat ion, 1992, Percent urban, 1992 b. Dependent Variable: Female lif e expectancy 1992

a. Predictors: (Constant), Births per 1000 population, 1992, Percent urban, 1992

Test for significance of the model:

p-value = .000 < .05

H0: Model is insignificant (i’s are all zeros). Ha: Model is significant (Some i’s are not zeros).

Coefficient of determination: the percentage of variability in the response variable (female life expectancy) that can be described by predictor variables (birth rate, percentage of urbanization) through the model. 27

28

Model estimation: (SPSS output)

Least square regression equation for estimating average response value

Coeffi ci entsa

Model 1

(Constant) Births per 1000 population, 1992 Percent urban, 1992

Unstandardized Coef f icients B St d. Error 76.216 2.431

St andardi zed Coef f icien ts Beta

t 31.350

Sig. .000

Collinearity Statistics Tolerance VI F

-.555

.045

-.648

-12.196

.000

.551

1.814

.154

.025

.331

6.238

.000

.551

1.814

yˆ  76.216  .555  x1  .154  x2

a. Dependent Variable: Female lif e expectancy 1992

Tests for Regression Coefficients

H0:  = 0 v.s. Ha:   0 p-value = .000 < .05 H0: 1 = 0 v.s. Ha: 1  0 p-value = .000 < .05 H0: 2 = 0 v.s. Ha: 2  0 p-value = .000 < .05

Collinearity*statistics:If the VIF (Variance Inflation Factor) is greater than 10 there is multicollinearity problem. (Some said VIF needs to be less than 4.) 29

The average female life expectancy for the countries whose birth rate per 1000 is 30 and whose percentage of urbanization is 40 would be 76.216 - 0.555(30) + 0.154(40) = 65.726. 30

5

Multiple Regression

Female Life Expectancy

Multiple Scatter Plot Before Transformation Female life expectan

Response variable: Female life expectancy Explana variables: Birth Rate, Urbanization, Phones, Doctors, and GDP.

Births per 1000 popu

Per cent urban, 1992

Phones per 100 peopl

Which variables are significant factors to female life expectancy in the model?

Doc tors per 10,000 p

GDP per c apita

31

32

Model Summaryb

Multiple Scatter Plot

Model 1

After ln(x) Transformation on Phones, Doctors, GDP

R .934a

R Square .873

Std. Error of the Estim ate 4.08

Durbin-Watson 2.103

a. Predictors: (Constant), Natural log of GDP , P ercent urban, 1992, Births per 1000 population, 1992, Natural log of doctors per 10000, Natural log of phones per 100 people b. Dependent Variable: Female life expectancy 1992

Female life expectan

Bir ths per 1000 popu

ANOVAb Percent urban, 1992

Model 1

Sum of Squares Regression 12123.330 Total

Natural log of GDP

(Co nstant) Births p er 10 00 po pu lation , 19 92 Percent urb an , 19 92 Natural log o f p ho nes per 100 peop le

B 77 .44 8

Std . Error 5.8 29

-.2 72

.05 8

1.9 37 E-02

.03 1

Beta

16.683

13891.679

111

.67 9

Co llinearity Statistics t 13 .28 7

Sig . .00 0

To lerance

-.3 19

-4.65 9

.00 0

.25 6

3.9 03

.04 3

.62 9

.53 1

.26 3

3.8 05

Natural log o f d octo rs per 100 00

1.8 94

.59 3

Natural log o f GDP

-1.39 0

.78 4

.55 2

Sig. .000 a

Stepwise Selection VIF

Model 1

Sum of Squares Regression 11159.884

1

Mean Square 11159.884

2731.795

110

24.834

Total

13891.679

111

Regression

11830.842

2

5915.421

2060.836

109

18.907

Total

13891.679

111

Regression

12069.502

3

4023.167

1822.177

108

16.872

13891.679

111

Residual 2

Residual

3.1 75

F 145.342

34

Co efficientsa

Mo del 1

Mean Square 2424.666

106

33

Multicollinearity Stand ardiz ed Co efficien t s

df

a. Pr edictor s: ( Constant), Natural log of GD P, Percent urban, 1992, Births per 1000 population, 1992, Natural log of doctors per 10000, Natural log of phones per 100 people b. Dependent V ariable: Fem ale life expectancy 1992

Natural log of docto

Un stand ardized Co efficien ts

5

1768.348

Residual

Natural log of phone

4.6 75

.00 0

.08 6

11 .59 0

.26 2

3.1 94

.00 2

.17 8

5.6 11

-.1 90

-1.77 2

.07 9

.10 5

9.5 43

3

a. Depend en t Variab le: Female life exp ectan cy 1 99 2

Tolerance measures the strength of the linear relation between the independent variables.It is better to be higher than 0.1. VIF is the reciprocal of Tolerance. 35

Residual Total

df

F 449.370

Sig. .000 a

312.873

.000 b

238.452

.000 c

a. Predictors: (Constant), Natural log of phones per 100 people b. Predictors: (Constant), Natural log of phones per 100 people, Births per 1000 population, 1992 c. Predictors: (Constant), Natural log of phones per 100 people, Births per 1000 population, 1992, Natural log of doctors per 10000 d. Dependent Variable: Fem ale life expectancy 1992

36

6

Multiple Regression

What are the variables that are significantly related to the female’s life expectancy?

• Description (model, system, relation):

Coefficientsa

Model 1

B 60.284

Std. Error .562

5.161

.243

72.566

2.119

Natural log of phones per 100 people

3.352

.370

Births per 1000 population, 1992

-.327

.055

68.176

2.317

Natural log of phones per 100 people

2.386

.434

Births per 1000 population, 1992

-.246

.056

Natural log of doctors per 10000

2.054

.546

.284

(Constant) Natural log of phones per 100 people

2

3

(Constant)

(Constant)

Relation between life expectancy & birth rate, GDP, … Relation between salary & rank, years of service, …

Standardiz ed Coefficient s

Unstandardized Coefficients

Beta

Collinearity Statistics t 107.184

Sig. .000

Tolerance

21.198

.000

1.000

34.239

.000

.582

9.048

.000

.329

3.042

-.383

-5.957

.000

.329

3.042

29.418

.000

.414

5.496

.000

.214

4.682

-.288

-4.364

.000

.280

3.576

3.761

.000

.213

4.706

.896

Use of regression analysis

• Control:

VIF

Died too young, underpaid, overpaid, …

1.000

• Prediction:

a. Dependent Variable: Female life expectancy 1992

Life expectancy, salary for new comers, future salary, …

• Variable screening (important factors): Significant factors for life expectancy, Significant factors for salary. 37

Construction of regression models

38

What is linear model?

1. Hypothesize the form of the model for  y| x x x x 1, 2 , 3 ,..., q – Selecting predictor variables. – Deciding functional form of the regression equation. – Defining scope of the model (design range). 2. Collect the sample data (observations, experiments). 3. Use sample estimate unknown parameters in the model. 4. Understand the distribution of the random error. 5. Model diagnostics, residual analysis. 6. Apply the model in decision making. 7. Review the model with new data.

39

Example of a linear model: • y = 0 + 1 x +  • y = 0 + 1 x + 2 x2 +  • y = 0 + 1 x1 + 2 x2 + 3 x1 x2 +  • y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + 4 x12 + 5 x22 +  • y = 0 + 1 ln(x) +  • y = 0 + 1 ex+  Model is linear in terms of its parameters. 40

7