Multiple Regression
Multiple Regression Model:
Multiple Regression
y 1 x1 2 x2 ... q xq
Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x1, x2, …, xq . A technique for modeling the relationship between one response variable with several predictor variables.
y y| x1, x2 , x3,...,xq
minimize
e [ y ( 2 i
i
i
1
x1i ... q xqi )]2
i
, 1, 2, 3, ... , q in the model can all be estimated by least square estimators:
Random component
ˆ , ˆ1 , ˆ 2 , ˆ3 ,..., ˆ q
1 x1 2 x2 ... q xq
The Least-Square Regression Equation:
yˆ ˆ ˆ1 x1 ˆ2 x2 ˆ3 x3 ... ˆq xq
Deterministic component 1
2
180
180
160
160
140
140
120
120
Weight
Weight
Example: Study weight (y) using age (x1) and height (x2).
100
100
80
80
Data: Age (months), height (inches), weight (pounds) were recorded for a group of school children.
60
60
40
40 120
140
160
180
200
220
240
50
260
60
70
80
Height
Age
Scatter plots above show that both age and height are linearly related to weight.
Model : y 1 x1 2 x2 with weigh t y, age x1 , and height x2
3
4
ANOVAb
SPSS output Model 1
Model Summary
Model 1
R .794a
R Square .630
Adjusted R Square .627
St d. Error of the Estimate 11.868
a. Predictors: (Constant), Height, Age
Regression Residual Total
Sum of Squares 56233.254 32960.761 89194.015
df 2 234 236
Mean Square 28116.627 140.858
F 199.610
Sig. .000a
a. Predictors: (Const ant), Height, Age b. Dependent Variable: Weight
Test for significance of the model: p-value = .000 < .05 H0: Model is insignificant (i’s are all zeros). Ha: Model is significant (Some i’s are not zeros).
Coefficient of determination: the percentage of variability in the response variable (Weight) that can be described by predictor variables (Age, Height) through the model. 5
6
1
Multiple Regression Model estimation: SPSS output
Coeffi ci entsa
Coeffi ci entsa
Model 1
(Constant) Age Height
Unstandardized Coef f icients B St d. Error -127.820 12.099 .240 .055 3.090 .257
St andardi zed Coef f icien ts Beta .228 .627
t -10.565 4.360 12.008
Sig. .000 .000 .000
Collinearity Statistics Tolerance VI F .579 .579
1.727 1.727
Model 1
(Constant) Age Height
Unstandardized Coef f icients B St d. Error -127.820 12.099 .240 .055 3.090 .257
St andardi zed Coef f icien ts Beta .228 .627
t -10.565 4.360 12.008
Sig. .000 .000 .000
Collinearity Statistics Tolerance VI F .579 .579
1.727 1.727
a. Dependent Variable: Weight
a. Dependent Variable: Weight
Tests for Regression Coefficients
H0: = 0 vs. Ha: 0 p-value = .000 < .05 H0: 1 = 0 vs. Ha: 1 0 p-value = .000 < .05 H0: 2 = 0 vs. Ha: 2 0 p-value = .000 < .05
Least square regression equation:
yˆ 127.82 .24 x1 3.09 x 2
Collinearity* statistics: If the VIF (Variance Inflation Factor) is greater than 10 there is problem of Multicollinearity. (Some said VIF needs to be less than 4.)
The average weight of children 144 months old and whose height is 55 inches would be: 127.82 + .24(144) + 3.09(55) = 76.69 lbs (estimated by the model)
7
Other possible models:
How to interpret , 1 and 2 ? Model:
8
( y: Weight, x1: Age, x2: Height )
y = 1 x1 2 x2 +
y = + 1 x1 + y = + 2 x2 +
where y: Weight, x1: Age, x2: Height
Interaction term
is the constant or the y-intercept in the model. It is the average response when both predictor variables are 0.
With interaction term (Non-additive): • y = + 1 x1 + 2 x2 + 3 x1 x2 + • y = + 1 x1 + 3 x1 x2 + • y = + 2 x2 + 3 x1 x2 +
1 is the rate of change of expected (average) weight per unit change of age adjusted for the height variable.
2 is the rate of change of expected (average) weight per unit change of height adjusted for the age variable. 9
10
Coefficient Estimation with Interaction Between Age and Height
For boys: Coefficientsa
Model : y 1 x1 2 x2 3 x1 x2 with w eight y, age x1 , and height x2 Model 1
Coeffi ci entsa
Model 1
(Constant) Age Height INTAG_HT
Unstandardized Coef f icients B St d. Error 66.996 106.189 -.973 .660 -3.13E-02 1.710 1.936E-02 .010
St andardi zed Coef f icien ts Beta -.923 -.006 1.636
t .631 -1.476 -.018 1.847
Sig. .529 .141 .985 .066
Collinearity Statistics Tolerance VI F .004 .013 .002
250.009 77.016 501.996
a. Dependent Variable: Weight
• High VIF implies very serious collinearity. • Interaction should not be used in the model.
11
(Constant) Age Height
Unstandardized Coeff icients B Std. Error -113.713 15.590 .308 .084 2.681 .368
Standardi zed Coeff icien ts Beta .289 .574
t -7.294 3.672 7.283
Sig. .000 .000 .000
Collinearity Statistics Tolerance VIF .443 .443
2.259 2.259
a. Dependent Variable: Weight
• Is there a serious collinearity? • Write the weight prediction equation using age and height as predictor variables. • Find the average weight for boys that are 144 months old and 55 inches tall.
12
2
Multiple Regression
For girls:
Indicator Variables
Coefficientsa
Model 1
(Constant) Age Height
Unstandardized Coeff icients B Std. Error -150.597 20.767 .191 .076 3.604 .408
Standardi zed Coeff icien ts Beta .186 .650
t -7.252 2.524 8.838
Sig. .000 .013 .000
Collinearity Statistics Tolerance VIF .704 .704
- are binary variables that take only two possible values, 0 and 1, and can be use for including categorical variables in the model.
1.420 1.420
Male: 1 Female: 0
a. Dependent Variable: Weight
• Is there a serious collinearity? • Write the weight prediction equation using age and height as predictor variables. • Find the average weight for boys that are 144 months old and 55 inches tall.
Group Statistics
Weight
Gender Male Female
N
Mean 103.448 98.878
126 111
St d. Dev iation 19.968 18.616
St d. Error Mean 1.779 1.767
13
One Binary Independent Variable Model: (A model that models two independent samples situation with equal variances condition.)
14
Two independent samples t-test can be modeled with simple linear regression model SPSS output for two independent samples t-test for comparing the mean weight between male and female. Independent Samples Test
y = 1 x1 +
Lev ene's Test f or Equality of Variances
where y : Weight, x1: Gender (x1 = 0 for female, x1 = 1 for male)
F Weight
Equal v ariances assumed Equal v ariances not assumed
Sig.
.630
t-t est f or Equalit y of Means
Sig. (2-tailed)
1.815
235
.071
4.570
2.518
-.392
9.532
1.823
234.233
.070
4.570
2.507
-.370
9.510
t
.428
df
St d. Error Dif f erence
SPSS output for linear regression with gender as predictor
When x1 = 0: y = When x1 = 1: y = 1
Coefficientsa
The difference of the means of the two categories is 1. 15
Model 1
(Constant) Gender
Unstandardized Coeff icients B Std. Error 98.878 1.836 4.570 2.518
Standardi zed Coeff icien ts Beta
t 53.846 1.815
.118
Sig. .000 .071
Collinearity Statistics Tolerance VIF 1.000
1.000
16
a. Dependent Variable: Weight
Coeffi ci entsa
Gender and Age as Predictor Variables Model : y 1 x1 2 x2 Model 1
with y weight, x1 age, and x2 gender ( x2 0 female, x2 1 male)
(Constant) Age Gender
Unstandardized Coef f icients B St d. Error -11.181 8.778 .669 .053 4.539 1.942
St andardi zed Coef f icien ts Beta .634 .117
t -1.274 12.705 2.338
Sig. .204 .000 .020
Collinearity Statistics Tolerance VI F 1.000 1.000
1.000 1.000
a. Dependent Variable: Weight
180
160
Weight
140
Age and Gender are both significant variables for predicting weight.
120
100
There is significant difference in average weight between genders if adjusted for age variable.
80
Gender 60
Male
40 120
Female 140
160
180
200
220
240
260
17
95% Conf idence Interv al of the Dif f erence Lower Upper
Mean Dif f erence
18
Age
3
Multiple Regression
Age, Height, & Gender as Predictors 180
Model : y 1 x1 2 x2 3 x3
160
W e i g h t
with y weight x1 age
140 120 100 80
x2 height
60
x3 gender ( x3 0 female, x3 1 male)
260 240
220 200
180 160
Age
140
50
60
70
80
Gender
Height
Male Female
19
20
Coeffi ci entsa
Model 1
(Constant) Age Height Gender
Unstandardized Coef f icients B St d. Error -128.209 12.264 .238 .056 3.105 .267 -.338 1.604
St andardi zed Coef f icien ts Beta .226 .630 -.009
t -10.454 4.250 11.621 -.210
Sig. .000 .000 .000 .834
Collinearity Statistics Tolerance VI F .562 .539 .932
1.780 1.854 1.073
a. Dependent Variable: Weight
Gender variable becomes insignificant with Age and Height variables in the model.
How to include a categorical variable in the model? The proper way to include a categorical variable is to use indicator variables. For having a categorical variable with k categories, one should set up k – 1 indicator variables. Example: “Race” variable: White = 1, Black = 2, Hispanic = 3. - 2 indicator variables will be needed.
When comparing the difference in average weights between genders, and adjusted for age and height variables, the difference is statistically insignificant. 21
Common Mistake: Use of the internally coded values of a categorical explanatory variable directly in linear regression modeling calculation. 22
“Number of hours of exercise per week”
Example: “Race”: White = 1, Black = 2, Hispanic = 3. Use of indicator variables x1 and x2 for Race variable • x1 = 1 represents “White”, otherwise x1 = 0, • x2 = 1 represents “Black”, otherwise x2 = 0, • x1 = 0 and x2 = 0 represents “Hispanic”.
Model:
y = 1 x1 2 x2 3 x3
“Body Fat Percentage”
“Number of hours of exercise per week”
Model:
y = 1 x1 2 x2 3 x3
“Race”
Race: White
Interpretation of the model: x1 = 1 and x2 = 0, y = 1 3 x3
Race: Black
x1 = 0 and x2 = 1, y = 2 3 x3
Race: Hispanic x1 = 0 and x2 = 0, y = 3 x3
“Body Fat Percentage”
“Race”
23
24
4
Multiple Regression
Suppose that the least squares regression equation for the model above is
Example: Study female life expectancy using percentage of urbanization and birth rate.
y 20 2.1 x1 1.3 x2 .1 x3 Estimate the avg. body fat for a white person exercise 10 hours per week: 20 + 2.1 x 1 + 1.3 x 0 – .1x10 = 21.1
Estimate the avg. body fat for a hispanic person exercise 10 hours per week: 20 + 2.1 x 0 + 1.3 x 0 – .1x10 = 18.9
90
80
80
Female life expectancy 1992
Estimate the avg. body fat for a black person exercise 10 hours per week: 20 + 2.1 x 0 + 1.3 x 1 – .1x10 = 20.2
Female life expectancy 1992
90
70
60
50
70
60
50
40 0
25
10
20
30
40
50
40
60
0
20
Births per 1000 population, 1992
Model : y 1 x1 2 x2 Model Summary
R .904a
Adjusted R Square .813
R Square .817
60
80
100 26
120
Percent urban, 1992
ANOVAb
y life expectancy, x1 birth rate, x2 percent urbanized Model 1
40
St d. Error of the Estimate 4.89
Model 1
Regression Residual Total
Sum of Squares 12577.056 2825.820 15402.876
df 2 118 120
Mean Square 6288.528 23.948
F 262.595
Sig. .000a
a. Predictors: (Const ant), Births per 1000 populat ion, 1992, Percent urban, 1992 b. Dependent Variable: Female lif e expectancy 1992
a. Predictors: (Constant), Births per 1000 population, 1992, Percent urban, 1992
Test for significance of the model:
p-value = .000 < .05
H0: Model is insignificant (i’s are all zeros). Ha: Model is significant (Some i’s are not zeros).
Coefficient of determination: the percentage of variability in the response variable (female life expectancy) that can be described by predictor variables (birth rate, percentage of urbanization) through the model. 27
28
Model estimation: (SPSS output)
Least square regression equation for estimating average response value
Coeffi ci entsa
Model 1
(Constant) Births per 1000 population, 1992 Percent urban, 1992
Unstandardized Coef f icients B St d. Error 76.216 2.431
St andardi zed Coef f icien ts Beta
t 31.350
Sig. .000
Collinearity Statistics Tolerance VI F
-.555
.045
-.648
-12.196
.000
.551
1.814
.154
.025
.331
6.238
.000
.551
1.814
yˆ 76.216 .555 x1 .154 x2
a. Dependent Variable: Female lif e expectancy 1992
Tests for Regression Coefficients
H0: = 0 v.s. Ha: 0 p-value = .000 < .05 H0: 1 = 0 v.s. Ha: 1 0 p-value = .000 < .05 H0: 2 = 0 v.s. Ha: 2 0 p-value = .000 < .05
Collinearity*statistics:If the VIF (Variance Inflation Factor) is greater than 10 there is multicollinearity problem. (Some said VIF needs to be less than 4.) 29
The average female life expectancy for the countries whose birth rate per 1000 is 30 and whose percentage of urbanization is 40 would be 76.216 - 0.555(30) + 0.154(40) = 65.726. 30
5
Multiple Regression
Female Life Expectancy
Multiple Scatter Plot Before Transformation Female life expectan
Response variable: Female life expectancy Explana variables: Birth Rate, Urbanization, Phones, Doctors, and GDP.
Births per 1000 popu
Per cent urban, 1992
Phones per 100 peopl
Which variables are significant factors to female life expectancy in the model?
Doc tors per 10,000 p
GDP per c apita
31
32
Model Summaryb
Multiple Scatter Plot
Model 1
After ln(x) Transformation on Phones, Doctors, GDP
R .934a
R Square .873
Adj usted R Square .867
Std. Error of the Estim ate 4.08
Durbin-Watson 2.103
a. Predictors: (Constant), Natural log of GDP , P ercent urban, 1992, Births per 1000 population, 1992, Natural log of doctors per 10000, Natural log of phones per 100 people b. Dependent Variable: Female life expectancy 1992
Female life expectan
Bir ths per 1000 popu
ANOVAb Percent urban, 1992
Model 1
Sum of Squares Regression 12123.330 Total
Natural log of GDP
(Co nstant) Births p er 10 00 po pu lation , 19 92 Percent urb an , 19 92 Natural log o f p ho nes per 100 peop le
B 77 .44 8
Std . Error 5.8 29
-.2 72
.05 8
1.9 37 E-02
.03 1
Beta
16.683
13891.679
111
.67 9
ANOVAd
Co llinearity Statistics t 13 .28 7
Sig . .00 0
To lerance
-.3 19
-4.65 9
.00 0
.25 6
3.9 03
.04 3
.62 9
.53 1
.26 3
3.8 05
Natural log o f d octo rs per 100 00
1.8 94
.59 3
Natural log o f GDP
-1.39 0
.78 4
.55 2
Sig. .000 a
Stepwise Selection VIF
Model 1
Sum of Squares Regression 11159.884
1
Mean Square 11159.884
2731.795
110
24.834
Total
13891.679
111
Regression
11830.842
2
5915.421
2060.836
109
18.907
Total
13891.679
111
Regression
12069.502
3
4023.167
1822.177
108
16.872
13891.679
111
Residual 2
Residual
3.1 75
F 145.342
34
Co efficientsa
Mo del 1
Mean Square 2424.666
106
33
Multicollinearity Stand ardiz ed Co efficien t s
df
a. Pr edictor s: ( Constant), Natural log of GD P, Percent urban, 1992, Births per 1000 population, 1992, Natural log of doctors per 10000, Natural log of phones per 100 people b. Dependent V ariable: Fem ale life expectancy 1992
Natural log of docto
Un stand ardized Co efficien ts
5
1768.348
Residual
Natural log of phone
4.6 75
.00 0
.08 6
11 .59 0
.26 2
3.1 94
.00 2
.17 8
5.6 11
-.1 90
-1.77 2
.07 9
.10 5
9.5 43
3
a. Depend en t Variab le: Female life exp ectan cy 1 99 2
Tolerance measures the strength of the linear relation between the independent variables.It is better to be higher than 0.1. VIF is the reciprocal of Tolerance. 35
Residual Total
df
F 449.370
Sig. .000 a
312.873
.000 b
238.452
.000 c
a. Predictors: (Constant), Natural log of phones per 100 people b. Predictors: (Constant), Natural log of phones per 100 people, Births per 1000 population, 1992 c. Predictors: (Constant), Natural log of phones per 100 people, Births per 1000 population, 1992, Natural log of doctors per 10000 d. Dependent Variable: Fem ale life expectancy 1992
36
6
Multiple Regression
What are the variables that are significantly related to the female’s life expectancy?
• Description (model, system, relation):
Coefficientsa
Model 1
B 60.284
Std. Error .562
5.161
.243
72.566
2.119
Natural log of phones per 100 people
3.352
.370
Births per 1000 population, 1992
-.327
.055
68.176
2.317
Natural log of phones per 100 people
2.386
.434
Births per 1000 population, 1992
-.246
.056
Natural log of doctors per 10000
2.054
.546
.284
(Constant) Natural log of phones per 100 people
2
3
(Constant)
(Constant)
Relation between life expectancy & birth rate, GDP, … Relation between salary & rank, years of service, …
Standardiz ed Coefficient s
Unstandardized Coefficients
Beta
Collinearity Statistics t 107.184
Sig. .000
Tolerance
21.198
.000
1.000
34.239
.000
.582
9.048
.000
.329
3.042
-.383
-5.957
.000
.329
3.042
29.418
.000
.414
5.496
.000
.214
4.682
-.288
-4.364
.000
.280
3.576
3.761
.000
.213
4.706
.896
Use of regression analysis
• Control:
VIF
Died too young, underpaid, overpaid, …
1.000
• Prediction:
a. Dependent Variable: Female life expectancy 1992
Life expectancy, salary for new comers, future salary, …
• Variable screening (important factors): Significant factors for life expectancy, Significant factors for salary. 37
Construction of regression models
38
What is linear model?
1. Hypothesize the form of the model for y| x x x x 1, 2 , 3 ,..., q – Selecting predictor variables. – Deciding functional form of the regression equation. – Defining scope of the model (design range). 2. Collect the sample data (observations, experiments). 3. Use sample estimate unknown parameters in the model. 4. Understand the distribution of the random error. 5. Model diagnostics, residual analysis. 6. Apply the model in decision making. 7. Review the model with new data.
39
Example of a linear model: • y = 0 + 1 x + • y = 0 + 1 x + 2 x2 + • y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + • y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + 4 x12 + 5 x22 + • y = 0 + 1 ln(x) + • y = 0 + 1 ex+ Model is linear in terms of its parameters. 40
7