Chapter 2: Simple Linear Regression

Chapter 2: Simple Linear Regression • • • • • • • • 2.1 The Model 2.2-2.3 Parameter Estimation 2.4 Properties of Estimators 2.5 Inference 2.6 Predict...
2 downloads 1 Views 219KB Size
Chapter 2: Simple Linear Regression • • • • • • • •

2.1 The Model 2.2-2.3 Parameter Estimation 2.4 Properties of Estimators 2.5 Inference 2.6 Prediction 2.7 Analysis of Variance 2.8 Regression Through the Origin 2.9 Related Models

1

2.1 The Model • Measurement of y (response) changes in a linear fashion with a setting of the variable x (predictor): β0 + β1x ε y= + linear relation noise ◦ The linear relation is deterministic (non-random). ◦ The noise or error is random.

• Noise accounts for the variability of the observations about the straight line. No noise ⇒ relation is deterministic. Increased noise ⇒ increased variability.

2

• Experiment with this simulation program: simple.sim >

source("simple.sim.R") simple.sim(sigma=.01) simple.sim(sigma=.1) simple.sim(sigma=1) simple.sim(sigma=10) 3

Simulation Examples sigma = 0.1

6 4

y

6 2

2

4

y

8

8

10

10

sigma = 0.01

4

6

8

10

2

4

6

8

x

x

sigma = 1

sigma = 10

10

10 20 30

y

6

0

4 2

y

8

10

2

2

4

6 x

8

10

2

4

6

8

10

x

4

The Setup • Assumptions: 1. E[y|x] = β0 + β1x. 2. Var(y|x) = Var(β0 + β1x + ε|x) = σ 2. • Data: Suppose data y 1, y 2, . . . , y n are obtained at settings x1, x2, . . . , xn, respectively. Then the model on the data is y i = β0 + β1xi + εi (εi i.i.d. N(0, σ 2) and E[y i|xi] = β0 + β1xi.) Either 1. the x’s are fixed values and measured without error (controlled experiment) OR 2. the analysis is conditional on the observed values of x (observational study). 5

2.2-2.3 Parameter Estimation, Fitted Values and Residuals

1. Maximum Likelihood Estimation Distributional assumptions are required

2. Least Squares Estimation Distributional assumptions are not required

6

2.2.1 Maximum Likelihood Estimation ◦ Normal assumption is required: −1 (y −β −β x )2 1 f (yi|xi) = √ e 2σ2 i 0 1 i 2πσ

Likelihood: L(β0, β1, σ) = 1 − 12 ∝ n e 2σ σ

Y

f (yi|xi)

Pn

2 i=1 (yi −β0 −β1 xi )

◦ Maximize with respect to β0, β1, and σ 2. ◦ (β0, β1): Equivalent to minimizing n X

(yi − β0 − β1xi)2

i=1 (i.e. Least-squares) σ 2: SSE/n, biased. 7

2.2.2 Least Squares Estimation

◦ Assumptions: 1. E[εi] = 0 2. Var(εi) = σ 2 3. εi’s are independent.

• Note that normality is not required.

8

Method • Minimize S(β0, β1) =

n X

(y i − β0 − β1xi)2

i=1

with respect to the parameters or regression coefficients β0 and β1: βb0 and βb1.

◦ Justification: We want the fitted line to pass as close to all of the points as possible.

◦ Aim: small Residuals (observed - fitted response values): ei = y i − βb0 − βb1xi

9

Look at the following plots: > > > > >

source("roller2.plot") roller2.plot(a=14,b=0) roller2.plot(a=2,b=2) roller2.plot(a=12, b=1) roller2.plot(a=-2,b=2.67)

10

−ve residual

4

6

8

60 40 0

2

4

6

8

10 12

a=12, b=1

a=−2, b=2.67

4

6

8

10 12

Roller weight (t)

60 40 20

20 0

−ve residual

Fitted values Data values

+ve residual

−ve residual

0

40

60

−ve residual

Roller weight (t)

+ve residual

2

+ve residual

Roller weight (t)

Fitted values Data values

0

20

10 12

Depression in lawn (mm)

2

Fitted values Data values

0

+ve residual

Depression in lawn (mm)

20

40

60

Fitted values Data values

0

Depression in lawn (mm)

a=2, b=2

0

Depression in lawn (mm)

a=14, b=0

0

2

4

6

8

10 12

Roller weight (t)

◦ The first three lines above do not pass as close to the plotted points as the fourth, even though the sum of the residuals is about the same in all four cases. ◦ Negative residuals cancel out positive residuals.

The Key: minimize squared residuals source("roller3.plot.R") > roller3.plot(14,0); roller3.plot(2,2); roller3.plot(12,1); > roller3.plot(a=-2,b=2.67) # small SS

4

6

8

120 80 40

2

4

6

8

10 12

0

a=12, b=1

a=−2, b=2.67

4

6

8

10 12

Roller weight (t)

80

120

Fitted values Data values

40

40

80

120

0

Roller weight (t)

+ve residual

0

+ve residual −ve residual

Roller weight (t)

−ve residual

2

Depression in lawn (mm)

10 12

Fitted values Data values

0

Fitted values Data values

+ve residual

−ve residual

0

2

Depression in lawn (mm)

120 80 40

+ve residual −ve residual

0

Depression in lawn (mm)

a=2, b=2

Fitted values Data values

0

Depression in lawn (mm)

a=14, b=0

0

2

4

6

8

10 12

Roller weight (t)

11

Unbiased estimate of σ 2

n X 1 e2 n − # parameters estimated i=1 i n 1 X e2 = n − 2 i=1 i

* n observations ⇒ n degrees of freedom

* 2 degrees of freedom are required to estimate the parameters

* the residuals retain n − 2 degrees of freedom

12

Alternative Viewpoint * y = (y 1, y 2, . . . , y n) is a vector in n-dimensional space. (n degrees of freedom) * The fitted values ybi = βb0 + βb1xi also form a vector in n-dimensional space: b = (yb1, yb2, . . . , ybn) y (2 degrees of freedom) b. * Least-squares seeks to minimize the distance between y and y * The distance between n-dimensional vectors u and v is the square root of n X

(ui − vi)2

i=1

b is * Thus, the squared distance between y and y n X i=1

(y i − ybi)2 =

n X

e2 i

i=1 13

Regression Coefficient Estimators • The minimizers of S(β0, β1) =

n X

(y i − β0 − β1xi)2

i=1

are βb0 = ¯ y − βb1¯ x and Sxy b β1 = Sxx where Sxy =

n X

(xi − x ¯)(y i − ¯ y)

i=1

and Sxx =

n X

(xi − x ¯)2

i=1 14

HomeMade R Estimators > ls.est predict(roller.lm, newdata=data.frame(weight predict(roller.lm, newdata=data.frame(weight σ 2 ◦ A reasonable test is F0 =

M SR MSE

∼ F1,n−2

◦ Large F0 ⇒ evidence against H0. ◦ Note t2 ν = F1,ν so this is really the same test as   t2 0 = q

βb1

MSE/Sxx

2 βb12Sxx M SR  =  =

MSE

M SE 41

The ANOVA table Source Reg.

df 1

SS βb12Sxx Syy − βb12Sxx Syy

MS βb12Sxx

Error n−2 SSE/(n − 2) Total n-1 roller data example: > anova(roller.lm) # R code Analysis of Variance Table

F M SR

MSE

Response: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 658 658 14.5 0.0052 Residuals 8 363 45 (recall that the t-statistic for testing β1 = 0 had been 3.81 = √ 14.5) ◦ Ex. Write an R function to compute these ANOVA quantities. 42

Confidence Interval for σ 2 2 ∼ χ ◦ SSE 2 n−2 σ so

SSE 2 2 P (χn−2,1−α/2 ≤ ≤ χ )=1−α n−2,α/2 2 σ so SSE

P( 2 χn−2,α/2

≤ σ2 ≤

SSE χ2 n−2,1−α/2

)=1−α

e.g. roller data: SSE = 363 χ2 8,.975 = 2.18 (R code: 1-qchisq(8, .025)) χ2 8,.025 = 17.5 (363/17.5, 363/2.18) = (20.7, 166.5) 43

2.7.1 R2 - Coefficient of Determination ◦ R2 = is the fraction of the response variability explained by the regression: SSR 2 R = Syy ◦ 0 ≤ R2 ≤ 1. Values near 1 imply that most of the variability is explained by the regression.

◦ roller data: SSR = 658 and Syy = 1021 so 658 2 R = = .644 1021

44

R output

> summary(roller.lm) ... Multiple R-Squared: 0.644, ...

◦ Ex. Write an R function which computes R2.

◦ Another interpretation: β12Sxx + σ 2 . E[SSR ] 2 = E[R ] = E[Syy ] (n − 1)σ 2 + β12Sxx 2

2 Sxx Sxx + σ β β12 n−1 1 (n−1) n−1 . = = Sxx Sxx σ 2 + β12 n−1 σ 2 + β12 (n−1)

for large n. (Note: this differs from the textbook.) 45

Properties of R2 • Thus, R2 increases as 1. Sxx increases (x’s become more spread out) 2. σ 2 decreases

◦ Cautions 1. R2 does not measure the magnitude of the regression slope. 2. R2 does not measure the appropriateness of the linear model. 3. A large value of R2 does not imply that the regression model will be an accurate predictor.

46

Hazards of Regression

• Extrapolation: predicting y values outside the range of observed x values. There is no guarantee that a future response would behave in the same linear manner outside the observed range. e.g. Consider an experiment with a spring. The spring is stretched to several different lengths x (in cm) and the restoring force F (in Newtons) is measured: x F 3 5.1 4 6.2 5 7.9 6 9.5 > spring.lm summary(spring.lm) Coefficients: 47

Estimate Std. Error t value Pr(>|t|) x 1.5884 0.0232 68.6 6.8e-06 The fitted model relating F to x is Fb = 1.58x Can we predict the restoring force for the spring, if it has been extended to a length of 15 cm?

• High leverage observations: x values at the extremes of the range have more influence on the slope of the regression than observations near the middle of the range.

• Outliers can distort the regression line. Outliers may be incorrectly recorded OR may be an indication that the linear relation or constant variance assumption is incorrect.

• A regression relationship does not mean that there is a cause-and-effect relationship. e.g. The following data give the number of lawyers and number of homicides in a given year for a number of towns: no. lawyers 1 2 7 10 12 14 15 18

no. homicides 0 0 2 5 6 6 7 8

Note that the number of homicides increases with the number of lawyers. Does this mean that in order to reduce the number of homicides, one should reduce the number of lawyers?

• Beware of nonsense relationships. e.g. It is possible to show that the area of some lakes in Manitoba is related to elevation. Do you think there is a real reason for this? Or is the apparent relation just a result of chance?

2.9 - Regression through the Origin ◦ intercept = 0 y i = β1xi + ε ◦ Max. Likelihood and L-S ⇒ minimize

n X

(yi − β1xi)2

i=1 P

xy βb1 = P i2i xi ei = y i − βb1xi SSE =

X

e2 i

◦ Max. Likelihood ⇒ b2 = σ

SSE n 48

◦ Unbiased Estimator: b 2 = MSE = σ

SSE n−1

◦ Properties of βb1: E[βb1] = β1 2 σ Var(βb1) = P 2 xi

◦ 1 − α C.I. for β1: v u u MSE b β1 ± tn−1,α/2t P 2 xi

◦ 1 − α C.I. for E[y|x0]: v u u MSEx2 yb0 ± tn−1,α/2t P 20 xi

since 2 σ Var(βb1x0) = P 2 x2 xi 0

◦ 1 − α P.I. for y, given x0: v u u x2 t yb0 ± tn−1,α/2 MSE(1 + P 02 ) xi

◦ R code: > roller.lm summary(roller.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) weight 2.392 0.299 7.99 2.2e-05 Residual standard error: 6.43 on 9 degrees of freedom Multiple R-Squared: 0.876, Adjusted R-squared: 0.863 F-statistic: 63.9 on 1 and 9 DF, p-value: 2.23e-005

> predict(roller.lm, newdata=data.frame(weight |t0|) is small.

Testing H0 : ρ = ρ0 1+r 1 log 2 1−r has an approximate normal distribution with mean Z=

µZ =

1 1+ρ log 2 1−ρ

and variance 2 = σZ

1 n−3

for large n. ◦ Thus, Z0 =

1+ρ0 log 1+r − log 1−r 1−ρ 0

q

2 1/(n − 3)

has an approximate standard normal distribution when the null hypothesis is true. 54

Confidence Interval for ρ 1+ρ • Confidence interval for 1 log 2 1−ρ :

q

Z ± zα/2 1/(n − 3) • Find endpoints (l, u) of this confidence interval and solve for ρ: e2l − 1 e2u − 1 ( , ) 2l 2u 1+e 1+e

55

R code for fossum example

• Find 95% confidence interval for the correlation between total length and head length: > > > > >

source("fossum.R") attach(fossum) n