## 09. Regression line. Regression. Slope intercept form review. Regression line. Regression line. Regression. y = mx + b

9/16/09 Regression line   Correlation coefficient a nice numerical summary of two Regression FPP 10 kind of quantitative variables   It indicates...
Author: Diane Wilcox
9/16/09

Regression line   Correlation coefficient a nice numerical summary of two

Regression FPP 10 kind of

quantitative variables   It indicates direction and strength of association

  But does it quantify the association?   It would be of interest to do this for   Predictions   Understanding phenomena

Regression line

Slope intercept form review

  Correlation measures the direction and strength of the

straight-line (linear) relationship between two quantitative variables   If a scatter plot shows a linear relationship, we would like to

summarize this overall pattern by drawing a line on the scatter plot   This line represents a mathematical model. Later we will

make the mathematical model a statistical one.

Regression line

Regression

  Slope intercept form notation

y = mx + b

Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

  Regression form notation

yˆ = a + bx

1

9/16/09

Which line is best

Which model to use   Different people might draw different lines by eye on a scatterplot Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)

  What are some ways we can determine which model(line) out of all

the possible models(lines) is the “best” one?   What are some ways that we can numerically rank the different

models? (i.e. the different lines)

  This will come later in the course

Slope interpretation yˆ = a + bx   The slope, b, of a regression line is almost always important

for interpreting the data. The slope is the rate of change, the mean amount of change in y-hat when x increases by 1

Intercept interpretation yˆ = a + bx   The intercept, a, of the regression line is the value of y-hat

when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero.

Slope interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

For every 1 sqft increase in size of home on average the house price increases by \$159.8 dollars

Intercept interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

If the sqft of a home was 0 on average the house price will be -\$90,245.80 dollars This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.

2

9/16/09

OECD data: Income and unemployment in the U.S.

Prediction

  What is the relationship between households’ disposable Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

income and the nation’s unemployment rate?   Data from the U.S. 1980 to 1998

For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = \$469,054.2

  (data provided by the economics department at Duke)

Disposable income vs unemployment rates   Disposable income

and unemployment rates regression output

Does regression fit data well?   A regression line is reasonable if   Association between two variables is indeed linear   When points are randomly scattered around line

  Income/unemployment rate data well-described by

regression line.

  Regression of AIDS

rates per 1000 people of GNP per capita   Line is too low for

GDP values near zero and too high for big GDP values.   We shouldn’t use line

for predictions

3

9/16/09

Birth and death rates in 74 countries

Changing the response variable   When the regression line fits the data badly, sometimes you

can transform variables to obtain a better fitting line.   With monetary variables, typically this can be accomplished

by taking logarithms.

Facts about regression   Regression of log(AIDS) on

log(GNP)

  The distinction between explanatory and response variable is

essential in regression   If you have a slope computed using x as the explanatory and y

as the response variable you can’t “back solve” to get predictions of x given y

  Much better fit   Predict log(AIDS) from

log(GNP). Exponentiate to estimate AIDS

  If you want to predict x given a y then you must find the

intercept and slope with y being the explanatory variable and x being the resopnse

  There is a close relationship between the correlation

  Predicting y at values of x beyond the range of x in the data is

coefficient and the slope of a regression line

b=r

SDy SDx

  They have the same sign

called extrapolation   This is risky, because we have no evidence to believe that the

association between x and y remains linear for unseen x values

  They are proportional to each other

  Extrapolated predictions can be absolutely wrong

  The intercept has no relationship with the correlation

coefficient but here is the formula

a = y − bx

4

9/16/09

Extrapolation

Extrapolation

  Diamond price and carat

  The relationship between

  Explanatory variable is

measured by carats and response variable is dollars

diamond carat and price doesn’t remain linear after a carat size of about 0.4

  Predict price of hope

diamond

yˆ = 48.88 + 2430.77(45.52) = \$110,697.53

Extrapolation   Green line is

linear fit with only diamonds less then 0.4 carats   Blue line is linear fit with all carat sizes   Red curve a quadratic fit

Lurking variable   A variable not being considered could be driving the

relationship   In practice this is a difficult issue to tackle. Especially when

everything seems OK

Influential point

Causality

  An outlier in either the X or Y direction which, if removed,

  On its own, regression only quantifies an association between

would markedly change the value of the slope and y-interept.   applet

x and y   It does not prove causality   Under a carefully designed experiment (or in some cases

observational studies) regression can be used to show causality.

5