Regression line Correlation coefficient a nice numerical summary of two

Regression FPP 10 kind of

quantitative variables It indicates direction and strength of association

But does it quantify the association? It would be of interest to do this for Predictions Understanding phenomena

Regression line

Slope intercept form review

Correlation measures the direction and strength of the

straight-line (linear) relationship between two quantitative variables If a scatter plot shows a linear relationship, we would like to

summarize this overall pattern by drawing a line on the scatter plot This line represents a mathematical model. Later we will

make the mathematical model a statistical one.

Regression line

Regression

Slope intercept form notation

y = mx + b

Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

Regression form notation

€

yˆ = a + bx

€

1

9/16/09

Which line is best

Which model to use Different people might draw different lines by eye on a scatterplot Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)

What are some ways we can determine which model(line) out of all

the possible models(lines) is the “best” one? What are some ways that we can numerically rank the different

models? (i.e. the different lines)

This will come later in the course

Slope interpretation yˆ = a + bx The slope, b, of a regression line is almost always important

€

for interpreting the data. The slope is the rate of change, the mean amount of change in y-hat when x increases by 1

Intercept interpretation yˆ = a + bx The intercept, a, of the regression line is the value of y-hat

€

when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero.

Slope interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars

Intercept interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.

2

9/16/09

OECD data: Income and unemployment in the U.S.

Prediction

What is the relationship between households’ disposable Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

income and the nation’s unemployment rate? Data from the U.S. 1980 to 1998

For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2

(data provided by the economics department at Duke)

Disposable income vs unemployment rates Disposable income

and unemployment rates regression output

Does regression fit data well? A regression line is reasonable if Association between two variables is indeed linear When points are randomly scattered around line

Income/unemployment rate data well-described by

regression line.

Regression of AIDS

rates per 1000 people of GNP per capita Line is too low for

GDP values near zero and too high for big GDP values. We shouldn’t use line

for predictions

3

9/16/09

Birth and death rates in 74 countries

Changing the response variable When the regression line fits the data badly, sometimes you

can transform variables to obtain a better fitting line. With monetary variables, typically this can be accomplished

by taking logarithms.

Facts about regression Regression of log(AIDS) on

log(GNP)

The distinction between explanatory and response variable is

essential in regression If you have a slope computed using x as the explanatory and y

as the response variable you can’t “back solve” to get predictions of x given y

Much better fit Predict log(AIDS) from

log(GNP). Exponentiate to estimate AIDS

If you want to predict x given a y then you must find the

intercept and slope with y being the explanatory variable and x being the resopnse

Facts about regression

Warnings about regression

There is a close relationship between the correlation

Predicting y at values of x beyond the range of x in the data is

coefficient and the slope of a regression line

b=r

SDy SDx

They have the same sign

called extrapolation This is risky, because we have no evidence to believe that the

association between x and y remains linear for unseen x values

They are proportional to each other

Extrapolated predictions can be absolutely wrong

€

The intercept has no relationship with the correlation

coefficient but here is the formula

a = y − bx

€

4

9/16/09

Extrapolation

Extrapolation

Diamond price and carat

The relationship between

Explanatory variable is

measured by carats and response variable is dollars

diamond carat and price doesn’t remain linear after a carat size of about 0.4

Predict price of hope

diamond

yˆ = 48.88 + 2430.77(45.52) = $110,697.53

€

Extrapolation Green line is

linear fit with only diamonds less then 0.4 carats Blue line is linear fit with all carat sizes Red curve a quadratic fit

Lurking variable A variable not being considered could be driving the

relationship In practice this is a difficult issue to tackle. Especially when

everything seems OK

Influential point

Causality

An outlier in either the X or Y direction which, if removed,

On its own, regression only quantifies an association between

would markedly change the value of the slope and y-interept. applet

x and y It does not prove causality Under a carefully designed experiment (or in some cases

observational studies) regression can be used to show causality.

5