What do we do with the data? yi = Revenue of ith Issue xi = Pages of Advertisement in ith Issue i = 1, . . . , n n = 37 Sample Size
Primary Research Questions: 1. How does pages of advertising relate to revenue?
Exploratory Results
50
r = 0.82 ●
Cov(X, Y ) = 65.83
40
●
●
Revenue
30
●
1. Form
●
●
20
●
●
●
10
●
● ● ● ●
●
●
●
● ● ● ● ●
● ●
•
● ●
● ● ●
5
10
Linear?
2. Direction
● ●
●
0
•
●
●
●
15 Pages of Advertising
20
25
Positive or Negative
3. Strength 4. Outliers
SLR Model Fit 4.09 + 1.67 ⇥ (Pages) 50
yˆ =
ˆ = 7.432 ●
40
●
●
Revenue
30
●
●
●
20
●
●
●
●
●
● ●
10
● ●
0
●
● ● ● ●
●
●
●
● ● ● ● ●
● ● ●
●
● ● ●
5
10
15
20
25
Pages of Advertising
Is this model any good? Or, does it explain the structure in the data well?
SLR Model Fit Measures of Goodness of Fit: 1. R2 Total Sums of Squares (SST) =
n X
(yi
y¯)2
i=1
Sum of Square Errors (SSE) =
ˆ0 + ˆ1 xi
n X
(yi
n X
z}|{ ( yˆi
i=1
Sum of Squares from Regression (SSR) =
i=1
SST |{z}
Total Error
=
SSE |{z}
Error After Regression
+
z}|{ yˆi )2
ˆ0 + ˆ1 xi
SSR |{z}
y¯)2
Error Taken Away by Regression
SLR Model Fit Measures of Goodness of Fit: 1. R2 R2 2 [0, 1] =
SSR =1 SST
SSE = 0.6719 SST
Interpretation: R2 is the percent of variation in revenue that is explained away by pages of advertisement. Intuition: Percent “better off” in predicting revenue when you include information from pages of advertisement. Issue: R2 only says how good your model is at explaining the data, it says nothing about how good your model is at prediction.
SLR Model Fit Measures of Goodness of Fit: 2. Predictive Accuracy via cross validation i. Randomly remove p% of your data (called a test set) ii. Fit model to remaining (1-p)% of your data (called a training set) iii. Use fit to predict the held-out test data. Predictive Bias = Predictive Mean Square Error (PMSE) =
1 ntest 1
n test X i=1 n test X
ntest i=1 p RPMSE = PMSE
(ˆ yi
yi )
(ˆ yi
yi ) 2
SLR Model Fit Interpreting Cross Validation Metrics Predictive Bias =
1 ntest
n test X
(ˆ yi
yi )
i=1
Bias: Systematic errors in estimation. For example, if bias0 then predictions are too high and if bias=0 predictions are just right. v u test u 1 nX RPMSE = t (ˆ yi ntest i=1
yi ) 2
Root predicted mean square error: How far off my predictions are on average.
SLR Model Fit 4.09 + 1.67 ⇥ (Pages)
50
yˆ =
ˆ = 7.432 ●
40
●
●
Revenue
30
●
●
●
20
●
●
●
●
●
● ●
10
● ●
0
●
● ● ● ●
●
●
●
● ● ● ● ●
● ● ●
●
● ● ●
5
10
15 Pages of Advertising
20
25
Are our assumptions OK? 1. Linear – maybe 2. Independent – maybe 3. Normal – maybe 4. Equal Variance – no
Assessing Model Assumptions Tools for Assessing Model Assumptions: 1. Residuals vs. Fitted Values Scatterplot ˆ0
✏ˆi = yi
ˆ1 xi
) Should be independent (no pattern) of yˆi with 30
constant variance (if pattern then likely dependent).
20
●
10
●
●
●
● ● ● ● ●
0
●
● ● ● ● ● ●
● ●
● ●
● ● ●
●
●
●
● ● ● ●
●
−10
Residuals
●
● ● ●
0
10
20 Fitted Values
30
Assessing Model Assumptions Tools for Assessing Model Assumptions: 1. Residuals vs. Fitted Values Scatterplot ✏ˆi = yi
ˆ0
ˆ1 xi
) Should be independent (no pattern) of yˆi with
constant variance (if pattern then likely dependent).
Assessing Model Assumptions Tools for Assessing Model Assumptions: 1. Residuals vs. Fitted Values Scatterplot • What is “close enough” to homoskedastic (equal variance) Breusch-Pagan Test (mathematical details are beyond the prereqs for this course): H0 : Data are homoskedastic HA : Data are heteroskedastic p
value : 0.01
Warning: Breusch-Pagan Test is highly sensitive so always check with fitted value vs. residual plot.
Assessing Model Assumptions Tools for Assessing Model Assumptions: 2. Histogram (density) of standardized residuals should be normal: ✏ˆi /SE(ˆ✏i ) ⇠ N (0, 1) (in theory) Histogram of Standardized Residuals
10 5 0
Frequency
15
Do we have outliers?
−3
−2
−1
0
1
Standardized Residuals
2
3
4
Assessing Model Assumptions Tools for Assessing Model Assumptions: 3. Normal Quantile-Quantile (QQ) Plot ✏ˆi /SE(ˆ ✏i ) ⇠ N (0, 1) (in theory)
i. Sort ✏ˆi /SE(ˆ✏i ) so that ✏ˆ(1) /SE(ˆ✏(1) ) < · · · < ✏ˆ(n) /SE(ˆ✏(n) ) ii. Find z(1) , . . . , z(n) so that Prob(z < zi ) ⇡ i/n from normal iii. ✏ˆ(i) /SE(ˆ✏(i) ) ⇡ z(i) if the normal assumption holds 4
Normal Q−Q Plot
2
3
●
1
●
0
●
●
● ● ●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ●
● ●
●
●
●
●
●
●
−1
● ●
●
● ●
−2
Sample Quantiles
●
●
−2
−1
0 Theoretical Quantiles
1
2
Here’s that outlier again!
Outliers in SLR Two questions: 1. Should we worry about outliers? ˆ1 = Corr(Y, X) sy sx
Correlation is sensitive to outliers so our regression line will be sensitive as well
2. How do we identify outliers? i. Graphical – Histogram/QQ plot of standardized residuals ii. Cooks Distance Di =
Pn
yj j=1 (ˆ
yˆj(
i) )
2
2ˆ 2
yˆj : Prediction of j th point using all data. yˆj(
i)
: Prediction of j th point using all data EXCEPT the ith point.
Outliers in SLR
50
Cooks Distance • Use cutoff (rule of thumb) of 4/n as “influential” or “outlier” ●
40
●
D1 = 0.414
●
30
●
20
●
●
●
●
●
● ●
10
● ●
● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
0
D8 = 0.409
Revenue
D2 = 0.395
●
●
● ●
● ● ●
5
10
15 Pages of Advertising
20
25
Assessing Model Assumptions Tools for Assessing Model Assumptions: 2. What is “close enough” to normal? Kolmogorov-Smirnov (KS) Test (mathematical details are beyond the pre-reqs for this class): H0 : Data come from a normal HA : Data don’t come from a normal p
value : 0.2689
Dealing with Violations of the SLR Assumptions Based on the above diagnostics we know: 1. Linear assumption is a bit sketchy 2. Homoskedasticity is certainly an issue 3. Normality OK with the exception of a few outliers. So, what do we do? 1. Change your assumptions (hard but preferred) 2. Transformations Idea
tY (yi ) =
0+
1 tX (xi ) + ✏i
tY (yi ) = Transformation of y tX (yi ) = Transformation of x
Example
log(yi ) =
0+
1
tY (yi ) = log(yi ) p tX (xi ) = xi
p
xi + ✏ i
Transformations Name log
Transformation ln(yi )
Fixes Nonlinearity, Heteroskedasticity
Issues Only if positive
Nonlinearity, Heteroskedasticity
Only if positive
yi /xi
Heteroskedasticity
Reverses relationship
2 ( 1, 1)
Heteroskedasticity
Hard to interpret
Non-normality
Impossible to interpret
p
Square root · Power Box-Cox
yi , (
yi
1
ln(yi )
yi
if if
6= 0 =0
With Box-Cox transformations, λcan be treated as a parameter and estimated using least squares OR maximum likelihood.
4
50
Transformations ●
●
●
3
40
●
●
●
20
●
●
● ● ● ● ●
●
● ● ●
1.0
●
● ●
1.5
● ●
● ● ●
●
● ● ●
● ● ● ●
●
2.0
2.5
●
● ●
●
●
●
● ● ●
●
● ● ●
● ●
●
1
●
● ●
0
10
●
●
2
ln(Revenue)
30
● ●
0
●
● ●
●
●
Revenue
●
●
● ● ● ●
3.0
5
10
15
● ●
6
● ●
●
3
●
●
●
7
●
●
● ● ●
●
1.0
5
● ●
2 1
●
●
1.5
2.0 ln(Pages)
2.5
3.0
●
● ●
● ● ●
●
●
●
4
● ●
3
●
● ● ●
●
1
● ●
●
●
●
●
●
●
●
●
sqrt(Revenue)
2
●
0
ln(Revenue)
● ●
25
Pages
4
ln(Pages)
20
● ● ● ●
1.5
●
● ● ● ●
2.0
● ●
●
●
● ●
● ● ●
● ●
● ● ●
2.5
3.0
3.5
sqrt(Pages)
4.0
4.5
5.0
Transformations Issues with transformations: 1. With the exception of Box-Cox, you’re guessing so your choice of transformation is subjective. 2. Changes the interpretation of the parameters. • Need to back-transform in order to produce anything interpretable 3. Changes the standard errors of the parameters. 4. Not always easy to keep track of all your assumptions.
Advertisement Example ln(yi ) =
0
[ = ) ln(y)
+
Cooks distance values are better now.
ln(xi ) + ✏i
1
1.05 + 1.48 ⇥ ln(x)
) yˆ = exp { 1.05 + 1.48 ⇥ ln(x)} ●
●
2
●
Normal Q−Q Plot
8
50
Histogram of Standardized Residuals
● ● ●
1
40
●
6
● ●● ● ● ● ●● ● ● ●●
●
●
0
Sample Quantiles
Revenue
● ●
●● ● ● ● ●● ● ●●
20
●
4
Frequency
30
●
●
●
●
−1
● ●
●
● ●
●
●
●
●
2
● ●
10
● ● ●
●
●
●
●
−2
●
●
● ●
● ●
0
● ● ● ● ●
●
● ●
●
0
●
5
10
15
Pages of Advertising
20
25
−3
−2
−1
0
1
Standardized Residuals
2
−2
−1
0 Theoretical Quantiles
1
2
End of Advertisement Analysis (see webpage for R and SAS code)