Econometrics I Lecture 2: Statistics Mohammad Vesal Graduate School of Management and Economics Sharif University of Technology
44716 Fall 1395
1 / 43
Outline Preliminary denitions
Point estimators
Approaches to parameter estimation
Interval estimation and condence intervals
Hypothesis testing
Reference: Wooldridge, Appendix C; Stock and Watson, Ch3; Greene, Appendix C (partly);
2 / 43
Denitions
•
Population, sample, statistical inference
•
learning: estimation and hypothesis testing
•
example: returns to education impractical and costly to survey the whole population point estimate: 7.5 percent interval estimate: 5.6-9.4 percent
•
example: neighborhood watch aects crime hypothesis testing
3 / 43
Denitions (2)
•
Identify population of interest and then the relation of interest. the relations involve (features of ) probability distributions.
Y
random variable on a population with pdf
f (y; θ). θ is.
Draw a sample from the population to learn what
• {Y1 , ..., Yn }
f (y; θ) if Y1 , ..., Yn f (y; θ) (i.i.d)
is a random sample from
independent with common pdf
are
once the sample is drawn we have a set of numbers
{y1 , ..., yn }
4 / 43
Example •
We want to know what is the average returns to education in Iran.
• Y: •
wage,
X:
years of schooling
We take a random sample of
n
individuals in Iran and ask
about their wages and years of schooling.
•
Before we ll out the questionnaire: after, we have data:
•
{(X1 , Y1 ), ..., (Xn , Yn )}
{(x1 , y1 ), ..., (xn , yn )}
joint pdf in the whole population:
fX,Y (x, y; θ),
where
θ
is
a set of parameters.
•
Aim: learn something about
θ
•
We might be interested in a
subset
from the sample. of parameters
fY |X (y | X = x; β) more specic example:
E[Y | X = x] = β0 + β1 x 5 / 43
Outline Preliminary denitions
Point estimators Finite sample properties Asymptotic (large sample) properties
Approaches to parameter estimation
Interval estimation and condence intervals
Hypothesis testing
6 / 43
What is an estimator? {Y1 , ..., Yn }
•
Random sample
•
(Point) Estimator: is a rule (function) that relates the
from
f (y; θ).
observed values of the sample to a value for the parameter(s) of interest (θ ):
W = h(Y1 , ..., Yn ) W
is a random variable itself.
{y1 , ..., yn } we w = h(y1 , ..., yn ).
given an actual sample of point estimate for
•
θ
Sometimes I use
as
θˆ when
can calculate a
talking about an estimator for
θ.
with a new sample, we get a new estimate for the parameter.
7 / 43
Example
•
Say we are interested in knowing the population mean for a random variable
Y ∼ N (µ, σ 2 ). {Y1 , ..., Yn }
•
We draw a random sample
•
What are the potential estimators for Anything like
µ?
µ ˆ = h(Y1 , ...Yn ).
µ ˆ = Y¯ = µ ˆ1 = Y1
Natural candidate is the sample mean: Weird candidate: use rst draw only:
•
1 n n Σi=1 Yi .
How do we pick among the many possible estimators?
8 / 43
Finite sample properties of estimators
•
Dene criteria for a good estimator Unbiasedness Relative eciency
9 / 43
Unbiasedness •
An estimator,
E(W ) = θ
W
of
θ,
is an unbiased estimator if
for all possible values of
θ.
interpretation: this is about the mean of the distribution of
W
NOT the value you get for an observed sample (w ).
•
dene bias: Bias(W )
•
Is
µ ˆ = Y¯
≡ E(W ) − θ
unbiased estimator of the population mean
What about
µ ˆ1 = Y1 ?
•
Some poor estimators are unbiased
•
Exercise: Is
σ2?
µ?
S 2 = n1 Σni=1 (Yi − µ)2
an unbiased estimator for
10 / 43
Relative eciency
•
Why only consider the mean of the distribution of
•
Sampling variance of estimators could be important too!
W?
Var(W )?
•
Relative eciency: If
W1
and
W2
W2
when Var(W1 )
with strict inequality for at least one
•
θ, W1 is ≤Var(W2 ) for all θ, θ.
are two unbiased estimators of
ecient relative to
Example: Var(ˆ µ) vs. Var(µ ˆ1 ) Var(ˆ µ)
=
σ2 n
< Var(ˆ µ1 ) = σ 2
for
n>1
11 / 43
Unbiasedness vs. relative eciency
Source: Wooldridge (2013)
12 / 43
Comparing two estimators
•
What if one estimator is biased and the other is not? How do we compare them?
•
Mean squared error (MSE)
M SE(W ) = E[(W − θ)2 ] = V ar(W ) + (Bias(W ))
2
13 / 43
Asymptotic properties
•
How do the properties of estimators change as the sample size increases?
•
How large is large?
14 / 43
Consistency • Wn
an estimator of
θ
based on
{Y1 , ..., Yn }
is consistent if
∀ > 0, Pr(| Wn − θ |> ) → 0 as n → ∞ •
Sometimes we write the above condition as plimWn
= θ.
If I require the estimator to be close enough to the parameter, I can nd a large enough sample size that does this. Or as the sample size increases, the probability of being far from
µ
drops to zero.
Unbiased estimators are not necessarily consistent But those with shrinking variance are. if
E(Wn ) = θ
and Var(Wn )
→0
as
n→∞
then
Wn
is consistent.
15 / 43
LLN and CLT
•
Law of L arge Numbers (LLN) If
Y1 , ..., Yn
are i.i.d with mean
•
Asymptotic normality
•
Central Limit Theorem (CLT)
µ
¯n ) then plim(Y
=µ
Y1 , ..., Yn is a random sample with mean µ and variance ¯ −µ σ 2 , then Zn = Yσ/n√ has an asymptotic standard normal n If
distribution.
16 / 43
Approaches to parameter estimation
•
Are there systematic methods for derivation of good estimators?
•
We discuss three approaches briey, Method of moments Maximum likelihood Least squares
•
but during the course we mostly rely on least squares.
17 / 43
Method of moments •
Remember the following
moments
for a random variable
with a given distribution
µ = E(Y ), σ 2 = E (Y − µ)2 •
A random sample with
•
A natural way to estimate parameters
n
observations is
µ
{Y1 , ..., Yn } and
σ2
is to replace
the moment conditions with their sample counterparts.
µ ˆ = Y¯ = •
1 n n Σi=1 Yi ,
σ ˆ2 =
1 n n−1 Σi=1
Yi − Y¯
A more general interpretation is We know some random variables must satisfy a few conditions in the population. We use the sample counterparts of these conditions to formulate method of moment estimators.
18 / 43
Maximum likelihood •
For a sample of
{Y1 , ..., Yn }
we can dene the likelihood
function as follows
L(θ; Y1 , ..., Yn ) = f (Y1 , ..., Yn ; θ) where
f ()
is the joint pdf for the random variables
a vector of unknown parameters
Yi
with
θ.
•
Maximum likelihood (ML) suggests a good estimator for
•
Intuition:
θ
maximizes the likelihood function.
L(θ; y1 , ..., yn )
gives the probability of observing a given
realization for the sample as a function of ML says a good estimator picks values for
θ. θ such
that the
probability of observing the current sample is maximized.
•
Under random sampling and assuming a pdf for
Y
this
could be a fruitful method.
•
Example: Random sample and estimator for
µ
Y ∼ N (µ, σ 2 ).
Find ML
2 and σ . 19 / 43
Least squares •
Consider the following decomposition
Yi = E [Yi | Xi = xi ] + ui where
•
ui = Yi − E [Yi | Xi = xi ].
Also consider a functional form for
E [Yi | Xi = xi ] = g(xi ; θ). •
One way to assess the goodness of
g(.)
is to see how far we
are on average from the observed values for
Y.
l(θ) = Σni=1 (Yi − g(xi ; θ))2 • •
LS estimator is
θˆ = argmin l(θ)
Example: Assume estimators for
β0
g(xi ; β0 , β1 ) = β0 + β1 xi , β1 .
nd LS
and
20 / 43
outline Preliminary denitions
Point estimators
Approaches to parameter estimation
Interval estimation and condence intervals
Hypothesis testing
21 / 43
Point vs. interval estimation
•
Point estimate doesn't provide information about how close the estimate is likely to be to the parameter of interest.
•
Point estimate for returns to education: 7 percent we cannot know
for certain
how close this is to the
population parameter. But we can make
•
probabilistic
claims.
Interval estimate: 3-11 percent Build a condence interval
22 / 43
Interval estimate for mean •
Random sample
{Y1 , ..., Yn }
from a population with
N (µ, σ 2 ). •
Use sample mean to estimate
µ:
σ2 Y¯ − µ √ ∼ N (0, 1) Y¯ ∼ N (µ, ) ⇒ n σ/ n •
From the standard normal distribution we know
Y¯ − µ √ < +1.96 = 0.95 Pr −1.96 < σ/ n
•
which suggests with 95 percent probability the interval
[Y¯ − 1.96 √σn , Y¯ + 1.96 √σn ] •
interval estimate for
contains
µ.
µ: [¯ y − 1.96 √σn , y¯ + 1.96 √σn ]. 23 / 43
100(1 − α)% condence interval •
We considered 95 percent condence intervals but the concept is more general
•
Let
cα/2
denote the
100(1 − α/2)
percentile of the standard
normal distribution.
• 100(1 − α)% obtained as
condence interval around the mean is
σ σ [¯ y − cα/2 √ , y¯ + cα/2 √ ] n n ¯ the standard deviation of Y
•
σ Notice √
•
numerical example:
n
is
σ = 1, n = 100, y¯ = 0.4
95% conf. int. (c2.5
= 1.96) 1 1 , 0.4 + 1.96 × 10 ] ⇒ [0.204, 0.596] 10 99% conf. int. (c0.5 = 2.575) 1 1 [0.4 − 2.575 × 10 , 0.4 + 2.575 × 10 ] ⇒ [0.1425, 0.6575] [0.4 − 1.96 ×
24 / 43
Meaning of a condence interval
•
Is it right to say: the probability that
µ
is in the
calculated
condence interval is 95 percent? No! For 95 percent of all random samples, the condence interval contains
µ!
25 / 43
Condence interval with unknown variance Y ∼ N (µ, σ 2 )
•
If
•
But if
S=
qσ
with known
σ
then
Y¯ −µ √ σ/ n
∼ N (0, 1).
is unknown then we need to estimate it (e.g. with
1 n n−1 Σi=1
Yi − Y¯
2
).
Y¯ −µ √ ∼ tn−1 . S/ n be 97.5th pctile of tn−1 distribution, then
Therefore, Let
c
• [Y¯ − c × √Sn , Y¯ + c × √Sn ]
contains
µ
with 95 percent
probability
•
S Once again √
n
deviation of
Y¯ .
is a point estimator for the standard This is usually referred to as the standard
error of the point estimate.
26 / 43
Condence interval with non-normal distribution
•
If
Y ∼ (µ, σ 2 )
still an estimator for
•
CLT:
Yi ∼
Y¯ is ¯ Y?
but the distribution is non-normal, then
µ
but what is the distribution of
Y¯ −µ √ is (µ, σ 2 ) iid random variables then σ/ n
standard normal when can use
S
n
is large.
as an estimator for
σ
and the theorem still holds.
we build asymptotic condence intervals
27 / 43
Racial discrimination 1988 Washington D.C. •
What is the extent of racial discrimination in hiring?
•
Study: 5 pairs of black and white applicants (identical in all other aspects). observe if applicants receive a job oer object of interest:
θB − θW
where
receiving a job oer for race
• Bi = 1
θr
indicates probability of
r
if black person gets an oer from employer
i
• Wi = 1if white person gets an oer from employer i ¯ and W ¯ are unbiased estimators for θB and θW • B ¯ ] = θB − θW , is Yi normal? dene Yi = Bi − Wi , then E[Y
y¯ = 0.224 − 0.357 = −0.133
•
from data we learn:
•
95% asymptotic conf. int.
0.482 0.482 [−0.133−1.96× √ , −0.133+1.96× √ ] → [−0.164, −0.102] 241 241 28 / 43
Outline Preliminary denitions
Point estimators
Approaches to parameter estimation
Interval estimation and condence intervals
Hypothesis testing
29 / 43
Why formulate a hypothesis test?
•
In estimation, we looked at the magnitude of an eect of interest.
•
Sometimes we may want to answer a question, e.g. Did subsidy reform increase ination? Does strict labor regulation increase wages?
•
statistical signicance vs. practical signicance
30 / 43
Example 1 - returns to construction •
Someone tells you the average returns to construction of residential buildings in Tehran is
r = 30
percent.
•
You want to test if this is a valid claim.
•
Collect data on cost and revenue of 100 construction projects and calculate
r¯ = 0.20.
Is this enough to claim average returns is not 0.3? We need to assess the strength of the evidence. Is nding Is nding
r¯ = 0.1 r¯ = 0.2
a stronger evidence against the claim? in a sample of 10,000 projects a stronger
evidence against the claim?
•
Given our sample, what is the probability that
•
Null hypothesis
r = 0.3?
H0 : r = 0.3 Alternative hypothesis
H1 : r 6= 0.3 31 / 43
Types of mistakes •
Type I and type II errors Type I: reject
H0
when it is true.
Type II: fail to reject
H0
when it is false.
source: Introduction to hypothesis testing
•
Reality vs. statistical rejection
32 / 43
Signicance and power •
Signicance level (size) of a test: probability of type I error
α = Pr(Reject H0 | H0 ) this shows how concerned you're about falsely rejecting
•
H0 .
Power of a test: 1-probability of type II error
π(r) = Pr(Reject H0 | r) = 1 − Pr(do
NOT reject
H0 | r)
this is a function of the actual value of the parameter!
•
Routine: We pick a signicance level
(α),
then try to minimize
π(r).
33 / 43
Test statistic •
Test statistic (T ) is a random variable built from the random sample (like an estimator!) for a given draw we calculate the value of this random variable and denote by
•
t
Given a test statistic we need a rejection rule to decide when to reject
H0
in favor of
simple rule: compare values of
t
t
H1
to a critical value
that lead to rejection of
H0
c
are called rejection
region. p-value: Given the calculated
t,
what is the largest
signicance level that we still fail to reject Choosing
α
H0 ? c.
(signicance level) will pin down
34 / 43
Example 1 - cont. • H0 : r = 0.3 •
vs.
H1 : r 6= 0.3
Since this is a hypothesis about the mean, let's use the sample average as a test statistic
T = if
•
Yi ∼ N i.i.d(r, σ 2 )
Y¯ − r σ/√n
with known
Pick signicance level:α
then
T ∼ N (0, 1).
= 5%
what is the critical value
α
σ
=
c? Pr(| T |> c | H0 )
⇒ c = zα/2 = 1.96 Reject
H0
if
| t |> 1.96
(rejection region).
35 / 43
Example 1 - with numbers
•
σ = 1, y¯ = 0.2. say
and for a given sample (n
Do we reject
t= •
H0 : r = 0.3
0.2−0.3 √ 1/ 100
in favor of
= −1 ⇒| t |< 1.96 ⇒
what if the sample size was
= 100)
we calculated
H1 : r 6= 0.3?
we don't reject
H0 .
n = 10000?
| t = −10 |> 1.96 ⇒reject H0 . •
What if
n = 100
but
y¯ = 0.1?
| t = −2 |> 1.96 ⇒reject H0 .
36 / 43
Rejection region
37 / 43
How to calculate power •
So far we focused on type I error.
•
How do we calculate type II error? alternatively, how do we calculate power (1-type II error)?
•
Power: what is the probability we reject value of the parameter is
H0
if the true
r1 ?
π(r1 ) = Pr(Reject H0 | r = r1 ) = Pr(| T |> c | r = r1 ) r1 − r0 r1 − r0 = Φ(−cα/2 − ) + 1 − Φ(cα/2 − ) √ σ/ n σ/√n •
Graphical calculation is nice.
38 / 43
Graphical representation of power function
39 / 43
Some desirable features of a test •
unbiased test If power is greater or equal to the signicance level of the test for all values of the parameter.
•
test consistency power goes to one as sample size goes to innity.
•
Do you think our test statistic in example 1 delivers an unbiased and consistent test? Remember
Y¯
was a consistent estimator for
r
it seems the test based on this should also be a consistent test!
•
We can also pick among various test (statistic)s based on the properties of their power functions.
40 / 43
Unknown σ / non-normal distributions •
If
σ
unknown, then we use
this means
T = Must choose
c
S=
q
1 n−1 Σ
Yi − Y¯
2
instead,
Y¯ − r ∼ tn−1 S/√n
based on pctiles of
tn−1
but if
n
is large, this
won't be that dierent from standard normal.
•
If
Yi ∼ D(µ, σ 2 )
and we can derive the distribution of
then use pctiles of
•
If
Yi ∼ (µ, σ 2 )
T
f (t).
but the distribution is unknown, then we
need CLT to build an asymptotic test
T = •
Y¯ − r ∼ N (0, 1) as n → ∞ S/√n
The principles remain the same!
41 / 43
Condence intervals vs. hypothesis testing
•
Condence intervals are just the complement of the rejection region. If a value of the parameter (r0 ) falls in the 95% condence interval around the estimated mean then we will not reject
H0 : r = r0 •
against
H1 : r 6= r0 .
Note many values fall in the condence interval and therefore many nulls won't be rejected!! That's why we don't say we accept reject
H0
H0 .
OR fail to reject
H0 ;
NOT accept
H0
is non-sense!
42 / 43
Summary
•
In this topic we learned What a (point) estimator is and what desirable nite and asymptotic properties it may possess. What interval estimation is and how we build condence intervals around a point estimate. How to formulate a hypothesis and conduct a hypothesis test.
43 / 43