Treatment Effects. An Average Treatment effect is a special case of an average partial

Treatment Effects An Average Treatment effect is a special case of an average partial effect. Specifically it is an average partial effect for a bina...
Author: Buddy Dixon
1 downloads 2 Views 3MB Size
Treatment Effects An Average Treatment effect is a special case of an average partial effect.

Specifically it is an average partial effect for a binary explanatory variable.

Pioneered by Rubin(1974) who introduced the concept in a counterfactual framework.

Generally, most estimators of ATE fall into one of two categories, (strong) ignorability, or IV.

Treatment effects begin with a counterfactual, where each agent has an outcome with and without treatment.

1

To model this, we will let a random draw from the population be denoted by the triple (y1i, y0i, di).

yi1 denotes outcome with treatment; yi0 denotes outcome without treatment; di denotes treatment indicator.

Problem is only observe: yi = diyi1 + (1 − di)yi0

Interested in different measures of the effect of treatment:

2

1. ATE = E[yi1 − yi0] 2. ATOT=E[yi1 − yi0|di = 1] 3. CATE = E[yi1 − yi0|xi] 4. CATOT=E[yi1 − yi0|di = 1, xi]

How do we estimate these parameters? Depends on assumptions of the model:

3

1. Randomized Treatment (rare in social sciences where there is usually self selection).

Here we assume di ⊥ (yi1, yi0)

Note in this case ATE=ATOT, because E[yi|di = 1] = E[yi1|di = 1] = E[yi1], and E[yi|di = 0] = E[yi0].

4

General relationship between ATE and ATOT:

yi0 = µ0 + ²i0

(1)

yi1 = µ1 + ²i1

(2)

difference equations and condition on di = 1 and we get AT OT = AT E + E[²i1 − ²i0|di = 1]

5

(3)

2. (Strong) Ignorability (selection on observables)

Here we assume di ⊥ (yi1, yi0) conditional on regressors xi. In this case CATE = CATOT: E[yi|di = 1, xi] = E[yi1|xi] E[yi|di = 0, xi] = E[yi0|xi]

6

Matching Methods Based on Propensity Scores We define the propensity score as

p(xi) = P (di = 1|xi = 1) Can show that under the ignorability condition,

AT E = EX [(di − p(xi))yi/(p(xi)(1 − p(xi)))]

To see why, note that the numerator can be expanded as: diyi1(1 − p(xi)) − p(xi)(1 − di)yi0

7

(4)

Now condition on xi, di and take expectations: (m1(xi) = E[yi1|xi]) dim1(xi) − p(xi)dim1(xi) − p(xi)(1 − di)m0(xi)

Now take expectation conditional on xi: p(xi)(1 − p(xi))(m1(xi) − m0(xi))

8

Instrumental Variables Since what we have here is a dummy endogenous variable, why not just do IV? Usual conditions for IV to work will be that ²i1 = ²i0 which is unrealistic.

Can interpret IV in the LATE framework. Relax the (unrealistic) strong ignorability condition by introducing binary instruments:

Here we will assume the presence of a binary instrumental variable zi, from which we can define the potential treatment variables: di0, di1 which denote treatment if i.v. is 0,1, respectively:

9

di = (1 − zi)di0 + zidi1 Therefore, yi = yi0 + di0(yi1 − yi0) + zi(di1 − di0)

Make the following Assumption: (independence) zi ⊥ (yi0, yi1, di0, di1) so then we can say E[yi|zi = 1] − E[yi|zi = 0] = E[(di1 − di0)(yi1 − yi0)] which is equal to E[(yi1 − yi0 )|di1 − di0 = 1]P (di1 − di0 = 1) − E[(yi1 − yi0 )|di1 − di0 = −1]P (di1 − di0 = −1)

Now make the following additional assumption (monotonicity) di1 ≥ di0

10

In which case E[yi |zi = 1] − E[yi |zi = 0] = E[(yi1 − yi0 )|di1 − di0 = 1]P (di1 − di0 = 1)

We define the subgroup of the population as which satisfies:

di1 − di0 = 1

as compliers, and we define the LATE parameter as E[yi1 − yi0 |di1 − di0 = 1]

i.e. the ATE for compliers.

11

Note that under the monotonicity assumption,

P (di1 − di0 = 1) = E[di1 − di0 ]

(5)

= E[di1 − E[di0 ]

(6)

= E[di |zi = 1] − E[di |zi = 0]

(7)

P (di = 1|zi = 1) − P (di = 1|zi = 0)

(8)

Therefore, we have the following result: LAT E =

E[yi |zi = 1] − E[yi |zi = 0] E[di |zi = 1] − E[di |zi = 0]

and note the r.h.s. corresponds to the iv estimator for the regression of yi on di with zi as instrument.

12

On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects Jinyong Hahn Econometrica, March 1998, Vol. 66, No. 2, pp315-331

Presented in Applied Microeconometrics Lunchgroup Alvin Murphy September 26, 2006

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Overview

This paper examines the role of the propensity score in estimating treatment effects Primarily concerned with Average Treatment Effects (ATE) and Average Treatment Effects on the Treated (ATT) Hahn develops semiparametric variance lower bounds and then provides estimators that achieve these bounds The contribution of the paper lies in the collection of central results regarding the role of the propensity score in achieving these bounds

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Introduction 1 As we have seen over the past few weeks, the core difficulty in estimating treatment effects is a missing counterfactual For each individual, we only observe one of the two potential outcomes. Some notation ... - Let Di denote a dummy variable, where Di = 1 if individual i received the treatment. - Let Y0i and Y1i denote the potential outcomes when Di = 0 and Di = 1 - Yi ≡ Di Y1i + (1 − Di )Y0i - Let Xi denote other covariates We only observe (Y , D, X )

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Introduction 2

The two primary parameters of interest are average treatment effects β ≡ E [Y1i − Y0i ] and average treatment effects on the treated γ ≡ E [Y1i − Y0i |Di = 1] It is well known that estimating β or γ without controlling in some way for the selection problem will lead to biased estimates Given certain assumptions, conditioning on the propensity score eliminates this bias

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Introduction: Rosenbaum, Rubin, and the Propensity Score

Define the propensity score p(x) ≡ P[Di = 1|Xi = x] Assume: - Xi is such that Di is ignorable given Xi i.e. Di ⊥(Y0i , Y1i )|Xi - 0 < P[Di = 1|Xi ] < 1 for all Xi Then, Di ⊥(Y0i , Y1i )|p(Xi ) (Rosenbaum & Rubin 1983, 1984)

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Introduction: Nonparametric Estimators of β The implication of the ignorability assumption is therefore E [Yji |p(Xi )] = E [Yi |Di = j, p(Xi )] for j = 0, 1 Hence, β = E {E [Yi |Di = 1, p(Xi )] − E [Yi |Di = 0, p(Xi )]} β = E {E [Yi |Di = 1, Xi ] − E [Yi |Di = 0, Xi ]} This suggests an estimator of β may be constructed as a sample average of: Eˆ [Yi |Di = 1, p(Xi )] − Eˆ [Yi |Di = 0, p(Xi )] or Eˆ [Yi |Di = 1, Xi ] − Eˆ [Yi |Di = 0, Xi ] Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Outline Examine the efficient estimation of β and γ under the ignorability assumption Examine the role of the propensity score from an efficiency point of view Hahn calculates semiparametric efficiency bounds and estimators that achieve these bounds are constructed Shows the propensity score is unnecessary for the estimation of β but knowledge of the propensity score does decrease the asymptotic variance bound for γ Even in this case, projection on the propensity score is not necessary to achieve the lower bound In some cases, conditioning on the propensity score could even result in a loss of efficiency

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficiency Bounds 1 The dataset consists of (Di , Yi , Xi ) for i = 1, . . . , n Theorem 1: Assume (Y0i , Y1i )⊥Di |Xi , the asymptotic variance bounds for β and γ are respectively " # σ12 (Xi ) σ02 (Xi ) + + (β(Xi ) − β)2 E p(Xi ) 1 − p(Xi ) and "

p(Xi )σ12 (Xi ) p(Xi )2 σ02 (Xi ) (β(Xi ) − γ)2 p(Xi ) + + p2 p 2 (1 − p(Xi )) p2

# (1)

where - βj (Xi ) = E [Yji |Xi ] for j = 0, 1 and β(Xi ) = β1 (Xi ) − β0 (Xi ) - σj2 (Xi ) = var (Yji |Xi ) - p = E [p(Xi )] Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficiency Bounds 2

Theorem 2: Assume (Y0i , Y1i )⊥Di |Xi , Furthermore, assume that the propensity score, p(Xi ) is known. The asymptotic variance bounds for β and γ are respectively # " σ02 (Xi ) σ12 (Xi ) + + (β(Xi ) − β)2 E p(Xi ) 1 − p(Xi ) and "

p(Xi )σ12 (Xi ) p(Xi )2 σ02 (Xi ) (β(Xi ) − γ)2 p(Xi )2 + + p2 p 2 (1 − p(Xi )) p2

Jinyong Hahn

#

Propensity Score and Semiparametric Estimation of ATE

Efficiency Bounds 3

Result 1: The propensity score does not play any role in the estimation of β: the knowledge of the propensity score does not decrease the variance bound Result 2: Knowledge of the propensity score reduces the asymptotic variance bound for γ by: # " (β(Xi ) − γ)2 p(Xi )(1 − p(Xi )) p2

(2)

(2) can be interpreted as the marginal value of the propensity score

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficiency Bounds 4 Theorem 3: Assume (Y0i , Y1i )⊥Di |Xi , Furthermore, assume that the propensity score, p(Xi ) is equal to some unknown constant p. The asymptotic variance bound for β = γ is: # " σ12 (Xi ) σ02 (Xi ) + + (β(Xi ) − β)2 E p 1−p Now, consider the variance bounds in Theorem 1 for the case where p(Xi ) = p The bound for β equals " # σ12 (Xi ) σ02 (Xi ) E + + (β(Xi ) − β)2 (3) p 1−p and the bound for γ equals " # σ12 (Xi ) σ02 (Xi ) (β(Xi ) − γ)2 + + p 1−p p Jinyong Hahn

(4)

Propensity Score and Semiparametric Estimation of ATE

Efficiency Bounds 5 The bound for β is, unsurprisingly, the same The bound for γ is lower The marginal value of knowing that assignment to treatment is random is given by " # 1−p E (β(Xi ) − β)2 p This marginal value equals the marginal value calculated above (benefit of knowing p(Xi )) when p(Xi ) = p Hahn suggests that this implies that the marginal value of the knowledge of the propensity score consists entirely of the marginal value of dimension reduction

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 1

Given knowledge of the variance bounds under different assumptions, which estimators will achieve these bounds? The estimators constructed are based on the relevant sample averages from an augmented data set. A dataset can be augmented to fill in the missing values of Y1i and Y0i by a nonparametric imputation method based on the projection on Xi Even when the propensity score is known, it is shown that projecting on the propensity score is not necessary for the estimator of γ to achieve the lower bound Conditioning on the propensity score may reduce the efficiency if the treatment is randomly assigned

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 2 If both Y1i and Y0i were always observed, then a consistent estimator of β could be easily formed by the sample average of the difference, Y1i − Y0i A consistent estimator of γ would be the sample average of the difference, Y1i − Y0i where Di = 1 The first step, therefore, is to nonparametrically impute the missing values of Y1i and Y0i using their conditional expectation given Xi These conditional expectations are only identified under the assumption of the ignorability of Di given Xi as can be seen by E [Di Yi |Xi ] = E [Di Y1i |Xi ] = E [Di |Xi ]E [Y1i |Xi ] = E [Di |Xi ]β1 (Xi ) so, β1 (Xi ) = E [Di Yi |Xi ]/E [Di |Xi ] Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 3 - Completing the Data

Given nonparametric estimators, Eˆ [(1 − Di )Yi |Xi ], Eˆ [Di Yi |Xi ], and Eˆ [Di |Xi ] we can fill in the missing values of Y1i and Y0i as Eˆ [(1 − Di )Yi |Xi ] Eˆ [Di Yi |Xi ] and βˆ0 (Xi ) ≡ βˆ1 (Xi ) ≡ Eˆ [Di |Xi ] 1 − Eˆ [Di |Xi ] We can get the “complete” data set (Yˆ1i , Yˆ0i , Di , Xi ) where Yˆ1i ≡ Di Yi + (1 − Di )βˆ1 (Xi ) and Yˆ0i ≡ (1 − Di )Yi + Di βˆ0 (Xi )

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 4 - Estimating β and γ Two alternative estimators of β and γ are now easily constructed P X (1/n) (Yˆ1i − Yˆ0i ) 1 i DiP βˆ = (Yˆ1i − Yˆ0i ) and γˆ = n (1/n) i Di i

recall that βj (Xi ) = E [Yji |Xi ] and β(Xi ) = β1 (Xi ) − β0 (Xi ) We also have already constructed βˆ1 (Xi ) and βˆ0 (Xi ), therefore, we can also estimate β and γ as P (1/n) i Di (βˆ1 (Xi ) − βˆ0 (Xi )) 1X ˆ ˜ ˆ P β= (β1 (Xi )−β0 (Xi )) and γ˜ = n (1/n) i Di i

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 5 - Asymptotic Variances Using results from Newey (1994) it can be shown that the √ √ asymptotic variances of n(βˆ − β) and n(β˜ − β) are equal to each other and equal to the asymptotic variance bound derived in Theorem 1 Similarly, it can be shown that the asymptotic variances of √ √ n(ˆ γ − γ) and n(˜ γ − γ) are equal to each other and equal to the asymptotic variance bound derived in Theorem 1 Proposition 4 formally states the above two results but does not provide any information regarding the first stage nonparametric regression estimation. Theorems 5 and 6 (and their respective discussions) provide some guidance and necessary conditions for the first stage estimations for the overall estimations to be efficient

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 6 - First Stage Estimation Theorem 5: Assume (Y0i , Y1i )⊥Di |Xi , Furthermore, assume that Xi has known finite support. Then βˆ and β˜ are efficient semiparametric estimators for β, γˆ and γ˜ are efficient semiparametric estimators for γ. First stage estimators are simply constructed as: P Di Yi · 1(Xi = x) ˆ E [Di Yi |Xi = x] = i P i 1(Xi = x) P (1 − Di )Yi · 1(Xi = x) ˆ E [(1 − Di )Yi |Xi = x] = i P i 1(Xi = x) P i Di · 1(Xi = x) Eˆ [Di |Xi = x] = P i 1(Xi = x)

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 7 - First Stage Estimation

Theorem 6: Assume (Y0i , Y1i )⊥Di |Xi , Furthermore, assume that multiple other conditions hold (see paper). Then βˆ and β˜ are efficient semiparametric estimators for β, γˆ and γ˜ are efficient semiparametric estimators for γ. First stage estimators are formed by series estimators For example, an estimator of Eˆ [Yi |Xi = x] would be p k (x) = (pik (x), . . . , pkk (x))0 y = (Y1 , . . . , Yn )0 p k = [p k (X1 ), . . . , p k (Xn )]0 Eˆ [Yi |Xi = x] = p k (x)0 π ˆ, π ˆ = (p k 0 p k )−1 p k 0 y

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 8 - Is Imputation Necessary?

Imputation is (seemingly) unavoidable even for experimental data Consider Robinson’s partially linear semiparametric regression model (Robinson 1988) Regress Yi − Eˆ [Yi |Xi ] on Di − Eˆ [Di |Xi ] The plim of the resulting estimator, βsl is given by E [(Yi − E [Yi |Xi ])(Di − E [Di |Xi ])] =β E [(Di − (Di |Xi ])2 ] The asymptotic variance of βsl can be calculated using the “machinery” of Newey(1994) and is shown to be larger than that of Hahn’s βˆ

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 9 - γ and the Propensity Score It was previously shown that the propensity score is unnecessary for the estimation of β This does not hold for γ as knowledge of the propensity score reduces the asymptotic variance bound Probably unrealistic to assume the propensity score is known However, many papers nonparametrically estimate the propensity score to exploit the dimension reduction feature. Hahn argues that even if the propensity score is known, it is not necessary to project onto the propensity score. Proposition 7: Assume (Y0i , Y1i )⊥Di |Xi , Furthermore, assume that the propensity score, p(·) is known. Then, the following estimator is an efficient estimator of γ ˆ , X 1X E [Di Yi |Xi ] Eˆ [(1 − Di )Yi |Xi ] 1 p(Xi ) · − p(Xi ) n n Eˆ [Di |Xi ] 1 − Eˆ [Di |Xi ] i

i

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Efficient Estimation 10 - Dangers of the Propensity Score Could projection on the propensity score even be harmful for the estimation of β = γ? (ie the experimental data case) ˜ which is an efficient estimator of β with or without β, knowledge of the propensity score, is still efficient for β note that the estimator for γ developed above in proposition 7 reduces to β˜ when the propensity score is constant Don’t want to use γ˜ as it is only efficient when the propensity score is unknown If we condition on the propensity score when it is constant, we just get the marginal expectation Therefore we consider the difference in sample averages as an estimator. Call this estimator βols . ˆ ≥0 However, var (βˆols ) − var (β) Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Conclusions and Summary Interesting development of asymptotic variance lower bounds Bound is unaffected for β (ATE) when propensity score is known Bound is reduced for γ (ATT) when propensity score is known Uses imputation of missing values to construct estimators that attain lower bounds Is imputation really the only way to achieve the lower bounds? Find out next week! Overall the paper provides little encouragement for fans of propensity score estimators. It is worth noting that projecting on the propensity score can reduce efficiency in the experimental data case.

Jinyong Hahn

Propensity Score and Semiparametric Estimation of ATE

Ecient Estimation of Average Treatment Eects Using the Estimated Propensity Score

Hirano, Imbens, and Ridder (2003) Presented by Jason Blevins Applied Microeconometrics Reading Group Duke University October 5, 2006

Context Previous Work:



As shown by Rosenbaum and Rubin (1983, 1985), the unconfoundedness assumption implies that adjusting for dierences in



x.

p(x ) removes all bias associated with

Hahn (1998) shows that while this removes all bias, it is not necessarily as ecient as conditioning on the covariates.



Rosenbaum (1987), Rubin and Thomas (1996), and Robins, Rotnitzky, and Zhao (1995): There can be eciency gains by using parametric estimates of the propensity score, rather than the true propensity score.

Main Finding:



 , wate , and treated are presented which weight observations by the inverse of nonparametric estimates of p (x ). If the estimator for p (x ) Estimators for

is suciently exible, this leads to a fully ecient estimator.

1

Outline      

Model, Objectives, and Assumptions Other Approaches: Matching Estimation Using the Propensity Score Previous Results: Hahn (1998) Missing Data Example Three Ecient Estimators



Population Average Treatement Eect



Weighted Average Treatement Eect



Average Treatement Eect for the Treated

2

Model Population Missing data Random sample

Treatment indicator Vector of covariates Outcomes

(T; X; Y (0); Y (1)) Y  T  Y (1) + (1 T )  Y (0) f(Ti ; Xi ; Yi )gNi =1

Ti 2 f0; 1g Xi

Yi (0), Yi (1)

Assumption 1. Unconfoundedness:

T ? (Y (0); Y (1)) j X

3

Quantities of Interest Population ATE Weighted ATE ATE on the Treated Propensity Score

  

 = E[Y (1) Y (0)] wate =

R

E[Y (1) RY (0) j X =x ]g (x )dF (x ) g (x )dF (x )

 = E[Y (1) Y (0) j T

= 1] p (x ) = P(T = 1 j X = x )

g (x ) = p (x ). we only observe either Yi (0) or Yi (1), never both.

ATE on the Treated arises when weight is Problem:

Straightforward nonparametric estimators are all infeasible!

4

Estimation by Matching The unconfoundedness assumption implies that

 (x )  E[Y (1) Y (0) j X = x ] = E[Y (1) j X = x ] E[Y (0) j X = x ] = E[Y (1) j T = 1; X = x ] E[Y (0) j T = 0; X = x ] = E[Y j T = 1; X = x ] E[Y j T = 0; X = x ] since

E[Y j T

= 1; X = x ] = E[T  Y (1) + (1 T )  Y (0) j T = 1; X = x ] = E[Y (1) j T = 1; X = x ] =

Z

 (x )dF (x ) 5

Estimation Using the Propensity Score Rosenbaum assumption

and

Rubin

(1983,

1985)

show

that

the

unconfoundedness

T ? (Y (0); Y (1)) j X implies T ? (Y (0); Y (1)) j p(X ).

Unconfoundedness gives:

E[T Y j X = x ] = E[T Y (1) j X = x ] = E[T j X = x ] E[Y (1) j X = x ] E[T Y j X = x ] E[T Y j X = x ] = E[Y (1) j X = x ] = E[T j X = x ] p (x ) This suggests using a sample average to nonparametrically estimate

 = E[ (x )] = E [E[Y (1) Y (0) j X = x ]] :

6

Previous Results: Hahn (1998) 

Semiparametric eciency bounds and estimators for



and

treated

p(x ) does not aect bound for  .  Knowing p(x ) decreases the bound for treated .  In general, conditioning only on p(x ) and not the covariates does not lead 

Knowing

to an ecient estimator (experimental data case).



 , regardless of whether p(x ) is known.  Nonparametrically estimate E[Y T j X = x ], E[Y (1 T ) j X p (x ).  Impute values for Yi (1) and Yi (0) using

Ecient estimator for

^ E ^Yi (1) = [Y T j Xi ] p^(Xi )

and

= x]

, and

^ E ^Yi (0) = [Y (1 T ) j Xi ] 1 p^(Xi ) 7

Example: Missing Data with Binary Covariates 0  E[Y ] given a random sample f(Ti ; Xi ; Ti Yi )gNi=1. and Xi are observed for everyone, Yi observed only if Ti = 1. Unconfoundedness: T ? Y j X Propensity score: p (x ) = E[T j X = x ] = P[T = 1 j X = x ] Suppose p (x ) = 1=2 Binary covariates: x 2 f0; 1g, Ntx = #fi j Ti = t; Xi = x g

  Ti    

Want to estimate

8

Example: True Weights Estimator Normalized variance bound for

0:

Vbound = 2 E [V(Y j X )] + V (E[Y j X ]) True weights estimator:

N Yi Ti 1 X Yi Ti = N i =1 p(Xi ) N i =1 1=2   2 Vtw = 2 E [V (Y j X )] + V (E[Y j X ]) + E E[Y j X ] N X 1 ^tw =

Inecient unless

E[Y j X ] = 0

9

Example: Estimated Weights Estimator Estimated propensity score:

N10=(N00 + N10) if x = 0 p^(x ) = N11=(N01 + N11) if x = 1 Ntx = #fi j Ti = t; Xi = x g (

Estimated weights estimator:

N X 1 ^ew =

Yi Ti N i =1 p^(Xi ) Vew = 2 E [V (Y j X )] + V (E[Y j X ]) = Vbound Fully Ecient!

10

GMM Interpretation: True Weights Estimator True weights estimator

1

tw

is GMM estimator with moment

(y; t; x; ) = pyt(x )

yt = 1=2

corresponding to

E fE[Y T j X ] E[Y j X ] E[T j X ]g = 0 Ignores information about

T , not necessarily ecient.

11

GMM Interpretation: Estimated Weights Estimator The propensity score provides additional information:

E fE[T j X ] p(X )g = E[T

1=2] = 0

With a binary covariate, we have

1=2)  (y; t; x; ) =  (1 x )(t 1=2) 

2

GMM with moment conditions

 

1 and

x (t



2 is fully ecient and corresponds to

the estimated weights estimator.

12

An Estimator for  Quantities of interest:

   E[Y (1) Y (0)]

p ( x )  P [T

= 1 j X = x]

Conditional moments:

t (x )  E[Y (t ) j X = x ]

t2(x )  V (Y (t ) j X = x )

  satises E [ (Y; T; X;  ; p(X ))] = 0 where yt (t; t; x; ; p(x )) = p(x ) Given an estimate of

y (1 t ) 1 p (x ) 

p,  N  X 1 Yi Ti Yi (1 Ti ) ^ = N p^(Xi ) 1 p^(Xi ) i =1

13

Series Logit Estimator RK (x ) = (r1K (x ); r2K (x ); : : : ; rKK (x ))> > Multi-index:  = (1 ; : : : ; r ) , j 2 N, r = dim(x ) Pr Norm: jj  j =1 j Sequence of distinct multi-indices: f(k )gk with j(k )j  j(k + 1)j Qr j 1 2  r Power series elements: x = x = x x    x r 1 2 j =1 j  (k ) Take the sequence frkK (x )gk where rkK (x ) = x Vector of functions:

14

Series Logit Estimator r = 3):

Example (

(1) = (0; 0; 0); (2) = (1; 0; 0); (3) = (0; 1; 0); (4) = (0; 0; 1); (4) = (2; 0; 0); : : :

1

 

1

 

1

 

1

 

R

1

=1

R

2

=

   

x1

R3 =

      x1    

x2

R4 =

      x1     x2    

x3

R5 =

       x1      x   2      x3     

:::

x12

15

Series Logit Estimator Logistic CDF:

L( a ) =

ea 1+ea

Series Logistic Estimator for

^K = argmax 

N X  i =1

Ti ln L(

p(x ) is p^(x ) = L(RK (x )>^K ) with RK

(X )>) + (1 i

Ti )ln(1 L(

RK

 > (X ) )) i

16

More Assumptions Assumption 2. Distribution of X :

X is a compact subset of Rr . ii. Density of X is bounded and bounded away from 0. Assumption 3. Distribution of (Y (0); Y (1)): i. Support of

i. ii.

E[Y (0)2] < 1 and E[Y (1)2] < 1. 0(x ) and 1(x ) are continuously dierentiable.

Assumption 4. Selection probability:

p(x ) is continuously dierentiable of order s with s  7r . ii. 0 < p (x ) < 1 Assumption 5. The Series Logit Estimator of p(x ) uses a power series with K = N  for some 4(s=r1 1) <  < 19 . i.

17

Asymptotic Properties of ^ Theorem 1. Suppose assumptions 15 hold. Then: i. ii.

p ^ !  . p N (^  ) d! N(0; V ) with

V iii.

= E ( (x ) 

02(X ) 12(X ) 2  ) + p  (X ) + 1 p  (X ) : 

^ reaches the semiparametric eciency bound.

18

Asymptotic Properties of ^ ^ is asymptotically linear: N X p 1    ^ =  + N [ (Yi ; Ti ; Xi ;  ; p (Xi )) + (Ti ; Xi )] + op (1= N ) i =1

where

(t; x ) =



and so

 1 (x )  0 (x ) (x )) + ( t p p  (x ) 1 p  (x ) 

V

= E[( + ) ] 2

Known weights estimator is asymptotically linear with inuence function Consistent estimator for

V

.

is found using another Series Logit Estimator.

19

An Ecient Estimator for 

w ate

E[Y (1) RY (0) j X = x ]g (x )dF (x ) wate = g (x )dF (x ) By choosing a weighting function g appropriately, we can obtain treatment eects for a subpopulation dened by X .  Note that g = p yields treated . R

(y; t; x; wate ; p(x )) = g(x ) pyt(x ) 1y (1 p(xt )) wate  X N N X Yi Ti Yi (1 Ti ) g (Xi ) ^wate = g (Xi ) p^(X ) 1 p^(X ) i i i i 

=1

average



=1

20

Asymptotic Properties of ^

w ate

Theorem 3. Suppose assumptions 15 hold,

E[g (x )] > 0.

^

i. wate ii.

p

Then:

p  . ! wate

= E[g(1X )]

2  )2 E g ( X ) (  ( X )  wate 2



g (X ) 2 g (X ) 2 + p  ( X ) 1 ( X ) + 1 p  ( X ) 0 ( X ) : 2

V^

is bounded, and

 ) d! N(0; V ) with N (^wate wate

V

iii.

jg ( x ) j

is consistent for

Theorem 4.

2



V.

^wate reaches the semiparametric eciency bound for wate . 21

An Estimator for  Take

tr eated

with p Known 

g (x ) = p(x ) and apply the estimator for wate .

(y; t; x; wate ; p(x )) = p(x ) pyt(x ) 1y (1 p(xt )) ^treated 

The estimator

wate



is the solution to

0=

Y T Y (1 T ) p(Xi ) p^(iXi ) 1i p^(Xi ) treated i i i =1

N X





p is used as the weighting function while p^ weights observations.  Hahn (1998) showed that knowing p reduces the variance bound for treated . p N -consistent, asymptotically From Theorems 3 and 4, this estimator is

Notice that

normal, and ecient.

22

An Estimator for 

tr eated

with p Unkown 

p is unknown, the eciency bound for treated is higher. We need a new estimator since  ^treated used p. Let  ^te be the solution to If

0=

Y T Y (1 T ) p^(Xi ) p^(iXi ) 1i p^(Xi ) te : i i i =1

N X





23

Asymptotic Properties of ^

te

Theorem 5. Suppose that assumptions 15 hold. Then: p  ^ ! treated . p  ) d! N(0; V ) with ii. N (^ te treated

i. te

V

= E[p1(X )]

2  (X )2 ( (X )   E p ) treated 2



  2 + p(X )12(X ) + 1 p (pX()X ) 02(X ) :

iii. te reaches the semiparametric eciency bound for estimation of treated when the propensity score is not known.

24

Conclusion Results:



Hahn (1998) showed that conditioning on the true propensity score does not, in general, yield an ecient estimator.



however, using



true propensity score does not yield the estimated propensity score does.

Weighting by the

The

proposed

estimators

require

nonparametric

ecient estimators,

estimation

of

fewer

functions than other estimators.

Open Questions:



Finite sample properties



Computational properties

25

The Mystery of Propensity Score Matching

October 28, 2008

• Rosenbaum and Rubin, the ”Central Role” of propensity

score. • D = 0, 1. A balancing score is a function b (X ) such that

D ⊥ X |b (X ) • b (X ) = X is obviously a balancing score. • p (X ) = p (D = 1|X ) is also a balancing score. • p (X ) is the coarsest balancing score. In other words, for any

balancing score b (X ), there must be some function g (·) such that p (X ) = g (b (X )) • Proof: By assumption,

f (D = 1|X , b (X )) = f (D = 1|b (X )) . The LHS is p (X ). The RHS is some function of b (X ).

• Conditional independence (CI) assumption:

Yi1 , Yi0 ⊥ Di |Xi • Also called unconfoundedness, strong ignorability, etc. • Under (CI), you can match on any balancing score b (Xi ):

Yi1 , Yi0 ⊥ Di |b (Xi ) because E (Y1 − Y0 |b (X )) =E (Y1 |D = 1, b (X )) − E (Y1 |D = 0, b (X )) =E (Y |D = 1, b (X )) − E (Y |D = 0, b (X )) . • But, why do you want to match on b (X ), or p (X )?

• Does balancing score really help with estimating ATE and

ATT ? • Only if you know p(X ), or b(X ). • Consider ATE. Method 1, no p(X ):

E (Y1 − Y0 ) =EX (E (Y1 − Y0 |X )) =EX (E (Y |D = 1, X ) − E (Y |D = 0, X )) This can be estimated by n i 1 X hˆ E (Y |D = 1, Xi ) − Eˆ (Y |D = 0, Xi ) n i=1

• Method 2, match on p(X ),

E (Y1 − Y0 ) =Ep(X ) (E (Y1 − Y0 |p(X ))) =Ep(X ) (E (Y |D = 1, p(X )) − E (Y |D = 0, p(X ))) . • First estimate p ˆ(X ), then, n i 1 X hˆ E (Y |D = 1, pˆ(Xi )) − Eˆ (Y |D = 0, pˆ(Xi )) n i=1

• Does this improve efficiency because the conditional mean is

only estimated on one dimension? • No. Estimating p ˆ(X ) is still a multi-dimensional problem. • Same story for ATT.

• Similarly for ATT, method 1,

E (Y1 − Y0 |D = 1) =EX (E (Y1 − Y0 |X , D = 1)) =EX |D=1 (E (Y |D = 1, X ) − E (Y |D = 0, X )) This is estimated by n X

n i X ˆ ˆ Di E (Y |D = 1, Xi ) − E (Y |D = 0, Xi ) / Di .

h

i=1

i=1

• Method 2:

E (Y1 − Y0 |D = 1) =Ep(X ) (E (Y1 − Y0 |p(X ), D = 1)) =Ep(X )|D=1 (E (Y |D = 1, p(X )) − E (Y |D = 0, p(X ))) This is estimated by n X i=1

n i X ˆ ˆ Di E (Y |D = 1, pˆ(Xi )) − E (Y |D = 0, pˆ(Xi )) / Di .

h

i=1

• Inverse propensity weighting is different from propensity

matching. • For ATE, can show that

 E (Y1 − Y0 ) = E

   1−p p |D = 1 − E Y0 |D = 0 . Y1 p(X ) 1 − p(X )

Can be estimated by n1 n0 1 X 1 X pˆ 1 − pˆ − . Y1i Y0i n1 pˆ (Xi ) n0 1 − pˆ (Xi ) i=1

i=1

• Inverse propensity (probability) weighting for ATT

 E (Y1 − Y0 |D = 1) = E (Y1 |D = 1) − E

 p (X ) (1 − p) Y0 |D = 0 . p (1 − p (X ))

This can be estimated by n1 n0 1 X 1 X pˆ (Xi ) (1 − pˆ) . Y1i − Y0i n1 n0 pˆ (1 − pˆ (Xi )) i=1

i=1

• Mystery of inverse prob weighting for ATT: even if you know

p (X ), using a combination of pˆ (X ) and p (X ) can improve efficiency. • But how do you guess the most efficient combination? • Ask Imbens, Hirano and Ridder (2003).

• Mystery unresolved: • Why do we ever need p ˆ (X ) when we don’t know p(X )? • Why should we ever use inverse probability weighting?

       ! " $#&%')(* +-,. 0/12#&% 3

4657980 :5+;?!@+#7A(* CB D  !9 02#%E;@?

GFH9IKJLNMPOQGFIR"S4T JUR V"SXW1OY Z\["R'GFV1]4LT Y^MFR V1S`_a[KT S\Ocb6deIKJV1Y fhgjihkPihl*mn>op grqhs tKihuv^wxsyi{zh|)}=}~zj€z^zPv W1R V6‚0ƒ„„ƒ

                "!#%$$&('   +* ,!#%$- (' /. 0!#%$$1('  )  43 5 76 * 8 %9:!#%$$1('  2   ;     ? ?  @ %3   J MPY

O

KO)H T

RLjT R I\H JY7O)VQS4T Y^MhLjT9I\[1MT O)V"R H\O)[1M O)d

[\I1YT S4T J SR\V1Y !JL

b

L O]4L R d

JY 



JdUT J[



R1MT O)V 

RL JeO >M JVXL JY^MhLT MPJ SXM O

HO

R MhT O)V'Y ZKT >M

J R\V'Y ZKT >M

[1Y=JEb AM OcH J RLrV'R IKO)[ M7S4T Y^MhLTI\[1MT O)V"R H\J

J M Y 

         

J MFO MhL J R1Mxd

;



["R V1MTH J

 RKTVKTV ]



O)V1MTV\[1JR\d



3

 6

dUT ]4Z M



d





 3

4OYT MhT "JEH O

 5



       @ %3    IKJLO aR

O {J "JLj‚[

w

 3

R HJ

OKL {O)d

!O\L {O)d

J M

dUT ]4Z M

JV

0["R V MhT9H JQd JV 

HO

%3 

H ["S\J S

!O\Lad

 3 J

H O +TV O)d

Jd

JV

MhL R\TV1J JYFMxZ"R VV"O)V MxL R\T9V1J JY

R MOKL

JV 



    

[R V MhT9H J'd R\H J.Y=JH J MhT O)VQO

7T Y^MxLT9I [ MhT O)VeJ

V\[\d

 9 

Z"R J.J

 ["R V MhT9H JY 0 w J ST R V 

OEY=JH J MhT O)VQO

%3

]cIKJV1J )M



ZKT ]4Z"JL IKJH O

Suggest Documents