## Propensity Score Matching Methods

Propensity Score Matching Methods Michael Oakes Division of Epidemiology & Community Health School of Public Health University of Minnesota Spring 201...
Author: Daisy Rice
Propensity Score Matching Methods Michael Oakes Division of Epidemiology & Community Health School of Public Health University of Minnesota Spring 2011

(2007)

(2010)

(2010)

Part I. “Theory” Outline 1. Review of Core Ideas 2. Confounding 3. Multiple Regression 4. Propensity Score Methods 5. Issues & Assumptions 6. Review 7. Questions

1. Core Ideas

More formally, T has a causal effect on Y for person i if

Yi T =0 ≠ Yi T =1 But we can only observe one of these outcome for any i At the population level, we use probabilities and assuming exchangeability,

Prob[Y T = 0 = 1] ≠ Prob[Y T =1 = 1]

Ideal Experiment Randomize a bunch of folks to two different conditions/environments and observe their outcomes at later time. Assume no problems with: measurement, attrition, contamination, interference, etc

Ideal Experiment The process of allocating subjects to conditions (i.e., randomization) is independent of the outcome variable. In other words, treatment assignment mechanism does not depend on potential outcomes. We can identify effects because Tx and Cx groups are exchangeable: we’ve a defensible counterfactual substitute.

Analysis of Experimental Data Simple! • Bivariate regression (ie, t-test) • Non-parametric test (eg, permutation test)

Even if ethical, randomized experiments are very expensive and difficult to conduct.

Most of our work must rely on observational study designs.

Observational designs pose many many many problems for causal inference.

Central problem of Obs Study Selection or differences between Tx and Cx groups unrelated to the Tx (ie, confounding) It’s an identification problem: • Is observed effect attributable to Tx or differences between groups?

2. Confounding • A mixing of effects • Lack of exchangeability between Tx and Cx groups • Imbalance of background characteristics between groups that effects outcomes

Confounding Surgeons who do high-risk surgeries have higher patient mortality rates than family docs who treat sore knees. • Are the surgeons less skilled or are the groups being treated different?

Absent randomization the central design and analytic task is to remove influence of confounding and regain exchangeable groups. How? A. Restrict B. Match C. Adjust

A. Restriction Design study to collect or keep only data on exchangeable subjects. Delete other data. • Works well, but how to define exchangeable? • What values of what variable(s) should you limit analysis to? Age, smoking status, SES, health-risk score, ???

B. Match Match Tx and Cx subjects on observed confounding variable (eg, age). This yields conditional exchangeability. • Works well, but how precise should “age” be? Year, year+month, age_cat? • Can only match on a few (eg, 1-3) variables before “curse of dimensionality”

C. Adjust Use multiple regression to “control for” or “adjust for” potential confounders. Can simultaneously adjust for many potential confounders.

3. Multiple Regression If the differences between groups is large, the average value applied to each group with adjustment may represent “no man’s land”, a place where no actual observations exist. Given this scenario, the interpretation of the estimate becomes speculative rather than soundly based. Heroic modeling assumptions are required. William Cochran (1957)

Analysis of Experimental Data

Y = α + β 1T + ε βˆ1 ⇒ Δ = average causal effect T is (0,1) treatment indicator which, for large samples, is independent of background characteristics by study design (ie, randomization)

Absent Randomization

Y = α + β 1T + βZ + ε Covariates, Z, serve to adjust groups for confounding…

Absent Randomization Unless specification of the model, including X, is perfect, bias results

l β 1 = Δ + BIAS

Simple (mean-centered) regression model

y | x = β1 x + e

(1)

n

ˆ β1 =

∑x y i =1 n

i

i (2)

∑x i =1

2 i

Substitute (1) into (2) and get

n

βˆ = β + 1

1

∑xe i =1 n

i i

(3)

∑x i =1

2 i

If x and e are correlated (due to confounding), the value of far righthand term of equation (3) will not go to zero as sample size approaches infinity. The result will be that our treatment effect estimate of will be biased:

βl 1 ≠ β1

What goes in Z ? • Small

p-values in bivariate models?

• 10% rule? • Stepwise procedures? • Everything? • Try stuff (e.g., interactions) until I get “good” p-value for main effect? • Theory? • Too many Z’s can overcome data (dimensionality)

In regular regression, unless you specify elements of Z in advance you end up capitalizing on chance and your standard errors and corresponding pvalues are too small… so conclusions are often wrong. Regression screening of random data will yield statistically significant “effects” when there are none. Avoid this!

4. Propensity Score Methods An approach to confounder control that better mimics the experimental approach. Introduced by Rosenbaum and Rubin in 1983

Propensity Score, p(z) In the analysis of treatment effects, suppose that we have a binary treatment T, an outcome Y, and background variables Z. The propensity score is defined as the conditional probability of treatment given background variables:

p ( z ) ≡ Pr(T = 1| Z = z ) Propensity score

Propensity Score, p(z) English: Propensity score is defined as the probability of being treated given a subject’s background characteristics.

Ignorable TAM Let Y(0) and Y(1) denote the potential outcomes under control and treatment, respectively. Then treatment assignment is (conditionally) unconfounded if treatment is independent of potential outcomes conditional on Z. This is an assumption!

Ignorable TAM

TAM ⊥ Y (0), Y (1) | Z

Ignorable TAM

TAM ⊥ Y (0), Y (1) | Z

TAM ⊥ Y (0), Y (1) | p( z ) Propensity score

Ignorable TAM

TAM ⊥ Y (0), Y (1) | Z TAM ⊥ Y (0), Y (1) | p( z ) As with randomized experiments, the expected result of an independent TAM is balance in confounders between treated and untreated groups, thus yielding perfect (?) counterfactual substitutes.

Simplified Tasks (1) Set outcome variable (Y) aside (2) Model treatment/exposure (0,1) with logistic regression or perhaps better models (3) Calculate/estimate predicted value of exposure from model; this is propensity score! (4) Use propensity score in analysis to estimate treatment effect

Model for Treatment? We want to first model the probability of being treated, not the outcome (eg, health).

Prob(Tx)= (things that predict getting treated)

Use logistic regression Logistic regression is like “regular” regression but used when the outcome variable (eg, Y) is not continuous but dichotomous (0,1). It yields predicted probabilities.

Regular regression

Y = α + β 1T + ε

Logistic regression

logit(Y ) = α + β 1T

logit(T ) = α + βZ exp(α + βZ) Prob(T = 1| Z)= 1 + exp(α + βZ)

So what goes in Z? Use all potential predictors of Tx, except those that are outcomes of Tx.

You can “play” with specification since you’ve set outcome Y aside. Greatly reduced threat of capitalizing on chance.

The propensity score is nothing more than the predicted probability of being treated, which comes directly from the logistic regression model. Each observation in your data will have a propensity score variable with range 0-1 Some observations may have been treated (T=1) with low propensity score of 0.01, while others not treated (T=0) with high propensity score of 0.90

Uses of propensity scores (1) Use as “regular” covariate in regression model (2) Stratify/classify data by range of p-score and estimate effects within and then average (3) Match those actually treated and those not actually treated on their p-score (4) Use as a weight in more sophisticated models

Propensity score matching Match treated person to their counterfactual, which is a nontreated person with a similar propensity score as treated person. Calculate difference, d , from the observed outcome, Y, for each index person and their matched counterfactual Calculate average d across all observations

d = ATT ≈ ACE = ATE

Matching methods • Nearest Neighbor • match treated to counterfactual with closest p-score

• Nearest Neighbor within Caliper (I like best) • match treated with closest within range (ie, caliper)

• Kernel, Local Linear, Mahalanobis, Optimal

How wide a caliper? Suggestion is no more than 25% of the standard deviation of your observed propensity scores.

ε ≤ 0.25*σ pscore

8.0

ATT Rate Ratio

6.0

4.0

2.0

1.0

0.0 .0001

.0005

.001

.005

Caliper Width

.01

.05

How do you know matching worked? Assess balance between treatment and control group prior to treatment.

If balance then ignorable TAM, by assumption!

Estimate Difference in Balance Estimate difference in means b/w Tx and Cx groups for covariate X before and after matching. Best to standardized difference so that you can compare them, and assess decrease in differences, which would be an increase in balance wrt that covariate.

Before matching

dX =

M Xt − M Xp SX

d X is that absolute standardized difference in covariate means M Xt is the mean of variable X for the treatment group M Xp is the mean of variable X for the potential control group S X is the pooled standard deviation of the groups

Gou & Fraser, p. 157

After matching

d Xm

M Xt − M Xc = SX

d Xm is that absolute standardized difference in covariate means after matching M Xc is the mean of variable X for the control group after matchng

Balance Assessment Absolute reduction in imbalance due to matching = d X − d Xm

⎡ (d X − d Xm ) ⎤ % reduction in imbalance due to matching = 100 ⎢ ⎥ d ⎣ ⎦ X

Steps to Propensity Score Matching 1. Fit logistic regression of treatment (not outcome!) 2. Estimate propensity score for each person 3. Match across exposure on estimated p-score 4. Throw away off-support observations 5. Assess balance between groups 6. Re-estimate pscore if balance not obtained 7. Estimate causal model (eg, t-test or other methods) with outcome 8. Bootstrap standard error of causal effect

Gou & Fraser, Fig 5.1

Off-support? Rubin – If complete separation in propensity score? “You can say nothing about causal effects.” Rosenbaum – Sharply distinct treatments that could happen to anyone. If your substitute (ie, comparison group) does not reflect treatment group, then all inferences are based on (offsupport) model assumptions.

Estimated Probability of Exposure

1.0

Actually Exposed 0.5

Actually Unexposed

0.0

100

50

0

50

Number of Observed Subjects

100

Estimated Probability of Exposure

1.0

Actually Exposed 0.5

Actually Unexposed

0.0 100

50

0

50

Number of Observed Subjects

100

Propensity score

.975.925.875.825.775.725.675.625.575.525.475.425.375.325.275.225.175.125.075.0250400

300

200

100

0

100

200

300

400

Number of Infants 40-100% Poverty

< 5% Poverty

Propensity of Am Ind living in high-poverty Mpls neighborhoods

Cutting Edge? 

Estimation of propensity scores (CART?)



Estimation of “bounds” around estimated treatment effects



Propensity score model calibration

Validation sample

Qualitative info



SE estimation



Imputation for missing covariates



Inducing confounding by misspecifying Z

5. Issues & Assumptions 1. Exposure model specification 2. Matching algorithm (nearest, greedy, etc) 3. Unobservables! 4. Missing values of covariates 5. Matching with replacement (ie, imputation) 6. Precision/bias tradeoff with respect to “support” 7. Which treatment effect estimator (ATT, ITT, ACE…) 8. Clustering 9. SUTVA violations

Because they require us to think about the ideal experiment we would have liked to have conducted, propensity score methods are a better tool than multiple regression. Setting aside the outcome variable, Y, until it’s time to assess differences between observed outcomes and counterfactual substitutes, is an invaluable addition to the practice of applied research.

Remember…  We impose a causal model on the world/phenomena  It’s a cognitive thing… a belief subject to scientific scrutiny  We are bombarded with “data” and so must select some for consideration  Humans are excellent at confirmation bias… finding “data” to support our belief; we struggle with data that undermines our beliefs.

Part II. Application • ATE, ACE, ATT, TOT • A Typical Analysis • Propensity Score Methods • Issues & Assumptions • Review

1. ATE, ACE, ATT, TOT What “effect” do we wish to estimate?

Ideally, we’d like the treatment effect for each individual in our study. If we could observe every person and their counterfactual we could just take the average across all persons as an estimate of delta.

τ i = Yi (1) - Yi (0) τ ⇒Δ

But of course we cannot calculate a causal effect for a particular person. We must move up the population level.

The average treatment effect (ATE) is the difference in the average of the outcome variable in the treatment group minus the average of the outcome variable in the control group. ATE is the same as the average causal effect (ACE).

ATE = ACE = E[Y (1) - Y (0)]

The average treatment effect (ATE) is the difference in the average of the outcome variable in the treatment group minus the average of the outcome in the control group. ATE is the same as the average causal effect (ACE).

ATE = ACE = E[Y (1) - Y (0)] Often desired and easily estimable in RCTs

The average treatment effect on the treated (ATT) is the mean difference between those actually treated or exposed and their counterfactuals. ATT is the same as the treatment effect on the treated (TOT).

ATT = TOT = E[Y (1) - Y (0) | T=1]

The affect of the treatment on the treated (ATT) is the mean effect of those actually treated or exposed . ATT is the same as the treatment effect on the treated (TOT).

ATT = TOT = E[Y (1) - Y (0) | T=1] Often better for observational designs

When we randomize the treatment assignment mechanism (TAM) is independent of the outcomes and all subjects have a positive probability of being treated. Accordingly, the ATT = ATE:

ATT = ATE = E (Y | T = 1) - E (Y | T = 0) = E[Y (1) − Y (0)] This is because randomization produces exchangeable groups (ie, balance) and yields excellent counterfactual substitutes. In other words, the control group serves to substitute for the unobservable counterfactuals of the treatment group, at least with large samples.

But when we do not randomize the treatment assignment mechanism (TAM) is rarely independent of the outcomes and some subjects may have a zero probability of being treated. Accordingly,

ATT ≠ ATE Thus we are often better off comparing treated subjects to the best counterfactuals we can find for them. The best ones are the non-treated subjects that have the same probability of being treated as the treated subjects being studied (or the set of them). This theoretically satisfied the critical assumption about independence of the treatment assignment mechanism: TAM ⊥ Y (0),Y (1)| p(z)

Incredibly, neither of the desired effect estimates (ATT, ATE) is (easily) identifiable through regression modeling in an observational study.

2. Typical Analysis

Let’s use Stata

What is Stata? Stata is a full-featured statistical programming language for Windows, Macintosh, Unix and Linux. It can be considered a “stat package,” like SAS, SPSS, or R.

Data and program for this entire module are available on my website.

Data………. tcws_pscore_demo.dta Program…. pscore demo.do

Data are based on my Twin Cities Walking Study (TCWS) but have been altered for pedagogical purposes. The data are fictitious!

Was 358!

15

Percent

10

5

0

0

5000

10000

Total walking METS: Diaries

15000

t-test of totwalk (Y) by walkable (X)

Bivariate Regression: totwalk (Y) on walkable (X)

Assess differences in averagesof suspected confounders across treatment (walkable) and control (no walkable) conditions. Use a t-test (better methods available)

Summary of t-tests Means Potential Confounder Age Tenure % Males % White % Married % College degree HH Income Scale

Treatment (Walkable) 44.07 14.28 34.24 78.19 31.91 81.38 6.63

Control (Not-Walkable) Difference 52.02 13.13 31.76 68.82 48.82 48.82 5.21

7.95 -1.15 -2.47 -9.37 16.91 -32.56 -1.42

p-value 0.000 0.369 0.622 0.044 0.001 0.000 0.000

Multiple Regression

Keep variables with small p-values

3. Propensity Score Analysis

Nearest Neighbor Matching within caliper

Assess Balance

Pscore ATT Estimate

Assess Support

There are no matches for 12 of the “treated” subjects

0

.2

.4 .6 Propensity Score Untreated Treated: Off support

.8 Treated: On support

1

Bootstrap, k=1000

Observed Value

Pscore Predicted Counterfactual

Observed Value

Pscore Predicted Counterfactual

Effect For Subject #4

Effect For Subject #4

Unmeasured Confounding?

Rosenbaum Bounds A method of sensitivity analysis that uses a parameter “gamma” to measure the degree of departure from random assignment of treatment.

Imagine that 2 subjects with the same observed characteristics differ in the odds of receiving the treatment by at most a factor of “gamma”.

In a randomized trial, randomization of the treatment ensures that gamma = 1. In an observational study, if gamma = 2 and two subjects are identical on matched covariates then one might be twice as likely as the other to receive the treatment because they differ in terms of an unobserved covariate.

Let’s Compare!

Method

Effect

L95%CI

U95%CI

T-test

4,498

3,849

5,147

Bivariate Regression

4,498

3,337

5,158

2,215

1,815

2,616

2,206

1,815

2,597

Pscore: 6 vars, NN match, caliper = 0.03

2,405

1,605

3,205

Pscore: 6 vars, NN match, caliper = 0.03, BS=1k

2,393

1,787

2,913

Pscore: 4 vars, NN match, caliper = 0.05

2,618

1,781

3,455

Pscore: 4 vars, NN match, caliper = 0.001

3,231

2,173

4,289

Pscore: 4 vars, NN match, caliper = 0.9 Pscore: 6 vars, Kernel match

2,618 2,730

1,781 2,068

3,455 3,392

Method

Effect

L95%CI

U95%CI

T-test

4,498

3,849

5,147

Bivariate Regression

4,498

3,337

5,158

2,215

1,815

2,616

2,206

1,815

2,597

Pscore: 6 vars, NN match, caliper = 0.03

2,405

1,605

3,205

Pscore: 6 vars, NN match, caliper = 0.03, BS=1k

2,393

1,787

2,913

Pscore: 4 vars, NN match, caliper = 0.05

2,618

1,781

3,455

Pscore: 4 vars, NN match, caliper = 0.001

3,231

2,173

4,289

Pscore: 4 vars, NN match, caliper = 0.9 Pscore: 6 vars, Kernel match

2,618 2,730

1,781 2,068

3,455 3,392

4. Issues & Assumptions

• Counterfactuals are unobservable • Black box mechanisms • ATE or ATT? • Pscore matching should balance confounders • Must assume no unobservables • Enforce “support” • Missing values are problematic • No predictors that are consequence of outcome • Pscores are mere estimates from one sample

Can also be done in SAS, R, SPSS, and other programs

Simplified Tasks (1) Set outcome variable (Y) aside (2) Model treatment/exposure (0,1) with logistic regression or perhaps better models (3) Check/Assess balance (4) Once balance is maximized, use propensity score in analysis to estimate ATT (5) Estimate potential impact of unobservables

Once again… with gusto! Because they require us to think about the ideal experiment we would have liked to have conducted, propensity score methods are a better tool than multiple regression. Setting aside the outcome variable, Y, until it’s time to assess differences between observed outcomes and counterfactual substitutes, is an invaluable addition to the practice of applied research.