The Bootstrap. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Motivation Binomial Distribution The Bootstrap Variance Bias The Bootstrap Robert M. Haralick Computer Science, Graduate Center City University of Ne...

Author: Isabel Clarke

2 downloads 1 Views 454KB Size

Report

Download PDF

Recommend Documents

Krzysztof Klosin CURRENT POSITION. City University of New York (Queens College). City University of New York (The Graduate Center)

CUNY Academic Works. City University of New York (CUNY) Lynn Andrew Perkins Graduate Center, City University of New York

CUNY Academic Works. City University of New York (CUNY) Zoe A. Berko Graduate Center, City University of New York

CUNY Academic Works. City University of New York (CUNY) Pamela Proscia Graduate Center, City University of New York

Doctoral Program in Clinical Psychology The Graduate Center of the City University of New York

Marcel den Dikken The Graduate Center of The City University of New York

Ph.D. Program in Economics Tel.: The Graduate Center, City University of New York

Department of Anthropology M.A. in Anthropology, Graduate Center, City University of New York [Hunter College degree]

The Graduate School and University Center of The City University of New York Ph.D. Program in Art History

The City University of New York

John Jay College and Graduate Center City University of New York

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING AND COMPUTER SCIENCE

Doctoral Program in Clinical Psychology The Graduate Center of the City University of New York. Adult Psychopathology Course #

NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

NEW YORK CITY COLLEGE OF TECHNOLOGY THE CITY UNIVERSITY OF NEW YORK

NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

Rockefeller Center. New York City, New York, USA

NEW YORK UNIVERSITY. Graduate Study in Physics

Robert B. Greifinger John Jay College of Criminal Justice, The City University of New York

Opal. Robert Grimm New York University

NEW YORK UNIVERSITY ROBERT F. WAGNER GRADUATE SCHOOL OF PUBLIC SERVICE

Motivation Binomial Distribution The Bootstrap Variance Bias

The Bootstrap Robert M. Haralick Computer Science, Graduate Center City University of New York

Motivation Binomial Distribution The Bootstrap Variance Bias

Outline

1

Motivation

2

Binomial Distribution

3

The Bootstrap

4

Variance

5

Bias

Motivation Binomial Distribution The Bootstrap Variance Bias

The Study

On January 27, 1987 The New York Times summarized a controlled, randomized, double-blind study showing that the risk of heart attack could be reduced by taking aspirin. half the subjects are randomly assigned to take aspirin; half a placebo subjects and physicians were blinded to the assignments tablets given every other day

Motivation Binomial Distribution The Bootstrap Variance Bias

The Data

Aspirin Group Placebo Group

Number Having Heart Attacks 104 189

Number With No Heart Attacks 10933 10845

Total Number 11037 11034

Motivation Binomial Distribution The Bootstrap Variance Bias

The Estimated Rates

Given that a person was an aspirin taker, the fraction of people that had a heart attack was 104/11037. Given that a person was a placebo taker, the fraction of people that had a heart attack was 189/11034.

Motivation Binomial Distribution The Bootstrap Variance Bias

The Estimated Rates Given that a person had a heart attack, the fraction of people that took aspirin was 104 = .3549 293 Given that a person had a heart attack, the fraction of people that took placebo was 189 = .6451 293

Motivation Binomial Distribution The Bootstrap Variance Bias

The Estimation

The Ratio of the Two Rates: 104/11037 189/11034 = .55026

θˆ =

Motivation Binomial Distribution The Bootstrap Variance Bias

The Conclusion

Aspirin takers only have 55% as many heart attacks as placebo-takers.

Motivation Binomial Distribution The Bootstrap Variance Bias

The Problem

We are not interested in θˆ the estimated ratio in the sample. We are interested in θ the true ratio, the ratio in the general population.

Motivation Binomial Distribution The Bootstrap Variance Bias

Repetition

Suppose we were to repeat the study N times. We would estimate θˆ1 , . . . , θˆN Each estimate would certainly not be 55%. The cause of the variation is sampling error. How much would they vary?

Motivation Binomial Distribution The Bootstrap Variance Bias

The Bernoulli Trial

Definition A Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, typically called "success" and "failure."

Motivation Binomial Distribution The Bootstrap Variance Bias

Random Variable Notation

Let X be a random variable either taking the value of 0 or 1. 0 is the value of failure 1 is the value of success Probability X takes the value 1 is p Probability that X takes the value 0 is q = 1 − p A Bernoulli Trial produces an observation of X

Motivation Binomial Distribution The Bootstrap Variance Bias

Bernouuli Process

Definition A Bernoulli process consists of repeatedly performing independent but identical Bernoulli trials

Motivation Binomial Distribution The Bootstrap Variance Bias

The Joint Distribution Let x1 , . . . xN be the observation of N independent Bernoulli trials, each trial have success probability p and failure probability q = 1 − p. N Y p Prob(x1 , . . . , xN ) = q

=

n=1 N Y

n=1

if xn = 1 if xn = 0

pxn q 1−xn

Motivation Binomial Distribution The Bootstrap Variance Bias

The Joint Distribution Define the random variable S by S=

N X

xn

n=0

The distribution of S depends on N and p. N Y p Prob(x1 , . . . , xN ) = q n=1

= pS q N−S

if xn = 1 if xn = 0

Motivation Binomial Distribution The Bootstrap Variance Bias

The Binomial Distribution

Definition A random variable S has the binomial distribution with parameters p and N, denoted B(p, N), if and only if Prob(S = s) =

N s

ps q N−s

Motivation Binomial Distribution The Bootstrap Variance Bias

The Binomial Distribution

b X N Prob(a ≤ S ≤ b) = pk q N−k k k =a

E[S] = Np V [S] = Npq

Motivation Binomial Distribution The Bootstrap Variance Bias

Estimating p

ˆ = p

S N

S ] N E[S] = N Np = N = p

ˆ ] = E[ E[p

Motivation Binomial Distribution The Bootstrap Variance Bias

Variance of Estimate

S ] N V [S] N2 Npq N2 pq N

ˆ] = V [ V [p = = =

Motivation Binomial Distribution The Bootstrap Variance Bias

Left Sided Confidence Interval

Let S have B(p, N). Definition [0, b] is the left sided p0 confidence interval for S if and only if Prob(0 ≤ S ≤ b) = p0

Motivation Binomial Distribution The Bootstrap Variance Bias

Right Sided Confidence Interval

Let S have B(p, N). Definition [a, N] is the right sided p0 confidence interval for S if and only if Prob(a ≤ S ≤ N) = p0

Motivation Binomial Distribution The Bootstrap Variance Bias

Central Confidence Interval Let S have B(p, N). Definition [a, b] is the p0 central confidence interval for S if and only if Prob(a ≤ S ≤ b) = p0 where a is an integer minimizing |Prob(0 ≤ S ≤ a) − (1 − p0 /2| b is an integer minimizing |Prob(b ≤ S ≤ N) − (1 − p0 /2|

Motivation Binomial Distribution The Bootstrap Variance Bias

Confidence Interval Around Mean

Prob(|S − Np| ≤ k ) = Prob(S − Np ≤ k and − (S − Np) ≤ k ) = Prob(S − k ≤ Np and S + k ≥ NP)) = Prob((S − k )/N ≤ p and (S + k )/N ≥ p) = Prob(S − k )/N ≤ p ≤ (S + k )/N)

Motivation Binomial Distribution The Bootstrap Variance Bias

Confidence Interval for p The binomial distribution is a discrete distribution. Generally it is not possible to construct and confidence interval for p with exactly specified confidence coefficients. The central α confidence interval [pL , pU ] for p satifies Prob(pL ≤ p ≤ pU ) = α where Prob(0 ≤ p ≤ pL ) = (1 − α/2) Prob(pU ≤ p ≤ 1) = (1 − α/2)

Motivation Binomial Distribution The Bootstrap Variance Bias

Approximate Central Confidence Interval

Approximate α central confidence interval [pL , pU ] can be obtained by approximately solving N X N pLj (1 − pL )N−j j

= (1 − α)/2

S X N pUj (1 − pU )N−j j

= (1 − α)/2

j=S

j=0

Motivation Binomial Distribution The Bootstrap Variance Bias

The Beta Function

Definition The Beta function B(m, n) is defined by the definite integral Z B(m, n) =

1

x m−1 (1 − x)n−1

0

=

(m − 1)!(n − 1)! (m + n − 1)!

Motivation Binomial Distribution The Bootstrap Variance Bias

Incomplete Beta Integral Definition The incomplete Beta Integral is defined by Z Ip (m, n) =

p

x m−1 (1 − x)n−1 dx

0

The incomplete Beta Integral is related to the binomial sums by N X N pj (1 − p)N−j Ip (k , N − k + 1) = j j=k

= Prob(k ≤ S ≤ N)

Motivation Binomial Distribution The Bootstrap Variance Bias

Confidence Interval

Observe S and estimate the α central confidence interval [pL , pU ] for p by IpL (S, N − S + 1) = (1 − α)/2 1 − IpU (S + 1, N − S) = (1 − α)/2

Motivation Binomial Distribution The Bootstrap Variance Bias

The Data

Aspirin Group Placebo Group

Number Having Heart Attacks 104 189

Number With No Heart Attacks 10933 10845

Total Number 11037 11034

Motivation Binomial Distribution The Bootstrap Variance Bias

Aspirin Binomial

NA be the number of people who were given aspirins SA be the number of people who were given aspirin and had a heart attack. pA be the probability that a person who takes aspirin will have a heart attack. SA has B(pA , NA )

Motivation Binomial Distribution The Bootstrap Variance Bias

Placebo Binomial

NB be the number of people who were given placebos SB be the number of people who were given placebos and had a heart attack. pB be the probability that a person who takes placebos will have a heart attack. SB has B(pB , NB )

Motivation Binomial Distribution The Bootstrap Variance Bias

The Ratio

θ =

pA pB

Motivation Binomial Distribution The Bootstrap Variance Bias

The Estimated Ratio

ˆA = p ˆB = p θˆ = = =

SA NA SB NB ˆA p ˆB p 104/11037 189/11034 .55026

Motivation Binomial Distribution The Bootstrap Variance Bias

Problem

ˆ What is the variance of θ? What is the α confidence interval of θ?

Motivation Binomial Distribution The Bootstrap Variance Bias

Confidence Interval For The Ratio

Prob(.43 ≤ θ ≤ .70) = .95

Motivation Binomial Distribution The Bootstrap Variance Bias

The Bootstrap

The bootstrap is a data-based simulation method for statistical estimation answering questions like ˆ What is the variance of θ? What is the α confidence interval of θ?

Motivation Binomial Distribution The Bootstrap Variance Bias

Derivation

The use of the term bootstrap derives from the phrase to pull oneself up by one’s bootstrap, widely thought to be based on one of the eighteenth century Adventures of Baron Munchausen by Rudolf Erich Raspe. The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.

Motivation Binomial Distribution The Bootstrap Variance Bias

How It Works

Create two populations Population A 104 ones 10933 zeros

Population B 189 ones 10845 zeros

Motivation Binomial Distribution The Bootstrap Variance Bias

How It Works Repeat N times Draw with replacement a sample of 11037 items from Population A Draw with replacement a sample of 11034 items from Population B Form estimates θˆn∗ , n = 1, . . . , N

Calculate the sample variance ∗ ,...θ ˆ∗ Calculate the order statistics θˆ(1) (N) ∗ ,θ ˆ∗ Form the α confidence interval by [θˆ(M) (N−M) ] where N−2M N

=α

Motivation Binomial Distribution The Bootstrap Variance Bias

The Bootstrap Sample

Definition Let X =< x1 , . . . , xN > be the given random sample of N values. A bootstrap sample is a random sample of N items independently sampled with replacement from X .

Motivation Binomial Distribution The Bootstrap Variance Bias

Bootstrap Summary Let X =< x1 , . . . , xN > be a random sample. Suppose we are interested in the statistic T (X ) and we want to estimate the variance of T (X ). Form the bootstrap samples X1∗ , . . . , XM∗ Compute tm = T (Xm∗ ), m = 1, . . . , M P Compute ¯t = 1 M tm M

m=1

ˆ (T (X )) = V

M 1 X (tm − ¯t)2 M −1 m=1

Motivation Binomial Distribution The Bootstrap Variance Bias

The Correlation Coefficient

Definition Let X and Y be two random variables. The correlation ρ between X and Y is defined by ρ = E[

(X − µx ) (Y − µy ) ] σx σy

Motivation Binomial Distribution The Bootstrap Variance Bias

The Estimated Correlation Coefficient

Let (x1 , y1 ), . . . , (xN , yN ) be a random sample from a bivariate population. The estimated correlation coefficient ρˆ is calculated by N ˆy ) ˆx ) (yn − µ 1 X (xn − µ ρˆ = N σ ˆx σ ˆy n=1

Motivation Binomial Distribution The Bootstrap Variance Bias

The Variance of the Estimated Correlation Coefficient

If the random variables (X , Y ) are bivariate normal, then V [ˆ ρ] =

1 (1 − ρ2 )2 N

But what if the joint distribution of (X , Y ) is not normal?

Motivation Binomial Distribution The Bootstrap Variance Bias

Correlation Coefficient Variance by Boostrap

Form the bootstrap samples Z1∗ , . . . , ZM∗ Compute ρˆm = ρ(Zm∗ ), m = 1, . . . , M P Compute ρ¯ = M ˆm m=1 ρ ˆ (ˆ V ρ) =

M 1 X (ρm − ρ¯)2 M −1 m=1

Motivation Binomial Distribution The Bootstrap Variance Bias

Eigenvalue Ratio

ˆ be the estimated covariance matrix. Let λ ˆ1 ≥ λ ˆ2 ≥ . . . λ ˆK Let Σ ˆ be the eigenvalues of Σ. ˆ λ ˆr = P 1 K ˆ k =1 λk is the estimated percentage of the variance accounted for in the first principal component. What is the variance of ˆr ?

Motivation Binomial Distribution The Bootstrap Variance Bias

The Distribution Function

Definition Let X be a real valued random variable. The function F (x) = Prob(X ≤ x) is called the Distribution Function of X

Motivation Binomial Distribution The Bootstrap Variance Bias

The Empiric Distribution Function

Definition Let x1 , . . . , xN be a random sample from a given population. The function #{n | xn ≤ x} Fˆ (x) = N is called the Empiric Distribution Function of X based on sample x1 , . . . , xN .

Motivation Binomial Distribution The Bootstrap Variance Bias

Population Parameter

All the information about the population is contained in the distribution function F θ = TF

Motivation Binomial Distribution The Bootstrap Variance Bias

Plug-in Principle

The plug-in principle is a simple method of estimating parameters from sample Definition The plug-in estimate of a parameter θ = TF is defined by θˆ = TFˆ

Motivation Binomial Distribution The Bootstrap Variance Bias

Bias

Definition ˆ − θ. The bias of an estimater θˆ is defined by the difference E[θ]

Motivation Binomial Distribution The Bootstrap Variance Bias

Bias and Standard Error

The bootstrap can be used to study the bias and standard error of the plug-in estimate θˆ = TFˆ

Motivation Binomial Distribution The Bootstrap Variance Bias

Estimates of Bias Let X =< x1 , . . . , xN > be the given random sample from a population whose distribution function is F . Let X1∗ , . . . , XM∗ be M independent bootstrap samples. Let T (X ) be the estimate based on sample X . Let TF be the true value of the quantity in distribution F . Let b = E[T (X )] − TF be the bias. Calculate ˆtm = T (Xm∗ ), m = 1, . . . , M P Calculate ¯t = 1 M ˆtm M

m=1

Calculate TFˆ ˆ = ¯t − T ˆ b F