Motivation Binomial Distribution The Bootstrap Variance Bias
The Bootstrap Robert M. Haralick Computer Science, Graduate Center City University of New York
Motivation Binomial Distribution The Bootstrap Variance Bias
Outline
1
Motivation
2
Binomial Distribution
3
The Bootstrap
4
Variance
5
Bias
Motivation Binomial Distribution The Bootstrap Variance Bias
The Study
On January 27, 1987 The New York Times summarized a controlled, randomized, double-blind study showing that the risk of heart attack could be reduced by taking aspirin. half the subjects are randomly assigned to take aspirin; half a placebo subjects and physicians were blinded to the assignments tablets given every other day
Motivation Binomial Distribution The Bootstrap Variance Bias
The Data
Aspirin Group Placebo Group
Number Having Heart Attacks 104 189
Number With No Heart Attacks 10933 10845
Total Number 11037 11034
Motivation Binomial Distribution The Bootstrap Variance Bias
The Estimated Rates
Given that a person was an aspirin taker, the fraction of people that had a heart attack was 104/11037. Given that a person was a placebo taker, the fraction of people that had a heart attack was 189/11034.
Motivation Binomial Distribution The Bootstrap Variance Bias
The Estimated Rates Given that a person had a heart attack, the fraction of people that took aspirin was 104 = .3549 293 Given that a person had a heart attack, the fraction of people that took placebo was 189 = .6451 293
Motivation Binomial Distribution The Bootstrap Variance Bias
The Estimation
The Ratio of the Two Rates: 104/11037 189/11034 = .55026
θˆ =
Motivation Binomial Distribution The Bootstrap Variance Bias
The Conclusion
Aspirin takers only have 55% as many heart attacks as placebo-takers.
Motivation Binomial Distribution The Bootstrap Variance Bias
The Problem
We are not interested in θˆ the estimated ratio in the sample. We are interested in θ the true ratio, the ratio in the general population.
Motivation Binomial Distribution The Bootstrap Variance Bias
Repetition
Suppose we were to repeat the study N times. We would estimate θˆ1 , . . . , θˆN Each estimate would certainly not be 55%. The cause of the variation is sampling error. How much would they vary?
Motivation Binomial Distribution The Bootstrap Variance Bias
The Bernoulli Trial
Definition A Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, typically called "success" and "failure."
Motivation Binomial Distribution The Bootstrap Variance Bias
Random Variable Notation
Let X be a random variable either taking the value of 0 or 1. 0 is the value of failure 1 is the value of success Probability X takes the value 1 is p Probability that X takes the value 0 is q = 1 − p A Bernoulli Trial produces an observation of X
Motivation Binomial Distribution The Bootstrap Variance Bias
Bernouuli Process
Definition A Bernoulli process consists of repeatedly performing independent but identical Bernoulli trials
Motivation Binomial Distribution The Bootstrap Variance Bias
The Joint Distribution Let x1 , . . . xN be the observation of N independent Bernoulli trials, each trial have success probability p and failure probability q = 1 − p. N Y p Prob(x1 , . . . , xN ) = q
=
n=1 N Y
n=1
if xn = 1 if xn = 0
pxn q 1−xn
Motivation Binomial Distribution The Bootstrap Variance Bias
The Joint Distribution Define the random variable S by S=
N X
xn
n=0
The distribution of S depends on N and p. N Y p Prob(x1 , . . . , xN ) = q n=1
= pS q N−S
if xn = 1 if xn = 0
Motivation Binomial Distribution The Bootstrap Variance Bias
The Binomial Distribution
Definition A random variable S has the binomial distribution with parameters p and N, denoted B(p, N), if and only if Prob(S = s) =
N s
ps q N−s
Motivation Binomial Distribution The Bootstrap Variance Bias
The Binomial Distribution
b X N Prob(a ≤ S ≤ b) = pk q N−k k k =a
E[S] = Np V [S] = Npq
Motivation Binomial Distribution The Bootstrap Variance Bias
Estimating p
ˆ = p
S N
S ] N E[S] = N Np = N = p
ˆ ] = E[ E[p
Motivation Binomial Distribution The Bootstrap Variance Bias
Variance of Estimate
S ] N V [S] N2 Npq N2 pq N
ˆ] = V [ V [p = = =
Motivation Binomial Distribution The Bootstrap Variance Bias
Left Sided Confidence Interval
Let S have B(p, N). Definition [0, b] is the left sided p0 confidence interval for S if and only if Prob(0 ≤ S ≤ b) = p0
Motivation Binomial Distribution The Bootstrap Variance Bias
Right Sided Confidence Interval
Let S have B(p, N). Definition [a, N] is the right sided p0 confidence interval for S if and only if Prob(a ≤ S ≤ N) = p0
Motivation Binomial Distribution The Bootstrap Variance Bias
Central Confidence Interval Let S have B(p, N). Definition [a, b] is the p0 central confidence interval for S if and only if Prob(a ≤ S ≤ b) = p0 where a is an integer minimizing |Prob(0 ≤ S ≤ a) − (1 − p0 /2| b is an integer minimizing |Prob(b ≤ S ≤ N) − (1 − p0 /2|
Motivation Binomial Distribution The Bootstrap Variance Bias
Confidence Interval Around Mean
Prob(|S − Np| ≤ k ) = Prob(S − Np ≤ k and − (S − Np) ≤ k ) = Prob(S − k ≤ Np and S + k ≥ NP)) = Prob((S − k )/N ≤ p and (S + k )/N ≥ p) = Prob(S − k )/N ≤ p ≤ (S + k )/N)
Motivation Binomial Distribution The Bootstrap Variance Bias
Confidence Interval for p The binomial distribution is a discrete distribution. Generally it is not possible to construct and confidence interval for p with exactly specified confidence coefficients. The central α confidence interval [pL , pU ] for p satifies Prob(pL ≤ p ≤ pU ) = α where Prob(0 ≤ p ≤ pL ) = (1 − α/2) Prob(pU ≤ p ≤ 1) = (1 − α/2)
Motivation Binomial Distribution The Bootstrap Variance Bias
Approximate Central Confidence Interval
Approximate α central confidence interval [pL , pU ] can be obtained by approximately solving N X N pLj (1 − pL )N−j j
= (1 − α)/2
S X N pUj (1 − pU )N−j j
= (1 − α)/2
j=S
j=0
Motivation Binomial Distribution The Bootstrap Variance Bias
The Beta Function
Definition The Beta function B(m, n) is defined by the definite integral Z B(m, n) =
1
x m−1 (1 − x)n−1
0
=
(m − 1)!(n − 1)! (m + n − 1)!
Motivation Binomial Distribution The Bootstrap Variance Bias
Incomplete Beta Integral Definition The incomplete Beta Integral is defined by Z Ip (m, n) =
p
x m−1 (1 − x)n−1 dx
0
The incomplete Beta Integral is related to the binomial sums by N X N pj (1 − p)N−j Ip (k , N − k + 1) = j j=k
= Prob(k ≤ S ≤ N)
Motivation Binomial Distribution The Bootstrap Variance Bias
Confidence Interval
Observe S and estimate the α central confidence interval [pL , pU ] for p by IpL (S, N − S + 1) = (1 − α)/2 1 − IpU (S + 1, N − S) = (1 − α)/2
Motivation Binomial Distribution The Bootstrap Variance Bias
The Data
Aspirin Group Placebo Group
Number Having Heart Attacks 104 189
Number With No Heart Attacks 10933 10845
Total Number 11037 11034
Motivation Binomial Distribution The Bootstrap Variance Bias
Aspirin Binomial
NA be the number of people who were given aspirins SA be the number of people who were given aspirin and had a heart attack. pA be the probability that a person who takes aspirin will have a heart attack. SA has B(pA , NA )
Motivation Binomial Distribution The Bootstrap Variance Bias
Placebo Binomial
NB be the number of people who were given placebos SB be the number of people who were given placebos and had a heart attack. pB be the probability that a person who takes placebos will have a heart attack. SB has B(pB , NB )
Motivation Binomial Distribution The Bootstrap Variance Bias
The Ratio
θ =
pA pB
Motivation Binomial Distribution The Bootstrap Variance Bias
The Estimated Ratio
ˆA = p ˆB = p θˆ = = =
SA NA SB NB ˆA p ˆB p 104/11037 189/11034 .55026
Motivation Binomial Distribution The Bootstrap Variance Bias
Problem
ˆ What is the variance of θ? What is the α confidence interval of θ?
Motivation Binomial Distribution The Bootstrap Variance Bias
Confidence Interval For The Ratio
Prob(.43 ≤ θ ≤ .70) = .95
Motivation Binomial Distribution The Bootstrap Variance Bias
The Bootstrap
The bootstrap is a data-based simulation method for statistical estimation answering questions like ˆ What is the variance of θ? What is the α confidence interval of θ?
Motivation Binomial Distribution The Bootstrap Variance Bias
Derivation
The use of the term bootstrap derives from the phrase to pull oneself up by one’s bootstrap, widely thought to be based on one of the eighteenth century Adventures of Baron Munchausen by Rudolf Erich Raspe. The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.
Motivation Binomial Distribution The Bootstrap Variance Bias
How It Works
Create two populations Population A 104 ones 10933 zeros
Population B 189 ones 10845 zeros
Motivation Binomial Distribution The Bootstrap Variance Bias
How It Works Repeat N times Draw with replacement a sample of 11037 items from Population A Draw with replacement a sample of 11034 items from Population B Form estimates θˆn∗ , n = 1, . . . , N
Calculate the sample variance ∗ ,...θ ˆ∗ Calculate the order statistics θˆ(1) (N) ∗ ,θ ˆ∗ Form the α confidence interval by [θˆ(M) (N−M) ] where N−2M N
=α
Motivation Binomial Distribution The Bootstrap Variance Bias
The Bootstrap Sample
Definition Let X =< x1 , . . . , xN > be the given random sample of N values. A bootstrap sample is a random sample of N items independently sampled with replacement from X .
Motivation Binomial Distribution The Bootstrap Variance Bias
Bootstrap Summary Let X =< x1 , . . . , xN > be a random sample. Suppose we are interested in the statistic T (X ) and we want to estimate the variance of T (X ). Form the bootstrap samples X1∗ , . . . , XM∗ Compute tm = T (Xm∗ ), m = 1, . . . , M P Compute ¯t = 1 M tm M
m=1
ˆ (T (X )) = V
M 1 X (tm − ¯t)2 M −1 m=1
Motivation Binomial Distribution The Bootstrap Variance Bias
The Correlation Coefficient
Definition Let X and Y be two random variables. The correlation ρ between X and Y is defined by ρ = E[
(X − µx ) (Y − µy ) ] σx σy
Motivation Binomial Distribution The Bootstrap Variance Bias
The Estimated Correlation Coefficient
Let (x1 , y1 ), . . . , (xN , yN ) be a random sample from a bivariate population. The estimated correlation coefficient ρˆ is calculated by N ˆy ) ˆx ) (yn − µ 1 X (xn − µ ρˆ = N σ ˆx σ ˆy n=1
Motivation Binomial Distribution The Bootstrap Variance Bias
The Variance of the Estimated Correlation Coefficient
If the random variables (X , Y ) are bivariate normal, then V [ˆ ρ] =
1 (1 − ρ2 )2 N
But what if the joint distribution of (X , Y ) is not normal?
Motivation Binomial Distribution The Bootstrap Variance Bias
Correlation Coefficient Variance by Boostrap
Form the bootstrap samples Z1∗ , . . . , ZM∗ Compute ρˆm = ρ(Zm∗ ), m = 1, . . . , M P Compute ρ¯ = M ˆm m=1 ρ ˆ (ˆ V ρ) =
M 1 X (ρm − ρ¯)2 M −1 m=1
Motivation Binomial Distribution The Bootstrap Variance Bias
Eigenvalue Ratio
ˆ be the estimated covariance matrix. Let λ ˆ1 ≥ λ ˆ2 ≥ . . . λ ˆK Let Σ ˆ be the eigenvalues of Σ. ˆ λ ˆr = P 1 K ˆ k =1 λk is the estimated percentage of the variance accounted for in the first principal component. What is the variance of ˆr ?
Motivation Binomial Distribution The Bootstrap Variance Bias
The Distribution Function
Definition Let X be a real valued random variable. The function F (x) = Prob(X ≤ x) is called the Distribution Function of X
Motivation Binomial Distribution The Bootstrap Variance Bias
The Empiric Distribution Function
Definition Let x1 , . . . , xN be a random sample from a given population. The function #{n | xn ≤ x} Fˆ (x) = N is called the Empiric Distribution Function of X based on sample x1 , . . . , xN .
Motivation Binomial Distribution The Bootstrap Variance Bias
Population Parameter
All the information about the population is contained in the distribution function F θ = TF
Motivation Binomial Distribution The Bootstrap Variance Bias
Plug-in Principle
The plug-in principle is a simple method of estimating parameters from sample Definition The plug-in estimate of a parameter θ = TF is defined by θˆ = TFˆ
Motivation Binomial Distribution The Bootstrap Variance Bias
Bias
Definition ˆ − θ. The bias of an estimater θˆ is defined by the difference E[θ]
Motivation Binomial Distribution The Bootstrap Variance Bias
Bias and Standard Error
The bootstrap can be used to study the bias and standard error of the plug-in estimate θˆ = TFˆ
Motivation Binomial Distribution The Bootstrap Variance Bias
Estimates of Bias Let X =< x1 , . . . , xN > be the given random sample from a population whose distribution function is F . Let X1∗ , . . . , XM∗ be M independent bootstrap samples. Let T (X ) be the estimate based on sample X . Let TF be the true value of the quantity in distribution F . Let b = E[T (X )] − TF be the bias. Calculate ˆtm = T (Xm∗ ), m = 1, . . . , M P Calculate ¯t = 1 M ˆtm M
m=1
Calculate TFˆ ˆ = ¯t − T ˆ b F