'

$

Chapter 11 Simulation studies in statistics

&

% 1

'

$

Introduction • What is a simulation study, and why do one? • Simulations for properties of estimators • Simulations for properties of hypothesis tests

&

% 2

'

$

What is a simulation study Simulation • A numerical technique for conducting experiments on the computer. • It involves random sampling from probability distributions.

&

% 3

'

$

Rationale: (In statistics) • Properties of statistical methods must be established so that the methods may be used with confidence. • Exact analytical derivations of properties are rarely possible. • Large sample approximations to properties are often possible. • But what happens in the finite sample size case? • And what happens when assumptions are violated?

&

% 4

'

$

Usual issues • Is an estimator biased in finite samples? Is it still consistent under departures from assumptions? What is its sampling variance? • How does it compare to competing estimators on the basis of bias, precision, etc.? • Does a procedure for constructing a confidence interval for a parameter achieve the advertised nominal level of coverage?

&

% 5

'

$

Usual issues • Does a hypothesis testing procedure attain the advertised level of significance or size? • If it does, what power is possible against different alternatives to the null hypothesis? Do different test procedures deliver different power? How to answer these questions in the absence of analytical results?

&

% 6

'

$

Monte Carlo simulation approximation • An estimator or test statistic has a true sampling distribution under a particular set of conditions (finite sample size, true distribution of the data, etc.) • Ideally, we could want to know this true sampling distribution in order to address the issues on the previous slide. • But derivation of the true sampling distribution is not tractable. • Hence we approximate the sampling distribution of an estimator or a test statistic under a particular set of conditions.

&

% 7

'

$

How to approximate A typical Monte Carlo simulation involves the following: • Generate S independent data sets under the conditions of interest • Compute the numerical value of the estimator/test statistic T from each data set. Hence we have T1 , T2 , · · · , Ts , · · · , TS

&

% 8

'

$

How to approximate • If S is large enough, summary statistics across T1 , T2 , · · · , Ts , · · · , TS should be good approximations to the true properties of the estimator/test statistic under the conditions of interest. An example: Consider an estimator for a parameter θ: – Ts is the value of T from the s-th data set, s=1,2,· · · ,S. – The mean of all Ts ’s is an estimate of the true mean of the sampling distribution of the estimator.

&

% 9

'

$

Simulation for properties of estimators A simple example: • Compare three estimators for the mean µ of a distribution based on a random sample Y1 , Y2 , · · · , Yn • The three estimators are: – Sample Mean, T (1) – Sample Median, T (2) – Sample 10% trimmed mean,T (3)

&

% 10

'

$

Simulation procedures For a particular choice of µ, n and true underlying distribution • Generate independent draws Y1 , Y2 , · · · , Yn from the distribution • Compute T (1) , T (2) , T (3) • Repeat S times. Hence we have (1)

(1)

(1)

(2)

(2)

(2)

(3)

(3)

(3)

T1 , T 2 , · · · , T S

T1 , T 2 , · · · , T S

T1 , T 2 , · · · , T S

&

% 11

'

$

Simulation procedures Compute for k = 1, 2, 3 mean ˆ =

S 1 ∑ (k) Ts = T¯(k) S s=1

ˆ bias

= T¯(k) − µ v u S u 1 ∑ 2 (k) t (k) ˆ ¯ SD = (Ts − T ) S − 1 s=1

ˆ MSE =

S 1 ∑ ¯(k) ˆ 2 ˆ 2 + bias (T − µ)2 ≈ SD S s=1

&

% 12

'

$

A simulation study • To compare 3 estimators for location µ through a simulation study • The 3 estimators are – Sample Mean, – Sample Median, – Sample 10% trimmed mean • Underlying distribution: N(0,1) • Sample size: 15 • Simulation size: 1000

&

% 13

'

$

A simulation study: R code > # To compare 3 estimators for location,mu, through simulation study > # The 3 estimators are sample mean, median, and 10% trimmed mean. > S=1000 # Simulation size (No.of samples) > n=15 # Sample size > mu=1 # Mean of the underlying normal distribution > sd=1 # Standard deviation of the underlying normal distribution

&

% 14

'

$

A simulation study: R code > meax=numeric(S) # A vector contains all the sample means > medx=numeric(S) # A vector contains all the sample medians > trmx=numeric(S) # A vector contains all the sample 10% trimmed means > stdx=numeric(S) # A vector contains all the sample standard deviation > set.seeds(1234) # Set the seed number

&

% 15

'

$

A simulation study: R code > for (i in 1:S){ + x=rnorm(n,mu,sd) # Generate a random sample of size n + mean[i]=mean(x) # Compute the mean for the i-th sample,i.e.T∧(1) i + medx[i]=median(x)# Compute the median for the i-th sample,i.e.T∧(2) i + trmx[i]=mean(x,trim=0.1) +# Compute the 10% trimmed mean for the i-th sample, i.e.T∧(3) i + stdx[i]=sd(x)# Compute the standard deviation for the i-th sample }

&

% 16

'

$

A simulation study: R code > # Compute the mean of “S” sample means, medians, and trimmed means > simumean=apply(cbind(meax,medx,trmx),2,mean) > # “2” in the second argument asks the computer to fine “means” for all the columns in the matrix given in argument 1. > # Hence “simumean” is a 3×1 vector consists of the average of the “S” means, medians, and the 10% trimmed means.

&

% 17

'

$

A simulation study: R code > # Compute the standard deviation of the “S” sample means, medians, and 10% trimmed means > simustd=apply(cbind(meax,medx,trmx),2,sd) > # Compute the bias > simubias=simumean-rep(mu,3) > # Compute the MSE (Mean Square Error) > simumse=simubias∧2+simustd∧2

&

% 18

'

$

A simulation study: R code > ests=c(“Sample mean”,“Sample median”,“Sample 10% trimmed mean”) # column heading in the output > names=c(“True value”,“No. of simu”,“”MC Mean,“MC Std Deviation”,“MC Bias”,“MC MSE”) # row heading in the output > sumdat=rbind(rep(mu,3),rep(S,3),simumean,simustd,simbias,simumse) > > dimnames(sumdat)=list(names,ests) > round(sumdat,4)

&

% 19

'

$

A simulation study: R code

&

% 20

'

$

Checking coverage probability of confidence interval • Usual 95% confidence interval for µ based on sample mean is given by s s (Y¯ − t0.025,n−1 √ , Y¯ + t0.025,n−1 √ ) n n • Does the interval achieve the nominal level of coverage 95%?

&

% 21

'

$

Checking coverage probability of confidence interval > #Check the coverage probability of confidence interval > #We make use of the data obtained from the above simulation study > t05=qt(0.975,n-1) #Get t {0.025,14} >coverage=sum((meaxt05*stdx/sqrt(n)=mu))/S > #The above statement is equivalent to > # d=0 > # for(i in 1:S){ > #d=d+meax[i]t05*stdx[i]/sqrt(n)=mu))} > #coverage=d/S >coverage [1]0.945 &

% 22

'

$

Simulation for properties of hypothesis tests Example: Consider the size and power of the usual t-test for the mean • Test H0 : µ = µ0 against H0 : µ ̸= µ0 • To evaluate whether size/level of test achieves advertised α – Generate data under H0 : µ = µ0 and calculate the proportion of rejections of H0 • Approximation the true probability of rejecting H0 when it is true – Proportion should be approximately equal to α

&

% 23

'

$

Simulation for properties of hypothesis tests • To evaluate power – Generate data under some alternative µ ̸= µ0 (let say, µ = µ1 ) and calculate the proportion of rejections of H0 • Approximate the true probability of rejecting H0 when the alternative is true

&

% 24

'

$

Checking the size of t-test: R code > #Check the size of t-test >ttests=(meax-mu0)/(stdx/sqrt(n)) >size=sum(abs(ttests)>t05)/S >size [1]0.045

&

% 25

'

$

Checking the power of t-test: R code > #Check the power of t-test when d=1*sd > Assume the samples generated from N (µ1 , σ) where µ1 = 1, σ = 1. > Test H0 : µ = µ0 vs H1 : µ = µ1 = µ0 + σ where µ0 = 0. >ttests=(meax-mu0)/(stdx/sqrt(n)) >power=sum(abs(ttests)>t05)/S >power [1]0.954

&

% 26

'

$ Simulation study in SAS

Simulation to study the size of t-test *Generate “ns” normal random samples; * “mcrep” (M.C. Replications) is the index for the s-th random sample; data simu; seed=123; ns=1000;n=25;mu=0;sigma=1; do mcrep=1 to ns; do i=1 to n; x=rannor(seed); output; end; keep mcrep x; end; run; &

% 27

'

$

Simulation study in SAS proc sort data=simu; by mcrep; run; *Perform the t-test for each sample. Output the p-value “probt” in the SAS dataset “outtset”; proc univariate data=simu noprint mu0=0; by mcrep; var x; output out=outtest probt=p; run;

&

% 28

'

$

Simulation study in SAS * Count how many samples have “probt”