3. Statistical Inference

3. Statistical Inference 3.1 Introduction In order to say something about the distribution of a variable in a given population, it is impractical to o...
Author: Dwain Austin
8 downloads 2 Views 459KB Size
3. Statistical Inference 3.1 Introduction In order to say something about the distribution of a variable in a given population, it is impractical to observe all the values taken by that variable in the population. Hence, we observe a sample of n individuals. It is assumed that these n observations are independent and are taken from the distribution of the variable in the population as a whole. When the sample is chosen at random from the appropriate sampling frame (see Chapter 1), this is a reasonable assumption.

1 / 47

Parameters

We will refer to such a sample as a simple sample from the population. A parameter is a value describing the distribution of a variable (e.g. the population mean, µ, the population standard deviation, σ). In general, these parameters are unknown.

2 / 47

Statistics

From a sample we can gain various statistics e.g. i) the sample mean - X ii) the sample variance - s 2 iii) the sample proportion - pˆ (e.g. the proportion of people in a sample wishing to vote for the Labour party).

3 / 47

Statistics

Using these statistics, we wish to say something about the appropriate parameters of a population (distribution) i) the population (theoretical) mean - µ ii) the population variance - σ 2 iii) the population proportion - p.

4 / 47

Sampling errors

Example 2.4.3 illustrates that when we use the sample mean, x to estimate the population mean, µ, there will be a random error which depends on the sample observed. e.g. Sampling error for mean = |x − µ| We do not know what value the sampling error takes, but we can say something about its distribution for samples of a fixed size. As the sample size increases, the distribution of the sample mean becomes more concentrated around the population mean. Hence, estimation becomes (on average) more accurate.

5 / 47

Confidence intervals

A statistic used to estimate the value of a parameter is called an estimator. e.g. the sample mean is used to estimate the population mean. We consider the following two problems. 1. Given an estimator of a parameter (e.g. the sample mean as an estimator of the population mean) can we define an interval, such that the appropriate parameter (the population mean) is very likely to belong to that interval? This is done by interval estimation (using confidence intervals).

6 / 47

Statistical testing

2. Suppose we have a hypothesis regarding a parameter of the population (e.g. 12% of the population want to vote for the Green Party). How do we decide whether this is a realistic hypothesis or not? This is the goal of hypothesis testing.

7 / 47

3.1.1 The distribution of the sample mean and sample proportion

Before addressing these goals we consider a few results considering two important statistics 1. the sample mean - X 2. the sample proportion - pˆ.

8 / 47

The distribution of the sample mean Let Xi denote the value of the i-th observation from the sample. Since these Xi ’s all come from the distribution of the variable in the population, we have E (Xi ) = µ; Var (Xi ) = σ 2 It follows that

n n X X E( Xi ) = E (Xi ) = nµ i=1

i=1

Since the Xi ’s are independent n n X X Var ( Xi ) = Var (Xi ) = nσ 2 i=1

i=1

9 / 47

The distribution of the sample mean

It follows from the central limit theorem that n X

Xi ∼approx N(nµ, nσ 2 ).

i=1

Dividing by n (the expected value is divided by n and the variance by n2 ) σ2 X ∼approx N(µ, ) n

10 / 47

The standard error of the sample mean The standard deviation of the sample mean is called the standard error of the sample mean and denoted S.E .(X ), where σ S.E .(X ) = √ . n It can be seen from this distribution that, given that a random sample is taken from the population in question, the expected value of the sample mean is the population mean (there is no systematic error in estimation). Also, the dispersion of the sample mean decreases as the sample size increases. Since we do not know σ, we estimate the standard error of the sample mean using s S.E .(X ) ≈ √ . n 11 / 47

Systematic errors in estimation

It should be noted that e.g. if the sampling frame is inappropriate, then systematic errors may occur. For example, if we use the average height of a sample of Irish students to estimate the average height of all Irish adults we will tend to overestimate the mean height of Irish adults.

12 / 47

The distribution of the sample proportion

Let p be the proportion of a population showing a given trait and Y be the number of individuals in a sample exhibiting that trait. Then Y ∼ Bin(n, p). For large n, Y ∼approx N(np, np[1 − p]). It follows that the distribution of the sample proportion pˆ = pˆ ∼approx N(p,

Y n

is

p[1 − p] ) n

Hence, for large samples the distribution of the sample mean and the sample proportion are approximately normal.

13 / 47

Standard error of the sample proportion The standard error of the sample proportion is its standard deviation, i.e. r p(1 − p) S.E .(ˆ p) = . n Since we do not know the population proportion p, we may estimate the standard error of the sample proportion using r pˆ(1 − pˆ) S.E .(X ) ≈ . n The standard error of the sample proportion is approximately the average error made when using the sample proportion to estimate the population proportion. Similarly, the standard error of the sample mean is approximately the average error made when using the sample mean to estimate the population mean. 14 / 47

3.2 Confidence Intervals

3.2.1 For the population mean µ, large sample size (n ≥ 30) Suppose we take a large number of samples of a variable with n observations. Since n is large these sample means will have a normal (bell-shaped) distribution. From the tables for the normal distribution, P(Z > 1.96) = 0.025. It follows that 95% of such samples will have a sample mean less than 1.96 standard errors from the population mean (see next slide).

15 / 47

Confidence intervals

16 / 47

Confidence intervals for the population mean with a large sample It follows that if the population variance is known, then X ± 1.96S.E .(X ). is a 95% confidence interval for the population mean. The term 1.96S.E .(X ) is called the radius of the confidence interval. This means that if we took a large number of such samples, approximately 95% of such confidence intervals calculated from these samples would contain the population mean.

17 / 47

Confidence intervals for the population mean with a large sample Similarly, 99% of the sample means lie less than 2.576 standard errors from the population mean. Hence, a 99% confidence interval for the population mean is given by X ± 2.576S.E .(X ). This means that if we took a large number of such samples, approximately 99% of the confidence intervals calculated from these samples would contain the population mean. The confidence levels used here are 95% and 99%. These are the most commonly used significance levels.

18 / 47

Confidence intervals for the population mean with a large sample The confidence level will be denoted as 100(1 − α)%. For large sample sizes, the approximation s S.E .(X ) ≈ √ n will be reasonably good. Hence we can use the following as an approximate 100(1 − α)% confidence interval for the population mean: sZα/2 X ± √ ≈ X ± Zα/2 S.E .(X ), n where Zq satisfies P(Z > Zq ) = q and Z ∼ N(0, 1). Zα/2 is called the critical value. 19 / 47

Confidence intervals for the population mean with a large sample

20 / 47

Confidence intervals for the population mean with a small sample In this case s is no longer a good approximation of the population standard deviation. In order to reflect the increased uncertainty, we take our critical values from the Student t distribution with n − 1 degrees of freedom (see Table 7 in the script). Assume that we have n observations from a normal distribution, then X −µ X −µ √ ∼ N(0, 1); T = √ ∼ tn−1 , Z= σ/ n s/ n where tn−1 denotes the Student distribution with n − 1 degrees of freedom. 21 / 47

The Student distribution

The Student distribution is very similar to the standard normal distribution, but has a greater variance. It is symmetric about 0. As the number of degrees of freedom tends to infinity, then the student distribution tends to the standard normal distribution. This follows from the definition of the Student distribution (see previous slide).

22 / 47

Confidence intervals for the population mean with a small sample Assuming that the observations come from a distribution which is similar to the normal distribution, the following is a 100(1 − α)% confidence interval for the population mean: X±

stn−1,α/2 √ , n

where tn−1,q satisfies P(T > tn−1,q ) = q when T has a student distribution with n − 1 degrees of freedom. This can be written in the form X ± tn−1,α/2 S.E .(X ), where S.E .(X ) is our approximation of the standard error of the sample mean, √sn . 23 / 47

Table for the student distribution

In Table 7 the number of degrees of freedom corresponds to the rows and α/2 corresponds to the column. Hence, t6,0.005 = 3.707. As n → ∞, the tn−1 distribution tends to the standard normal distribution. Zα/2 = t∞,α/2 Hence, for large samples (n > 30) we can read the appropriate critical values from the bottom row of Table 7.

24 / 47

Example 3.2.1

Suppose the mean and variance of the height of 100 Irish students are 174cm and 144cm2 , respectively. Calculate 95% and 99% confidence intervals for the average height of all Irish students.

25 / 47

Example 3.2.1 For a 95% confidence interval 100(1 − α) = 95 ⇒ α = 0.05. Since the sample size is large, the 95% confidence interval is given by s X ± Zα/2 S.E .(X )=X ± t∞,0.025 √ n √ 1.96 144 =174 ± × √ 100 =174 ± 2.35 = [171.65, 176.35]

26 / 47

Example 3.2.1 For a 99% confidence interval 100(1 − α) = 99 ⇒ α = 0.01. Since the sample size is large, the 99% confidence interval is given by s X ± Zα/2 S.E .(X )=X ± t∞,0.005 √ n √ 2.576 144 =174 ± √ 100 =174 ± 3.10 = [170.9, 177.1]

27 / 47

Estimating the population mean to a given accuracy It should be noted that the higher the confidence level, the wider the confidence interval (at a higher confidence level a confidence interval must be more likely to contain the population mean). We can also use these formulae to calculate the number of observations required to estimate the population mean to within δ with a given probability (normally 0.95 or 0.99). It is assumed that the required sample size is large and that we have a reasonable estimate of the population variance (i.e. we have an initial sample of reasonably large size). We require that the radius of the confidence interval is less than or equal to δ i.e. t∞,α/2 S.E .(X ) ≤ δ 28 / 47

Example 3.2.2

Calculate the sample size needed to estimate the mean height of all Irish students to within 1cm with a probability of a) 0.95, b) 0.99 Here, we use the data from Example 3.2.1. The sample variance was 144cm2 .

29 / 47

Example 3.2.2 We require that the radius of the confidence interval is bounded above by 1cm. In case a) the confidence level is 95% (i.e. α = 0.05). The radius of the confidence interval is t∞,α/2 S.E .(X ) =

st∞,0.025 √ ≤ 1. n

Hence, √

√ 144 × 1.96≤ n

144 × 1.962 ≤n 553.19≤n Thus, at least 554 observations are required to estimate the population mean to within 1cm with a probability of 95%. 30 / 47

Example 3.2.2 In case b) the confidence level is 99% (i.e. α = 0.01). The radius of the confidence interval is t∞,α/2 S.E .(X ) =

st∞,0.005 √ ≤ 1. n

Hence, √

√ 144 × 2.576≤ n

144 × 2.5762 ≤n 955.55≤n Thus, at least 956 observations are required to estimate the population mean to within 1cm with a probability of 99%. 31 / 47

Assumptions underlying the calculation of a confidence interval

It should be noted that these calculations assume that the sample mean has a normal distribution. If the sample is large (n > 30), then this is a reasonable assumption. However, if the sample size is small, this assumption is only reasonable when the observations come from a distribution which is similar to the normal distribution.

32 / 47

Example 3.2.3

The masses of a sample of 25 Irish students were taken. The sample mean was 74kg and the sample variance 121kg2 . Calculate an approximate 90% confidence interval for the mean mass of all Irish students.

33 / 47

Example 3.2.3 b) In this case we have a small sample. The confidence level is 90%, thus α = 0.1. The confidence interval is given by √ tn−1,α/2 s t24,0.05 121 √ √ X± = 74 ± . n 25 From Table 7, t24,0.05 = 1.711 Hence, the confidence interval is given by √ 1.711 × 121 √ 74 ± = 74 ± 3.76 = [70.24, 77.76] 25

34 / 47

Assumptions used in this calculation

The distribution of mass is slightly right skewed (i.e. not normal). It follows that for small samples the sample mean will not have a normal distribution (although the approximation here will be reasonable, the sample size is not very small and the distribution of mass is not highly skewed). In this case the confidence level will be approximately 90%.

35 / 47

3.2.3 Confidence intervals for the population proportion

In this case we only consider large samples (n > 30). The standard error of the sample proportion is r p(1 − p) S.E .(ˆ p) = n It can be seen that the standard error of the sample proportion depends on the population proportion, which is unknown.

36 / 47

Estimation of the standard error of the sample proportion

We may estimate the standard error in 2 ways. 1. Using the sample proportion r S.E .(ˆ p) =

pˆ(1 − pˆ) n

37 / 47

Estimation of the standard error of the sample proportion

2. Using a conservative estimate of the standard error (i.e. we use the maximum possible standard error for the given sample size). Since 0 ≤ p ≤ 1, it follows that p(1 − p) ≤ 14 . Hence, r S.E .(ˆ p) ≤

1 1 = √ . 4n 2 n

38 / 47

Confidence interval for a population proportion

A 100(1 − α)% confidence interval for the population proportion p is given by pˆ ± Zα/2 S.E .(ˆ p ) = pˆ ± t∞,α/2 S.E .(ˆ p ). Note that this formula is analogous to the formula for a confidence interval for the population mean for large samples.

39 / 47

Confidence interval for a population proportion

When we simply wish to calculate a confidence interval, we use the first approximation for the standard error of the sample proportion. Hence, r pˆ(1 − pˆ) pˆ ± t∞,α/2 n is an approximate 100(1 − α)% confidence interval for the population proportion.

40 / 47

Confidence interval for a population proportion The conservative approximation of the standard error of the sample proportion is used when we wish to determine the sample size required to estimate a population proportion to a required accuracy δ. In this case the maximum radius of the confidence interval for the population proportion is given by Rmax =

t∞,α/2 √ . 2 n

The advantage of this approach is that such a sample size is sufficient regardless of what the population proportion is.

41 / 47

Confidence interval for a population proportion

It should be noted that when p is close to 0 or 1, then the confidence interval may not be accurate (in this case the normal approximation to the binomial tends to be inaccurate).

42 / 47

Example 3.2.4

In a survey 150 out of 1000 voters questioned said that they would vote Labour. i) Calculate a 95% confidence interval for the proportion of the population that would vote Labour. ii) What sample size is required to measure any population proportion to an accuracy of ±3% with a probability of 95%?

43 / 47

Example 3.2.4

The confidence level is 95%, hence α = 0.05. The appropriate confidence interval is r pˆ(1 − pˆ) pˆ ± t∞,α/2 S.E .(ˆ p ) ≈ pˆ ± t∞,α/2 n pˆ is the sample proportion pˆ = 150/1000 = 0.15. t∞,0.025 = 1.96.

44 / 47

Example 3.2.4

Hence, an approximate 95% confidence interval for the proportion of the population that would vote Labour is r 0.15(1 − 0.15) 0.15 ± 1.96 =0.15 ± 0.022 1000 =[0.128, 0.172]

45 / 47

Example 3.2.4 The maximum radius of the confidence interval is given by Rmax =

t∞,α/2 √ . 2 n

Since the confidence level is 95%, α = 0.05. We need to find n such that Rmax ≤ 0.03. Hence, t∞,0.025 √ ≤0.03 2 n √ 1.96 ≤ n 2 × 0.03 √ 32.667≤ n n≥32.6672 = 1067.11 Since, n must be an integer, the sample size must be at least 1068. 46 / 47

Public opinion polls

Note that in most public opinion polls the error is given as ±3%. Normally such polls use a sample of just over 1000 individuals. The maximum radius of a 95% confidence interval is 0.03 in this case.

47 / 47

Suggest Documents