Confidence, hypothesis testing, and significance. ESS 116 Lecture 6

Confidence, hypothesis testing, and significance. ESS 116 Lecture 6 Lecture 5 - review • Population: the actual properties of the real world • Sampl...

Author: Stephen Golden

6 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Lecture 10: Confidence intervals & Hypothesis testing

CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

STA 291 Lecture 22. Significance Tests. Chapter 11 Testing Hypothesis Concepts of Hypothesis Testing

Lecture 12. Hypothesis Testing

Lecture 10: Hypothesis Testing

Lecture 20. Hypothesis Testing

LECTURE 5: HYPOTHESIS TESTING

ISTA 116 Hypothesis Testing: Binary Data

Chapter 6: Confidence intervals and hypothesis tests

Visual Hypothesis Testing with Confidence Intervals

Lecture 7: Hypothesis Testing and KL Divergence

General Steps of Hypothesis (Significance) Testing. Overview of Hypothesis Testing and Various Distributions

Unit 3: Foundations for inference Lecture 2: Confidence Intervals and Hypothesis Testing

One-Sample Hypothesis Testing and Confidence Intervals for Population

Hypothesis Testing. Lecture 4: Hypothesis Testing. Steps of Hypothesis Testing. Hypothesis test for a single mean I

Lecture 5: Bayesian Estimation & Hypothesis Testing

AMS 7 Hypothesis Testing Lecture 11

Statistics and Hypothesis Testing

Lecture 8: Frequentist hypothesis testing, and contingency tables

9.2 Critical Values for. Statistical Significance in. Hypothesis testing

Lecture 13 Estimation and hypothesis testing for logistic regression

Notes 4: Hypothesis Testing: Hypothesis Testing, One Sample Z test, and Hypothesis Testing Errors

Recap: Statistical Inference. Lecture 5: Hypothesis Testing. Basic steps of Hypothesis Testing. Hypothesis test for a single mean I

Significance Testing & Univariate Significance Tests

Confidence, hypothesis testing, and significance. ESS 116 Lecture 6

Lecture 5 - review • Population: the actual properties of the real world • Sample: set of values imperfectly representing the population • Parameters: refer to the population (e.g., μ and σ) (e.g., and s) • Statistics: refer to the sample • Accuracy: quality of being close to the true value • Precision: number of significant digits in a numerical value (measurements or calculation)

Lecture 5 - review • Sample visualization – Frequency Table – Cumulative Frequency – Histogram

• Rules for a good histogram – number of bins ≈ number of data values – histogram takes either a number of bins, or a list of bin edges

What you need to know • Central Tendency: – Mean (average) – Median (50% higher, 50% lower) – Mode(s) (peak value(s))

• Dispersion: – Range (max – min) – Standard deviation (average distance to mean) – Variance (square of Std Dev)

• Shape: – Skewness (positive: tail to the right, negative: tail to the left)

Know how they relate to visual features on a histogram

Frequency (unitless: count)

250

200

150

100

50

0 220

240

260

280

300

320

340

360

0.2

0.15

0.1

0.05

0 220

240

260

280

300

320

340

Temperature (K)

c) Discrete probability density (N=1000)

d) Continuous probability density function (PDF)

0.02

0.02

0.018

0.018

0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 220

b) Relative histogram

0.25

Temperature (K)

Probability density (K - 1)

Discrete probability density (K - 1)

a) Histogram of 1000 Temperature values

Discrete probability (unitless: count/all counts)

Histogram vs. relative histogram vs. probability density.

360

0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002

240

260

280

300

Temperature (K)

320

340

360

0 220

240

260

280

300

Temperature (K)

320

340

360

• In a long sequence of trials, the relative frequencies of outcomes settle down to values which are regarded as their probabilities.

• The smooth PDFs are what we think the natural popuation of all possible values are governed by.   • Therefore we are especially interested in the PDFs defined by the parameters of the population.

Calculating probability from a PDF. • Probabilities correspond to the integrals of PDFs.   • The solution is that we have to specify an interval on the PDF curve we want to calculate the probability for.  • Mathematically, we integrate the area under the PDF curve bounded by the specified interval.

Probability Density Functions • Histograms: empirical frequency distribution of our sample. • A histogram for and an infinitely small bin size will produce a Probability Density function (PDF) • The probability that x is between x1 and x2 is:

• Examples of theoretical Distributions: – Normal distribution (2 parameters: μ and σ) – Z distribution (0 parameters) – Student’s t distribution (1 parameter: Φ)

MATLAB theoretical distributions • Normal (μ,σ) – Given x0, find p0 >> p0 = normcdf(x0,mu,sigma);

– Given p0, find x0 >> x0 = norminv(p0,mu,sigma);

• Z-distribution >> p0 = normcdf(x0); >> x0 = norminv(p0);

• t-distribution >> p0 = tcdf(x0,V); >> x0 = tinv(p0,V);

• χ2-distribution >> p0 = chi2cdf(x0,V); >> x0 = chi2inv(p0,V);

p0 = P( x < x0) e.g.: 0.88 = P(x < 1.17)

i>Clicker question ESS 116 grades follow a Normal distribution of mean 800 with a standard deviation of 100. What is the probability of having a grade below 500? A. B. C. D.

1 – normcdf(500,800,100); 1 – norminv(500,800,100); normcdf(500,800,100); norminv(500,800,100);

... In pictures: •

How to answer questions about probability.

µ=mean, σ=std.

1 −(x− µ )2 / 2σ 2 f ( x) = e σ 2π

€

Consider a colorful analysis of the global spatial pattern of climate change.

Method: Analyzing the difference in means of two temperature data samples (recent vs. historical) as a measure of climate change. a) Estimate the recent climate change in surface temperature at all locations by subtracting the surface temperature averaged over the first seven years from the surface temperature averaged over the last seven years. The answer should be a 2D lat/lon matrix. Visualize it as a color map of future temperature change with superimposed coastlines.

The method yields an answer everywhere. But what is our confidence in the answer?

Where is climate change significant at high confidence as measured by a difference in mean temperatures?

Is the difference in mean between these 2 groups systematic, or just due to chance? Imagine these are the data samples at a specific location.

Slide from: Ruth Rosenholtz, MIT OpenCourseWare

What about this difference in the mean?

Slide from: Ruth Rosenholtz, MIT OpenCourseWare

What about this difference in the mean?

Slide from: Ruth Rosenholtz, MIT OpenCourseWare

Factors affecting our confidence in the answer

• Natural variability is different in different parts of the world.

• There are only so many data samples. • Surely this must affect our confidence in the climate change answer...

• How to quantify our confidence in the

answer of the difference between the mean in two data samples?

• Due to variability with the sample • Due to the amount of data points in the sample.

Hypothesis testing A statistical method to determine the probability that a given hypothesis is true.

Hypothesis testing It is very hard to prove that something is true: because you would have to prove it in all cases. BUT It is much easier to prove something is false: because you only need to find 1 case in which it is not true.

Hypothesis testing is based on this idea:

If you can disprove that something is false with some confidence level... (i.e. by showing the probability of being false is very small) THEN The opposite is very likely true.

Hypothesis - typical example. I have two data samples, Control & Test

e.g. temperatures from 1979-1989. e.g. temperatures from 1997-2007.

Is the mean value of the test population   statistically different from the control population?

Difference probably didn’t arise due to chance alone.

A key purpose of statistics.

As researchers, we need a principled way of analyzing data, to protect us from inventing elaborate explanations for effects in data that could have occurred predominantly due to chance.

Slide content from: Ruth Rosenholtz, MIT OpenCourseWare

Hypothesis testing You have two data samples. Mean value of sample 1 is m1 and of sample 2 is m2. Suppose that analysis indicates m1 ≠ m2 for the samples. hypothesis

Is the same true for the populations? It is very hard to prove that something is true: because you would have to prove it in all cases. BUT It is much easier to prove something is false: because you only need to show it is false in 1 case i.e. that m1 = m2 is false null hypothesis

Hypothesis testing In statistics we use two mutually exclusive hypotheses to generalize the whole population.

Null hypothesis Ho: m1 = m2

Population means are not statistically different.

Alternative hypothesis Ha: m1 ≠ m2

Population means are statistically different.

An investigator may either reject or not reject the null hypothesis (Ho) with some degree of confidence.

Hypothesis testing Null hypothesis Ho: m1 = m2

Population means are not statistically different.

Alternative hypothesis Ha: m1 ≠ m2

Population means are statistically different.

We cannot prove Ho... But we can prove that Ho is rejected with high confidence. → Hence

Ha is accepted, likely to be true.

That is, in this example I cannot prove that m1 ≠ m2 But I can show that m1 = m2 is unlikely to be true → reject Ho.

An investigator may either reject or not reject the null hypothesis (Ho) with some degree of confidence.

Ok, but how to we describe “confidence” or “significance” in the difference between means?

Is the difference in mean between these 2 groups systematic, or just due to chance?

Slide from: Ruth Rosenholtz, MIT OpenCourseWare

What about this difference in the mean?

Slide from: Ruth Rosenholtz, MIT OpenCourseWare

What about this difference in the mean?

Slide from: Ruth Rosenholtz, MIT OpenCourseWare

Standard Error (SE) • Variability of X is measured by the standard deviation • There might be a “gap” between the sample mean and the population mean μ • Standard Error: variability in the sample mean Population standard deviation

Sample size

• Decreases as the sample size increases (more precise)

Significant differences in means - intuitions.

• Occurs when the difference in means is large compared to the spread (e.g. variance s2 or standard deviation s) of the data sample.

•t

stat

~ (m1 - m2) / s measures this. Sample means

•

This is the test statistic at the heart of a “ttest”

Also depends on the number of samples.

•

With more samples, we’re willing to say a difference is significant even if the variance is a bit larger compared to the difference in the means.

•

tstat ~ (m1 - m2) / SE includes sample # effect. Standard Error (SE) ~ s / sqrt(n) n: # of data points in sample

Slide content from: Ruth Rosenholtz, MIT OpenCourseWare

t-tests • In general, we’ll compute from our data some t of the form:

•t

•t

stat

stat,

~ (m1 - m2) / SE

is a measure of how reliable of a difference we’re seeing between the two means. stat

• If this number is “big enough” we’ll say that there is a significant difference between the two sample mean values.

Slide content from: Ruth Rosenholtz, MIT OpenCourseWare

t-tests t~

Difference between sample (means) Normal variability within sample (SE)

• If t is large, the difference between groups is much bigger than the normal variability within groups.

• Therefore, the means of the two samples are significantly different from each other.

• If t is small, the difference between groups is much smaller than the normal variability within groups.

• Therefore, the means of the two samples are not significantly different from each other.

Slide content from: Ruth Rosenholtz, MIT OpenCourseWare

Hypothesis testing • The classical way to make statistical comparisons is to prepare a statement about a fact for which it is possible to calculate its probability of occurrence. • This statement is the null hypothesis and its counterpart is the alternative hypothesis. • The null hypothesis is traditionally written as H0 and the alternative hypothesis as H1. • A statistical test measures the experimental strength of evidence against the null hypothesis. • Curiously, depending on the risks at stake, the null hypothesis is often the reverse of what the experimenter actually believes for tactical reasons.

Examples of Hypotheses • Let μ1 and μ2 be the means of 2 samples • We want to investigate the likelihood that their means are the same: – Null Hypothesis:

H0: μ1 = μ2

– Alternative Hypothesis : H1: μ1 ≠ μ2 – The Alternative Hypothesis could also be: H1: μ1 > μ2

• The first example of H1 is said to be two-sided or twotailed (includes both μ1 > μ2 and μ1 < μ2) • The second is said to be one-sided or one-tailed. • The number of sides has implications on how to formulate the test

Possible outcomes H0 is correct

H0 is incorrect

H0 is accepted

Correct decision Probability: 1-α

Type II error (missed detection) Probability: β

H0 is rejected

Type I error (false alarm) Probability: α

Correct decision Probability: 1-β

• Level of significance: probability α of committing a Type I error α is set before performing the test. • In a two-sided test, α is split between the two options. • Often, H0 and α are designed with the intention of rejecting H0, thus risking a Type I error and avoiding the unbound Type II error. The more likely this is, the more power the test has. Power is 1− β

t-tests • Would like to set a threshold, t

crit, such

that  

tstat > tcrit   means the difference we see between the means is unlikely to have occurred by chance (and thus there’s likely to be a real systematic difference between the two conditions).

• A special class of statistical probability distribution functions is used to determine tcrit.

• Need to know the degrees of freedom of a dataset: • Typically df = N - 1 # samples in dataset Slide content from: Ruth Rosenholtz, MIT OpenCourseWare

The “t-distribution” reflects the probability that a t value could take on a particular value, purely by chance, including the effects of sampling. The t distribution 0.4 0.35

DOF=1e10 DOF=10 DOF=50

t distribution approaches standard normal PDF for high DOF (large # of samples).

0.3

P(t)

0.25 0.2 0.15 0.1 In matlab, the function “tpdf” can be used to calculate the tdistribution.

0.05 0 −5

−4

−3

−2

−1

0 t

1

2

3

4

5

Actually family of PDFs that have slightly different shapes depending on the number of degrees of freedom (N-1). The t distribution 0.4 0.35

DOF=1e10 DOF=10 DOF=50

t distribution approaches standard normal PDF for high DOF (large # of samples).

0.3

P(t)

0.25 0.2 0.15 0.1 0.05 0 −5

−4

−3

−2

−1

0 t

1

2

3

4

5

Thresholds derived from the t-distribution can be used to tell if a particular t value is significant at a specific confidence level. Null hypothesis Ho: m1 = m2

Population means are not statistically different.

Alternative hypothesis Ha: m1 ≠ m2

Population means are statistically different.

Thresholds derived from the t-distribution can be used to tell if a particular t value is significant at a specific confidence level.

To accept the alternative hypothesis at 95% confidence... .... must show only 5% probability the null hypothesis is true, which justifies rejecting it. Ho: m1 = m2 Another way of saying the null hypothesis: Ho:

t ~ m1 - m2 = 0 Is tstat distinguishable from zero?

Thresholds derived from the t-distribution can be used to tell if a particular tstat value is statistically distinguishable from zero. Example (90% conf.) There is only a 5% chance of finding t values lower than this value...

There is only a 5% probability of finding t values higher than this value purely by chance...

Area under the t distribution corresponding to 90% probability

0.4 0.35 0.3

P(t)

0.25 0.2

There is a 90% chance of finding t values in this range by chance.

0.15 0.1 0.05 0 −5

−4

−3

5% −2

−1

0 t

1

5% 2

3

4

5

This is an example of a “two-tailed” t-test at 90% confidence.

Confidence intervals come from the t-distribution. Confidence Intervals provide statistical limits for your mean values based on a degree of statistical confidence. Construction of confidence interval is based on the t-distribution.

Mean +Tcritical* SE Mean -Tcritical* SE

The confidence interval is calculated as: CI =[Mean -Tcritical* SE, Mean +Tcritical* SE] STDEV SE=________________ SQRT(Sample Size)

NOTE: Confidence intervals are descriptive and should not be used for determining statistical significance.

Confidence intervals suggest that population means do not overlap.

Confidence intervals suggest that population means overlap.

Lecture 10- CI Regression analysis

How to calculate tcrit in matLAB? Use the “tinv” command*. *Requires the statistics toolbox!

tcrit= ? = tinv (0.95,50) in matLAB. Area under the t distribution corresponding to 90% probability

0.4

degrees of freedom for the shown t-distribution.

0.35 0.3

often called α

P(t)

0.25 0.2

There is a 90% chance of finding t values in this range by chance.

0.15 0.1 0.05 0 −5

−4

−3

5% −2

−1

tcrit=-1.6759 = tinv (0.05,50).

0 t

1

5% 2

3

tcrit=1.6759 Defines 90% confidence level (At least, for this t-distribution).

4

5

Aside: One- vs. two-tailed t-tests imply a different value of tcrit for the same confidence level. 2 tails t-test: H0= M1=M2 Ha= M1≠M2 i.e. M1>M2 or M2>M1

We will use two-tailed in HW6

95%

1 tail t-test: H0= M1≥M2 Ha= M1>M2

tinv(0.025,df)

95% tinv (0.95,df)

tinv(97.5,df)

Summary: How to t-test.

• Compute t and df from your data samples. • Decide upon a level of confidence (significance). stat

99% and 95% are typical.

• i.e. probability of the null hypothesis (p value) is 0.01 and 0.05.

• From this, decide if one-tailed or two-tailed and use “tinv” command to find tcrit

• Compare t to t . • If |t |>|t |, “the difference is significant”, stat

stat

crit

crit

there’s likely an actual difference between the to means.

• If not, the difference is “not significant.” Slide content from: Ruth Rosenholtz, MIT OpenCourseWare

Where is climate change significant at high confidence measured by a difference in mean temperatures?

You can use the t-test to blank out locations where the answer is not significant.

Setup to the problem... Null hypothesis Ho: m1 = m2

Population means are not statistically different.

Alternative hypothesis Ha: m1 ≠ m2

Population means are statistically different.

• Q: Can we accept Ha at 99% confidence? • A: only if < 1% probability Ho is true by chance. • To find probability of Ho happening by chance, we compare the calculated t-value (tstat) to its corresponding t-distribution.

In other words, if our data produce a t-value less than X or greater than Y that would be a situation where Ho is occurring by chance at less than 1% probability and so we could reject the null hypothesis.

Can reject null hypothesis if |t| > |X|

t=X

t=Y

It remains to find the critical values t=X, t=Y Actually, since Y = -X (t-dist is symmetric) we only need to find X.

More information on how “tinv” works.

Returns the t-value representing the rightmost t-value required to enclose the specified probability under the t-distribution integrated for all t values to the left.

For a 99% confidence situation (two-tailed estimate), X is the value where the integrated probability under the t-distribution to the left is equal to 0.5% (0.005) 1% probability of Ho (0.5% in each tail)

Can reject null hypothesis if |t| > |X|

t=X

t=-X

X = tinv (0.005,df)

For a 60% confidence situation (two-tailed estimate), X is the value where the integrated probability under the t-distribution to the left is equal to 15% (0.15) 30% probability of Ho (15% in each tail)

70%

t=X

Can reject null hypothesis if |t| > |X|

t=-X

X = tinv (0.15,df)

A bit more practical guidance

• For a difference in means using the unpaired two-tailed t-test

• • df = n

tstat = (m2 - m1 ) / sqrt (σ22/n2 + σ12/n2) 2

+ n1 - 2