Lecture 3: Quantitative Variable

Lecture 3: Quantitative Variable 1 Quantitative Variable • A quantitative variable takes values that have numerical meanings. For example, educ = 1...
Author: Paula Stevenson
0 downloads 2 Views 41KB Size
Lecture 3: Quantitative Variable

1

Quantitative Variable • A quantitative variable takes values that have numerical meanings. For example, educ = 14 means that a person has 14 years of schooling. • Do not confuse a quantitative variable with a numerical categorical variable. The numbers are arbitrary for the latter—you can use a different set of numbers to categorize observations.

2

Descriptive Statistics For the quantitative variable, we want to report • sample mean, which measures the central location • median, which is a more robust measurement for central location • sample variance or standard deviation, which measures the dispersion or spread • sample skewness, which is positive if there are extremely large values (a long right tail) • sample kurtosis, which exceeds three if the tail is taller than the normal distribution (so extreme values occur with probability higher than the normal distribution) • minimum and maximum, which can be used to detect outlier or typo

3

Histogram A histogram is the sample estimate of the population distribution. The shape of the histogram is sensitive to the number of bins. For example, by using 7 bins Figure 1 below plots the density, which is on the vertical axis, for annualized quarterly GDP growth rate.

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Figure 1: Histogram of GDP Growth Rate

4

Percentile • For example, 20th percentile is the value (or score) below which 20% of the observations may be found • Percentiles can be used to answer questions like “how big is big”. For instance, we may define a good applicant if her SAT score is above the 90th percentile. That means, at least 90 percent applicants have SAT score below hers. • 1st quartile is the 25th percentile; median is the 50th percentile, and 3rd quartile is the 75th percentile. • The critical value used for hypothesis testing is a special percentile. For example, the critical value 1.96 satisfies P(−1.96 < Z < 1.96) = 0.95, or equivalently, P(Z < 1.96) = 0.975, where Z denotes the standard normal random variable.

5

Boxplot The median, 1st and 3rd quartiles can be seen from a boxplot. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data (from Wiki). For example,

−10

−5

0

5

10

15

Figure 2: Boxplot of GDP Growth

6

Stationarity It is worth noting that the mean value, variance, histogram, etc make sense only for stationary data. We need to transform nonstationary data into stationary data before reporting those statistics.

7

One-sample T Test • The central limit theorem states that for an i.i.d sample with mean µ and variance σ 2 , as the sample size rises, the sampling distribution of the sample average approaches a normal distribution ( ) n 2 σ ∑i=1 xi d → N µ, (1) x¯ ≡ n n • Notice that as n → ∞, the variance of x¯ approaches zero, meaning that the sample mean converges to the constant µ . This result is called law of large number. • The t statistic is just the standardized sample mean: x¯ − µ0 √ ∼ N(0, 1) t≡ σ/ n

(2)

where √σn is the standard error. We reject the null hypothesis H0 : µ = µ0 if the t statistics exceeds 1.96 in absolute value. 8

Hypothesis Testing There are three equivalent ways to draw conclusions for T test 1. The t statistic is greater than the critical value in absolute value (for a two-sided alternative hypothesis) 2. The p value is less than, say, 0.05 3. The hypothesized value under the null hypothesis lies outside the confidence interval. Basically, the null hypothesis is rejected when we end up in the rejection region, or the tail part of the sampling distribution of the testing statistics. The values in the tail part are unlikely if the null hypothesis is true. So they are evidence against the null hypothesis if they are observed indeed. We use either the critical value (percentile) or p value to define the tail part.

9

IID Sample Many statistics assume the sample is IID. This assumption can fail, so the statistics can yield misleading results, when 1. The observations are dependent. This can happen for time series data 2. The observations are not identical. For example, observations may have varying variance (heteroscedasticity) 3. Another violation is that data may come from different populations. For instance, the population mean of wage for male works may differ from female workers. A two-sample t test can reveal that. The bottom line is, always remember to verify the iid sample assumption.

10

Unpaired Two-sample T Test of Equal Means • We need a binary (indicator) variable that defines the two (sub)samples • The null hypothesis is that the mean value of variable y is the same across the two samples. • For example, y is wage, and binary variable is gender. So under the null hypothesis, male and female earn the same average wage. • The R function for an unpaired two-sample T test of equal means is t.test(y˜binary)

• If the variance is the same across the two sample (homoscedasticity), we can run a dummy variable regression summary(lm(y˜binary))

The t value of the binary regressor is the two-sample t test. • The test becomes a Two-sample T Test of Equal Proportions when y is also binary. 11

Paired Two-sample T Test of Equal Means • For example, we may measure the body temperature twice for each patient, before and after taking the medicine • The null hypothesis is the medicine is ineffective, i.e., on average, the body temperature does not change. • We just apply the one-sample t test to the difference in the body temperature, and let µ0 = 0.

12