Chapter 3 – Descriptive Statistics
Discrete Data
• Definition – simple measures that describe the distribution of a variable • We typically try to describe – Central tendency – Spread or variation • Two types of data – discrete – continuous
• Finite or countable number of alternatives • Examples: – Categorical variables • Excellent/good/fair/poor • Grades (A,B,C,D,F) – Integers (0,1,2,….) • Number of cigarettes smoked per day • Lost work days • Doctor visits • SAT scores
1
Continuous Data
2
How do we describe a data series?
• Outcomes are infinitely divisible and hence uncountable number of alternatives • Examples – Weights and heights – Temperature – Grade point averages • Many discrete variables can be treated as continuous – SAT scores – Hourly wages
• n observations • xi is the value for the i’th observation • (x1, x2, x3…..xn) is the set of data
3
Problem?
4
Describing Data: Discrete Variables
• If n is small, easy to describe all outcomes, just print out the data – Wins among NFC East Teams – Grades from a small class
• Frequencies in a table • Graphically illustrate frequency in a histogram
• For large data sets (big n), you need a way of describing the data in a “user friendly” fashion 5
6
•1
Annual Dr. Visits, Adults, 21-64, 1994 National Health Interview Survey Fraction of sample
0
0.216
1
0.217
2
0.148
3
0.090
4
0.079
5+
0.250
0.25 Fraction of sample
Visits
Distribution of Annual Doctor Visits, 21-64, 1994 NHIS
0.20 0.15 0.10 0.05 0.00
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
Annual visits
7
8
Adult Female Heights 1995 NHIS
• Heights of males and females – Measured in inches – Self-reported • From National Health Interview Survey • Essentially a continuous variable, reported as a discrete outcome
Fraction of Sample
0.20 0.15 0.10 0.05 0.00
48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78
Inches
9
10
Distribution of Grades, ECON 321
Adult Male Heights 1995 NHIS
Grade A
Fall 2002 13.4%
Spring 2003 33.3%
Spring 2004 19.1%
B
39.3%
18.9%
31.8%
C
39.3%
36.6%
39.1%
D
4.0%
8.1%
9.1%
F
4.0%
3.1%
1.8%
Fraction of Sample
0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00
53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 Inches
11
12
•2
Class Characteristics
Class Characteristics
By sex
% of students
By residency
% of students
Male
75%
In-state
83%
Female
25%
Out-of-state
16%
13
Class Characteristics By age
Class Characteristics By age
% of students
Freshman
14
3%
% of students
Below 2.0
6%
2.0 – 2.49
27%
Sophomore
27%
2.50 – 2.99
25%
Junior
50%
3.0 – 3.49
11%
Senior
18%
3.5 and above
13%
No grades yet
14%
15
Continuous data
16
Example: Weekly Earnings
• Difficult to describe in a tabular form
• Integer value – Already rounded to nearest dollar – $0 - $2000/week • Many different values – so difficult to graph • Do you want to consider someone who makes $396 different from someone who makes $400/week?
• Can make data discrete by rounding observations, then graphing • Describe data by measures of central tendency and variation
17
18
•3
Weekly Earnings, Full-Time Males, 18-64
Example: Weekly Earnings
4.0 3.5 Percent of Observations
• Sample – full-time workers, males, 18-64 – Earnings in June 2003 – Round to nearest $1 – 7300 workers • Possible values $100 - $2885 • Over 1000 ‘unique’ weekly wages
3.0 2.5 2.0 1.5 1.0 0.5
2788
2676
2564
2452
2340
2228
2116
2004
1892
1780
1668
1556
1444
1332
996
1220
1108
884
772
660
548
436
324
212
100
0.0
Income 19
20
Weekly Earnings, Rounded to nearest $50 10 9 8 Percent of Sample
• You can make the graphs ‘smoother’ by taking some noise out of the data and rounding observations to the nearest $10, $25 or $50 value
7 6 5 4 3 2 1
10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00 21 00 22 00 23 00 24 00 25 00 26 00 27 00 28 00 29 00
90 0
80 0
70 0
60 0
50 0
40 0
30 0
20 0
10 0
0
Weekly Earnings
21
22
Measures of Central Tendency Adult Female Heights 1995 NHIS
• Mode most frequent outcome • Median 50th percentile value. Value where half of the outcomes lie above (and below) • Mean arithmetic average n= (1/n)(X1 + X2 + X3 + ….. Xn) n = (1/n) EI XI
Fraction of Sample
0.20 0.15 0.10 0.05 0.00
48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78
Inches
23
24
•4
Adult Male Heights 1995 NHIS
Distribution of Male Weights (rounded to nearest 5 lbs) 10 9 8
0.10
7 Percent
0.12 0.08 0.06
5 4 3
0.04
2
0.02
1
33 0
34 5
30 0
31 5
28 5
25 5
27 0
22 5
24 0
19 5
21 0
16 5
18 0
13 5
15 0
12 0
0
53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 Inches
90
0.00
6
10 5
Fraction of Sample
0.14
Pounds
25
26
Distributions of Adult Heights, 1995 NHIS Distribution of Female Weights (rounded to nearest 5 lbs) 10 9
Group
Mean
Median
Mode
Males
70
70
72
Females
64
64
64
8 Percent
7 6 5 4 3 2 1
33 0
34 5
30 0
31 5
28 5
25 5
27 0
22 5
24 0
19 5
21 0
16 5
18 0
13 5
15 0
12 0
10 5
90
0
Pounds
27
Distributions of Adult Weights, 1995 NHIS
28
Comparing self-reported weight to actual weight (18-64)
Group
Mean
Median
Mode
Males
182
180
180
Females
148
140
140
29
Males
1996 Selfreported 189
1999 Measured by MD 184
% difference -2.6%
Females
149
165
10.7%
30
•5
Measure of Central Tendency Previous Examples
Simple Example • 15 observations
Sample
• (1,2,2,3,3,3,4,4,4,4,5,5,5,5,5) – Mode = 5 – Median = 4 – Mean = 55/15 = 3.67
Mode
Median
Mean
Dr. visits
1
2
5.4
Weekly earnings
$400
$660
$814
31
Problems with the mode
32
Problems with Mean and Median
• Poor measure if the distribution is flat – (1,1,2,2,3,3,4,4,5,5) • Not a great measure if data is continuous – Can get the mode if you round the data – But, value can change due to rounding • Mode is useful when the data is discrete and there are few alternatives
• Mean – uses all the distribution – so value changes due to large “outliers” • Median – “outliers” do not alter values • Example – income in US – Most people earn a “modest” income – Few people earn extremely high incomes – which impacts the mean – Mean > Median 33
Problems with Mean and Median
34
Problems with median
• Weekly earnings 1996– full time adult males – N=21,004 – Mean = $737 – Median = $577 • Suppose you include Michael Jordan, who at the time was $36 million/year or $692,308/week from the Bulls – Mean = $770 – Median = $577
• May mask important changes in the distribution • Suppose earnings of the lowest 25% are cut in half – Mean fill fall but median will stay the same – Median = $577 – Mean = $708 35
36
•6
Example – What do the Mean and Median Measure?
March CPS (continued)
• Current Population Survey (CPS) – Monthly household-based survey – 65,000 households, 160,000 people – Basic labor market data such as the monthly unemployment rate – March survey • Annual demographic file • Earnings/income/insurance status previous year
• 3 levels of data – Households (people not in group qtrs). – Families (2 or more related people living together) – Individuals • Detailed income questions – Earnings/business profits/unearned income/social security/transfer payments 37
38
March CPS (continued)
Quantiles of Distribution
• Earnings are added up over all categories • March 2001 – data for all of 2000 • Household income (all sources) – Median $42,024 – Mean $57,014 – Mode $ 0
• Different percentiles of distribution • Useful for describing large amounts of data • Median is the 50th percentile
39
Quantiles of Household Income, 2000
• 5th %
40
Quantiles of Household Wealth, 1998
$6,906
• 75th %
$73,000
• 10th % $10,572
• 90th %
$112,040
• 20th % $ 5,000
• 90th %
$257,000
• 25th % $21,521
• 95th %
$147,297
• 50th % $86,100
• 95th %
$475,000
• 50th % $42,024
• 99th %
$354,422
• Mean $395,500
• 99th %
$3.3 mil
41
• 5th %
0
42
•7
Measures of Dispersion Daily Max Temperatures 1950-1998: Santa Barbara, CA
0.06
• Two measures of central tendency may be the same with very different distributions – Example – average annual daily highs • Richmond: 69.0 • Santa Barbara: 69.5 • How do we characterize the dispersion of outcomes?
Fraction of days
0.05 0.04 0.03 0.02 0.01 0
12 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Temperature
43
44
Measures of Dispersion 0.06
Daily Max Temperatures 1950-1998, Richmond, VA
• Range
Fraction of days
0.05
max – min
0.04
• Interquartile range (IQR) 75th – 25th percentile
0.03 0.02 0.01 0
12 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Temperature
45
46
Example: Variance of (1,2,2,3,3,3,4,4,4,4)
Measures of Dispersion • Mean squared deviation = (1/n)EI(Xi - n)2
Xi
(Xi- n) (Xi- n)2
Xi
• Variance = s2 = [1/(n-1)]EI(Xi - n)2
1 2 2 3
-2 -1 -1 0
4 1 1 0
3 3
0 0
0 0
4 1 1 4 1 1 4 1 1 4 1 1 GiXi = 30 so n = 3 Gi (Xi- n)2 = 10 S2= Gi (Xi- n)2/9 = 1.11
• Standard deviation = s = square root of variance
47
(Xi- n)
(Xi- n)2
48
•8
Measures of Dispersion Average Daily Highs
Some Sample Values
Mean Range IQR
S
Richmond
69.0
17.5
Santa Barbara
69.5
12/105 55/84 47/109 64/74
n
Variable
City
Dr. visits, adults 21-64
7.2
S
5.4
163
Weekly earnings, adult males Daily high in Richmond
$814
$565
69.0
17.2
Daily high in Santa Barb.
69.5
7.2
49
50
Some Sample Values n
S
70
3.2
Variable Male height Female height
Linear Transformations of Data
64
3.9
Male weight
182
33.5
Female weight
148
33.5
• In some situations, we find it useful to change the scale of variables – Real vs. nominal dollars – Convert the ¥ to $ to £ – English vs. metric – Fahrenheit to Celsius • General linear transformation Yi = a + bXi 51
52
What Happens to Mean?
What Happens to Variance • s2y = [1/(n-1)] Ei (Yi - N)2
N= (1/n) Ei Yi N= (1/n) Ei ( a + bXi) N = (1/n) Ei a + (1/n) Ei bXi N = (1/n) (na) + (1/n)b Ei Xi N = a + b(1/n) Ei Xi N = a + bn
53
•
= [1/(n-1)] Ei (a + bXi - a - bn)2
•
= [1/(n-1)] Ei (bXi - bn)2
•
= [1/(n-1)] Ei b2(Xi - n)2
•
= b2 [1/(n-1)] Ei (Xi - n)2 = b2 s2x
• sy = bsx
54
•9
Example – Converting Fahrenheit to Celsius
Recap
• Average daily high in Santa Barbara, Fahrenheit –n = 69.5 = 51.12 – s2f – sf = 7.15
• Given a sample of Xi’s with a – mean n – Standard deviation of sx • Linear transformation – Yi = a + bXi • The sample of Yi’s has – Mean of N = a + bn – Standard deviation of sy = bsx
• To convert F to C, use equation – ci = -17.78 + 0.556fi – When: fi = 32, ci = 0 – fi = 212, ci = 100 55
Distribution of Temperatures, Santa Barbara N
= -17.78 + 0.556 n = -17.78 + 0.556 (69.5) = 20.86
sc
= 0.556 sf = 0.556 (7.15) = 3.98
56
Example: Pounds to kilograms • 1 K equals 2.2 Pounds, so K=(1/2.2)P • Y = a + bX = 0.4536X • Sample value for males in US • n = 182 • Sx = 33.5 • What would males weigh in England?? • N = a+b n = 0.4536 n = 82.55 • Sy = b sx = 0.4536 sx = 15.2 57
58
Stylized facts • Income distribution has become more unequal over time – Independent of how income is measured (family, household or individual) – Growth at the top of the distribution has been greater than at the bottom • Therefore – growth in mean will be very different from the growth in the median
Example: How to Use Descriptive Statistics The Changing Income Distribution
59
60
•10
Real Household Incomes Over Time $150,000 $130,000 Mean
$55,000
Median
$110,000 Income
$60,000
$50,000 $45,000 $40,000
$90,000 $70,000
$35,000
$50,000
$30,000
$30,000
19 75 19 77 19 79 19 81 19 83 19 85 19 87 19 89 19 91 19 93 19 95 19 97 19 99
Real 2000 $
Mean and Median HH Incom e
$10,000
19 75 19 77 19 79 19 81 19 83 19 85 19 87 19 89 19 91 19 93 19 95 19 97 19 99
Year
Year 20th
61
% Change in Household Income 1975-2000
20th %ile 40th %ile 60th %ile 80th %ile 95th %ile
80th
95th
62
% Difference: 95th and 20th Percentile HH Income
700%
26% 46% 27% 23% 29% 41% 56%
650% 600% Series1
550% 500% 450% 400%
19 67 19 70 19 73 19 76 19 79 19 82 19 85 19 88 19 91 19 94 19 97 20 00
• • • • •
60th
750%
% Difference
• Median • Mean
40th
Year 63
Quantiles of Household Income, 2000
• 5th %
64
Quantiles of Household Wealth, 1998
$6,906
• 75th %
$73,000
• 10th % $10,572
• 90th %
$112,040
• 20th % $ 5,000
• 90th %
$257,000
• 25th % $21,521
• 95th %
$147,297
• 50th % $86,100
• 95th %
$475,000
• 50th % $42,024
• 99th %
$354,422
• Mean $395,500
• 99th %
$3.3 mil
65
• 5th %
0
66
•11