Discrete Data. Chapter 3 Descriptive Statistics. Continuous Data. How do we describe a data series? Describing Data: Discrete Variables. Problem?

Chapter 3 – Descriptive Statistics Discrete Data • Definition – simple measures that describe the distribution of a variable • We typically try to d...
Author: Kerrie Walters
4 downloads 1 Views 92KB Size
Chapter 3 – Descriptive Statistics

Discrete Data

• Definition – simple measures that describe the distribution of a variable • We typically try to describe – Central tendency – Spread or variation • Two types of data – discrete – continuous

• Finite or countable number of alternatives • Examples: – Categorical variables • Excellent/good/fair/poor • Grades (A,B,C,D,F) – Integers (0,1,2,….) • Number of cigarettes smoked per day • Lost work days • Doctor visits • SAT scores

1

Continuous Data

2

How do we describe a data series?

• Outcomes are infinitely divisible and hence uncountable number of alternatives • Examples – Weights and heights – Temperature – Grade point averages • Many discrete variables can be treated as continuous – SAT scores – Hourly wages

• n observations • xi is the value for the i’th observation • (x1, x2, x3…..xn) is the set of data

3

Problem?

4

Describing Data: Discrete Variables

• If n is small, easy to describe all outcomes, just print out the data – Wins among NFC East Teams – Grades from a small class

• Frequencies in a table • Graphically illustrate frequency in a histogram

• For large data sets (big n), you need a way of describing the data in a “user friendly” fashion 5

6

•1

Annual Dr. Visits, Adults, 21-64, 1994 National Health Interview Survey Fraction of sample

0

0.216

1

0.217

2

0.148

3

0.090

4

0.079

5+

0.250

0.25 Fraction of sample

Visits

Distribution of Annual Doctor Visits, 21-64, 1994 NHIS

0.20 0.15 0.10 0.05 0.00

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

Annual visits

7

8

Adult Female Heights 1995 NHIS

• Heights of males and females – Measured in inches – Self-reported • From National Health Interview Survey • Essentially a continuous variable, reported as a discrete outcome

Fraction of Sample

0.20 0.15 0.10 0.05 0.00

48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78

Inches

9

10

Distribution of Grades, ECON 321

Adult Male Heights 1995 NHIS

Grade A

Fall 2002 13.4%

Spring 2003 33.3%

Spring 2004 19.1%

B

39.3%

18.9%

31.8%

C

39.3%

36.6%

39.1%

D

4.0%

8.1%

9.1%

F

4.0%

3.1%

1.8%

Fraction of Sample

0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00

53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 Inches

11

12

•2

Class Characteristics

Class Characteristics

By sex

% of students

By residency

% of students

Male

75%

In-state

83%

Female

25%

Out-of-state

16%

13

Class Characteristics By age

Class Characteristics By age

% of students

Freshman

14

3%

% of students

Below 2.0

6%

2.0 – 2.49

27%

Sophomore

27%

2.50 – 2.99

25%

Junior

50%

3.0 – 3.49

11%

Senior

18%

3.5 and above

13%

No grades yet

14%

15

Continuous data

16

Example: Weekly Earnings

• Difficult to describe in a tabular form

• Integer value – Already rounded to nearest dollar – $0 - $2000/week • Many different values – so difficult to graph • Do you want to consider someone who makes $396 different from someone who makes $400/week?

• Can make data discrete by rounding observations, then graphing • Describe data by measures of central tendency and variation

17

18

•3

Weekly Earnings, Full-Time Males, 18-64

Example: Weekly Earnings

4.0 3.5 Percent of Observations

• Sample – full-time workers, males, 18-64 – Earnings in June 2003 – Round to nearest $1 – 7300 workers • Possible values $100 - $2885 • Over 1000 ‘unique’ weekly wages

3.0 2.5 2.0 1.5 1.0 0.5

2788

2676

2564

2452

2340

2228

2116

2004

1892

1780

1668

1556

1444

1332

996

1220

1108

884

772

660

548

436

324

212

100

0.0

Income 19

20

Weekly Earnings, Rounded to nearest $50 10 9 8 Percent of Sample

• You can make the graphs ‘smoother’ by taking some noise out of the data and rounding observations to the nearest $10, $25 or $50 value

7 6 5 4 3 2 1

10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00 21 00 22 00 23 00 24 00 25 00 26 00 27 00 28 00 29 00

90 0

80 0

70 0

60 0

50 0

40 0

30 0

20 0

10 0

0

Weekly Earnings

21

22

Measures of Central Tendency Adult Female Heights 1995 NHIS

• Mode most frequent outcome • Median 50th percentile value. Value where half of the outcomes lie above (and below) • Mean arithmetic average n= (1/n)(X1 + X2 + X3 + ….. Xn) n = (1/n) EI XI

Fraction of Sample

0.20 0.15 0.10 0.05 0.00

48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78

Inches

23

24

•4

Adult Male Heights 1995 NHIS

Distribution of Male Weights (rounded to nearest 5 lbs) 10 9 8

0.10

7 Percent

0.12 0.08 0.06

5 4 3

0.04

2

0.02

1

33 0

34 5

30 0

31 5

28 5

25 5

27 0

22 5

24 0

19 5

21 0

16 5

18 0

13 5

15 0

12 0

0

53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 Inches

90

0.00

6

10 5

Fraction of Sample

0.14

Pounds

25

26

Distributions of Adult Heights, 1995 NHIS Distribution of Female Weights (rounded to nearest 5 lbs) 10 9

Group

Mean

Median

Mode

Males

70

70

72

Females

64

64

64

8 Percent

7 6 5 4 3 2 1

33 0

34 5

30 0

31 5

28 5

25 5

27 0

22 5

24 0

19 5

21 0

16 5

18 0

13 5

15 0

12 0

10 5

90

0

Pounds

27

Distributions of Adult Weights, 1995 NHIS

28

Comparing self-reported weight to actual weight (18-64)

Group

Mean

Median

Mode

Males

182

180

180

Females

148

140

140

29

Males

1996 Selfreported 189

1999 Measured by MD 184

% difference -2.6%

Females

149

165

10.7%

30

•5

Measure of Central Tendency Previous Examples

Simple Example • 15 observations

Sample

• (1,2,2,3,3,3,4,4,4,4,5,5,5,5,5) – Mode = 5 – Median = 4 – Mean = 55/15 = 3.67

Mode

Median

Mean

Dr. visits

1

2

5.4

Weekly earnings

$400

$660

$814

31

Problems with the mode

32

Problems with Mean and Median

• Poor measure if the distribution is flat – (1,1,2,2,3,3,4,4,5,5) • Not a great measure if data is continuous – Can get the mode if you round the data – But, value can change due to rounding • Mode is useful when the data is discrete and there are few alternatives

• Mean – uses all the distribution – so value changes due to large “outliers” • Median – “outliers” do not alter values • Example – income in US – Most people earn a “modest” income – Few people earn extremely high incomes – which impacts the mean – Mean > Median 33

Problems with Mean and Median

34

Problems with median

• Weekly earnings 1996– full time adult males – N=21,004 – Mean = $737 – Median = $577 • Suppose you include Michael Jordan, who at the time was $36 million/year or $692,308/week from the Bulls – Mean = $770 – Median = $577

• May mask important changes in the distribution • Suppose earnings of the lowest 25% are cut in half – Mean fill fall but median will stay the same – Median = $577 – Mean = $708 35

36

•6

Example – What do the Mean and Median Measure?

March CPS (continued)

• Current Population Survey (CPS) – Monthly household-based survey – 65,000 households, 160,000 people – Basic labor market data such as the monthly unemployment rate – March survey • Annual demographic file • Earnings/income/insurance status previous year

• 3 levels of data – Households (people not in group qtrs). – Families (2 or more related people living together) – Individuals • Detailed income questions – Earnings/business profits/unearned income/social security/transfer payments 37

38

March CPS (continued)

Quantiles of Distribution

• Earnings are added up over all categories • March 2001 – data for all of 2000 • Household income (all sources) – Median $42,024 – Mean $57,014 – Mode $ 0

• Different percentiles of distribution • Useful for describing large amounts of data • Median is the 50th percentile

39

Quantiles of Household Income, 2000

• 5th %

40

Quantiles of Household Wealth, 1998

$6,906

• 75th %

$73,000

• 10th % $10,572

• 90th %

$112,040

• 20th % $ 5,000

• 90th %

$257,000

• 25th % $21,521

• 95th %

$147,297

• 50th % $86,100

• 95th %

$475,000

• 50th % $42,024

• 99th %

$354,422

• Mean $395,500

• 99th %

$3.3 mil

41

• 5th %

0

42

•7

Measures of Dispersion Daily Max Temperatures 1950-1998: Santa Barbara, CA

0.06

• Two measures of central tendency may be the same with very different distributions – Example – average annual daily highs • Richmond: 69.0 • Santa Barbara: 69.5 • How do we characterize the dispersion of outcomes?

Fraction of days

0.05 0.04 0.03 0.02 0.01 0

12 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Temperature

43

44

Measures of Dispersion 0.06

Daily Max Temperatures 1950-1998, Richmond, VA

• Range

Fraction of days

0.05

max – min

0.04

• Interquartile range (IQR) 75th – 25th percentile

0.03 0.02 0.01 0

12 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Temperature

45

46

Example: Variance of (1,2,2,3,3,3,4,4,4,4)

Measures of Dispersion • Mean squared deviation = (1/n)EI(Xi - n)2

Xi

(Xi- n) (Xi- n)2

Xi

• Variance = s2 = [1/(n-1)]EI(Xi - n)2

1 2 2 3

-2 -1 -1 0

4 1 1 0

3 3

0 0

0 0

4 1 1 4 1 1 4 1 1 4 1 1 GiXi = 30 so n = 3 Gi (Xi- n)2 = 10 S2= Gi (Xi- n)2/9 = 1.11

• Standard deviation = s = square root of variance

47

(Xi- n)

(Xi- n)2

48

•8

Measures of Dispersion Average Daily Highs

Some Sample Values

Mean Range IQR

S

Richmond

69.0

17.5

Santa Barbara

69.5

12/105 55/84 47/109 64/74

n

Variable

City

Dr. visits, adults 21-64

7.2

S

5.4

163

Weekly earnings, adult males Daily high in Richmond

$814

$565

69.0

17.2

Daily high in Santa Barb.

69.5

7.2

49

50

Some Sample Values n

S

70

3.2

Variable Male height Female height

Linear Transformations of Data

64

3.9

Male weight

182

33.5

Female weight

148

33.5

• In some situations, we find it useful to change the scale of variables – Real vs. nominal dollars – Convert the ¥ to $ to £ – English vs. metric – Fahrenheit to Celsius • General linear transformation Yi = a + bXi 51

52

What Happens to Mean?

What Happens to Variance • s2y = [1/(n-1)] Ei (Yi - N)2

N= (1/n) Ei Yi N= (1/n) Ei ( a + bXi) N = (1/n) Ei a + (1/n) Ei bXi N = (1/n) (na) + (1/n)b Ei Xi N = a + b(1/n) Ei Xi N = a + bn

53



= [1/(n-1)] Ei (a + bXi - a - bn)2



= [1/(n-1)] Ei (bXi - bn)2



= [1/(n-1)] Ei b2(Xi - n)2



= b2 [1/(n-1)] Ei (Xi - n)2 = b2 s2x

• sy = bsx

54

•9

Example – Converting Fahrenheit to Celsius

Recap

• Average daily high in Santa Barbara, Fahrenheit –n = 69.5 = 51.12 – s2f – sf = 7.15

• Given a sample of Xi’s with a – mean n – Standard deviation of sx • Linear transformation – Yi = a + bXi • The sample of Yi’s has – Mean of N = a + bn – Standard deviation of sy = bsx

• To convert F to C, use equation – ci = -17.78 + 0.556fi – When: fi = 32, ci = 0 – fi = 212, ci = 100 55

Distribution of Temperatures, Santa Barbara N

= -17.78 + 0.556 n = -17.78 + 0.556 (69.5) = 20.86

sc

= 0.556 sf = 0.556 (7.15) = 3.98

56

Example: Pounds to kilograms • 1 K equals 2.2 Pounds, so K=(1/2.2)P • Y = a + bX = 0.4536X • Sample value for males in US • n = 182 • Sx = 33.5 • What would males weigh in England?? • N = a+b n = 0.4536 n = 82.55 • Sy = b sx = 0.4536 sx = 15.2 57

58

Stylized facts • Income distribution has become more unequal over time – Independent of how income is measured (family, household or individual) – Growth at the top of the distribution has been greater than at the bottom • Therefore – growth in mean will be very different from the growth in the median

Example: How to Use Descriptive Statistics The Changing Income Distribution

59

60

•10

Real Household Incomes Over Time $150,000 $130,000 Mean

$55,000

Median

$110,000 Income

$60,000

$50,000 $45,000 $40,000

$90,000 $70,000

$35,000

$50,000

$30,000

$30,000

19 75 19 77 19 79 19 81 19 83 19 85 19 87 19 89 19 91 19 93 19 95 19 97 19 99

Real 2000 $

Mean and Median HH Incom e

$10,000

19 75 19 77 19 79 19 81 19 83 19 85 19 87 19 89 19 91 19 93 19 95 19 97 19 99

Year

Year 20th

61

% Change in Household Income 1975-2000

20th %ile 40th %ile 60th %ile 80th %ile 95th %ile

80th

95th

62

% Difference: 95th and 20th Percentile HH Income

700%

26% 46% 27% 23% 29% 41% 56%

650% 600% Series1

550% 500% 450% 400%

19 67 19 70 19 73 19 76 19 79 19 82 19 85 19 88 19 91 19 94 19 97 20 00

• • • • •

60th

750%

% Difference

• Median • Mean

40th

Year 63

Quantiles of Household Income, 2000

• 5th %

64

Quantiles of Household Wealth, 1998

$6,906

• 75th %

$73,000

• 10th % $10,572

• 90th %

$112,040

• 20th % $ 5,000

• 90th %

$257,000

• 25th % $21,521

• 95th %

$147,297

• 50th % $86,100

• 95th %

$475,000

• 50th % $42,024

• 99th %

$354,422

• Mean $395,500

• 99th %

$3.3 mil

65

• 5th %

0

66

•11