Chapter 4 Displaying and Summarizing Quantitative Data

Chapter 4 ST 101 Reiland Displaying and Summarizing Quantitative Data Chapter Objectives: At the end of this chapter you should be able to: 1) Crea...
Author: Lora Mills
0 downloads 1 Views 128KB Size
Chapter 4

ST 101

Reiland

Displaying and Summarizing Quantitative Data Chapter Objectives: At the end of this chapter you should be able to: 1) Create appropriate displays to graphically depict quantitative data (frequency tables, histograms, stem-and-leaf displays, dotplots, timeplots; the use of software will be emphasized) 2) Describe the important features of the distribution of a quantitative variable: shape, center, spread, and any unusual features such as outliers, gaps, or clusters. Throughout the course we will emphasize the paradigm "Think, Show, Tell". The above objectives fit into this paradigm as follows: "Think" about what graphical display is appropriate for the data at hand; create the display to "show" the data (objective 1). "Tell" what characteristics of the data are conveyed by the graphical display (objective 2).

Reading Assignment: Text: Chapter 4.

Histograms A histogram shows three general types of information: It provides visual indication of where the approximate center of the data is. We can gain an understanding of the degree of spread, or variation, in the data. We can observe the shape of the distribution. Construction of a histogram (automate!): i) identify the smallest and largest measurements in data set ii) divide interval between smallest and largest measurements into between 5 and 20 subintervals (called bins in Excel.) iii) count the number of data values that are in each bin (the bins and the count in each bin give the distribution of the quantitative variable iv) plot the bin counts as bars over the bins; the height of the bar over a bin indicates the count for that bin EXAMPLE: (Number of daily employee absences from a large corporation; 106 days) 106 obs. approx # of classes œ 146 144 140 140 138 140 148 140 129 153 143

141 140 140 143 136 148 142 139 143 148 143

139 138 141 143 138 140 133 158 148 144 148

140 139 143 149 144 140 140 135 138 138 141

145 147 134 136 136 139 141 132 149 150 145

141 139 146 141 145 139 145 148 146 148 141

142 141 134 143 143 144 148 142 141 138

131 137 142 143 137 138 139 145 142 145

142 141 133 141 142 146 136 145 144 145

140 132 149 140 146 153 141 121 137 142

ST 101

Displaying and Summarizing Quantitative Data Histogram of Employee Absences 70 60 y 50 c n 40 e u q 30 e r F 20 10 0 125.5

132.5

Statcrunch histogram

139.5 146.5 Absences from Work

153.5

160.5

page 2

ST 101

Displaying and Summarizing Quantitative Data

page 3

Heights of students in ST101 EXCEL

Student Heights ST 101 20 yc n e 10 u q e rF 0 59

61

63

65

67

69

71

73

75 More

Height (inches)

DATADESK

Stem-and-Leaf Displays Partition each number in data set into a “stem" and “leaf" Constructing a stem and leaf display: i) determine the stem and leaf you want to use; ( 5 - 20 stems) ii) write stems in a column with smallest stem at top; include all stems in range of data, even those without leaves; iii) include only 1 digit in the leaves; drop digits after the first digit or round off; iv) record the leaf for each measurement in the row corresponding to its stem; ordering of leaves in a row is optional, but this does make the display more informative. EXAMPLE: Below is a list of the number of home runs that Roger Maris hit during his 10 years in the American League. Make a stemplot of the data. 8 13 14 16 23 26 28 33 39 61

EXAMPLE: Number of touchdown passes thrown by each of the 31 teams in the NFL during the 2000 season. 37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6

ST 101

Displaying and Summarizing Quantitative Data

page 4

STEMS ARE 10'S DIGIT stem leaf 3 | 7 3 | 233 2 | 889 2 | 001112223 1 | 56888899 1 | 22444 0 | 69

EXAMPLE: Nielsen ratings for week of Aug. 8 - Aug. 14, 2005. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Program CSI WITHOUT A TRACE CSI: MIAMI 60 MINUTES TWO AND A HALF MEN 930P TWO AND A HALF MEN EXTREME MAKEOVER:HM ED-8P NCIS AFC-NFC HALL OF FAME GAME(S) LAW AND ORDER:CRIM INTENT AFC-NFC HALL-FME SHOWCASE(S) EVERYBODY LOVES RAYMOND LAW AND ORDER:SVU MT&R: UNFORGET MOMNTS TV(S) COLD CASE CSI: NY LAW AND ORDER BIG BROTHER 6-TUE CROSSING JORDAN DATELINE FRI

Network CBS CBS CBS CBS CBS CBS ABC CBS ABC NBC ABC CBS NBC NBC CBS CBS NBC CBS NBC NBC

Time 9:00PM 10:01PM 10:00PM 7:00PM 9:30PM 9:00PM 8:00PM 8:00PM 8:08PM 9:00PM 8:00PM 8:30PM 10:00PM 8:30PM 8:00PM 10:00PM 10:00PM 9:00PM 10:00PM 8:00PM

Day Thu Thu Mon Sun Mon Mon Sun Tue Mon Sun Mon Mon Tue Wed Sun Wed Wed Tue Sun Fri

Rating 9.3 8 7.9 7.6 7 6.9 6.8 6.5 6.2 6 5.9 5.8 5.8 5.8 5.6 5.6 5.6 5.5 5.5 5.1

*There are an estimated 105.5 million television households in the USA. A single /ratings/ point represents 1%, or 1,055,000 households for the 2005-06 season. /Share/ is the percentage of television sets in use tuned to a specific program. Stem-and-leaf for Shares stems are 10's 0 1* 1t 1f 1s 1**

|9 9 |0 0 0 0 0 0 0 1 1 1 1 |2 2 3 |4 |6 |

Share Households 16 10,225,000 14 8,742,000 13 8,668,000 14 8,368,000 11 7,659,000 11 7,540,000 12 7,423,000 12 7,175,000 11 6,846,000 10 6,625,000 11 6,478,000 10 6,390,000 10 6,409,000 10 6,331,000 10 6,174,000 10 6,104,000 10 6,121,000 9 5,981,000 9 6,065,000 10 5,592,000

ST 101

Displaying and Summarizing Quantitative Data

Stem-and-Leaf for Rating stems are 1's 5* 5. 6* 6. 7* 7. 8* 8. 9*

|1 |556668889 |02 |589 |0 |69 |0 | |3

EXAMPLE: (beginning of class pulses)

# --. 3 9 10 23 23 16 23 10 10 4 2 4 . 1

BPULSE Unit = 1.000000 n = 138. missing = Stem Leaves . . . ---- -------------------------------------------------------------4* | 4. | 588 5* | 001233444 5. | 5556788899 6* | 00011111122233333344444 6. | 55556666667777788888888 7* | 0000011222233444 7. | 55555666666777888888999 8* | 0000112224 8. | 5555667789 9* | 0012 9. | 58 10* | 0223 10. | 11* | 1 Advantages of stem and leaf displays: i) each measurement displayed ii) ascending order iii) relatively simple (if data set not too large) Disadvantage: i) display becomes unwieldy for large data sets

0.

page 5

ST 101

Displaying and Summarizing Quantitative Data

page 6

EXAMPLE Population of 185 US cities with between 100,000 and 500,000 residents.

Since a stem and leaf plot shows only two-place accuracy, we had to round the numbers to the nearest 10,000. For example the largest number (493,559) was rounded to 490,000 and then plotted with a stem of 4 and a leaf of 9. The fourth highest number (463,201) was rounded to 460,000 and plotted with a stem of 4 and a leaf of 6. Thus, the stems represent units of 100,000 and the leaves represent units of 10,000. Notice that each stem value is split into five parts: 0-1, 2-3, 4-5, 6-7, and 8-9.

Dotplots simple display, it just places a dot along an axis for each case in the data. similar to a stem-and-leaf display Kentucky Derby winning times, plotting each race as its own dot.

ST 101 Timeplots

Displaying and Summarizing Quantitative Data

page 7

Winning Times in Olympic 100m Dash 13 12.5 12 11.5 11 10.5 10 9.5 9 1880

1900

1920

1940

1960

1980

11.46

More

Histogram Frequency

15 10 5 0 9.84

10.38

10.92 Bin

The Shape of a Distribution skewnessskewed to the right (positively skewed) 45 8

2006 Baseball Salaries

400 300

2006 Salary ($1,000's)

21325

19325

17325

15325

8 9 8 3 3 2 1 1 2 2 1

13325

11325

9325

33 16 17 23 16 15 14

7325

0

5325

100

3325

71 64 54

200

1325

Frequency

500

2000

2020

ST 101

Displaying and Summarizing Quantitative Data

page 8

skewed to the left (negatively skewed)

H istogram of Exam Scores Fre que ncy

30 20 10 0 20

30

40

50 60 70 80 Ex a m S core s

90

100

symmetric

B a n k C u s to m e rs : 1 0 : 0 0 -1 1 : 0 0 a m 20 Fr e que ncy

15 10 5

e

2 3.

m

or

4

Nu m b e r o f Cu sto m e rs

13

5.

6 12

7.

8 11

9.

2 10

10

.2 94

86

.4

.6 78

70

.8

0

outliers

200 m Races 20.2 secs or less (approx. 700) 60 50 40 y c n e u 30 q e r F 20

Usain Bolt 2008 19.30

Michael Johnson 1996 19.32

10 0 6 .2 3 9 .2 2 1 9 .9 1 1

9 .2 9 1

2 .3 9 1

5 .3 9 1

8 .3 9 1

1 .4 9 1

4 .4 9 1

7 5 . 3 5 .4 9 . 9 1 9 1 1

6 .5 9 1

9 .5 9 1

2 .6 9 1

5 .6 9 1

8 .6 9 1

1 .7 9 1

4 .7 9 1

7 .8 .7 9 9 1 1

3 .8 9 1

6 .8 9 1

9 .8 9 1

2 .9 9 1

5 .9 9 1

8 .9 9 1

1 .0 0 2

4 .0 0 2

7 .1 0 0 0 2 2

3 .1 0 2

6 .1 0 2

9 .1 0 2

TIMES

BIMODAL DISTRIBUTIONS (two peaks) (frequently results from measurements on two populations, such as heights of male and female adults).

ST 101

Displaying and Summarizing Quantitative Data

page 9

His to g ra m

Frequency

60 50 40 30

F re q ue nc y

20 10 More

73.5

71

68.5

66

63.5

61

58.5

56

53.5

51

0

B in

Describing Distributions Numerically Section Objectives: At the end of this section you should be able to: 1) Calculate appropriate numerical summaries of quantitative data to describe center (median, mean, quartiles) and spread (range, interquartile range, standard deviation) [the use of software will be emphasized!] 2) Describe the characteristics of various numerical summaries with emphasis on the affects of outliers 3) Interpret the values of the numerical summaries for a particular data set. 4) Match graphical displays of quantitative data to the values of the summary statistics. 5) Apply graphical and numerical procedures to compare 2 or more sets of data Throughout the course we will emphasize the paradigm "Think, Show, Tell". The above objectives fit into this paradigm as follows: "Think" about what numerical summaries of center and spread are appropriate for the data at hand; calculate the values of the numerical summaries to "show" the center and spread. "Tell" what characteristics of the data are conveyed by the values of the numerical summaries.

Finding Center and Spread Would like to numerically summarize two characteristics of quantitative data: i) center ii) spread Ö Finding the center: the median median: the value that falls in the middle when the data are arranged in order of magnitude Calculating the Median Given a set of 8 data values arranged in order of magnitude Middle value if 8 is odd Median œ œ Mean of the two middle values if 8 is even graphically, the median splits the histogram of the data into two halves of equal area.

ST 101

Displaying and Summarizing Quantitative Data

page 10

EXAMPLES: 1) Below is a list of the home runs hit by Babe Ruth in each of his seasons as a Yankee: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 median œ

2) student pulse rates - ordered values: 38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70,70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80,80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 median = 3) Year 2002 baseball salaries: 8 œ 805 median œ $900,000; maximum œ $25,000,000 (Alex Rodriguez) minimum = $200,000

4) Median fan age: MLB: 45; NFL: 43; NBA: 41 NHL: 39 (Scarborough Research)

Ö Measuring spread: home on the range range = max  min EXAMPLE: Year 2002 baseball salaries: range = $25,000,000  $200,000 = $24,200,000 disadvantage of range: too crude and sensitive, a single extreme value can make the range very large. Ö Measuring spread: the interquartile range (IQR) focus on the middle of the data instead of the extremes of the data find the range of the middle half of the data: i) divide the data in half at the median ii) now divide both halves in half again, cutting the data into quarters " % of

the data lies below the lower quartile à half the data lies between " of the data lies above the upper quartile ß % interquartile range

ST 101

Displaying and Summarizing Quantitative Data

page 11

IQR = upper quartile  lower quartile quartiles are NOT well-defined, different software packages give different answers FINDING QUARTILES BY HAND when n is odd, include the overall median in both halves when n is even, do NOT include the overall median in either half EXAMPLES: 1) odd number of observations in data set Below is a list of the home runs hit by Babe Ruth in each of his seasons as a Yankee: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 ordered values: 22 25 34 35 41 41 46 46 46 47 49 54 54 59 60 median = 46 lower half (including median) 22 25 34 35 41 41 46 46 U" œ 69A/< ;?+36/ œ

$&  %" œ $) #

upper half (including median) 46 46 47 49 54 54 59 60 U$ œ ?::/< ;?+36/ œ

%*  &% œ &"Þ& #

IQR = 51.5  38 œ 13.5 software Excel: U" = 38; U$ = 51.5; IQR = DataDesk: U" = 36.5; U$ = 52.75; IQR = 16.25 2) even number of observations in data set ten "distance of hometown from NCSU campus" values: 300 500 65 180 200 120 270 10 100 10 ordered values: 10 10 65 100 120 180 200 270 300 500 median = "#!")! œ 150 # lower half: 10 10 65 100 120 U" œ 69A/< ;?+36/ œ '& upper half: 180 200 270 300 500 U$ œ ?::/< ;?+36/ œ #(! IQR = 270  65 œ 205 software Excel: U" = 73.75; U$ = 252.5; IQR = 252.5  73.75 œ 178.75 DataDesk: U" = 65; U$ = 270; IQR = 3) median, quartiles from stem and leaf plot class beginning pulse rates

# ---

BPULSE Unit = 1.000000 n = 138. missing = Stem Leaves . . . ---- --------------------------------------------------------------

0.

ST 101 Displaying and Summarizing Quantitative Data . 4* | 3 4. | 588 9 5* | 001233444 10 5. | 5556788899 23 6* | 00011111122233333344444 23 6. | 55556666667777788888888 16 7* | 0000011222233444 23 7. | 55555666666777888888999 10 8* | 0000112224 10 8. | 5555667789 4 9* | 0012 2 9. | 58 4 10* | 0223 . 10. | 1 11* | 1

page 12

median = lower quartile = upper quartile =

5-Number Summary minimum Q" median

Q$ maximum

5-number summary for the above 138 student pulses

Summarizing Symmetric Distributions EXAMPLE (body temperature of 93 adults)

median œ 98.2 beats per min.

mean œ 98.12 beats per minute

Ö Finding the center: the mean median; determined by counting the data, doesn't care how large or how small the data values are (except the middle one or two data values). Often we do care about the actual data values; would like a measure of center that uses each data value.

ST 101

Displaying and Summarizing Quantitative Data

NOTATION

C 8 C

page 13

represents an observation in a data set number of observations in the data set denotes the sample mean

consider any set of data values represented by C's; then Cœ

!C sum of C's œ 8 8

IMPORTANT: the mean is an appropriate measure of the middle only when the shape is approximately symmetric and there are no outliers.

Connection to histogram A histogram balances when supported at the mean median = 57.7 years; mean = 55.26 years

Mean or median? It makes a difference (sometimes) EXAMPLE: 2004 major league baseball salaries n œ 826 C œ $2,482,530 median œ $787,500

min œ $300,000

max œ $21,726,881

ST 101

Displaying and Summarizing Quantitative Data

page 14

2004 Major League Baseball Salaries Frequency

500

423

400 300 200 100 50

61 51 48

33 19 13 22 22 13 11 15

3 10 5 8 2 2 0 1 4 3 2 2 0 1 1

202

187

171

156

141

125

110

95

80

64

49

34

18

3

0 Salary ($100,000's)

Mean , Median, and Maxim um B aseball S alaries M ax $27,000,000

$2,050,000

$22,000,000

$1,550,000

$17,000,000

$1,050,000

$12,000,000

2002

2000

1998

1996

1994

1992

1990

1988

1986

1984

$2,000,000 1982

$50,000 1980

$7,000,000

1978

$550,000

M a x im um S a la ry

M edian

$2,550,000

1976

M e a n, M e dia n S a la ry

M ean

Ye a r

Ö Finding spread: the standard deviation IQR: uses only Q" and Q$ to measure spread standard deviation: takes into account how far each observation is from the mean !ÐC  CÑ œ ? variance =# œ

units: square gallons, square dollars

!ÐC  CÑ# 8"

ST 101

Displaying and Summarizing Quantitative Data

standard deviation

page 15

Í Í! Í ÐC  CÑ# = œ Ì 8"

automate this calculation! IMPORTANT: 1) the standard deviation is an appropriate measure of spread only when the shape is approximately symmetric and there are no outliers. 2) Always (always!) report a spread along with any summary of the center. EXAMPLE 1 3 5 9

Thinking about the standard deviation: 1) Note that = is always nonnegative, that is, =   !Þ When does = œ !? 2) The larger the value of =, the greater the spread of the data. Given two data sets, the standard deviation is useful as a relative measure of spread. 3) The standard deviation is the most commonly used measure of risk in many areas such as finance, business, education, social sciences, etc. 4) Why divide by n  1 instead of n when computing the sample standard deviation? i) to drive you crazy. ii) dividing by 8 to find the standard deviation of a small group would underestimate the variability present in the larger groups they represent. iii) above formula for s includes the sample mean C. Since !(C3  C) œ 0, only n  1 of n

i=1

the data values are free to vary. example:

Reporting shape, center, and spread of quantitative data 1) when telling about a quantitative variable, always report shape, along with a center and a spread 2) if the shape is skewed, report the median and IQR; the mean and standard deviation are sensitive to outliers (you can include the mean and standard deviation, but you should point out why the mean and median differ) 3) if the shape is symmetric, report the mean and standard deviation. 4) if there are obvious outliers and you are reporting the mean and standard deviation, report them with the outliers included and the outliers removed (the median and IQR will not be affected by the outliers).

ST 101

Displaying and Summarizing Quantitative Data

page 16

SUMMARY We can now summarize distributions of quantitative variables numerically. ñ The 5-number summary displays the min, Q1, median, Q3, and max. ñ Measures of center include the mean and median. ñ Measures of spread include the range, IQR, and standard deviation. We know which measures to use for symmetric distributions and skewed distributions. We can also display distributions with boxplots. ñ While histograms better show the shape of the distribution, boxplots reveal the center, middle 50%, and any outliers in the distribution. ñ Boxplots are useful for comparing groups.