Organizing and Summarizing Data

Organizing and Summarizing Data Learning Objectives: 1. Organize qualitative data using a) Frequency and relative frequency table, b) bar graph, c) pi...
Author: Octavia Owen
5 downloads 0 Views 746KB Size
Organizing and Summarizing Data Learning Objectives: 1. Organize qualitative data using a) Frequency and relative frequency table, b) bar graph, c) pie graph and d) Pareto graph. 2. Organize quantitative data for 1. Discrete Data using a) Frequency and relative frequency table, b) bar graph, c) pie graph and d) Pareto graph 2. Continuous Data using a) histogram, b) stem-leaf plot, c) Time series plot 1

Data Presentation Data Presentation Qualitative

Quantitative Data

Data

Summary Table

Bar Chart

Pie Chart

Stem-&-Leaf Display Dot Chart

Time Series Plot

Frequency Distribution

Histogram

Box Plot 2

Organizing Qualitative or Categorical Data • A statistical table can be used to display data graphically as a data distribution: consists of Class, Class Frequency, Relative Frequency or Percentage • For qualitative data, three measurements are available for the list of categories: – the frequency, or number of measurements – the relative frequency, or proportion = frequency / Total # of observations – the percentage

• A pie chart is the familiar circular graph that shows how the measurements are distributed among the categories. • A bar chart shows the same distribution of measurements in categories, with the height of the bar measuring how often a particular category was observed. • Pareto Chart A bar chart in which the bars are ordered from largest to smallest is called a Pareto chart.

3

A survey of 400 individuals are survey to rate the school quality.

The data is summarized: Rating Frequency

A

B

35

260

C

D

93

12

Relative Frequency

Percentage Draw a pie chart, a Bar Chart and a Pareto chart

B

C

A

D 4

Exercise: A set of ten students is selected , and measurements are recorded as in the following table:

[similar exam questions] Student 1 2 3 4 5 6 7 8 9 10

GPA Gender Year 2.0 F 1 2.3 F 2 2.9 M 2 2.7 M 1 2.6 F 3 3.2 F 3 2.7 F 1 3.5 M 4 2.1 M 3 2.7 F 3

Number of Credit Major Hours Enrolled Psychology 16 Mathematics 15 English 17 English 15 Business 14 Computer 16 Chemistry 14 Chemistry 15 Business 12 Sociology 16

• What variables can be described using pie chart or bar chart? • Construct a Bar chart and Pareto chart for the variable Year.

5

Answer Variables that can be described by Bar cart or Pie chart must be qualitative or discrete variables. Gender and Major are qualitative variables. Year and # of credit Hours Taken are discrete. GPA is a continuous variable. NOTE: ID is not a characteristic for describing students. 6

Organizing Quantitative Data-Popular Plots • If the variable can take only a finite or countable number of values, it is a discrete variable. – For a discrete variable, Bar chart, Pie chart or Pareto charts can be applied to describe the discrete variable as we did for qualitative variables.

• A variable that can assume an infinite number of values corresponding to points on a line interval is called continuous. – Stem & Leaf Plot , and Histogram are two common graphs to display continuous data. – Time Series Plot is applied to display the data along the Time domain for demonstrating trends or patterns along the time.

7

• Dotplots: Plots the measurements as points on the x axis, stacking the points that duplicate existing points.

8

Stem and leaf plots: This plot presents a graphical display of the data using the actual numerical values of each data point. Constructing a Stem and Leaf Plot: 1. Divide each measurement into two parts: the stem and the leaf. 2. List the stems in a column, with a vertical line to their right. 3. For each measurement, record the leaf portion in the same row as its matching stem. 4. Order the leaves from lowest to highest in each stem. 5. Provide a key to your stem and leaf coding so that the reader can recreate the actual measurements if necessary.

9

Example The following Table lists the prices (in dollars) of 19 different brands of walking shoes. Construct a tem and leaf plot to display the distribution of the data. 90 65 75 70

70 68 70

70 60 68

70 74 65

75 70 40

70 95 65

The price 74 is represented by the stem 7 and leaf 4. The price obtained by: 74 x (Leaf Unit) = 74x(1) = 74.

Solution

10

Interpreting Graphs with a Critical Eye: • What to look for as you describe the data: - Scales : The measurement unit such as $, inches, etc - location: Where is the center of the data - shape: The shape of the frequency distribution. - outliers: Some unusual data values, such as 6000 miles away from home when comparing with the rest. • Distributions are often described by their shapes: - symmetric - skewed to the right (long tail goes right) - skewed to the left (long tail goes left) - unimodal, bimodal, multimodal (one peak, two peaks, many peaks) 11

Identify the Shape of a Distribution Examine the three dotplots generated by Minitab and shown in the following Figure Describe these distributions in terms of their locations and shapes. Figure : Character Dotplots and the corresponding distribution shapes

Symmetric

Skew-to-right

Skew-to-left

• Skew-to-right: Most values are small. Only a few are much larger. The long tail is on the right side. • Skew-to-left: Most vaules are large. Only a few are much smaller. The long tail is on the left side.

Similar Exam questions

12

Exercise Determine the shape of the distribution of each of the following variables: 1. Score of a very easy test 2. Score of a very difficult test 3. Entry level salary for college graduates 4. Adult’s height

13

Answer 1. Very easy test: skew-to-the-left (most scores are high. Only a few low scores) 2. Very difficult test: skew-to-the-right (most scores are low. Only a few high scores.) 3. Entry level salary: likely to be skew-to-the-right. Since most salaries would be lower than $50,000. A few could be quite high. 4. Adult’s height: this has a typical symmetric distribution

14

Relative Frequency Histograms What is it? A relative frequency histogram for a quantitative data set is a graph that describes the relative frequency (or frequency) of the variable, for example, distance from home, in which the possible values of the variable are divided into a few groups (classes, or intervals), the relative frequency (or frequency ) is represented by a rectangle with the height representing the proportion or relative frequency of occurrence for a particular class (or group) of the variable being measured. • On the X axis: The class, (or group) of the variable are plotted along the x axis. • On the Y-axis: The relative frequency or frequency of observations within the class is the height on the Y axis. 15

Histogram for Continuous Data Why do we want to do this?

Histogram summaries data values of the variable in a graph that can demonstrate the distribution of the variable, so that it helps us to quickly visualize where are the majority of data values, if there are some very unusual data values, if these unusual data on the high side or on the low end? Are data values very far apart or are they very close to each other?, and so on Is this different from Bar or pie graph? YES, it is different. Bar or pie graph is for categorical or discrete variables. Histogram is for continuous variables. 16

How to construct a histogram? By hand (in case you do not have technology): Constructing a relative frequency histogram for continuous variables: 1. Choose the number of classes, usually between 5 and 15. 2. Calculate the approximate class width by dividing the difference between the largest and smallest values (Range = largest – smallest) by the number of classes. 3. Round the approximate class width up to a convenient number. 4 Locate the class boundaries. If discrete, assign one or more integers to a class. If continuous, use Method of left inclusion: Include the left class boundary point but not the right boundary point in the class. – NOTE: Different methods may be used in different software. Some may use right inclusion. Some may add an additional decimal place for the class boundary.

5. Construct a statistical table containing the classes, their boundaries, and their relative frequencies. 6. Construct the histogram like a bar graph. 17

Example: Constructing Histogram by hand The following Table lists the prices (in dollars) of 19 different brands of walking shoes. Construct a relative histogram to display the distribution of the data. 90 70 70 70 75 70 65 68 60 74 70 95 75 70 68 65 40 65 70 Solution

1.

Determine # of classes: for example, use k=6 classes

2.

Range = 95 -40 = 55,

3.

Class width = 55/6 ~ 9.17 ~ 10 (Run the width up (not run off nor truncate) to a ‘convenient number.)

4.

Use left-inclusion to determine class boundaries: [40,50),[50,60), [60, 70), [70,80),[80,90),[90,100)

5.

Construct a Relative Frequency Table – first count # of observations in each class. This is the frequency, call it fi. Relative frequency (rfi) = fi/n , where n is the total # of data points.

6.

Draw a two-dimensional graph with X-axis: the class boundaries of the variable, and Y-axis: the relative frequency for each class, and a rectangle with the relative frequency as the height for 18 each class.

Activity : Complete the construction of the Histogram Histogram

Relative Frequency Table

Histogram of ShoePrice

Frequency Relative Frequency

[40,50)

1

1/19

[50,60)

0

0

[60,70)

6

6/19

[70,80)

10

10/19

[80.90)

0

0

8

2/19

6

6

4 2

2 1

0

[90,100) 2

10

10

Frequency

Group

0

40

50

0

60

70 ShoePrice

80

90

100

Histogram constructed using Minitab

NOTE: We need to know how a relative frequency and a histogram are constructed. The construction of a histogram, however, can be easily done by computer software.

19

Using Minitab to create the Default Histogram for the Shoes Price Data Go to Minitab, on the Worksheet window, enter the prices of the 19 pairs of shoes data, and give the column name: Price. 90

70

70

70 75

75 70

70 68

65 65

68 40

60 65

74 70

70

95

SAVE your data set: File, Save Worksheet As, Name it: Shoes Price on your desktop. Go to Graph menu, choose Histogram, select Simple, select variable ‘Price’, OK.

Histogram of ShoePrice - using Default options

14

13

12

Frequency

10 8 6 4 2 0

2 1

1

1

1

0

40

50

60

70 ShoePrice

80

90

20

100

Change the # of intervals in Minitab As you see the default histogram has it’s own # of classes. One can change this number to display a different histograms for the same data. For the Shoes Price example, the following steps change the # from 7 to 4 classes: Click inside the histogram graph, The bars are highlighted. Rightclick on the bars, choose

‘Edit Bars’, Go to Binning menu, change # of Intervals to 4, OK. 21

Histogram of Distance Data - Distance from Home for CMU students. Sample Size =56

For the Distance Data used in your Activiy#1,can

25

25

you construct the histogram on the left for the above data?

21

Frequency

20 15 10 7 5 1

1

300

400

0 0

100

200

0

0

0

500

600

700

1

800

distance-exclude an extreme distance of 6000 miles

A histogram shows the distribution of the Distance variable. Several properties can be noticed:

Majority of students are from within 250 miles with a few very far away. The distribution of distance is very skewed to the right side (where the long tail is).

The Distance data is grouped into k equal intervals, in this case, k = 9 (Minitab chooses k = 9, # of classes). X-axis is the interval of distance. Minitab chooses the first interval [50,49], 2nd interval [50,149] and so on. Y-axis is the frequency of students whose distance between each respective interval. For example, There are 25 distances between 50 and 149 miles. A rectangle is used to represent the 22 interval and the frequency.

The following data represent the closing value of the Dow Jones Industrial Average for the years 1980 - 2001.

23

Time Series Plot

24

What did you learn from this chapter? Graphical display for qualitative or categorical data: bar chart, pie chart, Pareto chart. Graphical display for discrete quantitative data: bar chart, pie chart, Pareto chart. Graphical display for quantitative continuous data: stem-leaf plot, dot plot, histogram, time-series plot. The shape of distribution: skew-to-left, symmetric, skew-to-right. Outliers, rare cases. Real-time activities for illustration: How far are you from home? Does one minute of exercise increase your pulse rate dramatically? 25