Methods for Describing Data

Methods for Describing Data I. Describing Qualitative Data (i) Frequency Table or Frequency Distribution is the organization or summarization of raw d...

Author: Peter Roger Warner

3 downloads 1 Views 87KB Size

Report

Download PDF

Recommend Documents

Module 2 Describing Data

Describing Data: Categorical Variables

DESCRIBING QUANTITATIVE DATA

Statistics for Business and Economics. Chapter 2: Secs 1, 2 Methods for Describing Sets of Data

Methods for Categorical Data

Displaying and Describing Quantitative Data

Chapter 3: Describing Bivariate Data

A Notation for Describing Data Representations Intended for XML Encoding

Chapter 3: Statistics for describing, exploring, and comparing data

Describing Data. Transformations in the Coordinate Plan

Linear Algebra Methods for Data Mining

Accompanying Notes on Data and Methods for

Panel data methods for microeconometrics using Stata

Methods for describing light capture by understorey weeds in temperate forests: consequences for tree regeneration

Methods and Data Analysis

Data Analysis. Introduction to Data Analysis and Decision Making. Uncertainty. Decision Making. Describing data and datasets

Guidelines for describing units of learning outcomes

A pattern language for describing documents

Market Data Methods for Managing Costs in Tough Times

Multi-dimensional Access and Storage Methods for Multimedia Data

WHO methods and data sources for life tables

A comparison of different methods for modelling rare events data

Arguments and Methods for Database Data Model Forensics

Using Qualitative Methods to Generate Data for Instructional Development

Methods for Describing Data I. Describing Qualitative Data (i) Frequency Table or Frequency Distribution is the organization or summarization of raw data in a table with two columns • classes : each class is one of the categories into which unit characteristics or data values are classified, and • frequencies : each class frequency is the number of units or data values falling in the class. Important Note: The terms “Class” and “Frequency” are generic only. Use the name of the variable for class and number of units for frequency. Also, you must include a title describing the frequency table. In short, your frequency table must be self explanatory. Example. Frequency distribution of race of 5000 students Race

Number of Students

Percentage of Students

Black White Hispanic Other

1000 2000 1500 500

20.00 40.00 30.00 10.00

Total 5000 100.00 What percentage of students are not white? What percentage of students are black or hispanic? II. Describing Quantitative Data (i) Ungrouped Frequency Distribution for Discrete Data is the organization or summarization of raw data in the form of a table with two columns classes : that shows distinct values (in order of magnitude) of the discrete variable under consideration, and em frequencies : each class frequency is the number of times each value of the discrete variable is repeated in the data set. A third column that shows percent of each class frequency is also recommended. Frequency distribution of the # of courses taken by 30 students Number of Courses

Number of Students

3 4 5 6

4 18 6 2

1. How many students enrolled in at least four courses? (ans. 26) 2. What percentage of students enrolled in four courses? (ans. 60%) 3. What percentage of students enrolled in at most five courses? (ans. 93.33%) (ii) Grouped Frequency Distribution for Continuous Data is the organization or summarization of raw data in the form of a table with two columns class intervals (preferably with boundaries): obtained by dividing the range of the data into several (about 5 to 15) intervals preferably with equal width in such a way that no data value belongs to two different intervals; frequencies : the number of data values that fall in a class interval is the corresponding class frequency. Example. Frequency distribution of EPA mileage ratings on 100 cars (Data is in Table 2.2, page 38; H = 44.9, L= 30.0) Mileage Ratings 29.99 − 32.99 32.99 − 35.99 35.99 − 38.99 38.99 − 41.99 41.99 − 44.99

Number of Cars 6 16 58 18 2 1

1. How many cars are rated under 33 miles per gallon? 2. What percentage of cars has a rating of 36 or more mileage?

(iii) Frequency Histogram for grouped frequency distribution is a graph that displays class frequencies along the Y -axis and class boundaries along the X-axis. Appropriate adjustments along the Y -axis are necessary to display grouped frequency distribution with unequal class widths. For ungrouped frequency distribution, first create class boundaries (as explained in class) and then construct the histogram. Chapter 2: Practice Exercise 1 On a given day, a researcher gathered some information from each of a group of students sitting at the University Center. Data for 30 students are shown below. Class Rank (R) Jr Sr Jr Fr Jr Jr Sr Sr Fr Sr So So So Sr Jr

No. of Courses (X) 4 5 4 4 4 4 4 4 4 6 3 4 4 4 4

GPA (Y ) 3.76 4.00 3.66 1.98 3.54 3.33 2.97 3.25 3.36 2.40 1.51 1.11 2.85 3.88 3.17

Class Rank So Fr Jr Fr Sr So So Jr Sr Fr So Jr Jr Sr Fr

No. of Courses 5 5 4 3 4 5 6 4 3 4 5 4 3 4 5

GPA 3.36 2.43 1.52 3.29 2.69 3.84 3.74 2.62 3.41 2.28 3.46 1.33 3.22 2.17 3.47

Construct a frequency table for each variable. Display each table using a graph. Distribution of Class Rank Distribution of Number of Courses Class Rank FR SO JR SR

# of Students 6 7 9 8

# of courses 3 4 5 6

# of Students 4 18 6 2

(mean =4.20, std dev = 0.7611) Frequency distribution of GPA GPA 1.105 − 1.685 1.685 − 2.265 2.265 − 2.845 2.845 − 3.425 3.425 − 4.005

# of Students 4 1 6 10 9

(From the above grouped distribution, mean = 2.9223, std dev = 0.7689; using all thirty individual GPA, mean = 2.92, std dev = 0.8158) Cross-table - Summarizing Bivariate Attribute Data Data sets with two attribute variables are often summarized in cross-tables with one row (column) for each category of first (second) attribute variable, and the number or the proportion of subjects that belong to two row-column categories are written in the corresponding cell in the table. 2

Example : Consider the data set collected from 18 students. Class Rank Jr Sr Jr Fr Jr Fr Jr Sr Sr

Employment Status Full-time Part-time Full-time Unemployed Part-time Part-time Full-time Full-time Unemployed

Class Rank Fr Sr So So So Fr Sr Jr So

Employment Status Full-time Part-time Full-time Part-time Unemployed Part-time Part-time Unemployed Part-time

Distribution of students according to class rank and employment status

Fr So Jr Sr

Full-time

Part-time

Unemployed

1 1 3 1

2 2 1 3

1 1 1 1

One may be interested in answering following type of questions. 1. What percentage of the survey is Sr? (27.78%) 2. What percentage of the survey is unemployed? (22.22%) 3. What percentage of Sr is unemployed? (20%) 4. What percentage of unemployed is Sr? (25%) Summarizing quantitative data by categories of a qualitative variable. Simple Table: Data on one attribute and one quantitative variable are often summarized in a simple table with a column for the attribute and a second column for summary results of the quantitative variable. (Note: This is not a frequency table) First column shows different categories of the attribute, and Second column shows summary results (e.g., mean, median) of the quantitative variable corresponding to each category of the attribute. Example : Average of age of students by class rank. Class Rank Fr So Jr Sr

Average Age 20 21.2 23 24.5

The above summary results can be displayed using a bar chart with class rank along the horizontal axis and average age along the vertical axis. Summarizing bivariate quantitative data. • Scatter diagram to summarize and display bivariate quantitative variables. Summary measurements of Quantitative Data Measures of Location or Central Tendency Mean. If x1 , x2 , . . . , xn are used to denote the observations of a variable X in the data set, then the mean is P n 1X x ¯ X= xi = . n i=1 n 3

Median is the middle number in the ordered data set. An ordered data set is one in which observations are arranged in ascending (or descending ) order of magnitude. If the total number n of observations is even, then median is the mean of the middle two numbers. Write M to denote the median. Median is less sensative than the mean to extremely large or small measurements. Mode is the observation (or observations) in the data set that occurs more than the other numbers in the data set. A data set may have no mode, it may have a unique mode or more than one mode. Mode is an appropriate summary measure for categorical data. For example, if answers for favorite colors are red, blue, green, purple, it is impossible to add up those values to find the mean or to order the colors to find median. The other alternative to describe this set of data would be the mode. Then you can say that most people prefer red (or whatever color occurs most). Measures of Variablity or Dispersion or Spread Range = Highest observation (H) - Lowest observation (L). Variance. If x1 , x2 , . . . , xn are used to denote the observations in the data set, then the variance is P 2 P 2 P ( x) n 2 X (x ) − 1 (x − x ¯ ) 2 2 n s = (xi − x ¯) = = (n − 1) i=1 n−1 n−1

Standard Deviation =

√

s2

Coefficient of Variation = ( xs¯ )100% Numerical Measures of Relative Position or Relative Standing

Percentile : Percentiles are numbers that divide an ordered data set into 100 equal parts. The p-th percentile is the value xp such that at most p% of all values are smaller than xp and at most (100 − p)% are larger than xp . The value xp is np th np np equal to the average of ( 100 ) and ( 100 + 1)th largest sample observations if 100 is an integer; Otherwise xp is equal to the np th . (k + 1) largest sample observation where k is the largest integer not exceeding 100 z-score (also known as standardized score): This is defined as z

=

observed measurement − mean standard deviation

=

x−x ¯ s

or

x−µ σ

depending on sample or population measurement x. Note: The z-score represents the distance between a given data value x0 and the mean of all data value, expressed in standard deviation. For example, if z-score for x0 is 2, then x0 is 2 standard deviation above the mean. All observations with absolute z-scores more than three are considered outliers.

Boxplot for quantitative data - Detecting outliers and extreme observations (**** This section on boxplot is intended for students in Honors Section Only ****)

(i) (ii) (iii) (iv)

A box plot displays the following summary results of quantitative variable(s): Lower adjacent value = smallest observation greater than or equal to Q1 − 1.5 ∗ IQR (=lower inner fence), Q1 = first quartile, median, Q3 = third quartile, 4

(v) Upper adjacent value= largest observation less than or equal to Q3 + 1.5 ∗ IQR ( = upper inner fence) (vi) outlying and extreme observations: Observations that are unusually large or unusually small relative to other observations in a data set are outliers. All observations falling outside the interval [Q1 −1.5(Q3 −Q1 ), Q3 +1.5(Q3 −Q1 )] are considered outliers and those falling outside [Q1 − 3(Q3 − Q1 ), Q3 + 3(Q3 − Q1 )] are considered extreme outlying observations. The upper and lower quartiles of the data portrayed by the top and bottom of a rectangle, and the median is portrayed by a horizontal line segment within the rectangle. The mean is marked by a dot or a plus sign. The vertical lines (or “whiskers”) above and below the rectangle extend to the adjacent values. • Box plots focuses on the following important features of a data set: → typical or central value measured by mean → position of data measured by median, first and third quartiles → measures of spread or variability measured by IQR → shape - symmetry or skewness : the relative distances of the upper and lower quartiles from the median give information about the shape of the distribution of the data. If one distance is much bigger than the other, the distribution is skewed. For a symmetric distribution, the median cuts the box in half, the whiskers are about the same length and outside values (if any) are symmetrically placed beyond the lower and upper whiskers. → Outlying and extreme data points and the behavior of the tails: outlying and extreme values (observations beyond the adjacent values) are graphed individually. These values portray behavior in the extreme tails of the distribution, providing further information about spread and shape. • Box plots are often drawn side-by-side to compare two or more quantitative variables. Example. Environmental Protection Agency (EPA) performs extensive tests on all new car models to determine their mileage ratings. Suppose that the following 100 measurements represent the results of such tests on a certain new car models: 30 31.8 32.5 32.7 32.9 32.9 33.1 33.2 33.6 33.8 37 37 37 37.1 37.1 37.1 37.2 37.2 37.3 37.3

33.9 33.9 34 34.2 34.4 34.5 34.8 34.8 35 35.1 37.4 37.4 37.5 37.6 37.6 37.7 37.7 37.8 37.9 37.9

35.2 35.3 35.5 35.6 35.6 35.7 35.8 35.9 35.9 36 38 38.1 38.2 38.2 38.3 38.4 38.5 38.6 38.7 38.8

36.1 36.2 36.3 36.3 36.4 36.4 36.5 36.5 36.6 36.6 39 39 39.3 39.4 39.5 39.7 39.8 39.9 40 40.1

36.7 36.7 36.7 36.8 36.8 36.8 36.9 36.9 36.9 37 40.2 40.3 40.5 40.5 40.7 41 41 41.2 42.1 44.9

Some summary results are mean =36.994, median= 37.00, Q1 = 35.65, Q3 = 38.35, L = 30.00, H = 44.90, 1.5(Q3 − Q1 ) = 4.05, 3(Q3 − Q1 ) = 8.1. Summarize the data in a box plot. Interpreting the Mean and Standard Deviation Empirical Rule: For approximately mound-shaped distributions of data, the following statements can be made. 1. Approximately 68% of the measurements will fall within 1 standard deviation of the mean. 2. Approximately 95% of the measurements will fall within 2 standard deviations of the mean. 3. Approximately 99.7% of the measurements will fall within 3 standard deviations of the mean. Chebyshev’s Rule. the mean.

For k > 1, at least (1 −

1 k2 )100%

5

of the observations will fall within k standard deviations of

Example. A small computing center has found that the number of jobs submitted per day to its computers has a distribution that is bell-shaped symmetric with a mean of 83 jobs and a standard deviation of 10. What proportion of the days do the number of jobs submitted exceed 93? (approximately 16%)

Example. A small computing center has found that the number of jobs submitted per day to its computers has a distribution that is approximately mound-shaped with a mean of 83 jobs and a standard deviation of 10. What proportion of the days do the number of jobs fall between 73 and 93? (approximately 68%)

Example. Assume that the average SAT score of first year UCF students is 1175 with a standard deviation of 120. If the distribution of their SAT scores is approximately bell-shaped symmetric, what percentage of students scored between 1055 and 1415? Find the minimum score of a student who scored among the top 2.5% students. (approximately 81.5%; 1415)

Example. Solar energy is considered by many to be the energy of the future. A recent survey was taken to compare the cost of solar energy to the cost of gas or electric energy. Results of the survey revealed that the distribution of the amount of the monthly utility bill of a 3-bedroom house using gas or electric energy had a mean of $125 and a standard deviation of $10. What percentage of homes will have a monthly utility bill of less than $105 or more than $145. (at most 25%) Chapters 1 and 2 Review Key-words: Population and sample, quantitative and qualitative variable, descriptive and inferential statistics, parameter and statistic, census and survey, frequency tables and simple tables, bar diagram and histogram, measures of location (mean, median and mode), measures of dispersion (range, standard deviation, variance, coefficient of variation) and measures of relative position (percentiles, quartiles, z-score), outliers, box plot.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. (answers are given at the end) 1) A sample of high school teenagers reported that 92% of those sampled are interested in pursuing a college education. This statement is a result of a A) quantitative variable B) statistical inference C) descriptive statistic 2) A published analysis recently stated ”Based on a sample of 250 newly hired truck drivers, there is evidence to indicate that, on average, independent truck drivers are overpaid relative to company-hired truck drivers.” This statement is an example of A) descriptive statistics B) a random sample C) a conclusion D) inferential statistics 4) Which of the following measures would allow you to compare a student’s combined SAT score (taken from a population with mean = 600 and standard deviation of 110) and a student’s ACT score (taken from a population with mean = 22 and standard deviation of 1.4)? A) the percentiles of both scores B) the z-scores of both scores C) both a and b D) neither a nor b 5) From past figures, it is predicted that 47% of the registered voters in California will vote in the June primary. Does this statement describe descriptive or inferential statistics? A) descriptive statistics B) inferential statistics 6

6) The average age of students in a statistics class is 19 years. Does this statement describe descriptive or inferential statistics? A) inferential statistics B) descriptive statistics SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question. 7) Parking at a large university has become a very big problem. University administrators are interested in determining the average parking time (e.g. the time it takes a student to find a parking spot) of its students. An administrator followed 110 students and carefully recorded their parking times. Identify the sample used. 8) Which is used more often, a sample or a population? Why? 9) The ages of professors in the biology department at a private university are 57, 68, 66, 64, and 44. Calculate the sample variance of these ages. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 10) A study published in 1990 attempted to estimate the proportion of Florida residents who were willing to spend more tax dollars on protecting the Florida beaches from environmental disasters. Forty-three hundred Florida residents were surveyed.Which of the following describes the variable of interest in the study? A) the response to the question ”Do you use the beach?” B) the 4300 Florida residents surveyed C) being willing to spend more tax dollars on protecting the Florida beaches from environmental disasters D) the response to the question ”Do you live along the beach?” 11) Parking at a large university has become a very big problem. University administrators are interested in determining the average parking time (e.g. the time it takes a student to find a parking spot) of its students. An administrator followed 260 students and carefully recorded their parking times. Identify the inference of interest to the university administration. A) the generalization of the average time it takes the university administrators to find a parking spot B) the generalization of the average time it takes a student to find a parking spot C) the generalization of the average number of parking spots available for students D) the generalization of the average amount of money that should be charged for student parking passes as a result of limited parking spots 12) The amount of television viewed by today’s youth is of primary concern to Parents Against Watching Television (PAWT). 290 parents of elementary school-aged children were asked to estimate the number of hours per week that their child watched television. The mean and the standard deviation for their responses were 12 and 4, respectively. Identify the type of data collected by PAWT. A) quantitative data B) qualitative data 13) A radio station claims that the amount of advertising per hour of broadcast time has an average of 12 minutes and a standard deviation equal to 2.2 minutes. You listen to the radio station for 1 hour, at a randomly selected time, and carefully observe that the amount of advertising time is equal to 13 minutes. Calculate the z-score for this amount of advertising time. A) z = -0.45 B) z = 0.45 C) z = 0.90 D) z = 2.2 14. Identify the correct answer in each case. • In a data set with 5 distinct numbers, the value of mode is (a) zero (b) one (c) there is no mode (d) five. • Which is not a measure of central tendency or location? (a) Mean (b) Range (c) Mode (d) Median. • Which is not a measure of relative position? (a) Median (b) Quartile (c) standard deviation (d) Percentile • Which is not displayed in a box plot? (a) Median (b) Mode (c) Outliers (d) Quartiles 15. Identify each of the following as examples of attribute/qualitative (A) or quantitative (Q) variables. • The amount of flu vaccine in a syringe 7

• Method of payments (cash, check, etc.) for books bought by UCF students. • The amount of carbon monoxide product per gallon of unleaded gas.

16. “TRUE” / “FALSE” questions. • Both 2 and 3 are mode of the data set 2, 3, 2, 3, 3, 2. • When we take the information contained in the sample and make statements or predictions about all of the information in the population, we are utilizing the technique that is known as descriptive statistics. • Median of data set is also equal to one-half of the range of the data set. • 50th percentile of a data set is also equal to Range/2. • If your score on an exam (that has 100 true-false questions) corresponds to the 75th percentile, then you obtained 75 correct answers out of 100 questions. • If a data set has seven distinct observations, then the median of the data set is four. • If a data set has all negative numbers, then its standard deviation is also negative. • In a symmetric distribution, we expect the values of the mean, median, and mode to differ greatly from one another. • In skewed distributions, the mean is the best measure of the center of the distribution since it is least affected by extreme observations.

(Answers: 1 (C); 2(D); 4(C); 5(B); 6(B); 7: parking time of 110 students; 8: sample; 9: s2 = 95.20; 10(C); 11(B); 12(A); 13(B); 14(C, B, C, B); 15(Q, A, Q); 16(F, F, F, F, F, F, F, F, F))

8