Statistics and Data Analysis

Assignment 1 IOMS Department Statistics and Data Analysis Professor William Greene Phone: 212.998.0876 Office: KMC 7-90 Home page: http://people.ste...
Author: Elmer Caldwell
6 downloads 1 Views 771KB Size
Assignment 1

IOMS Department

Statistics and Data Analysis Professor William Greene Phone: 212.998.0876 Office: KMC 7-90 Home page: http://people.stern.nyu.edu/wgreene Email: [email protected] Course web page: http://people.stern.nyu.edu/wgreene/Statistics/Outline.htm

Assignment 1 Solutions Notes: (1) The data sets for this problem set (and for the other problem sets for this course) are all stored on the home page for this course. You can find links to all of them on the course outline, at the bottom with the links to the problem sets themselves. (2) In the exercises below (and in the other problem sets), the initials HOG refer to the textbook Basic Statistical Ideas for Managers, by Hildebrand, Ott and Gray.

Part I. Describing Data 1. Consider the following values: 20 11 14 12 17 14 10 23 15 11 17 10 18 18 13 18

Find the mean, median, and mode for these data. SOLUTION: There are 16 values, and these 16 values have a total of 241, so the mean (or average) is 241 ÷ 16 = 15.0625. If you sort the values into ascending order, you get 10 10 11 11 12 13 14 14 15 17 17 18 18 18 20 23

↑ The median of the 16 values occurs between the 8th and 9th largest, at the position indicated by the arrow. We give 14.5 as the median. The value 18 occurs three times, more often than any other value, so we report 18 as the mode. 2. This is Exercise HOG 2.1, page 23. The data are available on the course website as HOGEx0201.mpj.An automobile manufacturer routinely keeps records on the number of finished (passing all inspections) cars produced per eight-hour shift. The data for the last 28 shifts are 366 390 324 385 380 375 384 383 375 339 360 386 387 384 379 386 374 366 377 385 381 359 363 371 379 385 367 364

a. What is the average number of finished cars per shift based on these data? b. Construct a histogram for these data.

1

Assignment 1 c. The data above are the results from observing 28 eight hour shifts. You are about to observe a 29th. What would be a good guess of how many will be observed? d. Suppose the 29th shift was expected to be a very productive one – with large output. What would be a good guess of the number of finished cars on a very good day? .a. 373.4 .b. SOLUTION: There are many ways to make a histogram, including by hand. Here’s the default from Minitab:

Minitab selected intervals of width 10 and then centered them on the labels. It might look nicer to have the bars between cutpoints:

This was achieved by clicking on the horizontal scale, then on the binning tab, selecting Cutpoint, and choosing Midpoint/cutpoint positions as 320:400/10. (This is an instance of the Minitab convention start:end/step.) If you drew this by hand, you might have selected a different strategy, but the general appearance should be similar. We’re asked for an explanation of this shape. We’re going to have to come up with an extra-statistical argument, meaning a statement that goes beyond what the data are telling us directly. One such statement is that the factory regularly produces automobiles in the 350-400 range, but every now and then something goes very wrong, with the production dropping by about 40 to 50 cars. The data tell us directly that most of the time the production is in the range 350-400, but now and then the production drops to 320-340. The data do not tell us the reason why this drop occurs; any such explanation goes beyond the statistical information. .c. The mean of 373.4 would be a good guess. The median of 378 would be also.

2

Assignment 1 .d. The maximum of 390 might be tempting. But, it is hard to expect a randomly chosen day to be as good as the best day in a sample, especially as the number of observations gets large. It would be good to move back from the extreme point. A slightly lower value such as 386 or 387 might be a better guess.

3. Which of the two samples in each set has the higher standarddeviation. You can tell by looking at the data. It is not necessary to do any computation to answer this question. Explain your reasoning for each answer. Set 1 Sample A: 16, 16, 16, 16, 16 Sample B: 15, 16, 16, 16, 16 Set 2 Sample A: 20, 25, 25, 25, 30 Sample B: 15, 25, 25, 25, 35 Set 3 Sample A: 20, 20, 30, 40, 40 Sample B: 20, 25, 30, 35, 40 Set 1. S.D. of Sample A is 0.0. It is something greater than 0 for sample B Set 2. Sample B is the same as A in the middle, but the leftmost observation is smaller and the rightmost observartion is larger than in A. So sample B has the larger standard deviation. Set 3. Observations 1,3,5 are the same in both. The means are the same, 35. Observations 1,3,5 are the same, observations 2 and 4 contribute (20-30)2 and (40-30)2 = 200 in A. Observations 2 and 4 contribute (25-30)2 and (35-302 = 50 in B, so A has the larger standard deviation. 4. This is exercise HOG, problem 2.23, page 42. The data file is HOG-Ex0222.mpj on the course outline. Data on 60 telephone operators in terms of number of call requests processed in a workdaywere analyzed using Minitab.

Descriptive Statistics: Cleared Variable Cleared Minimum 601.00

N Mean SE Mean StDev 60 794.23 4.42 34.25 Q1 Median Q3 Maximum 789.00 799.00 807.75 844.00

Data Display Cleared 797 794 796 807 820 601 787 794 804 807

817 801 817 792 800

813 805 801 786 785

817 811 798 808 796

793 835 797 808 789

762 787 788 844 842

719 800 802 790 829

804 771 792 763

811 794 779 784

837 805 803 739

804 797 807 805

790 724 789 817

a. Calculate the “mean plus-or-minus 1 standard deviation” interval used in theEmpirical Rule discussed in class. b. Of the 60 scores in “cleared,” 51 fall within the 1 standard deviation interval. How does this result compare with the theoretical value of the Empirical Rule? SOLUTION: The interval x ± s is here 794.23 ± 34.25, or 759.98 to 824.48. The fraction falling within this interval is 51/60 = 85%. This is much larger than the expected 67%, so something must have inflated the standard deviation. The lonely single value 601 is likely the problem. If you set aside this single value, you’ll find that the standard deviation of the remaining 59 values is 23.21, a huge reduction from the overall standard deviation of 34.25. In general, we do not recommend that you set aside values just because they are weird.

3

Assignment 1

5. (Application) The data file WHO-HealthStudy.mpj (a Minitab project file) contains a famous data set. These data were used in the World Health Organization’s 2000 comparison of the health care systems in 191 countries – nearly the entire world – that was widely discussed in the popular press (including on the front page of the New York Times) If you’ve seen Michael Moore’s movie Sicko, or seen the trailer, there is a point at which he takes out a study on a clipboard and shows you how the United States ranked 37th in the world in “health care.” These are part of the data that were used to do the study. The extract of the data file in the Minitab project contains 12data columns, as shown below for the first few countries

4

Assignment 1 The variables in the file are DALE = disability adjusted life expectancy EDUC = average years of education GINI = a measure of income inequality (low numbers are bad) POPDEN = the population density, people per square kilometer GDPC = per capita gross domestic product (country income) GEFF = World Bank measure of the effectiveness of the government VOICE = World Bank indicator of how democratic the country is OECD = an indicator of whether the country is in the OECD. (OECD is the United Nations Organization for Economic Cooperation and Development. Notwithstanding its lofty title, it is mainly a group of the world’s wealthiest countries.) COMP = an equally weighted average of survey results on five objectives (Health, Health distribution, Responsiveness, Responsiveness in Distribution, Fairness in financing). EFFICIENCY = estimated overall efficiency (WHO/Paper 30, based on COMP) PCHEXP = per capita health care expenditure (public) PUBSHARE = proportion of total health expenditure paid for by the government a. Let’s compare the incomes of the 30 OECD countries with the incomes of the 161 other countries. A box-plot will be useful. Use Graph -> Boxplot (with groups), then graph variable GDPC with group variable OECD. Note that OECD = 1 is the OECD and OECD = 0 is the other countries. What do you find? b. Present a description of the variables DALE and GDPC using the tools discussed in class. (Descriptive statistics will include means and standard deviations and medians. Graphical tools include histograms and box plots.) c. Does higher income buy higher life expectancy? Produce a scatter plot of DALE (on the Y axis) against GDPC (on the X axis). What do you find? d. Does education produce higher income? Produce a scatter plot of EDUC (on the X axis against GDPC. What do you find? What conclusion do you draw? e. Do higher levels of education appear to be associated with higher life expectancy? SOLUTION a. The figure tells the story

5

Assignment 1 b. The full set of descriptive statistics is given below.

The two variables have different scales, so we cannot plot them on the same figure (at least not reasonably.) Here is the box plot of the two together, for example. Boxplot of DALE, GDPC 30000 25000

Data

20000 15000 10000 5000 0 DALE

GDPC

Here is a possibly informative histogram of DALE.

6

Assignment 1 The boxplot of DALE provides only five benchmarks (minimum, Q1, median, Q3, maximum). We had the number from Descriptive Statistics, and the boxplot makes these into a visual display: Boxplot of DALE 80

70

DA LE

60

50

40

30

20

Minitab provides a nice summary for a variable, using Stat -> Basic Statistics -> Graphical Summary Summary for GDPC A nderson-D arling N ormality Test

0

6000

12000

18000

24000

30000

A -S quared P -V alue