Math 17: Intro Stats Final Exam December 19, 2010

Name:__________________________ Math 17: Intro Stats Final Exam December 19, 2010 Directions: Before you leave, you must turn in both this exam sheet...
Author: Ethan Anderson
19 downloads 2 Views 346KB Size
Name:__________________________

Math 17: Intro Stats Final Exam December 19, 2010 Directions: Before you leave, you must turn in both this exam sheet and any statistical tables. If not, you will receive a significant grade reduction. You are allowed to use a calculator and a twosided sheet of notes for this exam. All cell phones, PDAs, iPods, laptops, etc, should be turned off and put out of sight. You may not discuss the exam with anyone but me. In total, this exam is worth 200 points. You have the entire period to complete this exam. Part I – Multiple Choice: There is only ONE correct response per question. Each question is worth 5 points. There are a total of 10 questions for a combined total of 50 points. Clearly circle or write your answer in front of each question. Part I Total Score Possible Points

50

Part II – Free Response: You must show all work in order to receive full credit. Each question is worth a different amount of points and this value is noted in the table below. There are a total of 8 questions with multiple parts for a combined total of 150 points. Part II

1

2

3

4

5

6

7

8

Total

12

20

20

18

25

18

12

25

150

Score Possible Points

Here is my suggestion: Read all questions before beginning and try to complete the ones you know best first. GOOD LUCK!!!

PART I: MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1. Which of the following variables is a binomial random variable? A) the time it takes a randomly selected student to complete a multiple choice exam B) the number of textbooks a randomly selected student bought this semester C) the number of women taller than 63 inches in a random sample of 10 women D) the number of CDs a randomly selected person owns E) the hair colors found in a random sample of 10 students

2. Which statement is not true about confidence intervals? A) A confidence interval is an interval of values computed from sample data that is likely to include the true population parameter value. B) An approximate formula for a 95% confidence interval is sample estimate ± margin of error. C) A confidence interval between 20% and 40% means that the population proportion definitely lies between 20% and 40%. D) A 99% confidence interval procedure has a higher probability of producing intervals that will include the population parameter than a 95% confidence interval procedure. E) Confidence intervals are (by definition) statistical inference procedures.

3. True or False: The p-value is the probability that the null hypothesis is true. A) True B) False

4. For which of the following paired variables, as defined, could not be used to calculate a chi-square statistic? A) Resting Heart Rate (beats/min) and Actual Age (yrs) B) Whether or not have a Job and Year in School (Fr, So, Jr, Se) C) Gender and Opinion About Capital Punishment D) Age Group (under 20 yrs, 21-29 yrs, 30-39 yrs, etc) and Favorite Type Of Music E) None of the above

5. A statistics professor wants to know how her students feel about an introductory statistics course and decides to administer a survey to a random sample of students taking the course. If she randomly selects a class rank (freshmen, sophomores, juniors and seniors) and survey every student in that class rank, what sampling method does she use? A) simple random sampling B) stratified random sampling C) cluster sampling D) sampling with replacement E) systematic sampling 2

6. What statement is true about both pˆ and  ? A) They are both parameters B) They are both statistics C) They are both symbols pertaining to means D)  is a statistic and pˆ is a parameter E)  is a parameter and pˆ is a statistic

7. A short quiz has one true/false question and two multiple choice questions with five choices. A student guesses at each question. Assuming the choices are all equally likely, what is the probability that the student gets all three correct (assume independence)? A) 1/3 B) 1/5 C) 1/10 D) 1/25 E) 1/50

8. Suppose you have to cross a train track on your commute to school. The probability that you will have to wait for a train is .2. If you don‟t have to wait for the train, the commute takes 15 min, but if you have to wait it takes 20 min. What is the correct set-up for the calculation for the expected time it takes to you to commute to school? A) Expected Commute Time = 15(.2) + 20(.8) B) Expected Commute Time = 20(.2) + 15(.8) C) Expected Commute Time = 20(.2) -15(.8) D) Expected Commute Time = 20(15) + (.2)(.8) E) None of the above

9. In a survey, students are asked how many hours they study in a typical week. A five-number summary of the responses is: 2, 9, 14, 20, 60. Which interval describes the number of hours spent studying in a typical week for about 25% of the students sampled? A) 2 to 14 B) 2 to 60 C) 9 to 20 D) 14 to 60 E) 20 to 60

10. Which of the following correlation values indicates the strongest linear relationship between two quantitative variables? A) r = –0.65 B) r = –0.30 C) r = 0.00 D) r = 0.11 E) r = 0.60 3

PART II: FREE RESPONSE. Write the word or phrase that best completes each statement or answers the question. President Marx, in his last year of presidency, would like to know more about students at Amherst and has formed a committee of twenty eight to analyze a campus-wise online survey recently done via SurveyMonkey.com. The sample consists of responses from n = 173 randomly selected students and is believed to be representative of the student body at Amherst. The Independence Assumption is satisfied, so do NOT worry about checking the randomization condition and the 10% condition in Q1 – Q8. (You DO need to check other conditions when doing inference below.) 1. Three committee members, Andrea, Natasha, and Brian (Safstrom), are particularly interest in the lives of athletes on Amherst campus. They notice that among 173 students in the survey, 53 participate in varsity sports. A. Based on the survey results, construct and interpret a 95% confidence interval for the proportion of Amherst students who are varsity athletes.

B. Explain the meaning of “95% confidence” in part A.

4

2. Another three committee members, Christina, Julio-C, and Brian (Smith), also want to know more about Amherst athletes. They are curious if there is an association between students‟ ethnicity and whether they are varsity athletes. A. What is the appropriate analysis to perform (be specific) and state appropriate hypotheses. Analysis:

Null:

Alternative:

B. The two-way table below summaries the results from the survey. Observed (Expected) is the table setup. Use the observed counts to answer questions in this part. Ethnicity\ VarsityAthlete African American (A.A.)

YES 5 ( 6.43 )

NO 16 ( 14.57 )

Total 21

Asia-Pacific (A.P.)

3

17

( 13.87 )

20

Caucasian

42 ( 30.64 )

58

(

)

100

Hispanic

2

( 5.82

)

17

( 13.18

)

19

Others

1

(

)

12

(

)

13

Total

53

( 6.13 )

120

173

What is the conditional distribution (in %) of ethnicity for varsity athlete students? A.A.

A.P.

Caucasian

Hispanic

Others

.

Hispanic

Others

.

For non-varsity-athlete students? A.A.

A.P.

Caucasian

C. Write one sentence or two to describe differences in the conditional distributions above. Based on those differences, do you think varsity athlete status is independent of ethnicity? (Do not do any inference yet.)

5

D. Some expected counts for inference are missing in the table. Compute and fill in those missing expected counts in the parentheses.

E. How many degrees of freedom?

F. Assume the assumptions for the test are met. The test statistic works out to be 15.6748 with a P-value equal to .003488. State your complete conclusion in context.

G. Compute the chi-square component of the cell {Caucasian\ YES}. Does this cell arouse your suspicion? What additional information do you get from this cell?

3. Unsurprisingly, many committee members would like to know how well Amherst students perform in academia. Four of them, Dan, Victor, Randy, and Ophelia, use Rcmdr to obtain the descriptive statistics for the variable, GPA. The output is shown below.

GPA

mean 3.553437

sd 0.2965505

0% 25% 2.4 3.4

50% 3.6

75% 3.7625

100% 4

A. What is the shape of the corresponding histogram? (symmetric/ right-skewed/ left-skewed.)

6

B. Are there any outliers? Explain.

C. When describing center and spread, which set of summary statistics should we use? Mean and standard deviation, or median and IQR? Why?

Three committee members, Christopher, Allen and Arlen decide to get a whole picture of students‟ academic performance, so they call the Registrar and obtain all students‟ GPAs. They are informed that a Normal model with an average of 3.57 and a standard deviation of 0.15 is appropriate for Amherst students‟ GPAs. D. What percent of Amherst students achieve a GPA greather than 3.90?

E. What is the probability that the mean GPA of 28 randomly selected students will be less than 3.48?

7

4. Claire, Tian and Chris, another three committee members who are interested in students‟ academic success, would like to check if there is a difference between the average GPA of Varsity & Club athletes („VarsClub‟) and the average GPA of Intramural & Non-athletes („IntramNon‟). A. Indicate what inference procedure you would use to investigate this question. Why?

B. Write appropriate hypotheses.

C. Assume the assumptions for the test are met. Partial Rcmdr output is given below. Use the output to complete the test procedure at a .10 significance level. data: GPA2 by Sport2 t = 1.0405, df = 151.606, p-value = -------alternative hypothesis: true difference in means (‘IntramNon’ – ‘VarsClub’) is not equal to 0 sample estimates: mean in group IntramNon mean in group VarsClub 3.579211 3.530119

P-value: Conclusion:

D. Given your conclusion above, which type of error (Type I or Type II) could be made? Explain this type of error in context.

8

5. Due to their concern about sleep deprivation among college students, Kate, Melissa and Camille take a closer look at the variable Sleep (the number of hours per night students sleep) and Study (the number of hours per week students study). They are interested in knowing whether the variable Sleep is a significant predictor of the variable Study. The plots attached below are the scatterplot, the residuals plot, and a histogram of the residuals (in order), with the regression analysis for the data. Use this information to analyze the association between hours sleeping per night and hours studying per week.

60 50 40

Study

30 10

20

Residual standard error: 11.69 on 171 degrees of freedom Multiple R-squared: 0.06804 F-statistic: 12.48 on 1 and 171 DF, p-value: 0.0005278

70

Call: lm(formula = Study ~ Sleep, data = Project) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 43.9907 6.3760 6.899 9.73e-11 *** Sleep -3.1994 0.9055 -3.533 0.000528 ***

0

A. What is the value of the correlation coefficient? 5

6

7

8

9

10

20 10

residuals.RegModel

0 -20

-10

B. Based on the output, what is the equation of the regression line?

30

40

Sleep

15

20

25

fitted.RegModel

40 30 20 10

9

0

frequency

50

60

70

C. For a student who sleep 7 hours per day and study 20 hours per week, what is his/her residual?

-20

0

20

Project$residuals.RegModel

40

D. Is there an association between hours sleeping per night and hours studying per week? Write appropriate hypotheses, check and explain if the assumptions for regression satisfied, provide the corresponding test statistic and P-value, and then state your conclusion about the association.

E. Create a 95% confidence interval for the true slope and explain in context what your interval means.

F. Is the variable Sleep a good predictor of the variable Study? Expalin and use statistics to support your answer.

10

6. Speaking of sleeping, three committee members, Brendan, James, and Danny, recently heard of an interesting claim that college students tend to sleep more in a day than they work out in a week. They are not sure if this is the case at Amherst and would like to see if the survey supports this claim. A. Explain why this data is an example of paired data.

B. Write appropriate hypotheses (in words and in symbols).

C. Assuming all assumptions are satisfied to perform the test, use the partial Rcmdr output below to complete the test procedure. Make sure to state your conclusion in context. data: Project$Sleep and Project$WorkOut t = -2.3136, df = 172, p-value = 0.989 alternative hypothesis: true difference in means (‘Sleep – Woukout’) is greater than 0

D. Danny further notices that the average number of hours sleeping is 6.9728 hours/night, while the average amount of time students exercise is 8.0578 hours/week. He can‟t help but wonder: maybe Amherst College students work out more in a week than they sleep in a day? Use Mean(„Workout-Sleep‟) = 1.0850 and St. Dev(„Workout-Sleep‟) = 6.168 to create a 90% confidence interval for the difference in the paired means. Does your interval provide evidence to support Danny‟s assertion? Explain.

11

7. On a different issue, Haley, Colin, and Saumitra are investigating the possibility of ROTC‟s return and would like to know if more females than males oppose the reinstitution of ROTC (Reserve Officers‟ Training Corps) on campus. The results are shown in the table below. Gender \ Female Male

ROTC

NO/oppose 78 40

YES/support 18 37

A. Test an appropriate hypothesis and state your conclusion.

B. Explain what your P-value means in the context of the problem.

12

8. Amherst College is well-known for its diverse student body and three committee members, Amal, Emmanuel, and Crystal would like to know if students‟ first spoken language affects the total number of languages they speak. They categorize the student body into four groups according to their first spoken language: English, Spanish, Chinese and Others. Perform an ANOVA to look for differences in the mean number of languages spoken for students with different first spoken language. Use a α-level = .01. A. State appropriate hypotheses.

B. What assumptions need to hold in order for the ANOVA to be valid?

Assuming the assumptions hold, use the partial output from R below to complete the test and multiple comparisons (if appropriate). Df FirstLanguage 3 Residuals 169

Sum Sq 30.205 92.500

Mean Sq 10.068 0.547

Simultaneous Confidence Intervals Multiple Comparisons of Means: Tukey Contrasts 95% family-wise confidence level Linear Hypotheses: Estimate lwr English - Chinese == 0 -0.8750 -1.8317 Others - Chinese == 0 0.1250 -0.9301 Spanish - Chinese == 0 0.5000 -0.6342 Others - English == 0 1.0000 0.5026 Spanish - English == 0 1.3750 0.7265 Spanish - Others == 0 0.3750 -0.4114

F value ???

Pr(>F) 2.244e-10 ***

upr 0.0817 1.1801 1.6342 1.4974 2.0235 1.1614

13

C. What is the missing value of the test statistic?

D. What is the P-value of the hypothesis test?

E. What is your conclusion?

F. What are the results of multiple comparisons?

G. The normality assumption is in fact questionable due to very small sample sizes of Chinese and Spanish (see below), so it‟s probably better to run a non-parametric test to verify the above conclusion. Name a nonparametric alternative to this test procedure.

Chinese English Others Spanish

mean 2.500 1.625 2.625 3.000

sd 0.5773503 0.7181464 0.9574271 0.7071068

n 4 144 16 9

14