Practice for Final - Stats 12 Fall 2008 Ryan Rosario - Sections 1AB 1. According to a 2007 data mining project performed by your TA, 4900 UCLA Facebook profiles and friend lists were downloaded to a hard disk using a Python crawler. The crawler starts by choosing a random Facebook UCLA profile, downloads it to disk, and then accesses the list of that user’s UCLA Facebook friends. The crawler then visits the profiles for all of those friends and moves on to their friends etc. One aspect was to determine whether or not third party (non-Facebook developed) wall applications would bias an analysis of standard Wall posts. The number of profiles containing an active third party Wall box are listed below. Application Name SuperWall FunWall SuperWall and FunWall

Profiles 559 206 84

(a) Draw an appropriate Venn diagram to represent this context. Label the diagram with probabilities. Let F represent FunWall and let S represent SuperWall.

(b) Find the probability that a randomly selected UCLA Facebook profile in this sample does not use either of these third party wall applications.

1

(c) Suppose my ultimate goal was to determine the proportion of all Facebook users that do not have either of these wall applications active. Suppose I use a confidence interval. Provide two reasons why using a confidence interval for this analysis is not valid. You do not need to know anything about Facebook to answer this question. Hint: Reread this entire problem very carefully, and read the bolded statements.

(d) Consider the following table that breaks down the number of UCLA Facebook users in the sample that displayed their Wall and those that have their Walls hidden (hidden to the crawler) by gender. Sex Male Female Column Total

Wall Displayed 2164 1835 3999

Wall Hidden 188 713 901

Row Total 2352 2548 4900

i. Among those with hidden Walls, what is the probability that the user is female?

ii. Are gender and Wall privacy independent? Why or why not? Use a probablistic argument!!

2

(e) Suppose the crawler flips a coin to determine whether or not to download the current profile to disk. If the coin shows heads then the profile is downloaded to disk, otherwise we just skip over it without saving any of its data. If it downloads the profile to disk, it will visit that profile’s friends with probability 0.75. If it does not download the profile to disk, it will visit that profile’s friends with probability 0.1. i. Draw the tree or contingency table that corresponds with the situation described above. Annotate all events and probabilities for each branch.

ii. Find the probability that the crawler will visit the profiles of the current user’s friends.

iii. Suppose we know that the crawler will visit the current user’s friends. What is the probability that the current user’s profile was downloaded to disk?

3

Reach for the STARs. In 1998, Gray Davis approved the Standardized Testing and Reporting Program which mandated the use of Stanford Achievement Test, Ninth Edition as the sole norm-referenced measure of educational outcomes in the state. In 2003, the California Department of Education approved replacing Stanford 9 with another test, California Achievement Tests, Sixth Edition but only required 3rd and 7th graders to take the test. Everybody else had to take a different battery of tests referred to as Content Standards Test. In this problem, we will analyze a couple of facets of this decision using the material we have studied this quarter. 2. During the research phase leading to this decision, a sample of school districts were selected (and paid) to administer both Stanford 9 and CAT/6 to all of its students. The table below displays 10 pairs of scaled scores for the Language subtest of each battery. Scaled scores take into account the difficulty of items on the test as a way of normalizing among different forms. Scaled scores on Stanford 9 range from 200 to 999. Scaled scores on CAT/6 Language range from 0 to 999. Stanford 9 200 310 450 520 600 650 732 810 900 960

CAT/6 Survey 50 125 500 600 670 690 780 820 950 999

(a) Make a scatterplot of these test scores. Clearly denote what x and y represent.

4

(b) Compute the sample correlation coefficient r between Stanford 9 language and CAT/6 Language. To maximize partial credit, make sure to show all of your work, including sx , sy , x ¯, y¯ as well as the formula and the numbers you have plugged in!

(c) For brevity, the sample in part a only has size 10. The actual correlation coefficient is r = 0.89. Compute the regression model equation using the following statistics: x ¯ = 500, y¯ = 500, sx = 200, sy = 165. To maximize partial credit, show all steps including your computation of ˆb0 and ˆb1 .

(d) Interpret the slope and intercept of your regression model. Comment on anything strange you may notice, what may have caused it, and how you may be able to resolve it.

5

(e) Using your regression model from the previous problem, predict the scaled score a student that received a 610 on Stanford 9 Language would receive on CAT/6 Language. Suppose the student’s true CAT/6 Language score is 570. Compute the residual. Make a comment about the model’s prediction.

(f) Compute r2 and interpret it.

(g) Suppose the psychometrician that did this study forgot to include a pairing of scores. This particular student scored 700 on Stanford 9 Language and 360 on CAT/6 Language. Select the correct statement. r will:

increase

decrease

6

remain about the same

3. In addition to national percentiles, test publishers may choose to report other metrics such as stanines and grade equivalents. Stanines express test performance on a scale from 1 to 9 with scores of 1, 2, and 3 representing “below average,” 4, 5, and 6 representing “average” and 7, 8, and 9 representing “above average.” Each stanine is 0.5 standard deviations in width, except for the first and ninth which are larger (see the diagram below). Grade equivalents on the other hand, are decimals ranging from 0.0 to 12.9 in 0.1 increments that express scaled scores as an approximate grade level and time of year. The digit before the decimal (0 to 12) represents grade level and the digit after the decimal (0 to 9) represents the month of the school year, assuming a 10 month school year. Grade equivalents allow educational administrators to gauge approximate grade level improvement over time. Suppose on CAT/6 Reading the mean scaled score is 500 with standard deviation 150.

0.1

0.2

0.3

0.4

For parts a, b, and c, use the following graphic (carefully) to help answer the following questions.

2

3

4

5

6

7

8

9

0.0

1

−3

−2

−1

0

1

2

3

z−score

(a) On the graphic above, annotate each stanine with its area (percentage of scores falling in that stanine). Hint: Remember the normal distribution is symmetric!

7

(b) Suppose identical twins (with identical intelligence) Ted and Ned both take this test and receive scaled scores of 605 and 614 respectively. Compute Ted and Ned’s stanines.

Ted’s stanine:

Ned’s stanine:

(c) Based on your answer to part b, describe one criticism of the use of stanine scores.

(d) Convert Ted’s score to a grade equivalent. Assume that Ted is a third grader and is tested during the seventh month of instruction (about March). An average third grader tested in March should receive a grade equivalent of 3.7. Assume the standard deviation to be 0.2 (2 months).

8

4. The Mad Tea Party. When Alice walks through the looking glass in her quest to find out why the White Rabbit is in such a hurry, she stumbles upon something that will change her life forever - she meets the Mad Hatter and March Hare. Mad Hatter: March Hare: Mad Hatter: March Hare:

Statistics prove, prove that you’ve one birthday! Imagine, just one birthday every year. Ah, but there are 364 UNbirthdays! Precisely why we’re gathered here to cheer!

Apparently all that tea caused them to forget about leap year! Assume that a year consists of only 365 days. (a) In a tea party consisting of Alice, Mad Hatter, March Hare, Chesire Cat and Dormouse, find the probability that none of them has a birthday (they all are celebrating an unbirthday).

(b) In the same tea party, find the probability that at least one of them has a birthday and spoils the whole celebration.

(c) Suppose we hold a larger tea party. Find the probability that the 8th person to arrive is the first to spoil the party.

9

(d) Now suppose all of Wonderland (population 5,000) is invited: rabbits, flowers, brooms, people etc. What is the probability that at least 5 attendees spoils the festivities?

The rest of the song for your entertainment: Alice: Well then today’s my unbirthday too! Mad Hatter: It is??? March Hare: What a small world this is! Mad Hatter: In THAT case... Both: A very merry unbirthday to... Alice: To me? March Hare: To you! Mad Hatter: a very merry unbirthday Alice: for me? Mad Hatter: for you! March Hare: now blow the candle out my dear and make your wish come true! hoo hoo! Both: A very merry unbirthday to you!!!! Dormouse: Twinkle, twinkle little bat. How I wonder what you’re at. Up above the world you fly, like a tea tray in the sky!

10

5. To Be or Not to Be. Problems involving authenticity of historical and literary documents such as the Federalist Papers, Shakespeare’s tragedies and Ronald Reagan’s speeches have been analyzed using statistical methods. On average, Shakespeare writes 1,500 words per act. A researcher counts the number of words in each act of Hamlet, and finds that there are, on average, 1,600 per act with a standard deviation of 125, and there are 5 acts in Hamlet. The researcher found this odd and decided to perform a hypothesis test to determine whether or not Hamlet was unusually long. Test at 5%. (a) Set up the proper null and alternate hypotheses.

(b) Calculate the test statistic and draw the appropriate picture.

(c) State your decision.

(d) What do you conclude?

11

6. Carl Icahn Strikes Back. Time-Warner, in its effort to take control over the world, has suggested a pay-per-megabyte policy that requires users to pay for the amount of bandwidth they use on a monthly basis. Suppose Time-Warner wants to classify users that transfer (download) an average of more than 20 GB per month into its most expensive per-megabyte pricing plan. Ryan gets placed in this category because of his most recent usage and is not happy. Over the past 48 months, Ryan’s router reports that he downloads an average of 20.5 GB per month with a standard deviation of 3 GB. Is it statistically reasonable to place Ryan in this group (assuming they value statistics)? (a) Construct the proper null and alternate hypotheses.

(b) Calculate the appropriate test statistic, draw a picture, find the p-value and state your decision (reject/fail to reject).

(c) Make your conclusion in the context of the problem.

(d) Let pval represent the p-value. The p-value indicates that a) the probability that the null hypothesis is true is pval . b) the probability we reject the null hypothesis when the null hypothesis is true is pval . c) the probability that we observe a sample mean as extreme or more extreme than the one observed is pval , if H0 is true. d) the probability of making a Type II error is pval .

12

(e) Test the same hypothesis above using a confidence interval. Clearly state your confidence level, find the confidence interval and make your conclusion in terms of the confidence interval.

(f) Gigabytes downloaded per month follows a normal distribution with mean 10.0 and standard deviation 5.0. Find the probability that the mean gigabytes downloaded per month for a random sample of 100 Time-Warner customers from around the country is between 9.5 and 10.5 gigabytes.

(g) Sophisticated statistical analyses are being developed to classify Internet traffic without having to actually peek at what is being transmitted. Suppose Time-Warner already uses this technology and can actively profile BitTorrent users. They find that a sample of 41 BitTorrent users download an average of 11.0 GB per month with a standard deviation of 6.0 GB. On the other hand, they find that a sample of 60 normal users download an average of 9 GB per month with a standard deviation of 5 GB. The motive is that if the analyst can find a significant difference between the two groups of users, then Time-Warner can place both of these groups of users into different pricing brackets. i. Construct the null and alternate hypotheses.

13

ii. Calculate the appropriate test statistic and draw the corresponding picture.

iii. State your decision (reject/do not reject).

iv. Estimate the p-value.

v. Make your conclusion in the context of the stated problem.

vi. Construct a 95% confidence interval for the difference between the two means.

14

vii. Interpret your confidence interval in the context of the problem.

15