The Standard Normal Distribution

Elementary Statistics Chapter 6 Goal: Spring 2012 The Standard Normal Distribution To become familiar with how to use Excel 2007/2010 for the Norm...
Author: Marcia Goodwin
17 downloads 0 Views 639KB Size
Elementary Statistics

Chapter 6 Goal:

Spring 2012

The Standard Normal Distribution

To become familiar with how to use Excel 2007/2010 for the Normal Distribution.

Instructions: There are four Stat Tools in Excel that deal with the Normal Distribution. They are NORM.S.DIST, NORM.S.INV, NORM.DIST and NORM.INV. The first two are used when we are working with the Standard Normal Distribution. Open Excel and click on the Stat button in the Quick Access Bar. Scroll down until you see NORM.S.DIST. (It might be spelt slightly different in Excel 2007). Select that tool. Here is what you should see:

Enter 1.50 for z and true for Cumulative. Midway down the tool screen on the right, you’ll see the answer. It should read 0.93319. Try it. This is the area under the curve and to the left of z:

z

14

Elementary Statistics

Spring 2012

The second tool that deals with the Standard Normal Distribution is NORM.S.INV.

Enter 0.93319 for Probability. You should see the value, 1.49999 returned. This is essentially 1.50 with rounding errors. As you can see, NORM.S.INV is the inverse of NORM.S.DIST. It’s used when you have a probability and you want the corresponding z-score.

15

Elementary Statistics

Spring 2012

The next two tools are used when working with a generic Normal Distribution. The first is NORM.DIST:

This tool works in a manner very similar to NORM.S.INV, except that it’s generic, so we also have to input the mean and standard deviation of the distribution we are working with. Enter in the following values: X 140 Mean 100 Standard_dev 15 Cumulative True The value, 0.9962 should be returned. As an example, IQ has a normal distribution with mean 100 and standard deviation 15. Therefore, 99.62% of the population has an IQ below 140:

16

Elementary Statistics

Spring 2012

The last NORM tool is NORM.INV, the generic form of NORM.S.INV:

As an example, enter the following values: Probability .9962 Mean 100 Standard_dev 15 The value retuned should be 140.040, the difference from 140 being due to rounding error. The result can be interpreted as follows. Given a normal distribution with mean 100 and standard deviation 15, for what value does 99.62% of the population lie below? The answer is 140.

17

Elementary Statistics Goal:

Spring 2012

To become familiar with how to describe a data set.

Reading:

Triola, Chapters 2 and 3.

Instructions:

Open the Excel spreadsheet, Test Scores Data Set.

Data All data sets, whether they consist of discrete or continuous data, have a mean, mode, and a median. These are the measures of center. They also have a standard deviation, which is a measure of how much the data varies relative to the mean of the data set. The mean can be thought of as the balance point of the data. It is easily calculated by summing all the data in the set and then dividing by the size of the set. The following is the formula for calculating the mean: ̅



The mode is the data item that appears most frequently in the list when the data is discrete. Later I will discuss the concept of the mode when applied to continuous data. The median is the point halfway through the data. For example, let’s suppose that your data set consists of 51 rolls of the dice. If you sort the list from low value to high value, the datum corresponding to the 26th item in the list would be your median. If the value 7 is the 26th item in the list, then that’s your median. On the other hand, if there are an even number of items in your data list, say 50 for example, then the median would be the average of the 25th item and the 26th. For example, if the 25th data item had the value 6 and the 26th item had the value 7, then the median would be 6.5. If you place your data into an Excel spreadsheet, then you can use the Stat Tools, AVERAGE, MEDIAN, and MODE to find these values. The standard deviation is a measure of variation. Two data sets can have the exact same mean, but if the data in one is more spread out from the mean, then that one will have a greater standard deviation. The standard deviation of a data set is given by the formula: ∑( √

̅)

For example, the standard deviation of a typical set of 50 rolls of the dice might be 2.680. If your data is in an Excel spreadsheet, you can use the Stat Tool STDEV.S to calculation the standard deviation. You can get a “picture” of your data by creating a histogram from it. First, you have to select the data range for the “buckets” or “bins”. For example, if your data consists of test scores, the width of your bins might be 0-9, 10-19, 20-29, etc. As you go through the test scores, every time you encounter a grade that falls within the range of one of the bins, you add one to that bin. For example, out of a class 18

Elementary Statistics

Spring 2012

of 100 students, if eight students scored 70, three scored 72, five scored 75, and 4 scored 78, a total of 20 would be added to the bin, “70-79”. Sometimes the bin label might be just a single number, as in the dice rolls. There the bins would be 2, 3, 4, , 12. You can construct the histogram by hand, or you can use the Histogram tool in Excel. Using the bins along the x-axis and the quantities in the bins along the y-axis, you can create a graph and get a “picture” of your data. One of two shapes will usually emerge, symmetric or skewed. In a symmetric distribution, the left hand side more or less mirrors the right hand side. In a skewed distribution, one half of the distribution tends to stretch out in a long tail.

Symmetric

Skewed

Population vs Sample The population is the whole set of data you are examining. Some examples are the height of all males in the United States, or the weight of all freshmen on some campus, or the set of all sets of 50 rolls of the dice. Think a bit about that last one for a moment. It often isn’t practical to actually measure or count the entire population. The whole power behind Statistics lies in the fact that we can take a small sample from the population, and based on that relatively small sample, make various predictions about the entire population. There is just one catch. The sample has to be truly representative of the population. The science of achieving this is actually quite complicated. You want to try and collect a simple random sample. This is a sample that is selected in such a manner that very possible sample of the same size has the same chance of being chosen. Now, read that several times until it makes sense. I did warn you that sampling theory is complicated. The mean and standard deviation of a population are called parameters, and are denoted and , respectively. The corresponding measure of a sample are called statistics, and are denoted ̅ and s, respectively. Another popular measure is call proportion, denoted p for the population and ̂ for the sample. It occurs when taking polls or measuring percentages. For example, we might be interested in the percentage of people who like Starbuck’s coffee. In this course, the measures mean, standard deviation and proportion are the ones that we will work with most of the time. 19

Elementary Statistics

Spring 2012

Histograms The principles behind histogramming have been explained above. Here we will focus on how to histogram data in Excel. In column A, you list the data, for example, the results of giving a test to 100 students (See the Test Data set). In column B, you list the upper limit of each bin. Continuing with the Test Data set, if you want the bins to count the scores falling into 0-9, 10-19, 20 -29, etc. you would list the number 9, 19, 29, etc. in column B. See the figure below. Next, you want to bring up the Histogram Tool. Select Data Analysis under the Data Tab. Then select Histogram from the list in the menu. You should see the following:

The easiest way to fill out the Input Range is to select all of the data in Column A, and then do the same for the Bin Range, using the values in Column B. Next, select the Output Range radio button. Now, there’s a bug here, so be careful. Click inside the edit box so that the cursor is blinking there. Then, just select some cell from the spreadsheet where you would like your data display to begin. Finally, check off Chart Output at the bottom and hit OK. If you are working with the rice Test Data set, you should see the following:

20

Elementary Statistics

Spring 2012

There are 50 entries in Column A and not all are shown here. Also, the width of the bins should be the same. However, you can see here, that an exception was made for the score of 100. Also, take a good look at the histogram. Would you say that it is symmetrical or skewed? The following values are the mean, mode and median found using Excel. See where these values fall along the x-axis the chart above. Mean: Mode: Median:

75.64 76 76

One test of a symmetrical distribution is that the mean, mode and median are very close in value. What do these values have to say about the data set? If you had selected “skewed” as the answer in the previous paragraph and symmetrical in this one, how would you reconcile the two answers?

21

Elementary Statistics Goal:

Spring 2012

To become familiar with the Normal Probability Distribution.

Reading:

Triola, Chapter 6, Sections 1 – 5

Why the normal curve is so common It results from a stochastic process in which the random variable is a measurable characteristic such as height or weight that can be affected by many little errors of measurement. For example, if you take a metal ruler to measure some object, think about all the little variables that could affect the measurement. The ruler itself might not be that accurate. In the manufacturing of the ruler, many variables might have been a little off. The ambient temperature will affect the reading. The precision of the ruler will affect the reading. The list goes on and on. A good rule of thumb is that when you must use a device to measure something, length, weight, time, etc. the results will most likely be distributed according to the normal curve. The Standard Normal Distribution is a continuous probability distribution that has a bell-shaped graph. The mean is equal to 0, i.e. , and the standard deviation is 1, As in any continuous probability distribution, the area under the curve is equal to 1.0.

Using Areas to Find Probabilities Given a z-score, we can find the area under the curve to the left of the score using the Excel tool, NORM.S.DIST: For example, if z = 0.76, then the area under the curve to the left of z equals 0.7764:

000. NORM.S.DIST(0.76,true) = 0.7764 22

Elementary Statistics It is also true that (

Spring 2012 )

.

To find the area between two z-scores, you’ll need to use NORM.S.DIST twice, once for each score: For example, to find the probability of -2.0 < z < 1.5

Area = NORM.S.DIST(1.5,TRUE) – NORM.S.DIST(-2.0,TRUE) = .9332 - .0228 = .9104

Given an area, we can also find the corresponding z-score using NORM.S.INV:

The z-score equals NORM.S.INV(.9732) = 1.930 Another way to interpret this result is that 97.32% of population lies below the z value of 1.930.

Percentiles What z-score corresponds to the 95th percentile? (

)

23

Elementary Statistics

Spring 2012

Computing a z-score from a raw score z-scores are used with the Standard Normal Curve, but this is a theoretical curve. Real world phenomena don’t have a mean of zero and a standard deviation of 1.0. Therefore, we have to be able to convert real world values to z-scores. We use the following formula:

Examples Let’s say that we’re working with IQs. The mean IQ, µ, is 100 and the standard deviation, σ, is 15. Therefore, if you wanted to find the percentage of the population that has an IQ below 80, you would first convert the real world value, 80, to a z-score and then use NORM.S.DIST:

(

)

Suppose we wanted to find the percentage of the population that has an IQ between 80 and 110. We would first convert 80 and 100 to z-scores, -1.3333 and 0.6667 respectively. Then, (

)

( (

) )

(

)

However, there is a faster way to do this in Excel without first going through the calculation of a z-score. We can use the Excel tool, NORM.DIST as follows. If we want to find ( ):

24

Elementary Statistics

Spring 2012

Here’s another example where this time we will be using NORM.INV, the inverse of NORM.DIST. Let’s say that we want to design a door such that 98% of males can pass through without having to duck. Let’s assume that µ = 70.0 in and = 2.8 in for the population of U.S. males. Using NORM.INV we get:

The doorway would have to be a minimum of 75.75 inches high. Notice that we used a probability of .98 when we wanted 98%. Remember, the two are equivalent.

Estimators When we are studying data from a population, there are several things we would like to know about the population. First, we would like to know the shape of the population probability distribution. This was discussed above. Then we would like to know the population mean, µ, and standard deviation, σ, or we might what to know the proportion, p, of people who answer yes to some question. These three measure are called parameters. We use sample statistics to estimate these parameters, as shown in the following table. Population Parameter µ

Estimated by Sample Statistic ̅

p

̂

Notice that we are using the variance,

instead of the standard deviation. That’s because we want estimators that are unbiased, but this is a minor point. Once we have estimated we will use the square root.

Just having an estimate, of say the population mean, is good, but it’s not enough. We know that our sample mean is most likely not exactly equal to the population mean, but we don’t know how far off we could be. It would be ideal, if we could bracket ̅ in an interval, and be able to say that we are 95% confident that the population mean falls somewhere in the interval. We will not calculate the confidence in this chapter, but instead, lay the groundwork for understand how it is calculated.

25

Elementary Statistics

Spring 2012

Sampling Distribution Imagine that we collected a sample of size and calculated the sample mean, ̅, and standard deviation, s. Now imagine that we do this 999 more times, each sample being of size 100. We will end up 1000 different ̅ and 1000 different s. Let’s just focus on the means. We can treat those 1000 means as if they were a sample collected from the population of the population means. Now, read that three more times. Remember, we are now looking at the mean of the means. The probability distribution of these sample means is called a sampling distribution of the means. If we histogram the means, we will get a picture that looks very much like a normal curve. This will be true regardless of whether the original underlying population was normally distributed. That’s important to remember. The original population does not have to be normally distributed. However, it can’t be too crazy either. As long as it has no more than a reasonable skew to it, the sampling distribution will be normally shaped. All of this is true of the sampling distribution of the proportions as well. However, the situation with the variances is different. There the sampling distribution is a skewed distribution that we call the Chi Square distribution. The following is an illustration of the three different sampling distributions.

Central Limit Theorem All of this is all very well, but what does it have to do with finding the population mean, µ, and the population standard deviation, σ? Well, it turns out that as we keep repeating the process of collecting samples, the following happens:

Let N be the number of times we collected a sample. (Note this is not little n, which is the size of all the samples). As N approaches infinity, we get the following: 1. The mean of the means, which we will call ̅ ̅ approaches the mean of the population of samples means, which we call ̅ and, now here’s the cool part, ̅ 2. The standard deviation of the means, which we will call ̅ approaches the standard deviation of the population of means, which we call ̅ , and, now here’s the really cool part, ̅ . The √

standard deviation ̅

is smaller by a factor of



than σ, the standard deviation of the

underlying population. But is that really so strange? Don’t average, well, average things out, so you would expect a small variation among the means than among the raw data. 26

Elementary Statistics

Spring 2012

3. Finally, as N approaches infinity, regardless of the shape of the underlying population, as long as it’s not too wild, the sampling distribution of the means approaches a perfect normal curve. These results form what is possibly the most important theorem in statistics, the Central Limit Theorem. So go back, and take another look at it. It is the theoretical basis for all of the work we are going to do for the rest of the semester.

Application Let’s say that we’re going to take a ferry boat from Vallejo to San Francisco. The ferry seats 115 people and has a maximum capacity of 20,000 lbs. To test the safety of the boat, we’ll use the average weight, 169 lbs, and standard deviation, 28 lbs, of U.S. males, because men usually weigh more than women. Now, note that if some random group of 115 people show up, and their average weight is 174 lbs or greater, we’re going to be in trouble because the load is going to be exceeding the capacity of the boat. Now, it is quite likely that any one person showing up could weigh more than 174 lbs. Using NORM.S.DIST, we find this probability to be 0.429 (don’t forget to find the probability of the compliment and then subtract the answer from 1.0). Now that’s a pretty high probability, but it’s just one person. The next male getting on the boat might weight only 165 lbs, easily offsetting the weight of the first passenger. The real question we want to answer is what’s the probability that a group of 115 people are going to show up whose average weight is 174 lbs or greater, because then we are going to be in trouble. We use the NORM.DIST tool just like we did for the weight of one person, except now, we recognize that we are dealing with averages, hence a different population, the population of averages. The standard deviation of this population is . Using this value for standard deviation in the √



NORM.DIST tool, we find that the probability of 115 people showing up whose average weight is 174 is 0.0277. Now this is less than 0.05, so technically it counts as a rare event. However, if there was an almost 3% chance that the ferry would likely capsize, would you get on? Thank god we know statistics, eh?

27