Probability and the Normal Curve

Probability and the Normal Curve MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit February 17, 2010 Basic rules of probab...
Author: Junior Thornton
1 downloads 2 Views 693KB Size
Probability and the Normal Curve MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit

February 17, 2010

Basic rules of probability I

Probability is the cornerstone of decision-making in quantitative research — in particular, how to judge evidence given a specific hypothesis

I

The specific question with research is the following: How likely was it that I obtained this sample of data, given my hypothesis?

I

Probability is on the unit interval [0, 1], even though in common parlance we may refer to probability as percentages Probability axioms:

I

1. For any set of events E : 0 ≤ Pr (E ) ≤ 1. 2. Probability that something occurs is 1.0: Pr (Ω) = 1 3. Any countable sequence of pairwise disjoint events E1 , E2 , . . . P satisfies Pr (E1 ∪ E2 ∪ · · · ) = i Pr (Ei ) I

For instance: for a coin toss, Pr(heads)=0.5, Pr(tails)=0.5

Computing probabilities I

I

Probability refers to the relative likelihood of occurrence of any given outcome or event Alternatively, the probability associated with an event is the number of times that event can occur, relative to the total number of times any event can occur Pr(given outcome) =

number of times outcome can occur total number of times any outcome can occur

I

Example: If a classroom contains 20 Democrats and 10 Republicans, then the probability that a randomly selected student will be Democrat is 20/(20 + 10) = .667

I

Other example: The probability of picking the ace of spades from a single draw from a deck of cards is 1/52; the probability of drawing any ace is 4/52

Probability distributions I

A probability distribution is directly analagous to a frequency distribution, except that it is based on theory (probability theory) rather than what is observed in sample data

I

In a probability distribution, we calculate determine the possible values of outcomes, and compute probabilities associated with each outcome

I

Example:

Probability versus frequency distributions I

The difference is that a frequency distribution depends on a sample. Example: flip two coins 10 times (Table 5.3 LF&F): > table(rbinom(10,2,.5))/10 0 1 2 0.3 0.6 0.1

I

If we increase the coin flips to 1000, we get: > table(rbinom(1000,2,.5))/1000 0 1 2 0.266 0.496 0.238

I

A probability distribution is like a frequency distribution where N=∞

I

Note: Your exact numbers will differ!

0.0

0.1

0.2

0.3

0.4

0.5

Empirical distribution of heads in 50 tosses of 2 coins

0

1

2

barplot(tab

0.0

0.1

0.2

0.3

0.4

Empirical distribution of heads in 1,000 tosses of 2 coins

0

1

2

barplot(tab

Mean and variance of a probability distribution P

X N

I

¯ = Remember the formula for a mean: X

I

We can compute the mean for any given frequency distribution already Returning to the two coin flip example:

I

> # Means of frequency distributions of 2 coin tosses > mean(rbinom(10,2,.5)) [1] 0.9 > mean(rbinom(100,2,.5)) [1] 1.04 > mean(rbinom(1000,2,.5)) [1] 0.961 > mean(rbinom(10000,2,.5)) [1] 0.9987 I

If we performed this test an infinite number of times, we would expect the average to be 1.0. This is why we call the mean of probability distributions an expected value

Greeks and Romans

I

For sample statistics, we use Roman characters

I

For population parameters, we use Greek characters

Quantity Mean Standard deviation Variance

Sample Notation

Population Notation

¯ X s s2

µ σ σ2

The normal curve

0.4

I

0.3

I

symmetrical continuous unimodal follows a specific mathematical form involving two parameters

0.2

I

dnorm(x)

I

0.1

I

The single most important probability distribution in all of statistics Features:

0.0

I

-4

-2

0 x

2

4

The area under the normal curve I

Remember that the normal distribution describes a (continuous) probability distribution — empirical distributions (may) only approximate it

I

To use the normal distribution in solving problems, we calculate probabilities in a probability distribution that comes from integrating the normal curve

I

Normal distribution has the density: (x−µ)2 1 f (x) = √ e − 2σ2 σ 2π

where I I

σ > 0 is the standard deviation µ is the exepected value

The area under the normal curve (cont.)

I

Typically we consider the area relative to standard deviations from the mean

I

A constant proportion of the total area under the normal curve will lie between the mean and any given distance measured in units of σ

I

For instance, the area under the normal curve and the point 1σ above the mean always contains 34.13% of cases

I

The same σ distance below the mean contains the identical proportion of cases

Area under the normal curve in σ distances

Determining exact areas under the normal curve

I

To determine the probability of an event that a random variable X with a normal distribution is less than or equal to x, we evaluate the cumulative distribution function of the normal probability distribution at x

I

The cdf of the normal distribution is: Z x (x−µ)2 1 Φµ,σ2 (x) = √ e − 2σ2 dx σ 2π −∞

I

This equation (and the integral in the cdf) makes it possible for us to determine the total area under the curve for any given distance from the mean µ

Determining exact areas under the normal curve

I

For instance, what if we wanted to know the percentage of cases contained for 1.4 σ units?

I

We can do this in R, or use Table A from Appendix C of LF&F

I

Using Table A from Appendix B... In R:

I

> 1-pnorm(1.4) [1] 0.08075666 > round((1-pnorm(1.4))*100, 2) [1] 8.08

“Standardized” scores I

I

I

We can transform any distribution into a set of standard deviations from the mean — this is called a z score or standardized score The z score measures the direction and degree that any given raw score deviates from the mean of a distribution on a scale of σ units Formula for computing z score: zi =

I

I

Xi − µ σ

In order to convert any raw scores into scores that can be assessed using the normal curve, we convert them into σ units (or “z scores) by standardizing them The normal curve can be used in conjunction with z scores and Table A (or pnorm) to determine the probability of obtaining any raw score in a distribution

Example I

Assume we have a normal distribution of hourly wages I I

I

mean wage is $10, and the standard deviation is $1.5 we wish to find out probability of an hourly wage of $12

What is probability of obtaining a score between mean of $10 and value of $12?

Example continued Steps: 1. Standardize the raw score of $12: z = (12 − 10)/1.5 = 1.33 2. Use Table A to find the total frequency under the normal curve between z = 1.33 and the mean. This is p=.4082 Alternatively, we could have used R: > pnorm(1.33)-pnorm(0) [1] 0.4082409 Standard normal curve: special version of the normal curve with µ = 0, σ 2 = 1

Example variation I

I

I

What if we had wanted to find the probability that a wage might be greater than the observed value (in this case, $12)? In this case, we would integrate from 1.33 to +∞

In R, this is easy: > 1-pnorm(1.33) [1] 0.09175914

Example variation 2 I

I I

We could also have obtained the probability that a wage would be either less than $8 or more than $12. The transformation would then yield -1.33 and 1.33 In R: > pnorm(1.33) - pnorm(-1.33) [1] 0.8164817

0.0

0.1

0.2

0.3

0.4

This means that 1 − .8165 = .1835 of the area lies below $8 and above $12

dnorm(x)

I

-1.33

0.00 x

1.33

Standardizing scores in R

> x (stdx scale(x) [,1] [1,] -0.8518410 [2,] -0.4543152 [3,] -0.3218066 [4,] -0.0567894 [5,] 0.8707708 [6,] -0.9843496 [7,] 1.7983310 attr(,"scaled:center") [1] 7.428571 attr(,"scaled:scale") [1] 7.54668

0.8707708 -0.9843496

1.798331

Methods for determining normality in samples

I

Visual inspection of the kernel density

I

Some tests also exist, e.g. Anderson-Darling test, Kolmogorov-Smirnov test, Pearson χ2 test

I

Q-Q plot to visualize normality I

I I

“quantile-quantile” plot: plots ranked samples against a similar number of ranked samples from a normal distribution normality shows up as a straight-line correspondence In R: qqnorm()

Normal Q-Q plot

30000 20000 0

10000

Sample Quantiles

40000

50000

Normal Q-Q Plot

-3

-2

-1

0 Theoretical Quantiles

> > > >

load("dail2002.Rdata") attach(dail2002) qqnorm(spend_total) qqline(spend_total)

1

2

3

Central Limit Theorem

I

States that as the sample size N increases, the distribution of the sample means will be normally distributed with a mean µ and variance σ 2 no matter what the shape of the original distribution

I

Second fundamental theorem of probability (law of large numbers is the first)

I

This allows us to make probability statements about sample means, against a theoretical (probability) distribution of means that we might have drawn, since that probability distribution is normally distributed

I

We could illustrate this with the spend total example from the dail2002.Rdata dataset

Suggest Documents