Statistics for Decision- Making in Business

Statistics for DecisionMaking in Business 1st Edition Milos Podmanik Foreword: What is This Book Good For? You‟re probably thinking to yourself, “W...
40 downloads 0 Views 3MB Size
Statistics for DecisionMaking in Business

1st Edition Milos Podmanik

Foreword: What is This Book Good For? You‟re probably thinking to yourself, “Who does this guy think he is by trying to write his own book?” The answer is both satisfying and deceiving to those who expect the traditional math course with the traditional instructor. I write this course manual to most closely match my personal teaching philosophy. What might that be? Well, I firmly believe that math education focuses too much on processes, templates, and repetitive, mundane computational skills. Is this of any importance? To some extent, yes, they are important. For the most part, however, students fail to make connections from math to the real-world and vice versa. We tend to teach students how to “do” and not how to “think.” As a result, I believe it is far more important to promote a deep level of understanding, engagement, and connections to the planet we live on. After all, do you really want to become a calculator? If your answer is “yes,” then this will come as a major disappointment: a computer could calculate faster and more accurately than you decades ago! Not to mention, computers will only continue to get faster and better than you at computing. Here‟s the good news: computers don‟t understand why they‟re doing what they‟re doing! They are simply computing machines. It takes (and most likely will always take) a rational, deepthinking human being to provide a contextual and meaningful analysis of the inputs and outputs of a numerical process. And that, my friends, is what this book is all about.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 2

A Note to Students This book is far from perfect. In fact, it will never be perfect. There is, however, a lot of blood, sweat, and tears put into this book (paper cuts hurt!). I spent much of my 2012 winter break thinking, writing, and rewriting contents in this book to make it feel “right” for both you and me. As such, I don‟t believe it‟s that much to ask for you to read the book. What‟s my point? … Read this book!

Statistics for Decision-Making in Business

© Milos Podmanik

Page 3

Table of Contents Chapter 1: Fundamentals of Statistics

Section Concept

Page

1.1 1.2 1.3

Data and Their Uses Descriptive VS. Inferential Statistics Statistics in Excel

5 12 21

2.1 2.2 2.3 2.4

Visualizing Categorical Data Visualizing Quantitative Data Descriptive Statistics – Center and Position Descriptive Statistics – Variability

29 43 56 67

3.1 3.2 3.3 3.4 3.5 3.6

The Idea of Probability Joint Probability Probability of Unions Conditional Probability Combinations and Permutations Expected Value

82 89 99 107 119 135

4.1

The Binomial Distribution

146

5.1

158

5.2

The Ideas Behind the Continuous Distribution The Normal Distribution

6.1 6.2

Sampling Distribution for ̅ Confidence Interval for ̅

181 191

2: Visual Representations of Data

3: Probability and Decision-Making

4: Discrete Probability Distributions

5: Continuous Probability Distributions

172

6: Sampling Distributions and Estimation

Statistics for Decision-Making in Business

© Milos Podmanik

Page 4

6.3

Confidence Interval for ̂

202

7.1

The Concept Behind Hypothesis Testing

208

APPENDIX A: Answers to Select Problems

220

7: Hypothesis Testing

Appendices

Chapter 1 Fundamentals of Statistics 1.1 Data and Their Uses Our lives are filled with information. While at one point we didn‟t have enough data in the world, now we have so much of it that computers need to be revamped continually in order to keep up with it. Facebook records rich information about hundreds of millions of users. Studies are revealing new conclusions that allow us to make decisions about choosing the right type of treatment for medical conditions. Scientific data is establishing the strong correlation between humans‟ interaction with the planet and changes in climate. The power of data is limitless. However, due to our regularly failing media expertise, the results of studies are often miscommunicated because they are not understood. In order to fully extract the meaningfulness of data, we must understand how to analyze them. We must be accurate and precise in what we measure and how we measure it.

1.1.1 Three Good Reasons to Study Statistics In no particular order, these are: 1. To be informed 2. To be able to make good decisions based on data and to understand current issues 3. To be able to evaluate decisions that affect the operations of a business and our personal lives 1. To be informed What does it mean to be informed? To be informed we should be able to understand and interpret tables, charts, and graphs. We should be able to make sense of conclusions of other's research Statistics for Decision-Making in Business

© Milos Podmanik

Page 5

based on their numerical results. Moreover, we should be able to have insight into the gathering, summarization, and analysis of data, and so we should always approach numerical results with a slight bit of doubt. In other words, we ideally want to adopt the attitude of "doubt until enough evidence to trust." Let's take a look at some examples of where statistics have helped inform society. Examples: - Does it matter how long children are bottle-fed? An experiment was run to determine differences in iron deficiency and the length of time that a child is bottle-fed. - In 2005, Medicare candidates faced a decision of which prescription medication plan to choose. A program called PlanFinder was made available online to compare available options. But, are senior citizens online? - A study in 2005 attempted to answer the question, are students ruder today than in the past? A survey was conducted. - Is domestic violence common? A study in 2005 interviewed about 24,000 women to attempt to answer this question. - What factors are involved in student achievement in school? Is study-time the most important factor in answering this question? A study concluded that things such as prioritizing student achievement and encouraging teacher collaboration may have some impact. - Do the accounts receivable reported by a business accurately reflect the true accounts receivable? The IRS randomly audits businesses to try and answer this question. - A stock’s share value change has fluctuated between -1.2% and 8.9% over the last year. What predictions should an investor make about the stock over the coming year in order to decide whether to purchase? - CVS Pharmacy sells 5 lb. bags of 100% Pure Cane Granulated Sugar. As a quality control measure, the company would like to know the amount of variability in the true weight of sugar placed into each of the bags. 2. Making Good Decisions How can we ever be sure that the results we're seeing or reading are truly the ones we should believe? Although it is assumed that those who talk about data are supposed to understand statistics, you'd be surprised how poor some of their conclusions are. We'll definitely see why by the time this course is over. You'll learn how to summarize data how to analyze it, and, most importantly, how not to make conclusions about it. The title "Making Good Decisions" should not be new to you, hopefully. 3. Evaluating Decisions that Affect Our Lives Are you satisfied that the Food and Drug Administration (FDA) has allowed a new patent for the drug Zoloft, which is now also useful for Social Anxiety Disorder (in addition to depression), but Statistics for Decision-Making in Business

© Milos Podmanik

Page 6

which has undergone no additional research to prove the claim? Do you know why you're paying $720 for car insurance every six months, while you're roommate is paying only $450? If a mammogram comes back positive for breast cancer, is there any chance that this is a false positive? Should you be surprised that no ethnic applicants were hired to a company if three applicants were to be selected, when 15 were Caucasian and 5 were Hispanic? Is there a reason to suspect inequality? It should not surprise you that these can be answered with probability and statistics.

1.1.2 Types of Data In order to be able to reach the goals mentioned above, we need to have some sort of information about which to make our decisions – we call this information data. Data comes in two main categories: quantitative and qualitative/categorical. Quantitative variables, as the title implies, deal with numerical quantities. For example, the average revenue of a Whole Foods market store is considered a quantitative variable, since the measurement is a number. Qualitative variables, on the other hand, deal with qualities. For example, the type of television that a customer is likely to purchase is considered a qualitative variable, since its value will be, for instance, plasma, LED, LCD, etc.

1.1.3 Not All Quantitative Variables Are As They Appear! Just because a variable is stated as a numerical value doesn‟t mean that it can be treated as a numerical value. A variable must be classified according to its scale of measurement. For instance, suppose you are to test three marketing tactics on customers. You call these tactics, Tactics 1, 2, and 3, respectively. These tactics have numerical values, but the numbers do not have any ordering significance. That is, tactic 1 is not necessarily better than tactic 3. These numbers serve simply as names for the values of the variables and cannot be numerically compared. We call this a variable of nominal scale. Suppose that a business magazine reports the top three new businesses in the city each month. That is, we have businesses 1, 2, and 3, where 1 is considered the best of the three, 2 the second best, and 3 the third best. In this case, we can talk about 1 being better than 2 and 3 and 3 being worse than 1 and 2. This type of variable has the properties of a nominal scaled variable, but also has the property of order. We call this a variable of ordinal scale. In another example, consider the variable IQ. Suppose two people have IQ‟s of 100 and 120. Based on this information, we can say that the person with 120 has a higher IQ. However, we can also say that the second person has an IQ that is 20 points higher than the first person. We couldn‟t really say this for the example above. In addition to being nominal (a person can be identified by their value) and ordinal (can rank the scores), we can also talk about the differences in scores. This type of variable is of interval scale. Statistics for Decision-Making in Business

© Milos Podmanik

Page 7

The most powerful type of variable is one that contains all of the above properties, but whose ratio between two values is meaningful and whose value of zero means a complete absence of the characteristic. While IQ is of an interval scale, it does not make much sense to say that the person with the 120 IQ is 20% . / smarter than the person with the 100 IQ. Certainly we cannot say that a person with 0 IQ has no intelligence at all (this person is probably not even alive!). Consider, however, the median salary of different types of employees. One employee makes $100,000 and another makes $120,000. We can definitely say that the second person makes 20% more than the first person, and we can also say that a values of $0 would indicate a person makes no money at all (total absence of that variable). This variable is of ratio scale.

1.1.4 How We Obtain Data The first question we have after knowing a bit about data is, how do we get it? Existing Data In some instances, this data already exists and is available to the researcher. For instance, one can easily go online and find existing data on the U.S. public. We can view things like the average credit card debt per person by state, pounds of grains produced in the United States since 1950, etc. This data is usually available through a number of websites, such as:      

U.S. Statistical Abstract (U.S. Census) - http://www.census.gov/compendia/statab/ Federal Reserve Board – http://www.federalreserve.org Office of Management and Budget – http://www.whitehouse.gov/omb Department of Commerce – http://www.doc.gov Bureau of Labor Statistics – http://www.bls.gov FedStats - http://www.fedstats.gov/

There are literally thousands of other repositories for existing data. Sometimes a little bit of research unveils a plethora of results. If a company is doing a study of its clients, it may already have a myriad of existing internal data. Conducting a Study to Obtain Data We hear a lot of things coming from our failing media sources. Data is blindly reported, while the method of data collection is ignored. Why do you think there are so many conflicting conclusions reached? One week coffee is linked to cancer, while the next it fights cancer. Which is it? Many times, observational studies are conducted. There is no experimenter manipulation in this type of study. For example, a zoologist might study elephant eating patterns in various climates to determine whether climate has an effect on caloric intake (response variable – what is measured). He probably cannot manipulate the climate (predictor variable – serves to predict responses) in which the elephant lives (for many reasons, not the least of which is the difficulty Statistics for Decision-Making in Business

© Milos Podmanik

Page 8

of transporting such an animal. Not to mention, there are startling ethical concerns with such an action!). He probably cannot dictate how much food is in the environment, either. Certainly, he can get an accurate reading of the elephant‟s food intake by following the animal for several days. At the end of the day, the zoologist is merely observing what happens. His conclusions are limited. An experiment, on the other hand, is a type of study in which the experimenter is able to control and manipulate most, if not all, environmental factors. If the experimenter is studying the effects of caffeine on math test scores, for instance, he would have a control group of, perhaps, students who he gives no coffee to and another, experimental group, to which he gives coffee with 60 mg of caffeine. He then measures each group on test score performance (% of total correct):

Suppose the experimental group does poorly compared to the control group. Can we be sure that it was due to the caffeine? As long as test conditions were the same in each group, yes. If, however, there was something different between the two groups in addition to the presence/absence of caffeine, then the results are not so clear. What if, for instance, they played music with the control group and none with the control group? How do we know better performance in the control group wasn‟t an effect of soothing music calming the nerves? It could even have been a combination of no caffeine and music. Punchline: In an experiment, we manipulate one factor and hold all other conditions constant. Most of the time it is desirable to run an experiment. The number one reason for this is that we can usually collect evidence that leads to a cause-and-effect relationship, assuming the experiment is conducted properly. In an observational study it is impossible to do this as there are many confounding variables, or variables that might be related to the explanatory and response variable. Consider this classic example: a researcher counts the number of crimes committed in a city and then the number of churches in that city. She does this for quite a few cities. It is found that there is a positive relationship between the number of crimes committed and the number of churches. That is, as crime increases, so do the number of churches. What gives? Do these people just repent more often for their guilty consciences? Statistics for Decision-Making in Business

© Milos Podmanik

Page 9

It may not come as a large shock that we're dealing with potentially many confounding variables. The simplest one is population. As a city's population increases, more crime is committed and more churches are needed. This is but one possible explanation. Example 1: An educational researcher finds that there is a strong relationship between the number of hours a student studies and his/her grade point average (GPA)? List a few possible confounding variables. SOLUTION: There is no guarantee that studying more causes a higher GPA. There are many factors that might influence a higher GPA:  More sleep  Less stress (maybe due to lack of job)  Less television viewing  Better study environment  More support from family/friends

Issues in Planning a Study There are many. Let's consider the following scenario to help illustrate a few. Scenario: Suppose we want to test whether or not a newly designed Freud circular saw blade runs at a lower temperature, and hence causes less burn marks in the wood, than the old blade at 7200 revolutions per minute (RPM). Can we just run the cuts, take the temperatures, and compare? I think you know the answer to this. First off, we face many extraneous factors, or variables that are not of interest in the current study but that are thought to affect the response variables. Examples? The person doing the cutting with each blade (same or not?). The type of wood being cut (is one pine and the other oak?). The type of saw (low-power Craftsman, or professional Jet?). In order to avoid having these types of factors affect our measurement, we must control them. We can do this by having the same person do the cutting, having both boards being cut exactly the same, and use the same saw for both tests. Secondly, is it sufficient to cut just one board using each blade? Definitely not. We must expect that there will be some variation or variability in the temperatures we measure. That is, if I run the cut with the old saw four times, I may read temperatures of 205 , 202 , 209 and 219 . This difference among the measurements is called variability. Thus, to take into account the variability, we must take several replications, or repeated measurements. Then, we would likely use the mean, or average of the replications.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 10

Although far from last, we will consider here one more important concept. You might not think anything of it at first, but do you suppose that it's a good idea to use just two saw blades for the experiment - one old, one new? What if we happened to get a faulty blade out of the batch? If we run 4 replications with each blade, we might consider having 4 of the old blades and 4 of the new blades. If you have a total of 8 sheets of wood to be cut, is it okay to cut the first 4 with the old blade and the last 4 with the new blade? Surprisingly, the answer is "no." Why not? Suppose the sheets were delivered freshly cut, and still moist. Well, moisture is subject to gravity, and so the last four boards might be more moist than the top four. Thus, we must randomize each board to one of the two types of saw blades. In other words, we randomly assign each board to a blade. We will not consider this any further at this point. Homework Problems - 1.1 1. Classify each of the following variables as nominal, ordinal, interval, or ratio scale. Justify your answer. a. Favorite flavor of ice cream b. Temperature ( F) c. Accounts Receivable Balance d. Ranking of Presidential Candidates According to Preference 2. Based on a study of 2121 children between the ages of one and four, researchers at the Medical College of Wisconsin concluded that there was an association between iron deficiency and the length of time that a child is bottle-fed (Milwaukee Journal Sentinal, November 26, 2005). a. How many elements does this dataset contain? b. Is the variable categorical or quantitative? Explain. 3. The student senate at a university with 15,000 students is interested in the proportion of students who favor a change in the grading system to allow for plus and minus grades (e.g., B+, B, B-, rather than just B). Two hundred students are interviewed to determine their attitude toward this proposed change. a. How many elements does this dataset contain? b. Is the variable categorical or quantitative? Explain. 4. An article titled “Guard Your Kids Against Allergies: Get Them a Pet” (San Luis Obispo Tribune, August 28, 2002) described a study that led researchers to conclude that “babies raised with two or more animals were about half as likely to have allergies by the time they turned six.” a. Is this study an observational study or an experiment? Explain. b. Describe a potential confounding variable that illustrates why it is unreasonable to conclude that being raised with two or more animals is the cause of the observed lower allergy rate.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 11

5. The article “Television‟s Value to Kids: It‟s All in How They Use It” (Seattle Times, July 6, 2005) described a study in which researchers analyzed standardized test results and television viewing habits of 1700 children. They found that children who averaged more than two hours of television viewing per day when they were younger than 3 tended to score lower on measures of reading ability and short term memory. a. Is the study described an observational study or an experiment? b. Is it reasonable to conclude that watching two or more hours of television is the cause of lower reading scores? Explain. 6. “More than half of California‟s doctors say they are so frustrated with managed care they will quit, retire early, or leave the state within three years.” This conclusion from an article titled “Doctors Feeling Pessimistic, Study Finds” (San Luis Obispo Tribune, July 15, 2001) was based on a mail survey conducted by the California Medical Association. Surveys were mailed to 19,000 California doctors, and 2000 completed surveys were returned. a. Is this study an observational study or an experiment? Explain. b. Describe any concerns you have regarding the conclusion drawn.

1.2Descriptive VS. Inferential Statistics 1.2.1 The Purpose of Statistics and “Statistics” Statistics is a branch of mathematics that deals with the analysis of data. This is often confusing to some people, since the lower-case version of this word, statistic, actually means: a piece of data. So, we have statistics, which are the data themselves, and we have Statistics, which deals with the analysis of statistics. Confusing, huh? We generally use the word statistics loosely to mean “data.” A statistician is a special type of mathematician who deals with the analysis of data. Many people confuse the profession of the statistician with a person who simply has many statistics memorized. While some certainly may, most do not. Needless to say, our purpose in the field of Statistics is to understand data. Depending on one‟s goal, statistics may be used to simply describe an obtained set of data or to extrapolate the data to describe something much larger. These two goals are respectively called, descriptive and inferential statistics.

1.2.2 Descriptive Statistics Suppose you work in the accounting department and have collected the following data on revenues earned from new and existing customers over the past day:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 12

Account Type Revenue ($) New $5,296 Old $2,230 Old $7,643 Old $3,897 Old $9,590 Old $2,689 Old $5,890 Old $9,561 New $3,643 New $8,861 Old $3,946

Your goal is to summarize the data in some meaningful way(s). Descriptive statistics is the method of describing or summarizing data. How could this be done? We first consider the types of variables we have present:  Account type – Categorical o New, Old 

Revenue – Quantitative o Range from $2,230 to $9,590

With categorical variables, we cannot mathematically manipulate the observed values, or observations (here we have “New” and “Old” for observations). We can only provide descriptions of the values. We can provide the relative frequency of these values. A relative frequency is a ratio of the number of observations of a given value to the total number of observations. Here, we could summarize by saying: Account Type Relative Frequency New Old

This allows us to conclude that 27% of the sales came from new clients while 73% came from existing clients. This is very valuable information! This information demonstrates that the company has grown over the course of this one day. We could present these two descriptive statistics to management by either providing the raw percentages, or by some visual display, such as a pie chart or a bar graph. A pie chart shows the ratios (or all parts of one whole) of the categorical variable and thus the entire circle represents 100% of all account types (100% of the categorical variable values): Statistics for Decision-Making in Business

© Milos Podmanik

Page 13

Account Type

New 27%

Old 73%

This literally shows the “ingredients” of the pie. A corresponding bar graph might be:

Frequency

Account Type 9 8 7 6 5 4 3 2 1 0 New

Old Type

In a similar way, we could describe Revenue, the quantitative variable. Typically, quantitative variables are described by: 

Central tendency – measure of the “typical” or center-most observation. Examples are mean (average), median (the value that is literally the middle number), and mode (most frequently occurring number – typically not used and data sets usually do not have one).



Variability – measure of how spread-out the data values are. A number of possible measures exist including (but not limited to): range, interquartile range, and standard deviation.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 14

For the present time, we‟ll proceed to describe one of each of the above descriptive statistics. The rest will be discussed in later sections. Since we‟re most used to finding a simple average, or the mean, we will do that here. Recall, that the mean can be found by summing the observations and dividing by the number of observations:

Recall that when we find an average, we are placing all values into a common “pot.” We then divide the pot into equal parts. That is to say, if each company had spent the same amount of money on each purchase, they would each spend $5,750. We like to think of this as a measure of the center value. Spending less than this amount puts a company below the average and spending more puts the company above the average. Mean (Simple Average) The mean, or simple average, of a quantitative variable is expressed as:

This value represents the amount allocated to each observation, if each observation were to receive an equal share of the total. We think of this as the “center” value.

In conjunction with measures that summarize the center, it is critical to focus also on how spread out the data is. One such measure is the range. The range is simply the difference between the minimum and maximum values in the dataset. In this instance, we have: Minimum: Maximimum:

$2,230 $9,590

The difference is:

Thus, the range of the dataset is $7,360. This tells us that the amount spent varied by as much as $7,360 from company-to-company. Range Range, a measure of the variability (or spread) of a dataset, is measured by taking the difference between the largest and smallest observed value. That is, Statistics for Decision-Making in Business

© Milos Podmanik

Page 15

Example 1: For the example considered above, summarize the center and spread of revenue by account type. Describe any information revealed by splitting up the data in this fashion. SOLUTION: We are being asked to look at values specific to the account type. Thus, we will have two means and two ranges. For “New” accounts: Account Type Revenue ($) New $5,296 New $3,643 New $8,861

For “Old” accounts: Account Type Old Old Old Old Old Old Old Old

Revenue ($) $2,230 $7,643 $3,897 $9,590 $2,689 $5,890 $9,561 $3,946

We summarize this information in a table:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 16

Row Labels New Old Grand Total

Average of Revenue ($) Max of Revenue ($) Min of Revenue ($) Range 5933 8861 3643 5218 5681 9590 2230 7360 5750 9590 2230

We see that both company‟s tend to have about the same average purchase amount. However, it appears that the amount spent by old customers is prone to more fluctuation than that of new customers. This might be due simply to the fact that there are only three new customers.

Technology Note: All of the information above was generated using Microsoft Excel.

1.2.3 Inferential Statistics Descriptive statistics is a great way to describe what you have, but how can we describe data that we do not have? Let‟s consider an example. You are the manager of the production branch at Healthy Heart Organic Foods. Due to recent workload increases, you are concerned that your employees‟ team morale has decreased. You have 864 employees working in your department. You would like to conduct a survey, but you do not have the means to investigate the data in each of the surveys provided. Certainly, you could pay your assistant overtime to analyze them for you, but that would be costly of his time and payroll. Instead, you decide to randomly survey 50 of the employees in your department in order to get an idea of the overall morale. This process of collecting data on a smaller portion of the whole in order to generalize to the whole is known as statistical inference. This branch of statistics is called inferential statistics. It is of utmost importance to make appropriate conclusions when reporting findings of any study, a survey or an experiment. For example, if we find that rats die after ingestion of 20mg of caffeine, does that mean caffeine will kill a human, as well? This brings up the worthwhile discussion of a population versus the sample. Let‟s consider the figure below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 17

First off, a researcher must decide who his target population is. That is, is he trying to describe all people in the United States? All Asian children between the ages of 2 and 5? All elk in Minnesota? The population is the set of all people, creatures, things, etc., that we wish to describe. It is often quite time-consuming and costly to conduct a study based on whole populations. Even presidential polls rarely involve more than a couple hundred participants. Through one of a variety of processes, only a select number of elements of the target population will be selected. This select number is referred to as the sample. The process of selecting a sample from the population that we will consider is simple random sampling (SRS). This process helps to ensure that any differences that we notice among sample elements is entirely due to chance and, importantly, that every element in the target population has an equally likely chance of being in the sample. Simple random sampling can be done by many means. You’ve probably heard of the random process of drawing a name from a box to declare the winner of a raffle. More sophisticated means of this are done by a random number generator on a computer, wherein every element of the population is assigned a whole number. Then, a series of random numbers is drawn by a computer and those elements are selected to be in the sample. We can see in the illustration above that our goal is to then make inferences about the population based on our observations of the sample. Just as you might hear from Gallup: “55% of voters plan on voting for Candidate X,” we try to make generalizations based on the target population. As another example, consider a lighting company that is hoping to manufacture a light bulb with a new type of filament. As with any light bulb, a consumer would want to know how long the light bulb is expected to last. Unfortunately, not every light bulb will last equally long as every other light bulb. This means that an average will have to be taken. To add to this, it is not possible to test every single light bulb to determine how long it will last. So, the company decides to randomly test 200 bulbs that come through the assembly line. They hope to use this sample, since it is random and is assumed to be representative of all light bulbs, to estimate the true average lifespan of a light bulb with this new filament. Here is an overview of their inferential statistics process:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 18

(SOURCE: Essentials of Modern Business Statistics, 4th Edition, Anderson, et. al.) Though it might seem simple enough to conclude that the average light bulb survives for 76 hours, we have to take into account the variability in the lifetimes. That is to say, we need some way to produce a reasonable interval for the true average, since it is the entire population we are looking to describe. A discussion of this inference process is left for future sections. Homework Problems - 1.2 1. Over its first week in the Box Office (12/14/2012 to 12/20/2012), the movie The Hobbit: An Unexpected Journey grossed the following amounts, in millions of dollars (no particular order): 6.9

9.2

1.6

1.9

1.9

1.6

4.9

(SOURCE: www.the-numbers.com) a. b. c. d. e.

Calculate the mean. Explain the real-world meaning of the mean. Calculate the range. Explain the real-world meaning of the range. Provide a brief written report (summary) to the producers of the film on how the film is doing and the stability of gross revenues.

2. A marketing firm conducts a focus group with eighteen randomly selected college students to determine their preference for a variety of clothing lines. a. Describe the sample. b. Describe the population. c. What variables might the marketing firm want to measure? Statistics for Decision-Making in Business

© Milos Podmanik

Page 19

d. Is the firm‟s goal to conduct descriptive or inferential statistics? 3. In a quality control process, 250 packages of cheese are randomly selected from an assembly line. Each package of cheese will be described as either “pass” or “fail,” depending on whether or not it passes the inspection. a. Describe the sample. b. Describe the population. c. Quality control will fail if more than 1% of the packages fail. How many packages must pass? 4. Two datasets have a range of 30. Describe how it is possible that one dataset is considered to be more spread out that the other dataset. 5. One hundred randomly selected CGCC students are surveyed and asked, “Do you believe that racism is an issue in the college setting?” The survey makers would like to generalize to college students. What is wrong with their study?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 20

1.3Statistics in Excel When conducting an analysis of realistic amounts of data, it is tiresome, mundane, and even unfeasible to carry out computations by hand. Microsoft Excel is by far a more powerful and accessible piece of software that does this all for us. As such, we seek to better understand how it works in this section. All images below come from the most recent version of Microsoft Excel. Excel is a spreadsheet-based software. This means that each entry, or cell, represents one piece of information that is all a part of a larger grid of cells. A cell may contain numerical or textual information.

1.3.1 Sum(), Average(), Min(), and Max() Eventually, you will learn to make beautiful spreadsheets, but we are now only concerned with some basic features. Let‟s begin by entering the following accounting data from Section 1.2:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 21

Account Type Revenue ($) New $5,296 Old $2,230 Old $7,643 Old $3,897 Old $9,590 Old $2,689 Old $5,890 Old $9,561 New $3,643 New $8,861 Old $3,946

We can choose any cell we want to begin entering data. Let‟s choose cell A1 to type in the header. This cell reference means that we are looking at row A and column 1. We will enter our second column‟s label into cell B1. We will list the data vertically, as shown in the table above. After clicking on a cell and typing in each entry, simply press ENTER or TAB to move to the next cell. Do not press ESC, or the data you are typing will be cancelled.

In order to see the entire labels in cells A1 and B1, we can expand the column by placing the cursor between the grey-shaded labels for columns A and B, clicking, holding, and dragging the window to an appropriate size.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 22

We can make it a bit more presentable by centering and by bolding the labels.

Excel is extremely useful due to the fact that it allows us to create formulas based on the values of existing cells or cell ranges (a collection of one or more cells). A formula can either act on a provided value or on a provided set of cells. For example, suppose we want to add up the total revenue. We want the result to appear in cell D3. To initiate a formula, we must begin with = in the desired formula cell. Thus, we could click cell D3 and type:

This, however, would defeat the purpose of having entered all the data in already! So, we will use the built in sum function. To use this, we type: Statistics for Decision-Making in Business

© Milos Podmanik

Page 23

= sum(B2:B12) This tells Excel to sum up the range of values from B2 to B12. The colon indicates that we want the full range and not just the two cells B2 and B12. If we were only to have wanted to sum cells B2 and B12 (no in between), then we would have replaced the colon with a comma. NOTE: Excel is not case-sensitive when it comes to formulas. You can type SUM or Sum or even sUm and Excel will recognize what you are asking it to do. However, if you are analyzing categorical data, then “New” is not recognized as being the same as “new.” We get:

(NOTE: It is highly recommended that you label your spreadsheet values. Before or after inserting the sum into D3, it is a good idea to label that cell‟s content, perhaps in cell C3 as shown above. This will be very helpful when your spreadsheet is loaded with information.) To get the proper formatting, highlight cell D3 and select “Currency” from the Number column in the Home Tab. This formatting only applies to the selected cell(s). To find the average revenue, we would simply type the following into the desired cell (we‟ll use D4): = average(B2:B12)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 24

For measures such as the range, Excel does not have a built-in range function. Excel does have a function to locate the maximum and minimum values in a range of cells. Into cell D5, we will type in: = max(B2:B12) – min(B2:B12) This will find the maximum value from B2 to B12 and subtract away the minimum from B2 to B12, giving us precisely the range. If it is desirable to see the max or the min, you can choose a cell and simply type in the max portion or the min portion without doing the subtraction, as shown below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 25

Suppose that this company assumes the daily revenue of $63,246 is (roughly) expected to be earned on a daily basis over the next 30-day month. To get the month‟s revenue we would like to multiply this amount by 30. To do this, we would simply type into our desired output cell: = 30*D3 NOTE: To indicate multiplication in Excel formulas, you must use the multiplication sign. Parenthesis to indicate multiplication will produce an error.

There are literally hundreds of functions available through Excel. A very useful tool for learning how to do new things in Excel is to Google what you are trying to accomplish. For example, if I wanted to find the standard deviation of revenues, I might search Google for “standard deviation in Excel.” Thousands of results are bound to pop-up. Why stop there… try YouTube for many useful videos.

1.3.2 Countif() It is nice to know that Excel has formulas to operate on quantities, but it could still be devastating to have to count categorical values by hand. The countif() function is useful for such an act. This function works as follows: you provide a range of cells for the function to evaluate. You then provide a condition that it should search for and it counts the number of such instances. Suppose we want to count the number of new accounts in cells B2 to B12. We would enter: = countif(B2:B12, “New”) NOTE: we separate the cell range with a comma. After the comma, we type in parenthesis the word it is to search for. Note that case does matter here, since we need to tell Excel exactly what to search for. Statistics for Decision-Making in Business

© Milos Podmanik

Page 26

We get:

We can do the same for Old. A neat little trick is to modify our formula. Let‟s say that we want to minimize the number of areas in our spreadsheet that we would need to change if, say, we began calling “New” accounts “NB” for “New Business.” We would need to change all the account type names, as well as the search criteria in the formula. To make this easier, we can tell our formula to search for something that is already typed into an existing cell. Since C10 contains the actual word we want to search for, we will simply put C10 after the comma instead of the word “New.” = countif(B2:B12, C10) This tells Excel what cells to count, and it tells it what cell to find the search criteria in. We still get the same result. Caution to the wind: if you modify the entry in C10, your result in D10 will change accordingly (or it might produce an error). Homework Problems - 1.3 1. A new policy prohibiting personal emails being sent is enforced by a telemarketing company. A climate survey was then conducted to ask whether a randomly selected number of employees agrees with the policy, and the duration of time they‟ve been with the company. The results are below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 27

Agrees w/Policy Change? Y Y N Y N N N Y Y N N Y Y Y Y Y N N Y

Years at Company 4 8 3 10 3 3 6 3 5 8 1 8 10 5 8 3 8 8 9

a. Determine the mean number of years this sample has been with the company. b. Determine the minimum and maximum number of years a person from this sample has been with the company. c. Determine the combined overall number of years this sample has been with the company. d. Determine the frequency with which people within this sample agreed and disagreed with the policy change. e. Calculate the mean, the minimum and maximum, and the range for each of the two groups (agree and disagree). f. Describe any patterns that emerged when considering the two groups separately.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 28

Chapter 2 Visual Representations of Data

2.1 Visualizing Categorical Data When summarizing data, it goes without say that there are appropriate and inappropriate ways to display the data. For example, if you collected a person‟s age and income, you might be interested in studying income as a function of age. In this case, you probably would not want to build a pie chart, since you‟re studying quantitative variables (two of them, at that). In the previous chapter, the main types of categorical data visualizations were mentioned – bar graphs and pie charts. Our aim here is simply to summarize and to show how to use them in conjunction with Excel. We‟ll create three types of representations:   

Pie Chart Frequency Bar Graph – Vertical axis keeps tracks the number of instances of each observation Relative Frequency Bar Graph – Vertical axis keeps tracks the ratio of instances of each observation (decimal or percentage, typically)

2.1.1 Creating a Pie Chart Using Excel Suppose a hotel owner asks 20 randomly selected recent guests to respond to the following statement regarding their experiences at the new hotel lounge: “The dining experience in Harlan’s Hotel Lounge is worth revisiting.” Respondents circle one of the following letter combinations: -

SD - Strongly Disagree D -Disagree A - Agree SA - Strongly Agree

The resulting data is shown below: Participant Opinion

1 D

2 A

3 SD

4 A

5 SD

Statistics for Decision-Making in Business

6 SA

7 A

8 A

9 A

10 A

© Milos Podmanik

11 A

12 A

13 D

14 A

15 A

16 A

17 A

Page 29

18 A

19 A

20 A

To represent the data to his shareholders, his marketing team constructs the above visual representations. Since the participant number is not important, it is okay to ignore that line of the dataset. Our focus is on the Opinion row. This is a categorical variable, so we‟ll begin by counting the number of SD, D, A, and SA responses by using Excel‟s countif() option. Further, we‟ll calculate the relative frequency of each response by dividing the number of responses for each category by the total number of observations, which we tally below all the individual frequencies:

One new trick worth mentioning is Excel‟s ability to recognize patterns in our formulas. Let‟s say that we typed in our countif() formula for SD in G7 as follows.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 30

We now have to enter a formula for the three remaining opinions. This can get time-consuming. So, we attempt to copy cell G7 and paste it in G8:

This does work! Note that, since we shifted the formula down one level, F7 turned into F8. That is, the search criteria is now being “pulled” from F8, the cell corresponding to an opinion of „D‟. However, we have one problem: the counting region also shifted from D6:D25 to D7:D26. We don‟t want that! To tell Excel that we still want the counting region to be D6:D25 and to not change when we copy our formula, we “lock” the rows and columns by putting a dollar-sign ($) before the row letter and before the column number, as shown below:

(HINT: If you place your cursor over each of the cell names in the formula and press command F4 on your keyboard, you will notice the dollar-sign toggle for you) Notice that F7 contains no dollar-signs, so as to indicate to Excel that we wish for the criteria cell to adjust down one row (still in column F) as we move down one row. We can now copy-paste the formula down the remaining cells: Statistics for Decision-Making in Business

© Milos Podmanik

Page 31

In G12, we would like the sum of the frequencies, so we type: = sum(G7:G10)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 32

We know from the data that this value is correct! To get the relative frequencies, we want to divide each frequency by the constant 20. For instance, the relative frequency of „A‟ would be 2/20 = 0.1. Instead of telling Excel to divide 2 by 20, we will type the following formula into H7: = G7/$G$11 Note that we lock cell G11 so that, when we copy this formula to the remaining cells, we continue to divide by 20, the value in G11.

It is neat to note that we can copy the formula all the way down to H11, since it will simply take 20 and divide it by 20, indicating that the total is 1 or 100% of the data.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 33

We are now prepared to construct visuals. To build a pie chart, we can simply highlight the four opinions and the corresponding frequencies (click and drag from cell F7 to G10), selecting the Insert tab, clicking on Pie in the Charts column, and selecting the desired pie chart. We‟ll select the first one.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 34

Alternatively, it is possible to insert a blank pie chart and to then select the data afterwards. The above process saves a couple of steps. Now we would like to label the chart. It would be nice to see a title and the percentages for each of the slices. To do this, select the chart and click on Design in the Chart Tools tab that appears.

In the Chart Layouts column, we can select the style of chart most appropriate to our needs. For demonstration purposes, the first option will be shown below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 35

To add a suitable title, click “Chart Title” and overwrite it with an appropriate name. If the pie chart become distorted or label are moved undesirably, the chat box can be adjusted by dragging out its corners.

There are many options when it comes to formatting graphs and charts. This will be left for exploration. Note also that many online sources, such as YouTube, offer tutorials on professional formatting within Excel.

2.1.2 Creating a Bar Graph Using Excel Depending on what one would like to emphasize, a bar graph may be suitable to meet that need. We can create either a frequency bar graph or a relative frequency, depending on whether we want to display the number of times an observation appears or the percentage of observations resulting in each of the possible variable values. Using our example from above, since the frequencies are in the column adjacent to the opinion value, we can simply highlight all observations and frequencies and select the Insert tab, the Charts column, and select the first 2-D Column graph from Column. Be careful not to select the Total row. Statistics for Decision-Making in Business

© Milos Podmanik

Page 36

16 14 12 10

8

Series1

6 4 2

0 SD

D

A

SA

There is only one variable here, we can click on the “Series1” in the legend and press DELETE. This will free-up some space. 16 14 12 10

8 6 4 2

0 SD

D

A

SA

With the graph selected, Choose the Layout tab that appears in the Chart Tools area.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 37

You can label the graph by selecting appropriate options from “Chart Title” and “Axis Titles” on the left side of the selected tab.

Guest Opinions of Harlan's Lounge 16 14

Frequency

12 10 8

6 4 2

0 SD

D

A

SA

Opinion

In the relative frequency bar graph, we wish only to change the measurement on the vertical axis. We want to draw the proportions from the third column of our data.

We can update our current bar graph to reflect this. If you do not want to lose the information in your frequency bar graph, you can copy the graph and paste it beside the existing graph. This will allow us to modify the data that is being drawn in.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 38

Selected the copied graph. In Chart Tools, select the Design tab. From there, click on Select Data.

Select the “Edit” option above the “Legend Entries” box.

Beside the “Series values” box, click the icon. This will now allow you to select the values of the dependent variable. Click and drag to select all the relative frequencies, except the total frequency. Then press the you should now see:

Statistics for Decision-Making in Business

icon to close the dialogue box. After relabeling the vertical axis,

© Milos Podmanik

Page 39

Guest Opinions of Harlan's Lounge 0.8

Relative Frequency

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SD

D

A

SA

Opinion

We notice that both graphs look nearly identical. This is due to the fact that the relative frequencies are proportional to the frequencies (they are the frequencies multiplied by 1/20!).

2.1.3 Conclusions The owner of the hotel can reasonably conclude that 80% of his recent guests enjoyed the lounge (enough to consider revisiting!). He can conclude that 20% of his guests either did not care for it or absolutely hated it! If he is interested in additional repeat visitors, perhaps he might like to determine how to make the experience better for those who seem to be highly dissatisfied. Are these descriptive measures demonstrative of the entire population of visitors? To a greater or lesser extent – perhaps.

Homework Problems - 2.1 1. The following dataset represents the meat selection made by individuals at a dinner banquet. Attendees selected from beef (B), chicken (C) veal (V), or pork (P). B C

C C

B P

C P

V B

B B

C C

a. Is this data categorical or quantitative? b. Create a table that shows the frequency and relative frequency for each of the choices. Use Excel. c. Create a frequency bar graph. Label all axes. d. Create a relative frequency bar graph. Label all axes. e. Create a pie chart. Label all axes. f. Write a brief report (summary) describing the meal preferences of these attendees. Describe any general trends. Use specific data and make appropriate conclusions. Statistics for Decision-Making in Business

© Milos Podmanik

Page 40

2. The following data represents per capita meat consumption (pounds per person) in 2009 for a variety of meats (SOURCE: U.S. Statistical Abstract, Table 217). Pounds per Meat Person Beef 58.1 Veal 0.3 Lamb and mutton 0.7 Pork 46.6 Chicken 56.0 Turkey 13.3

a. b. c. d.

Using Excel, find the mean and range of the data. Explain the real-world meaning of the mean you found. Explain the real-world meaning of the range you found. What conclusions can be made about the center and spread of per-capita meat consumption?

3. On opening day, the owners of Green Heart Restaurant invited 29 food critics to be a part of the culinary experience. Each critic gave a grade of A (Best), B, C, D, or F (Worst) to reflect the quality of the overall dining experience. The scores are shown below: A D C A

B C B B

B B A

A B D

C A C

B A C

C C B

B C B

B C B

a. Generate a relative frequency bar chart. b. Generate a pie chart. c. What should the owners take away from the experiences of the critics? 4. Consider the scenario in problem 1. a. What is the sample? b. What is the population of interest? c. What other variable(s) might be of interest to the data analyst to better study attendees‟ eating preferences? 5. Consider the scenario in problem 3. a. What is the sample? b. What is the population of interest? c. What other variable(s) might be of interest to the data analyst to better study the target demographic?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 41

6. Suppose you are the owner of an accounting firm. You would like to better understand the employment of the residents within ten miles of your firm. a. What variables would you collect? Which are quantitative and which are qualitative? b. What is the population of interest? c. How would you go about collecting data for this study? Be specific.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 42

2.2 Visualizing Quantitative Data To make an assessment of how efficient the technical support department is in helping customers solve software issues, management keeps track of the length of each phone call taking place over the day. They find the following: Length of Call (min.) 1 2 13 4 12 4 10 6 6 9 4 3 4 0 12 6 4 4 13 15 0 4 10 4 10 7 2 10 8 4 7 0 4 4 4

Since this data is quantitative, the discussed visual displays are not appropriate. However, management still would like to visualize the 35 observations. One quick, by-hand technique to visualize how the times appear would be a dot plot, or a simple number line, with any repeats stacked above others. Given the presence of great technology, we will use Excel to create a histogram, which is a graph similar to a bar graph (can be either frequency or relative frequency). The difference is that, instead of having nominal categories on the horizontal axis, we will create numerical categories. For example, we could simply create tick marks for each observation value present in the table and to then display the number of time it appears. Often, with small amounts of data, the graph may appear spread out. In this case, we might decide to create a bar representing, say, all calls that fall between 0 and 3 minutes. Let‟s demonstrate both:

Call Times 14 12

Frequency

10

8 6 4

2 0 0

1

2

3

4

6

7

8

9

10

12

13

15

Length (min.)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 43

We clearly see that most calls are between about 4 and 10 minutes (a 4-minute call is most frequent – the mode). Alternatively, we might choose to create equal-width categories. Let‟s say we have categories that show the times as 3-minute blocks:

Call Times 14 12

Frequency

10

8 6 4

2 0 0-2

3-5

6-8

9-11

12-15

Length (min.)

Beautiful! Now it is more clear how call times are distributed. This visualization is a bit simpler than the one above, as it groups times into more manageable categories. Note that the bars are touching. This is the distinction of a histogram from a bar graph – we want to emphasize that times are continuous and that every time length between 0 and 15 are accounted for (even fractions of minute, potentially). We can make these categories as wide or narrow as we‟d like. We call these categories bins. Think about this as you would about sorting recycling materials into one of several bins.

2.2.1 Creating a Frequency Histogram Using Excel The most time-consuming part of building a histogram by hand is organizing the data and counting the number of observations. Excel does this quite easily via the use of a pivot table. A pivot table is a “live” table whose values can be formatted in many different ways. We must first begin with the dataset in Excel as a raw column or row of data:

To insert a pivot table, highlight the entire set of data, including the data label. Click on the Insert tab and choose the PivotTable option from the Tables column. A data prompt should appear with the table range already appearing in the box:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 44

You can either choose to have Excel place the table within the same worksheet, or you can have it create a new one. This choice is up to you. If you choose “Existing Worksheet” you will have to specify a cell to paste it to. Choose a cell that is out of the way of any existing data so that it doesn‟t “bump” into it if the pivot table becomes quite large. You should now see something similar to the table below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 45

When highlighted, a “PivotTable Field List” window should appear to the right of your screen with the name(s) of the variable(s) in the “Choose fields to add to report” box.

This generic template will now allow us to construct a table. From the PivotTable Field List window, we will drag the Times variable into the Row Labels box. This will create a series of rows with each of the observations appearing, only once. Thus, we will not have to see repeats!

Statistics for Decision-Making in Business

© Milos Podmanik

Page 46

If we had additional variables, the row labels can be any variable desired. For each of these rows, we would like to see a frequency count. This is where the “Values” box comes in handy. Drag the Times variable into the “Values” box:

The values of time are, by default, the sums of the times for each of the row labels. This is not what we want. We want “Count of Times.” To change the type of value, click the arrow on the “Sum of Times” button. Choose “Value Field Settings.” Change “Summarize value field by” option to “Count” and close the dialogue box: Statistics for Decision-Making in Business

© Milos Podmanik

Page 47

We can double-check that these values are correct by noting that the Grand Total is 35, the same as the number of observations. We would like a histogram to show the “Row Labels” along the horizontal axis and the “Count of Times” along the vertical axis. To do this, select the pivot table and choose the Options tab from the PivotTable Tools menu.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 48

Select PivotChart and select the first graphing option:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 49

We make a few adjustments: delete the legend, re-label the chart title, and remove the two grey boxes. Now that a graph has been inserted, a PivotChart Tools menu appears when the graph is highlighted. This is very similar to inserting a regular graph. Select Layout to add axis labels. To remove the grey boxes, right-click either box and select “Hide All Field Buttons on Chart.”

Histogram of Call Times 14 12

Frequency

10

8 6 4

2 0 0

1

2

3

4

6

7

8

9

10

12

13

15

Times (min.)

To make the gaps between bars disappear, select the graph and choose the eighth graph option from the Design tab in the PivotChart Tools menu shown below (NOTE: this option will automatically put in axis labels):

Statistics for Decision-Making in Business

© Milos Podmanik

Page 50

To make solid black lines appear as the outlines for each bar, change the bar styles from “Chart Styles.”

Histogram of Call Times 14 12

Frequency

10

8 6 4

2 0 0

1

2

3

4

6

7

8

9

10

12

13

15

Times (min.)

We now would like to adjust the bin widths. Doing this is simple! Select the pivot table. From the Options tab under the PivotTable Tools menu, choose “Group Selection” from the Group column. In the dialogue box that appears, the “Starting at” and “Ending at” boxes should reflect the smallest and largest values of the variable. You can adjust these to be wider or narrower, if you choose to show less than the full dataset. In the “By:” box, put the width of the classes. In this case, we chose 3. Press “OK” and the you should then see the updated pivot table and graph!

Statistics for Decision-Making in Business

© Milos Podmanik

Page 51

To change frequency to relative frequency, we must now change “Count of Times” in the “Values” box of the “PivotTable Field List.” Click on “Count of Times” and select “Value Field Settings.” Within the dialogue box, choose the “Show Value As” tab and choose values to show as “% of Grand Total.” Press “OK.” Adjust the vertical axis label accordingly.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 52

Histogram of Call Times 40.00%

Relative Frequency

35.00% 30.00% 25.00% 20.00% 15.00% 10.00%

5.00% 0.00% 0-2

3-5

6-8

9-11

12-15

Times (min.)

Homework Problems - 2.2

1. An instructor grades a math test and produces the following histogram:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 53

Frequency

Histogran of Test Percentages 10 9 8 7 6 5 4 3 2 1 0 60-64

65-69

70-74

75-79

85-90

Percentage Earned

a. What can the instructor conclude about the fairness of the test? b. What appears to be the mean score, based on the histogram? c. What is the approximate range of scores, and why is it only possible to be approximate this from the given information? 2. A cashier at a mall retail clothing outlet asked customers their age for an anonymous survey. The ages he collected can be found below: 31 33 32 25 24 33

a. b. c. d. e. f. g.

34 30 30 24 31 20

30 29 30 25 31 32

30 28 22 31 32 32

31 20 31 25 31 52

27 32 38 24 31 31

33 24 28 36 28 27

36 30 31 32 31 30

Using Excel, find the mean and range of the data. Explain the real-world meaning of the mean you found. Explain the real-world meaning of the range you found. Create a relative frequency histogram for age. Leave your bin width as 1 year. Create a relative frequency histogram for age with bin width 5 years. Describe any trends in the age of shoppers at this store. Based on your answer to e), which age group(s) can be omitted from the company‟s marketing tactics, in an effort to focus only on the regular shoppers?

3. The total number of people (in millions) working in all of the various industries in the United States in 2010 is given in the table below: 2.206 7.134 9.115 12.53

0.731 5.88 6.138 2.966

9.077 1.253 32.062 9.564

14.081 3.149 13.155 6.769

8.789 9.35 18.907 6.102

5.293 6.605 6.249 0.667

3.805 2.745 9.406 6.983

15.934 15.253 3.252

(SOURCE: U.S. Statistical Abstract, Table 619) Statistics for Decision-Making in Business

© Milos Podmanik

Page 54

a. b. c. d.

Using Excel, find the mean and range of the data. Explain the real-world meaning of the mean you found. Explain the real-world meaning of the range you found. Create a relative frequency histogram for age. Leave your bin width as 2 million people. e. Create a relative frequency histogram for age with bin width 5 million people. f. The federal government regularly publishes reports on employment across the many industries. Using the information you have gathered, generate a brief report detailing your findings, including any trends in employment. 4. A resort chain that wishes to expand is constantly searching for new sites to add properties that will be profitable. A good place to start is by considering climates. Suppose Starwood Hotels and Resorts Worldwide obtains the following data from the U.S. Census Bureau on highest temperatures ever recorded in various cities in the United States: 112 109 114 118 110 120 114

100 112 114 117 121 113 115

128 100 105 118 113 120

120 118 109 125 120 117

134 117 107 106 119 107

114 116 112 110 111 110

106 118 115 122 104 118

110 121 115 108 111 112

(SOURCE: U.S. Statistical Abstract, Table 391) a. b. c. d. e. f.

Using Excel, find the mean and range of the data. Explain the real-world meaning of the mean you found. Explain the real-world meaning of the range you found. Create a relative frequency histogram for age. Leave your bin width as 5 degree. Create a relative frequency histogram for age with bin width 10 degrees. What percentage of states can be eliminated from consideration if the company will not take any risks with states that have had a record high over 115 F? g. Summarize the distribution of high temperatures in the U.S.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 55

2.3 Descriptive Statistics – Center and Position Histograms provide us with a great visualization of the overall distribution of values. A distribution describes the layout of the values of a quantitative or categorical variable. To further describe the differences between two similar distributions, it is helpful to use statistics that describe center, location, and spread.

2.3.1 Mean and Median To make peace with some regularly occurring notation in statistics, we will use mean “the sum of.” For instance,

(“sigma”) to

Let‟s say that we have a set of variable values. To distinguish each of these “ ‟s” we‟ll use subscripts, denoting them:

Then, to indicate that we want to sum these values across all subscripts, we would write:

Which means, “sum up all

values in the dataset,” or

Using this new notation, we already know how to calculate the mean: Mean – x-bar notation The mean value, or average, of a dataset containing

values can be written as:

̅ , is used to denote the mean of a sample and can be read as “x-bar.”

A common point of confusion for students is the difference in the subscript and the denominator . Many people think that the subscript should be to match the number of elements in the dataset. However, specifically refers to the very last value in the dataset. We treat the as an index that goes across all subscripts from 1 all the way up to and including . To Statistics for Decision-Making in Business

© Milos Podmanik

Page 56

account for this discrepancy, mathematicians usually write where the index should start below sigma and the maximum value above the sigma. For example, if there are 3 values in the dataset, we would write the mean as: ̅ As you can see, the sigma notation can quickly become convoluted, and so we typically just write to indicate the sum of all -values. Median The median value of a dataset is the value that represents the physical center of the data set. To locate the median: Organize the data values from smallest to largest. Then, If there is an odd number of values in the data set, the center value can be located by counting in

positions from the smallest value, including the smallest value. Alternatively, one can count in an equal number of values from the left and right endpoints to locate the center value. If there is an even number of values in the data set, average the two middle-most values together. The locations of the two middle-most values are:

Positions from the smallest value, including the smallest value. Once again, these values can be found by counting from the left and the right endpoints of the dataset.

Example 1: Find the mean and median salaries for the company represented by the following dataset (in thousands). Explain which measure better reflects the overall company demographic.

SOLUTION: We first find the mean: ̅ This means that, on average, employees earn $148,200 per year. Statistics for Decision-Making in Business

© Milos Podmanik

Page 57

We begin by listing them in ascending order:

The two middle values are 48 and 50 (these values are four values in from either side). These represent the 10/2=5th and 10/2+1=6th values in the dataset. To find the median, we average them together to get

The median salary is $49,000 per employee per year. The median is clearly a more viable measure. The mean takes into account all values, including the outlier, or “extreme” salary of $1.1 million per year. The median is not influenced by extreme outliers.

To find the mean and median salaries in Excel we use the functions average() and median(). The parameter for both functions is the cell range corresponding to the dataset.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 58

2.3.2 Percentile Another useful tool for describing the location of data points is a percentile. Percentile The th percentile is a value such that than or equal to this value.

percent of the values in a dataset (of

values) are less

To find the location of this value, that is, the index, , first arrange the data in ascending order. The index can be calculated by: .

/

That is, find the th percent of the number of observations. Round up if the index is a decimal and take the average of the values in positions and if the calculated value of is an integer. One of these two actions will be taken Example 2: Find the 50th percentile for the salaries in Example 1:. Interpret the real-world meaning of this value.

The values, in ascending order, are:

We take . Since this is an integer, we average together the values in positions 5 and 6, giving us a value of 49. This means that 50% of employees represented in this dataset make $49,000 or less. Statistics for Decision-Making in Business

© Milos Podmanik

Page 59

Not surprisingly, the 50th percentile is actually the median of the dataset! This is always true. In Excel, we can use the Percentile() function. The set-up of this function‟s parameters is: =percentile(cell range, p/100) Thus, for this dataset, we would have:

2.3.3 Quartiles Often times, data analysts like to think about data in terms of quartiles, or quarters. There are 4 quartiles and can be represented as follows:    

Quartile 1 = 25th Percentile Quartile 2 = 50th Percentile Quartile 3 = 75th Percentile Quartile 4 = 100th Percentile

Statistics for Decision-Making in Business

© Milos Podmanik

Page 60

2.3.4 Rank What if, on the other hand, an employee wants to know what the rank of his salary is (he knows his percentile value)? This requires reverse-engineering of the idea of a percentile. Without the use of any mathematical formulas, we would need to count the number of values that are equal to or lesser than salary in question. To make this easier, we can use Excel‟s Rank() function. The parameters we will use are as follows: = rank(value, cell range, 1) This will return the number of values that are less than or equal to the value in question. If we changed the parameter of 1 to a 0, Excel would return the ranking of that value, treating rankings as being similar to the ranks of, say, runners in a race. We will then need to divide this output by the number of observations in the dataset. To make the counting process more automated, we can take this output and divide it by the output of the count() function. This function will simply count the number of entries in the specified range, and has the following parameter: = count(cell range) Let‟s say the employee making $24,000 would like to know his salary‟s rank. To calculate, we would type the following:

Giving us:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 61

Thus, his salary is in the 30th percentile. This means that 30% of people represented in this dataset make $24,000 or less. Another approach would be to use the “Rank and Percentiles” tool in an Excel add-in called Analysis ToolPak. This method will show the ranks and percentiles of all values in the dataset and is only useful for relatively small, manageable datasets. The Analysis ToolPak will be important later on, so we‟ll describe it‟s installation here.

2.3.5 Analysis ToolPak To install the Analysis ToolPak, select the File tab within Excel. Then select Options from the ribbon that appears. Select the Add-Ins option. Click Analysis ToolPak and press Go.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 62

Check the “Analysis ToolPak” and “Analysis ToolPak – VBA” features from the pop-up window and press OK.

You now have the ToolPak installed.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 63

To use the “Rank and Percentile” tool, select the Data tab. Choose Data Analysis from the Analysis column. Pick “Rank and Percentile” from the pop-up window and press OK.

Select the input range:

You can either specify an output range, or have Excel create a new worksheet with the results. This is up to your preferences. Check “Labels in First Row” and be sure that the data label has been selected. The results are shown below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 64

You‟ll immediately notice that a salary of $24,000 is shown as being in the 22.2-percentile, which does not agree with our calculation. Every software package uses some technique to conduct this calculation. A common agreement for calculation purposes does not exist. Fortunately, they both are in the same “ballpark.” Homework Problems - 2.3 1. Suppose your instructor releases scores on a recent project. The scores are as follows: 83 95 64 99

89 92 30 85

76 80 80 80

41 84 79 82

92 77 78 70

85 78 70 69

76 81 75 71

71 75 81 70

a. Generate a relative frequency histogram and comment on any interesting observations of the distribution. b. Compare the mean and median. What causes them to be different in this particular way? c. What score would be required in order to be in the 80th percentile? d. In what percentile is a person who scores 71% on this project? 2. In order to make way for new products, a grocery store chain would like to determine whether the Lunch Pack or Family Pack of Flaxem Crackers generate more revenue. The following two datasets show the revenue generated by each over a 10-month period: Lunch

450 500

510 550

550 290

330 310

400 300

Family

500 600

400 200

600 200

310 600

350 430

Statistics for Decision-Making in Business

© Milos Podmanik

Page 65

a. Compare the mean and median of each dataset. What can be said about the middle-most revenues? b. Find all four quartiles for each dataset. Use this information to make an argument for why this grocer should hang on to the Family Pack. c. For each of the datasets, determine the top 10% of revenues that can be expected. d. Find the range of the data. Comment on how this might influence the grocer‟s decision. 3. Suppose that Budget Car Rentals assesses a variety of new 2012 and 2013 sedans for its new line of rental cars. It finds the following information on city and highway fuel efficiencies (mpg) for eight vehicles in consideration: Year 2012 2013 2013 2012 2012 2012 2012 2012 Make Toyota Ford Ford Honda Toyota Toyota Hyundai VW Model Prius Hyb. Fusion Hyb. C-Max Hyb. Insight Camry LE Hyb. Camry XLE Hyb. Sonata Hyb. Passat City 51 47 44 41 43 40 34 31 Highway 49 47 41 44 39 38 39 43

(SOURCE: www.fueleconomy.gov) a. Find the mean and median fuel efficiency for city and highway mileages of the vehicles being considered. Comment on any differences between the two values. b. What is the rank percentage of a vehicle that has 43 city mpg? c. If the company makes its choice based on the top 15% of city and highway for the vehicles being considered, what will be the minimum city and highway mileages they should consider? d. Make a recommendation for which vehicle(s) should be purchased, if any.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 66

2.4 Descriptive Statistics – Variability The measure of center is always a good start. But what does a sample mean not tell us? It fails to describe how far apart the data are from one another. In other words, we need to assess the variability of variance of the numbers we have collected. The simplest way we might go about describing the variability is by simply looking at the range of the data, such that: Range = largest observation - smallest observation Albeit, this still does not help us identify how spread out the data are. For example, suppose we find our range to be 100 units (see dataset below). This might seem rather daunting at first, but what if all values were clumped between 0 and 10, and there existed an outlier of 110? Obviously, this range is often determined by outliers alone. 0

1

3

10

8

7

4

110

2.4.1 Standard Deviation To create a better measure of variability that takes all data points into account, just like the mean does, statisticians established a standard deviation. As the title implies, this is a standard tool that measures the average deviations (or by how much each values deviates) from the mean. This requires us to find all the deviations for points in our dataset, ̅ We would find all of these. Let‟s demonstrate with the above dataset: Value 0 1 3 10 8 7 4 110 Mean:

̅ -17.875 -16.875 -14.875 -7.875 -9.875 -10.875 -13.875 92.125

17.875

The deviations that we observe to be below the mean produce a negative deviation and the one above the mean has a positive deviation. To find an average deviation, we would ideally add Statistics for Decision-Making in Business

© Milos Podmanik

Page 67

them. However, observe that the sum of the deviations is 0! This is true of any dataset, since the mean represents the “balance” of the dataset. Due to mathematical concerns that we won‟t state here, mathematicians decided to square these values, since squaring converts all signed numbers into positive values.

0 1 3 10 8 7 4 110 Mean:

(

̅ -17.875 -16.875 -14.875 -7.875 -9.875 -10.875 -13.875 92.125

Value

17.875

̅) 319.5156 284.7656 221.2656 62.01563 97.51563 118.2656 192.5156 8487.016

Sum: 9782.88

Great, now they can be summed up to give 9782.88! Thus, we have found the following: ∑(

̅)

One would think that dividing by 8 would now be appropriate to find the average. Due to mathematical properties that are beyond the scope of this course, the division will be by 7, which is . Thus: ∑(

̅)

This value that we have found is called the variance. NOTE: The division by has to do with the fact that we are often dealing with a sample in inferential statistics and hope to make conclusions above a population. Sample Variance The variance of a sample, an uninterpretable measure of variability denoted by by the following formula: ∑(

, can be found

̅)

To make all of these calculations more meaningful (to have a true average), we should probably “unsquare” the value that we have. When we do this, we get the sample standard deviation: Statistics for Decision-Making in Business

© Milos Podmanik

Page 68

∑( √

̅)





This is what we can think of as the average deviation of each point from the mean. It is clearly high for this dataset. What is causing it? The outlier of 110! Conclusion: On average, values in the dataset deviate from the mean by about 37 units. Sample Standard Deviation The standard deviation of a sample, denoted , is given by the following formula: ∑( √

̅)

Note that this is simply the square root of the variance. In Excel, the standard deviation can be calculated simply by using the function below: = stdev(cell range) Example 1: A river with mild current is known to have an average depth of 3 feet with a standard deviation of 3 feet. The bottom is not visible. Is the river safe to cross by foot? Also, what is the variance? SOLUTION: Since there is a standard deviation of 3 feet, we can conclude, that, on average, the river depth deviates by 3 feet from the mean. It would not be unusual to encounter a part of the river with a depth of 6 or more feet. Therefore, the river should not be crossed by foot. Since the standard deviation is the square root of the variance, the variance is the square of the standard deviation. That is,

Thus, the variance is 9. The variance does not have a valuable interpretation.

2.4.2 How Do We Interpret the Value We Get? Think about this: n is a fixed value for our sample, specifically 5. The only thing that could make s2 large or small is the numerator. Thus, if the deviations are large (a bad thing!), then the squared deviations will be large, and so the sum of squares will be large. This implies a large standard deviation. Statistics for Decision-Making in Business

© Milos Podmanik

Page 69

If the deviations are small (good thing!), then the squared deviations will be small, and so the sum of squares will be small. This implies a small deviation. So, a large standard deviation means that there is a lot of variability, or that the values are vastly different from one another. A small standard deviation means the values in the data set are quite alike. In the near future, you'll see why it is important to have a small standard deviation. In general, as the variance and standard deviation get larger, our ability to make precise statements about the population quickly evaporates. We will be using variance and standard deviation consistently for the rest of the semester. It is important to get comfortable with it.

2.4.3 Do Population Variances and Standard Deviations Fall into Play? Indeed they do. Do you think that we can find them? Definitely not! The population variance requires the use of the population mean, . How do we get ? We take the average of all the values in the entire population. Since we typically don't know this value, we also typically don't know the population variance, so certainly we don't know the population standard deviation (since it's the square root of the population variance). The table below summarizes the notations we need to recognize: Variance Standard Deviation Sample Population The population parameter, , is the lowercase Greek letter “Sigma.” (This is as opposed to the sample statistic, .)

2.4.4 Interquartile Range The standard deviation, much like the mean, is easily skewed by excessively small or large values. We noticed this in the first example in this section. Using the idea of medians and percentiles is a safe bet for outlier-proofing our spread estimates. An interquartile range is the difference between the 3rd quartile and the 1st quartile. Remember, these are simply the 75th and 25th percentiles, respectively. The difference is the middle 50% of the dataset.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 70

This gives us a nice measure of how spread out the data is about the median. Example 2: Consider the following home prices and find both the standard deviation and the interquartile range. Describe what conclusions can be drawn from these values. Values (thous. $)

95

875

96

89

87

88

93

91

SOLUTION: Using Excel, we find the following:

The standard deviation indicates that home prices, on average, vary by $277,100 from the mean value. However, we see from the interquartile range that the middle 50% of homes only vary by $6,500. The standard deviation is being skewed by the home that is priced at $875,000. The interquartile range tells us that the majority of home values stay pretty close to the median value. Additionally, we see that most home values are between $88,000 and $96,000.

2.4.5 Descriptive Statistics: Analysis ToolPak in Excel To generate most of the features we have discussed up until now, we turn to Excel‟s Analysis ToolPak for a more automated approach. Let‟s consider the house data above:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 71

Values (thous. $) 95 875 96 89 87 88 93 91

Access the Data Analysis tool from the Data tab in Excel. Select “Descriptive Statistics” from the menu and select the data from the spreadsheet containing the data.

Be sure that you check “Summary Statistics.”

Statistics for Decision-Making in Business

© Milos Podmanik

Page 72

We can immediately see the mean and the median of the dataset. Additionally, we see the standard deviation, variance, range, min/max, sum of the values, and the number of values in the dataset, among other tools to ignore for now. We see, as expected, that the dataset does not have a mode, or most frequently occurring value.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 73

2.4.6 Shapes of Distributions Now that we have a basis for measuring data in terms of its center and spread, we turn back to making connections with the visual shape of the distribution. There are many different shapes that we encounter for distributions. Let's discuss a few. First, note that the following do not look like the rectangular histograms from earlier on. These are smoothed out forms of what we experienced earlier. They are often used to describe the general shape of a distribution. And, of course, they are much easier to sketch. A histogram is said to be (a) unimodal if it has a single peak, (b) bimodal if it has two peaks, and (c) multimodal if it has more than two peaks.

If we follow the curves from left to right, we begin at the lower tail, move over the peak(s), and arrive back down to what is called the upper tail. A unimodal histogram is said to be symmetric, if we are able to draw a line down the center such that the left side of the line is a mirror image of the right side. Consider the following unimodal symmetric histograms:

A unimodal histogram that is not symmetric is said to be skewed. If the upper tail of the histogram stretches out much farther than the lower tail, then the distribution of values is positively (right) skewed. On the other hand, if the lower tail is much longer than the upper tail, the histogram is negatively (left) skewed. Can you identify the following unimodal histograms as positively or negatively skewed?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 74

Lastly, a normal curve is the most desired type, due to its (in general) nice properties. A normal curve occurs quite frequently. It has a bell shape and is sometimes called the Gaussian curve. Here are examples of normal curves:

2.4.7 Skewness Excel also produces a nice measure that allows us to make conclusions about the general shape of the distribution. This measure is called skewness. If the skewness measure is:   

Postive, then the distribution is skewed right Negative, then the distribution is skewed left Zero, then the distribution is symmetric

The farther from 0 that the skewness measure is, the more skewed in the respective direction the distribution will be. Consider the following data showing the number of televisions owned by randomly sampled individuals in a big city:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 75

3 4 4 2 2 0 0 4 1 0

4 0 3 4 2 0 1 3 2 2

3 4 3 2 0 3 4 2 0 0

2 4 0 4 2 1 4 4 3 4

3 4 4 0 1 0 2 2 0 4

2 3 2 3 1 3 1 4 2 3

1 1 1 4 3 4 2 3 3 4

1 0 2 3 2 3 0 3 2 1

0 1 4 3 2 3 2 3 0 0

Using Excel, we produce descriptive statistics using the Analysis ToolPak:

We notice that the Skewness measure is positive: 0.51. This means the dataset is slightly skewed to the right:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 76

Histogram of TV's Owned 25

Frequency

20 15 10 5 0 0

1

2

3

4

5

6

Number of TV's

2.4.8 Outlier Detection After analyzing a dataset, how do we assess likely values for data and deem other values as outliers? One approach is to determine how many standard deviations above (positive value) or below the mean (negative value) a given data value is. For instance, suppose we have a dataset with mean 20 and standard deviation 3. We have an observation of 14. In terms of units, this value is 6 units below the mean. Thus, it has a deviation of -6. This deviation tells us that the data value in question is 2 standard deviations below the mean, since:

This measure is often called a z-score. Let‟s recap: -Score A -score tells us the number of standard deviations a data point, Mathematically,

, is from its mean, ̅ .

̅

Statistics for Decision-Making in Business

© Milos Podmanik

Page 77

The idea of a -score is quite helpful, in that it tells us how far it is from the mean, relative to the size of the standard deviation (the average spread). If is very close to 0, then the score is not far from the mean. If it is very large, it is very far from the mean. A very useful theorem established by Russian mathematician, Lvovich Chebyshev, allows us to determine how large is very large. Chebyshev established the following theorem: Chebyshev’s Theorem For any , at least . / of the data values must be within (to the left and the right) standard deviations of the mean, for any. This works for any and all distributions.

Example 3:

A data value is 3 standard deviations above the mean. Is this an extreme value?

SOLUTION: Chebyshev‟s Theorem states that

89% of all data points in this distribution will lie between -3 and +3 standard deviations from the mean. Thus, there is, at most, an 11% chance of observing something higher than +3 standard deviations. This data value is fairly unlikely an might be considered a mild outlier.

Homework Problems - 2.4 1. The Connecticut Agricultural Experiment Station conducted a study of the calorie content of different types of beer. The calorie content (calories per 100 mL) for 26 brands of light beer are: 29 23

28 32

33 31

31 32

30 19

33 40

30 22

28 34

27 31

41 42

39 35

31 29

a. Find the standard deviation. Explain the real-world meaning of this value. b. Find the interquartile range. Explain the real-world meaning of this value. c. Find the skewness. What type of shape does this distribution have? 2. The UNICEF report “Progress for Children” (April, 2005) included the accompanying data on the percentage of primary-school-age children who were enrolled in school for 23 countries in Central Africa.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 78

29 43

58.3 58.4

34.6 61.9

35.5 40.9

45.4 73.9

38.6 34.8

63.8 74.4

53.9 97.4

61.9 61

69.9 66.7

43 79.6

85 98.9

63.4

a. Find the range, standard deviation, and interquartile range. Explain what these three values tell us about the shape of the distribution. b. Explain the real-world meaning of the standard deviation and the interquartile range. c. Produce descriptive statistics for this dataset with the Analysis ToolPak in Excel. d. Is the distribution skewed? If so, in which direction? e. Create a relative frequency histogram. Describe any trends in the data. f. Is an observation of 79.6 an outlier? Use Chebyshev‟s Theorem to justify your answer. 3. The article “Determination of Most Representative Subdivision” (Journal of Energy Engineering [1993]: 43-55) gave data on various characteristics of subdivisions that could be used in deciding whether to provide electrical power using overhead lines or underground lines. Data on the variable x = total length of streets within a subdivision (in feet) are as follows: 1280 360 3350 450 1850 3150

5320 3330 540 2250 2460 1890

4390 3380 3870 2320 5850 510

2100 340 1250 2400 2700 240

1240 1000 2400 3150 2730 396

3060 960 960 5700 1670 1419

4770 1320 1120 5220 100 2109

1050 530 2120 500 5770

a. Find the range, standard deviation, and interquartile range. Explain what these three values tell us about the shape of the distribution. b. Explain the real-world meaning of the standard deviation and the interquartile range. c. Produce descriptive statistics for this dataset with the Analysis ToolPak in Excel. d. Is the distribution skewed? If so, in which direction? e. Find the -score for the observation 79.6. Explain what your answer means in real-world terms. f. Create a relative frequency histogram. Is an observation of 79.6 an outlier? Use Chebyshev‟s Theorem to justify your answer. 4. Using the five class intervals 100 to 120, 120 to 140, . . ., 180 to 200, devise a frequency distribution based on 70 observations whose histogram could be described as follows: a. symmetric b. bimodal

Statistics for Decision-Making in Business

c. positively (right) skewed d. negatively (left) skewed

© Milos Podmanik

Page 79

5. The Highway Loss Data Institute publishes data on repair costs resulting from a 5-mph crash test of a car moving forward into a flat barrier. The following table gives data for 10 midsize luxury cars tested in October 2002: Model Repair Cost Audi A6 0 BMW 328i 0 Cadillac Catera 900 Jaguar X 1254 Lexus ES300 234 Lexus IS300 979 Mercedes C320 707 Saab 9-5 670 Volvo S60 769 Volvo S80 4194 a. Using Analysis ToolPak in Excel, generate all descriptive statistics. Discuss the best measure of center and the best measure of spread based on what you see. Justify why these measure were selected. b. Find the -score for the observation 4194. Explain what your answer means in real-world terms. c. Is $4,194 considered an extreme outlier? Also use Chebyshev‟s Theorem to numerically reinforce your answer. 6. Cost-to-charge ratios were reported for the 10 hospitals in California with the lowest ratios (San Luis Obispo Tribune, December 15, 2002). The 10 cost-to-charge values were 8.81

10.26

10.2

12.66

12.86

12.96

13.04

13.14

14.7

14.84

Discuss relevant descriptive statistics and a relative frequency distribution . Use your information to make a conclusion about the state of hospitals in California. 7. The technical report “Ozone Season Emissions by State” (U.S. Environmental Protection Agency, 2002) gave the following nitrous oxide emissions (in thousands of tons) for 16 states in the continental United States: 76 0

22 89

40 136

7 39

30 92

5 40

6 13

136 27

72 1

33 63

Generate a brief report about the distribution of nitrous oxide emissions in the sampled states. Use descriptive measures and visuals to justify your answer.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 80

Chapter 3 Probability and Decision Theory When you stop, I mean really stop, and think about how often you think in terms of probabilities, I am confident you‟ll find you use it more often than not. Do you ever decide to get to work by taking one route as opposed to another? Would you find yourself making health decisions based on your doctor‟s advice instead of the advice you might receive from a ten-year-old child? Have you ever purchased a birthday gift for someone after deep contemplation of what that person might like? Do you trust one news network over another? What are your decisions based on in these situations? Whether or not you‟re willing to give in to your inner nerd, you should admit that you think in terms of chances and likelihood. I imagine that you do have a preferred route. I think that you do trust an expert‟s medical opinion. I believe that you do make a gift purchase after considering what you think the recipient enjoys. I should think there are some networks that you trust more than others. In this chapter, we‟ll explore the nature of probabilistic thinking. You‟ll also notice the phrase “Decision Theory” in the title. Instead of focusing on the trite probability questions involving situations that we don‟t ever encounter, we‟ll concern ourselves with real-world situations where probabilistic reasoning will help us make a decision.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 81

3.1 The Idea of Probability In this section, we‟ll address what probability is (and isn‟t). Example 1: A weather report by the National Weather Service (NWS) stated on July 31, 2011 that, overnight, there was a 50% chance of precipitation in the 85225 zip code in which Chandler-Gilbert Community College is located. What does this mean?

(SOURCE: www.crh.noaa.gov/) SOLUTION: This is actually quite a loaded statement. One might want to say that, out of 100 times, it will rain 50 times. This is a very misleading approach for a couple of different reasons. First off, what is meant by “times”? We are only concerned with one time: overnight on July 31, 2011. A probability is actually a measure of how likely something is to occur in the long-run. That is, if something were to be repeated in trials over and over again then, theoretically, the specified outcome would occur a certain percentage of time. Importantly, it must be noted that the conditions under which we are measuring a probability must be in place in order for the probability to be a valid measure. In our case, NWS states that, under the exact same environmental conditions taking place throughout the night of July 31, 2011, it would be expected to rain 50% of the time. The graph below shows a hypothetical scenario in which there is a 50% chance of precipitation under the set of conditions that occurred on the above night. Notice that it rained on the initial day and so immediately the proportion (or probability) of rainy days is 100%. As the same conditions occur on different days, sometimes it rains and sometimes it does not. Having noted that, any given day has a 50% chance of precipitation. We notice that the proportion is quite unstable at first, jumping from 100%, down to nearly 40%; However, as many days with this same set of conditions pass (in the long-run), we notice that the proportion becomes more stable and approaches the theoretical probability of 50%. Statistics for Decision-Making in Business

© Milos Podmanik

Page 82

Proportion of Rainy Days Under July 31,2011 Overnight Conditions

Proportion of Rainy Days

1.2

1 0.8 0.6

0.4 0.2 0 0

20

40

60

80

100

120

140

Day with Specific Conditions

Graph: Based on a random simulation involving the true probability of a 50% chance of precipitation and what occurs in the long-run.

As an interesting note, NWS has sophisticated helium “balloons” that they send up into the air to measure properties such as wind speed and direction, humidity, and barometric pressure. Then physics is used based on theories of fluid mechanics to make the prediction. Among many others that we could begin to state, there is one other major misconception about probability: that if the probability that it rains is said to be very small and yet it rains, then the probability must be wrong. This is incorrect. Probability is a measure of uncertainty. As in the case of meteorology, the predictions are scientific and are based upon prior data. Just because it has only rained, say, 10% of the time on days like today, this is not to say that it won‟t rain. In fact, it very well might! The moral of the story is that probability talks about likelihood. Only in the instance of 0% and 100% probabilities is anything guaranteed. If there are situations in which something either never happens or always happens, then we‟re probably not concerned about understanding probabilities. Probability Probability is a measure of uncertainty, typically expressed as a number between 0 (0%) and 1 (100%), that describes how likely it is that an event will or will not occur under a specified set of Statistics for Decision-Making in Business

© Milos Podmanik

Page 83

conditions in the long-run.

Measuring Probabilities While probability is considerably more complicated than we‟ll let on, the basic idea is that a probability can be calculated by considering the number of times some event occurs relative to the total number of “trials,” or observable situations. In simpler terms, it is the number of “successes” out of the total number of trials. Calculating Probability The probability that event occurs, denoted ( ), is the ratio (or fraction) of successes divided by the number of trials. Mathematically, we write the number of times occurs by ( ) and the total number of trials as ( ). That is, ( )

( ) ( )

This formula works when all elements in the sample space are equiprobable, that is, each individual outcome in the sample space has the same probability of occurring as any other outcome. As a note the () notation stands for “the number of ways” the event in parenthesis can occur. The in the denominator stands for sample space or the total number of things/situations/trials being considered in the experiment. Example 2: In a 2009 study of high-fructose corn syrup (HFCS), a corn-based sweetener used in a wide variety of foods, beverages, and condiments, 20 samples of HFCS were analyzed. Of those, nine of them were found to contain mercury by researchers. Based on the results of this study, find the probability that a random sample of HFCS contains mercury and explain what this result means. SOURCE: http://www.washingtonpost.com/wpdyn/content/article/2009/01/26/AR2009012601831.html SOLUTION: The event in this scenario is that mercury is found. Out of the total 20 trials, nine of them contained mercury. Therefore, (

Statistics for Decision-Making in Business

)

© Milos Podmanik

Page 84

This means that if samples of HFCS were to be sampled randomly and repeatedly, it would be found that 45% of all samples would contain traces of mercury. This does not guarantee that exactly 45 samples out of 100 will contain mercury.

Example 3: In July 2011, temperatures in Gilbert, Arizona were above 100 every day (SOURCE: www.weather.com). Based on this data, a researcher concludes that the probability of above 100 temperatures in Arizona is 100%. Comment on his findings. SOLUTION: Since temperatures in July 2011 were above 100 31 days of the 31 days in the month, it is fair to make the experimental observation that approximately 100% of all days in July 2011 have temperatures exceeding 100 , in the long-run (there have been days in the past when temperatures were below 100 ); However, because we know that temperatures are periodic, or that they go from low to high and back to low over the course of a year, 100% is not a good estimate for temperatures in Arizona, in general (temperatures are reasonably never above 100 in January!). This example truly stresses the importance of critical thinking when using probabilities. It is often that probabilities are used and abused in the media, education, and in politics, just to name a few. We want to make sure that we are as specific as possible. It will often be considerably helpful to display probabilities in a tabular form, that is, through the use of tables. This type of table is called a contingency table. This not only helps to organize data, but to simultaneously see the big picture. Let‟s consider an example.

Example 4: In a 1950 study that considered 1,418 hospital patients in London (half of each) with and without lung cancer and whether or not they smoked over the course of their lives, the following was found: Smoker?/Lung Cancer? Yes No 688 650 Yes 21 59 No Assuming this data can be used as a representation of the entire population of London residents, analyze the data by discussing the following: a. What is the probability that a randomly selected participant within this study develops lung cancer? b. Provided that a person was a smoker, what is the probability that he has lung cancer? c. Provided that a person was not a smoker, what is the probability that he has lung cancer? d. Given that a person has lung cancer, what is the probability that he smokes? SOLUTION: Statistics for Decision-Making in Business

© Milos Podmanik

Page 85

When answering these questions, it is fairly useful to fully organize the data by providing all totals: Smoker?/Lung Cancer? Yes No Lung Cancer TOTALS:

Yes 688 21 709

No 650 59 709

Smoker TOTALS 1,338 80 1,418

1. Since there is a total of 1,418 individuals being considered and, of those, 709 developed lung cancer, (

)

We must be careful in using this probability as it doesn‟t really reveal anything about the link between lung cancer and smoking, since 709 patients with lung cancer and 709 without lung cancer were chosen to participate in the study to begin with. This is a probability that was fixed by the researchers. 2. There is a total of 1,338 individuals in the study that smoke (we are limited to the smokers only, per the way the question is stated). Of those individuals, 688 have lung cancer. (

)

Slightly over half of the patients who are smokers developed lung cancer. This number is frighteningly large. Before we jump the gun in assuming that smoking is the culprit here, we should probably consider what happens with nonsmokers. 3. There is a total of 80 nonsmokers in the group. Of them, 21 developed lung cancer. (

)

Slightly more than one-fourth of non-smokers developed lung cancer. This number appears to be significantly less severe than for the smokers. We speculate (but did not prove) that smoking increases the likelihood that one will develop lung cancer. 4. There are 709 patients with lung cancer. Of these, 688 smoke. (

Statistics for Decision-Making in Business

)

© Milos Podmanik

Page 86

Are we confident in accusing a lung cancer patient of being a smoker? According to this data, perhaps. The moral of the story is: analyze the situation from a variety of lenses. What appears to be true might be an illusion of what we see immediately! Sometimes, however, it is about what the naked eye does not detect. This is what makes good analysts.

Homework Problems - 3.1 1. A classmate of yours was absent when this section was discussed. Explain to her what a probability is in your own words. 2. In a study performed by Cambridge University in the United Kingdom, it was found that, “One out of three people is overwhelmed by the latest breakthroughs in technology.” (SOURCE: http://www.gev.com/2011/07/study-one-out-of-three-people-overwhelmedby-technology/). Primarily, individuals are overwhelmed by how much information is available through the use of social networks and smartphones, to name just two. Explain what is meant by this and explain in terms of probabilistic reasoning. 3. In a 2007 survey conducted by DDB Worldwide, an internationally known advertising company, the following question was asked of a random group of 217 participants: “Is consistency in branding becoming any more or less important?” The following table displays the results: Response Number of respondents More important 143 Less important 74 Find the probability that a respondent believes that consistency in branding is: a. More important, then explain what this means. b. Less important, then explain what this means. 4. The probability that a visit to a primary care physician‟s (PCP) office results in neither lab work nor referral to a specialist is 35%. Of those coming to a PCP‟s office, 30% are referred to specialists and 40% require lab work. Determine the probability that a visit to a PCP‟s office results in both lab work and referral to a specialist. (Video Solution) 5. A public health researcher examines the medical records of a group of 937 men who died in 1999 and discovers that 210 of the men died from causes related to heart disease. Moreover, 312 of the 937 men had at least one parent who suffered from heart disease, and, of these 312 men, 102 died from causes related to heart disease. Statistics for Decision-Making in Business

© Milos Podmanik

Page 87

Determine the probability that a man randomly selected from this group died of causes related to heart disease, provided that neither of his parents suffered from heart disease. (PROBLEM SOURCE: SOA/CAS Exam P Sample Questions, Page 5) (Video Solution)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 88

3.2 Joint Probability In the previous section, we began computing probability using some fairly basic ideas. In calculating probabilities, we made a huge assumption: that the found number represents what will occur in the long-run. For instance, if we conduct a study and find that out of 100 people, 94 respond positively to a new energy drink, can we conclude the drink is effective in providing added energy? The answer to this question is humbling: it depends upon how the data was collected. Suppose the participants are all college students who tend to consume a large amount of caffeine as it is. Would it be fair for the advertisement to say, “There is a 94% chance that this energy drink will energize you?” Not necessarily, since the result only appeared to be valid in a sample of college students. This means that the population must be specified form which the sample was taken. In this case, the population is the set of all college students and the sample is the 100 students who were selected. Thus, perhaps the advertisement should say, “Are you a college student? If so, there is a 94% chance that this energy drink will energize you?” That is, provided that this sample was a random sample and not a group of college students hand-picked from the respective population. Okay, so you have a data sample collected from a specific population and your goal is to now talk about probabilities. Example 1: Imagine that you work for a marketing agency and your goal is to determine the effectiveness of two different branding approaches to a new line of clothing. The first approach involves establishing a group of Facebook followers by giving incentives for discounts on clothing by becoming a friend of the company. The company hypothesizes that seeing the company logo under on their Facebook account each week, they will gain a strong familiarity and comfort level with the company‟s product. The second approach involves hiring Hollywood actors to endorse the product at film festivals and celebrity appearances. The company then tracks the degree of success of the branding tactic by measuring the number of retail outlets that agree to stock the product based on the branding used. They find that, of the 6 companies exposed to Tactic 1 (T1), 5 agreed to stock the product. Of the 7 companies exposed to Tactic 2 (T2), 5 agreed to stock the product.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 89

Because of the amount of resources involved in selling the product to retail stores, a single marketing analyst can only reach out to about 15 business per month; However, if successfully sold, the result is a high level of profit for the clothing company, which, in turn, means you might get that raise after all. SOLUTION: Let‟s start with a simpler question, and first consider T1. We find that the probability of a successful sale is: (

)

This means that we should expect 80% of all companies to sell the clothing line, in the long-run. Suppose that a marketing analyst is to offer T1 to two different companies. He would like to know, what is the probability that both companies agree to sell the product? Is the answer 80%? Unfortunately, no. There is an 80% chance that each company agrees to sell the clothing line. We should expect that the probability that both sign-on is less. We know that about 8 out of 10 times, Company 1 (C1) will sign-on and that 8 out of 10 times Company 2 (C2) will sign on. Let‟s compare the possibilities by using a tabular approach: Company 2 Choices Y Y Y Y Y Y Y Y N N Y Y Y Y Y Company 1 Y Choices Y Y N N Each cell in the table represents a particular combination of the C1‟s choice and C2‟s choice. So, the 1-1 entry (remember, this means first row, first column) of the table is the situation in which it does indeed turn out that C1 and C2 agree to sell the clothing line. The question was, what is the probability that both sign-on? Since the definition of probability is the ratio of the number of ways the event can occur divided by the total number of possible outcomes, let‟s do a bit of counting by highlighting important features of the table: Company 2 Choices Y Y Y Y Y Y Y Y N N Statistics for Decision-Making in Business

© Milos Podmanik

Page 90

Y Y Y Y Y Company 1 Y Choices Y Y N N The shaded region represents the number of ways in which we can get both companies to sign on. This region is 8 x 8, which creates 64 possibilities. The total number of possibilities is simply the total number of cells in the table. Since the table is 10 x 10, we have100 possibilities. So, (

)

This is, as speculated, less than the probability that only one company signs on. Let‟s consider what we really did here:

Notice that

(

(

)

(

)

)

(

)

(

)

(

)

Or, in short, (

Statistics for Decision-Making in Business

)

(

© Milos Podmanik

)

(

)

Page 91

Example 3: The idea of red-light cameras has been disputed quite often in Arizona and all across the United States. While unable to find any specific details, the author will assume that red-light runners have about a 70% chance of being caught by a red-light camera on any given instance. Suppose that on a given day, two cars run through an intersection during separate red lights, setting off the camera. What is the probability that both drivers are caught? SOLUTION: We can fairly assume that the first driver being caught and the second driver being caught (calling these events and , respectively) constitute events that do not affect one another. Thus, (

)

(

)

(

)

There is a 49% chance that both drivers are caught. This is about the likelihood of getting heads on the toss of a coin.

Example 5: In a crop of corn, the Food & Drug Administration (FDA) finds that two of the 20 bushels of corn are potentially contaminated with E. coli. Supposing that two bushels have already gone out for shipment to county marketplaces, how likely is it that both of the contaminated bushels have gone out? SOLUTION: The question asks about the probability that both have been shipped, that is, the first contaminated bushel and the second contaminated bushel. We will refer to these events as simply and . We will first write the “and” probability in the form of dependent events and will then determine whether or not a dependency exists (see Independence Property box above). (

)

(

)

(

)

We know that ( ) . Now, since the first probability “removes” one of the two contaminated bushels and one bushel out of the 20 available, the probability of shipping a second bushel is slightly changed to: (

)

Thus, the events are indeed dependent, and so the probability becomes: (

Statistics for Decision-Making in Business

)

© Milos Podmanik

Page 92

There is less than a 1% chance that both contaminated bushels went out. Does this outcome satisfy the farm producing these bushels of corn? Thinking in more detail, the main concern is actually in regards to one or more (at least one) contaminated bushel going out!

In order to address how to find this, it is useful to think about the following, perhaps obvious, characteristic. Basic Properties of Probability (Kolmogorov Axioms) 1) A particular event is: guaranteed to not occur, is guaranteed to occur, or lies somewhere between these extremes. 2) In a given situation, or sample space, the likelihood of something occurring (however small or insignificant), is guaranteed. 3) The summed probabilities of all the possible events in a situation constitute the entire, or the whole of all possibilities. Mathematically, suppose that a sample space consists of n events above verbal rules translate into: 1) For any arbitrary event between events 1 and n, let‟s call this event ( 2) Using

. Then, the

, then:

)

to denote the sample space, ( )

3) Summing the probabilities gives 100% of all possible outcomes: (

)

(

)

(

)

(

)

These basic properties are often referred to as the Kolmogorov axioms, named after the mathematician Andrey Kolmogorov. An axiom can be thought of as a necessary assumption. For instance, when physicists develop new concepts in physics, they assume that gravity follows certain properties. Thus, they have gravity axioms. The Kolmogorov axioms are extremely important in probability and the development of new ideas.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 93

In fact, recall Example 5 dealing with the contaminated corn crop. What are all the possibilities for shipping out two bushels from the total of 20? Let‟s list them out:   

0 contaminated bushels and 2 uncontaminated bushels ship (call it )) 1 contaminated and 1 uncontaminated bushels ship (call it ) 2 contaminated bushels ship (call it )

Are there any others? Not unless there is a possibility we have not considered. Since two bushels are guaranteed to go out, the outcome must fall into one of the three categories listed. Let‟s calculate the probability for each of these by hand: 



(

)

( (

) )

(

)

( ): there are two possibilities; either the first is contaminated and the second is not, or vice versa. We must consider both outcomes below: o

(

o

(

(

)

(

(

)

(

)

)

)

)

These two possibilities give 9.5% + 9.5% = 19% of the sample space. 

(

)

(from previous calculation)

(NOTE: Importantly, summing these three probabilities gives 1, as stated in the axioms!) Statistics for Decision-Making in Business

© Milos Podmanik

Page 94

We can now see that the situation in which there is at least one contaminated bushel will occur of the time. This is much higher than when we concerned ourselves with both going out! This is quite a frightening situation! Needless to say, this was a lot of work; however, we can use the axioms to simplify the amount of work we commit to ourselves. According to axiom 2: (

)

(

)

(

)

Our earlier statement involved wanting to know the likelihood that at least one contaminated bushel went out. That only involves and ! Solving for the sum of these two probabilities: (

)

(

)

(

)

That is, (

(

)

)

(

)

This is the same number we achieved taking the long route! We only had to find the probability of shipping 0 bushels, which is a little bit of work as compared to a lot of work! Probability of At Least One… Given any number of events involving quantities, the probability of at least one in quantity is 1 minus the probability of 0 in quantity. That is: (

)

Mathematically, let subscripts denoted . Then, (

(

)

represent quantity, where corresponding events are )

(

)

(

)

(

)

Homework Problems - 3.2 1. In 2009 the H1N1 virus, commonly referred to as the “Swine Flu,” reportedly infected an estimated 10% of New Yorkers (SOURCE: http://www.reuters.com/article/2009/08/30/us-flu-newyork-idUSTRE57T26Y20090830).

Statistics for Decision-Making in Business

© Milos Podmanik

Page 95

Suppose that an emergency room in New York City has two individuals with flu-like symptoms. (Video Solution) a. What condition(s) do you believe would make it appropriate to assume independence in this situation? b. By using the tabular approach and assuming independence, find the probability that both people have the H1N1 virus. c. By using the “and” rule, verify that you get the same answer that you found in Part b. d. Find the probability that neither of these individuals has the H1N1 virus. e. Find the probability that at least one of them has the H1N1 virus. f. Exposure to flu germs for even a short period of time can significantly increase one‟s chances of catching the flu. Suppose that if one is exposed to an individual with the flu virus, their chance of becoming infected is 15 percentage points higher than normal. Find the probability that both individuals have the flu virus. 2. Many fire stations handle emergency calls for medical assistance as well as calls requesting firefighting equipment. A particular station says that the probability that an incoming call is for medical assistance is .85. This can be expressed as P(call is for medical assistance) = .85. a. Give a relative frequency interpretation of the given probability. That is, interpret what the number .85 means based on the definition of probability. b. What is the probability that a call is not for medical assistance? c. Assuming that successive calls are independent of one another (i.e., knowing that one call is for medical assistance doesn't influence our assessment of the probability that the next call will be for medical assistance), calculate the probability that both of the two successive calls will be for medical assistance. d. Still assuming independence, calculate the probability that for two successive calls, the first is for medical assistance and the second is not for medical assistance. e. Still assuming independence, calculate the probability that exactly one of the next two calls will be for medical assistance. (Hint: There are two different possibilities that you should consider. The one call for medical assistance might be the first call, or it might be the second call.) f. Do you think it is reasonable to assume that the requests made in successive calls are independent? Explain. 3. "N.Y. Lottery Numbers Come Up 9-1-1 on 9/11" was the headline of an article that appeared in the San Francisco Chronicle (September 13, 2002). More than 5600 people had selected the sequence 9-1-1 on that date, many more than is typical for that sequence. A professor at the University of Buffalo is quoted as saying, "I'm a bit surprised, but I wouldn't characterize it as bizarre. It's randomness. Every number has the same chance of coming up. People tend to read into these things. I'm sure that whatever numbers come up tonight, they will have some special meaning to someone, somewhere." The New York state lottery uses balls numbered 0-9 circulating in 3 separate bins. To select the winning

Statistics for Decision-Making in Business

© Milos Podmanik

Page 96

sequence, one ball is chose at random from each bin. What is the probability that the sequence 9-1-1 would be the one selected on any particular day? 4. On August 8, 2011, the Dow Jones Industrial fell 635 points (5.5%) to 10,810 points, representing the 6th worst point loss ever experienced. On that day, President Obama‟s approval ratings also suffered tremendously; only 22% of the nation‟s voters “Strongly Approve” of how he is performing in the presidential role (SOURCE: http://www.rasmussenreports.com/public_content/politics/obama_administration/daily_pr esidential_tracking_poll). Suppose presidential hopeful Randall Terry (Democrat) speaks at a rally shortly thereafter and assumes that his approval rating as a candidate will likely closely mirror President Obama‟s. Suppose there are 40 swing voters (voters that are “on the fence” about who to vote for). (Video Solution) a. What is the probability that all 40 voters will strongly approve of Terry‟s plan? b. What is the probability that none of the 40 voters will strongly approve of Terry‟s plan? c. What is the probability that at least one voter will approve of Terry‟s plan? 5. The following case study is reported in the article "Parking Tickets and Missing Women," which appears in an early edition of the book Statistics: A Guide to the Unknown. In a Swedish trial on a charge of overtime parking, a police officer testified that he had noted the position of the two air valves on the tires of a parked car: To the closest hour, one valve was at the 1 o' clock position and the other was at the 6 o' clock position. After the allowable time for parking in that zone had passed, the policeman returned, noted the valves were in the same position, and ticketed the car. The owner of the car claimed that he had left the parking place in time and had returned later. The values just happened by chance to be in the same positions. An "expert" witness computed the probability of this occurring as (1/12)(1/12) = 1/144. a. What reasoning did the expert use to arrive at the probability of 1/144? b. Can you spot the error(s) in the reasoning that leads to the stated probability of 1/144? What effect does this error(s) have on the probability of occurrence? Do you think that 1/144 is larger or smaller that the correct probability of occurrence? 6. Jeanie is a bit forgetful, and if she doesn't make a "to do" list, the probability that she forgets something she is supposed to do is .1. Tomorrow she intends to run three errands, and she fails to write them on her list. a. What is the probability that Jeanie forgets all three errands? What assumptions did you make to calculate this probability? b. What is the probability that Jeanie remembers at least one of the three errands? c. What is the probability that Jeanie remembers the first errand but not the second or third? 7. One of the myths most commonly believed by students on multiple choice exams is that, as long as they always use letter „C‟ as their guess, they increase their chances of Statistics for Decision-Making in Business

© Milos Podmanik

Page 97

guessing correctly. This, of course, is absurd, since there is not usually a set pattern used by instructors in pairing correct answers with certain letters (certainly not for me, anyhow). Suppose that a multiple-choice quiz has two problems on it and that the student has no idea how to answer them, so he guesses. Each problem has letters A-E corresponding to the answers to choose from. Using counting techniques discussed in class, find and explain how you found the following: (Video Solution) a. b. c. d. e.

What is the probability that both guesses are correct? What is the probability that both guesses are incorrect? What is the probability that he receives a 50% on the test? How likely is it that he gets at least one problem correct? What is the probability that he receives a 90% on the exam (assume no partial credit is possible)? f. How did the idea of “counting tables” allow you to answer these questions without having to do additional work for each subsequent table?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 98

3.3 Probability of Unions Imagine that you toss a fair, two-sided quarter. You let it land and take a look at the side facing up. What is the probability that you see heads or tails (assume the toss will be ignored if it happens to land on its side)? You can probably see fairly quickly that the outcome desired is guaranteed; when a coin is tossed, it will result in one of two outcomes: heads or tails. If someone in a bet were to tell you that he will win if the toss of a coin results in heads or tails, then you could probably tell him, “Congratulations!” Adding to our intuition (no pun intended), we will write the situation in the form of a mathematical probability. The sample space will have two outcomes:

Then, (

)

Since we know that ( )

( )

So, we can gladly write: (

)

( )

( )

Simple enough! We feel pretty satisfied and so we hope to tackle another problem: Example 1: A large company offers a self-insured health insurance policy to its employees to help them reduce premium and copay costs. Using its historical data from the last two years, the company analyst considers the risk status of the employees (low or high) based on preexisting conditions, and the type of claim filed (physical health or mental health). He finds that 70% of employees have filed a mental health claim and that 40% of employees have been categorized as high risk. Further, he finds that 20% of employees are low risk and have filed a physical health claim. The company only insures the first claim. All claims thereafter are paid for by a third-party insurer.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 99

For reporting purposes, he would like to find the probability that a randomly selected employee (or an employee that is to be hired in the future) is high risk or will file a mental health claim. As he is writing his report, he reaches a speed bump: Letting

, (

)

( )

( )

He quickly realizes that this probability is invalid because a probability cannot be greater than 1, or 100%. What happened?

SOLUTION: We first organize his data into a table to help us better see what is happening: Claim\Risk Low High Physical .20 Mental .70 .40 The probabilities outside of the boxes represent totals for mental health claims and for high risk claims. The probability in the 1-1 entry of the table represents the probability of being low risk and filing a physical health claim. Since we know that this data represents all of those who have filed claims, we know that 100% have filed one type or the other. Additionally, each employee considered falls into one of the two risk categories. So we fill in more details: Claim\Risk Low High Physical .20 .30 Mental .70 .60 .40 We can also proceed to fill in the boxes in the table, since each person falls into exactly one of the four positions (low physical, low mental, high physical, high mental): Claim\Risk Low High Physical .20 .10 .30 Mental .40 .30 .70 .60 .40 Now, the analyst added to second row total with the second column total, as highlighted in the table below: Claim\Risk Low High Statistics for Decision-Making in Business

© Milos Podmanik

Page 100

Physical Mental

.20 .40 .60

.10 .30 .40

.30 .70

The problem seems to be that the .40 and the .70 both include the probability of High Risk and Mental Claim! In other words, it is being counted twice, hence the end probability that is great than 1. Instead, let‟s add up the individuals box probabilities as illustrated in the table below: Claim\Risk Low High Physical .20 .10 .30 Mental .40 .30 .70 .60 .40 ) We find that ( , which is a number that rests between 0% and 100%. We conclude that, in fact, there is an 80% chance that a claim-filing employee is high risk or files a mental claim (or both!!). While this does not seem like a huge amount of work, suppose that we instead had three types of claims and 3 different statuses. It would probably be convenient to have some sort of mathematical approach to the solution. Let‟s go back to the table in which the double-count occurred: Claim\Risk Low High Physical .20 .10 .30 Mental .40 .30 .70 .60 .40 We are free to add the two probabilities, ( ) and ( ), but we must be sure to take out the .30 one time, so that it is single-counted and not double-counted: (

)

This is the same answer as before! Notice what we really did: (

)

( )

( )

(

)

Regardless of the context/application of the probability, this issues can be resolved as shown. Probability of One Event “Or” the Other Statistics for Decision-Making in Business

© Milos Podmanik

Page 101

Given two events, and , the probability that one or the other occurs is the sum of the individual probabilities with the double-count removed once. Mathematically, ( Typically,

)

(

)

(

)

(

)

is used (called a union) to replace the word “or”, making the above equation (

)

(

)

(

)

(

)

At the beginning of this section, we addressed a coin-tossing problem that involve the summation of the probability of heads and the probability of tails. Let‟s see why we could get away with not subtracting away the double-count. We use the “Or” probability set-up: (

)

( )

( )

(

)

We already know that the first two probabilities on the right-hand side, but what is the third probability value? Let‟s analyze its meaning: (

)

Of course, it is impossible to get both heads and tails in one toss of a coin! Any impossible outcome has a probability of 0%. That is: (

)

So, (

)

( )

( )

(

)

We simply “lucked-out” when this problem worked-out according to our intuition. In general, you need only to remember the “Or” probability formula for the reasons given to solve any problem involving the occurrence of one outcome or another. Example 2: It is often interesting to note how political preference (Democrat or Republican) varies within a married couple. Suppose that in a survey of 160 couples it is found that 60 of the couples agree on a preference to vote Democrat and 40 are such that the husband votes Democrat and the wife votes Republican. The total number of wives that vote Democrat is 90. What is the probability that the couple has a husband or a wife that is Republican?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 102

SOLUTION: We first arrange this information in a table: Husband\Wife Democrat Republican Democrat 60 40 Republican 90 160 Note that the bottom-right corner represents the table total. We know that the number of husbands voting democrat is . This means that the number of husbands voting Republican is . Additionally, we conclude that the number of couples where the husband votes Republican and the wife votes Democrat is . We fill this information in: Husband\Wife Democrat Republican Democrat 60 40 100 Republican 30 60 90 160 This allows us to fill in the remaining details in the table: Husband\Wife Democrat Republican Democrat 60 40 100 Republican 30 30 60 90 70 160 We convert the totals into percentages by dividing each cell entry by the total number of couples, 160: Husband\Wife Democrat Republican Democrat .375 .25 .625 Republican .1875 .1875 .375 .5625 .4375

Let

So, (

Statistics for Decision-Making in Business

)

(

)

(

© Milos Podmanik

)

(

)

Page 103

We find that there is a 62.5% chance that in a couple either the husband votes Republican, the wife votes Republican, or both vote Republican.

At this point you might be wondering why we don‟t simply draw out the table and ignore the mathematical formulas. When possible, tables are extremely useful, but they might not always be available. Consider the following example. Example 3: Testing has determined that a particular ballistic missile has an 80% chance of hitting its intended target. Suppose that an enemy jet approaches a military base and so two missiles are fired at the incoming jet. What is the probability that this threat is eliminated? SOLUTION: This is the probability that one or both missiles hit the target. We only have one probability, so filling out a table would not be possible. Let

We want to know (

)

(

)

(

)

(

)

We already know the first two probabilities on the right hand-side (.80), but we are not given information on ( ). We can fairly assume that the outcome of one missile has no (or very minimal) impact on the outcome of another missile, and so we assume the events are independent. This allows us to write: (

)

(

)

(

)

And so, (

)

We conclude that there is a 96% chance that the enemy jet is eliminated.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 104

Homework Problems - 3.3 1. A gaming investor is considering becoming a financial partner in a new casino. In deciding to go in on the deal, he reviews gaming revenues for previous years. From experience and industry research, he decides that the gaming industry tends to be successful when total gross revenues for card rooms are above $1 million or when gross revenues for lotteries are above $20 billion. Between 2000 and 2009, he found that 50% of the time, both sectors have been successful and that 0% of the time only card tables were successful (and lotteries were not). Lotteries were unsuccessful 30% of the time (SOURCE: 2011 U.S. Statistical Abstract, Table 1258). What is the probability that the investor‟s conditions will be met? In your professional opinion, is it likely that he will decide to become a partner in the proposal? (Video Solution) 2. A researcher conducts a study on a total of 600 cats to determine whether or not they tend to be adaptive to danger and whether or not their time to respond to those dangers is fast enough to avoid harm. The animals were exposed to non-harmful stimuli to assist in answering the researcher‟s questions. In his report he details that, “207 non-adaptive cats were studied and, of them, 180 were found to have response times that were simply not fast enough. By comparison, a total of 300 cats were both adaptive and had response times that were fast enough.” How likely is it that a cat is adaptive to environmental physical dangers or has a response time that is fast enough? (Video Solution) 3. In the March 3, 2011 episode of the Dr. Oz Show entitled “Dangerous Doctors: Is Your MD Hazardous to Your Health?” Dr. Oz mentioned that 20% of the time doctors order scans to protect themselves from a lawsuit. Dr. Oz also said, “Up to 1/3 of all tests and treatments are entirely unnecessary.” (Video Solution) a. Two patients are given orders for scans from a particular doctor. What is the probability that one patient or the other were given scans to protect the doctor against a lawsuit? b. One patient is given orders for two different tests/treatments. What is the probability that one or both of them was/were unnecessary? c. A patient is prescribed a scan and a blood test. What is the probability that an unnecessary prescription was made, through the patient‟s eyes? 4. In all of his Fall 2010 classes, Milos discovered that 44% of his students earned a „B‟ or better on their homework average. He also discovered that 50% of his students had a „B‟ or better homework average or a „B‟ or better overall grade in the class (SOURCE: Milos‟ Fall 2010 Grade Spreadsheet). If 30% of all his students received a „B‟ or better homework average and a „B‟ or better class grade, what percentage of his students earned a „B‟ or better in the class? (Video Solution) 5. In all of his Fall 2010 classes, Milos discovered that the percentage of all students that earned a „C‟ or better homework average, 87% of these students earned a „C‟ or better final class grade. 70% of all students in his classes earned a „C‟ or better homework average or earned a „C‟ or better final class grade (SOURCE: Milos‟ Fall 2010 Grade Spreadsheet), while only 49% earned a „C‟ or better on homework and as a final class Statistics for Decision-Making in Business

© Milos Podmanik

Page 105

grade (some still did well in the class, but maybe failed to turn in homework). What is the probability that a randomly selected student in his class earned a „C‟ or better final class grade? (Video Solution)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 106

3.4 Conditional Probability

In many cases, a probability depends on what we already know. For instance, would we believe that the likelihood of a car accident changes, provided that the roads are slick from snow? We would probably agree that the likelihood increases if we already know the road conditions. Suppose a fair, two-sided coin is tossed. You are told that the outcome is not a head. What is the likelihood that the outcome is tails? The answer is probably obvious… if you know the outcome was not heads, and the only two possibilities are heads and tails, then there is a 100% chance the outcome is tails. This is a conditional probability. That is, if

Further, to indicate that the outcome is not one of the above, we often put a bar on top of the event name: ̅ ̅ Then, ( ) However, given that we know the outcome was not tails, the probability of heads jumped to 1. We might write: (

̅)

Instead of using the word “given” we often use a vertical line (called a “pipe”), |. That is,

Statistics for Decision-Making in Business

© Milos Podmanik

Page 107

̅)

( Conditional Probability The conditional probability of event

provided that (

And implies that the likelihood of

already occurred is written as

)

may be different, knowing that

already took place.

Example 1: Due to wars at sea, shipwrecks, and other such disasters, there are (roughly) around 3,000,000 sunken vessels in the all of the seas in the world! Suppose an area of the ocean is mapped out due to the historic ships that have wrecked in that area. There is speculation that, of the estimated 20 ships in that region, 11 are original pirate ships. Given that a pirate ship is the first of the 20 recovered, what is the probability that the next one found will also be a pirate ship? SOLUTION: We would like to find the probability that a pirate ship is found, given that one pirate ship has already been removed. If one ship is removed, there are 19 ships left. Since the ship removed was a pirate ship, there are only 10 remaining. That is, (

)

Note that this is different than, (

)

Why? This probability has no condition placed on it. It assumes the very basic information: 20 ships, 11 pirate ships. So, (

)

The conditional probability, in this case, is different than the unconditional probability. Example 2: Determine whether or not the following situations represent independent or dependent events. a)

and

as

: It rains in Chandler today There is a car accident in Chandler

Statistics for Decision-Making in Business

© Milos Podmanik

Page 108

b) c) d) e) f)

: The Arizona Cardinals make it to the playoffs Subway runs out of whole wheat bread : Dow Jones Industrial reports an enormous loss Microsoft stocks plummet ( ) ( ) ( ) ( ) ( ) ( ) ) ( ) ( ) ( ) )

SOLUTION: a) Dependent; rain likely greatens the likelihood for accidents b) Independent; these events probably don‟t have any impact on one another c) Dependent; Microsoft is part of the Dow Jones Industrial and so there is a strong relationship between the two d) Independent; we see that the likelihood of does not change given that has occurred – it is still .75 e) Dependent; the likelihood of does change given that has occurred – it drops to .3 f) If the product of the two given events does equal the probability of and , then the events are independent, as this would mean that ( ) is .75, which is the same as ( ). We see that , and so we conclude that the events are independent.

Example 3: An aircraft radar system detects 30 aircraft in a 100-mile radius. Of these, 18 are ally planes, 6 are cargo planes, and 6 are enemy planes. Given that a plane approaching the radar is ruled out as being an enemy plane, what is the probability that it is a cargo plane? SOLUTION: First off, define:

We want to know, (

̅)

Since it is not an enemy plane, it must be one of the remaining 24 aircraft. Of those, 6 are cargo planes, so (

Statistics for Decision-Making in Business

̅)

© Milos Podmanik

Page 109

Example 4: Suppose that Company 1 (C1) and Company 2 (C2) are competitors in the clothing business. In fact, they both have locations within Chandler Fashion Center Mall. Given previous business experience, the marketing analyst knows that each company has an 80% chance of agreeing to sell a particular line of clothing; However, if C1 agrees to sell the clothing line, C2 wants to stay competitive and so definitely purchases the clothing line. How is the probability that both will agree affected by this new knowledge? SOLUTION: In this situation, the decision of C2 is dependent (conditional) upon the decision of C1. Consider a table in which C2‟s choices will reflect the decision of C1. Company 2 Choices When C1 Agrees Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Company 1 Y Choices Y Y N N

(

)

The difference is that C2‟s decisions are all to agree, provided that C1 has agreed. If C1 does not agree, then we‟re not really sure how C2 will act, but we don‟t really care, since the probability we are in search of is when both companies agree! Here we have: (

Statistics for Decision-Making in Business

)

(

© Milos Podmanik

)

(

)

Page 110

We could just as well have written,

So as to be using the decimal form instead of the tabular fractions.

If you look back at the reasoning here, you‟ll notice that we have bolded the word “dependent.” In previous sections, we didn‟t have to worry about dependency, since we assumed that the choices of C1 and C2 were independent, that is, one outcome did not affect the other, and vice versa. How do we know whether events are dependent or independent? Often times this is based upon some knowledge of the situation or, perhaps, our intuition. Let‟s set up the important ideas here and then we‟ll look at a few examples of dependence versus independence. Probability of Two Events Occurring Simultaneously Given two events, If

and

and

, then

are independent events, then (

where And if

) )

(

( ) ( ) ( ) ( )

is a symbol to represent the word “and”. We use this in mathematics often. and

are dependent events, then (

(

) )

( ) ( ( ) (

) )

Or, as it is often written (

)

(

) (

)

In either instance, the end result involves multiplication. NOTE:

and

are generic names and thus can be attached to an event in an arbitrary order.

As an interesting note, we can make the following conclusion:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 111

Independence Property Given two events, and , if ( dependence formula reduces to: (

)

) ( (

( ) )

( (

), then

)

does not depend on

, and so the

) (

)

This result is important, because it allows you to only have to remember the “and” rule for dependent events. If the next event does not depend on the prior event, then the end probability is just a product of the two individual probabilities. Though the ideas presented above might at first seem confusing, you‟ll notice that the idea of joint probabilities has not changed. The only new caution is to take care to acknowledge whether the events are independent or not. We‟ll consider a few more examples below. Example 5: The probability that a resistor and capacitor both fail in a portable electronic device in the fifth year of use is 0.95%. The probability that the resistor fails is 1.22% and the probability that the capacitor fails is 1%. Are the events independent? If they are not independent, what is the probability that the capacitor fails given that the resistor fails? SOLUTION: Let

If the two events are independent, then the product of unconditional probabilities should give us the provided joint probability. We have that,

( ) ( )

If they are independent events, then (

)

However, the joint probability under independence is 0.0122%, not 0.95%. Thus, (

Statistics for Decision-Making in Business

)

( )

© Milos Podmanik

(

)

Page 112

That is, the probability that the capacitor fails is dependent upon the resistor failing. Filling in what we know: (

)

Dividing gives, (

)

Thus, there is a 77.9% chance the capacitor fails if the resistor fails. The resistor is an integral part in this device. The likelihood of the capacitor failing increases, if the resistor fails first.

The above examples brings up a useful result. Calculating the Conditional Probability of A given B Since (

)

( )

(

)

We have that, (

)

(

) ( )

Example 6: In a demographic study of a small, it is found that 5% of the adult residents are unemployed and living at or below poverty level. A total of 8% are unemployed. What is the probability that a person in this town is living at or below the poverty level, given that they are unemployed? Interpret the meaning of your answer. SOLUTION: Letting = a person lives at or below the poverty level and would like to know, ( ) We have that (

)

( ) (

= a person is unemployed, we

. Thus: )

This says that, if a person is unemployed, there is a 62.5% chance they are living at or below the poverty level. We would probably expect this figure to be quite high.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 113

Conditional probability is quite useful when used in the correct way. The counterintuitive problem below will allow us to shed light on how important it really is to think about dependencies. Example 7: As part of a narcotics checkpoint, officers randomly search freight trucks for shipments of illegal drugs. The officers search a small number of crates in the trucks that are chosen for random inspection. Suppose that, unbeknownst to the officers, there are two trucks ahead, one of which contains one crate with illegal drugs. This truck has a total of 8 crates, while the truck without drugs has a total of 5 crates. One of the two trucks will be randomly chosen. What is the probability that the officers find the drugs? SOLUTION: At first, it is tempting to say that the probability is , however this is not accurate. The probability that the officers find the crate with drugs is dependent on them choosing the correct truck first! Let

Two things must happen: they must choose the correct truck and they must choose the correct crate. Randomly choosing one of the two trucks is equiprobable, ( ) . If the correct truck is chosen, then the probability of choosing the correct crate is , that is, ( (

)

( )

(

)

)

Why is it not valid to say 1/13? It might appear that probability is simply pulling a “fast one” on our intuition. A simple way to think about it is as follows: there is not just one random process here. If all the crates were in the same truck, there would indeed be a 1/13 chance that we‟d get the right crate. However, there are two random processes here. If you don‟t choose the correct truck, then choosing the correct crate is impossible. The likelihood of the second random process leading to the correct crate is indeed deeply affected by the outcome of the first random process! Example 8: Reconsider Example 7:: Let‟s say that the second truck had two crates with shipments of drugs. As before, one of the two trucks will be randomly chosen. What is the probability that the officers find the drugs? SOLUTION:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 114

This can happen in one of two ways:  

the truck with 8 crates ( ) is selected and the one correct crate is chosen OR the truck with 5 crates ( ) is selected and one of the two correct crates is chosen

We will first create a small tree diagram showing the possible outcomes.

The beauty of this diagram is that it displays the conditional probabilities on the right “stems” of the tree for each initial choosing of the truck. The probability that drugs are found would thus be: 

Truck 1:



Truck 2:

Since these are distinct outcomes and cannot both occur (there is no overlap in the events), it is okay to add them

Thus, there is a 37% chance that drugs are found between the two trucks. Again, note that the probability is not simply , as our intuition might falsely lead us to believe. To formalize the tree above, Let

Statistics for Decision-Making in Business

© Milos Podmanik

Page 115

( (

)

(

)

)

(

)

Since only one truck will be chosen, the probability of findings drugs in T1 and T2 is 0. (

)

(

)

( ) (

)

(

)

(

)

( ) (

)

Summing these together yields

, as with the tree diagram.

Homework Problems - 3.4 1. A deck of standard playing cards has 52 cards. There are four suits: clubs, diamonds, hearts, and spades. There are two colors of cards – red and black. Diamonds and hearts are red, and clubs and spades are black. The cards are labeled A (Ace), 1-10, J (Jack), Q (Queen), and K (king). To better visualize, consider the illustration below:

Suppose you are given various conditions and that you must determine the probability of the specified draw on the next card. Use the card descriptions above to find that probability that: (Video Solution) a. b. c. d. e.

Given that one Jack is removed, a Jack is drawn Given that all red cards are removed, a black card is drawn Given that a red Queen is removed, a red Queen is drawn Given that all red Queens are removed, a black Queen is drawn Given that all Kings are removed, a red card is drawn

Statistics for Decision-Making in Business

© Milos Podmanik

Page 116

f. All numerical red cards are removed, a king is drawn g. A red king is removed, a black king is drawn 2. An auto insurance company finds that there is an 18% chance that a teenager gets into a car accident between ages 16 and 19. There is a 34% chance that a teenager gets a traffic ticket during this same age range. They find that the chance of getting into a car accident and getting a traffic ticket (not necessarily because of the accident) is 10%. (Video Solution) a. Based on the probabilities provided, are the two events independent? Perform a calculation to justify your answer. b. Given that a teenager gets into an accident, what is the probability that he gets a traffic ticket? c. Why did the probability change in this way, as compared to the unconditional probability of getting a traffic ticket? d. Given that a teenager gets a traffic ticket, what is the probability that he gets into an accident? e. Explain, in practical terms, what your answer in d) means. 3. Let , , and be events in a sample space. Do the following: a) explain whether or not the events are independent or dependent, and b) answer the questions below regarding these events with the information provided. Assume the first event listed in each ) means probability statement occurs first (e.g. ( occurs first). (Video Solution) ( ) ( ) ( ) ( ) ( ) ( ) a. b. c.

( ( (

) ) )

4. Gregor Mendel was a monk who, in 1865, suggested a theory of inheritance based on the science of genetics. He identified heterozygous individuals for flower color that had two alleles (one r = recessive white color allele and one R = dominant red color allele). When these individuals were mated, ¾ of the offspring were observed to have red flowers and ¼ had white flowers. The table summarizes this mating; each parent gives one of its alleles to form the gene of the offspring. Parent 2 Parent 1 r R r rr rR R Rr RR Statistics for Decision-Making in Business

© Milos Podmanik

Page 117

We assume that each parent is equally likely to give either of the two alleles and that, if either one or two of the alleles in a pair is dominant (R), the offspring will have red flowers. (Problem source: Mathematical Statistics with Applications, 6th Ed., Wackerly, et al.) (Video Solution) a. What is the probability that an offspring has one recessive allele, given that the offspring has red flowers? b. What is the probability that an offspring has one dominant allele, given that the offspring has white flowers? c. What is the probability that an offspring has white flowers, given that it has one recessive allele? d. What is the probability that an offspring has white flowers, given that it has one dominant allele? e. What is the probability that an offspring has red flowers, given that it has one dominant allele? 5. There are 5 candidates for 2 town council positions. Three of them are for the removal of a landfill just outside of the city limits. The same candidate cannot fill both seats. (Video Solution) a. What is the probability that one randomly chosen candidate in the group is for the removal of the landfill? b. Given that one of the positions is filled with a candidate in favor of the landfill removal, what is the probability that the second candidate chosen is also in favor? c. What is the probability that two candidates in favor of the landfill removal are chosen? d. What is the probability that only one seat is filled by a candidate in favor of the landfill removal? e. What is the probability that at least one seat is filled by a candidate in favor of the landfill removal?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 118

3.5 Combinations and Permutations Recall from Section 3.2 the problem faced by a corn growing business: the FDA determines that two of the 20 bushels are potentially contaminated with E. coli. Two bushels had been shipped out and the question was: what is the probability that both bushels that were shipped to the local grocer were uncontaminated? We wrote the simultaneous probability as (

(

)

)

(

)

Due to the fact that one of the uncontaminated bushels was removed from the “pool”, there was now only a 17/19 chance that the second uncontaminated bushel would be pulled. In short, we wrote: (

)

We notice that the numerator and denominator both have a product of two sequential numbers. Had they shipped, say, four bushels, the probability that all four were uncontaminated would be:

As you might imagine, this pattern continues. How painful, though, would it be to have to multiply eight or nine probabilities of this nature together? You could certainly do it, but you might think, “It sure would be nice to take advantage of this pattern!” Well, we‟re in luck! Let‟s define an important term: A factorial is a descending product of whole numbers down to 1, beginning at a specified whole number. To start with a generic whole number, , we denote this product by , and write: (

Example 1:

Find

) (

)

.

SOLUTION: By definition of factorial, we write

Statistics for Decision-Making in Business

© Milos Podmanik

Page 119

This definition is great, but it still does not resolve our crisis: how do we multiply on a specific number of sequential whole numbers? Here‟s a little trick: write the factorial out, then divide out the factors that are not needed. For us, this means: ⏟ ⏟ But this is the same thing as:

In a similar way, we can write the denominator of our probability by:

Before we push this too far and get ourselves into a trap, let‟s consider a different example with a smaller sample space. Suppose that there are only 3 bushels of corn and that only one is contaminated with E. coli. Again, let‟s say that two are shipped out. Then, (

)

If you recall the tabular approach to thinking about this, we might show the possibilities for uncontaminated bushels, U1 and U2, and the way in which they can appear: 2nd Bushel U1 U2 st 1 Bushel U1 U2 We know that the pairs (U1, U1) and (U2, U2) for the 1st and 2nd bushels cannot be possible, since that particular bushel is removed from the population. So, we denote that in the table by blacking-out those cells: Statistics for Decision-Making in Business

© Milos Podmanik

Page 120

2nd Bushel U1 U2 1st Bushel U1 U2 Perfect! So we see the remaining two possibilities, right? Well, actually, is there a difference between (U2, U1) and (U1, U2)? Not unless those two bushels are actually different than one another! So, blacking out either one of these pairs leaves: 2nd Bushel U1 U2 st 1 Bushel U1 U2 One possibility! You might be wondering why we‟re bothering with this if we‟ve already found the probability. This is a good thing to wonder. Recall that a probability is the number of ways an event can happen divided by the total number of outcomes. To be consistent with this definition, we really should be putting 1 in the numerator. Does that mean we miscomputed the probability? Not in this particular example, but it can happen. To make our denominator consistent, let‟s look at the total number of possibilities for selecting bushels, adding in the contaminated bushel, C: 2nd Bushel U1 U2 C st 1 Bushel U1 U2 C Again, it is not possible to select the same pair twice, so we black-out the diagonals: 2nd Bushel U1 U2 C 1st Bushel U1 U2 C Are we done? Not unless we feel that (U2, U1) is different than (U1, U2). We notice that the three cells to the right of our blacked out diagonal are duplicates of those to the left. Thus we can cross them out, as well:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 121

2nd Bushel U1 U2 C 1st Bushel U1 U2 C This leaves us with three possibilities. So, our probability should be: (

)

Wait! This is the same as our earlier calculation of (

)

Since we get the same answer, one might think that it must not matter which approach we take. Many times, it doesn‟t; however, “many” is not satisfying enough, since this leaves us prone to mistakes under different circumstances. Let‟s analyze the full situation two different ways. We found that if we don‟t eliminate order differences, then we can write the probability as:

If we did (correctly) eliminate order differences, notice that we cut the number of possibilities in half, that is, divided by 2. You‟ll notice that 2 is the same thing as . So, let‟s divide out the number of duplicates from top and bottom:

And

Statistics for Decision-Making in Business

© Milos Podmanik

Page 122

Which, in its final state gives:

This does look rather complicated, but remember that it follows from some fairly simple things that we have built up on. Also notice that both the top fraction and the bottom fraction have . Ah, yes! So that‟s why the order-not-eliminated and order-eliminated answers are the same:





While this works out beautifully in this example, it is not always true, and so we must take care to observe whether order difference is important. We will see examples later where this difference will come into play, but those situation are a bit more advanced. Let‟s simplify this horrid notation a bit. Suppose that there are a total of are to be drawn.

items and of those

Permutation – Order Does Matter If order is not to be eliminated (in cases where order is important), then the number of ways to select things from the given is called a permutation and is denoted: ( NOTE: ( factorial.

)

)

, that is, factorial is not distributable!! Subtract first, then use

For our numerator, we had selected 2 uncontaminated bushels from a total of 18 uncontaminated bushels. According to our new notation, this can be written as: (

)

And this is precisely what we have written for the numerator! For our denominator, we had selected 2 (general) bushels from a total of 20 (general) bushels, since we want to know the total number of ways 2 objects can come out of 20. Statistics for Decision-Making in Business

© Milos Podmanik

Page 123

(

)

And this is precisely what we have written for the denominator! In simplified notation, (

)

Calculator Clinic – Using Permutations To evaluate a permutation, 1. first enter in your home screen

2. Go to

and move to the left to the PRB tab.

3. Select 2: nPr. This will return you to your home screen.

4. Enter

and press ENTER

TIP: Sometimes the value of the numerator or denominator is so large that the computer throws an overflow error. It is advisable to enter the entire probability in, numerator and denominator to avoid this potential problem. Let‟s now consider the case where it is important to Statistics for Decision-Making in Business

© Milos Podmanik

Page 124

Combination – Order Does NOT Matter (Eliminated) If order is to be eliminated (in cases where order is not important), then the number of ways to select things from the given is called a combination and is denoted: (

)

) NOTE: ( , that is, factorial is not distributable!! Subtract first, then use factorial. Additionally, the factorial of a product is not the product of factorials, that is, .

For our numerator, we had selected 2 uncontaminated bushels from a total of 18 uncontaminated bushels, eliminating the number of repeats, which was 2, or . According to our new notation, this can be written as: ( ) And this is precisely what we have written for the numerator! For our denominator, we had selected 2 (general) bushels from a total of 20 (general) bushels, since we want to know the total number of ways 2 objects can come out of 20, order aside. (

)

And this is precisely what we have written for the denominator! In simplified notation, (

)

Calculator Clinic – Using Combinations Follow the steps for finding permutations, but in Step 3, use 3: nCr instead. Example 2: Every week, Cori stops at Chipotle Mexican Grill for lunch with his colleagues. Each time, he drops a business card into the fishbowl for a chance to win lunch for his entire office. After the seventh visit, Cori begins to wonder his chances of winning. He estimates that there are approximately 40 cards in the bowl. If two were to be drawn, what is the probability Cori wins both draws? Statistics for Decision-Making in Business

© Milos Podmanik

Page 125

SOLUTION: We first think about what it is that we need to know. Per the question asked, the event is that the first and second cards drawn are both Cori‟s. This event occurs when the 2 cards drawn both come out of the 7 he has put in thus far. Since the order in which his two cards are drawn don‟t matter (as the prize is the same), we would like to know the value of

The sample space is simply the total number of outcomes. Two cards will be drawn from the stack of 40, and since order doesn‟t matter

Thus, the probability of this event is (

)

There is about a 3% chance that both of the cards drawn are Cori‟s.

Example 3: Probability is often used in police investigations to help determine probable cause. Suppose that in a gang-related report it was stated that three gang members were spotted. In an interrogation room, 20 gang members are suspects, three of whom are certain to have committed the crime. A detective has a suspicion that the three came from a gang of which 5 of its members are present. Just by chance, how likely is it that the three members came from the gang he believes to be behind the crime? Does this give him what you might consider “probable cause” to pursue the group? SOLUTION: The event is that the three criminals come from a group of five particular gang members. There are

The total number of way three-criminal groups that can be formed out of the suspects is

This means,

Statistics for Decision-Making in Business

© Milos Podmanik

Page 126

(

)

There is only a .9% chance that the three gang members all come from the presumed gang. The detective should consider more evidence to narrow down the search results before making assumptions.

Example 4: A business creates a new system to keep track of client relations, such that information about the client and a particular orders placed can be accessed by a nonrepeating, four character or digit number. For instance, KA23 and AK23 are possible codes. Any code containing only letters will be reserved for large clients. How many such codes of non-repeating letters can they make available, and assuming all such codes will eventually be used up what percentage of the company‟s clients will be considered large clients? SOLUTION: There are 26 letters in the alphabet and, of those, four will comprise a single, large-client code. There are different codes without the same letters being repeated, but where order does matter. In order to know what percentage (or probability) of the total number of possible codes this represents, we need to compute the total number of codes that can be formed, where no letter or number is repeated, but where order does matter. This is precisely what permutations are for. Since there are 26 letters and 10 numbers, a total of 36 different “symbols” can be selected from. The number of permutations is total different codes1 without the same letters or numbers being repeated, but where order does matter. So, the percentage/probability, then, is: (

)

We conclude that 25% of all clients (the large clients) will have completely alphabetical codes.

1

Notice that the increase in the number of possibilities after increasing the size of the sample space is not proportional to the increase amount. The growth is actually exponential, not linear. Statistics for Decision-Making in Business

© Milos Podmanik

Page 127

Example 5: In Example 4:, it was necessary that letters and numbers were not to be repeated. Recalculate the number of large client codes and the percentage of them by assuming that numbers and letters actually can be repeated. SOLUTION: Recall that a permutation or a combination is intended to handle situation in which repeats are not allowed. Recall from the beginning of this section that to find the number of ways in which two bushels of corn could be selected from a crop of 20 (and after one is selected, the sample space reduces in size), we wrote:

In this situation, we are allowing repeats. For the number of ways to form a 4-letter code, we have 26 possibilities for each digit. That is 26 for the first, the second, the third, and the fourth. Crossing all of these possibilities gives:

Which we expect to be larger than in the previous example since we are allowing repeats. Similarly, the number of letter/number codes that are possible can be calculated by noting that, in general, each piece of the code has 36 possibilities. So,

The percentage/probability is (

)

The percentage changes to 27% of all codes will contain only letters.

Moral of the Story with Counting ’ determining some key pieces of information:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 128

1. Are repeats/replacements allowed? If yes, permutations/combinations are likely the incorrect approach. 2. Does order matter? If yes, permutations should be used. If no, combinations should be used. You Might Be Wondering: You might be wondering why we must divide by to remove all repeats. This was probably somewhat obvious when working with two objects. Say there are 5 objects to select from. One is now gone, so for the second selection there are only 4. We proceed to cross out everything along the and to the right of the diagonal since they are either not possible or are s ’ Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 We have essentially multiplied the first five possibilities by the next number of possibilities, which is only 4 (this is accounted for by crossing out the diagonals, since this subtracts out five possibilities to give ), and then divided that result by 2, since half of the table is a repeat. That is, What happens when we select a third object? We extend the above table as a multiple of 3, since there are three objects left. Each table represents a pairing with one of the three remaining objects, as shown in the upper-left corner:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 129

OBJECT 1 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 2 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 3 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 In the first table, we can cross out the first column (and first row, if it were there), since it is not possible to select object 1 for a third time. In the second table, we can cross out the second column/row and in the third table we can cross out the third column/row for the same reason as table 1. OBJECT 1 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 2 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 3 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Statistics for Decision-Making in Business

© Milos Podmanik

Page 130

Object 4 Object 5 Also notice that the second column of table 1 and the last three rows of table are the same (1, 2, 3), (1, 2, 4), and (1, 2, 5). For a similar reason, the third column of table 1 can be crossed out, since it is a repeat of what we have in column 1 of table 3. OBJECT 1 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 2 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 3 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 Nothing else in table 1 can be eliminated, since (1, 4, 5) cannot be found in either of the two remaining tables (this is a unique characteristic of the bottom, right-most entry). In table 2, we will try to eliminate any entries that can be found in table 3. These eliminations will involve any entries that contain Object 3. We can do so with the (2, 1, 3) entry and the third column: OBJECT 1 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 OBJECT 2 Object 1 Object 2 Object 3 Object 4 Object 1 Statistics for Decision-Making in Business

© Milos Podmanik

Page 131

Object 2 Object 3 Object 4 Object 5 OBJECT 3 Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Object 5 Now, notice that we have 10 white spots left. This happens to be exactly one-third of what we had after we tripled the table. That is, ⏟



Which can be simplified to,

(

)

Selecting items allows this process to repeat, ad nauseam, any number of times. Mathematicians discovered that this tabular process could be reduced into the formula we “ ” general case (where we allow to be any value between 0 and the number of items we have to choose from), which tends to be discussed in more theoretical mathematics courses such as Discrete Mathematical Structures (our MAT227). Homework Problems - 3.5 1. If possible, give an imaginary (but realistic) scenario for each of the following. If not possible, state why. a. b. c. d. Statistics for Decision-Making in Business

© Milos Podmanik

Page 132

e. 2. Your classmate was absent when permutations and combinations. Explain when he should and when he should not use permutations and combinations. (Video Solution) 3. A police officer has been brought before the court on accusations of racial profiling. This occurs when a person of a particular race has been pulled over or detained by the police due to his race. The officer stopped 2 vehicles out of 10 that passed by through a freeway tollbooth. Both of the suspects were Asian and there were a total of 3 Asian drivers in the 10. (Video Solution) a. In how many ways could 2 drivers have been selected from the 10? b. In how many ways could 2 Asian drivers have been selected from the 3? c. How likely is it that the 2 selected drivers would both have been Asian if the stops were truly random? 4. In the United States, 20 out of the 50 states spend more than 50% of their state park and recreation areas revenue on keeping the state park operable (SOURCE: 2012 U.S. Statistical Abstract). Suppose a survey of 10 states is to be conducted next year to see if anything has changed. (Video Solution) a. In how many ways can 10 states be selected for the survey? b. In how many ways can 10 states be drawn so that all 10 are operating on more than 50% of their state park and recreation areas revenue? c. What is the probability that all 10 of the states drawn are operating on more than 50% of their state park and recreation areas revenue? 5. Ten pieces of furniture are to be arranged in a long row in a furniture store. In how many ways can all 10 be arranged? (Video Solution) 6. At Chandler-Gilbert Community College high-school math competitions, students enter into a raffle to win various prizes, including a graphing calculator. There are typically around 200 students. Suppose there are 5 different types of calculators to be given out and that the best is saved for last. (Video Solution) a. In how many ways can the prizes be distributed among the 200 students? b. Suppose a school has 5 attendees. In how many ways can all 5 students from this school win a calculator? c. What is the probability that all 5 students from this school wins a calculator? 7. A frequent concern of cautious consumers is the idea of the last four digits of a credit card number being displayed on receipts. Suppose a consumer has a Visa, which has a total of 16-digits, each of which can be between 0 and 9. For the sake of simplicity, suppose any combination is possible. A customer left the following receipt lying around and is now concerned about his identity: (Video Solution)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 133

a. First, how many different credit-card numbers are possible with 16 digits? b. How many different credit-cards numbers can be arranged with 6781 as the last four digits? c. On any one guess by a potential thief, what is the probability that he correctly guesses this person‟s credit card number?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 134

3.6 Expected Value

Imagine that you are an insurance salesperson with many years of experience. A new client has requested that your business provide him with auto insurance. He is 20 years old and has never been in an accident before. Considering age alone, you look at industry data and find that, as recently as 2008, there was about a 15% chance that someone his age would get into an accident (SOURCE: U.S. Statistical Abstract, Table 1113). Using your own expertise you find that, of your 20 year-old clients, the typical accident payment for his particular make and model of vehicle is about $3,200. He brings forward a quote from another insurance agency for a $100/month premium with no deductible (nothing to pay when an accident does occur except the running premium). The question is, do you insure him? Let‟s look at the possibilities in a tabular form. Since there‟s a 15% chance the driver will get into an accident, there is an 85% chance he won‟t (since it either does happen or it doesn‟t). If there is no accident, then the insurance company receives $1200 for the entire year. If an accident does occur, the insurer pays out $3200 (hence a negative effect), but still receives the year‟s premiums. Thus, the net difference is $2000, which the insurer is responsible for. Action Likelihood Accident 15% No Accident 85%

Monetary Value to Insurer

If we now consider 100 years, it is expected that 15 of those years there would be an accident and 85 of them there would be no accident, assuming the constant probability. That means the insurer would pay $2000 a total of 15 times and receive $1200 a total of 85 times. Let‟s consider the net difference:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 135

This amount looks very good! In fact, on average, the company received . This customer is definitely profitable to the company, in the long-run. Of course, we know that an accident could occur the first year, in which an $800 loss would be incurred right away. Notice what we really did here. We took the sum of the amounts and divided by 100: (

)

By properties of a common denominator we can write: ( ( (

) )

( (

)

)

) (

)

In reality, we multiplied each monetary value by its respective probability. This idea is known as expected value, since it is what we expect to happen in the long-run. Expected Value and Random Variable Expected value is the expected, or average, quantity that should occur in the long-run, provided that each quantity occurs with a certain probability. Suppose there are probability,

quantities, , each of which occurs with a certain , respectively, then the expected value, denoted , - is , -

A capital , , is used to denote what is called a discrete random variable, a variable that takes on one of (a natural number of) values with a certain probability. This value is defined by what it measures in the given situation. Importantly, , that is, we must account for 100% of all possible outcomes in order for the expected value to be meaningful.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 136

An expected value is actually not something terribly new. To see this more explicitly, suppose a student earns three test scores: 95%, 80%, and 85%. Then the average percentage is:

Observe that we can use properties of fractions to separate the sum as follows: ( )

( )

( )

While one-third in this situation is not a probability (since the scores have already been ) “ ” -third of the overall class grade.

Example 1:

A company sells consumer electronics, such as televisions, stereos, and computers. For each product, the company offers the consumer a warranty that protects any problems that might occur within the first two years, with the exception of accidental damage and theft. For a particular television that runs $1200, it offers a 2-year warranty for $ ’ determines that 3% of these televisions malfunction each year. Is the company offering the warranty at a profitable price? Explain your answer and define the random variable.

SOLUTION: We should determine what will happen, on average. We first see that the warranty is a 2-year warranty and the defect rate is for one year. If 3% malfunction each year, then 6% of all televisions are expected to malfunction within the first two years. This means that the company will make $175 with a 94% probability and will lose $1200$175=$1025 with a 6% probability, since it will still receive the payment, but will have to either replace the product or offer a credit to the consumer. Letting , -, is

, then the expected amount to be gained, or

, -

Statistics for Decision-Making in Business

(

© Milos Podmanik

)

Page 137

This means that, after selling this product for a while, it should earn an average of $103 from each consumer that purchases the warranty. This is a profitable outcome.

Example 2:

The Arizona Lottery has a number of different lottery games that a person can play. One in particular is Fantasy 5. The rules of the game are simple: pay $1 per ticket and select five numbers between 1 and 41. Five numbers are then selected at random. If you correctly selected two or more of these numbers, then you are considered a winner. The following table describes the likelihood of winning:

(SOURCE: www.arizonalottery.com) The estimated jackpot for the Wednesday, August 17, 2011 lottery was $54,000. Is the game in your favor? Why or why not?

SOLUTION: We must first consider the fact that these prizes do not take into account that $1 was lost to purchase the ticket; we should subtract $1 from each of the prizes. Additionally, we note that the probabilities do not add to 1:

The remainder of the time, it is simply the case that $1 is lost:

We rebuild the table to show all of the values and probabilities: 53,999 499 4 0 -1 ( ) 1/749,398 1/4163 1/119 1/11 9,004/10,000

Statistics for Decision-Making in Business

© Milos Podmanik

Page 138

Where The expected value is: , -

(

)

(

)

(

)

(

)

(

)

This means that if one were to play time-after-time, taking into consideration the small likelihood of winning occasionally, one would be expected to lose, on average, $0.67 per ticket. ’

Notice that we represented the outcomes by using a table, in which we listed the outcomes, or the individual along with the probability that this occurs, ( ). This is one way in which to display a probability distribution, or how all probabilities are distributed among the various outcomes.

Example 3:

A fair, six-sided die is tossed repeatedly. The number of dots that are facing up after each throw is recorded. Define the random variable, find its probability distribution, and find and interpret the expected value of the random variable.

SOLUTION: We define the random variable,

The different values that can take on are , since we know there are six sides. Since this is a fair die, each of these six outcomes has an equally likely chance of appearing, so ( )

, for all values,

of . Our probability distribution is thus,

1 2 3 4 5 6 ( ) 1/6 1/6 1/6 1/6 1/6 1/6 The expected value is the sum of the products of each outcome value and its associated probability. Statistics for Decision-Making in Business

© Milos Podmanik

Page 139

, -

( )

( )

( )

( )

( )

( )

The average value of a die that is repeatedly tossed will be 3.5. If we were to conduct a simulation we would probably see something similar as in the introductory section of this chapter:

Average Die Roll Outcome Average Die Roll Outcome

6

5 4 3 2

1 0 0

20

40

60

80

100

120

140

Number of Times Die Has Been Tossed

As time passes, we see that the average roll becomes more stable and seems to e approaching 3.5, as we have shown mathematically. Example 4: In hopes of understanding the directions in which married couples are naturally inclined to walk at an outdoor mall in Arizona, a marketing group conducts a study. It is the experience of the mall that men and women tend to walk in different directions once they park (and catching up later). The first question is how many individuals within a couple can they expect to start their walk through a street that has one or more clothing stores?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 140

SOLUTION: We first note that there are three paths out of five with one or more clothing stores. We assume there are two people per couple and that each takes a different initial route. The random variable we are interested in is:

The random variable can take on values, , since it is possible that neither of them take a clothing store route, only one does, or both do. We need to find the probability for each of the three events. individuals taking a route with a clothing store would occur when, from the three clothing store routes, none are selected, and both routes without clothing stores are selected. We then must compare this to the number of ways two routes can be chosen from five. That is, (

)

( )( ) ( )

Similarly, for , we want to know how many ways one clothing-store route and one nonclothing-store route can be selected. That is, (

)

( )( ) ( )

For Statistics for Decision-Making in Business

© Milos Podmanik

Page 141

(

)

( )( ) ( )

Our probability distribution is: 0 1 2 ( ) 1/10 6/10 3/10 We can see that the probabilities sum to 1, which helps to imply that we have accounted for all possibilities. The number of individuals expected to take a clothing store route is an expected value of this distribution, , -

(

)

(

)

(

)

Thus, it can be expected that, on average, at least one person from the couple will walk along a route that contains a clothing store.

One additional way to represent a probability distribution is by using a probability histogram. A histogram looks similar to a bar graph, except that it has a numerical horizontal axis and measures the probability along the vertical axis. Additionally, the bars touch in order to show continuity, where applicable. For the above situation, we would expect to see:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 142

Clothing Store Route Probabilities 0.7

Probability

0.6 0.5 0.4 0.3 0.2

0.1 0 0

1

2

Number of Individuals

This is a convenient visual way to view the distribution of probabilities. It is clear to us that it is quite unlikely that neither of the individuals in the couple will walk a route without a clothing store. Homework Problems - 3.6 1. While working in downtown Phoenix, the author tracked minutes that the Blue Line bus going through downtown Phoenix, AZ was late in arriving at a specific bus stop. He discovered the following: (Video Solution) On time 1 2 3 4 ( ) 0.53 0.25 0.18 0.03 0.01 a. Construct a probability histogram. b. What does the probability histogram reveal? c. Find and interpret the expected value of the random variable. (SOURCE: Author‟s data) 2. A Geico auto insurance policy for a 21-year-old Chandler male driver of a 2012 BMW M5 with no previous tickets has a semi-annual premium of $312.41. In the instance of an accident, there is a $1,000 deductible that the policyholder must pay before insurance will cover the damages (SOURCE: www.geico.com). The vehicle costs about $115,000 to replace. From past experience, suppose Geico knows there is a 2.5% chance (annually)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 143

that this situation will result in an accident. Find the expected payout for Geico and comment on its profitability in a situation like this. (Video Solution) 3. An insurance policy pays $100 per day for up to 3 days of hospitalization and $50 per day for each day of hospitalization thereafter. (Video Solution) The number of days of hospitalization, , is a random variable with probability given by the function

(

)

{

a. Define the random variable. b. Give the probability distribution for by using a probability histogram. c. What does the probability histogram tell you about hospitalization? d. Determine the expected payment for hospitalization under this policy. (SOURCE: Society of Actuaries (SOA), Spring 2003 Exam P, #36) 4. You work on a dairy farm and are in charge of quality control for eggs. Your primary concern is that broken eggs do not go out. You know from past experience that about 25% of the outgoing boxes contain one or more broken eggs (based on complaints). If a local restaurant purchases 4 boxes of eggs from you, what is the expected number of boxes with broken eggs that this vendor should receive? (Video Solution) 5. At a major seafood restaurant, shrimp fettuccini is a popular dish. The company is considering adding a family-sized fettuccini dish, but would first like to make sure that it will be a profitable endeavor. The company randomly surveys customers that who purchase the original $14.99 dish and finds that 15% would purchase the larger family dish. What should they charge for the family-sized dish so that average revenue from shrimp fettuccini will be $17.00? (Video Solution)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 144

Statistics for Decision-Making in Business

© Milos Podmanik

Page 145

Chapter 4 Discrete Probability Distributions It might seem paradoxical to say that uncertainty occurs in certain ways, but the truth is that it does – assuming certain assumptions are satisfied. As we build a probability distribution, whether in the form of a table or histogram, we can often times save ourselves a lot of labor by focusing on the type of experiment that lay before us. The purpose of this chapter is to (hopefully) simplify some of our efforts.

4.1 The Binomial Distribution 1.1.1 Why Probability Distributions Are Useful Suppose a friend of yours, let‟s call him Kyle, tells you that his brother is 6-feet, 9-inches tall. You are most likely wide-eyed and surprised by what he just told you. Why is this? You likely have some idea of how tall people generally are. You would probably consider a height of 6-feet, 9-inches to be uncommon in the environment you‟re used to. In fact, you might even go as far as to call this height an outlier, or a value that falls outside the usual data range. How can you be absolutely sure that this height is uncommon? What if you live in a region that tends to have shorter people? The statistician would say that it would be nice to see a probability distribution associated with heights of all people living in the region, state, country, or continent on which you live. She would argue that, if you are trying to describe the people in the U.S. based on people living in Arizona, you are drawing from a biased sample. While we will not discuss continuous random variables here (variables that can take on any number in a specified range), we will show a theoretical distribution for heights in the U.S. below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 146

For men, we see that the most frequently occurring height is near 70 inches (5-feet, 10-inches). It is very uncommon to have someone who is 80 inches tall (6-feet, 9-inches). This type of information allows us to conclude that your brother‟s friend is indeed very tall. You might be wondering how we know that the shapes of the distributions should look like bells. This is based on the data collection process. It is not unlikely in nature for distributions to have a heavily loaded center with lower frequencies out towards the left and right tails. While the histogram of all heights might not have a perfect bell shape as we indicate, having this shape allows us to use mathematics to model the curve. Although many variables do take on a continuous set of values, we will begin with discrete random variables, as these are slightly simpler to describe.

1.1.2 The Binomial Distribution When we talk about any variable that can take on a finite (as opposed to infinite) number of possibilities, we are dealing with a discrete random variable. Specifically, a binomial random variable is one that takes on one of two possible values, as indicated by the prefix “bi.” We will simply refer to the outcome as either a “success” or a “failure.” Consider this example: let‟s say that you and a friend are tossing a coin (since this is one of the most exciting things to do). Your friend tosses 9 heads out of 10 tosses. Curious about this, you begin to analyze the results – how likely is that this type of event could take place? By letting and represent the events that a head/tail is facing up on a coin toss, respectively, we know that one possible way in which this can happen is:

The probability of this particular sequence of 9 heads and 1 tail is: ( )

Statistics for Decision-Making in Business

( )

© Milos Podmanik

Page 147

This is definitely a small probability, but it is not the only way in which this can happen. The tail can occur first, second, third, fourth, etc., with heads all around it. Another one would be:

The probability of this sequence is the same: 9 heads, 1 tail. This is okay, since the probability of tossing a certain sequence does not affect the probability of getting a head or tail on the next toss. So, ( )

( )

( )

( )

( ) (

)

Not surprisingly, there are 8 more places for the tail to have appeared. We‟ll summarize in the table below: Arrangement of 9 , 1 ( ( ( ( ( ( ( ( ( (

) ) ) ) ) ) ) ) ) )

Probability ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

Since these are 10 distinct ways of getting this outcome, each with probability 0.000977 (that is, each takes up 0.0977% of the entire sample space), the probability of getting 9 heads and 1 tail is: (

)

As suspected, this particular event is not very likely. What if we complicated the problem a little more and asked, what would be the probability of having two tails mixed up in 10 total tosses? This gets more complicated, since the two tosses could occur one after another, two tosses apart, three tosses apart, etc. To simplify our lives, it can be shown that the total number of ways in which a binary “success” can occur is by finding the following combination: .

Statistics for Decision-Making in Business

/

© Milos Podmanik

Page 148

So, we had 10 trials and wanted to know the number if ways in which 9 heads (successes) can be included in the mix. We have: .

/

Then, we simply need to find the probability of just one of those arrangements and multiply it by the number of different arrangements. Since we defined a head resulting as a success, then, what we just calculated was: .

/

(

)

(

)

At first glance, it might seem a little confusing that the second exponent is the number of trials less the number of successes. Why is this? Suppose there are 10 trials and you want 6 successes. This necessarily means that the other 4 trials would result in failures. This is precisely , or the number of trials less the number of successes. Let‟s make this formula easier to consider. First off, let‟s define some variables: Let

Now, in any event, success and failure make up the whole sample space. That is:

Since they make up the sample space, (

)

(

)

So,

Statistics for Decision-Making in Business

© Milos Podmanik

Page 149

(

)

(

)

We rewrite our formula with the above defined components: . /

(

)

This is known as the binomial probability density function, or binomial pdf. To make this more clear, we first define a random variable, . In the case of a binomial experiment (one in which there are two possible outcomes for each trial), the set listing all possible values that can be achieved (between 0 and the number of trials). For example, if

in coin tosses, then * +. That is, between 0 and 10 heads can possibly be achieved in 10 tosses of the coin (though not all have the same probability). To indicate a binomial pdf calculation, we often write: The probability that

takes on successes is . / (

)

. /

(

)

(

)

, or,

We summarize a binomial pdf below, along with the necessary assumptions to use this. Binomial Probability Density Function (pdf) If the following assumptions are met: 1) 2) 3) 4)

An experiment is carried out with trials, Each trial can result in only one of two possible values: a success or a failure, The probability of a success in each trial is (it is always the same), and Each trial is independent of all other trials (the outcome of one trial in no way affects the outcome of any other trial),

then the experiment is a binomial experiment and the probability of calculated by (

Statistics for Decision-Making in Business

)

. /

(

© Milos Podmanik

successes can be

)

Page 150

Example 1:

A fair-two sided coin is tossed 10 times. The goal is to get 8 heads.

a) In how many different ways can this event occur? b) Verify that all assumptions are met to conduct a binomial experiment. c) What is the probability of this event? SOLUTION: a) Since there are 10 events and 8 successes desired, there are: .

/

b) 1) 2) 3) 4)

There are trials Each outcome is either a head (success) or a tail (failure) The probability of success on any trial is One toss does not influence the outcome of any other toss

Thus, all assumptions have been met. c) (

)

.

/(

) (

)

Thus, there is about a 4.3% of tossing 8 heads in 10 tosses.

The fact that the probability of getting 8 heads in 10 tosses is higher than getting 9 heads in 10 tosses should not surprise us. Getting 9 heads is a rather extreme request. Getting 8 heads, while still extreme, is a bit more likely. Let‟s now build the probability distribution histogram for . We first display the probabilities in a table below by applying the binomial pdf:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 151

Successes Probability 0 0.001 1 0.010 2 0.044 3 0.117 4 0.205 5 0.246 6 0.205 7 0.117 8 0.044 9 0.010 10 0.001

Does this match our expectations? The table indicates that getting 5 heads has the highest likelihood of all 11 possible events. Even more importantly, the probability of getting between 4 and 6 heads in 10 tosses is . The probability of getting very few or many successes gets to be very unlikely. This data is displayed in the histogram below:

Tossing X Heads in 10 Tosses 0.300

Probability

0.250 0.200 0.150

0.100 0.050 0.000 1

2

3

4

5

6

7

8

9

10

11

Successes

This further validates our argument above. Additionally, note that the sum of all event probabilities sums to 1. This is necessary and important in describing the distribution. Sum of Success Probabilities in a Binomial Experiment With trials in a binomial experiment, the sum of the probabilities of 0 up to constitute the sample space and hence equal 1.

successes must

That is, Statistics for Decision-Making in Business

© Milos Podmanik

Page 152

(

)

(

)

(

)

(

)

Example 2: A fair, 6-sided die is rolled 8 times. The goal is to roll a 1 or a 2 four times during the experiment. a) Is this a binomial experiment? b) In how many different ways can this event occur? c) What is the probability of this event? SOLUTION: a) A success is classified as rolling a 1 or a 2. A failure is classified as rolling a 3, 4, 5, or 6. Thus, . There are trials and the probability of a success is always , since the 8 outcomes are independent. Thus, this is indeed a binomial experiment. b) It is possible to have a success occur in . / c) Let

different ways.

be the number of successes possible. Then (

)

. /( ) (

*

+. )

. /( ) ( )

There is about a 17% chance of getting a 1 or 2 on four out of 8 die rolls.

A question that follows from Example 2: is, what does the distribution look like? Let‟s develop the distribution in tabular form first. To do this, we calculate binomial probabilities for each of the 9 possible outcomes (anywhere between 0 and 8 successes possible). Successes Probability 0 0.039 1 0.156 2 0.273 3 0.273 4 0.171 5 0.068 6 0.017 7 0.002 8 0.000

Statistics for Decision-Making in Business

© Milos Podmanik

Page 153

We see clearly that the number of successes with the highest probability is 2 or 3. The histogram follows:

Rolling a 1 or 2 in 8 Die Rolls 0.300

Probability

0.250 0.200 0.150 0.100 0.050 0.000 1

2

3

4

5

6

7

8

9

Successes

Notice that this distribution is not symmetric. It is said to have to be skewed to the right, since the distribution has its probabilities heavily concentrated towards the left and so has a tail to the right (hence the name) Distribution Types There are three single-peaked (called unimodal) distributions, as illustrated below:

1.1.3 Expected Value

Statistics for Decision-Making in Business

© Milos Podmanik

Page 154

Expected Value of a Binomial Random Variable It can be shown that the expected value of , or the average number of successes we expect to see, given that is a binomial random variable, is: ( )

Example 3: Pristine Air Conditioning uses a digital phonebook to call homeowners in a large city regarding a $55.99 A/C maintenance special. In an hour, a telemarketer can make about 10 calls. If the probability that a randomly called homeowner signs up for the maintenance special is 0.40, a. what is the probability that telemarketer gets at least 80% of his hourly customers to sign up? b. Represent this probability in a histogram. c. Find and explain the expected value of the random variable. SOLUTION: a) We first need to determine whether or not this is a binomial probability. Since the probability of success is 0.40 on every one of 10 trials and we assume that the size of the population does not significantly impact the percentage of success (as removing one potential customer from the pool reduces the size of the callable population), we conclude that this is a binomial experiment. Thus, the number of called homeowners that accept the offer. We want to know the probability of getting business from 8, 9, or all 10 of the called individuals. We want: (

)

(

)

(

)

because each of these accounts for disjoint pieces of the sample space. With .

and /(

) (

, we have: )

.

/(

) (

)

.

/(

) (

)

Thus, there is only about a 1.23% chance that the A/C company gets the business of 80% or more of the homeowners called. b) The histogram is below. The probability we are looking at is the sum of probabilities after 7 successes:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 155

, ( ) c) The expected value is, homeowners accept the maintenance offer.

. Thus, we expect that each hour 4 out of 10

Homework Problems –4.1 1. Determine whether or not each of the following experiments represents a binomial experiment. (Video Solution) a. A die is rolled 20 times and the number of 6‟s is counted. b. A die is rolled until ten 6‟s show up. c. In a stream with 1,500 fish, 700 are Rainbow Trout. A total of 20 fish are caught and the number of Rainbow Trout is counted. d. About 10% of the U.S. population is suspected to have a form of bacteria. A sample of 100 people is drawn from the population and the number of people with the strain of bacteria is counted. e. A brand of LED light bulb has a 0.5% chance of going out prior to the advertised life of 30,000 hours. In the testing phase, 850 bulbs are sampled for quality assurance. The number of bulbs that don‟t die prior to the 30,000 hour life is counted. 2. Suppose the outcome of random variable independent probability of success, a. Is this a binomial experiment? b. What is the probability that c. What is the probability that d. What is the probability that e. What is the probability that Statistics for Decision-Making in Business

is conducted with . (Video Solution)

© Milos Podmanik

trials each with

Page 156

f. What is the probability that g. What is , -? Does it coincide with the resulting probability?

that has the highest

3. In preparing for a New Year‟s Eve celebration, police look at past records for arrests due driving under the influence (DUI). In the U.S., 10.5% of arrests made are for DUI (SOURCE: U.S. Statistical Abstract, Table 324). If it is expected that each police officer makes 10 arrests, what is the probability that all arrests result in DUI‟s? (Video Solution) 4. Pancreatic cancer is a vicious killer. The 5-year survival rate between 2001 and 2007 was only 5.9%, meaning that the majority of people with pancreatic cancer die within 5-years of contracting the cancer. In a group of 25 patients, 5 survive beyond. How likely is such an event? Assume that the survival of one person is independent of another person. (SOURCE: U.S. Statistical Abstract, Table 182). (Video Solution) 5. A new herbal drink blend is being compared to an older blend via a blind taste-test comparison. Four judges will taste each of the two drinks and will state their preference. It is anticipated that both blends are equally impressive. (Video Solution) a. Find the probability distribution for the number of judges that vote in favor of the new blend. b. Construct a probability histogram. c. What is the probability that at least two of the judges prefer the new blend? d. What is the expected value of this distribution and what is its real-world meaning? 6. Goranson and Hall (1980) explain that the probability of detecting a crack in an airplane wing is the product of , the probability of inspecting a plane with a wing crack; , the probability of inspecting the detail in which the crack is located; and , the probability of detecting the damage. (Problem Source: Mathematical Statistics with Applications, 6th Ed., Wackerly, et. al.) (Video Solution) a. What assumptions justify the multiplication of these probabilities? b. Suppose and for a certain fleet of planes. If three planes are inspected from this fleet, find the probability that a wing crack will be detected on at least one of them. c. Find the probability distribution for the number of planes in this fleet with detected wing cracks. d. Construct a probability histogram. e. What is the expected value of this distribution and what is its real-world meaning?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 157

Chapter 5 Continuous Probability Distributions Up until this point, we have only considered distribution that have discrete values – non-negative integers. There are many variables, however, that are continuous in nature. In fact, almost every variable you studied in algebra and calculus was continuous! Take, for example, heights of NBA basketball players, hourly wage, response time of a database server, temperature, depth of a lake, the value of a share of Intel stock, and the lifespan of a car engine, to name just a very few. These are all variables that can take on infinitely many values, even within a limited range. For example, the response time of a database could be 0 seconds and 1 second. It could be 0.01 seconds, 0.00001 seconds, or 0.98727495 seconds.

5.1 The Ideas Behind the Continuous Distribution 5.1.1 Conceptual Approach to Continuous Distributions Think back to a discrete distribution. The probability of a particular value was found by observing the height of the relative frequency bar. While relative frequency represents the percentage of observations found to have the value specified, it can also be thought of as a probability, if we feel that it accurately models predictions that we might use it for. Consider the example below showing the number of children in a classroom of 30 that are likely to likely to have the flu.

Probability (Relative Frequency)

Number of Children with Flu in a Class 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0.4

0.2 0.16

0.14

0

1

2

3

0.1

0.1

4

5

Number of Children w/Flu

Statistics for Decision-Making in Business

© Milos Podmanik

Page 158

For instance, we see that the probability that any 2 children in a classroom have the flu is 0.2. Let‟s call this random variable

# of children in a classroom of 30 that have the flu.

Then, we will write the probability that any 2 children have the flu as: (

)

This reads, “the probability that the number of children that have the flue is 2” The output of this statement is: (

)

What would it mean to say ask: What is (

)?

This is asking us to find the probability that 2 or fewer children have the flu. In other words, what is the probability that 0, 1, or 2 children have the flu. To answer this, we simply add the bar heights corresponding to . (

)

Thus, there is a 74% chance that 2 or fewer children in a class of 30 children have the flu. With continuous distributions, we cannot simply read the “height of the bar!” For instance consider the following continuous probability distribution that shows the likelihood of various wait times in line at a fast-food restaurant:

Time Speng Waiting in Line 0.25

Probability

0.2 0.15 0.1 0.05 0 0

1

2

3

4

5

Minutes

Statistics for Decision-Making in Business

© Milos Podmanik

Page 159

In this case: minutes spent waiting in line is a continuous random variable. The reason is that a person doesn‟t wait a whole-number of minutes! It is perfectly okay for a person to wait 1.42 minutes, for example. ), that is, the probability that the wait time is In this example, suppose we wish to find ( 2-and-a-half minutes. At first glance, we might simply decide to locate 2.5 minutes and assess the probability output. We would find:

(

)

If this were the case, wouldn‟t it be the case that all wait times have a probability of 0.2? Based on the graph, of course. This, however, would be a logical pitfall: if there are infinitely many different wait times between 0 and 5 minutes, then the sum of all probabilities would be a sum of infinitely many 0.2‟s. In other words, it is only possible for the wait times to have individual probabilities of 0.2 if the times were discrete. When we deal with continuous random variables, we should actually consider the vertical axis to be density instead of probability. In and of itself, density is not a meaningful value, however, in conjunction what we will mention next, it will prove to be useful. Without going into too much detail, an interval of densities is designed in such a way that the area under the function is 1, or 100%. Let‟s reconsider the above graph:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 160

Time Speng Waiting in Line 0.25

Density

0.2 0.15 0.1 0.05 0 0

1

2

3

4

5

Minutes

We notice

. The region underneath the blue line is rectangular. Visually:

To find the area of a rectangle, we must simply take

And, so we are able to confirm that store has experienced.

represents all possible wait times this particular

As you might guess, if we wish to find the probability of a range of values, we would simply find the probability between those two values of time.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 161

One question does remain, however: what is the probability that the wait time is exactly 2.5 minutes? The answer might not come as too much of a surprise: the probability is 0! The probability of a single value in a continuous distribution is 0, since there are infinitely many possible values. Thus, 2.5 represents 1 of infinitely many values. Take and you get 0! We can only find the probability of a non-zero range of values for a continuous random variable! Continuous Random Variables A continuous random variable is a random variable that has infinitely many possible values within a range of real numbers. As a result, the probability that a continuous random variable takes on any one specific value is 0. Probability Density Function (PDF) The PDF of a continuous random variable is a continuous function such that the total area between the function and the horizontal axis is 1. The function‟s input values are the values of the random variable, while the output values are densities. Densities are individually meaningless values designed so that the total area equals 1. Reconsider the above wait-times example:

Time Spent Waiting in Line 0.25

Density

0.2 0.15 0.1 0.05 0 0

1

2

3

4

5

Minutes

Statistics for Decision-Making in Business

© Milos Podmanik

Page 162

Suppose we wish to find ( ), that is, the probability that the waiting time is between 2.5 and 3.5 minutes. To find this, we simply find the area under the PDF between 2.5 and 3.5 minutes:

The area of the rectangular region is:

Thus, (

)

We can expect to wait between 2.5 and 3.5 minutes with a 20% chance. Thus, approximately one in five visits, our wait-time will be somewhere within this interval. Similarly, suppose we wish to know: (

)

This is the probability that the wait-time is between 0.3 and 4.4 minutes. We identify this region below:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 163

The area of this region is:

Thus, there is an 82% chance that the wait-time is between 0.3 and 4.4 minutes.

5.1.2 Uniform Distribution Continuous Uniform Distribution When the PDF of a random variable is a constant, we call this a uniform distribution. That is, values of the random variable are uniformly distributed. The PDF of a random variable, , whose values are in the interval ( )

is:

{

The expected value of this random variable is: (

)

The variance of this random variable is: (

Statistics for Decision-Making in Business

)

(

© Milos Podmanik

)

Page 164

Resulting in a standard deviation of:



(

)

Example 1: The amount of revenue that a farmers market generates on a given Saturday is uniformly distributed between $5,000 and $22,000. a. Find the PDF for this random variable. b. Find the probability that the between $6,000 and $8,000 is generated. c. Find the expected value of this random variable and explain its real-world meaning. d. Find the standard deviation of this random variable and explain its real-world meaning. SOLUTION: a. The lower limit is

and the upper limit is

. Thus,

( ) This is constant function is only valid for values between 5000 and 22000. It is valued as 0 everywhere else.

Revenue PDF 0.00007 0.00006

Density

0.00005 0.00004

0.00003 0.00002 0.00001

0 5000

22000 Revenue ($)

b. We want (

Statistics for Decision-Making in Business

). The probability will be the length times the width.

© Milos Podmanik

Page 165

We get: (

)

There is about a 12% chance that revenue earned will fall between $6,000 and $8,000.

c.

The expected value will be:

This is a simple average. Thus, on average, the farmers market will make $13,500 on a given Saturday. d. The standard deviation will be:

√ On average, revenue will vary by $4,908 less or more than the mean.

5.1.3 Other Distributions

Statistics for Decision-Making in Business

© Milos Podmanik

Page 166

Without going into detail here, continuous random variables have PDF‟s with area between the function and the horizontal axis equal to 1. Clearly, densities will have to be positive, as it is not possible to have negative probabilities. As an example, a distribution might look like this: 1.2 1

Density

0.8 0.6 0.4

0.2 0 0

1

2

Random Variable Values

Practically speaking, it appears to be most probable that the random variable will take on a value around 1. It is less likely that the random variable will take on values close to 0 or close to 2. This might be handy in situations where such criteria is desired. Notice that the area is also 1. If you divide the triangle into 2 and use the area of a triangle formula . /:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 167

Then the sum of the two triangular areas is:

In this next section, we will focus our attention on the most commonly used continuous random variable: the normally distributed random variable. Homework Problems –5.1 The first two questions below involve discrete random variables. The aim of these questions is to get you thinking in terms of the probabilities of ranges of values. 1. A pizza shop sells pizzas in four different sizes. The 1000 most recent orders for a single pizza gave the following proportions for the various sizes:

With denoting the size of a pizza in a single-pizza order, the given table is an approximation to the population distribution of . a. Construct a probability (relative frequency) histogram to represent the approximate distribution of this variable. b. Approximate ( ). c. Approximate ( ). d. Find the expected value of .What does this value mean? e. What is the approximate probability that is within 2 in. of this expected (mean) value? 2. Airlines sometimes overbook flights. Suppose that for a plane with 100 seats, an airline takes 110 reservations. Define the variable as the number of people who actually show up for a sold-out flight. From past experience, the population distribution of is given in the following table:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 168

a. What is the probability that the airline can accommodate everyone who shows up for the flight? b. What is the probability that not all passengers can be accommodated? 3. A particular professor never dismisses class early. Let denote the amount of time past the hour (in minutes) that elapses before the professor dismisses class. Suppose that the density curve shown in the following figure is an appropriate model for the probability distribution of : 0.20

0.15

0.10

0.05

2

a. b. c. d. e. f.

4

6

8

10

Find the probability density function (PDF) for this random variable. What is the probability that at most 5 minutes elapse before dismissal? Find ( ). Explain what your answer means. Find the expected value of this distribution and explain its real-world meaning. Find the standard deviation of this distribution and explain its real-world meaning. What is the probability that instructor let‟s out class within one standard deviation of the average overtime?

4. A delivery service charges a special rate for any package that weighs less than 1 lb. Let denote the weight of a randomly selected parcel that qualifies for this special rate. The probability distribution of is specified by the following density curve: Density

0.5 x

0.6

0.8

1.5

1.0

0.5

0.0

Statistics for Decision-Making in Business

0.2

0.4

© Milos Podmanik

1.0

1.2

Page 169

Use the fact that the figure can be broken up into the area of a rectangle and the area of a triangle, where area of a triangle = ( )( ) and the area of a rectangle = ( )( ). a. What is the probability that a randomly selected package of this type weighs at most 0.5 lb.? b. What is the probability that a randomly selected package of this type weighs between 0.25 lb. and 0.5 lb.? c. What is the probability that a randomly selected package of this type weighs at least 0.75 lb.? d. The probability is defined on the interval . Verify that the area under the curve in this region is 1. 5. A plumbing service is able to respond to off-site emergency calls uniformly between 15 and 45 minutes. a. Find the PDF for this random variable, . ) b. Find ( c. Find ( ) d. Why are both of the above probabilities the same? e. Find ( ). f. Find and interpret the real-world meaning of the expected value. g. Find and interpret the real-world meaning of the standard deviation. h. What is the probability that the service responds within 1.5 standard deviations of the expected time?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 170

Statistics for Decision-Making in Business

© Milos Podmanik

Page 171

5.2 The Normal Distribution

5.2.1 The Normal Distribution As a Natural Phenomena The normal distribution (pictured above), much like the uniform distribution, is a continuous distribution. In fact, this distribution is defined for all real numbers. The curve runs from to . However, as you might observe, the most likely values occur close to where the density function peaks. Values that occur in either one of the “tails” are highly unlikely and, as it appears, the density function is very close to the horizontal axis as it extends farther to the left and to the right. Why do we use this distribution? Much like the infamous appears in many natural places, many random variables tend to be normally distributed. That is to say, the bulk of values tend to occur near the mean and median (both of which are located directly in the center of the distribution, since it is perfectly symmetric). For instance, heights of individuals in the United States (roughly) follow a normal distribution – there are many people whose heights are near average. There are fewer extremely short and extremely tall people in the United States. Thus, we would say that the bulk of people are “normal” with respect to their heights. While certainly not all random variables are normally distributed, many are. Weights, IQ, newvehicle gas mileages (to name just a very few) are variables that have been known to follow a normal distribution. As we will later see, any distribution can “become” a normal distribution. This is a beautiful phenomenon that allows us to make some important conclusions (more on this idea in a later section). As before, the overall area under the normal curve is 1 (50% on either side of the mean/median, as in the image). To find the area, we would need to use some rather unusual shapes in order to apply the same methodology as before. The idea of an integral in calculus would actually allow us to find the area exactly, however, the normal curve is modeled by the following pdf: ( )

Statistics for Decision-Making in Business

(

)



© Milos Podmanik

Page 172

As you can see, this is a difficult function to work with. Historically, tables have been developed with calculated areas, as the calculus was once quite difficult to do. In order to do this, it was often necessary to first convert the desired range of values to -scores. Since every normal distribution has a different mean and standard deviation, it would be impossible to create a table for every possible combination. Instead, since each normal distribution is of the same shape, it made sense to create just one table that represented a mean of and a standard deviation of . That is, we can think about every distribution as the number of standard deviations each score is from the mean. The mean is 0 standard deviations away from the mean (it is the mean!) and each unit represents 1 standard deviation. We can think about any distribution this way! Normal Distribution Expected Value and Variance A normal probability distribution can be modeled by the function ( )

(

)



where the expected value is , defined as a standard mean, ∑

And variance is

, defined as a standard variance, ∑(

)

IMPORTANT NOTE: and represent the population mean and variance. represents the population size. Recall that the sample variance has a divisor of , so that it is an unbiased estimator of the population variance.

Below is an example of what a typical table would look like. We call this a standard normal table, since it requires that values between which we would like to know areas are “standardized.” This means they are converted to scores prior to using the table:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 173

As we notice, this table only shows positive scores. A similar table exists for negative scores, that is, for values that are less than the mean. The image tells us that each of the entries in the center of the table correspond to areas that are to the left of the score we would look up. 1. In an Arizona town, suppose the heights of adult males is such that (so the standard deviation is the square root of this value, probability that a male is shorter than 72 inches (6 feet tall)? SOLUTION: We wish to find ( ), where distribution would look like the following:

Statistics for Decision-Making in Business

© Milos Podmanik

(

inches and ). What is the

). The normal

Page 174

We wish to know the area of the shaded region below:

We first convert the value of 72 to a

score:

We round to two decimal places, since the standard normal table can handle up to two decimal places. Any additional decimal places would not make a substantial difference. We locate 0.04 = 1.14).

by first locating 1.1 along the rows and 0.04 along the columns (since 1.1 +

Statistics for Decision-Making in Business

© Milos Podmanik

Page 175

) The value we find is 0.8729. This means that ( . There is an 87.29% chance that a randomly selected individual will be less than 72 inches in height. What if we wanted to know an area to the right, such as ( ) these values. However, if we know that ( greater than 72 must be the remaining area,

)? The table does not provide then the probability of a height .

Similarly, if we wish to find the area between two points, we must get creative. Suppose we wish to know (

). We first need to convert both endpoints to

scores:

and

Statistics for Decision-Making in Business

© Milos Podmanik

Page 176

We can easily find that the probability of a score less than 0.57 is: 0.7157

The probability of a score less than 1.00 is: 0.8643

The area between them is the difference in their areas:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 177

As technology progresses, there is a much lesser need for by-hand computations of the sort above. Instead, let us use the web applet from which the above pdf‟s came: http://www.rossmanchance.com/applets/NormalCalcs/NormalCalculations.html

As you can see, we enter the mean and standard deviation in the first section. If we would like to plot two functions over one another, we could check the box and enter a second mean and standard deviation. Statistics for Decision-Making in Business

© Milos Podmanik

Page 178

In the second section, we can check up to two boxes, in the event that we would like to find an area between two points. We can either enter values as z-scores or as raw data values ( ). To find the probability of a value greater than, we click the grey box to select: The probability of such an event is displayed in the “prob” box. If we have two values entered and both boxes checked, then the “probability between” these two values is displayed. Isn‟t this much more intuitive and convenient than using tables? NOTE: One limitation of the above applet is that a bit of finagling.

values rounded to two decimal places require

Homework Problems – 5.2 Use the applet mentioned in this section to complete these exercises. You are not required to use the standard normal table. 1. In the United States, IQ‟s are normally distributed with and . a. What is the probability that a person has an IQ lower than 130? b. What is the probability that a person has an IQ between 80 and 110? c. What is the probability that a person has an IQ between 50 and 70? d. What is the probability that a person has an IQ above 120? 2. In the UK, birth weights are approximately normally distributed with lbs. and lbs. (SOURCE: http://www.healthknowledge.org.uk). a. Find and explain the real-world meaning of ( ). b. Find and explain the real-world meaning of ( ). c. Find and explain the real-world meaning of ( ). d. Find and explain the real-world meaning of ( ). e. What weight is such that 20% of infants weight less than this amount? (HINT: You can still use the calculator applet.) 3. In a recent years, Scholastic Aptitude Test (SAT) scores for all college-bound seniors in the United States was such that points and points (SOURCE: http://www.collegeboard.com) . a. 50% of students scored less than how many points? b. 50% of students scored more than how many points? c. In order to be in the top 10% of SAT-takers, what score would one have to achieve? d. What score do the lowest 10% score between? e. The middle 50% of students scored between what two values? 4. Sketch a normal distribution standard deviations, and

Statistics for Decision-Making in Business

and . Label the mean, standard deviations.

© Milos Podmanik

standard deviations,

Page 179

a. Determine the probability that an observation falls within each of these standard deviation ranges. b. The Empirical Rule describes the probability of scores within 1, 2, and 3 standard deviations of the mean. Do a web search on this topic and compare it to your answer in the above part. Are the results the same? 5. Suppose a distribution is such that and . a. What would happen to the distribution if was changed to 60? b. What would happen to the distribution if was changed to 10? There are two effects to describe. Discuss why it makes practical sense that these two things should happen to the curve. c. What would happen to the distribution if was changed to 2? There are two effects to describe. Discuss why it makes practical sense that these two things should happen to the curve. d. Describe the effects, in general, of and on the shape and location of a normal distribution.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 180

Chapter 6 Sampling Distributions and Estimation When it is only our dataset that is of interest, we use descriptive statistics. This is precisely the trouble we have been up to so far! Often times, however, we cannot collect all elements in the population. Take, for example, a poll to gauge Americans‟ opinion of a candidate in office. Certainly, you cannot sample all voting-age adults. This is easily resolved with a manageable random sample, but is further complicated by the following idea: sampling variability! We will work to answer the following question: How do we estimate true population parameters using a random sample, all the while taking into account the fact that our sample statistic is variable from sample-to-sample? This is the purpose of inferential statistics and is a very important aspect of understanding the structure of an underlying population. With many advances in statistics, it is possible to make precise claims about our population.

6.1 Sampling Distribution for ̅ 6.1.1 What is a Sampling Distribution? The hard-cold truth is that, when working with statistical inference, we likely have no idea what the underlying probability distribution for the population looks like. If we did, then we wouldn‟t have to draw a random sample and would be nearly done with this course. Since we don‟t, we can‟t in good conscience assume that the distribution is normal. So, why spend time studying such a distribution? We will soon experience why. Let‟s start with an example that is concrete. Suppose we roll a die. Without too much effort, we can produce the probability distribution for the population of all possible outcomes. Here it is:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 181

Probability Distribution for Single Die Roll 0.18

0.16

Probability

0.14 0.12 0.10 0.08

0.06 0.04 0.02 0.00 1

2

3

4

5

6

Die Value

In words, the probability of getting any one face value on a die roll is about 0.17 or 1/6. The distribution is uniform. If we found the expected value (the average), we would get: ,

-

( )

( )

(NOTE: This is the same as

( )

( )

( )

( )

since each event is equally likely)

The variance of this population requires us to use the population standard deviation formula (remember, division by occurs if we are dealing with a sample, so that we have an unbiased estimate for the population standard deviation). That is: ,

-

∑(

)

Using Excel we find that: 1 2 3 4 5 6 =VAR.P(A2:A7)

Statistics for Decision-Making in Business

which give:

© Milos Podmanik

1 2 3 4 5 6 2.916666667

Page 182

Thus, the standard deviation would be √ , meaning that, on average, we would expect the die value to deviate by 1.708, or nearly 2 units from the average (1.5 to 5.5, which is pretty much 1 to 6). Thus, we have that:

In reality, keep in mind that we would often not know much about our population. We get the luxury of studying something we can fully explain. This is all in an effort to better understand sampling distributions. Suppose we conducted an experiment of rolling the die 10 times. For one random sequence, we might obtain the following result: 4 3 4 3 1

6 4 1 4 2

Not surprisingly, we get a fairly even spread of values 1 – 6. If we are to compute the average, we would obtain 3.2. That is if all rolls came up as the same number, each roll would be 3.2. Suppose we asked 19 other people to roll a die 10 times and to then report back to us the mean. Here is what we might find (based on a computer simulation of rolls):

Statistics for Decision-Making in Business

© Milos Podmanik

Page 183

20 Means 3.1 3.3 2.4 3.5 2.7 2.9 2.9 3.6 3 4.7 3.6 3.2 3.9 2.8 3.2 3.3 3.9 3.3 3.5 3.1

First off, we notice there is sampling variability. Not every person obtained the same average outcome from 10 tosses each. This is expected, since the process is a random one. The distribution of these means is called a sampling distribution. Sampling Distribution The distribution of sample statistics (such as ̅ ) computed from repeated sampling is called a sampling distribution.

6.1.2 The Central Limit Theorem We do notice that the means tend to gravitate towards 3.5. Some, as expected, deviate from this value. Let us now consider a histogram for this sampling distribution of sample means:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 184

Sampling Distribution of x-bar 6 5 4 3

2 1

5.15>

4.9 to 5.15

4.65 to 4.9

4.4 to 4.65

4.15 to 4.4

3.9 to 4.15

3.65 to 3.9

3.4 to 3.65

3.15 to 3.4

2.9 to 3.15

2.65 to 2.9

2.4 to 2.65

0

This is quite interesting… we have obtained a distribution (of means) that appears somewhat bell-shaped. Suppose now that we had a total of 1000 people roll a die 10 times each, and to then compute the sample mean. Here is what a simulation of this process would look like:

Sampling Distribution of x-bar

1.7 to 1.8 1.8 to 1.9 1.9 to 2 2 to 2.1 2.1 to 2.2 2.2 to 2.3 2.3 to 2.4 2.4 to 2.5 2.5 to 2.6 2.6 to 2.7 2.7 to 2.8 2.8 to 2.9 2.9 to 3 3 to 3.1 3.1 to 3.2 3.2 to 3.3 3.3 to 3.4 3.4 to 3.5 3.5 to 3.6 3.6 to 3.7 3.7 to 3.8 3.8 to 3.9 3.9 to 4 4 to 4.1 4.1 to 4.2 4.2 to 4.3 4.3 to 4.4 4.4 to 4.5 4.5 to 4.6 4.6 to 4.7 4.7 to 4.8 4.8 to 4.9 4.9 to 5 5 to 5.1 5.1 to 5.2 5.2>

100 90 80 70 60 50 40 30 20 10 0

Wow! Our distribution of means for 1000 individuals for experiments of 10 rolls each produces something remarkably like a normal distribution. Additionally, it appears that the mean of this distribution is around 3.5! Let‟s try this again, but now, let‟s say that 1000 individuals each roll a die 20 times, and each individual computes a sample mean. This simulated event would produce the following distribution of die-roll average: Statistics for Decision-Making in Business

© Milos Podmanik

Page 185

Sampling Distribution of x-bar 120 100 80 60 40

20

4.7>

4.6 to 4.7

4.5 to 4.6

4.4 to 4.5

4.3 to 4.4

4.2 to 4.3

4.1 to 4.2

4 to 4.1

3.9 to 4

3.8 to 3.9

3.7 to 3.8

3.6 to 3.7

3.5 to 3.6

3.4 to 3.5

3.3 to 3.4

3.2 to 3.3

3.1 to 3.2

3 to 3.1

2.9 to 3

2.8 to 2.9

2.7 to 2.8

2.6 to 2.7

2.5 to 2.6

2.4 to 2.5

2.3 to 2.4

2.2 to 2.3

0

The distribution looks a bit more normal. Upon closer inspection, we also see that the variability of these averages is smaller. That is: Approximate Range for Means of 10 Tosses: 2.1 to 5.2 Approximate Range for Means of 20 Tosses: 2.5 to 4.6 We notice that increasing the sample size ( ) has decreased the sampling distribution‟s variability. In fact, the standard deviation for the distribution of means computed from 10 and 20 tosses is about 0.52 and 0.38, respectively. Let‟s do one more experiment. Let‟s say that 1000 individuals each roll a die 30 times, and each individual computes the mean of his/her rolls. The sampling distribution of means would look like this (based on simulation):

Statistics for Decision-Making in Business

© Milos Podmanik

Page 186

Sampling Distribution of x-bar 140 120 100

80 60 40 20

4.9>

4.8 to 4.9

4.7 to 4.8

4.6 to 4.7

4.5 to 4.6

4.4 to 4.5

4.3 to 4.4

4.2 to 4.3

4.1 to 4.2

4 to 4.1

3.9 to 4

3.8 to 3.9

3.7 to 3.8

3.6 to 3.7

3.5 to 3.6

3.4 to 3.5

3.3 to 3.4

3.2 to 3.3

3.1 to 3.2

3 to 3.1

2.9 to 3

2.8 to 2.9

2.7 to 2.8

2.6 to 2.7

2.5 to 2.6

2.4 to 2.5

0

Again, we notice the bell-curved shape and the decreased range of means (about 2.6 to 4.4)! Let‟s summarize: Distribution Type Distribution Mean Distribution Standard Deviation Original Die Values 3.5 1.7 UNIFORM Sampling Distribution 3.5 0.52 Of 10-Roll Means NORMAL Sampling Distribution 3.5 0.38 Of 20-Roll Means NORMAL Sampling Distribution 3.5 0.32 Of 30-Roll Means NORMAL

We can very easily see that the expected value of the sampling distribution is the same as , the expected value of the population distribution. That is: , ̅But, what is the relationship of the standard deviations of the means in relation to the standard deviation of the population of die roll value?! This is not so clear. Statisticians, after much research, found that the standard deviation of each of the sampling distribution is related to the sample size in the following way:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 187

, ̅-



For example,

√ That is very close to the 0.52 we obtained! Similarly, for our sample of size 20,

√ This one happens to be fairly spot-on! An finally, for our sample of size 30,

√ This is again very close to our obtained 0.32! The reason for this difference is simply due to randomness, and estimates can be improved more (if desired) by increasing the number of “individuals rolling the die.” What we have observed here is formally known as the Central Limit Theorem. Central Limit Theorem Regardless of the distribution of a random variable, , if we take repeated random samples from this distribution of and compute the mean, ̅ , for each sample, then the following will hold: 1.) The distribution of ̅ will be approximately normal , ̅2.) , ̅3.) √

(NOTE: A sample size of at least 30 is a rule-of-thumb and can vary slightly depending on the severity of skews and abnormalities in the distribution. For even severely skewed distributions, the approximate shape is typically normal.)

6.1.3 Why the Central Limit Theorem?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 188

The Central Limit Theorem (CLT) has some very powerful, but subtle results. First of all, we do not need to understand the shape of the underlying distribution from which we are sampling. This is an amazing result in-and-of itself, since we usually have little to know information about the population itself (again, if we did, we wouldn‟t be wasting our time with any of this!). Secondly, since the resulting sampling distribution is approximately normally distributed, we can proceed to calculate probabilities using the normal distribution. This is also great, since we already have the background in that process! Example 1: After experimentation, researchers believe that the mean lifespan of a strain of bacteria is days with days. Due to the complexity of the bacteria, the shape of the distribution of bacteria lifespans is unknown. A sample of 60 bacteria strains is collected. a. Does the CLT apply here? b. Calculate the probability that the sample mean lifespan, ̅ , is less than 3 days. SOLUTION: a. Since the sample size is 60, we should be safe in assuming that the sampling distribution of all means is normally distributed with mean and standard deviation √

. b. We want (

). Using our probability calculator

Given the very small level of variability in the sampling distribution of lifespan means, we would consider observing an average smaller than 3 feasibly 0.

6.1.4 Limitations of the CLT

Statistics for Decision-Making in Business

© Milos Podmanik

Page 189

One major oversight of our excitement with this idea is the notion that we would actually know the true population mean, , and the true population standard deviation, . If we have limited information about our population, then we certainly would not know these values. In the next parts of this chapter, we will learn how to use our sample to make these predictions about the population. Though similar in conceptual nature, it is not as straightforward as replacing with ̅ and with . Homework Problems – 6.1 1. In your own words, what does the Central Limit Theorem tell us? 2. In your own words, why is the Central Limit Theorem a very powerful practical result? 3. A sample of size 36 is taken from a population distribution of unknown shape, though the mean is believed to be 100 with a standard deviation of 18. What is the probability that the sample mean is: a. Greater than 102? b. Less than 98? c. Between 95 and 105? d. Between what two values will the middle 90% of means be? 4. A stained glass company produces panes of glass with a mean thickness of 0.42 inches and a standard deviation of 0.04 inches, if produced properly. Suppose a random sample of windows reveals a sample mean of 0.43. a. What is the probability of this average, or a larger average? b. Given the probability you have computed, what can be said about recent production standards? 5. Promote Marketing has a research team to research new marketing tactics to propose to potential clients. A group of 40 clients have been invited for a conference to be put on by the marketing firm. The research team usually generates in revenues for each member of the team with . a. What will be the shape of the distribution of ̅ ? How do you know? b. What is the probability that average sales will exceed $420,000 for this particular event? c. How would your answer change if 100 clients were to show up? d. If the team (300 people) have an average revenue that is in the 90th percentile of revenues, they will earn 4-days of paid vacation. What average sales would be required for this? 6. A computer simulation reveals that a distribution of average incomes in a sample of 500 has a standard deviation of $130. What is the standard deviation for the population of all incomes? Interpret the result you get in real-world terms. 7. Use the Excel Sampling Distribution Applet to address this problem. In a population, it is found that 30% of homes have 5 rooms, 40% have 4 rooms, and 30% have 3 rooms. You Statistics for Decision-Making in Business

© Milos Podmanik

Page 190

can set this up in our applet by having a “die” with 10 values: three 5‟s, four 4‟s, and three 3‟s. a. What is the average number of rooms a home has in this population? What is the standard deviation in the number of rooms in this population? b. Now, suppose you take a sample of size 30 from this population. What shape will the distribution have and how do you know? c. Take 1,000 random samples each of size and compute the 1,000 sample means. According to the applet, what is the average of the average rooms in the sample? What is the standard deviation in the average number of rooms in a house? Compare these two results to what the Central Limit Theorem says we should come up with. That is, find , ̅ - and , ̅ -. d. Take 1,000 random samples each of size and compute the 1,000 sample means. According to the applet, what is the average of the average rooms in the sample? What is the standard deviation in the average number of rooms in a house? Compare these two results to what the Central Limit Theorem says we should come up with. That is, find , ̅ - and , ̅ -. e. Take 1,000 random samples each of size and compute the 1,000 sample means. According to the applet, what is the average of the average rooms in the sample? What is the standard deviation in the average number of rooms in a house? Compare these two results to what the Central Limit Theorem says we should come up with. That is, find , ̅ - and , ̅ -. f. Why do the values in the population have the highest standard deviation when compared with the distribution of means in the last there parts? g. What is the probability that, in a sample of 100 homes, the average number of rooms is greater than 5? h. Explain in practical terms why the standard deviation of any ̅ distribution decreases as the sample size increases.

6.2 Confidence Interval for ̅ 6.2.1 Confidence Interval for ̅ Using Sampling Distributions As discussed previously, our ultimate goal is to make inferences about the population parameter . Again, keep in mind that this is the only reason why we are spending time on this! Otherwise, we would have completed our semester early! When we generate our sampling distribution for ̅ we see very vividly that our sample means are subject to sampling variability, depending on which “die values” are “rolled” for each individual sample of size . Thus, we should be very skeptical of concluding that ̅ is representative of the true population mean. However if we have many, many “individuals roll the die,” we should get a fairly reasonable understanding of a range of values for the true value of . Let‟s consider an example. Suppose we want to better understand a population of ages of people in a town. Statistics for Decision-Making in Business

© Milos Podmanik

Page 191

1 3 29 31 19

1 19 25 32 20

18 20 29 31 22

22 32 24 31 21

25 20 23 35 20

27 25 29 33 20

30 29 29 30 19

18 32 26 32 22

21 33 27 31 22

2 40 1 33 9

23.46 9.250319

But, wait! Let‟s pretend that we actually don‟t have access to the entire population of values (yes, we clearly see them in the table above, but we normally do not have that luxury). Due to limited time and money, you are only able to sample 30 of these values. After taking a random sample, here is what you have chosen: 32 20 25 ̅

31 25 27

31 29 30

35 32 18

19 33 21

20 19 33

22 19 30

21 19 32

20 18 31

20 22 33

25.56667 5.870342

Again, at this point, we would have no way of telling how close we are to the actual mean of 23.46. To get a good estimate of , we will come up with a confidence interval. A confidence interval is a range of values such that there is an probability that the true population mean, , is between those values. How do we calculate this? Here is our motivation for what is to come: There are two ways to think about inferential statistics: 1) Use theoretical results and make conclusions using them 2) Build a sampling distribution for the statistic of choice ( ̅ or ̂ ) using the Bootstrap Method and make conclusions using this empirical data. We will draw parallels between the two regularly. Here is the basic idea of Bootstrap Sampling: 1) From the population, take a random sample, preferably of size 30 or greater. The larger the random sample, the more power we have in making inferences about the population. 2) If this is a truly representative sample, then we can think of it as a “mini” population that acts and behaves according to the population as a whole. This is a key ingredient!

Statistics for Decision-Making in Business

© Milos Podmanik

Page 192

3) We cannot use this sample to calculate the corresponding parameter because of sampling variability. However, if this sample behaves like the population, then we can resample from it and get an idea of the overall variability. That is, draw a sample of the same sample size from this “mini” population, but do so with replacement. This is the same idea as rolling a die a fixed number of times – we are sampling with replacement from the population 1,2,3,4,5, 6. What will this do? It will account for sampling variability, if repeated. 4) Calculate the statistic from this sample and record it. 5) Repeat steps 3) and 4) 1,000 to 10,000 times. We now have a sampling distribution and can make estimates about the true population parameter. And, guess what this distribution will look like? You guessed it – it will be approximately normal, by the Central Limit Theorem. Below is a diagrammatic representation of steps 1) – 5): Sample 1

Sample 2

Sample 3 Population

Random Sample,

Sample 4

. . . Sample 10,000

Some of the assumptions we make are indeed dangerous. For example, do we really have a mini population? If the answer is “no,” then theoretical results are equally worthless since they, too, assume that the sample is representative.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 193

Now, back to our example… If we have truly collected a random sample, then we should be able to think about the sample as a small population. If this is a small population, then we should be able to sample from it. We will draw random samples of size from the small “population” which is also of size . Sounds strange, but we will sample with replacement, so it is possible to resample the same value multiple times. We will draw 1,000 samples of size from this “population” and, as you might have figured, we will calculate the mean of each and build the sampling distribution for ̅ .

Sampling Distribution of x-bar

29.7666666666667>

29.2666666666667 to 29.7666666666667

28.7666666666667 to 29.2666666666667

28.2666666666667 to 28.7666666666667

27.7666666666667 to 28.2666666666667

27.2666666666667 to 27.7666666666667

26.7666666666667 to 27.2666666666667

26.2666666666667 to 26.7666666666667

25.7666666666667 to 26.2666666666667

25.2666666666667 to 25.7666666666667

24.7666666666667 to 25.2666666666667

24.2666666666667 to 24.7666666666667

23.7666666666667 to 24.2666666666667

23.2666666666667 to 23.7666666666667

22.7666666666667 to 23.2666666666667

22.2666666666667 to 22.7666666666667

200 180 160 140 120 100 80 60 40 20 0

As we should expect based on CLT, the distribution of these 1,000 means is approximately normal. Let‟s suppose that we want to have an interval within which there is a 95% probability that the true population mean, , lies. This is the same as looking for the middle 95% of means!

Statistics for Decision-Making in Business

© Milos Podmanik

Page 194

Thus, we need to find the lower and upper limits for this interval by finding the 2.5 percentile and the 97.5 percentile. In Excel, we can do this by using the percentile() function. We get: Upper (97.5 percentile): Lower (2.5 percentile):

27.50 23.60

Thus, we can say that we are 95% confident that the true population mean is between 23.6 years and 27.5 years. In other words, there is a 95% probability that we have “trapped” the population mean between our lower and upper limit. Said one other way, 95% of all sample means, when the variability from sample to sample is taken into account, are between these lower and upper limits. If this is representative of the population, then we should believe that 95% of the time, we will have means between these two values. What if we wanted to be 99% certain? We would need to find lower and upper limits so that there is only 1% in the tails:

Thus, we would like 0.01/2 = 0.005 (or .5%) in each of the two tails. To find the lower and upper limits, we would need to find the 0.005 percentile and the 1-0.005 = 0.995 percentile. We get: Upper (97.5 percentile): Lower (2.5 percentile):

28.17 22.83

Thus, we are 99% confident that the true population mean age, , is between 22.83 years and 28.17 years. In other words, there is a 99% probability that the true mean age is between 22.83 and 28.17 years. If we want to be more confident, we need to expand our interval of values! Note that in only one of our confidence intervals (99%), we have captured the true mean within our range. This is very likely, since our confidence percentage is very high. BUT, keep in mind that we never know what the true mean is! Thus, we cannot say that it would have been better to stick with the wider 99% interval. After all, there is a 1% chance we might have made an error. The level of confidence that we desire depends on the situation and the allowable mean width we are willing to tolerate. More confidence means wider possibilities. In general, we never know Statistics for Decision-Making in Business

© Milos Podmanik

Page 195

whether or not we have captured the true mean in our interval. On the upside, there is a probability associated with it! As a final note, it is interesting that we actually missed the true mean in our 95% confidence interval, since there is only a 5% chance of error. Keep in mind, however, that this interval was based on simulation. It is based on 1,000 samples and may have been better to increase the number of samples.

6.2.2 Confidence Interval for ̅ Using Theoretical Results – When

and

are Unkown

In the previous section, we found that the sampling distribution of ̅ with is , , approximately normal with ̅ and ̅ . As a bit of notation, if a random variable √ has a normal distribution with mean and standard deviation, we would write: ̅

(



)

This reads, “ -bar is normally distributed with mean

and standard deviation



.”

This, however, assumes that we know something that we probably don‟t – the population mean and standard deviation! As you might guess, we will use ̅ and to approximate these. This proposes a problem: we are √ introducing more error. In order to account for this, the normal distribution is not appropriate. When using these approximations, we must use the theoretical Student’s Distribution. This distribution looks much like the normal distribution, but is constructed by sample size, not the mean and standard deviation. Below is a comparison of the -distribution in comparison to the standard normal distribution for size .

Statistics for Decision-Making in Business

© Milos Podmanik

Page 196

We see that the standard deviation (in red) is just slightly larger than that of the standard normal (in blue) – it is about 1.0339. So, as sample size gets greater, the -distribution begins to look more like a standard normal. BUT, look at the one below where sample size is 10:

The variability is nearly 14% greater. As we mentioned, this distribution‟s shape relies on the sample size. The relationship is called the degrees of freedom and can be calculated as , that is degrees of freedom is equal to one less than the sample size. So, in our previous example, we had a sample size of 30, so

In a probability calculator, we would enter 29 for the degrees of freedom:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 197

This will work much like the standard normal distribution. It, too, functions in displaying standard deviations. That is, the mean is 0 standard deviations away from the mean. We can to know the number of standard deviations to the left and to the right of the mean we need to travel, in order to “trap” 95% of the distribution. We use the calculator:

Thus, we would expect 95% of sample means to be within 2.045 standard deviations of the mean. In other words: ̅



Or: Statistics for Decision-Making in Business

© Milos Podmanik

Page 198

√ The lower limit is:

√ And the upper limit is:

√ Thus, we are 95% confident that the true average age in this town is between 23.4 and 27.8. Notice that this is not very much different than our simulated confidence interval of 23.6 to 27.5. So, which is more precise? This is arguable, but it is difficult to argue with empirical data. Personally, I prefer the bootstrap confidence interval we ran earlier. My reasoning is that a distribution of means is asymptotically normal, meaning that, under infinitely many sampled units, the distribution would be exactly normal. This is very theoretical and not always valid. For now, we will compare both. For the 99% confidence interval, theory produces the following:

We would now simply adjust the number of standard deviations to 2.756:

√ Lower limit:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 199

√ Upper limit:

√ Similarly, there is a 95% chance that the population mean age is between 22.6 and 28.5. Compare this to our empirical result above of 22.8 to 28.2. We are, again, very close. Homework Problems –6.2 1. Describe, in your own words, what a bootstrap distribution is and why we would want to use one. Be sure to mention the logical process behind building one, as well as the assumptions we are making when we do so. 2. What is a confidence interval? Explain in your own words. 3. The following is a random sample of 10 labor costs associated with farming for civilian consumers (in billions of dollars) since 1970. Labor Costs (bill. $) 229.9 303.7 137.9 58.3 81.5 196.6 36.6 168.4 122.9 347.4

(SOURCE: Data randomly sampled from U.S. Statistical Abstract, Table 847) a. Does the Central Limit Theorem apply for this data? Why or why not? b. Using a bootstrap distribution, calculate a 95% confidence interval for , the true population average labor cost. c. In a complete sentence, interpret the real-world meaning of this value. d. Using the bootstrap distribution and percentiles, how likely is it that a sample of labor costs has a mean greater than $190,000,000,000? 4. In Arizona, primarily the Phoenix Metropolitan area, the issue of red-light cameras used to catch red-light runners and speeders was a prominent one for much of the early 2000‟s. Many studies were carried out over this period of debate to determine whether or not they were effective, and whether or not they used taxpayer money appropriately. Suppose the Statistics for Decision-Making in Business

© Milos Podmanik

Page 200

following data was collected on the revenue generated by randomly sampled red-lights across the valley. The goal is to have, on average, each camera generate $750 and no less than $640 per day. 883 872 832

522 840 840

590 536 676

779 892 555

887 880 884

615 588 617

690 547 517

771 770 586

843 687 505

509 842 552

a. Can the state be 95% confident that the desired average is possible? b. Generate a 99% confidence interval for , the population average daily revenue per camera. Explain in a complete sentence what this means. c. Is the CLT valid in this problem? Explain. d. Using the assumption that the distribution of ̅ is normally distributed, calculate a theoretical 95% confidence interval for (you will need to estimate the √

standard deviation of ̅ ‟s and ̅ to estimate . e. In reality, anytime we estimate parameters, like you did above in part d), we actually shouldn‟t assume a normal distribution. Instead, we should assume what is known as a -distribution, which is symmetrical, though has more variability to account for the uncertainty in our estimates. Watch this brief informative video: http://www.youtube.com/watch?v=yV-0ReCXW64 Pull up the following applet: http://www.stat.tamu.edu/~west/applets/tdemo.html. You can type in the percentile corresponding to means you want to consider. stands for “degrees of freedom” and can be calculated by taking the sample size minus 1 ( ). (From the video, we know that, if the sample size is really, really big, then the difference between the normal distribution and t-distribution becomes indistinguishable.) The output of this applet will give you the number of standard deviations your endpoints will be on either side of the mean. For example, you will find that a 99% confidence interval for a sample of size 100 has endpoints that are 2.626 standard deviation from the mean (left and right). Let‟s say your sample mean is ̅ and standard deviation . Then, the confidence interval will be an interval around the sample mean. That is, one standard deviation is (remember, the standard deviation of means √



requires that we divide the standard deviation among individual ‟s and divide by the square root of the sample size). So, 2.626 standard deviations would be 2.626(0.5) = 1.313 units away from the mean. The endpoints would be 40 – 1.313 and 40 + 1.313, or 38.687 to 41.313. Formulaically, we found:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 201

̅



Where is the number of standard deviations endpoints for a confidence interval with total area in the tails. i.e.

Using this “crash course” in theoretical confidence interval-finding, compute the 95% confidence using these ideas. Do you get a similar result? How close?

6.3 Confidence Interval for ̂ 6.3.1 Confidence Interval for ̂ Using Sampling Distributions Suppose that it is of interest to estimate the proportion of recent customers that say they would come back and shop at your store. You take a sample and determine that, of 30 people, 20 said they would and 10 said they wouldn‟t. You would like to make an inference about the population of all of your customers. In your sample, you know that: ̂ Is the proportion of your customers that will come back and purchase from you again. You are looking to find a confidence interval for ̂ . How do we do that with the simulator if we have no data? In reality, we do. We just have to make it numerical. In reality, 20/30 is an average. It is the average of 30 responses. If we let:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 202

{ So, we have a set of twenty 1‟s and ten 0‟s. We enter these in to our simulator.

We run the bootstrap sample on these 1‟s and 0‟s 1,000 times. We will get a variety of sample proportions:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 203

We see that this distribution is approximately normal. No surprise there!

Sampling Distribution of p-hat

0.983333333333334>

0.433333333333333 to 0.483333333333333 0.483333333333333 to 0.533333333333333 0.533333333333333 to 0.583333333333333 0.583333333333333 to 0.633333333333333 0.633333333333333 to 0.683333333333333 0.683333333333333 to 0.733333333333334 0.733333333333334 to 0.783333333333334 0.783333333333334 to 0.833333333333334 0.833333333333334 to 0.883333333333334 0.883333333333334 to 0.933333333333334 0.933333333333334 to 0.983333333333334

350 300 250 200 150 100 50 0

We calculate the 2.5- and 97.5-percentiles to get the middle 95% of sample proportions generated in the bootstrap sample:

Percentile 1: Percentile 2:

(As %) 97.5 2.5

Results 0.833 0.500

Thus, we are 95% confident that the proportion of the population of customers that will shop at your store will between 0.50 and 0.83. This is quite a wide interval! At least you know what to expect with 95% confidence! Statistics for Decision-Making in Business

© Milos Podmanik

Page 204

DULY CAUTIONED: The assumptions here are the same as for bootstrapping with ̅ : a random sample is drawn from the population and is representative of the population. If not, the sample is worthless, in any case.

6.3.2 Confidence Interval for ̂ Using Theoretical Results Without providing the intuition for this method, we will simply state the results for the CLT pertaining to the sampling distribution of ̂ : Central Limit Theorem for ̂ The sampling distribution of ̂ (which is really just an average of 0‟s and 1‟s) is approximately normal just as long as (similar idea as for the standard CLT). (

̂

̂)

With , ̂√

, ̂-

̂(

̂)

NOTE: the standard deviation is often referred to as the margin of error in polls.

The results above state that, 1. the average proportion of the sampling distribution is the true population proportion. 2. The standard deviation of proportions of the sampling distribution is the above, complex, calculation. AS LONG AS ̂ true statements. We can now proceed:

and (

̂)

, both of which are

Here, we get to use the standard normal distribution to calculate the number of standard deviations corresponding to the desired interval. So, we know that: ̂

Statistics for Decision-Making in Business

© Milos Podmanik

Page 205

, ̂-



(

)

The number of standard deviations corresponding to the middle 95% of a standard normal distribution is calculated below:

Thus, these endpoints are approximately 1.96 standard deviations away from the mean. So, our confidence interval would be:

̂



̂(

̂)

In our case:

Lower limit:

Upper limit:

These limits are nearly identical to the simulation values! Statistics for Decision-Making in Business

© Milos Podmanik

Page 206

Homework Problems –6.2 1. In a sample of 55 students from Arizona State University taking a political science class, 30 say they would be interested in taking another political science class. The university is interested in determine the proportion of all its students that are interested in taking another political science class. a. What is the population of interest in this study? b. Construct a 90% bootstrap confidence interval for, , the true proportion. c. Interpret the real-world meaning of your confidence interval. 2. A software company takes a random sample of recent orders and finds that, of the 250 sampled, 42 resulted in the return of a piece of purchased software. a. What is the population of interest in this study? b. Construct a 99% bootstrap confidence interval for, , the true proportion. c. Interpret the real-world meaning of your confidence interval. 3. A batch of apples was inspected prior to shipment for any defects. Each apple was marked as either pass (P), re-inspect (R) or fail (F). The following results were reported. F P P P P

P P R P P

P R P P P

P P R P F

P R P P P

P R F P R

P P R P P

P R R R P

R P P P P

R P P P R

a. What is the population of interest in this study? b. Construct a 95% bootstrap confidence interval for, , the true proportion of passing apples. c. Interpret the real-world meaning of your confidence interval. d. Using the CLT for ̂ ‟s, construct a 95% confidence interval (see blue box in this section). How does it compare to the bootstrap confidence interval?

Chapter 7 Hypothesis Testing We are often faced with uncertainty. Specifically, we often want to know whether one product is better than the other, whether one group outperforms another in some type of task, or how one manufacturing process compares to another, among many other things. How can we ever know? The first step would be to conduct a study and collect data. The data must then be compared.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 207

But, how do we do so if there exists variability from one sample to the next? This chapter will address this question?

7.1 The Concept Behind Hypothesis Testing So, you have a research question… what now? The question might at first seem obvious: let‟s run a study. This question, however, needs some special treatment before anything else happens, especially if the study comes at a significant cost. For instance, suppose we‟re interested in determining whether pesticides damage the soil in which we grow the majority of our food. This is a loaded curiosity. We first need to fully define how it is that we would conduct such a study. For instance, will be comparing two regions, one that has been sprayed with pesticides and one that hasn‟t been sprayed? What is it, exactly, that we will measure in order determine the level of soil damage? First and foremost, we need to formulate a hypothesis, or a belief about what it is that we expect to see. For example, Our hypothesis is that pesticides inflict serious damage on sprayed soils Great, so we know what we believe. Did we just state what we wanted to happen? Probably not. We‟ll usually formulate a hypothesis based on some existing observations. Perhaps we‟re seeing that plants aren‟t producing as many edibles as previously thought. Or, maybe we‟re finding rising levels of cancers. (By the way, all of the above are becoming eminent public concerns in the U.S. and beyond.) So, based on these observations, we‟re forming an educated belief on the effect of pesticides. The next critical question: How will we measure “soil damage?” This can be a controversial question and may lack a consensus of an answer. Will it be measured by the quantities of beneficial microbes present in the soil? By the soil‟s pH level? By the amount of nitrogen it contains? However we choose to measure “soil damage,” we want to be sure that we are being accurate. That is, we need to be sure that we are actually measuring what we say we‟re measuring. This sounds infantile, but it happens all the time that researchers say they‟re measuring something that they‟re not actually measuring. So, suppose we do some research and conclude that we test for soil damage by determining the weight of vegetables harvested from these plants and comparing the average weight per plant for the experimental group (some determined quantity of pesticides sprayed). We find that healthy plants produce about 30 lbs. of some vegetable across their seasonal life span. Will the average plant yield for plants sprayed with pesticides be lower? Statistics for Decision-Making in Business

© Milos Podmanik

Page 208

Since this is a mathematical question, we would want to formulate our hypothesis into mathematical statements. Since we are dealing with an average in this scenario, the statistical symbol often used to represent the average plant yield for the entire population of this particular vegetable is the Greek letter Mu, . Now, our experimental hypothesis is that pesticides damage the soil, measured by the pounds of vegetables yielded from these plants. If that is the case, we would expect to see a yield of less than 30 lbs. of fruit per plant. That is, our hypothesis is that

Since this is the experimental hypothesis, we have no evidence to conclude that this is true. Thus, we should probably assume that there is no difference between the yields of pesticide-sprayed and non-sprayed plants. Thus, begin by assuming that:

This second hypothesis is called the null hypothesis, that is, the hypothesis that is assumed until there is sufficient evidence otherwise. Symbolically, this hypothesis is written and is typically read as “null hypothesis,” or “h-naught.” The hypothesis that we believe is called the alternative hypothesis, and is written

, or “h-ay.”

To write these two hypotheses, we would write:

When evidence is insufficient, we say “Based on sample data, we fail to reject



in favor of

When evidence is sufficient to conclude that the average is really below 30, we say “Based on sample evidence, we reject

in favor of



We are cautious to make these conclusions based on sample data. Certainly, we may have obtained an oddball sample that doesn‟t represent the population. Let‟s practice writing some hypotheses. First, off, let‟s make note of the variety of population characteristics, called population parameters, that we can seek to describe in a study.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 209

Population Parameters In a study, we seek to gain information about the target population. There is a number of things we can test about the population parameters, actual values. Two common ones are: 1) Population average, denoted by Greek Mu (“mew”), 2) Population percentage, denoted by Greek Pi (“pie”), Unfortunately, we do not know the true values for and and realistically cannot, unless we sample the entire population. We can only estimate them based on the sample we collect. The values we collect from the sample are sample statistics and are estimators for the respective population parameters. These estimators for the values above, respectively, are notated: 1) ̂ (“mew-hat”) 2) ̂ (“pie-hat”)

Example 1: Because of variation in the manufacturing process, tennis balls produced by a particular machine do not have identical diameters. Let denote the true average diameter for tennis balls currently being produced. Suppose that the machine was initially calibrated to achieve the design specification in. However, the manufacturer is now concerned that the diameters no longer conform to this specification. If sample evidence suggests that the true average diameter for tennis balls is not 3 inches, the production process will have to be halted while the machine is recalibrated. Because stopping the production is costly, the manufacturer wants to be quite sure that the true average diameter is not 3 inches before undertaking recalibration. What are the competing hypotheses? SOLUTION: Under the original assumption,

. The researcher wants to test whether

. So:

Example 2: A long-used chemical in a particular carpet-cleaning product has been known to successfully remove dark stains 70% of the time. After extensive research, the product's formula is modified. The head of production must decide whether or not to sell the new product. Write null and alternative hypotheses for conducting an experiment that might help him decide. SOLUTION: Under original specifications, the proportion of time the product works is . He is concerned that . If it is truly less effective, then he will not sell the new product. That is, Statistics for Decision-Making in Business

© Milos Podmanik

Page 210

Example 3: Many older homes have electrical systems that use fuses rather than circuit breakers. A manufacturer of 40-amp fuses wants to make sure that the mean amperage at which its fuses burn out is in fact 40. If the mean amperage is lower than 40, customers will complain because the fuses require replacement too often. If the mean amperage is higher than 40, the manufacturer might be liable for damage to an electrical system as a result of fuse malfunction. To verify the mean amperage of the fuses, a random sample of fuses is selected and tested. If a hypothesis test is performed using the resulting data, what null and alternative hypotheses would be of interest to the manufacturer? SOLUTION: The fuse is designed and assumed to be 40 amps. That is, on average, sure it is not the case that . So,

. He wants to make

So Your Average IS Different! In our pesticide experiment, our target population is all plants of this particular variety. Thus, we will take a random sample of plants from the pesticide group. Once we have that, we will find the sample mean, which is called a sample statistic. That is, we can‟t possibly keep track of all the plants in the population, so we will use the mean of the sample to help us describe the entire population. Usually, this sample statistic is written as ̂ (“mew-hat”). Suppose that you find, from the pesticide group, that ̂ The claim has been proven, right? Maybe, maybe not. We must remember that this is just one random sample from all plants. Certainly, this sample average is lower, but can it not just be due to random variation that we‟re seeing a difference? After all, not all no-pesticide plants will produce exactly 30 lbs. of the vegetable. What if we collect a sample and ̂

Statistics for Decision-Making in Business

© Milos Podmanik

Page 211

Without some sort of analysis, we might be tempted to say this is sufficiently lower. However, we need to have some sort of formal way to determine: When is “low,” low enough? Or, more generally The Big Question When making conclusions about the population based on sample data, we must first ask the question, When do we conclude that an “extreme” is extreme enough to reject

?

As you might guess, there is probability involved. That is, if the probability of observing what we have just seen, or what is more extreme, is small “enough,” then we will reject and conclude that might be a more valid conclusion. Punchline: We shouldn‟t reject the null hypothesis unless the probability of seeing something as or more extreme is very unlikely. What Happens If I Reject

When the Data Provides Insufficient Evidence?

Imagine a medical test to determine whether or not you have some disease. Let‟s call this disease, Disease X. As for having the condition, you have one of two possibilities: you have it or you don‟t. As for the test, it will either say that you have it or you don‟t. Now, realistically, we know that there is no way to be omniscient and really know whether or not you have the condition. However, let‟s imagine that we are all-knowing and can judge the validity of the test. There are four possibilities: 1) 2) 3) 4)

The test is positive, and you do have X (accurate) The test is positive, and you don’t have X (inaccurate) The test is negative, and you do have X (inaccurate) The test is positive, and you don’t have X (accurate)

It is evident that possibilities 2) and 3) represent scenarios where there is an inaccurate result. That is, it would be invalid for the test to tell you that you have the condition when, in fact, you don‟t. It would also be invalid for the test to tell you that you don‟t have the condition when, in fact, you do. Contrarily, we do want the test to tell us positive when we do have the condition and negative when we don‟t. Statistics for Decision-Making in Business

© Milos Podmanik

Page 212

Test Says

Medical researchers usually give these four instances name, as summarized in the following table:

Positive Negative

Truth Have True Positive False Negative (Type I Error)

Don‟t Have False Positive (Type II Error) True Negative

As can be seen, the green cells represent accurate results (true results) and the red cells represent inaccurate results (false results). As a patient, you would probably be quite upset (devastated, even) if you received false results for a terrible condition, such as X!

Hypothesis Test Conclusion

In a hypothesis test, we are up against the same dilemma: our test result can be either positive or negative. The truth may or may not be accurately represented. Let‟s modify our table slightly to represent the hypothesis test scenario:

Don‟t Reject Reject

Truth True True Positive

False Negative (Type I Error)

False False Positive (Type II Error) True Negative

In reality, we shouldn‟t reject (make it appear false), when it is true. If we do, we have a false negative on our hands. Similarly, we shouldn‟t not reject (make it appear true), when it is false. These are labeled Type I and Type II errors, respectively. How Do We Avoid Erroneous Conclusions? Unfortunately, we are not omniscient. Thus, we can never be sure that our conclusions are accurate. If we knew, there would be no testing necessary! On the flipside, we can determine how large of an error rate we require. Earlier, we mentioned that we will reject when the probability of observing something as or more extreme as what we have observed is “small.” This value of small fully determines our probability of a Type I error. As researchers, it is our duty to set this value. This probability of a Type I error is called the criterion, or alpha-level, and is denoted with the Greek letter alpha, . Criterion/Alpha-Level Statistics for Decision-Making in Business

© Milos Podmanik

Page 213

Our chosen risk of a Type I error is called the criterion or alpha-level, and is denoted by the . Typical values for are:

That is, rarely will we choose a very small or considerably large alpha-level. Suppose that we reject when the probability of observing something as or more extreme as what we have observed is 5% (or smaller). We have that . This means that there is still a 5% (or smaller) chance that we observe a value (sample mean, sample proportion, etc.) more extreme than what we have observed. That is, there is a 5% chance that we have falsely rejected the null hypothesis. Probabilistically, (

)

( (

) )

To visualize this, consider the diagram below. Recall that a conditional probability statement limits us to the event after the “pipe,” |, and then asks the question, “what percentage of the time can we expect the event to occur, out of the times the specified condition occurs. The modified table below shows that.

Hypothesis Test Conclusion

Truth Don‟t Reject Reject

True True Positive 95% False Negative (Type I Error) 5% 100%

At this point we might wonder: why shouldn‟t we set Type 1 error risk?

extremely small so that we minimize the

Good question. Imagine that your alpha is 0.0001. This means you will only reject 0.01% (or 1 out of 10,000 times) of the time, when it is true. Certainly, your risk of a Type I error is extremely small. Statistics for Decision-Making in Business

© Milos Podmanik

Page 214

Since your decision criteria, or the numerical figure that we later calculate to decide whether or not to reject, will be extremely stringent and difficult to achieve. If this is the case, then you almost never reject the null hypothesis! Okay, so if you very rarely reject the null hypothesis, then you are also potentially committing another act of error: not rejecting the null hypothesis, even though it may be false. That is, you increase the likelihood of a Type II error. Recall that, (

)

(

)

We can see here that failing to reject results in potentially failing to reject it even when it should be rejected! Unfortunately, there is no free lunch in hypothesis testing.

Hypothesis Test Conclusion

Truth Don‟t Reject

True False Negative (Type II Error)

Reject

True Positive

Though we cannot yet easily provide numerical support for this claim (which certainly makes sense), we will make the following preliminary conclusion: Type II Error The probability of a Type II error, denoted , is inversely proportional to , the probability of a Type I error. That is, decreasing will increase .

Important Caution Students are often confused that the probability of rejecting when is true and the probability of failing to reject when is true sum to 1. After all, these two possibilities are only two of the four possible results in a test decision. However, keep in mind that these are the percentages of time we reject and fail to reject out of all the times that is true! This out of only one column total, not the entire sample space.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 215

The important caution brings up the following idea: If (

)

(

)

(

)

, then, (

)

Similarly, If (

)

(

)

(

)

(

)

, then,

The probability that we reject the null hypothesis when it is false is referred to as the power of the test. We summarize these in the table below:

Hypothesis Test Conclusion

Truth True

False

Don‟t Reject Reject

Example 4: The college dropout rate for a particular county is known to be 30%. The educational board of a city within the county believe its dropout rate is significantly lower. The board follows 60 students and, of them, 15 dropout. The board wants to run a statistical hypothesis test with to determine whether their belief is true. Describe the hypothesis test by: a. Writing competing hypotheses b. A decision rule for rejecting c. A decision criterion rule Statistics for Decision-Making in Business

© Milos Podmanik

Page 216

d. A generic conclusion statement SOLUTION: a.) Under the null hypothesis,

. We want to test to see if

. Thus:

b.) We will reject if the probability of observing something as or more extreme as 15 out of 60 dropouts ( ) under the assumption of the null hypothesis is less than or equal to 0.05. That is: ( ) c.) We will reject if the observed value of is smaller than some cutoff value of . That is, it might be the case that would have to be smaller than, say, 13 in order for us to reject the null hypothesis. d.) Based on sample evidence, we (choose from below) a. Reject in favor of b. Fail to reject . We do not accept as true, but we don‟t have evidence to conclude otherwise.

As we see from the above example, our hypothesis test needs to have a structured layout. We need to know ahead of time what we‟ll do. It is tempting, but we cannot determine our rejection criterion based on what the sample data tells us! In practice, you can carry this type of philosophy, but you increase the error rate. Consider, for example, the scenario wherein you take an exam for a biology class. You get the results back and look at what you missed. You say, “oh, of course I should have put that! I knew that!” If you told that to the instructor, she may say, “sorry, you didn‟t demonstrate that on the exam.” Without surprise, we expect this response. Why? Because, it is the test that helps to determine our level of understanding! It is not the other way around. If the instructor allowed you to change your answer, then the test wouldn‟t really be demonstrating what you knew at that time of the test. A hypothesis test is quite analogous. We carry one out because we have a hunch. Always think back to this statement: If you dig long enough in your data, you will find something! This, however, looks upon the digging process as a negative thing since it does not justify the decision questions. In fact, it creates a high likelihood that we are observing a coincidence and not a solid finding at all! Thus, we increase the probability of error exponentially!

Statistics for Decision-Making in Business

© Milos Podmanik

Page 217

Structure of a Hypothesis Test The following should be included in all hypothesis tests: 1. A statement of competing hypotheses ( vs. ) 2. A decision rule for rejecting (based on ) 3. A decision criterion rule (the physical value of the random variable that represents the required “extremeness” of our observed sample value. 4. A conclusion statement (what the sample data tells you to conclude) As an important note: we never say, “accept as true.” Instead, we remain accurate and say that there is simply not enough evidence to reject it. Think about this as “innocent,” vs. “not guilty.” Just because a court cannot prove that someone is guilty, they don‟t say that he is innocent. Instead, they give the verdict of “not guilty.” Homework Problems – 7.1

1. In your own words, explain the difference between the null and alternative hypotheses. Also, explain how to identify each in a research study. 2. Explain why we assume that the null hypothesis is true before testing a hypothesis. 3. It is believed that 7% ( ) of an organic corn crop is lost to insect infestations. An organic farmer has devised a system that may result in less insect destruction. He would like to test this idea with a hypothesis test. Write the competing hypotheses. 4. A high school statistics class typically gets an average of scores out of 5 on an Advanced Placement (AP) exam. Over the recent several years, he has found that his students‟ scores were higher. He would like to test this hypothesis. Write the competing hypotheses. 5. A snack dispenser has a failure rate of over a 5-year span. After changes to the machine, the manufacturer would like to know whether or not this has changed. Write competing hypotheses. 6. What does it mean to say that

when describing a Type I error?

7. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test for the scenario in question 3, assuming and that he finds that only 52 out of 1000 bushels of his crop are lost to insect infestations. 8. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test for the scenario in question 4, assuming and that he finds his students have been averaging ̅ on the test. Statistics for Decision-Making in Business

© Milos Podmanik

Page 218

9. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test for the scenario in question 5, assuming and that she finds the failure rate is 16 out of 1000 machines. 10. In real-world terms, describe what Type I and II errors would mean for each of questions 3, 4, and 5. 11. Why does the risk of a Type II error increase as we decrease ?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 219

APPENDIX A Answers to Select Problems 1.1 Data and Their Uses 1. a. Nominal; ice cream names cannot be ordered, in general. b. Interval; temperatures have order and the differences in temperature can be reasonably discussed. For example, to talk about a difference is meaningful. c. Ratio: Absolute 0 exists since there can be no balance at all. Additionally, it makes sense to talk about ratios. For instance, accounts receivable balances can be, say, 20% higher this month as compared to last. d. Ordinal; there is an ordering, though we can‟t talk about the number 1 candidate as being 2 better than the number 3 candidate. This is because the difference of 1 might not necessarily be the same from 1 to 2 as it would be from 2 to 3. Maybe candidate 3 is a far third. 2. a. 2,121 elements in the sample b. Length of time is a quantitative variable, since it is a numerical measure. 3. a. 15,000 elements in the sample b. A proportion is a quantitative variable, since it is a ratio. 4. a. Observational; the number of animals a family have is not being assigned. Instead, families are simply being asked about how many animals they have. b. The study might have considered families with horses. People with horses likely live on the outskirts of a big city, perhaps being exposed to less pollen. Also, maybe more families have pets because their children do not seem to have allergies to them. 5. a. Observational; the researchers are looking at preexisting habits. They are not attempting to alter the habits to determine what effect doing so might have on measures of reading ability and short-term memory. b. No; perhaps those who watch more television also have other habits that lead them to scoring poorly on such assessments. 6. a. Observational; the opinions of the doctors are not being altered in any way. b. There is a nonresponse bias since not all participants responded. Thus, it might be the case that those with the strongest opinions decided to come forward, whereas the other 17,000 who didn‟t respond might have influenced the poll in a different way. Statistics for Decision-Making in Business

© Milos Podmanik

Page 220

1.2 Descriptive VS. Inferential Statistics 1. a. b. c. d.

$4 million/day If all days had the same gross revenue, $4 million would be earned. $7.6 The amount of gross revenue earned on a given day varies by as much as $7.6 million as another day. e. The film has generated an average of $4 million/day. There is much instability in this average in that the actual gross revenue has varied from $1.6 million to $9.2 million, a range of $7.6 million. It is dangerous to place too many bets on what might happen next, due to the extreme variability in revenues. 2. a. b. c. d.

18 randomly selected college students All college students Answers vary; spending on clothing, style preference, etc. Inferential; they wish to make conclusions about the population of all college students

3. a. 250 packages of cheese selected b. All packages of cheese produced by the company c. 248 or more must pass 4. Consider the following two datasets with a range of 30: 0, 1, 2, 2, 3, 2, 28, 29, 30 0, 1, 2, 3, 4, 3, 4, 2, 1 30 While both have a range of 30, the first dataset has most of its data towards the outer ends of the dataset. In the second dataset, there appears to tightly spaced data, followed by one outlier of 30. The second dataset is, overall, less spread out. 5. The researchers are trying to use CGCC students as a representative population of all college students. This presents a bias, in that CGCC probably does not accurately represent all college students. 2.4 Descriptive Statistics – Variability 1. a. Standard deviation = 5.9; on average, beers in this sample are within 5.9 calories of the average calorie content. Statistics for Decision-Making in Business

© Milos Podmanik

Page 221

b. Q3 – Q1 = 4.75. The middle 50% of beer calories in this sample have a range of 4.75 calories. Specifically, they range from 29 calories (first quartile) to 33.75 calories (third quartile). c. The skewness value is 0.14. This means the distribution is slightly skewed to the right. 2. a. Range = 64.3; Interquartile Range = 27.7 (71.9 – 44.2); Standard Deviation = 18.7. The difference between the highest and lowest percentage is 64.3%, telling us that the percentage of school enrollees varies greatly across Central Africa. However, this does not ensure that there is not a single outlier creating this wide spread. The interquartile range is 27.7%, telling us that the middle 50% of percentages span from 44.2% to 71.9%, still a considerable spread. The standard deviation verifies that percentages are quite variable, since, on average, the percentage of school enrollees varies by 18.7% points about the mean. b. The interquartile range is 27.7%, telling us that the middle 50% of percentages span from 44.2% to 71.9%, still a considerable spread. The standard deviation verifies that percentages are quite variable, since, on average, the percentage of school enrollees varies by 18.7% points about the mean. c. Enrollment Mean 60.9 Standard Error 3.9 Median 61.9 Mode 61.9 Standard Deviation 18.7 Sample Variance 351.2 Kurtosis -0.4 Skewness 0.4 Range 64.3 Minimum 34.6 Maximum 98.9 Sum 1401.2 Count 23.0

d. Yes, it is skewed to the right, since the skewness value is 0.4, a positive value. e.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 222

Relative Frequency

Percent Enrolled 35% 30% 25% 20% 15% 10% 5% 0%

Percentage

The majority of people in Central Africa are not enrolled in school, since it is predominantly the case that fewer than 50% of people in each nation attend school. f. We know that ̅

and

. A percentage of 79.6% is

standard deviation from the mean. We would expect that at least (

)

of all enrollment percentages would be within one standard deviation of the mean. This is considered to be a very normal percentage (it is still within the “average” spread). 3. a. The range is 5750, which tells us that there is a difference of 5,750 feet from the shortest street to the longest street. The interquartile range is 2170, telling us that the middle 50% of all street lengths range from 980 feet to 3,150 feet. The standard deviation is 1634, telling us that, on average, a street varies by 1,634 feet from the mean street length. b. The interquartile range is 2170, telling us that the middle 50% of all street lengths range from 980 feet to 3,150 feet. The standard deviation is 1634, telling us that, on average, a street varies by 1,634 feet from the mean street length. c.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 223

Street Lengths Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

2231.4 238.4 2100.0 960.0 1634.1 2670328.9 -0.2 0.8 5750.0 100.0 5850.0 104874.0 47.0

d. The distribution is strongly skewed to the right. e.

This means that a street length of 79.6 feet would be about 1.3 standard deviations below the mean. f.

Street Length 35.00%

Relative Frequency

30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 100-1099 1100-2099 2100-3099 3100-4099 4100-5099 5100-6099 Feet

By C.T. . / of all street lengths in the sample are guaranteed to fall within 1.3 standard deviations of the mean. This is not unusual. 4. Answers vary; Statistics for Decision-Making in Business

© Milos Podmanik

Page 224

Symmetric: 35 30 25 20 15 10 5

0 100 to 120

120 to 140

140 to 160

160 to 180

180 to 200

120 to 140

140 to 160

160 to 180

180 to 200

Bimodal (two peaks): 30 25 20

15 10 5

0 100 to 120

Right Skewed:

Statistics for Decision-Making in Business

© Milos Podmanik

Page 225

35 30 25 20 15 10 5

0 100 to 120

120 to 140

140 to 160

160 to 180

180 to 200

100 to 120

120 to 140

140 to 160

160 to 180

180 to 200

Left Skewed: 35 30 25 20 15 10 5

0

5. a.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 226

Repair Cost Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

971 382 738 1,207 1,455,875 7 2 4,194 4,194 9,707 10

Due to the great variability in repair costs, it would be most appropriate to use the median as measure of center. It also reflects the fact that most repair costs, if there are any, tend to be between $600 and $1000. Since the standard deviation describes movement about the mean, it is not appropriate to be used in combination with a median. Thus, we should probably use the interquartile range to describe the middle 50% of repair costs. b.

The repair costs of $4,194 is nearly 3 standard deviations above the mean. This means that it is an outlier cost. c. According to C.T., at least . / of the data in this data set should be within 2.7 standard deviations of the mean. Thus, there is only a 14% chance that we have a score outside of 2.7 standard deviations of the mean. This tells us that a repair cost of $4,194 is fairly unusual. 6.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 227

CC Ratios Mean 12.35 Standard Error 0.62 Median 12.91 Mode #N/A Standard Deviation 1.97 Sample Variance 3.90 Kurtosis -0.50 Skewness -0.60 Range 6.03 Minimum 8.81 Maximum 14.84 Sum 123.47 Count 10.00

There do not appear to be extreme outliers, since the mean and median are close. However, based on the mean being smaller than the median, and the skewness value being negative, there is a slight left-skew to the distribution. The standard deviation tells us that average CC ratios are within 0.62, or 62% points, of the mean. We verify these notions by consider the histogram

rel freq

CC Ratio Distribution 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00%

CC Ratio

We should also be careful to note that there is not very much data available, which is why we don‟t distinctly see a skew. 7.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 228

Nitrous Oxide (thous. Tons) Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

46.35 9.395205 36 40 42.01663 1765.397 0.09474 0.949789 136 0 136 927 20

Nitrous Oxide Distribution 30%

rel freq

25% 20% 15% 10%

5% 0%

Nitrous Oxide (thous. Tons)

The distribution of nitrous oxide emissions is skewed to the right indicating that most states have relatively low emissions, whereas fewer states have relatively high emissions. We note that the median is a good measure, indicating that 36 thousand tons is the 50th percentile. There are two outliers of 136 thousand tons. For this value, , indicating that at least around 75% of all values in the data set are within 2.1 standard deviations of the mean. Thus, 136 can be considered a mild outlier.

3.2 Joint Probability

Statistics for Decision-Making in Business

© Milos Podmanik

Page 229

1. See Video Solution 2. a. About 85% of all the past calls were for medical assistance. b. P(call is not for medical assistance) = 1 – 0.85 = 0.15. c. P(two successive calls are both for medical assistance) = (0.85)(0.85) = 0.7225. d. P(first call is for medical assistance and second call is not for medical assistance) = (0.85)(0.15) = 0.1275 e. P(exactly one of two calls is for medical assistance) = P(first call is for medical assistance and the second is not) + P(first call is not for medical assistance but the second is) = (0.85)(0.15) + (0.15)(0.85) = 0.255. f. Probably not. There are likely to be several calls related to the same event several reports of the same accident or fire that would be received close together in time. 3. (“ ” “ ” “ ”) . / . / . / 4. See Video Solution 5. a. The "expert" assumed that the positions of the two valves were independent. b. The position of the two valves is not independent but rather dependent. The effect of the error makes the probability much smaller. The actual probability is compared to . 6. a. Assuming that whether Jeanie forgets to do one of her “to do” list items is independent of whether or not she forgets any other of her “to do” list items, the probability that she forgets all three errands = (0.1)(0.1)(0.1) = 0.001. ) b. ( ( ) c. P(remembers the first errand, but not the second or the third) = (0.9)(0.1)(0.1) = 0.009. 5.1 The Ideas Behind the Continuous Distribution

1. a.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 230

Pizza Size Distribution 0.6

Probability

0.5 0.4 0.3 0.2 0.1 0 12

14

16

18

Size (inches)

b. c. d.

( (

, average. e. ( 2. a. b.

) )

(

)

(

)

(

)

( (

)

(

)

inches per pizza, on

(doesn‟t include the 12-inch pizza!)

) )

3. a. b. c. d. e.

, so ( ) ( (

for

)

) ; on average, the professor dismisses class 5 minutes after the hour. ; on average, the amount of time that the professor dismisses the class after the hour by varies by 2.9 minutes about the mean. ( ) ( ) f. 4. a.

(

)

Statistics for Decision-Making in Business

© Milos Podmanik

Page 231

b.

(

c.

(

d.

(

)

(

)

)

(

)

(

)

)

5. , so ( )

a. b.

(

for

)

) c. ( ) d. Both ( probability that Statistics for Decision-Making in Business

(

) because, in a continuous distribution, the is 0. © Milos Podmanik

Page 232

e. f. g. h.

(

) ; the average response time is 26 minutes ; on average, wait times deviate from the mean wait time by 4.6 minutes. ) ( . Thus, we want ( ) .

5.2 The Normal Distribution 1. a.

b.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 233

c.

d.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 234

2. a. The long-run proportion of all children born in the U.K. expected to weight more than 10 lbs. is 0.0186.

b. The long-run proportion of all children born in the U.K. expected to weigh at most 10 lbs. is 09814.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 235

c. The long-run proportion of all children born in the U.K. expected to weigh between 5 and 6.5 lbs. is 01837.

d. The long-run proportion of all children born in the U.K. expected to weigh between 1 and 2 lbs. is 0.0000.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 236

e. 20% of all children are expected to be born weighing less than 6.5 lbs.

6. In a recent years, Scholastic Aptitude Test (SAT) scores for all college-bound seniors in the United States was such that points and points (SOURCE: http://www.collegeboard.com) . a. 50% of students scored less than how many points? b. 50% of students scored more than how many points? c. In order to be in the top 10% of SAT-takers, what score would one have to achieve? d. What score do the lowest 10% score between? e. The middle 50% of students scored between what two values?

Statistics for Decision-Making in Business

© Milos Podmanik

Page 237

3. a. 50% of students score less than 1518 on the test.

b. By complementary probability, 50% of students should score more than 1518. c. You would have to score about 1913 points.

d. About 1123.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 238

e. The middle 50% score between about 1310 and 1726.

4. a.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 239

b. The Empirical Rule is a summary of what we have done above. It is a nice ruleof-thumb. 5. a. The distribution would maintain its exact shape, though would be shifted 10 units to the right. b. The distribution would become wider and have a lower peak. This must happen to make sure the area is still 1 when the distribution becomes wider. c. The distribution would become narrower and have a higher peak. If a distribution becomes narrower, its height must increase to maintain an area of 1. d. The mean, , determines where the distribution is centered without altering its shape. The standard deviation, , will make a distribution wide and low-peaked if it large, and will make a distribution narrow and high-peaked if small. 6.1 The Sampling Distribution for ̅ 1. Answers vary 2. Answers vary – emphasis on the ability to have a population distribution with any unknown shape. 3. a. 0.2525 b. 0.2514 c. 0.9044 d. 95.1 and 104.9 4. a. 0.0272

Statistics for Decision-Making in Business

© Milos Podmanik

Page 240

b. This might indicate that the production process is outside of the norm. This type of average is unlikely in a sample of size where The company should investigate why the average thickness of its glass samples is so thick. 5. a. It should be approximately normal, regardless of the distribution of revenues. b. 0.3869 c. The standard deviation of means would change from $6,957 to $4,400. This ) would change ( . This makes sense, since the distribution of means is less spread, and so there will be fewer mean sales amounts beyond $420,000. d. $421,255.50; If the team averages more than this amount for each team member, then they will receive the paid vacation days. , ̅6. We know that , so , so A person‟s √ √



income varies, on average, by about $2,906.89 from the population average of incomes. 7. a. b.

c.

d.

e.

f.

g. h.

rooms and rooms (NOTE: be sure to use sdev.p() since this is a population standard deviation we want) It should be approximately normal based on the Central Limit Theorem; the sample size of 30 satisfies the minimum required sample size to meet normality assumptions. Answers will vary slightly due to sampling variability of the simulation process; , ̅, ̅and . We see that , ̅ as expected. We also see that , which is what we obtained via simulation. √ √ Answers will vary slightly due to sampling variability of the simulation process; , ̅, ̅and . We see that , ̅ as expected. We also see that , which is what we obtained via simulation. √ √ Answers will vary slightly due to sampling variability of the simulation process; , ̅, ̅and . We see that , ̅ as expected. We also see that , which is what we obtained via simulation. √ √ The population standard deviation can be thought of as the distribution of means , ̅from a sample of size . That is, . Since it is the smallest √ possible sample size, it will have the highest degree of variability. 0.000 or about 0% chance As with tossing a coin repeatedly, when something is repeated over-and-over again, the amount of variation in the outcomes becomes relatively small. That is, any mild outliers get averaged in to a large sample of typical values, and its effect is dispersed. In small samples, the opposite holds – deviate values are highly corrosive to the sample mean. 6.2 Confidence Interval for ̅

1. Answers vary Statistics for Decision-Making in Business

© Milos Podmanik

Page 241

2. Answers vary 3. a. No, the sample size is 10, which is less than the minimum required (30). ) b. ( c. We are 95% confident that the population average labor cost is between $109.6 billion and $227.6 billion. d. About 0.213 4. a. Yes, since they can be 95% confident that the average revenue per camera will be between $654.51 and $752.44. b. No, since they can be 99% confident that the average revenue per camera will be between $637.42 and $768.01, which includes the possibility of the average being lower than $640. c. Yes, the sample size is 30, which is the minimum required sample size for the CLT results to be applied. d. We know that , ̅ , which we are estimating by ̅ . That is, we are assuming the sample mean is the population mean for the basis of our interval. Here, ̅ . Similarly , ̅ √ . We are using to estimate . Thus, our , ̅estimate of . Using our probability calculator, we find: √

Our 95% confidence interval would be 652.1 to 755.1, which is close to our bootstrap confidence interval. It is a bit wider than we would like.

e. Here we have that . We have Thus, in each tail. We find that

5% to split between the tails. (same number of standard

deviations from the mean to each tail, since the distribution is symmetric):

Statistics for Decision-Making in Business

© Milos Podmanik

Page 242

We have that ̅



and

. So our interval will be

̅ Where (

Statistics for Decision-Making in Business

© Milos Podmanik

)

Page 243

Thus, our interval is: (

)

Or (

)

This is a bit wider, accounting for the extra variability in estimating

and .

6.3 Confidence Interval for ̂

7.1 The Concept Behind Hypothesis Testing

1. The null hypothesis is assumed to be true and is usually based on what has been observed before. The alternative hypothesis is what we would like to test, which is something that would challenge past observations or assumptions about a population. 2. We assume it is true because it is based on past observations or research. For example, if the Census Bureau finds that 35% of Americans enjoy hypothesis testing, then this is typically based on some fairly extensive research. If a researcher believes this rate is greater in his community, then he can test his alternative hypothesis. 3.

4.

5.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 244

6. This is the probability that we reject the null hypothesis when it is, in fact, true. That is ( ) . This allows us to be 95% confident that we fail to reject when it is true, a correct decision. 7. 1) Hypotheses:

2) Decision Rule: We will reject the null hypothesis when the likelihood of observing something as small or smaller than 52 out of 1000 bushels is no larger than a 1% probability, under the assumption of the null hypothesis. That is, ( ) 3) We will reject if the observed value of is smaller than some cutoff value of . 4) Based on the sample evidence, we will either: a. Reject in favor of of insect-related crop destruction for the farmer‟s new method. b. Fail to reject . We do not have sufficient evidence to conclude that the farmer‟s new method is better than his old method. 8. 1) Hypotheses:

2) Decision Rule: We will reject the null hypothesis when the likelihood of observing something as large or larger than ̅ is no larger than a 5% probability, under the assumption of the null hypothesis. That is, ( ̅

)

3) We will reject if the observed average of ̅ is larger than some cutoff value of ̅ . 4) Based on the sample evidence, we will either: a. Reject in favor of out of 5 questions are answered correctly by his students (as of recent observations). b. Fail to reject . We do not have sufficient evidence to conclude that the instructor‟s more recent students do better on the AP exam than his former students. 9. 1) Hypotheses: Statistics for Decision-Making in Business

© Milos Podmanik

Page 245

2) Decision Rule: We will reject the null hypothesis when the likelihood of observing something as small/large or smaller/larger than 16 out of 1000 bushels is no larger than a 1% probability, under the assumption of the null hypothesis. That is, (

)

3) We will reject if the observed value of is smaller or larger than some cutoff values of . That is, if it is smaller than some value, say , or larger than some value, say , then we will reject . Remember, we set-up a hypothesis first, then do the test. Even though 16 is larger than 15 out of 1000, we did not know this to begin with. We are still testing whether or not this value is significantly different and do not care about the direction of the difference. 4) Based on the sample evidence, we will either: a) Reject in favor of of machines fail. That is, either a significantly fewer number of them fail, or a significantly greater number of them fail. b) Fail to reject . We do not have sufficient evidence to conclude that new machines fail more or less when compared to the old machine. 10. 1) Type I: We conclude the farmer‟s method reduces crop destruction, when there is no difference; Type II: We conclude the farmer‟s method is no different than the old method, when in fact there is less than 7% crop destruction with his new method. 2) Type I: We conclude the instructors students perform better than his former students, when in fact there is no difference; Type II: We conclude that his new students perform just as well as his former students, when in fact they do better. 3) Type I: We conclude that the new machines fail more or less than the former machines, when in fact there is no difference; Type II: We conclude that there is no difference between the failure rates of the new and old machines, when in fact there is a significant difference. 11. Increasing means we will reject less often, as we set more stringent conditions upon the rejection process. If we reject less often, then there is an elevated likelihood that we may fail to reject, when in fact we should. This is precisely what a Type II error is.

Statistics for Decision-Making in Business

© Milos Podmanik

Page 246