CHAPTER 7 ANSWERS. Section 7.1 Statistical Literacy and Critical Thinking

CHAPTER 7 ANSWERS Section 7.1 Statistical Literacy and Critical Thinking 1 2 3 4 5 6 7 8 A correlation exists between two variables when higher val...
Author: Jordan Golden
0 downloads 1 Views 169KB Size
CHAPTER 7 ANSWERS Section 7.1 Statistical Literacy and Critical Thinking 1

2 3 4 5

6 7 8

A correlation exists between two variables when higher values of one variable tend to be associated with higher values of the other, or when higher values of one variable tend to be associated with lower values of the other. The term “correlation” in statistics refers specifically to a statistic calculated from a set of paired data. It always lies between –1 and 1 and indicates how closely the paired data come to falling in a straight line. This is much more specific than its meaning in general usage. No. It is possible that with 100 additional pairs of data that the conclusion could change so that there is a correlation. The points fall very close to a straight line that goes upward to the right. Answers may vary, but both of these variables are likely to increase as the cost of living index increases with inflation. This statement is not sensible. There might be some other =factor that caused the stork population to increase and the births increase at the same time. One such factor might be a growing population that resulted in an increase in the number of births and an increase in the number of housing structures to that provide good nesting locations for the storks. It is not likely that there is a direct cause-effect relationship between the two variables. This statement is not sensible. The fact that one of the two variables shows a positive effect has nothing to do with the presence or absence of a positive correlation between the two variables. This statement is not sensible. The strength of a correlation is a measure of how close the data points fall to a straight line, not to how large the sample is. This statement is not sensible because r must always fall between –1 and +1 inclusive. A value of 1.2 for r is not possible. Concepts and Applications

9 10 11 12 13 14 15 16 17 18

These variables will be positively correlated because taller women tend to weigh more. These variables will be positively correlated because more time spent studying tends to result in higher grades. These variables will be negatively correlated because heavier cars tend to get lower mileage. The variables are not correlated. These variables will be positively correlated because running the marathon faster (in less time) will result in a lower finish order, the lowest time finishing first and the highest time finishing last. These variables will have a negative correlation because the temperature decreases as the altitude increases. These variables are not correlated. These variables are negative correlated because the prize money is greater for those who shoot lower scores. There is a strong positive correlation, with the correlation coefficient approximately 0.8. Much of this correlation is due to the fact that a large fraction of the grain produced is used to feed livestock. An estimate of the correlation coefficient is r = 0.7 to r = 0.8. Forecasts are reasonably accurate for two days in the future. Results should be similar for other two-week periods although the values of the temperatures could be different. Generally, the accuracy of two-day forecasts does not depend on the time of year, so other two-week periods of data should show a

115 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

116

CHAPTER 7, CORRELATION AND CAUSALITY

19

similar pattern. a) 7 6 5 4 3 2 1 0 0

20

40

60

80

Speed Limit

b) c)

There is a moderate positive correlation; r = 0.59 exactly. It is difficult to argue with the phrase “no guarantee”. However, four of the five lowest death rates are associated with a 55 mph speed limit. With the exception of Britain, the higher speed limits are generally associated with higher death rates. Death rates are also influenced by other factors beside speed limits. See the end of this chapter’s solutions for the computation details for r for this exercise.

20

a) 60 50 40 30 20 10 0 -10 0 -20

5

10

15

20

25

Birth Rate

b) c)

There is a moderate positive correlation (r = 0.50 exactly). The birth rate does give some indication of population growth rate, but other factors, such as immigration and death rates, also affect population growth.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.1, SEEKING CORRELATION 21

117

a)

Most Valuable Players (2000-2006) 80 60 40 20 0 0.290 0.300 0.310 0.320 0.330 0.340 0.350 0.360 0.370 0.380 Batting Average

b) c)

a) Movie Receipts and Attendance

Total Attendance (Bil lions)

22

There is a slight negative correlation (r = -0.20). No. Firstly, the correlation is negative, suggesting that a higher average is associated with fewer home runs. Moreover, the negative correlation is probably mostly the result of the data for Ichiro Suzuki, who had only 8 home runs with a high average. Finally, all of the players had relatively high batting averages in comparison with the rest of the players in major league baseball. It is quite possible that if we had the data for all players, the data would look quite a bit different. Correlations are frequently low when the range of one of the variables is somewhat restricted.

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

2

4

6

8

10

Total Receipts ($Billions)

b) c)

There is a strong positive correlation between attendance and receipts. The actual value of r is 0.96. The bottom line is total receipts, which almost doubled between 1990 and 2002. Attendance has also increased, but by a little less than 40%. The increased attendance must also be encouraging to movie executives, especially since it occurred while prices were also increasing.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

118

CHAPTER 7, CORRELATION AND CAUSALITY

23

a) 58 56 54 52 50 48 0

20000

40000

60000

80000

Household Income

b) c)

24

There is a strong negative correlation between income and the number of TV hours per week (r = -0.86 exactly). Families with more income have more opportunities to do other things. It may also be that some people have higher incomes because they spend more time associated with earning money, leaving less time to watch TV. Obviously, just watching less TV will not increase your income, but if you use that time for income producing activities, your income could increase.

a) 6 5 4 3 2 1 0 0

20

40

60

80

100

Mean January High Temperature

b)

c)

There is a weak negative correlation (r = -0.30 exactly). This is probably due to the one data point for Bombay in the lower right part of the graph. The rest of the data have a U-shaped configuration, suggesting little or no correlation. Mean high temperature and precipitation are fairly uncorrelated. Note that the outlier at (0.1,88) affects the correlation coefficient significantly, making it more negative than it would be if that point were removed from the data.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.2, INTERPRETING CORRELATIONS 25

119

a) 12 10 8 6 4 2 0 0

50

100

150

200

250

300

350

Total Sales (Billions)

b)

c) 26

There is a strong correlation between sales and earnings (r = 0.94 exactly). The strong correlation in this case is highly affected by the Wal-Mart data, although the general trend for the rest of the data still shows a positive correlation. Higher sales do not necessarily translate into higher earnings. Some companies may have larger expenses, driving earnings down.

a) 200 150 100 50 0 0

1000

2000

3000

4000

Mean Daily Calories

b)

27 28

There is a strong negative correlation between the mean daily calories and the infant mortality rate (r = -0.94). c) The mean daily calories is an indicator of both the health and wealth of a nation, so it is not surprising that lower infant mortality is associated with higher caloric intake. True. Since correlation measures how closely the points are to a straight line, the correlation should be the same if the variables are interchanged. True. Since correlation measures how closely the points are to a straight line, the correlation should be the same if we change the units of measurement to something that is directly proportional to the original units (such as yards to feet or pounds to ounces). This statement would not be true if we changed the units by taking square roots or squaring the data values. Section 7.2 Statistical Literacy and Critical Thinking

1

If there is a significant correlation between two variables, it means that the two variables increase or decrease together in a straight line

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

120

2

3

4

5

6 7 8

CHAPTER 7, CORRELATION AND CAUSALITY relationship. It does not mean that one variable has a direct effect on the other variable. There are at least two things wrong with this statement. First, consider how the study must have been done. To establish any connection between putting infants to sleep in the supine position and the numbers of SIDS deaths, one would have to obtain information on the practice of a large number of parents in putting their infants to bed. Even if that were done, both variables would have been observed (as opposed to conducting an experiment in which some parents were required to put their infants to sleep in the supine position and others were not allowed to do so.), in which case no evidence can be obtained that one variable causes the other. In fact, no experiment was performed and no data on the practices of parents were collected. What did happen was that the SIDS death rate decreased during a time when pediatricians advised the supine sleeping position. Whether parents followed that advice would be important to establishing a relationship between the practice and the SIDS death rate, but it cannot establish that such a relationship is a cause-effect relationship. Outliers are values that are very far away from almost all of the other values in a data set. Outliers might make it appear that there is a significant correlation that is not real, or they might mask a real correlation. A scatter diagram is a graph of paired data. Each point represents one pair of data values. One axis is used for one variable and the other axis is used for the other variable. A scatter diagram helps us to visualize relationships between the two variables. The statement is not sensible. Although it might seem to make sense that increasing a car's weight causes the fuel consumption rate (in miles per gallon) to decrease, we cannot make that conclusion based solely on the correlation. The correlation does not imply a cause and effect relationship. The statement is not sensible. Although a correlation cannot be the basis for concluding that exercise causes better health, there is other evidence to suggest that it does. The statement is not sensible. Instead of concluding that drinking causes car crashes, we can only conclude that there is a linear relationship between drinking and car crashes. A correlation does not imply causality. The statement is not sensible. Even though the additional pair of data is only one pair among 21, its effect can be very strong. The value of the correlation coefficient could change substantially. Concepts and Applications

9

10

11 12

There is a positive correlation between the number of registered handguns and the crime rate. This correlation is probably due to a common underlying cause. Many crimes are committed with unregistered handguns, some with no handguns. It is possible that a rising crime rate leads people to purchase handguns for protection (and register them). There is a negative correlation that is probably due to a direct cause. As people run, they burn calories, and as they run further, they burn more calories. Keeping in mind that food consumed is also a factor in weight gain or loss, it is also possible that those who weigh less can run further since they have less weight to carry with them. Thus the question is not a simple as it might first appear. The truth may be somewhere in between. Some people weigh less because they run, and some may run more because their lower weight makes running enjoyable. There is a positive correlation that is due to a direct cause. Toll fees along the Massachusetts Turnpike are based on the distance traveled. There is a positive correlation that is due to a direct cause. With more and more vehicles sharing the same roads, congestion is more likely to occur, and

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.2, INTERPRETING CORRELATIONS 13

14 15 16

17

18

19

121

drivers must wait longer. This is a positive correlation that is probably due to a common cause, such as the general increase in the number of cars and traffic. That is, the increase in vehicles requires that some intersections get new lights to control the traffic and, at the same time, the increase in vehicles is in part responsible for the increased number of crashes. There is a positive correlation between the distance and the speed. Astronomers can explain the correlation with a direct cause. There is a negative correlation that is probably due to a direct cause. As gas prices increase, people can’t afford to drive as much or as far, so they cut costs by not driving longer distances. There is a negative correlation between the incidence of melanoma and latitude. Since melanoma can be caused by too much exposure to the sun’s rays, and latitude decreases as you approach the equator, this makes some sense since the warmer temperatures probably also mean that people have more skin exposed. There have been several studies of this phenomenon. One concludes that the negative correlation is only for non-Hispanic whites. Another concludes that the negative correlation disappears entirely even for whites when other possible explanatory variables are taken into account. a) The point (0.4, 1.0) is an outlier since it lies far from the rest of the data points. Without the outlier, there is no linear relationship between the variables, so the correlation is zero. b) With the outlier included, there is a negative correlation (actual value is r = -0.57). a) The point (0.5, 1.0) is an outlier. Without the outlier, there is a strong positive correlation, probably 0.95 or larger (Actual value is r = 0.9894). b) With the outlier, the correlation is still positive, but much less than without it (actual value is r = 0.851). a) 14 12 10 8 6 4 2 0 0

50

100

150

200

Weight (lbs)

b)

The actual correlation coefficient is r = 0.92, which is significant at the 0.01 level, so there is a very strong correlation between weight and shoe size. If you look at just the first five points, there does not appear to be a strong linear relationship between weight and shoe size. The same is true if you look only at the last five points. It appears that the apparent correlation found in part (a) is more a result of gender than of weight.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

122

CHAPTER 7, CORRELATION AND CAUSALITY

20

a) 100 80 60 40 20 0 0

20

40

60

80

100

January High

21

a) b)

c)

30 25 Death rate

b)

It would appear that there is a strong negative correlation (actual r = -0.87). There does not appear to be any correlation within each of the two groups of five cities. The correlation in part (a) is due to the fact that summer and winter are reversed in the Northern and Southern Hemispheres. The actual correlation coefficient is r = 0.77. This is significant at the 0.01 level, indicating a strong correlation. The 16 points to the left correspond to relatively affluent countries, such as Sweden, with low birth rates and low death rates. The remaining points on the right correspond to relatively poor countries, such as Uganda, with higher birth rates and higher death rates. There appears to be a negative correlation between the variables for the wealthier countries and a positive correlation for the poorer countries. You could confirm the correlations by dividing the data into two groups and finding the correlation in each group. You can confirm the location of the data points for individual countries such as Sweden and Uganda by going to the internet and entering “birth rates” in Google and choosing one of the web sites that shows birth and death rates for different countries. One such site is http://encarta.msn.com/media_701500528/Birth_and_Death_Rates_by_Country _or_Region.html. There you will find that the birth and death rates in 2002 for Sweden are 9.8 and 10.6, while for Uganda, they are 47.2 and 17.5. A scatter diagram for all 223 countries shown on this site is shown below. The correlation coefficient for all of these data is r = 0.41.

20 15 10 5 0 0

10

20

30

40

50

60

Birth Rate

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.3, BEST-FIT LINES AND PREDICTION 22

123

a) 100 80 60 40 20 0 0

2

4

6

8

10

12

14

Reading Time (hrs/week)

b)

c)

The actual correlation coefficient is r = 0.007, indicating no correlation. The data could be divided into a group of points with an upward trend (rows 1, 3, 5, 7, 9 of the table) and another group with a downward trend. It’s plausible that the first group corresponds to the book readers and the second to the comic readers. Other divisions of the data are also possible. If the conjecture is correct, then book reading time is strongly positively correlated with test scores (r = 0.97) and comic reading time is strongly negatively correlated with test scores (r = -0.98).

Section 7.3 Statistical Literacy and Critical Thinking 1

2

3

4

5

A best-fit line (or regression line) is a line on a scatter diagram that lies closer to the data points than any other possible line (according to the standard statistical measure of closeness). That is, if you find the difference between each data y value and the corresponding y value for the point directly above or below it on the line, square that difference, and then total the squares of the differences, the best-fit line has the lowest total possible. A best-fit line is useful for predicting the value of the y variable given some value of the other variable (x variable) within the range of the x values in the data set. Multiple regression involves determination of the best-fit equation that describes the relationship between one variable and two or more other variables. r2 is the square of the correlation coefficient, and it is the proportion of the variation in a variable that is explained by that variable’s linear relationship with a another variable (as expressed by the best-fit line for the two variables). 2 R is the coefficient of determination for a regression equation that relates 2 2 one variable to at least two other variables (Actually, R is the same as r if there is only one other variable.). In multiple regression, it is a measure of how well the best-fit equation actually fits the sample data. If 2 R is close to 1, it means that the regression equation fits the data very 2 well. If R is close to zero, it means that the regression equation is virtually useless for describing the relationship between the variables. This statement does not make sense. The paired data represent the same height using different scales, so the data values should fall exactly on a straight line (except for possible rounding). The correlation coefficient

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

124

CHAPTER 7, CORRELATION AND CAUSALITY 2

6 7 8

should be 1 or very close to 1, and therefore r should also be very close to 1. 2 This statement does not make sense. Since r is obtained by squaring r, it must always be positive or zero, never negative. This statement does not make sense. A woman 120 inches tall would be 10 feet tall, but that is well beyond the scope of the data, and it does not make sense to make predictions outside the scope of the data. This statement makes sense and it is a reasonable interpretation of the values of r2 = 0.926. Physically, an interpretation might be that the longer the eruption, the longer it takes to build up enough pressure to force the next eruption. Concepts and Applications

9

a)

Here is one way to add a best-fit line to the scatter diagram. Enter the Color and Price data from Table 7.1 into an Excel spreadsheet with Color in the left hand column. Highlight the two sets of data (including column headings) and click on the graph icon at the top of the spreadsheet. Select the (XY) scatter option and use the top subtype option. Click on Next twice. Using the tabs, you can now add titles to your graph, remove the grid lines (just uncheck them), and remove the legend (uncheck it). Click on Finish. The graph should appear on the spreadsheet. Now right click on any one of the data points in the graph and click on Add Trendline. Select the Linear option and click OK. The result is shown below.

Diamond Price and Color 16000 14000

Price

12000 10000 8000 6000 4000 2000 0 0

2

4

6

8

10

Color

b)

c) 10

a)

You can use Excel to find the correlation as well. Assuming that you have placed the data in columns A and B, rows 2 though 24, in any new cell enter the expression =correl(a2:a24,b2:b24) and press Enter. The actual r value will appear as -0.16323; squaring this, we find that r2 = 0.026645, so about 3% of the variation in price can be explained by the best-fit line. 2 The best-fit line should not be used to make predictions since r is so close to zero. Use a ruler to obtain the approximate life expectancies and infant mortality rates from Figure 7.4. You can use Excel as in Exercise 5 to recreate the scatter diagram and add a best-fit line. The result is

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.3, BEST-FIT LINES AND PREDICTION

125

120 100 80 60 40 20 0 50

55

60

65

70

75

80

85

Life Expectancy (Years)

b)

c) 11

Actual r = -0.90; 2 r = 0.81. About 81% of the variation in infant mortality can be explained by the best-fit line. The best-fit line could be used to make predictions on infant mortality given the life expectancy.

a) 500 400 300 200 100 0 0

1

2

3

4

5

6

No. Of Farms (Millions)

b)

c)

Actual r = -0.99; 2 r = 0.98. About 98% of the variation in farm size can be explained by the best-fit line. The best-fit line could be used to make predictions within the range of the number of farms included in the data. Because there does appear to be a slight curvature to the points, predicting outside that range should not be done.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

126

CHAPTER 7, CORRELATION AND CAUSALITY

12

a)

Actual Temperature

SameDay Forecast 80 70 60 50 40 30 20 10 0 0

10

20

30

40

50

60

50

60

Forecast Temperature

Actual Tempera ture

3-DayF orecast 80 70 60 50 40 30 20 10 0 0

10

20

30

40

Forecast Temperature

b)

We estimate r to be about 0.8 in the first graph and 0.65 in the 2 second. Then r is 0.64 for the first graph and 0.42 for the second. c) For the same day forecast, the actual correlation coefficient is r = 2 0.80; r = 0.64; about 64% of the variation in actual temperatures is explained by the best-fit line. For the three-day forecast, the actual 2 correlation coefficient is r= 0.65; r = 0.42; about 42% of the variation in actual temperature can be explained by the best-fit line. d) The best-fit line is more reliable for the same-day predictions than for the three-day predictions. In Exercises 13-20, we will show the actual best-fit lines in part (a). Your line may be different since you will have tried to draw the line by eye. Similarly, in part (b), we will show the actual values of r2. Your estimates may be different.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.3, BEST-FIT LINES AND PREDICTION 13

127

a) 7 6 5 4 3 2 1 0 0

20

40

60

80

Speed Limit

b)

c)

d)

14

The correlation is moderately positive. The actual value of r is 2 0.587, so r = 0.345. Thus, about 35% of the variation in the death rate can be accounted for by the linear relationship with the speed limit. (75,6.1) and (70,3.5) are both possible outliers, the first because it is away from most of the data points and the latter because the death rate is lower than might be expected considering the rest of the data. Because one point is above the best-fit line and one is below, the net effect of the two points is probably to cancel each other out since both points will “pull” the line toward themselves. No. It would be difficult to conclude that the model is necessarily linear outside the range of the data. Higher speed limits might cause the death rate to increase faster than indicated by the line while lower speed limits might cause the death rate to decrease faster than indicated by the line. Furthermore, the value of r is too small to consider predictions based on the best-fit line to be reliable.

See the end of this chapter’s solutions for the details of the computation of the best-fit line for this exercise. a) 60 50 40 30 20 10 0 -10 0 -20

5

10

15

20

25

Birth Rate

b) c)

The actual r = 0.50; r2 = 0.25; 25% of the variation can be accounted for by the best-fit line. (21.0,17.9) is a possible outlier. The best-fit line would have a steeper slope if that point were removed. (16.3, 50.1) is also a possible outlier. The best-fit line would have a lower slope if that

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

128

CHAPTER 7, CORRELATION AND CAUSALITY d)

15

point were removed. Predictions based on the best-fit line are not reliable, especially outside the range of the data.

a)

Most Valuable Players (2000-2006) 80 60 40 20 0 0.290 0.300 0.310 0.320 0.330 0.340 0.350 0.360 0.370 0.380 Batting Average

b)

c)

d)

a) Movie Receipts and Attendance

Total Attendance (Bil lions)

16

The correlation between batting average and number of home runs is very 2 weak. The actual value of r is -0.204; ; r = 0.041. Only about 4% of the variation in home runs can be accounted for by a linear relationship with batting average for these most valuable players. There is one outlier at (0.350, 8). The effect of this outlier is to pull the right end of the line down. If this point were removed, the correlation becomes –0.03 and the best-fit line is almost horizontal. This best-fit line is worthless for predicting the number of home runs from the batting average for these most valuable players. It definitely should not be used for prediction of home runs outside the range of these data.

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

2

4

6

8

10

Total Receipts ($Billions)

b) c) d)

The actual r = 0.96; r2 = 0.92; 92% of the variation in attendance can be accounted for by the best-fit line. No outliers. Predictions based on the best-fit line should be reliable within the range of the data shown. However, as always, prediction of attendance values for receipts outside the range of the data points is not a good practice.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.3, BEST-FIT LINES AND PREDICTION 17

129

a) 58 56 54 52 50 48 46 0

20000

40000

60000

80000

Household Income

b)

c)

d)

18

The correlation is fairly strong. The actual value of r is –0.86 and 2 r = 0.74. About 74% of the variation in TV hours can be attributed to the linear relationship with household income. The data point in the upper left of the diagram is a possible outlier since it is clearly out of line with the other four data points. The effect of this point is to pull the left end of best-fit line upwards. With this point removed, the correlation would be –0.99. No. Predicting outside the range of the data points is seldom a good idea. Even though the correlation is fairly strong for these data, remember that the two endpoints are based on assumptions that $25,000 and %70,000 can adequately represent the two end income classes. Having only five data points, one of which is an outlier, is another good reason for not using these data for predictions based on the bestfit line.

a) 6 5 4 3 2 1 0 0

20

40

60

80

100

Mean January High Temperature

b)

c)

d)

The correlation is quite weak. The actual value of r is –0.30 and r2 = 0.09. About 9% of the variation in the mean January precipitation can be accounted for by a linear relationship with the mean January high temperature. The point in the lower right of the diagram is clearly an outlier. It has the effect of pulling the right end of the best-fit line downward. With that point deleted, the value of r changes to +0.26, so the effect of that point is quite substantial. Even without the outlier, it does not appear that the data points

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

130

CHAPTER 7, CORRELATION AND CAUSALITY follow a straight line. The scatter plot is more U-shaped. Therefore, without even considering the value of the correlation coefficient, the linear model is not appropriate and should not be used for predicting precipitation based on high temperatures, either within or outside of the range of the data points.

19

a) 12 10 8 6 4 2 0 0

50

100

150

200

250

300

350

Total Sales (Billions)

b)

c)

d)

20

The correlation is very strong. The actual value of r is 0.941 and r2 = 0.886. About 87% of the variation in profits can be accounted for by a linear relationship with total sales. The data point in the upper right of the diagram for Walmart is clearly an outlier, lying far away from the rest of the data points. Because that point is close to being in line with the other points, deleting it does not change the line very much. However, the value of r drops to about 0.69. Due to the lack of data between Walmart and the other points, the linear model should not be used for predictions either within or outside the range of data values. In fact, there may not be any data points possible to the right of Walmart.

a) 200 150 100 50 0 -50

0

1000

2000

3000

4000

Mean Daily Calories

b)

c) d)

The correlation is very strong. The actual value of r is -0.940 and r2 = 0.884. About 88% of the variation in infant mortality rate can be attributed to the linear relationship with mean daily calories. There do not appear to be any outliers. No. We do not know that the linear model is valid outside the range of the data points. It would appear that for countries with a mean daily

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

SECTION 7.4, THE SEARCH FOR CAUSALITY

131

calorie intake greater than 4000 calories, the infant mortality rate would be projected to be negative. Since that is not possible, making projections outside the range of the data makes no sense. Section 7.4 Statistical Literacy and Critical Thinking 1 2 3

4

5 6 7 8

A correlation between two variables (1) could be the result of as coincidence; (2) could be due to a common cause; or (3) could be due to a direct influence of one of the variables on the other. We do not want to rule out the explanation of a direct influence of one variable on the other variable because that is the explanation that we would like to establish. A confounding variable is a variable that is not included in the analysis, but it affects the variables that included in the analysis. Failure to include or account for a confounding variable might cause a researcher to miss an underlying causality by considering only data showing no correlation. Finding a correlation between two variables means finding a statistical association or relationship, and this can happen with or without a causeeffect relation between the variables. Establishing causality between two variables is finding that one of the variables has a direct influence on the other variable. The statement is not sensible. We cannot conclude that a correlation implies causality, regardless of the value of the correlation coefficient. This statement makes sense. If causality exists, there is normally a fairly strong correlation. A value of r = 0.013 is a very weak correlation, which rules out causality for all practical purposes. The statement is not sensible. Even though coincidence is ruled out, there might be a common underlying cause that could be a possible explanation. This statement is not sensible. A sample of five subjects is so small that the correlation might not be significant. Concepts and Applications

9 10 11 12 13

14 15 16

The causal connection is valid. As the speed of the club head increases, more force is applied to the golf ball, so it travels farther. The causal connection is not valid despite the fact that some people spend large amounts of money on magnets, believing that wearing a magnet can cure a wide variety of health problems. This causal connection is valid. Alcohol is a depressant to the central nervous system, and it has several effects that include decreased reaction time. This is one important reason why drinking and driving is so dangerous. The causal connection is valid. As altitude increases, a person takes in lower numbers of oxygen molecules with each breath, and the brain does not function as well. Guideline 1 Guidelines 2 and 5 Guidelines 3 and 5 The headaches are associated with work days in some way. The headaches are not associated with Coke or possibly with caffeine. The headaches are possibly the result of bad ventilation in the building. Susceptibility involves other factors than just smoking and varies among individuals. Also, some smokers die of other causes first. Smoking can only increase the risk already present. The study compared the life expectancy of conductors with that of all American males including those who die as infants and children. Conductors don’t usually become conductors until they are middle aged, say at least 30

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

132

17

18

19 20

CHAPTER 7, CORRELATION AND CAUSALITY years old. So they are not a representative sample of the population (they are older on the average) and should be expected to have a higher average life span than that of all males. This was an observational study. Later child bearing reflects an underlying cause. While it’s possible that the conclusions are correct, there are other possible explanations for the findings. For example, it’s also possible that the younger women lived during a time when having babies after age forty was less likely (by choice). It is still possible for them to live to be 100. Since this was an observational study, it is not possible to establish cause and effect. The people who live near the high voltage lines may all be exposed to some other common cause of cancer in the same area; for example, radium in the soil or pollutants in the water sources or air. Any experiment to isolate the cause (for example, removing the high voltage power lines) will require many years to be conclusive. Availability is not itself a cause. Social, economic, or personal conditions may also cause individuals to use the available weapons. The vasectomies do not cause prostate cancer; it’s the visits to the doctor that increase the chance of detecting cancer. Chapter 7 Review Exercises

1

2

3

The strength of the correlation between tar and nicotine is very strong. The 2 value of r is 0.925, so 92.5% of the variation in nicotine can be explained by a linear relationship between tar and nicotine. The strength of the correlation between carbon monoxide and nicotine is very 2 strong. The value of r is 0.826, so 82.6% of the variation in nicotine can be explained by a linear relationship between carbon monoxide and nicotine. The strength of the correlation between tar and carbon monoxide is very 2 strong. The value of r is 0.958, so 95.8% of the variation in carbon monoxide can be explained by a linear relationship between carbon monoxide and tar.

4 Nicotine

Nicotine

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

5

10

15

20

Tar

5

Because the pattern of the points is close to a straight line, the scatter plot indicates that there is a strong correlation between the tar and nicotine content of the cigarettes. Data would consist of the death rate in a neighborhood and the distance between the neighborhood and power lines for many neighborhoods. It may be possible to establish a correlation between power lines and leukemia deaths, but it would be very difficult to establish a causal relationship. The main problem with the claim is that it is based on observational data, not data

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

CHAPTER 7 QUIZ

6 7

8

9 10

133

from an experiment. It is, therefore, not possible to establish a causal relationship between proximity to power lines and leukemia. The points on the scatter diagram lie on a straight line with negative slope (falling to the right). Correlation alone never implies causation, and, in this case, certainly more trips to the dentist do not cause higher incomes. Households with more disposable income can afford more trips to the dentist or can afford dental insurance that covers the costs of the trips. Variables affecting the value of a home might include its location, size, age, condition, and lot size. Location is often cited as the most important factor with considerations including nearness to schools, shopping, churches, medical facilities, job, etc. The age of the previous owner would be unrelated to the value. The data values that were collected were uncorrelated. It’s still possible that the variables represented by the data values are related in some nonlinear way, i.e., the scatter plot forms a curve instead of a straight line. The scatter diagram does not show any obvious evidence of a linear relationship between the variables, so the correlation coefficient should be close to zero. (The actual value of r is –0.056.) Chapter 7 Quiz

1 2 3

4 5 6 7 8 9 10

Every possible correlation coefficient must lie between the values of –1 and 1. The variables in a and b are not likely to be correlated, while those in c, d, and e are likely to be correlated. Only statement c can be used to describe the relationship between the two variables. Statement d is not valid regardless of the strength of the correlation. Statement e is not valid since it is possible, but not ensured, that one of the variables is the direct cause of the other variable. There is a strong negative correlation between x and y in the diagram, probably in the neighborhood of –0.9. (The actual value of r is –0.934.) Yes. From Table 7.3, we see that a correlation based on 7 data points is significant at the 0.01 level if it is greater than 0.875 or less than – 0.875. False. It may happen that one of the variables is the direct cause of the other, but it is not guaranteed. False. A scatter diagram is any diagram in which the values of paired data are plotted, regardless of the presence or absence of any pattern. True. False. The year 2015 is well outside the range of years for which the data were collected. Tax rates, deductions, and exemptions could change prior to 2015, making any such predictions invalid. True. The value of r could be either 0.3 or –0.3.

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

134

CHAPTER 7, CORRELATION AND CAUSALITY Example of the Computations of r and the Best-fit Line. We provide here the details needed for the computation of r and the best-fit line for Exercise 19 of Section 7.1 and Exercise 13 of Section 7.3. Both exercises use the same data.

Speed Limit

Death Rate

x

y

x

55

3.0

55

2

2

xy

3025

9.00

165.0

3.3

3025

10.89

181.5

55

3.4

3025

11.56

187.0

70

3.5

4900

12.25

245.0

55

4.1

3025

16.81

225.5

60

4.3

3600

18.49

258.0

55

4.7

3025

22.09

258.5

65

4.9

4225

24.01

318.5

60

5.1

3600

26.01

306.0

75

6.1

5625

37.21

457.5

605

42.4

37075

188.32

2602.5

y

For this data, the number of data points is n = 10. The totals for each column are shown in bold at the bottom of the column. Thus

x

605;

y

42.6;

x2

y2

37075;

188.32;

for r is given at the end of Section 7.1. the various sums, we have

nu

r

n u(

( x u y ) (

x2 ) (

x) u(

x ) 2 u n u(

y2 ) (

10 u 37075  605 u 10 u188.32  42.6

r

2

0 .59

2

26025 . .

The formula

Substituting the above values for

y)

10 u 2602.5  605 u 42.6 2

(x u y)

2

y)2

0.59 .

0. 35

To find the best-fit line, we use the equations found at the end of Section 7.3. Substituting the various totals from the table above into those equations, we obtain:

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley

CALCULATION OF R AND BEST-FIT LINE EXAMPLE slope

m

(x u y) (

nu

2

x (

nu

y - intercept = b =

y n

x ) u(

m u

x) x n

2

y)

10 u2602.5 605 u42. 4 10 u 37075  6052

42. 6 605  0 .07894 u 10 10

135

0 .07894

0 .536

Thus the equation of the best-fit line is y = -0.536 + 0.07894x where y is the death rate and x is the speed limit. To plot the line, we need to find two points on the line. Since any two points will suffice, we will choose x = 55 and x = 75. For x = 55, we have y = -0.536 + 0.07894(55) = 3.81, and for x = 75, we have y = -0.536 + 0.07894(75) = 5.38. The best-fit line can now be drawn by connecting the points (55,3.81) and (75,5.38).

Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley