Further Maths Bivariate Data Summary Further Mathematics Bivariate Summary Representing Bivariate Data Back to Back Stem Plot A back to back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with two categories. Example

The data can be displayed as a back to back stem plot. From the distribution we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed. Since the Liberal distribution is skewed the median is a better indicator of the centre of the distribution than the mean. It can be seen that the Liberal party volunteers handed out many more “how to vote” cards than the Labor party volunteers.

Parallel Boxplots When we want to display a relationship between a numerical variable and a categorical variable with two or more categories, parallel boxplots can be used. Example

The 5-number summary of each class is determined.

Page 1 of 19

Further Maths Bivariate Data Summary Four boxplots are drawn. Notice that one scale is used. Based on the medians, 7B did best (median 77.5), followed by 7C (median 69.5), then 7D (median 65) and finally 7A (median 61.5).

Two-way Frequency Tables and Segmented Bar Charts When we are examining the relationship between two categorical variables, the two-way frequency table can be used. Example 67 primary and 47 secondary school students were asked about their attitude to the number of school holidays which should be given. They were asked whether there should be fewer, the same number or more school holidays. The results of the survey can be recorded in a two-way frequency table and a percentage two-way table as shown below.

The data can also be represented in a segmented bar chart based on the data in the second table. Clearly, secondary students were much keener on having more holidays than were primary students.

Page 2 of 19

Further Maths Bivariate Data Summary Dependent and Independent Variables In a relationship involving two variables, if the values of one variable “depend” on the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable. When a relationship between two sets of variables is being examined, it is important to know which one of the two variables depends on the other. Most often we can make a judgement about this, although sometimes it may not be possible. For example, in the case where the ages of company employees are compared with their annual salaries, you might reasonably expect that the annual salary of an employee would depend on the person’s age. In this case, the age of the employee is the independent variable and the salary of the employee is the dependent variable. We always place the independent variable on the x-axis and the dependent variable on the y-axis in a scatterplot

Scatterplots

Page 3 of 19

Further Maths Bivariate Data Summary Example

There is a moderate, negative linear relationship between the two variables

Pearson’s Correlation Coefficient (r)

Note that outliers can affect the accuracy of r.

Page 4 of 19

Further Maths Bivariate Data Summary

The coefficient of determination is useful when we have two variables which have a linear relationship. It tells us the percentage of variation in one variable which can be explained by the variation in the other variable. Example.

Page 5 of 19

Further Maths Bivariate Data Summary Linear Regression If 2 variables have a moderate or strong association (positive or negative), we can find the equation of the line of best fit of the data and make predictions. The general process of fitting curves to data is called regression and the fitted line is called a regression line.

Lines of Best Fit By Eye The simplest method is to plot the data on a scatter graph, and place a line over the plot by eye which seems to best represent the pattern in the data values. This method will often give the approximate location of the regression line.

The 3-Median Method

Page 6 of 19

Further Maths Bivariate Data Summary

Step 5. Find the equation of the line (general form 𝑦 = 𝑚𝑥 + 𝑐, where 𝑚 is the gradient and 𝑐 is the y-intercept). The gradient of the 3-median line can be found by determining the gradient of the line that passes through the upper median and lower median, (𝑥𝑢 , 𝑦𝑢 ) and (𝑥𝐿 , 𝑦𝐿 ). Use the formula: Gradient =

𝑟𝑖𝑠𝑒 𝑟𝑢𝑛

=

𝑦𝑢 −𝑦𝐿 𝑥𝑢 −𝑥𝐿

The y-intercept 𝑐 can be found from the graph if the scale on the axes begins at zero. Otherwise, take a point on the final line and determine c by substitution.

Using the Calculator to Determine the Equation of the 3-Median Line The following table gives the winning high jump heights for consecutive Olympic Games from 1956 to 2000.

1.

Enter the data into Lists and Spreadsheets View

2. Press the Home button and enter Data and Statistics View. Page 7 of 19

Further Maths Bivariate Data Summary 3. Press the TAB key and make year the independent variable. (x value) 4. Press the TAB key and select height to make it the dependent variable. (y value) You should see the scatter plot shown below.

5. Press Menu, Analysis, Regression and Show Median-Median. You will see the plot with the 3Median regression line as below.

6. The equation of the 3-Median regression line is given as 𝑦 = 0.006094𝑥 − 9.7751 This equation can be written as: Winning Height = 𝟎. 𝟎𝟎𝟔𝟎𝟗𝟒 × 𝒀𝒆𝒂𝒓 − 𝟗. 𝟕𝟕𝟓𝟏 7. Using this rule we can predict the winning jump in 2004 by substituting Year = 2004 in the equation. Winning Height = 0.006094 × 2004 − 9.7751 = 2.43728 The regression model predicts the winning jump to be 2.44 meters in 2004. It is interesting to note that the actual winning jump in 2004 was 2.36m which shows that extrapolating data can be unreliable. Interpolation occurs when a value is substituted into the regression equation that is within the bounds of the data given. Interpolation is reliable. Predicting the winning jump for the year 1978 in the above example is interpolation. Extrapolation occurs when a value is substituted into the regression equation that is outside the bounds of the given data. Extrapolation is unreliable. Predicting the winning jump for the year 2004 in the above example is extrapolation.

Page 8 of 19

Further Maths Bivariate Data Summary The Least Squares Regression Method The Least Squares Regression method finds the line that minimizes the total of the squares formed by the points and the line. We normally use CAS to generate the equation of the least squares regression line. It is given in the form 𝑦 = 𝑚𝑥 + 𝑏 on the calculator. Example:

You would expect the number of skiers to depend on the depth of snow. The independent variable is the depth of snow and dependent variable is the number of skiers. Create a scatterplot of the data. This can be done on the calculator. 1. In Lists and Spreadsheet view, enter the data in the table.

Page 9 of 19

Further Maths Bivariate Data Summary 2. Hit the Home button and go to Data and Statistics view.

3. Tab to the horizontal axis and select the independent variable depth and tab to the vertical axis and select the dependent variable skiers. The scatterplot will form.

4. It can be seen that there is a linear, positive strong correlation between the depth of snow and the number of skiers. There is evidence to suggest that as the depth of the snow increases the number of skiers increases. 5. Next find 𝑟, the coefficient of correlation and the coefficient of determination 𝑟 2 . Hit Ctrl and Left Arrow to return to Lists and Spreadsheet View. Hit Menu, Statistics, Stat Calculations, Linear Regression (mx + b). Hit the Click button and select depth from the drop down list for X List. Hit tab and select skiers from the drop down list for the Y List.

Page 10 of 19

Further Maths Bivariate Data Summary There is no need to enter data into the other boxes. Tab to OK and hit Enter.

The coefficient of correlation 𝑟 = 0.88402 This indicates that there is a strong, positive correlation between the depth of snow and the number of skiers. The coefficient of determination, 𝑟 2 is 0.781492 We can say that 78% of the variation in the number of skiers can be explained by the variation in the depth of snow. The data also gives us the line of best fit, the least squares regression equation. 𝑦 = 186.418𝑥 + 28.3373 We can write this more clearly as 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 𝐢𝐧 𝐦 + 𝟐𝟖. 𝟑𝟑𝟕𝟑 6. The equation of the least squares regression line can also be determined in Data and Statistics view. Hit Ctrl + right arrow to return to your scatterplot. Hit Menu, Analyse, Regression, Show Linear (mx + b)

The least squares regression equation is 𝑦 = 186.418𝑥 + 28.3373

Page 11 of 19

Further Maths Bivariate Data Summary Interpreting the Slope (Gradient) and y-intercept The slope or gradient 186.418 indicates that the number of skiers increased by 186 for every 1 metre increase in depth of snow. The y-intercept 28.3373 indicates that if the depth of snow is 0, there would be 28 skiers attending the resort. Using the Least Squares Regression Equation to make Predictions Suppose we want to estimate the number of skiers when the depth of snow is 3m. Using 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 + 𝟐𝟖. 𝟑𝟑𝟕𝟑 Number of skiers = 186.418 × 3 + 28.3373 = 587.5913 That is 588 skiers This result is reliable because we have interpolated. 3m lies within the bounds of the depth of snow given in the table. That is it is between 0.5 and 3.6m. Suppose we want to estimate the number of skiers when the depth of snow is 4m. Number of skiers = 186.418 × 4 + 28.3373 = 774 That is 774 skiers. The result is unreliable because we have extrapolated. 4m lies outside the bounds of the depth of snow given in the table. It is outside the range of 0.5 to 3.6m.

Page 12 of 19

Further Maths Bivariate Data Summary Calculating the Least Squares Regression Line Manually If you are given Pearson’s correlation coefficient, r, the mean values of the independent and dependent variables, and the standard deviations of the independent and dependent variables for a set of data, the least squares regression equation can be calculated. If the least squares regression equation is 𝑦 = 𝑚𝑥 + 𝑐 Then the gradient 𝒎=𝒓

𝒔𝒚 𝒔𝒙

, where

r is the coefficient of correlation. 𝑠𝑥 is the standard deviation of the independent variable 𝑠𝑦 is the standard deviation of the dependent variable Also the least squares regression line always passes through the point ( ̅𝑥, 𝑦̅), so ̅ = 𝒎𝒙 ̅ + 𝒄, where 𝒚 𝑥̅ is the mean of the independent variable 𝑦̅ is the mean of the dependent variable. These equations can be rearranged to give: 𝑐 = 𝑦̅ − 𝑚𝑥̅ 𝑐 can now be found by substitution. Example:

Page 13 of 19

Further Maths Bivariate Data Summary

Residual Analysis There is no guarantee that a linear regression model will be appropriate for a bivariate data set. One way to check that a regression line is viable is to look at the differences between the actual y values and the predicted y values from the regression equation. These differences are often referred to as the residual values, because they are the bits left over. The residual values can be determined using the formula: residual value = 𝒚 (𝒂𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆) − 𝒚(𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆) A residual is the vertical distance between each data point and the regression line.

Page 14 of 19

Further Maths Bivariate Data Summary We can create a plot of the residuals very easily using CAS. For example if the values given are those in the table below: x y

2 4.8

3 5.4

4 6.0

1.

Enter data into Lists and Spreadsheets View

2.

Press the Home button and enter the Data and Statistics View

3.

Form the scatter plot ensuring the x values are along the x axis and the y values are along the y axis.

4.

Click Menu, Analyse, Regression, Show Linear (mx + b)….. This will insert the regression line. On first inspection it appears a good fit.

5.

Tab to the y-axis and click stat.resid.

The residual plot appears

5 6.5

6 6.9

Notice that the points do not appear to be distributed randomly about zero. Some pattern is evident. Looks like a parabola. This means that the regression model has not accounted for this pattern in the relationship. So although the coefficient of determination is high (r2 = .9933) other models may need to be tried. Also if you move the mouse pointer over the point the coordinates of the point is given.

Page 15 of 19

Further Maths Bivariate Data Summary Transforming to Linearity We have seen that there is no guarantee that a linear regression model (straight line model) will be appropriate for a bivariate data set, even when the coefficient of determination is high. It may be that a non-linear model such as a quadratic, logarithmic or reciprocal may be better suited for the model. Another approach is to “straighten out” the curve so that an appropriate non-linear model can be found.

Page 16 of 19

Further Maths Bivariate Data Summary Example Air is trapped in a syringe, and then weights are added to the plunger, which increases the pressure exerted on the trapped air. For each weight, the pressure and resulting volume are recorded. Volume (mL) 29.5 25.1 22.5 20 18.2 16.7 15.5 14.2 13.4 Pressure (kPa) 98.6 111.3 124 136.7 149.4 162.1 174.8 187.5 200.2

1. Using CAS in Lists and Spreadsheets View enter the data in the table:

2. Press the Home button and enter the Data and Statistics View, create the scatter plot, making the volume the independent x value and the pressure the dependent y value. On inspection it appears that the plot is curved and not linear.

3. Press Menu, Analyse, Regression, Show Linear (mx + b).. to insert the least squares regression line.

Page 17 of 19

Further Maths Bivariate Data Summary 4. Tab to y-axis and select stat.resid to create a residual plot.

Notice that the residual points form a curved pattern about the X axis. This suggests a linear model is not the best model for the data, even though the coefficient of determination is high (r 2 = 0.9425) Also if the mouse is moved over the points the coordinates of the point is given. 1

5. In order to "straighten" the plot we will use the reciprocal transformation 𝑥. We will reciprocate the volume values. Press Ctrl + Left Arrow to return to Lists and Spreadsheets View. Give column C the heading recvol (short for reciprocal volume). In the grey area just below the new heading in column C enter the formula: 1/ vol Press Enter. Column C should automatically fill with the reciprocal values.

Page 18 of 19

Further Maths Bivariate Data Summary 6. Press the Home button and enter Data and Statistics View. Plot the pressure (y value) against the recvol (x values) to form a new scatter plot of reciprocal data. Also insert the least squares regression line. Notice the line is a precise fit.

𝟏

Remember that we have plotted y (pressure) against

(

𝟏

) . The graphics 𝒙 𝒗𝒐𝒍𝒖𝒎𝒆 calculator reports the model in the form 𝑦 = 𝑚𝑥 + 𝑏, the symbol x refers to the values used for 𝟏 the independent variable, which we recognise as due to the transformation. 𝒙 7.

So the transformed equation can be written as:

y= or

2499 𝑥

+ 12.6

Pressure =

2499 𝑉𝑜𝑙𝑢𝑚𝑒

+ 12.6

Using this formula we can find the pressure if a volume is given. For example if the volume is 19 mL,

Pressure =

2499 19

+ 12.6 = 144.13

Pressure is 144.13 kPa This result is reliable because we have interpolated. The value 19 mL lies within the range of volumes given in the table. If the volume is 5 mL then

Pressure =

2499 5

+ 12.6 = 512.4

Pressure is 512.4 kPa This result is unreliable, because we have extrapolated. The value 5 mL lies outside the range of volumes given in the table.

Page 19 of 19