Lab 18: Curve Fitting - The Least Squares Regression Model

Lab 18: Curve Fitting - The Least Squares Regression Model In the From Data to Models lab, we found a function whose graph represents the data by usi...
1 downloads 1 Views 56KB Size
Lab 18: Curve Fitting - The Least Squares Regression Model

In the From Data to Models lab, we found a function whose graph represents the data by using the sliders to find function with the smallest sum of the absolute values of the differences between the data & -values and the predicted values, V &À Statisticians do not use this method, but prefer the least squares regression model. The least squares regression equation is found by minimizing the sum of the squares of the distances of the points on the line (or curve) from the actual data. In this lab, we explore what happens when this approach is used. The Least Squares Regression Model The technique for finding the function that best fits the data is called the least squares regression technique. The idea is to minimize the sum of the squares of the differences between &: the value of the data, &, and the predicted values, called V :: ~

 ²& 

~



cV &  ³

We obtain V & ~  ²%³ by substituting the % value in the function we have chosen to represent the data. For the Motor Vehicle Production data, let us suppose that there is a linear function, & ~ % b , such that the line best approximates the data. Then replacing V & in the formula V above, we have :: ~

 ²& 

~



c ²% b ³³

The line that best fits the data is the one that minimizes :: . If we fit the data with a linear function, then the % and & coordinates of the data are known, and the values of  and  in & ~ % b  are unknown. That is, the desired function is a function of two variables,  and . V The minimum value of this function and the corresponding values of  and  are found using partial derivatives. In this lab, you are not asked to computer the partial derivatives, but rather to see what is happening as we try various values of  and . Your goal is to get as close to the minimum point as possible. Open the Linear Regression Tool. Choose Motor Vehicle Productivity (linear) from the list. In the upper right window is the plot of the data points. Sliders for  and  appear below the graph, and the corresponding value of :: appears as well. In the left window is the graph of ' ~ :: calculated with the data pairs (% Á & ³ substituted. To view the level curves in a lower left window, click on Á  plane.

1. Fitting a line using least squares linear regression 1.1 Motor Vehicle Productivity Index The opening graph represents the data for Motor Vehicle Productivity Index. (The data first appeared in the From Data to Models lab.) Using the sliders on the right, find a line which closely approximates the data. In the window on the upper right of the screen, the line appears with line segments representing the distance from the data point to the line & ~ % b À In the left window, you will see the red dot representing the point ( Á  Á '³ on the graph of ' ~ :: . Change the values of  and  until you get as close as possible to the minimum point on the graph. On the lower graph, you can see the red dot on the level curve as well. Record your information. & ~ % b  ~ _____________________________ :: ~ _____________________

Once you have found what appears to be the best solution, click on solution to see the actual values of  and  that minimize ::À Choose quit to return to the main menu. In each linear modeling problem, the method of finding the least squares regression equation always leads to a sum of squares function. If there are two independent variables, and , this function is second degree and the graph has a paraboloidal shape with a minimum point. The computer program used by your calculator and also by EXCEL and other spreadsheets finds the best fit which minimizes the sum of the squares. Your instructor may want you to use your calculator or EXCEL to complete the problems below. At the end of the lab is a brief introduction to using EXCEL to find the line of best fit. In the problems below, you will attempt to find the parameters which yield the best fit using the sliders, where by "best fit" we mean the values of the parameters which minimize ::À 1.2 Direct Mail Expenditures Choose the Direct Mail Expenditures from the list in the main menu. Use the sliders to try to find the best fit, as above. Again, the data points are shown and the line segments representing the distance from the line & ~ % b  to the data points. :: appears as well as the values of  and À After you have found what you think is the best equation, record your results: & ~ % b  ~ _____________________________ :: ~ _____________________

Click on the solution and compare your function with the one actually found by the tool using the partial derivatives to minimize ' , the error.

If you have a calculator with regression, use it to find the line of best fit. Record the result: & ~ _______________________

Compare that result with your estimation. How did you do?

The least squares technique may also be used for finding polynomials of higher degree that fit a data set. Unfortunately, the techniques given in applied calculus courses generally only involve finding the minimum when the function has two independent variables. If we wish to fit a quadratic polynomial, & ~ % b %b , there are three parameters to find: Á Á and , which would involve minimizing a function of three variables. Fortunately, Excel and other spreadsheets can find the polynomial best fits. We can, however, use the linear equation model on exponential data if we perform a little work first. 2. Fitting exponential functions using regression Suppose the data appears to fit a function of the form & ~ *! . Using logarithms, we can rewrite this equation as ln & ~ ln * b ! Notice that this equation is actually in the form of a linear equation if we let @ ~ ln & and ln * ~ À That is, we have the linear equation @ ~  b !

For example, & ~ ! can be rewritten as ln & ~   b ! or @ ~   b !À Below are the graphs of these two functions.

& ~ !

@ ~   b !

2.1 National Health Expenditures From the main menu, choose the data set for National Health Expenditures. In the right window is the graph of the original data. In the left window, the graph of the logarithm

data appears. The sliders will affect both the exponential curve and the corresponding linear curve. Below on the left is the graph of the level curves. Using the sliders, try to find the function which best fits this data. Notice that as you get a better fit to the original data, you are also finding a better fit to the linearized data on the left. When you think you have the best fit, list the values:  ~ _______

ln ²*³ ~ _______

and

So the linear equation representing the data in the upper left window is Y ~ % b  ~ _____________________________ :: ~ _____________________

In order to obtain the equation in the proper form, use your calculator to evaluate the * from the value of  ¢  ~ ln * ~ ________

We obtain * by finding the antilogarithm, or: * ~  ~ __________

Now write the appropriate function: & ~ *! ~ ______________________

Test your function by evaluating & for the values of ! given and compare with the actual data values. !

&

0

143

10

346

15

592

20

1063

25

1710

30

2566

V&

2.2 There is one more example in the menu, National Health Expenditures (%GNP). If you need more practice, try the remaining example, which is a linear regression problem. Compare the result with the results you found in the From Data to Models Lab or by using your calculator or Excel.

One of the major questions involving the use of mathematical models is to see for what period of time the model is reasonably accurate. Obtain updated data on one of these examples. Use our model to predict the latest values and compare with the actual value. What do you observe?

3. The Strategy for Finding a Regression Equation You might wish to review the characteristics of the types of functions before beginning this lab. Notice that each family of functions has distinct characteristics. In order to fit a model (function) to a set of data, you first must decide on which type of function will be appropriate. To do this, first plot the points on a scattergram and look at the trend of the points. 1. Does the collection of points appear to be roughly close to some line? If so, a linear model will be best. 2. Does the collection of points appear to increase (or decrease) rapidly? Consider the rate of change between consecutive values. If this increases (decreases) slowly, consider a polynomial model. If it changes at a rapidly increasing rate, consider an exponential model. 3. If the data has some natural limiting value, consider using a logistic curve. 4. Why are the above recommendations made?

4. Further exploration Below is a description of using EXCEL to obtain the least squares regression equation for a set of data. 4.1 In this example, we have used the Internet Users data found in the From Data to Models Lab. For this data, an exponential function will be the best fit. EXCEL can find linear or polynomial coefficients. To find the constants in an exponential function, we have to employ the method above, and convert the equation & ~ *! to its logarithmic equivalent @ ~ ln & or @ ~ ln * b ! Begin by typing the data into EXCEL. Make a column of % values and a column of & values. Use the built-in function "ln( )' to obtain the logarithm of each y value. (For example,, type "+ln(5)" to have EXCEL calculate the natural logarithm of 5.) Click on Tools, followed by Data Analysis. Form the list, choose "Regression". You are asked for the Y values. Enter the column of ln & (or @ ) values. Next, enter the column of ? values. Click OK. You will obtain the following result: SUMMARY OUTPUT Regression Statistics Multiple R 0.953204144 R Square 0.90859814 Adjusted R Square 0.878130853 Standard Error 0.070307212 Observations 5 ANOVA df Regression Residual Total

SS 1 3 4

MS F 0.147413691 0.014829312 0.162243003

Coefficients Intercept

4.229125613

X Variable 1

0.025272663

Significance F 0.147413691 29.82208912 0.004943104

Standard Error t Stat Upper 95% 0.055186363 76.63352638 4.404753415 0.004627879 5.460960458 0.040000653

0.012066263

P-value Lower 95.0% 4.8972E-06 4.053497811 0.012066263 0.010544674

Lower 95% Upper 95.0% 4.053497811 4.404753415 0.010544674 0.040000653

There is more information that you know what to do with! In a statistics course, you will learn what the other information tells you. For now, we have to pick out the information of interest. At the bottom of the table, "Intercept" gives you the value of À So, in our problem,  ~ À   We know that  ~  * , so * ~  ~ 68.65717 Finally, the X Variable gives you the coefficient of %, which is   ~ À  

 Rounding to four decimal places, we can now write the equation of the function that best fits the data: & ~ *! ~ 68.6572 À  !

4.2 Olympic Torch In 1996, Centennial Olympic Games were held in Atlanta, Georgia. Prior to the beginning of the games, the Olympic Torch was carried across the country. Most of the time, the torch was carried by a runner, but some of the time it was carried by a cyclist. The total distances covered by each of these groups through the Philadelphia, Pennsylvania region, are given in the tables below. At time, due to traffic and weather conditions, the torch bearers were behind schedule. By analyzing the data, we want to determine if they could have gotten back on schedule by making modifications in the distances covered, for example by substituting a cyclist for part of the distance originally scheduled for the runner, by increasing the runners' speeds, or by increasing the cyclists' speeds. To analyze this data, you may use your calculator, EXCEL, or formulas that have been developed to find the values of  and  that provide the minimum value of ::À The formulas for  and  appear below. ~



 % & c² % ³² & ³  % c² % ³ 

~



  

~

 

~ 





~

~







c  ~c & c %

_ where & is the mean (average) of the & values, and c % is the mean (average) of the % values (or in our case, ! values). We can apply these formulas to analyze the data below. Total distance covered by the runners in approximately 1 hour elapsed time (min.) 0 3 7 18 24 49 52 57 58 59 total distance (k) 0 .5 1.1 2.6 3.4 7.0 7.5 8.1 8.3 8.4 Total distance covered by the cyclists in approximately 1 hour elapsed time (min.) 0 1 5 7 12 33 38 46 50 59 total distance (k) 0 .8 4.8 5.2 5.7 12.0 15.0 21.0 23.5 26.4 First, plot the points representing distance vs. time for the runners on the graph. In this case, we suppose that the runners travel at a constant rate, indicating that the derivative of the function is a constant. This is true only of a linear function. So, we choose to model the data with a linear function. Using a linear equation as the least squares fit, substitute the % and & value pairs in the formula. 1. Find the function representing the distance traveled by the runners.

2. Next find the line of best fit for the distance covered by the cyclists.

3. Use the two equations to estimate the speed of the of the runners and the speed of the cyclists. (Remember: & Z gives you the slope of a curve, and for a linear function, & Z ~ À)

4. Determine difference in the speeds of the cyclists and the runners. There are landmarks along the route, and places designated where the torch will be passed from one carrier to the next. If the progress of the torch was 5 minutes off schedule, how would you suggest making up the time?

4.3 Viscosity The viscosity of a liquid is its resistance to flow. It can be obtained by observing the time it takes for a steel ball of fixed radius to fall through a column of the liquid. The Society of Automotive Engineers (SAE) has established a numbering system which indicates the viscosity of motor oils. Higher numbers indicate greater viscosity. SAE10w40 represents a viscosity of 10 at  * and a viscosity of 40 at 99 * . Below is a table of values indicating the viscosity of SAE10w30 at various temperatures. 

*  

 

#  !& À À À À

Plot the points on the graph at the left, and determine which type of function fits the information best. Select this function type and find the regression equation for the data.