COMPARING REGRESSION TREES WITH NEURAL NETWORKS IN AEROBIC FITNESS APPROXIMATION

COMPARING REGRESSION TREES WITH NEURAL NETWORKS IN AEROBIC FITNESS APPROXIMATION Satu Tamminen*, Perttu Laurinen, Juha Röning** Machine Vision and Med...
Author: Andrea Russell
0 downloads 1 Views 40KB Size
COMPARING REGRESSION TREES WITH NEURAL NETWORKS IN AEROBIC FITNESS APPROXIMATION Satu Tamminen*, Perttu Laurinen, Juha Röning** Machine Vision and Media Processing Group, Infotech Oulu University of Oulu P.O.Box 4500, FIN-90401 Oulu, Finland * [email protected], ** [email protected]

Abstract. A method for aerobic fitness measurement is proposed. The data used in this research consists of accurate measurements of maximum oxygen uptake as reference values and physical features, including R-R intervals of heart beat measured at rest. The physical system of the human being is highly nonlinear. Therefore, traditional linear regression cannot be used to model the data accurately. Effective methods are needed to capture the nonlinear behaviour into a model. Regression trees and neural networks are considered as candidates for the task. Neural networks are powerful in approximation, but it is difficult to represent how they conclude to the model. Regression trees are easy to visualize and their structure is more comprehensible. Both of these methods were used in fitness approximation, and a comparison of the results is presented. Keywords: neural networks, regression trees, aerobic fitness.

1. Introduction An individual’s aerobic fitness can be approximated using certain physical features. Weight, height, age, sex and heart rate, for example, serve moderately well to characterize the maximum oxygen uptake, which, in turn, can be regarded as a measure of fitness. The physical system of the human being is highly nonlinear in nature, and the method used for fitness approximation should therefore be able to recognize nonlinear relations between the variables. There is usually no a priori knowledge available about the nature of the nonlinearity, which is why the system cannot be linearized with special transformations. Nonlinearity is a natural characteristic of neural networks when nonlinear activa-

tion functions are used. The method is nonparametric, and therefore useful when the form of nonlinearity in the data is not known. Neural networks are thus a common solution to this problem. On the other hand, the data can be divided into partly linear subgroups analyzable independently with linear methods. Regression trees can be used for this purpose. Approximations were here done with both methods and the results were compared. Neural networks are effective, but the procedure is difficult to characterize or understand, and they are often the only method considered, whereas regression trees are structurally very simple and easy to visualize. Neural networks have been used for nonlinear approximation in several areas. The problem is that they have been used uncritically and without even finding out if there are better methods available. The idea behind regression trees is quite old, and regression trees have been used in many applications. But no comparison of these two methods has been found in publications.

2. Aerobic fitness The most important determinant of an individual’s capacity to perform prolonged muscular exercise is maximum aerobic capacity, which reflects the working efficiency of the respiratory and circulatory organisms. Thus, one of the methods most commonly used to approximate an individual’s aerobic fitness is the measurement of maximum oxygen uptake (MAXL, l/min or MAXML, ml/kg/min) (Niemelä 1983; Tulppo et al. 1996; Väinämö et al. 1996). If aerobic fitness needs to be measured accurately, it can be done in a special clinic, where clinical methods can be used to measure maximal oxygen uptake. These

direct measurements are expensive and time-consuming, however. Another alternative is to do a less accurate indirect fitness test by running or walking under a work load. It is obvious that an inexpensive and reasonably accurate system for measuring aerobic fitness is needed. Because both of these tests are physically stressful, it would be desirable to develop a method which can be performed at rest.

Since the relations of several features to aerobic fitness are nonlinear, good results cannot be obtained by using a plain linear regression model. Nonlinear regression does not guarantee an optimal solution, however, and it is very complicated. The functional expression has to be written, good initial values have to be found for the parameters, and the derivatives of the model may have to be specified with respect to the parameters.

An expert can draw conclusions about the fitness of an individual by using physical features, such as height, weight, age and sex, for approximation purposes. There is also a lot of information available on the resting heart rate, which correlates moderately well with aerobic power (Kenney 1985; Väinämö et al. 1996). A hypothesis can be proposed that aerobic fitness can be estimated on the basis of physical features and statistical features calculated from R-R intervals.

Regression trees are binary decision trees. The tree is constructed by splitting the entire data into subsets by using all the independent variables. The goal is to produce terminal nodes that are as homogeneous as possible with respect to the target variable. Regression trees can be notable accurate in the case of nonlinear problems, but the traditional regression will probably work better for linear data (Breiman et al. 1984).

The present material was obtained from the Merikoski Institute of Health and Rehabilitation, Oulu, Finland, and comprised 237 sets of R-R interval measurements, physical features, e.g. age, height, weight and sex, and accurate oxygen uptake measurements (with a bicycle ergometer) performed on adult men and women aged 1565 years. All the subjects were healthy and none of them were on medication. The series of R-R intervals were passed through a filter to eliminate artifacts, and different statistical features, e.g. mean, variance, maximum and minimum, were calculated. For approximation purposes, 12 physical and statistical features which correlate well with maximum oxygen uptake were selected. The proportion of female subjects was small. Therefore, the estimated results for females in the new data sets will not be very reliable.

3. Regression trees Linear regression can be used when the features measured have a linear correlation with the dependent variable. The regression can be formulated as y i = f ( x i, θ ) + ε i ,

(1) T

where i=1, ... ,n and θ = ( θ 1, … ,θ k ) is the vector of the parameters to be estimated. The errors ε i , i=1, ... ,n are i.i.d. with a mean equal to zero and an unknown variance 2

σ . The function f can be linear or nonlinear (Ratkowsky

1983).

N x∑ ∈t

1 For every node t, ----

( yi – y ( t ) )

2

is the within node

i

sum of squares. In other words, it is the total squared deviations of the y i in t from their average. The regression tree is formed by iteratively splitting the nodes, so that the decrease in R(T) is maximized, where R(T) sums up all the sums of squares within all the nodes. After the tree has been constructed, it can be visualized by showing how the data space is divided. Every division contains a rule, which allows the relations between variables to be examined. In other words, every division is based on a decision to divide the subdata into two groups by certain continuous or categorical variable. It is therefore easy to see which variables are the most important in view of the homogeneity of the data space. The size of the tree can be restricted with various limitations. The problem of bias and variance is present along with the size. If the tree is too small, the model cannot fit the data properly, and if the tree is too large, generalization of the model is inadequate. Stopping rules are used to control the size of the tree being built. The maximum depth of the tree and the minimum number of subjects per parent or child node can be defined. The C&RT method in Answer Tree 1.0 (SPSS) is a method of Breiman et al. (1984), which generates binary decision trees. It provides a versatile illustration of the results. But modeling within the terminal nodes is laborious. CUBIST by Ross Quinlan is much easier to use, but there are no publications available on the method, and the procedure used is therefore unknown.

4. Neural networks Nonlinearity is a natural characteristic of neural networks when nonlinear activation functions are used. The method is nonparametric and therefore useful when the form of nonlinearity in the data is not known. Supervised methods are useful for approximation and modeling (Haykin 1994). The 3-layer MLP (multilayer perceptron with one hidden level) is a model of the form y = f 2(W 2 f 1(W 1 x))

(2)

which is fitted to a set of couples x,y comprising the training set. The transfer function f 1 in the hidden layer is usually nonlinear, differentiable and increasing. Sigmoidal functions are often used. When performing regression with neural networks, nonlinearity is introduced into the model by means of transfer functions. The function f 2 in the output layer may be linear or nonlinear, and W 1 and W 2 are estimated weight matrices (Murtagh 1994). The initial values of the parameters in the network affect both the result and the stopping criterion. In this application, the number of neurons and epochs was selected randomly from the restricted areas and thousands of combinations were tried. Minimization of the root mean square error (RMS) of the test set was used as the criterion when selecting the best network. For MLP the visualization of the modeling results and the relations between the variables are not natural. Selforganizing maps can be used for this purpose (Lipponen et al. 1998). They provide a two-dimensional space of the data, which can be examined separately in view of all the variables.

5. Results A Sun Ultra 2 workstation and MATLAB 4.2. were used

to construct the neural networks. An MLP network with the Levenberg-Marquardt algorithm and one hidden level was used to approximate maximum oxygen uptake. PC (Pentium Pro 200) and Answer Tree 1.0 (SPSS 8.0.1) were used to build the regression trees and the results were compared with the results from CUBIST 1.05. The data of 237 subjects were divided randomly into the training set and the test set, which contained 38 subjects. The results of both methods were analyzed with the correlations between the target and predicted values. The average error percentages were calculated and they visually analyzed with scatter plots, which showed clearly the problem areas in the model. In scatter plots, the target values versus the predicted values are plotted for every model, and it would be good to have the data points form a linear and slim pattern. The regression trees were first built with SPSS. All the 12 variables were used and the minimum sizes of the parent and child nodes were defined as 20 subjects. Subjects aged 24 or under were assigned to the first terminal node. The older subjects were divided according to their maximum R-R interval values. The second terminal node contains male subjects with max R-R 1.156 or under, and the third terminal node contains female subjects that comply with the same rules. The fourth terminal node contains subjects with max R-R over 1.156 and body mass index 23.7 or under, and the fifth terminal node contains subjects that otherwise comply with the same rules but have BMI over 23.7 (Fig. 1). Sex is usually considered one of the most important features, but it does not dominate in this tree. The explanation is the small number of females. Young subjects of the data set have a very high average fitness level, which tends to minimize the differences between males and females. This is not valid in the general population, though. BMI and max R-R are natural separators, because the predicted variable is the maximum oxygen uptake divided with weight and heavier people therefore have lower MAXML, and people with higher max R-R have lower heart rate at rest, which is a sign of a good fitness level.

Fig. 1. Regression tree from C&RT

The tree regression model worked better than the linear regression model, where generalization was a major problem together with the less accurate results. The correlation between the target and predicted values was 0.91 for the training set and 0.81 for the test set (see Fig. 2 a and Table 1.). There were some differences between the rules. Rule 5 worked best and rule 4 least well. Females with a low fitness level got very poor approximations for the rules 1 and 5, and these impaired the overall result. CUBIST was also used, and the regression tree had the same structure except that the cut off point for BMI was 22.7 instead of 23.7. The models in the terminal nodes were quite different, however. Because no publications of the operation of CUBIST are available, the cause for the difference is unknown. But compared to C&RT, CUBIST was much easier to use because of the automated modeling phase. Otherwise, the problems were roughly similar in the two methods. When the number of data points is small, regression trees have the problem that there is not enough data for separate models in the terminal nodes. The performance of the neural network was the best. The structure was quite simple with 6 hidden neurons, and the network was trained for 22 epochs. The correlation between the target and estimated values was 0.95 for the training set and 0.93 for the test set (Fig. 2c). The results

were least adequate for the average fitness level, but there were no significant difference between male and female subjects, however. When regression trees were used, the approximation results for female subjects with an average fitness level were poor. The reason for this is the small number of female subjects. Some test cases of female subjects fell into the rules 1,4 and 5, and female subjects with a high fitness level got good results, because their features were closer to those of male subjects. A decision was then made to exclude the female subjects from the data, and the regression tree approach was tested with CUBIST. The program allowed only one split, which was made at age 25, but it was too rough for this set of nonlinear data. A neural network approximation was therefore performed instead. When only male subjects were used in the fitness model, the results improved especially for the test set. The correlation between the target values and the estimated values was 0.94 for the training set and 0.96 for the test set (Fig. 2d). The improvement was due to the more homogeneous data set. The network had only 4 hidden neurons, and it was trained for 20 epochs.

a)

b)

c)

d)

Fig 2. Approximation results (a) with C&RT, (b) with CUBIST, (c) with the neural network for the whole data set, and (d) with the neural network only for males (the training set is marked with ‘+’ and the test set with ‘o’). The target values are on the x-axis and the predicted values of the model on the y-axis.

Table 1: Approximation results correlation (training)

error % (training)

correlation (test)

error % (test)

C&RT

0.91

9.99%

0.81

16.38%

CUBIST

0.89

11.37%

0.84

14.09%

NN

0.95

7.48%

0.93

9.03%

NN for males

0.94

8.13%

0.96

7.10%

6. Conclusions The findings showed that aerobic fitness can be approximated on the basis of the subject’s physical features and heart rate measured at rest with both neural networks and regression trees. The best results are better than those obtained with indirect measurements, and the method proposed involves no need for a work load, but the whole measurement can be performed at rest. Neural networks provided the best results for the average error percentages, especially in the test set. The error for neural networks did not grow significantly with the test set, and the generalization was therefore good. All the methods had the same problem areas in approximation. The results for the subjects with an average fitness level had the most variability, which can be seen in Fig. 2., where the scatter plots are wider in the middle region. Regression trees are easier to understand and visualize. CUBIST is much easier to use than C&RT, but more information about the procedure used is needed. The relations between the variables in the regression trees were interesting. Sex has usually been considered one of the most important variables in fitness approximation, but here it only dominated in one subcategory. This is clearly due to the small number of female subjects in the data set. Another interesting feature is the age variable. In the regression tree, subjects aged under 24 years constituted a separate group, and their MAXML average was considerably high. When new data is introduced into the model, the performance will be poor for subjects with a low fitness level. The results from the regression tree visualize all the interesting relations right away. This is not possible with neural networks. There is no inner structure introducing the primary relations that could point the investigation to certain areas. One possible solution is to map the data into a two-dimensional space with SOM and to visualize the space separately for the variables.

Although the results from the neural network are better, their use is not self-evident. It may be useful to start with the regression trees to find out the basic structure of the data, and then check if the results could be improved with the neural network.

References Breiman L., Friedman J.H., Olshen R.A., Stone C.J.: Classification and Regression Trees. Wadsworth Inc. USA (1984) Haykin S.: Neural Networks. Macmillan College Publishing Company, Inc. USA (1994) Kenney W.L.: Parasympathetic Control of Resting Heart Rate: Relationship to Aerobic Power. Medicine and Science in Sports and Exercise 17 (1985) 451-455 Lipponen S., Mäkikallio T., Tulppo M., Röning J.: Finding Structure in Fitness Data. International Conference on The Practical Application of Knowledge Discovery and Data Mining (PADD’98), UK (1998) Murtagh F.: Neural Networks and Related ‘Massively Parallel’ Methods for Statistics: A Short Overview. International Statistical Review 62, 3 (1994) 275-288 Niemelä K.: Role of A Progressive Bicycle Exercise Test in Evaluating The Functional Capacity. Acta Universitatis Ouluensis, Finland (1983) Quinlan R.: http://www.rulequest.com/ Ratkowsky D.A.: Nonlinear Regression modeling, a Unified Practical Approach. Marcel Dekker, Inc. USA (1983) Tulppo M.P., Mäkikallio T.H., Takala T.E.S., Seppänen T., Huikuri H.V.: Quantitative Beat-to-beat Analysis of Heart Rate Dynamics During Exercise. American Journal of Physiology 271 (1996) H244-H252 Väinämö K., Röning J., Nissilä S., Mäkikallio T., Tulppo M.: Artificial Neural Networks for Aerobic Fitness Approximation. International Conference on Neural Networks (ICNN ‘96), USA (1996)

Suggest Documents