On learning to predict Web traffic

Decision Support Systems 35 (2003) 213 – 229 www.elsevier.com/locate/dsw On learning to predict Web traffic Selwyn Piramuthu* Decision and Informatio...

Author: Allan Ryan

8 downloads 0 Views 177KB Size

Report

Download PDF

Recommend Documents

Modeling Traffic on the Web Graph

Learning to Find Answers to Questions on the Web

Learning to Integrate Web Taxonomies

A WEB-BASED EVALUATION TOOL TO PREDICT LONG EYE GLANCES

Learning to Predict Driver Route and Destination Intent

A Machine Learning Approach to Predict Chemical Reactions

Very fast web traffic generator on a Tilera device

FoCUS: Learning to Crawl Web Forums

Using Machine Learning to Predict the Effect of Warfarin on Heart Patients

Web as a textbook: Curating Targeted Learning Paths through the Heterogeneous Learning Resources on the Web

GENERALIZATION CANNOT PREDICT ABSTRACT-CONCEPT LEARNING

Using Machine Learning to Predict the Impact of Agricultural Factors on Communities of Soil Microarthropods

How to Predict Epidemics

New England 511 Traffic Web. New England 511 Traffic Web. User Guide. Version 1.0. Page 1

Hybrid TCP-UDP Transport for Web Traffic

Explaining World Wide Web Traffic Self-Similarity

Ranking Web Sites with Real User Traffic

How Case-based Reasoning can be used to predict and improve Traffic Flow in Urban Intersections

Recommendations for Simulations to Predict

How (Not) To Predict Elections

Fuzzy Semantic Retrieval for Traffic Information Based on Fuzzy Ontology and RDF on the Semantic Web

Learning Kendo UI Web Development

Decision Support Systems 35 (2003) 213 – 229 www.elsevier.com/locate/dsw

On learning to predict Web traffic Selwyn Piramuthu* Decision and Information Sciences, University of Florida, 351 STZ, Gainesville, FL 32611-7169, USA

Abstract The ease of collecting data about customers through the Internet has facilitated the process of developing large repositories of data. These data can and do contain patterns that are useful for the decision maker. Knowledge discovery and data mining methods have been widely used to extract these patterns. It is acknowledged that about 80% of the resources in a majority of data mining applications are spent on cleaning and preprocessing the data. However, there have been relatively few studies on preprocessing data used as input in these data mining systems. In this study, we present a feature selection method based on the Hausdorff distance measure, and evaluate its effectiveness in preprocessing input data for inducing decision trees. Message traffic data from a Web site are used to illustrate performance of the proposed method. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Data mining; Feature selection; Web traffic

1. Introduction The ease and availability of Internet access, especially the World Wide Web, facilitates communication among users and developers of Web sites. This has also led to consideration of resource allocation issues among designers, developers and maintainers of these systems. Specifically, the availability of resources for the users of these sites without loss of data, delay in response, fewer error messages, among others, are critical in these applications. One of the means to handle such issues is through Web traffic analysis. Web traffic analysis is important in systems where bursty traffic is expected. In this paper, we use machine learning methods to analyze Web traffic. Specifically, we use decision trees to predict Web traffic. Decision trees have been

*

Tel.: +1-352-392-8882; fax: +1-352-392-5438. E-mail address: [email protected] (S. Piramuthu).

shown to have excellent properties for classification/ prediction applications. We propose and develop a method for improving the classification performance of a system using decision trees. We then illustrate the performance of the system using Web traffic analysis application. Recent advances in computing technology in terms of speed, cost, as well as access to tremendous amounts of computing power and the ability to process huge amounts of data in reasonable time, has spurred increased interest in data mining applications. Machine learning has been one of the methods used in most of these data mining applications. It is widely recognized that around 80% of the resources in data mining applications are spent on cleaning and preprocessing the data. The actual mining or extraction of patterns from the data requires the data to be clean since input data are the primary, if not the only, source of knowledge in these systems. Cleaning and preprocessing data involves several steps including procedures for handling incomplete, noisy or missing

0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 9 2 3 6 ( 0 2 ) 0 0 1 0 7 - 0

214

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

data; sampling of appropriate data; feature selection; feature construction; and also formatting the data as per the representational requirements of techniques used to extract knowledge from these data. Invariably, and unknowingly for the most part, irrelevant, as well as redundant, variables are introduced along with relevant variables to better represent the domain in these applications. A relevant variable is neither irrelevant nor redundant to the target concept of interest [18]. Whereas an irrelevant feature (variable) does not affect describing the target concept in any way, a redundant feature does not add anything new to describing the target concept while possibly adding more noise than useful information in concept learning. Feature selection is the problem of choosing a small subset of features that ideally is necessary and sufficient to describe the target concept [20]. Feature selection is of paramount importance for any learning algorithm which when poorly done (i.e., a poor set of features is selected) may lead to problems associated with incomplete information, noisy or irrelevant features, not the best set/mix of features, among others. The learning algorithm used is slowed down unnecessarily due to higher dimensions of the feature space, while also experiencing lower classification accuracies due to learning irrelevant information. The ultimate objective of feature selection is to obtain a feature space with (1) low dimensionality, (2) retention of sufficient information, (3) enhancement of separability in feature space for examples in different categories by removing effects due to noisy features and (4) comparability of features among examples in same category [28]. Although seemingly trivial, the importance of feature selection cannot be overstated. Consider, for example, a data mining situation where the concept to be learned is to classify good and bad creditworthy customers. The data for this application could possibly include several variables including social security number, asset, liability, past credit history, number of years with current employer, salary and frequency of credit evaluation requests. Here, regardless of the other variables included in the data, the social security number can uniquely determine a customer’s creditworthiness. The learned knowledge using only the social security number as predictor has extremely poor generalizability when applied to new customers. Clearly, in this case, to avoid such a problem, we can

exclude social security numbers from the input data. It is not always clear-cut as to which of the variables could result in such spurious patterns. A similar problem could possibly exist among one or more other variables in the data. Feature selection methods can be used in similar situations to cull out such problematic features before the data enters the pattern extraction stage in data mining systems. The use of appropriate input data can and does result in improvements in performance. This study explores this idea of effectively utilizing input data. Feature selection method using the Hausdorff distance measure is presented and evaluated in this study. The Hausdorff distance measure is widely used in computer vision and graphics applications, due to its excellent discriminant properties. It has, however, not received much attention in the feature selection literature. The proposed method is illustrated using three real world data. Two of these data sets are from published sources, and are used to evaluate the proposed method. We then analyze message traffic at a Web site using the proposed method. Thus, this paper addresses two issues: (1) show how data preprocessing can help improve learning performance, and (2) Web traffic prediction. This paper is organized as follows: Rationale for the use of knowledge discovery and data mining methods for analyzing network traffic data is provided in Section 2. Section 3 provides a brief overview of recent developments in feature selection methods. The proposed feature selection method using the Hausdorff distance-based method is presented in Section 4. A small example is used to explain how and why the Hausdorff distance measure is appropriate for feature selection in Section 5. Two real world data sets are then used to evaluate the proposed method. The results from these are provided in Section 6. This is followed by an illustration of the proposed method using message traffic data from a Web site in Section 7. Section 8 concludes the paper with a brief discussion of this study.

2. Web traffic analysis The data networking world enables corporations to profoundly transform the way they conduct business. Through its vast reach, the Internet helps companies

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

communicate better with employees, customers, suppliers and distributors. Corporations that use this network experience can lower business costs and forge closer relationships with partners and customers. In an era of heightened competition, those companies with the best data communications infrastructure have a clear advantage in the marketplace. The data networking industry provides a robust foundation on which business applications can be built. Data traffic appears in a wide variety of areas including cross-platform Web browsers such as Netscape Navigator and Microsoft Internet Explorer run on PCs, Macintoshes, UNIX workstations, as well as other alternative computing devices, including palmtops, televisions and cellular phones. The ubiquity of computer networks lets users access data and services from most locations worldwide. Employees telecommuting from home can just as easily obtain corporate data as they can while at the office. An increasingly popular activity today is investing over the Internet. This includes the use of on-line brokerages, researching companies using both the companies, as well as other Web sites, and discussing the performance of these companies with others through the use of chatrooms, bulletin boards, emails, among others. As the use of the Internet grows, the availability of information about companies is expected to grow. Not all of the information will be useful or even correct. Also, the type of information available has changed. Traditional theory links several different factors to a company’s stock price. Macroeconomic factors such as the growth in GDP, interest rates and expectations of inflation are often cited as the cause of movements in stock prices. Microeconomic factors such as a company’s profitability or earnings per share also drive movements in stock prices. Additionally, nonfinancial factors also play a role. At a high level, it is called ‘‘herd mentality,’’ while at the individual investor level, it tends to be recognized as a ‘‘hot tip.’’ In either case, the spread of this type of informal, nonfinancial information affects investor behavior. And with the emergence of the Internet, the flow of this information has increased rapidly. The impact of exchange of such information on changes in stock price can be explored using knowledge discovery and data mining techniques. Assuming the existence of useful and actionable relationships,

215

they can be used to predict investor behavior for the purpose of making stock investing more profitable or for identifying potentially illegal transfers of information about companies before the information is released to the general public. Data mining techniques can also be used to provide useful information that improves the decision making process for IS personnel who are involved in the maintenance and development of systems that facilitate communication and interaction among users of these systems. Examples of such objectives include (1) reducing down time, (2) providing more storage space and (3) providing cost savings. These objectives can be achieved by being able to predict bursts in message traffic, among others. This information is especially crucial when the burstiness of message traffic deviates by several orders of magnitude from the norm. In this study, we consider the problem of predicting the intensity of message traffic in a chatroom environment. By being able to predict this, the system manager can allocate resources appropriately to reduce delay, error messages, and loss of service for the users involved. The past decade has witnessed a surge in Web traffic analysis applications, which monitor large volumes of Web site visitors and track their navigation through a site (e.g., Refs. [5,14,19,38]). The techniques used help system administrators evaluate the load on corporate Web sites and aid in capacity planning for bandwidth and servers. By managing traffic better, loads that cause crashes can be reduced or eliminated. Messages can be routed and prioritized based on message type, as well as the identity of the user. In this study, we are interested in being able to predict message traffic. We are not interested in load balancing or capacity planning that follow message traffic analysis at later stages of planning, development and maintenance. In general, real world Web traffic analyses applications are associated with huge volumes of data. In spite of the advances in knowledge discovery methods and computing performance, there is still need for careful preprocessing of input data used in these analyses. This is because of the deleterious effect of data that are redundant, irrelevant, noisy and unnecessary for an application of interest. As for preprocessing input data, feature selection methods have been shown to be beneficial for selecting appropriate attributes. Feature selection can, thus, be used in concert with other

216

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

preprocessing methods to enhance the knowledge discovery process, resulting in better performance.

3. Recent developments in feature selection Feature selection is the problem of choosing a small subset of features that ideally is necessary and sufficient to describe the target concept [20]. A goal of feature selection is to avoid selecting too many or too few features than is necessary. If too few features are selected, there is a good chance that the information content in this set of features is low. On the other hand, if too many (irrelevant) features are selected, the effects due to noise present in (most real-world) data may overshadow the information present. Hence, this is a tradeoff which must be addressed by any feature selection method. The marginal benefit resulting from the presence of a feature in a given set plays an important role. A given feature might provide more information when present with certain other feature(s) than when considered by itself. Cover [11], Elashoff et al. [15] and Toussaint [43], among others, have shown the importance of selecting features as a set, rather than selecting the best features to form the (supposedly) best set. They have shown that the best individual features do not necessarily constitute the best set of features. However, in most real-world situations, it is not known what the best set of features is nor the number (n) of features in such a set. Currently, there is no means to obtain the value of n, which depends partially on the objective of interest. Even assuming that n is known, it is extremely difficult to obtain the best set of n features since not all n of these features may be present in the data comprising the available set of features. There exists a vast amount of literature on feature selection. Researchers have attempted feature selection through varied means, such as statistical (e.g., Ref. [21]), geometrical (e.g., Ref. [16]), informationtheoretic measures (e.g., Ref. [9]), neuro-fuzzy (e.g., Ref. [7]), Receiver Operating Curves (ROC) [10], discretization [26], mathematical programming (e.g., Ref. [8]), among others. In statistical analyses, forward and backward stepwise multiple regression (SMR) are widely used to select features, with forward SMR being used more often due to the lesser magnitude of calculations

involved. The output here is the smallest subset of features resulting in an R2 (correlation coefficient) value that explains a significantly large amount of the variance. In forward SMR, the analyses proceeds by adding features to a subset until the addition of a new feature no longer results in a significant (usually at the 0.05 level) increment in explained variance (R2 value). In backward SMR, the full set of features are used to start with, while seeking to eliminate features with the smallest contribution to R2. Malki and Moghaddamjoo [27] apply the K – L transform on the training examples to obtain the initial training vectors. Training is started in the direction of the major eigenvectors of the correlation matrix of the training examples. The remaining components are gradually included in their order of significance. The authors generated training examples from a synthetic noisy image and compared the results obtained using the proposed method to those of standard backpropagation algorithm. The proposed method converged faster than standard backpropagation with comparable classification performance. Siedlecki and Sklansky [40] use genetic algorithms for feature selection by encoding the initial set of n features as n-element bit string with 1 and 0 representing the presence and absence, respectively, of features in the set. They used classification accuracy, as the fitness function (for genetic algorithms while selecting features) and obtained good neural network results compared to branch and bound and sequential search [41] algorithms. They used a synthetic data, as well as digitized infrared imagery of real scenes, with classification accuracy as the objective function. Yang and Honavar [44] report a similar study. However, later, Hopkins et al. [17] show that classification accuracy maybe a poor fitness function measure when searching for reducing the dimension of the feature set. Using Rough Sets theory [32], PRESET [30] determines the degree of dependency (c) of sets of attributes for selecting binary features. Features leading to a minimal preset decision tree, which is the one with minimal length of all path from root to leaves, are selected. Kohavi and Frasca [23] use best-first search, stopping after a predetermined number of nonimproving node expansions. They suggest that it may be beneficial to use a feature subset that is not a reduct, which has a property that a feature cannot be

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

removed from it without changing the independence property of features. A table-majority inducer was used with good results. The wrapper method [22] searches for a good feature subset using the induction algorithm as a black box. The feature selection algorithm exists as a wrapper around the induction algorithm. The induction algorithm is run on data sets with subsets of features, and the subset of feature with the highest estimated value of a performance criterion is chosen. The induction algorithm is used to evaluate the data set with the chosen features, on an independent test set. Yuan et al. [45] develop a two-phase method combining wrapper and filter approaches. Almuallim and Dietterich [3] introduce MIN-FEATURES (if two functions are consistent with the training examples, prefer the function that involves fewer input features) bias to select features in the FOCUS algorithm. They used synthetic data to study the performance of the FOCUS, ID3 and FRINGE algorithms using sample complexity, coverage and classification accuracy as performance criteria. They increased the number of irrelevant features and showed that FOCUS performed consistently better. The IDG algorithm [16] takes the positions of examples in the instance space to select features for decision trees. They limit their attention to boundaries separating examples belonging to different classes, while rewarding (penalizing) rules that separate examples from different (same) classes. Eight data sets are used to compare the performance (percent accuracy, number of nodes in decision tree, time) of decision trees constructed using the proposed algorithm with ID3 [36]. Decision trees generated using the proposed algorithm had better accuracy whereas those with ID3 had fewer number of nodes and took more than an order of magnitude less time. Based on the positions of instances in instance space, the Relief algorithm [20] selects features that are statistically relevant to the target concept, using a relevancy threshold that is selected by the user. Relief is noise-tolerant and is unaffected by feature interaction. The complexity of relief is O(pn), where n and p are the number of instances and number of features, respectively. Relief was studied using two 2-class problems with good results, compared to FOCUS [3] and heuristic search [13]. Kononenko [25] extended RELIEF to deal with noisy, incomplete and multiclass data sets.

217

Milne [29] used neural networks to measure the contribution of individual input features to the output of the neural network. A new measure of input features’ contribution to output is proposed, and evaluated using data mapping species occurrence in a forest. Using a scatter plot of contribution to output, subsets of features were removed and the remaining feature sets were used as input to neural networks. Setino and Liu [39] present a similar study using neural networks to select features. Battiti [6] developed MIFS to use mutual information for evaluating the information content of each individual feature with respect to the output class. The features thus selected were used as input in neural networks. The author shows that the proposed method is better than those feature selection methods that use linear dependence (e.g., correlations as in principal components analysis) measures. Al-Ani and Deriche [2] extend this work by considering trade-offs between computational costs and combined feature selection. Koller and Sahami [24] use cross-entropy to minimize the amount of predictive information lost during feature selection. Piramuthu and Shaw [34] use C4.5 [37] to select features used as input in neural networks. Their results showed improvements, over just backpropagation, both in terms of classification accuracy and time taken by neural networks to converge. The most popular feature selection methods in machine learning literature arevariations of sequential forward search (SFS) and sequential backward search (SBS) as described in Devijver and Kittler [13] and its variants (e.g., Ref. [35]). SFS (SBS) obtains a chain of nested subsets of features by adding (subtracting) the locally best (worst) feature in the set. These methods are particular cases of the more general ‘plus l–take away r’ method [41]. Results from previous studies indicate that the performance using forward and backward searches are comparable. In terms of computing resources, forward search has the advantage since fewer number of features are evaluated at each iteration, compared to backward search where the process begins using all the features.

4. Feature selection and Hausdorff distance In this section, the proposed method of feature selection using the Hausdorff distance (FSH) method

218

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

is presented after a brief introduction to the Hausdorff distance measure. 4.1. Hausdorff distance The Hausdorff distance (e.g., Ref. [31]) is a measure of the similarity, with respect to their position in metric space, of two nonempty compact sets A and B. It measures the extent to which each point in a set is located relative to those in another set. Let X1={x11, x12, . . ., x1m} and X2={x21, x22, . . ., x2n} be two finite point sets and d a distance over this space. Here, d can be any distance including the 1 norm,1 the Euclidean norm, as well as simple difference between corresponding coordinates in each dimension, among others. The Hausdorff distance is defined as follows: bx1 eX1 ; Dðx1 ; X2 Þ ¼ minx2 eX2 fdðx1 ; x2 Þg

ð1Þ

hðX1 ; X2 Þ ¼ maxx1 eX1 fDðx1 ; X2 Þg

ð2Þ

HðX1 ; X2 Þ ¼ maxfhðX1 ; X2 Þ; hðX2 ; X1 Þg

ð3Þ

Here, h(X1, X2) is the directed Hausdorff distance from X1 to X2. It identifies the point x * eX1 that is farthest (using a prespecified norm) from any point in X2 and measures the distance from x* to its nearest neighbor in X2. Essentially, h(X1, X2) ranks each point in X1 based on its distance from the nearest point in X2 and then uses the largest ranked such point (x*, the point in X1 farthest away from X2) as the distance. If h(X1, X2) = A, then each point in X1 has at least one point in X2 in the neighborhood of radius A. For smaller values of A, X1 is nearly included in X2. Hence, h(X1, X2) is a measure of inclusion of X1 in X2. The Hausdorff distance itself, H(X1, X2), is the maximum of the directed Hausdorff distances h(X1, X2) and h(X2, X1). H(X1, X2) can be calculated in O(mn) for two point sets of size m and n, respectively. Alt et al. [4] improve this to O((m + n)log(m + n)).

1 The 1 norm between two points A(x1, y1) and B(x2, y2) is defined as d(A, B) = Ax1 x2A + Ay1 y2A.

The Hausdorff distance H is a metric over the set of all closed, bounded sets [11]. Being a true distance, it also obeys the properties of identity, symmetry and triangle inequality. In the context of classification, it follows that description of a concept is identical only to its own description, the order of comparing different concepts does not matter, and the descriptions of two different concepts cannot be similar to some third concept. 4.2. Feature selection with Hausdorff distance The algorithm FSH assumes the data set to include k variables. Step 1 calculates the Hausdorff distance Hi(X1, X2) between examples belonging to classes 1 and 2 (assuming a binary concept learning problem), individually for each of the k variables in the data set. This is followed by sorting the Hi values, and the corresponding variables are noted (s1, . . ., sk) in step 2. This is followed by evaluation of the feature set by inducing decision tree. We use C4.5 [37], a wellknown method, for generating decision trees. The trees are generated iteratively as more variables are added to the data set, in ascending order, based on their corresponding Hi values. The quality of the decision trees (e.g., classification accuracy on heretofore unseen examples) thus generated are evaluated. The algorithm stops when a prespecified stopping criterion (classification accuracy) is reached, and the variables corresponding to this decision tree are returned as the selected set.

Algorithm FSH (feature selection with Hausdorff distance) Variables in data: v1, v2, . . .vk; S = set of all input variables = F (1) Set i = 1; while i < k + 1, do (a) Calculate Hi(X1, X2) for vi. (b) i = i + 1. (c) Go to 1(a). (2) Sort Hi(., .), in ascending order, with the corresponding features (s1, . . ., sk). (3) Set j = k; until stopping criterion is met, do (a) S = S + sj (b) Induce decision tree with input S. (c) Evaluate quality of decision tree. (d) j = j 1. (e) Go to 3(a). (4) Return final set of features.

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

5. An example Consider the data given in Fig. 1. These data points are chosen to facilitate explanation of the Hausdorff distance measure and its appropriateness as a criterion for feature selection. There are two independent variables V1 and V2, and a dependent variable Y. The examples from the two classes ( Y=+ and ) are clearly separable. The Hausdorff distance values, H(+, ), for V1 and V2 using 1 norm as the distance measure are 3 and 0, respectively. Given a choice between variables V1 and V2, based on the Hausdorff distance measure, we would choose the variable that corresponds to the highest H(+, ) value. Here, we would choose V1 as the variable that distinguishes examples from the two classes since HV1(+, ) > HV2(+, ). Intuitively, if we project the data points on the V1 axis, the four data points belonging to the class ‘‘ ’’ would be on V1 = 1 and 2, and the four points belonging to the class ‘‘ + ’’ would be on V1 = 4 and 5. Here, given just the V1 axis, we can separate the examples belonging to the ‘‘ ’’ classes and those belonging to the ‘‘ + ’’ classes. Similarly, if we project the data points on the V2 axis, the four data points belonging to the class ‘‘ ’’ would be on V2 = 1 and 2. The four points belonging to the class ‘‘ + ’’ would

219

also be on V2 = 1 and 2. Here, given just the V2 axis, we cannot separate the examples belonging to the ‘‘ ’’ and ‘‘ + ’’ classes. HV2(+, ) = 0 signifies perfect overlapping of examples in the two classes in dimension V2, whereas our goal is to separate examples belonging to the two classes.

6. Experimental results Using two financial credit scoring data sets with different characteristics—one on loan default prediction and the other on bank failure prediction—we illustrate the performance of the proposed feature selection method. To facilitate comparison of results from previous studies using these data sets (e.g., Refs. [1,42]), we follow the same split of training and testing (holdout) samples in accordance with these previous studies. 6.1. Loan default data This data has been used in previous studies (e.g., Ref. [1]) to classify a set of firms into those that would default and those that would not default on loan payments. The source of this data is the Index of Corporate Events in the 1973 – 1975 issues of Disclo-

Fig. 1. An example.

220

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

degree. The classification accuracy of the decision tree after preprocessing through FSH is the same as that generated without any preprocessing. However, the same accuracy was obtained with fewer (10 compared to 18) features, as well as a smaller (7 compared to 15) decision tree. Clearly, the ability to learn to describe a concept with fewer features, as well as smaller decision tree, is preferable in terms of the Occam’s Razor principle, as well as the resources necessary to gather, store, maintain, analyze and interpret results. Although the nonlinear method resulted in a smaller tree, the classification accuracy on holdout examples is not as good as the other two methods.

sure Journal. Sixteen defaulted firms were matched with 16 nondefaulted firms to obtain data for the study. Another set of 16 examples, all belonging to the nondefault case, were used as the holdout set in line with previous studies using this data set. There are 18 variables in this data: (1) net income/total assets, (2) net income/sales, (3) total debt/total assets, (4) cash flow/total debt, (5) long-term debt/net worth, (6) current assets/current liabilities, (7) quick assets/ sales, (8) quick assets/current liabilities, (9) working capital/sales,(10) cash at year-end/total debt, (11) earnings trend, (12) sales trend, (13) current ratio trend, (14) trend of L.T.D./N.W., (15) trend of W.C./ sales, (16) trend of N.I./T.A., (17) trend of N.I./sales and (18) trend of cash flow/T.D. For detailed description of this data, the reader is referred to Ref. [1]. In order to compare FSH with a comparable method, the nonlinear sequential forward search method with Parzen and hyperspheric kernel is used. The nonlinear sequential forward search method with Parzen measure using hyperspheric kernel has been shown in previous studies (e.g., Ref. [33]) to result in good performance compared to several other interclass, as well as probabilistic distance-based feature selection methods for induced decision trees. The number of variables chosen by the nonlinear method was guided by the number of variables chosen by the FSH method, for comparison purposes. Table 1 provides results from decision trees generated, using C4.5 [37], after preprocessing input through the feature selection methods. In Table 1, ‘none’ corresponds to the case where no preprocessing (here, feature selection) was done. The classification accuracy on heretofore unseen testing (holdout) examples are of primary interest, and the number of input variables, as well as the size of decision trees generated, are also important though to a lesser

6.2. Bank failure prediction data This data set was used in the Tam and Kiang [42] study. Texas banks that failed during 1985 – 1987 were the primary source of data. Data from a year prior to their failure were used. Data from 59 failed banks were matched with 59 nonfailed banks, which were comparable in terms of asset size, number of branches, age and charter status. Tam and Kiang had also used holdout samples. The 1-year prior case consists of 44 banks, 22 of which belongs to failed and the other 22 to nonfailed banks. The 2-year prior case consists of 40 banks, 20 of which belong to failed and the remaining 20 to nonfailed banks. The data describes each of these banks in terms of 19 financial ratios. For a detailed overview of the data set, the reader is referred to Tam and Kiang [42]. Tables 2 and 3 provide results from decision trees generated after preprocessing input through the feature selection methods using bank failure prediction data. Unlike with the loan default data set, the classification accuracy of the decision tree after

Table 1 Results using loan default data Feature Selection Method

Input variables (for C4.5)

Tree size (for C4.5)

Classification accuracy (%) of induced decision trees Training

Testing

None FSH Nonlinear

x1, . . ., x18 x2, x5, x6, x7, x9, x10, x13, x16, x17, x18 x1, x2, x3, x4, x5, x6, x7, x8, x13, x15

15 7 5

96.9 84.4 78.1

87.5 87.5 81.2

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

221

Table 2 Bank failure prediction data (1 year prior) Feature selection method

Input variables (for C4.5)

Tree size (for C4.5)

Classification accuracy (%) of induced decision trees Training

Testing

None FSH Nonlinear

x1, . . ., x19 x1, x5, x6, x7, x8, x9, x15 x1, x2, x3, x4, x5, x10, x17

29 9 21

99.2 84.7 91.5

79.5 86.4 81.8

preprocessing through FSH is slightly better than that with the nonlinear feature selection method, as well as that generated without any preprocessing. The size of the decision tree is also significantly smaller in the case of FSH compared to the other two methods. This is a slightly larger data set compared to the loan default data set, with more examples (118 training and 44 and 40 holdout examples in the bank failure prediction data, compared to 32 training and 16 testing examples in the loan default data set). The performance of the nonlinear method also is better than that using no preprocessing at all, in terms of tree size, using relatively fewer number of input variables.

were posted from 7/7/98 to 7/7/99. The following is a list of the steps we followed in collecting the data.

7. Predicting Web traffic

7.1.3. Step 3: Collect and process data from Web site This was the most laborious part of the data collection exercise. Each chatroom could display a listing of individual messages, 100 at a time, with a two-line summary that included the name of the person sending the message and its time and date. These blocks of 100 messages were copied into a text editor and stored as large text files, covering the period from the beginning of July 1998 through early July 1999. The number of messages varied widely, from 130 for Network Appliance to 35,000 for Compaq.

7.1. Data collection Before extracting useful patterns from data, the data must be collected and preprocessed. This section describes the process we followed to generate the business rules for managing network traffic. The data used in this study are from Silicon Investor. To begin with, we reviewed and cross-examined six different companies and the corresponding chat messages that

7.1.1. Step 1: Identify the source of data The data were collected from Silicon Investor’s Web site (http://www.techstocks.com). Specifically, the data comprises data-log of their chat site. 7.1.2. Step 2: Select target companies Once we knew the companies that have established chatrooms, we selected six different companies. These include Apple Computers, Compaq Computers, Hasbro, Network Appliance, Seagate and Western Digital. Each had a chatroom with varying number of messages.

Table 3 Bank failure prediction data (2 years prior) Feature selection method

Input variables (for C4.5)

Tree size (for C4.5)

Classification accuracy (%) of induced decision trees Training

Testing

None FSH Nonlinear

x1, . . ., x19 x6, x7, x8, x9, x16, x17 x2, x3, x5, x10, x16, x19

27 13 23

94.1 80.5 88.1

72.5 77.5 60.0

222

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

7.1.4. Step 4: Collect financial data The data on each company’s stock were also obtained. This include the open, close, high and low price for each company from the beginning of July 1998 through early July 1999. The data on the closing value of four indices of stocks are from their Web sites. The New York Stock Exchange (NYSE) Composite index are from their Web site (http://www.nyse.com) while the data on the NASDAQ Composite, NASDAQ 100 and NASDAQ Computer indices are from their Web site (http://www.nasdaq-amex.com). These indices were chosen because four of the stocks are listed on the NYSE while the other two are on NASDAQ. 7.2. Data processing The message data and financial data are then processed to make them suitable for extracting patterns. The number of chatroom messages was computed by company by date. The processing continued by counting the number of unique message authors by company by date. This element tried to account for chatty participants who may use the chatroom facility of the Web site to send messages unrelated to the stock or company to other users. The number of unique authors may be a better indicator of the breadth of Web site use than simply the gross count of messages. The time the message was sent was also processed. All records were categorized in three different ways: relative to 12:00 noon (A.M. versus P.M.); relative to the trading hours of the NYSE (before, during and after the trading hours of 9:30 A.M. to 4:00 P.M.); and relative to peak Internet usage hours (11:00 A.M. to 4:00 P.M.). All times are in Eastern Standard Time, so there was no need to adjust the data for time zone differences. For each category, the number of messages falling into each choice was stored as a unique element in the database. This resulted in a set of variables about the number and timing of messages per day. The data elements in the starting database are listed below. Date Company Name Daily Open, Close, High and Low Stock Price Daily Trading Volume for each Stock

Daily NASDAQ Composite Index Closing Value Daily NASDAQ 100 Index Closing Value Daily NASDAQ Computer Index Closing Value Daily NYSE Composite Index Closing Value Total Count of Messages by Company Count of Unique Message Authors by Company Count of Messages written before 12:00 noon and after 12:00 noon by Company Count of Messages written before, during and after trading hours of the NYSE by Company Count of Messages written during peak Internet usage hours and otherwise by Company The message count data were further processed into discrete categories. Discretization of data was done simply because C4.5, the decision tree generator, works better with discretized data. For the data set used here, discretization also helps in normalizing data corresponding to the different companies. The total message count and count of unique authors variables were categorized based on their dispersion. Standard deviations of message count were computed for each company. Then, each record had a category variable created that put it into a High, Medium or Low category based on how many standard deviations a given instance was from the mean of the data. The average values and standard deviations for each company’s message count data are provided in Table 4. The financial data were processed as well. The first adjustment to the data was to compute the difference in the price variables. The change in closing price was computed for each stock, as well as for each stock price index. It was at this point that the first assumption was made. The stock price information existed for each trading day, which

Table 4 Change in daily closing stock price and message count Company

Stock price (S.D.)

Message counts (S.D.)

Apple Compaq Hasbro Network appliance Seagate Western digital

0.029 0.031 0.005 0.100 0.010 0.014

27.524 (27.827) 101.347 (75.500) 1.900 (1.779) 3.872 (3.912) 7.211 (6.964) 5.007 (6.447)

(1.256) (1.069) (0.643) (1.951) (1.082) (0.555)

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

excludes all holidays and weekends. But the chatroom messages didn’t stop when trading stopped. So, the first assumption is that each nontrading day would retain the closing stock price and volume from the most recent trading day. In most cases, this meant that Saturdays and Sundays had the closing price and volume of the previous Friday. Holidays were given the same treatment. The next preprocessing step was to convert the stock price changes into a discrete variable taking a limited number of values. The change in an individual stock’s price was converted by taking the daily change in closing price and comparing it to the standard deviation of the change for that company. The discrete values were then computed to be: Low—the change was less than 1 standard deviation; Medium—the change was from 1 to less than 2 standard deviations; High—the change was two or more standard deviations. Implicit in this discretization process was that the mean of the change in stock price is zero. Therefore, we find that 2/3 of our observations are categorized as Low, 28% are Medium and 5% are High, following a normal distribution. Table 1 shows the means and standard deviations for each company’s change in stock price. It would seem from observation that this implicit assumption is not unrealistic. The last step in preprocessing the data was to discretize the four index data series, using the same approach. The standard deviation of the daily change in each index was calculated and a category assigned, which are again Low, Medium and High. For both the individual stock price changes and the changes in the indices, this had the effect of reducing the number of records rated as having a High change to 5% of the total, or about 100 records. As a result, we found that we could not effectively explore the more meaningful distinction where not only was a Low, Medium and High category assigned, but a direction (Up, Down or No Change). When investing, there is a large difference in impact and suggested responses between a large increase in stock price and a large decrease in stock price. 7.2.1. Other data issues Impact of assumption. The impact of the assumption that nontrading days would retain the value for stock price and trading volume from

223

the last trading day comes from the different treatment of closing price and trading volume in our analysis. Somehow, there needed to be some way to relate weekend and holiday chatroom activity to changes in stock price. But there were no changes in stock price on nontrading days. So, one of three options was available, each with a downside: (1) Drop the weekends and holidays altogether from the analysis. This would have cost us more than two out of every seven records and also eliminated the possibility of detecting any impact from nontrading day messages. (2) Keep previous trading day price, but set volume to zero. This is technically more accurate, but adds a very strong link between volume and change in stock price. If Friday’s closing price is carried over to Saturday and Sunday, then the change in price for these 2 days will be zero. Because the volume will also always be zero, a very strong link is forged that overwhelms all other influences in the data. Tests of this relationship using data where nontrading day volume was set to zero showed this to be the case. (3) Use last trading day value for nontrading days. This approach also included the binary weekend variable for nontrading days. The problematic issue here is the fact that Friday’s volume is being associated with the change in price for Friday, as well as 2 day’s worth of zero change. This would tend to understate the importance of trading volume in the determination of stock price changes. But given that the focus was on the influence of chatroom messages, this was deemed acceptable. Lags in the data. There is considerable room for lags to be present in the data. Lags could exist because, as we hoped to see, messages could precede stock price changes if there is insider trading or if information is being passed prior to stock movements. This type of lag could also be indicative of conversations occurring a day or two before the scheduled announcement of quarterly performance measures. In the other direction, lags may reflect the increase in interest in a stock after one or more days of continuing price rises or declines. The importance of lags in the data was explored using correlation analysis and linear regression. The correlation analysis calculated Pearson’s correlation coefficients between the change in stock price and message count volume, including four lags of each. In addition,

224

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

trading volume was included. The correlation coefficients are given in Appendix A. In general, the correlation is extremely low in almost all cases. The only notable exception is the relationship between trading volume and the current count of total messages and unique authors. In both cases, it lies between 0.77 and 0.79. Although still quite small, one other relationship can be seen. There is a weak, though significant correlation between the count of messages and authors 2 days prior and the current daily change in stock price. Regressions designed to explore this relationship proved disappointing, as all R2 values were less than 0.05. The model used to predict behavior plays a key role in the value of the results. We use the total count of messages as the dependent variable, assuming that it can act as a proxy for network traffic. If we include other message count variables, we are guaranteeing that they will explain most of the Low, Medium and High volumes of messages because they are some transformation of total message count. If we refer back to Appendix A, we see that the correlation coefficients between the total message count and unique author count variables is high (0.921). Although not presented in Appendix A, the correlation coefficients for the other message count variables is also high. Therefore, any analysis performed with these variables included will swamp all other influences simply because these other variables are derived from the total count variable. The independent variables used are (x1. . .x13: daily change in stock price, direction of daily change in stock price, range of movement of daily stock price, change in NYSE Composite Index value, direction of change in NYSE Composite Index

value, change in NASDAQ Composite Index value, direction of change in NASDAQ Composite Index value, change in NASDAQ 100 Index value, direction of change in NASDAQ 100 Index value, change in NASDAQ Computer Index value, direction of change in NASDAQ Computer Index value, weekend binary and daily trading volume. The weekend binary takes on values 0 or 1 depending on whether the day is a day when the stock markets are closed or open. The direction and magnitude variables are discretized. The direction variables take values up, no change, or down. The change variables take values low, medium and high. Table 5 provides results from decision trees generated before and after preprocessing input through the feature selection method presented in this paper. We used a 10-fold cross-validation to determine the set of training and testing data. The average from these 10-fold cross-validation runs are provided along with the standard deviation values in parentheses. The classification accuracy for the case using preprocessing on training data is worse than that using no preprocessing. This is not of concern since this could signify that the decision trees generated did not over-fit the input data. If indeed this were true, the generalizability of the generated trees can be evaluated using heretofore unseen examples (i.e., the testing data set).The classification accuracy of the decision tree on testing data set after preprocessing through FSH is slightly better than that generated without any preprocessing. The number of variables used (here, 3) in the decision trees after preprocessing is significantly fewer than those (here, 13) that were used without preprocessing data. Moreover, the size of the decision tree is also significantly smaller in the case with preprocessing using FSH compared to the case without any preprocessing. To summa-

Table 5 Results with and without feature selection Feature selection method

Input variables (for C4.5)

Tree size (for C4.5)

Classification accuracy (%) of decision trees Training

Testing

None Nonlinear FSH

x1, . . ., x13 x1, x4, x8 x1, x10, x13

115.3 (32.65) 26.4 (2.27) 13.4 (4.88)

85.7 (19.77) 80.34 (6.93) 81.2 (8.46)

78.5 (5.317) 79.7 (2.91) 80.2 (2.87)

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

rize, the results using preprocessing resulted in compact knowledge representation for comparable classification performance. For a manager of a network, the rules to take away from this analysis are as follows. (1) When trading volume is low, the number of messages will be low. (2) Large changes in the Indices will lead to large changes in message volume, especially the NASDAQ Computer Index. (3) If stock price changes are 2 points or more (either direction), message volume increases. Also, small changes in stock prices lead to low message volumes. Knowing these, the Web site planner and administrator can plan for necessary bandwidth and allocate the load appropriately. The assumption about how weekends are to be handled does affect the results, as does the industry makeup of the stocks chosen to include in the analysis. The NASDAQ Computer Index proved so strong a factor because five of the six companies were computer companies and the Strategic Investor Web site focuses on these types of companies. If the focus was on stocks of automobile or financial service companies, then the results would have come out differently. But we can probably generalize that changes in major stock indices will influence the price of individual stocks, as well as the volume of activity on networks.

8. Discussion We developed a framework to analyze data generated through the Internet, specifically through the Web. Given the proliferation of Internet sites with Web pages, the availability and ease of capturing useful data about the visitors to these sites, and the need to get the most information about these visitors to better serve their needs and be competitive in the market place, there is a definite need to analyze the data that are generated in these Web sites. Common characteristics of these data include: (1) the availability of these data in huge amounts, and (2) the presence of irrelevant (to the decision-maker analyzing these data) information in most of the components that make up these data. In situations where the traffic is bursty, analysis of the

225

situation is critical to allocate resources (bandwidth, machines, etc.) appropriately. Data collected in these environment are likely to include variables that contain more noise than useful information. These variables could render the results from analyses to be invalid. One of the ways to address these issues is to preprocess the data to remove the components that do not contain useful information for describing the concept of interest to the decision maker. We developed and evaluated a feature selection method based on the Hausdorff distance measure as to its effectiveness on selecting features for inducing decision trees. This method was compared with the case when no feature selection was used, as well as a nonlinear method. In terms of classification accuracy on previously unseen examples, FSH performed slightly better than the case with no feature selection, as well as the nonlinear method with smaller decision trees. The results also show that induced decision trees are sensitive to the input data used. By selecting appropriate features through preprocessing, the performance of induced decision trees can be improved without much effort since most of these preprocessing techniques are not time/computing intensive. This is true for any learning algorithm because the complexity of the data used directly affects the learning algorithm’s performance. Feature selection, when used along with any learning system, can help improve performance of these systems even further with minimal additional effort. By selecting useful features from the data set, we are essentially reducing the number of features needed for learning tasks. This in turn translates to reduction in data gathering costs, as well as storage and maintenance costs associated with features that are not necessarily useful for the decision problem of interest.

Acknowledgements I would like to thank Bruno Berszoner and Po Yin Tse for providing the source data and for descriptive statistics on the data. I would also like to thank the editor and the reviewers for their thorough review, which helped improve the content and presentation of this paper.

226

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

Appendix A. Correlation results for daily stock change and count variables, including lags of each, across all companies Change in daily closing stock price

Change in stock price, current Change in stock price, 1 day Change in stock price, 2 days Change in stock price, 3 days Change in stock price, 4 days Total message count, current Total Message Count, 1 Day Total message count, 2 days Total message count, 3 days Total message count, 4 days Unique author count, current Unique author count, 1 day Unique author count, 2 days Unique author count, 3 days Unique author count, 4 days Trading volume

Total count of chatroom messages

Current

1 day

2 days

3 days

4 days

Current

1 day

1.000 0.000 0.233 0.000 0.149 0.000 0.167 0.000 0.090 0.000 0.026 0.242 0.001 0.980 0.060 0.007 0.011 0.618 0.012 0.600 0.014 0.537 0.008 0.701 0.056 0.010 0.002 0.944 0.011 0.601 0.046 0.035

0.233 0.000 1.000 0.000 0.234 0.000 0.151 0.000 0.167 0.000 0.014 0.517 0.026 0.239 0.001 0.975 0.059 0.007 0.011 0.613 0.018 0.403 0.014 0.532 0.009 0.695 0.056 0.010 0.001 0.953 0.009 0.673

0.149 0.000 0.234 0.000 1.000 0.000 0.238 0.000 0.151 0.000 0.023 0.293 0.015 0.505 0.026 0.232 0.001 0.963 0.059 0.007 0.016 0.460 0.019 0.390 0.014 0.517 0.009 0.688 0.056 0.011 0.008 0.699

0.167 0.000 0.151 0.000 0.238 0.000 1.000 0.000 0.239 0.000 0.013 0.548 0.022 0.308 0.014 0.528 0.026 0.240 0.000 0.992 0.001 0.970 0.015 0.494 0.018 0.422 0.014 0.519 0.008 0.726 0.011 0.613

0.090 0.000 0.167 0.000 0.151 0.000 0.239 0.000 1.000 0.000 0.031 0.156 0.013 0.547 0.022 0.308 0.014 0.527 0.026 0.239 0.026 0.241 0.001 0.970 0.015 0.494 0.018 0.421 0.014 0.518 0.020 0.355

0.026 0.242 0.014 0.517 0.023 0.293 0.013 0.548 0.031 0.156 1.000 0.000 0.009 0.667 0.164 0.000 0.123 0.000 0.164 0.000 0.921 0.000 0.152 0.000 0.200 0.000 0.139 0.000 0.200 0.000 0.778 0.000

0.001 0.980 0.026 0.239 0.015 0.505 0.022 0.308 0.013 0.547 0.009 0.667 1.000 0.000 0.009 0.666 0.164 0.000 0.123 0.000 0.071 0.001 0.921 0.000 0.152 0.000 0.200 0.000 0.139 0.000 0.013 0.567

The first number reported is the Pearson correlation coefficient, the second is the probability of accepting the null hypothesis that ARA = 0, with n = 2085, given this result.

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

227

Count of unique message authors

Trading

2 days

3 days

4 days

Current

1 day

2 days

3 days

4 days

volume

0.060 0.007 0.001 0.975 0.026 0.232 0.014 0.528 0.022 0.308 0.164 0.000 0.009 0.666 1.000 0.000 0.009 0.666 0.164 0.000 0.198 0.000 0.071 0.001 0.921 0.000 0.152 0.000 0.200 0.000 0.200 0.000

0.011 0.618 0.059 0.007 0.001 0.963 0.026 0.240 0.014 0.527 0.123 0.000 0.164 0.000 0.009 0.666 1.000 0.000 0.009 0.666 0.136 0.000 0.198 0.000 0.071 0.001 0.921 0.000 0.152 0.000 0.086 0.000

0.012 0.600 0.011 0.613 0.059 0.007 0.000 0.992 0.026 0.239 0.164 0.000 0.123 0.000 0.164 0.000 0.009 0.666 1.000 0.000 0.195 0.000 0.136 0.000 0.198 0.000 0.071 0.001 0.921 0.000 0.157 0.000

0.014 0.537 0.018 0.403 0.016 0.460 0.001 0.970 0.026 0.241 0.921 0.000 0.071 0.001 0.198 0.000 0.136 0.000 0.195 0.000 1.000 0.000 0.085 0.000 0.248 0.000 0.160 0.000 0.246 0.000 0.777 0.000

0.008 0.701 0.014 0.532 0.019 0.390 0.015 0.494 0.001 0.970 0.152 0.000 0.921 0.000 0.071 0.001 0.198 0.000 0.136 0.000 0.085 0.000 1.000 0.000 0.085 0.000 0.248 0.000 0.160 0.000 0.144 0.000

0.056 0.010 0.009 0.695 0.014 0.517 0.018 0.422 0.015 0.494 0.200 0.000 0.152 0.000 0.921 0.000 0.071 0.001 0.198 0.000 0.248 0.000 0.085 0.000 1.000 0.000 0.085 0.000 0.248 0.000 0.247 0.000

0.002 0.944 0.056 0.010 0.009 0.688 0.014 0.519 0.018 0.421 0.139 0.000 0.200 0.000 0.152 0.000 0.921 0.000 0.071 0.001 0.160 0.000 0.248 0.000 0.085 0.000 1.000 0.000 0.085 0.000 0.106 0.000

0.011 0.601 0.001 0.953 0.056 0.011 0.008 0.726 0.014 0.518 0.200 0.000 0.139 0.000 0.200 0.000 0.152 0.000 0.921 0.000 0.246 0.000 0.160 0.000 0.248 0.000 0.085 0.000 1.000 0.000 0.190 0.000

0.046 0.035 0.009 0.673 0.008 0.699 0.011 0.613 0.020 0.355 0.778 0.000 0.013 0.567 0.200 0.000 0.086 0.000 0.157 0.000 0.777 0.000 0.144 0.000 0.247 0.000 0.106 0.000 0.190 0.000 1.000 0.000

228

S. Piramuthu / Decision Support Systems 35 (2003) 213–229

References [16] [1] A.R. Abdel-Khalik, K.M. El-Sheshai, Information choice and utilization in an experiment on default prediction, Journal of Accounting Research (1980) 325 – 342, Autumn. [2] A. Al-Ani, M. Deriche, An optimal feature selection technique using the concept of mutual information, Proceedings of the International Symposium on Signal Processing and Its Applications (ISSPA), Kuala Lumpur, IEEE Computer Society Press, Piscataway, NJ, August 13 – 16, 2001, pp. 477 – 480. [3] H.M. Almuallim, T.G. Dietterich, Learning with many irrelevant features, Proceedings of the Ninth National Conference on Artificial Intelligence, The AAAI Press, Menlo Park, CA, 1991, pp. 547 – 552. [4] H. Alt, B. Behrends, J. Bloemer, Approximate Matching of Polygonal Shapes (extended abstract), Proceedings of the Seventh Annual Symposium on Computational Geometry, 1991, pp. 186 – 193. [5] S. Basu, A. Mukherjeeand, S. Klivansky, Time series models for Internet traffic, Tech. Report GIT-CC-95-27, College of Computing, Georgia Institute of Technology, 1995. [6] R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks 5 (4) (1994) 537 – 550. [7] J.M. Benitez, J.L. Castro, C.J. Mantas, F. Rojas, A neuro-fuzzy approach for feature selection, Proceedings of IFSA World Congress and 20th NAFIPS International Conference, vol. 2, IEEE Computer Society Press, Piscataway, NJ, 2201, pp. 1003 – 1008. [8] P.S. Bradley, O.L. Mangasarian, W.N. Street, Feature selection in mathematical programming, INFORMS Journal on Computing 10 (2) 1998. [9] B. Chambless, D. Scarborough, Information-theoretic feature selection for a neural behavioral model, Proceedings of the International Joint Conference on Neural Networks (IJCNN01) vol. 2, IEEE Computer Society Press, Piscataway, NJ, 2201, pp. 1443 – 1448. [10] F.M. Coetzee, E. Glover, S. Lawrence, C.L. Giles, Feature selection in Web applications by ROC inflections and powerset pruning, Proceedings of the Symposium on Applications and the Internet, IEEE Computer Society Press, Piscataway, NJ, 2001, pp. 5 – 14. [11] T.M. Cover, The best two independent measurements are not the two best, IEEE Transactions on Systems, Man, and Cybernetics, 1974, pp. 116 – 117, SMC-4:1. [12] A. Csaszar, General Topology, Adam Hilger, Bristol, 1978. [13] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall, Upper Saddle River, New Jersey, 1982. [14] P.A. Dinda, D.R. O’Hallaron, An evaluation of linear models for host load prediction, Proceedings of the 8th IEEE Symposium on High-Performance Distributed Computing (HPDC-8), Redondo Beach, CA, IEEE Computer Society Press, Piscataway, NJ, 1999, Available only online at: http://dlib.computer. org/conferen/hpdc/0287/pdf/02870010.pdf. [15] J.D. Elashoff, R.M. Elashoff, G.E. Goldman, On the choice of

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28] [29]

[30]

[31] [32]

variables in classification problems with dichotomous variables, Biometrika 54 (1967) 668 – 670. T. Elomaa, E. Ukkonen, A geometric approach to feature selection, Proceedings of the European Conference on Machine Learning, Springer, Heidelberg, 1994, pp. 351 – 354. C. Hopkins, T. Routen, T. Watson, Problems with using genetic algorithms for neural network feature selection, 11th European Conference on Artificial Intelligence, Wiley and Sons, Indianapolis, IN, 1994, pp. 221 – 225. G.H. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, Proceedings of the Eleventh International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1994, pp. 121 – 129. J. Kim, D.J. Lilja, A network status predictor to support dynamic scheduling in network-based computing systems, Proceedings of the 13th International Parallel Processing Symposium, IEEE Computer Society Press, Piscataway, NJ, 1998, pp. 145 – 151. K. Kira, L.A. Rendell, A practical approach to feature selection, Proceedings of the Ninth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1992, pp. 249 – 256. J. Kittler, Mathematical methods of feature selection in pattern recognition, International Journal of Man – Machine Studies 7 (1975) 609 – 637. R. Kohavi, Wrappers for performance enhancement and oblivious decision graphs, PhD dissertation, Computer Science Department, Stanford University, 1995. R. Kohavi, B. Frasca, Useful feature subsets and Rough Sets reducts, Third International Workshop on Rough Sets and Soft Computing (RSSC 94), Simulation Councils, San Diego, CA, 1994, pp. 320 – 323. D. Koller, M. Sahami, Toward optimal feature selection, Machine Learning: Proceedings of the Thirteenth International Conference, Morgan Kaufmann, San Francisco, CA, 1996, pp. 284 – 292. I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, Proceedings of the European Conference on Machine Learning, Springer, Heidelberg, 1994, pp. 171 – 182. H. Liu, R. Setiono, Feature selection via discretization, IEEE Transactions on Knowledge and Data Engineering 9 (4) (July/ August 1997) 642 – 645. H.A. Malki, A. Moghaddamjoo, Using the Karhunen – Loe’ve transformation in the back-propagation training algorithm, IEEE Transactions on Neural Networks 2 (1) (1991) 162 – 165. W.S. Meisel, Computer-Oriented Approaches to Pattern Recognition, Academic Press, New York, 1972. L. Milne, Feature selection using neural networks with contribution measures, AI’95, Canberra, World Scientific Publishing Company, Singapore, November 1995, pp. 215 – 221. M. Modrzejewski, Feature selection using Rough Sets theory, European Conference on Machine Learning, Springer, Heidelberg, 1993, pp. 213 – 226. S.B. Nadler Jr., Hyperspaces of Sets, Marcel Dekker, New York, 1978. Z. Pawlak, Rough Sets, International Journal of Computer and Information Sciences 11 (5) (1982) 341 – 356.

S. Piramuthu / Decision Support Systems 35 (2003) 213–229 [33] S. Piramuthu, Evaluating feature selection methods for learning in data mining applications, HICSS-31, (1998) V:294 – V:301. [34] S. Piramuthu, M.J. Shaw, On using decision tree as feature selector for feed-forward neural networks, International Symposium on Integrating Knowledge and Neural Heuristics, The AAAI Press, Menlo Park, CA, 1994, pp. 67 – 74. [35] P. Pudil, F.J. Ferri, J. Novovicova, J. Kittler, Floating search methods for feature selection with nonmonotonic criterion functions, IEEE 12th International Conference on Pattern Recognition, vol. II, IEEE Computer Society Press, Piscataway, NJ, 1994, pp. 279 – 283. [36] J.R. Quinlan, Simplifying decision trees, International Journal of Man – Machine Studies 27 (1987) 221 – 234. [37] J.R. Quinlan, Decision trees and decision making, IEEE Transactions on Systems, Man and Cybernetics 20 (2) (1990) 339 – 346. [38] Z. Sahinoglu, S. Tekinay, On multimedia networks: self-similar traffic and network performance, IEEE Communications Magazine, IEEE Computer Society Press, Piscataway, NJ, January 1999, pp. 48 – 52. [39] R. Setino, H. Liu, Neural network feature selector, IEEE Transactions on Neural Networks 8 (3) (1997) 654 – 662. [40] W. Siedlecki, J. Sklansky, A note on genetic algorithms for large-scale feature selection, Pattern Recognition Letters 10 (5) (1989) 335 – 347.

229

[41] S.D. Stearns, On selecting features for pattern classifiers, Third International Conference on Pattern Recognition, IEEE Computer Society Press, Piscataway, NJ, 1976, pp. 71 – 75. [42] K.Y. Tam, M.Y. Kiang, Managerial applications of neural networks: the case of bank failure predictions, Management Science 38 (7) (1992) 926 – 947. [43] G.T. Toussaint, Note on optimal selection of independent binary-valued features for pattern recognition, IEEE Transactions on Information Theory, IT 17 (1971) 618. [44] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, Proceedings of the Genetic Programming Conference, GP’97, Morgan Kaufmann, San Francisco, CA, 1997, pp. 380 – 385. [45] H. Yuan, S.-S. Tseng, W. Gangshan, Z. Fuyan, A two-phase feature selection method using both filter and wrapper, Proceedings of the IEEE Conference on Systems, Man, and Cybernetics, vol. 2, IEEE Computer Society Press, Piscataway, NJ, 1999, pp. 132 – 136.

Selwyn Piramuthu is an Associate Professor of the Decision and Information Sciences Department at the University of Florida. His research interests are in pattern recognition and its application in workflow systems, supply chain management, financial credit – risk analysis and manufacturing scheduling.