Analytics and Machine Learning

Analytics and Machine Learning Executive Summary: What is machine learning and why should we know it? In today’s world, most people are expected to h...
Author: Austen Boone
1 downloads 2 Views 353KB Size
Analytics and Machine Learning

Executive Summary: What is machine learning and why should we know it? In today’s world, most people are expected to have an idea about how to answer this question. It is the way of making a machine think and take decisions like a human being. It is very evident that the machines outmatch human beings in terms of computation capacity by a huge extent. The need of today is to tap the potential of this vast computation power to reduce direct human involvement in most matters. In the recent past, this has been an area of tremendous research. People from all disciplines of science have felt the need for using machines on a variety of purposes. The broad aspects can be divided into clustering, classification and prediction analysis. Note that the need for such techniques can be as varied as clustering the incomes of people from a city on the basis of education, to predicting the time of the next meteor shower. Naturally, having an algorithm that can work for all these situations is a stellar achievement. We already have certain algorithms that do machine learning based on a mathematical background. However, it is only fair that these be applied with significant amount of knowledge about the field to be applied upon. This is where data science and analytics step in. Using these, we may look at such a problem with complete abstraction. The problem at hand can then be written in terms of the abstract decision theoretic approach. It may happen -as is the case with many computer science problems- that this approach helps us to reduce the problem at hand to some other problem, to which the solution is already known. Pure data Science, or Analytics, depends heavily on abstract mathematical structure. This can be regarded as its greatest strength, as well as its greatest weakness. There is no doubt in today’s world that Data Science is not built to be useful only to a handful of people with rich mathematical background. It is a subject that can provide utmost support to all directions of science. Moreover, much of this application is solely on applied real-world fields, like finance, or business analysis. We should not expect every practitioner to know the underlying complications of analytics in full depth. Rather, he/she should be able to extract the required information from analytics and interpret it properly for his/her field of use. Machine learning, in this case, comes as a great tool of interpretation.

We can see the ever increasing size of the analytics market from the above diagram. To keep up with this intense competition, we must know how to harness machine power to do the job of analyzing data for us. It is interesting to note the varied applications of analytics that have been identified in this diagram. Fortunately, we have dedicated machine learning methods for each of these.

Prediction We need to predict a response variable based on a large number of predictor variables. We have no prior information about which of these variables are actually useful. Potentially, we are to look for a very small subset of predictors. Note that this is desired since this will reduce the number of columns in the dataset by a huge proportion. This helps us in visualizing the data and the prediction power much better. The mathematics behind the analytics that deals with this issue, does not have a very simple development. The decision tree approach comes to the rescue in such a case. The prediction problem, if the range of the response variable is known, can be viewed as a classification problem, where we predict whether the response will be one of the given categories, given the data. The decision tree approach solves this problem with great ease. It is based on the principle that, at each step, the tree checks the proportion of each category, among the training set. If we get a clear separation among the explanatory variables for two different response categories, we become sure that these conditions are sufficient to determine the value of the response variable. No more prediction variables are required. This task is preformed within the tree itself, and no knowledge on model selection is required on the part of the practitioner.

The following example deals with such a prediction problem in the case of sales analysis at a retail chain. A collection of 40 variables, is used to predict the number of people that will respond to the new sales campaign. The governing authorities of the chain want to identify a very small number of predictors which must be kept in mind. The same problem can be viewed as a regression problem as well as a decision tree model (that uses machine learning). By a decision tree, the result is:

Explanation: FRE: Number of visits in last 12 months in which the customer purchased at least one item TMONSPEND: Amount spent in last 3 calendar months CLASSES: #product categories purchased in last 12 months STORES: #stores (pertaining to different franchises) customer shopped at in last 12 month LTP12FRE: #purchase visits prior to last 12 months

SMONSPEND: Amount spent in last 6 calendar months By regression analysis, if a logistic model is tried, we find all the above variables, along with the variables No. of Coupons used, Time since last visit, and individual shopping habits, to be significant. It can be seen that the decision tree approach makes the same prediction as that by the regression method, but with greater amount of clarity and interpretability. Naturally, this approach should be more favorable to practitioners wanting to do some exploratory analysis. This method also makes the course of action very clear cut for the client. The greatest advantage of using such algorithms is that this can be performed completely using a computer. The machine learns, by itself, from the training set, and applies them for prediction. One key feature of any regression analysis is that whichever new variable is included in a set of predictors, the value of R2, which is a measure of the prediction power of the model, increases. This is not very desirable, since we surely do not expect any variable, which may have no connection with the situation at hand, to increase our prediction power. On the other hand, this is not the case with a decision tree. It shows clearly which variables may be redundant. Next, we may think of the issue of testing whether or not some variable can be removed from a set of predictors. Most analytic tests depend heavily on some specific distribution assumption from which the data may have been collected. While these tests are surely the best if the model holds true, most of them may fail if it is not. Any machine learning approach scores heavily on this aspect. It treats the empirical distribution of the observed or training dataset to be the actual distribution, and makes inference based on it. This is a very realistic assumption and may be applied to a variety of applications.

Clustering Let us now look at cluster identification among observed data. In many cases, for example, to find patterns of expenditure among a specific class of people, we may observe that the data cloud can be looked at as a combination of disjoint parts of the population. The most fundamental way, from a machine learning point of view is to look at this problem as identifying some points in the p-dimensional surface, which can in effect be a close approximation to the observed data points. For simplicity, we may assume that the number of clusters is known. The most basic machine learning algorithm built for this purpose is the k-means algorithm. It takes a collection of k points as a random start, and keeps iterating, reducing the combined distance of all the points from the cluster centers identified. This is repeated till convergence. This method is very fast, and can be performed by an ordinary computer in a matter of seconds, even if the number of data points to be clustered, may be huge. As an example, let us try to find clusters among weekly Dow Jones industrial indices over a period of time. Consider only the variables: Open, Close, High, Low, and Percent Change in Weekly Price.

Results by K-means: 5 clusters have been used. Cluster Means: Cluster No. 1 2 3 4 5

open 162.8271 80.23996 37.66396 58.37521 42.85344

high 165.7625 81.70971 38.36294 58.96347 43.91844

low 160.7092 78.7023 37.08877 57.63405 41.58969

close 163.7338 80.36272 37.81026 58.24958 42.66313

Percent Change Price 0.591788 0.169758 0.394361 -0.56732 -0.5612

The following diagram shows the within cluster distances for this dataset:

It is clear that the clustering has identified 5 clear cut clusters from the data. We can now treat each of these clusters separately for further analysis.

Pattern Recognition The ML algorithms can be used to find any pattern from the data in general. The most popular example is that of image segmentation. Using clustering, we can divide the image into useful clusters, one of which may be useful for us in our study. This can be used to detect smiles in an image, and also for biological studies with X-ray or CT scan images. In business, pattern recognition finds use for analyzing trends in time series data. There may be patterns that are impossible to find from a mammoth dataset. But, a machine learning algorithm can be used to predict it in a matter of seconds.

More Applications Machine learning has gifted us a completely new paradigm- that of artificial intelligence. We can now train computers to think! One such application occurs in the case of sentiment analysis from some data. One among its many applications can be on a movie rating website. We consider each rating to be an independent observation providing information for the quality of the movie. Standard analytic methods are well designed to tackle the numerical information. However, note that, we lose a great deal of information that is provided in the comments with the review. The reviewer shares his own perspective on the movie in this comment. If it were told to a human, he/she would have been able to judge the strength of the reviewer’s feelings towards the movie. We want to train a computer to do the same. The following section discusses a possible approach on how we can do so. Our aim is to extract a numerical score, or a feature vector, which when treated by analytic methods, give us some additional information. The review is constructed of a set of words, each carrying either a good or a bad connotation about the movie. The first step surely consists of building a massive collection of words with positive or negative scores depending on its strength and sense. We can then use something like a string matching algorithm to determine how many words of what kind have been included in the review, and get an estimate of the ‘comment score’. An almost similar kind of application is that of website analysis. If the website has a lot of content, with appropriate ‘tags’, we shall be interested in recognizing the patterns in the different tags over time. In case of a buy/sell website, these tags may be prices, popularity, customer ratings, and so on. We want to be able to sort all the contents of such a website according to each of these tags. This uncovers many underlying patterns, for example, the behavior of the consumers on a particular commodity basket. There may be prevalent notions: the customers may only want to reduce cost when buying an item. Our machine learning approach provides ready estimates of such consumer behavior, and enables the selling company to determine its policy accordingly.

If we have a networking site like Facebook or Twitter, these ML algorithms can be used to get the trends of the moment, or to check whether a particular content matches guidelines. The entire concept of Facebook Newsfeed uses Machine Learning to predict which news to show according to the users’ past personal choices. An example:

Words of Caution So what does the future have in store for us? With each day, we seem to unearth a newer benefit in allowing machines to have a greater impact on our lives. Nevertheless, it will not be wise to do so, without judging possible pitfalls with care. In the rest of this dissertation, we concentrate on how analytics can save the day for possible faux pas by machine learning algorithms in some cases. Think of the classification problem, which as shown earlier, has been elegantly solved by the decision tree. If we now delve deeper into the way a decision tree works, it can be seen that it determines the majority label of the response variable at each step.

Suppose that we use such a classification in case of predicting whether or not an individual has cancer, depending on a series of explanatory variables. It is expected that the number of noncancer individuals is much higher than the number of cancer patients. A decision tree on such data is predominantly biased towards the majority, i.e. non-cancer. This defeats the sole purpose of prediction.

Pred. NC Pred. Cancer class Recall

True Non-cancer 25925

True Cancer 754

Class Precision 97.17%

4

50

92.59%

99.98%

6.22%

Analytic Corrections: It is very clear that the cost of identifying a cancer patient as fit, far exceeds the cost of classifying a fit individual as a patient. We now try to incorporate this asymmetric cost into the decision tree scenario. The decision tree is built with a symmetric entropy (a measure of disorder, or departure from null) criterion. This entropy gains its maximum at the mean and falls symmetrically on both sides. Taking the asymmetric cost skews this entropy towards one side (where the cost is higher) and forces the decision tree to make lower errors of this high cost type. The algorithm may now be poorer in terms of overall accuracy, but identifies most of the true cancer patients correctly. Applying this improves the table to:

True NC Pred. NC 15248 Pred. Cancer 10681 class Recall 58.81%

True Cancer Class precision 6 99.96% 798 6.95% 99.50%

Thus, it can often be the case that blindly using ML algorithms may lead us to indecisive, or sometimes, wrong results. In such cases, we must use human interference to check where exactly the algorithm may have gone wrong, and use transform the problem accordingly. In the near future, we may actually reach complete artificial intelligence. Nevertheless, it will not be wise to depend on them completely. Any machine learning algorithm necessarily builds on past experience. However, it must be kept in mind that Nature never fails to surprise us. It is possible only for a human being to adopt himself to a completely new paradigm, one which may have no past occurrence. Obviously, a ML algorithm will fail miserably in such cases.