E6893 Big Data Analytics Lecture 4: Big Data Analytics — Clustering and Classification Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Distinguished Researcher and Chief Scientist, Graph Computing
September 29th, 2016 1
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Review — Key Components of Mahout
2
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Machine Learning example: using SVM to recognize a Toyota Camry Non-ML Rule 1.Symbol has something like bull’s head Rule 2.Big black portion in front of car. Rule 3. …..????
ML — Support Vector Machine Feature Space
Positive SVs
Negative SVs
3
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2015 CY Lin, Columbia University
Machine Learning example: using SVM to recognize a Toyota Camry ML — Support Vector Machine Positive SVs
PCamry > 0.95
Feature Space Negative SVs
4
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2015 CY Lin, Columbia University
Clustering
5
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Clustering — on feature plane
6
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Clustering example
7
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Steps on clustering
8
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
K-mean clustering
9
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Making initial cluster centers
10
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Parameters to Mahout k-mean clustering algorithm
11
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
HelloWorld clustering scenario
12
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
HelloWorld Clustering scenario - II
13
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
HelloWorld Clustering scenario - III
14
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
HelloWorld clustering scenario result
15
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Testing difference distance measures
16
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Manhattan and Cosine distances
17
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Tanimoto distance and weighted distance
18
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Results comparison
19
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Data preparation in Mahout — vectors
20
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
vectorization example 0: weight 1: color 2: size
21
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Mahout codes to create vectors of the apple example
22
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Mahout codes to create vectors of the apple example — II
23
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Vectorization of text
Vector Space Model: Term Frequency (TF)
Stop Words:
Stemming:
24 E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Most Popular Stemming algorithms
25
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Term Frequency — Inverse Document Frequency (TF-IDF)
The value of word is reduced more if it is used frequently across all the documents in the dataset.
or
26
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
n-gram
It was the best of time. it was the worst of times.
==> bigram
Mahout provides a log-likelihood test to reduce the dimensions of n-grams 27
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Examples — using a news corpus
Reuters-21578 dataset: 22 files, each one has 1000 documents except the last one. http://www.daviddlewis.com/resources/testcollections/ reuters21578/
Extraction code:
28
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Mahout dictionary-based vectorizer
29
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Mahout dictionary-based vectorizer — II
30
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Mahout dictionary-based vectorizer — III
31
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Outputs & Steps
1. Tokenization using Lucene StandardAnalyzer 2. n-gram generation step 3. converts the tokenized documents into vectors using TF 4. count DF and then create TF-IDF
32
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
A practical setting of flags
33
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
normalization
Some documents may pop up showing they are similar to all the other documents because it is large. ==> Normalization can help.
34
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Clustering methods provided by Mahout
35
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
K-mean clustering
36
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Hadoop k-mean clustering jobs
37
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
K-mean clustering running as MapReduce job
38
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Hadoop k-mean clustering code
39
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
The output
40
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Canopy clustering to estimate the number of clusters Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.
41
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Running canopy clustering
Created less than 50 centroids.
42
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
News clustering code
43
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
News clustering example —> finding related articles
44
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
News clustering code — II
45
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
News clustering code — III
46
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Other clustering algorithms
Hierarchical clustering
47
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Different clustering approaches
48
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
When to use Mahout for classification?
49
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
The advantage of using Mahout for classification
50
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Classification — definition
51
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
How does a classification system work?
52
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Key terminology for classification
53
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Input and Output of a classification model
54
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Four types of values for predictor variables
55
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Sample data that illustrates all four value types
56
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Supervised vs. Unsupervised Learning
57
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Work flow in a typical classification project
58
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Classification Example 1 — Color-Fill
59
Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label. E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Target leak
• A target leak is a bug that involves unintentionally providing data about the target variable in the section of the predictor variables. • Don’t confused with intentionally including the target variable in the record of a training example. • Target leaks can seriously affect the accuracy of the classification system.
60
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Classification Example 2 — Color-Fill (another feature)
61
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Mahout classification algorithms
Mahout classification algorithms include: • • • •
Naive Bayesian Complementary Naive Bayesian Stochastic Gradient Descent (SDG) Random Forest
62
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Comparing two types of Mahout Scalable algorithms
63
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Step-by-step simple classification example 1. The data and the challenge 2. Training a model to find color-fill: preliminary thinking 3. Choosing a learning algorithm to train the model 4. Improving performance of the classifier
64
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Classification Example 3
65
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
What may be a good predictor?
66
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Choose algorithm via Mahout
67
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Stochastic Gradient Descent (SGD)
68
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Characteristic of SGD
69
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Support Vector Machine (SVM)
maximize boundary distances; remembering “support vectors”
70
nonlinear kernels E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Naive Bayes Training set: Classifier using Gaussian distribution assumptions:
Test Set:
==> female 71
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Random Forest
Random forest uses a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. 72
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Choosing a learning algorithm to train the model • One low overhead classification method is the stochastic gradient descent (SGD) algorithm for logistic regression. • This algorithm is sequential, but it’s fast.
73
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
The donut.csv data file in Example 3
74
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Build a model using Mahout
75
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Trainlogistic program
76
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Evaluate the model
AUC (0 ~ 1): 1 — perfect 0 — perfectly wrong 0.5 — random confusion matrix
77
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University
Questions?
78
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2016 CY Lin, Columbia University