Big Data Analytics Clustering and Classification

E6893 Big Data Analytics Lecture 4: Big Data Analytics — Clustering and Classification Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical En...
Author: Magdalen Fowler
4 downloads 4 Views 9MB Size
E6893 Big Data Analytics Lecture 4: Big Data Analytics — Clustering and Classification Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Distinguished Researcher and Chief Scientist, Graph Computing

September 29th, 2016 1

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Review — Key Components of Mahout

2

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Machine Learning example: using SVM to recognize a Toyota Camry Non-ML Rule 1.Symbol has something like bull’s head Rule 2.Big black portion in front of car. Rule 3. …..????

ML — Support Vector Machine Feature Space

Positive SVs

Negative SVs

3

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2015 CY Lin, Columbia University

Machine Learning example: using SVM to recognize a Toyota Camry ML — Support Vector Machine Positive SVs

PCamry > 0.95

Feature Space Negative SVs

4

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2015 CY Lin, Columbia University

Clustering

5

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Clustering — on feature plane

6

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Clustering example

7

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Steps on clustering

8

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

K-mean clustering

9

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Making initial cluster centers

10

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Parameters to Mahout k-mean clustering algorithm

11

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

HelloWorld clustering scenario

12

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

HelloWorld Clustering scenario - II

13

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

HelloWorld Clustering scenario - III

14

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

HelloWorld clustering scenario result

15

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Testing difference distance measures

16

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Manhattan and Cosine distances

17

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Tanimoto distance and weighted distance

18

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Results comparison

19

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Data preparation in Mahout — vectors

20

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

vectorization example 0: weight 1: color 2: size

21

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Mahout codes to create vectors of the apple example

22

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Mahout codes to create vectors of the apple example — II

23

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Vectorization of text

Vector Space Model: Term Frequency (TF)

Stop Words:

Stemming:

24 E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Most Popular Stemming algorithms

25

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Term Frequency — Inverse Document Frequency (TF-IDF)

The value of word is reduced more if it is used frequently across all the documents in the dataset.

or

26

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

n-gram

It was the best of time. it was the worst of times.

==> bigram

Mahout provides a log-likelihood test to reduce the dimensions of n-grams 27

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Examples — using a news corpus

Reuters-21578 dataset: 22 files, each one has 1000 documents except the last one. http://www.daviddlewis.com/resources/testcollections/ reuters21578/

Extraction code:

28

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Mahout dictionary-based vectorizer

29

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Mahout dictionary-based vectorizer — II

30

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Mahout dictionary-based vectorizer — III

31

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Outputs & Steps

1. Tokenization using Lucene StandardAnalyzer 2. n-gram generation step 3. converts the tokenized documents into vectors using TF 4. count DF and then create TF-IDF

32

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

A practical setting of flags

33

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

normalization

Some documents may pop up showing they are similar to all the other documents because it is large. ==> Normalization can help.

34

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Clustering methods provided by Mahout

35

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

K-mean clustering

36

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Hadoop k-mean clustering jobs

37

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

K-mean clustering running as MapReduce job

38

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Hadoop k-mean clustering code

39

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

The output

40

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Canopy clustering to estimate the number of clusters Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.

41

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Running canopy clustering

Created less than 50 centroids.

42

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

News clustering code

43

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

News clustering example —> finding related articles

44

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

News clustering code — II

45

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

News clustering code — III

46

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Other clustering algorithms

Hierarchical clustering

47

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Different clustering approaches

48

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

When to use Mahout for classification?

49

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

The advantage of using Mahout for classification

50

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Classification — definition

51

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

How does a classification system work?

52

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Key terminology for classification

53

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Input and Output of a classification model

54

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Four types of values for predictor variables

55

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Sample data that illustrates all four value types

56

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Supervised vs. Unsupervised Learning

57

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Work flow in a typical classification project

58

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Classification Example 1 — Color-Fill

59

Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label. E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Target leak

• A target leak is a bug that involves unintentionally providing data about the target variable in the section of the predictor variables. • Don’t confused with intentionally including the target variable in the record of a training example. • Target leaks can seriously affect the accuracy of the classification system.

60

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Classification Example 2 — Color-Fill (another feature)

61

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Mahout classification algorithms

Mahout classification algorithms include: • • • •

Naive Bayesian Complementary Naive Bayesian Stochastic Gradient Descent (SDG) Random Forest

62

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Comparing two types of Mahout Scalable algorithms

63

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Step-by-step simple classification example 1. The data and the challenge 2. Training a model to find color-fill: preliminary thinking 3. Choosing a learning algorithm to train the model 4. Improving performance of the classifier

64

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Classification Example 3

65

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

What may be a good predictor?

66

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Choose algorithm via Mahout

67

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Stochastic Gradient Descent (SGD)

68

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Characteristic of SGD

69

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Support Vector Machine (SVM)

maximize boundary distances; remembering “support vectors”

70

nonlinear kernels E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Naive Bayes Training set: Classifier using Gaussian distribution assumptions:

Test Set:

==> female 71

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Random Forest

Random forest uses a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. 72

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Choosing a learning algorithm to train the model • One low overhead classification method is the stochastic gradient descent (SGD) algorithm for logistic regression. • This algorithm is sequential, but it’s fast.

73

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

The donut.csv data file in Example 3

74

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Build a model using Mahout

75

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Trainlogistic program

76

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Evaluate the model

AUC (0 ~ 1): 1 — perfect 0 — perfectly wrong 0.5 — random confusion matrix

77

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Questions?

78

E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms

© 2016 CY Lin, Columbia University

Suggest Documents