Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 8 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab ...

Author: Meryl Melton

4 downloads 2 Views 2MB Size

Report

Download PDF

Recommend Documents

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques. Web Mining. Li Xiong

Data Mining: Concepts and Techniques. Chapter 8. (3 rd ed.)

Basic Data Mining Techniques

Data Warehousing & Mining Techniques

Data Mining Techniques: Classification and Prediction

Data Mining Classification: Alternative Techniques. Introduction to Data Mining

NEURAL NETWORKS BASED DATA MINING TECHNIQUES

Fundamental concepts and techniques

Cryptographic Techniques in Privacy-Preserving Data Mining

DISEASE PREDICTING SYSTEM USING DATA MINING TECHNIQUES

Applying Data Mining Techniques in CRM

Mining Lung Cancer Data for Smokers and Non- Smokers by Using Data Mining Techniques

Data Mining. Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Practical Machine Learning Tools and Techniques

Data Mining I. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods. Keith E. Emmert

Data Warehousing and Data Mining

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Mining: Data And Preprocessing

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

A Survey on Privacy Preserving Data Mining Techniques

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 8 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab Simon Fraser University, Ari Visa, , Institute of Signal Processing Tampere University of Technology September 12, 2013

Data Mining: Concepts and Techniques

1

Chapter 8. Cluster Analysis 

What is Cluster Analysis?



Types of Data in Cluster Analysis



A Categorization of Major Clustering Methods



Partitioning Methods



Hierarchical Methods



Density-Based Methods



Grid-Based Methods



Model-Based Clustering Methods



Outlier Analysis



Summary

September 12, 2013

Data Mining: Concepts and Techniques

2

General Applications of Clustering 



  

Pattern Recognition Spatial Data Analysis  create thematic maps in GIS by clustering feature spaces  detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns

September 12, 2013

Data Mining: Concepts and Techniques

4

Examples of Clustering Applications 









Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

September 12, 2013

Data Mining: Concepts and Techniques

5

What Is Good Clustering? 





A good clustering method will produce high quality clusters with 

high intra-class similarity



low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

September 12, 2013

Data Mining: Concepts and Techniques

6

Requirements of Clustering in Data Mining 

Scalability



Ability to deal with different types of attributes



Discovery of clusters with arbitrary shape



Minimal requirements for domain knowledge to determine input parameters



Able to deal with noise and outliers



Insensitive to order of input records



High dimensionality



Incorporation of user-specified constraints



Interpretability and usability

September 12, 2013

Data Mining: Concepts and Techniques

7

Chapter 8. Cluster Analysis 

What is Cluster Analysis?



Types of Data in Cluster Analysis



A Categorization of Major Clustering Methods



Partitioning Methods



Hierarchical Methods



Density-Based Methods



Grid-Based Methods



Model-Based Clustering Methods



Outlier Analysis



Summary

September 12, 2013

Data Mining: Concepts and Techniques

8

Data Structures





Data matrix  (two modes)

Dissimilarity matrix  (one mode)

September 12, 2013

 x11   ... x  i1  ... x  n1

...

x1f

...

...

...

...

...

xif

...

...

...

...

... xnf

...

 0  d(2,1) 0   d(3,1) d ( 3,2) 0  : :  : d ( n,1) d ( n,2) ...

Data Mining: Concepts and Techniques

x1p   ...  xip   ...  xnp  

      ... 0 9

Measure the Quality of Clustering 









Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective.

September 12, 2013

Data Mining: Concepts and Techniques

10

Type of data in clustering analysis 

Interval-scaled variables:



Binary variables:



Nominal, ordinal, and ratio variables:



Variables of mixed types:

September 12, 2013

Data Mining: Concepts and Techniques

11

Interval-valued variables 

Standardize data 

Calculate the mean absolute deviation: sf  1 n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where 

mf  1 n (x1 f  x2 f

 ... 

xnf )

.

Calculate the standardized measurement (z-score) xif  m f zif  sf



Using mean absolute deviation is more robust than using standard deviation

September 12, 2013

Data Mining: Concepts and Techniques

12

Similarity and Dissimilarity Between Objects 



Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q ) i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer 

If q = 1, d is Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j2 ip jp September 12, 2013

Data Mining: Concepts and Techniques

13

Similarity and Dissimilarity Between Objects (Cont.) 

If q = 2, d is Euclidean distance: d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 ) i1 j1 i2 j2 ip jp



Properties    



d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.

September 12, 2013

Data Mining: Concepts and Techniques

14

Binary Variables 

A contingency table for binary data Object j

Object i

1

0

sum

1

a

b

a b

0

c

d

cd

sum a  c b  d 



p

Simple matching coefficient (invariant, if the binary bc variable is symmetric): d (i, j)  a bc  d Jaccard coefficient (noninvariant if the binary variable is

asymmetric): September 12, 2013

d (i, j) 

bc a bc

Data Mining: Concepts and Techniques

15

Dissimilarity between Binary Variables 

Example Name Jack Mary Jim   

Gender M F M

Fever Y Y Y

Cough N N P

Test-1 P P N

Test-2 N N N

Test-3 N P N

Test-4 N N N

gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0 01  0.33 2 01 11 d ( jack , jim )   0.67 111 1 2 d ( jim , mary )   0.75 11 2 d ( jack , mary ) 

September 12, 2013

Data Mining: Concepts and Techniques

16

Nominal Variables 



A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching 

m: # of matches, p: total # of variables m d (i, j)  p  p



Method 2: use a large number of binary variables 

creating a new binary variable for each of the M nominal states

September 12, 2013

Data Mining: Concepts and Techniques

17

Ordinal Variables 

An ordinal variable can be discrete or continuous



order is important, e.g., rank



Can be treated like interval-scaled rif {1,..., M f }  replacing xif by their rank 



map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif 1 zif  M f 1 compute the dissimilarity using methods for intervalscaled variables

September 12, 2013

Data Mining: Concepts and Techniques

18

Ratio-Scaled Variables 



Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: 

treat them like interval-scaled variables — not a good

choice! (why?) 

apply logarithmic transformation

yif = log(xif) 

treat them as continuous ordinal data treat their rank as interval-scaled.

September 12, 2013

Data Mining: Concepts and Techniques

19

Variables of Mixed Types 



A database may contain all the six types of variables  symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. One may use a weighted formula to combine their effects.  pf  1 ij( f ) dij( f ) d (i, j)   pf  1 ij( f )  f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.  f is interval-based: use the normalized distance  f is ordinal or ratio-scaled r 1 z   compute ranks rif and if M 1  and treat zif as interval-scaled if

f

September 12, 2013

Data Mining: Concepts and Techniques

20

Chapter 8. Cluster Analysis 

What is Cluster Analysis?



Types of Data in Cluster Analysis



A Categorization of Major Clustering Methods



Partitioning Methods



Hierarchical Methods



Density-Based Methods



Grid-Based Methods



Model-Based Clustering Methods



Outlier Analysis



Summary

September 12, 2013

Data Mining: Concepts and Techniques

21

Major Clustering Approaches 

Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion 

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion



Density-based: based on connectivity and density functions



Grid-based: based on a multiple-level granularity structure



Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

September 12, 2013

Data Mining: Concepts and Techniques

22

Chapter 8. Cluster Analysis 

What is Cluster Analysis?



Types of Data in Cluster Analysis



A Categorization of Major Clustering Methods



Partitioning Methods



Hierarchical Methods



Density-Based Methods



Grid-Based Methods



Model-Based Clustering Methods



Outlier Analysis



Summary

September 12, 2013

Data Mining: Concepts and Techniques

23

Partitioning Algorithms: Basic Concept 



Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion 

Global optimal: exhaustively enumerate all partitions



Heuristic methods: k-means and k-medoids algorithms



k-means (MacQueen’67): Each cluster is represented by the center of the cluster



k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

September 12, 2013

Data Mining: Concepts and Techniques

24

The K-Means Clustering Method 

Given k, the k-means algorithm is implemented in 4 steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.  Assign each object to the cluster with the nearest seed point.  Go back to Step 2, stop when no more new assignment.

September 12, 2013

Data Mining: Concepts and Techniques

25

The K-Means Clustering Method 

Example 10

10

9

9

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1

1

0

0 0

1

2

3

4

5

6

7

8

9

10

10

10

9

9

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1

1

0

1

2

3

4

5

6

7

8

9

10

0

0

September 12, 2013

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

Data Mining: Concepts and Techniques

4

5

6

7

8

9

10

26

Comments on the K-Means Method 

Strength 





Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t