AUTOMATIC CLUSTER DETECTION

Peter Brezany Universit¨at Wien AUTOMATIC CLUSTER DETECTION Slide 1 Peter Brezany Institut fur ¨ Softwarewissenschaft ¨ Wien Universitat Introducti...
Author: Joleen Bryant
53 downloads 2 Views 159KB Size
Peter Brezany

Universit¨at Wien

AUTOMATIC CLUSTER DETECTION Slide 1 Peter Brezany Institut fur ¨ Softwarewissenschaft ¨ Wien Universitat

Introduction The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

Slide 2

Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Cluster analysis is an important human activity. Early in childhood, one learns how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. Remark: In marketing terms, subdividing the population according to variables already known to be good discriminators is called “segmentation”.

1

Peter Brezany

Universit¨at Wien

Introduction (2) Clustering is an example of unsupervised learning. Unlike classification, it does not rely on predefined classes and class-labeled training examples.

Slide 3

We are searching for groups of records - the clusters - that are similar to one another, in expectation that similar records represent similar customers or suppliers or products that will behave in similar ways. For this reason, clustering is a form of learning by observation, rather than learning by examples. Automatic cluster detection is rarely used in isolation because finding clusters is not an end in itself. Once clusters have been detected, other methods must be applied in order to figure out what the clusters mean.

Some Typical Applications of Clustering In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns.

Slide 4

In biology: derivating plant and animal taxonomies, categorizing genes with similar functionality In earth observation: the identification of areas of similar land use. In automobile insurance policy: the identification of groups of policy holders with a high average claim cost. In city planing: the identification of groups of houses according to house type, value, and geographical location. Classification of Web documents.

2

Peter Brezany

Universit¨at Wien

3

Example: Star Light, Star Bright Early in this century, astronomers trying to understand the relationship between the luminosity (brightness) of stars and their temperatures, made scatter plots like the one in the figure on the next slide.

Slide 5

The stars plotted by astronomers, Hertzsprung and Russel, fall into three clusters. We now understand that these three clusters represent stars in very different phases in the stellar life cycle. The relationship between luminosity and temperature is consistent within each cluster, but the relationship is different in each cluster because a fundamentally different process is generating the heat and light. Our problem has 2 dimensions. If all problems had so few dimensions, there would be no need for automatic cluster detection algorithms. As the number of dimensions (independent variables) increases, our ability to visualize clusters and our intuition about the distance between points quickly break down.

Clusters of Stars 10 6

10 4 10 2 n

ai

M Se

Slide 6

1

ce

en qu

Luminosity (Sun=1)

Red Giants

10 -2 10 -4

White Dwarfs

40,000

20,000

10,000

5,000

Temperature (Degrees Kelvin)

2,500

Peter Brezany

Universit¨at Wien

Fitting the Troops The U.S. army commissioned a study on how to design the uniforms of female soldiers. The army’s goal is to reduce the number of different uniform sizes that have to be kept in inventory while still providing each soldier with well-fitting khakis. There is a too rich “non-army” classification system for women’s clothing. Slide 7

Susan Ashdown and Beatrix Paal, researchers at Cornell University, designed a new set of sizes that fit particular body types. The database they mined contained more than 100 measurements for each nearly 3,000 women. The clustering technique employed in this case was the K-means algorithm with the following steps:



1. Choose the number of clusters (uniform sizes, in our example) that you want to form. That number is the in K-means. 2.



seeds are chosen to be the initial guess at the centroids of the clusters. Each seed is just a particular combination of values for each measurement - each seed might record the actual measurements of one of the women in the sample, but that is not a requirement.

3. Each record in the database is given a preliminary cluster assignment based on the seed to which it is closest.

Slide 8

4. The centroids (or means) of the new clusters are calculated and the whole process starts over with the new centroids taking on the role of the seeds. 5. Since the new centroids will not be in the same place as the original seeds, some of the records will be moved from the first cluster to which they were assigned to another one. 6. After a few iterations, this motion stops and the centroid of each cluster contains the measurements that define one of the new uniform sizes.

4

Peter Brezany

Universit¨at Wien

5

The K-Means Method First published by J. B. MacQuenn in 1967. It has many variations now.



Slide 9

1. step Select data points to be the seeds. Mac Queen’s algorithm simply takes the first records. Each of the seeds is an embryonic cluster with only one element. In our example = 3.





2. step We assign each record to the cluster whose centroid is nearest. In the figure on the next slide, we have done the first two steps. In this example, drawing the boundaries between the clusters is easy - given 2 points, X and Y, all points that are equidistant from X and Y fall along a line that is half way along the line segment that joins X and Y and is perpendicular to it. In our figure, the initial seeds are joined by dashed lines and the cluster boundaries constructed from them are solid lines.

The K-Means Method (2) The initial seeds determine the initial cluster boundaries.

Seed 3

Seed 2

Slide 10 X

2

Seed 1

X

1

Peter Brezany

Universit¨at Wien

6

The K-Means Method (3)

Slide 11

3. Step: Calculating the centroids of the new clusters This is simply a matter of averaging the positions of each point in the cluster along each dimension. If there are 200 records assigned to a cluster and we are clustering based on four fields from those records, then geometrically we have 200 points in a 4-dimensional space. The location of each point is described by a vector of the values of the four fields. The vectors have the form ( ). The value of for s and similarly for , and . the new centroid is the mean of all 200

 

    

   





In the next figure, the new centroids are marked with a cross. The arrows show the motion from the position of the original seeds to the new centroids of the clusters formed from those seeds.

Calculating the Centroids of the New Clusters The initial seeds determine the initial cluster boundaries.

Slide 12 X 2

X1

Peter Brezany

Universit¨at Wien

The K-Means Method (4)

Slide 13

4. Step Once the new clusters have been found, each point is once again assigned to the cluster with the closest centroid. The next figure shows the new cluster boundaries - formed, as before, by drawing lines equidistant between each pair of centroids. The point with the box around it, which was originally assigned to cluster number 2, has now been assigned to cluster number 1. The process of assigning points to cluster and then re-calculating centroids continues until the cluster boundaries stop changing.

Calculating the Centroids of the New Clusters (2) The initial seeds determine the initial cluster boundaries.

Slide 14 X

2

X

1

7

Peter Brezany

Universit¨at Wien

The Distance between Two Points Each field in a record becomes one element in a vector describing a point in space. The distance between 2 points is used as the measure of association.

Slide 15

If 2 points are close in distance, the corresponding records are considered similar. The most common metrics is the Euclidian distance. To find the Euclidian distance between X and Y, we first find the differences between the corresponding elements of X and Y (the distance along each axis) and square them. The distance is the square root of the sum of the squared differences.

Types of Data in Cluster Analysis We describe the types of data that often occur in cluster analysis and how to preprocess them for such an analysis. Data matrix (or object-by-variable structure): This represents objects, such as persons, with variables (also called measurements or attributes), such as age, height, weight, gender, race, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects x p variables):



Slide 16

x1l . . . x1f . . . . . . . . . xil . . . xif . . . . . . . . . xnl . . . xnf



. . . x1p . . . . . . . . . xip . . . . . . . . . xnp

8

Peter Brezany

Universit¨at Wien

9

Types of Data in Cluster Analysis (2) Dissimilarity (Un¨ahnlichkeit) matrix (or object-by-object structure): This stores a collection of proximities that are available for all pairs of represented by an n-by-n table:

Slide 17

0 d(2,1) d(3,1) . . . d(n,l)

0 d(3,2) . . . d(n,2)

0 . . . . . .

. . .



objects. It is often

0

where d(i,j) is the measured difference or dissimilarity between objects i and j. In general, d(i,j) is a nonnegative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they differ. Since d(i,j) = d(j,i), and d(i,i) = 0, we have we have the matrix in the above form. Measures of dissimilarity are discussed throughout the next slides.

Types of Data in Cluster Analysis (3) Many clustering algorithms operate on a dissimilarity matrix.

Slide 18

If the data is presented in the form of a data matrix, it can be transformed into a dissimilarity matrix before applying such clustering algorithms. “How can dissimilarity, d(i,j), be assessed?” The next slides discuss how object dissimilarity can be computed for objects described by different types of variables.

Peter Brezany

Universit¨at Wien

10

Interval-Scaled (Numerical) Variables Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples: weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature. The measurement unit used can affect the clustering analysis. Therfore, data should be standardized. Then variable have an equal weight. Slide 19

In some applications, the user may intentionally want to give more weight to a certain set of variables than to others. E.g., when clustering basketball player candidates, we may prefer to give more weight to the variable height.



Standardization - given measurements for a variable , this can be performed as follows:

  :     "!#$&%'   ()!#*+%-,.,.,&%/ $01 2!#3 4 , where   56,7,.,.8$01 are  measurements of  , and !# is the mean value of  !  9:;,.,7,%? 01 4 .

1. Calculate the mean absolute deviation,

, that is,

2. Calculate the standardized measurement, or z-score:

@A  CBEDGFIL HKJ(F F

.

After standardization, or without standardization in certain applications, the dissimilarity between two objects can be computed. The most popular measure is Euclidean distance:

M O N QPR4SUT   A V#5W   %'  A X25W+I  %-,.,7,:%/  A.Y 2IW Y   , where NZ[ A ;\ A ]6,.,.,.8 A^Y 4 and P_[IWE\5W`]6,.,.,.8IW Y 4 are two * dimensional data objects. Another well-known metric is Manhattan (or city block) distance:

M [N QPR4ab  A V25W: +%/  A c2IW+I&%>,7,.,%d  .A Y 2  5W Y  .

Slide 20

Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function:

e

0: Distance is a nonnegative number.

1. d(i,j)

2. d(i,i) = 0: The distance of an object to itself is 0. 3. d(i,j) = d(j,i): Distance is a symmetric function.

f

4. d(i,j) d(i,h)+d(h,j): Going directly from object i to object j in space is no more than making a detour over any other object h (triangular ineq.). Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is defined as

Peter Brezany

Slide 21

Universit¨at Wien

11

M ON QPR4g  A V25W:  hi%/  A c)IW+1 hV%-,.,.,&%/  ^A Y 2  5W Y  h4 + j h where k is a positive integer. It represents the Manhattan distance when k 9 , and Euclidean distance when k_'l . If each variable is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as

M [N 8PE4S T m R  A  2 W   % m I  A  2 W+   % ,.,.,&% m Y   ^A Y 2   W Y 

Weighting can also be applied to the Manhattan and Minkowski distances.

Binary Variables A binary variable has only 2 states: 0 or 1 (0 - the variable is absent, 1 - it is present). If all binary variables are thought of as having the same weight, we have the 2-by-2 contingency (Moeglichkeit) table (see below), where is the number of variables that equal 1 for both objects and , is the number of variables that equal 1 for object but equal 1 for object , and is the number of variables that equal 0 for both objects and . The total number of variables is .

P

Slide 22

o

N

P n p dkq%rni%sV%=o

k

N

N

P

--------------------------------------------------------------------Object j --------------------------------------------------------------------1 0 Sum --------------------------------------------------------------------1 q r q + r Object i 0 s t s + t Sum q+s r+t p --------------------------------------------------------------------A binary variable is symmetric if both of its states are equally valuable and carry the same weight - e.g., the attribute gender having the states male and female.

Peter Brezany

Universit¨at Wien

12

Similarity that is based on symmetric binary variables is called invariant similarity

M [N 8PE4a t6u LL , h uvt&u uZw

A binary variable is asymmetric if the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test. By convention, we shall code the most important outcome, which is usually the rarest one, by 1 (e.g.. HIV positive). Given two asymmetric binary variables, the agreement of two 1s (a positive match) is then considered more significant that that of two 0s (a negative match). The similarity based on such variables is called noninvariant similarity. Slide 23

M ON QPR4 6t u L L , h uv6t u

o

This is called as the Jaccard coefficient. The number of negative matches, , is considered unimportant and thus is ignored in the computation. Example: Dissimilarity between binary variables: The table below represents a patient record table, where name is an object-id, gender is a symmetric attribute, and the remaining attributes are asymmetric binary. ================================================================= name gender fever cough test-1 test-2 test-3 test-4 ----------------------------------------------------------------Jack M Y N P N N N

Mary F Y N P N P N Jim M Y Y N N N N . . . . . . . . . . . . . . . . =================================================================

x

Slide 24

For asymmetric attribute values let Y (yes) and P (positive) be set to 1, and (no or negative be set to 0. suppose that the distance between each pair of the three patients, Jack, Mary, and Jim, should be

M ^ Py]z&{|!y]n}I4 M ƒP;y]z&{|QPNG!„4 M ƒPNO!)!y]n}I4S

 uv~6u ~&u   d€,‚E  u  u  u   -€,‚…R†  u  u  u   d€,‡†;ˆ

These measurements suggest that Jim and Mary are unlikely to have a similar disease since they have the highest dissimilarity value among the three pairs.

Peter Brezany

Universit¨at Wien

Nominal Variables A nominal (categorical) variable is a generalization of the binary variable in that it can take on more than two states. E.g., map color is a nominal variable that may have, say, 5 states: red, yellow, green, pink, and blue.

Slide 25

‰ l]&,.,.,7 ‰ The dissimilarity between 2 objects N and P described by nominal variables can be computed : M [N QPR4 Y HZY J where ! is the number of matches (i.e., the number of variables for which N and P are in the same state), and  is the total number of variables.

Let the number of states of a nominal variable be . The states can be denoted by letters, symbols, or a set of integers, such as (however, there is no specific ordering).

PARTITIONING CLUSTERING METHODS



{

Given a database of objects and , the number of clusters to form, a partitioning algorithm organizes the objects into partitions ( ), where each partition represents a cluster.

{

{Šf‹

The clusters are formed to optimize an objective partitioning criterion, often called a similarity function, such as distance, so that objects within a cluster are “similar”, whereas the objects of different clusters are “dissimilar” in terms of the database attributes.

Slide 26

The k-means algorithm The k-means procedure is summarized in the figure on the next slide. The computing process iterates until the criterion function converges. Typically, the squared-error criterion is used, defined as

Œ

'A[Ž    Y ‘1’  “"! A   , D where is the sum of square-error for all objects in the database,  is the point in space representing a given object, and ! A is the mean of cluster ” A (both  and ! A are Œ

multidimensional).

13

Peter Brezany

Universit¨at Wien

14

The k-Means Algorithm

{

Algorithm: -means. The k-means algorithm for partitioning based on the mean value of the objects in the cluster. Input: The number of clusters Output: A set of

Slide 27

{

{

and a database containing



objects.

clusters that minimizes the squared-error criterion.

Method: (1) arbitrarily choose

{

objects as the initial cluster centers;

(2) repeat (3)

(re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;

(4)

update the cluster means, i.e., calculate the mean value of the objects for each cluster;

(5) until no change;

The k-Medoids Method The k-means algorithm is very sensitive to outliers. How might the algorithm be modified to diminish such sensitivity? Instead of taking the mean value of the objects in a cluster as a reference point, the medoid can be used, which is the most centrally located object in a cluster.

Slide 28

The basic strategy: (1) a representative object (the medoid) for each cluster is arbitrarily found; (2) each remaining object is clustered with the medoid to which its is the most similar; (3) one of the medoids is iteratively replaced by one of the nonmedoids as long as the quality of the resulting clustering is improved. The clustering quality is estimated using a cost function that measures the average dissimilarity between an object and the medoid of its cluster. To determine whether a nonmedoid object, medoid,

• t&– 0I—&˜ J

is a good replacement for a current

•:W , the following four cases are examined for each of the nonmedoid objects, 

.

Peter Brezany

Universit¨at Wien

15

The k-Medoids Method (2)



•:W

• 0I—&˜ as a medoid • A 6t – J Case 2:  currently belongs to medoid •:W . If •:W is replaced by • 0I—&˜ as a medoid 6 t – J and  is closest to • 0I—&˜ , then  is reassigned to • 05—6˜ . t6– J t6– J A Case 3:  currently belongs to medoid •  NX‹ ™ P . If •W is replaced by • t6– 05—6˜ J as a medoid and  is still closest to • A , then the assignment does not change. ™ P . If • W is replaced by • t6– 05—6˜ J as a Case 4:  currently belongs to medoid • A  NX‹ medoid and  is closest to • 0I—&˜ , then  is reassigned to • 0I—&˜ . t6– J t6– J • A gNqš ™ P

•:W

Case 1: currently belongs to medoid . If is replaced by , then is reassigned to . and is closest to one of



Slide 29



The next figure illustrates the four cases.

Four cases of the cost function Oi

Oi

p

Oj

Oj

Slide 30

random

1. Reassigned to Oi

data object cluster center before swapping after swapping

Oj Oj

p O

Oi

Oi

p Orandom

2. Reassigned to Orandom

Orandom 3. No change

p

Orandom

4. Reassigned to Orandom

Peter Brezany

Universit¨at Wien

16

The k-Medoids Algorithm

{

Algorithm: -medoids. A typical k-medoids algorithm for partitioning based on medoid or central objects. Input: The number of clusters Output: A set of

Slide 31

{

{

and a database containing



objects.

clusters that minimizes the sum of the dissimilarities

of all the objects to their nearest medoid. Method: (1) arbitrarily choose

{

objects as the initial medoids;

(2) repeat (3)

assign each object to the cluster with the nearest medoid;

(4)

randomly select a nonmedoid object,

(5) (6)

• t6– 0I—&˜ J ; compute the total cost, › , of swapping •:W with • 0I—&˜ ; t6– J if ›?œš than swap • W with • 0I—&˜ to form the new set of { t6– J

medoids.

(7) until no change;

Partitioning Methods for Large Databases PAM (Partitioning Around Medoids) was one of the first k-medoid algorithm introduced. It works effectively for small data sets.

Slide 32

A sampling based method, called CLARA (Clustering LARge Applications) can be used. The idea: instead of taking the whole set, a small portion of the actual data is chosen as a representative of the data. Medoids are chosen from this sample using PAM. Drawback: CLARA cannot find the best clustering if any sampled medoid is not among the best medoids. For example, if an object

•A

is one of the medoids in the best

{

is not selected during sampling, CLARA will newer find the best clustering.

medoids but it

Suggest Documents