The Distance Macro. SAS Release 8.2. December 4, SAS Institute Inc

The Distance Macro SAS Release 8.2 December 4, 2003 SAS Institute Inc. Preface Disclaimer: THIS INFORMATION IS PROVIDED BY SAS  INSTITUTE INC. A...

Author: Philippa Malone

35 downloads 1 Views 292KB Size

Report

Download PDF

Recommend Documents

THE SAS MACRO FACILITY

Swetha Vuppalanchi, SAS Institute Inc

LENGUAJE MACRO EN SAS

A Pseudo-Recursive SAS Macro

Demystifying the SAS Macro Facility by Example

Better Hashing in SAS 9.2 Robert Ray and Jason Secosky SAS Institute Inc., Cary, NC

Basic SAS Macro Processing Randy Betancourt

Share Your SAS Visual Analytics Reports with SAS Office Analytics David Bailey, I-kong Fu, Anand Chitale, SAS Institute Inc

Data Summarization Methods in Base SAS Procedures Lynne Bresler, SAS Institute, Inc, Cary, NC

EFFECTIVELY PROCESSING XML USING SAS CHEVELL PARKER TECHNICAL SUPPORT ANALYST, SAS INSTITUTE INC

A SAS Macro for Theil Regression

A SAS Macro to Find and Replace

Web-Enabling SAS Applications Bradley W. Klenz, SAS Institute, Inc., Cary, NC

Geotecnia SAS SAS geotechnical systems SAS SYSTEMS

SAS

The correct bibliographic citation for this manual is as follows: SAS Institute Inc The PROTO Procedure. Cary, NC: SAS Institute Inc

Artificial Intelligence and the SAS System: Why You Have To Teach the SAS System about Sex! David B. Malkovsky, SAS Institute Inc

Using the SAS Deployment Backup and Recovery Tool in the Third Maintenance Release of SAS 9.4

Tips and Tricks: Using the new SAS Map data sets Liz Simon, Darrell Massengill, SAS Institute Inc., Cary, NC

Assessing the Numerical Accuracy of SAS Software Statistical R&D Staff SAS Institute Inc., Cary, NC 27513

Creating A User Interface Using the SAS Macro Window

The Distance Macro SAS

Release 8.2 December 4, 2003

SAS Institute Inc.

Preface Disclaimer: THIS INFORMATION IS PROVIDED BY SAS  INSTITUTE INC. AS A SERVICE TO ITS USERS. IT IS PROVIDED "AS IS". THERE ARE NO WARRANTIES, EXPRESSED OR IMPLIED, AS TO MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE REGARDING THE ACCURACY OF THE MATERIALS OR CODE CONTAINED HEREIN.

Requirements To Run Your DISTANCE Macro The following statement in your DISTANCE macro needs to be changed to refer to the file containing XMACRO on your system: %inc ’ xmacro.sas’; The XMACRO macros from the SAS/STAT  sample library in 8.02 or later are required. The first macro in this file, %xmacinc, checks to see whether XMACRO has been included, and if not, attempts to include it. It is advisable to modify the %inc statements within the XMACRO file from the appropriate location on your system. No products other than base SAS software are required for using the %DISTANCE macro unless STD=AGK(p) or L(p) is specified, in which case SAS/STAT software is required. Use of the STD= argument requires the %STDIZE macro.

The DISTANCE Macro Contents OVERVIEW . . . . . . . . . . . . . . . . . . . Levels of Measurement . . . . . . . . . . . . Symmetric vs. Asymmetric Nominal Variables Standardization . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 3 4 5

GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computing Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . Computing Similarity/Dissimilarity Measures . . . . . . . . . . . . . . . .

5 6 9

%DISTANCE INVOCATION . . . . . . . . . . . . . . . . . . . . . . . . . 11 %DISTANCE Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 DEBUGGING INFORMATION . . . . . . . . . . . . . . . . . . . . . . . . 22 LIMITATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Proximity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 SUBJECT INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 SYNTAX INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2

The DISTANCE Macro

The DISTANCE Macro Overview The %DISTANCE macro computes various measures of distance, dissimilarity, or similarity between the observations (rows) of a SAS data set. These proximity measures are stored as a lower triangular matrix or a square matrix in an output data set (depending on the specification of the SHAPE= ) that can then be used as input to the CLUSTER, MDS, or MODECLUS procedures. The input data set may contain numeric or character variables or both, depending on which proximity measure is used. The number of rows and columns in the output matrix equals the number of observations in the input data set. If there are BY groups, an output matrix is computed for each BY group with the size determined by the maximum number of observations in any BY group.

Levels of Measurement Measurement of some attribute of a set of things is the process of assigning numbers or other symbols to the things in such a way that properties of the numbers or symbols reflect properties of the attribute being measured. There are different levels of measurement that involve different properties (relations and operations) of the numbers or symbols. Associated with each level of measurement is a set of transformations of the measurements that preserve the relevant properties; these transformations are called permissible transformations. A particular way of assigning numbers or symbols to measure something is called a scale of measurement. The most commonly discussed levels of measurement are as follows: Nominal

Two things are assigned the same symbol if they have the same value of the attribute. Permissible transformations are any one-toone or many-to-one transformation, although a many-to-one transformation loses information.

Ordinal

Things are assigned numbers such that the order of the numbers reflects an order relation defined on the attribute. Two things x and y with attribute values a(x) and a(y) are assigned numbers m(x) and m(y) such that if m(x) > m(y), then a(x) > a(y). Permissible transformations are any monotone increasing transformation, although a transformation that is not strictly increasing loses information.

Interval

Things are assigned numbers such that differences between the numbers reflect differences of the attribute. If m(x) − m(y) >

4

The DISTANCE Macro m(u) − m(v), then a(x) − a(y) > a(u) − a(v). Permissible transformations are any affine transformation t(m) = c ∗ m + d, where c and d are constants; another way of saying this is that the origin and unit of measurement are arbitrary. Log-interval

Things are assigned numbers such that ratios between the numbers reflect ratios of the attribute. If m(x)/m(y) > m(u)/m(v), then a(x)/a(y) > a(u)/a(v). Permissible transformations are any power transformation t(m) = c ∗ md , where c and d are constants.

Ratio

Things are assigned numbers such that differences and ratios between the numbers reflect differences and ratios of the attribute. Permissible transformations are any linear (similarity) transformation t(m) = c ∗ m, where c is a constant; another way of saying this is that the unit of measurement is arbitrary.

Absolute

Things are assigned numbers such that all properties of the numbers reflect analogous properties of the attribute. The only permissible transformation is the identity transformation.

Proximity measures provided in the DISTANCE macro basically accept four types of variables lists that correspond to four levels of measurement: nominal, ordinal, interval and ratio. You can specify ANOMINAL= or NOMINAL= (See Asymmetric and Symmetric nominal variables later in this chapter) for nominal level of measurements, specify ORDINAL= for ordinal level of measurements, specify INTERVAL=for interval level of measurements and specify RATIO= for ratio level of measurement. Ordinal variables are transformed to interval variables before the standardization. This is done by replacing the data with their rank scores as computed in PROC RANK, and by assuming that the classes of an ordinal variable are spaced equally along the interval scale. There are also different approaches on how to transform an ordinal variable to an interval variable. Refer to Anderberg (1973) for alternatives.

Symmetric vs. Asymmetric Nominal Variables Symmetric Binary Variable A binary variable contains two possible outcomes: 1 (positive/present) or 0 (negative/absent). If there is no preference which outcome should be coded as 0 and which as 1, the binary variable is called “symmetric”. For example, the binary variable “is evergreen” for a plant possesses the possible states “loses leaves in winter” and “does not lose leaves in winter”. Both are equally valuable and carry the same weights when a proximity measure is computed. Commonly used measures that accept symmetric binary variables include Simple Matching, Hamann, Roger and Tanimotom, Sokal and Sneath 1, and Sokal and Sneath 3 coefficients.

Asymmetric Binary Variable If the outcomes of a binary variable are not equally important, the binary variable is called “asymmetric”. An example of such a variable is the presence/absence of a relatively rare attribute, such as “is color blind” for a human-being. While we say that

Getting Started two people that are color blind have something in common, it is not sure that who are not color blinded have something in common. The most important outcome is usually coded as 1 (present) and the other is coded as 0 (absent). The agreement of two 1’s (a present-present match or a positive match) weights something significantly more than the agreement of two 0’s (an absent-absent match or a negative match.) Usually, the negative match is treated as totally irrelevant from the comparison between two data units. Commonly used measures that accept asymmetric binary variables include Jaccard, Dice, Russell and Rao, Binary Lance and Williams nonmetric, and the Kulcynski coefficients.

Extension of Binary Variable When nominal variables are employed, the comparison of one data unit with another can only be in terms of whether the data units score the same or different on the variables. If a variable is defined as an asymmetric nominal variable and two data units score the same but fall into the absent category, the absent-absent match should be excluded from the computation of the proximity measure.

Standardization Variable standardization is not required in the DISTANCE macro except for METHOD=GOWER|DGOWER; however, it is recommended to standardize the variables before the computation of proximity measures since variables with large variances tend to have more effect on the proximity measures than those with small variance. The STDIZE macro (available in release 6.09 of the SAS/STAT software) provides both parametric and non-parametric methods for standardizing variables. You can also standardize variables directly through %DISTANCE; in fact, it provides a convenient way to standardize ratio and interval (ordinal) variables with different methods at the same time (refer to STDRATIO= and STDINT= arguments later in this chapter.)

Getting Started This section shows how to use the %DISTANCE macro to compute a variety of similarity, distance/dissimilarity measures. The first example illustrates how to compute Euclidean and size distances; the second example introduces some commonly used similarity and dissimilarity measures for nominal level data (see Levels of Measurement later in this chapter for definitions) such as Jaccard and simple matching coefficients.

5

6

The DISTANCE Macro

Computing Distance Measures The first example illustrates how to generate distance measures for interval data. The following data set contains information about mammal’s teeth. Mammal have four kinds of teeth: incisors, canines, premolars, and molars. The data set below gives the number of teeth of each kind on one side of the top and bottom jaws for ten mammals. data teeth; title ’Mammals’’ Teeth’; input mammal $ 1-16 @21 (v1-v8) (1.); label v1=’Top incisors’ v2=’Bottom incisors’ v3=’Top canines’ v4=’Bottom canines’ v5=’Top premolars’ v6=’Bottom premolars’ v7=’Top molars’ v8=’Bottom molars’; cards; Armadillo 00000088 Mouse 11000033 Beaver 11002133 Groundhog 11002133 Rabbit 21003233 Moose 04003333 Mole 32103333 Wolf 33114423 Raccoon 33114432 Jaguar 33113211 ;

Since all eight variables are measured in the same units, rescaling the data is not strictly required. However, if variables are not standardized, the fact that canines have somewhat smaller variances than other varaibles might have slight effect on the analysis. The most commonly used distance measure is Euclidean distance which generally accepts ratio, interval, and ordinal level data (see Levels of Measurement later in this chapter for definitions). Since number of teeth is an interval level of measurement, Euclidean distance might be an appropriate distance measure between each pair of mammals. Other measures of distance for interval data are available in %DISTANCE as well. For example, size difference is used by taxonomists as a component of “phenetic resemblance,” namely to measure the differences in size of organisms. The following two analyses invoke %DISTANCE. The only difference in the invocation of %DISTANCE for each analysis is the specification for METHOD=. The first analysis uses METHOD=EUCLID and the second one uses METHOD=SIZE. To scale interval variables V1 through V8, STD= STD standardizes each variable to a mean of 0 and a standard deviation of 1. Notice that you need to include %STDIZE before the invocation of %DISTANCE in order to use STD= argument.

Computing Distance Measures

7

Also, SHAPE=SQUARE requests that a sqaure distance matrix instead of a lowertriangular (the default) distance matrix be printed. The following statements produce Figure 1 and Figure 2: options ls=78 ps=60; %include ’stdize.sas’;

%include ’distance.sas’;

/*- METHOD=EUCLID -*/ title2 ’METHOD=ECULID’; %distance(data=teeth, id=mammal, options=print nomiss,

shape=square, method=euclid, std=std, var=v1-v8

Change this statement to refer to the file containing %STDIZE on your system. Change this statement to refer to the file containing %DISTANCE on your system.

input data set. used to generate names for the distance varables. Prints the distance matrix. Indicates no missing data in the input data and thus speeds up the analysis. Outputs square distance matrix. Uses Euclidean distance measure, this is the default. Standardizes the data to mean=0 and std=1. Selects variables v1-v8 to compute distances.

) /*- METHOD=SIZE -*/ title2 ’METHOD=SIZE’; %distance(data=teeth, id=mammal, options=print nomiss, shape=square, method=size, std=std, var=v1-v8 )

Uses size distance measure.

8

The DISTANCE Macro

Mammals’ Teeth mammal

1

Armadillo

Mouse

Beaver

Groundhog

Rabbit

Armadillo Mouse Beaver Groundhog Rabbit Moose Mole Wolf Raccoon Jaguar

0.00000 4.05525 4.34184 4.34184 4.95602 5.77229 5.96370 7.26567 7.26567 7.43075

4.05525 0.00000 1.55130 1.55130 2.61543 3.81139 3.91911 5.32308 5.32308 4.65184

4.34184 1.55130 0.00000 0.00000 1.25596 2.89200 3.03255 4.39495 4.39495 4.04680

4.34184 1.55130 0.00000 0.00000 1.25596 2.89200 3.03255 4.39495 4.39495 4.04680

4.95602 2.61543 1.25596 1.25596 0.00000 2.90507 2.33288 3.69910 3.69910 3.67730

mammal

Moose

Mole

Wolf

Raccoon

Jaguar

5.77229 3.81139 2.89200 2.89200 2.90507 0.00000 3.45120 3.95318 3.95318 4.15534

5.96370 3.91911 3.03255 3.03255 2.33288 3.45120 0.00000 2.47647 2.47647 2.78786

7.26567 5.32308 4.39495 4.39495 3.69910 3.95318 2.47647 0.00000 0.77981 1.95177

7.26567 5.32308 4.39495 4.39495 3.69910 3.95318 2.47647 0.77981 0.00000 1.95177

7.43075 4.65184 4.04680 4.04680 3.67730 4.15534 2.78786 1.95177 1.95177 0.00000

Armadillo Mouse Beaver Groundhog Rabbit Moose Mole Wolf Raccoon Jaguar

Figure 1. Distance Matrix of Mammal’s Teeth Data using Euclidean Measure Mammals’ Teeth mammal

2

Armadillo

Mouse

Beaver

Groundhog

Rabbit

Armadillo Mouse Beaver Groundhog Rabbit Moose Mole Wolf Raccoon Jaguar

0.00000 1.39229 0.66058 0.66058 0.10632 0.60290 1.58539 2.88158 2.88158 1.57511

1.39229 0.00000 0.73171 0.73171 1.49861 1.99519 2.97768 4.27387 4.27387 2.96740

0.66058 0.73171 0.00000 0.00000 0.76690 1.26348 2.24597 3.54216 3.54216 2.23569

0.66058 0.73171 0.00000 0.00000 0.76690 1.26348 2.24597 3.54216 3.54216 2.23569

0.10632 1.49861 0.76690 0.76690 0.00000 0.49657 1.47907 2.77526 2.77526 1.46879

mammal

Moose

Mole

Wolf

Raccoon

Jaguar

0.60290 1.99519 1.26348 1.26348 0.49657 0.00000 0.98249 2.27868 2.27868 0.97221

1.58539 2.97768 2.24597 2.24597 1.47907 0.98249 0.00000 1.29619 1.29619 0.01028

2.88158 4.27387 3.54216 3.54216 2.77526 2.27868 1.29619 0.00000 0.00000 1.30647

2.88158 4.27387 3.54216 3.54216 2.77526 2.27868 1.29619 0.00000 0.00000 1.30647

1.57511 2.96740 2.23569 2.23569 1.46879 0.97221 0.01028 1.30647 1.30647 0.00000

Armadillo Mouse Beaver Groundhog Rabbit Moose Mole Wolf Raccoon Jaguar

Figure 2. Distance Matrix of Mammal’s Teeth Data using Size Measure

Computing Similarity/Dissimilarity Measures

9

Computing Similarity/Dissimilarity Measures The second example illustrates how to generate similarity/dissimilarity measures for binary data. This example was documented in “Computing a Distance Matrix” of Chapter 18, “The Cluster Procedures,” in the SAS/STAT User’s Guide, Version 6, First Edition. In this example, the observations are states. Binary-valued variables correspond to various grounds for divorce and indicate whether the grounds for divorce apply in each of the states: “1” indicates the presence of the ground “0” indicates the absence. options ls=78 ps=60; %include ’distance.sas’;

Change this statement refer to the file containing %DISTANCE on your system.

data divorce; title ’Grounds for Divorce’; input state $15. (incompat cruelty desertn non_supp alcohol felony impotenc insanity separate) (1.) @@; if mod(_n_,2) then input +4 @@; else input; cards; ALABAMA 111111111 ALASKA 111011110 ARIZONA 100000000 ARKANSAS 011111111 CALIFORNIA 100000010 COLORADO 100000000 CONNECTICUT 111111011 DELAWARE 100000001 FLORIDA 100000010 GEORGIA 111011110 HAWAII 100000001 IDAHO 111111011 ILLINOIS 011011100 INDIANA 100001110 IOWA 100000000 KANSAS 111011110 KENTUCKY 100000000 LOUISIANA 000001001 MAINE 111110110 MARYLAND 011001111 MASSACHUSETTS 111111101 MICHIGAN 100000000 MINNESOTA 100000000 MISSISSIPPI 111011110 MISSOURI 100000000 MONTANA 100000000 NEBRASKA 100000000 NEVADA 100000011 NEW HAMPSHIRE 111111100 NEW JERSEY 011011011 NEW MEXICO 111000000 NEW YORK 011001001 NORTH CAROLINA 000000111 NORTH DAKOTA 111111110 OHIO 111011101 OKLAHOMA 111111110 OREGON 100000000 PENNSYLVANIA 011001110 RHODE ISLAND 111111101 SOUTH CAROLINA 011010001 SOUTH DAKOTA 011111000 TENNESSEE 111111100 TEXAS 111001011 UTAH 011111110 VERMONT 011101011 VIRGINIA 010001001 WASHINGTON 100000001 WEST VIRGINIA 111011011 WISCONSIN 100000001 WYOMING 100000011 ;

The Jaccard coefficient is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both

10

The DISTANCE Macro states. Details of Jaccard coefficient are given in “Dissimilarity Measures” later in this chapter. You can compute the Jaccard dissimilarity coefficients by invoking the %DISTANCE macro as follows: /*- Compute Jaccard Dissimilarity (Djaccard) matrix -*/ %distance(data=divorce, id=state, used to generate names for the distance variables. options=nomiss, Does not print the distance matrix. Indicates no missing data in the input data and thus speeds up the analysis. out=distjacc, output data set containing the dissimilarity measures. shape=square, Outputs square dissimilarity matrix. method=djaccard, Uses Jaccard Dissimilarity. var=incompat--separate Selects variables incompat-separate to compute dissimilarity. )

The PROC PRINT procedure can be used to print the output data, distjacc. Since the dimension of the output dissimilarity matrix is 50 by 50, only partial output is printed. The following statements produce Figure 3: proc print data=disjacc(obs=6); title2 ’First 6 states’; title3 ’Djaccard Dissimilarity Matrix’; var alabama--colorado; id state; run;

Grounds for Divorce First 6 states Djaccard Dissimilarity Matrix state

Alabama

Alaska

Arizona

Arkansas

Alabama Alaska Arizona Arkansas California Colorado

0.00000 0.22222 0.88889 0.11111 0.77778 0.88889

0.22222 0.00000 0.85714 0.33333 0.71429 0.85714

0.88889 0.85714 0.00000 1.00000 0.50000 0.00000

0.11111 0.33333 1.00000 0.00000 0.88889 1.00000

1

California 0.77778 0.71429 0.50000 0.88889 0.00000 0.50000

Colorado 0.88889 0.85714 0.00000 1.00000 0.50000 0.00000

Figure 3. Output from PROC PRINT

The %DISTANCE macro provides a much easier way to compute the Jaccard dissimilarity coefficient than the approach described in the SAS/STAT User’s Guide. If a subset of variables is desirable for a further investigation, simply change the specification in the VAR= argument. Also, if missing values exist, %DISTANCE provides a variety of estimations to replace them (see “Missing values” later in this chapter.)

%DISTANCE Arguments

%DISTANCE Invocation The following options invoke %DISTANCE:

%distance ( data=, var=, varwgt=, absent=, anominal=, ano=, anowgt=, nominal=, nom=, nomwgt=, ordinal=, ord=, ordwgt=, interval=, int=, intwgt=, ratio=, rat=, ratwgt=, out=, prefix=, shape= method=, undef=, missing=, std=, stdinter=, stdratio=, vardef=, id=, copy=, by=, freq=, weight=, options= ) The %DISTANCE is required. All arguments in the macro are optional and may appear in any order. Note: If no variable lists are specified, by default, all the numeric variables in the DATA= data set will be included in the analysis and treated as VAR=– NUMERIC– .

%DISTANCE Arguments The following arguments may be listed within parentheses in any order, separated by commas: DATA=

SAS data set to analyze. The default is – LAST– . Most data set options may be used. WHERE processing is not supported. Compressed data sets, tape data sets, and views are not supported. VAR=

List of variables from which distances are to be computed. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. The variables may be numeric, character, or mixed depending on the METHOD= argument. Variable names should not begin with an underscore. VAR= should not be specified when any one of the following arguments is specified: ANOMINAL=, NOMINAL=, ORDINAL=, INTERVAL=, or RATIO=. The default list of VAR= contains all of the numeric variables.

11

12

The DISTANCE Macro

ID=

A single variable to be copied to the OUT= data set and used to generate names for the distance variables as in the TRANSPOSE procedure. If you specify both ID= and BY=, the ID variable must have the same values in the same order in each BY group. Also, the ID values must be valid SAS names. These restrictions may be removed in later versions. COPY=

List of additional variables to be copied to the OUT= data set. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. BY=

List of variables for BY groups. Abbreviated variable lists (e.g., X1-X100, ABC– XYZ, ABC:) may NOT be used. OUT=

The output data set containing the BY variables, ID variable, computed distance variables, COPY variables, FREQ variable, and WEIGHT variables. The default is – TEMP– if SHAPE=SQUARE; otherwise, the default is – DATA– . Data set options may NOT be used with the OUT= data set. PREFIX=

A prefix to be used to generate names for the distance variables in the OUT= data set as in the TRANSPOSE procedure. Do not use quotes. The default is DIST. SHAPE=

Shape of proximity matrix to be stored in the OUT= data set and to be printed if OPTIONS=PRINT. The available shapes are: TRIANGLE or TRI Stored as a lower triangular matrix. This is the default. SQUARE or SQR Stored as a squared matrix. METHOD=

The method for computing distancedissimilarity or similarity measures. The value of the METHOD= argument is the name of a macro to compute the distancedissimilarity or similarity measure between two vectors. The user can write additional macros to implement other distancedissimilarity or similarity measures besides those listed above. For use in PROC CLUSTER, the distancedissimilarity type of measures should be used; for example, METHOD=EUCLID or METHOD=DGOWER. The default method is EUCLID. The following five tables outline the methods available for the METHOD= argument. These tables are classified by types of variables acceptable by each method. There are five columns in each table. The first column contains methods. The second column contains the range of the coefficient. The third column contains types of coefficients : sim (similarity) or dis (distancedissimilarity). The fourth column contains variable

%DISTANCE Arguments type assumed for the VAR= list, The last column lists types of variables accepted by the corresponding method. Also, contents in column four and five are abbreviated as follows: R:

Ratio–must be numeric

I:

Interval–must be numeric

O:

Ordinal–must be numeric

N:

Nominal–may be numeric or character

A:

Asymmetric Nominal–may be numeric or character

See “Similarity and DistanceDissimilarity Measure” later in this chapter for formulas and descriptions. Table 1 lists methods accepting ratio, interval, ordinal, nominal, and anominal variables. METHOD= GOWER|DGOWER always implies standardization. By assuming all the numeric (ordinal, interval and ratio) variables are standardized by their corresponding default methods, the range for both methods in the second column of this table is between 0 and 1, inclusively. To find out the default methods of standardization for METHOD= GOWER|DGOWER, see STD= , STDINTER= , and STDRATIO= in this session. Table 1. Methods accepting all types of variables:

method

range

type of coefficient

type assumed for VAR= list

accepted list type

GOWER DGOWER

0 to 1 0 to 1

sim dis

I I

R,I,O,N,A R,I,O,N,A

GOWER

Gower’s similarity

DGOWER

1-Gower

13

14

The DISTANCE Macro Table 2 lists methods accepting ratio, interval, and ordinal variables. Table 2. Methods accepting ratio, interval, and ordinal variables:

method

range

type of coefficient

type assumed for VAR= list

accepted list type

EUCLID SQEUCLID SIZE SHAPE COV CORR DCORR SQCORR DSQCORR L(p) CITY CHEBYCHE POWER(p, r)

≥0 ≥0 ≥0 ≥0 ≥0 -1 to 1 0 to 2 0 to 1 0 to 1 ≥0 ≥0 ≥0 ≥0

dis dis dis dis sim sim dis sim dis dis dis dis dis

I I I I I I I I I I I I I

R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O R,I,O

EUCLID

Euclidean distance

SQEUCLID

Squared Euclidean distance

SIZE

Size distance

SHAPE

Shape distance

COV

Covariance

CORR

Correlation

DCORR

Correlation transformed to Euclidean distance

SQCORR

Squared correlation

DSQCORR

One minus squared correlation

L(p)

Minkowski Lp distance, where p is a positive numeric value

CITY

L1 , cityblock or Manhattan distance

CHEBYCHE

L∞

POWER(p, r)

Generalized Euclidean distance where p is a positive numeric value and r is a non-negative numeric value. The distance between two observations is the rth root of sum of the absolute differences to the pth power between the values for the observations

%DISTANCE Arguments Table 3 lists methods that accepting ratio variables only. Notice that in the second column of this table, all ranges are non-negative. This is because the ratio variables are assumed to be positive. Table 3. Methods accepting ratio variables:

method

range

SIMRATIO DISRATIO NONMETRI CANBERRA COSINE DOT OVERLAP DOVERLAP CHISQ CHI PHISQ PHI

0 0 0 0 0

type of coefficient

type assumed for VAR= list

accepted list type

sim dis sim dis sim sim sim dis dis dis sim sim

R R R R R R R R R R R R

R R R R R R R R R R R R

to 1 to 1 to 1 to 1 to 1 ≥0 ≥0 ≥0 ≥0 ≥0 ≥0 ≥0

SIMRATIO

Similarity ratio (if variables are binary, this is the Jaccard coefficient)

DISRATIO

One minus similarity ratio

NONMETRI

Lance and Williams nonmetric coefficient

CANBERRA

Canberra metric distance coefficient

COSINE

Cosine

DOT

Dot (inner) product

OVERLAP

Overlap similarity

DOVERLAP

Overlap dissimilarity

CHISQ

Chi-squared

CHI

Squared root of Chi-squared

PHISQ

Phi-squared

PHI

Squared root of Phi-squared

15

16

The DISTANCE Macro Table 4 lists methods accept nominal variables only. Table 4. Methods accept symmetric nominal variables:

method

range

HAMMING MATCH DMATCH DSQMATCH HAMANN RT SS1 SS3

0 0 0 0 -1 0 0 0

to to to to to to to to

v 1 1 1 1 1 1 1

type of coefficient

type assumed for VAR= list

accepted list type

dis sim dis dis sim sim sim sim

N N N N N N N N

N N N N N N N N

HAMMING

Hamming distance

MATCH

Simple matching coefficient

DMATCH

Simple matching coefficient transformed to Euclidean distance

DSQMATCH

Simple matching coefficient transformed to squared Euclidean distance

HAMANN

Hamann coefficient

RT

Roger and Tanimoto

SS1

Sokal and Sneath 1

SS3

Sokal and Sneath 3

Table 5 lists methods that distinguish the presence attribute from the absence attributes. Use ABSENT= to designate the list of values to be considered indicating absence. Table 5. Methods accept asymmetric nominal variables:

method

range

JACCARD DJACCARD DICE RR BLWNM K1

0 0 0 0 0

to 1 to 1 to 1 to 1 to 1 ≥0

type of coefficient

type assumed for VAR= list

accepted list type

sim dis sim sim dis sim

A A A A A A

RA RA A A A A

%DISTANCE Arguments JACCARD

Jaccard similarity coefficient

DJACCARD

Jaccard dissimilarity coefficient

DICE

Dice coefficient

RR

Russel and Rao

BLWNM

Binary Lance and Williams nonmetric, or Bray-Curtis coefficient

K1

Kulcynski 1

ABSENT=

List of values to be used as absence values in an irrelevant absent-absent match for all of the variables specified in the ANOMINAL= list. An absence value for a variable consists of combinations of all legal SAS characters and quoted by either single quote (’) or double quote ("). For instance, both ’.’ and ’999’ are legal values for ABSENT= list. An empty list or lack of ABSENT= requests the default. The default of an absence value for a character variable is ’NONE’ (notice that a blank value is treated as a missing value), and the default of an absence value for a numeric type of variable is ’0’. ANOMINAL= ANO=

List of variables to be treated as asymmetric nominal variables. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. The variables may be numeric, character or mixed. Variable names should not begin with an underscore. If both ANOMINAL= and ANO= are used, variables in the ANO= list will be concatenated to ANOMINAL= list. Do not use VAR= when ANOMINAL= is specified. NOMINAL= NOM=

List of variables to be treated as symmetric nominal variables. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. The variables may be numeric, character or mixed. Variable names should not begin with an underscore. If both NOMINAL= and NOM= are used, variables in the NOM= list will be concatenated to NOMINAL= list. Do not use VAR= when NOMINAL= is specified. ORDINAL= ORD=

List of variables to be treated as ordinal variables. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. Only numeric variables are allowed. Variable names should not begin with an underscore. If both ORDINAL= and ORD= are used, variables in the ORD= list will be concatenated to ORDINAL= list.

17

18

The DISTANCE Macro Do not use VAR= when ORDINAL= is specified. The data in the ORDINAL= list will be replaced by their corresponding ranks before the standardization. Since PROC RANK (used by the %DISTANCE macro for computing ranks) does not accept FREQ and WEIGHT statements, FREQ= and WEIGHT= are ignored in this case. After being replaced by ranks, ordinal variables are treated as interval variables.

INTERVAL= INT=

List of variables to be treated as interval variables. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. Only numeric variables are allowed. Variable names should not begin with an underscore If both INTERVAL= and INT= are used, variables in the INT= list will be concatenated to INTERVAL= list. Do not use VAR= when INTERVAL= is specified. RATIO= RAT=

List of variables to be treated as ratio variables. The usual forms of abbreviated lists (e.g., X1-X100, ABC–XYZ, ABC:) may be used. Only numeric variables are allowed. Variable names should not begin with an underscore. If both RATIO= and RAT= are used, variables in the RAT= list will be concatenated to RATIO= list. Do not use VAR= when RATIO= is specified. STD=

Method for standardizing variables for the VAR= list. See the %STDIZE macro for details. By default, variables are not standardized unless METHOD=GOWER|DGOWER. Standardization is mandatory when METHOD= GOWER|DGOWER by the following rules: • When METHOD=GOWER, variables are standardized by STD= RANGE and whatever is specified in STD= will be ignored. • When method=DGOWER, by default, variables are standardized by STD= RANGE. Notice when STD= MAXABS is used for standardizing a ratio variable, a variable should be designated ratio only if it is nonnegative.

%DISTANCE Arguments Table 6 lists available methods of standardization in %STDIZE. Table 6. Available Methods of Standardization

Method

Scale

Location

MEAN MEDIAN SUM EUCLEN USTD STD RANGE MIDRANGE MAXABS IQR MAD ABW(c) AHUBER(c) AGK(p) SPACING(p) L(p) IN(ds)

1 1 sum Euclidean length standard deviation about origin standard deviation range range/2 maximum absolute value interval quartile range median abs. dev. from median biweight A-estimate Huber A-estimate AGK estimate (ACECLUS) minimum spacing L(p) read from data set

mean median 0 0 0 mean minimum midrange 0 median median biweight 1-step M-estimate Huber 1-step M-estimate median mid minimum-spacing L(p) read from data set

STDINTER=

Method for standardizing ordinal and interval variables for the ORDINAL= and INTERVAL= lists. See the %STDIZE macro for details. By default, variables are not standardized unless METHOD= GOWER|DGOWER. Standardization is mandatory when METHOD= GOWER|DGOWER by the following rules: • When METHOD= GOWER, variables are standardized by STDINTER= RANGE, and whatever is specified in STDINTER= will be ignored. • When METHOD= DGOWER, by default, variables are standardized by STDINTER= RANGE. Do not specify both STD= and STDINTER= at the same time. STDRATIO=

Method for standardizing ratio variables for RATIO= list. See the %STDIZE macro for details. By default, variables are not standardized unless METHOD= GOWER|DGOWER is specified. Standardization is mandatory when METHOD= GOWER|DGOWER by the following rules: • When METHOD= GOWER, by default, variables are standardized by STDRATIO= MAXABS. Using other methods of standardization will not guarantee the GOWER coefficient > 0. • When METHOD= DGOWER, by default, variables are standardized by STDRATIO= MAXABS.

19

20

The DISTANCE Macro Notice when STDRATIO= MAXABS is used for standardizing a ratio variable, a variable should be designated ratio only if it is nonnegative. Do not specify both STD= and STDRATIO= at the same time.

FREQ=

A single numeric frequency variable used as in PROC UNIVARIATE. Applies only when STD= is used and only when ORDINAL= is not specified. WEIGHT=

A single numeric weight variable used as in PROC UNIVARIATE. Only works for STD=MEAN, SUM, EUCLEN, STD, AGK, or L(p), and only when ORDINAL= is not specified. UNDEF=

Numeric constant or missing value with which to replace undefined distances, for example, when an observation has all missing values, or if a divisor is zero. MISSING=

Method or a numeric value for replacing missing values. Use MISSING= when you want to replace missing values by something other than the location measure associated with the STD=, STDINTER=, or STDRATIO= option, which is what the REPLACE option replaces them by. The usual methods include MEAN, MEDIAN, and MIDRANGE. Any of the values for the STD=, STDINTER=, or STDRATIO= argument can also be specified for MISSING=, and the corresponding location measure will be used to replace missing values. If a numeric value is given, the specified value replaces missing values. If standardization is performed, the replacement is done after standardizing the data. VARDEF=

The divisor to be used in the calculation of similarity, distancedissimilarity measures, and for standardizing variables (see WEIGHT= ). The default value is VARDEF=DF. Other available values are N, WDF, and WEIGHT (or WGT). ANOWGT=

List of positive numbers used as weights for the variables in the ANOMINAL= list. Weights in the list are separated by blanks. The number of weights in the list must be the same as the number of variables in the ANOMINAL= list. The default value is 1 for each variable in the ANOMINAL= list. NOMWGT=

List of positive numbers used as weights for the variables in the NOMINAL= list. Weights in the list are separated by blanks. The number of weights in the list must be same as the number of variables in the NOMINAL= list. The default value is 1 for each variable in the NOMINAL= list. ORDWGT=

List of positive numbers used as weights for the variables in the ORDINAL= list. Weights in the list are separated by blanks. The number of weights in the list must be same as the number of variables in the ORDWGT= list.

%DISTANCE Arguments The default value is 1 for each variable in the ORDINAL= list. INTWGT=

List of positive numbers used as weights for the variables in the INTERVAL= list. Weights in the list are separated by blanks. The number of weights in the list must be same as the number of variables in the INTERVAL= list. The default value is 1 for each variable in the INTERVAL= list. RATWGT=

List of positive numbers used as weights for the variables in the RATIO= list. Weights in the list are separated by blanks. The number of weights in the list must be same as the number of variables in the RATIO= list. The default value is 1 for each variable in the RATIO= list. VARWGT=

List of positive numbers used as weights for the variables in the VAR= list. Weights in the list are separated by blanks. The number of weights in the list must be same as the number of variables in the VAR= list. OPTIONS=

List of additional options separated by blanks: PRINT

Print the distance matrices.

NOMISS

Generate missing distances for observations with missing values under the selected variables through the ANOMINAL= ,NOMINAL= ,ORDINAL= , INTERVAL= , and RATIO= lists. In general, missing values are tolerated, but some particular methods may get upset by missing values. This option may increase efficiency considerably depending on the method. Notice that PROC CLUSTER will not accept distance matrices with missing values.

REPLACE

Replace missing data by zero in the standardized data (which corresponds to the location measure before standardizing). To replace missing data by something else, see the MISSING= argument. OPTIONS= REPLACE implies standardization. The following rules are used to standardize variables: • When METHOD= GOWER, STD(INTER)= RANGE is the mandatory method of standardization for interval variables, and the default method of standardization for ratio variables is STDRATIO= MAXABS. • When METHOD= DGOWER, STD(INTER)= RANGE and STDRATIO= MAXABS are the default methods of standardization for interval variables and ratio variables, respectively. • When METHOD= anything other than GOWER|DGOWER, STD(INTER)= MEAN is the default method of standardization for both interval and ratio variables.

21

22

The DISTANCE Macro You may not specify both REPLACE and REPONLY. REPONLY

Replace missing data by the location measure specified by the STD=, STDINTER=, STDRATIO=, or MISSING= arguments, but do – not– standardize the data. If MISSING= is specified, missing values are replaced by the location measure specified by the MISSING=. If MISSING= is absent, the following rules are used to replace missing values: • When METHOD= GOWER, the location measure specified by STD(INTER)= RANGE is used to replace missing values for interval variables, and the location measure specified by STDRATIO= MEAN is used as a default estimate to replace missing values for ratio variables. • When METHOD= DGOWER, the location measure specified by STD(INTER)= RANGE and the location measure specified by STDRATIO= MEAN are used by default to replace missing values for interval variables and for ratio variables, respectively. • When METHOD= anything other than GOWER|DGOWER, by STD(INTER)= MEAN and the location measure specified by STDRATIO= MEAN are used by default to replace missing values for interval variables and for ratio variables, respectively. You may not specify both REPLACE and REPONLY.

Debugging Information The following statements may be useful for diagnosing errors: %let – notes– =1; Prints SAS notes for all steps; %let – echo– =1; Prints the arguments to the DISTANCE macro; %let – echo– =2; Prints the arguments to the DISTANCE macro after defaults have been set; options mprint;

Prints SAS code generated by the macro language;

options mlogic symbolgen; Prints lots of macro debugging info; You can suppress argument checking (and thereby speed up the macro at the risk of getting inscrutable error messages if you make a mistake) by using the statement: %let – check– =0;

Proximity Measures

Limitation Due to the limitation on the length of a macro variable (8), the maximum number of variables will be restricted to 999.

Details Proximity Measures The following notation is used in this section: v

the number of variables or the dimensionality

xj

data for observation x and the jth variable, where j= 1 to v

yj

data for observation y and the jth variable, where j= 1 to v

wj

weight for the jth variable from the WEIGHTS= option in the VAR statement. wj = 0 when either xj or yj is missing.

W

the sum of total weights. No matter if the observation is missing or not, its weight is added to this metric.

x ¯

mean for observation x P P x ¯ = vi=1 wj xj / vi=1 wj

y¯

mean for observation y P P y¯ = vi=1 wj yj / vi=1 wj

d(x, y)

the distance or dissimilarity between observations x and y

s(x, y)

the similarity between observations x and y

The factor W/ values.

Pv

i=1 wj

is used to adjust some of the proximity measures for missing

Methods Accepting All Measurement Levels GOWER

Gower’s similarity P P j j s1 (x, y) = vj=1 wj δx,y djx,y / vj=1 wj δx,y j To compute δx,y : for nominal, ordinal, interval, or ratio variable, j δx,y = 1;

for asymmetric nominal variable, j δx,y = 1, if either xj or yj is present j δx,y = 0, if both xj and yj are absent

23

24

The DISTANCE Macro To compute djx,y : for nominal or asymmetric nominal variable, djx,y = 1, if xj = yj djx,y = 0, if xj 6= yj ; for ordinal (where data are replaced by corresponding rank scores), interval, or ratio variable, djx,y = 1 − |xj − yj | DGOWER

1 minus Gower d2 (x, y) = 1 − s1 (x, y)

Methods Accepting Ratio, Interval, and Ordinal Variables: EUCLID

Euclidean distance qP P d3 (x, y) = ( vj=1 wj (xj − yj )2 )W/( vj=1 wj )

SQEUCLID

Squared Euclidean Pv distance P d4 (x, y) = ( j=1 wj (xj − yj )2 )W/( vj=1 wj )

SIZE

Size distanceP √ P d5 (x, y) = | vj=1 wj (xj − yj )| W /( vj=1 wj )

SHAPE

Shape distance qP P d6 (x, y) = ( vj=1 wj [(xj − x ¯) − (yj − y¯)]2 )W/( vj=1 wj ) Note: squared shape distance plus squared size distance equals squared Euclidean distance.

COV

Covariance P similarity coefficient s7 (x, y) = vj=1 wj (xj − x ¯)(yj − y¯)/vardiv, where vardiv = v if VARDEF=N = v − 1 if VARDEF=DF Pv = j=1 wj if VARDEF=WEIGHT Pv = j=1 wj − 1 if VARDEF=WDF

CORR

Correlation similarity Pv coefficient wj (xj −¯ x)(yj −¯ y) Pv s8 (x, y) = √Pv j=1 2 j=1

DCORR

wj (xj −¯ x)

j=1

wj (yj −¯ y )2

Correlation p transformed to Euclidean distance as sqrt(1-CORR) d9 (x, y) = 1 − s8 (x, y)

Proximity Measures SQCORR

Squared correlation P s10 (x, y) =

DSQCORR

v j=1

wj (xj −¯ x)(y −¯ y )]2 Pv j 2 x) y )2 j=1 wj (xj −¯ j=1 wj (yj −¯ [

Pv

Squared correlation transformed to squared Euclidean distance as (1-SQCORR) d11 (x, y) = 1 − s10 (x, y)

L(p)

Minkowski ( Lp ) distance, where p is a positive numeric value P P d12 (x, y) = [( vj=1 wj |xj − yj |p )W/( vj=1 wj )]1/p

CITYBLOCK

L1 P P d13 (x, y) = ( vj=1 wj |xj − yj |)W/( vj=1 wj )

CHEBYCHEV

L∞ d14 (x, y) = maxvj=1 wj |xj − yj |

POWER(p, r)

Generalized Euclidean distance, where p is a non-negative numeric value, and r is a positive numeric value. The distance between two observations is the rth root of sum of the absolute differences to the pth power between the values for the observations P P d15 (x, y) = [( vj=1 wj |xj − yj |p )W/( vj=1 wj )]1/r

Methods Accepting Ratio Variables SIMRATIO

Similarity ratio s16 (x, y) =

Pv

Pv

j=1

y ) j wj (x Pi j wj (xi yj )+ vj wj (xj −yj )2

DISRATIO

one minus similarity ratio d17 (x, y) = 1 − s16 (x, y)

NONMETRIC

Lance-Williams Pv nonmetric coefficient wj |xj −yj | P d18 (x, y) = vj=1 wj (xj +yj ) j=1

CANBERRA

Canberra metric P coefficient w |x −y | d19 (x, y) = vj=1 ( wjj(xjj +yjj ) )

COSINE

Cosine s20 (x, y) = √Pv

Pv

j=1

wj (xi yj ) Pv

2 j=1 wj (xj )

j=1

wj (yj )2

25

26

The DISTANCE Macro DOT

Dot (inner) product P P s21 (x, y) = vj=1 wj (xi yj )/ vj=1 wj

OVERLAP

Sum of the minimum values Pv s22 (x, y) = j=1 wj [min(xj , yj )]

DOVERLAP

The maximum of the Pvsum of the Pxv and the sum of y minus overlap d23 (x, y) = max( j=1 wi xj , j=1 wi yj ) − s22 (x, y)

CHISQ

chi-squared If the data represent the frequency counts, chi-squared dissimilarity between two sets of frequencies can be computed. A 2 by v contingency table is illustrated to explain how the chi-squared dissimilarity is computed:

Observation X Y Column sum

Var 1 x1 y1 c1

Variable Var 2 ... x2 ... y2 ... c2 ...

Var v xv yv cv

Row sum rx ry T

where rx =

Pv

ry =

Pv

cj T

j=1 wj xj j=1 wj yj

= wj (xj + yj ) P = rx + ry = vj=1 cj

The chi-squared measure is computed as follows: P P (w xj −E(xj ))2 Pv (w y −E(y ))2 d24 (x, y) = ( vj=1 j E(x + j=1 j jE(yj ) j )W/( vj=1 wj ) j) where for j= 1, 2, ..., v E(xj ) = rx cj /T E(yj ) = ry cj /T CHI

Squared rootp of chi-squared d25 (x, y) = d23 (x, y)

PHISQ

phi-squared This is the CHISQ dissimilarity normalized by the sum of weights P d26 (x, y) = d24 (x, y)/( vj=1 wj )

PHI

Squared rootp of phi-squared d27 (x, y) = d25 (x, y)

Proximity Measures Methods Accepting Symmetric Nominal Variables The following notation is used for computing d28 (x, y) to s35 (x, y). Notice that only the non-missing pairs are discussed below; all the pairs with at least one missing value will be excluded from any of the computations in the following section because wj = 0, if either xj or yj is missing. M

non-missing matches P j M = vj=1 wj δx,y , where j δx,y = 1, if xj = yj j δx,y = 0, otherwise

X

non-missing mismatches P j X = vj=1 wj δx,y , where j δx,y = 1, if xj 6= yj j δx,y = 0, otherwise

N

total non-missing pairs P N = vj=1 wj

HAMMING

Hamming distance d28 (x, y) = X

MATCH

Simple matching coefficient s29 (x, y) = M/N

DMATCH

Simple matching transformed to Euclidean distance p coefficient p d30 (x, y) = 1 − M/N = (X/N )

DSQMATCH

Simple matching coefficient transformed to squared Euclidean distance d31 (x, y) = 1 − M/N = X/N

HAMANN

Hamann coefficient s32 (x, y) = (M − X)/N

RT

Roger and Tanimoto s33 (x, y) = M/(M + 2X)

SS1

Sokal and Sneath 1 s34 (x, y) = 2M/(2M + X)

27

28

The DISTANCE Macro SS3

Sokal and Sneath 3. The coefficient between an observations and itself is always indeterminate (missing) since there is no mismatch. s35 (x, y) = M/X

The following notation is used for computing s36 (x, y) to d41 (x, y). Notice that only the non-missing pairs are discussed below; all the pairs with at least one missing value will be excluded from any of the computations in the following section because wj = 0, if either xj or yj is missing. Also, the observed non-missing data of an asymmetric binary variable can possibly have only two outcomes: presence or absence. Therefore, the notation, PX (present mismatches), always has a value of zero for an asymmetric binary variable. The following methods distinguish between the presence and absence of attributes. X

mismatches with at least one present P j X = vj=1 wj δx,y , where j δx,y = 1, if xj 6= yj and not both xj and yj are absent j δx,y = 0, otherwise

PM

present matches P j P M = vj=1 wj δx,y , where j δx,y = 1, if xj = yj and both xj and yj are present j δx,y = 0, otherwise

PX

present mismatches P j P X = vj=1 wj δx,y , where j δx,y = 1, if xj 6= yj and both xj and yj are present j δx,y = 0, otherwise

PP

both present = P M + P X

P

at least one present = P M + X

P AX

present-absent mismatches P j P AX = vj=1 wj δx,y , where j δx,y = 1, if xj 6= yj and either xj is present and yj is absent or

xj is absent and yj is present j δx,y

N

= 0 otherwise

total non-missing pairs P N = vj=1 wj

References Methods Accepting Asymmetric Nominal and Ratio Variables JACCARD

Jaccard similarity coefficient The JACCARD method is equivalent to the SIMRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as sum of the coefficient from the ratio variables (SIMRATIO) and the coefficient from the asymmetric nominal variables. s36 (x, y) = s16 (x, y) + P M/P

DJACCARD

Jaccard dissimilarity coefficient The DJACCARD method is equivalent to the DISRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as sum of the coefficient from the ratio variables(DISRATIO) and the coefficient from the asymmetric nominal variables. d37 (x, y) = d17 x, y + X/P

Methods Accepting Asymmetric Nominal Variables DICE

Dice coefficient or Czekanowski/Sorensen similarity coefficient s38 (x, y) = 2P M/(P + P M )

RR

Russell and Rao. This is the binary equivalent of the dot product coefficient. s39 (x, y) = P M/N

BLWNM BRAYCURTIS

Binary Lance and Williams, also known as Bray and Curtis coefficient d40 (x, y) = X/(P AX + 2P P )

K1

Kulcynski 1. The coefficient between an observations and itself is always indeterminate (missing) since there is no mismatch. d41 (x, y) = P M/X

References SAS Institute Inc. (1990), SAS Guide to Macro Processing, Version 6, Second Edition, Cary, NC: SAS Institute Inc.

29

30

The DISTANCE Macro Anderberg, M.R. (1973). Cluster Analysis for Applications New York: Academic Press. Jambu, M. and Lebeaux M-O. Cluster Analysis and Data Analysis Amsterdam: North-Holland Publishing Company. Legendre L. and Legendre P. (1983) Numerical Ecology New York : Elsevier Scientific Pub. Co. Kaufman L. and Rousseeuw P.J. (1990), Finding Groups in Data, New York: John Weiley and Sons, Inc. Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy, San Francisco: Freeman.

Subject Index B Binary Lance and Williams nonmetric coefficient DISTANCE procedure, 29 Bray and Curtis coefficient DISTANCE procedure, 29

C Canberra metric coefficient DISTANCE procedure, 25 Chebychev distance coefficient DISTANCE procedure, 25 chi-squared coefficient DISTANCE procedure, 26 Cityblock distance coefficient DISTANCE procedure, 25 Correlation dissimilarity coefficient DISTANCE procedure, 24 Correlation similarity coefficient DISTANCE procedure, 24 Cosine coefficient DISTANCE procedure, 25 Covariance similarity coefficient DISTANCE procedure, 24 Czekanowski/Sorensen similarity coefficient DISTANCE procedure, 29

D Dice coefficient DISTANCE procedure, 29 DISTANCE procedure Binary Lance and Williams nonmetric coefficient, 29 Bray and Curtis coefficient, 29 Canberra metric coefficient, 25 Chebychev distance coefficient, 25 chi-squared coefficient, 26 Cityblock distance coefficient, 25 Correlation dissimilarity coefficient, 24 Correlation similarity coefficient, 24 Cosine coefficient, 25 Covariance similarity coefficient, 24 Czekanowski/Sorensen similarity coefficient, 29 Dice coefficient, 29 Dot Product coefficient, 26 Euclidean distance coefficient, 24 formulas for proximity measures, 23 Generalized Euclidean distance coefficient, 25 Gower’s dissimilarity coefficient, 24

Gower’s similarity coefficient, 23 Hamann coefficient, 27 Hamming distance coefficient, 27 Jaccard dissimilarity coefficient, 29 Jaccard similarity coefficient, 29 Kulcynski 1 coefficient, 29 Lance-Williams nonmetric coefficient, 25 Minkowski L(p) distance coefficient, 25 Overlap dissimilarity coefficient, 26 Overlap similarity coefficient, 26 phi-squared coefficient, 26 Power distance coefficient, 25 Roger and Tanimoto coefficient, 27 Russell and Rao similarity coefficient, 29 Shape distance coefficient, 24 Similarity Ratio coefficient, 25 Simple Matching coefficient, 27 Simple Matching dissimilarity coefficient, 27 Size distance coefficient, 24 Sokal and Sneath 1 coefficient, 27 Sokal and Sneath 3 coefficient, 28 Squared Correlation dissimilarity coefficient, 25 Squared Correlation similarity coefficient, 25 Squared Euclidean distance coefficient, 24 Dot Product coefficient DISTANCE procedure, 26

E Euclidean distance coefficient DISTANCE procedure, 24

G Generalized Euclidean distance coefficient DISTANCE procedure, 25 Gower’s dissimilarity coefficient DISTANCE procedure, 24 Gower’s similarity coefficient DISTANCE procedure, 23

H Hamann coefficient DISTANCE procedure, 27 Hamming distance coefficient DISTANCE procedure, 27

J Jaccard dissimilarity coefficient DISTANCE procedure, 29

32

Subject Index

Jaccard similarity coefficient DISTANCE procedure, 29

K Kulcynski 1 coefficient DISTANCE procedure, 29

L Lance-Williams nonmetric coefficient DISTANCE procedure, 25

M methods classified by types of variables, 12 Minkowski L(p) distance coefficient DISTANCE procedure, 25

O Overlap dissimilarity coefficient DISTANCE procedure, 26 Overlap similarity coefficient DISTANCE procedure, 26

P phi-squared coefficient DISTANCE procedure, 26 Power distance coefficient DISTANCE procedure, 25 proximity measures formulas(DISTANCE), 23

R Roger and Tanimoto coefficient DISTANCE procedure, 27 Russell and Rao similarity coefficient DISTANCE procedure, 29

S Shape distance coefficient DISTANCE procedure, 24 Similarity Ratio coefficient DISTANCE procedure, 25 Simple Matching coefficient DISTANCE procedure, 27 Simple Matching dissimilarity coefficient DISTANCE procedure, 27 Size distance coefficient DISTANCE procedure, 24 Sokal and Sneath 1 coefficient DISTANCE procedure, 27 Sokal and Sneath 3 coefficient DISTANCE procedure, 28 Squared Correlation dissimilarity coefficient DISTANCE procedure, 25 Squared Correlation similarity coefficient DISTANCE procedure, 25 Squared Euclidean distance coefficient DISTANCE procedure, 24

Syntax Index A

S

ABSENT= argument, 17 ANO= argument, 17 ANOMINAL= argument, 17 ANOWGT= argument, 20

SHAPE= argument, 12 STD= argument, 18 STDINTER= argument, 19 STDRATIO= argument, 19

B

U

BY= argument, 12

UNDEF= argument, 20

C

V

COPY= argument, 12

D

VAR= argument, 11 VARDEF= argument, 20 VARWGT= argument, 21

DATA= argument, 11 %DISTANCE arguments, 11

W WEIGHT= argument, 20

F FREQ= argument, 20

I ID= argument, 12 INT= argument, 18 INTERVAL= argument, 18 INTWGT= argument, 21

M METHOD= argument, 12 MISSING= argument, 20

N NOM= argument, 17 NOMINAL= argument, 17 NOMWGT= argument, 20

O OPTIONS= argument, 21 ORD= argument, 17 ORDINAL= argument, 17 ORDWGT= argument, 20 OUT= argument, 12

P PREFIX= argument, 12

R RAT= argument, 18 RATIO= argument, 18 RATWGT= argument, 21