Introduction In Principal Components and Factor Analysis, look for similarities among variables in order to create components/factors combining variables.
Multidimensional Scaling
Aim now: look for similarities among observations. Create picture in small number of dimensions to express similarity/dissimilarity by distance as accurately as possible. Next chapter: classify observations into groups; those within group “similar”, those in different groups “dissimilar”. 2 kinds of MDS: “metric” uses actual distance; “non-metric” preserves rank order of distances.
130
Metric MDS
131
Hard to tell how cities might lie relative to each other.
Classical example of metric MDS: wish to understand spatial relationship of cities when we only have distances between them. Ontario, road distances: Toronto 70 Hamilton 400 470 Ottawa 90 145 430 Barrie 345 395 365 250 North Bay 390 440 490 300 125 Sudbury 260 330 175 330 460 580 Kingston 130 70 530 200 455 500 390 Niagara Falls 185 125 570 240 495 540 430 190 London
132
One idea: figure out mean distance from each city to all the others: City
Mean distance
Toronto
234
Barrie
248
Hamilton
256
Niagara Falls
308
North Bay
361
London
347
Kingston
369
Sudbury
421
Ottawa
429
133
Sudbury 200
Toronto apparently most “central”, Sudbury and Ottawa most “remote”.
NorthBay
100
Could: choose location for Toronto, put Barrie at right distance, try to
Barrie
put Hamilton at right distance from both. But then may not be able
Will be “errors” in 2 dimensions because road distances not direct. Need to find locations for points minimizing errors in some way.
0
ans[,2]
−200
Possible solution on next page:
Toronto
−100
to place Niagara Falls at right distance from these, etc.
London NiagaraFalls Hamilton
Ottawa −300
Kingston −200
134
−100
0
100
200
ans[,1] 135
General fact: if only have distances, can only reproduce map up to rotation and possibly reflection. Any rotated/reflected map will reproduce distances just as well. Note also strange position of London on our map: straight line from Toronto via Hamilton, Niagara Falls, whereas in reality London to west of these. Explanation: Lake Ontario – roads go around, so road distances larger than direct distances.
136
137
Squared Euclidean distance between objects r and s is
d2rs
How it works
=
p X i=1 p
= [dij ] between objects i and j , with dii = 0. Aim to end with matrix X of coordinates for each object (X could be n × 2 if coordinates in 2 dimensions).
=
Begin with n × n matrix of distances D
Want relationship between D and X . Easier to work other way around: if have coordinates X , how to find distances D ? Assume coordinates have mean 0, X is n × p.
X i=1
(xri − xsi )2 x2ri − 2xri xsi + x2si
In general, (j, k) element of matrix product CD is
P
i cji dik 0
and
(j, k) element of A is akj . Hence (j, k) element of XX is P i xji xki . 0
Terms in sum above therefore elements of B
= XX 0 , specifically
d2rs = brr + bss − 2brs . 138
That is, given X , calculate B
= XX 0 , get D from B . But want to do it other way around: get B from D , then find X with B = XX 0 . Pn Assumed that coordinates have mean 0, ie. j=1 xji = 0 for all i. Hence row and column sums of B are zero (proof?). P Using this, and letting T = r brr : X X d2rs = {brr + bss − 2brs } = T + nbss r
X
r
d2rs
=
s
X r,s
X s
d2rs =
X r,s
{brr + bss − 2brs } = nbrr + T {brr + bss − 2brs } = 2nT
140
139
Hence can solve for B in terms of D :
1 brs = − (d2rs − d2r. − d2.s + d2.. ); 2 dots mean to average row/column/overall of matrix of squared distances. Last step is to find X such that B
= XX 0 . Recall: can write B = LAL0 where A diagonal matrix of eigenvalues, L orthogonal matrix of eigenvectors. Hence X = LA1/2 solves the problem. That is, j -th coordinate values are j -th eigenvector times sq root of j -th eigenvalue.
141
A small example Suppose we have 3 objects, with
d212
=
4, d223
=
6, d213
= 8.
Times by
√
4.15,
√
1.85 to get coords:
Matrix B then
B=
6
−1 −5
1 −1.18
1 −1 4 −3 . 3 −5 −3 8
Eigenvalues 4.15, 1.85, 0. (Can represent exactly in 2 dimensions.)
0.78
2 −0.43 −1.07
3
1.67
0.29
Squared distances between objects exactly reproduced.
1st eigenvector (−0.58, −0.21, 0.79), 2nd (0.58, −0.79, 0.21).
142
143
ptions ls=65 ps=40;
Metric MDS in practice In general with n objects, have n eigenvalues, coordinates in R n .
data cities; infile "cities.dat"; input name $ toronto hamilton ottawa barrie northbay sudbury kingston niagaraf london;
Choose coordinate directions with largest eigenvalues, as in principal components. If 2 dimensions represents data well, map in
proc mds level=absolute out=coords outres=dist pineigval; id name;
2 dimensions preserves distances. Do in SAS: use PROC IML to reproduce matrix stuff, PROC MDS to
proc print data=dist; var _row_ _col_ data distance residual;
use the canned procedure.
proc print data=coords;
PROC MDS code:
proc plot data=coords; plot dim2 * dim1 = ’*’ $ _name_;
144
145
First 2 eigenvalues much bigger than others – 2 dimensions good. Eigenvalues col0 row0 row1 row2 row3 row4 row5 row6 row7 row8
301792.5 232055.4 17473.68 5930.31 4319.64 -0.00 -1180.00 -7280.59 -23349.8
Compare observed and predicted distances (selected): OBS 24 28 31 34 35 36
_ROW_ NiagaraF NiagaraF London London London London
_COL_ Ottawa Kingston Ottawa Sudbury Kingston NiagaraF
D i m e n s i o n 2
148
DISTANCE 497.905 346.720 597.215 511.944 462.797 122.569
RESIDUAL 32.0955 43.2796 -27.2149 28.0561 -32.7968 67.4308
Largest residuals with Niagara Falls and London (seem most misplaced on map, most affected by Lake Ontario). SAS map needs rotation and reflection:
146
400 + | | | * KINGSTON OTTAWA * 200 + | | | * NIAGARAF 0 + HAMILTON * * TORONTO | * BARRIE | * LONDON | NORTHBAY * -200 + | * SUDBURY | | -400 + | --+--------+--------+--------+--------+--------+--------+-300 -200 -100 0 100 200 300
DATA 530 390 570 540 430 190
147
Notes about metric MDS
• If distances actually Euclidean distances calculated from data matrix X , then map obtained from metric MDS identical to plot of 1st 2 principal components.
• As check on appropriateness of model, can plot observed and
predicted distances. Should lie close to line through origin with slope 1. If not, then distances not Euclidean. Can fit models accounting for this by changing LEVEL=ABSOLUTE in SAS: – if line passes through origin with some other slope, use
LEVEL=RATIO to estimate slope. – if on line but not through origin, use LEVEL=INTERVAL to estimate intercept and slope. 149
Measuring dissimilarity
Non-metric MDS To do metric MDS, need to assume that distances meaningful as numbers. Reasonable for physical distances, maybe also for
By relaxing assumption of Euclidean distance, can define dissimilarity in different ways suitable to problem. Example: animals lion, giraffe, cow, sheep, human. How to describe
variables measured precisely. In many interesting problems, not willing to assume that distances
similarities, differences? Use criteria eg. as in table: Lion
Euclidean (even approximately) – indeed, quantity measured may be “dissimilarity” rather than distance. Might simply wish to create map in which map distances and dissimilarities in same order smallest to largest, without trying to match up numbers. To do this, need non-metric MDS.
150
Giraffe
Cow
Sheep
Human
Has tail
Y
Y
Y
Y
N
Wild animal
Y
Y
N
N
N
Has long neck
N
Y
N
N
N
Farm animal
N
N
Y
Y
N
Eats meat
Y
N
N
N
Y
Walks on 4 legs
Y
Y
Y
Y
N
151
Could measure dissimilarity for 2 animals:
• as fraction of different answers (“simple matching coeff”, below)
Another idea: pairwise comparisons. Example pages 218–219 of
• as fraction of differences ignoring N, N (“Jaccard coeff”)
text: compare 10 cars by taking all
Lion
10 2
= 45 pairs, then ranking
the pairs from 1 (most similar) to 45 (least similar). These kinds of dissimilarity ordinal: only important thing whether
0.33
Giraffe
0.50
0.50
Cow
0.50
0.50
0
Sheep
0.50
0.83
0.67
0.67
one dissimilarity bigger than another, not how much bigger. Want mapping method that only relies on order, not size of numbers. Human
152
153
How it works Animal pair
δij
dij
dˆij
Aim is for close correspondence between order of distances on map
cow-sheep
0.00
2.2
2.2
and dissimilarities in data. How to measure closeness?
lion-giraffe
0.33
3.6
3.4
giraffe-sheep
0.50
3.2
3.4
lion-cow
0.50
5.0
5.0
Use animal data as example; suppose current 2D map coords as shown:
lion-sheep
0.50
6.4
5.0
Lion
Giraffe
Cow
Sheep
Human
giraffe-cow
0.50
3.6
5.0
x
1
3
5
6
10
lion-human
0.50
10.3
6.5
y
2
5
8
6
7
sheep-human
0.67
4.1
6.5
cow-human
0.67
5.1
6.5
giraffe-human
0.83
7.3
7.3
Animal
Calculate distances dij from current map, list with ordered dissimilarities δij :
154
Pool-adjacent-violators and Stress
dˆij based on dij , but in same order as δij . Method: start with dˆij
= dij . Find first pair in wrong order; average.
Go back up list, keep averaging until order correct. Continue down
155
If dij already in order, stress 0; otherwise larger. Denominator reflects idea that multiplying coords by constant should not change fit. In example, stress is
p 26.16/309.16 = 0.29.
list until end.
Aim to find coords making stress minimum. Iterative process.
Here, combined 2nd and 3rd, then 5th and 6th, then 7th, 8th, 9th.
Mathematical idea: can find partial derivatives of stress wrt coords (“gradients”); search along gradient directions to find coords
Measure fit using stress:
reducing stress; repeat until stress cannot be decreased.
sP
(dij − dˆij )2 P 2 dij 156
Solution not unique; solution, and even its stress, may depend on starting coordinates.
157
Non-metric MDS in SAS Easy: change to LEVEL=ORDINAL (or omit). Animal data input: data animals; input name $ lion datalines; lion 0 . . . . giraffe .33 0 . . cow .50 .50 0 . . sheep .50 .50 0 0 human .50 .83 .67 ;
giraffe cow sheep human;
. . .67 0
D i m e n s i o n 2
proc mds out=coords outres=dist; proc plot data=coords; plot dim2 * dim1 = ’*’ $ _name_;
1 + * LION * GIRAFFE | | | | | | | | 0 + * HUMAN | | | | | | | | -1 + SHEEP * COW --+----------+----------+----------+----------+----------+--3 -2 -1 0 1 2
158
Plot seems to map data: cow and sheep very close, human distant. Stress value from table of iterative process: BadnessConvergence Measures of-Fit Change in --------------------Iteration Type Criterion Criterion Monotone Gradient --------------------------------------------------------------0 Initial 0.01057 . . . 1 Monotone 0.00229 0.00828 0.00863 0.87387 2 Gau-New 0.00111 0.00118 . 0.10054 3 Gau-New 0.00110 0.0000111 . 0.08851 4 Gau-New 0.00109 7.03144E-6 . 0.05315 5 Gau-New 0.00109 2.33664E-6 . 0.02708 6 Gau-New 0.00109 5.94251E-7 . 0.01320 7 Gau-New 0.00109 1.40532E-7 . 0.00637
is 0.00109 (very small).
160
159
Data look 2-dimensional, suggesting that fitting to 1D won’t work so well. In SAS code, add DIMENSION=1 to PROC MDS line, get stress of 16%: OBS 1 2 3 4 5 6 7 8 9 10
_ROW_
_COL_
DATA
DISTANCE
FITDATA
GIRAFFE COW COW SHEEP SHEEP SHEEP HUMAN HUMAN HUMAN HUMAN
LION LION GIRAFFE LION GIRAFFE COW LION GIRAFFE COW SHEEP
0.33 0.50 0.50 0.50 0.50 0.00 0.50 0.83 0.67 0.67
1.68076 0.86761 0.81315 0.86759 0.81317 0.00002 1.26343 2.94419 2.13104 2.13102
1.07627 1.07627 1.07627 1.07627 1.07627 0.00005 1.07627 3.02246 2.18548 2.18548
161
Individual differences scaling Often, especially in social science studies, have dissimilarity Agreement between map distance dij and fitted distance dˆij not
matrices from several different people.
very good, especially giraffe-lion.
Want to combine these to get overall picture, and also understand
Fitting a 3rd dimension reduces stress effectively to 0, but 2D stress already very small, so little gain for extra complication.
how individuals differ from overall picture. Individual differences scaling assumes that each individual’s distances accounted for by same dimensions, but with different
(As in principal components, stop adding dimensions when stress
weights.
effectively no longer getting smaller.) Example: breakfast foods, from text. 4 individuals, 15 breakfast foods, 15 × 15 matrix of dissimilarities from each person. Glue
matrices together rowwise, with something distinguishing individuals. 162
163
SAS code: In breakfast data set, first number in row is ID for individual. Read in in obvious way: data brkfst; infile "breakfast.txt"; input id toastpp buttoast mufmarg jellydo cintoast blbmufmg rollbut toastmar toastjel toastmg cinbun danish glazedon coffeeck crnmuf;
proc mds model=indscal out=results; matrix id; proc print; var id _type_ _name_ dim1 dim2; proc plot; plot dim2*dim1 $ _name_;
Variables denote kinds of breakfast food. Results data set: coordinates for foods, but also individuals.
164
165
OBS 1 2 3 4 5 ... 12 13 14 15 16 17 18 19 20
ID
_TYPE_ . . . . .
. . . . . 42 71 101 112
CRITERION CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG DIAGCOEF DIAGCOEF DIAGCOEF DIAGCOEF
_NAME_
TOASTPP BUTTOAST MUFMARG JELLYDO CINBUN DANISH GLAZEDON COFFEECK CRNMUF
DIM1
DIM2
0.18316 -0.61996 1.24559 1.31496 -1.45202
. 1.55139 0.68637 -0.66150 0.38822
-1.10600 -1.05318 -1.32599 -0.87477 0.34146 0.94939 1.34905 1.31443 1.04825
-0.91385 -0.46032 0.05092 -1.12777 -1.37079 1.04817 0.42434 0.52181 0.94930
166
D i m e n s i o n 2
| 2 + | | > TOASTPP | CINTOAST < > TOASTMAR 1 + TOASTJEL < > > | > BUTTOAST | > JELLYDO >2 TOASTMG | 0 + > GLAZEDON | | > DANISH | > MUFMARG -1 + > CINBUN BLBMUFMG > ROLLBUT | > COFFEECK ˆ > CRNMUF | | -2 + | --+-------------+-------------+-------------+-------------+-2 -1 0 1 2 168
Stress 18%, lower would be better. Coordinates for individuals describe how individuals weight dimensions. Indivs 42, 112 (female) weight 2 dimensions equally, but indivs 71, 101 (male) weight dimension 1 higher. Plot (next) suggests dimension 1 “sweetness”; dimension 2 could be “bread”. Ie. males here distinguish more in terms of sweetness than anything else, while females take broader view.
167