Multidimensional Scaling

Introduction In Principal Components and Factor Analysis, look for similarities among variables in order to create components/factors combining variab...
31 downloads 0 Views 102KB Size
Introduction In Principal Components and Factor Analysis, look for similarities among variables in order to create components/factors combining variables.

Multidimensional Scaling

Aim now: look for similarities among observations. Create picture in small number of dimensions to express similarity/dissimilarity by distance as accurately as possible. Next chapter: classify observations into groups; those within group “similar”, those in different groups “dissimilar”. 2 kinds of MDS: “metric” uses actual distance; “non-metric” preserves rank order of distances.

130

Metric MDS

131

Hard to tell how cities might lie relative to each other.

Classical example of metric MDS: wish to understand spatial relationship of cities when we only have distances between them. Ontario, road distances: Toronto 70 Hamilton 400 470 Ottawa 90 145 430 Barrie 345 395 365 250 North Bay 390 440 490 300 125 Sudbury 260 330 175 330 460 580 Kingston 130 70 530 200 455 500 390 Niagara Falls 185 125 570 240 495 540 430 190 London

132

One idea: figure out mean distance from each city to all the others: City

Mean distance

Toronto

234

Barrie

248

Hamilton

256

Niagara Falls

308

North Bay

361

London

347

Kingston

369

Sudbury

421

Ottawa

429

133

Sudbury 200

Toronto apparently most “central”, Sudbury and Ottawa most “remote”.

NorthBay

100

Could: choose location for Toronto, put Barrie at right distance, try to

Barrie

put Hamilton at right distance from both. But then may not be able

Will be “errors” in 2 dimensions because road distances not direct. Need to find locations for points minimizing errors in some way.

0

ans[,2]

−200

Possible solution on next page:

Toronto

−100

to place Niagara Falls at right distance from these, etc.

London NiagaraFalls Hamilton

Ottawa −300

Kingston −200

134

−100

0

100

200

ans[,1] 135

General fact: if only have distances, can only reproduce map up to rotation and possibly reflection. Any rotated/reflected map will reproduce distances just as well. Note also strange position of London on our map: straight line from Toronto via Hamilton, Niagara Falls, whereas in reality London to west of these. Explanation: Lake Ontario – roads go around, so road distances larger than direct distances.

136

137

Squared Euclidean distance between objects r and s is

d2rs

How it works

=

p X i=1 p

= [dij ] between objects i and j , with dii = 0. Aim to end with matrix X of coordinates for each object (X could be n × 2 if coordinates in 2 dimensions).

=

Begin with n × n matrix of distances D

Want relationship between D and X . Easier to work other way around: if have coordinates X , how to find distances D ? Assume coordinates have mean 0, X is n × p.

X i=1

(xri − xsi )2 x2ri − 2xri xsi + x2si

In general, (j, k) element of matrix product CD is

P

i cji dik 0

and

(j, k) element of A is akj . Hence (j, k) element of XX is P i xji xki . 0

Terms in sum above therefore elements of B

= XX 0 , specifically

d2rs = brr + bss − 2brs . 138

That is, given X , calculate B

= XX 0 , get D from B . But want to do it other way around: get B from D , then find X with B = XX 0 . Pn Assumed that coordinates have mean 0, ie. j=1 xji = 0 for all i. Hence row and column sums of B are zero (proof?). P Using this, and letting T = r brr : X X d2rs = {brr + bss − 2brs } = T + nbss r

X

r

d2rs

=

s

X r,s

X s

d2rs =

X r,s

{brr + bss − 2brs } = nbrr + T {brr + bss − 2brs } = 2nT

140

139

Hence can solve for B in terms of D :

1 brs = − (d2rs − d2r. − d2.s + d2.. ); 2 dots mean to average row/column/overall of matrix of squared distances. Last step is to find X such that B

= XX 0 . Recall: can write B = LAL0 where A diagonal matrix of eigenvalues, L orthogonal matrix of eigenvectors. Hence X = LA1/2 solves the problem. That is, j -th coordinate values are j -th eigenvector times sq root of j -th eigenvalue.

141

A small example Suppose we have 3 objects, with

d212

=

4, d223

=

6, d213

= 8.

Times by



4.15,



1.85 to get coords:

Matrix B then

B=



6

−1 −5

1 −1.18



 1 −1 4 −3 .  3 −5 −3 8

Eigenvalues 4.15, 1.85, 0. (Can represent exactly in 2 dimensions.)

0.78

2 −0.43 −1.07

3

1.67

0.29

Squared distances between objects exactly reproduced.

1st eigenvector (−0.58, −0.21, 0.79), 2nd (0.58, −0.79, 0.21).

142

143

ptions ls=65 ps=40;

Metric MDS in practice In general with n objects, have n eigenvalues, coordinates in R n .

data cities; infile "cities.dat"; input name $ toronto hamilton ottawa barrie northbay sudbury kingston niagaraf london;

Choose coordinate directions with largest eigenvalues, as in principal components. If 2 dimensions represents data well, map in

proc mds level=absolute out=coords outres=dist pineigval; id name;

2 dimensions preserves distances. Do in SAS: use PROC IML to reproduce matrix stuff, PROC MDS to

proc print data=dist; var _row_ _col_ data distance residual;

use the canned procedure.

proc print data=coords;

PROC MDS code:

proc plot data=coords; plot dim2 * dim1 = ’*’ $ _name_;

144

145

First 2 eigenvalues much bigger than others – 2 dimensions good. Eigenvalues col0 row0 row1 row2 row3 row4 row5 row6 row7 row8

301792.5 232055.4 17473.68 5930.31 4319.64 -0.00 -1180.00 -7280.59 -23349.8

Compare observed and predicted distances (selected): OBS 24 28 31 34 35 36

_ROW_ NiagaraF NiagaraF London London London London

_COL_ Ottawa Kingston Ottawa Sudbury Kingston NiagaraF

D i m e n s i o n 2

148

DISTANCE 497.905 346.720 597.215 511.944 462.797 122.569

RESIDUAL 32.0955 43.2796 -27.2149 28.0561 -32.7968 67.4308

Largest residuals with Niagara Falls and London (seem most misplaced on map, most affected by Lake Ontario). SAS map needs rotation and reflection:

146

400 + | | | * KINGSTON OTTAWA * 200 + | | | * NIAGARAF 0 + HAMILTON * * TORONTO | * BARRIE | * LONDON | NORTHBAY * -200 + | * SUDBURY | | -400 + | --+--------+--------+--------+--------+--------+--------+-300 -200 -100 0 100 200 300

DATA 530 390 570 540 430 190

147

Notes about metric MDS

• If distances actually Euclidean distances calculated from data matrix X , then map obtained from metric MDS identical to plot of 1st 2 principal components.

• As check on appropriateness of model, can plot observed and

predicted distances. Should lie close to line through origin with slope 1. If not, then distances not Euclidean. Can fit models accounting for this by changing LEVEL=ABSOLUTE in SAS: – if line passes through origin with some other slope, use

LEVEL=RATIO to estimate slope. – if on line but not through origin, use LEVEL=INTERVAL to estimate intercept and slope. 149

Measuring dissimilarity

Non-metric MDS To do metric MDS, need to assume that distances meaningful as numbers. Reasonable for physical distances, maybe also for

By relaxing assumption of Euclidean distance, can define dissimilarity in different ways suitable to problem. Example: animals lion, giraffe, cow, sheep, human. How to describe

variables measured precisely. In many interesting problems, not willing to assume that distances

similarities, differences? Use criteria eg. as in table: Lion

Euclidean (even approximately) – indeed, quantity measured may be “dissimilarity” rather than distance. Might simply wish to create map in which map distances and dissimilarities in same order smallest to largest, without trying to match up numbers. To do this, need non-metric MDS.

150

Giraffe

Cow

Sheep

Human

Has tail

Y

Y

Y

Y

N

Wild animal

Y

Y

N

N

N

Has long neck

N

Y

N

N

N

Farm animal

N

N

Y

Y

N

Eats meat

Y

N

N

N

Y

Walks on 4 legs

Y

Y

Y

Y

N

151

Could measure dissimilarity for 2 animals:

• as fraction of different answers (“simple matching coeff”, below)

Another idea: pairwise comparisons. Example pages 218–219 of

• as fraction of differences ignoring N, N (“Jaccard coeff”)

text: compare 10 cars by taking all

Lion

10 2



= 45 pairs, then ranking

the pairs from 1 (most similar) to 45 (least similar). These kinds of dissimilarity ordinal: only important thing whether

0.33

Giraffe

0.50

0.50

Cow

0.50

0.50

0

Sheep

0.50

0.83

0.67

0.67

one dissimilarity bigger than another, not how much bigger. Want mapping method that only relies on order, not size of numbers. Human

152

153

How it works Animal pair

δij

dij

dˆij

Aim is for close correspondence between order of distances on map

cow-sheep

0.00

2.2

2.2

and dissimilarities in data. How to measure closeness?

lion-giraffe

0.33

3.6

3.4

giraffe-sheep

0.50

3.2

3.4

lion-cow

0.50

5.0

5.0

Use animal data as example; suppose current 2D map coords as shown:

lion-sheep

0.50

6.4

5.0

Lion

Giraffe

Cow

Sheep

Human

giraffe-cow

0.50

3.6

5.0

x

1

3

5

6

10

lion-human

0.50

10.3

6.5

y

2

5

8

6

7

sheep-human

0.67

4.1

6.5

cow-human

0.67

5.1

6.5

giraffe-human

0.83

7.3

7.3

Animal

Calculate distances dij from current map, list with ordered dissimilarities δij :

154

Pool-adjacent-violators and Stress

dˆij based on dij , but in same order as δij . Method: start with dˆij

= dij . Find first pair in wrong order; average.

Go back up list, keep averaging until order correct. Continue down

155

If dij already in order, stress 0; otherwise larger. Denominator reflects idea that multiplying coords by constant should not change fit. In example, stress is

p 26.16/309.16 = 0.29.

list until end.

Aim to find coords making stress minimum. Iterative process.

Here, combined 2nd and 3rd, then 5th and 6th, then 7th, 8th, 9th.

Mathematical idea: can find partial derivatives of stress wrt coords (“gradients”); search along gradient directions to find coords

Measure fit using stress:

reducing stress; repeat until stress cannot be decreased.

sP

(dij − dˆij )2 P 2 dij 156

Solution not unique; solution, and even its stress, may depend on starting coordinates.

157

Non-metric MDS in SAS Easy: change to LEVEL=ORDINAL (or omit). Animal data input: data animals; input name $ lion datalines; lion 0 . . . . giraffe .33 0 . . cow .50 .50 0 . . sheep .50 .50 0 0 human .50 .83 .67 ;

giraffe cow sheep human;

. . .67 0

D i m e n s i o n 2

proc mds out=coords outres=dist; proc plot data=coords; plot dim2 * dim1 = ’*’ $ _name_;

1 + * LION * GIRAFFE | | | | | | | | 0 + * HUMAN | | | | | | | | -1 + SHEEP * COW --+----------+----------+----------+----------+----------+--3 -2 -1 0 1 2

158

Plot seems to map data: cow and sheep very close, human distant. Stress value from table of iterative process: BadnessConvergence Measures of-Fit Change in --------------------Iteration Type Criterion Criterion Monotone Gradient --------------------------------------------------------------0 Initial 0.01057 . . . 1 Monotone 0.00229 0.00828 0.00863 0.87387 2 Gau-New 0.00111 0.00118 . 0.10054 3 Gau-New 0.00110 0.0000111 . 0.08851 4 Gau-New 0.00109 7.03144E-6 . 0.05315 5 Gau-New 0.00109 2.33664E-6 . 0.02708 6 Gau-New 0.00109 5.94251E-7 . 0.01320 7 Gau-New 0.00109 1.40532E-7 . 0.00637

is 0.00109 (very small).

160

159

Data look 2-dimensional, suggesting that fitting to 1D won’t work so well. In SAS code, add DIMENSION=1 to PROC MDS line, get stress of 16%: OBS 1 2 3 4 5 6 7 8 9 10

_ROW_

_COL_

DATA

DISTANCE

FITDATA

GIRAFFE COW COW SHEEP SHEEP SHEEP HUMAN HUMAN HUMAN HUMAN

LION LION GIRAFFE LION GIRAFFE COW LION GIRAFFE COW SHEEP

0.33 0.50 0.50 0.50 0.50 0.00 0.50 0.83 0.67 0.67

1.68076 0.86761 0.81315 0.86759 0.81317 0.00002 1.26343 2.94419 2.13104 2.13102

1.07627 1.07627 1.07627 1.07627 1.07627 0.00005 1.07627 3.02246 2.18548 2.18548

161

Individual differences scaling Often, especially in social science studies, have dissimilarity Agreement between map distance dij and fitted distance dˆij not

matrices from several different people.

very good, especially giraffe-lion.

Want to combine these to get overall picture, and also understand

Fitting a 3rd dimension reduces stress effectively to 0, but 2D stress already very small, so little gain for extra complication.

how individuals differ from overall picture. Individual differences scaling assumes that each individual’s distances accounted for by same dimensions, but with different

(As in principal components, stop adding dimensions when stress

weights.

effectively no longer getting smaller.) Example: breakfast foods, from text. 4 individuals, 15 breakfast foods, 15 × 15 matrix of dissimilarities from each person. Glue

matrices together rowwise, with something distinguishing individuals. 162

163

SAS code: In breakfast data set, first number in row is ID for individual. Read in in obvious way: data brkfst; infile "breakfast.txt"; input id toastpp buttoast mufmarg jellydo cintoast blbmufmg rollbut toastmar toastjel toastmg cinbun danish glazedon coffeeck crnmuf;

proc mds model=indscal out=results; matrix id; proc print; var id _type_ _name_ dim1 dim2; proc plot; plot dim2*dim1 $ _name_;

Variables denote kinds of breakfast food. Results data set: coordinates for foods, but also individuals.

164

165

OBS 1 2 3 4 5 ... 12 13 14 15 16 17 18 19 20

ID

_TYPE_ . . . . .

. . . . . 42 71 101 112

CRITERION CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG CONFIG DIAGCOEF DIAGCOEF DIAGCOEF DIAGCOEF

_NAME_

TOASTPP BUTTOAST MUFMARG JELLYDO CINBUN DANISH GLAZEDON COFFEECK CRNMUF

DIM1

DIM2

0.18316 -0.61996 1.24559 1.31496 -1.45202

. 1.55139 0.68637 -0.66150 0.38822

-1.10600 -1.05318 -1.32599 -0.87477 0.34146 0.94939 1.34905 1.31443 1.04825

-0.91385 -0.46032 0.05092 -1.12777 -1.37079 1.04817 0.42434 0.52181 0.94930

166

D i m e n s i o n 2

| 2 + | | > TOASTPP | CINTOAST < > TOASTMAR 1 + TOASTJEL < > > | > BUTTOAST | > JELLYDO >2 TOASTMG | 0 + > GLAZEDON | | > DANISH | > MUFMARG -1 + > CINBUN BLBMUFMG > ROLLBUT | > COFFEECK ˆ > CRNMUF | | -2 + | --+-------------+-------------+-------------+-------------+-2 -1 0 1 2 168

Stress 18%, lower would be better. Coordinates for individuals describe how individuals weight dimensions. Indivs 42, 112 (female) weight 2 dimensions equally, but indivs 71, 101 (male) weight dimension 1 higher. Plot (next) suggests dimension 1 “sweetness”; dimension 2 could be “bread”. Ie. males here distinguish more in terms of sweetness than anything else, while females take broader view.

167

Suggest Documents