CS 277, Data Mining Exploratory Data Analysis

CS 277, Data Mining Exploratory Data Analysis Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences Universi...
Author: Roderick Grant
6 downloads 1 Views 3MB Size
CS 277, Data Mining Exploratory Data Analysis

Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

2

Outline Assignment 1: Questions? Today’s Lecture: Exploratory Data Analysis – Analyzing single variables – Analyzing pairs of variables – Higher-dimensional visualization techniques

Next Lecture: Clustering and Dimension Reduction – Dimension reduction methods – Clustering methods

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

3

What are the main Data Mining Techniques? •

Descriptive Methods – – – – –

Exploratory Data Analysis, Visualization Dimension reduction (principal components, factor models, topic models) Clustering Pattern and Anomaly Detection ….and more

• Predictive Modeling – – – – –

Classification Ranking Regression Matrix completion (recommender systems) …and more

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

4

The Typical Data Mining Process

Problem Definition Defining the Goal

Understanding the Problem Domain

Data Definition Defining and Understanding Features

Creating Training and Test Data

Data Exploration Exploratory Data Analysis

Data Mining Running Data Mining Algorithms

Evaluating Results/Models

Model Deployment System Implementation And Testing

Evaluation “in the field”

Model in Operations Model Monitoring

Model Updating

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

5

Exploratory Data Analysis: Single Variables

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

6

Summary Statistics Mean: “center of data” Mode: location of highest data density Variance: “spread of data” Skew: indication of non-symmetry

Range: max - min Median: 50% of values below, 50% above Quantiles: e.g., values such that 25%, 50%, 75% are smaller

Note that some of these statistics can be misleading E.g., mean for data with 2 clusters may be in a region with zero data

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

7

Histogram of Unimodal Data 1000 data points simulated from a Normal distribution, mean 10, variance 1, 30 bins

1200

1000

800

600

400

200

0

6

7

8

9

10

11

12

13

14

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

8

Histograms: Unimodal Data 100 data points from a Normal, mean 10, variance 1, with 5, 10, 30 bins 25

40 35

20 30 25

15

20

10

15 10

5 5 0

6

7

8

9

10

11

12

13

0

6

7

8

9

10

11

12

13

12

10

8

6

4

2

0

6

7

8

9

10

11

12

13

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

9

Histogram of Multimodal Data 15000 data points simulated from a mixture of 3 Normal distributions, 300 bins

400 350 300 250 200 150 100 50 0

5

6

7

8

9

10

11

12

13

14

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

10

Histogram of Multimodal Data 15000 data points simulated from a mixture of 3 Normal distributions, 300 bins 6000

3500

3000

5000

2500 4000

2000 3000

1500 2000

1000 1000

0

500

5

6

7

8

9

10

11

12

13

14

400

0

5

7

8

9

10

11

12

13

14

11

12

13

14

160

350

140

300

120

250

100

200

80

150

60

100

40

50

20

0

6

5

6

7

8

9

10

11

12

13

14

0

5

6

7

8

9

10

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

11

Skewed Data 5000 data points simulated from an exponential distribution, 100 bins 450 400 350 300 250 200 150 100 50 0

0

1

2

3

4

5

6

7

8

9

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

12

Another Skewed Data Set 10000 data points simulated from a mixture of 2 exponentials, 100 bins 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0

0

20

40

60

80

100

120

140

160

180

200

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

13

Same Skewed Data after taking Logs (base 10) 10000 data points simulated from a mixture of 2 exponentials, 100 bins 350

300

250

200

150

100

50

0 -4

-3

-2

-1

0

1

2

3

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

14

What will the mean or median tell us about this data?

900 800 700 600 500 400 300 200 100 0

9

10

11

12

13

14

15

16

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

15

Issues with Histograms •

For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms.



For large data sets, histograms can be quite effective at illustrating general properties of the distribution.



Can smooth histogram using a variety of techniques – E.g., kernel density estimation, which avoids bins – but requires some notion of “scale”



Histograms effectively only work with 1 variable at a time – Difficult to extend to 2 dimensions, not possible for >2 – So histograms tell us nothing about the relationships among variables

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

16

US Zipcode Data: Population by Zipcode 900

8000

K = 50

7000

K = 500

800 700

6000

600

5000 500

4000 400

3000 300

2000

200

1000 0

100

0

2

4

6

8

10

0

12

0

2

4

6

8

10

12 4

4

x 10

x 10

400

K = 50

350 300 250 200 150 100 50 0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

17

Histogram with Outliers

Pima Indians Diabetes Data, From UC Irvine Machine Learning Repository

Number of Individuals

X values Padhraic Smyth, UC Irvine: CS 277, Winter 2014

18

Histogram with Outliers

Number of Individuals

Pima Indians Diabetes Data, From UC Irvine Machine Learning Repository

blood pressure = 0 ?

Diastolic Blood Pressure Padhraic Smyth, UC Irvine: CS 277, Winter 2014

19

Box Plots: Pima Indians Diabetes Data Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set

Body Mass Index

Healthy Individuals

Diabetic Individuals Padhraic Smyth, UC Irvine: CS 277, Winter 2014

20

Box Plots: Pima Indians Diabetes Data Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set

Body Mass Index

Plots all data points outside “whiskers” Upper Whisker

1.5 x Q3-Q1 Q3

Q2 (median)

Box = middle 50% of data

Q1 Lower Whisker

Healthy Individuals

Diabetic Individuals Padhraic Smyth, UC Irvine: CS 277, Winter 2014

21

Box Plots: Pima Indians Diabetes Data

healthy

Diastolic Blood Pressure

24-hour Serum Insulin

Plasma Glucose Concentration

Body Mass Index

diabetic

healthy

diabetic Padhraic Smyth, UC Irvine: CS 277, Winter 2014

22

Exploratory Data Analysis

Analyzing more than 1 variable at a time…

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

23

Relationships between Pairs of Variables • Say we have a variable Y we want to predict and many variables X that we could use to predict Y • In exploratory data analysis we may be interested in quickly finding out if a particular X variable is potentially useful at predicting Y • Options? – Linear correlation – Scatter plot: plot Y values versus X values

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

24

Linear Dependence between Pairs of Variables • Covariance and correlation measure linear dependence • Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is: 1 n Cov( X , Y )   ( x(i )  x )( y (i )  y ) n i 1

• The covariance is a measure of how X and Y vary together. – it will be large and positive if large values of X are associated with large values of Y and small X  small Y

• (Linear) Correlation = scaled covariance, varies between -1 and 1 n

 ( x(i)  x )( y(i)  y )

 ( X ,Y ) 

i 1

    ( x(i )  x ) 2  ( y (i )  y ) 2  i 1  i 1  n

n

1 2

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

25

Correlation coefficient • Covariance depends on ranges of X and Y • Standardize by dividing by standard deviation • Linear correlation coefficient is defined as:

n

 ( x(i)  x )( y(i)  y )

 ( X ,Y ) 

i 1

    ( x(i )  x ) 2  ( y (i )  y ) 2  i 1  i 1  n

n

1 2

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

26

Data Set on Housing Prices in Boston (widely used data set in research on regression models)

1

CRIM

per capita crime rate by town

2

ZN

proportion of residential land zoned for lots over 25,000 ft2

3

INDUS

proportion of non-retail business acres per town

4

NOX

Nitrogen oxide concentration (parts per 10 million)

5

RM

average number of rooms per dwelling

6

AGE

proportion of owner-occupied units built prior to 1940

7

DIS

weighted distances to five Boston employment centres

8

RAD

index of accessibility to radial highways

9

TAX

full-value property-tax rate per $10,000

10

PTRATIO

pupil-teacher ratio by town

11

MEDV

Median value of owner-occupied homes in $1000's

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

27

Matrix of Pairwise Linear Correlations -1 0 +1 Crime Rate

Industry Nitrous oxide Average # rooms Proportion of old houses

Highway accessibility Property tax rate Student-teacher ratio

Data on characteristics of Boston housing

Percentage of large residential lots

Distance to employment centers

Median house value

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

28

Dangers of searching for correlations in high-dimensional data Simulated 50 random Gaussian/normal data vectors, each with 100 variables Results in a 50 x 100 data matrix Below is a histogram of the 100 choose 2 pairs of correlation coefficients Even if data are entirely random (no dependence) there is a very high probability some variables will appear dependent just by chance.

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

29

Examples of X-Y plots and linear correlation values

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

30

Examples of X-Y plots and linear correlation values

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

31

Examples of X-Y plots and linear correlation values Non-Linear Dependence

Lack of linear correlation does not imply lack of dependence

Linear Dependence

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

32

DATA SET 2 15

10

10

Y values

Y values

DATA SET 1 15

5

0

0

5

10

15

5

0

20

0

5

X values

10

10

5

5

10

20

15

20

DATA SET 4 15

Y values

Y values

DATA SET 3

0

15

X values

15

0

10

15

20

X values

5

0

0

5

10

X values

Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014

33

Guess the Linear Correlation Values for each Data Set DATA SET 2 15

10

10

Y values

Y values

DATA SET 1 15

5

0

0

5

10

15

5

0

20

0

5

X values

10

10

5

5

10

20

15

20

DATA SET 4 15

Y values

Y values

DATA SET 3

0

15

X values

15

0

10

15

20

X values

5

0

0

5

10

X values

Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014

34

Actual Correlation Values DATA SET 1

DATA SET 2

15

15

Correlation = 0.82

10

Y values

Y values

Correlation = 0.82

5

0

0

5

10

15

10

5

0

20

0

5

X values DATA SET 3

20

15

20

DATA SET 4 15

Correlation = 0.82

Correlation = 0.82

10

Y values

Y values

15

X values

15

5

0

10

0

5

10

15

20

X values

10

5

0

0

5

10

X values

Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014

35

Summary Statistics for each Data Set Summary Statistics of Data Set 1 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82

Summary Statistics of Data Set 2 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82

Summary Statistics of Data Set 3 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82

Summary Statistics of Data Set 4 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82

Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014

36

Conclusions so far? • Summary statistics are useful…..up to a point

• Linear correlation measures can be misleading

• There really is no substitute for plotting/visualizing the data

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

37

Scatter Plots • Plot the value of one variable against the other • Simple…but can be very informative, can reveal more than summary statistics • For example, we can… – See if variables are dependent on each other (beyond linear dependence) – Detect if outliers are present – Can color-code to overlay group information (e.g., color points by class label for classification problems)

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

38 5

2.5

(from US Zip code data: each point = 1 Zip code)

x 10

MEDIAN HOUSEHOLD INCOME

units = dollars

2

1.5

1

0.5

0

0

2

4

6

8

10

MEDIAN PERCAPITA INCOME

12

14 4

x 10

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

39

Constant Variance versus Changing Variance

variation in Y does not depend on X

variation in Y changes with the value of X e.g., Y = annual tax paid, X = income

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

40

Problems with Scatter Plots of Large Data appears: later apps older; reality: downward slope (more apps, more variance)

96,000 bank loan applicants

scatter plot degrades into black smudge ... Padhraic Smyth, UC Irvine: CS 277, Winter 2014

41

Problems with Scatter Plots of Large Data appears: later apps older; reality: downward slope (more apps, more variance)

96,000 bank loan applicants

scatter plot degrades into black smudge ... Padhraic Smyth, UC Irvine: CS 277, Winter 2014

42

Contour Plots (based on local density) can help recall: (same 96,000 bank loan apps as before)

shows variance(y)  with x  is indeed due to horizontal skew in density

unimodal

skewed 

skewed 

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

43

Scatter-Plot Matrices Pima Indians Diabetes data

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

44

Another Scatter-Plot Matrix

For interactive visualization the concept of “linked plots” is generally useful, i.e., clicking on 1 or more points in 1 window and having these same points highlighted in other windows

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

45

Using Color to Show Group Information in Scatter Plots Iris classification data set, 3 classes

Figure from www.originlab.com Padhraic Smyth, UC Irvine: CS 277, Winter 2014

46

Another Example with Grouping by Color

Figure from hci.stanford.edu Padhraic Smyth, UC Irvine: CS 277, Winter 2014

47

Outlier Detection • Definition of an outlier? No precise definition Generally….”A data point that is significantly different to the rest of the data” But how do we define “significantly different”? (many answers to this…..) Typically assumed to mean that the point was measured in error, or is not a true measurement in some sense

Outliers in 1 dimension

Outlier in 2 dimensions 9

8

7

Y VALUES

– – – –

6

5

4

3

2

1

2

3

4

5 X VALUES

6

7

8

9

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

48 0

0

5

10

15

0

20

0

5

10

Example 1: The Effect of Outliers on Regression X values

DATA SET 3

15

5

10

10

0

0

5

5

10

15

DATA SET 1

5 20

10

1

5

1

0

Y values

15

Y values

15

Y values

5

10

Y values

Y values

Y values

DATA SET 4

DATA SET 2

10

0

5

10

X values 0

0

5

10

15

0

20

0

15

2

X values 5

0

10

0

X values

X values

Least SquaresDATA Fit SET with the Outlier 3

DATA SET 4

15

5

20

10

15

20

X values

Least SquaresDATA Fit without the Outlier SET 3 15

1

10

10

10

1

5

0

0

5

10

X values

15

20

5

0

0

5

10

X values

Y values

15

Y values

15

Y values

Y values

2

15

DATA SET 1 15

15

X values

5

0

0

15

5

20

10

15

20

X values

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

49

Example 2: Least Square Fitting is Sensitive to Outliers 18 16

16 2 cost for this one datum

14 12

Heavy penalty for large errors

10 5 4 3 2 1 0 -20

8 6

4

-15

-10

-5

0

5

2 0 0

2

4

6

8

10

12

14

16

18

20

Slide courtesy of Alex Ihler Padhraic Smyth, UC Irvine: CS 277, Winter 2014

50

More Robust Cost Functions for Training Regression Models

(MSE)

(MAE)

Something else entirely, e.g., (Blue Line)

Slide courtesy of Alex Ihler Padhraic Smyth, UC Irvine: CS 277, Winter 2014

51

L1 is more Robust to Outliers than L2 18

L2, original data

16

L1, original data

14 12

L2, outlier data

10

L1, outlier data

8 6 4 2 0

0

2

4

6

8

10

12

14

16

18

20

Slide courtesy of Alex Ihler Padhraic Smyth, UC Irvine: CS 277, Winter 2014

52

Detection of Outliers in Multiple Dimensions 9

8

Y VALUES

7

6

5

4

3

2

1

2

3

4

5 X VALUES

6

7

8

9

• Detecting “multi-dimensional outliers” is generally difficult • In the example above, the blue point will not look like an outlier if we were to plot 1-dimensional histograms of Y or X – it only stands out in the 2d plot • Now consider the same situation but in 3 or more dimensions

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

53

Some Advice on Outliers • Use visualization (e.g., in 1d, 2d) to spot obvious outliers • Use domain knowledge and known constraints – E.g., Age in years should be between 0 and 120

• Use the model itself to help detect outliers – E.g., in regression, data points with errors much larger than the others may be outliers

• Use robust techniques that are not overly sensitive to outliers – E.g., median is more robust than the mean, L1 is more robust than L2, etc

• Automated outlier detection algorithms? …not always useful – E.g., fit probability density model to N-1 points and determine how likely the Nth point is – May not work well in high dimensions and/or if there are multiple outliers

• In general: for large data sets outliers you can probably assume that outliers are present and proceed with caution…. Padhraic Smyth, UC Irvine: CS 277, Winter 2014

54

Multivariate Visualization •

Multivariate -> multiple variables



2 variables: scatter plots, etc



3 variables: – – – –



4 variables: – –



3-dimensional plots Look impressive, but often not that useful Can be cognitively challenging to interpret Alternatives: overlay color-coding (e.g., categorical data) on 2d scatter plot

3d with color or time Can be effective in certain situations, but tricky

Higher dimensions – – – –

Generally difficult Scatter plots, icon plots, parallel coordinates: all have weaknesses Alternative: “map” data to lower dimensions, e.g., PCA or multidimensional scaling Main problem: high-dimensional structure may not be apparent in low-dimensional views

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

55

Using Icons to Encode Information, e.g., Star Plots Each star represents a single observation. Star plots are used to examine the relative values for a single data point The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. Useful for small data sets with up to 10 or so variables 1 2 3 4

Price Mileage (MPG) 1978 Repair Record (1 = Worst, 5 = Best) 1977 Repair Record (1 = Worst, 5 = Best)

5 6 7 8

Headroom Rear Seat Room Trunk Space Weight

Limitations? Small data sets, small dimensions Ordering of variables may affect perception

9 Length

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

56

Another Example of Icon Plots

Figure from statsoft.com Padhraic Smyth, UC Irvine: CS 277, Winter 2014

57

Combining Scatter Plots and Icon Plots

Figure from statsoft.com Padhraic Smyth, UC Irvine: CS 277, Winter 2014

58

Chernoff Faces •

Variable values associated with facial characteristic parameters, e.g., head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening

• Limitations? – Only up to 10 or so dimensions – Overemphasizes certain variables because of our perceptual biases

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

59

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

60

Parallel Coordinates Method Epileptic Seizure Data

1 (of n) cases Interactive “brushing” is useful for seeing distinctions

(this case is a “brushed” one, with a darker line, to standout from the n-1 other cases)

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

61

More elaborate parallel coordinates example (from E. Wegman, 1999). 12,000 bank customers with 8 variables Additional “dependent” variable is profit (green for positive, red for negative)

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

62

Interactive “Grand Tour” Techniques • “Grand Tour” idea – – – – –

Cycle continuously through multiple projections of the data Cycles through all possible projections (depending on time constraints) Projects can be 1, 2, or 3d typically (often 2d) Can link with scatter plot matrices (see following example) Asimov (1985)

• Example on following 2 slides –

7 dimensional physics data, color-coded by group, shown with (a) Standard scatter matrix (b) 2 static snapshots of grand tour

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

63

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

64

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

65

Visualization of an email network using 2-dimensional graph drawing or “embedding”

Data from 500 researchers at Hewlett-Packard over approximately 1 year. Various structural elements of the network are apparent Padhraic Smyth, UC Irvine: CS 277, Winter 2014

66

Exploratory Data Analysis

Visualizing Time-Series Data

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

67

Time-Series Data: Example 1 Historical data on millions of miles flown by UK airline passengers …..note a number of different systematic effects

Summer “double peaks” (favor early or late)

Summer peaks

steady growth trend New Year bumps

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

68

Time-Series Data: Example 2 Data from study on weight measurements over time of children in Scotland

Experimental Study: More milk -> better health? 20,000 children: 5k raw, 5k pasteurize, 10k control (no supplement)

mean weight vs mean age for 10k control group

Weight

Would expect smooth weight growth plot. Plot shows an unexpected pattern (steps), not apparent from raw data table.

Age

Why do the children appear to grow in spurts?

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

69

Time-Series Data: Example 3 (Google Trends) Search Query = whiskey

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

70

Time-Series Data: Example 4 (Google Trends)

Search Query = NSA

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

71

Spatial Distribution of the Same Data (Google Trends) Search Query = whiskey

Padhraic Smyth, UC Irvine: CS 277, Winter 2014

72

Summary on Exploration/Visualization • Always useful and worthwhile to visualize data – – – –

human visual system is excellent at pattern recognition gives us a general idea of how data is distributed, e.g., extreme skew detect “obvious outliers” and errors in the data gain a general understanding of low-dimensional properties

• Many different visualization techniques • Limitations – generally only useful up to 3 or 4 dimensions – massive data: only so many pixels on a screen - but subsampling is useful

Padhraic Smyth, UC Irvine: CS 277, Winter 2014