Outline. Introduction to Data Mining (in Astronomy) Data mining is a combination of: What is data mining?

Introduction to Data Mining (in Astronomy) ADASS 2007 Tutorial Sabine McConnell Department of Computer Science/Studies Trent University Outline • • •...
Author: Gwendolyn Sims
2 downloads 2 Views 291KB Size
Introduction to Data Mining (in Astronomy) ADASS 2007 Tutorial Sabine McConnell Department of Computer Science/Studies Trent University

Outline • • • • • • • • •

Data mining is a combination of:

What is data mining? “The non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.” (Piatetsky-Shapiro) “ The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. ” (Hand)

Introduction The Data Classification Clustering Evaluation of Results Increasing the Accuracy Some Issues and Concerns Weka References

-

machine learning statistics databases visualization application domain

1

(Some) Applications of datamining techniques • Science: bioinformatics, discovery of drugs, astronomy • Government: law enforcement, income tax, anti-terror • Business: Market basket analysis, targeted marketing • Engineering: Satellite navigation

object ID 134633 3555432 3432223 124123 333456 3355478 875 33378 569433 3321347 5464648 454345476 4646788

Data mining in astronomy • Classification of stars, galaxies and planetary nebulae, both based on images and spectral parameters • star/galaxy separation • forecasting of sunspots and of geomagnetic storms from solar wind • forecasting of seeing • gravitational wave signal detection • antimatter search in cosmic rays • selection of Quasar candidates • detection of expanding HI shells • ....many more

View of the Dataset

The data-mining process

(= Matrix)

(Knowledge discovery in databases)

_RAJ2000 _DEJ2000 distance flags 00 03 09.1 +21 57 34 398 A 00 03 48.8 +07 28 45 113 D 00 03 58.6 +20 45 07 835 A 00 05 53.0 +22 32 14 398 A 00 06 21.4 +17 26 03 398 A 00 07 16.7 +27 42 31 398 A 00 07 16.1 +08 18 03 879 A 00 08 10.7 +27 00 15 578 A 00 08 20.5 +40 37 54 398 A 00 09 54.3 +25 55 28 778 A 00 10 47.7 +33 21 18 79 B 00 12 49.9 +77 47 44 398 A 00 13 27.5 +17 29 16 398 A

x size y size U-B error 1629 1654 14.4 low 939 1332 14 medium 1713 2219 12.7 low 1092 1400 0 low 1121 1419 15.1 low 1343 1810 13.4 high 1095 1281 14.6 medium 1154 1493 14.4 high 1661 1683 0 low 1961 2180 12.5 low 929 1359 13.5 high 1393 1671 0 low 1141 1573 14.2 medium

Bar? no yes no no no no yes no no no no no yes

class Irr Spiral Ell Irr Irr Spiral Spiral Irr Irr Spiral Ell Irr Spiral

• levels of measurement (nominal, ordinal, interval, ratio) • numeric vs. categorical

2

Data Preparation

Preprocessing and Algorithms • • • • • • • •

Neural networks like data to be scaled Decision trees do not care about scaling, but work better with discrete attributes that have small numbers of possible values Neural networks can handle irrelevant or redundant attributes, while they may lead to large decision trees Neural networks do not like noisy data, especially for small datasets, while decision trees do not care about noise much Nearest-neighbour approaches can handle noise if a certain parameter is adjusted Distance-based approaches do not work well if the attributes are not equally weighted, and typically work with numerical data only Expectation Maximization approaches can deal with missing data, but k-means techniques require substitution of missing data ….

A Comparison of Neural Network Algorithms and Preprocessing Methods for Star-Galaxy Discrimination, D. Bazell and Y. Peng, Astrophysical Journal Supplement Series:4755, May 1998

Data Preparation Issues • • • • • •

transformation of attribute types selection of attributes transformation of attributes normalization of attribute values sampling missing values

Data preparation: transformation of attribute types • categorical to numeric • numeric to categorical

3

Transformation: categorical to numeric • map to circle, sphere or hypersphere – may work if categories are ordinal (i.e. days of the week) – usually produces poor results otherwise

• map to generalized tetrahedron – to uniquely represent k possible attribute values, we need k new attributes. – example: an attribute with three possible values (i.e. circle, square, triangle) maps to three new attributes with the values (1,0,0) for circle, (0,1,0) for square, and (0,0,1) for triangle – works for both ordinal and nominal data

Data preparation: attribute selection • Remove irrelevant or redundant attributes to reduce the dimensionality of the dataset • Preserve the probability distribution of the classes present in the data as much as possible – Filter approach: Start with empty set, add attributes one at the time – Wrapper approach:Start with full set, remove attributes one at a time – Reduce search time by combining the two methods – alternative: use the upper levels of a decision tree, providing there are class labels in the data

Transformation: numeric to categorical • some data mining algorithms require data to be categorical • we may have to transform continuous attributes into categorical attributes: discretization or • transform continuous and discrete data into binary data: binarization • we also have to distinguish between unsupervised (no use of class information) and supervised (use of class information) discretization methods – unsupervised: equal-width or equal-frequency binning, k-means, visual inspection – supervised: use some measure of impurity of bins

Data preparation: transformation of attributes • Two popular methods: – Wavelet transforms – Principal component analysis

• express data in terms of new attributes • reduce the number of attributes by truncating

4

Data preparation: normalization • min-max normalization • z-score normalization (standardization) • normalization by decimal scaling

Data preparation: sampling • Reduce the number of objects (rows) in the dataset – – – –

simple sample without replacement simple random sample with replacement cluster sample stratified sample

Data preparation: sampling Missing values • data may be missing completely at random, missing at random, or not missing at random (censored) • depending on why the data is missing, we can use

stratified sample: preserves original distributions of classes undersampling/oversampling: equal distributions of classes

– – – – – – –

casewise data deletion mean substitution regression hot deck methods maximum likelihood methods multiple imputation ...

5

Data-mining categories

Building the model • • • • • • • •

Classification Clustering Visualization Association Rule Mining Summarization Outlier detection Deviation detection …

Predictive vs Descriptive Techniques Models vs. Patterns • Models: – Large-Scale Description of the Data – describe/predict/summarize the most common cases

• patterns: – – – –

small scale local models association rules, outliers often most interesting objects

• Data- mining techniques can be either - predictive (supervised) - descriptive (unsupervised) • predictive: predict (discrete) class attribute based on other attribute values. This is like learning from a teacher. →classification • descriptive: discover structure of data without prior knowledge of class labels →clustering • evolving area: semi-supervised (combines predictive and descriptive methods)

6

example: Automated morphological classification of APM galaxies by supervised artificial neural networks, Naim et al., MNRAS 275, 567-590(1995) • 830 galaxy images (diameter limited) from APM Equatorial Catalogue of Galaxies • 24 parameters (inputs), including ellipticity, surface brightness, bulge size, arm number, length and intensity • outputs: Revised Hubble Type of galaxy • classified by 6 human experts according to the Revised Hubble System, and by a supervised neural network with • result: rms error for classification by networks as compared with mean types of classification (1.8 Revised Hubble Types) is comparable to rms dispersion between experts

Predictive data mining: classification (Decision Trees)

Predictive data mining: classification (Learn a model to predict future target variables)

galaxies stars

Given a set of points from classes what is the class of new point ? Is the new point a star or a galaxy?

Decision Trees:Choosing a splitting criteria

Y c −1

y >2? yes

no

2 x >5?

2

4

5

X

if y > 2 then if x > 5 then blue else if x > 4 then red else blue else if x > 2 then red else blue

x >4?

x >2?

entropy (t ) = −∑ p(i | t ) log 2 p(i | t ) i =0

2

c −1

gini (t ) = 1 − ∑ [ p(i | t )] i =0

classification _ error (t ) = 1 − max[ p (i | t )] i

7

Decision tree: Measuring the impurity of a node

A BA A BB B

k

gain = I ( parent ) − ∑ j =1

N (v j ) N

I (v j )

information gain: when entropy is impurity measure

BB BA

A BA class A

goal: a large change in impurity I after split

2

class B

parent

3/7

4/7

left child

2/3

1/3

right child 1/4

3/4

2

⎛ 3⎞ ⎛4⎞ gini parent = 1 − ⎜ ⎟ − ⎜ ⎟ = 0.5 ⎝7⎠ ⎝7⎠ 2

2

⎛ 2⎞ ⎛1⎞ gini left _ child = 1 − ⎜ ⎟ − ⎜ ⎟ = 0.45 ⎝ 3⎠ ⎝3⎠ 2

2

⎛1⎞ ⎛3⎞ giniright _ child = 1 − ⎜ ⎟ − ⎜ ⎟ = 0.375 ⎝4⎠ ⎝4⎠

k

N (v j )

j =1

N

gain = I ( parent ) − ∑

I (v j )

⎛3⎞ ⎛ 4⎞ gain = 0.5 − ⎜ ⎟0.45 − ⎜ ⎟0.375 = 0.1 7 ⎝ ⎠ ⎝7⎠ repeat for all possible splits and choose best split.

Characteristics of Decision Trees Decision Trees: extensions • oblique decision trees: allows test conditions that involve multiple attributes • regression trees: value assigned to datum is the average of values in the node • random forests: build multiple decision trees that include a random factor when choosing the attributes to split on

• decision boundaries are typically axis-parallel, • can handle both numeric and nominal attributes • for nominal attributes, decision trees tend to favor the selection of attributes with larger numbers of possible values as splitting criteria • the runtime is determined by the fact that numeric attributes need to be sorted, therefore classification is fairly fast in typical settings • can easily be converted to (possibly suboptimal) rule sets. • pruning of trees is recommended to reduce their complexity, the pruning strategy is more important than the choice of splitting criteria. • robust to noise

8

example: Decision Trees for Automated Identification of Cosmic-Ray Hits in Hubble Space Telescope Images, Salzberg et al., Publications of the Astronomical Society of the Pacific 107:279-288, March 1995

• oblique decision tree, starting at random locations for the hyperplanes • overcomes local maxima by perturbation of the hyperplanes and restarting of the search at new location • compares results from 5 different decision trees • reduction of the feature set, use of decision trees to confirm labeling • over 95% accuracy for single, unpaired images

Predictive data mining: classification (Neural Networks)

data input layer

Neural Networks: Backpropagation

randomly initialize weights for each sample do 1. present sample to input nodes 2. propagate data through layers, using weights and activation functions 3. calculate results at output nodes 4. determine error at output nodes 5. propagate error backwards to adjust the weights

hidden layers

output layer

- more complex borders - more accurate - may overfit the data

Neural Networks: Extensions • • • • • • • • •

Madalines Adaptive Multilayer Networks Prediction Networks Winner-Take-All Networks Counterpropagation Networks Learning Vector Quantizers Principal Component Analysis Networks Hopfield Networks ….

repeat until stopping criterion satisfied

9

Applications of Neural Networks in Astronomy: • • • • • • • • • • • • • • •

star/galaxy separation spectral and morphological classification of galaxies spectral classification of stars determine number of binary stars in a cluster reduce input dimensionality classification of planetary nebulae predictions of solar flux and sunspots classification of asteroid spectra adaptive optics spacecraft control interpolation of HI distribution in Perseus classification of white dwards detection and classification of CCD defects search for antimatter …

Characteristics of Artificial Neural Networks • • • • • •

• •

slow poor interpretability of results. able to approximate any target function. can learn to ignore irrelevant or redundant attributes easy to parallelize. may converge to local minimum because of greedy optimization, but convergence to global maximum can be achieved through simulated annealing. choice of network structure non-trivial and time-consuming sensitive to noise (a validation set may help here)

example: The use of Neural Networks to probe the structure of the nearby universe. d’Abrusco et al., To appear in the proceedings of the Astronomical Data Analysis -IV workshop held in Marseille in 2006.

• • • •

supervised neural network applied to SDSS data training data: spectroscopic, contained 449 370 galaxies training data divided into training, validation, and test sets output: distance estimates for roughly 30 million galaxies distributed over 8 000 sq. deg. • provides list of candidate AGN and QSO

Lazy learners: Nearest-neighbour techniques

Lazy learners do not build models when a new datum is to be classified, it is assigned to the majority of the classes of its neighbours

10

Characteristics of NearestNeighbour Algorithms • • • •

slow does not work well with noisy data does not provide the user with a model. new data can easily be incorporated because of the lack of model. • easy to parallelize. • may not work well if attributes are not equally relevant, • decision boundaries are piece-wise linear.

Descriptive data mining: clustering Goal: Find clusters of similar objects (Find groups of similar galaxies)

Difference between predictive and descriptive approaches • lack of class labels in the descriptive case: we need establish correspondence between clusters and real-life type of objects • For predictive approaches, it is easier to see if there is agreement with human experts • evaluation of descriptive approaches is much harder • descriptive approaches avoid the bias that may be introduced by existing class labels, but introduce bias of their own (choice of distance measure, algorithms, and number of clusters)

Overview of Clustering Techniques • major distinction: partitioning-based vs. hierarchical methods (fixed number vs. variable number of clusters) partition-based • hierarchical methods are further divided into agglomerative and divisive clustering

- which algorithm should I use? - when are objects similar?

– agglomerative methods initially assign each sample to a separate cluster, then merge clusters that are closest to each other in successive steps – divisive methods start with one cluster containing all data, then repeatedly split the cluster(s) until each sample belongs to a separate cluster

hierarchical clustering (produces dendrogram)

11

Distance measures for objects

Distance measures for clusters d min = min | p − p ' | where p and p' are from

• minimum distance

different clusters

(single linkage, nearest-neighbour)

• • • • • • •

Manhattan distance Euclidean distance Squared Euclidean distance Chebychev distance Hamming distance Percent disagreement …

Descriptive data mining: clustering (K-means) (numerical data only) 1) 2) 3) 4)

Randomly pick k cluster centers Assign every object to its nearest cluster center Move each cluster center Repeat steps 2,3 until stopping criterion is satisfied

d min = max | p − p ' | where p and p' are from

• maximum distance

(complete linkage, farthest-neighbour)

different clusters

d mean =| mi − m j | where m indicates a

• mean distance

cluster center

• average distance

d average =

1 ni n j

∑∑ | p − p ' | where p and p' are from p

p'

different clusters

the choice of distance measure for clusters will determine the cluster shape!

K-means algorithm

x

x x Step 1: randomly choose k cluster centers

12

K-means algorithm

K-means algorithm

x

x

x

x x x Step 2: assign each point to the closest cluster center

K-means algorithm

x

x x

Step 3: move the cluster centers to represent the means of the clusters

K-means algorithm

xx

x

xx

x x Step 4: reassign the points to the closest cluster center

xx Step 5: move cluster centers

13

K-means algorithm

Characteristics of k-means x x x Step 6: reassign points and move cluster centers again, or terminate?

Other clustering approaches • • • • • •

• requires user-specified number of clusters • often converges to local optimum • does not perform well in the presence of outliers and noise • is only useful when mean of a cluster is defined, therefore most often used with numerical data only • biased towards spherical clusters • cannot handle missing data

Evaluating the model

EM k-medoids model-based grid-based density-based …

14

Evaluation methods

How can we evaluate (predictive and descriptive) models?

Training and test sets • • • • • •

split available data into two sets one set is used to build the model other set is used to evaluate the model typical split: 2/3 of data as training set, rest for test set does not work well for noisy data and small datasets if a validation set is needed as well, the data available for training is even more reduced • if test set is not representative sample of training set, then accuracy of model may be underestimated

• holdout method: use training and test sets • stratified holdout: preserve class distribution • repeated holdout • k-fold cross-validation • leave-one-out cross validation • bootstrap sample: 0.632 sample

Cross-Validation • split the data into k folds • use k-1 folds for training, 1 fold for testing • repeat k times so each fold is used for testing once • repeat the whole process x times • average the results a typical value for x and k is 10

15

Increasing the accuracy • • • •

boosting bagging randomization ensembles

Increasing accuracy: bagging • reduces variance • sample with replacement to create multiple datasets • classify each of the datasets to produce multiple methods • combine the individual models to produce overall models

Bias-Variance Decomposition • the classification error is the sum of bias, variance, and Bayes error rate ec = bias +variance+ eB • bias: measures how close the classifier will be to the function to be learned, on average • variance: measures how much the estimates of the classifier will vary with changes in the dataset • Bayes error rate: the minimum error rate associated with the Bayes optimal classifier

Increasing accuracy: boosting • builds multiple models from dataset • each datum is associated with a weight • weights are adjusted over time: – decrease the weight for data that are easy to classify – increase the weight for data that are hard to classify – build another model

• final model is constructed from all models, weighted by a score

16

Many more data-mining techniques…

Increasing accuracy: ensembles • generalizes on the idea of bagging • build multiple models, that can vary in – the input data – initial parameters: starting points, number of clusters,.... – learning algorithms

• can be very powerful if the learning algorithms are weak learners (result changes substantially with change in dataset)

• • • • • • •

association rules sequence mining random forests Support Vector Machines Naïve Bayes genetic algorithms …

Data Mining with Genetic Algorithms: Fitting a Galactic Model to An All-Sky Survey, Larsen and Humphreys

Step-by-step quide: data preparation

AJ, 125:19581979, April 2003

• genetic algorithms: survival of the fittest – fitness function to evaluate population – change population over time: random mutations, crossover – evaluate population at each timestep, only the fittest will survive

• derive global parameters for a Galactic model • magnitude-limited star counts from APS catalog • produces model counts for multi-directional data

– size of dataset: • number of attributes • number of samples per class

– – – –

transform attributes if necessary normalize/standardize the data select attributes reduce dimensionality if possible (PCA for sparse data, DWT for data with large numbers of attributes)

17

Step-by-step quide: evaluate the model

Step-by-step quide: build the model • descriptive techniques:

• 10-fold cross validation • never evaluate the model on the training data • be careful when comparing models derived with different techniques

(Some) Data-mining concerns: • • • •

curse of dimensionality local minima existing classifications distributed nature of the data • how can we describe the models in general terms • can we standardize the process somehow? • privacy issues

• • • • • • •

missing values normalization issues multiple measurements noisy data error bars cost of the models? …

– visualization – k-means algorithm – EM algorithm

• predictive approaches: – decision trees – neural networks

• combine both through semi-supervised learning

Crisp-DM • Cross Industry Standard Process for Data Mining • http://www.crisp-dm.org/ • describes commonly used approaches, mainly from a business perspective • non-proprietary, documented, industry and toolindependent model • describes best practices and structures of the data mining process, similar to our model

18

Predictive Model Markup Language (PMML) • XML-based language • define and share statistical and data-mining models amongst applications (i.e. DB2, SAS, SPSS…) - - - - - + - -

Distributed Data Mining • Meta-learning • Collective Data Mining Framework • Data partitions/Ensembles

Curse of Dimensionality • number of samples needed increases with dimensionality of the data • data mining algorithms are often more than linear in the number of attributes

Weka Machine Learning Workbench • available (no cost) at http://www.cs.waikato.nz/ml/weka

19

Weka interfaces

arff format @relation 'labor-neg-data' @attribute 'duration' real @attribute 'wage-increase-first-year' real @attribute 'wage-increase-second-year' real @attribute 'wage-increase-third-year' real @attribute 'cost-of-living-adjustment' {'none','tcf','tc'} @attribute 'working-hours' real @attribute 'pension' {'none','ret_allw','empl_contr'} @attribute 'vacation' {'below_average','average','generous'} @attribute 'longterm-disability-assistance' {'yes','no'} @attribute 'contribution-to-dental-plan' {'none','half','full'} @attribute 'bereavement-assistance' {'yes','no'} @attribute 'contribution-to-health-plan' {'none','half','full'} @attribute 'class' {'bad','good'} @data 1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good' 3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good' 3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good' 2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good' 3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good' 2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'

Commercial Data-Mining Software • • • • • • • • •

Clementine Enterprise Miner Insightful Miner Intelligent Miner Microsoft SQL Server 2005 MineSet Oracle Data Mining Cart …

20

References • Introduction to Data Mining, P. Tan, M. Steinbach, and V. Kumar, Addison Wesley, 2006 • Data Mining: Practical Machine Learning Tools and Techniques, I. Witten and E. Frank, Morgan Kaufmann, 2005 • Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann, 2006

References • http://people.trentu.ca/sabinemcconnell/ • www.kdnuggets.com • http://www.twocrows.com/glossary.htm

21