Cluster Analysis of Massive Datasets in Astronomy

Cluster Analysis of Massive Datasets in Astronomy Woncheol Jang∗ March 7, 2006 Abstract Clusters of galaxies are a useful proxy to trace the mass dis...

Author: Juniper Wilcox

5 downloads 2 Views 3MB Size

Report

Download PDF

Recommend Documents

Scalable Machine Learning for Massive Astronomical Datasets

Frontiers in Massive Data Analysis

Spatial Analysis with Raster Datasets

Dremel: Interactive Analysis of Web-Scale Datasets

Spatial Analysis with Raster Datasets

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Spatial Analysis with Raster Datasets

Cluster Analysis of Genomic Data

Cluster analysis of microarray data

Cluster SWOT Analysis

Understanding Cluster Analysis

22 Spatial Cluster Analysis

Cluster Analysis in DNA Microarray Experiments

Cluster-based analysis of FMRI data

State of Oklahoma 2005 Automotive Cluster Analysis

Lensing analysis of the CLASH cluster sample

Application of Cluster Analysis in Agriculture A Review Article

A Use of Cluster Analysis in Outdoor Recreation Research

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Cluster Analysis: A practical example

Cluster Analysis. Two Step Clustering

THE DISCOVERY OF A MASSIVE CLUSTER OF RED SUPERGIANTS WITH GLIMPSE

Cluster Analysis of Massive Datasets in Astronomy Woncheol Jang∗ March 7, 2006

Abstract Clusters of galaxies are a useful proxy to trace the mass distribution of the universe. By measuring the mass of clusters of galaxies at different scales, one can follow the evolution of the mass distribution (Mart´ınez and Saar, 2002). It can be shown that finding galaxies clustering is equivalent to finding density contour clusters (Hartigan, 1975): connected components of the level set S c ≡ {f > c} where f is a probability density function. Cuevas et al. (2000, 2001) proposed a nonparametric method for density contour clusters. They attempt to find density contour clusters by the minimal spanning tree. While their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm with the Fast Fourier Transform (FFT). The method is applied to a study of galaxy clustering on large astronomical sky survey data. Key Words: Density contour cluster; level set; clustering; Fast Fourier Transform.

1

Introduction

In the social and physical sciences, clustering often plays an important role in analyzing data. For example, clusters of galaxies are a useful proxy to ∗ Institute of Statistics and Decision Sciences, Duke University, Durham, NC 27708 USA

1

trace the mass distribution of the universe. By measuring the mass of clusters of galaxies at different scales, one can follow the evolution of the mass distribution (Mart´ınez and Saar, 2002). In most cases, the objectives of clustering are to find the locations and the number of clusters. Although these two problems are separate, it is tempting to solve both of them simultaneously. The usual tools for clustering are similarities or distances between objects. One popular approach to clustering is the model-based clustering (Fraley and Raftery, 2002; McLachlan and Peel, 2000). It is based on the assumption that the data are generated according to a mixture distribution with G components and each component is a member of parametric distributions such as normal. The parameters and number of clusters are estimated from the data and each observation can be assigned to a cluster with a probability of originating from the cluster. While the model-based clustering provides a method to estimate the number of cluster and membership of each observation, the results often are sensitive to the assumption of the parametric family: See Stuetzle (2003). He also pointed out that the model-based clustering is only suitable for ellipsoidal-shaped clusters. An alternative method is the nonparametric approach that is based on the assertion that a cluster is associated with a mode carrying high probability over neighborhoods. The goal of this approach is to find the modes and assign each observation to a cluster. Hartigan (1975) captured this concept by introducing density contour clusters; clusters are connected components of the level set Sc ≡ {f > c}. Indeed, it can be shown that finding galaxies clustering is equivalent to finding density contour clusters. In this paper, we present a fast clustering algorithm for Hartigan’s density contour clusters based on Cuevas et al. (2000, 2001) which we will refer to as the CFF algorithm. They suggested to use unions of balls centered at data points to estimate the connected components of the level set and provided an algorithm to extract the connected components of the estimated level set. While the CFF algorithm is conceptually simple, it requires massive amounts of computations for large datasets. We propose a more efficient clustering method based on this algorithm. Instead of using data points, we use grid points as centers of the balls. As a result, the Fast Fourier Transform (FFT) can be employed to reduce the cost of original computations for large datasets. The rest of this paper is organized as follows. In the following section, we give a review of the current status on level set inferences. We then introduce 2

the original CFF algorithm and our algorithm, a modified version of the CFF algorithm. The method is applied to a study of galaxy clustering on large astronomical datasets in section 4. Finally, in section 5, we discuss possible extensions of our method.

2

Level Set Estimation

Suppose that Y1 , . . . , Yn are independent observations from an unknown density f on Rd . From Hartigan’s point of view, density contour clustering is equivalent to estimating the level set Sc = {f > c}. Here c is a constant and often it is suggested by the situation under study. A naive estimator for the level set is the plug-in estimator Sbc ≡ {fb > c}; See Cuevas and Fraiman (1997) and references given therein. Here fb is a kernel density estimator: n y1 − Yi1 yd − Yid 1 X b , K ,..., f (y) = nhd i=1 h h

where y = (y1 , . . . , yd ), h = hn is the sequence of the bandwidths satisfying hn → 0, nhdn → ∞ and K is a multivariate kernel function satisfying: Z Z Z K(y)dy = 1, yK(y)dy = 0, and yy T K(y)dy = µ2 (K) · I. R Here µ2 = yj2 K(y)dy is independent of j (Wand and Jones, 1995). Cuevas and Fraiman (1997) studied asymptotic properties of the plug-in estimator of type {fb > cn } with cn → 0 for support estimation and showed the consistency and convergence rates. From a different point of view, the level set can be employed to develop a statistical quality control and outlier detection tool. If a observation is outside of the level set, one can classify that the process is out of control. It is interesting to know the asymptotic behavior of the probability of no classification. B´aillo et al. (2001) obtained the convergence rates of this type of the probability. It may not be easy to construct the plug-in estimator in practice because of the complicated geometrical structure of the estimator (Cuevas et al., 2000). A simpler alternative is to use a finite union of balls: Sec1 =

kn [

B(Y(i) , n ),

i=1

3

where Y(i) are those observations (in the original sample Y1 , . . . , Yn ) belonging to Sbc , B(Y(i) , n ) is a closed ball centered at Y(i) with radius n and kn is the number of Y(i) . Note that kn is random. Sec1 can be viewed as a histogram version of the plug-in estimator because the plug-in estimator is indeed a finite union of Y(i) + hSc (K) if Sc (K) = {K > c} is bounded. The properties of this type of estimators were studied originally by Devroye and Wise (1980) with applications to statistical quality control. Two set metrics have been used in the literature for studies of asymptotic inferences: the distance in measure dµ (S, T ) (similar to L1 metric in density estimation) and the Hausdorff metric dH (similar to the supremum norm in density estimation): dµ ≡ µ(T ∆S),

dH (T, S) ≡ inf{ > 0 : T ⊂ S , T ⊂ S},

where ∆ is the symmetric difference, µ is the Lebesgue measure and S is a union of all open balls with a radius around points of S. Devroye and Wise (1980) first proved dµ consistency of Sec1 under some conditions on n which are similar to those imposed on the bandwidth in the kernel smoothing and Korostelev and Tsybakov (1993) obtained that dµ convergence rates of Sec1 when the level set is a domain with a piecewise Lipschitz boundary. The consistency and convergence rates with regard to µH can be found in Cuevas and Rodriguez-Casal (2004) and Cuevas and Fraiman (1997).

3 3.1

Clustering Algorithm CFF Algorithm

To find clusters, one must estimate the level set and extract the connected components of the estimated level set. In machine learning literature, finding the connected components of the estimated level set can be considered as a constraint satisfaction problem (Russell and Norvig, 2002). While the best-first greed search is often used to find a solution for this type of search problems, finding the optimal solution may not be feasible. A possible alternative is to use the heuristic search, but it may require lots of memory usage. For really big search spaces, this method could run out of memory (personal communication with Andrew Moore) since it requires visiting every 4

data point. In contrast to algorithms for the constraint satisfaction problem, the CFF algorithm only need to visit those observations belonging to the level set. The key idea of the CFF algorithm is first to find a subset of data belonging to the level set and then find clusters by agglomerating these data points. In short, the CFF algorithm consists of two key steps. Step 1 Among the original data, find the observations Y(i) which belong to the estimated level set Sbc .

Step 2 Identify connected components of the estimated level set Sbc by unions of open balls centered at Y(i) with radius n . This means that every given pair of Y(i) will be joined with a path consisting of a finite number of edges with length smaller than 2n . Cuevas et al. (2000) also suggested an alternative version of their procedure for small kn . This could happen if the sample size is small or c is relatively large. The key idea of the alternative procedure is to replace Y(i) with smoothed bootstrap observations Z(i) belonging to the estimated level set, drawn from fb. Since the CFF algorithm is a nonparametric approach, it outperforms the other clustering algorithms such as mixture models and hierarchical single clustering for noisy background cases such as astronomical sky survey data in Figure 2 (Wong and Moore, 2002).

3.2

CFF Algorithm with FFT

The CFF algorithm is conceptually simple, but it requires massive computations for large datasets. Even for the first step, we need to compute the density estimates at every observation. Especially in high dimension, the task could be daunting even with today’s high computing power. There have been substantial developments in machine learning community for this type of problems: density estimation in high dimension with large datasets. For example, Moore (1999) provided an algorithm to fit the EM-based mixture model with KD-tree and Gray and Moore (2003) presented an algorithm for fast kernel density estimation for high dimensional massive datasets. While they approximated density estimates very quickly by cutting off the search early without computing exact densities, there are some remaining issues in the second step. 5

The second step is equivalent to finding Minimum Spanning Tree and Wong and Moore (2002) proposed an alternative implementation based on the GeoMS2 algorithm (Narasimhan et al., 2000). Though Wong and Moore showed the improvement of the CFF algorithm, their algorithm did not address the choice of n , an extra smoothing parameter which the CFF algorithm requires as input. While Cuevas et al. (2000) suggested a few empirical rules for possible choices of n , those rules still require intensive computations and may not be useful in practice for large datasets. To save computing cost and provide a convenient choice of n , we propose a modified version of the CFF algorithm. The key idea is to replace data points with grid points t1 , . . . , tm . In other words, we estimate Sbc with Sec2 ≡

km [

B(t(i) , m )

i=1

where t(i) ’s are equally spaced grid points which belong to Sbc , km is the total number of the grid points belonging to Sbc and m is the grid size. For the choice of the number of grid points, Wand (1994) provided some guideline for the multivariate case. For two dimensional case, table 1 in Wand (1994) showed that 322 grid points with linear binning achieved almost the same accuracy as 10,000 data points did. Hall and Wand (1996) also addressed the minimum grid size to achieve a certain degree of the asymptotic efficiency. Our empirical rule is to choose the nearest integer to n1/d k −1 as the number of grid for each dimension where k is any number between two and three so the total number of grid points can be (n1/d k −1 )d = nk −d . Having used the size of the grid as the radius of the balls, one can apply the Fast Fourier Transform (FFT) to compute density estimates at grid point to speed up the computations. While the FFT only requires O(m log m) to compute all density estimates at every grid points, other method usually require O(n log n). Note that we choose m < n. To implement our algorithm, we use the following steps as described in Cuevas et al. (2000). Let T be the number of the connected components and set the initial value of T as 0. Step 1 Compute fb at every grid point using the FFT and find a subset of grid points {t(i) : t(i) ∈ Sbc }. 6

Step 2 Choose a grid point from the set and name it t(1) . Compute the distance r1 = kt(1) − t(2) k where t(2) is the the nearest grid point to t(1) . Step 3a If r1 > 2m , the ball B(t(1) , m ) is a connected component of Sbc . Put T = T + 1 and repeat step 1 with any grid point in Sbc except t(1) .

Step 3b If r1 ≤ 2m , compute r2 = min{kt(3) − t(1) k, kt(3) − t(2) k} where t(3) is the grid point closest to the set {t(1) , t(2) }. Step 4a If r2 > 2m , put T = T + 1 and repeat step 1 with any grid point in Sbc except t(1) and t(2) .

Step 4b If r2 ≤ 2m , compute, by recurrence,

rK = min{kt(K+1) − t(i) k, i = 1, . . . , K}, where t(K+1) is the grid point closest to the set {t(1) , . . . , t(K) }. Repeat until we find a distance rK > 2m . Then put T = T + 1 and return to step 1. Step 5 Repeat Step 2 - 4 until every grid point is considered, then the total number of clusters, the connected components of Sbc is T . In contrast to the original CFF algorithm, our algorithm is computationally efficient for massive datasets;

1. In step 1, our algorithm only requires O(m log m) while the original algorithm needs at least O(n log n) operations. Note that m < n; 2. In step 2, usual Euclidean Minimum Spanning Trees methods require O(kn log kn ) while our agglomerating step only needs O(km log km ) where km < kn . Usually km is usually smaller than kn since m < n. Another advantage of our method is that one can use existing popular R/Splus library such as KernSmooth to compute density estimates at grid points. Although it is possible to compute density estimates at data points with O(n log n) operations, none of current R/Splus library provides such an efficient computation. The idea of using a fixed grid has been also used by Chaudhuri et al. (1999) in the context of set estimation. They define a set estimator which 7

they called s-shape with applications to digital image and showed the consistency of their estimator when the data are generated from a continuous distribution. For small or moderate sample size (m > n), one may not gain computational efficiency by using our algorithm, but still achieve the same effect in the alternative bootstrap approach proposed by Cuevas et al. (2000).

4

Case Study

Considering a relatively short history of statistics, the interaction between statistics and astronomy has a long history than one may imagine. Important statistical concepts and methods such as least squares are indeed developed by astronomers (Babu and Djorgovski , 2004). However, from the mid of 19th century the relationship weakened since astronomers more focused on astrophysics while statisticians turned to applications in agriculture and biological sciences. However for last two decades, the advent of new massive sky survey started to bring back the connection between two fields. There have been a series of the Statistical Challenges in Modern Astronomy conferences hosted at Penn. State University to address a vast range of statistical issues in astronomy with modern statistical methodology. One of main statistical issues in astronomy is clustering astronomical sky survey data. In this section, we introduce the scientific background of this subject and apply our method to large astronomical sky survey data.

4.1

Scientific Background

Traditionally astronomical sky survey has been done by a small group of cosmologists spending sleepless nights with a number of handy telescopes. However today’s high technology such as digital imaging cameras is opening a new era to cosmologists. Upon the arrival of huge digitalized data, some fundamental questions in cosmology have been revisited. For example; (1)How did the universe begin? (2) How old is the universe? (3) Is universe still expanding? (4) What is the eventual fate of the universe? Whereas the Big Bang model has been very successful to answer the first question, the rest of questions are involved in the mass distribution of the

8

universe. The key assumption in modern cosmology is that the evolution of the mass distribution of the universe is sensitive to cosmological parameters. Indeed, the mass distribution of the universe can be described as a surprisingly simple model by Press and Schechter (1974)’s seminal work. Their model provides an analytic form of a sort of cumulative density function of the mass of clusters of galaxies at different scales via cosmological parameters. Since most matter cannot be observed except clusters of galaxies, they are a useful tool to follw the mass distribution of the universe. Furthermore, due to the finite velocity of light, the further an object is, the further in the past we observe it. Therefore by measuring the mass of clusters of galaxies at different scales (times), one can learn the history of the universe. To estimate the cosmological parameters, one can consider a goodnessof-fit type of test statistics to calculate confidence intervals for cosmological parameters by matching Press-Schechter model on the mass of clusters of galaxies (Jang, 2003). It is beyond of the scope of this paper and will remain as future work. In short, galaxy clustering is a crucial step to estimate those cosmological parameters and those estimates would lead to answers for the rest of questions. A summary of typical astronomical sky survey data analysis steps is given in Figure 1.

4.2

Statistical Models

Let X1 , X2 , . . . , Xn be the positions of galaxies where Xi = (X1i , X2i , Zi ). The first two components, right ascension (RA) and declination (DEC) are the longitude and latitude with respect to the Earth. The third component, redshift, is related to distance and defined as follows. λo − λ e λe where λo is the observed wavelength of a spectral line and λe is the wavelength of the same spectral line measured in a laboratory. Using the similar argument in Doppler shifts for sound waves, we can estimates the distance from the redshift. We assume thatR Xi is a realization of a Poisson process with the intensity measure Λt (C) = C λ(x)dx, the mean number of galaxies inside C at time t. Here λ(x) is the intensity function. z≡

9

Galaxy Sky Survey Data

?

Galaxy Clustering

?

Fit Cosmological Model on the Mass of the Clusters

?

Estimating Large-scale Structure Figure 1: Typical Analysis Procedure of Astrophysics Sky Survey Data Cosmologists assume the mean number of galaxies inside a region C is directly proportional to the total mass inside the region. Hence the intensity measure Λt (C), the mean number of galaxies inside C at time t, is Z Λt (C) ∝ ρt (x)dx. C

Here ρt (x) be the mass density function at time t, i.e., Z ρt (x)dx ≡ total mass in a region A. A

The mass density is often expressed in terms of the density parameter Ω , 8πG defined by Ω = 3H 2 c2 ρ, where G is the Newton’s constant of gravitation, H is the Hubble constant and c is the speed of light. The density parameter is directly related to the spatial curvature; space is negatively curved (“open”) for Ω < 1 , flat for Ω = 1, and positively curved (“closed”) for Ω > 1. Observations to date have provided evidence in favor of an approximately flat universe, Ω = 1. We may decompose the density parameter into a sum of contributions from different sources of energy; the density parameter for 10

matter ΩM and for the cosmological constant ΩΛ . It is believed that ΩM is close to 0.3 and ΩΛ is close to 0.7. It is believed that in early the universe, quantum fluctuation were frozen in by sudden exponential inflation, thus big normalized mass density functions became virialized objects, clusters. To become a virialized object, a mass density function must satisfy the following geometric condition, ( ) C = x ρ(x|z) > δc ,

where δc is a complicated nonlinear function of redshift z and ΩM (Reichart et al., 1999). Given the fact that ρ is a kind of probability density function, it is clear from the condition (1) that galaxy clustering is equivalent to estimating the level set.

4.3

Data: Mock 2dF catalogue

With the development of modern instruments, the astronomical sky survey data collection procedure is much different than it used to be. For example, the Sloan Digital Sky Survey (SDSS) uses a 2.5meter-telescope at Apache Point, New Mexico with a wide-field CCD imaging camera sky per hour in five broad photometric bands covering wavelength range accessible to CCDs from ground. Though there are several real astronomical sky surveys available to the public including the SDSS, we use a simulation data, the Mock 2dF catalogue, for our data analysis. Unlike real data, the cosmological parameters are given in simulation data and can be used later to measure how accurate our estimates are. The Mock 2dF catalogue has been built to develop faster algorithms to deal with the very large numbers of galaxies involved and the development of new statistics (Cole et al., 1998). All Mock 2dF catalogues mimic the 2dF catalogue which was constructed using the 2dF instrument built by the Anglo-Australian Observatory. The 2dF catalogue measured redshifts for 250,000 galaxy selected from the APM survey, a projected catalogue using the Automatic Plate Measuring (APM) machine. Figure 2 shows a two-dimensional projection from the Mock 2dF catalogue. Here each observation presents a galaxy. While the majority of 11

Figure 2: Mock 2dF catalogue galaxies belong to one of clusters, the rest can be considered as noises. We use a subset of the Mock 2dF catalogue with the density parameter for matter, ΩM =0.3 and the cosmological constant, ΩΛ =0.7 for data analysis. It contains 202,882 galaxies and each galaxies has 4 attributes : RA, DEC, redshift and apparent magnitude. Here apparent magnitude is the brightness of the object and can be used to calculate the mass of the object since the mass follows light.

4.4

Results

Our main goal is to find the mass distribution of clusters as a function of time or redshift z. We use the following steps. 1. Given z, estimate ρ with a nonparametric density estimator. 12

2. Assign each galaxy to a cluster with our clustering algorithm. 3. Add up the absolute magnitudes of galaxies in each cluster and use the sums as estimates of the mass of the clusters. 4. Repeat the steps 1-3 over different z and compute the mass distribution of clusters as a function of z. For the first step, the data were divided into 10 slices by equally spaced redshift and then, a bivariate kernel density estimator were fitted to estimate the joint distribution of RA and DEC given redshift. Figure 3 (a) shows a slice of the 2dF data with 0.10 < z < 0.125 which has 33,157 galaxies and Figure 3 (b) presents a contour plot by the density estimates. To keep the original scale of the data, a spherically symmetric kernel was used, which means the bandwidth matrix is a constant times the identity matrix. To choose the optimal smoothing bandwidth, we used the crossvalidation selector based on the results in Jang (2006). To implement our algorithm, the first step is to compute density estimates at each grid point with the FFT. We used R library KernSmooth developed by Matt Wand to compute density estimates. Figure 3 (c) shows the grids point which belongs to the estimated level set {fb > δc }. For the second step, we wrote c program interfacing with R. We used 412 by 71 grid points for this particular example. The number of grid points (29,252) and this grid size is used for all datasets. . The main reason of this choice is that cosmologists want to keep the original physical ratio of RA and DEC and we want to use at least more than 50 grid points for each dimension. The clustering results are given in Figure 3 (d). Here each color represents a different cluster and 1,945 clusters were found out of 33,157 galaxies. In astronomical sky survey, cosmologists are more interested in the number and size of clusters so they can compute the mass distribution of the clusters over the time. Given a range of redshift, cosmologists agree to our findings and fitting the cosmological model for the mass distribution of clusters is still on going research project.

5

Discussion

The explosion of data in scientific problems provides a better opportunity where nonparametric methods can be applied for solving the problems. Our 13

algorithm shows the improvement of the original CFF algorithm in terms of computation cost with the FFT. We also address the issue of choosing the extra smoothing parameter n and provides a practical rule for the choice of the grid size. Constructing confidence sets for clusters can be used to address uncertainty of the clustering results. While there is a substantial literature on making confidence statements about a curve f in the context of nonparametric regression and nonparametric density estimation, most of them produce pointwise confidence bands for f . Therefore, it is not easy to construct confidence statements about features of f such as density contour clusters from the band. Jang et al. (2004) provides a method to construct uniform confidence sets for densities and density contour clusters. However, to implement this method in practice is still challenging. It may be interesting to combine our method with other methods developed in machine learning community such as Gray and Moore (2003).

References Babu, G. B. Djorgovski, S. G.(2004). Some Statistical and Computational Challenges, and Opportunities in Astronomy. Statistical Science 18 322–332. ´illo, A., Cuesta-Albertos, J. A. and Cuevas, A. (2001). ConBa vergence rates in nonparametric estimation of level sets. Statistics and Probability Letters 53 27–35. Chaudhuri, A. R., Basu, A., Bhandari, S. and Chaudhuri, B. (1999). An efficient approach to consistent set estimation. Sankhy¯ a, Series B 61 496–513. Cole, S., Hatton, S., Weinberg, D. H. and Frenk, C. S. (1998). Mock 2df and sdss galaxy redshift surveys. Monthly Notices of the Royal Astronomical Society 300 945–966. Cuevas, A., Febrero, M. and Fraiman, R. (2000). Estimating the number of clusters. The Canadian Journal of Statistics 28 367–382. Cuevas, A., Febrero, M. and Fraiman, R. (2001). Cluster analysis: a

14

further approach based on density estimation. Computational Statistics & Data Analysis 36 441–459. Cuevas, A. and Fraiman, R. (1997). A plugin approach to support estimation. Annals of Statistics 25 2300–2312. Cuevas, A. and Rodriguez-Casal, A. (2004). On boundary estimation. Advances in Applied Probability 36 340–354. Devroye, L. and Wise, G. (1980). Detection of abnormal behavior via nonparametric estimation of the support. SIAM Journal on Applied Mathematics 38 480–488. Fraley, C. and Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97 611–631. Gray, A. and Moore, A. (2003). Rapid evaluation of multiple density models. In Artificial Intelligence and Statistics. Hall, P. and Wand, M. (1996). On the accuracy of binned kernel density estimators. Journal of Multivariate Analysis 56 165–184. Hartigan, J. (1975). Clustering Algorithm. Wiley, New York. Jang, W. (2003). Nonparametric Density Estimation and Galaxy Clustering. In Statistical Challenges in Astronomy 443-445. Springer, New York. Jang, W. (2006). Nonparametric density estimation and clustering in astronomical sky surveys. Computational Statistics & Data Analysis 50 760–774. Jang, W., Genovese, C. and Wasserman, L. (2004). Nonparametric confidence sets for densities. Tech. Rep. 795, Department of Statistics, Carnegie Mellon University. Korostelev, A. and Tsybakov, A. (1993). Minimax Theory of Image Reconstruction. Springer. New York. Mart´ınez, V. and Saar, E. (2002). Statistics of the Galaxy Distribution. Chapman and Hall, London. 15

McLachlan, G. and Peel, D. (2000). Finite Mixture Model. Wiley, New York. Moore, A. (1999). Very fast em-based mixture model clustering using multiresolution kd-trees. In Advances in Neural Information Processing Systems 543–549. Narasimhan, G., Zhu, J. and Zachariasen, M. (2000). Experiments with computing geometric minimum spanning trees. In Proceedings of ALENEX’00, 183–196. Lecture Notes in Computer Science, SpringerVerlag, New York Press, W. H. and Schechter, P. (1974). Formation of galaxies and clusters of galaxies by self-similar gravitational condensation. Astrophysical Journal 187 425–438. Reichart, D., Nichol, R., Castander, F., Burker, D., A.K.Romer, Holden, B., Collins, C. and Ulmer, M. (1999). A deficit highredshift, high-luminosity x-ray clusters: evidence for a high value of ωm ? Astrophysical Journal 518 521–532. Russell, S. J. and Norvig, P. (2002). Artificial Intelligence: A Modern Approach . Prentice Hall, Upper Saddle River, 2nd edn. Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20 25–47. Wand, M. (1994). Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics 3 433–445. Wand, M. and Jones, M. (1995). Kernel Smoothing. Chapman and Hall, London. Wong, W.-K. and Moore, A. (2002). Efficient algorithms for nonparametric clustering with clutter. In Computing Science and Statistics 34 541–553.

16

−30 −35

160

180

200

220

−30

−25

17

RA (a) Mock 2dF catalogue with 0.1 < z < 0.125

−35

DEC

−25

Example : Mock 2dF catalogue

160

180

200

(b) contour plot by kernel density estimation

220

Figure 3: Subset of Mock 2dF catalogue 18