HCAC: Semi-supervised Hierarchical Clustering Using Confidence-Based Active Learning

HCAC: Semi-supervised Hierarchical Clustering Using Confidence-Based Active Learning Bruno M. Nogueira1 , Al´ıpio M. Jorge2 , and Solange O. Rezende1 ...
10 downloads 0 Views 3MB Size
HCAC: Semi-supervised Hierarchical Clustering Using Confidence-Based Active Learning Bruno M. Nogueira1 , Al´ıpio M. Jorge2 , and Solange O. Rezende1 1

Laboratory of Computational Intelligence, Institute of Mathematics and Computer Science, University of Sao Paulo, Brazil {brunomn,solange}@icmc.usp.br 2 LIAAD-INESCTEC, FCUP, University of Porto, Portugal [email protected]

Abstract. Despite their importance, hierarchical clustering has been little explored for semi-supervised algorithms. In this paper, we address the problem of semi-supervised hierarchical clustering by using an active learning solution with cluster-level constraints. This active learning approach is based on a new concept of merge confidence in agglomerative clustering. When there is low confidence in a cluster merge the user is queried and provides a cluster-level constraint. The proposed method is compared with an unsupervised algorithm (average-link) and two state-of-the-art semi-supervised algorithms (pairwise constraints and Constrained Complete-Link). Results show that our algorithm tends to be better than the two semi-supervised algorithms and can achieve a significant improvement when compared to the unsupervised algorithm. Our approach is particularly useful when the number of clusters is high which is the case in many real problems.

1 Introduction Semi-supervised clustering has been widely explored in the last years. Instead of finding groups guided by an objective function, as in unsupervised clustering, semi-supervised versions try to improve clustering results by employing external knowledge in the clustering process. The external knowledge is conveyed in the form of constraints. These constraints can be directly derived from the original data (using partially labelled data) or provided by an user, trying to adapt the clustering results to his/her expectations [8]. Constraints in semi-supervised clustering processes affect a small part of the dataset, as the supervision of large amounts of data is expensive [4]. So, it is very important to optimize the usage of external knowledge, obtaining the largest amount of useful information from the smallest number of constraints. In this sense, semi-supervised clustering algorithms must deal with two crucial issues: how to add information to the clustering process and in which cases the user should provide information. To assure efficacy in information addition, the characteristics of the semi-supervised clustering algorithm are very important. Mainly, three aspects can be observed: (1) the type of the constraints (e.g., pairwise constraints [28, 26], initial seeds [2] or feedback [8]); (2) the level of the constraints (instance-level [28], cluster-level [14] or instancecluster-level [13]); and (3) how the algorithm deals with these constraints (constraintbased [28], distance-based [19] or hybrid [4]). J.-G. Ganascia, P. Lenca, and J.-M. Petit (Eds.): DS 2012, LNAI 7569, pp. 139–153, 2012. c Springer-Verlag Berlin Heidelberg 2012 

140

B.M. Nogueira, A.M. Jorge, and S.O. Rezende

Active learning algorithms [23] can be used to choose proper cases to add information. In semi-supervised clustering, these algorithms can be used to detect instances or clusters to which the addition of constraints can help the clustering process to obtain an improved solution. Active learning algorithms have been successfully used to select pairs of instances to elicit pairwise contraints from the user [12, 26, 27]. Active-based solutions are also used in some algorithms to choose better initial seeds [3]. In the literature few works deal specifically with semi-supervised hierarchical clustering. Most of the approaches consider the semi-supervised problem as a flat-clustering problem. More specifically, neither the appropriate addition of information nor the selection of good cases to add constraints are fairly explored in hierarchical clustering context. Moreover, most of the studies are carried out with two categories only (binary datasets). Thus, the behaviour of most of the methods is not measured in domains in which there are more than two clusters, which is the case of many the real-world problems. In this work we propose HCAC (Hierarchical Confidence-Based Active Clustering), an effective method for better exploiting external knowledge during the hierarchical clustering process. This method improves the hierarchical clustering process by querying the user when it seems more appropriate. HCAC applies two ideas which have not been extensively exploited before. The first one is the kind of query. The user, when requested, chooses the next pair of clusters to be merged among a pool of pre-selected pairs. The second idea is to use confidence in unsupervised cluster merging decisions to determine when it is appropriate to query the user. The combination of these two ideas makes HCAC specially efficient when dealing with more than two clusters. This paper is organized as follows. In the next section, we present some related work on hierarchical semi-supervised clustering and active clustering. In Section 3, we present the HCAC algorithm. Then, in Section 4, we present experimental evaluations. Finally, in Section 5, we present some conclusions and point some future works.

2 Related Work There is little work on semi-supervised hierarchical clustering. ISAAC, one of the first proposals that add background knowledge in hierarchical clustering processes [25], uses a declarative approach. This is a conceptual clustering method which generates probabilistic concept hierarchies. The authors have modified ISAAC to allow the user to introduce a set of classification rules in the form of first-order logic. Clusters containing objects covered by different rules are not merged which guarantees that a cluster that completely satisfies each rule will be formed. In [18], pairwise constraints (must-link and cannot-link) are used in a semi-supervised clustering algorithm based on the complete-link algorithm (see [16]) - the Constrained Complete-Link (CCL) method. These constraints were introduced in [28], where the authors proposed the use of instance-level pairwise constraints must-link and cannot-link to indicate whether two instances belong to the same cluster or not. Due to their simplicity and good results, pairwise constraints have been widely explored [6, 30, 4, 9, 11]. In CCL, constraint insertion has two phases: imposition and propagation. During the imposition, constraints are added to pairs of examples by modifying

HCAC: Semi-supervised Hierarchical Clustering

141

the distance between elements. If two points xi and xj have a must-link constraint, then their distance is set to zero. Otherwise, if they have a cannot-link constraint, their distance is set to the maximum distance in the distance matrix plus one. In the propagation, the algorithm considers that if an example xk is near an example xi and xi has a must-link or a cannot-link constraint with xj , so xk is also near or far from xj . The propagation of must-link constraints is done by the calculation of a new distance between xk and xj through a modification of the Floyd-Warshall algorithm. Cannotlinks, on the other hand, are implicitly propagated by the complete-link algorithm. Pairwise constraints are also used in [17]. In this work, the authors propose the use of these constraints at the first level of a hierarchical clustering algorithm in order to generate the initial clusters. The constraints are not propagated to posterior levels as the algorithm aims to generate stable dendrograms. Using labelled examples, an algorithm based on the complete-link algorithm is proposed in [7]. This algorithm learns a distance threshold x from which there are no more cluster merges. To learn this distance threshold, the algorithm uses a small set of labelled objects. This small set is clustered and several threshold values are tested. The value which presents the best evaluation measure is chosen to cluster the entire data set. In [1], pairwise constraints are generated using labelled examples in a post-processing step. After an unsupervised clustering process, the algorithm uses these labelled examples to generate must-link and cannot-link constraints between pairs of objects. These constraints are used to determine whether to merge or split the resulting clusters. Labelled data is also used in [5], where a semi-supervised density-based hierarchical algorithm is proposed. The labelled data are used to generate an initial hierarchy, which is later expanded. So, unlabelled data are assigned to the most consistent clusters. An analysis of constraints in hierarchical clustering is done in [10]. In this work, the authors analysed the use of pairwise constraints and cluster-level constraints (minimum and maximum intra-cluster distances). The authors have proven that the combination of these constraints is computationally viable in hierarchical clustering, unlike flat clustering, where this combination is a NP-Complete problem. In [21], a semi-supervised approach is introduced for agglomerative clustering that uses penalty scores. These penalties are added when a cluster merging violates cannotlink constraints. Must-links are considered hard constraints and cannot be broken. The penalty factor of a cluster merging is calculated by the multiplication of a positive constant to the number of cannot-link constraints that involve elements from both clusters. This penalty factor is added to the distance between the clusters. One of the above mentioned, [18] uses an active learning algorithm which inserts constraints in hierarchical clustering. The algorithm is allowed to perform m pairwise questions. So, the algorithm performs an unsupervised complete-link clustering process in order to learn a distance α from which it is expected to need no more than m questions to cluster properly. The clustering restarts in an unsupervised way until it makes a merge of distance α. Then, the user is asked whether the roots of the next merge belong together. Depending on the answer, the constraints are propagated as explained before. In [26], the authors propose an active constraint selection algorithm based on a knearest neighbour graph. The main idea is to try to measure constraints utility before starting the clustering process. Based on the k-nearest neighbour graph, the algorithm

142

B.M. Nogueira, A.M. Jorge, and S.O. Rezende

selects a pool of candidate constraints composed by pairs of elements that present an edge weight value below a predefined threshold. Then, it ranks candidate constraints according to their ability to separate clusters. It is possible to see that semi-supervised hierarchical clustering is still less explored than semi-supervised flat clustering. Few of them exploit cluster-level characteristics in hierarchical clustering, which could be interesting as it can carry more information as instance-level ones. Moreover, there are few active learning approaches to these algorithms. Motivated by this lack of efforts to improve hierarchical clustering, in this paper we present the method HCAC (Hierarchical Confidence-Based Active Clustering). This active hierarchical clustering method is based on cluster level constraints and a new concept of cluster merge confidence. In the next section we discuss this method.

3 HCAC: A Confidence-Based Active Clustering Method HCAC (Hierarchical Confidence-Based Active Clustering - pronounced h-cac) is a new semi-supervised clustering method based on agglomerative hierarchical clustering. The idea of confidence used in this method was briefly introduced by the authors in [22]. Here we deeply explain and test this idea in hierarchical clustering. HCAC uses clusterlevel constraints which are provided by a human supervisor along the iterations of an agglomerative hierarchical clustering algorithm. In the next section, we will identify the kind of situation that motivated us to create this method and our approach to detect them. Then, we will explain our approach to deal with these situations by adding clusterlevel constraints. 3.1 Confidence-Based Active Clustering In unsupervised agglomerative hierarchical clustering, the nearest pair of elements1 in a given step is selected to be merged. However, sometimes this approach may cluster objects that represent different concepts not fully represented by the distance function.

Fig. 1. Cluster border problem

This occurs near cluster borders, as in Figure 1. In this figure, we have two underlying clusters (dashed circles), corresponding to two different concepts. In an unsupervised approach, despite representing different concepts, the pair of elements in the rectangle would be the first to be merged, as they are the nearest. However, there are better options near to cluster with one of these two elements, since they are also close to other elements that belong to the same concept. Motivated by this kind of situation, we present the concept of confidence of a merge. The confidence of a merge is related to the distance between the elements from the 1

In this work, the term elements may refer both to single examples or clusters of examples.

HCAC: Semi-supervised Hierarchical Clustering

143

proposed merge and other elements near them. If a pair of elements are close to each other but far from other elements, the confidence of merging these two elements is high since apparently there is no good alternative. However, if they are also close to other elements, it might be advisable to ask a user to check if there is a better merge. Formally, a confidence value can be calculated as follows. Considering a distance function dist ,  between elements to merge, the natural merge is between the nearest pair of elements a and b, where dist a, b  da,b  min dist x, y , x  y. The confidence C of this merge is calculated by the difference between da,b and de,f , where de,f  min dist x, y , x  y, x, y   a, b, x  a, b y  a, b. Merges having low confidence values are taken as points where the algorithm is more likely to make incorrect decisions (misclusterings). So, HCAC detects low confidence merges and queries the human to check whether a better alternative merge exists. In practical terms, low confidence merges are those where confidence is below a predefined threshold. The higher the threshold value, the more user interaction is requested. In this work, we also propose a calibration procedure to estimate this threshold with respect to the amount of tolerated interaction. This is done through an unsupervised execution of the hierarchical clustering algorithm, in a spirit similar to [18]. At each step of this unsupervised execution, the confidence value is calculated. At the end of this procedure an adequate threshold value is selected according to the desired number of human interactions. This procedure is described in Algorithm 1. Input: n: number of elements in the dataset; dist ., .: distance function; q: desired number of human interactions Output: conf T : confidence threshold value Initialize vector C with n  1 positions; for k  1 : n  1 do minDistk  di,j  min dist x, y , x  y; secM inDistk  dr,s  min dist x, y , x  y, x, y   i, j , x  i, j  y  i, j ; Ck  secM inDistk  minDistk ; end Order vector C; conf T  C q ;

Algorithm 1. Threshold calibration procedure With a calibrated threshold, we have a criterion for when to make queries. In the next section, we will explain how the user can interact with the HCAC in order to guide the clustering process. 3.2 Cluster-Level Constraints When a low confidence merge is spotted, the user is queried for additional information. The response comes in the form of a constraint. In general, constraints can be stated at instance level or at cluster level, where we consider whole subclusters instead of single instances. In our proposal, we use cluster-level constraints. Cluster-level constraints

144

B.M. Nogueira, A.M. Jorge, and S.O. Rezende

can obviously convey more information than instance-level ones. This can reduce the number of user’s interventions. Instance level queries, however, can be more easily resolved by the human. In HCAC, a cluster-level query is posed to acquire a cluster-level constraint when a low confidence merge is detected. For that, a pool of pairs of clusters is presented to the user in order to choose the pair that corresponds to the best merge. The pool contains c nearest pairs of clusters, where c is given a priori. The generation of this pool is described in Algorithm 2. It starts by finding the best unsupervised merge (the two nearest clusters i, j). After that, the c 1 best unsupervised merges involving i or j are included. This assembling procedure has a linear-time cost in function of the number of elements (O n, where n is the number of elements). Input: n: number of elements in the dataset; dist ., .: distance function; c: size of the pool of clusters Output: Pk : pool of pairs of clusters on the k-th iteration Initialize vector P with c positions; P 1  i, j   arg min dist x, y , x  y; x,y

for l  2 : c do P l  r, s r, s P, dist r, s  min dist x, y , x  y, x, y   i, j , x  i, j  y  i, j ; end

Algorithm 2. Procedure for assembling the pool of cluster pairs The higher the value of c, the more options the user has, and the brighter are the chances of finding a good choice. However, a large number of cluster pairs may imply excessive human effort. Moreover, dealing with a pool of clusters may not be trivial. This drawback could be smoothed using good summarizing cluster representations, such as wordclouds (textual datasets) or parallel coordinates [15] (non-textual datasets). The adoption of the active confidence-based approach tries to optimize the user’s intervention. Moreover, the adoption of this kind of cluster-level constraints and this new kind of queries tends to generate clusters with high purity degrees, as it helps to better determine the cluster boundaries. This fact makes HCAC specially useful when dealing with datasets with a large number of clusters.

4 Experimental Evaluation To evaluate HCAC, we have carried out two sets of experiments. The first one used 22 artificially generated bi-dimensional datasets, varying the number of clusters in each dataset from 2 to 1002 . All datasets are perfectly balanced, with 30 examples in each cluster. Each cluster is formed by the combination of two normal distributions (one for the x-axis and other for the y-axis), separated by a constant distance and, therefore, are 2

Datasets available at http://sites.labic.icmc.usp.br/bmnogueira/artificial.html

HCAC: Semi-supervised Hierarchical Clustering

145

well shaped. The main objective of this experiment is to see how the method performance varies according to the number of clusters on the dataset. In the second set of experiments, we have assessed the performance of our method in 31 real-world datasets from the UCI repository3 and from the MULAN repository 4 . These datasets are approximately balanced and have labelled instances which enables the objective evaluation of clustering results. A brief description of these datasets can be observed in Table 1. The evaluation methodology applied on these datasets and the obtained results are presented in the following sections. Table 1. Description of the real-world datasets used in the experiments. MULAN datasets are highlighted with the symbol ’*’. Dataset # Examples # Classes Dataset # Examples # Classes Balance 625 3 MFeat 2000 10 Breast Cancer Wisconsin 683 2 Musk 476 2 Breast Tissue 106 6 Pima 768 2 Cardiotocography 2126 10 Scene* 2417 15 Ecoli 336 8 Secom 1151 2 Emotions* 593 27 Sonar 208 2 Glass 214 6 Soybean 266 15 Haberman 306 2 Spectf 267 2 Image Segmentation 210 7 Statlog Satellite 4435 7 Ionosphere 351 2 Transfusion 748 2 Iris 150 3 Vehicle 846 4 Isolet 1559 26 Vertebral Column 310 3 Libras 360 15 Vowel 990 10 Lung Cancer 27 3 Wine 178 3 Madelon 600 2 Zoo 101 7 Mammographic Masses 830 2

4.1 Evaluation Methodology We have compared the HCAC with three standards: an unsupervised algorithm, which is used as a baseline; a semi-supervised algorithm using must-link and cannot-link pairwise constraints [28]; and the active constrained hierarchical clustering process proposed in [18] (Constrained Complete Link - CCL), which also uses cluster level constraints along with the clustering process. The CCL algorithm uses a complete-link strategy to perform the cannot-links propagation while the other two approaches use the average-link strategy [16]. The comparison with the baseline unsupervised algorithm is done for assessing the ability of the semi-supervised algorithms to exploit user provided information. We simulated the human interaction in the semi-supervised algorithms by using the labels provided with the data sets. The idea is to automatically answer the queries using a sensible criteria that models the user’s behaviour. In HCAC, for the cluster-level queries, the criteria for choosing the best cluster merge is entropy [24]. Among the pairs in the pool, the one with the lowest entropy value is selected for merging. For the algorithm using pairwise constraints, we randomly pick pairs of instances before the clustering process starts. As suggested in [10], if the elements belong to the same class, then a must-link constraint was added and the distance between this pair was set to zero. Otherwise, a cannot-link constraint was added and the distance was set to infinity. 3 4

http://archive.ics.uci.edu/ml/datasets.html http://mulan.sourceforge.net/datasets.html

146

B.M. Nogueira, A.M. Jorge, and S.O. Rezende

Finally, for the CCL algorithm, it was established that the roots of the next proposed merge have to be merged if they present a entropy equals or lower than 0.2. We have tried different numbers of human interventions in the clustering process (number of pairwise queries or cluster-level queries). We have varied the number of desired interventions in 1%, 5%, 10%, 20% ... 100% of the number of merges in the agglomerative clustering process (which is equal to the number of instances in the dataset minus one). In the case of the HCAC algorithm, we have also tested two different numbers of pairs of elements in the pool: 5 and 10. In a real application, the usage of 10 pairs in the pool may not be a viable configuration, since it would demand too much effort from the user. However, we decided to compare this configuration in order to analyse how the size of the pool impacts on HCAC performance. In the evaluation, we used 10-fold cross validation. For each dataset, in each experiment configuration, the algorithms were applied 10 times, always leaving one fold out of the dataset. Each resulting clustering was evaluated through the FScore measure [20] which is very adequate for hierarchical clustering. The FScore for a class Ki is the maximum value of FScore obtained at any cluster Cj of the hierarchy, which can be calculated according to Equation 1: F Ki , Cj  

2  R Ki , Cj   P Ki , Cj  R Ki , Cj   P Ki , Cj 

(1)

where R Ki , Cj  is the recall for the class Ki in the cluster Cj , defined as nij / size of Cj (nij is the number of elements in Cj that belongs to Ki ) and P Ki , Cj  is the precision, defined as nij / size of Ki . The FScore value for a clustering is calculated by the weighted average of the FScore for each class, as shown on Equation 2.

nF C c

F Score 

i 1

i

n

i

(2)

The final FScore value for a given dataset is the average of the FScore values for each clustering result (each fold). The non-parametric Wilcoxon [29] statistical test was used to detect statistical significance in the differences of the algorithms performance considering an α of 0.05. The test was applied to compare the HCAC algorithm against one of the other algorithms. 4.2 Results The statistical comparison of the results of the first set of experiments, using artificial datasets, are shown in Table 2. It can be easily noticed that HCAC statistically outperforms all other compared algorithms in most of the configurations. An FScore comparison for some of these artificial datasets is presented in Figure 2. In this figure, it can be observed that the algorithms performance decay as the number of clusters increase. It can be also noticed that this decay is stronger for the methods that use none or instance-level constraints (pairwise and average). So, algorithms that employ cluster-level constraints (HCAC and CCL) tend to perform much better than the other methods, specially when the number of clusters is high. Particularly, HCAC tends to outperform all other algorithms when the number of clusters in the dataset is greater than three.

HCAC: Semi-supervised Hierarchical Clustering

147

Table 2. Results of the statistical comparisons in the artificial datasets. The symbol ìindicates that HCAC wins with statistical significance; èindicates that HCAC wins with no statistical significance; éindicates that HCAC loses with no statistical significance. Each symbol is followed by the number of datasets that HCAC performs better and worse than the compared algorithm. % Pairwise 1 13 - 4 5 19 - 3 10 20 - 2 20 21 - 1 30 21 - 1 40 22 - 0 50 21 - 1 60 20 - 2 70 22 - 0 80 21 - 1 90 22 - 0 100 22 - 0

è ì ì ì ì ì ì ì ì ì ì ì

5 Pairs 10 Pairs CCL Average Pairwise CCL Average 22 - 0 12 - 5 14 - 4 22 - 0 12 - 6 22 - 0 16 - 5 19 - 3 22 - 0 19 - 3 22 - 0 21 - 1 21 - 1 22 - 0 19 - 2 22 - 0 21 - 1 22 - 0 21 - 1 19 - 3 20 - 2 22 - 0 22 - 0 21 - 1 21 - 0 20 - 2 22 - 0 22 - 0 21 - 1 22 - 0 20 - 2 22 - 0 21 - 1 21 - 1 22 - 0 18 - 4 21 - 0 21 - 1 21 - 1 22 - 0 19 - 3 22 - 0 22 - 0 20 - 2 21 - 1 18 - 4 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0 22 - 0

ì ì ì ì ì ì ì ì ì ì ì ì

è ì ì ì ì ì ì ì ì ì ì ì

ì ì ì ì ì ì ì ì ì ì ì ì

ì ì ì ì ì ì ì ì ì ì ì ì

ì ì ì ì ì ì ì ì ì ì ì ì

This tendency can be explained by the nature of the constraints. In general, the more clusters a dataset has, the more complex it is and the more information will be needed to correctly delimit them. With the pairwise constraints, the user indicates whether two instances do or do not belong to the same cluster. On the other hand, our proposed cluster-level constraints indicate that two groups of instances must be merged. So, in the cluster-level constraint the number of instances influenced and the quantity of information added are higher. Moreover, our active learning approach tends to require the user’s intervention on points that can be regarded as cluster borders. The more clusters a dataset has, the more border regions are present and the higher are the chances of misclusterings. In order to highlight this behaviour variation as the number of clusters increases, in Figure 3 we present a comparison of HCAC with pairwise-constrained and CCL approaches according to the number of clusters in the dataset. In the horizontal axis we have the number of clusters. For each number of clusters, we calculated the victory rate of the HCAC algorithm over the compared algorithm. The HCAC victory rate is the

Fig. 2. Results for the artificial datasets. On the perimeter we have the level of human intervention (in %). On the radius we have the F-score.

148

B.M. Nogueira, A.M. Jorge, and S.O. Rezende

Fig. 3. Comparison of HCAC and semi-supervised approaches. On the X axis we have the number of clusters in the dataset. On the Y axis, we have the HCAC victory rate.

proportion of the cases where HCAC presents higher FScore than pairwise with respect to the total number of comparisons considering all datasets with the same number of clusters and all user’s intervention percentages. Two different victory rate lines were then plotted, one for each experiment configuration (5 and 10 pairs). According to the results, the HCAC algorithm tends to have more advantage over pairwise (rate above 0.5) in datasets with great number of clusters. The results of the statistical comparisons of all of the real-world dataset experiments can be observed in Table 3 and the FScore for these datasets are shown in Figure 4. As the clusters are not well shaped as the ones in the artificial datasets, it can be noticed that the performance of all methods present a great decrease when compared to the artificial datasets. HCAC, however, tends to outperform CCL and the baseline algorithm in most of the comparisons. We can also see that there is a non-significant improvement on the HCAC performance when more pairs are presented to the user. This improvement was expected, since with more pairs HCAC is able to exploit extra information. However, increasing the number of pairs in the pool has a cognitive cost. Comparing the HCAC method with the unsupervised algorithm there is only a clear advantage between 30% and 40% of user’s interventions. The results in Figure 4 show that with less interventions the performance of HCAC and average-link are very similar and there are non significant wins and losses. Also, the performance of HCAC is very similar to the pairwise constrained approach, alternating winnings and losses. This indicates that, in a general way, the quality of information added is very similar in both approaches. In the comparison with the CCL, HCAC tends to outperform this algorithm in all comparisons, presenting statistically significant better performance even with just few user’s interventions.

HCAC: Semi-supervised Hierarchical Clustering

149

Fig. 4. Results for real-world datasets (number of clusters in parenthesis). On the perimeter we have the level of human intervention (in %). On the radius we have the F-score.

150

B.M. Nogueira, A.M. Jorge, and S.O. Rezende Table 3. Results of the statistical comparisons on all of the real-world datasets 5 Pairs 10 Pairs % Pairwise CCL Average Pairwise CCL Average 1 15 - 10 23 - 8 9-7 16 - 9 23 - 8 11 - 7 5 16 - 14 21 - 9 11 - 14 12 - 18 19 - 11 13 - 11 10 15 - 14 21 - 9 18 - 8 15 - 15 22 - 7 16 - 10 20 15 - 15 20 - 10 15 - 12 14 - 16 16 - 14 13 - 13 30 15 - 15 25 - 5 16 - 13 14 - 16 21 - 9 20 - 9 40 13 - 17 22 - 8 23 - 6 15 - 15 23 - 7 23 - 6 50 14 - 16 24 - 6 22 - 8 17 - 13 20 - 10 26 - 4 60 19 - 11 23 - 7 26 - 4 18 - 12 18 - 12 27 - 3 70 16 - 14 22 - 8 26 - 4 20 - 10 19 - 11 29 - 1 80 19 - 11 23 - 7 30 - 0 21 - 9 23 - 7 29 - 1 90 20 - 10 23 - 7 29 - 1 26 - 4 26 - 4 30 - 0 100 29 - 1 25 - 5 30 - 0 30 - 0 29 - 1 30 - 0

è è è ë ë é é è è è ì ì

ì ì ì ì ì ì ì ì ì ì ì ì

è é ì è è ì ì ì ì ì ì ì

è é ë é é ë è è ì ì ì ì

ì ì ì è è ì è è è ì ì ì

è è è ë ì ì ì ì ì ì ì ì

In this real-world dataset experiments, it can be noticed that HCAC does not achieve a better performance than other algorithms in some comparisons. These results, specially when comparing with the pairwise constrained algorithm, are highly influenced by the binary datasets. As seen in Figure 4, in binary datasets the performance of algorithms that use none or pairwise constraints are very similar to the HCAC performance. In binary datasets, as there are less cluster borders than in datasets with more clusters, it is easier to correctly delimit clusters boundaries. So, less information is necessary to be inserted, which makes instance-level constraints efficient in this context. In order to compare the performance of the algorithms in the presence of more cluster borders in real-world datasets, we have also carried out a statistical comparison of the algorithms performance in datasets with more than two clusters. In this comparison, we have used the 20 real-world datasets that contain three or more clusters. The results of this comparison can be seen in Table 4. It is possible to observe that the performance of HCAC tends to be better than all of the other algorithms when the number of clusters in the dataset is greater than two. As shown in Figure 3, HCAC tends to present a winning rate above 0.5 against all of the other algorithms in almost all of the non-binary datasets. In this figure, the unexpected result for 10 cluster datasets is influenced by the results of the Cardiotocography dataset, in which all algorithms achieve the optimal solution. Table 4. Results of statistical comparisons on real-world datasets with more than two clusters 5 Pairs 10 Pairs % Pairwise CCL Average Pairwise CCL Average 1 9-6 17 - 3 7 - 4 8-7 17 - 3 8 - 4 5 10 - 9 14 - 5 9 - 8 10 - 9 15 - 4 11 - 6 10 10 - 9 14 - 5 16 - 2 10 - 9 15 - 4 11 - 7 20 12 - 7 12 - 7 13 - 5 12 - 7 11 - 8 12 - 6 30 11 - 8 14 - 5 12 - 6 11 - 8 13 - 6 14 - 4 40 11 - 8 14 - 5 17 - 1 11 - 8 14 - 5 16 - 2 50 11 - 8 15 - 4 17 - 2 14 - 5 15 - 4 18 - 1 60 15 - 4 15 - 4 18 - 1 14 - 5 14 - 5 18 - 1 70 13 - 6 15 - 4 17 - 2 15 - 4 15 - 4 18 - 1 80 15 - 4 15 - 4 19 - 0 15 - 4 15 - 4 19 - 0 90 16 - 3 16 - 3 19 - 0 18 - 1 17 - 2 19 - 0 100 18 - 1 14 - 5 19 - 0 19 - 0 18 - 1 19 - 0

è è è è è è è è ì ì ì ì

ì ì ì ì è ì ì ì ì ì ì ì

è è ì è ì ì ì ì ì ì ì ì

è è è è è è è ì ì ì ì ì

ì ì ì è è è è è è ì ì ì

è è è è ì ì ì ì ì ì ì ì

HCAC: Semi-supervised Hierarchical Clustering

151

5 Conclusions In this work, we presented HCAC, an active semi-supervised hierarchical clustering method. HCAC uses cluster-level constraints where the user can indicate a pair of clusters to be merged. It also uses a new active learning process based on the concept of merge confidence. We have also devised a method for determining the adequate confidence threshold given a maximum amount of user’s effort. When dealing with well shaped clusters, HCAC outperformed all other algorithms in most of the comparisons. In real-world datasets, with clusters that contain boundaries that are hard to detect, HCAC also presented a good performance. When compared to the pairwise constrained approach, HCAC showed a slight advantage using pools of 5 and 10 pairs. While the pairwise constrained approach has the advantage of dealing with instances, which are more intuitive, the HCAC cluster-level constraints can convey more information, what can reduce the number of human interventions. Also, HCAC has the advantage of pre-selecting a pool of clusters for the user in linear-time, thus efficiently reducing the number of pairs to be analysed by the user. HCAC also outperformed another active semi-supervised method with cluster-level constraints (CCL) in most of the comparisons. Moreover, HCAC presented a good performance when compared to the unsupervised algorithm. These results indicate that it is worthwhile to exploit the concept of confidence. Empirical results also indicate that HCAC is particularly useful with datasets with a large number of clusters. This characteristic is due to the cluster-level nature of the constraints, as well as the active learning approach which helps to delimit cluster borders. In real-world datasets, HCAC presented better performance in datasets with more than 3 clusters, which is the case of many real applications. The application of HCAC has the limitation of requiring an adequate description of the groups when presenting the pairs of elements to the user. A poor description may lead the user to incorrect decisions. We are investigating adequate ways to formulate cluster-level queries so that the user can provide constraints with minimal cognitive effort, as parallel coordinates [15] for non-textual or wordclouds for textual datasets. In future works, we intend to improve the performance of the HCAC method by exploiting constraint propagation. With the simple approach used in this work, HCAC presented significant results, achieving statistically significant improvements in comparisons to state-of-the-art clustering algorithms (both unsupervised and semi-supervised). This proved that the information inserted by HCAC to the clustering process is relevant and our method can successfully detect points to ask for user’s intervention. By using constraints propagation, we believe that we can minimize the number of constraints and achieve better results through the propagation of the added information to other instances besides the ones involved in constraints. Furthermore, we intend to measure the performance of the algorithm when dealing with textual datasets and compare this performance with some active pairwise constrained approaches for this kind of dataset. Acknowledgments. This work is part-funded by the ERDF - European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness), by the Portuguese Funds through the FCT (Portuguese Foundation for

152

B.M. Nogueira, A.M. Jorge, and S.O. Rezende

Science and Technology) within project FCOMP - 01-0124-FEDER-022701 and by EU project ePolicy, FP7-ICT-2011-7, grant agreement 288147. We also acknowledge the fund provided by EBWII (EU), CAPES and FAPESP - Project 2011/19850-9 (Brazil).

References [1] Bade, K., Hermkes, M., N¨urnberger, A.: User Oriented Hierarchical Information Organization and Retrieval. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 518–526. Springer, Heidelberg (2007) [2] Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: ICML 2002: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 27–34. Morgan Kaufmann Publishers Inc., San Francisco (2002) [3] Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering. In: SDM 2004: Proceedings of the 2004, SIAM International Conference on Data Mining, pp. 333–344. SIAM, Philadelphia (2004) [4] Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semisupervised clustering. In: ICML 2004: Proceedings of the 21st International Conference on Machine Learning, pp. 81–88. ACM, New York (2004) [5] B¨ohm, C., Plant, C.: Hissclu: a hierarchical density-based method for semi-supervised clustering. In: EDBT 2008: Proceedings of the 11th International Conference on Extending Database Technology, pp. 440–451. ACM, New York (2008) [6] Cohn, D., Caruana, R., Mccallum, A.: Semi-supervised clustering with user feedback - technical report tr2003-1892. Technical report, Cornell University (2003) [7] Daniels, K., Giraud-Carrier, C.: Learning the threshold in hierarchical agglomerative clustering. In: ICMLA 2006: Proceedings of the 5th International Conference on Machine Learning and Applications, pp. 270–278. IEEE, Washington, DC (2006) [8] Dasgupta, S., Ng, V.: Which clustering do you want? inducing your ideal clustering with minimal feedback. Journal of Artificial Intelligence Research 39, 581–632 (2010) [9] Davidson, I., Ravi, S.S.: Clustering with constraints: Feasibility issues and the k-means algorithm. In: SDM 2005: Proceedings of the, SIAM International Conference on Data Mining, pp. 138–149. SIAM, Philadelphia (2005) [10] Davidson, I., Ravi, S.S.: Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Mining and Knowledge Discovery 18(2), 257–282 (2009) [11] Domeniconi, C., Peng, J., Yan, B.: Composite kernels for semi-supervised clustering. Knowledge and Information Systems 24(1), 1–18 (2010) [12] Huang, R., Lam, W.: An active learning framework for semi-supervised document clustering with language modeling. Data and Knowledge Engineering 68(1), 49–67 (2009) [13] Huang, Y., Mitchell, T.M.: Text clustering with extended user feedback. In: SIGIR 2006: Proceedings of the 29th ACM Conference on Research and Development in Information Retrieval, pp. 413–420. ACM, New York (2006) [14] Huang, Y., Mitchell, T.M.: Exploring hierarchical user feedback in email clustering. In: EMAIL 2008: Proceedings of the Workshop on Enhanced Messaging - AAAI 2008, pp. 36–41. AAAI Press (2008) [15] Inselberg, A.: Parallel Coordinates: Visual Multidimensional Geometry and Its Applications. Springer, Secaucus (2009) [16] Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988)

HCAC: Semi-supervised Hierarchical Clustering

153

[17] Kestler, H.A., Kraus, J.M., Palm, G., Schwenker, F.: On the Effects of Constraints in Semisupervised Hierarchical Clustering. In: Schwenker, F., Marinai, S. (eds.) ANNPR 2006. LNCS (LNAI), vol. 4087, pp. 57–66. Springer, Heidelberg (2006) [18] Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: ICML 2002: Proceedings of the 19th International Conference on Machine Learning, pp. 307–314. Morgan Kaufmann Publishers, San Francisco (2002) [19] Kumar, N., Kummamuru, K., Paranjpe, D.: Semi-supervised clustering with metric learning using relative comparisons. In: ICDM 2005: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 693–696. IEEE, Washington, DC (2005) [20] Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM, New York (1999) [21] Miyamoto, S., Terami, A.: Constrained agglomerative hierarchical clustering algorithms with penalties. In: FUZZ 2011: 2011 IEEE International Conference on Fuzzy Systems, pp. 422 –427 (2011) [22] Nogueira, B., Jorge, A., Rezende, S.: Hierarchical confidence-based active clustering. In: SAC 2012: Proceedings of the 27th ACM Symposium on Applied Computing, pp. 535– 536. ACM, New York (2012) [23] Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2009) [24] Shannon, C.: A mathematical theory of communication. ACM SIGMOBILE Mobile Comp. and Communications Rev. 5, 3–55 (2001) [25] Talavera, L., B´ejar, J.: Integrating Declarative Knowledge in Hierarchical Clustering Tasks. In: Hand, D.J., Kok, J.N., Berthold, M. (eds.) IDA 1999. LNCS, vol. 1642, pp. 211–222. Springer, Heidelberg (1999) [26] Vu, V.-V., Labroche, N., Bouchon-Meunier, B.: Boosting clustering by active constraint selection. In: ECAI 2010: Proceeding of the 19th European Conference on Artificial Intelligence, pp. 297–302. IOS Press, Amsterdam (2010) [27] Vu, V.-V., Labroche, N., Bouchon-Meunier, B.: Improving constrained clustering with active query selection. Pattern Recognition 45(4), 1749–1758 (2012) [28] Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: ICML 2000: Proceedings of the 17th International Conference on Machine Learning, pp. 1103–1110. Morgan Kaufmann Publishers Inc., San Francisco (2000) [29] Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945) [30] Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems, vol. 15, pp. 505–512. MIT Press, Cambridge (2003)