LogDet Divergence based Metric Learning using Triplet Labels

LogDet Divergence based Metric Learning using Triplet Labels Jiangyuan Mei [email protected] Harbin Institute of Technology, Box 3015, Yikuang ...
1 downloads 0 Views 136KB Size
LogDet Divergence based Metric Learning using Triplet Labels

Jiangyuan Mei [email protected] Harbin Institute of Technology, Box 3015, Yikuang Street #2, Harbin, Heilongjiang Province, P.R. China Meizhu Liu Siemens Corporate Research Princeton, NJ, 08540, USA Hamid Reza Karimi University of Agder, N-4898 Grimstad, Norway

[email protected] [email protected]

Huijun Gao [email protected] Harbin Institute of Technology, Box 3015, Yikuang Street #2, Harbin, Heilongjiang Province, P.R. China

Abstract Metric learning is fundamental to lots of learning algorithms and it plays significant roles in many applications. In this paper, we present a LogDet divergence based metric learning approach to a learn Mahalanobis distance over input space of the instances. In the proposed model, the most natural constraint triplets are used as the labels of the training samples. Meanwhile, in order to avoid overfitting problem, the model uses the LogDet divergence to regularize the obtained Mahalanobis matrix as close as possible to a given matrix. Besides, a cyclic iterative algorithm is presented to solve the objective function and accelerate the metric learning process. Furthermore, this paper constructs a novel dynamic triplets building strategy to guarantee that the most useful triplets are used in every training cycle. Experiments on benchmark data sets demonstrates the proposed model achieve an improved performance when compared with the state-of-theart methods.

1. Introduction In practice, measuring the divergence among objects plays a critical role in various computer vision and pattern recognition applications, such as image retrieval, shape detection, face recognition, object Appearing in Proceedings of the 30 th International Conference on Machine Learning, Atlanta, USA. Copyright 2013 by the author(s)/owner(s).

tracking, clustering and classification (Bi et al., 2011; Ong & Bowden, 2004; Cao et al., 2010; Jiang et al., 2011). In order to measure how similar or related instances are, more than one features are observed for each instance. These features may have different relevance to the category of the instance and thus the weights of these features should be well considered. In other words, an appropriate distance or similarity metric over the input space of the instances should be learned to measure the divergence among instances. Among various metrics, Mahalanobis distance parameterized by a Positive Semi Definite (PSD) matrix M has considerable advantages over other metrics. Firstly, the Mahalanobis distance is scale invariant, which means the scale of the Mahalanobis distance has no effect on the performance of classification or clustering. Secondly, the metric takes into account the correlations of different features as the element of the offdiagonal in Mahalanobis matrix is not zero in general, which helps build a more accurate relationship among instances. For these reasons, Mahalanobis distance has become the most popular metric (Bar-Hillel et al., 2003; Xing et al., 2002; Xiang et al., 2008). A good metric learning algorithm should be able to emphasize relevant dimensions while reducing the influence of non-informative dimensions (Liu & Vemuri, 2012). Therefore, special attention should be paid to three properties when learning a Mahalanobis matrix. The first property is that the learning algorithm should be global, that is to say, all the useful samples should be used for training as far as possible. However, in most cases, only a part of the whole samples could be used in the training process due to the limitation of algorithmic efficiency. In practical applications, in-

LogDet Divergence based Metric Learning using Triplet Labels

appropriately selecting part of the whole samples always results in overfitting problem. Thus how to select efficient training samples should be well considered in the metric learning algorithm. The second property is that the labels of the training samples should be as weak as possible. In most real-world applications, it is hard to obtain strict labels of the training samples, thus weaker labels means more practical. Classical supervised metric learning methods are always classified into two types (Niu et al., 2012): (a) The first ones are supervised learning algorithms with class labels (Sugiyama, 2007); (b) The second ones are supervised learning algorithms with data pair labels which indicate the similarity or dissimilarity of the pairs (Weinberger et al., 2006; Davis et al., 2007), and the data pair labels are weaker than the class labels. In literatures (Liu & Vemuri, 2012; Bi et al., 2011; Liu et al., 2011), an even weaker representation which was always used in information retrieval is introduced into metric learning algorithms. In these literatures, the triplets (i, j, k) who represent ith instance is more similar to jth instance than kth instance are used to train the Mahalanobis matrix. The work (Bi et al., 2011) pointed out that the proximity triplets can be derived from other formats of constraints, but not vice versa. Thus, they are the weakest representation as well as the most natural constraint for learning a metric. The third property is that the algorithmic efficiency should not be too low. That is to say, metric learning algorithms should be scalable with respect to the size of the training samples. In this paper, considering the above three properties, a novel and practical metric learning model is proposed. In order to weaken the training labels and make it more realizable in real application, the triplet labels are utilized to train the Mahalanobis matrix. At the same time, the LogDet divergence is used to measure the distance between objective Mahalanobis matrix M and a given Mahalanobis distance function M0 to make sure that these two matrices are as close as possible. This regularization method can guarantee the stability of the metric learning process and avoid the overfitting problem effectively. Then, an cyclic iterative algorithm is presented to solve the proposed metric learning model and simultaneously improve the efficiency of learning algorithms. Besides, a novel triplets building strategy is presented in this work to allow as many useful triplets to be trained as possible. These triplets are updated at the beginning of every cycle to make sure that the most useful triplets are used to train the Mahalanobis matrix. Furthermore, several experiments are conducted to compare the proposed algorithm to the existing state-of-the-art metric learning algorithms

to demonstrate its performance. Meanwhile, the relationships between performance and some parameters are also illustrated in the experiments. These relationships will help to select appropriate parameters to achieve the best performance in real applications.

2. Related Work There are a lot of metric learning algorithms in the literatures. In (Weinberger et al., 2006), the Large Margin Nearest Neighbor (LMNN) is proposed and the metric is learned by maintaining consistency in data’s neighborhood and keeping a large margin at the boundaries of different categories. In (Shen et al., 2009), the proposed BoostMetric algorithm improved LMNN method by using the loss function to derive an Adaboost like optimization procedure. Besides, the triplets are used to train the Mahalanobis matrix and the Mahalanobis matrix is composed as a positive linear combination of trace-one rank-one matrices, which are regarded as the weak learners. Inspired by these two methods, a MetricBoost algorithm is proposed in (Bi et al., 2011). A bipartite strategy is employed in MetricBoost to greatly improve computation efficiency by decomposing proximity relationships over triplets into pair-wise constraints. Considering the overfitting problem, in (Liu & Vemuri, 2012), a Doubly Regularized Metric Learning (DRML) approach is put forward to improve the robustness of the MetricBoost algorithm. The weight across the training samples is smoothed to guarantee the stability of the solution. At the same time, the obtained rank-one matrices are regularized to discard the redundancy. These three methods all achieve high performance when dealing with data with small size. However, one of the weak points of these methods is that the execution efficiency is relatively low if the dimension of the training data become large. In order to represent the global property of the training data, a large number of triplets should be selected. These three methods all optimize the solution by using these triplets at one time, which will lead to low execution efficiency. Another strategy to regularize metric is to minimize the divergence between the objective matrix and a given matrix. In (Davis et al., 2007), the InformationTheoretic Metric Learning (ITML) method is expressed as a particular Bregman optimization, and the distance function is chosen as LogDet divergence. This method is fast and scalable because it does not require any eigenvalue computations or semi-definite programming. However, the constraints of ITML model are strict. The distances among all similar pairs should be smaller than a given threshold while dissimilar pairs’

LogDet Divergence based Metric Learning using Triplet Labels

distances should be larger than another fixed threshold. Therefore, one of the weaknesses of ITML method is that it cannot get the most appropriate metric. Therefore, research on how to learn a metric accurately, robustly and fast is worth, promising and challenging.

3. Problem Formulation Given a dataset {xi }, with xi ∈ ℜD , i = 1, 2, · · · , N , the Mahalanobis distance parameterized by M between xi and xj is expressed as T

dM (xi , xj ) = (xi − xj ) M (xi − xj ) ,

(1)

The purpose of metric learning is to learn a Mahalanobis distance which represents the relationship of data in the same or different categories. In detail, if xi and xj are instances in the same category while xk is in another category, the objective is to find a matrix M ∈ ℜD×D to ensure that the distance between xi and xj is closer than that of xi and xk , expressed as dM (xi , xj ) − dM (xi , xk ) ≤ −ρ, .

(2)

where ρ > 0, (i, j, k) ∈ T and T represents the triplets set. It’s worth pointing out that there might be more than one Mahalanobis matrix M which satisfies all these constraints. If metric learning model is very complex, the obtained matrix M would jump among several feasible solutions. In order to guarantee the stability of solution procedure, the authors in work (Davis et al., 2007) proposed a new regularization method. This work regularizes the objective Mahalanobis matrix M to be as close as possible to a given Mahalanobis distance function M0 , given as arg min Dϕ (M, M0 ) M

(3)

The function Dϕ () denotes the Bregman matrix divergence (Kulis et al., 2006) between two matrix, defined as Dϕ (M, M0 ) = ϕ (M() − ϕ (M0 )

) T − tr (∇ϕ (M0 )) (M − M0 ) ,

(4)

where the function tr () stands for the trace of matrix. It is worth noting that the given Mahalanobis matrix M0 is not a ground-truth. In fact, it only helps to guarantee that the obtained M would not have a big divergence from a faithful PSD matrix M0 . Thus M0 is always chosen as baseline unit matrix I (Davis et al., 2007). In Equ. 4, the differentiable function ϕ (M ) plays a critical role in deciding the properties of the Bregman matrix divergence. When the differentiable

function ϕ (M ) is chosen as the Burg entropy of the eigenvalues ϕ (M ) = − log (det (M )), the corresponding Bregman matrix divergence is called LogDet divergence, denoted as ( ) ( ( )) Dld (M, M0 ) = tr M M0 −1 − log det M M0 −1 − n. (5) There are two advantages when using LogDet divergence to measure the distance between two matrices M and M0 . On one hand, the work (Dhillon, 2007; Davis et al., 2007) demonstrates that the LogDet divergence between the covariance matrices is equivalent to the Kullback−Leibler divergence between corresponding multivariate Gaussian distributions. On the other hand, the LogDet divergence between two matrices M and M0 do not change when these two matrices perform the invertible linear transformation S at the ( ) same time, i.e. Dld (M, M0 ) = Dld S T M S, S T M0 S . This property will be very useful in solving the followed model. Considering the above two advantages, the metric learning model in this paper adopts the LogDet divergence to regularize the objective Mahalanobis matrix. Thus, the proposed metric learning model is arg min M

Dld (M, M0 )

dM (xi , xj ) − dM (xi , xk ) ≤ −ρ (i, j, k) ∈ T . ρ>0 (6) In this formulation, the triplets (i, j, k) which represents proximity relationships are used as constraints. Although the work (Davis et al., 2007) declared that the ITML method could be easily adapted to handle the triplets constraints, it is worth pointing out the ITML method would get conservative results in this situation. In other words, the proposed method can achieve more accurate results than ITML method when the constraints are given as the formulation of proximity relationships. s.t.

4. Algorithm The proposed metric learning model has a simple formalism, however, the amount of the constraints depends on the capacity of the triplets set T . Every triplet contains three instances, which means the size of T is cubic to the capacity the training dataset {xi }. Therefore, in this section, an efficient algorithm to seek the feasible solution is proposed. In (Kulis et al., 2006), a cyclic projections method is proposed to seek an optimized kernel matrix under linear inequality constraints. A similar strategy is used to solve the model in Euq. 6. Lemma 4.1 Suppose that X = [x1 , x2 , · · · , xn ], if

LogDet Divergence based Metric Learning using Triplet Labels

K = X T M X and K0 = X T M0 X, the metric learning model in Equ. 6 could be converted as arg min

Dld (K, K0 )

s.t.

tr (KAijk ) ≤ −ρ , K ≥ 0, ρ > 0

K

T

(7)

T

where Aijk = (ei − ej ) (ei − ej ) − (ei − ek ) (ei − ek ). Proof Firstly, as mentioned above, the invertible linear transformation which is performed on two matrices do not change their LogDet divergence. Thus, ( ) Dld (K, K0 ) = Dld X T M X, X T M0 X = Dld (M, M0 ) .

is used to substitute the inverse in these cases. Meanwhile, for the sake of convenience, two symbols vij and T T vik are defined to represent (ei − ej ) and (ei − ek ) . Thus, Equ. 9 is rewritten as { ( )† Kt+1 = Kt† − αvij vij T + αvik vik T . (10) ( )) ( tr Kt+1 vij vij T − vik vik T ≤ −ρ Lemma 4.2 The Kt+1 in Equ. 10 can be updated as  T vij Kt  Γt = Kt + αKt vij T 1−αvij Kt vij , (11) T vik Γt  Kt+1 = Γt − αΓt vik 1+αv T Γt vik (

T where Γt = Kt† − αvij vij

)†

ik

.

T

Secondly, since K = X M X, the Mahalanobis distance in Equ. 1 can be rewritten as ( ) T dM (xi , xj ) = tr K(ei − ej ) (ei − ej ) . Hence the constraints in Equ. 2 are expressed as ( ( )) T T tr K (ei − ej ) (ei − ej ) − (ei − ek ) (ei − ek ) ≤ −ρ. Therefore, M is feasible for the metric learning model in Equ. 6 if and only K is feasible for model in Equ. 7. The problem described in Equ. 7 is a convex minimization problem with linear inequality constraints. Convex minimization problems can be solved by kinds of methods, such as subgradient projection method and Interior-Point method. In the proposed metric learning model, the dimension of inequality constraints is very large. It is better to use a progressive strategy to make use of all these constraints, and the strategy is summarized as follows. Firstly, the objective function f (K) is defined as f (K) = Dld (K, K0 ). Then, an unique α is sought by solving the following equation system { ∇f (Kt+1 ) = ∇f (Kt ) + αATijk . (8) tr (Kt+1 Aijk ) ≤ −ρ

Proof First of) all, we prove that the pseudoinverse of ( αKt vij v T Kt † T Kt − αvij vij is equivalent to Kt + 1−αvT Kijt vij , and ij

an efficient way is to demonstrate that their product is the unit matrix I. ) ( )( T αKt vij vij Kt † T Kt + 1−αvT Kt vij Kt − αvij vij ij

=

Kt† Kt



T T T αKt† Kt vij vij Kt α2 vij vij Kt vij vij Kt + 1−αv − TK v TK v 1−αv t ij t ij ij ij T T αvij (I−αvij Kt vij )vij Kt TK v 1−αvij t ij T αvij vij Kt = I.

T αvij vij Kt

T Kt + = I − αvij vij T = I − αvij vij Kt +

T Kt vij is a scalar, thus the It deserves noting that αvij ( ) T item I − αvij Kt vij can be factored out in the process. In the following step, the second equation in Equ. 11 is considered. The demonstration of this equation is similar to the above step,(and it is easy to )prove T ( ) αΓt vik vik Γt T T that Kt − αvij vij + αvik vik Γt − 1+v = T ik Γt vik I.

4.1. Solving α

(9)

In Equ. 11, the only unknown variable is α, and solving α is a necessary step to update Kt . Except for the Equ. 11, the second equation in Equ. 10 is another constraint to α. And the Kt+1 in Equ. 10 can be replaced by an analytical expression using Equ. 11. To simplify the derivation process, we suppose that p1 = vij T Kt vij , p2 = vik T(Kt vik and) p3 = vik(T Kt vij = ) vij T Kt vik . Thus, tr Kt vij vij T =p1 , tr Kt vik vik T =p2 , and ( ( )) tr Kt+1 vij vij T − vik vik T 2α(p1 p2 −p3 2 )+(p1 −p2 ) (12) = α2 (p3 2 −p1 p2 )+α(p2 −p1 )+1 ≤ −ρ.

In some situation, the matrix K is not full rank, and the inverse of K doesn’t exist. Thus the pseudoinverse

For convenience, an objective function g (α) is given as 2αA+B g (α) = ρ − 2 , (13) α A+αB+1

The gradient for the objective funcion f (K) is ∇f (K) = K0−T − K −T , and K = K T . Thus, the Equ. 8 is simplified to { −1 Kt+1 = Kt−1 − αATijk . tr (Kt+1 Aijk ) ≤ −ρ

LogDet Divergence based Metric Learning using Triplet Labels

where

{

( ) A = p3 2 − p1 p2 . B = (p2 − p1 )

And Equ. 12 is converted to searching an α which satisfies the constraint g (α) ≤ 0. In general, the g (α) may have several singular points. And these singular points would have influence on the existence of solution. The possible cases of singular point can be distinguished by using the discriminant ∆1 = B 2 − 4A.

(14)

Another efficient method to judge existence of solution of g (α) ≤ 0 is to consider the case g (α) =0 and analyze its solution. The equation can be simplified to a quadratic equation with the variable α, expressed as aα2 +bα+c = 0 where

In the metric learning process, when given a triplet (i, j, k), if the α obtained by solving Equ. 17 is used to train the Mahalanobis matrix M , it can guarantee that M have a good performance on this triplet. However, in the next iteration, the next M will jump to another value to make sure it has a good performance on that triplet. In fact, thousands of triplets are utilized to train the Mahalanobis matrix M , and the solution of M would be unstable if using the obtained α to update it directly. In order to avoid this problem, the iteration factor is regularized as

(15)

  a = Aρ b = Bρ−2A  c=ρ−B

The possible distribution of its roots can be distinguished by using another discriminant ∆2 = b2 − 4ac.

(16)

α′ = ξα

Obviously, when ξ = 0, it means that the Mahalanobis distance M do not update in this iteration. When ξ is chosen to be a very small value, the corresponding triplet would be learned to a small degree. In other words, each triplet has a weight in the metric learning process. Therefore, as all the parameters in the metric learning model in Equ. 6 can be solved in an analytical form, the projection is processed via the updating formulation  ′ T i −xj ) Mt  γ = Mt + α Mt (xi −xj )(x 1−α′ (xi −xj )T Mt (xi −xj ) , (19) ′ T i −xk )(xi −xk ) γ  Mt+1 = γ − α γ(x T ′ 1+α (x −x ) γ(x −x ) i

(

where γ= X First of all, if there is singular point in g (α), there must exist α which satisfies g (α) ≤ 0. It can be proved by demonstrating that ∆1 ≥ 0 is a sufficient but not necessary condition for ∆2 ≥ 0. Therefore, the α can be selected between the root of α02 A + α0 B + 1 = 0 and the corresponding root for Equ. 15. √ Then, when

4A it comes to the case ∆2 < 0, if ρ ≤ 4A−B 2 , the α between two roots for Equ. 15 all satisfy g (α)√ ≤ 0, and 2

4A one of feasible solutions is α = −B 2A . If ρ > 4A−B 2 , there is no feasible solution to g (α) ≤ 0. However, α = −B 2A can also guarantee that the g (α) has the approximate minimum value. Furthermore, in other special cases, such as a = 0, the α is set as 0 and the Kt+1 is not updated for the sake of safety. In short, α can be chosen by using the following equations,  α +α  1 2 2 ∆1 ≥ 0 B α= (17) ∆1 < 0 − 2A  0 else 2





∆1 ∆2 where α1 = −B+sgn(B)· , α2 = −b+sgn(B)· and 2A 2a sgn () represents the sign function. The sign function is used to make sure that the selected root is the one which is close to zero.

(18)

) T †

k

i

k

ΓX † .

4.2. Building Triplets Building triplets is another important step in the proposed metric learning algorithm. In is impossible to use all the triplets as training samples because the size of the total triplets is too large. A general triplets building method (Liu & Vemuri, 2012) is that the triplets which are at the boundaries of different classes are selected to achieve a good classification performance. In other words, for each xi , instances {xj } which has the largest distance in the same category and instances {xk } which has the nearest distance in the different category are selected, and these {(i, j, k)} are used to build the triplets. The previous works (Bi et al., 2011; Liu & Vemuri, 2012) regard building triplets as a preprocess step. They only build the triplets once and do not change it in the metric learning process. However, in metric learning process, with the updating of the Mahalanobis matrix, some of these triplets do not play critical roles while some unselected triplets become important. Thus, in this paper, a dynamic triplets building strategy is presented: Firstly, we measure the performance of the learned M on the training dataset after

LogDet Divergence based Metric Learning using Triplet Labels

every metric learning cycle; Secondly, the triplets are selected based on the measurement. In order to measure the performance of the learned M on the training dataset Γ, two matrices are defined in this section. The first one represents the Mahalanobis distance of all the sample pairs {(i, j)} in Γ, given as DM = {dM (xi ,xj )},

xi , xj ∈ Γ

(20)

The second one is the similar matrix S, which gives the relationship information of sample pairs, expressed as S = {s (xi , xj )} , xi , xj ∈ Γ

(21)

If xi and xj belongs to the same category, s (xi , xj ) = 1, otherwise, s (xi , xj ) = 0. i = We choose the ith row from DM and named as DM {dM (xi , xj )}, j = 1, 2, · · · , N , and the corresponding S i = {s (xi , xj )}, j = 1, 2, · · · , N . If we sort the vector i i DM in a ascending order, the result is DM , and the i i to the DM is mapping from DM

( i ) i hi DM = DM .

(22)

As said above, a smaller dM (xi , xj ) means a nearer distance between xi and xj , so all the foremost i dM (xi , xj ) in DM should belong in the same category of xi in the ideal case. In other words, if we apply the ( i ) i same mapping fi on the S i , the obtained S = hi SM should also be the descending order of S i . And the ideal case is that all the foremost elements are 1 and i others are 0. However, in the real situation, some S is out of order and some 0 is in front of 1 after applying ( i ) i i S = hi SM . If we want to reorder the elements in S in the ideal order, we can achieve it by exchanging the disordered 0 and 1 to the normal order several times. Suppose ti is the total exchanging times to reorder the i elements the S to the ideal order. If tT represents the exchanging times for the case of total disorder, the degree of disorder is calculated as ( i ) ti δ DM = , tT

(23)

and the degree of disorder of all the samples is expressed as N ∑ ( i ) δ (DM ) = δ DM . (24)

Algorithm 1 LogDet Divergence based Metric Learning Input: Γ: input samples {xi }; M0 : a given Mahalanobis matrix; ρ: a distance threshold; ξ: the relaxation coefficient; Output: M : output Mahalanobis matrix; Initialization : Initialize Cycle = 0, t = 0 and Mt =M0 ; Build NT triplets {(xi , xj , xk )} based on DM0 by using the method in Sec. 4.2; repeat Cycle = Cycle + 1; iter = 0; repeat iter = iter + 1; t = t + 1; Pick a triplet (i, j, k) ∈ T T p1 = (xi − xj ) Mt (xi − xj ) ; T p2 = (xi − xk ) Mt (xi − xk ) ; T p3 = (xi − xj ) Mt (xi − xk ) ; Calculate A, B, a, b, c and ∆1 , ∆2 by using the equations in Sec. 4.1; Compute α and α′ by using equation system in Equ. 17 and Equ. 18; Update the Mahalanobis matrix Mt+1 by using equation system in Equ. 19; until iter = N ( i ) Compute the δ (DMt ) and all the δ DM ; t Build new NT triplets {(i, j, k)} based on DMt by using the method in Sec. 4.2; until δ (DMt ) converges or Cycle = Cyclemax Return M

triplets based on the measurement. In this step, the triplets generating strategy is similar to the method presented in the work (Liu & Vemuri, 2012). The difference lies in that the Mt obtained in the previous cycle is used to measure the distance between xi and xj , xk . Then, the distribution of the selected triplets ( i ) in ith row is determined by the ratio between δ DM and δ (DM ), that is ( i ) NT δ DM Ni = , δ (DM )

i=1

This index reflects the classification accuracy in the training dataset to a certain degree and it can be used to judge if the iteration is converged. After measuring the performance of the obtained M on the training dataset, the next step is to select the

where NT is the total triplets. In this way, if the ith row has a high disorder degree under the Mt , it will draw special attention when learning Mt+1 in the next cycle. To sum up, the proposed metric learning algorithm is given as Algorithm 1.

LogDet Divergence based Metric Learning using Triplet Labels 0.97

5. Experiments

Cycle=2 Cycle=4 Cycle=6 Cycle=8

0.98

5.1. Comparison with the State-of-the-Art Methods in Classifications We compare the proposed method with many other basic classification methods and the state-ofthe-art metric learning algorithms, including Euclidean distance, L1 -norm distance, Mahalanobis distance, ITML (Davis et al., 2007), COP (Xing et al., 2002), Boostmetric (Shen et al., 2009) and MetricBoost (Bi et al., 2011). The performance of all these algorithms are evaluated using 5-fold cross validation and the final results are the average of results obtained 10 runs. The experiments are respectively conducted on six datasets in UCI machine learning repository, including “Iris”, “Wine”, “Seeds”, “WDBC”, “Sonar” and “Heart”. In the proposed method, the parameter Cycle is chosen as 8, and the relaxation coefficient is set as ξ = 10−4 . Besides, the total quantity of the triplets is selected as NT = 104 to guarantee that enough triplets are utilized to learn the Mahalanobis matrix. The parameter ρ has little impact on the classification accuracy, and it is set as the 25th of the observed distribution of distances between pairs of points within the dataset. Besides, the variable k in these comparison experiments is chosen as 1. Testing results are summarized in Table 1 and these results reveal that the proposed method achieves the best accuracy across all datasets. After careful observations and analysis, the good performance of the proposed method can be explained as follows. On one hand, comparing with methods such as BoostMetric and MetricBoost, the proposed method uses a given matrix M0 to make sure that the obtained M is a PSD matrix or an approximate PSD matrix which will guarantee the stability of the training process. On the other hand, the proposed method utilizes the most useful triplets as the labels of the training data, which helps the proposed method get an improved performance when comparing with methods such as COP and ITML. 1

http://archive.ics.uci.edu/ml/

0.95 Classification accuracy

Classification accuracy

0.96

In this section, we conduct experiments on several public domain datasets such as UCI machine learning repository1 to present the superiority of the proposed metric learning method and the relationship between the performance and parameters. In these experiments, the performance index is chosen as the classification accuracy of k-Nearest Neighbor. The results illustrate that the proposed method outperforms the state-of-the-art methods.

Cycle=2 Cycle=4 Cycle=6 Cycle=8

0.96

0.94

0.92

0.9

0.94 0.93 0.92 0.91

0.88

0.86

0.9

1

2

3

4

5 6 7 kth nearest neighbor

(a)

8

9

10

0.89

1

2

3

4

5 6 7 kth nearest neighbor

8

9

10

(b)

Figure 1. The relationship between classification accuracy and the parameter Cycle. (a) The experiment results based on dataset “Wine”; (b) The experiment results based on dataset “Seeds”.

5.2. The Relationship between Performance and Parameters In the proposed algorithm, several parameters may have influence on the performance of the algorithm. The first one is the parameter Cycle, and Fig. 1 illustrates this relationship by using the experiment results conducted on the datasets “Wine” and “Seeds”. The experiment results show that more cycles will lead to higher classification accuracy. Meanwhile, it deserves pointing out that more cycles means more iterations, and the running time will be very long, reducing the efficiency of the metric learning. Furthermore, the experiment results shown in Fig. 1 also illustrate that the classification accuracy will convergence when the Cycle increases to a certain value. Thus, a recommend value of Cycle is 8 ∼ 10. Another parameter which may affect the performance of metric learning is the relaxation coefficient ξ. As mentioned above, the relaxation coefficient ξ is used to avoid the overfitting problem. On one hand, if ξ is too large, the Mahalanobis matrix will not be stable in the metric learning process since every triplet will have significant influence on the changing of Mahalanobis matrix. Thus the obtained Mahalanobis matrix will have a low classification accuracy. On the other hand, if the coefficient ξ is very small, every iteration will have too little influence on the changing of the Mahalanobis matrix, and the metric learning process will be very slow. Then the obtained Mahalanobis matrix will not have a good performance because the learning process is insufficient. Fig. 1 gives the experiment results based on datasets “Heart” and “WDBC”. The experiment results demonstrate that the metric learning method has the best performance when ξ = 10−4 . If ξ = 10−6 , the classification accuracy of the proposed method is close to that of Euclidean distance. If ξ = 10−1 , the classification accuracy is very low. In fact, the selection of ξ is relevant to the quantity of

LogDet Divergence based Metric Learning using Triplet Labels Table 1. Classification accuracy of different metric learning methods on 6 selected datasets in UCI repository

dataset Iris Wine Seeds WDBC Sonar Heart

Euclidean 0.9600 0.7360 0.9000 0.9139 0.8221 0.5889

L1 -norm 0.9467 0.7865 0.9000 0.9279 0.8269 0.6296

0.9

1

0.85

0.9

Mahalanobis 0.8600 0.9157 0.9286 0.8910 0.8173 0.7185

COP 0.9533 0.9494 0.9000 0.9139 0.8221 0.5889

Classification accuracy

Classification accuracy

0.7 0.65 0.6

0.8

0.7

0.6

0.5 0.55

0.4

0.5 0.45 −6 10

−5

10

−4

−3

10 10 Relaxation coefficient ξ

(a)

−2

10

−1

10

MetricBoost 0.9667 0.6180 0.6000 0.8084 0.6731 0.7815

ITML 0.9600 0.7360 0.8952 0.9315 0.8365 0.7593

proposed 0.9800 0.9775 0.9571 0.9438 0.8510 0.8370

References

0.8 0.75

BoostMetric 0.8800 0.8034 0.8143 0.8735 0.7981 0.5519

−6

10

−5

10

−4

−3

10 10 Relaxation coefficient ξ

−2

10

−1

10

(b)

Figure 2. The relationship between classification accuracy and the parameter relaxation coefficient ξ. (a) The experiment results based on dataset “Heart”; (b) The experiment results based on dataset “WDBC”.

the triplets NT , and ξ ≈ N1T . A larger NT will lead to a better performance at an expense of a lower efficiency. Thus, in most cases, a recommend value of NT is 103∼5 .

6. Conclusion In this paper, a LogDet divergence based metric learning approach is proposed. First of all, the triplets are used as the constraints in the model while the LogDet Divergence is introduced to regularize the Mahalanobis matrix. Then, a cyclic algorithm is presented to solve the proposed metric learning model to improve the efficiency of learning process. Furthermore, a novel dynamic triplets building strategy is presented in this work, which can guarantee that the most useful triplets are used in the training process. Experiments on benchmark data sets demonstrate that the proposed method outperforms the state-of-the-art metric learning methods. Meanwhile, the relationship between performance and parameters are also discussed in the experiments. One drawback of this framework is the limitation of the dataset, and the main reason is that the size of DM is relevant to the length of the samples. Thus, the proposed method is hard to be adopted in the dataset which has a large size. Thus, further research needs to be carried out on the algorithm optimization.

Bar-Hillel, Aharon, Hertz, Tomer, Shental, Noam, and Weinshall, Daphna. Learning distance functions using equivalence relations. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, volume 20, pp. 11, 2003. Bi, Jinbo, Wu, Dijia, Lu, Le, Liu, Meizhu, Tao, Yimo, and Wolf, M. Adaboost on low-rank psd matrices for metric learning. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 2617–2624. IEEE, 2011. Cao, Zhimin, Yin, Qi, Tang, Xiaoou, and Sun, Jian. Face recognition with learning-based descriptor. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2707–2714. IEEE, 2010. Davis, Jason V, Kulis, Brian, Jain, Prateek, Sra, Suvrit, and Dhillon, Inderjit S. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pp. 209–216. ACM, 2007. Dhillon, JVDI. Differential entropic clustering of multivariate gaussians. Advances in Neural Information Processing Systems. The MIT Press, 19:337, 2007. Jiang, Nan, Liu, Wenyu, and Wu, Ying. Adaptive and discriminative metric differential tracking. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1161–1168. IEEE, 2011. Kulis, Brian, Sustik, M´aty´as, and Dhillon, Inderjit. Learning low-rank kernel matrices. In Proceedings of the 23rd international conference on Machine learning, pp. 505–512. ACM, 2006. Liu, Meizhu and Vemuri, Baba. A robust and efficient doubly regularized metric learning approach. Computer Vision–ECCV 2012, pp. 646–659, 2012.

LogDet Divergence based Metric Learning using Triplet Labels

Liu, Meizhu, Lu, Le, Bi, Jinbo, Raykar, Vikas, Wolf, Matthias, and Salganicoff, Marcos. Robust large scale prone-supine polyp matching using local features: a metric learning approach. Medical Image Computing and Computer-Assisted Intervention– MICCAI 2011, pp. 75–82, 2011. Niu, Gang, Dai, Bo, Yamada, Makoto, and Sugiyama, Masashi. Information-theoretic semi-supervised metric learning via entropy regularization. arXiv preprint arXiv:1206.4614, 2012. Ong, Eng-Jon and Bowden, Richard. A boosted classifier tree for hand shape detection. In Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on, pp. 889– 894. IEEE, 2004. Shen, Chunhua, Kim, Junae, Wang, Lei, and Hengel, Anton van den. Positive semidefinite metric learning with boosting. arXiv preprint arXiv:0910.2279, 2009. Sugiyama, Masashi. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. The Journal of Machine Learning Research, 8:1027–1061, 2007. Weinberger, Kilian Q, Blitzer, John, and Saul, Lawrence K. Distance metric learning for large margin nearest neighbor classification. In NIPS. Citeseer, 2006. Xiang, Shiming, Nie, Feiping, and Zhang, Changshui. Learning a mahalanobis distance metric for data clustering and classification. Pattern Recognition, 41(12):3600–3612, 2008. Xing, Eric P, Ng, Andrew Y, Jordan, Michael I, and Russell, Stuart. Distance metric learning, with application to clustering with side-information. Advances in neural information processing systems, 15: 505–512, 2002.

Suggest Documents