Heterogeneous Multi-task Metric Learning across Multiple ...intensive experiments on text...

> TNNLS-2015-P-5907 REVISION 4 < 1

Heterogeneous Multi-task Metric Learning acrossMultiple Domains

Yong Luo, Yonggang Wen, Senior Member, IEEE, and Dacheng Tao, Fellow, IEEE

Abstract—Distance metric learning (DML) plays a crucial rolein diverse machine learning algorithms and applications. Whenthe labeled information in target domain is limited, transfermetric learning (TML) helps to learn the metric by leveragingthe sufficient information from other related domains. Multi-task metric learning (MTML), which can be regarded as aspecial case of TML, performs transfer across all related do-mains. Current TML tools usually assume that the same featurerepresentation is exploited for different domains. However, inreal-world applications, data may be drawn from heterogeneousdomains. Heterogeneous transfer learning approaches can beadopted to remedy this drawback by deriving a metric fromthe learned transformation across different domains. But theyare often limited in that only two domains can be handled.To appropriately handle multiple domains, we develop a novelheterogeneous multi-task metric learning (HMTML) framework.In HMTML, the metrics of all different domains are learnedtogether. The transformations derived from the metrics areutilized to induce a common subspace, and the high-ordercovariance among the predictive structures of these domains ismaximized in this subspace. There do exist a few heterogeneoustransfer learning approaches that deal with multiple domains,but the high-order statistics (correlation information), which canonly be exploited by simultaneously examining all domains, isignored in these approaches. Compared with them, the proposedHMTML can effectively explore such high-order information,thus obtaining more reliable feature transformations and metrics.Effectiveness of our method is validated by the extensive andintensive experiments on text categorization, scene classification,and social image annotation.

Index Terms—Heterogeneous domain, distance metric learn-ing, multi-task, tensor, high-order statistics

I. INTRODUCTION

THE objective of distance metric learning (DML) is tofind a measure to appropriately estimate the distance or

similarity between data. DML is critical in various researchfields, e.g., k-means clustering, k-nearest neighbor (kNN)classification, the sophisticated support vector machine (SVM)[1], [2] and learning to rank [3]. For instance, the kNN basedmodels can be comparable or superior to other well-designed

Y. Luo and Y. Wen are with the School of Computer Science andEngineering, Nanyang Technological University, Singapore 639798, e-mail:[email protected], [email protected].

D. Tao is with the Centre for Quantum Computation & Intelligent Systemsand the Faculty of Engineering and Information Technology, University ofTechnology, Sydney, 81 Broadway Street, Ultimo, NSW 2007, Australia, e-mail: [email protected].

c©2017 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.

classifiers by learning proper distance metrics [4], [5]. Be-sides, it is demonstrated in [1] that learning the Mahalanobismetric for the RBF kernel in SVM without model selectionconsistently outperforms the traditional SVM-RBF, in whichthe hyperparameter is determined by cross validation.

Recently, some transfer metric learning (TML) [6], [7]methods are proposed for DML in case that labeled dataprovided in target domain (the domain of interest) is insuf-ficient, while the labeled information in certain related, butdifferent source domains is abundant. In this scenario, the datadistributions of the target and source domain may differ a lot,hence the traditional DML algorithms usually do not performwell. Indeed, TML [6], [7] tries to reduce the distributiongap and help learn the target metric by utilizing the labeleddata from the source domains. In particular, multi-task metriclearning (MTML) [7] aims to simultaneously improve themetric learning of all different (source and target) domainssince the labeled data for each of them is assumed to be scarce.

There is a main drawback in most of the current TMLalgorithms, i.e., samples of related domains are assumed tohave the same feature dimension or share the same featurespace. This assumption, however, is invalid in many applica-tions. A typical example is the classification of multilingualdocuments. In this application, the feature representation ofa document written in one language is different from thatwritten in other languages since different vocabularies areutilized. Also, in some face recognition and object recog-nition applications, images may be collected under differentenvironmental conditions, hence their representations havedifferent dimensions. Besides, in natural image classificationand multimedia retrieval, we often extract different kinds offeatures (such as global wavelet texture and local SIFT [8]) torepresent the samples, which can also have different modalities(such as text, audio and image). Each feature space or modalitycan be regarded as a domain.

In the literature, various heterogeneous transfer learning[9], [10], [11] algorithms have been developed to deal withheterogeneous domains. In these approaches, a widely adoptedstrategy is to reduce the difference of heterogeneous domains[11] by transforming their representations into a commonsubspace. A metric can be derived from the transformationlearned for each domain. Although has achieve success insome applications, these approaches suffer a major limitationthat only two domains (one source and one target domain)can be handled. In practice, however, the number of domains isusually more than two. For instance, the news from the Reutersmultilingual collection is written in five different languages.Besides, it is common to utilize various kinds of features (such

arX

iv:1

904.

0408

1v1

[st

at.M

L]

8 A

pr 2

019


as global, local, and biologically inspired) in visual analysis-based applications.

To remedy these drawbacks, we propose a novel heteroge-neous multi-task metric learning (HMTML) framework, whichis able to handle an arbitrary number of domains. In theproposed HMTML, the metrics of all domains are learnedin a single optimization problem, where the empirical lossesw.r.t. the metric for each domain are minimized. Meanwhile,we derive feature transformations from the metrics and usethem to project the predictive structures of different domainsinto a common subspace. In this paper, all different domainsare assumed to have the same application [10], [12], such asarticle classification with the same categories. Consequently,the predictive structures, which are parameterized by theweight vector of classifiers, should be close to each other inthe subspace. Then a connection between different metrics isbuilt by minimizing the divergence between the transformedpredictive structures. Compared to the learning of differentmetrics separately, more reliable metrics can be obtainedsince different domains help each other in our method. Thisis particularly important for those domains with little labelinformation.

There do exist a few approaches [10], [12] that can learntransformations and derive metrics for more than two domains.These approaches, however, only explore the statistics (corre-lation information) between pairs of representations in eitherone-vs-one [10], or centralized [12] way. While the high-order statistics are ignored, which can only be obtained byexamining all domains simultaneously. Our method is moreadvantageous than these approaches in that we directly analyzethe covariance tensor over the prediction weights of all do-mains. This encodes the high-order correlation information inthe learned transformations and thus hopefully we can achievebetter performance. Experiments are conducted extensivelyon three popular applications, i.e., text categorization, sceneclassification, and social image annotation. We compare theproposed HMTML with not only the Euclidean (EU) baselineand representative DML (single domain) algorithms [4], [13],[14], but also two representative heterogeneous transfer learn-ing approaches [10], [12] that can deal with more than twodomains. The experimental results demonstrate the superiorityof our method.

This paper differs from our conference works [15], [16] sig-nificantly since: 1) in [15], we need large amounts of unlabeledcorresponding data to build domain connection. In this paper,we utilize the predictive structures to bridge different domainsand thus only limited labeled data are required. Therefore, thecritical regularization term that enables knowledge transfer isquite different from that in [15]; 2) in [16], we aim to learnfeatures for different domains, instead of the distance metricsin this paper. The optimization problem is thus quite differentsince we directly optimize w.r.t. the distance metrics in thispaper.

The rest of the paper is structured as follows. Some closelyrelated works are summarized in Section II. In Section III,we first give an overall description of the proposed HMTML,and then present its formulation together with some analysis.Experimental results are reported and analyzed in Section IV,

and we conclude this paper in Section V.

II. RELATED WORK

A. Distance metric learning

The goal of distance metric learning (DML) [17] is tolearn an appropriate distance function over the input space, sothat the relationships between data are appropriately reflected.Most conventional metric learning methods, which are oftencalled “Mahalanobis metric learning”, can be regarded aslearning a linear transformation of the input data [18], [19].The first work of Mahalanobis metric learning was done byXing et al. [20], where a constrained convex optimizationproblem with no regularization was proposed. Some otherrepresentative algorithms include the neighborhood componentanalysis (NCA) [21], large margin nearest neighbors (LMNN)[4], information theoretic metric learning (ITML) [13], andregularized distance metric learning (RDML) [14].

To capture the nonlinear structure in the data, one can extendthe linear metric learning methods to learn nonlinear transfor-mations by adopting the kernel trick or “Kernel PCA (KPCA)”trick [22]. Alternatively, neural networks can be utilized tolearn arbitrarily complex nonlinear mapping for metric learn-ing [23]. Some other representative nonlinear metric learningapproaches include the gradient-boosted LMNN (GB-LMNN)[24], Hamming distance metric learning (HDML) [25], andsupport vector metric learning (SVML) [1].

Recently, transfer metric learning (TML) has attracted in-tensive attention to tackle the labeled data deficiency issuein the target domain [26], [7] or all given related domains[27], [28], [7]. The latter is often called multi-task metriclearning (MTML), and is the focus of this paper. In [27], theypropose mt-LMNN, which is an extension of LMNN to themulti-task setting [29]. This is realized by representing themetric of each task as a summation of a shared Mahalanobismetric and a task-specific metric. The level of information tobe shared across tasks is controlled by the trade-off hyper-parameters of the metric regularizations. Yang et al. [28]generalize mt-LMNN by utilizing von Neumann divergencebased regularization to preserve geometry between the learnedmetrics. An implicit assumption of these methods is that thedata samples of different domains lie in the same feature space.Thus these approaches cannot handle heterogeneous features.To remedy this drawback, we propose heterogeneous MTML(HMTML) inspired by heterogeneous transfer learning.

B. Heterogeneous transfer learning

Developments in transfer learning [30] across heterogeneousfeature spaces can be grouped to two categories: heterogeneousdomain adaptation (HDA) [10], [11] and heterogeneous multi-task learning (HMTL) [12]. In HDA, there is usually only onetarget domain that has limited labeled data, and our aim isto utilize sufficient labeled data from related source domainsto help the learning in the target domain. Whereas in HMTL,the labeled data in all domains are scarce, and thus we treatdifferent domains equally and make them help each other.

Most HDA methods only have two domains, i.e., one sourceand one target domain. The main idea in these methods is


to either map the heterogeneous data into a common featurespace by learning a feature mapping for each domain [9],[31], or map the data from the source domain to the targetdomain by learning an asymmetric transformation [32], [11].The former is equivalent to Mahalanobis metric learningsince each learned mapping could be used to derive a metricdirectly. Wang and Mahadevan [10] presented a HDA methodbased on manifold alignment (DAMA). This method managesseveral domains, and the feature mappings are learned byutilizing the labels shared by all domains to align each pairof manifolds. Compared with HDA, there are much fewerworks on HMTL. One representative approach is the multi-task discriminant analysis (MTDA) [12], which extends lineardiscriminant analysis (LDA) to learn multiple tasks simultane-ously. In MTDA, a common intermediate structure is assumedto be shared by the learned latent representations of differentdomains. MTDA can also deal with more than two domains,but is limited in that only the pairwise correlations (betweeneach latent representation and the shared representation) areexploited. Therefore, the high-order correlations between alldomains are ignored in both DAMA and MTDA. We developthe following tensor based heterogeneous multi-task metriclearning framework to rectify this shortcoming.

III. HETEROGENEOUS MULTI-TASK METRIC LEARNING

In DAMA [10] and MTDA [12], linear transformationis learned for each domain and only pairwise correlationinformation is explored. In contrast to them, tensor basedheterogeneous MTML (HMTML) is developed in this paper toexploit the high-order tensor correlations for metric learning.Fig. 1 is an overview of the proposed method. Here, wetake multilingual text classification as an example to illustratethe main idea. The number of labeled samples for each ofthe M heterogenous domains (e.g., “English”, “Italian”, and“German”) are assumed to be small. For the m’th domain,empirical losses w.r.t. the metric Am are minimized on thelabeled set Dm. The obtained metrics may be unreliabledue to the labeled data deficiency issue in each domain.Hence we let the different domains help each other to learnimproved metrics by sharing information across them. Thisis performed by first constructing multiple prediction (suchas binary classification) problems in the m’th domain, anda set of prediction weight vectors wp

mPp=1 are obtained bytraining the problems on Dm. Here, wp

m is the weight vectorof the p’th base classifier, and we assume each of the P baseclassifiers is linear, i.e., fp(xm) = (wp

m)Txm. We follow [11]to generate the P prediction problems, where the well-knownError Correcting Output Codes (ECOC) scheme is adopted.Then we decompose the metric Am as Am = UmU

Tm, and the

obtained Um is used to transform the weight vectors into acommon subspace, i.e., vpm = UTmwp

mPp=1. As illustrated inSection I, the application is the same for different domains.Therefore, after being transformed into the common subspace,the weight vectors should be close to each other, i.e., allvp1,v

p2, . . . ,v

pm should be similar. Finally, by maximizing

the tensor based high-order covariance between all vpm,we obtain improved U∗m, where additional information from

other domains are involved. Then more reliable metric can beobtained since A∗m = U∗m(U∗m)T .

In short, ECOC generates a binary “codebook” to encodeeach class as a binary string, and trains multiple binaryclassifiers according to the “codebook”. The test sample isclassified using all the trained classifiers and the results areconcatenated to obtain a binary string, which is decoded forfinal prediction [33]. In this paper, the classification weightsof all the trained binary classifiers are used to build domainconnections. It should be noted that the learned weights maynot be reliable if given limited labeled samples. Fortunately,we can construct sufficient base classifiers using the ECOCscheme, and thus robust transformations can be obtained evensome learned base classifiers are inaccurate or incorrect [11].The reasons why the decomposition Um of the metric Am canbe used to transform the parameters of classifiers are mainlybased on two points: 1) Mahanobias metric learning can beformulated as learning a linear transformation [18]. In theliterature of distance metric learning (DML), most methodsfocus on learning the Mahalanobis distance, which is oftendenoted as

dA(xi,xj) = (xi − xj)TA(xi − xj),

where A is the metric, which is a positive semi-definitematrix and can be factorized as A = UUT . By applyingsome simple algebraic manipulations, we have dA(xi,xj) =‖Uxi−Uxj‖22; 2) according to the multi-task learning methodspresented in [34], [35], the transformation learned in theparameter space can also be interpreted as defining mappingin the feature space. Therefore, the linear transformation Umcan be used to transform the parameters of classifiers. Thetechnical details of the proposed method are given below, andwe start by briefing the frequently used notations and conceptsof multilinear algebra in this paper.

A. Notations

If A is an M -th order tensor of size I1 × I2 × . . . × IM ,and U is a Jm × Im matrix, then the m-mode product of Aand U is signified as B = A ×m U , which is also an M -thorder tensor of size I1 × . . . × Im−1 × Jm × Im+1 . . . × IMwith the entry

B(i1, . . . , im−1, jm, im+1, . . . , iM )

=

Im∑im=1

A(i1, i2, . . . , iM )U(jm, im).(1)

The product of A and a set of matrices Um ∈ RJm×ImMm=1

is given by

B = A×1 U1 ×2 U2 . . .×M UM . (2)

The mode-m matricization of A is a matrix A(m) of sizeIm × (I1 . . . Im−1Im+1 . . . IM ). We can regard the m-modemultiplication B = A ×m U as matrix multiplication in theform of B(m) = UA(m). Let u be an Im-vector, the contractedm-mode product of A and u is denoted as B = A×mu, which


Italian

… …

… …

Tensor based divergence minimization

… …𝐯11𝐯12𝐯1𝑃

𝑉1

𝐯𝑀1𝐯𝑀2𝐯𝑀𝑃

𝑉𝑀

… …

English

𝐴1 = 𝑈1𝑈1𝑇

𝐰11

𝑊1

𝐰12𝐰1𝑃

…𝑈1

𝐴1∗ = 𝑈1

∗𝑈1∗𝑇

𝒟1

German

…𝑈𝑀

𝑊𝑀

𝐰𝑀2

𝐰𝑀1

𝐰𝑀𝑃

𝐴𝑀 = 𝑈𝑀𝑈𝑀𝑇

𝐴𝑀∗ = 𝑈𝑀

∗ 𝑈𝑀∗ 𝑇

𝒟𝑀𝒟2

Fig. 1. System diagram of the proposed heterogeneous multi-task metric learning. Given limited labeled samples for each of the M domains, learningthe metrics A1, A2, . . . , AM separately may be unreliable. To allow all different domains help each other, we firstly learn the prediction weights Wm =[w1

m,w2m, . . . ,wP

m],m = 1, 2, . . . ,M . Considering the metric Am can be decomposed as Am = UmUTm, we transform all Wm to a common subspace

using Um and obtain Vm = UTmWm, where Vm = [v1

m,v2m, . . . ,vP

m] with each vpm = UT

mwpm. By simultaneously minimizing the empirical losses

w.r.t. each Am and the high-order divergence between all Vm, we obtain reliable U∗m and also the metric A∗

m = U∗m(U∗

m)T .

is an M−1-th tensor of size I1× . . .×Im−1×Im+1 . . .×IM .The elements are calculated by

B(i1, . . . , im−1, im+1, . . . , iM ) =

Im∑im=1

A(i1, i2, . . . , iM )u(im).

(3)

Finally, the Frobenius norm of the tensor A is calculated as

‖A‖2F = 〈A,A〉 =

I1∑i1=1

I2∑i2=1

. . .

IM∑iM=1

A(i1, i2, . . . , iM )2. (4)

B. Problem formulation

Suppose there are M heterogeneous domains, and thelabeled training set for the m’th domain is Dm =(xmn, ymn)Nm

n=1, where xmn ∈ Rdm and its correspondingclass label ymn ∈ 1, 2, . . . , C. The label set is the same forthe different heterogeneous domains [10], [12]. Then we havethe following general formulation for the proposed HMTML

minAmMm=1

F (Am) =

M∑m=1

Ψ(Am) + γR(A1, A2, . . . , AM ),

s.t. Am 0,m = 1, 2, . . . ,M,(5)

where Ψ(Am) = 2Nm(Nm−1)

∑i<j L(Am;xmi,xmj , ymij) is

the empirical loss w.r.t. Am in the m’th domain, ymij = ±1indicates xmi and xmj are similar/dissimilar to each other,and R(A1, A2, . . . , AM ) is a carefully chosen regularizer thatenables knowledge transfer across different domains. Here,ymij is obtained according to whether xmi and xmj be-long to the same class or not. Following [14], we chooseL(Am;xmi,xmj , ymij) = g(ymij [1− ‖xmi − xmj‖2Am

]) and

adopt the generalized log loss (GL-loss) [36] for g, i.e.,g(z) = 1

ρ log(1 + exp(−ρz)). Here, ‖xmi − xmj‖2Am=

(xmi − xmj)TAm(xmi − xmj). The GL-loss is a smooth

version of the hinge loss, and the smoothness is controlledby the hyper-parameter ρ, which is set as 3 in this paper.For notational simplicity, we denote xmi, xmj and ymij asx1mk, x2

mk and ymk respectively, where k = 1, 2, . . . , N ′m =Nm(Nm−1)

2 . We also set δmk = x1mk − x2

mk so that ‖x1mk −

x2mk‖2Am

= δTmkAmδmk, and the loss term becomes Ψ(Am) =1N ′

m

∑N ′m

k=1 g(ymk(1− δTmkAmδmk)

).

To transfer information across different domains, P binaryclassification problems are constructed and a set of classifierswp

mPp=1 are learned for each of the M domains by utilizingthe labeled training data. This process can be carried out inan offline manner and thus does not has influence on thecomputational cost of subsequent metric learning. Consideringthat Am = UmU

Tm due to the positive semi-definite property

of the metric, we propose to use Um to transform wpm and

this leads to vpm = UTmwpm. Then the divergence of all

vp1,vp2, . . . ,v

pM are minimized. In the following, we first

derive the formulation for M = 2 (two domains), and thengeneralize it for M > 2.

When M = 2, we have the following formulation:

minU1,U2

1

P

P∑p=1

‖vp1 − vp2‖22 +

2∑m=1

γm‖Um‖1,

s.t. U1, U2 0,

(6)

where Um ∈ Rdm×r is the transformation of the m’th domain,and r is the number of common factors shared by different do-mains. The trade-off hyper-parameters γm > 0, the l1-norm‖Um‖1 =

∑di=1

∑rj=1 |umij |, and indicates that all entries


of Um are non-negative. The transformation Um is restrictedto be sparse since the features in different domains usuallyhave sparse correspondences. The non-negative constraints cannarrow the hypothesis space and improve the interpretabilityof the results.

For M > 2, we generalize (6) as

minU1,U2

1

P

P∑p=1

‖wp1 −Gw

p2‖22 +

2∑m=1

γm‖Um‖1,

s.t. U1, U2 0,

(7)

where G = U1ErUT2 , and Er is an identity matrix of size r.

By using the tensor notation, we have G = Er ×1 U1 ×2 U2,so the formulation (7) for M > 2 is given by

minUmMm=1

1

P

P∑p=1

‖wp1 − G×2(wp

2)T . . . ×M (wpM )T ‖22

+

M∑m=1

γm‖Um‖1,

s.t. Um 0,m = 1, . . . ,M,

(8)

where G = Er ×1 U1 ×2 U2 . . . ×M UM is a transformationtensor, Er ∈ Rr×r×...×r is an identity tensor (the diagonalelements are 1, and all other entries are 0). A specific opti-mization problem for HMTML can be obtained by choosingthe regularizer in (5) as the objective of (8), i.e.,

minUmMm=1

F (Um)

=

M∑m=1

1

N ′m

N ′m∑

k=1

g(ymk(1− δTmkUmUTmδmk)

)+γ

P

P∑p=1

‖wp1 − G×2(wp

2)T . . . ×M (wpM )T ‖22

+

M∑m=1

γm‖Um‖1,

s.t. Um 0,m = 1, 2, . . . ,M.

(9)

We reformulate (9) using the following theorem due to itsinconvenience in optimization.

Theorem 1: If ‖wpm‖22 = 1, p = 1, . . . , P ;m = 1, . . . ,M ,

then we have:

‖wp1 − G×2(wp

2)T . . . ×M (wpM )T ‖22

= ‖wp1 w

p2 . . . w

pM − G‖

2F ,

(10)

where is the outer product.We leave the proof in the supplementary material.

By substituting (10) into (9) and replacing G with Er ×1

U1 ×2 U2 . . .×M UM , we obtain the following reformulation

of (9):

minUmMm=1

F (Um)

=

M∑m=1

1

N ′m

N ′m∑

k=1

g(ymk(1− δTmkUmUTmδmk)

)+γ

P

P∑p=1

‖Wp − Er ×1 U1 ×2 U2 . . .×M UM‖2F

+

M∑m=1

γm‖Um‖1,

s.t. Um 0,m = 1, 2, . . . ,M,(11)

where Wp = wp1 w

p2 . . . w

pM is the covariance tensor of

the prediction weights of all different domains. It is intuitivethat a latent subspace shared by all domains can be foundby minimizing the second term in (11). In this subspace, therepresentations of different domains are close to each otherand knowledge is transferred. Hence different domains canhelp each other to learn improved transformation Um, andalso the distance metric Am.

C. Optimization algorithm

Problem (11) can be solved using an alternating optimiza-tion strategy. That is, only one variable Um is updated at atime and all the other Um′ , m′ 6= m are fixed. This updatingprocedure is conducted iteratively for each variable. Accordingto [37], we have

G = Er ×1 U1 ×2 U2 . . .×M UM = B ×m Um,

where B = Er ×1 U1 . . . ×m−1 Um−1 ×m+1 Um+1 . . . ×MUM . By applying the metricizing property of the tensor-matrixproduct, we have G(m) = UmB(m). Besides, it is easy toverify that ‖Wp − G‖2F = ‖W p

(m) − G(m)‖2F . Therefore, thesub-problem of (11) w.r.t. Um becomes:

minUm

F (Um) = Φ(Um) + Ω(Um),

s.t. Um 0,(12)

where Φ(Um) = 1N ′

m

∑N ′m

k=1 g(ymk(1− δTmkUmUTmδmk)

)+

γP

∑Pp=1 ‖W

p(m) −UmB(m)‖2F , and Ω(Um) = γm‖Um‖1. We

propose to solve the problem (12) efficiently by utilizing theprojected gradient method (PGM) presented in [38]. However,the term in Ω(Um) is non-differentiable, we thus first smooth itaccording to [39]. For notational clarity, we omit the subscriptm in the following derivation. According to [39], the smoothedversion of the l1-norm l(uij) = |uij | can be given by

lσ(uij) = maxQ∈Q〈uij , qij〉 −

σ

2q2ij , (13)

where Q = Q : −1 ≤ qij ≤ 1, Q ∈ Rd×r and σ is thesmooth hyper-parameter, where we set it as 0.5 empirically inour implementation according to the comprehensive study ofthe smoothed l1-norm in [40]. By setting the objective function


of (13) to zero and then projecting qij on Q, we obtain thefollowing solution,

qij = medianuijσ,−1, 1

. (14)

By substituting the solution (14) back into (13), we have thepiece-wise approximation of l, i.e.,

lσ =

−uij − σ

2 , uij < −σ;uij − σ

2 , uij > σ;u2ij

2σ , otherwise.

(15)

To utilize the PGM for optimization, we have to compute thegradient of the smoothed l1-norm to determine the descentdirection. We summarize the results in the following theorem.

Theorem 2: The gradient of smoothed l(U) = ‖U‖1 =∑di=1

∑rj=1 l(uij) is

∂lσ(U)

∂U= Q, (16)

where Q is the matrix defined in (13).In addition, we summarize the gradient of Φ(U) as follows,

Theorem 3: The gradient of Φ(U) w.r.t. U is

∂Φ(U)

∂U=

1

N ′

∑k

2yk(δkδTk )U

1 + exp(ρzk)+

2γ

P

∑p

(UBBT−W pBT ),

(17)where zk = yk(1− δTk UUT δk).We leave the proofs of both theorems in the supplementarymaterial. Therefore, the gradient of the smoothed F (Um) is

∂Fσ(Um)

∂Um=

1

N ′m

∑k

2ymk(δmkδTmk)Um

1 + exp(ρzmk)

+2γ

P

∑p

(UmB(m)B

T(m) −W

p(m)B

T(m)

)+ γmQm.

(18)

Based on the obtained gradient, we apply the improved PGMpresented in [38] to minimize the smoothed primal Fσ(Um),i.e.,

U t+1m = π[U tm − µt∇Fσ(U tm)], (19)

where the operator π[x] projects all the negative entries of xto zero, and µt is the step size that must satisfy the followingcondition:

Fσ(U t+1m )− Fσ(U tm) ≤ κ∇Fσ(U tm)T (U t+1

m − U tm), (20)

where the hyper-parameter κ is chosen to be 0.01 follow-ing [38]. The step size can be determined using the Al-gorithm 1 cited from [38] (Algorithm 4 therein), and theconvergence of the algorithm is guaranteed according to[38]. The stopping criterion we utilized here is |Fσ(U t+1

m )−Fσ(U tm)|/(|Fσ(U t+1

m )− Fσ(U0m)| < ε), where the initializa-

tion U0m is the set as the results of the previous iterations in

the alternating of all UmMm=1.Finally, the solutions of (11) are obtained by alternatively

updating each Um using Algorithm 1 until the stop criterion|OBJk+1 − OBJk|/|OBJk| < ε is reached, where OBJk isthe objective value of (11) in the k’th iteration step. Becausethe objective value of (12) decreases at each iteration of the al-ternating procedure, i.e., F (Uk+1

m , Ukm′m′ 6=m) ≤ F (Ukm).

Algorithm 1 The improved projected gradient method forsolving Um.Input: Sample pair differences and corresponding labelsδmk, ymk

N ′m

k=1; B(m) and W p(m), p = 1, . . . , P .

Hyper-parameters: γ and γm.Output: Um.

1: Set µ0 = 1, and β = 0.1. Initialize U0m as the results of

the previous iterations in the alternating optimization ofall UmMm=1.

2: For t = 1, 2, . . .3: (a) Assign µt ← µt−1;4: (b) If µt satisfies (20), repeatedly increase it by

µt ← µt/β, until either µt does not satisfy (20) orUm(µt/β) = Um(µt).Else repeatedly decrease µt ← µt ∗ β until µt satisfy(20);

5: (c) Update U t+1m = π[U tm − µt∇Fσ(U tm)].

6: Until convergence

This indicates that F (Uk+1m ) ≤ F (Ukm). Therefore, the

convergence of the proposed HMTML algorithm is guaran-teed. Once the solutions U∗mMm=1 have been obtained, we canconduct subsequent learning, such as multi-class classificationin each domain using the learned metric A∗m = U∗mU

∗mT . More

implementation details can be found in the supplementarymaterial.

D. Complexity analysis

To analyze the time complexity of the proposed HMTMLalgorithm, we first present the computational cost of op-timizing each Um as described in Algorithm 1. In eachiteration of Algorithm 1, we must first determine the descentdirection according to the gradient calculated using (18).Then an appropriate step size is obtained by exhaustedlychecking whether the condition (20) is satisfied. In eachcheck, we need to calculate the updated objective value ofFσ(U t+1

m ). The main cost of objective value calculation isspent on

∑Pp=1 ‖W

p(m) −UmB(m)‖2F , which can be rewritten

as tr((UTmUmB(m)BT(m)) − 2

∑Pp=1(W p

(m)BT(m))U

Tm), where

the constant part tr(∑Pp=1(W p

(m))TW p

(m)) is omitted. Toaccelerate computation, we pre-calculate B(m)B

T(m) and

(∑pW

p(m))B

T(m), which are independent on Um and the

time costs are O(r2∏m′ 6=m dm′) and O(r

∏Mm=1 dm) re-

spectively. After the pre-calculation, the time complexities ofcalculating (UTmUm)(B(m)B

T(m)) and

∑Pp=1(W p

(m)BT(m))U

Tm

become O(r2dm + r2.807) and O(r2dm) respectively, wherethe complexity of O(r2.807) comes from the multiplication oftwo square matrices of size r using the Strassen algorithm[41]. It is easy to derive that the computational cost ofthe remained parts in the objective function is O(rdmN

′m).

Hence, the time cost of calculating the objective value isO(rdmN

′m + r2dm + r2.807) after the pre-calculation.

Similarly, the main cost of gradient calculation is spenton∑p(UmB(m)B

T(m)−W

p(m)B

T(m)), and after pre-calculating

B(m)BT(m) and (

∑pW

p(m))B

T(m), we can derive that the


time cost of calculating the gradient is O(rdmN′m +

r2dm). Therefore, the computational cost of optimizingUm is O[r

∏m′ 6=m dm′(r + dm) + T2((rdmN

′m + r2dm) +

T1(rdmN′m + r2dm + r2.807))], where T1 is the number of

checks that is needed to find the step size, and T2 is the numberof iterations for reaching the stop criterion. Considering thatthe optimal rank r dm, we can simplify the cost asO[r

∏Mm=1 dm+T2T1(rdmN

′m+r2dm)]. Finally, suppose the

number of iterations for alternately updating all UmMm=1 isΓ, we obtain the time complexity of the proposed HMTML,i.e., (ΓM [r

∏Mm=1 dm + T2T1(rdmN

′m + r2dm)]), where N ′m

and dm are average sample number and feature dimension ofall domains respectively. This is linear w.r.t. M and

∏Mm=1 dm,

and quadratic in the numbers r and Nm (N ′m is quadraticw.r.t. Nm). Besides, it is common that Γ < 10, T2 < 20, andT1 < 50. Thus the complexity is not very high.

IV. EXPERIMENTS

In this section, the effectiveness and efficiency of theproposed HMTML are verified with experiments in various ap-plications, i.e., document categorization, scene classification,and image annotation. In the following, we first present thedatasets to be used and experimental setups.

A. Datasets and features

Document categorization: we adopt the Reuters multi-lingual collection (RMLC) [42] dataset with six populouscategories. The news articles in this dataset are translated intofive languages. The TF-IDF features are provided to representeach document. In our experiments, three languages (i.e.,English, Spanish, and Italian) are chosen and each languageis regarded as one domain. The provided representationsare preprocessed by applying principal component analysis(PCA). After the preprocessing, 20% energy is preserved. Theresulting dimensions of the document representations are 245,213, and 107 respectively for the three different domains. Thedata preprocessing is mainly conducted to: 1) find the high-level patterns which are comparable for transfer; 2) avoid over-fitting; 3) speed up the computation during the experiments.The main reasons for preserving only 20% energy are: 1)the amount of energy being preserved does not has influenceon the comparison of the effectiveness verification betweenthe proposed approach and other methods. 2) it may lead tooverfitting when the feature dimension of each domain is high.There are 18758, 24039, and 12342 instances in the threedifferent domains respectively. In each domain, the sizes ofthe training and test sets are the same.

Scene classification: We employ a popular natural scenedataset: Scene-15 [43]. The dataset consists of 4, 585 imagesthat belong to 15 categories. The features we used are thepopular global feature GIST [44], local binary pattern (LBP)[45], and pyramid histogram of oriented gradient (PHOG)[46], where the dimensions are 20, 59, and 40 respectively.A detailed description on how to extract these features canbe found in [47]. Since the images usually lie in a nonlinearfeature space, we preprocess the different features using the

kernel PCA (KPCA) [48]. The result dimensions are the sameas the original features.

Image annotation: A challenging dataset NUS-WIDE(NUS) [49], which consists of 269, 648 natural images isadopted. We conducted experiments on a subset of 16, 519images with 12 animal concepts: bear, bird, cat, cow, dog,elk, fish, fox, horse, tiger, whale, and zebra. In this subset,three kinds of features are extracted for each image: 500-Dbag of visual words [43] based on SIFT descriptors [8], 144-D color auto-correlogram, and 128-D wavelet texture. We referto [49] for a detailed description of the features. Similar to thescene classification, KPCA is adopted for preprocess, and theresulting dimensions are all 100.

In both scene classification and image annotation, we treateach feature space as a domain. In each domain, half of thesamples are used for training, and the rest for test.

B. Experimental setup

The methods included for comparison are:• EU: directly using the simple Euclidean metric and

original feature representations to compute the distancebetween samples.

• LMNN [4]: the large margin nearest neighbor DMLalgorithm presented in [4]. Distance metric for differentdomains are learned separately. The number of localtarget neighbors is chosen from the set 1, 2, . . . , 15.

• ITML [13]: the information-theoretic metric learning al-gorithm presented in [13]. The trade-off hyper-parameteris tuned over the set 10i|i = −5,−4, . . . , 4.

• RDML [14]: an efficient and competitive distance metriclearning algorithm presented in [14]. We tune trade-offhyper-parameter over the set 10i|i = −5,−4, . . . , 4.

• DAMA [10]: constructing mappings Um to link multipleheterogeneous domains using manifold alignment. Thehyper-parameter is determined according to the strategypresented in [10].

• MTDA [12]: applying supervised dimension reduction toheterogeneous representations (domains) simultaneously,where a multi-task extension of linear discriminant anal-ysis is developed. The learned transformation Um =WmH consists of a domain specific part Wm ∈ Rdm×d′ ,and a common part H ∈ Rd′×r shared by all domains.Here, d′ > r is the intermediate dimensionality. As pre-sented in [12], the shared structure H represents “somecharacteristics of the application itself in the commonlower-dimensional space”. The hyper-parameter d′ is setas 100 since the model is not very sensitive to the hyper-parameter according to [12].

• HMTML: the proposed heterogeneous multi-task metriclearning method. We adopt linear SVMs to obtain theweight vectors of base classifiers for knowledge trans-fer. The hyper-parameters γm are set as the samevalue since the different features have been normalized.Both γ and γm are optimized over the set 10i|i =−5,−4, . . . , 4. It is not hard to determine the valueof hyper-parameter P , which is the number of baseclassifiers. According to [50], [51], the code length is


suggested to be 15 logC if we use the sparse randomdesign coding technique in ECOC. From the experimentalresults shown in [11], higher classification accuracy canbe achieved with an increasing number P , and when thereare too many base classifiers, the accuracy may decrease.The optimal P is achieved around (slightly larger than)15 logC, we thus empirically set P = 10d1.5 logCe inour method.

The single domain metric learning algorithms (LMNN,ITML, and RDML) only utilize the limited labeled samples ineach domain, and do not make use of any additional informa-tion from other domains. In DAMA and MTDA, after learningUm, we derive the metric for each domain as Am = UmU

Tm.

The task in each domain is to perform multi-class classifica-tion. The penalty hyper-parameters of the base SVM classifiersare empirically set to 1. For all compared methods, any typesof classifiers can be utilized for final classification in eachdomain after learning the distance metrics for all domains. Inthe distance metric learning literature, it is common to usethe nearest neighbour (1NN) classifier to directly evaluate theperformance of the learned metric [6], [7], [52], so we mainlyadopt it in this paper. Some results of using the sophisticatedSVM classifiers can be found in the supplementary material.

In the KPCA preprocessing, we adopt the Gaussian kernel,i.e., k(xi,xj) = exp(−‖xi − xj‖2/(2ω2)), where the hyper-parameter ω is empirically set as the mean of the distancesbetween all training sample pairs.

In all experiments given below, we adopt the classificationaccuracy and macroF1 [53] score to evaluate the performance.The average performance of all domains is calculated for com-parison. The experiments are run for ten times by randomlychoosing different sets of labeled instances. Both the meanvalues and standard deviations are reported.

If unspecified, the hyper-parameters are determined usingleave-one-out cross validation on the labeled set. For example,when the number of labeled samples is 5 for each classin each domain, one sample is chosen for model evaluationand the remained four samples are used for model training.This leads to C test examples and 4C training examplesin each domain, where C is the number of classes. Thisprocedure is repeated five times by regarding one of thefive labeled samples as test example in turn. The averageclassification accuracies of all domains from all the five runsare averaged for hyper-parameter determination. The chosen ofan optimal common factors of multiple heterogeneous featuresis still an open problem [54], and we do not study it in thispaper. To this end, for DAMA, MTDA, and the proposedHMTML, we do not tune the hyper-parameter r, which is thenumber of common factors (or dimensionality of the commonsubspace) used to explain the original data of all domains. Theperformance comparisons are performed on a set of variedr = 1, 2, 5, 8, 10, 20, 30, 50, 80, 100. The maximum valueof r is 20 in scene classification since the lowest featuredimension is 20.

C. Overall classification resultsDocument categorization: In the training set, the number

of labeled instances for each category is chosen randomly as

5, 10, 15. This is used to evaluate the performance of allcompared approaches w.r.t. the number of labeled samples.The results of selecting more labeled samples are given insupplementary material. Fig. 2 and Fig. 3 shows the accu-racies and macroF1 values respectively for different r. Theperformance of all compared approaches at its best number (ofcommon factors) can be found in Table I. From these results,we can draw several conclusions: 1) when the number oflabeled samples increases, the performance of all approachesbecome better; 2) although limited number of labeled datais given in each domain, ITML and RDML can greatlyimprove the performance. This indicates that distance metriclearning (DML) is an effective tool in this application; 3) theperformance of all heterogeneous transfer learning methods(DAMA, MTDA, and HMTML) is much better than the single-domain DML algorithms (LMNN, ITML, and RDML). Thisindicates that it is useful to leverage information from otherdomains in DML; In addition, it can be seen from the resultsthat the optimal number r is often smaller than 30. Hence wemay only need 30 factors to distinguish the different categoriesin this dataset; 4) although effective and discriminative, theperformance of MTDA is heavily dependent on label informa-tion. Therefore, when the labeled samples are scarce, MTDAis worse than DAMA. The topology preserving of DAMAis helpful when the labeled samples are insufficient; 5) theproposed HMTML also relies a lot on the label informationto learn base classifiers, which are used to connect differentdomains. If the labeled samples are scarce, some learnedclassifiers may be unreliable. However, owing to the taskconstruction strategy presented in [11], robust transformationscan be obtained even we have learnt some incorrect or inac-curate classifiers. Therefore, our method outperforms DAMAeven when the labeled instances are scarce; 6) overall, thedeveloped HMTML is superior to both DAMA and MTDA atmost numbers (of common factors). This can be interpreted asthe expressive ability of the factors learned by our method arestronger than other compared methods. The main reason is thatthe high-order statistics of all domains are examined. However,only the pairwise correlations are exploited in DAMA andMTDA; 7) consistent performance have been achieved underdifferent criteria (accuracy and macroF1 score). Specifically,compared with the competitive MTDA, the improvements ofthe proposed method are considerable. For example, when 5,10, and 15 labeled samples are utilized, the relative improve-ments are 10.0%, 7.1%, and 3.1% respectively under macroF1criteria.

In addition, we conduct experiments on more domains toverify that the proposed method is able to handle arbitrarynumber of heterogeneous domains. See the supplementarymaterial for the results.

Scene classification: The number of labeled examples foreach category varies in the set 4, 6, 8. The performancew.r.t. the number r are shown in Fig. 4 and Fig. 5. Theaverage performance are summarized in Table II. We cansee from the results that: 1) when the number of labeledexamples is 4, both LMNN and RDML are only comparableto directly using the Euclidean distance (EU). ITML andDAMA are only slightly better than the EU baseline. This may


Number of common factors1 5 10 20 30 50 80 100

Acc

ura

cy

0.3

0.36

0.42

0.48

0.54

0.6#labeled samples for each category = 5

EULMNNITMLRDMLDAMAMTDAHMTML


Acc

ura

cy

0.3

0.37

0.44

0.51

0.58




Acc

ura

cy

0.3

0.38

0.46

0.54

0.62



Fig. 2. Average accuracy of all domains vs. number of the common factors on the RMLC dataset.


Mac

roF

1

0.3

0.36

0.42

0.48

0.54




Mac

roF

1

0.3

0.37

0.44

0.51

0.58




Mac

roF

1

0.3

0.38

0.46

0.54

0.62



Fig. 3. Average macroF1 score of all domains vs. number of the common factors on the RMLC dataset.

Number of common factors1 2 5 8 10 20

Acc

ura

cy

0.18

0.22

0.26

0.3

0.34




Acc

ura

cy

0.21

0.25

0.29

0.33

0.37




Acc

ura

cy

0.23

0.27

0.31

0.35

0.39



Fig. 4. Average accuracy of all domains vs. number of the common factors on the Scene-15 dataset.


Mac

roF

1

0.14

0.18

0.22

0.26

0.3




Mac

roF

1

0.19

0.23

0.27

0.31

0.35




Mac

roF

1

0.21

0.25

0.29

0.33

0.37



Fig. 5. Average macroF1 score of all domains vs. number of the common factors on the Scene-15 dataset.


TABLE IAVERAGE ACCURACY AND MACROF1 SCORE OF ALL DOMAINS OF THE COMPARED METHODS AT THEIR BEST NUMBERS (OF COMMON FACTORS) ON THE

RMLC DATASET. IN EACH DOMAIN, THE NUMBER OF LABELED TRAINING SAMPLES FOR EACH CATEGORY VARIES FROM 5 TO 15.

Accuracy MacroF1

Methods 5 10 15 5 10 15

EU 0.382±0.023 0.442±0.012 0.491±0.012 0.369±0.030 0.423±0.012 0.479±0.022

LMNN 0.412±0.020 0.461±0.017 0.514±0.017 0.400±0.029 0.454±0.021 0.495±0.015ITML 0.489±0.038 0.561±0.032 0.585±0.026 0.486±0.028 0.556±0.024 0.580±0.024RDML 0.431±0.014 0.500±0.029 0.565±0.013 0.412±0.019 0.466±0.023 0.550±0.017

DAMA 0.544±0.021 0.599±0.029 0.602±0.009 0.533±0.013 0.583±0.027 0.589±0.012MTDA 0.486±0.031 0.598±0.015 0.666±0.008 0.493±0.023 0.585±0.016 0.647±0.012

HMTML 0.559±0.005 0.645±0.013 0.686±0.003 0.543±0.015 0.626±0.010 0.666±0.007

TABLE IIAVERAGE ACCURACY AND MACROF1 SCORE OF ALL DOMAINS OF THE COMPARED METHODS AT THEIR BEST NUMBERS (OF COMMON FACTORS) ON THE

SCENE-15 DATASET. IN EACH DOMAIN, THE NUMBER OF LABELED TRAINING EXAMPLES FOR EACH CATEGORY VARIES FROM 4 TO 8.

Accuracy MacroF1

Methods 4 6 8 4 6 8

EU 0.321±0.018 0.337±0.012 0.349±0.010 0.309±0.017 0.326±0.012 0.337±0.011

LMNN 0.326±0.014 0.360±0.006 0.392±0.011 0.310±0.010 0.342±0.005 0.377±0.011ITML 0.336±0.017 0.375±0.009 0.389±0.011 0.324±0.017 0.365±0.009 0.379±0.011RDML 0.322±0.005 0.341±0.004 0.358±0.005 0.308±0.009 0.332±0.008 0.347±0.007

DAMA 0.328±0.009 0.377±0.016 0.411±0.010 0.320±0.007 0.358±0.012 0.391±0.009MTDA 0.350±0.008 0.381±0.006 0.394±0.008 0.325±0.011 0.351±0.006 0.366±0.008

HMTML 0.366±0.008 0.406±0.011 0.427±0.004 0.334±0.009 0.381±0.012 0.401±0.004

be because the learned metrics are linear, while the imagesare usually lie in a nonlinear feature space. Although theKPCA preprocessing can help to exploit the nonlinearity tosome extent, it does not optimize w.r.t. the metric; 2) whencomparing with the EU baseline, the heterogeneous transferlearning approaches do not improve that much as in documentcategorization. This can be interpreted as that the differentdomains correspond to different types of representations in thisproblem of scene classification. It is much more challengingthan the classification of multilingual documents, where thesame kind of feature (TF-IDF) with various vocabularies isadopted. In this application, the statistical properties of thevarious types of visual representations greatly differ fromeach other. Therefore, the common factors contained amongthese different domains are difficult to be discovered by onlyexploring the pair-wise correlations between them. However,much better performance can be obtained by the presentedHMTML since the correlations of all domains are exploitedsimultaneously, especially when small number of labeledsamples are given. Thus the superiority of our method inalleviating the labeled data deficiency issue is verified.

Image annotation: The number of labeled instances foreach concept varies in the set 4, 6, 8. We show the annotationaccuracies and MacroF1 scores of the compared methods inFig. 6 and Fig. 7 respectively, and summarize the resultsat their best numbers (of common factors) in Table III. Itcan be observed from the results that: 1) all of the singledomain metric learning algorithms do not perform well. Thisis mainly due to the challenging of the setting that each kindof feature is regarded as a domain. Besides, in this application,

the different animal concepts are hard to distinguish sincethey have high inter-class similarity (e.g., “cat” is similar to“tiger”) or have large intra-class variability (such as “bird”).More labeled instances are necessary for these algorithmsto work, and we refer to the supplementary material fordemonstration; 2) DAMA is comparable to the EU baseline,and MTDA only obtains satisfactory accuracies when enough(e.g., 8) labeled instances are provided. Whereas the proposedHMTML still outperforms other approaches in most cases, andthe tendencies of the macroF1 score curves are the same asaccuracy.

D. Further empirical study

In this section, we conduct some further empirical studieson the RMLC dataset. More results can be found in thesupplementary material.

1) An investigation of the performance for the individualdomains: Fig. 8 compares all different approaches on theindividual domains. From the results, we can conclude that:1) RDML improves the performance in each domain, and theimprovements are similar for different domains, since theydo not communicate with each other. In contrast, the transferlearning approaches improve much more than RDML in thedomains where less discriminative information is containedin the original representations, such as IT (Italian) and EN(English). These results validate the success transfer of theknowledge among different domains, and this is really helpfulin metric learning; 2) in the good-performing SP (Spanish) do-main, the results of DAMA and MTDA are almost the same asRDML, and sometimes, even a little worse. This phenomenon



Acc

ura

cy

0.11

0.126

0.142

0.158

0.174

0.19#labeled samples for each animal concept = 4



Acc

ura

cy

0.12

0.136

0.152

0.168

0.184




Acc

ura

cy

0.12

0.136

0.152

0.168

0.184



Fig. 6. Average accuracy of all domains vs. number of the common factors on the NUS animal subset.


Mac

roF

1

0.08

0.1

0.12

0.14

0.16




Mac

roF

1

0.09

0.11

0.13

0.15

0.17




Mac

roF

1

0.1

0.12

0.14

0.16

0.18



Fig. 7. Average macroF1 score of all domains vs. number of the common factors on the NUS animal subset.

TABLE IIIAVERAGE ACCURACY AND MACROF1 SCORE OF ALL DOMAINS OF THE COMPARED METHODS AT THEIR BEST NUMBERS (OF COMMON FACTORS) ON THE

NUS ANIMAL SUBSET. IN EACH DOMAIN, THE NUMBER OF LABELED TRAINING INSTANCES FOR EACH CONCEPT VARIES FROM 4 TO 8.

Accuracy MacroF1

Methods 4 6 8 4 6 8

EU 0.161±0.011 0.172±0.008 0.177±0.009 0.153±0.013 0.167±0.009 0.175±0.007

LMNN 0.169±0.006 0.176±0.001 0.185±0.004 0.161±0.008 0.173±0.004 0.177±0.005ITML 0.172±0.012 0.180±0.008 0.188±0.006 0.171±0.012 0.177±0.009 0.188±0.007RDML 0.164±0.009 0.172±0.009 0.184±0.005 0.156±0.005 0.160±0.011 0.176±0.006

DAMA 0.165±0.008 0.171±0.012 0.179±0.003 0.168±0.006 0.171±0.004 0.175±0.005MTDA 0.164±0.019 0.178±0.009 0.188±0.010 0.157±0.022 0.177±0.011 0.191±0.010

HMTML 0.184±0.010 0.191±0.004 0.195±0.004 0.178±0.010 0.189±0.003 0.198±0.007

indicates that in DAMA and MTDA, the discriminative domaincan benefit little from the relatively non-discriminative do-mains. While the proposed HMTML still achieves significantimprovements in this case. This demonstrates that the high-order correlation information between all domains is welldiscovered. Exploring this kind of information is much betterthan only exploring the correlation information between pairsof domains (as in DAMA and MTDA).

2) A self-comparison analysis: To see how the differentdesign choices affect the performance, we conduct a self-comparison of the proposed HMTML. In particular, we com-pare HMTML with the following sub-models:

• HMTML (loss=0): dropping the log-loss term in (11),and this amounts to solving just the optimization problemdescribed in (8).

• HMTML (reg=0): setting the hyper-parameter γ = 0in (11). Thus the sparsity of the factors is encouraged,whereas the base classifiers are not used.

0.200

0.300

0.400

0.500

0.600

0.700

EN IT SP AVG

EU LMNN ITML RDML DAMA MTDA HMTML

0.200

0.300

0.400

0.500

0.600

0.700

EN IT SP AVG

EU LMNN ITML RDML DAMA MTDA HMTML

Fig. 8. Individual accuracy and macroF1 score of each domain of thecompared methods at their best numbers (of common factors) on the RMLCdataset. (10 labeled instances for each category; EN: English, IT: Italian, SP:Spanish, AVG: average.)



Acc

ura

cy

0.5

0.53

0.56

0.59

0.62


HMTML (loss=0)HMTML (reg=0)HMTML (F-norm)HMTML (no constr.)HMTML


Mac

roF

1

0.48

0.51

0.54

0.57

0.6


HMTML (loss=0)HMTML (reg=0)HMTML (F-norm)HMTML (no constr.)HMTML

Fig. 9. A self-comparison of the proposed method on the RMLC dataset.

• HMTML (F-norm): replacing the l1-norm with theFrobenius norm in (11). The optimization becomes easiersince the gradient of the Frobenius norm can be directlyobtained without smoothing.

• HMTML (no constr.): dropping the non-negative con-straints in (11). That is, the operator π[x] in (19) is notperformed in the optimization.

The results are shown in Fig. 9, where 10 labeled instancesare selected for each category. From the results, we cansee that solving only the proposed tensor-based divergenceminimization problem (8) can achieve satisfactory perfor-mance when the number of common factors is large enough.Both the accuracy and macroF1 score at the best number ofcommon factors are higher than those of HMTML (reg=0),which joint minimizes the empirical losses of all domainswithout minimizing their divergence in the subspace. HMTML(reg=0) is better when the number of common factors is smalland the performance is steadier. Therefore, both the log-lossand divergence minimization terms are indispensable in theproposed model. HMTML (F-norm) is worse than the pro-posed HMTML that adopts the l1-norm. This demonstrates thebenefits of enforcing sparsity in metric learning, as suggestedin the literatures [55], [56], [57]. HMTML (no constr.) isworse than HMTML at most dimensions, although their bestperformance are comparable. This may be because the non-negative constraints help to narrow the hypothesis space andthus control the model complexity.

3) Sensitivity analysis w.r.t. different initializations and op-timization orders: We can only obtain a local minimum sincethe proposed formulation (11) is not joint convex with respectto all the parameters. However, the proposed method canachieve satisfactory performance using only random initial-izations. Fig. 10(a) shows the model is insensitive to differentinitializations of the parameter set Um and demonstrates theeffectiveness of the obtained local optimal solution.

In addition, Fig. 10(b) shows that the proposed model is alsoinsensitive to the different optimization orders of the parameterset Um, and demonstrates that the order of optimizing Um

would not affect the performance.4) Empirical analysis of the computational complexity:

The training time of different approaches can be found in Fig.10(c). All the experiments are performed on the computer withthe following specifications: 2.9 GHz Intel Xeon (8 cores)with Matlab R2014b. The results indicate that the proposedHMTML is comparable to ITML when r (number of commonfactors) is small, and slower than all other approaches whenmany common factors are required. The cost of HMTML isnot strictly quadratic w.r.t. r because the number of iterationsneeded for algorithm termination changes (for different r).Since the optimal number r is usually not very large, the timecost of the proposed method is tolerable. It should be notedthat the test time for different methods are almost the samebecause the sizes of the obtained metrics are the same.

V. CONCLUSION

A novel approach for heterogeneous metric learning(HMTML) is proposed in this paper. In HMTML, we calculatethe prediction weight covariance tensor of multiple hetero-geneous domains. Then the high-order statistics among thesedifferent domains are discovered by analyzing the tensor. Theproposed HMTML can successfully transfer the knowledgebetween different domains by finding a common subspace, andmake them help each other in metric learning. In this subspace,the high-order correlations between all domains are exploited.An efficient optimization algorithm is developed for findingthe solutions. It is demonstrated empirically that exploitingthe high-order correlation information achieves better perfor-mance than the pairwise relationship exploration in traditionalapproaches.

It can be concluded from the experimental results on threechallenging datasets: 1) when the labeled data is scarce, learn-ing the metric for each individual domain may deteriorate theperformance. This issue can be alleviated by jointly learningthe metrics for multiple heterogeneous domains. This is inline with the multi-task learning literature; 2) the sharedknowledge of different heterogenous domains can be exploitedby heterogeneous transfer learning approaches. Each domaincan benefit from such knowledge if the discovered commonfactors are appropriate. We show that it is particularly helpfulto utilize the high-order statistics (correlation information) indiscovering such factors.

The proposed algorithm is particularly suitable for thecase when we want to improve the performance of multipleheterogeneous domains, where the labeled samples are scarcefor each of them. Because the learned transformation is linearfor each domain, the proposed method is very effective for textanalysis based applications (such as document categorization),where the structure of the data distribution is usually linear.For the image data, we suggest adding a kernel PCA (KPCA)preprocessing, which may help improve the performance tosome extent.

The relative higher computational cost may be the maindrawback of the proposed HMTML when comparing withother heterogeneous transfer learning approaches. In the fu-ture, we will develop efficient algorithm for optimization. For


Random Initialization Index1 2 3 4 5 6 7 8 9 10

Acc

ura

cy

0.5

0.54

0.58

0.62

0.66

0.7Performance vs. Random Initializations

#labeled samples for each category = 5#labeled samples for each category = 10#labeled samples for each category = 15

(a)

Optimization Order Index1 2 3 4 5

Acc

ura

cy

0.5

0.54

0.58

0.62

0.66

0.7Performance vs. Optimization Orders

#labeled samples for each category = 5#labeled samples for each category = 10#labeled samples for each category = 15

(b)


Tra

inin

g T

ime

Co

st (

s)

10-2

100

102

104

LMNNITMLRDMLDAMAMTDAHMTML

(c)

Fig. 10. Empirical analysis: (a) Performance in accuracy vs. random initialized Um. The proposed model is insensitive to different initializations; (b)Performance in accuracy vs. order of optimizing Um. The proposed model is insensitive to different optimization orders; (c) Training time cost vs. numberof common factors on the RMLC dataset (10 labeled instances for each category).

example, the technique of parallel computing will be intro-duced to accelerate the optimization. In addition, to deal withcomplicated domains, we intend to learn nonlinear metrics byextending the presented HMTML.

REFERENCES

[1] Z. Xu, K. Q. Weinberger, and O. Chapelle, “Distance metric learningfor kernel machines,” arXiv preprint arXiv:1208.3422v2, 2013.

[2] P. Bouboulis, S. Theodoridis, C. Mavroforakis, and L. Evaggelatou-Dalla, “Complex support vector machines for regression and quaternaryclassification,” IEEE Transactions on Neural Networks and LearningSystems, vol. 26, no. 6, pp. 1260–1274, 2015.

[3] B. McFee and G. R. Lanckriet, “Metric learning to rank,” in Interna-tional Conference on Machine Learning, 2010, pp. 775–782.

[4] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learningfor large margin nearest neighbor classification,” in Advances in NeuralInformation Processing Systems, 2005, pp. 1473–1480.

[5] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “Tagprop:Discriminative metric learning in nearest neighbor models for imageauto-annotation,” in IEEE International Conference on Computer Vision,2009, pp. 309–316.

[6] Z. Zha, T. Mei, M. Wang, Z. Wang, and X. Hua, “Robust distance metriclearning with auxiliary knowledge,” in International Joint Conference onArtificial Intelligence, 2009, pp. 1327–1332.

[7] Y. Zhang and D. Y. Yeung, “Transfer metric learning with semi-supervised extension,” ACM Transactions on Intelligent Systems andTechnology, vol. 3, no. 3, pp. 54:1–54:28, 2012.

[8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,2004.

[9] X. Shi, Q. Liu, W. Fan, P. S. Yu, and R. Zhu, “Transfer learning on het-erogenous feature spaces via spectral transformation,” in InternationalConference on Data Mining, 2010, pp. 1049–1054.

[10] C. Wang and S. Mahadevan, “Heterogeneous domain adaptation usingmanifold alignment,” in International Joint Conference on ArtificialIntelligence, 2011, pp. 1541–1546.

[11] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan, “Heterogeneous domainadaptation for multiple classes,” in International Conference on ArtificialIntelligence and Statistics, 2014, pp. 1095–1103.

[12] Y. Zhang and D. Y. Yeung, “Multi-task learning in heterogeneous featurespaces,” in AAAI Conference on Artificial Intelligence, 2011, pp. 574–579.

[13] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in International Conference on MachineLearning, 2007, pp. 209–216.

[14] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning:Theory and algorithm,” in Advances in Neural Information ProcessingSystems, 2009, pp. 862–870.

[15] Y. Luo, Y. Wen, and D. Tao, “On combining side information and unla-beled data for heterogeneous multi-task metric learning,” in InternationalJoint Conference on Artificial Intelligence, 2016, pp. 1809–1815.

[16] Y. Luo, D. Tao, and Y. Wen, “Exploiting high-order information inheterogeneous multi-task feature learning,” in International Joint Con-ference on Artificial Intelligence, 2016, pp. 2443–2449.

[17] X. Xu, W. Li, and D. Xu, “Distance metric learning using privilegedinformation for face verification and person re-identification,” IEEETransactions on Neural Networks and Learning Systems, vol. 26, no. 12,pp. 3150–3162, 2015.

[18] B. Kulis, “Metric learning: A survey,” Foundations and Trends inMachine Learning, vol. 5, no. 4, pp. 287–364, 2012.

[19] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning forfeature vectors and structured data,” arXiv preprint arXiv:1306.6709v4,2014.

[20] E. P. Xing, M. I. Jordan, S. Russell, and A. Ng, “Distance metric learningwith application to clustering with side-information,” in Advances inNeural Information Processing Systems, 2002, pp. 505–512.

[21] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neigh-bourhood components analysis,” in Advances in Neural InformationProcessing Systems, 2004, pp. 513–520.

[22] R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Ki-jsirikul, “A new kernelization framework for mahalanobis distancelearning algorithms,” Neurocomputing, vol. 73, no. 10, pp. 1570–1579,2010.

[23] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric dis-criminatively, with application to face verification,” in IEEE Conferenceon Computer Vision and Pattern Recognition, 2005, pp. 539–546.

[24] D. Kedem, S. Tyree, F. Sha, G. R. Lanckriet, and K. Q. Weinberger,“Non-linear metric learning,” in Advances in Neural Information Pro-cessing Systems, 2012, pp. 2573–2581.

[25] M. Norouzi, D. M. Blei, and R. R. Salakhutdinov, “Hamming distancemetric learning,” in Advances in Neural Information Processing Systems,2012, pp. 1061–1069.

[26] B. Geng, D. Tao, and C. Xu, “DAML: Domain adaptation metriclearning,” IEEE Transactions on Image Processing, vol. 20, no. 10, pp.2980–2989, 2011.

[27] S. Parameswaran and K. Q. Weinberger, “Large margin multi-task metriclearning,” in Advances in Neural Information Processing Systems, 2010,pp. 1867–1875.

[28] P. Yang, K. Huang, and C. Liu, “Geometry preserving multi-task metriclearning,” Machine Learning, vol. 92, no. 1, pp. 133–175, 2013.

[29] C. Li, M. Georgiopoulos, and G. C. Anagnostopoulos, “Multitaskclassification hypothesis space with improved generalization bounds,”IEEE Transactions on Neural Networks and Learning Systems, vol. 26,no. 7, pp. 1468–1479, 2015.

[30] L. Shao, F. Zhu, and X. Li, “Transfer learning for visual categoriza-tion: A survey,” IEEE Transactions on Neural Networks and LearningSystems, vol. 26, no. 5, pp. 1019–1034, 2015.

[31] L. Duan, D. Xu, and I. Tsang, “Learning with augmented featuresfor heterogeneous domain adaptation,” in International Conference onMachine Learning, 2012, pp. 711–718.

[32] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what youget: Domain adaptation using asymmetric kernel transforms,” in IEEEConference on Computer Vision and Pattern Recognition, 2011, pp.1785–1792.

[33] T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems


via error-correcting output codes,” Journal of Artificial IntelligenceResearch, vol. 2, pp. 263–286, 1995.

[34] R. K. Ando and T. Zhang, “A framework for learning predictivestructures from multiple tasks and unlabeled data,” Journal of MachineLearning Research, vol. 6, pp. 1817–1853, 2005.

[35] A. Quattoni, M. Collins, and T. Darrell, “Learning visual representationsusing images with captions,” in IEEE Conference on Computer Visionand Pattern Recognition, 2007, pp. 1–8.

[36] R. S. Cabral, F. Torre, J. P. Costeira, and A. Bernardino, “Matrixcompletion for multi-label image classification,” in Advances in NeuralInformation Processing Systems, 2011, pp. 190–198.

[37] L. De Lathauwer, B. De Moor, and J. Vandewalle, “On the best rank-1and rank-(r1, r2, ..., rn) approximation of higher-order tensors,” SIAMJournal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000.

[38] C. J. Lin, “Projected gradient methods for nonnegative matrix factoriza-tion,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.

[39] Y. Nesterov, “Smooth minimization of non-smooth functions,” Mathe-matical Programming, vol. 103, no. 1, pp. 127–152, 2005.

[40] T. Zhou, D. Tao, and X. Wu, “NESVM: A fast gradient method forsupport vector machines,” in International Conference on Data Mining,2010, pp. 679–688.

[41] W. Miller, “Computational complexity and numerical stability,” SIAMJournal on Computing, vol. 4, no. 2, pp. 97–107, 1975.

[42] M. Amini, N. Usunier, and C. Goutte, “Learning from multiple partiallyobserved views - an application to multilingual text categorization,” inAdvances in Neural Information Processing Systems, 2009, pp. 28–36.

[43] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories,” in IEEEConference on Computer Vision and Pattern Recognition, 2006, pp.2169–2178.

[44] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” International Journalof Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.

[45] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 7, pp. 971–987, 2002.

[46] A. Bosch, A. Zisserman, and X. Munoz, “Image classification usingrandom forests and ferns,” in International Conference on ComputerVision, 2007, pp. 1–8.

[47] D. Dai, T. Kroeger, R. Timofte, and L. Van Gool, “Metric imitation bymanifold transfer for efficient vision applications,” in IEEE Conferenceon Computer Vision and Pattern Recognition, 2015, pp. 3527–3536.

[48] B. Scholkopf, A. Smola, and K. R. Muller, “Nonlinear componentanalysis as a kernel eigenvalue problem,” Neural Computation, vol. 10,no. 5, pp. 1299–1319, 1998.

[49] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: A real-world web image database from National University ofSingapore,” in International Conference on Image and Video Retrieval,2009.

[50] E. L. Allwein, R. E. Schapire, and Y. Singer, “Reducing multiclass tobinary: A unifying approach for margin classifiers,” Journal of MachineLearning Research, vol. 1, pp. 113–141, 2000.

[51] S. Escalera, O. Pujol, and P. Radeva, “Separability of ternary codes forsparse designs of error-correcting output codes,” Pattern RecognitionLetters, vol. 30, no. 3, pp. 285–297, 2009.

[52] G. Qi, C. C. Aggarwal, and T. S. Huang, “Transfer learning of distancemetrics by cross-domain metric sampling across heterogeneous spaces,”in SIAM International Conference on Data Mining, 2012, pp. 528–539.

[53] M. Sokolova and G. Lapalme, “A systematic analysis of performancemeasures for classification tasks,” Information Processing and Manage-ment, vol. 45, no. 4, pp. 427–437, 2009.

[54] D. P. Foster, R. Johnson, and T. Zhang, “Multi-view dimensionalityreduction via canonical correlation analysis,” TTI-Chicago, Tech. Rep.TR-2009-5, 2008.

[55] G. Qi, J. Tang, Z. Zha, T. S. Chua, and H. Zhang, “An efficientsparse metric learning in high-dimensional space via L1-penalized log-determinant regularization,” in International Conference on MachineLearning, 2009, pp. 841–848.

[56] W. Liu, S. Ma, D. Tao, J. Liu, and P. Liu, “Semi-supervised sparsemetric learning using alternating linearization optimization,” in ACMInternational Conference on Knowledge Discovery and Data Mining,2010, pp. 1139–1148.

[57] K. Huang, Y. Ying, and C. Campbell, “Generalized sparse metric learn-ing with relative comparisons,” Knowledge and Information Systems,vol. 28, no. 1, pp. 25–45, 2011.

Heterogeneous Multi-task Metric Learning across Multiple ...intensive experiments on text...

Documents

Transcript of Heterogeneous Multi-task Metric Learning across Multiple ...intensive experiments on text...