A SUBMISSION TO IEEE TRANSACTIONS ON IMAGE …€¦ · A. Multilabel Image Classiﬁcation...

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2577382, IEEETransactions on Image Processing

A SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING 1

Correlated Logistic Model with Elastic NetRegularization for Multilabel Image Classification

Qiang Li, Bo Xie, Jane You, Wei Bian, Member, IEEE, and Dacheng Tao, Fellow, IEEE

Abstract—In this paper, we present correlated logistic model(CorrLog) for multilabel image classification. CorrLog extendsconventional Logistic Regression model into multilabel cases, viaexplicitly modelling the pairwise correlation between labels. Inaddition, we propose to learn model parameters of CorrLogwith Elastic Net regularization, which helps exploit the sparsityin feature selection and label correlations and thus furtherboost the performance of multilabel classification. CorrLog canbe efficiently learned, though approximately, by regularizedmaximum pseudo likelihood estimation (MPLE), and it enjoysa satisfying generalization bound that is independent of thenumber of labels. CorrLog performs competitively for multil-abel image classification on benchmark datasets MULAN scene,MIT outdoor scene, PASCAL VOC 2007 and PASCAL VOC2012, compared to the state-of-the-art multilabel classificationalgorithms.

I. INTRODUCTION

Multilabel classification (MLC) extends conventional singlelabel classification (SLC) by allowing an instance to beassigned to multiple labels from a label set. It occurs naturallyfrom a wide range of practical problems, such as documentcategorization, image classification, music annotation, web-page classification and bioinformatics applications, where eachinstance can be simultaneously described by several classlabels out of a candidate label set. MLC is also closely relatedto many other research areas, such as subspace learning [1],nonnegative matrix factorization [2], multi-view learning [3]and multi-task learning [4]. Because of its great generality andwide applications, MLC has received increasing attentions inrecent years from machine learning, data mining, to computervision communities, and developed rapidly with both algorith-mic and theoretical achievements [5], [6], [7], [8], [9], [10].

This research was supported in part by Australian Research Council ProjectsFT-130101457, DP-140102164 and LE-140100061. The funding support fromthe Hong Kong government under its General Research Fund (GRF) scheme(Ref. no. 152202/14E) and the Hong Kong Polytechnic University CentralResearch Grant is greatly appreciated.

Q. Li is with the Centre for Quantum Computation and Intelligent Systems,Faculty of Engineering and Information Technology, University of TechnologySydney, 81 Broadway, Ultimo, NSW 2007, Australia, and also with Depart-ment of Computing, The Hong Kong Polytechnic University, Hung Hom,Kowloon, Hong Kong (e-mail: [email protected]).

W. Bian and D. Tao are with the Centre for Quantum Computation andIntelligent Systems, Faculty of Engineering and Information Technology,University of Technology Sydney, 81 Broadway, Ultimo, NSW 2007, Australia(e-mail: [email protected], [email protected]).

B. Xie is with College of Computing, Georgia Institute of Technology,Atlanta, GA 30345, USA (email: [email protected]).

J. You is with Department of Computing, The Hong Kong Poly-technic University, Hung Hom, Kowloon, Hong Kong (e-mail: [email protected]).

The key feature of MLC that makes it distinct from SLCis label correlation, without which classifiers can be trainedindependently for each individual label and MLC degeneratesto SLC. The correlation between different labels can beverified by calculating the statistics, e.g., χ2 test and Pearson’scorrelation coefficient, of their distributions. According to [11],there are two types of label correlations (or dependence), i.e.,the conditional correlations and the unconditional correlations,wherein the former describes the label correlations conditionedon a given instance while the latter summarizes the globallabel correlations of only label distribution by marginalizingout the instance. From a classification point of view, modellingof label conditional correlations is preferable since they aredirectly related to prediction; however, proper utilization of un-conditional correlations is also helpful, but in an average sensebecause of the marginalization. Accordingly, quite a number ofMLC algorithms have been proposed in the past a few years,by exploiting either of the two types of label correlations,1 andbelow, we give a brief review of the representative ones. Asit is a very big literature, we cannot cover all the algorithms.The recent surveys [8], [9] contain many references omittedfrom this paper.• By exploiting unconditional label correlations: A large

class of MLC algorithms that utilize unconditional labelcorrelations are built upon label transformation. The keyidea is to find new representation for the label vector(one dimension corresponds to an individual label), sothat the transformed labels or responses are uncorrelatedand thus can be predicted independently. Original labelvector needs to be recovered after the prediction. MLCalgorithms using label transformation include [12] whichutilizes low-dimensional embedding and [13] and [14]which use random projections. Another strategy of usingunconditional label correlations, e.g., used in the stackingmethod [6] and the “Curds” & “Whey” procedure [15],is first to predict each individual label independently andcorrect/adjust the prediction by proper post-processing.Algorithms are also proposed based on co-occurrenceor structure information extracted from the label set,which include random k-label sets (RAKEL) [16], prunedproblem transformation (PPT) [17], hierarchical binaryrelevance (HBR) [18] and hierarchy of multilabel classi-fiers (HOMER) [8]. Regression-based models, includingreduced-rank regression and multitask learning, can also

1Studies on MLC, from different perspectives rather than label correlations,also exit in the literature, e.g., by defining different loss functions, dimensionreduction and classifier ensemble methods, but are not in the scope of thispaper.




be used for MLC, with an interpretation of utilizingunconditional label correlations [11].

• By exploiting conditional label correlations: MLC algo-rithms in this category are diverse and often developedby specific heuristics. For example, multilabel K-nearestneighbour (MLkNN) [5] extends KNN to the multilabelsituation, which applies maximum a posterior (MAP)label prediction by obtaining the prior label distributionwithin the K nearest neighbours of an instance. Instance-based logistic regression (IBLR) [6] is also a localizedalgorithm, which modifies logistic regression by usinglabel information from the neighbourhood as features.Classifier chain (CC) [19], as well as its ensemble andprobabilistic variants [20], incorporate label correlationsinto a chain of binary classifiers, where the predictionof a label uses previous labels as features. Channelcoding based MLC techniques such as principal labelspace transformation (PLST) [21] and maximum marginoutput coding (MMOC) [22] proposed to select codes thatexploits conditional label correlations. Graphical models,e.g., conditional random fields (CRFs) [23], are alsoapplied to MLC, which provides a richer framework tohandle conditional label correlations.

A. Multilabel Image ClassificationMultilabel image classification belongs to the generic scope

of MLC, but handles the specific problem of predicting thepresence or absence of multiple object categories in an image.Like many related high-level vision tasks such as objectrecognition [24], [25], visual tracking [26], image annotation[27], [28], [29] and scene classification [30], [31], [32],multilabel image classification [33], [34], [35], [36], [37],[38] is very challenging due to large intra-class variation. Ingeneral, the variation is caused by viewpoint, scale, occlusion,illumination, semantic context, etc.

On the one hand, many effective image representationschemes have been developed to handle this high-level visiontask. Most of the classical approaches derive from handcraftedimage features, such as GIST [39], dense SIFT [40], VLAD[41], and object bank [42]. In contrast, the very recent deeplearning techniques have also been developed for image fea-ture learning, such as deep CNN features [43], [44]. Thesetechniques are more powerful than classical methods whenlearning from a very large amount of unlabeled images.

On the other hand, label correlations have also been exploit-ed to significantly improve image classification performance.Most of the current multilabel image classification algorithmsare motivated by considering label correlations conditioned onimage features, thus intrinsically falls into the CRFs frame-work. For example, probabilistic label enhancement model(PLEM) [45] designed to exploit image label co-occurrencepairs based on a maximum spanning tree construction anda piecewise procedure is utilized to train the pairwise CRFsmodel. More recently, clique generating machine (CGM) [46]proposed to learn the image label graph structure and param-eters by iteratively activating a set of cliques. It also belongsto the CRFs framework, but the labels are not constrained tobe all connected which may result in isolated cliques.

TABLE ISUMMARY OF IMPORTANT NOTATIONS THROUGHOUT THIS PAPER.

Notation Description

D = {x(l),y(l)} training dataset with n examples, 1 ≤ l ≤ n

Dk modified training data set by replacing the k-th example ofD with an independent example

D\k modified training data set by discarding thek-th example of D

L(Θ)negative log pseudo likelihood over trainingdataset Dk

Lr(Θ)regularized negative log pseudo likelihoodover training dataset D\k

Ren(Θ;λ1, λ2, ε)elastic net regularization with weights λ1, λ2and parameter ε

Θ = {β,α} model parameters of CorrLog

Θ = {β, α} empirical learned model parameters by max-imum pseudo likelihood estimation over D

Θk = {βk, αk} empirical learned model parameters over Dk

Θ\k = {β\k, α\k} empirical learned model parameters over D\k

R(Θ)empirical error of the empirical model Θ overtraining set D

R(Θ) generalization error of the empirical model Θ

B. Motivation and Organization

Correlated logistic model (CorrLog) provides a more princi-pled way to handle conditional label correlations, and enjoysseveral favourable properties: 1) built upon independent lo-gistic regressions (ILRs), it offers an explicit way to modelthe pairwise (second order) label correlations; 2) by using thepseudo likelihood technique, the parameters of CorrLog can belearned approximately with a computational complexity linearwith respect to label number; 3) the learning of CorrLog isstable, and the empirically learned model enjoys a general-ization error bound that is independent of label number. Inaddition, the results presented in this paper extend our previousstudy [47] in following aspects: 1) we introduce elastic netregularization to CorrLog, which facilitates the utilization ofthe sparsity in both feature selection and label correlations; 2)a learning algorithm for CorrLog based on soft thresholdingis derived to handle the nonsmoothness of the elastic netregularization; 3) the proof of generalization bound is alsoextended for the new regularization; 4) we apply CorrLog tomultilabel image classification, and achieve competitive resultswith the state-of-the-art methods of this area.

The rest of this paper is organized as follows. SectionII introduces the model CorrLog with elastic net regulariza-tion. Section III presents algorithms for learning CorrLog byregularized maximum pseudo likelihood estimation, and forprediction with CorrLog by message passing. A generalizationanalysis of CorrLog based on the concept of algorithm stabilityis presented in Section IV. Section V to Section VII reportresults of empirical evaluations, including experiments onsynthetic dataset and on benchmark multilabel image classifi-cation datasets.




II. CORRELATED LOGISTIC MODEL

We study the problem of learning a joint prediction y =d(x) : X 7→ Y , where the instance space X = {x : ‖x‖ ≤1,x ∈ RD} and the label space Y = {−1, 1}m. By assumingthe conditional independence among labels, we can modelMLC by a set of independent logistic regressions (ILRs).Specifically, the conditional probability plr(y|x) of ILRs isgiven by

plr(y|x) =m∏i=1

plr(yi|x)

=m∏i=1

exp(yiβ

Ti x)

exp(βTi x

)+ exp

(−βTi x

) , (1)

where βi ∈ RD is the coefficients for the i-th logisticregression (LR) in ILRs. For the convenience of expression,the bias of the standard LR is omitted here, which is equivalentto augmenting the feature of x with a constant.

Clearly, ILRs (1) enjoys several merits, such as, it can belearned efficiently, in particular with a linear computationalcomplexity with respect to label number m, and its proba-bilistic formulation inherently helps deal with the imbalanceof positive and negative examples for each label, which is acommon problem encountered by MLC. However, it ignoresentirely the potential correlation among labels and thus tendsto under-fit the true posterior p0(y|x), especially when thelabel number m is large.

A. Correlated Logistic Regressions

CorrLog tries to extend ILRs with as small effort as possi-ble, so that the correlation among labels is explicitly modelledwhile the advantages of ILRs can be also preserved. To achievethis, we propose to augment (1) with a simple function q(y)and reformulate the posterior probability as

p(y|x) ∝ plr(y|x)q(y). (2)

As long as q(y) cannot be decomposed into independentproduct terms for individual labels, it introduces label cor-relations into p(y|x). It is worth noticing that we assumedq(y) to be independent of x. Therefore, (2) models labelcorrelations in an average sense. This is similar to the conceptof “marginal correlations” in MLC [11]. However, they areintrinsically different, because (2) integrate the correlation intothe posterior probability, which directly aims at prediction.In addition, the idea used in (2) for correlation modelling isalso distinct from the “Curds and Whey” procedure in [15]which corrects outputs of multivariate linear regression byreconsidering their correlations to the true responses.

In this paper, we choose q(y) to be the following quadraticform,

q(y) = exp

∑i<j

αijyiyj

. (3)

It means that yi and yj are positively correlated given αij > 0and negatively correlated given αij < 0. It is also possible todefine αij as functions of x, but this will drastically increase

the number of model parameters, e.g., by O(m2D) if linearfunctions are used.

By substituting (3) into (2), we obtain the conditionalprobability for CorrLog

p(y|x; Θ) ∝ exp

m∑i=1

yiβTi x +

∑i<j

αijyiyj

, (4)

where the model parameter Θ = {β,α} contains β =[β1, ..., βm] and α = [α12, ..., α(m−1)m]T . It can be seenthat CorrLog is a simple modification of (1), by using aquadratic term to adjust the joint prediction, so that hiddenlabel correlations can be exploited. In addition, CorrLog isclosely related to popular statistical models for joint modellingof binary variables. For example, conditional on x, (4) isexactly an Ising model [48] for y. It can also be treatedas a special instance of CRFs [23], by defining featuresφi(x,y) = yix and ψij(y) = yiyj . Moreover, classicalmodel multivariate probit (MP) [49] also models pairwisecorrelations in y. However, it utilizes Gaussian latent variablesfor correlation modelling, which is essentially different fromCorrLog.

B. Elastic Net Regularization

Given a set of training data D = {x(l),y(l) : 1 ≤ l ≤ n},CorrLog can be learned by regularized maximum log likeli-hood estimation (MLE), i.e.,

Θ = arg minΘL(Θ) +R(Θ), (5)

where L(Θ) is the negative log likelihood

L(Θ) = − 1

n

n∑l=1

log p(y(l)|x(l); Θ), (6)

and R(Θ) is a properly chosen regularization.A possible choice for R(Θ) is the `2 regularizer,

R2(Θ;λ1, λ2) = λ1

m∑i=1

‖βi‖22 + λ2

∑i<j

|αij |2, (7)

with λ1, λ2 > 0 being the weighting parameters. The `2regularization enjoys the merits of computational flexibilityand learning stability. However, it is unable to exploit anysparsity that can be possessed by the problem at hand. Forexample, for MLC, it is likely that the prediction of eachlabel yi only depends on a subset of the D features of x,which implies the sparsity of βi. Besides, α can also besparse since not all labels in y are correlated to each other. `1regularizer is another choice for R(Θ), especially regardingmodel sparsity. Nevertheless, it has been noticed by severalstudies that `1 regularized algorithms are inherently unstable,that is, a slight change of the training data set can leadto substantially different prediction models. Based on aboveconsideration, we propose to use the elastic net regularizer[50], which is a combination of `2 and `1 regularizers and




inherits their individual advantages, i.e., learning stability andmodel sparsity,

Ren(Θ;λ1, λ2, ε) = λ1

m∑i=1

(‖βi‖22 + ε‖βi‖1)

+ λ2

∑i<j

(|αij |2 + ε|αij |), (8)

where ε ≥ 0 controls the trade-off between the `1 regulariza-tion and the `2 regularization, and large ε encourages a highlevel of sparsity.

III. ALGORITHMS

In this section, we derive algorithms for learning and predic-tion with CorrLog. The exponentially large size of the labelspace Y = {−1, 1}m makes exact algorithms for CorrLogcomputationally intractable, since the conditional probability(4) needs to be normalized by the partition function

A(Θ) =∑y∈Y

exp

m∑i=1

yiβTi x +

∑i<j

αijyiyj

, (9)

which is a summation over an exponential number of terms.Thus, we turn to approximate learning and prediction algo-rithms, by exploiting the pseudo likelihood and the messagepassing techniques.

A. Approximate Learning via Pseudo Likelihood

Maximum pseudo likelihood estimation (MPLE) [51] pro-vides an alternative approach for estimating model parameters,especially when the partition function of the likelihood cannotbe evaluated efficiently. It was developed in the field ofspatial dependence analysis and has been widely applied tothe estimation of various statistical models, from the Isingmodel [48] to the CRFs [52]. Here, we apply MPLE to thelearning of parameter Θ in CorrLog.

The pseudo likelihood of the model over m jointly distribut-ed random variables is defined as the product of the conditionalprobability of each individual random variables conditionedon all the rest ones. For CorrLog (4), its pseudo likelihood isgiven by

p(y|x; Θ) =m∏i=1

p(yi|y−i,x; Θ), (10)

where y−i = [y1, ...,yi−1,yi+1, ...,ym] and the conditionalprobability p(yi|y−i,x; Θ) can be directly obtained from (4),

p(yi|y−i,x; Θ) =

1

1 + exp{−2yi

(βTi x +

∑mj=i+1 αijyj +

∑i−1j=1 αjiyj

)} .(11)

Accordingly, the negative log pseudo likelihood over thetraining data D is given by

L(Θ) = − 1

n

n∑l=1

m∑i=1

log p(y(l)i |y

(l)−i,x

(l); Θ). (12)

To this end, the optimal model parameter Θ = {β, α} ofCorrLog can be learned approximately by the elastic netregularized MPLE,

Θ = arg minΘLr(Θ)

= arg minΘL(Θ) +Ren(Θ;λ1, λ2, ε). (13)

where λ1, λ2 and ε are tuning parameters.A First-Order Method by Soft Thresholding: Problem (13)is a convex optimization problem, thanks to the convexity ofthe logarithmic loss function and the elastic net regularization,and thus a unique optimal solution. However, the elastic netregularization is non-smooth due to the `1 norm regularizer,which makes direct gradient based algorithm inapplicable. Themain idea of our algorithm for solving (13) is to divide theobjective function into smooth and non-smooth parts, and thenapply the soft thresholding technique to deal with the non-smoothness.

Denoting by Js(Θ) the smooth part of Lr(Θ), i.e.,

Js(Θ) = L(Θ) + λ1

m∑i=1

‖βi‖22 + λ2

∑i<j

|αij |2, (14)

its gradient ∇Js at the k-th iteration Θ(k) = {β(k),α(k)} isgiven by

∇Jsβi(Θ(k)) = 1

n

n∑l=1

ξlix(l) + 2λ1β

(k)i

∇Jsαij(Θ(k)) = 1

n

n∑l=1

(ξliy

(l)j + ξljy

(l)i

)+ 2λ2α

(k)ij

(15)withξli =

−2y(l)i

1 + exp{

2y(l)i

(β(k)Ti x(l) +

∑mj=i+1 α

(k)ij y

(l)j +

∑i−1j=1 α

(k)ji y

(l)j

)} .(16)

Then, a surrogate J(Θ) of the objective function Lr(Θ) in(13) can be obtained by using ∇Js(Θ(k)), i.e.,

J(Θ; Θ(k)) = Js(Θ(k))

+

m∑i=1

〈∇Jsβi(Θ(k)), βi − β(k)

i 〉+1

2η‖βi − β(k)

i ‖22 + λ1ε‖βi‖1

+∑i<j

〈∇Jsαij(Θ(k)), αij − α(k)

ij 〉+1

2η(αij − α(k)

ij )2 + λ2ε|αij |.

(17)

The parameter η in (17) servers a similar role to the variableupdating step size in gradient descent methods, and it isset such that 1/η is larger than the Lipschitz constant of∇Js(Θ(k)). For such η, it can be shown that J(Θ) ≥ Lr(Θ)and J(Θ(k)) = Lr(Θ(k)). Therefore, the update of Θ can berealized by the minimization

Θ(k+1) = arg minΘ

J(Θ; Θ(k)), (18)

which is solved by the soft thresholding function S(·), i.e.,{β

(k+1)i = S(β

(k)i − η∇Jsβi(Θ

(k));λ1ε)

α(k+1)ij = S(α

(k)ij − η∇Jsαij (Θ(k));λ2ε),

(19)




Algorithm 1 Learning CorrLog by Maximum Pseudo Likeli-hood Estimation with Elastic Net Regularization

Input: Training data D, initialization β(0) = 0, α(0) =0, and learning rate η, where 1/η is set larger than theLipschitz constant of ∇Js(Θ) (17).

Output: Model parameters Θ = (β(t), α(t)).

repeatCalculating the gradient of JS(Θ) at Θ(k) = (β(k),α(k))by using (15);Updating Θ(k+1) = (β(k+1),α(k+1)) by using softthresholding (19);k = k + 1

until Converged

where

S(u; ρ) =

u− 0.5ρ, if u > 0.5ρu+ 0.5ρ, if u < −0.5ρ0, otherwise.

(20)

Iteratively applying (19) until convergence provides a first-order method for solving (13). Algorithm 1 presents thepseudo code for this procedure.Remark 1 From the above derivation, especially equations(15) and (19), the computational complexity of our learningalgorithm is linear with respect to the label number m. There-fore, learning CorrLog is no more expensive than learningm independent logistic regressions, which makes CorrLogscalable to the case of large label numbers.Remark 2 It is possible to further speed up the learningalgorithm. In particular, Algorithm 1 can be modified to havethe optimal convergence rate in the sense of Nemirovskyand Yudin [53], i.e., O(1/k2) wherein k is the number ofiterations. However, its convergence is usually as slow as instandard gradient descent methods. Actually, we only needto replace the current variable Θ(k) in the surrogate (17)by a weighted combination of the variables from previousiterations. As such modification is a direct application of thefast iterative shrinkage thresholding, [54], we do not presentthe details here but leave readers to the reference.

B. Joint Prediction by Message Passing

For MLC, as the labels are not independent in general, theprediction task is actually a joint maximum a posterior (MAP)estimation over p(y|x). In the case of CorrLog, suppose themodel parameter Θ is learned by the regularized MPLE fromthe last subsection, the prediction of y for a new instance xcan be obtained by

y = arg maxy∈Y

p(y|x; Θ)

= arg maxy∈Y

exp

m∑i=1

yiβTi x +

∑i<j

αijyiyj

. (21)

We use the belief propagation (BP) to solve (21) [55].Specifically, we run the max-product algorithm with uniformlyinitialized messages and an early stopping criterion with 50

iterations. Since the graphical model defined by α in (21) hasloops, we cannot guarantee the convergence of the algorithm.However, we found that it works well on all experiments inthis paper.

IV. GENERALIZATION ANALYSIS

An important issue in designing a machine learning algo-rithm is generalization, i.e., how the algorithm will perform onthe test data compared to on the training data. In the section,we present a generalization analysis for CorrLog, by using theconcept of algorithmic stability [56]. Our analysis follows twosteps. First, we show that the learning of CorrLog by MPLEis stable, i.e., the learned model parameter Θ does not varymuch given a slight change of the training data set D. Then, weprove that the generalization error of CorrLog can be boundedby the empirical error, plus a term related to the stability butindependent of the label number m.

A. The Stability of MPLE

The stability of a learning algorithm indicates how muchthe learned model changes according to a small change ofthe training data set. Denote by Dk a modified training dataset the same with D but replacing the k-th training example(x(k),y(k)) by another independent example (x′,y′). SupposeΘ and Θk are the model parameters learned by MPLE (13) onD and Dk, respectively. We intend to show that the differencebetween these two models, defined as

‖Θk − Θ‖ ,m∑i=1

‖βk

i − βi‖+∑i<j

|αkij − αij |, ∀ 1 ≤ k ≤ n,

(22)is bounded by an order of O(1/n), so that the learning isstable for large n.

First, we need the following auxiliary model Θ\k =

{β\k, α\k} learned on D\k, which is the same with D but

without the k-th example

Θ\k = arg minΘL\k(Θ) +Ren(Θ;λ1, λ2, ε), (23)

where

L\k(Θ) = − 1

n

∑l 6=k

m∑i=1

log p(y(l)i |y

(l)−i,x

(l); Θ). (24)

The following Lemma provides an upper bound of thedifference Lr(Θ\k)− Lr(Θ).

Lemma 1. Given Lr(·) and Θ defined in (13), and Θ\k

defined in (23), it holds for ∀1 ≤ k ≤ n,

Lr(Θ\k)− Lr(Θ) ≤

1

n

(m∑i=1

log p(y(k)i |y

(k)−i ,x

(k); Θ\k)−m∑i=1

log p(y(k)i |y

(k)−i ,x

(k); Θ)

)(25)

Proof. Denote by RHS the righthand side of (25), we have

RHS =(Lr(Θ\k)− L\kr (Θ\k)

)−(Lr(Θ)− L\kr (Θ)

).




Furthermore, the definition of Θ\k implies L\kr (Θ\k) ≤L\kr (Θ). Combining these two we have (25). This completesthe proof.

Next, we show a lower bound of the difference Lr(Θ\k)−Lr(Θ).

Lemma 2. Given Lr(·) and Θ defined in (13), and Θ\k

defined in (23), it holds for ∀1 ≤ k ≤ n,

Lr(Θ\k)− Lr(Θ) ≥ λ1‖β\k− β‖2 + λ2‖α\k − α‖2. (26)

Proof. We define the following function

f(Θ) = Lr(Θ)− λ1‖β − β‖2 − λ2‖α− α‖2.

Then, for (26), it is sufficient to show that f(Θ\k) ≥ f(Θ).By using (13), we have

f(Θ) = L(Θ) + 2λ1

m∑i=1

βTi βi + 2λ2

∑i<j

αijαij

+ λ1εm∑i=1

‖βi‖1 + λ2ε∑i<j

|αij |. (27)

It is straightforward to verify that f(Θ) and Lr(Θ) in (13)have the same subgradient at Θ, i.e.,

∂f(Θ) = ∂Lr(Θ). (28)

Since Θ minimizes Lr(Θ), we have 0 ∈ ∂Lr(Θ) and thus0 ∈ ∂f(Θ), which implies Θ also minimizes f(Θ). Thereforef(Θ) ≤ f(Θ\k).

In addition, by checking the Lipschitz continuous propertyof log p(yi|y−i,x; Θ), we have the following Lemma 3.

Lemma 3. Given Θ defined in (13) and Θ\k defined in (23),it holds for ∀ (x,y) ∈ X × Y and ∀1 ≤ k ≤ n∣∣ m∑

i=1

log p(yi|y−i,x; Θ)−m∑i=1

log p(yi|y−i,x; Θ\k)∣∣

≤ 2m∑i=1

‖βi − β\ki ‖+ 4∑i<j

|αij − α\kij |. (29)

Proof. First, we have

‖∂ log p(yi|y−i,x; Θ)/∂βi‖ ≤ 2‖x‖ ≤ 2,

and

|∂ log p(yi|y−i,x; Θ)/∂αij | ≤ 4|yiyj | = 4.

That is log p(yi|y−i,x; Θ) is Lipschitz continuous with re-spect to βi and αij , with constant 2 and 4, respectively.Therefore, (29) holds.

By combining the above three Lemmas, we have the fol-lowing Theorem 1 that shows the stability of CorrLog.

Theorem 1. Given model parameters Θ = {β, α} and Θk =

{βk, αk} learned on training datasets D and Dk, respectively,

both by (13), it holds thatm∑i=1

‖βk

i − βi‖+∑i<j

|αkij − αij | ≤16

min(λ1, λ2)n. (30)

Proof. By combining (25), (26) and (29), we have

‖β\k− β‖2 + ‖α\k − α‖2 ≤

4

min(λ1, λ2)n

m∑i=1

‖βi − β\ki ‖+∑i<j

|αij − α\kij |

. (31)

Further, by using

‖β\k− β‖2 + ‖α\k − α‖2 ≥

1

2

m∑i=1


|αij − α\kij |

2

(32)

we havem∑i=1


|αij − α\kij | ≤8

min(λ1, λ2)n(33)

Since Dk and D\k differ from each other with only the k-thtraining example, the same argument gives

m∑i=1

‖βki − β\ki ‖+

∑i<j

|αkij − α\kij | ≤

8

min(λ1, λ2)n. (34)

Then, (30) is obtained immediately. This completes the proof.

B. Generalization Bound

We first define a loss function to measure the generalizationerror. Considering that CorrLog predicts labels by MAP esti-mation, we define the loss function by using the log probability

`(x,y; Θ) =

1, f(x,y,Θ) < 01− f(x,y,Θ)/γ, 0 ≤ f(x,y,Θ) < γ0, f(x,y,Θ) ≥ γ,

(35)where the constant γ > 0 and

f(x,y,Θ) = log p(y|x; Θ)−maxy′ 6=y

log p(y′|x; Θ)

=

m∑i=1

yiβTi x +

∑i<j

αijyiyj

−max

y′ 6=y

m∑i=1

y′iβTi x +

∑i<j

αijy′iy′j

. (36)

The loss function (35) is defined analogously to the lossfunction used in binary classification, where f(x,y,Θ) isreplaced with the margin ywTx if a linear classifier w isused. Besides, (35) gives a 0 loss only if all dimensions of yare correctly predicted, which emphasizes the joint predictionin MLC. By using this loss function, the generalization errorand the empirical error are given by

R(Θ) = Exy`(x,y; Θ), (37)

and

R(Θ) =1

n

n∑l=1

`(x(l),y(l); Θ). (38)




According to [56], an exponential bound exists for R(Θ)if CorrLog has a uniform stability with respect to the lossfunction (35). The following Theorem 2 shows this conditionholds.

Theorem 2. Given model parameters Θ = {β, α} and Θk =

{βk, αk} learned on training datasets D and Dk, respectively,

both by (13), it holds for ∀(x,y) ∈ X × Y ,

|`(x,y; Θ)− `(x,y; Θk)| ≤ 32

γmin(λ1, λ2)n. (39)

Proof. First, we have the following inequality from (35)

γ|`(x,y; Θ)− `(x,y; Θk)| ≤ |f(x,y, Θ)− f(x,y, Θk)|(40)

Then, by introducing notation

A(x,y,β,α) =m∑i=1

yiβTi x +

∑i<j

αijyiyj , (41)

and rewriting

f(x,y,Θ) = A(x,y,β,α)−maxy′ 6=y

A(x,y′,β,α), (42)

we have

γ|`(x,y; Θ)− `(x,y; Θk)| ≤∣∣A(x,y, β, α)−A(x,y, β

k, αk)

∣∣+ |max

y′ 6=yA(x,y′, β, α)−max

y′ 6=yA(x,y′, β

k, αk)|. (43)

Due to the fact that for any functions h1(u) and h2(u) it holds2

|maxu

h1(u)−maxu

h2(u)| ≤ maxu|h1(u)− h2(u)|, (44)

we have

γ|`(x,y; Θ)− `(x,y; Θk)|

≤∣∣A(x,y, β, α)−A(x,y, β

k, αk)

∣∣+ max

y′ 6=y

∣∣A(x,y′, β, α)−A(x,y′, βk, αk)

∣∣≤ 2 max

y

m∑i=1

|yi(βi − βki )Tx|+∑i<j

|(αij − αkij)yiyj |

≤ 2

m∑i=1

‖βi − βki ‖+ 2∑i<j

|αij − αkij |

. (45)

Then, the proof is completed by applying Theorem 1.

Now, we are ready to present the main theorem on thegeneralization ability of CorrLog.

Theorem 3. Given the model parameter Θ learned by (13),with i.i.d. training data D = {(x(l),y(l)) ∈ X × Y, l =1, 2, ..., n} and regularization parameters λ1, λ2, it holds withat least probability 1− δ,

R(Θ) ≤ R(Θ) +32

γmin(λ1, λ2)n

+

(64

γmin(λ1, λ2)+ 1

)√log 1/δ

2n. (46)

2Suppose u?1 and u?2 maximize h1(u) and h2(u) respectively, and withoutloss of generality h1(u?1) ≥ h2(u?2), we have |h1(u?1) − h2(u?2)| =h1(u?1)− h2(u?2) ≤ h1(u?1)− h2(u?1) ≤ maxu |h1(u)− h2(u)|.

Proof. Given Theorem 2, the generalization bound (46) isa direct result of Theorem 12 in [56] (Please refer to thereference for details).

Remark 3 A notable observation from Theorem 3 is that thegeneralization bound (46) of CorrLog is independent of thelabel number m. Therefore, CorrLog is preferable for MLCwith a large number of labels, for which the generalizationerror still can be bounded with high confidence.Remark 4 While the learning of CorrLog (13) utilizes theelastic net regularization Ren(Θ;λ1, λ2, ε), where ε is theweighting parameter on the `1 regularization to encouragesparsity, the generalization bound (46) is independent of theparameter ε. The reason is that `1 regularization does notlead to stable learning algorithms [57], and only the `2regularization in Ren(Θ;λ1, λ2, ε) contributes to the stabilityof CorrLog.

V. TOY EXAMPLE

We design a simple toy example to illustrate the capacityof CorrLog on label correlation modelling. In particular, weshow that when ILRs fail drastically due to ignoring the labelcorrelations (under-fitting), CorrLog performs well. Considera two-label classification problem on a 2-D plane, where eachinstance x is sampled uniformly from the unit disc ‖x‖ ≤ 1and the corresponding labels y = [y1,y2] are defined by

y1 = sign(ηT1 x) and y2 = OR(y1, sign(ηT2 x)

),

where η1 = (1, 1,−0.5), η2 = (−1, 1,−0.5) and the aug-mented feature is x = [xT , 1]T . The sign(·) function takesvalue 1 or −1, and the OR(·, ·) operation outputs 1 if eitherof its input is 1. The definition of y2 makes the two labelscorrelated. We generate 1000 random examples according toabove setting and split them into training and test sets, eachof which contains 500 examples. During training, we set theparameter ε of the elastic net regularization to 0, i.e., weactually used an `2 regularization, this is because in thisexample the model is not sparse in terms of both featureselection and label correlation. In addition, as the number ofthe training examples is sufficiently large for this problem,we suppose there is no over-fitting and tune the regularizationparameters for both ILRs and CorrLog by minimizing the 0-1loss on the training set.

Figure 1 shows that true labels of test data, the predictionsof ILRs and the predictions of CorrLog, where different labelsare marked by different colors. In (a), the disc is divided intothree regions, −/−, −/+ and +/+, where the two blackboundaries are specified by η1 and η2, respectively. In (b),the first boundary η1 is properly learned by ILRs, while thesecond one is learned wrongly. This is because the secondlabel is highly correlated to the first label, but ILRs ignoressuch correlation. As a result, ILRs wrongly predicted theimpossible case of +/−. The misclassification rate measuredby 0-1 loss is 0.197. In contrast, CorrLog predicts correctlabels for most instances with a 0-1 loss 0.068. Besides, itis interesting to note that the correlation between the twolabels are “asymmetric”, for the first label is not affectedby the second. This asymmetry contributes the most to the




−1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

Truth

−/−−/++/+

(a)

−1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

ILRs

−/−+/−−/++/+

(b)

−1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

CorrLog

−/−−/++/+

(c)

Fig. 1. A two-label toy example: (a) true labels of test data; (b) predictions given by ILRs; (c) predictions given by CorrLog. The dash and solid blackboundaries are specified by η1 and η2. In the legend, “+” and “-” stand for positive and negative labels, respectively, e.g., “−/+” means y1 = −1 andy2 = 1, and so on.

misclassification of CorrLog, because the previous definitionimplies that only symmetric correlations are modelled inCorrLog.

VI. MULTILABEL IMAGE CLASSIFICATION

In this section, we apply the proposed CorrLog to multil-abel image classification. In particular, four multilabel imagedatasets are used in this paper, including MULAN scene(MULANscene)3, MIT outdoor scene (MITscene) [39], PAS-CAL VOC 2007 (PASCAL07) [58] and PASCAL VOC 2012(PASCAL12) [59]. MULAN scene dataset contains 2047 im-ages with 6 labels, and each image is represented by 294features. MIT outdoor scene dataset contains 2688 images in 8categories. To make it suitable for multilabel experiment, wetransformed each category label with several tags accordingto the image contents of that category4. PASCAL VOC 2007dataset consists of 9963 images with 20 labels. For PASCALVOC 2012, we use the available train-validation subset whichcontains 11540 images. In addition, two kinds of features areadopted to represent the last three datasets, i.e., the PHOW(a variant of dense SIFT descriptors extracted at multiplescales) features [40] and deep CNN (convolutional neuralnetwork) features [43], [44]. Summary of the basic informationof the datasets is illustrated in Table II. To extract PHOWfeatures, we use the VLFeat implementation [60]. For deepCNN features, we use the ’imagenet-vgg-f’ model pretrainedon ImageNet database [44] which is available in MatConvNetmatlab toolbox [61].

A. A Warming-Up Qualitative ExperimentAs an extension to our previous work on CorrLog, this

paper utilizes elastic net to inherit individual advantages of `23http://mulan.sourceforge.net/4The 8 categories are coast, forest, highway, insidecity, mountain, open-

country, street, and tallbuildings. The 8 binary tags are building, grass, cement-road, dirt-road, mountain, sea, sky, and tree. The transformation follows,C1→ (B6, B7), C2→ (B4, B8), C3→ (B3, B7), C4→ (B1), C5→(B5, B7), C6 → (B2, B4, B7), C7 → (B1, B3, B7), C8 → (B1, B7).For example, coast (C1) is tagged with sea (B6) and sky (B7).

TABLE IIDATASETS SUMMARY. #IMAGES STANDS FOR THE NUMBER OF ALL

IMAGES, #FEATURES STANDS FOR THE DIMENSION OF THE FEATURES,AND #LABELS STANDS FOR THE NUMBER OF LABELS.

Datasets #images #features #labelsMULANscene 2047 294 6

MITscene-PHOW 2688 3600 8MITscene-CNN 2688 4096 8

PASCAL07-PHOW 9963 3600 20PASCAL07-CNN 9963 4096 20

PASCAL12-PHOW 11540 3600 20PASCAL12-CNN 11540 4096 20

and `1 regularization. To build up the intuition, we employMITscene with PHOW features to visualize the differencebetween `2 and elastic net regularization. Table III presentsthe learned CorrLog label graphs using these two types ofregularization respectively. In the label graph, the color of eachedge represents the correlation strength between two certainlabels. We have also listed 8 representative example images,one for each category, and their binary tags for completeness.

According to the comparison, one can see that elastic netregularization results in a sparse label graph due to its `1component, while `2 regularization can only lead to a fully-connected label graph. In addition, the learned label correla-tions in elastic net case are more reasonable than that of `2. Forexample, in the `2 label graph, dirt-road and mountain haveweekly positive correlation (according to the link betweenthem), though they seldom co-occur on the images in thedatasets, while in the elastic net graph, their correlation iscorrected as negative. It has to be confessed that elasticnet regularization also discarded some reasonable correlationssuch as cement-road and building. This phenomenon is a directresult of the compromise between learning stability and modelsparsity. We shall mention that those reasonable correlationscan be maintained by decreasing λ1, λ2 or ε, though moreunreasonable connections will also be maintained. Thus, ap-plying weak sparsity may impair the model performance. Asa result, it is important to choose a good level of sparsity




TABLE IIILEARNED CORRLOG LABEL GRAPH ON MITSCENE USING `2 OR ELASTIC NET REGULARIZATION.

MITscene Images and Tagscoast forest highway insidecity mountain opencountry street tallbuilding

seasky

dirt-roadtree

cement-roadsky

buildingmountain

sky

grassdirt-road

sky

buildingcement-road

sky

buildingsky

Learned CorrLog Label Graph`2 regularization Elastic net regularization

−0.034843 0.043017

building

grass

cement−roaddirt−road

mountain

sea

sky

tree

building

grass

cement−roaddirt−road

mountain

sea

sky

tree

to achieve a compromise. In our experiments, CorrLog withelastic net regularization generally outperforms that with `2regularization, which confirms our motivation that appropriatelevel of sparsity in feature selection and label correlations helpboost the performance of MLC. In the following presentation,we will use CorrLog with elastic net regularization in allexperimental comparisons. To benefit following research, ourcode is available upon request.

B. Quantitative Experimental Setting

In this subsection, we present further comparisons betweenCorrLog and other MLC methods. First, to demonstrate theeffectiveness of utilizing label correlation, we first compareCorrLog’s performance with ILRs. Moreover, four state-of-the-art MLC methods - instance-based learning by logistic re-gression (IBLR) [6], multilabel k-nearest neighbour (MLkNN)[5], classifier chains (CC) [19] and maximum margin outputcoding (MMOC) [22] were also employed for comparisonstudy. Note that ILRs can be regarded as the basic baseline andother methods represent state-of-the-arts. In our experiments,LIBlinear [62] `2-regularized logistic regression is employedto build binary classifiers for ILRs. As for other methods,

we use publicly available codes in MEKA5 or the authors’homepages.

We used six different measures to evaluate the performance.These include different loss functions (Hamming loss andzero-one loss) and other popular measures (accuracy, F1 score,Macro-F1 and Micro-F1). The details of these evaluationmeasures can be found in [63], [16], [19], [20]. The parametersfor CorrLog are fixed across all experiments as λ1 = 0.001,λ2 = 0.001 and ε = 1. On each dataset, all the methods arecompared by 5-fold cross validation. The mean and standarddeviation are reported for each criterion. In addition, pairedt-tests at 0.05 significance level is applied to evaluate thestatistical significance of performance difference.

C. Quantitative Results and Discussions

Tables IV, V, VI and VII summarized the experimentalresults on MULANscene, MITscene, PASCAL07 and PAS-CAL12 of all six algorithms evaluated by the six measures. Bycomparing the results of CorrLog and ILRs, we can clearly seethe improvements obtained by exploiting label correlations forMLC. Except the Hamming loss, CorrLog greatly outperforms

5http://meka.sourceforge.net/




TABLE IVMULANSCENE PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER ∗/~ INDICATES WHETHER CORRLOG IS STATISTICALLY

SUPERIOR/INFERIOR TO THE COMPARED METHOD (USING PAIRED T-TEST AT 0.05 SIGNIFICANCE LEVEL).

Datasets Methods MeasuresHamming loss 0-1 loss Accuracy F1-Score Macro-F1 Micro-F1

MULANscene

CorrLog 0.095±0.007 0.341±0.020 0.710±0.018 0.728±0.017 0.745±0.016 0.734±0.017ILRs 0.117±0.006 ∗ 0.495±0.022 ∗ 0.592±0.016 ∗ 0.622±0.014 ∗ 0.677±0.016 ∗ 0.669±0.014 ∗IBLR 0.085±0.004 ~ 0.358±0.016 0.677±0.018 ∗ 0.689±0.019 ∗ 0.747±0.010 0.738±0.014

MLkNN 0.086±0.003 0.374±0.015 ∗ 0.668±0.018 ∗ 0.682±0.019 ∗ 0.742±0.013 0.734±0.012CC 0.104±0.005 ∗ 0.346±0.015 0.696±0.015 ∗ 0.710±0.015 ∗ 0.716±0.018 ∗ 0.706±0.014 ∗

MMOC 0.126±0.017 ∗ 0.401±0.046 ∗ 0.629±0.049 ∗ 0.639±0.050 ∗ 0.680±0.031 ∗ 0.638±0.049 ∗

TABLE VMITSCENE PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER ∗/~ INDICATES WHETHER CORRLOG IS STATISTICALLY



MITscene-PHOW

CorrLog 0.045±0.006 0.196±0.017 0.884±0.012 0.914±0.010 0.883±0.017 0.915±0.011ILRs 0.071±0.002 ∗ 0.358±0.015 ∗ 0.825±0.007 ∗ 0.877±0.005 ∗ 0.833±0.007 ∗ 0.872±0.003 ∗IBLR 0.060±0.003 ∗ 0.243±0.021 ∗ 0.845±0.012 ∗ 0.879±0.008 ∗ 0.848±0.009 ∗ 0.886±0.006 ∗

MLkNN 0.069±0.002 ∗ 0.326±0.022 ∗ 0.810±0.009 ∗ 0.857±0.006 ∗ 0.827±0.009 ∗ 0.869±0.004 ∗CC 0.047±0.005 0.200±0.021 0.883±0.012 0.913±0.008 0.883±0.015 0.913±0.009

MMOC 0.062±0.010 ∗ 0.274±0.035 ∗ 0.845±0.017 ∗ 0.885±0.014 ∗ 0.846±0.024 ∗ 0.885±0.017 ∗

MITscene-CNN

CorrLog 0.017±0.004 0.088±0.015 0.953±0.008 0.966±0.006 0.957±0.011 0.968±0.006ILRs 0.020±0.002 ∗ 0.102±0.015 ∗ 0.947±0.006 ∗ 0.962±0.004 ∗ 0.951±0.007 ∗ 0.963±0.005 ∗IBLR 0.022±0.001 ∗ 0.090±0.009 0.944±0.004 0.957±0.003 ∗ 0.944±0.004 ∗ 0.958±0.003 ∗

MLkNN 0.024±0.002 ∗ 0.104±0.005 ∗ 0.939±0.003 ∗ 0.954±0.003 ∗ 0.941±0.002 ∗ 0.955±0.004 ∗CC 0.021±0.003 ∗ 0.075±0.008 ~ 0.951±0.005 0.962±0.004 ∗ 0.948±0.007 ∗ 0.961±0.005 ∗

MMOC 0.018±0.002 0.062±0.005 ~ 0.959±0.003 ~ 0.967±0.003 0.955±0.005 0.967±0.004

ILRs on all datasets. Especially, the reduction of zero-oneloss is significant on all four datasets with different type offeatures. This confirms the value of correlation modellingto joint prediction. However, it should be noticed that theimprovement of CorrLog over ILRs is less significant when theperformance is measured by Hamming loss. This is becauseHamming loss treats the prediction of each label individually.

In addition, CorrLog is more effective in exploiting labelcorrelations than other four state-of-the-art MLC algorithms.For MULANscene dataset, CorrLog achieved comparable re-sults with IBLR and both of them outperformed other methods.For MITscene dataset, both PHOW and CNN features are veryeffective representations and boost the classification results. Asa consequence, the performance of CorrLog and the four MLCalgorithms are very close to each other. It is worth noting that,the MMOC method is time-consuming in the training stage,though it achieved the best performance on this dataset. Asfor both PASCAL07 and PASCAL12 datasets, CNN featuresperform significantly better than PHOW features. CorrLog ob-tained much better results than the competing MLC schemes,except for the Hamming loss and zero-one loss. Note that theCorrLog also performs competitively with PLEM and CGM,according to the results reported in [46].

D. Complexity Analysis and Execution Time

Table VIII summarizes the algorithm computational com-plexity of all MLC methods. The training computationalcost of both CorrLog and ILRs are linear to the number oflabels, while CorrLog causes more testing computational cost

than ILRs due to the iterative belief propagation algorithm.In contrast, the training complexity of CC and MMOC arepolynomial to the number of labels. The two instance-basedmethods, MLkNN and IBLR, are relatively computational inboth train and test stages due to the involvement of instance-based searching of nearest neighbours. In particular, trainingMLkNN requires estimating the prior label distribution fromtraining data which needs the consideration of all k nearestneighbours of all training samples. Testing a given samplein MLkNN consists of finding its k-nearest neighbours andapplying maximum a posterior (MAP) inference. Differentfrom MLkNN, IBLR constructs logistic regression models byadopting labels of k-nearest neighbours as features.

To evaluate the practical efficiency, Table IX presents theexecution time (train and test phase) of all comparison algo-rithms under Matlab environment. A Linux server equippedwith Intel Xeon CPU (8 cores @ 3.4 GHz) and 32 GBmemory is used for conducting all the experiments. CorrLog isimplemented in Matlab language, while ILRs is implementedbased on LIBlinear’s mex functions. MMOC is evaluated usingthe authors’ Matlab code which also builds upon LIBlinear. Asfor IBLR, MLkNN and CC, the MEKA Java library is calledvia a Matlab wrapper. Based on the comparison results, thefollowing observations can be made: 1) the execution time islargely consistent with the complexity analysis, though theremaybe some unavoidable computational differences betweenMatlab scripts, mex functions and Java codes; 2) CorrLog’strain phase is very efficient and its test phase is also compara-ble with ILRs, CC and MMOC; 3) CorrLog is more efficient




TABLE VIPASCAL07 PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER ∗/~ INDICATES WHETHER CORRLOG IS STATISTICALLY



PASCAL07-PHOW

CorrLog 0.068±0.001 0.776±0.007 0.370±0.010 0.423±0.012 0.367±0.011 0.480±0.008ILRs 0.093±0.001 ∗ 0.878±0.007 ∗ 0.294±0.008 ∗ 0.360±0.009 ∗ 0.332±0.008 ∗ 0.404±0.007 ∗IBLR 0.066±0.001 ~ 0.832±0.003 ∗ 0.270±0.005 ∗ 0.308±0.006 ∗ 0.258±0.007 ∗ 0.408±0.009 ∗

MLkNN 0.066±0.001 ~ 0.839±0.006 ∗ 0.256±0.007 ∗ 0.291±0.008 ∗ 0.235±0.006 ∗ 0.392±0.007 ∗CC 0.091±0.000 ∗ 0.845±0.010 ∗ 0.318±0.005 ∗ 0.379±0.003 ∗ 0.348±0.004 ∗ 0.417±0.001 ∗

MMOC 0.065±0.001 ~ 0.850±0.003 ∗ 0.259±0.009 ∗ 0.299±0.011 ∗ 0.206±0.007 ∗ 0.392±0.012 ∗

PASCAL07-CNN


MLkNN 0.043±0.001 ∗ 0.557±0.010 ∗ 0.585±0.014 ∗ 0.635±0.015 ∗ 0.613±0.006 ∗ 0.668±0.011 ∗CC 0.051±0.001 ∗ 0.586±0.008 ∗ 0.602±0.008 ∗ 0.668±0.008 ∗ 0.635±0.009 ∗ 0.669±0.008 ∗

MMOC 0.037±0.000 ~ 0.512±0.008 0.634±0.009 ∗ 0.684±0.009 ∗ 0.663±0.005 ∗ 0.719±0.004 ∗

TABLE VIIPASCAL12 PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER ∗/~ INDICATES WHETHER CORRLOG IS STATISTICALLY



PASCAL12-PHOW

CorrLog 0.070±0.001 0.790±0.009 0.344±0.009 0.393±0.010 0.369±0.014 0.449±0.006ILRs 0.100±0.001 ∗ 0.891±0.009 ∗ 0.269±0.007 ∗ 0.333±0.008 ∗ 0.324±0.008 ∗ 0.370±0.005 ∗IBLR 0.068±0.001 ~ 0.869±0.009 ∗ 0.219±0.005 ∗ 0.252±0.003 ∗ 0.253±0.007 ∗ 0.345±0.005 ∗

MLkNN 0.069±0.001 ~ 0.883±0.008 ∗ 0.191±0.006 ∗ 0.218±0.005 ∗ 0.213±0.007 ∗ 0.306±0.006 ∗CC 0.097±0.001 ∗ 0.862±0.012 ∗ 0.291±0.010 ∗ 0.350±0.010 ∗ 0.340±0.007 ∗ 0.380±0.006 ∗

MMOC 0.067±0.001 ~ 0.865±0.003 ∗ 0.227±0.005 ∗ 0.262±0.007 ∗ 0.200±0.007 ∗ 0.346±0.004 ∗

PASCAL12-CNN


MLkNN 0.045±0.002 ∗ 0.575±0.012 ∗ 0.566±0.015 ∗ 0.616±0.017 ∗ 0.604±0.011 ∗ 0.645±0.013 ∗CC 0.055±0.001 ∗ 0.615±0.010 ∗ 0.579±0.009 ∗ 0.647±0.010 ∗ 0.623±0.005 ∗ 0.643±0.007 ∗

MMOC 0.039±0.001 ~ 0.525±0.005 0.619±0.006 ∗ 0.669±0.007 ∗ 0.659±0.004 ∗ 0.699±0.005 ∗

TABLE VIIICOMPUTATIONAL COMPLEXITY ANALYSIS. RECALL THAT n STANDS FORTHE NUMBER OF TRAIN IMAGES, D STANDS FOR THE DIMENSION OF THEFEATURES, AND m STANDS FOR THE NUMBER OF LABELS. NOTE THAT C

IS THE ITERATION NUMBER OF THE MAX-PRODUCT ALGORITHM INCORRLOG, AND K IS THE NUMBER OF NEAREST NEIGHBOURS IN

MLKNN AND IBLR.

Methods Train Test per imageCorrLog O(nDm) O(Dm+ Cm2)

ILRs O(nDm) O(Dm)IBLR O(Kn2Dm+ nDm) O(KnDm+Dm)

MLkNN O(Kn2Dm) O(KnDm)CC O(nDm+ nm2) O(Dm+m2)

MMOC O(nm3 + nDm2 + n4) O(m3)

than IBLR and MLkNN in both train and test stages.

VII. CONCLUSION

We have proposed a new MLC algorithm CorrLog andapplied it to multilabel image classification. Built upon IRLs,CorrLog explicitly models the pairwise correlation between la-bels, and thus improves the effectiveness for MLC. Besides, byusing the elastic net regularization, CorrLog is able to exploitthe sparsity in both feature selection and label correlations, andthus further boost the performance of MLC. Theoretically, we

have shown that the generalization error of CorrLog is upperbounded and is independent of the number of labels. Thissuggests the generalization bound holds with high confidenceeven when the number of labels is large. Evaluations on fourbenchmark multilabel image datasets confirm the effectivenessof CorrLog for multilabel image classification and show itscompetitiveness with the state-of-the-arts.

REFERENCES

[1] W. Bian and D. Tao, “Asymptotic generalization bound of fishers lineardiscriminant analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,no. 12, pp. 2325–2337, 2014.

[2] T. Liu and D. Tao, “On the performance of manhattan nonnegativematrix factorizatio,” IEEE Trans. Neural Netw. Learn., vol. PP, no. 99,pp. 1–1, 2016.

[3] C. Xu, D. Tao, and C. Xu, “Multi-view intact space learning,” IEEETrans. Pattern Anal. Mach. Intell., vol. 37, no. 12, pp. 2531–2544, 2015.

[4] T. Liu, D. Tao, S. Mingli, and M. Stephen, “Algorithm-dependentgeneralization bounds for multi-task learning,” IEEE Trans. PatternAnal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2016.

[5] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach tomulti-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048,2007.

[6] W. Cheng and E. Hullermeier, “Combining instance-based learning andlogistic regression for multilabel classification,” Mach. Learn., vol. 76,no. 2-3, pp. 211–225, 2009.

[7] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label predictionvia compressed sensing.” in Proc. Adv. Neural Inf. Process. Syst., 2009,vol. 22, pp. 772–780.




TABLE IXAVERAGE EXECUTION TIME COMPARISON ON ALL DATASETS.

MULANscene MITscene-PHOW MITscene-CNN PASCAL07-PHOW PASCAL07-CNN PASCAL12-PHOW PASCAL12-CNNTrain Test Train Test Train Test Train Test Train Test Train Test Train Test

CorrLog 0.09 1.74 2.80 2.12 2.46 2.08 8.94 10.68 8.35 11.08 9.67 12.58 8.62 13.06ILRs 2.54 0.02 39.50 0.37 7.50 0.15 872.77 4.79 122.73 1.56 1183.45 5.59 161.71 1.83IBLR 12.01 2.63 218.98 53.31 215.28 52.19 3132.18 779.15 2833.94 688.53 4142.75 1034.86 3824.06 947.90

MLkNN 10.29 2.36 188.08 45.87 176.52 42.78 2507.51 628.61 2232.19 551.26 3442.21 863.17 3020.29 779.50CC 5.48 0.06 40.71 0.55 26.64 0.65 74315.65 7.48 8746.82 8.15 137818.99 8.38 15926.62 9.57

MMOC 851.98 0.51 2952.77 0.70 2162.13 0.48 86714.47 33.08 38403.54 17.75 97856.16 31.43 45541.01 20.66

[8] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,”in Data Mining and Knowledge Discovery Handbook. Springer, 2010,pp. 667–685.

[9] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algo-rithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837,2014.

[10] J. Petterson and T. S. Caetano, “Reverse multi-label learning,” in Proc.Adv. Neural Inf. Process. Syst., 2010, pp. 1912–1920.

[11] K. Dembczynski, W. Waegeman, W. Cheng, and E. Hullermeier, “Onlabel dependence in multi-label classification,” in Proc. Int. Conf. Mach.Learn. Workshop on Learning from Multi-label Data, 2010, pp. 5–13.

[12] S. Yu, K. Yu, V. Tresp, and H.-P. Kriegel, “Multi-output regularizedfeature projection,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 12, pp.1600–1613, 2006.

[13] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label predictionvia compressed sensing.” in Proc. Adv. Neural Inf. Process. Syst., vol. 22,2009, pp. 772–780.

[14] T. Zhou, D. Tao, and X. Wu, “Compressed labeling on distilled labelsetsfor multi-label learning,” Mach. Learn., vol. 88, no. 1-2, pp. 69–126,2012.

[15] L. Breiman and J. H. Friedman, “Predicting multivariate responsesin multiple linear regression (with discussion),” Journal of the RoyalStatistical Society: Series B (Statistical Methodology), vol. 59, no. 1,pp. 3–54, 1997.

[16] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemblemethod for multilabel classification,” in Proc. Eur. Conf. Mach. Learn.Springer, 2007, pp. 406–417.

[17] J. Read, “A Pruned Problem Transformation Method for Multi-labelclassification,” in Proc. New Zealand Computer Science Research Stu-dent Conference, 2008, pp. 143–150.

[18] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni, “Incremental algorithmsfor hierarchical classification,” J. Mach. Learn. Res., vol. 7, pp. 31–54,2006.

[19] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains formulti-label classification,” Mach. Learn., vol. 85, no. 3, pp. 333–359,2011.

[20] W. Cheng, E. Hullermeier, and K. J. Dembczynski, “Bayes optimalmultilabel classification via probabilistic classifier chains,” in Proc. Int.Conf. Mach. Learn., 2010, pp. 279–286.

[21] F. Tai and H.-T. Lin, “Multilabel classification with principal label spacetransformation,” Neural Computation, vol. 24, no. 9, pp. 2508–2542,2012.

[22] Y. Zhang and J. G. Schneider, “Maximum margin output coding,” inProc. Int. Conf. Mach. Learn. ACM, 2012, pp. 1575–1582.

[23] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional randomfields: Probabilistic models for segmenting and labeling sequence data,”in Proc. Int. Conf. Mach. Learn. ACM, 2001, pp. 282–289.

[24] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, 2002.

[25] G. Wang, D. Forsyth, and D. Hoiem, “Improved object categorizationand detection using comparative object similarity,” IEEE Trans. PatternAnal. Mach. Intell., vol. 35, no. 10, pp. 2442–2453, 2013.

[26] L. Wang, T. Liu, G. Wang, K. L. Chan, and Q. Yang, “Video track-ing using learned hierarchical features,” IEEE Trans. Image Process.,vol. 24, no. 4, pp. 1424–1435, 2015.

[27] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple bernoulli relevancemodels for image and video annotation,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., vol. 2. IEEE, 2004, pp. II–1002.

[28] J. Fan, Y. Shen, C. Yang, and N. Zhou, “Structured max-margin learningfor inter-related classifier training and multilabel image annotation,”IEEE Trans. Image Process., vol. 20, no. 3, pp. 837–854, 2011.

[29] Q. Mao, I. W.-H. Tsang, and S. Gao, “Objective-guided image anno-tation,” IEEE Trans. Image Process., vol. 22, no. 4, pp. 1585–1597,2013.

[30] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-labelscene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771,2004.

[31] D. Song and D. Tao, “Biologically inspired feature manifold for sceneclassification,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 174–184,2010.

[32] Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang, “Learningdiscriminative and shareable features for scene classification,” in Proc.Eur. Conf. Comput. Vis. Springer, 2014, pp. 552–568.

[33] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, and Y. Wen, “Multiview vector-valued manifold regularization for multilabel image classification,” IEEETrans. Neural Netw. Learn., vol. 24, no. 5, pp. 709–722, 2013.

[34] Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Manifoldregularized multitask learning for semi-supervised multilabel imageclassification,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 523–536, 2013.

[35] F. Sun, J. Tang, H. Li, G.-J. Qi, and T. S. Huang, “Multi-label imagecategorization with sparse factor representation,” IEEE Trans. ImageProcess., vol. 23, no. 3, pp. 1028–1037, 2014.

[36] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification viahigh-order label correlation driven active learning,” IEEE Trans. ImageProcess., vol. 23, no. 3, pp. 1430–1441, 2014.

[37] Y. Luo, T. Liu, D. Tao, and C. Xu, “Multiview matrix completion formultilabel image classification,” IEEE Trans. Image Process., vol. 24,no. 8, pp. 2355–2368, 2015.

[38] C. Xu, T. Liu, D. Tao, and C. Xu, “Local rademacher complexity formulti-label learning,” IEEE Trans. Image Process., vol. 25, no. 3, pp.1495–1507, 2016.

[39] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, 2001.

[40] A. Bosch, A. Zisserman, and X. Munoz, “Image classification usingrandom forests and ferns,” in Proc. IEEE Int. Conf. Comput. Vis. IEEE,2007, pp. 1–8.

[41] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating localdescriptors into a compact image representation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. IEEE, 2010, pp. 3304–3311.

[42] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high-level image representation for scene classification & semantic featuresparsification,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1378–1386.

[43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf.Process. Syst., 2012, pp. 1097–1105.

[44] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return ofthe devil in the details: Delving deep into convolutional nets,” in BritishMachine Vision Conference, 2014.

[45] X. Li, F. Zhao, and Y. Guo, “Multi-label image classification with aprobabilistic label enhancement model,” Proc. Conf. Uncertain. Artif.Intell., 2014.

[46] M. Tan, Q. Shi, A. van den Hengel, C. Shen, J. Gao, F. Hu, and Z. Zhang,“Learning graph structure for multi-label image classification via cliquegeneration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,pp. 4100–4109.




[47] W. Bian, B. Xie, and D. Tao, “Corrlog: Correlated logistic models forjoint prediction of multiple labels,” in Proc. Int. Conf. Artif. Intell. Stat.,2012, pp. 109–117.

[48] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, “High-dimensionalising model selection using l 1 -regularized logistic regression,” Annalsof Statistics, vol. 38, pp. 1287–1319, 2010.

[49] J. Ashford and R. Sowden, “Multi-variate probit analysis,” Biometrics,vol. 26, pp. 535–546, 1970.

[50] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” Journal of the Royal Statistical Society: Series B (StatisticalMethodology), vol. 67, no. 2, pp. 301–320, 2005.

[51] J. Besag, “Statistical Analysis of Non-Lattice Data,” Journal of the RoyalStatistical Society: Series D (The Statistician), vol. 24, no. 3, pp. 179–195, 1975.

[52] C. Sutton and A. McCallum, “Piecewise pseudolikelihood for efficienttraining of conditional random fields,” in Proc. Int. Conf. Mach. Learn.ACM, 2007, pp. 863–870.

[53] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and MethodEfficiency in Optimization, ser. Wiley-Interscience Series in DiscreteMathematics. New York, USA: John Wiley & Sons, 1983.

[54] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM J. Img. Sci., vol. 2, no. 1,pp. 183–202, 2009.

[55] C. M. Bishop, Pattern recognition and machine learning. Springer,2006.

[56] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach.Learn. Res., vol. 2, pp. 499–526, 2002.

[57] H. Xu, C. Caramanis, and S. Mannor, “Sparse algorithms are not stable:A no-free-lunch theorem,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 34, no. 1, pp. 187–193, 2012.

[58] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” Int. J. Comput.Vis., vol. 88, no. 2, pp. 303–338, 2010.

[59] ——, “The pascal visual object classes challenge 2012 (voc2012)results,” 2012.

[60] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library ofcomputer vision algorithms,” http://www.vlfeat.org/, 2008.

[61] A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural networksfor matlab,” in Proc. ACM Int. Conf. Multimedia.

[62] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,“Liblinear: A library for large linear classification,” J. Mach. Learn.Res., vol. 9, pp. 1871–1874, 2008.

[63] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Dzeroski, “An extensiveexperimental comparison of methods for multi-label learning,” PatternRecognit., vol. 45, no. 9, pp. 3084–3104, 2012.

Qiang Li received the BEng degree in Electronicsand Information Engineering, and MEng degree inSignals and Information Processing, from HuazhongUniversity of Science and Technology (HUST),Wuhan, China, in 2010 and 2013, respectively. Heis currently pursuing PhD degree in University ofTechnology Sydney, Australia. His research interestsinclude probabilistic inference, probabilistic graphi-cal models, image processing, video surveillance andcomputer vision.

Bo Xie is currently a PhD student in Georgia Insti-tute of Technology. He received the B.Eng degreein telecommunications engineering from Beijing U-niversity of Posts and Telecommunications.

Jane You is currently a full-professor in the Depart-ment of Computing at the Hong Kong PolytechnicUniversity. Prof. You obtained her BEng. in Elec-tronic Engineering from Xian Jiaotong University in1986 and Ph.D in Computer Science from La TrobeUniversity, Australia in 1992. She was a lecturer atthe University of South Australia and senior lecturer(tenured) at Griffith University from 1993 till 2002.Prof. You was awarded French Foreign MinistryInternational Postdoctoral Fellowship in 1993 andworked at Universite Paris XI during 1993-94. Prof.

Jane You has conducted research in the fields of image processing, medicalimaging, computer-aided detection/diagnosis, pattern recognition. So far, shehas more than 280 research papers published. Prof. You is a team memberof three successful US patents and three awards including Hong KongGovernment Industrial Awards. Her work on retinal imaging led to a US patent(2015), the first runner-up for Innovation Award of Excellence (Hong Kong,2015), a Special Prize and Gold Medal with Jurys Commendation at the 39thInternational Exhibition of Inventions of Geneva (2011) and the second placeof SPIE Medical Imaging2009 Retinopathy Online Challenge (ROC2009)).Prof. You is an associate editor of Pattern Recognition and other journals.

Wei Bian (M’14) received the B.Eng. degree in elec-tronic engineering and the B.Sc. degree in appliedmathematics in 2005, the M.Eng. degree in electron-ic engineering in 2007, all from the Harbin instituteof Technology, Harbin, China, and the PhD degreein Computer Science in 2012 from the University ofTechnology, Sydney, Australia. His research interestsare pattern recognition and machine learning.

Dacheng Tao (F15) is Professor of Computer Sci-ence with the Centre for Quantum Computation &Intelligent Systems, and the Faculty of Engineeringand Information Technology in the University ofTechnology Sydney. He mainly applies statisticsand mathematics to data analytics problems andhis research interests spread across computer vision,data science, image processing, machine learning,and video surveillance. His research results haveexpounded in one monograph and 200+ publicationsat prestigious journals and prominent conferences,

such as IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS, ICML, CVPR,ICCV, ECCV, AISTATS, ICDM; and ACM SIGKDD, with several best paperawards, such as the best theory/algorithm paper runner up award in IEEEICDM07, the best student paper award in IEEE ICDM13, and the 2014 ICDM10-year highest-impact paper award. He received the 2015 Australian Scopus-Eureka Prize, the 2015 ACS Gold Disruptor Award and the 2015 UTS Vice-Chancellors Medal for Exceptional Research. He is a Fellow of the IEEE,OSA, IAPR and SPIE.

A SUBMISSION TO IEEE TRANSACTIONS ON IMAGE …€¦ · A. Multilabel Image Classiﬁcation...

Documents

Transcript of A SUBMISSION TO IEEE TRANSACTIONS ON IMAGE …€¦ · A. Multilabel Image Classiﬁcation...