Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data...

14
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges Yuxin Peng, Xin Huang, and Yunzhen Zhao Abstract—Multimedia retrieval plays an indispensable role in big data utilization. Past eorts mainly focused on single-media retrieval. However, the requirements of users are highly flexible, such as retrieving the relevant audio clips with one query of image. So challenges stemming from the “media gap”, which means that representations of dierent media types are inconsis- tent, have attracted increasing attention. Cross-media retrieval is designed for the scenarios where the queries and retrieval results are of dierent media types. As a relatively new research topic, its concepts, methodologies and benchmarks are still not clear in the literatures. To address these issues, we review more than 100 references, give an overview including the concepts, methodologies, major challenges and open issues, as well as build up the benchmarks including datasets and experimental results. Researchers can directly adopt the benchmarks to promptly evaluate their proposed methods. This will help them to focus on algorithm design, rather than the time-consuming compared methods and results. It is noted that we have constructed a new dataset XMedia, which is the first publicly available dataset with up to five media types (text, image, video, audio and 3D model). We believe this overview will attract more researchers to focus on cross-media retrieval and be helpful to them. Index Terms—Cross-media retrieval, overview, concepts, methodologies, benchmarks, challenges. I. Introduction W ITH the rapid growth of multimedia data such as text, image, video, audio and 3D model, cross-media re- trieval is becoming increasingly attractive, through which users can get the results with various media types by submitting one query of any media type. For instance, on a visit to the Gate Bridge, users can submit a photo of it, and retrieve relevant results including text descriptions, images, videos, audio clips and 3D models. The research of multimedia retrieval has lasted for several decades [1]. However, past eorts generally focused on single- media retrieval, where the queries and retrieval results belong to the same media type. Beyond the case of single-media retrieval, some methods have been proposed to deal with more than one media type. Such methods aim to combine multiple media types together in a retrieval process as [2], [3], but the queries and retrieval results must share the same media com- bination. For example, users can retrieve image/text pairs by an image/text pair. Although these methods involve multiple media types, they are not designed for performing retrieval across dierent media types, and cross-media similarities This work was supported by National Natural Science Foundation of China under Grants 61371128 and 61532005. The authors are with the Institute of Computer Science and Technology, Peking University, Beijing 100871, China. Corresponding author: Yuxin Peng (e-mail: [email protected]). cannot be directly measured, such as the similarity between an image and an audio clip. Nowadays, as digital media content is generated and found everywhere, requirements of users are highly flexible such as retrieving the relevant audio clips with one query of image. Such retrieval paradigm is called cross- media retrieval, which has been drawing extensive interests. It is more useful and flexible than single-media retrieval because users can retrieve whatever they want by submitting whatever they have [4]. The key challenge of cross-media retrieval is the issue of “media gap”, which means that representations of dierent media types are inconsistent and lie in dierent feature spaces, so it is extremely challenging to measure similarities among them. There have been many methods proposed for addressing this issue by analyzing the rich correlations contained in cross-media data. For example, current mainstream methods are designed to learn an intermediate common space for features of dierent media types, and measure the similarities among them in one common space, which are called common space learning methods. Meanwhile, cross-media similarity measurement methods are proposed to directly compute the cross-media similarities by analyzing the known data relation- ships without obtaining an explicit common space. A brief illustration of cross-media retrieval is shown in Figure 1. Most of the existing methods are designed for retrieval of only two media types (mainly image and text), but cross-media retrieval emphasizes the diversity of media types. Hence, there still remains a problem of incorporating other media types into the unified framework, such as video, audio and 3D model. As our research on cross-media retrieval has lasted for several years [5]–[12], we find some key issues on concepts, methodologies and benchmarks are still not clear in the literatures. To address these problems, we review more than 100 references and aim to: Summarize existing works and methodologies to present an overview, which will facilitate the research of cross- media retrieval. Build up the benchmarks, including datasets and ex- perimental results. This will help researchers to focus on algorithm design, rather than the time-consuming compared methods and results, since they can directly adopt the benchmarks to promptly evaluate their proposed methods. Provide a new dataset XMedia for comprehensive eval- uations of cross-media retrieval. It is the first publicly available dataset consisting of up to five media types (text, image, video, audio and 3D model). Present the main challenges and open issues, which Copyright c 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected]. arXiv:1704.02223v4 [cs.MM] 1 Jul 2017

Transcript of Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data...

Page 1: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

An Overview of Cross-media Retrieval: Concepts,Methodologies, Benchmarks and Challenges

Yuxin Peng, Xin Huang, and Yunzhen Zhao

Abstract—Multimedia retrieval plays an indispensable role inbig data utilization. Past efforts mainly focused on single-mediaretrieval. However, the requirements of users are highly flexible,such as retrieving the relevant audio clips with one query ofimage. So challenges stemming from the “media gap”, whichmeans that representations of different media types are inconsis-tent, have attracted increasing attention. Cross-media retrievalis designed for the scenarios where the queries and retrievalresults are of different media types. As a relatively new researchtopic, its concepts, methodologies and benchmarks are still notclear in the literatures. To address these issues, we review morethan 100 references, give an overview including the concepts,methodologies, major challenges and open issues, as well as buildup the benchmarks including datasets and experimental results.Researchers can directly adopt the benchmarks to promptlyevaluate their proposed methods. This will help them to focuson algorithm design, rather than the time-consuming comparedmethods and results. It is noted that we have constructed a newdataset XMedia, which is the first publicly available dataset withup to five media types (text, image, video, audio and 3D model).We believe this overview will attract more researchers to focuson cross-media retrieval and be helpful to them.

Index Terms—Cross-media retrieval, overview, concepts,methodologies, benchmarks, challenges.

I. Introduction

W ITH the rapid growth of multimedia data such as text,image, video, audio and 3D model, cross-media re-

trieval is becoming increasingly attractive, through which userscan get the results with various media types by submitting onequery of any media type. For instance, on a visit to the GateBridge, users can submit a photo of it, and retrieve relevantresults including text descriptions, images, videos, audio clipsand 3D models.

The research of multimedia retrieval has lasted for severaldecades [1]. However, past efforts generally focused on single-media retrieval, where the queries and retrieval results belongto the same media type. Beyond the case of single-mediaretrieval, some methods have been proposed to deal with morethan one media type. Such methods aim to combine multiplemedia types together in a retrieval process as [2], [3], but thequeries and retrieval results must share the same media com-bination. For example, users can retrieve image/text pairs byan image/text pair. Although these methods involve multiplemedia types, they are not designed for performing retrievalacross different media types, and cross-media similarities

This work was supported by National Natural Science Foundation of Chinaunder Grants 61371128 and 61532005.

The authors are with the Institute of Computer Science and Technology,Peking University, Beijing 100871, China. Corresponding author: Yuxin Peng(e-mail: [email protected]).

cannot be directly measured, such as the similarity between animage and an audio clip. Nowadays, as digital media contentis generated and found everywhere, requirements of users arehighly flexible such as retrieving the relevant audio clips withone query of image. Such retrieval paradigm is called cross-media retrieval, which has been drawing extensive interests. Itis more useful and flexible than single-media retrieval becauseusers can retrieve whatever they want by submitting whateverthey have [4].

The key challenge of cross-media retrieval is the issue of“media gap”, which means that representations of differentmedia types are inconsistent and lie in different feature spaces,so it is extremely challenging to measure similarities amongthem. There have been many methods proposed for addressingthis issue by analyzing the rich correlations contained incross-media data. For example, current mainstream methodsare designed to learn an intermediate common space forfeatures of different media types, and measure the similaritiesamong them in one common space, which are called commonspace learning methods. Meanwhile, cross-media similaritymeasurement methods are proposed to directly compute thecross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A briefillustration of cross-media retrieval is shown in Figure 1. Mostof the existing methods are designed for retrieval of only twomedia types (mainly image and text), but cross-media retrievalemphasizes the diversity of media types. Hence, there stillremains a problem of incorporating other media types into theunified framework, such as video, audio and 3D model.

As our research on cross-media retrieval has lasted forseveral years [5]–[12], we find some key issues on concepts,methodologies and benchmarks are still not clear in theliteratures. To address these problems, we review more than100 references and aim to:• Summarize existing works and methodologies to present

an overview, which will facilitate the research of cross-media retrieval.

• Build up the benchmarks, including datasets and ex-perimental results. This will help researchers to focuson algorithm design, rather than the time-consumingcompared methods and results, since they can directlyadopt the benchmarks to promptly evaluate their proposedmethods.

• Provide a new dataset XMedia for comprehensive eval-uations of cross-media retrieval. It is the first publiclyavailable dataset consisting of up to five media types (text,image, video, audio and 3D model).

• Present the main challenges and open issues, which

Copyright c© 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any otherpurposes must be obtained from the IEEE by sending an email to [email protected].

arX

iv:1

704.

0222

3v4

[cs

.MM

] 1

Jul

201

7

Page 2: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

2

Queries Cross-media Data

Audio

Video

Image

The bridge is one of the most

recognized symbols of San

Francisco, California, and the

United States. It has been declared

one of the Wonders of the Modern

World by the American Society of

Civil Engineers.

The Golden Gate Bridge and

Highway District, authorized by

an act of the California

Legislature, was incorporated in

1928 as the official entity to

design, construct, and finance the

Golden Gate Bridge

The country is home to people of

many different national origins. As

a result, Americans do not equate

their nationality with ethnicity, but

with citizenship and allegiance.

Text

Cross-media Retrieval Methods Retrieval Results

Text

Image

Video

Audio

The Golden Gate Bridge and

Highway District, authorized by

an act of the California

Legislature, was incorporated in

1928 as the official entity to

design, construct, and finance the

Golden Gate Bridge

The bridge is one of the most

recognized symbols of San

Francisco, California, and the

United States. It has been declared

one of the Wonders of the Modern

World by the American Society of

Civil Engineers.

Common Space Learning

Image SpaceText Space

Audio Space

Jordan and the Bulls

compiled a 62–20 record in

the 1997–98 season.

Jordan led the league with

28.7 points per game...

Richard began the 1977

season on a high note with

a nine-inning, seven-

strikeout performance on

April 8 against the Braves..

At 06:15 on May 8,

from a position

southeast of Rossel

Island (), Hara

launched seven ...

On 5 August 1915,

Wark enlisted in the

Australian Imperial

Force, and was

posted as...

The Golden Gate Bridge

and Highway District,

authorized by an act of the

California Legislature, was

incorporated in 1928 as the

official entity to design,

construct

The bridge is one of the

most recognized symbols of

San Francisco, California,

and the United States. It has

been declared one of the

Wonders of the Modern

World

Video Space Common Space

Jordan and the Bulls

compiled a 62–20 record

in the 1997–98 season.

Jordan led the league

with 28.7 points per

game...

Richard began the 1977

season on a high note

with a nine-inning,

seven-strikeout

performance on April 8

against the Braves..

At 06:15 on May 8,

from a position

southeast of Rossel

Island (), Hara

launched seven ...

On 5 August 1915,

Wark enlisted in the

Australian Imperial

Force, and was

posted as...

The Golden Gate Bridge

and Highway District,

authorized by an act of the

California Legislature,

was incorporated in 1928

as the official entity

The bridge is one of the

most recognized symbols

of San Francisco,

California, and the United

States. It has been

declared

Cross-media Similarity Measurement

Cross-media SimilaritiesKnown Data Relationships

Jordan and the Bulls

compiled a 62–20

record in the 1997–

98 season. Jordan led

the league with 28.7

points per game...

Richard began the

1977 season on a

high note with a

nine-inning, seven-

strikeout

performance on April

8 against the Braves..

At 06:15 on May

8, from a

position

southeast of

Rossel Island (),

Hara launched

seven ...

On 5 August

1915, Wark

enlisted in the

Australian

Imperial Force,

and was posted

as...

The Golden Gate

Bridge and Highway

District, authorized

by an act of the

Legislature, was

incorporated in 1928

as the official entity

The bridge is one of

the most recognized

symbols of San

Francisco, California,

and the United States.

It has been declared

Jordan and the Bulls

compiled a 62–20

record in the 1997–

98 season. Jordan led

the league with 28.7

points per game...

Richard began the

1977 season on a

high note with a

nine-inning, seven-

strikeout

performance on April

8 against the Braves..

At 06:15 on May

8, from a

position

southeast of

Rossel Island (),

Hara launched

seven ...

On 5 August

1915, Wark

enlisted in the

Australian

Imperial Force,

and was posted

as...

The Golden Gate

Bridge and Highway

District, authorized

by an act of the

Legislature, was

incorporated in 1928

as the official entity

The bridge is one of

the most recognized

symbols of San

Francisco, California,

and the United States.

It has been declared

The Golden Gate Bridge is a

suspension bridge spanning the

Golden Gate strait, the one-mile-

wide (1.6 km), one-point-seven-

mile-long (2.7 km) channel

between San Francisco Bay and

the Pacific Ocean.

Text

Image

Video

Audio

Fig. 1: A brief illustration of cross-media retrieval, which shows two major kinds of methods, namely common space learningand cross-media similarity measurement methods.

are important and meaningful for the further researchdirections of cross-media retrieval.

The rest of this paper is organized as follows: SectionII presents the definition of cross-media retrieval. SectionsIII, IV and V introduce the representative works of commonspace learning, cross-media similarity measurement and othermethods, which are shown in Table I. Section VI summarizesthe widely-used datasets for cross-media retrieval, and SectionVII presents the experimental results on these datasets. SectionVIII presents the open issues and challenges, and finallySection IX concludes this paper.

II. Definition of Cross-media Retrieval

For clarity, we take two media types X and Y as exam-ples to give the formulation of definition for cross-mediaretrieval. The training data is denoted as Dtr = {Xtr,Ytr},in which Xtr = {xp}

ntrp=1, where ntr denotes the number of

media instances for training, and xp denotes the p-th mediainstance. Similarly, we denote Ytr = {yp}

ntrp=1. There exist co-

existence relationships between xp and yp, which mean thatinstances of different media types exist together to describerelevant semantics. Semantic category labels for training datacan be provided and denoted as {cX

p }ntrp=1 and {cY

p}ntrp=1, which

indicate the semantic categories that media instances belongto. The testing data is denoted as Dte = {Xte,Yte}, in whichXte = {xq}

nteq=1, and Yte = {yq}

nteq=1. The goal is to compute cross-

media similarities sim(xa,yb), and retrieve relevant instancesof different media types in testing data for one query of anymedia type. Unsupervised methods take the setting that alltraining data is unlabeled, semi-supervised methods take thesetting that only a subset of the training data is labeled, whilefully supervised methods take the setting that all of the trainingdata is labeled.

Some works involve analyzing the correlation betweendifferent media types, mainly image and text, but they arequite different from cross-media retrieval. For example, imageannotation methods such as [13] aim to obtain the probabilitiesthat the tags are assigned to images, while in cross-media re-trieval, text refers to sentences or paragraph descriptions rather

TABLE I: Representative works of cross-media retrieval. U isfor unsupervised methods, S is for semi-supervised methods,F is for fully supervised methods, and R is for methods thatinvolve relevance feedback and cannot be easily categorizedby supervision settings.

Categories Representative works

Commonspace

learning

Traditional statisticalcorrelation analysis

[18](U) [19](U) [22](U) [23](F)[24](U) [25](U,F) [26](F) [27](F)

[28](F) [29](F) [30](U)

DNN-based[12](F) [24](U) [33](U) [34](U)[35](U) [36](U) [37](U) [38](U)

[39](U) [40](F) [41] (U) [42](U) [43](F)Cross-media graph

regularization[7](F) [10](S) [11](S) [49][F]

[50](S) [51](U)Metric learning [7](F) [54](F)

Learning to rank [56](U) [57](F) [58](F) [59](F)Dictionary learning [61](U) [62](S,F) [64](F)

Cross-mediahashing

[67](U) [68](U) [49](F) [69](U)[70](F) [71](F) [72](F) [73](U)[74](F) [75](U) [76](U) [77](F)[78](F) [79](U) [80](U) [81](F)

[82](U) [83](U)

Others[20](U) [84](F) [85](F) [86](F)

[87](F) [89](S)

Cross-mediasimilarity

measurement

Graph-based[4](R) [6](S) [90](R) [91](R)

[92](R) [93](R) [94](R) [95](R)

Neighbor analysis[2](U) [5](F)

[8](U,F)

Others

Relevance feedbackanalysis

[4](R) [90](R) [93](R) [95](R)

Multimodaltopic model

[97](U) [98](U) [99](U) [100](F)

than only tags. Methods of image/video caption such as [14],[15] are mainly designed for generating the text descriptionsof image/video, while cross-media retrieval aims to find themost relevant texts in the existing data for image/video inthe existing data and vice versa. Another important differencebetween them is that image/video caption only focuses onimage/video and text, which is not easy to be extended toother media types, while cross-media retrieval is the retrievalacross all media types such as text, image, video, audio and3D model. In addition, there are some transfer learning worksinvolving different media types such as [16], but transfer learn-ing and cross-media retrieval differ in two aspects: (1) Transferlearning is a learning framework with a broad coverage ofmethods and applications, which allows the domains, tasks,

Page 3: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

3

and distributions used in training and testing to be different[17]. However, cross-media retrieval is a specific informationretrieval task across different media types, and its characteristicchallenge and focus are the issue of “media gap”. (2) “Transferlearning aims to extract the knowledge from one or moresource tasks and applies the knowledge to a target task” [17],and there exist distinct source and target domains. But differentmedia types are treated equally in cross-media retrieval, andthere are usually no distinct source and target domains, orsource and target tasks.

III. Common Space Learning

Common space learning based methods are currently themainstream in cross-media retrieval. They follow the idea thatdata sharing the same semantics has latent correlations, whichmakes it possible to construct a common space. Taking theGolden Gate Bridge as an example, all of the text descriptions,images, videos, audio clips and 3D models about it describesimilar semantics. Consequently, they can be close to eachother in a common high-level semantic space. These methodsaim to learn such a common space, and explicitly projectdata of different media types to this space for similaritymeasurement.

We mainly introduce seven categories of existing methodsas subsections A-G. Among them, (A) traditional statisticalcorrelation analysis methods are the basic paradigm and foun-dation of common space learning methods, which mainly learnlinear projection matrices for common space by optimizing thestatistical values. Other categories are classified according tothe characteristics on different aspects:• On basic model, (B) DNN-based methods take deep

neural network as the basic model and aim to make use ofits strong abstraction ability for cross-media correlationlearning.

• On correlation modeling, (C) cross-media graph regu-larization methods adopt the graph model to representthe complex cross-media correlations, (D) metric learningmethods view the cross-media correlations as a set ofsimilar/dissimilar constraints, and (E) learning to rankmethods focus on cross-media ranking information astheir optimization objective.

• On property of common space, (F) dictionary learningmethods generate dictionaries and the learned commonspace is for sparse coefficient of cross-media data, and(G) cross-media hashing methods aim to learn a commonHamming space to accelerate retrieval.

Because these categories are classified according to differentaspects, there exist a few overlaps among these categories. Forexample, the work of [7] can be classified as both a metriclearning and graph regularization method.

A. Traditional Statistical Correlation Analysis Methods

Traditional statistical correlation analysis methods are thebasic paradigm and foundation of common space learningmethods, which mainly learn linear projection matrices byoptimizing the statistical values. Canonical correlation analysis

(CCA) [18] is one of the most representative works as intro-duced in [19]. The cross-media data is often organized as setsof paired data with different media types such as image/textpairs. CCA is a possible solution for such case, which learnsa subspace that maximizes the pairwise correlations betweentwo sets of heterogeneous data. As an early classical work,CCA is also used in some recent works such as [20], [21].CCA and its variants such as [22]–[26], are the most popularbaseline methods for cross-media retrieval.

CCA itself is unsupervised and does not use semanticcategory labels, but researchers have also attempted to extendCCA for incorporating semantic information. Rasiwasia et al.[27] propose to first apply CCA to get the common spaceof image and text, and then achieve the semantic abstractionby logistic regression. Costa et al. [28] then further verifythe effectiveness of combining CCA with semantic categorylabels. GMA [29] also obtains the improvement on accuracy,which is a supervised extension of CCA. Multi-view CCA[25] is proposed to take high-level semantics as the thirdview of CCA, and multi-label CCA [26] is designed todeal with the scenarios where cross-media data has multiplelabels. These methods achieve considerable progress, whichindicates that semantic information is helpful to improve theaccuracy of cross-media retrieval. Besides CCA, there arealso alternative methods of traditional statistical correlationanalysis. For example, cross-modal factor analysis (CFA) [30]is proposed to minimize the Frobenius norm between pairwisedata in the common space. As the basic paradigms of cross-media common space learning, these methods are relativelyefficient for training and easy to be implemented. However, itis difficult to fully model the complex correlations of cross-media data in the real world only by linear projections. Inaddition, most of these methods can only model two mediatypes, but cross-media retrieval usually involves more thantwo media types.

B. DNN-based MethodsWith the great advance of deep learning, deep neural

network (DNN) has shown its potential in different multi-media applications such as object recognition [31] and textgeneration [32]. With considerable power of learning non-linear relationships, DNN is also used to perform commonspace learning for data of different media types. Ngiam etal. [33] apply an extension of restricted Boltzmann machine(RBM) to the common space learning and propose bimodaldeep autoencoder, in which inputs of two different media typespass through a shared code layer, in order to learn the cross-media correlations as well as preserve the reconstruction infor-mation. Following this idea, some similar deep architecturesare proposed and achieve progress in cross-media retrieval.For example, Srivastava et al. [34] adopt two separate deepBoltzmann machine (DBM) to model the distribution overthe features of different media types, and the two models arecombined by an additional layer on the top of them as thejoint representation layer, which can learn the common spaceby computing joint distribution.

There are also some attempts to combine DNN with CCA asdeep canonical correlation analysis (DCCA) [24], [35]. DCCA

Page 4: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

4

can be viewed as a non-linear extension of CCA, and used tolearn the complex non-linear transformations for two mediatypes. Different from the previous works [33], [34] which buildone network with a shared layer for different media types,there are two separate subnetworks in DCCA, and the totalcorrelation is maximized by the correlation constraints be-tween the code layers. Feng et al. [36] propose three architec-tures for common space learning: correspondence autoencoder,correspondence cross-modal autoencoder and correspondencefull-modal autoencoder. All of them have similar architecturesconsisting of two subnetworks coupled at the code layers, andjointly consider the reconstruction errors and the correlationloss. Some works also consist of two autoencoders, suchas independent component multimodal autoencoder (ICMAE)[37] and deep canonically correlated autoencoders (DCCAE)[38]. ICMAE focuses on attribute discovery by learning theshared representation across visual and textual modalities, andDCCAE is optimized by the integration of reconstructionerrors and canonical correlations. Peng et al. [12] proposecross-media multiple deep networks (CMDN), which is ahierarchical architecture with multiple deep networks. CMDNjointly preserves the intra-media and inter-media informationto generate two kinds of complementary separate representa-tions for each media type, and then hierarchically combinesthem to learn the common space via a stacked learning style,which improves the retrieval accuracy. In addition, in thework of [39], clicks of users are exploited as side informationfor cross-media common space learning. A large part of theaforementioned methods are non-convolutional and take hand-crafted features as inputs as [12], [36]. Wei et al. [40] proposedeep-SM to exploit convolutional neural network (CNN) withdeep semantic matching, which demonstrates the power of theCNN features in cross-media retrieval. He et al. [41] proposea deep and bidirectional representation learning model, withtwo convolution-based networks modeling the matched andunmatched image/text pairs simultaneously for training.

The deep architectures used in cross-media retrieval mainlyinclude two ways. The first way can be viewed as one network,and inputs of different media types pass through the sameshared layer [33], [34], while the second way consists ofsubnetworks coupled by correlation constraints at the codelayers [36], [42]. These methods take DNN as the basicmodel, so have the advantage of abstraction ability for dealingwith complex cross-media correlations. However, training datausually plays a key role for the performance of DNN model,and large-scale labeled cross-media datasets are much harderto collect than single-media datasets. It is noted that mostof the above works have the limitation of taking only twomedia types as inputs, although there exist some recent worksfor more than two kinds of inputs such as [43] which takesfive input types. Jointly learning common space for more thantwo media types can improve the flexibility of cross-mediaretrieval, which is an important challenge for future research.

Except for the above works, other deep architectures havealso been designed for multimedia applications such as im-age/video caption and text to image synthesis [14], [15], [44],[45]. For example, recurrent neural network (RNN) and longshort-term memory (LSTM) [14], [15] have been applied to

image/video caption, which can generate text descriptions forvisual content. Generative adversarial networks (GANs) [46]are proposed by Goodfellow et al., which estimate generativemodels via an adversarial process by simultaneously trainingtwo models: a generative model and a discriminative model.The basic idea of GANs is to set up a game between twoplayers, and pit two adversaries against each other. Each playeris represented by a differentiable function, which are typicallyimplemented as deep neural networks according to [47].Reed et al. [44] develop a GANs formulation to convert thevisual concepts from characters to pixels. Later, they proposegenerative adversarial what-where network (GAWWN) [45]to synthesize images by giving the locations of the content todraw. These methods are not directly designed for cross-mediaretrieval, but their ideas and models are valuable to it.

C. Cross-media Graph Regularization MethodsGraph regularization [48] is widely used in semi-supervised

learning, which considers semi-supervised learning problemin the view of labeling a partially labeled graph. The edgeweights denote the affinities of data in the graph, and the aim isto predict the labels of unlabeled vertices. Graph regularizationcan enrich the training set and make the solution smooth.

Zhai et al. [7] propose joint graph regularized heteroge-neous metric learning (JGRHML). They incorporate graphregularization into the cross-media retrieval problem, whichuses data in the learned metric space for constructing thejoint graph regularization term. Then they propose joint rep-resentation learning (JRL) method [10] with the ability ofjointly considering correlations and semantic information ina unified framework for up to five media types. Specifically,they construct a separate graph for each media type, in whichthe edge weights denote affinities of labeled and unlabeleddata of the same media type. With graph regularization, JRLenriches the training set and learns a projection matrix foreach media type jointly. As JRL separately constructs differentgraphs for different media types, Peng et al. [11] furtherpropose to construct a unified hypergraph for all media typesin the common space, thus different media types can boosteach other. Another important improvement of [11] is to uti-lize fine-grained information by media instance segmentation,which helps to exploit multi-level correlations of cross-mediadata. Graph regularization is also an important part in somerecent works such as [49]–[51], where a cross-media graphregularization term is used to preserve the intra-media andinter-media similarity relationships.

Graph regularization is effective for cross-media correlationlearning because it can describe various correlations of cross-media data, such as semantic relevance, intra-media similari-ties and inter-media similarities. Besides, graph regularizationcan naturally model more than two media types in a unifiedframework [11]. However, the graph construction processusually leads to high time and space complexity, especiallyin real-world scenarios with large-scale cross-media data.

D. Metric Learning MethodsMetric learning methods are designed to learn transfor-

mations of input features from the given similar/dissimilar

Page 5: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

5

information to achieve better metric results, which are widelyused in single-media retrieval [52], [53]. It is natural toview cross-media data as the extension of multi-view single-media data, so researchers attempt to apply metric learningto cross-media retrieval directly. Intuitively, we can learn twotransformations for two media types, and let similar instancesbe close and dissimilar instances be apart [54]. JGRHML [7] isa representative work of cross-media metric learning, whichhas also been discussed in Section III-C. Besides the simi-lar/dissimilar information, JGRHML introduces a joint graphregularization term for the metric learning. Different mediatypes are complementary in the joint graph regularization andoptimizing them jointly can make the solution smooth. Metriclearning preserves the semantic similar/dissimilar informationduring the common space learning, which is important forsemantic retrieval of cross-media data. However, the mainlimitation of existing metric learning methods for cross-mediaretrieval such as [7], [54] is that they depend on the super-vision information, and are not applicable when supervisioninformation is unavailable.

E. Learning to Rank Methods

Learning to rank methods take the ranking information astraining data, and directly optimize the ranking of retrievedresults, instead of the similarities between pairwise data. Earlyworks of learning to rank focus on single-media retrieval,but some works such as [55] indicate that they can beextended to cross-language retrieval. In the work of [56],a discriminative model is proposed to learn mappings fromthe image space to the text space, but only uni-directionalranking (text→image) is involved. For bi-directional rankingmethods (specifically text→image and image→text rankinginformation), Wu et al. [57] propose bi-directional cross-mediasemantic representation model (Bi-CMSRM) to optimize thebi-directional listwise ranking loss. To incorporate the fine-grained information, Jiang et al. [58] first project visual objectsand text words into the local common space, and then projectthem into the global common space in a compositional waywith ranking information. In addition, Wu et al. [59] takea conditional random field for shared topic learning, andthen perform latent joint representation learning with rankingfunction. Leaning to rank is designed for directly benefiting thefinal retrieval performance, and can serve as the optimizationobjective for cross-media retrieval. Existing methods mainlyinvolve only two media types such as [56], [57], [59], andwhen the number of media types increases, there remains aproblem on how to incorporate the ranking information ofmore than two media types into a unified framework.

F. Dictionary Learning Methods

Dictionary learning methods hold the view that data consistsof two parts: dictionaries and sparse coefficients. The idea canalso be incorporated into cross-media retrieval: decomposingdata into the media-specific part for each media, and thecommon part for cross-modal correlations. Monaci et al. [60]propose to learn the multi-modal dictionaries for recovering

meaningful synchronous patterns from audio and visual sig-nals. The key idea of this method is to learn the joint audio-visual dictionaries, so as to find the temporal correlationsacross different modalities. However, since it only takes thesynchronous temporal signals as inputs, it is not a cross-mediaretrieval method. Jia et al. [61] propose to learn one dictionaryfor each modality, while the weights of these dictionaries arethe same. In this work, data is clearly decomposed into twoparts: the private dictionaries and the shared coefficients. Zhuet al. [62] propose cross-modality submodular dictionary learn-ing (CmSDL), which learns a modality-adaptive dictionarypair and an isomorphic space for cross-media representation.

Coupled dictionary learning [63] is an effective way tojointly construct the private dictionaries for two views. Zhuanget al. [64] propose to extend single-media coupled dictionarylearning to cross-media retrieval, assuming that there existlinear mappings among sparse coefficients of different mediatypes. Data of one media type can be mapped into the spaceof another media type via these sparse coefficient mappings.In conclusion, dictionary learning methods model cross-mediaretrieval problem in a factorization way, and the commonspace is for sparse coefficients. Based on this idea, they havedifferent views of methods, such as a unique sparse coefficientfor all media types [61] and a set of projections amongsparse coefficients of different media types [64]. It is easier tocapture cross-media correlations from the sparse coefficientsof different media types due to the high sparsity. However, itis a challenge to solve the optimization problem with masscalculation of dictionary learning on large-scale cross-mediadata.

G. Cross-media Hashing Methods

Nowadays, the amount of multimedia data is growingdramatically, which requires high efficiency of retrieval sys-tem. Hashing methods are designed for accelerating retrievalprocess, and widely used in various retrieval applications.However, most of them only involve a single media type suchas image [65]. For example, Tang et al. [66] propose to learnimage hashing functions with discriminative information ofthe local neighborhood structure, and exploit the neighborsof samples in the original space to improve the retrievalaccuracy. Cross-media hashing aims to generate the hashcodes for more than one media type, and project the cross-media data into a common Hamming space. There have beensome works extending single-media hashing to the retrievalof data with multiple views or information sources such as[67], [68]. They are not designed specifically for cross-mediaretrieval, but these methods and ideas can be easily appliedto it. For example, to model the multiple information sources,Zhang et al. [68] propose composite hashing with multipleinformation sources (CHMIS), and the idea is to preserveboth the similarities in the original spaces and the correlationsbetween multiple information sources. Similarly, the idea ofpreserving both inter-media and intra-media similarities is akey principle in some later works of cross-media retrieval [49],[69]–[71]. For instance, Wu et al. [49] first apply hypergraph tomodel intra-media and inter-media similarities, and then learn

Page 6: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

6

multi-modal dictionaries for generating hashing codes. Thediscriminative capability of hash codes has also been taken intoconsideration [72], which helps to learn the hash codes undersupervised condition. More recently, Long et al. [73] proposeto learn common space projection and composite quantizers ina seamless scheme, while most existing works view continuouscommon space learning and binary codes generation as twoseparate stages.

Besides, cross-media hashing methods are various such as[74]–[83]. The models vary from eigen-decomposition andboosting [74] to the probabilistic generative model [71], andeven the deep architectures [78], [82]. The aforementionedcross-media hashing methods mainly consider similar factorssuch as inter-media similarities, intra-media similarities andsemantic discriminative capability. It is noted that cross-mediahashing methods are learning-based, because they learn fromthe cross-media correlations to bridge the “media gap”. Cross-media hashing has the advantage on retrieval efficiency due tothe short binary hashing codes, which can benefit the retrievalon large-scale datasets in the real world. However, existingworks such as [78]–[83] only involve the retrieval betweendata of two media types (mainly image and text). As forthe experiments, some small-scale datasets such as Wikipedia(2,866 image/text pairs) and Pascal Sentence (1,000 image/textpairs) are used to evaluate the accuracy of hashing [70],[75], [76], [78], [83]. Nevertheless, the efficiency advantageof hashing cannot be effectively validated on such small-scaledatasets.

H. Other Methods

There are still some methods that cannot be easily cate-gorized into the aforementioned categories. They also followthe idea to project heterogeneous data into a common space,so that the similarities of them can be directly measured. Forexample, Zhang et al. [84] propose a two-step method, whichfirst projects the data into a high-dimensional common space,and then maps data from the high-dimensional space into alow-dimensional common space, according to the intra-classdistance and inter-class distance. Kang et al. [85] propose localgroup based consistent feature learning (LGCFL) to deal withunpaired data. In this method, the common space learning canbe obtained according to the semantic category labels, insteadof strictly paired data as CCA.

Most existing methods only learn one projection for eachmedia type such as [10], [18], [27]. This can be furtherexplained as two main aspects: The first is learning only oneglobal projection for each media type, which may lead toinflexibility for large-scale and complex data. Instead, Hua etal. [86] propose to learn a set of local projections, and analyzethe hierarchy structure to exploit the semantic correlations ofdata tags. The second is using the same projections for allretrieval tasks (such as image→text and text→image retrieval).Instead, Wei et al. [87] propose to learn different projectionmatrices for image→text retrieval and text→image retrieval.The idea of training different models for different tasks is alsopresented in the works such as [20]. However, there existsa limitation for this idea that as the number of retrieval tasks

increases, the number of projection matrices to be learned willalso increase.

Following the attempts of discovering subspaces and man-ifolds that provide the common low-dimensional represen-tations of two different high-dimensional datasets [88] (forexample, two image datasets from different viewpoints), someworks such as [89] also extend manifold alignment for cross-media retrieval. These methods take the intuition that high-dimensional data has low-dimensional manifold structures, andaim to find the common space projections for different mediatypes by aligning their underlying manifold representations.

IV. Cross-media SimilarityMeasurement

Cross-media similarity measurement methods aim to mea-sure similarities of heterogeneous data directly, without ex-plicitly projecting media instances from their separate spacesto a common space. For the absence of common space, cross-media similarities cannot be computed directly by distancemeasuring or normal classifiers. An intuitive way is to usethe known media instances and correlations in datasets as thebasis to bridge the “media gap”.

Existing methods for cross-media similarity measurementusually take the idea of using edges in graphs for representingthe relationships among media instances and multimedia docu-ments (MMDs). According to the different focuses of methods,we further classify them into two categories as subsections: (A)Graph-based methods focus on the construction of graphs, and(B) neighbor analysis methods mainly consider how to exploitthe neighbor relationships of data for similarity measurement.These two categories have overlaps in algorithm process,because the neighbor relationships may be analyzed in aconstructed graph.

A. Graph-based Methods

The basic idea of graph-based methods is to view thecross-media data as vertices in one or more graphs, and theedges are constructed by the correlations of cross-media data.Single-media content similarities, co-existence relationshipsand semantic category labels can be jointly used for graphconstruction. Through a process such as similarity propagation[90] and constraint fusion [91], the retrieval results can beobtained. These methods often focus on the situations whenthe relevance in MMDs is available, which contain data withmultiple media types of the same semantics [92]. The co-existence relationships of data in MMDs provide importanthints to bridge different media types. For example, a graphindicating the similarities of MMDs plays an important rolein [4], [93], and the cross-media retrieval is based on MMDaffinities in this graph.

Tong et al. [91] construct an independent graph for eachmedia type. These graphs are further combined by linear fu-sion or sequential fusion, and then the similarity measurementof cross-media data is conducted. Different from [91], Zhuanget al. [90] construct a uniform cross-media correlation graph,which integrates all media types. The edge weights are deter-mined by the similarities of single-media data and co-existencerelationships. Besides, the links among MMDs on web pages

Page 7: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

7

have also been taken into account in the work of [94]. Yang etal. [95] propose a two-level graph construction strategy. Theyfirst construct two types of graphs: one graph for each mediatype, and the other for all MMDs. Then the characteristicsof media instances propagate along the MMD semantic graphand the MMD semantic space is constructed to perform cross-media retrieval. While existing methods mostly consider onlythe positive correlations in similarity propagation, Zhai etal. [6] propose to propagate both the positive and negativecorrelations among data of different media types in graphs,and improve the retrieval accuracy.

The key idea of graph-based similarity measurement meth-ods is to construct one or more graphs, and represent cross-media correlations on the level of media instances or MMDs.It is helpful to incorporate various types of correlation infor-mation by graph construction. However, graph-based methodsare time and space consuming due to the process of graphconstruction. In addition, existing works are often devotedto scenarios where the relevance in MMDs is available, andrelevance feedback usually acts as a key factor in these workssuch as [4], [90], [93]. On the one hand, when the aboverelevance is unavailable, it would be difficult to perform cross-media retrieval, especially when the queries are out of thedatasets. On the other hand, in real-world applications, therelationships of MMDs are usually noisy and incomplete,which is also a key challenge for these methods.

B. Neighbor Analysis Methods

Generally speaking, neighbor analysis methods are usuallybased on graph construction because the neighbors may beanalyzed in a given graph [90], [95]. In this paper, graph-basedmethods mainly involve the process of graph construction,while neighbor analysis methods focus on using the neighborrelationships for similarity measurement.

Clinchant et al. [2] introduce a multimedia fusion strategynamed transmedia fusion for cross-media retrieval. For in-stance, there exists a dataset containing image/text pairs, andusers retrieve the relevant texts by queries of images. Givenone query of image, its nearest neighbors will be retrievedaccording to single-media content similarities, and then thetext descriptions of these nearest neighbors are regarded asthe relevant texts. Zhai et al. [5] propose to compute cross-media similarities by the probabilities of two media instancesbelonging to the same semantic category, which are calculatedby analyzing the homogeneous nearest neighbors of eachmedia instance. Ma et al. [8] propose to compute cross-media similarities in the perspective of clusters. In their work,clustering algorithm is first applied to each media type, andthen the similarities among clusters are obtained accordingto the data co-existence relationships. The queries will beassigned to clusters with different weights according to thesingle-media content similarities, and then retrieval results canbe obtained by computing the similarities among clusters.

Neighbor analysis methods find the nearest neighbors indatasets with the queries to get the retrieval results. Theseneighbors can be used as expanded queries, and serve as thebridges for dealing with queries out of the datasets. In addition,

some methods such as [5] do not rely on MMDs, so theyare flexible. However, because the neighbor analysis methodsmay be actually based on graph construction, they have thesame problem of high time and space complexity. It is alsodifficult to ensure the relevant relationships of neighbors, sothe performance is not stable.

V. OtherMethods for Cross-media Retrieval

Besides common space learning and cross-media similaritymeasurement methods, we introduce two categories of othercross-media methods as subsections: (A) Relevant feedbackanalysis is an auxiliary method for providing more informationon user intent to promote the performance of retrieval. (B)Multimodal topic model views cross-media data in the topiclevel, and the cross-media similarities are usually obtained bycomputing the conditional probability.

A. Relevance Feedback Analysis

To bridge the vast “media gap”, the relevance feedback(RF) is beneficial to provide more accurate information andfacilitate the retrieval accuracy. It is worth noting that RF iswidely used in cross-media similarity measurement, and theeffectiveness has been validated in some works [4], [93], [95].RF includes two types: short-term feedback and long-termfeedback. Short-term feedback only involves RF informationprovided by the current user, while long-term feedback takesRF information provided by all users into account. For short-term feedback, in the works of [4], [95], when the queries areout of the datasets, the system will show the nearest neighborsin the dataset with the queries, and users should label themas the positive or negative samples. Then the similarities willbe refined according to the feedback. For long-term feedback,Yang et al. [93] propose to convert long-term feedback infor-mation into pairwise similar/dissimilar constraints to refine thevector representation of data. Zhuang et al. [90] exploit bothlong-term and short-term feedback. As for long-term feedback,they first investigate the global structure of all feedback andthen refine the uniform cross-media correlation graph. Forshort-term feedback, they simply use the positive samples asexpanded queries. RF is an auxiliary technique to improve theretrieval accuracy in an interactive way, but with the cost ofhuman labor.

B. Multimodal Topic Model

Inspired by topic models such as latent dirichlet allocation(LDA) [96] in text processing, researchers have extended topicmodels to the multimodal retrieval. These models are oftendesigned for applications such as image annotation, involvingimages and their corresponding tags. Correspondence LDA(Corr-LDA) [97] is a classical multimodal extension of LDAfor image annotation. Specifically, it first generates imageregion descriptions and then generates the caption. However,it takes a strict assumption that each image topic must havea corresponding text topic. To address this problem, topic-regression multi-modal LDA (tr-mmLDA) [98] uses two sep-arate topic models for image and text, and finally applies

Page 8: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

8

a regression module to correlate the two hidden topic sets.Nevertheless, it still takes a strong assumption that eachword in the text has a visual interpretation. To further makethe topic models flexible and perform cross-media retrieval,Jia et al. [99] propose multi-modal document random field(MDRF) method, which can be viewed as a Markov randomfield over LDA topic models. Wang et al. [100] propose adownstream supervised topic model, and build a joint cross-modal probabilistic graphical model to discover the mutuallyconsistent semantic topics. Multimodal topic model aims toanalyze the cross-media correlations in the topic level. How-ever, these existing methods often take strong assumptions onthe distribution of cross-media topics, such as the existence ofthe same topic proportions or pairwise topic correspondencesbetween different media types, which are not satisfied in real-world application.

VI. Cross-Media Retrieval Dataset

Datasets are important for the evaluation of cross-mediaretrieval methods. We study all references of this paper andsummarize the frequencies of several popular datasets in TableII. It is shown that Wikipedia and NUS-WIDE datasets arethe most widely-used cross-media retrieval datasets. PascalVOC datasets are a series of important datasets for cross-media retrieval, and also the basis of Pascal Sentence dataset.Pascal VOC 2007 dataset is the most popular one amongPascal VOC datasets. In addition, XMedia dataset is the firstcross-media dataset that contains up to five media types.We first introduce Wikipedia and XMedia datasets which arespecifically designed for cross-media retrieval, then the restNUS-WIDE and Pascal VOC 2007 datasets. Besides, we alsointroduce a large-scale click-based dataset Clickture.

Wikipedia Dataset. Wikipedia dataset [27] is the mostwidely-used dataset for cross-media retrieval. It is basedon “featured articles” in Wikipedia, which is a continuallyupdated article collection. There are totally 29 categories in“featured articles”, but only 10 most populated categories areactually considered. Each article is split into several sectionsaccording to its section headings, and this dataset is finallygenerated as a set of 2,866 image/text pairs. Wikipedia datasetis an important benchmark dataset for cross-media retrieval.However, this dataset is small-scale and only involves twomedia types (image and text). The categories in this datasetare of high-level semantics to be difficultly distinguished, suchas warfare and history, leading to confusions for retrievalevaluation. On the one hand, there are some semantic overlapsamong these categories. For example, a war (should belong towarfare category) is usually also a historical event (shouldbelong to history category). On the other hand, even databelonging to the same category may differ greatly on semanticsfrom each other.

XMedia Dataset. For comprehensive and fair evaluation,we have constructed a new cross-media dataset XMedia.We choose 20 categories such as insect, bird, wind, dog,tiger, explosion and elephant. These categories are specificobjects that can avoid the confusions and overlaps. For eachcategory, we collect the data of five media types: 250 texts,

250 images, 25 videos, 50 audio clips and 25 3D models,so there are 600 media instances for each category andthe total number of media instances is 12,000. All of themedia instances are crawled from famous websites: Wikipedia,Flickr, YouTube, 3D Warehouse and Princeton 3D modelsearch engine. XMedia dataset is the first cross-media datasetwith up to five media types (text, image, video, audio and3D model), and has been used in our works [7], [10], [11]to evaluate the effectiveness of cross-media retrieval. XMediadataset is publicly available and can be accessed through thelink: http://www.icst.pku.edu.cn/mipl/XMedia.

NUS-WIDE Dataset. NUS-WIDE dataset [101] is a webimage dataset including images and their associated tags. Theimages and tags are all randomly crawled from Flickr throughits public API. With the duplicated images removed, thereare 269,648 images in NUS-WIDE dataset of 81 concepts.Totally 425,059 unique tags are originally associated withthese images. However, to further improve the quality of tags,those tags that appear no more than 100 times and do not existin WordNet [102] are removed. So finally 5,018 unique tagsare included in this dataset.

Pascal VOC 2007 Dataset. Pascal visual object classes(VOC) challenge [103] is a benchmark in visual object cate-gory detection and recognition. Pascal VOC 2007 is the mostpopular Pascal VOC dataset, which consists of 9,963 imagesdivided into 20 categories. The image annotations serve asthe text for cross-media retrieval, and are defined over avocabulary of 804 keywords.

Clickture Dataset. Clickture dataset [104] is a large-scaleclick-based image dataset, which is collected from one-yearclick-through data of a commercial image search engine. Thefull Clickture dataset consists of 40 million images and 73.6million text queries. It also has a subset Clickture-Lite with 1.0million images and 11.7 million text queries. Following recentworks as [39], [105], we take Clickture-Lite for experimentalevaluation. The training set consists of 23.1 million query-image-click triads, where “click” is an integer indicating therelevance between the image and query, and the testing set has79,926 query-image pairs generated from 1,000 text queries.

Among the above datasets, Wikipedia and XMedia datasetsare specifically designed for cross-media retrieval. NUS-WIDEand Pascal VOC 2007 datasets are image/tag datasets, whichare initially designed for the evaluation of other applicationssuch as image annotation and classification. There are onlytags in these two datasets as text, instead of sentences orparagraph descriptions as Wikipedia and XMedia datasets.Clickture dataset is the largest among these datasets, but itprovides no category labels as supervision information.

VII. Experiments

A. Feature Extraction and Dataset Split

This subsection presents the feature extraction strategy andsplit of training/testing set in the experiments. For Wikipedia,XMedia and Clickture datasets, we take the same strategy as[27] to generate both text and image representations, and therepresentations of video, audio and 3D model are the same as[10]. In detail, texts are represented by the histograms of a

Page 9: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

9

TABLE II: The using frequencies of several popular widely-used datasets in the references on cross-media retrieval.

Dataset Wikipedia NUS-WIDE Pascal VOC 2007 Pascal Sentence MIR-Flickr Corel LabelMe XMedia

Frequency 38 30 9 9 7 6 5 3

10-topic LDA model, and images are represented by the bag-of-visual-words (BoVW) histograms of a SIFT codebook with128 codewords. Videos are segmented into several video shotsfirst, and then the 128-dimensional BoVW histogram featuresare extracted for video keyframes. Audio clips are representedby the 29-dimensional MFCC features, and 3D models arerepresented by the concatenated 4,700-dimensional vectors ofa LightField descriptor set. For NUS-WIDE dataset, we usethe 1,000-dimensional word frequency features and for texts,and the 500-dimensional BoVW features for images providedby Chua et al. [101]. For Pascal VOC 2007 dataset, we usepublicly available features for the experiments, which is thesame as [85] where the 399-dimensional word frequency fea-tures are used for texts, and the 512-dimensional GIST featuresare used for images. The above feature extraction strategy isadopted for all compared methods in the experiments exceptDCMIT [35], because its architecture contains networks takingthe original images and texts as inputs. However, DCMIT doesnot involve corresponding networks for video, audio and 3Dmodel, so for these 3 media types we use the same extractedfeatures as all the other compared methods.

For Wikipedia dataset, 2,173 image/text pairs are used fortraining and 693 image/text pairs are used for testing. ForXMedia dataset, the ratio of training and testing sets is 4:1for all the five media types, so we have a training set of 9,600instances and a testing set of 2,400 instances. For NUS-WIDEdataset, we select image/text pairs that exclusively belong toone of the 10 largest categories from valid URLs. As a result,the size of training set is 58,620 and the testing set has a totalsize of 38,955. Pascal VOC 2007 dataset is split into a trainingset with 5,011 image/text pairs and a testing set with 4,952image/text pairs. Images with only one object are selected forthe experiments and finally there are 2,808 image/text pairs inthe training set and 2,841 image/text pairs in the testing set.For Clickture dataset, there are 11.7 million distinct queriesand 1.0 million unique images for training, and 79,926 query-image pairs generated from 1,000 queries for testing.

B. Evaluation Metrics and Compared Methods

Two retrieval tasks are conducted for objective evaluationon cross-media retrieval:• Multi-modality cross-media retrieval. By submitting a

query example of any media type, all media types willbe retrieved.

• Bi-modality cross-media retrieval. By submitting aquery example of any media type, the other media typewill be retrieved.

On all datasets except for Clickture dataset, both the tasksare performed, and the retrieval results are evaluated by meanaverage precision (MAP) scores, which are widely adoptedin information retrieval. MAP score for a set of queriesis the mean of the average precision (AP) for each query.

Besides, we also adopt precision-recall curves (PR curves) andrunning time for comprehensive evaluations. Due to the lengthlimitation of this paper, we show the PR curves and runningtime on our website: http://www.icst.pku.edu.cn/mipl/XMedia.Clickture dataset does not provide category labels for evalu-ation with MAP and PR curves. Instead, it consists of manytext queries, and for each text query there are multiple imagesalong with the relevance between images and the query,which are uni-directional ground truth. Following [39], [105],we conduct the text-based image retrieval task for each textquery, and take discounted cumulative gain for top 25 results(DCG@25) as evaluation metric. The compared methods in theexperiments include: BITR [20], CCA [18], CCA+SMN [27],CFA [30], CMCP [6], DCMIT [35], HSNN [5], JGRHML[7], JRL [10], LGCFL [85], ml-CCA [26], mv-CCA [25] andS2UPG [11]. All these methods are evaluated on Wikipedia,XMedia, NUS-WIDE and Pascal VOC 2007 datasets. How-ever, because Clickture dataset provides no category labels forsupervised training, only unsupervised methods (BITR, CCA,CFA, DCMIT) are evaluated on this dataset.

C. Experimental Results

Table III shows MAP scores of multi-modality cross-mediaretrieval. We observe that the methods proposed with semanticinformation such as CCA+SMN, HSNN, LGCFL, ml-CCA,mv-CCA and JGRHML achieve better results than CCA, CFAand BITR, which only consider the pairwise correlations.DCMIT performs better than CCA due to the use of DNN.CMCP and JRL achieve better results, for the reason thatCMCP considers not only the positive but also the negativecorrelations among different media types, and JRL incorpo-rates the sparse and semi-supervised regularizations to enrichthe training set as well as make the solution smooth. S2UPGachieves the best results because it adopts the media patchesto model fine-grained correlations, and the unified hypergraphcan jointly model data from all media types, so as to fullyexploit the correlations among them.

Table IV shows the MAP scores of bi-modality cross-mediaretrieval. Generally speaking, CMCP, HSNN, JGRHML, JRLand S2UPG get much better results than other methods suchas BITR, CCA and CCA+SMN. The trends among them aredifferent from the results on multi-modality cross-media re-trieval. For example, the results of CMCP, JGRHML and JRLare close to each other on bi-modality cross-media retrieval,while JRL clearly outperforms CMCP and JGRHML on multi-modality cross-media retrieval. S2UPG still achieves the bestresults, because the fine-grained information of different mediatypes can be modeled into one unified hypergraph to makethem boost each other. It is noted that because Clickturedataset provides no category labels for supervised training,we perform unsupervised methods to verify their effectiveness,and the results are shown in Table V. The overall trends among

Page 10: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

10

TABLE III: The MAP scores of multi-modality cross-media retrieval.

Dataset Task BITR CCA CCA+SMN CFA CMCP DCMIT HSNN JGRHML JRL LGCFL ml-CCA mv-CCA S2UPG

WikipediaImage→All 0.167 0.191 0.211 0.185 0.255 0.221 0.245 0.263 0.273 0.213 0.209 0.200 0.274Text→All 0.258 0.381 0.414 0.400 0.462 0.408 0.434 0.468 0.476 0.441 0.418 0.444 0.503

XMedia

Image→All 0.079 0.130 0.143 0.137 0.218 0.261 0.184 0.169 0.252 0.063 0.139 0.131 0.311Text→All 0.065 0.130 0.141 0.136 0.196 0.146 0.178 0.144 0.199 0.068 0.142 0.152 0.254

Video→All 0.071 0.079 0.116 0.107 0.153 0.078 0.127 0.134 0.152 0.078 0.111 0.082 0.190Audio→All 0.072 0.104 0.125 0.117 0.202 0.100 0.187 0.131 0.197 0.078 0.133 0.105 0.227

3D→All 0.061 0.070 0.082 0.111 0.219 0.070 0.160 0.203 0.181 0.086 0.154 0.069 0.291

NUS-WIDEImage→All 0.247 0.288 0.310 0.288 0.328 0.307 0.361 0.314 0.409 0.355 0.322 0.301 0.473Text→All 0.248 0.330 0.364 0.329 0.449 0.380 0.433 0.419 0.485 0.422 0.349 0.419 0.584

Pascal VOC 2007Image→All 0.088 0.116 0.179 0.150 0.258 0.185 0.273 0.150 0.329 0.270 0.265 0.194 0.345Text→All 0.091 0.239 0.405 0.191 0.387 0.283 0.407 0.282 0.594 0.664 0.621 0.618 0.631

Average 0.132 0.187 0.226 0.196 0.284 0.222 0.272 0.243 0.322 0.249 0.260 0.247 0.371

TABLE IV: The MAP scores of bi-modality cross-media retrieval.

Dataset Task BITR CCA CCA+SMN CFA CMCP DCMIT HSNN JGRHML JRL LGCFL ml-CCA mv-CCA S2UPG

WikipediaImage→Text 0.222 0.249 0.277 0.246 0.326 0.277 0.321 0.329 0.339 0.274 0.269 0.271 0.377Text→Image 0.171 0.196 0.226 0.195 0.251 0.250 0.251 0.256 0.250 0.224 0.211 0.209 0.286

XMedia

Image→Text 0.070 0.119 0.141 0.127 0.201 0.150 0.176 0.176 0.195 0.089 0.134 0.149 0.270Image→Video 0.089 0.098 0.197 0.174 0.203 0.092 0.182 0.239 0.225 0.097 0.137 0.158 0.264Image→Audio 0.085 0.103 0.162 0.129 0.219 0.087 0.225 0.198 0.234 0.121 0.148 0.135 0.265

Image→3D 0.108 0.117 0.299 0.159 0.338 0.092 0.242 0.347 0.300 0.148 0.188 0.165 0.394Text→Image 0.061 0.114 0.138 0.126 0.217 0.195 0.185 0.190 0.213 0.081 0.134 0.137 0.275Text→Video 0.081 0.110 0.127 0.137 0.174 0.110 0.158 0.201 0.160 0.074 0.121 0.118 0.242Text→Audio 0.079 0.127 0.155 0.136 0.213 0.127 0.215 0.183 0.209 0.083 0.160 0.155 0.242

Text→3D 0.101 0.160 0.187 0.180 0.319 0.091 0.243 0.279 0.250 0.112 0.214 0.115 0.338Video→Image 0.059 0.065 0.157 0.126 0.188 0.058 0.143 0.192 0.196 0.082 0.120 0.112 0.225Video→Text 0.065 0.078 0.090 0.102 0.141 0.078 0.118 0.134 0.123 0.072 0.090 0.090 0.193

Video→Audio 0.074 0.093 0.137 0.117 0.155 0.093 0.166 0.139 0.152 0.110 0.128 0.096 0.168Video→3D 0.106 0.134 0.101 0.178 0.238 0.098 0.176 0.253 0.195 0.130 0.173 0.156 0.267

Audio→Image 0.057 0.078 0.129 0.109 0.233 0.065 0.213 0.177 0.243 0.091 0.123 0.104 0.274Audio→Text 0.061 0.106 0.122 0.110 0.195 0.106 0.182 0.142 0.173 0.100 0.125 0.122 0.244

Audio→Video 0.096 0.114 0.142 0.151 0.172 0.114 0.169 0.181 0.158 0.100 0.136 0.096 0.207Audio→3D 0.118 0.164 0.269 0.147 0.343 0.102 0.284 0.318 0.222 0.170 0.256 0.153 0.3633D→Image 0.057 0.073 0.248 0.115 0.264 0.057 0.169 0.296 0.230 0.089 0.144 0.083 0.3453D→Text 0.048 0.104 0.135 0.109 0.198 0.062 0.161 0.183 0.165 0.089 0.147 0.075 0.275

3D→Video 0.097 0.123 0.101 0.186 0.199 0.097 0.147 0.242 0.151 0.165 0.166 0.126 0.2763D→Audio 0.080 0.153 0.214 0.147 0.255 0.085 0.259 0.318 0.185 0.101 0.205 0.106 0.329

NUS-WIDEImage→Text 0.279 0.250 0.365 0.251 0.368 0.330 0.404 0.451 0.466 0.388 0.345 0.345 0.525Text→Image 0.237 0.227 0.347 0.247 0.327 0.309 0.332 0.434 0.383 0.362 0.317 0.308 0.477

Pascal VOC 2007Image→Text 0.091 0.134 0.254 0.213 0.300 0.236 0.323 0.172 0.397 0.398 0.381 0.341 0.423Text→Image 0.095 0.123 0.201 0.196 0.265 0.249 0.266 0.170 0.263 0.325 0.303 0.261 0.326

Average 0.103 0.131 0.189 0.158 0.242 0.139 0.220 0.238 0.234 0.157 0.188 0.161 0.303

TABLE V: The DCG@25 scores on Clickture dataset.

Dataset BITR CCA CFA DCMIT

Clickture 0.474 0.484 0.486 0.492

different methods on Clickture dataset are similar with otherdatasets.

We also conduct experiments on Wikipedia, XMedia andClickture datasets with the BoW features for texts, and theCNN features for images and videos, to show the perfor-mance with different features. We use the 4,096-dimensionalCNN features extracted by the fc7 layer of AlexNet, andthe 3,000-dimensional BoW text features. Due to page lim-itation, we just present the average of all MAP scores forcross-modality and bi-modality cross-media retrieval tasks onWikipedia dataset in Table VI, and the detailed results alongwith results on other datasets can be found on our website:http://www.icst.pku.edu.cn/mipl/XMedia. Table VI shows thatfeatures have significant impacts on the retrieval accuracy.Generally speaking, CNN features significantly improve theperformance of most compared methods, while the perfor-mance of BoW features is not stable.

VIII. Challenges and Open Issues

Dataset Construction and Benchmark Standardization.Datasets are very important for experimental evaluation, butas discussed in Section VI, nowadays there are only a fewdatasets publicly available for cross-media retrieval. Existing

datasets still have shortcomings on the size, the number ofmedia types, the rationality of categories, etc. For example,the sizes of Wikipedia and XMedia datasets are small, andWikipedia dataset consists of only two media types (imageand text). To construct high-quality datasets, specific problemsshould be considered such as: What categories should beincluded in the datasets? How many media types should beinvolved? How large should the dataset be? These questionsare important for evaluation on the datasets. For instance, asdiscussed in Section VI, the high-level semantic categoriesof Wikipedia dataset may lead to semantic overlaps andconfusions, which limits the objectivity of evaluation.

To address the above problems, we are constructing anew dataset named XMediaNet, which consists of five me-dia types (text, image, video, audio and 3D model). Weselect 200 categories from WordNet [102] for ensuring thecategory hierarchy. These categories consist of two mainparts: animals and artifacts. There are 48 kinds of animalssuch as elephant, owl, bee and frog as well as 152 kindsof artifacts such as violin, airplane, shotgun and camera.The total number of media instances will exceed 100,000,and they are crawled from famous websites as Wikipedia,Flickr, YouTube, Findsounds, Freesound and Yobi3D. Oncethe dataset is ready, we will release it on our website:http://www.icst.pku.edu.cn/mipl/XMedia. We will also providethe experimental results on widely-used datasets, and en-courage researchers to submit their results for building up a

Page 11: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

11

TABLE VI: Average of all MAP scores for multi-modality and bi-modality cross-media retrieval with different features onWikipedia dataset.

Image Text BITR CCA CCA+SMN CFA CMCP HSNN JGRHML JRL LGCFL ml-CCA mv-CCA S2UPG

BoVW LDA 0.205 0.254 0.282 0.257 0.324 0.313 0.329 0.335 0.288 0.277 0.281 0.360BoVW BoW 0.118 0.125 0.129 0.222 0.345 0.342 0.258 0.357 0.320 0.301 0.193 0.357CNN LDA 0.272 0.193 0.197 0.378 0.447 0.431 0.428 0.452 0.420 0.363 0.210 0.459

continuously updated benchmark (as the website of LFW facedataset [106] at http://vis-www.cs.umass.edu/lfw, and the web-site of ImageNet dataset [107] at http://www.image-net.org).Researchers can directly adopt the experimental results toevaluate their own methods, which will help them focus onalgorithm design, rather than the time-consuming comparedmethods and results, thus greatly facilitate the development ofcross-media retrieval.

Improvement of Accuracy and Efficiency. The effectiveyet efficient methods are still required for cross-media re-trieval. First, the accuracy needs to be improved. On theone hand, existing methods still have potential to be im-proved. For example, graph-based methods of cross-mediasimilarity measurement may use more context informationfor the effective graph construction such as link relationships.On the other hand, the discriminative power of single-mediafeatures is also important. For instance, in the experiments ofSection VII, state-of-the-art methods generally adopt the low-dimensional features (e.g., 128-dimensional BoVW histogramfeatures for image and 10-dimensional LDA features for textas in [10], [11], [27]). As discussed in Section VII-C, whenmore discriminative features are adopted such as CNN featuresfor image, the retrieval accuracy will be improved. Second,the efficiency is also an important factor for evaluations andapplications. Cross-media retrieval datasets are still small-scale and limited on the number of media types up to now.Although there have been some hashing methods for cross-media retrieval as [69]–[71], the issue of efficiency has notbeen paid enough attention to. In the future, with the release ofour large-scale XMediaNet dataset, it will be more convenientfor researchers to evaluate the efficiency of their methods,which will facilitate the development of practical applicationsfor cross-media retrieval.

Applications of Deep Neural Network. DNN is designedto simulate the neuronal structure of human brain whichcan naturally deal with the correlations of different mediatypes, so it is worth a try to exploit DNN for bridging the“media gap”. Actually, there have been some attempts (suchas the aforementioned methods [34], [37] in Section III-B),but they are relatively straight-forward applications of DNN,which mostly take the single-media features as raw inputs,and perform common space learning for them by extendingexisting models such as autoencoders. Although DNN-basedmethods have achieved considerable progress on cross-mediaretrieval [12], there is still potential for further improvement.The applications of DNN remain research hotspots on cross-media retrieval, as is the case with single-media retrieval.On the one hand, existing methods mainly take the single-media features as inputs, so they heavily depend on theeffectiveness of features. Research efforts may be devoted to

designing end-to-end architectures for cross-media retrieval,which take the original media instances as inputs (e.g., theoriginal images and audio clips), and directly get the retrievalresults with DNN. Some special networks for specific mediatypes (e.g., R-CNN for object region detection [58]) couldalso be incorporated into the unified framework of cross-mediaretrieval. On the other hand, most of the existing methodsare designed for only two media types. In the future works,researchers could focus on jointly analyzing more than twomedia types, which will make the applications of DNN incross-media retrieval more flexible and effective.

Exploitation of Context Correlation Information. Themain challenge of cross-media retrieval is still the hetero-geneous forms of different media types. Existing methodsattempt to bridge the “media gap”, but only achieve limitedimprovement and the retrieval results are not accurate whendealing with the real-world cross-media data. The cross-media correlations are often with the context information. Forexample, if an image and an audio clip are from two webpages with link relationship, they are likely to be relevantto each other. Many existing methods (e.g., CCA, CFA andJRL) only take the co-existence relationships and semanticcategory labels as training information, but ignore rich contextinformation. Actually, cross-media data on the Internet usuallydoes not exist separately, and has important context informa-tion such as link relationships. Such context information isrelatively accurate, and provides important hints to improve theaccuracy of cross-media retrieval. The web data is also usuallydivergent, so it is important to exploit the context informationfor complex practical applications. We believe that researcherswill pay more attention to rich context information to boostthe performance of cross-media retrieval in the future works.

Practical Applications of Cross-media Retrieval. With theconstant improvement on both effectiveness and efficiency,practical applications of cross-media retrieval will becomepossible. These applications can provide more flexible andconvenient ways to retrieve from the large-scale cross-mediadata, and users will like to adopt the cross-media search enginewhich is capable of retrieving various media types as text, im-age, video, audio, and 3D model with one query of any mediatype. In addition, other possible application scenarios includethe enterprises involving the cross-media data, such as TVstations, media corporations, digital libraries and publishingcompanies. Both Internet and relevant enterprises will havethe huge requirements of cross-media retrieval.

IX. Conclusion

Cross-media retrieval is an important research topic whichaims to deal with the “media gap” for performing retrievalacross different media types. This paper has reviewed more

Page 12: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

12

than 100 references to present an overview of cross-mediaretrieval, for building up the evaluation benchmarks, as wellas facilitating the relevant research. Existing methods havebeen introduced mainly including the common space learningand cross-media similarity measurement methods. Commonspace learning methods explicitly learn a common space fordifferent media types to perform retrieval, while cross-mediasimilarity measurement methods directly measure cross-mediasimilarities without a common space. The widely-used cross-media retrieval datasets have also been introduced, includ-ing Wikipedia, XMedia, NUS-WIDE, Pascal VOC 2007 andClickture Datasets. Among these datasets, XMedia which wehave constructed is the first dataset with five media types forcomprehensive and fair evaluation. We are further constructinga new dataset XMediaNet with five media types and morethan 100,000 instances. The cross-media benchmarks, such asdatasets, compared methods, evaluation metrics and experi-mental results have been given, and we have established acontinuously updated website to present them. Based on thediscussed aspects, the main challenges and open issues havealso been presented in the future works. We hope these couldattract more researchers to focus on cross-media retrieval, andpromote the relevant research and applications.

References

[1] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based mul-timedia information retrieval: State of the art and challenges,” ACMTransactions on Multimedia Computing, Communications, and Appli-cations (TOMCCAP), vol. 2, no. 1, pp. 1–19, 2006.

[2] S. Clinchant, J. Ah-Pine, and G. Csurka, “Semantic combinationof textual and visual information in multimedia retrieval,” in ACMInternational Conference on Multimedia Retrieval (ICMR), no. 44,2011.

[3] Y. Liu, W.-L. Zhao, C.-W. Ngo, C.-S. Xu, and H.-Q. Lu, “Coherent bag-of audio words model for efficient large-scale video copy detection,” inACM International Conference on Image and Video Retrieval (CIVR),2010, pp. 89–96.

[4] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, “Ranking with localregression and global alignment for cross media retrieval,” in ACMInternational Conference on Multimedia (ACM MM), 2009, pp. 175–184.

[5] X. Zhai, Y. Peng, and J. Xiao, “Effective heterogeneous similarity mea-sure with nearest neighbors for cross-media retrieval,” in InternationalConference on MultiMedia Modeling (MMM), 2012, pp. 312–322.

[6] X. Zhai, Y. Peng, and J. Xiao, “Cross-modality correlation propagationfor cross-media retrieval,” in IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), 2012, pp. 2337–2340.

[7] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning withjoint graph regularization for cross-media retrieval,” in AAAI Confer-ence on Artificial Intelligence (AAAI), 2013, pp. 1198–1204.

[8] D. Ma, X. Zhai, and Y. Peng, “Cross-media retrieval by cluster-based correlation analysis,” in IEEE International Conference on ImageProcessing (ICIP), 2013, pp. 3986–3990.

[9] X. Zhai, Y. Peng, and J. Xiao, “Cross-media retrieval by intra-mediaand inter-media correlation mining,” Multimedia System, vol. 19, no. 5,pp. 395–406, 2013.

[10] X. Zhai, Y. Peng, and J. Xiao, “Learning cross-media joint represen-tation with sparse and semi-supervised regularization,” IEEE Transac-tions on Circuits and Systems for Video Technology (TCSVT), vol. 24,no. 6, pp. 965–978, 2014.

[11] Y. Peng, X. Zhai, Y. Zhao, and X. Huang, “Semi-supervised cross-media feature learning with unified patch graph regularization,” IEEETransactions on Circuits and Systems for Video Technology (TCSVT),vol. 26, no. 3, pp. 583–596, 2016.

[12] Y. Peng, X. Huang, and J. Qi, “Cross-media shared representation byhierarchical learning with multiple deep networks,” in InternationalJoint Conference on Artificial Intelligence (IJCAI), 2016, pp. 3846–3853.

[13] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotationand retrieval using cross-media relevance models,” in InternationalACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR), 2003, pp. 119–126.

[14] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioningwith multimodal recurrent neural networks (m-rnn),” 2014, arXiv:1412.6632.

[15] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Aneural image caption generator,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2015, pp. 3156–3164.

[16] J. Tang, X. Shu, Z. Li, G. Qi, and J. Wang, “Generalized deep trans-fer networks for knowledge propagation in heterogeneous domains,”ACM Transactions on Multimedia Computing, Communications, andApplications (TOMCCAP), vol. 12, no. 4s, pp. 68:1–68:22, 2016.

[17] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-actions on Knowledge and Data Engineering (TKDE), vol. 22, no. 10,pp. 1345–1359, 2010.

[18] H. Hotelling, “Relations between two sets of variates,” Biometrika,vol. 28, no. 3/4, pp. 321–377, 1936.

[19] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical corre-lation analysis: An overview with application to learning methods,”Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.

[20] Y. Verma and C. V. Jawahar, “Im2text and text2im: Associatingimages and texts for cross-modal retrieval,” in British Machine VisionConference (BMVC), 2014.

[21] B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Associating neural wordembeddings with deep image representations using fisher vectors,”in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2015, pp. 4437–4446.

[22] S. Akaho, “A kernel method for canonical correlation analysis,” 2006,arXiv: cs/0609071.

[23] N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, “Clustercanonical correlation analysis,” in International Conference on Artifi-cial Intelligence and Statistics (AISTATS), 2014, pp. 823–831.

[24] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonicalcorrelation analysis,” in International Conference on Machine Learning(ICML), 2010, pp. 3408–3415.

[25] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embed-ding space for modeling internet images, tags, and their semantics,”International Journal of Computer Vision (IJCV), vol. 106, no. 2, pp.210–233, 2014.

[26] V. Ranjan, N. Rasiwasia, and C. V. Jawahar, “Multi-label cross-modalretrieval,” in 2015 IEEE International Conference on Computer Vision(ICCV), 2015, pp. 4094–4102.

[27] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet,R. Levy, and N. Vasconcelos, “A new approach to cross-modal mul-timedia retrieval,” in ACM International Conference on Multimedia(ACM MM), 2010, pp. 251–260.

[28] J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanck-riet, R. Levy, and N. Vasconcelos, “On the role of correlation andabstraction in cross-modal multimedia retrieval,” IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI), vol. 36, no. 3,pp. 521–535, 2014.

[29] A. Sharma, A. Kumar, H. Daume III, and D. W.Jacobs, “Generalizedmultiview analysis: A discriminative latent space,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2160–2167.

[30] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia contentprocessing through cross-modal association,” in ACM InternationalConference on Multimedia (ACM MM), 2003, pp. 604–611.

[31] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato,and T. Mikolov, “Devise: A deep visual-semantic embedding model,”in Advances in Neural Information Processing Systems (NIPS), 2013,pp. 2121–2129.

[32] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural lan-guage models,” in International Conference on Machine Learning(ICML), 2014, pp. 595–603.

[33] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,“Multimodal deep learning,” in International Conference on MachineLearning (ICML), 2011, pp. 689–696.

[34] N. Srivastava and R. Salakhutdinov, “Multimodal learning with deepboltzmann machines,” in Advances in Neural Information ProcessingSystems (NIPS), 2012, pp. 2222–2230.

[35] F. Yan and K. Mikolajczyk, “Deep correlation for matching images andtext,” in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2015, pp. 3441–3450.

Page 13: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

13

[36] F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspon-dence autoencoder,” in ACM International Conference on Multimedia(ACM MM), 2014, pp. 7–16.

[37] H. Zhang, Y. Yang, H. Luan, S. Yang, and T.-S. Chua, “Start fromscratch: Towards automatically identifying, modeling, and namingvisual attributes,” in ACM International Conference on Multimedia(ACM MM), 2014, pp. 187–196.

[38] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep multi-view representation learning,” in International Conference on MachineLearning (ICML), 2015, pp. 1083–1092.

[39] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang,“Learning of multimodal representations with random walks on theclick graph,” IEEE Transactions on Image Processing (TIP), vol. 25,no. 2, pp. 630–642, 2016.

[40] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with CNN visual features: A new baseline,” IEEETransactions on Cybernetics (TCYB), vol. 47, no. 2, pp. 449–460, 2017.

[41] Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrievalvia deep and bidirectional representation learning,” IEEE Transactionson Multimedia (TMM), vol. 18, no. 7, pp. 1363–1377, 2016.

[42] W. Wang, B. C. Ooit, X. Yang, D. Zhang, and Y. Zhuang, “Effectivemultimodal retrieval based on stacked autoencoders,” in InternationalConference on Very Large Data Bases (VLDB), 2014, pp. 649–660.

[43] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba,“Learning aligned cross-modal representations from weakly aligneddata,” in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016, pp. 2940–2949.

[44] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,“Generative adversarial text to image synthesis,” in International Con-ference on Machine Learning (ICML), 2016, pp. 1060–1069.

[45] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,“Learning what and where to draw,” in Advances in Neural InformationProcessing Systems (NIPS), 2016, pp. 217–225.

[46] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,”in Advances in Neural Information Processing Systems (NIPS), 2014,pp. 2672–2680.

[47] I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarial net-works,” 2017, arXiv: 1701.00160.

[48] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semisu-pervised learning on large graphs,” Learning theory. Springer BerlinHeidelberg, pp. 624–638, 2004.

[49] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparsemulti-modal hashing,” IEEE Transactions on Image Processing (TIP),vol. 16, no. 2, pp. 427–439, 2014.

[50] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selectionand subspace learning for cross-modal retrieval,” IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI), vol. 38, no. 10,pp. 2010–2023, 2015.

[51] J. Liang, Z. Li, D. Cao, R. He, and J. Wang, “Self-paced cross-modal subspace matching,” in International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIR), 2016,pp. 569–578.

[52] N. Quadrianto and C. H. Lamppert, “Learning multi-view neighbor-hood preserving projections,” in International Conference on MachineLearning (ICML), 2011, pp. 425–432.

[53] B. McFee and G. Lanckriet, “Metric learning to rank,” in InternationalConference on Machine Learning (ICML), 2010, pp. 775–782.

[54] W. Wu, J. Xu, and H. Li, “Learning similarity function between objectsin heterogeneous spaces,” in Microsoft Research Technique Report,2010.

[55] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi,O. Chapelle, and K. Weinberger, “Learning to rank with (a lot of) wordfeatures,” Information Retrieval, vol. 13, no. 3, pp. 291–314, 2010.

[56] D. Grangier and S. Bengio, “A discriminative kernel-based approach torank images from text queries,” IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI), vol. 30, no. 8, pp. 1371–1384,2008.

[57] F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang, “Cross-mediasemantic representation via bi-directional learning to rank,” in ACMInternational Conference on Multimedia (ACM MM), 2013, pp. 877–886.

[58] X. Jiang, F. Wu, X. Li, Z. Zhao, W. Lu, S. Tang, and Y. Zhuang, “Deepcompositional cross-modal learning to rank via local-global alignment,”in ACM International Conference on Multimedia (ACM MM), 2015, pp.69–78.

[59] F. Wu, X. Jiang, X. Li, S. Tang, W. Lu, Z. Zhang, and Y. Zhuang,“Cross-modal learning to rank via latent joint representation,” IEEETransactions on Image Processing (TIP), vol. 24, no. 5, pp. 1497–1509, 2015.

[60] G. Monaci, P. Jost, P. Vandergheynst, B. Mailhe, S. Lesage, andR. Gribonval, “Learning multi-modal dictionaries,” IEEE Transactionson Image Processing (TIP), vol. 16, no. 9, pp. 2272–2283, 2007.

[61] Y. Jia, M. Salzmann, and T. Darrell, “Factorized latent spaces withstructured sparsity,” in Advances in Neural Information ProcessingSystems (NIPS), 2010, pp. 982–990.

[62] F. Zhu, L. Shao, and M. Yu, “Cross-modality submodular dictionarylearning for information retrieval,” in ACM International Conferenceon Conference on Information and Knowledge Management (CIKM),2014, pp. 1479–1488.

[63] S. Wang, L. Zhang, Y. Liang, and Q. Pan, “Semi-coupled dictionarylearning with applications to image super-resolution and photo-sketchsynthesis,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2012, pp. 2216–2223.

[64] Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu, “Supervised coupleddictionary learning with group structures for multi-modal retrieval,” inAAAI Conference on Artificial Intelligence (AAAI), 2013, pp. 1070–1076.

[65] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash forindexing big data - A survey,” Proceedings of the IEEE, vol. 104,no. 1, pp. 34–57, 2016.

[66] J. Tang, Z. Li, M. Wang, and R. Zhao, “Neighborhood discriminanthashing for large-scale image retrieval,” IEEE Transactions on ImageProcessing (TIP), vol. 24, no. 9, pp. 2827–2840, 2015.

[67] S. Kumar and R. Udupa, “Learning hash functions for cross-viewsimilarity search,” in International Joint Conference on Artificial Intel-ligence (IJCAI), 2011, pp. 1360–1365.

[68] D. Zhang, F. Wang, and L. Si, “Composite hashing with multipleinformation sources,” in International ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR), 2011,pp. 225–234.

[69] Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multimodaldata,” in Advances in Neural Information Processing Systems (NIPS),2012, pp. 1376–1384.

[70] Y. Hu, Z. Jin, H. Ren, D. Cai, and X. He, “Iterative multi-viewhashing for cross media indexing,” in ACM International Conferenceon Multimedia (ACM MM), 2014, pp. 527–536.

[71] Y. Zhen and D.-Y. Yeung, “A probabilistic model for multimodalhash function learning,” in ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), 2012, pp. 940–948.

[72] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discrim-inative coupled dictionary hashing for fast cross-media retrieval,” inInternational ACM SIGIR Conference on Research and Developmentin Information Retrieval (SIGIR), 2014, pp. 395–404.

[73] M. Long, Y. Cao, J. Wang, and P. S. Yu, “Composite correlation quanti-zation for efficient multimodal retrieval,” in International ACM SIGIRConference on Research and Development in Information Retrieval(SIGIR), 2016, pp. 579–588.

[74] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios,“Data fusion through cross-modality metric learning using similarity-sensitive hashing,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2010, pp. 3594–3601.

[75] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Predictabledual-view hashing,” in International Conference on Machine Learning(ICML), 2013, pp. 1328–1336.

[76] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashingfor multimodal data,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2014, pp. 2083–2090.

[77] D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao, “Para-metric local multimodal hashing for cross-view similarity search,” inInternational Joint Conference on Artifcial Intelligence (IJCAI), 2013,pp. 2754–2760.

[78] Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao, “Cross-mediahashing with neural networks,” in ACM International Conference onMultimedia (ACM MM), 2014, pp. 901–904.

[79] Q. Wang, L. Si, and B. Shen, “Learning to hash on partial multi-modal data,” in International Joint Conference on Artificial Intelligence(IJCAI), 2015, pp. 3904–3910.

[80] D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodalhashing for cross-media retrieval,” in International Joint Conferenceon Artificial Intelligence (IJCAI), 2015, pp. 3890–3896.

Page 14: Yuxin Peng, Xin Huang, and Yunzhen Zhao · cross-media similarities by analyzing the known data relation-ships without obtaining an explicit common space. A brief illustration of

14

[81] H. Liu, R. Ji, Y. Wu, and G. Hua, “Supervised matrix factorization forcross-modality hashing,” in International Joint Conference on ArtificialIntelligence (IJCAI), 2016, pp. 1767–1773.

[82] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal hashingwith orthogonal regularization,” in International Joint Conference onArtificial Intelligence (IJCAI), 2015, pp. 2291–2297.

[83] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing forcross-modal similarity search,” in International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR),2014, pp. 415–424.

[84] L. Zhang, Y. Zhao, Z. Zhu, S. Wei, and X. Wu, “Mining semanticallyconsistent patterns for cross-view data,” IEEE Transactions on Knowl-edge and Data Engineering (TKDE), vol. 26, pp. 2745–2758, 2014.

[85] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistentfeature representation for cross-modal multimedia retrieval,” IEEETransactions on Multimedia (TMM), vol. 17, no. 3, pp. 370–381, 2015.

[86] Y. Hua, S. Wang, S. Liu, Q. Huang, and A. Cai, “Tina: Cross-modalcorrelation learning by adaptive hierarchical semantic aggregation,” inIEEE International Conference on Data Mining (ICDM), 2014, pp.190–199.

[87] Y. Wei, Y. Zhao, Z. Zhu, S. Wei, Y. Xiao, J. Feng, and S. Yan,“Modality-dependent cross-media retrieval,” ACM Transactions onIntelligent Systems and Technology (TIST), vol. 7, no. 4, pp. 57:1–57:13, 2016.

[88] J. H. Ham, D. D.Lee, and L. K.Saul, “Semisupervised alignment ofmanifolds,” in International Conference on Uncertainty in ArtificialIntelligence (UAI), vol. 10, 2005, pp. 120–127.

[89] X. Mao, B. Lin, D. Cai, X. He, and J. Pei, “Parallel field alignment forcross media retrieval,” in ACM International Conference on Multimedia(ACM MM), 2013, pp. 897–906.

[90] Y. Zhuang, Y. Yang, and F. Wu, “Mining semantic correlation of hetero-geneous multimedia data for cross-media retrieval,” IEEE Transactionson Multimedia (TMM), vol. 10, no. 2, pp. 221–229, 2008.

[91] H. Tong, J. He, M. Li, C. Zhang, and W.-Y. Ma, “Graph based multi-modality learning,” in ACM International Conference on Multimedia(ACM MM), 2005, pp. 862–871.

[92] Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L. Chia, “Cross-media retrievalusing query dependent search methods,” Pattern Recognition, vol. 43,no. 8, pp. 2927–2936, 2010.

[93] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimediaretrieval framework based on semi-supervised ranking and relevancefeedback,” IEEE Transactions on Pattern Analysis and Machine Intel-ligence (TPAMI), vol. 34, no. 4, pp. 723–742, 2012.

[94] Y. Zhuang, H. Shan, and F. Wu, “An approach for cross-media retrievalwith cross-reference graph and pagerank,” in IEEE International Con-ference on Multi-Media Modelling (MMM), 2006, pp. 161–168.

[95] Y. Yang, Y. Zhuang, F. Wu, and Y. Pan, “Harmonizing hierarchi-cal manifolds for multimedia document semantics understanding andcross-media retrieval,” IEEE Transactions on Multimedia (TMM),vol. 10, no. 3, pp. 437–446, 2008.

[96] D. M. Blei, A. Y. Ng, and M. I. Jordan., “Latent dirichlet allocation,”Journal of Machine Learning Research (JMLR), vol. 3, pp. 993–1022,2003.

[97] D. M. Blei and M. I. Jordan., “Modeling annotated data,” in Inter-national ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR), 2003, pp. 127–134.

[98] D. Putthividhya, H. T.Attias, and S. S.Nagarajan, “Topic regressionmulti-modal latent dirichlet allocation for image annotation,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR),2010, pp. 3408–3415.

[99] Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality sim-ilarity for multinomial data,” in IEEE International Conference onComputer Vision (ICCV), 2011, pp. 2407–2414.

[100] Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multi-modalmutual topic reinforce modeling for cross-media retrieval,” in ACMInternational Conference on Multimedia (ACM MM), 2014, pp. 307–316.

[101] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university ofsingapore,” in ACM International Conference on Image and VideoRetrieval (CIVR), 2009, no. 48.

[102] G. A. Miller, “Wordnet: A lexical database for english,” Communica-tions of the ACM, vol. 38, no. 11, pp. 39–41, 1995.

[103] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, andA. Zisserman, “The pascal visual object classes (VOC) challenge,”International Journal of Computer Vision (IJCV), vol. 88, no. 2, pp.303–338, 2010.

[104] X. Hua, L. Yang, J. Wang, J. Wang, M. Ye, K. Wang, Y. Rui, andJ. Li, “Clickage: towards bridging semantic and intent gaps via miningclick logs of search engines,” in ACM International Conference onMultimedia (ACM MM), 2013, pp. 243–252.

[105] Y. Pan, T. Yao, X. Tian, H. Li, and C. Ngo, “Click-through-basedsubspace learning for image search,” in ACM International Conferenceon Multimedia (ACM MM), 2014, pp. 233–236.

[106] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeledfaces in the wild: A database for studying face recognition in uncon-strained environments,” University of Massachusetts, Amherst, Tech.Rep. 07-49, 2007.

[107] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet:A large-scale hierarchical image database,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.

Yuxin Peng is the professor of Institute of ComputerScience and Technology (ICST), Peking University,and the chief scientist of 863 Program (National Hi-Tech Research and Development Program of China).He received the Ph.D. degree in computer applica-tion technology from Peking University in Jul. 2003.After that, he worked as an assistant professor inICST, Peking University. He was promoted to asso-ciate professor and professor in Peking Universityin Aug. 2005 and Aug. 2010 respectively. In 2006,he was authorized by the “Program for New Star

in Science and Technology of Beijing” and the“Program for New CenturyExcellent Talents in University (NCET)”. He has published over 100 papersin refereed international journals and conference proceedings, including IJCV,TIP, TCSVT, TMM, PR, ACM MM, ICCV, CVPR, IJCAI, AAAI, etc. He ledhis team to participate in TRECVID (TREC Video Retrieval Evaluation) manytimes. In TRECVID 2009, his team won four first places on 4 sub-tasks of theHigh-Level Feature Extraction (HLFE) task and Search task. In TRECVID2012, his team gained four first places on all 4 sub-tasks of the InstanceSearch (INS) task and Known-Item Search (KIS) task. In TRECVID 2014,his team gained the first place in the Interactive Instance Search task. Histeam also gained both two first places in the INS task of TRECVID 2015and 2016. Besides, he won the first prize of Beijing Science and TechnologyAward for Technological Invention in 2016 (ranking first). He has applied 34patents, and obtained 15 of them. His current research interests mainly includecross-media analysis and reasoning, image and video analysis and retrieval,and computer vision.

Xin Huang received the B.S. degree in computerscience and technology from Peking University inJul. 2014. He is currently pursuing the Ph.D. degreein the Institute of Computer Science and Technology(ICST), Peking University. His research interestsinclude cross-media analysis and reasoning, andmachine learning.

Yunzhen Zhao received the B.S. degree in mathe-matical science from Peking University in Jul. 2014.He is currently pursuing the M.S. degree in the In-stitute of Computer Science and Technology (ICST),Peking University. His current research interestsinclude cross-media retrieval and machine learning.