LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly,...

22
arXiv:1509.05472v1 [cs.LG] 17 Sep 2015 PROCEEDINGS OF THE IEEE 1 Learning to Hash for Indexing Big Data - A Survey Jun Wang, Member, IEEE, Wei Liu, Member, IEEE, Sanjiv Kumar, Member, IEEE, and Shih-Fu Chang, Fellow, IEEE Abstract—The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the near- est neighbors to a query is a fundamental research prob- lem. However, the straightforward solution using exhaus- tive comparison is infeasible due to the prohibitive compu- tational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promis- ing performance in both efficiency and accuracy. Prior ran- domized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with ran- dom projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic under- standing of insights, pros and cons of the emerging tech- niques. We provide a comprehensive survey of the learning to hash framework and representative techniques of vari- ous types, including unsupervised, semi-supervised, and su- pervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area. Index Terms—Learning to hash, approximate nearest neighbor search, unsupervised learning, semi-supervised learning, supervised learning, deep learning. I. Introduction The advent of Internet has resulted in massive infor- mation overloading in the recent decades. Nowadays, the World Wide Web has over 366 million accessible websites, containing more than 1 trillion webpages 1 . For instance, Twitter receives over 100 million tweets per day, and Ya- hoo! exchanges over 3 billion messages per day. Besides the overwhelming textual data, the photo sharing website Flickr has more than 5 billion images available, where im- ages are still being uploaded at the rate of over 3, 000 Jun Wang is with the Institute of Data Science and Technology at Alibaba Group, Seattle, WA, 98101, USA. Phone: `1 917 945–2619, e-mail: [email protected]. Wei Liu is with IBM T. J. Watson Research Center, York- town Heights, NY 10598, USA. Phone: `1 917 945–1274, e-mail: [email protected]. Sanjiv Kumar is with Google Research, New York, NY 10011, USA. Phone: `1 212 865–2214, e-mail: [email protected]. Shih-Fu Chang is with the Departments of Electrical Engineer- ing and Computer Science, Columbia University, New York, NY 10027, USA. Phone: `1 212 854–6894, fax: `1 212 932–9421, e-mail: [email protected]. 1 The number webpages is estimated based on the number of in- dexed links by Google in 2008. images per minute. Another rich media sharing website YouTube receives more than 100 hours of videos uploaded per minute. Due to the dramatic increase in the size of the data, modern information technology infrastructure has to deal with such gigantic databases. In fact, com- pared to the cost of storage, searching for relevant content in massive databases turns out to be even a more chal- lenging task. In particular, searching for rich media data, such as audio, images, and videos, remains a major chal- lenge since there exist major gaps between available solu- tions and practical needs in both accuracy and computa- tional costs. Besides the widely used text-based commer- cial search engines such as Google and Bing, content-based image retrieval (CBIR) has attracted substantial attention in the past decade [1]. Instead of relying on textual key- words based indexing structures, CBIR requires efficiently indexing media content in order to directly respond to vi- sual queries. Searching for similar data samples in a given database essentially relates to the fundamental problem of nearest neighbor search [2]. Exhaustively comparing a query point q with each sample in a database X is infeasible because the linear time complexity Op|X |q tends to be expensive in realistic large-scale settings. Besides the scalability is- sue, most practical large-scale applications also suffer from the curse of dimensionality [3], since data under mod- ern analytics usually contains thousands or even tens of thousands of dimensions, e.g., in documents and images. Therefore, beyond the infeasibility of the computational cost for exhaustive search, the storage constraint originat- ing from loading original data into memory also becomes a critical bottleneck. Note that retrieving a set of Ap- proximate Nearest Neighbors (ANNs) is often sufficient for many practical applications. Hence, a fast and effective indexing method achieving sublinear (op|X |q), logarithmic (Oplog |X |q), or even constant (Op1q) query time is desired for ANN search. Tree-based indexing approaches, such as KD tree [4], ball tree [5], metric tree [6], and vantage point tree [7], have been popular during the past several decades. However, tree-based approaches require significant storage costs (sometimes more than the data itself). In addition, the performance of tree-based indexing methods dramat- ically degrades when handling high-dimensional data [8]. More recently, product quantization techniques have been proposed to encode high-dimensional data vectors via sub- space decomposition for efficient ANN search [9][10]. Unlike the recursive partitioning used by tree-based in- dexing methods, hashing methods repeatedly partition the entire dataset and derive a single hash ’bit’ 2 from each par- 2 Depending on the type of the hash function used, each hash may return either an integer or simply a binary bit. In this survey we

Transcript of LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly,...

Page 1: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

arX

iv:1

509.

0547

2v1

[cs

.LG

] 1

7 Se

p 20

15PROCEEDINGS OF THE IEEE 1

Learning to Hash for Indexing Big Data - A SurveyJun Wang, Member, IEEE, Wei Liu, Member, IEEE, Sanjiv Kumar, Member, IEEE,

and Shih-Fu Chang, Fellow, IEEE

Abstract—The explosive growth in big data has attractedmuch attention in designing efficient indexing and searchmethods recently. In many critical applications such aslarge-scale search and pattern matching, finding the near-est neighbors to a query is a fundamental research prob-lem. However, the straightforward solution using exhaus-tive comparison is infeasible due to the prohibitive compu-tational complexity and memory requirement. In response,Approximate Nearest Neighbor (ANN) search based onhashing techniques has become popular due to its promis-ing performance in both efficiency and accuracy. Prior ran-domized hashing methods, e.g., Locality-Sensitive Hashing(LSH), explore data-independent hash functions with ran-dom projections or permutations. Although having eleganttheoretic guarantees on the search quality in certain metricspaces, performance of randomized hashing has been showninsufficient in many real-world applications. As a remedy,new approaches incorporating data-driven learning methodsin development of advanced hash functions have emerged.Such learning to hash methods exploit information such asdata distributions or class labels when optimizing the hashcodes or functions. Importantly, the learned hash codes areable to preserve the proximity of neighboring data in theoriginal feature spaces in the hash code spaces. The goalof this paper is to provide readers with systematic under-standing of insights, pros and cons of the emerging tech-niques. We provide a comprehensive survey of the learningto hash framework and representative techniques of vari-ous types, including unsupervised, semi-supervised, and su-pervised. In addition, we also summarize recent hashingapproaches utilizing the deep learning models. Finally, wediscuss the future direction and trends of research in thisarea.

Index Terms—Learning to hash, approximate nearestneighbor search, unsupervised learning, semi-supervisedlearning, supervised learning, deep learning.

I. Introduction

The advent of Internet has resulted in massive infor-mation overloading in the recent decades. Nowadays, theWorld Wide Web has over 366 million accessible websites,containing more than 1 trillion webpages1. For instance,Twitter receives over 100 million tweets per day, and Ya-hoo! exchanges over 3 billion messages per day. Besidesthe overwhelming textual data, the photo sharing websiteFlickr has more than 5 billion images available, where im-ages are still being uploaded at the rate of over 3, 000

Jun Wang is with the Institute of Data Science and Technology atAlibaba Group, Seattle, WA, 98101, USA. Phone: `1 917 945–2619,e-mail: [email protected].

Wei Liu is with IBM T. J. Watson Research Center, York-town Heights, NY 10598, USA. Phone: `1 917 945–1274, e-mail:[email protected].

Sanjiv Kumar is with Google Research, New York, NY 10011,USA. Phone: `1 212 865–2214, e-mail: [email protected].

Shih-Fu Chang is with the Departments of Electrical Engineer-ing and Computer Science, Columbia University, New York, NY10027, USA. Phone: `1 212 854–6894, fax: `1 212 932–9421, e-mail:[email protected].

1 The number webpages is estimated based on the number of in-dexed links by Google in 2008.

images per minute. Another rich media sharing websiteYouTube receives more than 100 hours of videos uploadedper minute. Due to the dramatic increase in the size ofthe data, modern information technology infrastructurehas to deal with such gigantic databases. In fact, com-pared to the cost of storage, searching for relevant contentin massive databases turns out to be even a more chal-lenging task. In particular, searching for rich media data,such as audio, images, and videos, remains a major chal-lenge since there exist major gaps between available solu-tions and practical needs in both accuracy and computa-tional costs. Besides the widely used text-based commer-cial search engines such as Google and Bing, content-basedimage retrieval (CBIR) has attracted substantial attentionin the past decade [1]. Instead of relying on textual key-words based indexing structures, CBIR requires efficientlyindexing media content in order to directly respond to vi-sual queries.

Searching for similar data samples in a given databaseessentially relates to the fundamental problem of nearestneighbor search [2]. Exhaustively comparing a query pointq with each sample in a database X is infeasible becausethe linear time complexity Op|X |q tends to be expensivein realistic large-scale settings. Besides the scalability is-sue, most practical large-scale applications also suffer fromthe curse of dimensionality [3], since data under mod-ern analytics usually contains thousands or even tens ofthousands of dimensions, e.g., in documents and images.Therefore, beyond the infeasibility of the computationalcost for exhaustive search, the storage constraint originat-ing from loading original data into memory also becomesa critical bottleneck. Note that retrieving a set of Ap-

proximate Nearest Neighbors (ANNs) is often sufficient formany practical applications. Hence, a fast and effectiveindexing method achieving sublinear (op|X |q), logarithmic(Oplog |X |q), or even constant (Op1q) query time is desiredfor ANN search. Tree-based indexing approaches, such asKD tree [4], ball tree [5], metric tree [6], and vantage pointtree [7], have been popular during the past several decades.However, tree-based approaches require significant storagecosts (sometimes more than the data itself). In addition,the performance of tree-based indexing methods dramat-ically degrades when handling high-dimensional data [8].More recently, product quantization techniques have beenproposed to encode high-dimensional data vectors via sub-space decomposition for efficient ANN search [9][10].

Unlike the recursive partitioning used by tree-based in-dexing methods, hashing methods repeatedly partition theentire dataset and derive a single hash ’bit’2 from each par-

2 Depending on the type of the hash function used, each hash mayreturn either an integer or simply a binary bit. In this survey we

Page 2: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

2 PROCEEDINGS OF THE IEEE

titioning. In binary partitioning based hashing, input datais mapped to a discrete code space called Hamming space,where each sample is represented by a binary code. Specif-ically, given N D-dim vectors X P R

DˆN , the goal of hash-ing is to derive suitable K-bit binary codes Y P B

KˆN . To

generate Y, K binary hash functions

hk : RD ÞÑ B(K

k“1

are needed. Note that hashing-based ANN search tech-niques can lead to substantially reduced storage as theyusually store only compact binary codes. For instance,80 million tiny images (32 ˆ 32 pixels, double type) costaround 600G bytes [11], but can be compressed into 64-bitbinary codes requiring only 600M bytes! In many cases,hash codes are organized into a hash table for inverse tablelookup, as shown in Figure 1. One advantage of hashing-based indexing is that hash table lookup takes only con-stant query time. In fact, in many cases, another alter-native way of finding the nearest neighbors in the codespace by explicitly computing Hamming distance with allthe database items can be done very efficiently as well.Hashing methods have been intensively studied and

widely used in many different fields, including com-puter graphics, computational geometry, telecommuni-cation, computer vision, etc., for several decades [12].Among these methods, the randomized scheme of Locality-Sensitive Hashing (LSH) is one of the most popularchoices [13]. A key ingredient in LSH family of techniquesis a hash function that, with high probabilities, returns thesame bit for the nearby data points in the original met-ric space. LSH provides interesting asymptotic theoreticalproperties leading to performance guarantees. However,LSH based randomized techniques suffer from several cru-cial drawbacks. First, to achieve desired search precision,LSH often needs to use long hash codes, which reduces therecall. Multiple hash tables are used to alleviate this is-sue, but it dramatically increases the storage cost as wellas the query time. Second, the theoretical guarantees ofLSH only apply to certain metrics such as ℓp (p P p0,2s)and Jaccard [14]. However, returning ANNs in such met-ric spaces may not lead to good search performance whensemantic similarity is represented in a complex way insteadof a simple distance or similarity metric. This discrepancybetween semantic and metric spaces has been recognizedin the computer vision and machine learning communities,namely as semantic gap [15].To tackle the aforementioned issues, many hashing

methods have been proposed recently to leverage ma-chine learning techniques to produce more effective hashcodes [16]. The goal of learning to hash is to learn data-dependent and task-specific hash functions that yield com-pact binary codes to achieve good search accuracy [17].In order to achieve this goal, sophisticated machine learn-ing tools and algorithms have been adapted to the proce-dure of hash function design, including the boosting algo-rithm [18], distance metric learning [19], asymmetric bi-nary embedding [20], kernel methods [21][22], compressedsensing [23], maximum margin learning [24], sequential

primarily focus on binary hashing techniques as they are used mostcommonly due to their computational and storage efficiency.

learning [25], clustering analysis [26], semi-supervisedlearning [14], supervised learning [27][22], graph learn-ing [28], and so on. For instance, in the specific applica-tion of image search, the similarity (or distance) betweenimage pairs is usually not defined via a simple metric. Ide-ally, one would like to provide pairs of images that contain“similar” or “dissimilar” images. From such pairwise la-beled information, a good hashing mechanism should beable to generate hash codes which preserve the semanticconsistency, i.e., semantically similar images should havesimilar codes. Both the supervised and semi-supervisedlearning paradigms have been explored using such pair-wise semantic relationships to learn semantically relevanthash functions [11][29][30][31]. In this paper, we will sur-vey important representative hashing approaches and alsodiscuss the future research directions.The remainder of this article is organized as follows. In

Section II, we will present necessary background informa-tion, prior randomized hashing methods, and the motiva-tions of studying hashing. Section III gives a high-leveloverview of emerging learning-based hashing methods. InSection IV, we survey several popular methods that fallinto the learning to hash framework. In addition, Section Vdescribes the recent development of using neural networksto perform deep learning of hash codes. Section VI dis-cusses advanced hashing techniques and large-scale appli-cations of hashing. Several open issues and future direc-tions are described in Section VII.

II. Notations and Background

In this section, we will first present the notations, assummarized in Table I. Then we will briefly introduce theconceptual paradigm of hashing-based ANN search. Fi-nally, we will present some background information onhashing methods, including the introduction of two well-known randomized hashing techniques.

A. Notations

Given a sample point x P RD, one can employ a set of

hash functions H “ th1, ¨ ¨ ¨ , hKu to compute a K-bit bi-nary code y “ ty1, ¨ ¨ ¨ ,yKu for x as

y “ th1pxq, ¨ ¨ ¨ ,h2pxq, ¨ ¨ ¨ ,hKpxqu, (1)

where the kth bit is computed as yk “ hkpxq. The hashfunction performs the mapping as hk : RD ÝÑ B. Such abinary encoding process can also be viewed as mappingthe original data point to a binary valued space, namelyHamming space:

H : x Ñ th1pxq, ¨ ¨ ¨ ,hKpxqu. (2)

Given a set of hash functions, we can map all the items inthe database X “ txnuNn“1

P RDˆN to the corresponding

binary codes as

Y “ HpXq “ th1pXq, h2pXq, ¨ ¨ ¨ , hKpXqu,

where the hash codes of the data X are Y P BKˆN .

Page 3: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 3

TABLE I

Summary Of Notations

Symbol Definition

N number of data pointsD dimensionality of data pointsK number of hash bitsi, j indices of data pointsk index of a hash functionxi P R

D,xj P RD the ith and jth data point

Si,Sj the ith and jth setqi P R

D a query pointX “ rx1, ¨ ¨ ¨ ,xN s P R

DˆN data matrix with points as columnsyi P t1,´1uK, or yi P t0,1uK hash codes of data points xi and xj

yk¨ P BNˆ1 the k-th hash bit of N data points

Y “ ry1, ¨ ¨ ¨ ,yN s P BKˆN hash codes of data X

θij angle between data points xi and xj

hk :RD Ñ t1,´1u the k-th hash function

H “ rh1, ¨ ¨ ¨ ,hKs :RD Ñ t1,´1uK K hash functionsJpSi,Sjq Jaccard similarity between sets Si and Sj

Jpxi,xjq Jaccard similarity between vectors xi and xj

dHpyi,yjq Hamming distance between yi and yj

dWHpyi,yjq weighted Hamming distance between yi and yj

pxi,xjq P M a pair of similar pointspxi,xjq P C a pair of dissimilar pointspqi,x

`i ,x

´i q a ranking triplet

Sij “ simpxi,xjq similarity between data points xi and xj

S P RNˆN similarity matrix of data X

Pw a hyperplane with its normal vector w

After computing the binary codes, one can perform ANNsearch in Hamming space with significantly reduced com-putation. Hamming distance between two binary codes yi

and yj is defined as

dHpyi,yjq “ |yi ´yj | “Kÿ

k“1

|hkpxiq ´ hkpxjq|, (3)

where yi “ rh1pxiq, ¨ ¨ ¨ ,hkpxiq, ¨ ¨ ¨ ,hKpxiqs andyj “ rh1pxjq, ¨ ¨ ¨ ,hkpxjq, ¨ ¨ ¨ ,hKpxjqs. Note thatthe Hamming distance can be calculated in an efficientway as a bitwise logic operation. Thus, even conduct-ing exhaustive search in the Hamming space can besignificantly faster than doing the same in the originalspace. Furthermore, through designing a certain indexingstructure, the ANN search with hashing methods can beeven more efficient. Below we describe the pipeline of atypical hashing-based ANN search system.

B. Pipeline of Hashing-based ANN Search

There are three basic steps in ANN search using hash-ing techniques: designing hash functions, generating hashcodes and indexing the database items, and online query-ing using hash codes. These steps are described in detailbelow.

B.1 Designing Hash Functions

There exist a number of ways of designing hash func-tions. Randomized hashing approaches often use randomprojections or permutations. The emerging learning to

hash framework exploits the data distribution and oftenvarious levels of supervised information to determine opti-mal parameters of the hash functions. The supervised in-formation includes pointwise labels, pairwise relationships,and ranking orders. Due to their efficiency, the most com-monly used hash functions are of the form of a generalizedlinear projection:

hkpxq “ sgn`

fpwJk x` bkq

˘

. (4)

Here fp¨q is a prespecified function which can be pos-sibly nonlinear. The parameters to be determined aretwk, bkuKk“1

, representing the projection vector wk andthe corresponding intercept bk. During the training pro-cedure, the data X, sometimes along with supervised in-formation, is used to estimate these parameters. In ad-dition, different choices of fp¨q yield different propertiesof the hash functions, leading to a wide range of hash-ing approaches. For example, LSH keeps fp¨q to be anidentity function, while shift-invariant kernel-based hash-ing and spectral hashing choose fp¨q to be a shifted cosineor sinusoidal function [32][33].Note that, the hash functions given by (4) generate the

Page 4: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

4 PROCEEDINGS OF THE IEEE

nxx ,1

2x

0110

1110

1111

hash table items

1x

2x

nx

database items

0110

1110

0110

hash codes

Indexing Inverse Lookup

1x

2x

nx

1h2h

3h

4h

Binary Hashing

Fig. 1

An illustration of linear projection based binary hashing,

indexing, and hash table construction for fast inverse

lookup.

codes as hkpxq P t´1,1u. One can easily convert them intobinary codes from t0,1u as

yk “1

2p1` hkpxqq . (5)

Without loss of generality, in this survey we will use theterm hash codes to refer to either t0,1u or t´1,1u form,which should be clear from the context.

B.2 Indexing Using Hash Tables

With a learned hash function, one can compute the bi-nary codes Y for all the items in a database. For K hashfunctions, the codes for the entire database cost onlyNK{8bytes. Assuming the original data to be stored in double-precision floating-point format, the original storage costs8ND bytes. Since the massive datasets are often asso-ciated with thousands of dimensions, the computed hashcodes significantly reduce the storage cost by hundreds andeven thousands of times.In practice, the hash codes of the database are orga-

nized as an inverse-lookup, resulting in a hash table or ahash map. For a set of K binary hash functions, one canhave at most 2K entries in the hash table. Each entry,called a hash bucket, is indexed by a K-bit hash code. Inthe hash table, one keeps only those buckets that containsat least one database item. Figure 1 shows an exampleof using binary hash functions to index the data and con-struct a hash table. Thus, a hash table can be seen asan inverse-lookup table, which can return all the databaseitems corresponding to a certain code in constant time.This procedure is key to achieving speedup by many hash-ing based ANN search techniques. Since most of the buck-ets from 2K possible choices are typically empty, creatingan inverse lookup can be a very efficient way of even stor-ing the codes if multiple database items end up with thesame codes.

B.3 Online Querying with Hashing

During the querying procedure, the goal is to find thenearest database items to a given query. The query is firstconverted into a code using the same hash functions thatmapped the database items to codes. One way to find near-est neighbors of the query is by computing the Hammingdistance between the query code to all the database codes.

0110

0110

1110

0100

0111

codes satisfying

nxx ,1

2x

0110

1110

hash table returned items

1111

Inverse Lookup

Fig. 2

The procedure of inverse-lookup in hash table, where q is

the query mapped to a 4-bit hash code “0100” and the

returned approximate nearest neighbors within Hamming

radius 1 are x1,x2,xn.

Note that the Hamming distance can be rapidly computedusing logical xor operation between binary codes as

dHpyi,yjq “ yi ‘yj. (6)

On modern computer architectures, this is achieved effi-ciently by running xor instruction followed by popcount.With the computed Hamming distance between the queryand each database item, one can perform exhaustive scanto extract the approximate nearest neighbors of the query.Although this is much faster than the exhaustive searchin the original feature space, the time complexity is stilllinear. An alternative way of searching for the neighbors isby using the inverse-lookup in the hash table and returningthe data points within a small Hamming distance r of thequery. Specifically, given a query point q, and its corre-sponding hash code yq “ Hpqq, all the database points y

whose hash codes fall within the Hamming ball of radiusr centered at yq, i.,e. dHpy,Hpqqq ď r. As shown in Fig-

ure 2, for a K-bit binary code, a total ofřr

l“0

`

Kl

˘

possiblecodes will be within Hamming radius of r. Thus one needsto search OpKrq buckets in the hash table. The union ofall the items falling into the corresponding hash bucketsis returned as the search result. The inverse-lookup in ahash table has constant time complexity independent of thedatabase size N . In practice, a small value of r (r “ 1,2 iscommonly used) is used to avoid the exponential growth inthe possible code combinations that need to be searched.

C. Randomized Hashing Methods

Randomized hashing, e.g. locality sensitive hash family,has been a popular choice due to its simplicity. In addi-tion, it has interesting proximity preserving properties. Abinary hash function hp¨q from LSH family is chosen suchthat the probability of two points having the same bit isproportional to their (normalized) similarity, i.e.,

P thpxiq “ hpxjqu “ simpxi,xjq. (7)

Here simp¨, ¨q represents similarity between a pair of pointsin the input space, e.g., cosine similarity or Jaccard simi-larity [34]. In this section, we briefly review two categoriesof randomized hashing methods, i.e. random projectionbased and random permutation based approaches.

Page 5: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 5

Fig. 3

An illustration of random hyperplane partitioning based

hashing method.

C.1 Random Projection Based Hashing

As a representative member of the LSH family, randomprojection based hash (RPH) functions have been widelyused in different applications. The key ingredient of RPH isto map nearby points in the original space to the same hashbucket with a high probability. This equivalently preservesthe locality in the original space in the Hamming space.Typical examples of RPH functions consist of a randomprojection w and a random shift b as

hkpxq “ sgnpwJk x` bkq, (8)

The random vectorw is constructed by sampling each com-ponent of w randomly from a standard Gaussian distribu-tion for cosine distance [34].It is easy to show that the collision probability of two

samples xi,xj falling into the same hash bucket is deter-mined by the angle θij between these two sample vectors,as shown in Figure 3. One can show that

Pr rhkpxiq “ hkpxjqs “ 1´θij

π“ 1´

1

πcos´1

xJi xj

}xi}}xj}(9)

The above collision probability gives the asymptotic theo-retical guarantees for approximating the cosine similaritydefined in the original space. However, long hash codes arerequired to achieve sufficient discrimination for high pre-cision. This significantly reduces the recall if hash tablebased inverse lookup is used for search. In order to bal-ance the tradeoff of precision and recall, one has to con-struct multiple hash tables with long hash codes, which in-creases both storage and computation costs. In particular,with hash codes of length K, it is required to construct asufficient number of hash tables to ensure the desired per-formance bound [35]. Given l K-bit tables, the collisionprobability is given as:

P tHpxiq “ Hpxjqu9 l ¨

1´1

πcos´1

xJi xj

}xi}}xj}

K

(10)

To balance the search precision and recall, the length ofhash codes should be long enough to reduce false collisions(i.e., non-neighbor samples falling into the same bucket).

Meanwhile, the number of hash tables l should be suffi-ciently large to increase the recall. However, this is ineffi-cient due to extra storage cost and longer query time.

To overcome these drawbacks, many practical systemsadapt various strategies to reduce the storage overload andto improve the efficiency. For instance, a self-tuning in-dexing technique, called LSH forest was proposed in [36],which aims at improving the performance without addi-tional storage and query overhead. In [37][38], a techniquecalled MultiProbe LSH was developed to reduce the num-ber of required hash tables through intelligently probingmultiple buckets in each hash table. In [39], nonlinearrandomized Hadamard transforms were explored to speedup the LSH based ANN search for Euclidean distance.In [40], BayesLSH was proposed to combine Bayesian in-ference with LSH in a principled manner, which has prob-abilistic guarantees on the quality of the search results interms of accuracy and recall. However, the random projec-tions based hash functions ignore the specific properties ofa given data set and thus the generated binary codes aredata-independent, which leads to less effective performancecompared to the learning based methods to be discussedlater.

In machine learning and data mining community, re-cent methods tend to leverage data-dependent and task-specific information to improve the efficiency of randomprojection based hash functions [16]. For example, in-corporating kernel learning with LSH can help generalizeANN search from a standard metric space to a wide rangeof similarity functions [41][42]. Furthermore, metric learn-ing has been combined with randomized LSH functions toexplore a set of pairwise similarity and dissimilarity con-straints [19]. Other variants of locality sensitive hashingtechniques include super-bit LSH [43], boosted LSH [18],as well as non-metric LSH [44]

C.2 Random Permutation based Hashing

Another well-known paradigm from the LSH family ismin-wise independent permutation hashing (MinHash),which has been widely used for approximating Jaccardsimilarity between sets or vectors. Jaccard is a popularchoice for measuring similarity between documents or im-ages. A typical application is to index documents andthen identify near-duplicate samples from a corpus of doc-uments [45][46]. The Jaccard similarity between two sets

Si and Sj is defined as JpSi,Sjq “SiXSj

SiYSj. Since a collec-

tion of sets tSiuNi“1

can be represented as a characteristicmatrix C P B

MˆN , where M is the cardinality of the uni-versal set S1 Y ¨¨ ¨ YSN . Here the rows of C representsthe elements of the universal set and the columns corre-spond to the sets. The element cdi “ 1 indicates the d-thelement is a member of the i-th set, cdi “ 0 otherwise. As-sume a random permutation πkp¨q that assigns the indexof the d-th element as πkpdq P t1, ¨ ¨ ¨ ,Du. It is easy tosee that the random permutation satisfies two properties:πkpdq ‰ πkplq and Prrπkpdq ą πkplqs “ 0.5. A random per-mutation based min-hash signature of a set Si is defined

Page 6: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

6 PROCEEDINGS OF THE IEEE

as the minimum index of the non-zero element after per-forming permutation using πk

hkpSiq “ mindPt1,¨¨¨ ,Du,cπkpdqi“1

πkpdq. (11)

Note that such a hash function holds a property that thechance of two sets having the same MinHash values is equalto the Jaccard similarity between them [47]

PrrhkpSiq “ hkpSjqs “ JpSi,Sjq. (12)

The definition of the Jaccard similarity can be extendedto two vectors xi “ txi1, ¨ ¨ ¨ , xid, ¨ ¨ ¨ , xiDu and xj “txj1, ¨ ¨ ¨ ,xjd, ¨ ¨ ¨ ,xjDu as

Jpxi,xjq “

řD

d“1minpxid, xjdq

řD

d“1maxpxid, xjdq

.

Similar min-hash functions can be defined for the abovevectors and the property of the collision probability shownin Eq. 12 still holds [48]. Compared to the random pro-jection based LSH family, the min-hash functions generatenon-binary hash values that can be potentially extendedto continuous cases. In practice, the min-hash schemehas shown powerful performance for high-dimensional andsparse vectors like the bag-of-word representation of docu-ments or feature histograms of images. In a large scale eval-uation conducted by Google Inc., the min-hash approachoutperforms other competing methods for the applicationof webpage duplicate detection [49]. In addition, the min-hash scheme is also applied for Google news personal-ization [50] and near duplicate image detection [51] [52].Some recent efforts have been made to further improvethe min-hash technique, including b-bit minwise hash-ing [53] [54], one permutation approach [55], geometricmin-Hashing [56], and a fast computing technique for im-age data [57].

III. Categories of Learning Based HashingMethods

Among the three key steps in hashing-based ANNsearch, design of improved data-dependent hash functionshas been the focus in learning to hash paradigm. Since theproposal of LSH in [58], many new hashing techniques havebeen developed. Note that most of the emerging hashingmethods are focused on improving the search performanceusing a single hash table. The reason is that these tech-niques expect to learn compact discriminative codes suchthat searching within a small Hamming ball of the queryor even exhaustive scan in Hamming space is both fast andaccurate. Hence, in the following, we primarily focus onvarious techniques and algorithms for designing a singlehash table. In particular, we provide different perspectivessuch as the learning paradigm and hash function charac-teristics to categorize the hashing approaches developedrecently. It is worth mentioning that a few recent stud-ies have shown that exploring the power of multiple hashtables can sometimes generate superior performance. In

order to improve precision as well as recall, Xu et al., de-veloped multiple complementary hash tables that are se-quentially learned using a boosting-style algorithm [31].Also, in cases when the code length is not very large andthe number of database points is large, exhaustive scan inHamming space can be done much faster by using multi-table indexing as shown by Norouzi et al. [59].

A. Data-Dependent vs. Data-Independent

Based on whether design of hash functions requires anal-ysis of a given dataset, there are two high-level cate-gories of hashing techniques: data-independent and data-dependent. As one of the most popular data-independentapproaches, random projection has been used widely fordesigning data-independent hashing techniques such asLSH and SIKH mentioned earlier. LSH is arguably themost popular hashing method and has been applied toa variety of problem domains, including information re-trieval and computer vision. In both LSH and SIKH,the projection vector w and intersect b, as defined inEq. 4, are randomly sampled from certain distributions.Although these methods have strict performance guaran-tees, they are less efficient since the hash functions arenot specifically designed for a certain dataset or searchtask. Based on the random projection scheme, there havebeen several efforts to improve the performance of the LSHmethod [37] [39] [41].Realizing the limitation of data-independent hashing

approaches, many recent methods use data and possiblysome form of supervision to design more efficient hashfunctions. Based on the level of supervision, the data-dependent methods can be further categorized into threesubgroups as described below.

B. Unsupervised, Supervised, and Semi-Supervised

Many emerging hashing techniques are designed by ex-ploiting various machine learning paradigms, ranging fromunsupervised and supervised to semi-supervised settings.For instance, unsupervised hashing methods attempt tointegrate the data properties, such as distributions andmanifold structures to design compact hash codes withimproved accuracy. Representative unsupervised methodsinclude spectral hashing [32], graph hashing [28], mani-fold hashing [60], iterative quantization hashing [61], ker-nalized locality sensitive hashing [41][21], isotropic hash-ing [62], angular quantization hashing [63], and sphericalhashing [64]. Among these approaches, spectral hashingexplores the data distribution and graph hashing utilizesthe underlying manifold structure of data captured bya graph representation. In addition, supervised learningparadigms ranging from kernel learning to metric learningto deep learning have been exploited to learn binary codes,and many supervised hashing methods have been proposedrecently [19][22][65][66][67][68]. Finally, semi-supervisedlearning paradigm was employed to design hash functionsby using both labeled and unlabeled data. For instance,Wang et. al proposed a regularized objective to achieveaccurate yet balanced hash codes to avoid overfitting [14].

Page 7: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 7

(a)

less similar

(b) (c)

Fig. 4

An illustration of different levels of supervised

information: a) pairwise labels; b) a triplet

simpq,x`q ą simpq,x´q; and c) a distance based rank list

px4,x1,x2,x3q to a query point q.

In [69][70], authors proposed to exploit the metric learn-ing and locality sensitive hashing to achieve fast similaritybased search. Since the labeled data is used for deriv-ing optimal metric distance while the hash function designuses no supervision, the proposed hashing technique canbe regarded as a semi-supervised approach.

C. Pointwise, Pairwise, Triplet-wise and Listwise

Based on the level of supervision, the supervised or semi-supervised hashing methods can be further grouped intoseveral subcategories, including pointwise, pairwise, triplet-wise, and listwise approaches. For example, a few existingapproaches utilize the instance level semantic attributesor labels to design the hash functions [71][18][72]. Addi-tionally, learning methods based on pairwise supervisionhave been extensively studied, and many hashing tech-niques have been proposed [19][22][65][66][14][69][70][73].As demonstrated in Figure 4(a), the pair px2,x3q con-tains similar points and the other two pairs px1,x2q andpx1,x3q contain dissimilar points. Such relations are con-sidered in the learning procedure to preserve the pairwiselabel information in the learned Hamming space. Sincethe ranking information is not fully utilized, the perfor-mance of pairwise supervision based methods could besub-optimal for nearest neighbor search. More recently,a triplet ranking that encodes the pairwise proximity com-parison among three data points is exploited to design hashcodes [74][65][75]. As shown in Figure 4(b), the point x` ismore similar to the query point q than the point x´. Sucha triplet ranking information, i.e., simpq,x`q ą simpq,x´qis expected to be encoded in the learned binary hash codes.Finally, the listwise information indicates the rank orderof a set of points with respect to the query point. In Fig-ure 4(c), for the query point q, the rank list px4,x1,x2,x3qshows the ranking order of their similarities to the querypoint q, where x4 is the nearest point and x3 is the farthestone. By converting rank lists to a triplet tensor matrix,listwise hashing is designed to preserve the ranking in theHamming space [76].

01

0

1

(a) (b)

Fig. 5

Comparison of hash bits generated using a) PCA hashing and

b) Spectral Hashing.

D. Linear vs. Nonlinear

Based on the form of function fp¨q in Eq. 4, hash func-tions can also be categorized in two groups: linear andnonlinear. Due to their computational efficiency, linearfunctions tend to be more popular, which include ran-dom projection based LSH methods. The learning basedmethods derive optimal projections by optimizing differenttypes of objectives. For instance, PCA hashing performsprincipal component analysis on the data to derive largevariance projections [63][77][78], as shown in Figure 5(a).In the same league, supervised methods have used Lin-ear Discriminant Analysis to design more discriminativehash codes [79][80]. Semi-supervised hashing methods es-timate the projections that have minimum empirical losson pair-wise labels while partitioning the unlabeled datain a balanced way [14]. Techniques that use variance of theprojections as the underlying objective, also tend to use or-thogonality constraints for computational ease. However,these constraints lead to a significant drawback since thevariance for most real-world data decays rapidly with mostof the variance contained only in top few directions. Thus,in order to generate more bits in the code, one is forcedto use progressively low-variance directions due to orthog-onality constraints. The binary codes derived from theselow-variance projections tend to have significantly lowerperformance. Two types of solutions based on relaxationof the orthogonality constraints or random/learned rota-tion of the data have been proposed in the literature toaddress these issues [14][61]. Isotropic hashing is proposedto derive projections with equal variances and is shown tobe superior to anisotropic variances based projections [62].Instead of performing one-shot learning, sequential projec-tion learning derives correlated projections with the goalof correcting errors from previous hash bits [25]. Finally,to reduce the computational complexity of full projection,circulant binary embedding was recently proposed to sig-nificantly speed up the encoding process using the circulantconvolution [81].

Despite its simplicity, linear hashing often suffers frominsufficient discriminative power. Thus, nonlinear meth-ods have been developed to override such limitations. Forinstance, spectral hashing first extracts the principal pro-jections of the data, and then partitions the projected databy a sinusoidal function (nonlinear) with a specific angu-lar frequency. Essentially, it prefers to partition projec-

Page 8: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

8 PROCEEDINGS OF THE IEEE

tions with large spread and small spatial frequency suchthat the large variance projections can be reused. As il-lustrated in Figure 5(b), the fist principal component canbe reused in spectral hashing to divide the data into fourparts while being encoded with only one bit. In addition,shift-invariant kernel-based hashing chooses fp¨q to be ashifted cosine function and samples the projection vectorin the same way as standard LSH does [33]. Another cate-gory of nonlinear hashing techniques employs kernel func-tions [21][22][28][82]. Anchor graph hashing proposed byLiu et al. [28] uses a kernel function to measure similarityof each points with a set of anchors resulting in nonlinearhashing. Kernerlized LSH uses a sparse set of datapointsto compute a kernel matrix and preform random projectionin the kernel space to compute binary codes [21]. Based onsimilar representation of kernel metric, Kulis and Darrellpropose learning of hash functions by explicitly minimizingthe reconstruction error in the kernel space and Hammingspace [27]. Liu et al. applies kernel representation butoptimizes the hash functions by exploring the equivalencebetween optimizing the code inner products and the Ham-ming distances to achieve scale invariance [22].

E. Single-Shot Learning vs. Multiple-Shot Learning

For learning based hashing methods, one first formu-lates an objective function reflecting desired characteris-tics of the hash codes. In a single-shot learning paradigm,the optimal solution is derived by optimizing the objectivefunction in a single-shot. In such a learning to hash frame-work, the K hash functions are learned simultaneously. Incontrast, the multiple-shot learning procedure considers aglobal objective, but optimizes a hash function consideringthe bias generated by the previous hash functions. Sucha procedure sequentially trains hash functions one bit ata time [83][25][84]. The multiple-shot hash function learn-ing is often used in supervised or semi-supervised settingssince the given label information can be used to assess thequality of the hash functions learned in previous steps. Forinstance, the sequential projection based hashing aims toincorporate the bit correlations by iteratively updating thepairwise label matrix, where higher weights are imposed onpoint pairs violated by the previous hash functions [25]. Inthe complementary projection learning approach [84], theauthors present a sequential learning procedure to obtaina series of hash functions that cross the sparse data re-gion, as well as generate balanced hash buckets. Columngeneration hashing learns the best hash function duringeach iteration and updates the weights of hash functionsaccordingly. Other interesting learning ideas include two-step learning methods which treat hash bit learning andhash function learning separately [85][86].

F. Non-Weighted vs. Weighted Hashing

Given the Hamming embedding defined in Eq. 2, tra-ditional hashing based indexing schemes map the originaldata into a non-weighted Hamming space, where each bitcontributes equally. Given such a mapping, the Hammingdistance is calculated by counting the number of different

bits. However, it is easy to observe that different bits oftenbehave differently [14][32]. In general, for linear projectionbased hashing methods, the binary code generated fromlarge variance projection tends to perform better due toits superior discriminative power. Hence, to improve dis-crimination among hash codes, techniques were designedto learn a weighted hamming embedding as

H : X Ñ tα1h1pxq, ¨ ¨ ¨ ,αKhKpxqu. (13)

Hence the conventional hamming distance is replaced by aweighted version as

dWH “Kÿ

k“1

αk|hkpxiq ´ hkpxjq|. (14)

One of the representative approaches is Boosted SimilaritySensitive Coding (BSSC) [18]. By learning the hash func-tions and the corresponding weights tα1, ¨ ¨ ¨ , αku jointly,the objective is to lower the collision probability of non-neighbor pair pxi,xjq P C while improving the collisionprobability of neighboring pair pxi,xjq P M. If one treatseach hash function as a decision stump, the straightfor-ward way of learning the weights is to directly apply adap-tive boosting algorithm [87] as described in [18]. In [88],a boosting-style method called BoostMAP is proposedto map data points to weighted binary vectors that canleverage both metric and semantic similarity measures.Other weighted hashing methods include designing spe-cific bit-level weighting schemes to improve the search ac-curacy [74][89][90][91][92]. In addition, a recent work aboutdesigning a unified bit selection framework can be regardedas a special case of weighted hashing approach, where theweights of hash bits are binary [93]. Another effective hashcode ranking method is the query-sensitive hashing, whichexplores the raw feature of the query sample and learnsquery-specific weights of hash bits to achieve accurate ǫ-nearest neighbor search [94].

IV. Methodology Review and Analysis

In this section, we will focus on review of several rep-resentative hashing methods that explore various ma-chine learning techniques to design data-specific indexingschemes. The techniques consist of unsupervised, semi-supervised, as well as supervised approaches, includingspectral hashing, anchor graph hashing, angular quanti-zation, binary reconstructive embedding based hashing,metric learning based hashing, semi-supervised hashing,column generation hashing, and ranking supervised hash-ing. Table II summarizes the surveyed hashing techniques,as well as their technical merits.Note that this section mainly focuses on describing

the intuition and formulation of each method, as wellas discussing their pros and cons. The performanceof each individual method highly depends on practicalsettings, including learning parameters and dataset it-self. In general, the nonlinear and supervised techniquestend to generate better performance than linear and un-supervised methods, while being more computationallycostly [14][19][21][22][27].

Page 9: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 9

Method Hash Function/Objective Function Parameters Learning Paradigm Supervision

Spectral Hashing sgnpcospαwJxqq w,α unsupervised NA

Anchor Graph Hashing sgnpwJxq w unsupervised NA

Angular Quantization pb,wq “ argmaxř

i

bJi

}bi}2wJxi b, w unsupervised NA

Binary Reconstructive Embedding sgnpwJKpxqq w unsupervised/supervised pairwise distance

Metric Learning Hashing sgnpwJGJxq G,w supervised pairwise similarity

Semi-Supervised Hashing sgnpwJxq w semi-supervised pairwise similarity

Column generation hashing sgnpwJx` bq w supervised triplet

Listwise hashing sgnpwJx` bq w supervised ranking list

Circulant binary embedding sgnpcircprq ¨xq r unsupervised/supervised pairwise similarity

TABLE II

A summary of the surveyed hashing techniques in this article.

A. Spectral Hashing

In the formulation of spectral hashing, the desired prop-erties include keeping neighbors in input space as neigh-bors in the hamming space and requiring the codes to bebalanced and uncorrelated [32]. Hence, the objective ofspectral hashing is formulated as:

minÿ

ij

1

2Aij}yi ´yj}2 “

1

2trpYJLYq (15)

subject to: Y P t´1,1uNˆK

1Jyk¨ “ 0, k “ 1, ¨ ¨ ¨ ,K

YJY “ nIKˆK ,

where A “ tAijuNi,j“1is a pairwise similarity matrix and

the Laplacian matrix is calculated as L “ diagpA1q ´ A.The constraint 1Jyk “ 0 ensures that the hash bit yk

reaches a balanced partitioning of the data and the con-straint YJY “ nIKˆK imposes orthogonality betweenhash bits to minimize the redundancy.The direct solution for the above optimization is non-

trivial for even a single bit since it is essentially a balancedgraph partition problem, which is NP hard. The orthog-onality constraints for K-bit balanced partitioning makethe above problem even harder. Motivated by the well-known spectral graph analysis [95], the authors suggest tominimize the cost function with relaxed constraints. Inparticular, with the assumption of uniform data distribu-tion, the spectral solution can be efficiently computed us-ing 1D-Laplacian eigenfunctions [32]. The final solution forspectral hashing equals to apply a sinusoidal function withpre-computed angular frequency to partition data alongPCA directions. Note that the projections are computedusing data but learned in an unsupervised manner. Asmost of the orthogonal projection based hashing methods,spectral hashing suffers from the low-quality binary cod-ing using low-variance projections. Hence, a “kernel trick”is used to alleviate the degraded performance when usinglong hash bits [96]. Moreover, the assumption of uniformdata distribution usually hardly hold for real-world data.

B. Anchor Graph Hashing

Following the similar objective as spectral hashing, an-chor graph hashing was designed to solve the problem froma different perspective without the assumption of uniformdistribution [28]. Note that the critical bottleneck for solv-ing Eq. 15 is the cost of building a pariwsie similarity graphA, the computation of associated graph Laplacian, as wellas solving the corresponding eigen-system, which at leasthas a quadratic complexity. The key idea is to use a smallset of MpM ! Nq anchor points to approximate the graphstructure represented by the matrix A such that the sim-ilarity between any pair of points can be approximatedusing point-to-anchor similarities [97]. In particular, thetruncated point-to-anchor similarity Z P R

NˆM gives thesimilarities between N database points to the M anchorpoints. Thus, the approximated similarity matrix A canbe calculated as A “ ZΛZJ, where Λ “ diagpZ1q is thedegree matrix of the anchor graph Z. Based on such an ap-proximation, instead of solving the eigen-system of the ma-trix A “ ZΛZJ, one can alternatively solve a much smallereigen-system with an M ˆM matrix Λ1{2ZJZΛ1{2. Thefinal binary codes can be obtained through calculating thesign function over a spectral embedding as

Y “ sgnpZΛ1{2VΣ1{2q, (16)

Here we have the matrices V “ rv1, ¨ ¨ ¨ , vk, ¨ ¨ ¨ ,vK s PR

MˆK and Σ “ diagpσ1, ¨ ¨ ¨ , σk, ¨ ¨ ¨ , σKq P RKˆK , where

tvk,σku are the eigenvector-eigenvalue pairs [28]. Figure 6shows the two-bit partitioning on a synthetic data withnonlinear structure using different hashing methods, in-cluding spectral hashing, exact graph hashing, and anchorgraph hashing. Note that since spectral hashing computestwo smoothest pseudo graph Laplacian eigenfunctions in-stead of performing real spectral embedding, it can nothandle such type of nonlinear data structures. The exactgraph hashing method first constructs an exact neighbor-hood graph, e.g., kNN graph, and then performs parti-tioning with spectral techniques to solve the optimizationproblem in Eq.15. The anchor graph hashing archives a

Page 10: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

10 PROCEEDINGS OF THE IEEE

(a) (b) (c)

(d) (e) (f)

Fig. 6

Comparison of partitioning a two-moon data by the first

two hash bits using different methods: a) the first bit using

spectral hashing; b) the first bit using exact graph hashing;

c) the first bit using anchor graph hashing; d) the second

bit using spectral hashing; e) the second bit using exact

graph hashing; f) the second bit using anchor graph hashing;

good separation (by the first bit) of the nonlinear mani-fold and balancing partitioning, even performs better thanthe exact graph hashing, which loses the property of bal-ancing partitioning for the second bit. The anchor graphhashing approach was recently further improved by lever-aging a discrete optimization technique to directly solvebinary hash codes without any relaxation [98].

C. Angular Quantization Based Hashing

Since similarity is often measured by the cosine of theangle between pairs of samples, angular quantization isthus proposed to map non-negative feature vectors onto avertex of the binary hypercube with the smallest angle [63].In such a setting, the vertices of the hypercube is treated asquantization landmarks that grow exponentially with thedata dimensionality D. As shown in Figure 7, the nearestbinary vertex b in a hypercube to the data point x is givenby

b˚ “ argmaxb

bJx

}b}2(17)

subject to: b P t0,1uK ,

Although it is an integer programming problem, its globalmaximum can be found with a complexity of OpD logDq.The optimal binary vertices will be used as the binary hashcodes for data points as y “ b˚. Based on this angularquantization framework, a data-dependent extension is de-signed to learn a rotation matrix R P R

DˆD to align theprojected data RJx to the binary vertices without chang-ing the similarity between point pairs. The objective is

Fig. 7

Illustration of angular quantization based hashing

method [63]. The binary code of a data point x is assigned as

the nearest binary vertex in the hypercube, which is

b4 “ r0 1 1sJ in the illustrated example [63].

formulated as the following

pb˚i ,R

˚q “ argmaxbi,R

ÿ

i

bJi

}bi}2RJxi (18)

subject to: b P t0,1uK

RJR “ IDˆD

Note that the above formulation still generates a D-bit bi-nary code for each data point, while compact codes areoften desired in many real-world applications [14]. To gen-erate a K=bit code, a projection matrix S P R

DˆK withorthogonal columns can be used to replace the rotationmatrix R in the above objective with additional normal-ization, as discussed in [63]. Finally, the optimal binarycodes and the projection/rotation matrix are learned us-ing an alternating optimization scheme.

D. Binary Reconstructive Embedding

Instead of using data-independent random projections asin LSH or principal components as in SH, Kulis and Dar-rell [27] proposed data-dependent and bit-correlated hashfunctions as:

hkpxq “ sgn

˜

sÿ

q“1

Wkqκpxkq ,xq

¸

(19)

The sample set txkqu, q “ 1, ¨ ¨ ¨ , s is the training data forlearning hash function hk and κp¨q is a kernel function, andW is a weight matrix.Based on the above formulation, a method called Bi-

nary Reconstructive Embedding (BRE) was designed tominimize a cost function measuring the difference betweenthe metric and reconstructed distance in hamming space.The Euclidean metric dM and the binary reconstructiondistance dR are defined as:

dMpxi,xjq “1

2}xi ´xj}2 (20)

dRpxi,xjq “1

K

Kÿ

k“1

phkpxiq ´ hkpxjqq2

Page 11: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 11

Fig. 8

Illustration of the hashing method based on metric

learning. The left shows the partitioning using standard

LSH method and the right shows the partitioning of the

metric learning based LSH method (modified the original

figure in [19]).

The objective is to minimize the following reconstructionerror to derive the optimal W:

W˚ “ argminW

ÿ

pxi,xjqPN

rdMpxi,xjq ´ dRpxi,xjqs2, (21)

where the set of sample pairs N is the training data.Optimizing the above objective function is difficult dueto the non-differentiability of sgnp¨q function. Instead, acoordinate-descent algorithm was applied to iteratively up-date the hash functions to a local optimum. This hashingmethod can be easily extended to a supervised scenarioby setting pairs with same labels to have zero distanceand pairs with different labels to have a large distance.However, since the binary reconstruction distance dR isbounded in r0,1s while the metric distance dM has no up-per bound, the minimization problem in Eq. (21) is onlymeaningful when input data is appropriately normalized.In practice, the original data point x is often mapped toa hypersphere with unit length so that 0 ď dM ď 1. Thisnormalization removes the scale of data points, which isoften not negligible for practical applications of nearestneighbor search. In addition, Hamming distance basedobjective is hard to optimize due to its nonconvex andnonsmooth properties. Hence, Liu et al. proposed to uti-lize the equivalence between code inner products and theHamming distances to design supervised and kernel-basedhash functions [22]. The objective is to ensure the innerproduct of hash codes consistent with the given pairwisesupervision. Such a strategy of optimizing the hash codeinner product in KSH rather than the Hamming distancelike what’s done in BRE pays off nicely and leads to majorperformance gains in similarity-based retrieval consistentlyconfirmed in extensive experiments reported in [22] and re-cent studies [99].

E. Metric Learning based Hashing

The key idea for metric learning based hashing methodis to learn a parameterized Mahalanobis metric using pair-wise label information. Such learned metrics are then em-ployed to the standard random projection based hash func-tions [19]. The goal is to preserve the pairwise relationshipin the binary code space, where similar data pairs are more

(a) (b) (c)

Fig. 9

Illustration of one-bit partitioning of different linear

projection based hashing methods: a) unsupervised hashing;

b) supervised hashing; and c) semi-supervised hashing. The

similar point pairs are indicated in the green rectangle

shape and the dissimilar point pairs are with a red triangle

shape.

likely to collide in the same hash buck and dissimilar pairsare less likely to share the same hash codes, as illustratedin Figure 8.The parameterized inner product is defined as

simpxi,xjq “ xJi Mxj ,

where M is a positive-definite dˆ d matrix to be learnedfrom the labeled data. Note that this similarity mea-sure corresponds to the parameterized squared Maha-lanobis distance dM. Assume that M can be factorizedas M “ GJG. Then the parameterized squared Maha-lanobis distance can be written as

dMpxi,xjq “ pxi ´xjqJMpxi ´xjq (22)

“ pGxi ´GxiqJpGxi ´Gxiq.

Based on the above equation, the distance dMpxi,xjq canbe interpreted as the Euclidian distance between the pro-jected data points Gxi and Gxj . Note that the matrixM can be learned through various metric learning methodsuch as information-theoretic metric learning [100]. To ac-commodate the learned distance metric, the randomizedhash function is given as

hkpxq “ sgnpwkGJxq. (23)

It is easy to see that the above hash function generatesthe hash codes which preserve the parameterized similaritymeasure in the Hamming space. Figure 8 demonstrates thedifference between standard random projection based LSHand the metric learning based LSH, where it is easy to seethat the learned metric help assign the same hash bit to thesimilar sample pairs. Accordingly, the collision probabilityis given as

Pr rhkpxiq “ hkpxiqs “ 1´1

πcos´1

xJi G

JGxj

}Gxi}}Gxj}(24)

Realizing that the pairwise constraints often come to beavailable incrementally, Jain et al exploit an efficient onlinelocality-sensitive hashing with gradually learned distancemetrics [70].

Page 12: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

12 PROCEEDINGS OF THE IEEE

F. Semi-Supervised Hashing

Supervised hashing techniques have been shown to besuperior than unsupervised approaches since they lever-age the supervision information to design task-specific hashcodes. However, for a typical setting of large scale prob-lem, the human annotation process is often costly and thelabels can be noisy and sparse, which could easily lead tooverfitting.Considering a small set of pairswise labels and a

large amount of unlabled data, semi-supervised hashingaims in designing hash functions with minimum empiri-cal loss while maintaining maximum entropy over the en-tire dataset. Specifically, assume the pairwise labels aregiven as two type of sets M and C. A pair of data pointpxi,xjq P M indicates that xi and xj are similar andpxi,xjq P C means that xi and xj . Hence, the empiricalaccuracy on the labeled data for a family of hash functionsH “ rh1, ¨ ¨ ¨ ,hKs is given as

JpHq“ÿ

k

»

ÿ

pxi,xjqPM

hkpxiqhkpxjq ´ÿ

pxi,xjqPC

hkpxiqhkpxjq

fi

fl . (25)

If define a matrix S P Rlˆl incorporating the pairwise la-

beled information from Xl as:

Sij “

$

&

%

1 : pxi,xjq P M

´1 : pxi,xjq P C

0 : otherwise.

, (26)

The above empirical accuracy can be written in a compactmatrix form after dropping off the sgnp¨q function

JpHq “1

2tr`

WJXl S WJXJl

˘

. (27)

However, only considering the empirical accuracy duringthe design of the hash function can lead to undesired re-sults. As illustrated in Figure 9(b), although such a hashbit partitions the data with zero error over the pairwiselabeled data, it results in imbalanced separation of the un-labeled data, thus being less informative. Therefore, aninformation theoretic regularization is suggested to max-imize the entropy of each hash bit. After relaxation, thefinal objective is formed as

W˚ “ argmaxW

1

2tr`

WJXlSXJl W

˘

2tr`

WJXXJW˘

,

(28)

where the first part represents the empirical accuracy andthe second component encourages partitioning along largevariance projections. The coefficient η weighs the contri-bution from these two components. The above objectivecan be solved using various optimization strategies, result-ing in orthogonal or correlated binary codes, as describedin [14][25]. Figure 9 illustrates the comparison of one-bitlinear partition using different learning paradigms, wherethe semi-supervised method tends to produce balanced yetaccurate data separation. Finally, Xu et al. employs simi-lar semi-supervised formulation to sequentially learn mul-tiple complementary hash tables to further improve theperformance [31].

G. Column Generation Hashing

Beyond pairwise relationship, complex supervision likeranking triplets and ranking lists has been exploited tolearn hash functions with the property of ranking preserv-ing. In many real applications such as image retrieval andrecommendation system, it is often easier to receive therelative comparison instead of instance-wise or pari-wiselabels. For a general claim, such relative comparison infor-mation is given in a triplet form. Formally, a set of tripletsare represented as:

E “ tpqi,x`i ,x

´i q|simpqi,x

`i q ą simpqi,x

´i qu,

where the the function simp¨q could be an unknown sim-ilarity measure. Hence, the triplet pqi,x

`i ,x

´i q indicates

that the sample point x`i is more semantically similar or

closer to a query point q than the point x´i , as demon-

strated in Figure 4(b).As one of the representative methods falling into this

category, column generation hashing explores the large-margin framework to leverage such type of proximitycomparison information to design weighted hash func-tions [74]. In particular, the relative comparison infor-mation simpqi,x

`i q ą simpqi,x

´i q will be preserved in a

weighted Hamming space as dWHpqi,x`i q ă dWHpqi,x

´i q,

where dWH is the weighted Hamming distance as de-fined in Eq. 14. To impose a large-margin, the constraintdWHpqi,x

`i q ă dWHpqi,x

´i q should be satisfied as well as

possible. Thus, a typical large-margin objective with ℓ1norm regularization can be formulated as

argminw,ξ

|E|ÿ

i“1

ξi `C}w}1 (29)

subject to: w ľ 0,ξ ľ 0;

dWHpqi,x´i q ´ dWHpqi,x

´i q ě 1´ ξi,@i,

where w is the random projections for computing the hashcodes. To solve the above optimization problem, the au-thors proposed using column generation technique to learnthe hash function and the associated bit weights iteratively.For each iteration, the best hash function is generated andthe weight vector is updated accordingly. In addition, dif-ferent loss functions with other regularization terms suchas ℓ8 are also suggested as alternatives in the above for-mulation.

H. Ranking Supervised Hashing

Different from other methods that explore the tripletrelationship [74][75][65], the ranking supervised hashingmethod attempt to preserve the ranking order of a set ofdatabase points corresponding to the query point [76]. As-sume that the training dataset X “ txnu hasN points withxn P R

D. In addition, a query set is given as Q “ tqmu,and qm P R

D,m “ 1, ¨ ¨ ¨ ,M . For any specific query pointqm, we can derive a ranking list over X , which can be writ-ten as a vector as rpqm,X q “ prm1 , ¨ ¨ ¨ , rmn , ¨ ¨ ¨ , rmN q. Eachelement rmn falls into the integer range r1,N s and no two el-ements share the same value for the exact ranking case. If

Page 13: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 13

Hash Functions

Ranking List

x2

x3

x1

Semantic/Feature Space

Query Point: q

q

x2

x3

x1

x4

3

4

2

1x4

0 1 -1 -1

-1 0 -1 -1

1 1 0 -1

1 1 1 0

x2

x3

x1

x4

x2 x3x1 x4

Ranking Triplet

Matrix

x2 x3x1 x4q

1 -1 -1 -1 1

1 1 -1 1 1

1 -1 -1 1 -1

0 2 -2 -2

-2 0 -4 -4

2 4 0 0

2 4 0 0

0 1 -1 -1

-1 0 -1 -1

1 1 0 -1

1 1 1 0

0 2 -2 -2

-2 0 -4 -4

2 4 0 0

2 4 0 0

Ha

sh

Co

de

s

Ground Truth Triplet

Fig. 10

The conceptual diagram of the Rank Supervised Hashing

method. The top left component demonstrates the

procedure of deriving ground truth ranking list r using the

semantic relevance or feature similarity/distance, and then

converting it to a triplet matrix Spqq for a given query q. The

bottom left component describes the estimation of relaxed

ranking triplet matrix Spqq from the binary hash codes. The

right component shows the objective of minimizing the

inconsistency between the two ranking triplet matrices.

rmi ă rmj (i, j “ 1, ¨ ¨ ¨ ,N), it indicates sample xi has higherrank than xj , which means xi is more relevant or similarto qm than xj . To represent such a discrete ranking list, aranking triplet matrix S P R

NˆN is defined as

Spqm;xi,xjq “

$

&

%

1 : rqi ă r

qj

´1 : rqi ą r

qj

0 : rqi “ r

qj .

(30)

Hence for a set of query points Q “ tqmu, we can derive atriplet tensor, i.e., a set of triplet matrices

S “ tSpqmqu P RMˆNˆN .

In particular, the element of the triplet tensor isdefined as Smij “ Spqmqpi, jq “ Spqm; xi, xjq, m “1, ¨ ¨ ¨ , M , i, j “ 1, ¨ ¨ ¨ , N . The objective is topreserve the ranking lists in the mapped Hammingspace. In other words, if Spqm; xi, xjq “ 1, wetend to ensure dHpqm, xiq ă dHpqm, xiq, otherwisedHpqm,xiq ą dHpqm,xiq. Assume the hash code has thevalue as t´1,1u, such ranking order is equivalent to thesimilarity measurement using the inner products of the bi-nary codes, i.e.,

dHpqm,xiqădHpqm,xiqôHpqmqJHpxiqąHpqmqJHpxjq.

Then the empirical loss function LH over the ranking listcan be represented as

LH “ ´ÿ

m

ÿ

i,j

HpqmqJ rHpxiq ´ HpxjqsSmij .

Assume we utilize linear hash functions, the final objectiveis formed as the following constrained quadratic problem

W˚ “ argmaxW

LH “ argmaxW

trpWWJBq (31)

s.t. WJW “ I,

where the constant matrix B is computed as B “ř

m pmqJm with pm “

ř

i,j rxi ´xjsSmij . The orthogo-nality constraint is utilized to minimize the redundancybetween different hash bits. Intuitively, the above formula-tion is to preserve the ranking list in the Hamming space,as shown in the conceptual diagram in Figure 10. Theaugmented Lagrangian multiplier method was introducedto derive feasible solutions for the above constrained prob-lem, as discussed in [76].

I. Circulant Binary Embedding

Realizing that most of the current hashing techniquesrely on linear projections, which could suffer from very highcomputational and storage costs for high-dimensional data,circulant binary embedding was recently developed to han-dle such a challenge using the circulant projection [81].Briefly, given a vector r “ tr0, ¨ ¨ ¨ , rd´1u, we can gener-ate its corresponding circulant matrix R “ circprq [101].Therefore, the binary embedding with the circulant pro-jection is defined as:

hpxq “ sgnpRxq “ sgnpcircprq ¨xq. (32)

Since the circulant projection circprqx is equivalent to cir-cular convolution rgx, the computation of linear projec-tion can be eventually realized using fast Fourier transformas

circprqx “ rgx “ F´1 pFprq ˝Fpxqq . (33)

Thus, the time complexity is reduced from d2 to d logd.Finally, one could randomly select the circulant vector r

or design specific ones using supervised learning methods.

V. Deep Learning for Hashing

During the past decade (since around 2006), Deep Learn-

ing [102], also known as Deep Neural Networks, has drawnincreasing attention and research efforts in a variety of arti-ficial intelligence areas including speech recognition, com-puter vision, machine learning, text mining, etc. Since onemain purpose of deep learning is to learn robust and power-ful feature representations for complex data, it is very nat-ural to leverage deep learning for exploring compact hashcodes which can be regarded as binary representations ofdata. In this section, we briefly introduce several recentlyproposed hashing methods that employ deep learning. InTable III, we compare eight deep learning based hashingmethods in terms of four key characteristics that can beused to differentiate the approaches.The earliest work in deep learning based hashing may

be Semantic Hashing [103]. This method builds a deepgenerative model to discover hidden binary units (i.e., la-tent topic features) which can model input text data (i.e.,

Page 14: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

14 PROCEEDINGS OF THE IEEE

word-count vectors). Such a deep model is made as a stackof Restricted Boltzmann Machines (RBMs) [104]. Afterlearning a multi-layer RBM through pre-training and fine-tuning on a collection of documents, the hash code of anydocument is acquired by simply thresholding the output ofthe deepest layer. Such hash codes provided by the deepRBM were shown to preserve semantically similar relation-ships among input documents into the code space, in whicheach hash code (or hash key) is used as a memory addressto locate corresponding documents. In this way, semanti-cally similar documents are mapped to adjacent memoryaddresses, thereby enabling efficient search via hash ta-ble lookup. To enhance the performance of deep RBMs,a supervised version was proposed in [66], which borrowsthe idea of nonlinear Neighbourhood Component Analy-sis (NCA) embedding [105]. The supervised informationstems from given neighbor/nonneighbor relationships be-tween training examples. Then, the objective function ofNCA is optimized on top of a deep RBM, making the deepRBM yield discriminative hash codes. Note that super-vised deep RBMs can be applied to broad data domainsother than text data. In [66], supervised deep RBMs usinga Gaussian distribution to model visible units in the firstlayer were successfully applied to handle massive imagedata.

A recent work named Sparse Similarity-Preserving

Hashing [99] tried to address the low recall issue pertainingto relatively long hash codes, which affect most of previoushashing techniques. The idea is enforcing sparsity into thehash codes to be learned from training examples with pair-wise supervised information, that is, similar and dissimilarpairs of examples (also known as side information in themachine learning literature). The relaxed hash functions,actually nonlinear embedding functions, are learned bytraining a Tailored Feed-Forward Neural Network. Withinthis architecture, two ISTA-type networks [110] that sharethe same set of parameters and conduct fast approxima-tions of sparse coding are coupled in the training phase.Since each output of the neural network is continuous al-beit sparse, a hyperbolic tangent function is applied to theoutput followed by a thresholding operation, leading tothe final binary hash code. In [99], an extension to hash-ing multimodal data, e.g., web images with textual tags,was also presented.

Another work named Deep Hashing [106] developed adeep neural network to learn a multiple hierarchical nonlin-ear transformation which maps original images to compactbinary hash codes and hence supports large-scale image re-trieval with the learned binary image representation. Thedeep hashing model is established under three constraintswhich are imposed on the top layer of the deep neural net-work: 1) the reconstruction error between an original real-valued image feature vector and the resulting binary codeis minimized, 2) each bit of binary codes has a balance,and 3) all bits are independent of each other. Similar con-straints have been adopted in prior unsupervised hashingor binary coding methods such as Iterative Quantization(ITQ) [111]. A supervised version called Supervised Deep

Hashing3 was also presented in [106], where a discrimina-tive term incorporating pairwise supervised information isadded to the objective function of the deep hashing model.The authors of [106] showed the superiority of the super-vised deep hashing model over its unsupervised counter-part. Both of them produce hash codes through thresh-olding the output of the top layer in the neural network,where all activation functions are hyperbolic tangent func-tions.It is worthwhile to point out that the above methods, in-

cluding Sparse Similarity-Preserving Hashing, Deep Hash-ing and Supervised Deep Hashing, did not include a pre-training stage during the training of the deep neural net-works. Instead, the hash codes are learned from scratchusing a set of training data. However, the absence ofpre-training may make the generated hash codes less effec-tive. Specifically, the Sparse Similarity-Preserving Hashingmethod is found to be inferior to the state-of-the-art su-pervised hashing method Kernel-Based Supervised Hash-ing (KSH) [22] in terms of search accuracy on some im-age datasets [99]; the Deep Hashing method and its su-pervised version are slightly better than ITQ and its su-pervised version CCA+ITQ, respectively [111][106]. Notethat KSH, ITQ and CCA+ITQ exploit relatively shallowlearning frameworks.Almost all existing hashing techniques including the

aforementioned ones relying on deep neural networks takea vector of hand-crafted visual features extracted from animage as input. Therefore, the quality of produced hashcodes heavily depends on the quality of hand-crafted fea-tures. To remove this barrier, a recent method called Con-

volutional Neural Network Hashing [107] was developed tointegrate image feature learning and hash value learninginto a joint learning model. Given pairwise supervised in-formation, this model consists of a stage of learning ap-proximate hash codes and a stage of training a deep Con-

volutional Neural Network (CNN) [112] that outputs con-tinuous hash values. Such hash values can be generatedby activation functions like sigmoid, hyperbolic tangentor softmax, and then quantized into binary hash codesthrough appropriate thresholding. Thanks to the power ofCNNs, the joint model is capable of simultaneously learn-ing image features and hash values, directly working onraw image pixels. The deployed CNN is composed of threeconvolution-pooling layers that involve rectified linear ac-tivation, max pooling, and local contrast normalization, astandard fully-connected layer, and an output layer withsoftmax activation functions.Also based on CNNs, a latest method called as Deep Se-

mantic Ranking Hashing [108] was presented to learn hashvalues such that multilevel semantic similarities amongmulti-labeled images are preserved. Like the Convolu-tional Neural Network Hashing method, this method takesimage pixels as input and trains a deep CNN, by whichimage feature representations and hash values are jointlylearned. The deployed CNN consists of five convolution-

3 It is essentially semi-supervised as abundant unlabeled examplesare used for training the deep neural network.

Page 15: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 15

TABLE III

The characteristics of eight recently proposed deep learning based hashing methods.

Deep Learning based Data Learning Learning Hierarchy of

Hashing Methods Domain Paradigm Features? Deep Neural Networks

Semantic Hashing [103] text unsupervised no 4

Restricted Boltzmann Machine [66] text and image supervised no 4 and 5

Tailored Feed-Forward Neural Network [99] text and image supervised no 6

Deep Hashing [106] image unsupervised no 3

Supervised Deep Hashing [106] image supervised no 3

Convolutional Neural Network Hashing [107] image supervised yes 5

Deep Semantic Ranking Hashing [108] image supervised yes 8

Deep Neural Network Hashing [109] image supervised yes 10

pooling layers, two fully-connected layers, and a hash layer(i.e., output layer). The key hash layer is connected toboth fully-connected layers and in the function expressionas

hpxq “ 2σ`

wJrf1pxq; f2pxqs˘

´ 1,

in which x represents an input image, hpxq represents thevector of hash values for image x, f1pxq and f2pxq respec-tively denote the feature representations from the outputsof the first and second fully-connected layers, w is theweight vector, and σpq is the logistic function. The DeepSemantic Ranking Hashing method leverages listwise su-pervised information to train the CNN, which stems froma collection of image triplets that encode the multilevelsimilarities, i.e., the first image in each triplet is more sim-ilar to the second one than the third one. The hash codeof image x is finally obtained by thresholding the outputhpxq of the hash layer at zero.The above Convolutional Neural Network Hashing

method [107] requires separately learning approximatehash codes to guide the subsequent learning of image rep-resentation and finer hash values.A latest method calledDeep Neural Network Hashing [109] goes beyond, in whichthe image representation and hash values are learned inone stage so that representation learning and hash learn-ing are tightly coupled to benefit each other. Similar to theDeep Semantic Ranking Hashing method [108], the DeepNeural Network Hashing method incorporates listwise su-pervised information to train a deep CNN, giving rise to acurrently deepest architecture for supervised hashing. Thepipeline of the deep hashing architecture includes threebuilding blocks: 1) a triplet of images (the first image ismore similar to the second one than the third one) whichare fed to the CNN, and upon which a triplet ranking lossis designed to characterize the listwise supervised informa-tion; 2) a shared sub-network with a stack of eight convo-lution layers to generate the intermediate image features;3) a divide-and-encode module to divide the intermediateimage features into multiple channels, each of which is en-coded into a single hash bit. Within the divide-and-encodemodule, there are one fully-connected layer and one hashlayer. The former uses sigmoid activation, while the latteruses a piecewise thresholding scheme to produce a nearlydiscrete hash values. Eventually, the hash code of any im-

age is yielded by thresholding the output of the hash layerat 0.5. In [109], the Deep Neural Network Hashing methodwas shown to surpass the Convolutional Neural NetworkHashing method as well as several shallow learning basedsupervised hashing methods in terms of image search ac-curacy.Last, we a few observations are worth mentioning deep

learning based hashing methods introduced in this section.1. The majority of those methods did not report thetime of hash code generation. In real-world searchscenarios, the speed for generating hashes should besubstantially fast. There might be concern about thehashing speed of those deep neural network driven ap-proaches, especially the approaches involving imagefeature learning, which may take much longer time tohash an image compared to shallow learning drivenapproaches like ITQ and KSH.

2. Instead of employing deep neural networks to seekhash codes, another interesting problem is to designa proper hashing technique to accelerate deep neuralnetwork training or save memory space. The latestwork [113] presented a hashing trick named Hashed-

Nets, which shrinks the storage costs of neural net-works significantly while mostly preserving the gener-alization performance in image classification tasks.

VI. Advanced Methods and RelatedApplications

In this section, we further extend the survey scope tocover a few more advanced hashing methods that are devel-oped for specific settings and applications, such as point-to-hyperplane hashing, subspace hashing, and multimodal-ity hashing.

A. Hyperplane Hashing

Distinct from the previously surveyed conventional hash-ing techniques all of which address the problem of fastpoint-to-point nearest neighbor search (see Figure 12(a)),a new scenario “point-to-hyperplane” hashing emerges totackle fast point-to-hyperplane nearest neighbor search(see Figure 12(b)), where the query is a hyperplane instead

4 The illustration figure is fromhttp://vision.cs.utexas.edu/projects/activehash/

Page 16: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

16 PROCEEDINGS OF THE IEEE

Fig. 11

Active learning framework with a hashing-based fast query

selection strategy 4.

of a data point. Such a new scenario requires hashing thehyperplane query to near database points, which is diffi-cult to accomplish because point-to-hyperplane distancesare quite different from routine point-to-point distances interms of the computation mechanism. Despite the bulkof research on point-to-point hashing, this special hash-ing paradigm is rarely touched. For convenience, we callpoint-to-hyperplane hashing as Hyperplane Hashing.Hyperplane hashing is actually fairly important for

many machine learning applications such as large-scale ac-tive learning with SVMs [114]. In SVM-based active learn-ing [115], the well proven sample selection strategy is tosearch in the unlabeled sample pool to identify the sampleclosest to the current hyperplane decision boundary, thusproviding the most useful information for improving thelearning model. When making such active learning scal-able to gigantic databases, exhaustive search for the pointnearest to the hyperplane is not efficient for the online sam-ple selection requirement. Hence, novel hashing methodsthat can principally handle hyperplane queries are calledfor. A conceptual diagram using hyperplane hashing toscale up active learning process is demonstrated in Fig-ure 11.We demonstrate the geometric relationship between a

data point x and a hyperplane Pw with the vector normalas w in Figure 13(a). Given a hyperplane query Pw and aset of points X , the target nearest neighbor is

x˚ “ argminxPX

Dpx,Pwq,

where Dpx,Pwq “ |wJx|}w} is the point-to-hyperplane dis-

tance. The existing hyperplane hashing methods [116][117]all attempt to minimize a slightly modified “distance”|wJx|

}w}}x} , i.e., the sine of the point-to-hyperplane angle

αx,w “ˇ

ˇθx,w ´ π2

ˇ

ˇ. Note that θx,w P r0,πs is the angle be-tween x and w. The angle measure αx,w P r0,π{2s betweena database point and a hyperplane query turns out to bereflected into the design of hash functions.As shown in Figure 13(b), the goal of hyperplane hashing

is to hash a hyperplane query Pw and the desired neighbors(e.g., x1,x2) with narrow αx,w into the same or nearbyhash buckets, meanwhile avoiding to return the unde-

Fig. 12

Two distinct nearest neighbor search problems. (a)

Point-to-point search, the blue solid circle represents a

point query and the red circle represents the found

nearest neighbor point. (b) Point-to-hyperplane search, the

blue plane denotes a hyperplane query Pw with w being its

normal vector, and the red circle denotes the found

nearest neighbor point.

sired nonneighbors (e.g., x3,x4) with wide αx,w. Becauseαx,w “

ˇ

ˇθx,w ´ π2

ˇ

ˇ, the point-to-hyperplane search problemcan be equivalently transformed to a specific point-to-pointsearch problem where the query is the hyperplane normalw and the desired nearest neighbor to the raw query Pw

is the one whose angle θx,w from w is closest to π{2, i.e.,most closely perpendicular to w (we write “perpendiculartow” as K w for brevity). This is very different from tradi-tional point-to-point nearest neighbor search which returnsthe most similar point to the query point. In the following,several existing hyperplane hashing methods will be brieflydiscussed

Jain et al. [117] devised two different families of ran-domized hash functions to attack the hyperplane hashingproblem. The first one is Angle-Hyperplane Hash (AH-Hash) A, of which one instance function is

hApzq “"

rsgnpuJzq,sgnpvJzqs,z is a database pointrsgnpuJzq,sgnp´vJzqs, z is a hyperplane normal

(34)

where z P Rd represents an input vector, and u and v are

both drawn independently from a standard d-variate Gaus-sian, i.e., u,v ∼N p0, Idˆdq. Note that hA is a two-bit hashfunction which leads to the probability of collision for a hy-perplane normal w and a database point x:

Pr“

hApwq “ hApxq‰

“1

α2x,w

π2. (35)

This probability monotonically decreases as the point-to-hyperplane angle αx,w increases, ensuring angle-sensitivehashing.

The second family proposed by Jain et al. is Embedding-

Hyperplane Hash (EH-Hash) function family E of which

Page 17: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 17

Fig. 13

The hyperplane hashing problem. (a) Point-to-hyperplane

distance Dpx,Pwq and point-to-hyperplane angle αx,w . (b)

Neighbors (x1,x2) and nonneighbors (x3,x4) of the

hyperplane query Pw, and the ideal neighbors are the

points K w.

one instance is

hEpzq “

"

sgn`

UJVpzzJq˘

,z is a database pointsgn

`

´UJVpzzJq˘

, z is a hyperplane normal

(36)

where VpAq returns the vectorial concatenation of matrixA, and U∼N p0, Id2ˆd2q. The EH hash function hE yields

hash bits on an embedded space Rd2

resulting from vec-torizing rank-one matrices zzJ and ´zzJ. Compared withhA, hE gives a higher probability of collision:

Pr“

hEpwq “ hEpxq‰

“cos´1 sin2pαx,wq

π, (37)

which also bears the angle-sensitive hashing property.However, it is much more expensive to compute than AH-Hash.More recently, Liu et al. [117] designed a randomized

function family with bilinear Bilinear-Hyperplane Hash

(BH-Hash) as:

B “

hBpzq “ sgnpuJzzJvq, i.i.d. u,v ∼N p0, Idˆdq(

.

(38)

As a core finding, Liu et al. proved in [117] that the proba-bility of collision for a hyperplane query Pw and a databasepoint x under hB is

Pr“

hBpPwq “ hBpxq‰

“1

2α2

x,w

π2. (39)

Specifically, hBpPwq is prescribed to be ´hBpwq. Eq. (39)endows hB with the angle-sensitive hashing property. It isimportant to find that the collision probability given by theBH hash function hB is always twice of the collision prob-ability by the AH hash function hA, and also greater thanthe collision probability by the EH hash function hE . Asillustrated in Figure 14, for any fixed r, BH-Hash accom-plishes the highest probability of collision, which indicatesthat the BH-Hash has a better angle-sensitive property.In terms of the formulation, the bilinear hash function

hB is correlated with yet different from the linear hash

0 0.5 1 1.5 2 2.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

r

p 1 (co

llisi

on p

roba

bilit

y)

(a) p1 vs. r

AH−HashEH−HashBH−Hash

Fig. 14

Comparison of the collision probabilities of the three

randomized hyperplane hashing schemes using p1 (probability

of collision) vs. r (squared point-to-hyperplane angle)

functions hA and hE . (1) hB produces a single hash bitwhich is the product of the two hash bits produced by hA.(2) hB may be a rank-one special case of hE in algebra if wewrite uJzzJv “ trpzzJvuJq and UJVpzzJq “ trpzzJUq.(3) hB appears in a universal form, while both hA andhE treat a query and a database item in a distinct man-ner. The computation time of hB is Θp2dq which is thesame as that of hA and one order of magnitude faster thanΘp2d2q of hE . Liu et al. further improved the performanceof hB through learning the bilinear projection directionsu,v in hB from the data. Gong et al. extended the bilin-ear formulation to the conventional point-to-point hashingscheme through designing compact binary codes for high-dimensional visual descriptors [118].

B. Subspace Hashing

Beyond the aforementioned conventional hashing whichtackles searching in a database of vectors, subspace hash-

ing [119], which has been rarely explored in the literature,attempts to efficiently search through a large database ofsubspaces. Subspace representation is very common inmany computer vision, pattern recognition, and statisti-cal learning problems, such as subspace representations ofimage patches, image sets, video clips, etc. For example,face images of the same subject with fixed poses but dif-ferent illuminations are often assumed to reside near linearsubspaces. A common use scenario is to use a single faceimage to find the subspace (and the corresponding subjectID) closest to the query image [120]. Given a query in theform of vector or subspace, searching for a nearest subspacein a subspace database is frequently encountered in a vari-ety of practical applications including example-based im-age synthesis, scene classification, speaker recognition, facerecognition, and motion-based action recognition [120].

However, hashing and searching for subspaces are bothdifferent from the schemes used in traditional vector hash-ing and the latest hyperplane hashing. [119] presented ageneral framework to the problem of Approximate Near-

est Subspace (ANS) search, which uniformly deals withthe cases that query is a vector or subspace, query anddatabase elements are subspaces of fixed dimension, queryand database elements are subspaces of different dimen-sion, and database elements are subspaces of varying di-

Page 18: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

18 PROCEEDINGS OF THE IEEE

mension. The critical technique exploited by [119] is two-step: 1) a simple mapping that maps both query anddatabase elements to “points” in a new vector space, and 2)doing approximate nearest neighbor search using conven-tional vector hashing algorithms in the new space. Conse-quently, the main contribution of [119] is reducing the diffi-cult subspace hashing problem to a regular vector hashingtask. [119] used LSH for the vector hashing task. Whilesimple, the hashing technique (mapping + LSH) of [119]perhaps suffers from the high dimensionality of the con-structed new vector space.More recently, [120] exclusively addressed the point-to-

subspace query where query is a vector and database itemsare subspaces of arbitrary dimension. [120] proposed a rig-orously faster hashing technique than that of [119]. Itshash function can hash D-dimensional vectors (D is theambient dimension of the query) or Dˆr-dimensional sub-spaces (r is arbitrary) in a linear time complexity OpDq,which is computationally more efficient than the hash func-tions devised in [119]. [120] further proved the search timeunder the OpDq hashes to be sublinear in the databasesize.Based on the nice finding of [120], we would like to

achieve faster hashing for the subspace-to-subspace queryby means of crafted novel hash functions to handle sub-spaces in varying dimension. Both theoretical and practi-cal explorations in this direction will be beneficial to thehashing area.

C. MultiModality Hashing

Note that the majority of the hash learning methods aredesigned for constructing the Hamming embedding for asingle modality or representation. Some recent advancedmethods are proposed to design the hash functions for morecomplex settings, such as that the data are represented bymultimodal features or the data are formed in a heteroge-neous way [121]. Such type of hashing methods are closelyrelated to the applications in social network, whether mul-timodality and heterogeneity are often observed. Below wesurvey several representative methods that are proposedrecently.Realizing that data items like webpage can be described

from multiple information sources, composing hashing wasrecently proposed to design hashing schme using severalinformation sources [122]. Besides the intuitive way of con-catenating multiple features to derive hash functions, theauthor also presented an iterative weighting scheme andformulated convex combination of multiple features. Theobjective is to ensure the consistency between the seman-tic similarity and Hamming similarity of the data. Fi-nally, a joint optimization strategy is employed to learnthe importance of individual type of features and the hashfunctions. Co-regularized hashing was proposed to inves-tigate the hashing learning across multiparity data in asupervised setting, where similar and dissimilar pairs ofintra-modality points are given as supervision informa-tion [123]. One of such a typical setting is to index im-ages and the text jointly to preserve the semantic relations

between image and text. The authors formulate their ob-jective as a boosted co-regularization framework with thecost component as a weighted sum of the intra-modalityand inter-modality loss. The learning process of the hashfunctions is performed via a boosting procedure to that thebias introduced by previous hash function can be sequen-tially minimized. Dual-view hashing attempts to derivea hidden common Hamming embedding of data from twoviews, while maintaining the predictability of the binarycodes [124]. A probabilistic model called multimodal la-tent binary embedding was recently presented to derive bi-nary latent factors in a common Hamming space for index-ing multimodal data [125]. Other closely related hashingmethods include the design of multiple feature hashing fornear-duplicate duplicate detection [126], submodular hash-ing for video indexing [127], and Probabilistic AttributedHashing for integrating low-level features and semantic at-tributes [128].

D. Applications with Hashing

Indexing massive multimedia data, such as images andvideo, are the natural applications for learning basedhashing. Especially, due to the well-known seman-tic gap, supervised and semi-supervised hashing meth-ods have been extensively studied for image searchand retrieval [69][29][41][129][130][131], mobile productsearch[132]. Other closely related computer vision ap-plications include image patch matching [133], imageclassification [118], face recognition [134][135], pose esti-mation [71], object tracking [136], and duplicate detec-tion [126][137][52][51][138]. In addition, this emerging hashlearning framework can be exploited for some general ma-chine learning and data mining tasks, including cross-modality data fusion [139], large scale optimization [140],large scale classification and regression [141], collaborativefiltering [142], and recommendation [143]. For indexingvideo sequences, a straightforward method is to indepen-dently compute binary codes for each key frames and use aset of hash code to represent video index. More recently, Yeet al. proposed a structure learning framework to derive avideo hashing technique that incorporates both temporaland spatial structure information [144]. In addition, ad-vanced hashing methods are also developed for documentsearch and retrieval. For instance, Wang et al. proposedto leverage both tag information and semantic topic mod-eling to achieve more accurate hash codes [145]. Li et al.designed a two-stage unsupervised hashing framework forfast document retrieval [146].

Hashing techniques have also been applied to the ac-tive learning framework to cope with big data applica-tions. Without performing exhaustive test on all thedata points, hyperplane hashing can help significantlyspeed up the interactive training sample selection pro-cedure [114][117][116]. In addition, a two-stage hashingscheme is developed to achieve fast query pair selection forlarge scale active learning to rank [147].

Page 19: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 19

VII. Open Issues and Future Directions

Despite the tremendous progress in developing a largearray of hashing techniques, several major issues remainopen. First, unlike the locality sensitive hashing family,most of the learning based hashing techniques lack the the-oretical guarantees on the quality of returned neighbors.Although several recent techniques have presented theo-retical analysis of the collision probability, they are mostlybased on randomized hash functions [116][117][19]. Hence,it is highly desired to further investigate such theoreticalproperties. Second, compact hash codes have been mostlystudied for large scale retrieval problems. Due to theircompact form, the hash codes also have great potential inmany other large scale data modeling tasks such as efficientnonlinear kernel SVM classifiers [148] and rapid kernel ap-proximation [149]. A bigger question is: instead of usingthe original data, can one directly use compact codes todo generic unsupervised or supervised learning without af-fecting the accuracy? To achieve this, theoretically soundpractical methods need to be devised. This will make ef-ficient large-scale learning possible with limited resources,for instance on mobile devices. Third, most of the currenthashing technicals are designed for given feature represen-tations that tend to suffer from the semnatic gap. One ofthe possible future directions is to integrate representationlearning with binary code learning using advanced learn-ing schemes such as deep neural network. Finally, sinceheterogeneity has been an important characteristics of thebig data applications, one of the future trends will be todesign efficient hashing approaches that can leverage het-erogeneous features and multi-modal data to improve theoverall indexing quality. Along those lines, developing newhashing techniques for composite distance measures, i.e.,those based on combinations of different distances actingon different types of features will be of great interest.

References

[1] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval:Ideas, influences, and trends of the new age,” ACM ComputingSurveys, vol. 40, no. 2, pp. 1–60, 2008.

[2] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-neighbormethods in learning and vision: theory and practice. MITPress, 2006.

[3] R. Bellman, Dynamic programming, ser. Rand Corporation re-search study. Princeton University Press, 1957.

[4] J. Bentley, “Multidimensional binary search trees used for asso-ciative searching,” Communications of the ACM, vol. 18, no. 9,p. 517, 1975.

[5] S. Omohundro, “Efficient algorithms with neural network be-havior,” Complex Systems, vol. 1, no. 2, pp. 273–347, 1987.

[6] J. Uhlmann, “Satisfying general proximity/similarity querieswith metric trees,” Information Processing Letters, vol. 40,no. 4, pp. 175–179, 1991.

[7] P. Yianilos, “Data structures and algorithms for nearest neigh-bor search in general metric spaces,” in Proc. of the fourthannual ACM-SIAM Symposium on Discrete algorithms, 1993,pp. 311–321.

[8] P. Indyk, “Nearest-neighbor searching in high dimensions,” inHandbook of discrete and computational geometry, J. E. Good-man and J. O’Rourke, Eds. Boca Raton, FL: CRC Press LLC,2004.

[9] H. Jegou, M. Douze, and C. Schmid, “Product quantizationfor nearest neighbor search,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128,2011.

[10] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quan-tization for approximate nearest neighbor search,” in Proceed-ings of the IEEE International Conference on Computer Vi-sion and Pattern Recognition, Portland, Oregon, USA, 2013,pp. 2946–2953.

[11] A. Torralba, R. Fergus, and W. Freeman, “80 million tiny im-ages: A large data set for nonparametric object and scenerecognition,” IEEE Trans. on Pattern Analysis and MachineIntelligence, vol. 30, no. 11, pp. 1958–1970, 2008.

[12] D. Knuth, Art of Computer Programming, Volume 3: Sortingand searching. Addison-Wesley Professional, 1997.

[13] A. Gionis, P. Indyk, and R. Motwani, “Similarity search inhigh dimensions via hashing,” in Proc. of 25th InternationalConference on Very Large Data Bases, 1999, pp. 518–529.

[14] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hash-ing for large scale search,” IEEE Transactions on PatternAnalysis and Machine Intelligence, 2012.

[15] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,“Content-based image retrieval at the end of the early years,”IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 22, no. 12, pp. 1349–1380, 2000.

[16] L. Cayton and S. Dasgupta, “A learning framework for nearestneighbor search,” in Advances in Neural Information Process-ing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis,Eds. Cambridge, MA: MIT Press, 2008, pp. 233–240.

[17] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer, “Com-pact hashing with joint optimization of search accuracy andtime,” in IEEE Conference on Computer Vision and PatternRecognition, Colorado Springs, CO, USA, 2011, pp. 753–760.

[18] G. Shakhnarovich, “Learning task-specific similarity,” Ph.D.dissertation, Massachusetts Institute of Technology, 2005.

[19] B. Kulis, P. Jain, and K. Grauman, “Fast similarity search forlearned metrics,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 31, no. 12, pp. 2143–2157, 2009.

[20] A. Gordo and F. Perronnin:, “Asymmetric distances for binaryembeddings,” in IEEE Conference on Computer Vision andPattern Recognition, Colorado Springs, CO, USA, 2011, pp.729–736.

[21] B. Kulis and K. Grauman, “Kernelized locality-sensitive hash-ing,” IEEE Transactions on Pattern Analysis and MachineIntelligence, 2011.

[22] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Super-vised hashing with kernels,” in Proceedings of the IEEE Inter-national Conference on Computer Vision and Pattern Recog-nition, Providence, RI, USA, 2012, pp. 2074–2081.

[23] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hash-ing,” in Proceedings of the IEEE International Conference onComputer Vision and Pattern Recognition, Portland, Oregon,USA, 2013, pp. 446–451.

[24] A. Joly and O. Buisson, “Random maximum margin hashing,”in IEEE Conference on Computer Vision and Pattern Recog-nition, Colorado Springs, CO, USA, 2011, pp. 873–880.

[25] J. Wang, S. Kumar, and S.-F. Chang, “Sequential projectionlearning for hashing with compact codes,” in Proceedings of the27th International Conference on Machine Learning, 2010, pp.1127–1134.

[26] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving quantization method for learning binary compactcodes,” in Proceedings of the IEEE International Conferenceon Computer Vision and Pattern Recognition, Portland, Ore-gon, USA, 2013, pp. 2938–2945.

[27] B. Kulis and T. Darrell, “Learning to Hash with Binary Recon-structive Embeddings,” in Proc. of Advances in Neural Infor-mation Processing Systems, Y. Bengio, D. Schuurmans, J. Laf-ferty, C. K. I. Williams, and A. Culotta, Eds., 2009, vol. 20,pp. 1042–1050.

[28] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing withgraphs,” in Proc. of ICML, Bellevue, Washington, USA, 2011.

[29] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hash-ing for scalable image retrieval,” in IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, SanFrancisco, USA, 2010, pp. 3424–3431.

[30] M. Norouzi and D. Fleet, “Minimal loss hashing for compactbinary codes,” in Proceedings of the 27th International Con-ference on Machine Learning, 2011.

[31] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Comple-mentary hashing for approximate nearest neighbor search,” in

Page 20: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

20 PROCEEDINGS OF THE IEEE

Proceedings of the IEEE International Conference on Com-puter Vision, 2011, pp. 1631–1638.

[32] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” inProc. of Advances in Neural Information Processing Systems,D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds.MIT Press, 2008, vol. 21, pp. 1753–1760.

[33] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codesfrom shift-invariant kernels,” in Advances in Neural Informa-tion Processing Systems 22, Y. Bengio, D. Schuurmans, J. Laf-ferty, C. K. I. Williams, and A. Culotta, Eds. MIT Press, 2009,pp. 1509–1517.

[34] M. S. Charikar, “Similarity estimation techniques from round-ing algorithms,” in Proceedings of the thiry-fourth annual ACMsymposium on Theory of computing. ACM, 2002, pp. 380–388.

[35] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” inProceedings of the twentieth annual Symposium on Computa-tional Geometry, 2004, pp. 253–262.

[36] M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuningindexes for similarity search,” in Proceedings of the 14th inter-national conference on World Wide Web, Chiba, Japan, 2005,pp. 651–660.

[37] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similaritysearch,” in Proceedings of the 33rd international conferenceon Very large data bases, 2007, pp. 950–961.

[38] W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li,“Modeling lsh for performance tuning,” in Proceedings of the17th ACM Conference on Information and Knowledge Man-agement, 2008, pp. 669–678.

[39] A. Dasgupta, R. Kumar, and T. Sarlos, “Fast locality-sensitivehashing,” in Proceedings of the 17th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining,2011, pp. 1073–1081.

[40] V. Satuluri and S. Parthasarathy, “Bayesian locality sensitivehashing for fast similarity search,” Proceedings of the VLDBEndowment, vol. 5, no. 5, pp. 430–441, 2012.

[41] B. Kulis and K. Grauman, “Kernelized locality-sensitive hash-ing for scalable image search,” in IEEE International Con-ference on Computer Vision, kyoto, Japan, 2009, pp. 2130 –2137.

[42] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing inkernel space,” in IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, San Francisco, USA,2010, pp. 3344–3351.

[43] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensitive hashing,” inAdvances in Neural Information Process-ing Systems 25, 2012, pp. 108–116.

[44] Y. Mu and S. Yan, “Non-metric locality-sensitive hashing.” inProceedings of the Twenty-Fourth AAAI Conference on Arti-ficial Intelligence, 2010, pp. 539–544.

[45] A. Broder, “On the resemblance and containment of docu-ments,” in Compression and Complexity of Sequences Proceed-ings, 1997, pp. 21–29.

[46] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher,“Min-wise independent permutations,” in Proceedings of theThirtieth Annual ACM Symposium on Theory of Computing,1998, pp. 327–336.

[47] A. Rajaraman and J. D. Ullman, Mining of massive datasets.Cambridge University Press, 2012.

[48] S. Ioffe, “Improved consistent sampling, weighted minhash andl1 sketching,” in IEEE International Conference on Data Min-ing. IEEE, 2010, pp. 246–255.

[49] M. Henzinger, “Finding near-duplicate web pages: a large-scaleevaluation of algorithms,” in Proceedings of the 29th annualinternational ACM SIGIR conference on Research and devel-opment in information retrieval, 2006, pp. 284–291.

[50] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google newspersonalization: scalable online collaborative filtering,” in Pro-ceedings of the 16th international conference on World WideWeb, 2007, pp. 271–280.

[51] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate imagedetection: min-hash and tf-idf weighting.” in Proceedings ofthe the British Machine Vision Conference, vol. 810, 2008, pp.812–815.

[52] D. C. Lee, Q. Ke, and M. Isard, “Partition min-hash for partialduplicate image discovery,” in European Conference on Com-puter Vision, Heraklion, Crete, Greece, 2012, pp. 648–662.

[53] P. Li and C. Konig, “b-bit minwise hashing,” in Proceedings ofthe 19th International Conference on World Wide Web, 2010,pp. 671–680.

[54] P. Li, A. Konig, and W. Gui, “b-bit minwise hashing for es-timating three-way similarities,” in Advances in Neural Infor-mation Processing Systems 23, J. Lafferty, C. K. I. Williams,J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., 2010, pp.1387–1395.

[55] P. Li, A. Owen, and C.-H. Zhang, “One permutation hash-ing,” in Advances in Neural Information Processing Systems25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Wein-berger, Eds., 2012, pp. 3122–3130.

[56] O. Chum, M. Perdoch, and J. Matas, “Geometric min-hashing:Finding a (thick) needle in a haystack,” in IEEE ComputerSociety Conference on Computer Vision and Pattern Recogni-tion, Miami, Florida, USA, 2009, pp. 17–24.

[57] O. Chum and J. Matas, “Fast computation of min-hash signa-tures for image collections,” in Proceedings of the IEEE Inter-national Conference on Computer Vision and Pattern Recog-nition, Providence, RI, USA, 2012, pp. 3077–3084.

[58] P. Indyk and R. Motwani, “Approximate nearest neighbors:towards removing the curse of dimensionality,” in Proc. of 30thACM Symposium on Theory of Computing, 1998, pp. 604–613.

[59] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast search in ham-ming space with multi-index hashing,” in IEEE Conferenceon Computer Vision and Pattern Recognition, Providence, RI,USA, 2012, pp. 3108–3115.

[60] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang,“Inductive hashing on manifolds,” in Proceedings of the IEEEInternational Conference on Computer Vision and PatternRecognition, Portland, Oregon, USA, 2013, pp. 1562–1569.

[61] Y. Gong and S. Lazebnik, “Iterative quantization: A pro-crustean approach to learning binary codes,” in IEEE Confer-ence on Computer Vision and Pattern Recognition, ColoradoSprings, CO, USA, 2011, pp. 817–824.

[62] W. Kong and W.-J. Li, “Isotropic hashing,” in Advances inNeural Information Processing Systems 25, 2012, pp. 1655–1663.

[63] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angularquantization based binary codes for fast similarity search,” inAdvances in Neural Information Processing Systems 25, 2012,pp. 1205–1213.

[64] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, “Spher-ical hashing,” in IEEE Conference on Computer Vision andPattern Recognition, Providence, RI, USA, 2012, pp. 2957–2964.

[65] M. Norouzi, D. Fleet, and R. Salakhutdinov, “Hamming dis-tance metric learning,” in Advances in Neural InformationProcessing Systems, 2012, pp. 1070–1078.

[66] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and largeimage databases for recognition,” in IEEE Conference onComputer Vision and Pattern Recognition, Anchorage, Alaska,USA, 2008, pp. 1–8.

[67] L. Fan, “Supervise binary hash code learning with jensen shan-non divergence,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2013.

[68] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised dis-crete hashing,” in Proceedings of the IEEE International Con-ference on Computer Vision and Pattern Recognition, Boston,Massachusetts, USA, 2015.

[69] P. Jain, B. Kulis, and K. Grauman, “Fast image search forlearned metrics,” in IEEE Conference on Computer Visionand Pattern Recognition, Anchorage, Alaska, USA, 2008, pp.1–8.

[70] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online met-ric learning and fast similarity search,” in Advances in neuralinformation processing systems, 2008, pp. 761–768.

[71] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estima-tion with parameter-sensitive hashing,” in IEEE InternationalConference on Computer Vision, Nice, France, 2003, pp. 750–757.

[72] M. Rastegari, A. Farhadi, and D. A. Forsyth, “Attribute dis-covery via predictable discriminative binary codes,” in Euro-pean Conference on Computer Vision, Florence, Italy, 2012,pp. 876–889.

[73] M. Ou, P. Cui, F. Wang, J. Wang, and W. Zhu, “Non-transitivehashing with latent similarity components,” in Proceedings of

Page 21: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 21

the 21th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, 2015, pp. 895–904.

[74] X. Li, G. Lin, C. Shen, A. V. den Hengel, and A. Dick, “Learn-ing hash functions using column generation,” in Proceedings ofthe 30th International Conference on Machine Learning, 2013,pp. 142–150.

[75] J. Wang, J. Wang, N. Yu, and S. Li, “Order preserving hashingfor approximate nearest neighbor search,” in Proceedings of the21st ACM international conference on Multimedia, 2013, pp.133–142.

[76] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang, “Learning hashcodes with listwise supervision,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2013.

[77] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregat-ing local descriptors into a compact image representation,” inIEEE Conference on Computer Vision and Pattern Recogni-tion, 2010, pp. 3304–3311.

[78] J. He, S. Kumar, and S.-F. Chang, “On the difficulty of near-est neighbor search,” in Proceedings of the 29th internationalconference on Machine learning, 2012.

[79] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua,“Ldahash: Improved matching with smaller descriptors,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 34, no. 1, pp. 66–78, 2012.

[80] T. Trzcinski and V. Lepetit, “Efficient discriminative projec-tions for compact binary descriptors,” in European Conferenceon Computer Vision, Florence, Italy, 2012, pp. 228–242.

[81] F. Yu, S. Kumar, Y. Gong, and S.-F. Chang, “Circulant binaryembedding,” in Proc. of ICML, Beijing, China, 2014, pp. 946–954.

[82] J. He, W. Liu, and S.-F. Chang, “Scalable similarity searchwith optimized kernel hashing,” in Proceedings of the 16thACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, 2010, pp. 1129–1138.

[83] R.-S. Lin, D. A. Ross, and J. Yagnik, “Spec hashing: Similar-ity preserving algorithm for entropy-based coding,” in IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition, San Francisco, USA, 2010, pp. 848–854.

[84] Z. Jin, Y. Hu, Y. Lin, D. Zhang, S. Lin, D. Cai, and X. Li,“Complementary projection hashing,” in Proceedings of theIEEE International Conference on Computer Vision, 2013.

[85] G. Lin, C. Shen, A. V. den Hengel, and D. Suter, “A generaltwo-step approach to learning-based hashing,” in Proceedingsof the IEEE International Conference on Computer Vision,2013.

[86] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashingfor fast similarity search,” in Proceedings of the 33rd Interna-tional ACM SIGIR Conference on Research and Developmentin Information Retrieval, 2010, pp. 18–25.

[87] Y. Freund and R. Schapire, “A desicion-theoretic generaliza-tion of on-line learning and an application to boosting,” inComputational Learning Theory, 1995, pp. 23–37.

[88] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios, “Boostmap:An embedding method for efficient nearest neighbor retrieval,”IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 30, no. 1, pp. 89–104, Jan. 2008.

[89] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query-adaptive image search with hash codes,” IEEE Transactionson Multimedia, vol. 15, no. 2, pp. 442–453, 2013.

[90] Y.-G. Jiang, J. Wang, and S.-F. Chang, “Lost in binarization:Query-adaptive ranking for similar image search with compactcodes,” in Proceedings of ACM International Conference onMultimedia Retrieval, 2011.

[91] Q. Wang, D. Zhang, and L. Si, “Weighted hashing for fastlarge scale similarity search,” in Proceedings of the 22nd ACMConference on information and knowledge management, 2013,pp. 1185–1188.

[92] L. Zhang, Y. Zhang, X. Gu, and Q. Tian, “Binary code rank-ing with weighted hamming distance,” in Proceedings of theIEEE International Conference on Computer Vision and Pat-tern Recognition, Portland, Oregon, USA, 2013, pp. 1586–159.

[93] X. Liu, J. He, B. Lang, and S.-F. Chang, “Hash bit selection:a unified solution for selection problems in hashing,” in Pro-ceedings of the IEEE International Conference on ComputerVision and Pattern Recognition, Portland, Oregon, USA, 2013,pp. 1570–1577.

[94] X. Zhang, L. Zhang, and H.-Y. Shum, “Qsrank: Query-sensitive hash code ranking for efficient ǫ-neighbor search,” in

IEEE Conference on Computer Vision and Pattern Recogni-tion, Providence, RI, USA, 2012, pp. 2058–2065.

[95] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectralgrouping using the Nystrom method,” IEEE Transactions onPattern Analysis and Machine Intelligence, no. 2, pp. 214–225,2004.

[96] Y. Weiss, R. Fergus, and A. Torralba, “Multidimensional spec-tral hashing,” in European Conference on Computer Vision,Florence, Italy, 2012, pp. 340–353.

[97] W. Liu, J. He, and S.-F. Chang, “Large graph construction forscalable semi-supervised learning,” in Proceedings of the 27thInternational Conference on Machine Learning, 2010, pp. 679–686.

[98] W. Liu, C. Mu, S. Kumar, and S.-F. Chang, “Discrete graphhashing,” in Advances in Neural Information Processing Sys-tems, 2014, pp. 3419–3427.

[99] J. Masci, A. Bronstein, M. Bronstein, P. Sprechmann, andG. Sapiro, “Sparse similarity-preserving hashing,” in Proc. In-ternational Conference on Learning Representations, 2014.

[100] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,“Information-theoretic metric learning,” in Proceedings of the24th international conference on Machine learning, 2007, pp.209–216.

[101] R. M. Gray, Toeplitz and circulant matrices: A review. nowpublishers inc, 2006.

[102] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,vol. 521, pp. 436–444, 2015.

[103] R. Salakhutdinov and G. Hinton, “Semantic hashing,” Inter-national Journal of Approximate Reasoning, vol. 50, no. 7, pp.969–978, 2009.

[104] G. Hinton and R. Salakhutdinov, “Reducing the dimensionalityof data with neural networks,” Science, vol. 313, no. 5786, pp.504–507, 2006.

[105] R. Salakhutdinov and G. Hinton, “Learning a nonlinear em-bedding by preserving class neighbourhood structure,” in Proc.International Conference on Artificial Intelligence and Statis-tics, 2007, pp. 412–419.

[106] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deephashing for compact binary codes learning,” in Proc. IEEEConference on Computer Vision and Pattern Recognition,2015.

[107] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashingfor image retrieval via image representation learning,” in Proc.AAAI Conference on Artificial Intelligence, 2014, pp. 2156–2162.

[108] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semanticranking based hashing for multi-label image retrieval,” in Proc.IEEE Conference on Computer Vision and Pattern Recogni-tion, 2015.

[109] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous featurelearning and hash coding with deep neural networks,” in Proc.IEEE Conference on Computer Vision and Pattern Recogni-tion, 2015.

[110] K. Gregor and Y. LeCun, “Learning fast approximations ofsparse coding,” in Proc. International Conference on MachineLearning, 2010, pp. 399–406.

[111] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterativequantization: A procrustean approach to learning binary codesfor large-scale image retrieval,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–2929, 2013.

[112] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet clas-sification with deep convolutional neural networks,” in Proc.Advances in Neural Information Processing Systems, vol. 25,2012, pp. 1106–1114.

[113] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, andY. Chen, “Compressing neural networks with the hashingtrick,” in Proc. International Conference on Machine Learn-ing, 2015, pp. 2285–2294.

[114] S. Vijayanarasimhan, P. Jain, and K. Grauman, “Hashing hy-perplane queries to near points with applications to large-scaleactive learning,” IEEE Transactions on Pattern Analysis andMachine Intelligence, 2013.

[115] S. Tong and D. Koller, “Support vector machine active learningwith applications to text classification,” Journal of MachineLearning Research, vol. 2, pp. 45–66, 2001.

[116] P. Jain, S. Vijayanarasimhan, and K. Grauman, “Hashing hy-perplane queries to near points with applications to large-scale

Page 22: LearningtoHashforIndexingBigData-ASurvey › pdf › 1509.05472.pdfcodes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in

22 PROCEEDINGS OF THE IEEE

active learning,” in Advances in Neural Information Process-ing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,R. Zemel, and A. Culotta, Eds. MIT Press, 2010, pp. 928–936.

[117] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang, “Com-pact hyperplane hashing with bilinear functions,” in Proceed-ings of the 29th international conference on Machine learning,2012.

[118] Y. Gong, S. Kumar, H. Rowley, and S. Lazebnik, “Learningbinary codes for high-dimensional data using bilinear projec-tions,” in Proceedings of the IEEE International Conference onComputer Vision and Pattern Recognition, Portland, Oregon,USA, 2013, pp. 484–491.

[119] R. Basri, T. Hassner, and L. Zelnik-Manor, “Approximatenearest subspace search,” IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, vol. 33, no. 2, pp. 266–278, 2011.

[120] X. Wang, S. Atev, J. Wright, and G. Lerman, “Fast subspacesearch via grassmannian based hashing,” in Proceedings of theIEEE International Conference on Computer Vision, 2013.

[121] S. Kim, Y. Kang, and S. Choi, “Sequential spectral learning tohash with multiple representations,” in European Conferenceon Computer Vision, Florence, Italy, 2012, pp. 538–551.

[122] D. Zhang, F. Wang, and L. Si, “Composite hashing with mul-tiple information sources,” in Proceedings of the 34th interna-tional ACM SIGIR conference on Research and developmentin Information Retrieval, 2011, pp. 225–234.

[123] Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multi-modal data,” in Advances in Neural Information ProcessingSystems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, andK. Weinberger, Eds. MIT Press, 2012, pp. 1385–1393.

[124] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Pre-dictable dual-view hashing,” in Proceedings of the 30th Inter-national Conference on Machine Learning, 2013, pp. 1328–1336.

[125] Y. Zhen and D.-Y. Yeung, “A probabilistic model for mul-timodal hash function learning,” in Proceedings of the 18thACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, 2012, pp. 940–948.

[126] J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, “Mul-tiple feature hashing for real-time large scale near-duplicatevideo retrieval,” in Proceedings of the 19th ACM internationalconference on Multimedia, 2011, pp. 423–432.

[127] L. Cao, Z. Li, Y. Mu, and S.-F. Chang, “Submodular videohashing: a unified framework towards video pooling and index-ing,” in Proceedings of the 20th ACM international conferenceon Multimedia, 2012, pp. 299–308.

[128] M. Ou, P. Cui, J. Wang, F. Wang, and W. Zhu, “Probabilisticattributed hashing,” in Twenty-Ninth AAAI Conference onArtificial Intelligence, 2015, pp. 2894–2900.

[129] K. Grauman and R. Fergus, “Learning binary hash codes forlarge-scale image search,” in Machine Learning for ComputerVision. Springer, 2013, pp. 49–87.

[130] K. Grauman, “Efficiently searching for similar images,” Com-mun. ACM, vol. 53, no. 6, 2010.

[131] W. Kong, W.-J. Li, and M. Guo, “Manhattan hashing for large-scale image retrieval,” in Proceedings of the 35th internationalACM SIGIR conference on Research and development in in-formation retrieval, 2012, pp. 45–54.

[132] J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, andS.-F. Chang, “Mobile product search with bag of hash bitsand boundary reranking,” in IEEE Conference on ComputerVision and Pattern Recognition, Providence, RI, USA, 2012,pp. 3005–3012.

[133] S. Korman and S. Avidan, “Coherency sensitive hashing,” inProceedings of the IEEE International Conference on Com-puter Vision, 2011, pp. 1607–1614.

[134] Q. Shi, H. Li, and C. Shen, “Rapid face recognition using hash-ing,” in IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, San Francisco, USA, 2010, pp.2753–2760.

[135] Q. Dai, J. Li, J. Wang, Y. Chen, and Y.-G. Jiang, “Optimalbayesian hashing for efficient face recognition,” in Proceedingsof the Twenty-Fourth International Joint Conference on Arti-ficial Intellige, 2015, pp. 3430–3437.

[136] X. Li, C. Shen, A. Dick, and A. van den Hengel, “Learningcompact binary codes for visual tracking,” in Proceedings ofthe IEEE International Conference on Computer Vision andPattern Recognition, Portland, Oregon, USA, 2013, pp. 2419–2426.

[137] J. Yuan, G. Gravier, S. Campion, X. Liu, and H. Jgou, “Effi-cient mining of repetitions in large-scale tv streams with prod-uct quantization hashing,” in European Conference on Com-puter Vision, Florence, Italy, 2012, pp. 271–280.

[138] G. S. Manku, A. Jain, and A. Das Sarma, “Detecting near-duplicates for web crawling,” in Proceedings of the 16th inter-national conference on World Wide Web, 2007, pp. 141–150.

[139] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Para-gios, “Data fusion through cross-modality metric learning us-ing similarity-sensitive hashing,” in IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, SanFrancisco, USA, 2010, pp. 3594–3601.

[140] Y. Mu, J. Wright, and S.-F. Chang, “Accelerated large scaleoptimization by concomitant hashing,” in European Confer-ence on Computer Vision, Florence, Italy, 2012, pp. 414–427.

[141] P. Li, A. Shrivastava, J. L. Moore, and A. C. Knig, “Hashingalgorithms for large-scale learning,” in Advances in Neural In-formation Processing Systems 24, J. Shawe-Taylor, R. Zemel,P. Bartlett, F. Pereira, and K. Weinberger, Eds., 2011, pp.2672–2680.

[142] K. Zhou and H. Zha, “Learning binary codes for collaborativefiltering,” in Proceedings of the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining,2012, pp. 498–506.

[143] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang,“Comparing apples to oranges: A scalable solution with hetero-geneous hashing,” in Proceedings of the 19th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 2013, pp. 230–238.

[144] G. Ye, D. Liu, J. Wang, and S.-F. Chang, “Large-scale videohashing via structure learning,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2013.

[145] Q. Wang, D. Zhang, and L. Si, “Semantic hashing using tagsand topic modeling,” in Proceedings of the 36th InternationalACM SIGIR Conference on Research and Development in In-formation Retrieval, 2013, pp. 213–222.

[146] H. Li, W. Liu, and H. Ji, “Two-stage hashing for fast documentretrieval,” in Proceedings of the Annual Meeting of the Asso-ciation for Computational Linguistics, Baltimore, Maryland,USA, 2014.

[147] B. Qian, X. Wang, J. Wang, WeifengZhi, H. Li, and I. David-son, “Fast pairwise query selection for large-scale active learn-ing to rank,” in Proceedings of the IEEE International Con-ference on Data Mining, 2013.

[148] Y. Mu, G. Hua, W. Fan, and S.-F. Chang, “Hash-svm: Scal-able kernel machines for large-scale visual classification,” inProceedings of the IEEE International Conference on Com-puter Vision and Pattern Recognition, Columbus, Ohio, USA,2014, pp. 446–451.

[149] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, andS. Vishwanathan, “Hash kernels for structured data,” TheJournal of Machine Learning Research, vol. 10, pp. 2615–2637,2009.