Learning to Hash for Big Data(1)Isotropic hashing[NIPS 2012] (2)Scalable graph hashing with feature...

Learning to Hash for Big Data

oÉ�

LAMDA GroupH®�ÆO�Å�Æ�EâX^�#EâI[:¢�¿

[email protected]

May 09, 2015

Li (http://cs.nju.edu.cn/lwj) Learning to Hash LAMDA, CS, NJU 1 / 43

http://cs.nju.edu.cn/lwj

Outline

1 IntroductionProblem DefinitionExisting Methods

2 Scalable Graph Hashing with Feature TransformationMotivationModel and LearningExperiment

3 Conclusion

4 Reference



Introduction

Outline



3 Conclusion

4 Reference



Introduction Problem Definition

Nearest Neighbor Search (Retrieval)

Given a query point q, return the points closest (similar) to q in thedatabase (e.g., images).

Underlying many machine learning, data mining, information retrievalproblems

Challenge in Big Data Applications:

Curse of dimensionality

Storage cost

Query speedLi (http://cs.nju.edu.cn/lwj) Learning to Hash LAMDA, CS, NJU 4 / 43



Similarity Preserving Hashing




Reduce Dimensionality and Storage Cost




Querying

Hamming distance:

||01101110, 00101101||H = 3||11011, 01011||H = 1




Fast Query Speed

By using hashing-based index, we can achieve constant or sub-linearsearch time complexity.

Exhaustive search is also acceptable because the distance calculationcost is cheap now.




Hash Function Learning

Easy or hard?

Hard: discrete optimization problem

Easy by approximation: two stages of hash function learning

Projection stage (dimensionality reduction)Projected with real-valued projection functionGiven a point x, each projected dimension i will be associated with areal-valued projection function fi(x) (e.g. fi(x) = wT

i x)

Quantization stageTurn real into binary

However, there exist essential differences between metriclearning (dimensionality reduction) and learning to hash. Simply adaptingtraditional metric learning is not enough.



Introduction Existing Methods

Data-Independent Methods

The hash function family is defined independently of the training dataset:

Locality-sensitive hashing (LSH): (Gionis et al., 1999; Andoni andIndyk, 2008) and its extensions (Datar et al., 2004; Kulis andGrauman, 2009; Kulis et al., 2009).

SIKH: Shift invariant kernel hashing (SIKH) (Raginsky and Lazebnik,2009).

Hash function: random projections.




Data-Dependent Methods

Hash functions are learned from a given training dataset (learning to hash).

Relatively short codes

Seminal papers: (Salakhutdinov and Hinton, 2007, 2009; Torralba et al.,2008; Weiss et al., 2008)

Two categories:

Unimodal

Supervised methodsgiven the labels yi or triplet (xi,xj ,xk)Unsupervised methods

Multimodal

Supervised methodsUnsupervised methods




(Unimodal) Unsupervised Methods

No labels to denote the categories of the training points.

PCAH: principal component analysis.

SH: eigenfunctions computed from the data similarity graph (Weisset al., 2008) .

ITQ: orthogonal rotation matrix to refine the initial projection matrixlearned by PCA (Gong and Lazebnik, 2011) .

AGH: graph-based hashing (Liu et al., 2011).

IsoHash: projected dimensions with isotropic variances (Kong and Li,2012b).

DGH: discrete graph hashing (Liu et al., 2014)

etc.




(Unimodal) Supervised (semi-supervised) Methods

Class labels or pairwise constraints:

SSH: semi-supervised hashing (SSH) exploits both labeled data andunlabeled data for hash function learning (Wang et al., 2010a,b).

MLH: minimal loss hashing (MLH) based on the latent structuralSVM framework (Norouzi and Fleet, 2011).

KSH: kernel-based supervised hashing (Liu et al., 2012)

LDAHash: linear discriminant analysis based hashing (Strecha et al.,2012)

LFH: supervised hashing with latent factor models (Zhang et al.,2014)

etc.

Triplet-based methods:

Hamming distance metric learning (HDML) (Norouzi et al., 2012)

Column generation base hashing (CGHash) (Li et al., 2013)




Multimodal Methods

Multi-source hashing

Cross-modal hashing




Multi-Source Hashing

Aims at learning better codes by leveraging auxiliary views thanunimodal hashing.

Assumes that all the views provided for a query, which are typicallynot feasible for many multimedia applications.

Multiple feature hashing (Song et al., 2011)

Composite hashing (Zhang et al., 2011)

etc.




Cross-Modal Hashing

Given a query of either image or text, return images or texts similar to it.

Cross View Hashing (CVH) (Kumar and Udupa, 2011)

Multimodal Latent Binary Embedding (MLBE) (Zhen and Yeung,2012a)

Co-Regularized Hashing (CRH) (Zhen and Yeung, 2012b)

Inter-Media Hashing (IMH) (Song et al., 2013)

Relation-aware Heterogeneous Hashing (RaHH) (Ou et al., 2013)

Semantic Correlation Maximization (SCM) (Zhang and Li, 2014)

etc.




Our Contribution

Unsupervised Hashing :(1) Isotropic hashing [NIPS 2012]

(2) Scalable graph hashing with feature transformation [IJCAI 2015]

Supervised Hashing [SIGIR 2014]:Supervised hashing with latent factor models

Multimodal Hashing [AAAI 2014]:Large-scale supervised multimodal hashing with semantic correlationmaximization

Multiple-Bit Quantization:(1) Double-bit quantization [AAAI 2012]

(2) Manhattan quantization [SIGIR 2012]

5�êâMFÆSµyG�ª³6[5�ÆÏ�6,2015]



Scalable Graph Hashing with Feature Transformation

Outline



3 Conclusion

4 Reference



Scalable Graph Hashing with Feature Transformation Motivation

(Unsupervised) Graph Hashing

Guide the hashing code learning procedure by directly exploiting thepairwise similarity (neighborhood structure).

min∑ij

Sij ||bi − bj ||2 = tr(BTLB)

subject to : bi ∈ {−1, 1}c∑i

bi = 0

1

n

∑i

bibTi = I

Should be expected to achieve better performance than non-graphbased methods if the learning algorithms are effective.




Motivation

However, it is difficult to design effective algorithms because both thememory and time complexity are at least O(n2).

SH (Weiss et al., 2008): Use an eigenfunction solution of 1-DLaplacian with uniform assumption

BRE (Kulis and Darrell, 2009): Subsample a small subset for training

AGH (Liu et al., 2011), DGH (Liu et al., 2014): Use anchor graph toapproximate the similarity graph




Contribution

How to utilize the whole graph and simultaneously avoid O(n2)complexity?

Scalable graph hashing (SGH):

A feature transformation (Shrivastava and Li, 2014) method toeffectively approximate the whole graph without explicitly computingit.

A sequential method for bit-wise complementary learning.

Linear complexity.

Outperform state of the art in terms of both accuracy and scalability.



Scalable Graph Hashing with Feature Transformation Model and Learning

Notation

X = {x1, . . . ,xn}T ∈ Rn×d: n data points.

bi ∈ {+1,−1}c: binary code of point xi.

bi = [h1(xi), . . . , hc(xi)]T , where hk(x) denotes hash function.

Pairwise similarity metric defined as: Sij = e−||xi−xj ||2F /ρ ∈ (0, 1]




Objective Function

min∑

ij(S̃ij −1cb

Ti bj)

2

S̃ij = 2Sij − 1 ∈ (−1, 1].Hash function: hk(xi) = sgn(

∑mj=1Wijφ(xi,xj) + bias)

∀xi, map it into Hamming space as bi = [h1(xi), · · · , hc(xi)]T

∀x, define: K(x) =[φ(x,x1)−

∑ni=1 φ(xi,x1)/n, . . . , φ(x,xm)−

∑ni=1 φ(xi,xm)/n]

Objective function with the parameter W ∈ Rc×m:

minW||cS̃− sgn(K(X)WT )sgn(K(X)WT )T ||2F

s.t. WK(X)TK(X)WT = I




Feature Transformation

∀x, define P (x) and Q(x):

P (x) = [

√2(e2 − 1)

eρe−||x||2F

ρ x;

√e2 + 1

ee− ||x||

2F

ρ ; 1]

Q(x) = [

√2(e2 − 1)

eρe−||x||2F

ρ x;

√e2 + 1

ee− ||x||

2F

ρ ;−1]

∀xi,xj ∈ X

P (xi)TQ(xj) = 2[

e2 − 1

2e× 2xTi xj

ρ+e2 + 1

2e]e−||xi||

2F+||xj ||

2F

ρ − 1

≈ 2e−||xi||

2F−||xj ||

2F+2xTi xj

ρ − 1

= 2e−||xi−xj ||

2F

ρ − 1 = S̃ij




Feature Transformation

Here, we use an approximation e2−12e x+ e2+1

2e ≈ ex

We assume −1 ≤ 2ρx

Ti xj ≤ 1. It is easy to prove that

ρ = 2max{‖xi‖2F }ni=1 can make −1 ≤ 2ρx

Ti xj ≤ 1.

Then we have S̃ ≈ P (X)TQ(X)




Sequential Learning Strategy

Direct relaxation may lead to poor performance

We adopt a sequential learning strategy in a bit-wise complementarymanner

Residual definition:Rt = cS̃−

∑t−1i=1 sgn(K(X)wi)sgn(K(X)wi)

T

R1 = cS̃

Objective function:

minwt||Rt − sgn(K(X)wt)sgn(K(X)wt)

T ||2F

s.t. wTt K(X)TK(X)wt = 1

By relaxation, we can get:

minwt

−tr(wTt K(X)TRtK(X)wt)

s.t. wTt K(X)TK(X)wt = 1





Then we obtain a generalized eigenvalue problem:

K(X)TRtK(X)wt = λK(X)TK(X)wt

Define At = K(X)TRtK(X) ∈ Rm×m, then:

At = At−1 −K(X)T sgn(K(X)wt−1)sgn(K(X)wt−1)TK(X)

Key component:

A1 = cK(X)T S̃K(X)

= c[K(X)T P (X)T ][Q(X)︸︷︷︸S̃

K(X)]





Adopting the residual matrix:

Rt = cS̃−c∑

i=1,i 6=tsgn(K(X)wi)sgn(K(X)wi)

T

we can learn all the W = {wi}ci=1 for multiple rounds.

This can further improve the accuracy.

We continue it for one more round to get a good tradeoff betweenaccuracy and speed.




Sequential Learning Algorithm




Complexity Analysis

Initialization

P (X) and Q(X): O(dn)Kernel initialization: O(dmn+mn)A1 and Z: O((d+ 2)mn)

Main procedure

O(c(mn+m2) +m3)

Typically, m, d, c will be much less than n.

Hence, the time complexity is O(n). Furthermore, the storage complexityis also O(n).



Scalable Graph Hashing with Feature Transformation Experiment

Dataset

TINY-1M

One million images from the set of 80M tiny images.

Each tiny image is represented by a 384-dim GIST descriptors.

MIRFLICKR-1M

One million Flickr images.

Extract 512-dim features from each image.




Evaluation Protocols and Baselines

Evaluation Protocols:

Ground truth: top 2% nearest neighbors in the training set in termsof Euclidean distance

Hamming ranking: Top-K precision

Training time for scalability

Baselines:

LSH (Andoni and Indyk, 2008)

PCAH (Gong and Lazebnik, 2011)

ITQ (Gong and Lazebnik, 2011)

AGH (Liu et al., 2011)

DGH (Liu et al., 2014): DGH-I and DGH-R




Top-1k Precision

Method 32 bits 64 bits 96 bits 128 bits 256 bitsSGH 0.4697 0.5742 0.6299 0.6737 0.7357ITQ 0.4289 0.4782 0.4947 0.4986 0.5003AGH 0.3973 0.4402 0.4577 0.4654 0.4767

DGH-I 0.3974 0.4536 0.4737 0.4874 0.4969DGH-R 0.3793 0.4554 0.4871 0.4989 0.5276PCAH 0.2457 0.2203 0.2000 0.1836 0.1421LSH 0.2507 0.3575 0.4122 0.4529 0.5212

Method 32 bits 64 bits 96 bits 128 bits 256 bitsSGH 0.4919 0.6041 0.6677 0.6985 0.7584ITQ 0.5177 0.5776 0.5999 0.6096 0.6228AGH 0.4299 0.4741 0.4911 0.4998 0.506

DGH-I 0.4299 0.4806 0.5001 0.5111 0.5253DGH-R 0.4121 0.4776 0.5054 0.5196 0.5428PCAH 0.2720 0.2384 0.2141 0.1950 0.1508LSH 0.2597 0.3995 0.466 0.5160 0.6072




Top-k Precision @TINY-1M

200 400 600 800 10000.2

0.3

0.4

0.5

0.6

0.7

0.8

#Returned Samples

Pre

cisi

on

SGHITQAGHDGH−IDGH−RPCAHLSH

(a) 64 bit

200 400 600 800 1000

0.2

0.3

0.4

0.5

0.6

0.7

0.8

#Returned Samples

Pre

cisi

on


(b) 128 bit




Top-k Precision @MIRFLICKR-1M

0 200 400 600 800 10000.2

0.3

0.4

0.5

0.6

0.7

0.8

#Returned Samples

Pre

cisi

on


(a) 64 bit

0 200 400 600 800 1000

0.2

0.3

0.4

0.5

0.6

0.7

0.8

#Returned Samples

Pre

cisi

on


(b) 128 bit




Training Time (in second)

Training time @TINY-1M. Here, t1 = 1438.60

Method 32 bits 64 bits 96 bits 128 bits 256 bitsSGH 34.49 52.37 71.53 89.65 164.23ITQ 31.72 60.62 89.01 149.18 322.06AGH 18.60 + t1 19.40 + t1 20.08 + t1 22.48 + t1 25.09 + t1DGH-I 187.57 + t1 296.99 + t1 518.57 + t1 924.08 + t1 1838.30 + t1DGH-R 217.06 + t1 360.18 + t1 615.74 + t1 1089.10 + t1 2300.10 + t1PCAH 4.29 4.54 4.75 5.85 6.49LSH 1.68 1.77 1.84 2.55 3.76

Training time @MIRFLICKR-1M. Here, t2 = 1564.86

Method 32 bits 64 bits 96 bits 128 bits 256 bitsSGH 41.51 59.02 74.86 97.25 168.35ITQ 36.17 64.61 89.50 132.71 285.10AGH 17.99 + t2 18.80 + t2 20.30 + t2 19.87 + t2 21.60 + t2DGH-I 85.81 + t2 143.68 + t2 215.41 + t2 352.73 + t2 739.56 + t2DGH-R 116.25 + t2 206.24 + t2 308.32 + t2 517.97 + t2 1199.44 + t2PCAH 7.65 7.90 8.47 9.23 10.42LSH 2.44 2.43 2.71 3.38 4.21




Sensitivity to Parameter ρ

(a) TINY-1M (b) MIRFLICKR-1M




Sensitivity to Parameter m

(a) TINY-1M (b) MIRFLICKR-1M



Conclusion

Outline



3 Conclusion

4 Reference



Conclusion

Conclusion

Hashing can significantly improve retrieval speed and reduce storagecost.

It has become a hot research topic in big learning with wideapplications.

We have proposed a series of hashing methods, includingunsupervised, supervised, and multimodal methods. Furthermore,some quantization strategies (Kong and Li, 2012a; Kong et al., 2012)are also designed.

In particular, the details of SGH with feature transformation (Jiangand Li, 2015) are introduced in this talk.



Conclusion

Future Trends

Discrete optimization and learning

Scalable training for supervised hashing

Quantization

New applications

5�êâMFÆSµyG�ª³6[5�ÆÏ�6,2015] (Li and Zhou,2015)



Conclusion

Q & A

Thanks!

Question?

Code available athttp://cs.nju.edu.cn/lwj




Reference

Outline



3 Conclusion

4 Reference



References

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions. Commun. ACM, 51(1):117–122,2008.

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitivehashing scheme based on p-stable distributions. In Proceedings of theACM Symposium on Computational Geometry, 2004.

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensionsvia hashing. In Proceedings of International Conference on Very LargeData Bases, 1999.

Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approachto learning binary codes. In Proceedings of Computer Vision andPattern Recognition, 2011.

Q.-Y. Jiang and W.-J. Li. Scalable graph hashing with featuretransformation. In IJCAI, 2015.

W. Kong and W.-J. Li. Double-bit quantization for hashing. InProceedings of the Twenty-Sixth AAAI Conference on ArtificialIntelligence (AAAI), 2012a.



References

W. Kong and W.-J. Li. Isotropic hashing. In Proceedings of the 26thAnnual Conference on Neural Information Processing Systems (NIPS),2012b.

W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large-scale imageretrieval. In The 35th International ACM SIGIR conference on researchand development in Information Retrieval (SIGIR), 2012.

B. Kulis and T. Darrell. Learning to hash with binary reconstructiveembeddings. In Proceedings of Neural Information Processing Systems,2009.

B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalableimage search. In Proceedings of International Conference on ComputerVision, 2009.

B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learnedmetrics. IEEE Trans. Pattern Anal. Mach. Intell., 31(12):2143–2157,2009.

S. Kumar and R. Udupa. Learning hash functions for cross-view similaritysearch. In IJCAI, pages 1360–1365, 2011.



References

W.-J. Li and Z.-H. Zhou. Learning to hash for big data: current statusand future trends. Chinese Science Bulletin, 60:485–490, 2015.

X. Li, G. Lin, C. Shen, A. van den Hengel, and A. R. Dick. Learning hashfunctions using column generation. In ICML, 2013.

W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. InProceedings of International Conference on Machine Learning, 2011.

W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashingwith kernels. In CVPR, pages 2074–2081, 2012.

W. Liu, C. Mu, S. Kumar, and S. Chang. Discrete graph hashing. InProceedings of Neural Information Processing Systems, 2014.

M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binarycodes. In Proceedings of International Conference on Machine Learning,2011.

M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metriclearning. In NIPS, pages 1070–1078, 2012.

M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang. Comparingapples to oranges: a scalable solution with heterogeneous hashing. InKDD, pages 230–238, 2013.



References

M. Raginsky and S. Lazebnik. Locality-sensitive binary codes fromshift-invariant kernels. In Proceedings of Neural Information ProcessingSystems, 2009.

R. Salakhutdinov and G. Hinton. Semantic Hashing. In SIGIR workshopon Information Retrieval and applications of Graphical Models, 2007.

R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx.Reasoning, 50(7):969–978, 2009.

A. Shrivastava and P. Li. Asymmetric lsh (alsh) for sublinear timemaximum inner product search (mips). In NIPS, 2014.

J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multiple featurehashing for real-time large scale near-duplicate video retrieval. In ACMMultimedia, pages 423–432, 2011.

J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-mediahashing for large-scale retrieval from heterogeneous data sources. InSIGMOD Conference, pages 785–796, 2013.

C. Strecha, A. A. Bronstein, M. M. Bronstein, and P. Fua. Ldahash:Improved matching with smaller descriptors. IEEE Trans. Pattern Anal.Mach. Intell., 34(1):66–78, 2012.



References

A. Torralba, R. Fergus, and Y. Weiss. Small codes and large imagedatabases for recognition. In Proceedings of Computer Vision andPattern Recognition, 2008.

J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning forhashing with compact codes. In Proceedings of International Conferenceon Machine Learning, 2010a.

J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing forlarge-scale image retrieval. In Proceedings of Computer Vision andPattern Recognition, 2010b.

Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings ofNeural Information Processing Systems, 2008.

D. Zhang and W.-J. Li. Large-scale supervised multimodal hashing withsemantic correlation maximization. In Proceedings of the Twenty-EighthAAAI Conference on Artificial Intelligence (AAAI), 2014.

D. Zhang, F. Wang, and L. Si. Composite hashing with multipleinformation sources. In Proceedings of International ACM SIGIRConference on Research and Development in Information Retrieval,2011.



P. Zhang, W. Zhang, W.-J. Li, and M. Guo. Supervised hashing withlatent factor models. In SIGIR, 2014.

Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hashfunction learning. In KDD, pages 940–948, 2012a.

Y. Zhen and D.-Y. Yeung. Co-regularized hashing for multimodal data. InNIPS, pages 1385–1393, 2012b.

() May 12, 2015 43 / 43

Learning to Hash for Big Data(1)Isotropic hashing[NIPS 2012] (2)Scalable graph hashing with feature...

Documents

Transcript of Learning to Hash for Big Data(1)Isotropic hashing[NIPS 2012] (2)Scalable graph hashing with feature...