20140702 xu jiaming hashinglearning - lite
-
Upload
jacob-xu -
Category
Data & Analytics
-
view
227 -
download
4
description
Transcript of 20140702 xu jiaming hashinglearning - lite
1
Learning to Hash for Large-Scale Search
Xu Jiaming
Chinese Academe of Science
2014-07-04 @CUHK
2
Motivation
Similarity based search has been popular in many applications
– Image/video search and retrieval: finding most similar images/videos
– Audio search: find similar songs
– Product search: find shoes with similar style but different color
– Patient search: find patients with similar diagnostic status
Two key components:
– Similarity/distance measure
– Indexing scheme
Whittlesearch (Kovashka et al. 2013)
- 2013CIKM Tutorial by Jun Wang
3
A Conceptual Diagram for Hashing Based Image Search System
Indexing and Search
Image Database
Similarity Search & Retrieval
Hash Function Design
Visual Search ApplicationsVisual Search Applications
Reranking Refinement
Designing compact yet accurate hashing codes is a critical component to make the search effective
- 2013CIKM Tutorial by Jun Wang
4
Outline
Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
SimHash [2002-STOC, 2007-WWW]
Learning to Hashing (data-dependent) Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]
Two-Stage Hashing [2014-ACL]
Semantic Hashing with Topics and Tags [2013-SIGIR]
Dual-View Hashing [2013-ICML]
Multiple View Hashing [2011-SIGIR]
LSH in MapReduce
5
Outline
Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
SimHash [2002-STOC, 2007-WWW]
Learning to Hashing (data-dependent) Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]
Two-Stage Hashing [2014-ACL]
Semantic Hashing with Topics and Tags [2013-SIGIR]
Dual-View Hashing [2013-ICML]
Multiple View Hashing [2011-SIGIR]
LSH in MapReduce
6
LSH [1999-VLDB, 2006-FOCS, 2008-Communications]
0
1Database Items
hash function
random
101 Query
Locality Sensitive Hashing (LSH)
- 2013CIKM Tutorial by Jun Wang
0
1 0
1
7
SimHash [2002-STOC, 2007-WWW]
Text ……
Observed Features
W1
W2
Wn
100110 W1
110000 W2
001001 Wn……
W1 –W1 -W1 W1 W1 -W1
W2 W2 -W2 -W2 -W2 -W2
-Wn –Wn Wn –Wn –Wn Wn
……
13, 108, -22, -5, -32, 551, 1, 0, 0, 0, 1
Step1: Compute TF-IDF
Step2: Hash Function
Step3: Signature
Step4: Sum
Step5: Generate Fingerprint
8
Outline
Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
SimHash [2002-STOC, 2007-WWW]
Learning to Hashing (data-dependent) Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]
Two-Stage Hashing [2014-ACL]
Semantic Hashing with Topics and Tags [2013-SIGIR]
Dual-View Hashing [2013-ICML]
Multiple View Hashing [2011-SIGIR]
LSH in MapReduce
9
STH [2010-SIGIR]
2min :
. . : { 1,1}
0
1
ij i jij
ki
ii
Ti i
i
S y y
s t y
y
y yn
I
min : ( ( ) )
. . : ( , ) { 1,1}
0
T
k
T
T
trace Y D W Y
s t Y i j
Y 1
Y Y I
Laplacian Eigenmap
Self Taught Hashing (STH)
Unsupervised Learning
Supervised Learning
10
SHK [2012-CVPR]
Pairwise similarity
Code inner product approximates pairwise similarity
Supervised Hashing with Kernels
- 2013CIKM Tutorial by Jun Wang
11
Outline
Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
SimHash [2002-STOC, 2007-WWW]
Learning to Hashing (data-dependent) Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]
Two-Stage Hashing [2014-ACL]
Semantic Hashing with Topics and Tags [2013-SIGIR]
Dual-View Hashing [2013-ICML]
Multiple View Hashing [2011-SIGIR]
LSH in MapReduce
12
ITQ [2011-CVPR, 2013-TPAMI]
Iterative Quantization Apply PCA for dimensionality reduction, find to maximize:
Keep top c eigenvectors of the data covariance matrix to
obtain , projected data is Note that if is an optimal solution then is also optimal for
any orthogonal matrix Key idea: Find to minimize the quantization loss:
nc and V are fixed so this is equivalent to maximizing ( ) :
13
TSH [2013-ICCV]
Two-Step Hashing
14
Outline
Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
SimHash [2002-STOC, 2007-WWW]
Learning to Hashing (data-dependent) Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]
Two-Stage Hashing [2014-ACL]
Semantic Hashing with Topics and Tags [2013-SIGIR]
Dual-View Hashing [2013-ICML]
Multiple View Hashing [2011-SIGIR]
LSH in MapReduce
15
SHU [2013-IJCAI]
Smart Hashing Update
1. Consistency-based Selection;
2. Similarity-based Selection.
( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j
2
{ 1,1}
1min
l rl
Tl l
HF
Q H H Sr
2
1 1{1,2,...,r}min k k T
k r r FkR rS H H
16
TSH [2014-ACL]
Two-Stage Hashing
LSH for neighbor candidate pruning; ITQ for
effective re-ranking. LSH captures term similarity; ITQ captures
topic similarity Advantages: High hash lookup success rate is attained by the LSH stage; High search precision due to the ITQ re-ranking stage; Scan only a small portion of an entire dataset Integrate two similarity measures
17
SHTTM [2013-SIGIR]
Semantic Hashing Using Tags and Topic Modeling
Hash Code Learning Hash Function Learning
2 2*
1
* 1
( )
arg min
( )
j j j
n
j jj
T T
y f x x
y x
W
W
W W W
W Y X X X I
Tag Consistency
12
2 2 2min ( )
. . { 1,1} , 0
T
F
k n
C
s t
Y,U
T U Y U Y θ
Y Y1
Similarity Preservation
18
DVH [2013-ICML]
Predictable Dual-View HashingThe goal is to find two sets of hyperplanes that map the visual and textual space into a common subspace.
CCA
Multi-SVM
19
MVH [2011-SIGIR]
Composite Hashing with Multiple Information Sources
2
2( ) ( ) ( ) ( )1 2
1 1 1
( , , ) ( ) ( , )
( )
S C
M M MTT k k k kk
k k k
J J J
C tr C
Y W α Y Y W
Y L Y Y W X W
Overall Objection
20
Outline
Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
SimHash [2002-STOC, 2007-WWW]
Learning to Hashing (data-dependent) Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]
Two-Stage Hashing [2014-ACL]
Semantic Hashing with Topics and Tags [2013-SIGIR]
Dual-View Hashing [2013-ICML]
Multiple View Hashing [2011-SIGIR]
LSH in MapReduce
21
LSH in MapReduce – Key Idea
22
LSH in MapReduce – First Round of MapReduce
23
LSH in MapReduce – Second Round of MapReduce
24
Reference
[1]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing[C]//VLDB. 1999, 99: 518-529.
[2]. Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C]//Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 2006: 459-468.
[3]. Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions[J]. COMMUNICATIONS OF THE ACM, 2008, 51(1): 117.
[4]. Charikar M S. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002: 380-388.
[5]. Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web crawling[C]//Proceedings of the 16th international conference on World Wide Web. ACM, 2007: 141-150.
[6]. Zhang D, Wang J, Cai D, et al. Self-taught hashing for fast similarity search[C]//Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010: 18-25.
[7]. Liu W, Wang J, Ji R, et al. Supervised hashing with kernels[C]//Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2074-2081.
25
Reference
[8]. Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary codes[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 817-824.
[9]. Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013, 35(12): 2916-2929.
[10]. Lin G, Shen C, Suter D, et al. A general two-step approach to learning-based hashing[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013: 2552-2559.
[11]. Yang Q, Huang L K, Zheng W S, et al. Smart hashing update for fast response[C]//Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 2013: 1855-1861.
[12]. Li H, Liu W, Ji H. Two-Stage Hashing for Fast Document Retrieval[C]. ACL. 2014
[13]. Wang Q, Zhang D, Si L. Semantic hashing using tags and topic modeling[C]//Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013: 213-222.
[14]. Rastegari M, Choi J, Fakhraei S, et al. Predictable Dual-View Hashing[C]//Proceedings of The 30th International Conference on Machine Learning. 2013: 1328-1336.
26
Reference
[15]. Zhang D, Wang F, Si L. Composite hashing with multiple information sources[C]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011: 225-234.
[16]. Szmit, Radosław. "Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data." Language Processing and Intelligent Information Systems. Springer Berlin Heidelberg, 2013. 171-178.
[17]. Blog: Location Sensitive Hashing in Map Reduce: http://horicky.blogspot.hk/2012/09/location-sensitive-hashing-in-map-reduce.html
[18]. Likelike Project: https://github.com/takahi-i/likelike
[19]. Jun Wang. Learning to Hash for Large-Scale Search. 2013 CIKM Tutorial.
27
Discussions and Questions?
Thank you!2014-07-04