Spectral Approaches to Nearest Neighbor Search arXiv:1408.0751 Robert Krauthgamer (Weizmann...
-
Upload
edgar-surridge -
Category
Documents
-
view
219 -
download
2
Transcript of Spectral Approaches to Nearest Neighbor Search arXiv:1408.0751 Robert Krauthgamer (Weizmann...
Spectral Approaches to Nearest Neighbor Search
arXiv:1408.0751
Robert Krauthgamer (Weizmann Institute)
Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan
Les Houches, January 2015
Nearest Neighbor Search (NNS)
Preprocess: a set of points in
Query: given a query point , report a point with the smallest distance to
𝑞𝑝∗
Motivation
Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure
Application areas: machine learning: k-NN rule signal processing, vector quantization,
bioinformatics, etc… Distance can be:
Hamming, Euclidean, edit distance, earth-mover
distance, …
000000011100010100000100010100011111
000000001100000100000100110100111111 𝑞
𝑝∗
Curse of Dimensionality
All exact algorithms degrade rapidly with the dimension
Algorithm Query time Space
Full indexing (Voronoi diagram size)
No indexing – linear scan
Approximate NNS
Given a query point , report s.t. : approximation factor randomized: return such with
probability
Heuristic perspective: gives a set of candidates (hopefully small)
𝑞
𝑝∗
𝑝 ′
NNS algorithms
6
It’s all about space partitions !
Low-dimensional [Arya-Mount’93], [Clarkson’94], [Arya-Mount-
Netanyahu-Silverman-We’98], [Kleinberg’97], [HarPeled’02],[Arya-Fonseca-Mount’11],…
High-dimensional [Indyk-Motwani’98], [Kushilevitz-Ostrovsky-
Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [Andoni-Indyk’06], [Andoni-Indyk-Nguyen-Razenshteyn’14], [Andoni-Razenshteyn’15]
High-dimensional Locality-Sensitive Hashing Crucial use of random projections
Johnson-Lindenstrauss Lemma: project to random subspace of dimension for approximation
Runtime: for -approximation
8
Practice Data-aware partitions
optimize the partition to your dataset PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-
Dasgupta’09] randomized kd-trees [SilpaAnan-Hartley’08, Muja-
Lowe’09] spectral/PCA/semantic/WTA hashing [Weiss-Torralba-
Fergus’08, Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-Ross-Lin’11]
9
Practice vs Theory
10
Data-aware projections often outperform (vanilla) random-projection methods
But no guarantees (correctness or performance)
JL generally optimal [Alon’03, Jayram-Woodruff’11] Even for some NNS setups! [Andoni-Indyk-
Patrascu’06]
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
Our model
12
“low-dimensional signal + large noise” inside high dimensional space
Signal: for subspace of dimension Data: each point is perturbed by a full-
dimensional Gaussian noise
𝑈
Model properties
13
Data points in P have at least unit norm
Query s.t.: for “nearest neighbor” for everybody else
Noise entries up to factor Claim: exact nearest neighbor is still the same
Noise is large: has magnitude top dimensions of capture sub-constant mass JL would not work: after noise, gap very close to 1
Algorithms via PCA
14
Find the “signal subspace” ? then can project everything to and solve NNS
there Use Principal Component Analysis (PCA)?
extract top direction(s) from SVD e.g., -dimensional space that minimizes
If PCA removes noise “perfectly”, we are done: Can reduce to -dimensional NNS
15
Best we can hope for dataset contains a “worst-case” -dimensional
instance Reduction from dimension to
Spoiler: Yes
NNS performance as if we are in dimensions for full model?
PCA under noise fails
16
Does PCA find “signal subspace” under noise ?
No PCA minimizes good only on “average”, not “worst-case” weak signal directions overpowered by noise
directions typical noise direction contributes
𝑝∗
1st Algorithm: intuition
17
Extract “well-captured points” points with signal mostly inside top PCA space should work for large fraction of points
Iterate on the rest
𝑝∗
Iterative PCA
18
To make this work: Nearly no noise in : ensuring close to
determined by heavy-enough spectral directions (dimension may be less than )
Capture only points whose signal fully in well-captured: distance to explained by noise only
• Find top PCA subspace • =points well-captured by • Build NNS d.s. on { projected
onto }• Iterate on the remaining points,
Query: query each NNS d.s. separately
Simpler model
19
Assume: small noise ,
where can be even adversarial
Algorithm: well-captured if
Claim 1: if captured by , will find it in NNS for any captured :
Claim 2: number of iterations is
for at most 1/4-fraction of points, hence constant fraction captured in each iteration
• Find top-k PCA subspace • =points well-captured by • Build NNS on { projected
onto }• Iterate on remaining points,
Query: query each NNS separately
Analysis of general model
20
Need to use randomness of the noise Want to say that “signal” is stronger than
“noise” (on average) Use random matrix theory
is random with entries All singular values
has rank and (Frobenius-norm)2 important directions have can ignore directions with
Important signal directions stronger than noise!
Closeness of subspaces ?
21
Trickier than singular values Top singular vector not stable under perturbation! Only stable if second singular value much smaller
How to even define “closeness” of subspaces?
To the rescue: Wedin’s sin-theta theorem
𝜃
𝑆
𝑈
Wedin’s sin-theta theorem
22
Developed by [Davis-Kahan’70], [Wedin’72]
Theorem: Consider is top- subspace of is the -space containing Then:
Another way to see why we need to take directions with sufficiently heavy singular values
𝜃
Additional issue: Conditioning
23
After an iteration, the noise is not random anymore! non-captured points might be “biased” by
capturing criterion
Fix: estimate top PCA subspace from a small sample of the data
Might be purely due to analysis But does not sound like a bad idea in practice
either
Performance of Iterative PCA
24
Can prove there are iterations In each, we have NNS in dimensional space Overall query time:
Reduced to instances of -dimension NNS!
2nd Algorithm: PCA-tree
25
Closer to algorithms used in practice• Find top PCA direction • Partition into slabs • Snap points to
hyperplane• Recurse on each slice
≈𝜖 /𝑘
Query: • follow all tree paths
that may contain
2 algorithmic modifications
26
Centering: Need to use centered PCA (subtract average) Otherwise errors from perturbations accumulate
Sparsification: Need to sparsify the set of points in each node of the
tree Otherwise can get a “dense” cluster:
not enough variance in signal lots of noise
• Find top PCA direction • Partition into slabs • Snap points to
hyperplanes• Recurse on each slice
Query: • follow all tree paths
that may contain
Analysis
27
An “extreme” version of Iterative PCA Algorithm: just use the top PCA direction: guaranteed to have
signal ! Main lemma: the tree depth is
because each discovered direction close to snapping: like orthogonalizing with respect to each
one cannot have too many such directions
Query runtime:
Overall performs like -dimensional NNS!
Wrap-up
28
Here: Model: “low-dimensional signal + large noise” like NNS in low dimensional space via “right” adaptation of PCA
Immediate questions: Other, less-structured signal/noise models? Algorithms with runtime dependent on spectrum?
Broader Q: Analysis that explains empirical success?
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?