Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon...

63
Fast Nonparametric Machine Fast Nonparametric Machine Learning Algorithms for High- Learning Algorithms for High- dimensional dimensional Ting Liu Ting Liu Carnegie Mellon University Carnegie Mellon University February, 2005 February, 2005 Ph.D. Thesis Ph.D. Thesis Proposal Proposal Massive Data and Massive Data and Applications Applications
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    1

Transcript of Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon...

Page 1: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Fast Nonparametric Machine Learning Fast Nonparametric Machine Learning Algorithms for High-dimensionalAlgorithms for High-dimensional

Ting LiuTing Liu

Carnegie Mellon UniversityCarnegie Mellon University

February, 2005February, 2005

Ph.D. Thesis ProposalPh.D. Thesis Proposal

Massive Data and ApplicationsMassive Data and Applications

Page 2: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 2

Thesis CommitteeThesis Committee

• Andrew Moore (Chair)

• Martial Hebert

• Jeff Schneider

• Trevor Darrell (MIT)

Page 3: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 3

Thesis ProposalThesis Proposal

Goal: to make nonparametric methods tractable for high-dim, massive datasets

• Nonparametric methods:– K-nearest-neighbor (K-NN)– Kernel density estimation– SVM evaluation phase– and more …

My thesis

Page 4: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 4

High-dim, massive data

Why K-NN?Why K-NN?• It is simple

– goes back to as early as [Fix-Hodges 1951]– [Cover-Hart 1967] justifies k-NN theoretically

• It is easy to implement– sanity check for other (more complicated) algorithms– similar insights for other nonparametric algorithms

• It is useful many applications in

– text categorization– drug activity detection– multimedia, computer vision– and more…

Page 5: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 5

Application: Video SegmentationApplication: Video Segmentation

Task: Shot transition detection

• Cut

• Gradual transition (fades, dissolves …)

Page 6: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 6

Technically Technically [Qi-Hauptmann-Liu 2003][Qi-Hauptmann-Liu 2003]

Pair-wise similarityfeatures

Classificationnormal: 0cut: 1gradual: 2

4 hours MPEG-1 video

(420,970 frames)

K-NN• very slow • good performance

We want a fast k-NN classification

method.

Color histogramVideoframes

Page 7: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 7

Application: Application: Near-duplicate Detection and Sub-image Retrieval

Copyrighted Image Database

Page 8: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 8

Algorithm Overview Algorithm Overview [Yan-Rahul 2004][Yan-Rahul 2004]

12,100,000 patches(12,100 copyrighted images)

Transformation DoG + PCA-SIFT

Searchstore

Each image: 1000 patches

1000 k-NN search per query

Each image:1000 patches

Each patch: 36-dim

train query

We want a fast k-NN search

method.

Page 9: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 9

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

Spatial tree

• SR-tree

• Kd-tree

• Metric-tree

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Naïve

• Random sample• PCA• LSH

Spill-tree

My workslow

Page 10: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 10

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-treeSpatial tree

• SR-tree

• Kd-tree

• Metric-tree

Page 11: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 11

Problems with Exact K-NN Search: Problems with Exact K-NN Search: EfficiencyEfficiency

• Slow with huge dataset in high dimensions

• Complexity of algorithms

– Naïve (linear scan): O(dN) per query

– Advanced: O(dlogN) ~ O(dN)

(spatial data structure to avoid searching all points) • SR-tree [Katayama-Satoh 1997]

• Kd-tree [Friedman-Bentley-Finkel 1977]

• Metric-tree (ball-tree) [Uhlmann 1991, Omohundro 1991]

Page 12: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 12

A set of points in R2

Metric-tree: an ExampleMetric-tree: an Example

Page 13: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 13

Build a metric-treeBuild a metric-tree

P2

P1

L

[Uhlmann 1991, Omohundro 1991]

Page 14: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 14

A metric-tree

Metric-tree Data StructureMetric-tree Data Structure

Internal data structure

P2

P1

[Uhlmann 1991, Omohundro 1991]

Page 15: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 15

• Let q be any query point

• Let x be a point inside ball B

Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality

xq

xq

Page 16: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 16

Metric-tree Based K-NN SearchMetric-tree Based K-NN Search

• Depth first search

• Pruning using the triangle inequality

• Significant speed-up when d is small: O(dlogN)

• Little speed-up when d is large: O(dN)

• “Curse of dimensionality”

Page 17: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 17

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-treeSpatial tree

• SR-tree

• Kd-tree

• Metric-tree

Page 18: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 18

My Work (part 1): Fast K-NN Classification Based on Metric-tree

Idea: Do classification w/o finding the k-NNs

KNS2: Fast k-NN classification for skewed 2-class KNS3: Fast k-NN classification for 2-class IOC: Fast k-NN classification for multi-class

Page 19: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 19

KNS2: Fast K-NN Classification for KNS2: Fast K-NN Classification for Skewed 2-class Skewed 2-class

Assumptions:

(1) 2 classes: pos. / neg.

(2) pos. class much less frequent than neg. class

Example: video segmentation

(~10,000 shot transitions, ~400,000 normal frames)

Q: How many of the k-NN are from pos. class?

Page 20: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 20

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 1 --- Find positive

Find the k closest pos. points

q

d1

d2

d3

Example: k = 3

di : distance of the i’th

closest pos. point to q

Fewer pos. points → easy to compute

Page 21: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 21

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Count negative

q

d1

d2

d3

c1 c2 c3

Example: k = 3

c1 = 1

c2 = 5

c3 = 8

ci: Num. of neg. points within di

Page 22: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 22

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Lowerbound negative

q

d1

d2

d3

c1 c2 c3

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

Idea: lowerbound each ci instead of computing it

ci: Num. of neg. points within di

Page 23: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 23

• Let q be any query point

• Let x be a point inside ball B

Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality

xq

xq

Page 24: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 24

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di20

c1 ≥ 0, c2 ≥ 0, c3 ≥ 0

A

Page 25: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 25

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

12

8

c1 ≥ 0, c2 ≥ 0, c3 ≥ 12

B

C

Page 26: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 26

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

12

8

c1 ≥ 0, c2 ≥ 0, c3 ≥ 12

B

C

Page 27: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 27

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

5

c1 ≥ 0, c2 ≥ 5, c3 ≥ 12

7

DE

Page 28: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 28

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

5

c1 ≥ 0, c2 ≥ 5, c3 ≥ 12

7

DE

Page 29: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 29

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

4

c1 ≥ 4, c2 ≥ 5, c3 ≥ 12

7

We are done! Return 0

E

F

Page 30: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 30

KNS2: the AlgorithmKNS2: the Algorithm

Build two metric-trees (Pos_tree / Neg_tree) Search Pos_tree, find k pos. NNs Search Neg_tree repeat pick a node from Neg_tree refine C = {c1, c2,…,ck}

if ci ≥ k-i+1 remove ci from C

end repeat

Let k’ = size(C) after the search

return k’

Page 31: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 31

Experimental Results (KNS2)Experimental Results (KNS2)

Dataset Dimension(d)

Data Size(N)

ds1 10 26,733

Letter 16 20,000

Video 45 420,970

J_Lee 100 181,395

Blanc_Mel 100 186,414

ds2 1.1£106 88,358

Page 32: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 32

CPU Time Speedup Over Naïve K-NN CPU Time Speedup Over Naïve K-NN (k = 9)(k = 9)

KNS2: 3x – 60x speed-up over naïve

0

10

20

30

40

50

60

70

Ds1(d=10) Letter(d=16) video(d=45) J_Lee(d=100) Blanc_Mel(d=100) ds2(d=1.1M)

spee

dups

Naive

Metric-tree

KNS2

Page 33: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 33

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-tree

Page 34: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 34

--- “I’m Feeling Lucky” search

--- spill-tree

My Work (Part 2): a New Metric-tree Based Approximate NN Search

Page 35: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 35

Empirically…

• takes 10% of the time finding the NN

• takes 90% of the time backtracking

Why is Metric-tree Slow?

p2

p1

q

Page 36: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 36

““I’m Feeling Lucky” SearchI’m Feeling Lucky” Search

• Algorithm: simple– Descends a metric-tree without backtracking– Return the first point hit in a leaf node

• Complexity: super fast – O(logN) per query

• Accuracy: quite low – Liable to make mistakes when q is near the decision

boundary

Page 37: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 37

Spill-tree:Spill-tree: – adding redundancy to help – adding redundancy to help “I’m-Feeling-Lucky” search “I’m-Feeling-Lucky” search

Page 38: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 38

Spill-treeSpill-tree

• A variant of metric-tree

• The children of a node can “spill over” onto each other, and contain shared data-points

Page 39: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 39

A Spill-tree Data StructureA Spill-tree Data Structure

LLL

p2p1

LROverlapping buffer

Overlapping buffer size

• Spill-tree: Both children own points between LL and LR

• Metric-tree: each child only owns points to one side of L

Page 40: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 40

A Spill-tree Data StructureA Spill-tree Data Structure

Advantage of Spill-tree

– higher accuracy– makes mistake only when

true NN is far away

LLL LR

p2p1

q

Overlapping buffer

Overlapping buffer size

Page 41: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 41

A Spill-tree Data StructureA Spill-tree Data Structure

Problem with spill-tree

– uncontrolled depth – O(logN) when – when – empirically,

is the expected dist.

of a point to its NN

LLL LR

p2p1

q

Overlapping buffer

Overlapping buffer size

Page 42: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 42

Hybrid Spill-tree SearchHybrid Spill-tree Search

• Balance threshold ρ = 70% (empirically)

if either child of a node v contains more than ρ of the total points,

then split v in the conventional way.

Overlapping node

-- “I’m Feeling Lucky” search

Non-overlapping node

-- backtracking search

Page 43: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 43

Further Efficiency Improvement by Further Efficiency Improvement by Random ProjectionRandom Projection

Intuition: random projection approximately preserves distance.

Page 44: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 44

Experiments for Spill-treeExperiments for Spill-tree

Dataset Num. Data (N)

Num. Dim

(d)

Aerial 275,465 60

Corel_hist 20,000 64

Corel_uci 68,040 64

Disk 40,000 1024

Galaxy 40,000 3838

Page 45: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 45

Comparison MethodsComparison Methods

• Naïve k-NN

• Metric-tree

• Locality Sensitive Hashing (LSH)

• Spill-tree

Page 46: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 46

Spill-tree vs. Metric-treeSpill-tree vs. Metric-tree

The CPU time (s) speed-up of Spill-tree over metric-tree

Spill-tree enjoys 3.3 ~ 706 folds speed-up over metric-tree

Page 47: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 47

Spill-tree vs. LSHSpill-tree vs. LSH

The CPU time (s) of Spill-tree and its speedup (in parentheses) over LSH

Spill-tree enjoys 2.5 ~ 31 folds speed-up over LSH

Page 48: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 48

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-tree

Page 49: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 49

My ContributionMy Contribution• T.Liu, A. W. Moore, A. Gray.

Efficient Exact k-NN and Nonparametric Classification in High Dimensions, NIPS 2003.

• Y. Qi, A. Hauptman, T.Liu. Supervised Classification for Video Shot Segmentation, ICME 2003.

• T.Liu, K. Yang, A. W. Moore. The IOC algorithm: Efficient Many-Class Non-parametric Classification for High-Dimensional Data, KDD 2004.

• T.Liu, A. W. Moore, A. Gray, K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms, NIPS 2004.

Page 50: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 50

Related WorkRelated Work

• [Uhlmann 1991, Omohundro 1991] Propose the idea of Metric-tree (Ball-tree)• [Omachi-Aso, 1997] Similar idea of KNS2 for NN classification• [Gionis-Indyk-Motwani, 1999] A practical approximate NN method: LSH• [Arya-Fu, 2003] Expected-case complexity of approximate NN searching

• [Yan-Rahul, 2004] Near-duplicate Detection and Sub-image Retrieval• [Indyk, 1998]

Approximate NN under L∞ norm

Page 51: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 51

Future WorkFuture Work

• Improve my previous work– Self-tuning spill-tree– Theoretical analysis of spill-tree

• Explore new related area– Dual-tree

• Applications in real-world

Page 52: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 52

Future Work (1): Self-Tuning Spill-treeFuture Work (1): Self-Tuning Spill-tree

• Two key factors of spill-tree– random projection dimension d’– overlapping size

Page 53: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 53

Benefits of Automatic Parameter TuningBenefits of Automatic Parameter Tuning

• Avoid tedious hand-tuning

• Gain more insights into the approx. NN

Page 54: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 54

Future work(2): Theoretical AnalysisFuture work(2): Theoretical Analysis

• Spill-tree + “I’m feeling lucky search”– good performance in practice – no theoretic guarantee

Page 55: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 55

Idea: when the number of points is large enough, then I’m feeling lucky search finds the true NN w.h.p.

Page 56: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 56

Idea: with overlapping buffer, the probability of successfully finding the true NN can be increased

Page 57: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 57

Future Work(3): Dual-Tree SearchFuture Work(3): Dual-Tree Search

• N-body problem [Gray-Moore, 2001]– NN classification– Kernel density estimation– Outlier detection– Two-point correlation

• Require pair-wise comparison of all N points– Naïve solution: O(N2)– Advanced solution based metric-tree

• Single-tree: only build trees on training data• Dual-tree: build trees on both training, query data

Page 58: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 58

• Let q be a point inside query node Q

• Let x be a point inside training node B

Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality

x

q

Q B

Page 59: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 59

Pruning OpportunityPruning Opportunity [Gray-Moore 2001]

Dmax(Q, B)

Dmin(Q, A)

A B

Q

OAOB

OQ

Prune A when

A can’t be pruned in this case

A, B: nodes from training set

Q: node from test set

But, this is too pessimistic!

Page 60: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 60

More pruning opportunityMore pruning opportunity

A B

Q

OAOB

q

Prune A when

Hyperbola H determined by OA,OB, rA+rB

A can be pruned in this caseChallenge: to compute this efficiently

Page 61: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 61

Future Work(4): ApplicationsFuture Work(4): Applications

• Multimedia --- video segmentation– shot-based segmentation– story-based segmentation

• Image retrieval --- near-duplicate detection

• Computer vision --- object recognition

Page 62: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 62

Time LineTime Line

• Now – Apr., 2005– Dual-tree (design and implementation)– Testing on real-world datasets

• May – Aug., 2005– Improving spill-tree algorithm– Theoretical analysis

• Sept. – Dec., 2005– Applications of new k-NN algorithm

• Jan. – Mar., 2006– Write up final thesis

Page 63: Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Ting Liu, CMU 63

Thank you!Thank you!

QU

ES

TIONS