Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections
description
Transcript of Performance Tuning on Multicore Systems for Feature Matching within Image Collections
Performance Tuning on Multicore Systems for
Feature Matching within Image Collections
Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung Leung and Minyi Guo*
Department of Computer Science University of Otago, New Zealand
* Department of Computer ScienceShanghai Jiao Tong University, China
Contents
• Motivation• Our work• Evaluation• Conclusion
Contents
• Motivation• Our work• Evaluation• Conclusion
Similarity Search
• Definition:– To preprocess a database of N objects so that
given a query object, one can effectively determine its nearest neighbors in database.
• Applications:– pattern recognition, chemical similarity
analysis, and statistical classification, etc.
The problem – KNN Search
• K Nearest Neighbor Search:– Feature: an array of D elements
• f = [e1]
– Feature Space: a set of features• Fs= {f1}
– Feature Similarity: Euclidean distance• =sqrt(Σ(fi
m-fjm)2)
– Search: given a query feature fq, find k features in Fs so that they have the shortest distances to fq.
Our Case Study• Feature Matching: a fundamental problem in many
computer vision tasks– Use the SIFT algorithm to generate features for each image;– Use a k-Nearest Neighbors (k-NN) algorithm to find similar
features between images
Challenges
• Very time-consuming:– datasets become larger:
• hundreds or thousands of images;– image resolution increases:
• 2300×1500 pixels, or higher;
• New platforms: HPC turns to multi-/many-core age:
• AMD 16-core and 64-core machines.
Motivation
• Performance evaluation:– Find out common problems that may limit the
performance of feature matching on multi-/many-core platforms.
• Performance tuning:– Find general methods to solve the identified
problems.
Contents
• Motivation
• Our work• Evaluation• Conclusion
Data Distribution
10000 20000 30000 400000
5
10
15
20
25
30
0
100000
200000
300000
400000
500000
600000
700000
26 26 26
3
181124
420008
660949
146180
images features
feature size range
num
ber o
f im
ages
tota
l num
ber o
f fea
ture
s
Data Size
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 8005
1015202530354045
data size kd-tree size totalImage id
Siz
e (M
B)
Problems
• Unbalanced workload:– Levels of parallelism;– Scheduling policy.
• Poor last-level cache utilization:– Memory architecture.
Levels of parallelism
…….. ……..
Level_1
Level_2 Level_3
———————
Level_4
LinearKD-treeKmeansLSHOthers
Level_1&2
Reference Images Query Images Features
Scheduling policy
• OpenMP scheduling policy:– Static: the scheduler will assign an equal number
of tasks to each thread (not used);
– Dynamic: when one thread finishes its current task, it will take new tasks from the global task queue;
– Guided: chunk size is adjusted dynamically when tasks are requested from the task queue.
Memory architecture• More cores are sharing the memory and last-level
cache:– Memory bandwidth:
• AMD 16-core 12.8 GB/s• AMD 64-core 25.6 GB/s
– Last-level cache:• AMD 16-core 6 MB• AMD 64-core 16 MB
• Large images may not fit in cache and will cause many memory accesses, which leads to hitting the memory wall.
Divide-and-Merge
• We propose Divide-and-Merge:– Whole feature space is split into several
smaller sub-spaces;– Search each sub-space independently;– Merge their results.
Divide-and-Merge
Time complexity
• Accurate algorithms:– Brute force: – Apply DM:
• Approximate algorithms:– Randomized KD-Tree: – Apply DM:
Contents
• Motivation• Our work
• Evaluation• Conclusion
Hardware and Software configuration
Name CPU Cache Memory OS Compiler
AMD 16-core(AMD16)
AMD Opteron Processor
83804 cores × 4 @ 2.5 GHz
L1: 128 KB,L2: 512 KB,L3: 6144 KB
16 GiB, DDR2 800 MHz12.8 GB/s
Ubuntu 12.04.1 g++-4.4
AMD 64-core(AMD64)
AMD Opteron Processor
62768 cores × 8 @ 2.3 GHz
L1: 48 KB,L2: 1000 KB,
L3: 16384 KB
64 GiB, DDR3 1333
MHz21.32 GB/s
Ubuntu 12.04.1 g++-4.4
Environment:OpenCV + OpenMP: one of the most frequently used setup for computer vision researchers to utilize parallel platforms
Levels of parallelism
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
Level_1 Level_2 Level_3 Level_1&2
Scalability
Scheduling policy(on level_1&2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
d1 d2 d4 guided
Scalability
Scheduling policy(on level_3)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
14
d1 d2 d4 guided
Scalability
Memory architecture
1. Original Execution
2. Apply Divide-and-Merge
Evaluation on Manawatu Dataset
1 4 8 121620242832364044485256606405101520253035404550
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Scalability
1 4 8 12162024283236404448525660640
5
10
15
20
25
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Speedup
Evaluation on Manawatu Dataset
1 4 8 121620242832364044485256606405101520253035404550
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Scalability
1 4 8 12162024283236404448525660640
2
4
6
8
10
12
14
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Speedup
Contents
• Motivation• Our work• Evaluation
• Conclusion
Conclusion• We have shown that performance tuning is
demanding on modern multicore systems.
• We have comprehensively evaluated the impact of the three factors that have an influence on large-scale image feature matching.
• We have proposed a Divide-and-Merge algorithm that can greatly improve the speedup and scalability of feature matching algorithms on multicore machines.