Performance Tuning on Multicore Systems for Feature Matching within Image Collections

Performance Tuning on Multicore Systems for

Feature Matching within Image Collections

Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung Leung and Minyi Guo*

Department of Computer Science University of Otago, New Zealand

* Department of Computer ScienceShanghai Jiao Tong University, China

Contents

• Motivation• Our work• Evaluation• Conclusion

Similarity Search

• Definition:– To preprocess a database of N objects so that

given a query object, one can effectively determine its nearest neighbors in database.

• Applications:– pattern recognition, chemical similarity

analysis, and statistical classification, etc.

The problem – KNN Search

• K Nearest Neighbor Search:– Feature: an array of D elements

• f = [e1]

– Feature Space: a set of features• Fs= {f1}

– Feature Similarity: Euclidean distance• =sqrt(Σ(fi

m-fjm)2)

– Search: given a query feature fq, find k features in Fs so that they have the shortest distances to fq.

Our Case Study• Feature Matching: a fundamental problem in many

computer vision tasks– Use the SIFT algorithm to generate features for each image;– Use a k-Nearest Neighbors (k-NN) algorithm to find similar

features between images

Challenges

• Very time-consuming:– datasets become larger:

• hundreds or thousands of images;– image resolution increases:

• 2300×1500 pixels, or higher;

• New platforms: HPC turns to multi-/many-core age:

• AMD 16-core and 64-core machines.

Motivation

• Performance evaluation:– Find out common problems that may limit the

performance of feature matching on multi-/many-core platforms.

• Performance tuning:– Find general methods to solve the identified

problems.

Contents

• Motivation

• Our work• Evaluation• Conclusion

Data Distribution

10000 20000 30000 400000

5

10

15

20

25

30

0

100000

200000

300000

400000

500000

600000

700000

26 26 26

3

181124

420008

660949

146180

images features

feature size range

num

ber o

f im

ages

tota

l num

ber o

f fea

ture

s

Data Size

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 8005

1015202530354045

data size kd-tree size totalImage id

Siz

e (M

B)

Problems

• Unbalanced workload:– Levels of parallelism;– Scheduling policy.

• Poor last-level cache utilization:– Memory architecture.

Levels of parallelism

…….. ……..

Level_1

Level_2 Level_3

———————

Level_4

LinearKD-treeKmeansLSHOthers

Level_1&2

Reference Images Query Images Features

Scheduling policy

• OpenMP scheduling policy:– Static: the scheduler will assign an equal number

of tasks to each thread (not used);

– Dynamic: when one thread finishes its current task, it will take new tasks from the global task queue;

– Guided: chunk size is adjusted dynamically when tasks are requested from the task queue.

Memory architecture• More cores are sharing the memory and last-level

cache:– Memory bandwidth:

• AMD 16-core 12.8 GB/s• AMD 64-core 25.6 GB/s

– Last-level cache:• AMD 16-core 6 MB• AMD 64-core 16 MB

• Large images may not fit in cache and will cause many memory accesses, which leads to hitting the memory wall.

Divide-and-Merge

• We propose Divide-and-Merge:– Whole feature space is split into several

smaller sub-spaces;– Search each sub-space independently;– Merge their results.

Divide-and-Merge

Time complexity

• Accurate algorithms:– Brute force: – Apply DM:

• Approximate algorithms:– Randomized KD-Tree: – Apply DM:

Contents

• Motivation• Our work

• Evaluation• Conclusion

Hardware and Software configuration

Name CPU Cache Memory OS Compiler

AMD 16-core(AMD16)

AMD Opteron Processor

83804 cores × 4 @ 2.5 GHz

L1: 128 KB,L2: 512 KB,L3: 6144 KB

16 GiB, DDR2 800 MHz12.8 GB/s

Ubuntu 12.04.1 g++-4.4

AMD 64-core(AMD64)

AMD Opteron Processor

62768 cores × 8 @ 2.3 GHz

L1: 48 KB,L2: 1000 KB,

L3: 16384 KB

64 GiB, DDR3 1333

MHz21.32 GB/s

Ubuntu 12.04.1 g++-4.4

Environment:OpenCV + OpenMP: one of the most frequently used setup for computer vision researchers to utilize parallel platforms

Levels of parallelism

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2

4

6

8

10

12

Level_1 Level_2 Level_3 Level_1&2

Scalability

Scheduling policy(on level_1&2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2

4

6

8

10

12

d1 d2 d4 guided

Scalability

Scheduling policy(on level_3)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2

4

6

8

10

12

14

d1 d2 d4 guided

Scalability

Memory architecture

1. Original Execution

2. Apply Divide-and-Merge

Evaluation on Manawatu Dataset

1 4 8 121620242832364044485256606405101520253035404550

Level_3 Level_3_DMLevel_1&2 Level_1&2_DM

Scalability

1 4 8 12162024283236404448525660640

5

10

15

20

25


Speedup

Evaluation on Manawatu Dataset

1 4 8 121620242832364044485256606405101520253035404550


Scalability

1 4 8 12162024283236404448525660640

2

4

6

8

10

12

14


Speedup

Contents

• Motivation• Our work• Evaluation

• Conclusion

Conclusion• We have shown that performance tuning is

demanding on modern multicore systems.

• We have comprehensively evaluated the impact of the three factors that have an influence on large-scale image feature matching.

• We have proposed a Divide-and-Merge algorithm that can greatly improve the speedup and scalability of feature matching algorithms on multicore machines.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections

Documents

Transcript of Performance Tuning on Multicore Systems for Feature Matching within Image Collections