Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation...

39
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009

Transcript of Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation...

Page 1: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Evaluation of Multi-core Architectures for Image Processing Algorithms

Masters Thesis Presentation byTrupti Patil

July 22, 2009

Page 2: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion

Page 3: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Motivation

Fast processing response a major requirement in many image processing applications.

Image processing algorithms can be computationally expensive

Data needs to be processed in parallel, and optimized for real-time execution

Recent introduction of massively-parallel computer architectures promising significant acceleration.

Some architectures haven’t been actively explored yet.

Page 4: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion

Page 5: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Contribution & scope of the thesis

This thesis adapts and optimizes three image processing and computer vision algorithms for four multi-core architectures.

The timings are found Obtained timings are compared against available

corresponding previous work (intra-class) and architecture type (inter-class).

Appropriate deductions are made based on results.

Page 6: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Implementation Conclusion

Page 7: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Background

Need for Parallelization SIMD Optimization The need for faster execution time Related work

Canny edge detection on CellBE [Gupta et al.] and on GPU [Luo et al.]

KLT tracking implementation on GPU [Sinha et al., Zach et al.]

Page 8: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Implementation Experimental Results Conclusion

Page 9: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Hardware & Software Platforms

Architecture Hardware Platform Software Platform

NetBurst Microarchitecture

Intel Pentium 4 HT Linux (Ubuntu)Intel C++ Compiler 11.1

Core Microarchitecture

Intel Core 2 Duo Mobile Linux (Ubuntu)Intel C++ Compiler 11.1

Cell Broadband Engine (CBE)

Sony PlayStation 3 Linux (Fedora)Cell SDK 3.1

Graphics Processing Unit (GPU)

Nvidia GeForce 8 Series Linux (Fedora)CUDA 2.1

Page 10: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Can execute legacy IA-32 and SIMD applications at higher clock rate.

HT allows simultaneous multithreading.

Has two logical processors on each physical processor

Support for upto SSE3

Improved performance/watt factor.

SSSE3 support for effective XMM registers’ utilization.

Supports SSE4 Scales upto Quad-core

Intel NetBurst & Core Microarchitectures

Page 11: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Structural diagram of the Cell Broadband Engine

Cell Broadband Engine (CBE)

EIB

SPE SPE SPE SPE

SPE SPE SPE SPE

PPE

Main Memory

GraphicsDevice

I/ODevices PPE

PPU

L2 Cache

L1 Instruction

Cache

L1 Data Cache

EIB

SPE SPE SPE SPE

SPE SPE SPE SPE

PPE

Main Memory

GraphicsDevice

I/ODevices

EIB

SPE SPE SPE SPE

SPE SPE SPE SPE

PPE

Main Memory

GraphicsDevice

I/ODevices

SPE

SPU

Memory Flow Controller (MFC)

Local Store (LS)

Page 12: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

One Power-based PPE, with VMX 32/32kB I/D L1, and 512kB L2 dual issue, in order PPU, 2 HW

threads

Eight SPEs, with up to 16x SIMD dual issue, in order SPU 128 registers (128b wide) 256 kB local store (LS) 2x 16B/cycle DMA, 16 outstanding

req.

Cell processor overview

Element Interconnect Bus (EIB) 4 rings, 16B wide (at 1:2 clock) 96B/cycle peak, 16B/cycle to

memory 2x 16B/cycle BIF and I/O

External communication Dual XDR memory controller (MIC) Two configurable bus interfaces

(BIC) Classical I/O interface SMP coherent interface

Page 13: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Data flow in GPU

Graphics Processing Unit (GPU)

VertexProcessor

FragmentProcessor

Assemble &

Rasterize

Framebuffer

OperationsApplication

Textures

FRAME

BUFFER

Page 14: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Nvidia GeForce 8 Series GPU

Graphics pipeline in NVIDIA GeForce 8 Series GPU

Page 15: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Computing engine in Nvidia GPUs Makes GPU a compute device into a highly

multithreaded coprocessor. Provides both low level and a higher level APIs Has several advantages over GPUs using graphics APIs

(e.g.: OpenGL)

Compute Unified Device Interface (CUDA)

Page 16: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion

Page 17: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Algorithm 1: Gaussian Smoothing

Gaussian smoothing is a filtering kernel Removes small-scale texture and noise for given

spatial extent 1-D Gaussian kernel written as:

2-D Gaussian kernel:

Separable

Page 18: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Gaussian Smoothing (example)

Page 19: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Algorithm 2: Canny Edge Detection

Edge detection a commonly operation in image processing

Edges are discontinuities in image gray levels, have strong intensity contrast.

Canny Edge Detection is an optimal edge-detector algorithm.

Illustrated ahead with an example.

Page 20: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Canny Edge Detection (example)

Page 21: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Algorithm 3: KLT Tracking

• First proposed by Lucas and Kanade. Extended by Tomasi and Kanade and Shi and Tomasi .

• Firstly, determine what feature(s) to track through feature selection

• Secondly, track the selected feature(s) across image sequence.

• Rests on three assumptions: temporal persistence, spatial coherence and brightness constancy

Page 22: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Algorithm 3: KLT Tracking

Page 23: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Results Conclusion

Page 24: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Gaussian Smoothing: Results

Lenna

Mandrill

Page 25: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Results: Gaussian Smoothing

Page 26: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Canny edge detection: Results

Lenna

Mandrill

Page 27: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Results: Canny edge detection

Page 28: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Results: Canny Edge DetectionComparison with other implementations on Cell

Comparison with other implementations on GPU

Page 29: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Results: KLT Tracking

Page 30: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Comparison with other implementations on GPU

Results: KLT Tracking

Comparison with other implementations on GPU— No known implementations yet.

Page 31: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Overview

Motivation Contribution & scope Background Platforms Algorithms Results Conclusion & Extension

Page 32: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Conclusion & Future work

GPU still ahead of other architectures, most suited for image processing applications.

Optimizing PS3 could improve timings to narrow the gap between its and GPU timings.

We could provide: Support for faster color Canny. Support for kernel width larger than 5 Better management of thread alignment in GPU if not a

multiple of 16 Include Intel Xeon & Larrabee as potential architectures.

Page 33: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Questions..

Page 34: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Additional Slides

Page 35: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

CBE Architecture Contains traditional microprocessor, PowerPC Processor Element (PPE) –

Controls tasks 64-bit PPC: 32 KB L1 instruction cache, 32 KB L1 data cache, and 512 KB

L2 cache. PPE controls 8 synergistic processor elements (SPEs) operating as SIMD units Each SPE has an SPU and a memory flow controller (MFC) - data intensive

tasks SPU (RISC) with 128 128-bit SIMD registers 256KB local store (LS).

PPE, SPE, MIC, BIC connected by Element Interconnect Bus (EIB) – for data movement - ring bus consisting of four 16 byte channels providing sustained b/w of 204.8 GB/s. MFC connection to Rambus XDR memory and BIC interface to I/O devices connected via RapidIO provide 25.6 GB/s of data b/w.

Page 36: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

CBE: What makes it fast?

Huge inter-SPE bandwidth 205 GB/s sustained output

Fast main memory 256.5 GB/s bandwidth for Rambus XDR memory

Predictable DMA latency and throughput DMA traffic has negligible impact on SPE local store

bandwidth Easy to overlap data movement with computation

High performance, low-power SPE cores

Page 37: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Nvidia GeForce (Continued)

•GPU has K multiprocessors (MP)

•Each MP has L scalar processors (SP)

•Each MP performs block processing in batches

•A block is processed by only one MP

•Each block is split into SIMD groups of threads (warps)

•A warp is executed physically in parallel

•A scheduler switches between warps

•A warp contains threads of increasing, consecutive thread IDs

•Currently a warp size is 32 threads

Page 38: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

CUDA: Programming model

Thread(0,0)

Thread(3,0)

Thread(0,7)

Thread(3,7)

Thread(1,0)

Thread(1,7)

Thread(0,1)

Thread(1,1)

Thread(3,1)

Thread(4,0)

Thread(5,0)

Thread(4,1)

Thread(5,1)

Thread(4,7)

Thread(5,7)

Block (2,1)

Warp 1 Warp 2

•Grid consist of thread blocks

•Each thread executes the kernel

•Grid and block dimensions specified by application. Max. by GPU memory

•1/ 2/ 3-D grid layout

•Thread and Block-IDs are unique

Page 39: Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

CUDA: Memory model

•Shared memory(R/W) - For sharing data within block

•Texture memory – spatially cached

•Constant memory – About 20K, cached

•Global Memory – Not cached, coalesce

•Explicit GPU memory alloc/de-allocation

•Slow copying between CPU and GPU memory