Christian Steinle, University of Heidelberg, Institute of Computer Engineering1 L1 Hough Tracking on...

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

1

L1 Hough Tracking on a Sony Playstation III

Christian Steinle, Andreas Kugel, Reinhard MännerComputer Engineering, University of Heidelberg

• Contents

– Motivation for the Playstation implementation– Algorithm adaption– Job restrictions– Algorithm implementation– Status


2

• Hough Space: 1 dimension for 1 parameter of a track– bending 1/Pz, angle (Px/Pz) and Py/Pz)– Hough Space dimensions: 95 x 31 (cells) x 191 (layers)

– Detector slice with constant angle corresponds to one 2D Hough histogram

– Detector slices are overlapping (multiple scattering)

Z

X

Y

Py/Pz

Px/P z

1 /Pz

Motivation for the Playstation implementation


3


• L1 Hough Tracking algorithm development in HDL for FPGA HDL descriptions for all critical functional units

• Synthesis result of a single histogram layer implemented with registers shows the requirement of about 30.000 logic cells

50% of FPGA (XC4VLX60: 59,904) registers used for layer

• Conclusion: Develop a multi-chip solution


4


• Conclusion: Use the CellBE of a Sony Playstation III as cheap and

flexible rapid prototyping system

Multi-chip algorithm Multi-core platform


5

Algorithm adaption

• Conclusion: Two parallelism levels:

multiple processors work on input hit data in parallel vector capable ALUs process multiple layers in parallel

Creation of input data packages as so-called jobs


6

Job restrictions

1. Balance workload for the processors

• Number of histogram layers: 191• Number of processors for a Sony Playstation III: 6

– Cell BE: 8 but 1 disabled and 1 reserved for OS

yersForJobnumberOfLa32:n1Restrictio


7

Job restrictions

2. Optimal use of the processor‘s vector capable ALU

• Each histogram cell contains a 5 bit signature (1bit/detector)

• The vector capable ALU is 128 bit 128 bit / 5 bit / signature = 25 signatures in parallel

• Speed gain by using an implicite type instead of bit manipulation

unsigned char (8 bit) is ALU and memory applicable Each histogram cell contains actually 3 unused bits 128 bit / 8 bit / signature = 16 signatures in parallel

• Random access requires the composing of the processing ALU vector with 16 elements from the memory


8

Job restrictions

• Hough transformed Hit contains γmin and γmax

Hit is inserted into γmax - γmin + 1 consecutive layers

• Speed gain by optimizing the histogram memory structure while using the consecutive histogram layer entries for a single hit

Direct memory access on up to 16 consecutive layers in parallel

Each hit in a job is systolically processed just once in parallel A single hit can contribute to more than one job (γmax - γmin >

16, γmax of hit > γmax of actual job) yersForJobnumberOfLa16:n2Restrictio


9

Job restrictions

3. Optimal use of the processor‘s local storage

• Local storage contains program code, static data, heap and stack

• Conclusion: Used memory for job = codingTable + histogram + input data Available memory = stackPointer – heapPointer -

securityRegion ForJobusedMemoryemoryavailableM:n3Restrictio


10

Algorithm implementation

• Configuration part on the PPU


11


• Configuration part on the SPU


12


• Different versions of the job creation are possible

– CODEVERSION 1:• Create all jobs with a loop over all hits before the SPU processing

– CODEVERSION 2:• Create all jobs with a loop over all jobs before the SPU processing

– CODEVERSION 3:• Create the jobs parallely pipelined with the SPU processing

– CODEVERSION 4:• Create the jobs for dedicated SPUs when they are ready


13


• Processing part one on the PPU


14


• Processing part two on the PPU


15


• Processing part three on the PPU


16


• Different versions of the histogram memory usage

– MEMORYVERSION 1:• Use vector type for memory and computation

– MEMORYVERSION 2:• Use smallest type for a single histogram cell in the memory and

build vector type for computation

– MEMORYVERSION 3:• Use smallest type for histogram cells in parallel layers and build

vector type for computation


17


• Proccessing part on the SPU


18

Status

• Algorithm adaption is implemented

• Hough processing has to be implemented:– Actual step in scalar version: peakfinding2D– Missing steps in scalar version: peakfinding3D– Actual step in vector version: diagonalization– Missing steps in vector version: peakfinding2D, peakfinding3D

• Measure the timing

• Compare the timing with– the timing of the C++ framework implementation– the estimated timing of the FGPA implementation

Christian Steinle, University of Heidelberg, Institute of Computer Engineering1 L1 Hough Tracking on...

Documents

Transcript of Christian Steinle, University of Heidelberg, Institute of Computer Engineering1 L1 Hough Tracking on...