Partitioning of a Signal Detection Algorithm to a ...

13
Partitioning of a Signal Dete Algorithm to a Heterogene Multicomputing Platform Partitioning of a Signal Dete Partitioning of a Signal Dete Algorithm to a Heterogene Algorithm to a Heterogene Multicomputing Multicomputing Platform Platform Michael Vinskus Mercury Computer Systems, Inc. High Performance Embedded Computing (HPEC) Conference September 23, 2003

Transcript of Partitioning of a Signal Detection Algorithm to a ...

Page 1: Partitioning of a Signal Detection Algorithm to a ...

Partitioning of a Signal Detection Algorithm to a Heterogeneous

Multicomputing Platform

Partitioning of a Signal Detection Partitioning of a Signal Detection Algorithm to a Heterogeneous Algorithm to a Heterogeneous

Multicomputing Multicomputing PlatformPlatformMichael Vinskus

Mercury Computer Systems, Inc.

High Performance Embedded Computing (HPEC) ConferenceSeptember 23, 2003

Page 2: Partitioning of a Signal Detection Algorithm to a ...

PeakDetection

New EnergyAlarm

DelayMemory

Channelizer16K point real

4:1 Overlap

ADC100 MHz

14-Bit

DemodulatorBank

Tasking

? Typical acquisition systems place the delay memory after the analog to digital converter. For demodulation, a digital down converter(DDC) is used to heterodyne and filter the delayed data stream

? When the number of signals to demodulate becomes very large (>100), the typical DDC-based approach becomes cumbersome. FFT processing can allow the

? The goal of the acquisition system is to detect the presence of new signals in the environment in a timely manner so that they may be identified and exploited? A homogeneous system based on general purpose processors can be used to solve

the problem, but advances in FPGAs allow for a higher processing density, especially in the channelizer

? Current FPGAs can offer over a 10x computational density improvement over genepurpose processors for certain applications. Unfortunately, most communication fabrics have not scaled at the same rate

How can the I/O issue be managed?By partitioning across multiple processing elements!

Page 3: Partitioning of a Signal Detection Algorithm to a ...

Parameters:16384 point real DFT4:1 overlap and windowing (P = 4)16384 input sample maximum latency requirementOutput 8192 bins, complex, 24 bits per component (1200 MB/s)

? The channelizer decomposes the input sample stream into frequency channels by performing overlapped and windowed, short time Fourier transforms on the input data stream

ADC

100 MHz14 Bit

800MB/s

1200MB/s

200MB/s

4:1 Data Overlap and Windowing16384

4096

0

3

1

2

16384 point real FFT

??

?

?1

0

)()(N

n

kNWnxkX

The channelizer throughput is greater than most interconnect fabrics can support

Page 4: Partitioning of a Signal Detection Algorithm to a ...

Peak Findingand

Measurement

PeakAssociation

New EnergyAlarm

Log Magnitude

? ?)(log10 mX k

Bin ThresholdCalculation

Delay Memory

Alarm Report

Fc = 45.325 MHzBW = 30.21 kHzTup = 15.000

? The new energy alarm detects the presence of “new” signals and performs some rudimentary external measurements

? Partitioning strategies? Commutation? Time domain broadcast? Frequency division

? High-input bandwidth and computational requirements necessitate partitioning

Page 5: Partitioning of a Signal Detection Algorithm to a ...

Each processing partition generates the entire frequency sweep for a range of time slices

Each partition processes a contiguous segment of the input data stream By partitioning the problem in this manner, extra overhead to handle the segment boundaries is incurred

Shared

New Data

IPC

? ?P

PL 1?

IPC

IPC

? In this example, P-1 old input data blocks need to be received and P-1 blocks need to be sent

In this case, the channelizer is split into four partitions

Also, the system latency increases due to time expansion nature of the

Page 6: Partitioning of a Signal Detection Algorithm to a ...

To meet the one block latency requirement, only 1/4 of a block of new data is passed to each partition; 3/4 of the data comes from the other partitions. This results in the same data block being transferred an extra 3 times!

For this example, the input bandwidth increases from 200 MB/s to 800 MB/s. We’re going the wrong way!

Throughput Overhead for Time Commutation

50

100

150

200

250

300

350

Per

cen

t

Page 7: Partitioning of a Signal Detection Algorithm to a ...

New Data

SavedData

Each partition receives the entire data streamNo extra I/O overhead is incurredPartitioning does not add latency

No communication between channelizer partitionsStill need to send all of the channelizer output data to the detector processorsIf there are a sufficient number of detector processing elements allocated and they are located correctly in the fabric, then I/O bottlenecks can be avoided

Page 8: Partitioning of a Signal Detection Algorithm to a ...

200200200

AD

C

200 100150

200 200 20

0

200

250 250

200200

50 50

100/100

100

Det

ecto

r

Det

ecto

r

Det

ecto

r

150100

Ch

ann

eliz

er

Crossbar

CrossbarCrossbar

Ch

ann

eliz

er

Unu

sed

200 100150

200 200

250

250

200

200

50 50

100

Det

ecto

r

Det

ecto

r

150100

Ch

ann

eliz

er

Crossbar

CrossbarCrossbar

Ch

ann

eliz

er

Each fabric connection is a 266 MB/s half duplex link to the crossbar. Typical performance is around 250 MB/sTo accommodate the channelizer output rate, the output stream musplit across multiple connectionsThis complicates the data flow and fully utilizes the fabric I/O capacity in many places

Page 9: Partitioning of a Signal Detection Algorithm to a ...

? ?PNpk p 1??

Partition p

OutputInput? p kkG :1?

? ? pkNG)(ng

? ?1?pkX

? ?pp kkX :1?? ?1: 1 ??? ?pp kNkNX

Partition generates DFTpoints:

IPC

IPC

? Each node generates a subset of the frequency bins for all time slices

? Each partition needs all of the time series data for each sweep

Able to pipeline the channelizer and detection processing with minimal inter-

Page 10: Partitioning of a Signal Detection Algorithm to a ...

Each partition performs the front part of the FFT computations. This is inefficient, but the I/O issues are simplerUniform communication between partitions at partition boundaries only? Single element communication between channelizer partitions? Similar communication between detector partitions!

So why the unusual allocation of the frequency bins?

Partition p

Real DFT)(ng ? ??10log PeakDetection

Partition p-1

Page 11: Partitioning of a Signal Detection Algorithm to a ...

GR(k)

GI(k)

(k)

(N-k)

(k)

(N-k)

-

- ???

???

Nk

22

sin?

???

???

Nk

22sin ?

???

???

Nk

22cos ?

GR(N-k)

--

-

? ? ? ? ? ? ? ? ? ?? ?? ?kNXkXWjkNXkXkG kN ???????? *

2*

21

for k = 0, 1, ..., N

The algorithm exploits the efficient computation of the DFT of a2N-point, real-valued, input sequence with an N-point complex transformSince the input data is real, only one half of the output spectrneeds to be computed

By processing a range of X(k), X(N-k) frequency pairs, twice as many output points can be calculated with only two extra additions!

Page 12: Partitioning of a Signal Detection Algorithm to a ...

Unlike the time partitioning case, most of the I/O movement occurs locally in the fabric

200Inter-partition Comms

Ch

ann

eliz

er

200

200 200 20

0

200

Crossbar

250 20033+50 50

Ch

ann

eliz

er

AD

C

Det

ecto

r

Det

ecto

r

Det

ecto

r

200

Crossbar

Crossbar

Ch

ann

eliz

er

200

200

Crossbar

250 20033+50 50

Ch

ann

eliz

er

Unu

sed

Det

ecto

r

200

Crossbar

Crossbar

250

250

Redundant computations are performed in each channelizer partition to reduce the system I/O requirements

Exchanges processing capacity for reduced I/O complexity

Page 13: Partitioning of a Signal Detection Algorithm to a ...

The detection algorithms are different than the single antenna case. Typically, eigenspace methods are employed

The channelizer is similar between the single and multi-antenna casesDetection processing requires the time series data for each frequency binLog magnitude computation is not requiredEigenvalues and eigenvectors are computed for each bin, across all antennas

As shown in the previous example, frequency domain partitioning has advantages when I/O bound conditions occur

Eigen-analysis

Detector

FrequencyChannelizerA/D

FrequencyChannelizerA/D

FrequencyChannelizerA/D

? Adds an extra data dimension to the problem

? Typical antenna arrays consist of 4 - 8 antenna elements