Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal...

171
Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by John K. Antonio University of Oklahoma Second Annual Review September 23, 1999

Transcript of Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal...

Page 1: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration ofCombined GPP/DSP/FPGA Systems for

Minimal SWAP

Presented byJohn K. Antonio

University of Oklahoma

Second Annual ReviewSeptember 23, 1999

Page 2: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Highlights from Year 1

• Highlights from Year 2

• Work to be Completed

Outline

Page 3: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications

• SAR• STAP

Requirements• Throughput• SWAP

•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration

University of Oklahoma: John K. Antonio and Sudarshan K. Dhall

New Ideas• Systematic determination of minimal SWAP

configuration based on proven mathematical programming techniques

• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes

• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system

Jun 97Start

Jun 98 Jun 99 Dec 00End

ScheduleDevelop optimalconfigurationtechniques

Construction and integration of GPP/DSP/FPGA system

Implement and test optimal configurations onGPP/DSP/FPGA system

Develop practicaldesign methodsbased on SAR andSTAP applications

Demonstrate advantagesof combiningtechnologies

Impact• Embedded Systems requirements for the

21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies

• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems

• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications

Jun 00

Page 4: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Highlights from Year 1

• Highlights from Year 2

• Work to be Completed

Outline

Page 5: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Personnel(Program Management Status)

• John K. Antonio, Principal Investigator

• Ph.D., Texas A&M University

• Professor/Director of CS, University of Oklahoma

• Over 70 publications in HPC and related areas

• PI or co-PI of 17 contracts/grants

totaling over $2.1M

Page 6: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Personnel(Program Management Status)

• Sudarshan K. Dhall, Co-Principal Investigator

• Ph.D., University of Illinois

• Professor of CS, University of Oklahoma

• Over 80 publications, 2 books, 3rd underway

• PI or co-PI of grants and contracting totalingabout $1M

Page 7: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Personnel(Program Management Status)

• Jack West, Research Scholar

Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator; STAP Implementation

• Jeff Muehring, Research Scholar

Optimal GPP/DSP/FPGA Configuration Techniques for SAR; SAR Implementation Intern at IBM/Houston, 8/99 to 1/00

Research Scholar at OU, 1/00 to 7/00

Page 8: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Personnel(Program Management Status)

• Hongping Li, Research Assistant, Ph.D. Student

Calibration of Power Prediction Simulator, System Interfacing, SAR Implementation

• Sirirut Vanichayobon, Research Assistant, Ph.D.Student

FPGA-Based Linear Equation Solver for STAP, System Interfacing, STAP Implementation

• Seok-Hyun Ko, Research Assistant, M.S. Student

Power Simulator Enhancements

Page 9: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Tim Osmulski, Research Assistant, M.S. student

Power Prediction Simulator for FPGAs

Graduated May 1998

• Nikhil Gupta, Research Assistant, M.S. student

Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs

Graduated August 1998

Personnel(Program Management Status)

Page 10: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Personnel(Program Management Status)

• Brian Veale, Research Assistant, M.S. student

Space and Power Study for High-Performance Integer and Floating Point ReconfigurableArchitectures

Graduated August 1999

Page 11: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Contacts, Partners, Vendors, and Other Communications

(Program Management Status)

José Muñoz, DARPA Ralph Kohler, Rome Lab

MIT Lincoln LabDavid MartinezJim Ward

MITRERichard Games

Northrop GrummanMarc Campbell

Synplicity, Inc. Madelyn Miller

XilinxJason Feinsmith

Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski

ISIMilissa BenincasaDavid Coker

Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms

Page 12: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C CompilerMC/OS, Cross Assembler, ToolkitMPI-Pro for MC/OS9U VME RACE Board1 SHARC Daughtercard (2CNs, 8MB/CN, 3 SHARCs/CN) = 6 SHARCS3 SHARC Daughtercards (2CNs, 16MB/CN, 3 SHARCs/CN) = 18 SHARCS4 PowerPC Daughtercard (2CNs, 16MB/CN, 1 PPC/CN) = 8 PPCsRIN-T Input CardROUT-T Output Card

Annapolis Micro Systems4 PCI WILDONE Cards (Xilinx 4028/4036)4 PCI WILDFORCE Array Card (5 Xilinx 4085s)Interfacing Cables

Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)

Equipment Status(Program Management Status)

Page 13: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

June 1997 Dec. 1998 June 2000 Dec. 2000Sept. 1999Mar. 1998

Design STAPIterative Weight Solver for FPGA

Inter-GPP/DSP Comm.Simulator for STAP

Optimal GPP/DSPConfig. for SAR

GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems

Implement STAP Iterative Weight Solver on FPGA

Optimal GPP/DSPConfig. for STAP

Implement SAR Linear Filteringon FPGA

Optimal GPP/DSP/FPGAConfig. for SAR/STAP

GPP/DSP and FPGA Subsystem Design, Integration and Testing

Optimal GPP/DSP/FPGA Config. for SAR

Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform

Implement SAR on GPP/DSP

Design SAR Linear Filteringfor FPGA

Implement STAP on GPP/DSP

Implement SAR onGPP/DSP/FPGA Platform

Optimal GPP/DSP/FPGA Config. for STAP

Implement STAP onGPP/DSP/FPGA Platform

Develop FPGA Power Consumption Simulator

KeyGPP/DSP Sub-System

Research/DesignImplement/Test

FPGA Sub-SystemResearch/DesignImplement/Test

GPP/DSP/FPGA SystemResearch/DesignImplement/Test

Test FPGA Power Consumption Simulator

Schedule of Milestones(Program Management Status)

Page 14: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

CurrentBudget

Balance on8/1/99

ProjectedExpenses8/99-7/00

ProjectedExpenses8/00-12/00

Personnel 246,223 108,635 154,024 52,123

Fringes 72,117 36,051 27,712 9,340

Consulting 40,000 37,000 0 0

Expenses 9,781 6,261 10,000 5,069

Travel 17,545 4,889 12,000 7,372

Equipment 217,670 42,652 42,652 0

Indirect Cost 181,262 90,632 87,317 31,674

Total 784,598 326,120 333,705 105,578

Budget Summary(Program Management Status)

Page 15: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Highlights from Year 1

• Highlights from Year 2

• Work to be Completed

Outline

Page 16: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 1

• Optimal Configuration of Compute Nodes for SAR Processing

• Network Simulator

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

• FPGA Power Prediction Simulator

Page 17: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration of Compute Nodes for SAR Processing

(Highlights from Year 1)

• Motivation and SAR Basics

• Parallelization of SAR Processing

• The Optimal Configuration Problem• Formulation• Numerical Results

• Conclusions

Page 18: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Nominal UAV Payload

“Predator”

Page 19: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Targets

Azim

uth

Velo

city

Range

Footprint

Footprint of Aerial Side-Looking SAR

Page 20: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Offset Overlapping Beams

vReal Azimuth Resolution

Rs

Page 21: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Synthetic Beams

Azim

uth

vR

Rs

CompressedResolution

Page 22: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration of Compute Nodes for SAR Processing

(Highlights from Year 1)

• Motivation and SAR Basics

• Parallelization of SAR Processing

• The Optimal Configuration Problem• Formulation• Numerical Results

• Conclusions

Page 23: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Parallelization of SAR Processing

DistributedCorner-Turn

1

Ran

ge S

ampl

es

Pulse No.

Range Samples

Puls

e N

o.

Range Processing(shown across 3 range processors)

Azimuth Processing(shown across 4 azimuth processors)

1

1

1

K r

Sa

Sa

K r

where Sa is the azimuth section length and Kr is the range reference kernel size

Reference:T. Einstein, “Realtime Synthetic Aperture Radar Processing on the RACE Multicomputer,” App. Note 203.0, Mercury Computing Sys, 1996.

Page 24: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Sectioned Convolution

Kernel

Discard

OverlapSection

FFT size

Large Overlap/Section ratio ⇒ Small azimuth memory, large number azimuth processorsSmall Overlap/Section ratio ⇒ Large azimuth memory, small number azimuth processors

Reference:T. Einstein, “Realtime Synthetic Aperture Radar Processing on the RACE Multicomputer,” App. Note 203.0, Mercury Computing Sys, 1996.

Page 25: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

System Parameters

• radar-dependent: R (range), Rs (range swath), and λ (wavelength)

• application-dependent: δ (desired resolution) and v (platform velocity)

• processor-dependent: αr and αa (non-fast-convolution range and azimuth loading) and γ (fast convolution throughput)

• software-dependent: Sa (azimuth convolution section length), Fa (azimuth FFT length), and Fr(range FFT length)

Page 26: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Derivations for Memory and Processor Requirements

Pv F R F F

PvR

F FS

MR v F R F F

MR R S

rr r s r r

a

s aa a

a

rs r r s r r

as a

=+ +

=+

+

=+ +

=+

( lg )

( lg )

( lg )

( )

6 10

6 10

16 6 10

2

2

2

3

2

3

δ α γ δγδ

αγ

δ

δ α γ δγδ

λ δδ

Page 27: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration of Compute Nodes for SAR Processing

(Highlights from Year 1)

• Motivation and SAR Basics

• Parallelization of SAR Processing

• The Optimal Configuration Problem• Formulation• Numerical Results

• Conclusions

Page 28: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Objective: Determine configurations for the CNs, number of CNs of each configuration, and section size, to satisfy processor and memory requirements and minimize power consumption

• Notation and Definitions:• CN Configuration: Specifies the daughtercard type

and number of range and azimuth CEs (per configured CN)

• X, Y: The two possible CN configurations• XT, YT: Daughtercard type for each CN configuration

Optimal Configuration Formulation

Page 29: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Notation and Definitions:• Xr, Yr: Number of range processors per CN

(for each configuration)• Xa, Ya: Number of azimuth processors per CN

(for each configuration)• NX, NY: Number of CNs of configurations X and Y• ΠCN(•): Power per CN as a function of

daughtercard type• MCN(•): Memory per CN as a function of

daughtercard type• PCN(•): Processors per CN as a function of

daughtercard type

Optimal Configuration Formulation

Page 30: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

1,0,,,,,

,....2,1,2

)()(

)()()(

)()()(

)(

)()(

≥≥

=+≥=

≤+≤+

+≥

+≥

+≤+≤

+=

aararYX

aak

a

TCNar

TCNar

aa

aaa

r

rrTCN

aa

aaa

r

rrTCN

aYaXaa

rYrXr

TCNYTCNX

SYYXXNN

kKSF

YPYYXPXX

SPSMY

PMYYM

SPSMX

PMXXM

YNXNSPYNXNP

YΠNXΠNZMinimize:

Subject to:

Optimal Configuration Formulation

Page 31: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration of Compute Nodes for SAR Processing

(Highlights from Year 1)

• Motivation and SAR Basics

• Parallelization of SAR Processing

• The Optimal Configuration Problem• Formulation• Numerical Results

• Conclusions

Page 32: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Minimum Power

Page 33: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Azimuth FFT Size

Page 34: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Azimuth Section Size

Page 35: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Ratio of Kernel Size to Section Size

Page 36: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Percentage of Power Usage by Card Type 1

Page 37: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal CN Configurations

0.5 1 1.5 250

100

150

200

250

300

350

400

Resolution

Vel

ocity

1 1 22 1 11 1 2 1 2 1

XT Xr Xa YTYrYa

1 1 2 2 0 1

1 2 1 2 0 21 3 0 2 0 21 3 0 2 1 12 0 2 2 1 1

1 1 2 2 1 1

2 1 1 2 2 0

1 1 2 2 0 2

Page 38: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration of Compute Nodes for SAR Processing

(Highlights from Year 1)

• Motivation and SAR Basics

• Parallelization of SAR Processing

• The Optimal Configuration Problem• Formulation• Numerical Results

• Conclusions

Page 39: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Conclusions

• A method for optimally configuring CN-based parallel systems for SAR processing was introduced.

• The method provides detailed HW and SW design and implementation information about how to best utilizesystem resources for given values of application parameters.

• The numerical studies show that the optimal ratio of daughtercard types can be relatively constant over regions of the application parameter space.

• For a fixed hardware configuration, the CNs can be re-configured (via software re-configuration) to achieve optimal power consumption over specified regions.

Page 40: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 1

• Optimal Configuration of Compute Nodes for SAR Processing

• Network Simulator

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

• FPGA Power Prediction Simulator

Page 41: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Network Simulator(Highlights from Year 1)

• Parallel STAP: The Motivation behind the Network Simulator

• Overview of the Network Simulator

• Numerical Studies

• Conclusions

Page 42: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Pulses Pulses

Data Cube

Data Cube

Doppler Filter

Channels

Ran

ge

Ran

ge

Channels

Beamform

Beam Outputs

Ran

ge

Pulses

QR Decomposition

Rotate

Channels

Ran

ge

Pulses

Data Cube

Steering Vectors

Weights

Input Data

RotatePulse

Compress

Data CubeC

hann

els

Pulses

Range

STAPSTAP PPROCESSING ROCESSING FFLOWLOW

Page 43: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

1. Partition STAP data cube over a 2-D process set.

2. Process the contiguous dimension.

3. Re-partition the data cube before processing the next dimension.

4. Rotate the newly distributed data to make the next dimension sequential in memory.

5. Repeat steps 1 through 4 before each processing phase.

SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY

Page 44: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.

Pulses Range

Cha

nnel

s

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

+

3 x 4 Process Set

Pulses

5

1

9

Range

Cha

nnel

s

Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.

Pulses Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 2 3 4

Pulses Range

Cha

nnel

s

+

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES

Page 45: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Pulses

5

1

9

Range

Cha

nnel

s• Re-Partitioning involves exchanging data with the next whole dimension.

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Range Dimension is Contiguous

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

Pulse Dimension is Contiguous

• Interprocessor Communication is required between processors in the same row.

Pulses

Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 1 1 2 1 3 1 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Page 46: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Required Data TransfersRequired Data Transfers

Network Interconnection ConfigurationNetwork Interconnection Configuration

6-PortCrossbar

CN CN CN CN

12

3

45

6 78

9

1011

12

IPC

56

78

910

1112

Cha

nnel

12

34Pulses Range

Pulse Compression

1

4CN

7

10

CN

CN

CN

CN

CN

3

4

3

3

4

3

Doppler Filtering

Pulses

Cha

nnel

Range

9 10 11 12

5 6 7 8

1 2 3 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Data ReData Re--distribution Mappingdistribution Mapping

Page 47: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Network Simulator(Highlights from Year 1)

• Parallel STAP: The Motivation behind the Network Simulator

• Overview of the Network Simulator

• Numerical Studies

• Conclusions

Page 48: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.

2. Contention resolved using priorities.a. User-programmable message priority

b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)

3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.

SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES

Page 49: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

6-PortCrossbar

6-PortCrossbar

Message DestinationMessage DestinationMessage SourceMessage Source

MessagePath

MessagePath

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY

6-PortCrossbar

6-PortCrossbar

CNCN

6-PortCrossbar

Page 50: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE

7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -

HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port

Active Port E InvolvedNot Yet Active

Port E Not Involved

Transaction Status

* - Peer Kill Rules Apply

Page 51: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

CrossbarCrossbar CrossbarCrossbar

CrossbarCrossbar

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

LinkLink

Random ScanGenerates Pseudo-Random CN Scan Ordering

Random ScanGenerates Pseudo-Random CN Scan Ordering

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Network Methods

NNETWORK ETWORK CCLASS LASS DDETAILSETAILS

Page 52: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Crossbar Methods

LinkConnects Crossbar Objects Link Status: Occupied or Free

LinkConnects Crossbar Objects Link Status: Occupied or Free

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS

Page 53: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Outgoing Message QueueOutgoing Message Queue

Message 1

Message 2

Message 3

::

Packet StackPacket StackEXPLODE

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

• PACKETS ARE SELF-ROUTING

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack

• PACKETS ARE SELF-ROUTING

::

Packet 2Packet 3Packet 4

Packet 1

CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS

Page 54: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM

NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>

User<<actor>>

User ClockClock

Pass 1

Pass 2

Increment Simulation

Clock

Build Messages

R:200,P:22,C:16

CEs:48

X:6, Y:8

Routing:FCN Traffic,

Phase 1 DMA:Y

Connection/Data

Transfer

Clean Up

Message Matrices

X, Y,MappingMatrices

SimulationTime = 2 msSimulation

Time = 2 ms

Messages Time* iterative process

Page 55: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2

Simulation Pass Subsystem

Start UpStart Up

Waitingfor Kill

Waitingfor Kill

CompletedCompletedSuspendedSuspended

BlockedBlocked ActiveActive

ReadyReady

Pass 1

Pass 2

Page 56: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Network Simulator(Highlights from Year 1)

• Parallel STAP: The Motivation behind the Network Simulator

• Overview of the Network Simulator

• Numerical Studies

• Conclusions

Page 57: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

05

101520253035404550

0.5 1 1.5 2

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Page 58: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)

0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)

0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

Page 59: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun

t CN TrafficCE Traffic

Page 60: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

012345678

10 15 20 25

Time (ms)

Coun

t CN TrafficCE Traffic

Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

012345678

10 15 20 25

Time (ms)

Coun

t CN TrafficCE Traffic

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

Page 61: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

0123456789

14 16 18 20 22

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

0123456789

14 16 18 20 22

Time (ms)

Coun

t ChainingNo Chaining

Page 62: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 2Communication Phase 2

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

012345678

21 22 23 24 25 26 27

Time (ms)

Coun

t ChainingNo Chaining

DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

012345678

21 22 23 24 25 26 27

Time (ms)

Coun

t ChainingNo Chaining

Page 63: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Network Simulator(Highlights from Year 1)

• Parallel STAP: The Motivation behind the Network Simulator

• Overview of the Network Simulator

• Numerical Studies

• Conclusions

Page 64: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

1. Designed and implemented a platform independent simulator.

4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.

2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.

3. Allows users to experiment with possible current and future configurations.

CCONCLUSIONSONCLUSIONS

Page 65: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 1

• Optimal Configuration of Compute Nodes for SAR Processing

• Network Simulator

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

• FPGA Power Prediction Simulator

Page 66: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Highlights from Year 1)

• Overview of STAP Weight Calculation

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

• Conclusions

Page 67: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Doppler Filter

Weight Computation

Steering Vector

Input Data

Pulse Compress Data Cube Data Cube

Weight Application

ThresholdDetection

Target Decision

Typical STAP Processing Flow

pulses

range

Doppler

range8%

91.5%

0.5%

CovarianceMatrix

Page 68: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Space-Time Adaptive Processing

• Effective partially adaptive STAP technique

• The architecture consists of

• Doppler processing across all pulse repetition intervals

• Adaptive filtering across• all channels and• K adjacent Doppler bins

Kth- Order Doppler Factored STAP

Page 69: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

1 31 ˆ:),(

×=× NN

rkx

r

∑+−=

=bL

rkxrkx

bkR

rLbr

H

rL 1)1(

),(),(1

),(ψ

Kth-Order Doppler Factored STAP

bth Ran

ge

Segm

ent

(with

L rce

lls)N

Cha

nnel

s

Doppler

k (k - 1)(k + 1)

Data matrix needed for calculating covariance matrix for kth Doppler Bin

and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3

Page 70: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Matrix-Based Derivation of

rr LNLN

bk

3 ˆ:),(

×=×

X

),(),(1

),(),(1),(1)1(

bkbk

bLrkxrkxbk

H

r

Lbr

H

r

L

LR

r

XX

ψ

=

= ∑+−=

sbkwbk =),(),(ψ

The Weight Equation:

),( bkψ

Page 71: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

STAP Weight Calculation

sLbkwRR

RR

sbkwRRL

bkwRQQRL

QRbk

sbkwbkbkL

sbkwbk

rT

TT

T

r

TT

r

T

H

r

=

=

==

=

=

=

),(

]0[ that Note

),(1),(1

),( :onDecomposti QR Take

),(),(),(1

),(),(

*11

1

***

X

XX

ψ

onsubstituti backward using ),(for Solve

),(

neliminatio forward using for Solve

),(Let

*1

1

*1

bkw

pbkwR

p

sLpR

pbkwR

rT

=

=

=

sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using

Page 72: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Iteration

STAP Weight Calculation

Initialization

ikTi

iTi

ii

ii

ii

Ti

iTi

ii

ddd

dggd

swg

ddd

dgww

+−=

−=

−=

+++

++

+

)(1

11

11

1

ψψ

ψ

ψ

sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using

00000 ,set , Choose dgwsdw −=−= ψ

Page 73: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Numerical Studies

Lr = 125

Flop

Cou

nt

108

109

1010

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

CGQR

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Lr = 250

Tolerance

1010

109

108

Flop

Cou

nt

Tolerance

CGQR

Page 74: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Highlights from Year 1)

• Overview of STAP Weight Calculation

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

• Conclusions

Page 75: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

+

Output Register

a b

Sign+16 bitmantissa

Normalizing unit

1’s comp/register

a bsign of a

a b

b

BUFFER

X

BUFFER

FPGA

BOARD

INTERCONNECTION

BUS

HOSTPROCESSOR

• Multiply-Accumulate Pipe• Reads two block floating

point operands per cycle • Performs two operations

per cycle• Performs exponent

normalization prior to accumulation

• 2 N-vectors reduced to a constant number of partial sums

FPGA Inner Product Co-Processor:Design 1

Page 76: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Multiply-Add Reduction Pipe• Reads four operands

per cycle • Performs three operations

per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums

• Basic Tradeoff: First design has lower throughput, but can perform more work

X X

1’s comp/register

Sign bSign a

+

Sign+16 bit mantissa

INTERCONNECTION

BUS

HOSTPROCESSOR

BUFFER

BUFFER

FPGA

BOARD

2 ff

Data forFirst

Multiplier

Data forSecond

Multiplier

Unitclocked

here

FPGA Inner Product Co-Processor:Design 2

Page 77: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Highlights from Year 1)

• Overview of STAP Weight Calculation

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

• Conclusions

Page 78: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Two Orders of Magnitude Experiment

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

0.999893 0.9999015 0.99991 0.9999185 0.999927

Freq

uenc

y

Data Histogram

05

101520253035404550

0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103

110

Freq

uenc

y

Exponent Histogram

050

100150200250300350400450500

119

121

123

125

127

129

131

133

135

137

139

141

143

145

Freq

uenc

y

Accuracy HistogramDesign 2

0

50

100

150

200

250

0.99

399

0.99

436

0.99

474

0.99

511

0.99

549

0.99

586

0.99

624

0.99

661

0.99

699

0.99

736

0.99

774

0.99

811

0.99

849

0.99

886

0.99

924

0.99

961

0.99

999

Freq

uenc

y

Page 79: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Five Orders of Magnitude Experiment

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

8

0.999912 0.99991875 0.9999255 0.99993225 0.999998

Freq

uenc

y

Data Value Histogram

05

101520253035404550

0

6867

1373

4

2060

2

2746

9

3433

6

4120

3

4807

0

5493

7

6180

5

6867

2

7553

9

8240

6

8927

3

9614

1

1030

08

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

700

800

119 121 123 125 127 129 131 133 135 137 139 141 143

Freq

uenc

y

Accuracy HistogramDesign 2

0

50

100

150

200

250

300

0.00

000

0.06

250

0.12

500

0.18

750

0.25

000

0.31

249

0.37

499

0.43

749

0.49

999

0.56

249

0.62

499

0.68

749

0.74

999

0.81

249

0.87

499

0.93

748

0.99

998

Freq

uenc

y

Page 80: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

“Outlyer” Experiment

Accuracy HistogramDesign 2

0

5

10

15

20

25

30

35

40

45

50

0.00

0.06

0.12

0.17

0.23

0.29

0.35

0.40

0.46

0.52

0.58

0.64

0.69

0.75

0.81

0.87

0.92

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

114

116

118

120

122

124

126

128

130

132

134

136

138

Freq

uenc

y

Data Value Histogram

0

200

400

600

800

1000

1200

0.00

09

62.5

008

125.

0007

187.

5007

250.

0006

312.

5006

375.

0005

437.

5005

500.

0004

562.

5004

625.

0003

687.

5003

750.

0002

812.

5002

875.

0001

937.

5001

1000

.000

0

Freq

uenc

y

Accuracy HistogramDesign 1

0

2

4

6

8

10

12

0.593067 0.6398925 0.686718 0.7335435 0.78369

Freq

uenc

y

outlyeroutlyer

Page 81: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

(Highlights from Year 1)

• Overview of STAP Weight Calculation

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

• Conclusions

Page 82: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Conclusions

• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)

• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2

• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges

• Block floating point accuracy breaks down when there are a few large outlyers in the data set

Page 83: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 1

• Optimal Configuration of Compute Nodes for SAR Processing

• Network Simulator

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

• FPGA Power Prediction Simulator

Page 84: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Power Prediction Simulator

(Highlights from Year 1)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Conclusions and Demo

Page 85: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Leakage CurrentDynamic Capacitance Charging Current

Most important for CMOSDependant on clock frequency

Power Dissipation in CMOS

Transient Current

Dependant on signal activityDependant on signal activity

Page 86: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Power Equations

Equivalent model of a transistor’s gate...

( )

−=

−RC

teVtvc 1

( ) RCt

VetvR

−=

( )ReVtp

RCt

R

22

=

∫∫−

−−

==ττ

ττ0

22

0

22 2

21 dte

RCCVdt

ReVp RC

tRCt

avg

222

21

2CVeCVp

o

RCt

avg ττ

τ

≈−

=−

Page 87: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

( ) 50.0=clockp

( ) 88.01 =xp

( ) 29.02 =xp

( ) 69.03 =xp ( ) 27.03 =xA

( ) 0.1=clockA

( ) 10.01 =xA

( ) 17.02 =xA

p(s): the probability that signal sattains a logical value of true at any given clock cycle.

A(s): the probability that signal stransitions at any given clock cycle.

Probabilistic Modeling

Page 88: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Probabilistic Modeling

x3

x2

x1

y

y

x3

x2

x1

:)(1 tx:)(2 tx:)(3 tx

:)(21 txx:)(321 txxx

p=0.88, A=0.10

p=0.29, A=0.17

p=0.69, A=0.27

p=0.83, A=0.17

p=0.10, A=0.13

Calculation of average power:

∑∈

=gates all

2

21

ggavg ACVP

Page 89: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Probabilistic Equations

( )

( )1 where,)(1

1

===

=

∏∑

∑ ∏

=

=

ii

k

ii

k

ii

Pyp

f

ππ

( ) ( )

( ) ( ){ }

( ) ( ){ }

∑∑ ∏

∑ ∏

∑ ∏

+

−⊕+

−⊕+

−⊕

⋅=

===≠≠ ∉

==≠ ∉

= ≠

X n

kjikji kjil

llkkjjiikji

n

jiji jik

kkjjiiji

n

i ijjjiii

xzPxzPxzPxzPzzzXfXf

xzPxzPxzPzzXfXf

xzPxzPzXfXf

XPyA

K

1,1,1,,

1,1,

1

)(1)()()(),,;()(31

)(1)()(),;()(21

)(1)();()(

)()(

*

* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching

Signal probability transformations...

Signal activity transformations...†

Page 90: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Power Prediction Simulator

(Highlights from Year 1)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Conclusions and Demo

Page 91: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Design

FPGA internal structure design...

CLB

IOB BUF

Page 92: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Routing Fabric Design

Example routings...

Xilinx 4000 series routing fabric is very intricate.

Xilinx synthesis tools use shortest path routing where possible.

The distance the signal travels is the metric considered in this model.

Page 93: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Signal Design

Symbolic Probability

Numeric Probability

Numeric Activity

Signal Reference

Manhattan Distance

CLBCLB

R

L

Local Signal Remote Signal

Page 94: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Routing Example

4

4 InterconnectionLUT

LUT

LUT

LUT

LUT

LUT

Page 95: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Routing Signal Connections

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

LUT

LUT

LUT

LUT

LUT

LUT

L

L

L

L

Page 96: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

FPGA Power Prediction Simulator

(Highlights from Year 1)

• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Conclusions

Page 97: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Conclusions

• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.

• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals

• Simulator calculates probabilities and activities for all internal signals

• Tool outputs power consumption of FPGA chip

• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods

Page 98: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Highlights from Year 1

• Highlights from Year 2

• Work to be Completed

Outline

Page 99: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 2

• Efforts to Calibrate the FPGA Power Prediction Simulator

• Comparison of Integer and Floating Point Computations on FPGAs

• Architecture of Prototype System for SAR and STAP Processing

• Integration of Reconfigurable Computing into SAR

• Configuration Technique for STAP

Page 100: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Basic Approach to Calibration

• N x N array of CLBs (configurable logic blocks)

• Programmable interconnect• Let S denote the set of all internal

signals for a configuration and Si denote all signals of length i

• Let Ai denote the sum of activities for all signals of length i

• 2N + 1 distinct capacitances (C) dependent on signal length

sSs

sdavg ACfVP ∑∈

= )(2

21

+⋅⋅⋅++

= ∑∑∑

∈∈∈ NSssN

Sss

Sssavg ACACACfVP

210

2102

21

Page 101: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Basic Approach to Calibration

=

NNNNNN

N

N

P

PP

C

CC

AAA

AAAAAA

fV

2

1

0

2

1

0

2,21,20,2

2,11,10,1

2,01,00,0

2

21

MMMOM

L

L

• For the j-th design/data set combination:let Pj denote the measured power let Aj,k denote the aggregate activity of all signalsof length k

• For each design/data set combination, the simulator provides the values for one row of the above matrix

• Given 2N + 1 measured values for Pj, the unknown capacitance values are then determined. This is how the simulator is calibrated.

Page 102: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Efforts to Calibrate the Simulator

• For the Xilinx 4036 family of parts, N = 36

• Generated a total of 73 (= 2N + 1) design/data set combinations

• Created a utility for generating data sets with specified statistics

• Created a utility for computing statistics associated with a given data set

• Attempts at Measuring Consumed Power• Heat• Current

Page 103: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Heat Measurement Approach

Page 104: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Heat Measurement Approach(continued)

Page 105: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Current Measurement Approach

Page 106: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 2

• Efforts to Calibrate the FPGA Power Prediction Simulator

• Comparison of Integer and Floating Point Computations on FPGAs

• Architecture of Prototype System for SAR and STAP Processing

• Integration of Reconfigurable Computing into SAR

• Configuration Technique for STAP

Page 107: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of Integer and Floating Point Computations on FPGAs

(Highlights from Year 2)

• Integer Pipelined Multiplier

• Floating Point Pipelined Multiplier

• Floating Point Pipelined Adder

• Comparison of Two Inner-Product Designs

• Conclusions

Page 108: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Array-Based Integer Multiplier

CSA 9

CSA 8

CSA 7

CSA 6

CSA 5

CSA 4

CSA 3

CSA 2

CSA 1

CSA 0

Propagate Adder

b0Ab1Ab2Ab3Ab4Ab5Ab6Ab7Ab8Ab9Ab10Ab11A

sumcarry

Page 109: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Carry-Save Addersin a 5-bit Multiplier

Half AdderFull AdderFull AdderFull AdderHalf Adder

Half AdderFull AdderFull AdderFull AdderFull Adder

Half AdderFull AdderFull AdderFull AdderFull Adder

Half AdderFull Adder

Full AdderFull Adder

b3a0b3a1b3a2b3a3b3a4

b4a0b4a1b4a2b4a3b4a4

b2a0b2a1b2a2b2a3b2a4

b1a0b1a1b1a2b1a3b1a4

b0a0b0a1b0a2b0a3b0a4

CSA 0

CSA 1

CSA 2

Propagate Adder

Full Adder

Page 110: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Half Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

sumcarry

sumcarry

upper 13 bits of product

CSA 9

Propagate Adder

Page 111: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• The Wild-One system runs at a maximum speed of 50MHz

• The 4036xla has more routing resources than the 4028ex

• Table shows maximum achieved clock rate as a function of the number of pipelined stages employed

# of stages Speed(Mhz)4028ex 4036xla

1 14 282 19 253 21 N/A4 22 275 29 286 39 287 22 298 33 50

Pipelining Results forArray-Based Integer Multiplier

Page 112: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of Integer and Floating Point Computations on FPGAs

(Highlights from Year 2)

• Integer Pipelined Multiplier

• Floating Point Pipelined Multiplier

• Floating Point Pipelined Adder

• Comparison of Two Inner-Product Designs

• Conclusions

Page 113: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

16-bit Floating-Point Format

• The floating point format chosen is a 16-bit format supported by the ADSP-2106x family of SHARC DSP processors

• The exponent is represented in excess-7 notation

• Range : ±1.5625×10-2 to ±2.559375×102

101.f0e3e 0fs • • • • • •

Short Word Floating-Point Format15 14 11 10 0

Page 114: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Floating Point Multiplier

0

12 bit Array-Based Multiplier

1.m1 1.m2

1 0

1

1

0

1

excess-7 adder

exponentadjustselect

e1(2)

e2(3)

e2(2)e1(3)

e1(1)e2(1)

e1(0)e2(0)

unf ovf

If the msb = 1 take thebits msb-1…msb-11

If the msb = 0 take thebits msb-2…msb-11

exponent

11

upper 13 bits of product

e2e1

mantissa

If underflow = 1, set exponent = 0If overflow = 1, set exponent = 15

(representing infinity)

If e1 or e2 = 0, set exponent = 0If e1 or e2 = 15, set exponent = 15

s2s1

mantissaexponentsign1 bit 4 bits 11 bits

Page 115: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of Integer and Floating Point Computations on FPGAs

(Highlights from Year 2)

• Integer Pipelined Multiplier

• Floating Point Pipelined Multiplier

• Floating Point Pipelined Adder

• Comparison of Two Inner-Product Designs

• Conclusions

Page 116: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

difference

pos./neg.

Choose Exponent

Normalize Mantissa and Adjust Exponent

Align Mantissas

Add/Subtract Mantissas

1.m1 1.m2e1 e2 s1 s2

Registers

exponent mantissa sign

Check for Absolute Zero and Infinity and Add Phantom Bit

Registers

Registers

Compare Exponents by Subtraction

Registers

Floating Point Adder

Page 117: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of Integer and Floating Point Computations on FPGAs

(Highlights from Year 2)

• Integer Pipelined Multiplier

• Floating Point Pipelined Multiplier

• Floating Point Pipelined Adder

• Comparison of Two Inner-Product Designs

• Conclusions

Page 118: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Inner Product Co-processor Designs

Input Buffer

Pipeline Multiplier

Pipeline Multiplier

Pipeline Adder

Output Buffer

Input Buffer

Pipeline Multiplier

Pipeline Adder

Output Buffer

Multiply-Accumulate SchemeMultiply-Add Scheme

Page 119: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

PerformanceSpeed # of # of # of # of Equivalent Estimated Power

Co-Processor Type (MHz) CLBs Flip-Flops 3-Input LUTs 4-Input LUTs Gate Count ConsumptionInteger Multiply-Accumalate 50 622 720 180 794 10076 N/AInteger Multiply-Add 43 1013 1148 423 1421 16809 415F.P. Multiply-Accumalate 38 437 414 154 742 8072 454F.P. Multiply-Add 34 716 654 254 1082 11766 390

( )

+++⋅⋅⋅++

= ∑∑∑∑

∈∈∈∈ − NN Sss

Sss

Sss

Sss ANANAA

21210

12221 Power Estimated

Notes:1. Integer co-processors implemented with 16-bit integer

multipliers and 32-bit integer adders2. The estimated power consumption calculated from

power simulator based on simplified (non-calibrated)constants:

Page 120: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

F.P. Multiply-Add vs F.P. Multiply-Accumulate Non-Weighted Activity Values

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Interconnection Length

Activ

ity V

alue Multiply-Add

Multiply-Accumulate

Page 121: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Interconnection Length

Wei

ghte

d Ac

tivity

Multiply-AddMultiply-Accumulate

F.P. Multiply-Add vs F.P. Multiply-Accumulate Linearly-Weighted Activity Values

Page 122: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of Integer and Floating Point Computations on FPGAs

(Highlights from Year 2)

• Integer Pipelined Multiplier

• Floating Point Pipelined Multiplier

• Floating Point Pipelined Adder

• Comparison of Two Inner-Product Designs

• Conclusions

Page 123: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Conclusions

• Developed libraries of efficient integer and floating point pipelined multipliers and adders

• Discovered that increasing the degree of pipelining increases required hardware

• Discovered that increasing the degree of pipelining generally increases maximum clock rate

• 16-bit F.P inner-product designs require less hardware than integer inner-product designs, which employ 16-bit multiplier(s) and 32-bit adder

• Multiply-accumulate designs consume more power (estimated) than multiply-add designs due to the requirement for long feedback paths

• Developed 50 page User’s Manual for Annapolis System

Page 124: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 2

• Efforts to Calibrate the FPGA Power Prediction Simulator

• Comparison of Integer and Floating Point Computations on FPGAs

• Architecture of Prototype System for SAR and STAP Processing

• Integration of Reconfigurable Computing into SAR

• Configuration Technique for STAP

Page 125: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Data Source

VME

MercurySystem

CNCNPEPE... ...

SPARC

ReconfigurableSubsystem

DSP/GPPSubsystem

Data Sink

AnnapolisSystem 120 MB/sec

PC

120 MB/sec120 MB/sec

PC

PCI Custom Custom

PEPE...

ReconfigurableSubsystem

AnnapolisSystem

PCI

120 MB/sec

Architecture of Prototype System

Page 126: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SAR Processing Flow

RangeCompression

AzimuthProcessing

DataTransfer

Azimuth

Range

Page 127: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

STAP Processing Flow

RangeCompression

DopplerFiltering

WeightComputation

DataTransfer

Doppler

Cha

nnel

Range

DataTransfer

Page 128: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Refer to Poster for Physical Viewof Architecture

Page 129: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 2

• Efforts to Calibrate the FPGA Power Prediction Simulator

• Comparison of Integer and Floating Point Computations on FPGAs

• Architecture of Prototype System for SAR and STAP Processing

• Integration of Reconfigurable Computing into SAR

• Configuration Technique for STAP

Page 130: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Integration of ReconfigurableComputing into SAR

(Highlights from Year 2)

• The SAR Benchmark

• Comparison of Two FIR Filter Designs

• Including FPGAs in the SAR Optimization Formulation

Page 131: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

The SAR Benchmark

• Retrieved Benchmark from

http://www.rl.af.mil/programs/hpcbench/

• Developed under the ARPT/Tri-Services Rapid Prototyping of Application Specific Signal Processors (RASSP) program

• Two main programs

• Synthetic SAR data generator (400 lines of code)

• Serial SAR processor (1600 lines of code)

• The SAR algorithm is stripmap mode - currently processes 4 frames of hh polarization data

Page 132: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• The SAR Benchmark

• Comparison of Two FIR Filter Designs

• Including FPGAs in the SAR Optimization Formulation

Integration of ReconfigurableComputing into SAR

(Highlights from Year 2)

Page 133: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of TwoFIR Filter Designs

D Q D Q

D Q

D QD Q

D Q

D Q

xk0 xk3xk2xk1

n

n++

+

Serial-Multiply/Parallel Add

• Ease of routing• Poor modularity

xk3 xk2 xk0xk1

+ +++

D Q

D QD Q D QD Q

D QD QD Q

n

n

Parallel-Multiply/Serial Add

• Poor routing• Good modularity

Page 134: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Comparison of TwoFIR Filter Designs

• Both designs implemented using fixed-point complex data (16-bit fixed-point real and imaginary components)

• Both designs make use of constant coefficient multipliers (from core generator)

• Four tap serial-multiply/parallel-add filter fit onto one 4036xla part

• Three tap parallel-multiply/serial-add filter fit onto one 4036xla part (insufficient routing resources for four taps)

• Four tap parallel-multiply/serial-add filter implemented across two parts on one board (one 4036 and one 4013)

Page 135: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• The SAR Benchmark

• Comparison of Two FIR Filter Designs

• Including FPGAs in the SAR Optimization Formulation

Integration of ReconfigurableComputing into SAR

(Highlights from Year 2)

Page 136: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Including FPGAs in the SAR Optimization Formulation

• Power estimates must be determined for a range of kernel sizes for both filter designs

• Hybrid designs may exist for multi-chip implementations that yield desired features of both modularity and routability

• Binary optimization variable defines whether entry-FPGA or DSP/GPP subsystems perform range compression

• Real optimization variable defines fraction of azimuth processing divided among GPP/DSP and exit-FPGA subsystems

Page 137: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Highlights from Year 2

• Efforts to Calibrate the FPGA Power Prediction Simulator

• Comparison of Integer and Floating Point Computations on FPGAs

• Architecture of Prototype System for SAR and STAP Processing

• Integration of Reconfigurable Computing into SAR

• Configuration Technique for STAP

Page 138: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Configuration Technique for STAP

• Incorporate New Features into the Network Simulator

• Testing and Calibration of the Network Simulator

• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer

• Optimization Problem• Computational Investigation

Page 139: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

NEW FEATURES FOR THE NETWORK SIMULATOR

• Incorporate Software Overhead Times in the Simulation Model– Currently, the simulator performs hardware switch-level modeling (i.e.,

packet level simulation at the crossbar level).– Modify the Network Simulator to include software overhead times for two

communication protocols.– Empirical analysis will be utilized to capture software overhead times for

the communication protocols.• Provide Additional Timing Information from Simulation Runs

– Currently, the simulator outputs completion times after each corner turn of the STAP data cube.

– Modify the Network Simulator to output message queue completion times for each Compute Node (CN) sending messages.

– Message queue completions times will become vital input into theoptimization algorithm.

• Add PowerPC Compute Node Configuration to the Simulator

Page 140: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

INCORPORATE SOFTWAREOVERHEAD TIMES

• Communication Time for a Message:

BM

TTT HardwareOSoftwareOC ++= )()(

CT

)(SoftwareOT

)(HardwareOTM

= Completion Time

= Software Overhead Time

= Hardware Overhead Time

= Message Size

= Network BandwidthB

where:

Modeled by SimulatorModeled by SimulatorInclude SoftwareInclude Software

Overhead Time in theOverhead Time in theSimulation ModelSimulation Model

Page 141: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SOFTWARE PROTOCOLS

• Two Communication Protocol Times will be added to the SimulationModel

– DMA MC/OS Communication Times (DMA Transfers between CNs)– MPI (Message Passing Interface) Software Layer Communication Times

• Incorporating Software Overhead Times into the Simulation Model will be accomplished through Empirical Analysis.

– For each of the two software protocols, zero length messages will be sent through the network. Their resulting communication times will be measured.

– After analysis of multiple runs, the simulator will be calibrated to include both DMA transfer overhead and MPI software overhead.

Page 142: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SOFTWARE COMPONENTS

MC/OS Runtime EnvironmentMC/OS Runtime EnvironmentMC/OS Runtime Environment

Interprocessor Communication System(ICS)

Interprocessor Interprocessor Communication Communication SystemSystem(ICS)(ICS)

POSIXAPI

POSIXPOSIXAPIAPI

MCexecMCexecMCexec

LoadableDevice Drivers

LoadableLoadableDevice Device DriversDrivers

DMAControllerDMADMAControllerController

CN ASIC Registers,InterruptsTimers,etc.

CN ASIC CN ASIC Registers,Registers,InterruptsInterruptsTimers,etc.Timers,etc.

MPI

Soft

war

e La

yer

MPI

Soft

war

e La

yer

MPI

Soft

war

e La

yer

‘DX’ Data Transfer‘DX’ Data TransferFacilityFacility

CPURegistersCPUCPURegistersRegisters

HARDWARE ABSTRACTION LAYER

Use

r Applic

atio

nU

ser

Applic

atio

nU

ser

Applic

atio

n

Page 143: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

PROPOSED WORK

• Incorporate New Features into the Network Simulator

• Testing and Calibration of the Network Simulator

• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer

• Optimization Problem• Computational Investigation

Page 144: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

TESTING AND CALIBRATION OF THE NETWORK SIMULATOR

• Test Specific Communication Patterns to Verify Accuracy of the Network Simulator– Implement a Communication Task on the Mercury RACE®

Computer– Replicate the Communication Task on the Network Simulator– Compare the Resultant Completion Times– If Necessary, Fine-Tune the Network Simulator

• Two Types of Communication Patterns will be used to Test and Calibrate the Network Simulator– Simple Test Patterns (Hand-Calculated Verification) – Complex Test Patterns (Empirical Verification)

Page 145: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

TESTING AND CALIBRATION WITH SPECIFIC TEST PATTERNS

• Simple Test Patterns (Hand-Calculated Verification)– Implement simple test patterns between CNs to verify the accuracy and assist in

fine-tuning of the Network Simulator. The test pattern communication time can be hand-calculated for comparison to the simulated result.

• Single Source Message Tests• Two Source Message Tests (Non-Contending Paths)• Two Source Message Tests (Contending Paths)• N Source Message Tests (Non-Contending Paths)• N Source Message Tests (Contending Paths)

• Complex Test Patterns (Empirical Verification)– Implement more complex basic communication patterns to test the validity of the

simulator. Empirical data from the Mercury Computer implementing the same test pattern will be used to calibrate the Network Simulator.

• All-to-All Personalized Communication Test• Randomized Message Queue Communication Test

Page 146: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SIMPLE TEST PATTERNSSingle Source Message Tests

• Test Plan Development Diagram

SingleMessageSingle

Message

TwoMessages

TwoMessages

3..N Messages

3..N Messages

SinglePacket /Message

SinglePacket /Message

TwoPackets /Message

TwoPackets /Message

3..PPackets /Message

3..PPackets /Message

SingleCrossbarSingle

Crossbar

3..CCrossbars

3..CCrossbarsSTARTSTART

RUN

TEST

RUN

TEST

Page 147: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SIMPLE TEST PATTERNSTwo Source Message Tests

(*Non-Contending Paths)

• Test Plan Development Diagram (For Each Source)

SingleMessage /

CN

SingleMessage /

CN

TwoMessages /

CN

TwoMessages /

CN

3..N Messages /

CN

3..N Messages /

CN

SinglePacket /Message

SinglePacket /Message

TwoPackets /Message

TwoPackets /Message

3..PPackets /Message

3..PPackets /Message

SingleCrossbar

(Non-Contending)

SingleCrossbar

(Non-Contending)

3..CCrossbars

(Non-Contending)

3..CCrossbars

(Non-Contending)

STARTSTART

RUN

TEST

RUN

TEST

Page 148: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

SIMPLE TEST PATTERNSTwo Source Message Tests

(*Contending Paths)

• Test Plan Development Diagram (For Each Source)

SingleMessage /

CN

SingleMessage /

CN

TwoMessages /

CN

TwoMessages /

CN

3..N Messages /

CN

3..N Messages /

CN

SinglePacket /Message

SinglePacket /Message

TwoPackets /Message

TwoPackets /Message

3..PPackets /Message

3..PPackets /Message

SingleCrossbar(Contending)

SingleCrossbar(Contending)

3..CCrossbars(Contending)

3..CCrossbars(Contending)

STARTSTART

RUN

TEST

RUN

TEST

Page 149: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Configuration Technique for STAP

• Incorporate New Features into the Network Simulator

• Testing and Calibration of the Network Simulator

• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer

• Optimization Problem• Computational Investigation

Page 150: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

MERCURY RACE®COMPUTER CONFIGURATION

CrossbarCrossbarCrossbar

CrossbarCrossbarCrossbarCrossbarCrossbarCrossbar

CrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbar

CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCN CNCN CNCN CNCN CNCN CNCN

VME PortVME Port

I/OI/O

CNCNCN

CNCNCN

CNCNCNPPC 603e, 16Mb, 100MhzPPC 603e, 16Mb, 100Mhz 3 SHARC 3 SHARC DSPsDSPs, 8Mb, 40Mhz, 8Mb, 40Mhz

3 SHARC 3 SHARC DSPsDSPs, 16Mb, 40Mhz, 16Mb, 40Mhz

Page 151: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

STAP IMPLEMENTATION ON MERCURY RACE® COMPUTER

• Implementation of STAP on the Mercury RACE® Computer involves the following tasks:

– Build the RT_STAP1 benchmark designed and developed by MITRE (requires MPI software).

– Successfully install and build MPI Software Technology, Inc.’s message passing interface software (MPI/PRO™) for the Mercury Computer (used by RT_STAP Benchmark).

– Build both the sequential host and parallel Mercury Computer version of the benchmark.• After successfully building and executing the RT_STAP benchmark on the 8 node

PowerPC Mercury RACE® computer, perform the following tasks:– Analysis of the RT-STAP benchmark source code to determined the partitioning of the

data (i.e., the mapping) and the scheduling of the messages. Replicate the data partitioning and message scheduling on the Network Simulator.

– Verify the reported communication times from the RT_STAP benchmark with the Network Simulator.

– Modify the RT-STAP source code to allow for specification of mapping and ordering of the data distribution. Verify results with the Network Simulator.

1 Cain, K.C., Torres, J.A., and Williams, R.T. MITRE Technical Report, MTR 96B0000021 RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark. February 1997.

Page 152: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

MPI/PRO™ BUILD FORMERCURY RACE® COMPUTER

• MPI/PRO™ for RACE® is a Commercial Off-the-Shelf Standards-Based Message-Passing Middleware.

• Provides robust messaging and implements the Message Passing Interface (MPI) defined by the Message-Passing Forum.

• MPI/PRO™ supports MPI 1.2 extensions.

• Currently supports RACE® PowerPC and i860 CNs.

• MPI/PRO™ is layered on Mercury’s MC/OS development and runtime environment.

Page 153: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

RT_STAP BENCHMARK ON MERCURY RACE® COMPUTER

• The RT_STAP benchmark, developed by MITRE, was designed to evaluate the application of scalable, high performance computers to the real time implementation of STAP techniques.

• The benchmark has the capability to vary the sophistication and computational complexity of the adaptive algorithms employed.

• The goal is to build and execute the MITRE RT_STAP benchmarksoftware on an 8 node PPC 603e Mercury Computer (MCOS 4.4.2) using MPI Software Technology, Inc. MPI/PRO.

• The RT_STAP benchmark software employs a QR-decomposition algorithm component in the space-time adaptive processing. A QRD benchmark is also provided to characterize a single processors performance of QR-decompositions.

Page 154: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Configuration Technique for STAP

• Incorporate New Features into the Network Simulator

• Testing and Calibration of the Network Simulator

• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer

• Optimization Problem• Computational Investigation

Page 155: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

OPTIMIZATION PROBLEM

• Overview of the Approach

• Definition of a Class of Mappings for Data Partitioning

• Development of an Objective Function to Evaluate Defined Classes of Mappings

• Implementation of a Genetic Algorithm to Produce Schedules for the Top Mapping Candidates generated by the Mapping Objective Function. – Use the Simulator to Evaluate the Communication Performance.

Page 156: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

OVERVIEW OF THE APPROACH

STAP Data CubeSTAP Data Cube

Select # CNs (P)(P=Allocated Compute

Nodes)

Select # Select # CNs CNs (P)(P)(P=Allocated Compute (P=Allocated Compute

Nodes)Nodes)

Minimize Mapping(Use Objective Function)Minimize MappingMinimize Mapping(Use Objective Function)(Use Objective Function)

GeneticAlgorithm

(Determine Optimal Schedule)

GeneticGeneticAlgorithmAlgorithm

(Determine Optimal (Determine Optimal Schedule)Schedule)

Network Simulator(Estimate Overall

Communication Time)

Network SimulatorNetwork Simulator(Estimate Overall (Estimate Overall

Communication Time)Communication Time)

Select Fixed or Random MappingSelect Fixed or Select Fixed or

Random MappingRandom Mapping

OPTIMIZEOPTIMIZEOPTIMIZE

Mercury RACE®(Configured with 1..P CNs)

Mercury RACE®(Configured with 1..P CNs)

Adjust Allocated P

Adjust Adjust Allocated PAllocated P

Page 157: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

The mapping matrices could be defined by any one of the following:

• Possible values for M and N :

DEFINITION OF A CLASS OF MAPPINGS

FOR DATA PARTITIONING

111 : NMT ×

( ) { }PjijiNM =⋅∈ |),(,

222 : NMT ×333 : NMT ×

{ }3|),( Pjiji =⋅

• Let the matrix represent the mapping for the kth processing phase:

kT2-d Process Set

MM

NN

kT

kk NMP ⋅=• Equation for the number of CNs:

For Ex. Assume: 12=P

321 ,, TTT

{ })112(),26(),34(),43(),62(),121( ××××××

Assuming the CN assignments with a mapping matrix is raster ordered left to right, the total number of combinations is: 2166366 3 =⋅=

• Total number of combinations :

Page 158: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

OBJECTIVE FUNCTION DEVELOPMENTQuality of Mapping

• An objective function can be developed based on the definition of a class of mappings for data partitioning.

= { | CN i communicates with CN j }

1T

2T

CornerCorner--Turn Produces Messages Turn Produces Messages

∑∈

⋅1),(

minεji

ijij dmObjective:

ijmijm

ijd

= message from CN i to CN j

= message size of ijm

Using the following definitions:

= minimum number of required crossbar connections for message ijm

1T = such that each represents the CN where the data vector is distributed.

[ ]crT ,111 NM ×

2T = such that each represents the CN where the data vector is distributed.

[ ]crT ,222 NM ×

ε ),( ji

3T = such that each represents the CN where the data vector is distributed.

[ ]crT ,333 NM ×

2T

3T

CornerCorner--Turn Produces Messages Turn Produces Messages

∑∈

⋅2),(

minεji

ijij dmObjective:

Page 159: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

OBJECTIVE FUNCTION DEVELOPMENTQuality of Mapping

• An objective function for the communication time:

• An objective function for STAP processing:

⋅+

⋅ ∑∑

∈∈ 21 ),(2

),(1 minmin

εε jiijij

jiijij dmkdmk

⋅+

⋅ ∑∑

∈∈ 2),(2

),(1 minmin

1 εε jiijij

jiijij dmkdmk

4k+ 5k+

3k+ (Range Computation Time)

(Doppler Computation Time) (Weight Computation Time)

First Corner Turn Second Corner Turn

Page 160: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

GENETIC ALGORITHMS

• A genetic algorithm (GA) is a population-based model that uses selection and recombination operators to generate new sample points in a search space.

• A GA encodes a potential solution to a specific problem on a chromosome-like data structure and applies recombination operators to these structures so as to preserve critical information.

• Often, GAs are viewed as function optimizers. As a result, researchers are typically interested in GAs as optimization tools.

• Implementation of a GA begins with a population of chromosomes. Once each chromosome is evaluated, reproduction opportunities are applied in such a way that those chromosomes which represent a better solution to the target problem are given more chances to reproduce than chromosomes with poorer solutions.

• Currently, GAs are a promising heuristic approach to locating near-optimal solutions in large search spaces.

Page 161: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

GENETIC ALGORITHMS

• A genetic algorithm is typically composed of two main components that are problem dependent:

– The problem encoding• The first component involves generating an encoding scheme to represent possible

solutions to the optimization problem. Candidate solutions are usually represented as strings of fixed length, like chromosomes, usually coded with a binary character set.

– The evaluation function• An evaluation function measures the quality of a particular solution. In this

research, the evaluation of a particular candidate will be accomplished by the Network Simulator. The fitness of the candidate from the population space will be measured based on its simulated performance.

• The objective of a GA search is to locate the chromosome that has the optimal fitness value. For this research, if the chromosome represented the scheduling of messages and the fitness value the completion time of the schedule, the objective of the GA would be to find the smallest value (i.e., shortest completion time).

Page 162: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

IMPLEMENTATION OF A GENETIC ALGORITHM HEURISTIC

• Implementation of a GA involves the following steps:1

– Generate an initial populationThis initial population is the first generation where evolution starts. A random set of chromosomes is often used as the initial population

– An evaluation using the evaluation or fitness functionEvaluate the quality of each chromosome in the initial population.

– A selection mechanismIn this step, chromosomes are duplicated or eliminated based on its relative quality or fitness. The population size is kept constant.

– A crossover mechanismSome pairs of the chromosomes are selected from the current population, and some of their corresponding components are exchanged to form two valid chromosomes. The new chromosomes may or may not be in the current population.

1 Wang, L., Siegel, H.J., Roychowdhury, V.P., and Maciejewski, A.A. Task Matching and Scheduling in Heterogeneous Computing Environments using a Genetic Algorithm-Based Approach, Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing.

Page 163: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

IMPLEMENTATION OF A GENETIC ALGORITHM HEURISTIC

• Implementation of a GA involves the following steps:1

– A mutation mechanismAfter a crossover operation, each string in the population may be mutated with some probability. The mutation process transforms a chromosome into another valid one that may or may not be in the population. The motivation for using mutation is to prevent the algorithm from getting stuck in a local minimum.

– Reevaluation of the populationThe new population after selection, crossover, and mutation is reevaluated. The fitness value for each new chromosome is computed.

– A set of stopping criteriaThe stopping criteria specifies the criteria upon which the algorithm terminates. If the stopping criteria have not been met, the new population goes through another cycle of selection, crossover, mutation, and evaluation. This cycle repeats until one of the stopping criteria is met.

1 Wang, L., Siegel, H.J., Roychowdhury, V.P., and Maciejewski, A.A. Task Matching and Scheduling in Heterogeneous Computing Environments using a Genetic Algorithm-Based Approach, Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing.

Page 164: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Configuration Technique for STAP

• Incorporate New Features into the Network Simulator

• Testing and Calibration of the Network Simulator

• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer

• Optimization Problem• Computational Investigation

Page 165: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

COMPUTATIONAL INVESTIGATION

• A QR-D computation is deterministic (i.e, its complexity can be calculated).

• A Conjugate Gradient (CG) computation is notDeterministic. Its complexity depends on the initial condition and desired tolerance.– This work proposes the investigation of the impact of

“intelligent” initial condition values to a CG algorithm.

Page 166: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values

A B C D

swCBA

=11 ),,(ψ sw

DCB=22 ),,(

ψ

HxxCBA 111 ),,(

⋅=ψ

=

CBA

x 1

=

DCB

x 2Hxx

DCB 222 ),,(⋅=ψ

Solve the following equations:Solve the following equations:

Where:Where:

,,

,,

=s

=1w weight vectorweight vector

steering vectorsteering vector

Page 167: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values

[ ]

=

=⋅=

HHH

HHH

HHH

HHHH

CCCBCABCBBBAACABAA

CBACBA

xxCBA 111 ),,(

ψ

[ ]

=

=⋅=

HHH

HHH

HHH

HHHH

DDDCDBCDCCCBBDBCBB

DCBDCB

xxDCB 222 ),,(

ψ

• Expanding and yields the following:),,(1 CBA

ψ),,(2 DCB

ψ

Page 168: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Attempting to solve the following equation for :

• Attempting to solve the following equation for :

CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values

swCBA

=11 ),,(ψ1w

=

3

2

1

3,1

2,1

1,1

1 ),,(

sss

www

CBAψ

=

3

2

1

3,2

2,2

1,2

2 ),,(

sss

www

DCBψ

13,12,11,1 swACwABwAA HHH =++

23,12,11,1 swBCwBBwBA HHH =++

33,12,11,1 swCCwCBwCA HHH =++

13,22,21,2 swBDwBCwBB HHH =++

23,22,21,2 swCDwCCwCB HHH =++

33,22,21,2 swDDwDCwDB HHH =++

2w swDCB

=22 ),,(ψ

Set of Linear EquationsSet of Linear Equations

Set of Linear EquationsSet of Linear Equations

Page 169: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Investigation of the two sets of linear equations reveals similarities among the sets of equations:

• The similarities between the equations may provide insight into the selection of the initial condition values. Assuming the steering vector remains the same for each set of linear equations, the initial values could be assigned as follows:

– If range bin D is similar to range bin C, then

– If range bin D is similar to range bin A, then

CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values

13,12,11,1 swACwABwAA HHH =++

23,12,11,1 swBCwBBwBA HHH =++

33,12,11,1 swCCwCBwCA HHH =++

13,22,21,2 swBDwBCwBB HHH =++

23,22,21,2 swCDwCCwCB HHH =++

33,22,21,2 swDDwDCwDB HHH =++

2,11,2 ww ← 3,12,2 ww ← 3,13,2 ww ←

2,11,2 ww ← 3,12,2 ww ← 1,13,2 ww ←

Page 170: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Highlights from Year 1

• Highlights from Year 2

• Work to be Completed

Outline

Page 171: Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Work to be Completed

• Interfacing of FPGA and GPP/DSP Subsystems

• Implement Parallel SAR Algorithm on GPP/DSP System

• Integrate FPGA FIR Filters for Range and Azimuth Processing for SAR

• Implement Parallel STAP Algorithm for GPP/DSP System

• Integrate FPGA FIR Filters for Range Processing for STAP

• Implement FPGA-based Linear Equation Solver

• Integrate FPGA-based Linear Equation Solver with STAP