Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein.

Reconfigurable Computing and the von Neumann

Syndrome

Reiner Hartenstein

© 2007, [email protected] http://hartenstein.de2

TU KaiserslauternQuestions ?

• familiar with FPGAs ? Programming easy?

• Who is familiar with systolic arrays ?

• Duality: data streams vs. instruction streams ?

• Programming a multicore microprocessor: will it be easy ?


TU Kaiserslautern

pervas


TU KaiserslauternOutline

• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions


TU Kaiserslautern

FPGAs found everywhere


TU Kaiserslautern

Pervasiveness of RC

http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html

http://hartenstein.de/pervasiveness.html


TU Kaiserslautern

RCeducation 2008

http://www.fpl.uni-kl.de/RCeducation08/

The 3rd International Workshop on Reconfigurable Computing Education

April 10, 2008, Montpellier, France



• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions the hardware / software chasm,

the configware / software chasmthe instruction stream tunnelthe overhead-prone paradigm




instruction-stream vs. data streambridging the chasm: an old hat

stubborn curriculum task forces


TU Kaiserslautern

paradox

Outline


TU Kaiserslautern

RC education

http://www.fpl.uni-kl.de/RCeducation/

http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html




platform FPGAs,coarse-grained arrayssaving energy


TU Kaiserslautern

FPGA with island architectureFPGA with island architecture

FPGA with island architecture

reconfigurable logic box

switch box

connect box

reco

nfig

urab

le in

terc

onne

ct fa

brics


TU Kaiserslautern

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

103

106

109

FPGAlogical

FPGArouted

density:

FPGAphysical

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

immense area inefficiency

1st DeHon‘s Law[1996: Ph. D thesis, MIT]

general purpose “simple” FPGA

Deficiencies of reconfigurable fabrics (FPGA)

(fine-grained)

power guzzlerpower guzzlerslow clockslow clock

deficiency factor: >10,000

deficiency factor: >10,000


TU Kaiserslautern

Software-to-Configware (FPGA) Migration:

molecular dynamics simulationmolecular dynamics simulation

88

some published speed-up factors [2003 – 2005]

1980 1990 2000 2010100

103

106

real-time face detectionreal-time face detection60006000

video-rate stereo vision


900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457

FFTFFT100

Reed-Solomon DecodingReed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

1000

MACMAC

DSP and wireless

Image processing,Pattern matching,

Multimedia

BLASTBLAST52

protein identificationprotein identification 40

Smith-Waterman pattern matching


288

Bioinformatics

GRAPEGRAPE2020AstrophysicsAstrophysics

speedu

p f

act

or

cryptocrypto1000

oil and gas oil and gas1717

X 2/yr

Reiner Hartenstein

Success with RC has been achieved in a variety of areas such as signal and image processing, cryptology, communications processing, data and text mining, and global optimization, for a variety of platform types, from high-end systems on earth to mission-critical systems in space.


TU Kaiserslautern


TU Kaiserslautern



88


1980 1990 2000 2010100

103

106




900pattern

recognitionpattern

recognition730


FFTFFT100



400

1000

MACMAC

DSP and wireless


Multimedia

BLASTBLAST52




288

Bioinformatics


speedu

p f

act

or

cryptocrypto1000


X 2/yr

PISA

The RC paradoxThe RC paradox

deficiency factor: >10,000speed-up factor: 6,000total discrepancy:

>60,000,000

3000

Reiner Hartenstein



TU Kaiserslautern



88


1980 1990 2000 2010100

103

106




900pattern

recognitionpattern

recognition730


FFTFFT100



400

1000

MACMAC

DSP and wireless


Multimedia

BLASTBLAST52




288

Bioinformatics


speedu

p f

act

or

cryptocrypto1000


X 2/yr



>60,000,000

3000

Reiner Hartenstein



TU Kaiserslautern



88


1980 1990 2000 2010

106




900pattern

recognitionpattern

recognition730


FFTFFT100



400

1000

MACMAC

DSP and wireless


Multimedia

BLASTBLAST52




288

Bioinformatics


speedu

p f

act

or

cryptocrypto1000


X 2/yr

PISA


100

103


>60,000,000

Reiner Hartenstein



TU Kaiserslautern

Software-to-Configware (FPGA) Migration:some published speed-up factors [2003 – 2005]

These examples worked fine with on-chip memoryThere are other

algorithms more difficult to

accelerate …

… where d-

daching might be

useful (ASM)

Reiner Hartenstein



TU Kaiserslautern

platform-FPGA

Outline


TU Kaiserslautern

How much on-chip embedded BRAM ?

256 – 1704 BGA

56 – 424

8 – 32

fast on-chip block RAMs:

BRAMs

DPU:coarse-grained

On-chip LatticeCS

series


TU Kaiserslautern

coarse


TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 rDPUs

Coarse-grained Reconfigurable Array

rout thru only

not usedbackbus connect

SNN filter on (supersystolic) KressArray (mainly a pipe network)

reconfigurable Data Path Unit, 32 bits wide

reconfigurable Data Path Unit, 32 bits wide

no CPU

rDPUrDPU

note: software perspective without instruction streams:

pipelining

note: software perspective without instruction streams:

pipelining

compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside

question after the talk: „but you can‘t implement decisions!“


TU Kaiserslautern

Simple KressArray Configuration Example


TU Kaiserslautern

DPU

Much less deficiencies by coarse-grained

1980 1990 2000 2010100

103

106

109

(Gordon Moore curve)

transistors / microchip

rDPA physical rDPA logical

area efficiency very close to Moore‘s

law

area efficiency very close to Moore‘s

law

Hartenstein‘s Law

[1996: ISIS, Austin, TX]

very compact configuration code: very

fast reconfiguration

very compact configuration code: very

fast reconfiguration

rDPU

DPU

CPUCPU programcounter

rDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPU


TU Kaiserslautern

energy


TU Kaiserslautern

Software-to-Configware (FPGA) Migration:Oil and gas [2005]

1980 1990 2000 2010100

103

106

speedu

p f

act

or


X 2/yr

side effect: slashing the electricity billby more than an order of magnitude

side effect: slashing the electricity billby more than an order of magnitude

Reiner Hartenstein



TU KaiserslauternAn accidentially discovered side effect

•Software to FPGA migration of an oil and gas application:•Speed-up factor of 17•Electricity bill down to <10%•Hardware cost down to <10%

•All other publications reporting speed-up did not report energy consumption.

Saves > $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack

What about higher

speed-up factors ?What about higher

speed-up factors ?

More dramatic

electricity savings? More dramatic

electricity savings?

Herb Riley, R. Associates

$70 in 2010?$70 in 2010?

- This will change.


TU Kaiserslautern

What’s Really Going On With Oil Prices? [BusinessWeek, January 29, 2007]

$52 Price of delivery in February 2007 [New York Mercantile Exchange: Jan. 17]

$200 Minimum oil price in 2010, in a bet by investment banker Matthew Simmons


TU KaiserslauternEnergy as a strategic issue

•Google‘s annual electricity bill: 50,000,000 $

•Amsterdam‘s electricity: 25% into server farms

•NY city server farms: 1/4 km2 building floor area

[Mark P. Mills]•Predicted f. USA in 2020: 30-50% of the entire national electricity consumption goes into cyber infrastructure

•petaFlop supercomputer (by 2012 ?): extreme power consumption


TU KaiserslauternEnergy: an im portant motivation

platform example

Energy: W / Gflops

energy factor

MDgrape-3*(domain-specific 2004)

0.2 1

Pentium 4 14 70

Earth Simulator(supercomputer 2003)

128 640

*) feasible also on reconfigurable platforms*) feasible also on reconfigurable platforms

Reiner Hartenstein

GRAvity PipE: special purpose computer for astrophysical N-body simulations, and, Molecular Dynamics Simulations.MDGRAPE-3 (aka Protein Explorer): Petaflops-GRAPE [Univ. of Tokyo & Genomic Sciences Center at RIKEN institute]Petaflops by GRAPE (non-reconfigurable)massive pipelining and on-chip distributed memory - several Gordon Bell awards


TU Kaiserslautern

Moore gap




& the multicore crisis


TU Kaiserslautern

Moore’s law not applicable to all

aspects of VLSI

What is the reason of the paradox ?

The Gordon Moore curve does not indicate performance

The peak clock frequency does not indicate performance

the law of Gates

Reiner Hartenstein

slow inter-processor communications and limited ability to parallelize algorithms. early attempts could employ only tens of processors.Current state-of-the-art systems can apply thousands of microprocessorswith high-bandwidth, low-latency interconnects to the most challenging HPC problems. Nevertheless, even these systems have limitations for certain types of HPC applications. Eventually, the extra overhead required for parallel processing overcomes the benefits thatthe additional processors provide, ---FPGAs are not well suited for running the operating system and connecting to networks and disk drives


TU Kaiserslautern

Rapid Decline of Computational Density

[BWRC, UC Berkeley, 2004]

1990 1995 2000 2005

200

100

0

50

150

75

25

125

175

SP

EC

fp20

00/M

Hz/

Bill

ion

Tra

nsis

tors

DEC alpha

SUNHP

IBM

alp

ha:

dow

n b

y 1

00

in

6 y

rsIB

M:

dow

n b

y 2

0 in 6

yrs

stolen from Bob Colwell

CPU

memory wall, caches, ...

primary design goal: avoiding a paradigm shiftdramatic demo of the von Neumann Syndrome


TU Kaiserslautern

Monstrous Steam Engines of Computing

5120 Processors, 5000 pins eachCrossbar weight: 220 t, 3000 km of thick cable,

larger than a battleship

power measured in tens of megawatts,

floor space measured in tens of thousands of square feet

ready 2003


TU Kaiserslautern

Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer •Ametek •Applied Dynamics •Astronautics •BBN •CDC•Convex•Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland•Computer•Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel . Machines •Kendall Square Research •Key Computer Laboratories

Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000]

•MasPar•Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer•Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics


TU Kaiserslautern

We are in a Computing Crisis

platform example

hardw cost $ / Gflops

cost factor

energyW / Gflops

energy factor

MDgrape-3*(domain-specific 2004)

15 1 0.2 1

Pentium 4 400 27 14 70

Earth Simulator(supercomputer 2003)

8000 533 128 640

*) feasible also with rDPA*) feasible also with rDPA

microprocessor crisisgoing multi core

supercomputing crisisMPP parallelism does not scale

Reiner Hartenstein

GRAvity PipE: special purpose computer for astrophysical N-body simulations, and, Molecular Dynamics Simulations.MDGRAPE-3 (aka Protein Explorer): Petaflops-GRAPE [Univ. of Tokyo & Genomic Sciences Center at RIKEN institute]Petaflops by GRAPE (non-reconfigurable)massive pipelining and on-chip distributed memory - several Gordon Bell awards


TU Kaiserslautern

Syndrome


TU KaiserslauternThe von Neumann Paradigm

Trap

•Program counter (auto-increment, jump, goto, branch)•Datapath Unit with ALU etc.,•I/O unit, ….

•[Burks, Goldstein, von Neumann; 1946]•RAM (memory cells have adresses ….)

CS education got stuck in this paradigm trap which stems from technology of the 1940s

We need a dual paradigm approach

CS education’s right eye is blind, and its left eye suffers from tunnel view


TU Kaiserslautern

What is the reason of the paradox ?

Result from decades of tunnel view in CS R&D and educationbasic mind set completely wrong

the von Neumann Syndrome

“CPU: most flexible platform” ?

>1000 CPUs running in parallel: the most inflexible platform

However, FPGA & rDPA are very flexible

The Law of More:drastically declining programmer productivity

Reiner Hartenstein

slow inter-processor communications and limited ability to parallelize algorithms. early attempts could employ only tens of processors.Current state-of-the-art systems can apply thousands of microprocessorswith high-bandwidth, low-latency interconnects to the most challenging HPC problems. Nevertheless, even these systems have limitations for certain types of HPC applications. Eventually, the extra overhead required for parallel processing overcomes the benefits thatthe additional processors provide, ---FPGAs are not well suited for running the operating system and connecting to networks and disk drives


TU Kaiserslautern

multicore


TU Kaiserslautern

Executive Summary doesn‘t help

We must first understand the nature of the paradigm

Understanding the Paradox ?

von Neumann chickens ?


TU Kaiserslautern


TU Kaiserslautern

models


TU KaiserslauternVon Neumann CPU

DPUprogramcounter

DPUCPUCPU

termprogra

m counter

execution triggered

byparadigm

CPUyes instruction

fetchinstruction-stream-based

RAMmemory- World of Software -Engineering- World of Software -Engineering

Program Source: SoftwareProgram Source: Software

(tunnel view with the left eye)


TU Kaiserslautern

von Neumann is not the common model

programcounter

DPUCPU

RAMmemory

von Neumann bottleneck

von Neumann instruction-

stream-based machine

co-processors

acceleratorCPU

instruction-stream-based

data-stream-

based

hard

ware

software

mainframe age:

microprocessor age:


TU Kaiserslautern

Here is the contemporary common model

programcounter

DPUCPU

RAMmemory




co-processors

acceleratorCPU


data-stream-

based

hard

ware

software

mainframe age:

microprocessor age:

Now we are in the configware age:

accelerator reconfigurable

accelerator hardwired

CPU


TU Kaiserslautern

term

program

counter

execution triggered

byparadigm

CPUyes instructio

n fetchinstruction-stream-based

DPU**no data

arrival*data-

stream-based

machine models

DPUCPUCPU

programcounter

RAMmemory

von

Neu

man

n

Anti machin

e

RAMdata

counter

RAMdata

counter

DPU

RAMdata

counter

rDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPU

*) “transport-triggered”**) does not have a program counter

- no instruction fetch at run time


TU Kaiserslautern


TU Kaiserslautern

Nick Tredennick’s Paradigm Shifts:

Von NeumannVon Neumann

1 programming source needed

algorithm: variable

resources: fixedsoftware

CPU

Early historic machinesEarly historic machines

algorithm: fixed

resources: fixed

(slowly preparing to use both eyes for a dual paradigm point of

view)


TU KaiserslauternCompilation: Software

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

instruction schedule(Befehls-Fahrplan)sequential

(von Neumann model)


TU KaiserslauternNick Tredennick’s Paradigm

Shifts

configware resources: variable 2 programming sources needed

flowware algorithm: variable

Reconfigurable ComputingReconfigurable Computing

Von NeumannVon Neumann

1 programming source needed

algorithm: variable

resources: fixedsoftware

CPU

Early historic machinesEarly historic machines

algorithm: fixed

resources: fixed


TU Kaiserslautern

datacounter

GAG RAM

ASM

datacounter

GAG RAM

ASM

datacounter

GAG RAM

ASM

Configware Compilation

configware code

flowware code

mapper

configwarecompiler

scheduler

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data

C, FORTRANMATHLAB

programming the data counters

configware compilation fundamentally different from software compilation

configware compilation fundamentally different from software compilation

xxx

xxx

xxx

|

||

x xxx

xx

x xx

- --

xxxx

xx

xxx

---

---

---

---

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

| data streams

rDPA

pipe network

datacounter

GAG RAM

ASM: Auto-Sequencing MemoriesASM


TU KaiserslauternThe first archetype machine model

mainframe

CPU

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”

But now we live in the Configware AgeBut now we live in the Configware Age


TU Kaiserslautern

systolic


TU Kaiserslautern

of course algebraic (linear projection)

only for applications with regular data dependencies

Mathematicians caught by their own paradigm trap

Rainer Kress discarded their algebraic synthesis methods and replaced it by simulated annealing:rDPA

1995

Synthesis Method?

The super-systolic array: a generalization of the systolic array

The super-systolic array: a generalization of the systolic array

reductionist

approachreductionist

approach


TU Kaiserslautern

Having introduced Data streams

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data stream

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“ time

port #

time

time

port #time

port #

systolic array research: throughout

the 80ies: Mathematicians‘

hobby

The road map to HPC: ignored for

decades~1980

DPA (pipe network)

execution transport-triggered

no memory wall

H. T. Kung


TU KaiserslauternWho generates the Data

Streams?

Mathematicians: it‘s not our job

xxx

xxx

xxx

|

||

x xx

x

xx

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

(it‘s not algebraic)

„systolic“


TU KaiserslauternWithout a sequencer …

… it’s not a machine

Mathematicians have missed to

invent the new machine paradigm

Mathematicians have missed to

invent the new machine paradigm

reductionist approach:reductionist approach:

(it‘s not our job) resources

sequencer

Machine:

... the anti machine

... the anti machine


TU Kaiserslautern

The counterpart of the von Neumann

machinexxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

(r)DPA

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

M

datacounter

GAG RAM

ASM: Auto-Sequencing

Memory

data counters instead of a program counter

data counters: located at memory

(not at data path)

Kress /Kung Anti Machine

Kress /Kung Anti Machine

coarse-grained


TU Kaiserslautern

Acceleration Mechanisms by ASM-based MoMSW

•parallelism by multi bank memory architecture•reconfigurable address compuattion – before run time

•avoiding multiple accesses to the same data.•avoiding memory cycles for address computation•improve parallelism by storage scheme transformations

•minimize data movement across chip boundaries


TU Kaiserslautern

FPGAs in Supercomputing

• Synergisms: coarse-grained parallelism through conventional parallel processing,

reconfigurable logic box: 1

Bit

• and: fine-grained parallelism through direct configware execution on the FPGAs

DPUprogramcounter

DPUCPUCPU

DataPath Units32 Bit, 64 Bit

(millions of rLBs embedded in a reconfigurable interconnect fabrics)


TU Kaiserslautern

Anti machine

resources

sequencer

memory

algorithms

flowware

data counters

hardwired anti machine:resources

sequencer

memory

algorithms

flowware

data counters

reconfigurable anti machine:

configware


TU Kaiserslautern

von Neumann machine

resources

sequencer

Machine: resources

sequencer

memory

algorithms

softwareprogram counter

von Neumann machine:


TU Kaiserslautern

The clash of paradigms

a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …

acceleratorsacceleratorsµprocessorµprocessor

structural

hardware guyprogrammer

procedural

the basic mind set isinstruction-stream-based

kind of data-stream-based mind set

the software / hardware chasmthe software / hardware chasm

we need a datastream based machine paradigm

microprocessor age:


TU Kaiserslautern

Xputer Principles

addr. generators reconfigurable

Data Path reconfigurable

Xputer

CPUWe used the VAX-11/750 of my group

DPLA

rALU

contemporary ?

1984: first FPGAs: very tiny & very expensive ASMASM


TU Kaiserslautern

super


TU Kaiserslautern



• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions


TU Kaiserslautern

dynamic


TU KaiserslauternFPGA Modes of Operation

configware code loaded from external flash memory, e. g. after power-on (~milliseconds)

time

C ph

offE ph

Execution phase

E ph

Configuration phase

C ph

Legend:

simple, static reconfigurability

(requiring new OS

principles)


TU Kaiserslautern

established R&D area

illustrating dynamically reconfigurable

time

FPGA

module no. macroZ

E phmodule z C ph

E ph

macro X

E phmodule X C ph

E ph C ph

configware macro Y

C ph

E phmodule Y C phX

conf

igur

es Y

X co

nfig

ures

Y

Swapping and scheduling of relocatable configware code macros is managed by a configware operating system

Swapping and scheduling of relocatable configware code macros is managed by a configware operating system

partially reconfigurableconfigware OS fundamentally different from software OS

Reconfigurable

Computing at

Microsoft

Microsoft

ReconVista

?Microsoft

ReconVista

?

Configware OS


TU KaiserslauternGliederung



TU KaiserslauternReconfigurable HPC

• This area is almost 10 years old


TU KaiserslauternHave to re-think basic assumptions

Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.

Not processing is costly, but moving data and messages

We’ve to re-think basic assumptions behind computing


TU Kaiserslautern

Illustrating the von Neumann paradigm trap

The data-stream-based approach

The instruction-stream-based approach

von Neuman

n bottle-

neck

von Neuman

n bottle-

neck

has no von Neumann bottle-neck

has no von Neumann bottle-neck

the watering pot model [Hartenstein]

many watering pots


TU KaiserslauternHave to re-think basic assumptions

Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.

Not processing is costly, but moving data and messages




• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm


TU Kaiserslautern

More compute power by Configware than Software

Conclusion: most compute power from ConfigwareConclusion: most compute power from Configware

75% of all (micro)processors are embedded 4 : 1

avarage acceleration factor >2-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

25% embedded µProc. accelerated by FPGA(s)

1 : 4

(a very cautious estimation**)

-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)

(difference probably an order of magnitude)(difference probably an order of magnitude)


TU KaiserslauternXputer Lab (around 1990)


TU Kaiserslautern

anti


TU Kaiserslautern

Programming Language Paradigms

language category Computer Languages Xputer Languages

both deterministic procedural sequencing: traceable, checkpointable

operationsequencedriven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes,instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.),data loop, loop nesting,parallel loops, escapes,data stream branching

state register program counter data counter(s)addresscomputation

massive memorycycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoidedparallel memorybank access interleaving only no restrictions

very easy to learn

multipleGAGs

Principles of MoPL [1994]

Principles of MoPL [1994]


TU Kaiserslautern

„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“

Avoiding the paradigm shift?

Tarek El-Ghazawi, panelist at SuperComputing 2006

„A leap too far for the existing HPC community“panelist Allan J. Cantle

SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective


TU Kaiserslautern

We need a new machine paradigm

a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …

data-stream-based mind set

we urgently need a datastream based machine paradigm

we urgently need a datastream based machine paradigm

datadata

it was pepared almost 30 years ago

xxx

xxx

xxx

|

||

x xxx

xx

x xx

- --

xxxx

xx

xxx

---

---

---

---

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

| data streams


TU Kaiserslautern

Generic Address Generator GAG

Generalization of the DMA

datacounter

GAG

GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003,

Dubrovnik]patented by TI 1995

• storge scheme optimization methodology, etc.

Acceleration factors by:

• address computation without memory cycles

avoid e.g. 94% address

computation overhead*

*) Software to Xputer migration

Reiner Hartenstein

ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik


TU KaiserslauternThe 2nd “archetype” machine

model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”


simple basic .Machine Paradigm



• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing

Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions


TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

rout thru only

not usedbackbus connect

SNN filter on (supersystolic) KressArray (mainly a pipe network)

reconfigurable Data Path Unit, e. g. 32 bits wide

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

question after the talk: „but you can‘t implement decisions!“

note: software perspective without instruction streams

Symptom of the von Neumann Syndrome

A High level R&D manager of a large Japanese IT industry groupyielded by single-paradigm mind set Executive summary? Forget it !How about a microprocessor giant having >100 vice presidents ?if clause turns into multiplexer


TU Kaiserslautern

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU



rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPUCPU

Dual Paradigm Application Development

SWcompiler

CWcompiler

C language source

Partitioner

Juergen Becker’s CoDe-X, 1996

placement and routingplacement and routing

automatic parallelization by loop transformationsautomatic parallelization by loop transformations

generating a pipe networkgenerating a pipe network


TU KaiserslauternHybrid Multi Core example

twin paradigm machine

each core can run CPU mode

or rDPU mode

rDPUrDPU rDPUrDPU rDPUrDPU
















CPUCPU

CPUCPU CPUCPU

CPUCPU

CPUCPU CPUCPU

CPUCPU CPUCPU

64 cores

How about a microprocessor giant having >100 vice presidents ?

How about a microprocessor giant having >100 vice presidents ?

customer refuses the pradigm shift?

customer refuses the pradigm shift?

disabled for the paradigm shift ?

disabled for the paradigm shift ?


TU Kaiserslautern

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

CPUCPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

CPUCPU

Compilation for Dual Paradigm Multicore

SWcompiler

CWcompiler

C language source

Partitioner

Juergen Becker’s CoDe-X, 1996

compile to hybrid multicorecompile to hybrid multicore

placement and routingplacement and routing

automatic parallelization by loop transformationsautomatic parallelization by loop transformations

generating a pipe networkgenerating a pipe network


TU Kaiserslautern

Here is the common model

programcounter

DPUCPU

RAMmemory




co-processors

acceleratorCPU


data-stream-

based

hard

ware

software

mainframe age:

microprocessor age:

configware age:

CPU accelerator reconfigurable

software/configwareco-compiler

software configware accelerator reconfigurable

accelerator hardwired

CPU


TU KaiserslauternMulti Core: Just more CPUs ?

Complexity and clock frequency of single-core microprocessors come to an end

Without a paradigm shift just more CPUs on chip lead to the dead roads known from supercomputing

Multi-core microprocessor chips emerging: soon 32 cores on an AMD chip, and 80 on an intel

Multi-threading is not the silver bullet



TU KaiserslauternSolution not expected from CS officers

We need mutual efforts, like EE/CS cooperation known from the Mead & Conway revolution

Progress of the joint task force on CS curriculum recommendations is extremely disillusioning

For RC other motivations are similarly high-grade: growing cost and looming shortage of energy.

The personal supercomputer: a far-ranging massive push of innovation in all areas of science and economy:

by Reconfigurable Computing

it‘s more like a lobby: „my area is the most important“


TU Kaiserslautern

Computing Sciences are in a severe Crisis

We urgently need to shape the Reconfigurable Computing Revolution for enabling to go toward incredibly promising new horizons of affordable highest performance computing

This cannot be achieved with the classical software-based mind set

We need a new dual paradigm approach

Watch out not to get screwed !

Supercomputing titans may be your enemies


TU KaiserslauternThe Configware Age

• Mainframe age and microprocessor(-only) age are history

• We are living in the configware age right now!

• Attempts to avoid the paradigm shift will again create a disaster


TU Kaiserslautern

thank you for your patience


TU Kaiserslautern

overhead


TU Kaiserslautern


TU Kaiserslautern

Von Neumann vs. anti machine

# feature von Neumann machine

hardwired anti machine

reconfigurable anti machine

1 m’ code schedules: instruction stream data streams

2 # prog’ sources 1 2

3 source 1 none configware

4 source 2 software flowware

5 sequenced by: program counter data counters

6 counter co-located with: PU (data path): CPU memory block: ASM

9 inter PU communication: common memory piped through

10 data meeting PU: move data at run time move locality of execution at compile rime


TU Kaiserslautern

Overhead avoided by anti machine

# feature von Neumann machine

hardwired anti machine

reconfigurable anti machine

11 state address computation overhead at run time

instruction stream none

12 data address computation overhead at run time


13 Inter PU communication overhead at run time


14 instruction fetch at run time instruction stream none

15 data meet PU at run time instruction stream none


TU Kaiserslautern

GAG


TU Kaiserslautern

MoM Scan window (MoMSW) Illustration

• Multiple* vari-size reconfigurable MoMSW scan windows

• MoMSW controlled by reconfigurable GAG (generic address generators)

• 2-dimensional (data) memory address space

MoM architectural primary features:

*) typically 3


Memory


Memory


Memory


Memory


TU Kaiserslautern

CGFFT: Parallel Scan Pattern Animation

MoM-3 with 3 varisize scan windows

DatapathDatapathASM: Auto-

Sequencing Memory


Memory


TU Kaiserslautern

Reconfigurable Generic Address Generator GAG

Generalization of the DMA

datacounter

GAG

GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003,

Dubrovnik]patented by TI 1995

• storge scheme optimization methodology, etc.

Acceleration factors by:

• address computation without memory cycles

avoid e.g. 94% address

computation overhead*

• supporting scratch optimization strategies (smart d-caching)

Reiner Hartenstein

ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik


TU Kaiserslautern

GAG: 2-D Generic Data Sequence Examples

a) b)

c)

d) e) f) g)


TU KaiserslauternGAG Slider Operation Demo

Example

yx

ceiling

C

address

LB

L0B0AF

floor

LB


TU Kaiserslautern

XMDS Scan Pattern Editor GUI


TU Kaiserslautern

JPEG zigzag scan pattern

x

y

EastScan is step by [1,0]end EastScan;

SouthScan isstep by [0,1]endSouthScan;

*> Declarations

NorthEastScan isloop 8 times until [*,1]step by [1,-1]endloopend NorthEastScan;

SouthWestScan isloop 8 times until [1,*]step by [-1,1]endloopend SouthWestScan;

HalfZigZag isEastScanloop 3 times SouthWestScanSouthScanNorthEastScanEastScanendloopend HalfZigZag;

goto PixMap[1,1]

HalfZigZag;SouthWestScanuturn (HalfZigZag)

HalfZigZag

data counterdata counter

data counterdata counter

2

1

3

4

HalfZigZag


TU Kaiserslautern

Significance of MoMSW Reconfigurable Scan Windows

• MoMSW Scan windows have the potential to drastically reduce traffic to/from slow off-chip memory.

• No instruction streams needed to implement scratch pad optimization strategies using fast on-chip memory

• MoMSW Scan windows may contribute to speed-up by a factor of 10 and sometimes even much more

• MoMSW Scan windows are the deterministic alternative („d-caching“) to (indeterministic and speculative) classical cache usage: performance can be well predicted

• For data-stream-based computing scan windows are highly effective, whereas classical caches are entirely useless


TU Kaiserslautern

Linear Filter Application

after inner scan line loop unrolling

final design

after scan line

unrolling

hardw. level access optim.

initial design

Parallelized Merged Buffer Linear Filter Applicationwith example image of x=22 by y=11 pixel

Speed-up factor >11due to MoMSW-based d-caching & storage scheme optimization


TU Kaiserslautern

PISA-MoM


TU Kaiserslautern

Processing 4-by-4 Reference Patterns

Mead-&-Conway nMOS Design Rules:256 4-by-4 reference patterns

Mead-&-Conway CMOS Design Rules:>800 4-by-4 reference patterns

MoM: all reference patterns matched in a single clock cycle

vN Software: some reference patterns can be skipped, depending on earlier patterns

DPLA: fabricated by the E.I.S. Multi University Project:

PISA DRC accelerator [ICCAD 1984]

1984: 1 DPLA replaces 256 FPGAsReference patterns automatically

generated from Design Rules

PISA: a forerunner of the MoM



TU Kaiserslautern

Speed-up by MoM-1 compared to 68020PISA project


TU Kaiserslautern

Speed-up by MoM-3 compared to SPARC 10/51


TU Kaiserslautern

1985 – 1990: Multimedia & DSP: MoM-3 speedup


TU Kaiserslautern

Significance of Address Generators

• Address generators have the potential to reduce computation time significantly.

• In a grid-based design rule check a speed-up of more than 2000 has been achieved*

• reconfigured address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead

*) 15,000 if the same algorithm is used


TU Kaiserslautern

hardware vs. software perspective

platform hardware perspective

data-stream-driven

software perspective

instruction-stream-driven

flexibility

performance

pot.

1

single paradigm

simple FPGA** X X +++ ++

2µprocessor &

multi core X X +++ -

3 coarse-grained X X ++ +++

4Platform FPGA 1 & (2)* & 3 X X X (X)* ++

+ ++++

5

dual paradigm

1 & 2 X X X X ++ ++

6 2 & 4 X X X X +++ ++++

7 2 & 3 X X X + +++

8reconfigurable

instr. set X X X +++ +

*) with soft cores and/or on-chip microprocessor**) without soft cores

for software peoplefor software people



TU Kaiserslautern

IngredientsrLB Soft

CPUsimple FPGA

rDPU

BRAMCPU

platform FPGA

rLB

hardwired special

functions

SoftCPU

rDPU BRAM

coarse-grained array

RAM

CPUand, for runninglegacy

softwarerDPU BRAM

anti machine (Xputer)

ASMASMSoftCPU

programcounter

CPU

programcounter

datacounter

ASMASMASMCPU rDPU

CPU with reconfigurable instruction set extension

rLB

(Kress/Kung machine)

all multi core!

all multi core!on-chip


TU Kaiserslautern

perspective ? what expertise needed ? hardware ?

• microprocessor (also multi core)

• simple FPGA (fine-grained)

• platform FPGA (domain-specific core assortment, embedded in FPGA fabrics)

• coarse-grained reconfigurable array

• reconfigurable instruction set processor

mishmash model – a

nightmare for under-

graduate studies

but by far best

optimization potential


von Neumann:


hardware

perspective

mishmash model (s. a.)


TU Kaiserslautern

flexibility (for accelerators)

Objectives

avoiding specific silicon

rapid prototyping, field-patching, emulation

cheap, compact vHPC

for every area which needs:


TU Kaiserslautern Reconfigurable Computing opens many spectacular new horizons:

Conclusion (1)

Cheap vHPC without needing specific silicon, no mask ....

Massive reduction of the electricity bill: locally and national

Cheap embedded vHPC Cheap desktop supercomputer (a new market)

Fast and cheap prototyping

Replacing expensive hardwired accelerators

Supporting fault tolerance, self-repair and self-organization

Flexibility for systems with unstable multiple standards by dynamic reconfigurability

Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)


TU Kaiserslautern

Universal vHPC co-architecture demonstrator

Conclusion (2)Needed:

The compilation tool problem to be solvedLanguage selection problem to be solvedEducation backlog problems to be solved

Use this to develop a very good high school and undergraduate lab course

A motivator: preparing for the top 500 contest

For widely spreading its use successfully:

select killer applications for demo


TU Kaiserslautern

More compute power by Configware than Software

Conclusion: most compute power from ConfigwareConclusion: most compute power from Configware

75% of all (micro)processors are embedded 4 : 1

avarage acceleration factor >2-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

25% embedded µProc. accelerated by FPGA(s)

1 : 4

(a very cautious estimation**)

-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)

(difference probably an order of magnitude)(difference probably an order of magnitude)


TU KaiserslauternConclusion (3)

Self-Repair and Self-Organization methodologyEmbedded r-emulation logistics methodology

Universal vHPC co-architecture demonstrator

select a killer application for demo

For widely spreading its use successfully:


TU Kaiserslautern

Universal HPC co-architecture for:

some Goals

embedded vHPC (nomadic, automotive, ...)desktop vHPC (scientific computing ...)

Application co-development environment forHardware non-experts, ....Acceptability by software-type users, ...

Meet product lifetime >> embedded syst. life:FPGA emulation logistics from

development downto maintenance and repair stationsexamples: automotive, aerospace,

industrial, ..


TU Kaiserslautern

SuperComputing 06SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?Tarek El-Ghazawi, The George Washington University-

Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?Dave Bennett, Xilinx, Inc-

Reconfigurable Computing: The Future of HPCDaniel S. Poznanovic, SRC Computers, Inc.-

Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?Allan J. Cantle , Nallatech Ltd.-

Challenges for Reconfigurable Computing in HPCKeith D. Underwood, Sandia National Laboratories-

Reconfigurable Computing - Are We There Yet?Rob Pennington, National Center for Supercomputing Applications-

Reconfigurable Computing: The Road AheadDuncan Buell, University of South Carolina-

Opportunities and Challenges with Reconfigurable HPCAlan D. George, University of Florida

Panel


TU Kaiserslautern

Acceleration Mechanisms by ASM-based MoMSW

•parallelism by multi bank memory architecture•reconfigurable address compuattion – before run time

•avoiding multiple accesses to the same data.•avoiding memory cycles for address computation•improve parallelism by storage scheme transformations

•minimize data movement across chip boundaries


TU KaiserslauternC or FORTRAN ?

Computer scientists haven’t been interested in programming clusters. If putting the cluster on a chip is what excites them, fine.

Gordon Bell:

It will still have to run Fortran!

*) like CoDe-X

Support tools have been demonstrated by academia

Classical programming languages, but with a slightly different semantics (data-procedural) are good candidates for parallel programming.

Reiner Hartenstein (conclusion of this talk):

or C (X-C)

it’s a shorter leapit’s a shorter leap


TU KaiserslauternNewton’s 1st Law

Scientists do not change their direction

Newton’s 1st Law à la Gordon Bell:

##

*) like CoDe-X

###

#####

###

##’##’

a


TU Kaiserslautern

Edu defic


TU KaiserslauternDual paradigm: an old hat

Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer

Software mind set: instruction-stream-based: flow chart -> control instructions(FSM: state transition)

-> Register Transfer Modules (DEC: mid 1970ies); similar concept: Case Western Reserve Univ. ;

FF

token bitevoke



(2)

“It is so simple!

why did it take 25 years to find out ?”

Hardware Description Language scene ~1970:

Because of the reductionists’ tunnel view

Because of a lack of transdisciplinary thinking

FF

token bitevoke



(3)

“procedure call” or function call

call Module-name (parameters);Software: time domain

Hardware Description Languages;

Hardware description: space domain


TU Kaiserslautern

ASM


TU Kaiserslautern


TU Kaiserslautern

Co-comp


TU Kaiserslautern


TU Kaiserslautern

programcounter

DPUCPU

RAMmemory




co-processors

acceleratorCPU


data-stream-

based

hard

ware

software

mainframe age:

microprocessor age:

configware age:

CPU accelerator reconfigurable

software/configwareco-compiler

software configware





SWcompiler

CWcompiler

C language source

Partitioner

CoDe-X, 1996

Apropos HiPEAC: Software / Configware Co-Compilation

automatic parallelization by loop transformations


TU Kaiserslautern

Jürgen Becker’s CoDE-X -1 Co-Compiler

Analyzer/ Profiler

GNU Ccompiler

paradigm

Computer machine

X-Ccompiler

Anti machineparadigm

Partitioner

X-C is C languageextended by MoPLX-C

CPU XputerXputer& running

legacy software rALU: => array size: 1-by-1


TU Kaiserslautern


Analyzer/ Profiler

GNU Ccompiler

paradigmComputer machine

DPSS

X-Ccompiler


Partitioner






Resource Parameters

supportingKressArray

family

Pipelining: A Shorter LeapPipelining: A Shorter Leap


TU Kaiserslautern


Analyzer/ Profiler

GNU Ccompiler

paradigmComputer machine

DPSS

X-Ccompiler


Partitioner


rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

CPUCPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

CPUCPU

heterogenous multi-core by dual mode cores: CPU mode vs. rDPU mode


TU Kaiserslautern

Why better


TU Kaiserslautern


TU Kaiserslautern

hardware vs. software perspective

platform hardware perspective

data-stream-driven


instruction-stream-driven

flexibility

performance

pot.

1

single paradigm

simple FPGA** X X +++ ++

2µprocessor &

multi core X X +++ -

3 coarse-grained X X ++ +++

4Platform FPGA 1 & (2)* & 3 X X X (X)* ++

+ ++++

5

dual paradigm

1 & 2 X X X X ++ ++

6 2 & 4 X X X X +++ ++++

7 2 & 3 X X X + +++

8reconfigurable

instr. set X X X +++ +

*) with soft cores and/or on-chip microprocessor**) without soft cores




TU Kaiserslautern

Data meeting the Processing Unit (PU)

by Software

byConfigware

routing the data by memory-cycle-hungry instruction streams thru shared memoryplacement of the execution locality ...

We have 2 choices

pipe network generated by configware compilation

... partly explaining the RC paradox


TU KaiserslauternData meeting the Processing Unit

byConfigware

placement of the execution locality ...

… pipe network generated by configware compilation


TU Kaiserslautern

conclus


TU Kaiserslautern

thank you for your patience


TU Kaiserslautern

END


TU Kaiserslautern

„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“


Tarek El-Ghazawi, panelist at SuperComputing 2006

„A leap too far for the existing HPC community“panelist Allan J. Cantle

SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leapA shorter leap by coarse-grained platforms

which allow a software-like pipelining perspective


TU Kaiserslautern

•… the promise of almost unimagined computing power•have the hardware developers raced too far ahead of many programmers' ability to create software ?•parallel computing has been an esoteric skill limited to people involved with high-performance supercomputing. That is changing now that desktop computers and even laptops aregoing multicore.•"High-performance computing experts have learned to deal with this, but they are a fraction of the programmers," Saied says. “•In the future you won't be able to get a computer that's not multicore•multicore chips become ubiquitous, all programmers will have to learn new tricks."•Even in high-performance computing there are areas that aren't yet ready for the new multicore machines.•"In industry, much of their high-performance code is not parallel," Saied says. "These corporations have a lot of time and money invested in their software, and they are rightly worried about having to re-engineer that code base."

Avoiding the paradigm shift?„A leap too far for the existing HPC community“

We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques.

A shorter leapA shorter leap by coarse-grained platforms which allow a software-like pipelining perspective


TU Kaiserslautern

•"Moore's Gap." •Steve Kirsch, an engineering fellow for Raytheon Systems Co., says that multicore computing presents both the dream of infinite computing power and the nightmare of programming. •"The real lesson here is that the hardware and software industries have to pay attention to each other," Kirsch says. "Their futures are tied together in a way that they haven't been in recent memory, and that will change the way both businesses will operate."


February, Intel released research details about a chip with 80 cores, a fingernail sized chip that has the same processing power that in 1996 required a supercomputer with a 2,000-square-foot footprint and using 1,000 times the electrical power.

a problem for those who depend on previously written software that has been steadily improving and evolving over decades. "Our legacy software is a real concern to us.

parallel programming for multicore computers may require new computer languages. "Today we program in sequential languages

Do we need to express our algorithms at a higher level of abstraction? Research into these areas is critical to our success."


TU Kaiserslautern

•""Our programming languages researchers are exploring new programming paradigms and models," Hambrusch says. "Our course on multicore architectures is also preparing students for future software development positions. Purdue is clearly playing a defining role in this critical technology."


"In five or six years, laptop computers will have the same capabilities, and face the same obstacles, as today's supercomputers," Saied says. "This challenge will face people who program for desktop computers, too. People who think they have nothing to do with supercomputers and parallel processing will find out that they need these skills, too."

Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer. Like locally-based Direct Memory Access (DMA), RDMA improves throughput and performance because it frees up resources. RDMA also facilitates a faster data transfer rate. RDMA implements a transport protocol in the network interface card (NIC) hardw


TU Kaiserslautern

•Three Ways to Make Multicore Work•-- Number 1:•-- Mathematics: Do more computational work with less data motion•– E.g., Higher-order methods•• Trades memory motion for more operations per word, producing an accurate answer in less elapsed time than lower order methods•– Different problem decompositions (no stratified solvers)•• The mathematical equivalent of loop fusion•• E.g., nonlinear Schwarz methods•– Ensemble calculations•• Compute ensemble values directly•– It is time (really past time) to rethink algorithms for memory locality and latency tolerance •I didn’t say threads•• See, e.g., Edward A. Lee, "The Problem with Threads," Computer, vol. 39, no. 5, pp. 33-42, May, 2006.•• “Night of the Living Threads”,•http://weblogs.mozillazine.org/roc/archives/2005/12/night_of_the_living_threads.html , 2005•• Robert O'Callahan: “Why Threads Are A Bad Idea (for most purposes)” John Ousterhout (~2004)••Allen Holub: “If I were king: A proposal for fixing the Java programming language's threading problems” http://www128.ibm.com/developerworks/library/j-king.html, 2000 Allen Holub has been working in the computer industry since 1979. He is widely published in magazines (Dr. Dobb's Journal, Programmers Journal, Byte, MSJ, among others), and he writes the "Java Toolbox" column for the online magazine JavaWorld .


Breaking the Assumptions-- Don’t have any off-chip memory– Consequence: Need algorithms, programming models, and software tools to work in more limited memory (a few GB)-- Have off-chip memory, but manage it more effectively– Consequence: Need to find a true, general-purpose hardware/software model-- Overlap latency with split operations– Consequence: Need to find massive amounts of concurrency; need to manage the programming challenges of split operations (these are hard for programmers to use correctly - may be an opportunity for formal methods) Multicore doesn’t just stress bandwidth, it increases the need for perfectly parallel algorithms-- All systems will look like attached processors - high latency, low (relative) bandwidth to main memory 128 cores? “When [a] request for data from Core 1 results in a L1 cache miss, the request is sent to the L2 cache. If this request hits a modified line in the L1 data cache of Core 2, certain internal conditions may cause incorrect data to be returned to the Core 1.” Everything does not double: traveling from New York to Chicago: before 1830: 3 weeks - 1857: 1+1/2 days; now: 6 hours - only a factor of 6 MPI on Multi-Core: 340 ns MPI ping/pong latency improvement will require better SWE tools Benchmarks• Ping-pong latency– Ring-based ping-pong exchange between all nodes• Nearest-neighbor ghost-area exchange– Test code from Argonne used to evaluate onesided and point-to-point operations• CPU availability– Calculates percentage of CPU available at receiver by doing a fixed amount of work during message arrival


TU Kaiserslautern

in Memoriam Stamatis

Vassiliadis

1951 - 2007

in Memoriam Richard Newton

1951 - 2007

in Memoriam …


TU Kaiserslautern

KressArray DPSS

ApplicationSet

DPSS

published at ASP-DAC 1995

ArchitectureEditor

MappingEditor

statist.Data

DelayEstim.

Analyzer

Architecture

Estimator

interm.form 2

expr.tree

ALE-XCompiler

PowerEstimator

PowerData

VHDLVerilog

HDLGeneratorSimulator

User

ALEXCode

Improvement Proposal Generator

Suggestion

SelectionUserInterface

interm.form 3

Mapper

DesignRules

DatapathGeneratorGenerator

KressrDPU

Layout

data stream Schedule

Scheduler

KressArrayXplorer (Platform Design Space Explorer)

Xplorer

InferenceEngine (FOX)

Sug-gest-ion

KressArrayfamily

parameters


TU Kaiserslautern

KressArray Family generic Fabrics: a few examples

Examples of 2nd Level Interconnect:layouted overrDPU cell - no separate routing areas !

+

rout-through and function

rout-throug

h only more NNports:

rich Rout Resources

Select Function

Repertory

select Nearest Neighbour (NN) Interconnect: an example

16 32 8 24

4

2 rDPU

Select mode, number, width of NNports

http://kressarray.de

Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein.

Documents

Transcript of Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein.