Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein.
-
Upload
gilbert-dylan-casey -
Category
Documents
-
view
227 -
download
1
Transcript of Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein.
Reconfigurable Computing and the von Neumann
Syndrome
Reiner Hartenstein
© 2007, [email protected] http://hartenstein.de2
TU KaiserslauternQuestions ?
• familiar with FPGAs ? Programming easy?
• Who is familiar with systolic arrays ?
• Duality: data streams vs. instruction streams ?
• Programming a multicore microprocessor: will it be easy ?
© 2007, [email protected] http://hartenstein.de4
TU KaiserslauternOutline
• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions
© 2007, [email protected] http://hartenstein.de6
TU Kaiserslautern
Pervasiveness of RC
http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html
http://hartenstein.de/pervasiveness.html
© 2007, [email protected] http://hartenstein.de7
TU Kaiserslautern
RCeducation 2008
http://www.fpl.uni-kl.de/RCeducation08/
The 3rd International Workshop on Reconfigurable Computing Education
April 10, 2008, Montpellier, France
© 2007, [email protected] http://hartenstein.de8
TU KaiserslauternOutline
• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions the hardware / software chasm,
the configware / software chasmthe instruction stream tunnelthe overhead-prone paradigm
© 2007, [email protected] http://hartenstein.de9
TU KaiserslauternOutline
• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions
instruction-stream vs. data streambridging the chasm: an old hat
stubborn curriculum task forces
© 2007, [email protected] http://hartenstein.de10
TU KaiserslauternOutline
• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions
© 2007, [email protected] http://hartenstein.de12
TU Kaiserslautern
RC education
http://www.fpl.uni-kl.de/RCeducation/
http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html
© 2007, [email protected] http://hartenstein.de13
TU KaiserslauternOutline
• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions
platform FPGAs,coarse-grained arrayssaving energy
© 2007, [email protected] http://hartenstein.de14
TU Kaiserslautern
FPGA with island architectureFPGA with island architecture
FPGA with island architecture
reconfigurable logic box
switch box
connect box
reco
nfig
urab
le in
terc
onne
ct fa
brics
© 2007, [email protected] http://hartenstein.de15
TU Kaiserslautern
reconfigurability overhead>
routing congestion
wiring overhead
overhead:
>> 10 000
1980 1990 2000 2010100
103
106
109
FPGAlogical
FPGArouted
density:
FPGAphysical
(Gordon Moore curve)
transistors / microchip
(microprocessor)
immense area inefficiency
immense area inefficiency
1st DeHon‘s Law[1996: Ph. D thesis, MIT]
general purpose “simple” FPGA
Deficiencies of reconfigurable fabrics (FPGA)
(fine-grained)
power guzzlerpower guzzlerslow clockslow clock
deficiency factor: >10,000
deficiency factor: >10,000
© 2007, [email protected] http://hartenstein.de16
TU Kaiserslautern
Software-to-Configware (FPGA) Migration:
molecular dynamics simulationmolecular dynamics simulation
88
some published speed-up factors [2003 – 2005]
1980 1990 2000 2010100
103
106
real-time face detectionreal-time face detection60006000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457
FFTFFT100
Reed-Solomon DecodingReed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
1000
MACMAC
DSP and wireless
Image processing,Pattern matching,
Multimedia
BLASTBLAST52
protein identificationprotein identification 40
Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
Bioinformatics
GRAPEGRAPE2020AstrophysicsAstrophysics
speedu
p f
act
or
cryptocrypto1000
oil and gas oil and gas1717
X 2/yr
© 2007, [email protected] http://hartenstein.de17
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de18
TU Kaiserslautern
Software-to-Configware (FPGA) Migration:
molecular dynamics simulationmolecular dynamics simulation
88
some published speed-up factors [2003 – 2005]
1980 1990 2000 2010100
103
106
real-time face detectionreal-time face detection60006000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457
FFTFFT100
Reed-Solomon DecodingReed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
1000
MACMAC
DSP and wireless
Image processing,Pattern matching,
Multimedia
BLASTBLAST52
protein identificationprotein identification 40
Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
Bioinformatics
GRAPEGRAPE2020AstrophysicsAstrophysics
speedu
p f
act
or
cryptocrypto1000
oil and gas oil and gas1717
X 2/yr
PISA
The RC paradoxThe RC paradox
deficiency factor: >10,000speed-up factor: 6,000total discrepancy:
>60,000,000
3000
© 2007, [email protected] http://hartenstein.de19
TU Kaiserslautern
Software-to-Configware (FPGA) Migration:
molecular dynamics simulationmolecular dynamics simulation
88
some published speed-up factors [2003 – 2005]
1980 1990 2000 2010100
103
106
real-time face detectionreal-time face detection60006000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457
FFTFFT100
Reed-Solomon DecodingReed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
1000
MACMAC
DSP and wireless
Image processing,Pattern matching,
Multimedia
BLASTBLAST52
protein identificationprotein identification 40
Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
Bioinformatics
GRAPEGRAPE2020AstrophysicsAstrophysics
speedu
p f
act
or
cryptocrypto1000
oil and gas oil and gas1717
X 2/yr
The RC paradoxThe RC paradox
deficiency factor: >10,000speed-up factor: 6,000total discrepancy:
>60,000,000
3000
© 2007, [email protected] http://hartenstein.de20
TU Kaiserslautern
Software-to-Configware (FPGA) Migration:
molecular dynamics simulationmolecular dynamics simulation
88
some published speed-up factors [2003 – 2005]
1980 1990 2000 2010
106
real-time face detectionreal-time face detection60006000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457
FFTFFT100
Reed-Solomon DecodingReed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
1000
MACMAC
DSP and wireless
Image processing,Pattern matching,
Multimedia
BLASTBLAST52
protein identificationprotein identification 40
Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
Bioinformatics
GRAPEGRAPE2020AstrophysicsAstrophysics
speedu
p f
act
or
cryptocrypto1000
oil and gas oil and gas1717
X 2/yr
PISA
The RC paradoxThe RC paradox
100
103
deficiency factor: >10,000speed-up factor: 6,000total discrepancy:
>60,000,000
© 2007, [email protected] http://hartenstein.de21
TU Kaiserslautern
Software-to-Configware (FPGA) Migration:some published speed-up factors [2003 – 2005]
These examples worked fine with on-chip memoryThere are other
algorithms more difficult to
accelerate …
… where d-
daching might be
useful (ASM)
© 2007, [email protected] http://hartenstein.de23
TU Kaiserslautern
How much on-chip embedded BRAM ?
256 – 1704 BGA
56 – 424
8 – 32
fast on-chip block RAMs:
BRAMs
DPU:coarse-grained
On-chip LatticeCS
series
© 2007, [email protected] http://hartenstein.de25
TU Kaiserslautern
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 rDPUs
Coarse-grained Reconfigurable Array
rout thru only
not usedbackbus connect
SNN filter on (supersystolic) KressArray (mainly a pipe network)
reconfigurable Data Path Unit, 32 bits wide
reconfigurable Data Path Unit, 32 bits wide
no CPU
rDPUrDPU
note: software perspective without instruction streams:
pipelining
note: software perspective without instruction streams:
pipelining
compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside
question after the talk: „but you can‘t implement decisions!“
© 2007, [email protected] http://hartenstein.de26
TU Kaiserslautern
Simple KressArray Configuration Example
© 2007, [email protected] http://hartenstein.de27
TU Kaiserslautern
DPU
Much less deficiencies by coarse-grained
1980 1990 2000 2010100
103
106
109
(Gordon Moore curve)
transistors / microchip
rDPA physical rDPA logical
area efficiency very close to Moore‘s
law
area efficiency very close to Moore‘s
law
Hartenstein‘s Law
[1996: ISIS, Austin, TX]
very compact configuration code: very
fast reconfiguration
very compact configuration code: very
fast reconfiguration
rDPU
DPU
CPUCPU programcounter
rDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPU
© 2007, [email protected] http://hartenstein.de29
TU Kaiserslautern
Software-to-Configware (FPGA) Migration:Oil and gas [2005]
1980 1990 2000 2010100
103
106
speedu
p f
act
or
oil and gas oil and gas1717
X 2/yr
side effect: slashing the electricity billby more than an order of magnitude
side effect: slashing the electricity billby more than an order of magnitude
© 2007, [email protected] http://hartenstein.de30
TU KaiserslauternAn accidentially discovered side effect
•Software to FPGA migration of an oil and gas application:•Speed-up factor of 17•Electricity bill down to <10%•Hardware cost down to <10%
•All other publications reporting speed-up did not report energy consumption.
Saves > $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack
What about higher
speed-up factors ?What about higher
speed-up factors ?
More dramatic
electricity savings? More dramatic
electricity savings?
Herb Riley, R. Associates
$70 in 2010?$70 in 2010?
- This will change.
© 2007, [email protected] http://hartenstein.de31
TU Kaiserslautern
What’s Really Going On With Oil Prices? [BusinessWeek, January 29, 2007]
$52 Price of delivery in February 2007 [New York Mercantile Exchange: Jan. 17]
$200 Minimum oil price in 2010, in a bet by investment banker Matthew Simmons
© 2007, [email protected] http://hartenstein.de32
TU KaiserslauternEnergy as a strategic issue
•Google‘s annual electricity bill: 50,000,000 $
•Amsterdam‘s electricity: 25% into server farms
•NY city server farms: 1/4 km2 building floor area
[Mark P. Mills]•Predicted f. USA in 2020: 30-50% of the entire national electricity consumption goes into cyber infrastructure
•petaFlop supercomputer (by 2012 ?): extreme power consumption
© 2007, [email protected] http://hartenstein.de33
TU KaiserslauternEnergy: an im portant motivation
platform example
Energy: W / Gflops
energy factor
MDgrape-3*(domain-specific 2004)
0.2 1
Pentium 4 14 70
Earth Simulator(supercomputer 2003)
128 640
*) feasible also on reconfigurable platforms*) feasible also on reconfigurable platforms
© 2007, [email protected] http://hartenstein.de35
TU KaiserslauternOutline
• The Pervasiveness of FPGAs• The Reconfigurable Computing Paradox• The Gordon Moore gap• The von Neumann syndrome• We need a dual paradigm approach• Conclusions
& the multicore crisis
© 2007, [email protected] http://hartenstein.de36
TU Kaiserslautern
Moore’s law not applicable to all
aspects of VLSI
What is the reason of the paradox ?
The Gordon Moore curve does not indicate performance
The peak clock frequency does not indicate performance
the law of Gates
© 2007, [email protected] http://hartenstein.de37
TU Kaiserslautern
Rapid Decline of Computational Density
[BWRC, UC Berkeley, 2004]
1990 1995 2000 2005
200
100
0
50
150
75
25
125
175
SP
EC
fp20
00/M
Hz/
Bill
ion
Tra
nsis
tors
DEC alpha
SUNHP
IBM
alp
ha:
dow
n b
y 1
00
in
6 y
rsIB
M:
dow
n b
y 2
0 in 6
yrs
stolen from Bob Colwell
CPU
memory wall, caches, ...
primary design goal: avoiding a paradigm shiftdramatic demo of the von Neumann Syndrome
© 2007, [email protected] http://hartenstein.de38
TU Kaiserslautern
Monstrous Steam Engines of Computing
5120 Processors, 5000 pins eachCrossbar weight: 220 t, 3000 km of thick cable,
larger than a battleship
power measured in tens of megawatts,
floor space measured in tens of thousands of square feet
ready 2003
© 2007, [email protected] http://hartenstein.de39
TU Kaiserslautern
Dead Supercomputer Society
•ACRI •Alliant •American Supercomputer •Ametek •Applied Dynamics •Astronautics •BBN •CDC•Convex•Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent
•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland•Computer•Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel . Machines •Kendall Square Research •Key Computer Laboratories
Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000]
•MasPar•Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer•Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics
© 2007, [email protected] http://hartenstein.de40
TU Kaiserslautern
We are in a Computing Crisis
platform example
hardw cost $ / Gflops
cost factor
energyW / Gflops
energy factor
MDgrape-3*(domain-specific 2004)
15 1 0.2 1
Pentium 4 400 27 14 70
Earth Simulator(supercomputer 2003)
8000 533 128 640
*) feasible also with rDPA*) feasible also with rDPA
microprocessor crisisgoing multi core
supercomputing crisisMPP parallelism does not scale
© 2007, [email protected] http://hartenstein.de42
TU KaiserslauternThe von Neumann Paradigm
Trap
•Program counter (auto-increment, jump, goto, branch)•Datapath Unit with ALU etc.,•I/O unit, ….
•[Burks, Goldstein, von Neumann; 1946]•RAM (memory cells have adresses ….)
CS education got stuck in this paradigm trap which stems from technology of the 1940s
We need a dual paradigm approach
CS education’s right eye is blind, and its left eye suffers from tunnel view
© 2007, [email protected] http://hartenstein.de43
TU Kaiserslautern
What is the reason of the paradox ?
Result from decades of tunnel view in CS R&D and educationbasic mind set completely wrong
the von Neumann Syndrome
“CPU: most flexible platform” ?
>1000 CPUs running in parallel: the most inflexible platform
However, FPGA & rDPA are very flexible
The Law of More:drastically declining programmer productivity
© 2007, [email protected] http://hartenstein.de45
TU Kaiserslautern
Executive Summary doesn‘t help
We must first understand the nature of the paradigm
Understanding the Paradox ?
von Neumann chickens ?
© 2007, [email protected] http://hartenstein.de46
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de48
TU KaiserslauternVon Neumann CPU
DPUprogramcounter
DPUCPUCPU
termprogra
m counter
execution triggered
byparadigm
CPUyes instruction
fetchinstruction-stream-based
RAMmemory- World of Software -Engineering- World of Software -Engineering
Program Source: SoftwareProgram Source: Software
(tunnel view with the left eye)
© 2007, [email protected] http://hartenstein.de49
TU Kaiserslautern
von Neumann is not the common model
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
von Neumann instruction-
stream-based machine
co-processors
acceleratorCPU
instruction-stream-based
data-stream-
based
hard
ware
software
mainframe age:
microprocessor age:
© 2007, [email protected] http://hartenstein.de50
TU Kaiserslautern
Here is the contemporary common model
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
von Neumann instruction-
stream-based machine
co-processors
acceleratorCPU
instruction-stream-based
data-stream-
based
hard
ware
software
mainframe age:
microprocessor age:
Now we are in the configware age:
accelerator reconfigurable
accelerator hardwired
CPU
© 2007, [email protected] http://hartenstein.de51
TU Kaiserslautern
term
program
counter
execution triggered
byparadigm
CPUyes instructio
n fetchinstruction-stream-based
DPU**no data
arrival*data-
stream-based
machine models
DPUCPUCPU
programcounter
RAMmemory
von
Neu
man
n
Anti machin
e
RAMdata
counter
RAMdata
counter
DPU
RAMdata
counter
rDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPUrDPU
*) “transport-triggered”**) does not have a program counter
- no instruction fetch at run time
© 2007, [email protected] http://hartenstein.de52
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de53
TU Kaiserslautern
Nick Tredennick’s Paradigm Shifts:
Von NeumannVon Neumann
1 programming source needed
algorithm: variable
resources: fixedsoftware
CPU
Early historic machinesEarly historic machines
algorithm: fixed
resources: fixed
(slowly preparing to use both eyes for a dual paradigm point of
view)
© 2007, [email protected] http://hartenstein.de54
TU KaiserslauternCompilation: Software
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
instruction schedule(Befehls-Fahrplan)sequential
(von Neumann model)
© 2007, [email protected] http://hartenstein.de55
TU KaiserslauternNick Tredennick’s Paradigm
Shifts
configware resources: variable 2 programming sources needed
flowware algorithm: variable
Reconfigurable ComputingReconfigurable Computing
Von NeumannVon Neumann
1 programming source needed
algorithm: variable
resources: fixedsoftware
CPU
Early historic machinesEarly historic machines
algorithm: fixed
resources: fixed
© 2007, [email protected] http://hartenstein.de56
TU Kaiserslautern
datacounter
GAG RAM
ASM
datacounter
GAG RAM
ASM
datacounter
GAG RAM
ASM
Configware Compilation
configware code
flowware code
mapper
configwarecompiler
scheduler
source „program“
Configware
Engineering
Configware
Engineering
placement &
routing
data
C, FORTRANMATHLAB
programming the data counters
configware compilation fundamentally different from software compilation
configware compilation fundamentally different from software compilation
xxx
xxx
xxx
|
||
x xxx
xx
x xx
- --
xxxx
xx
xxx
---
---
---
---
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
| data streams
rDPA
pipe network
datacounter
GAG RAM
ASM: Auto-Sequencing MemoriesASM
© 2007, [email protected] http://hartenstein.de57
TU KaiserslauternThe first archetype machine model
mainframe
CPU
compile orassemble
proceduralpersonalization
Software IndustrySoftware Industry Software Industry’sSecret of Success
simple basic .Machine Paradigm
personalization:RAM-based
instruction-stream- based mind set
“von Neumann”
But now we live in the Configware AgeBut now we live in the Configware Age
© 2007, [email protected] http://hartenstein.de59
TU Kaiserslautern
of course algebraic (linear projection)
only for applications with regular data dependencies
Mathematicians caught by their own paradigm trap
Rainer Kress discarded their algebraic synthesis methods and replaced it by simulated annealing:rDPA
1995
Synthesis Method?
The super-systolic array: a generalization of the systolic array
The super-systolic array: a generalization of the systolic array
reductionist
approachreductionist
approach
© 2007, [email protected] http://hartenstein.de60
TU Kaiserslautern
Having introduced Data streams
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data stream
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
„data
streams“ time
port #
time
time
port #time
port #
systolic array research: throughout
the 80ies: Mathematicians‘
hobby
The road map to HPC: ignored for
decades~1980
DPA (pipe network)
execution transport-triggered
no memory wall
H. T. Kung
© 2007, [email protected] http://hartenstein.de61
TU KaiserslauternWho generates the Data
Streams?
Mathematicians: it‘s not our job
xxx
xxx
xxx
|
||
x xx
x
xx
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(it‘s not algebraic)
„systolic“
© 2007, [email protected] http://hartenstein.de62
TU KaiserslauternWithout a sequencer …
… it’s not a machine
Mathematicians have missed to
invent the new machine paradigm
Mathematicians have missed to
invent the new machine paradigm
reductionist approach:reductionist approach:
(it‘s not our job) resources
sequencer
Machine:
... the anti machine
... the anti machine
© 2007, [email protected] http://hartenstein.de63
TU Kaiserslautern
The counterpart of the von Neumann
machinexxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(r)DPA
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
M
datacounter
GAG RAM
ASM: Auto-Sequencing
Memory
data counters instead of a program counter
data counters: located at memory
(not at data path)
Kress /Kung Anti Machine
Kress /Kung Anti Machine
coarse-grained
© 2007, [email protected] http://hartenstein.de64
TU Kaiserslautern
Acceleration Mechanisms by ASM-based MoMSW
•parallelism by multi bank memory architecture•reconfigurable address compuattion – before run time
•avoiding multiple accesses to the same data.•avoiding memory cycles for address computation•improve parallelism by storage scheme transformations
•minimize data movement across chip boundaries
© 2007, [email protected] http://hartenstein.de65
TU Kaiserslautern
FPGAs in Supercomputing
• Synergisms: coarse-grained parallelism through conventional parallel processing,
reconfigurable logic box: 1
Bit
• and: fine-grained parallelism through direct configware execution on the FPGAs
DPUprogramcounter
DPUCPUCPU
DataPath Units32 Bit, 64 Bit
(millions of rLBs embedded in a reconfigurable interconnect fabrics)
© 2007, [email protected] http://hartenstein.de66
TU Kaiserslautern
Anti machine
resources
sequencer
memory
algorithms
flowware
data counters
hardwired anti machine:resources
sequencer
memory
algorithms
flowware
data counters
reconfigurable anti machine:
configware
© 2007, [email protected] http://hartenstein.de67
TU Kaiserslautern
von Neumann machine
resources
sequencer
Machine: resources
sequencer
memory
algorithms
softwareprogram counter
von Neumann machine:
© 2007, [email protected] http://hartenstein.de68
TU Kaiserslautern
The clash of paradigms
a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …
acceleratorsacceleratorsµprocessorµprocessor
structural
hardware guyprogrammer
procedural
the basic mind set isinstruction-stream-based
kind of data-stream-based mind set
the software / hardware chasmthe software / hardware chasm
we need a datastream based machine paradigm
microprocessor age:
© 2007, [email protected] http://hartenstein.de69
TU Kaiserslautern
Xputer Principles
addr. generators reconfigurable
Data Path reconfigurable
Xputer
CPUWe used the VAX-11/750 of my group
DPLA
rALU
contemporary ?
1984: first FPGAs: very tiny & very expensive ASMASM
© 2007, [email protected] http://hartenstein.de71
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de72
TU KaiserslauternOutline
• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions
© 2007, [email protected] http://hartenstein.de74
TU KaiserslauternFPGA Modes of Operation
configware code loaded from external flash memory, e. g. after power-on (~milliseconds)
time
C ph
offE ph
Execution phase
E ph
Configuration phase
C ph
Legend:
simple, static reconfigurability
(requiring new OS
principles)
© 2007, [email protected] http://hartenstein.de75
TU Kaiserslautern
established R&D area
illustrating dynamically reconfigurable
time
FPGA
module no. macroZ
E phmodule z C ph
E ph
macro X
E phmodule X C ph
E ph C ph
configware macro Y
C ph
E phmodule Y C phX
conf
igur
es Y
X co
nfig
ures
Y
Swapping and scheduling of relocatable configware code macros is managed by a configware operating system
Swapping and scheduling of relocatable configware code macros is managed by a configware operating system
partially reconfigurableconfigware OS fundamentally different from software OS
Reconfigurable
Computing at
Microsoft
Microsoft
ReconVista
?Microsoft
ReconVista
?
Configware OS
© 2007, [email protected] http://hartenstein.de76
TU KaiserslauternGliederung
• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions
© 2007, [email protected] http://hartenstein.de77
TU KaiserslauternReconfigurable HPC
• This area is almost 10 years old
© 2007, [email protected] http://hartenstein.de78
TU KaiserslauternReconfigurable HPC
• This area is almost 10 years old
© 2007, [email protected] http://hartenstein.de79
TU KaiserslauternHave to re-think basic assumptions
Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.
Not processing is costly, but moving data and messages
We’ve to re-think basic assumptions behind computing
© 2007, [email protected] http://hartenstein.de80
TU Kaiserslautern
Illustrating the von Neumann paradigm trap
The data-stream-based approach
The instruction-stream-based approach
von Neuman
n bottle-
neck
von Neuman
n bottle-
neck
has no von Neumann bottle-neck
has no von Neumann bottle-neck
the watering pot model [Hartenstein]
many watering pots
© 2007, [email protected] http://hartenstein.de81
TU KaiserslauternHave to re-think basic assumptions
Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.
Not processing is costly, but moving data and messages
We’ve to re-think basic assumptions behind computing
© 2007, [email protected] http://hartenstein.de82
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de83
TU Kaiserslautern
More compute power by Configware than Software
Conclusion: most compute power from ConfigwareConclusion: most compute power from Configware
75% of all (micro)processors are embedded 4 : 1
avarage acceleration factor >2-> rMIPS* : MIPS > 2
*) rMIPS: MIPS replaced by FPGA compute power
25% embedded µProc. accelerated by FPGA(s)
1 : 4
(a very cautious estimation**)
-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)
(difference probably an order of magnitude)(difference probably an order of magnitude)
© 2007, [email protected] http://hartenstein.de84
TU KaiserslauternXputer Lab (around 1990)
© 2007, [email protected] http://hartenstein.de86
TU Kaiserslautern
Programming Language Paradigms
language category Computer Languages Xputer Languages
both deterministic procedural sequencing: traceable, checkpointable
operationsequencedriven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes,instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.),data loop, loop nesting,parallel loops, escapes,data stream branching
state register program counter data counter(s)addresscomputation
massive memorycycle overhead overhead avoided
Instruction fetch memory cycle overhead overhead avoidedparallel memorybank access interleaving only no restrictions
very easy to learn
multipleGAGs
Principles of MoPL [1994]
Principles of MoPL [1994]
© 2007, [email protected] http://hartenstein.de87
TU Kaiserslautern
„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“
Avoiding the paradigm shift?
Tarek El-Ghazawi, panelist at SuperComputing 2006
„A leap too far for the existing HPC community“panelist Allan J. Cantle
SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors
We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective
© 2007, [email protected] http://hartenstein.de88
TU KaiserslauternOutline
• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions
© 2007, [email protected] http://hartenstein.de89
TU Kaiserslautern
We need a new machine paradigm
a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …
data-stream-based mind set
we urgently need a datastream based machine paradigm
we urgently need a datastream based machine paradigm
datadata
it was pepared almost 30 years ago
xxx
xxx
xxx
|
||
x xxx
xx
x xx
- --
xxxx
xx
xxx
---
---
---
---
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
| data streams
© 2007, [email protected] http://hartenstein.de90
TU Kaiserslautern
Generic Address Generator GAG
Generalization of the DMA
datacounter
GAG
GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003,
Dubrovnik]patented by TI 1995
• storge scheme optimization methodology, etc.
Acceleration factors by:
• address computation without memory cycles
avoid e.g. 94% address
computation overhead*
*) Software to Xputer migration
© 2007, [email protected] http://hartenstein.de91
TU KaiserslauternThe 2nd “archetype” machine
model
compilestructural
personalization
Configware IndustryConfigware Industry
Configware Industry’sSecret of Success
personalization:RAM-based
data-stream- based mind set
“Kress-Kung”
accelerator reconfigurable
simple basic .Machine Paradigm
© 2007, [email protected] http://hartenstein.de92
TU KaiserslauternOutline
• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing
Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions
© 2007, [email protected] http://hartenstein.de93
TU Kaiserslautern
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
rout thru only
not usedbackbus connect
SNN filter on (supersystolic) KressArray (mainly a pipe network)
reconfigurable Data Path Unit, e. g. 32 bits wide
reconfigurable Data Path Unit, e. g. 32 bits wide
no CPUrDPUrDPU
question after the talk: „but you can‘t implement decisions!“
note: software perspective without instruction streams
Symptom of the von Neumann Syndrome
A High level R&D manager of a large Japanese IT industry groupyielded by single-paradigm mind set Executive summary? Forget it !How about a microprocessor giant having >100 vice presidents ?if clause turns into multiplexer
© 2007, [email protected] http://hartenstein.de94
TU Kaiserslautern
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPUCPU
Dual Paradigm Application Development
SWcompiler
CWcompiler
C language source
Partitioner
Juergen Becker’s CoDe-X, 1996
placement and routingplacement and routing
automatic parallelization by loop transformationsautomatic parallelization by loop transformations
generating a pipe networkgenerating a pipe network
© 2007, [email protected] http://hartenstein.de95
TU KaiserslauternHybrid Multi Core example
twin paradigm machine
each core can run CPU mode
or rDPU mode
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
CPUCPU
CPUCPU CPUCPU
CPUCPU
CPUCPU CPUCPU
CPUCPU CPUCPU
64 cores
How about a microprocessor giant having >100 vice presidents ?
How about a microprocessor giant having >100 vice presidents ?
customer refuses the pradigm shift?
customer refuses the pradigm shift?
disabled for the paradigm shift ?
disabled for the paradigm shift ?
© 2007, [email protected] http://hartenstein.de96
TU Kaiserslautern
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
CPUCPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
CPUCPU
Compilation for Dual Paradigm Multicore
SWcompiler
CWcompiler
C language source
Partitioner
Juergen Becker’s CoDe-X, 1996
compile to hybrid multicorecompile to hybrid multicore
placement and routingplacement and routing
automatic parallelization by loop transformationsautomatic parallelization by loop transformations
generating a pipe networkgenerating a pipe network
© 2007, [email protected] http://hartenstein.de97
TU KaiserslauternOutline
• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions
© 2007, [email protected] http://hartenstein.de98
TU Kaiserslautern
Here is the common model
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
von Neumann instruction-
stream-based machine
co-processors
acceleratorCPU
instruction-stream-based
data-stream-
based
hard
ware
software
mainframe age:
microprocessor age:
configware age:
CPU accelerator reconfigurable
software/configwareco-compiler
software configware accelerator reconfigurable
accelerator hardwired
CPU
© 2007, [email protected] http://hartenstein.de99
TU KaiserslauternOutline
• The von Neumann Paradigm• Accelerators and FPGAs• The Reconfigurable Computing Paradox• The new Paradigm• Coarse-grained• Bridging the Paradigm Chasm• Conclusions
© 2007, [email protected] http://hartenstein.de100
TU KaiserslauternMulti Core: Just more CPUs ?
Complexity and clock frequency of single-core microprocessors come to an end
Without a paradigm shift just more CPUs on chip lead to the dead roads known from supercomputing
Multi-core microprocessor chips emerging: soon 32 cores on an AMD chip, and 80 on an intel
Multi-threading is not the silver bullet
We’ve to re-think basic assumptions behind computing
© 2007, [email protected] http://hartenstein.de101
TU KaiserslauternSolution not expected from CS officers
We need mutual efforts, like EE/CS cooperation known from the Mead & Conway revolution
Progress of the joint task force on CS curriculum recommendations is extremely disillusioning
For RC other motivations are similarly high-grade: growing cost and looming shortage of energy.
The personal supercomputer: a far-ranging massive push of innovation in all areas of science and economy:
by Reconfigurable Computing
it‘s more like a lobby: „my area is the most important“
© 2007, [email protected] http://hartenstein.de102
TU Kaiserslautern
Computing Sciences are in a severe Crisis
We urgently need to shape the Reconfigurable Computing Revolution for enabling to go toward incredibly promising new horizons of affordable highest performance computing
This cannot be achieved with the classical software-based mind set
We need a new dual paradigm approach
Watch out not to get screwed !
Supercomputing titans may be your enemies
© 2007, [email protected] http://hartenstein.de103
TU KaiserslauternThe Configware Age
• Mainframe age and microprocessor(-only) age are history
• We are living in the configware age right now!
• Attempts to avoid the paradigm shift will again create a disaster
© 2007, [email protected] http://hartenstein.de106
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de107
TU Kaiserslautern
Von Neumann vs. anti machine
# feature von Neumann machine
hardwired anti machine
reconfigurable anti machine
1 m’ code schedules: instruction stream data streams
2 # prog’ sources 1 2
3 source 1 none configware
4 source 2 software flowware
5 sequenced by: program counter data counters
6 counter co-located with: PU (data path): CPU memory block: ASM
9 inter PU communication: common memory piped through
10 data meeting PU: move data at run time move locality of execution at compile rime
© 2007, [email protected] http://hartenstein.de108
TU Kaiserslautern
Overhead avoided by anti machine
# feature von Neumann machine
hardwired anti machine
reconfigurable anti machine
11 state address computation overhead at run time
instruction stream none
12 data address computation overhead at run time
instruction stream none
13 Inter PU communication overhead at run time
instruction stream none
14 instruction fetch at run time instruction stream none
15 data meet PU at run time instruction stream none
© 2007, [email protected] http://hartenstein.de110
TU Kaiserslautern
MoM Scan window (MoMSW) Illustration
• Multiple* vari-size reconfigurable MoMSW scan windows
• MoMSW controlled by reconfigurable GAG (generic address generators)
• 2-dimensional (data) memory address space
MoM architectural primary features:
*) typically 3
ASM: Auto-Sequencing
Memory
ASM: Auto-Sequencing
Memory
ASM: Auto-Sequencing
Memory
ASM: Auto-Sequencing
Memory
© 2007, [email protected] http://hartenstein.de111
TU Kaiserslautern
CGFFT: Parallel Scan Pattern Animation
MoM-3 with 3 varisize scan windows
DatapathDatapathASM: Auto-
Sequencing Memory
ASM: Auto-Sequencing
Memory
© 2007, [email protected] http://hartenstein.de112
TU Kaiserslautern
Reconfigurable Generic Address Generator GAG
Generalization of the DMA
datacounter
GAG
GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003,
Dubrovnik]patented by TI 1995
• storge scheme optimization methodology, etc.
Acceleration factors by:
• address computation without memory cycles
avoid e.g. 94% address
computation overhead*
• supporting scratch optimization strategies (smart d-caching)
© 2007, [email protected] http://hartenstein.de113
TU Kaiserslautern
GAG: 2-D Generic Data Sequence Examples
a) b)
c)
d) e) f) g)
© 2007, [email protected] http://hartenstein.de114
TU KaiserslauternGAG Slider Operation Demo
Example
yx
ceiling
C
address
LB
L0B0AF
floor
LB
© 2007, [email protected] http://hartenstein.de116
TU Kaiserslautern
JPEG zigzag scan pattern
x
y
EastScan is step by [1,0]end EastScan;
SouthScan isstep by [0,1]endSouthScan;
*> Declarations
NorthEastScan isloop 8 times until [*,1]step by [1,-1]endloopend NorthEastScan;
SouthWestScan isloop 8 times until [1,*]step by [-1,1]endloopend SouthWestScan;
HalfZigZag isEastScanloop 3 times SouthWestScanSouthScanNorthEastScanEastScanendloopend HalfZigZag;
goto PixMap[1,1]
HalfZigZag;SouthWestScanuturn (HalfZigZag)
HalfZigZag
data counterdata counter
data counterdata counter
2
1
3
4
HalfZigZag
© 2007, [email protected] http://hartenstein.de117
TU Kaiserslautern
Significance of MoMSW Reconfigurable Scan Windows
• MoMSW Scan windows have the potential to drastically reduce traffic to/from slow off-chip memory.
• No instruction streams needed to implement scratch pad optimization strategies using fast on-chip memory
• MoMSW Scan windows may contribute to speed-up by a factor of 10 and sometimes even much more
• MoMSW Scan windows are the deterministic alternative („d-caching“) to (indeterministic and speculative) classical cache usage: performance can be well predicted
• For data-stream-based computing scan windows are highly effective, whereas classical caches are entirely useless
© 2007, [email protected] http://hartenstein.de118
TU Kaiserslautern
Linear Filter Application
after inner scan line loop unrolling
final design
after scan line
unrolling
hardw. level access optim.
initial design
Parallelized Merged Buffer Linear Filter Applicationwith example image of x=22 by y=11 pixel
Speed-up factor >11due to MoMSW-based d-caching & storage scheme optimization
© 2007, [email protected] http://hartenstein.de120
TU Kaiserslautern
Processing 4-by-4 Reference Patterns
Mead-&-Conway nMOS Design Rules:256 4-by-4 reference patterns
Mead-&-Conway CMOS Design Rules:>800 4-by-4 reference patterns
MoM: all reference patterns matched in a single clock cycle
vN Software: some reference patterns can be skipped, depending on earlier patterns
DPLA: fabricated by the E.I.S. Multi University Project:
PISA DRC accelerator [ICCAD 1984]
1984: 1 DPLA replaces 256 FPGAsReference patterns automatically
generated from Design Rules
PISA: a forerunner of the MoM
accelerator reconfigurable
© 2007, [email protected] http://hartenstein.de121
TU Kaiserslautern
Speed-up by MoM-1 compared to 68020PISA project
© 2007, [email protected] http://hartenstein.de122
TU Kaiserslautern
Speed-up by MoM-3 compared to SPARC 10/51
© 2007, [email protected] http://hartenstein.de123
TU Kaiserslautern
1985 – 1990: Multimedia & DSP: MoM-3 speedup
© 2007, [email protected] http://hartenstein.de124
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de125
TU Kaiserslautern
Significance of Address Generators
• Address generators have the potential to reduce computation time significantly.
• In a grid-based design rule check a speed-up of more than 2000 has been achieved*
• reconfigured address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead
*) 15,000 if the same algorithm is used
© 2007, [email protected] http://hartenstein.de126
TU Kaiserslautern
hardware vs. software perspective
platform hardware perspective
data-stream-driven
software perspective
instruction-stream-driven
flexibility
performance
pot.
1
single paradigm
simple FPGA** X X +++ ++
2µprocessor &
multi core X X +++ -
3 coarse-grained X X ++ +++
4Platform FPGA 1 & (2)* & 3 X X X (X)* ++
+ ++++
5
dual paradigm
1 & 2 X X X X ++ ++
6 2 & 4 X X X X +++ ++++
7 2 & 3 X X X + +++
8reconfigurable
instr. set X X X +++ +
*) with soft cores and/or on-chip microprocessor**) without soft cores
for software peoplefor software people
for software peoplefor software people
© 2007, [email protected] http://hartenstein.de127
TU Kaiserslautern
IngredientsrLB Soft
CPUsimple FPGA
rDPU
BRAMCPU
platform FPGA
rLB
hardwired special
functions
SoftCPU
rDPU BRAM
coarse-grained array
RAM
CPUand, for runninglegacy
softwarerDPU BRAM
anti machine (Xputer)
ASMASMSoftCPU
programcounter
CPU
programcounter
datacounter
ASMASMASMCPU rDPU
CPU with reconfigurable instruction set extension
rLB
(Kress/Kung machine)
all multi core!
all multi core!on-chip
© 2007, [email protected] http://hartenstein.de128
TU Kaiserslautern
perspective ? what expertise needed ? hardware ?
• microprocessor (also multi core)
• simple FPGA (fine-grained)
• platform FPGA (domain-specific core assortment, embedded in FPGA fabrics)
• coarse-grained reconfigurable array
• reconfigurable instruction set processor
mishmash model – a
nightmare for under-
graduate studies
but by far best
optimization potential
software perspective
von Neumann:
software perspective
hardware
perspective
mishmash model (s. a.)
© 2007, [email protected] http://hartenstein.de129
TU Kaiserslautern
flexibility (for accelerators)
Objectives
avoiding specific silicon
rapid prototyping, field-patching, emulation
cheap, compact vHPC
for every area which needs:
© 2007, [email protected] http://hartenstein.de130
TU Kaiserslautern Reconfigurable Computing opens many spectacular new horizons:
Conclusion (1)
Cheap vHPC without needing specific silicon, no mask ....
Massive reduction of the electricity bill: locally and national
Cheap embedded vHPC Cheap desktop supercomputer (a new market)
Fast and cheap prototyping
Replacing expensive hardwired accelerators
Supporting fault tolerance, self-repair and self-organization
Flexibility for systems with unstable multiple standards by dynamic reconfigurability
Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)
© 2007, [email protected] http://hartenstein.de131
TU Kaiserslautern
Universal vHPC co-architecture demonstrator
Conclusion (2)Needed:
The compilation tool problem to be solvedLanguage selection problem to be solvedEducation backlog problems to be solved
Use this to develop a very good high school and undergraduate lab course
A motivator: preparing for the top 500 contest
For widely spreading its use successfully:
select killer applications for demo
© 2007, [email protected] http://hartenstein.de132
TU Kaiserslautern
More compute power by Configware than Software
Conclusion: most compute power from ConfigwareConclusion: most compute power from Configware
75% of all (micro)processors are embedded 4 : 1
avarage acceleration factor >2-> rMIPS* : MIPS > 2
*) rMIPS: MIPS replaced by FPGA compute power
25% embedded µProc. accelerated by FPGA(s)
1 : 4
(a very cautious estimation**)
-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)
(difference probably an order of magnitude)(difference probably an order of magnitude)
© 2007, [email protected] http://hartenstein.de133
TU KaiserslauternConclusion (3)
Self-Repair and Self-Organization methodologyEmbedded r-emulation logistics methodology
Universal vHPC co-architecture demonstrator
select a killer application for demo
For widely spreading its use successfully:
© 2007, [email protected] http://hartenstein.de134
TU Kaiserslautern
Universal HPC co-architecture for:
some Goals
embedded vHPC (nomadic, automotive, ...)desktop vHPC (scientific computing ...)
Application co-development environment forHardware non-experts, ....Acceptability by software-type users, ...
Meet product lifetime >> embedded syst. life:FPGA emulation logistics from
development downto maintenance and repair stationsexamples: automotive, aerospace,
industrial, ..
© 2007, [email protected] http://hartenstein.de135
TU Kaiserslautern
SuperComputing 06SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors
Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?Tarek El-Ghazawi, The George Washington University-
Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?Dave Bennett, Xilinx, Inc-
Reconfigurable Computing: The Future of HPCDaniel S. Poznanovic, SRC Computers, Inc.-
Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?Allan J. Cantle , Nallatech Ltd.-
Challenges for Reconfigurable Computing in HPCKeith D. Underwood, Sandia National Laboratories-
Reconfigurable Computing - Are We There Yet?Rob Pennington, National Center for Supercomputing Applications-
Reconfigurable Computing: The Road AheadDuncan Buell, University of South Carolina-
Opportunities and Challenges with Reconfigurable HPCAlan D. George, University of Florida
Panel
© 2007, [email protected] http://hartenstein.de136
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de137
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de138
TU Kaiserslautern
Acceleration Mechanisms by ASM-based MoMSW
•parallelism by multi bank memory architecture•reconfigurable address compuattion – before run time
•avoiding multiple accesses to the same data.•avoiding memory cycles for address computation•improve parallelism by storage scheme transformations
•minimize data movement across chip boundaries
© 2007, [email protected] http://hartenstein.de139
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de140
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de141
TU KaiserslauternOutline
• The (non-v-N) anti-machine (Xputer)• Speed-up by address generators• Data-procedural Programming Language• Generalization of the Systolic Array• Partitioning Compilation Techniques• Design Space Exploration• Bridging the Paradigm Chasm
© 2007, [email protected] http://hartenstein.de142
TU KaiserslauternC or FORTRAN ?
Computer scientists haven’t been interested in programming clusters. If putting the cluster on a chip is what excites them, fine.
Gordon Bell:
It will still have to run Fortran!
*) like CoDe-X
Support tools have been demonstrated by academia
Classical programming languages, but with a slightly different semantics (data-procedural) are good candidates for parallel programming.
Reiner Hartenstein (conclusion of this talk):
or C (X-C)
it’s a shorter leapit’s a shorter leap
© 2007, [email protected] http://hartenstein.de143
TU KaiserslauternNewton’s 1st Law
Scientists do not change their direction
Newton’s 1st Law à la Gordon Bell:
##
*) like CoDe-X
###
#####
###
##’##’
a
© 2007, [email protected] http://hartenstein.de145
TU KaiserslauternDual paradigm: an old hat
Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer
Software mind set: instruction-stream-based: flow chart -> control instructions(FSM: state transition)
-> Register Transfer Modules (DEC: mid 1970ies); similar concept: Case Western Reserve Univ. ;
FF
token bitevoke
© 2007, [email protected] http://hartenstein.de146
TU KaiserslauternDual paradigm: an old hat
(2)
“It is so simple!
why did it take 25 years to find out ?”
Hardware Description Language scene ~1970:
Because of the reductionists’ tunnel view
Because of a lack of transdisciplinary thinking
FF
token bitevoke
© 2007, [email protected] http://hartenstein.de147
TU KaiserslauternDual paradigm: an old hat
(3)
“procedure call” or function call
call Module-name (parameters);Software: time domain
Hardware Description Languages;
Hardware description: space domain
© 2007, [email protected] http://hartenstein.de149
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de151
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de152
TU Kaiserslautern
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
von Neumann instruction-
stream-based machine
co-processors
acceleratorCPU
instruction-stream-based
data-stream-
based
hard
ware
software
mainframe age:
microprocessor age:
configware age:
CPU accelerator reconfigurable
software/configwareco-compiler
software configware
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPUCPU
SWcompiler
CWcompiler
C language source
Partitioner
CoDe-X, 1996
Apropos HiPEAC: Software / Configware Co-Compilation
automatic parallelization by loop transformations
© 2007, [email protected] http://hartenstein.de153
TU Kaiserslautern
Jürgen Becker’s CoDE-X -1 Co-Compiler
Analyzer/ Profiler
GNU Ccompiler
paradigm
Computer machine
X-Ccompiler
Anti machineparadigm
Partitioner
X-C is C languageextended by MoPLX-C
CPU XputerXputer& running
legacy software rALU: => array size: 1-by-1
© 2007, [email protected] http://hartenstein.de154
TU Kaiserslautern
Jürgen Becker’s CoDE-X -2 Co-Compiler
Analyzer/ Profiler
GNU Ccompiler
paradigmComputer machine
DPSS
X-Ccompiler
Anti machineparadigm
Partitioner
X-C is C languageextended by MoPLX-C
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPUCPU
Resource Parameters
supportingKressArray
family
Pipelining: A Shorter LeapPipelining: A Shorter Leap
© 2007, [email protected] http://hartenstein.de155
TU Kaiserslautern
Jürgen Becker’s CoDE-X -2 Co-Compiler
Analyzer/ Profiler
GNU Ccompiler
paradigmComputer machine
DPSS
X-Ccompiler
Anti machineparadigm
Partitioner
X-C is C languageextended by MoPLX-C
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
CPUCPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
CPUCPU
heterogenous multi-core by dual mode cores: CPU mode vs. rDPU mode
© 2007, [email protected] http://hartenstein.de157
TU Kaiserslautern
© 2007, [email protected] http://hartenstein.de158
TU Kaiserslautern
hardware vs. software perspective
platform hardware perspective
data-stream-driven
software perspective
instruction-stream-driven
flexibility
performance
pot.
1
single paradigm
simple FPGA** X X +++ ++
2µprocessor &
multi core X X +++ -
3 coarse-grained X X ++ +++
4Platform FPGA 1 & (2)* & 3 X X X (X)* ++
+ ++++
5
dual paradigm
1 & 2 X X X X ++ ++
6 2 & 4 X X X X +++ ++++
7 2 & 3 X X X + +++
8reconfigurable
instr. set X X X +++ +
*) with soft cores and/or on-chip microprocessor**) without soft cores
for software peoplefor software people
for software peoplefor software people
© 2007, [email protected] http://hartenstein.de159
TU Kaiserslautern
Data meeting the Processing Unit (PU)
by Software
byConfigware
routing the data by memory-cycle-hungry instruction streams thru shared memoryplacement of the execution locality ...
We have 2 choices
pipe network generated by configware compilation
... partly explaining the RC paradox
© 2007, [email protected] http://hartenstein.de160
TU KaiserslauternData meeting the Processing Unit
byConfigware
placement of the execution locality ...
… pipe network generated by configware compilation
© 2007, [email protected] http://hartenstein.de165
TU Kaiserslautern
„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“
Avoiding the paradigm shift?
Tarek El-Ghazawi, panelist at SuperComputing 2006
„A leap too far for the existing HPC community“panelist Allan J. Cantle
SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors
We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leapA shorter leap by coarse-grained platforms
which allow a software-like pipelining perspective
© 2007, [email protected] http://hartenstein.de166
TU Kaiserslautern
•… the promise of almost unimagined computing power•have the hardware developers raced too far ahead of many programmers' ability to create software ?•parallel computing has been an esoteric skill limited to people involved with high-performance supercomputing. That is changing now that desktop computers and even laptops aregoing multicore.•"High-performance computing experts have learned to deal with this, but they are a fraction of the programmers," Saied says. “•In the future you won't be able to get a computer that's not multicore•multicore chips become ubiquitous, all programmers will have to learn new tricks."•Even in high-performance computing there are areas that aren't yet ready for the new multicore machines.•"In industry, much of their high-performance code is not parallel," Saied says. "These corporations have a lot of time and money invested in their software, and they are rightly worried about having to re-engineer that code base."
Avoiding the paradigm shift?„A leap too far for the existing HPC community“
We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques.
A shorter leapA shorter leap by coarse-grained platforms which allow a software-like pipelining perspective
© 2007, [email protected] http://hartenstein.de167
TU Kaiserslautern
•"Moore's Gap." •Steve Kirsch, an engineering fellow for Raytheon Systems Co., says that multicore computing presents both the dream of infinite computing power and the nightmare of programming. •"The real lesson here is that the hardware and software industries have to pay attention to each other," Kirsch says. "Their futures are tied together in a way that they haven't been in recent memory, and that will change the way both businesses will operate."
Avoiding the paradigm shift?
February, Intel released research details about a chip with 80 cores, a fingernail sized chip that has the same processing power that in 1996 required a supercomputer with a 2,000-square-foot footprint and using 1,000 times the electrical power.
a problem for those who depend on previously written software that has been steadily improving and evolving over decades. "Our legacy software is a real concern to us.
parallel programming for multicore computers may require new computer languages. "Today we program in sequential languages
Do we need to express our algorithms at a higher level of abstraction? Research into these areas is critical to our success."
© 2007, [email protected] http://hartenstein.de168
TU Kaiserslautern
•""Our programming languages researchers are exploring new programming paradigms and models," Hambrusch says. "Our course on multicore architectures is also preparing students for future software development positions. Purdue is clearly playing a defining role in this critical technology."
Avoiding the paradigm shift?
"In five or six years, laptop computers will have the same capabilities, and face the same obstacles, as today's supercomputers," Saied says. "This challenge will face people who program for desktop computers, too. People who think they have nothing to do with supercomputers and parallel processing will find out that they need these skills, too."
Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer. Like locally-based Direct Memory Access (DMA), RDMA improves throughput and performance because it frees up resources. RDMA also facilitates a faster data transfer rate. RDMA implements a transport protocol in the network interface card (NIC) hardw
© 2007, [email protected] http://hartenstein.de169
TU Kaiserslautern
•Three Ways to Make Multicore Work•-- Number 1:•-- Mathematics: Do more computational work with less data motion•– E.g., Higher-order methods•• Trades memory motion for more operations per word, producing an accurate answer in less elapsed time than lower order methods•– Different problem decompositions (no stratified solvers)•• The mathematical equivalent of loop fusion•• E.g., nonlinear Schwarz methods•– Ensemble calculations•• Compute ensemble values directly•– It is time (really past time) to rethink algorithms for memory locality and latency tolerance •I didn’t say threads•• See, e.g., Edward A. Lee, "The Problem with Threads," Computer, vol. 39, no. 5, pp. 33-42, May, 2006.•• “Night of the Living Threads”,•http://weblogs.mozillazine.org/roc/archives/2005/12/night_of_the_living_threads.html , 2005•• Robert O'Callahan: “Why Threads Are A Bad Idea (for most purposes)” John Ousterhout (~2004)••Allen Holub: “If I were king: A proposal for fixing the Java programming language's threading problems” http://www128.ibm.com/developerworks/library/j-king.html, 2000 Allen Holub has been working in the computer industry since 1979. He is widely published in magazines (Dr. Dobb's Journal, Programmers Journal, Byte, MSJ, among others), and he writes the "Java Toolbox" column for the online magazine JavaWorld .
Avoiding the paradigm shift?
Breaking the Assumptions-- Don’t have any off-chip memory– Consequence: Need algorithms, programming models, and software tools to work in more limited memory (a few GB)-- Have off-chip memory, but manage it more effectively– Consequence: Need to find a true, general-purpose hardware/software model-- Overlap latency with split operations– Consequence: Need to find massive amounts of concurrency; need to manage the programming challenges of split operations (these are hard for programmers to use correctly - may be an opportunity for formal methods) Multicore doesn’t just stress bandwidth, it increases the need for perfectly parallel algorithms-- All systems will look like attached processors - high latency, low (relative) bandwidth to main memory 128 cores? “When [a] request for data from Core 1 results in a L1 cache miss, the request is sent to the L2 cache. If this request hits a modified line in the L1 data cache of Core 2, certain internal conditions may cause incorrect data to be returned to the Core 1.” Everything does not double: traveling from New York to Chicago: before 1830: 3 weeks - 1857: 1+1/2 days; now: 6 hours - only a factor of 6 MPI on Multi-Core: 340 ns MPI ping/pong latency improvement will require better SWE tools Benchmarks• Ping-pong latency– Ring-based ping-pong exchange between all nodes• Nearest-neighbor ghost-area exchange– Test code from Argonne used to evaluate onesided and point-to-point operations• CPU availability– Calculates percentage of CPU available at receiver by doing a fixed amount of work during message arrival
© 2007, [email protected] http://hartenstein.de170
TU Kaiserslautern
in Memoriam Stamatis
Vassiliadis
1951 - 2007
in Memoriam Richard Newton
1951 - 2007
in Memoriam …
© 2007, [email protected] http://hartenstein.de171
TU Kaiserslautern
KressArray DPSS
ApplicationSet
DPSS
published at ASP-DAC 1995
ArchitectureEditor
MappingEditor
statist.Data
DelayEstim.
Analyzer
Architecture
Estimator
interm.form 2
expr.tree
ALE-XCompiler
PowerEstimator
PowerData
VHDLVerilog
HDLGeneratorSimulator
User
ALEXCode
Improvement Proposal Generator
Suggestion
SelectionUserInterface
interm.form 3
Mapper
DesignRules
DatapathGeneratorGenerator
KressrDPU
Layout
data stream Schedule
Scheduler
KressArrayXplorer (Platform Design Space Explorer)
Xplorer
InferenceEngine (FOX)
Sug-gest-ion
KressArrayfamily
parameters
© 2007, [email protected] http://hartenstein.de172
TU Kaiserslautern
KressArray Family generic Fabrics: a few examples
Examples of 2nd Level Interconnect:layouted overrDPU cell - no separate routing areas !
+
rout-through and function
rout-throug
h only more NNports:
rich Rout Resources
Select Function
Repertory
select Nearest Neighbour (NN) Interconnect: an example
16 32 8 24
4
2 rDPU
Select mode, number, width of NNports
http://kressarray.de