Reiner Hartenstein, TU Kaiserslautern,...

16
[email protected] 10 March 2014 Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 1 Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de We Need to Reinvent Computing to Avoid its Future Unaffordable Electricity Consumption Reiner Hartenstein 1 © 2010, [email protected] http://hartenstein.de TU Kaiserslautern This talk is about HPRC 2 googling „HPRC“ the 1st from 167,000 hits © 2010, [email protected] http://hartenstein.de TU Kaiserslautern better googling with 3 www.chrec.org: the 1st from 48,400 hits „NSF center for HPRC“ pronounced shreck " HPRC © 2010, [email protected] http://hartenstein.de TU Kaiserslautern Hetero Systems Not only in CHREC the term HPRC means Heterogeneous Systems including both: Instruction Stream Parallelism (on manycore etc.) 4 Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer Population not existing © [email protected] http://hartenstein.de TU Kaiserslautern 2011, Outline Coming unaffordable electricity bill Massively saving energy by FPGAs What’s the problem with FPGAs ? The parallelism crisis We have to re-invent computing Conclusions 5 © [email protected] http://hartenstein.de TU Kaiserslautern 2011, Outline Coming unaffordable electricity bill Massively saving energy by FPGAs What’s the problem with FPGAs ? The parallelism crisis We have to re-invent computing Conclusions 6

Transcript of Reiner Hartenstein, TU Kaiserslautern,...

Page 1: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 1

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

We Need to Reinvent Computing to Avoid its

Future Unaffordable Electricity Consumption

Reiner Hartenstein

1 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern This talk is about HPRC

2

googling „HPRC“

the 1st from 167,000 hits

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

better googling with

3

www.chrec.org: the 1st from 48,400 hits

„NSF center for HPRC“

pronounced „ shreck "

HPRC

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern Hetero Systems

Not only in CHREC the term HPRC means Heterogeneous Systems including both:

• Instruction Stream Parallelism (on manycore etc.)

4

• Data Stream Parallelism (on accelerators; on FPGAs)

Qualified Programmer Population not existing

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

5 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

6

Page 2: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 2

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Crude oil $ prices by barrel

7

March 2010, >82 US-$

January 2011, >92 US-$

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

never run out of energy?

natural gas: similar situation

typical oil field operation

coal

hydro nuclear

gas

oil

[Fatih Birol, Chief Economist IEA]. https://www.theoildrum.com/

2007:

80% crude oil coming from decline fields

> 30 %

~ 55 %

Pro

du

ctio

n (%

) 10

0

0 8

„6 more Saudi Arabias needed for demand predicted for 2030“

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern Beyond peak oil

9 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Power Consumption of Computers

Energy cost may overtake

IT equipment cost in the near future

„we may ultimately need revolutionary new solutions“ [Horst Simon, LBNL, Berkeley]

... has become an industry-wide issue: incremental improvements are on track,

[Albert

Zomaya]

Power consumption by internet: x30 til 2030 if trends continue G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 –11 Sep 2008

10

at Dallas

[Randy Katz: IEEE Spectrum, Febr. 2009]

server farm

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Electricity Bill: a Key Issue

„The possibility of computer equipment power consumption spiraling out of control could have serious consequences for the overall

affordability of computing.”

Patent for water-based data centers

Cost of a G’ data center determined only by monthly power bill

[L. A. Barrosso, Google]

11

Google going to sell electricity

• Already 2005, Google’s electricity bill higher than value of its equipment.

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

12

In my opinion, the largest supercomputers at any time,

including the first exaflops, should not be thought of as

computers. They are strategic scientific instruments that

happen to be built from computer technology. Their

usage patterns and scientific impact are closer to major

research facilities such as CERN, ITER, or Hubble. [Andrew Jones, vice president Numerical Algorithms Group ]

Supercomputers are

Scientific Instruments

no reason to solve the power problem ?

Page 3: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 3

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

13

Exa-scale: (1018 computations/second): exp. by 2018: Power estimations (for a single supercomputer): 250 MW – 10 GW (twice NY City w. 16 million people)

[several sources]

Overall affordability of computing

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Why Computers are important

14

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Business Informaiotn Systems …

15

Lufthansa anno 1960

… without computers

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

16

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011, 17

Reconfigurable Computing offers an overwhelming reduction of electricity consumption

Potential of RC

Only Reconfigurable Computing can avoid, that running our infrastructures becomes unaffordable in the future.

as well as massive speed-up factors …

… both by up to several orders of magnitude.

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

>15000 PISA project

18

FFT 100

Reed-Solomon Decoding 2400

Viterbi Decoding 400

1000

MAC

DSP and wireless

molecular dynamics simulation

88

BLAST 52

protein identification 40

Smith-Waterman pattern matching

288

Bioinformatics

GRAPE

20 Astrophysics

SPIHT wavelet-based image compression 457

real-time face detection

6000

video-rate stereo vision

900

pattern recognition 730

Image processing, Pattern matching,

Multimedia

3000 CT imaging crypto

1000

28500

DES breaking

1

1000

1,000,000

Spe

edup

-Fac

tor

Speed-up

factors are

not new

by avoiding the

von Neumann

paradigm

8723 DNA seq.

100

10

10,000

100,000

Page 4: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 4

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

19 Speed-up the mid’ 80ies

PISA project

one DPLA* replacing 256 early Xilinx FPGAs

*) fabricated by E.I.S. project

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Energy saving factors: ~10% of speedup

20

FFT 100

Reed-Solomon Decoding 2400

Viterbi Decoding 400

1000

MAC

DSP and wireless

molecular dynamics simulation

88

BLAST 52

protein identification 40

Smith-Waterman pattern matching

288

Bioinformatics

GRAPE

20 Astrophysics

crypto 1000

28500 DES breaking

100

103

106

Spe

edup

-Fac

tor

http://hartenstein.de

© 2010 [email protected]

Low Power Circuit Design:

PowerOpt™ (ChipVision Design Systems):

divides power consumption by up to 4

GPGPU and x86 multicore:

no energy saving data available

Power save

factors

obtained

SPIHT wavelet-based image compression 457

real-time face detection

6000

video-rate stereo vision

900 pattern

recognition 730

Image processing, Pattern matching, Multimedia

3000 CT imaging

8723 DNA seq.

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011, 21

[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]

Application . Speed-up factor

Savings Power Cost Size

DNA and Protein sequencing 8723 779 22 253

DES breaking 28514 3439 96 1116

much less equipment

needed

massively saving energy

RC*: Demonstrating the intensive Impact

SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster

Tarek El-Ghazawi

*) RC = Reconfigurable Computing © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Drastically less Equipment needed

For instance: a hangar full of racks replaced by a single rack without

air conditioning

22

or ½ rack

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

23

END

SGI® RASC™ Module (Version1) Xilinx Virtex II-6000 FPGA

16MB QDR SRAM Rack-mountable

Dual NUMAlink™ 4 ports Seamless direct attach to server's shared

memory fabric Datasheet (PDF 145K)

SGI® RASC™ RC100 Blade Dual Virtex 4 LX200 FPGAs

80MB QDR SRAM or 20GB DDR2 SDRAM Blade or rack-mountable form factor

Dual NUMAlink™ 4 ports Seamless direct attach to server's

shared memory fabric Datasheet (PDF 137K)

Hetero HPC SGI® RASC™

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

The Reconfigurability Paradox

• Routing congestion

24

• Lower clock speed

• Reconfigurability overhead

• Wiring overhead

Page 5: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 5

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

25 25

because of

The von Neumann

Syndrome

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

All but ALU is overhead: x20 efficiency

26

(data cashe)

x20 inefficiency: just one of several

overhead layers

[R. Hameed et al.: Understanding Sources of Inefficiency in General-Purpose Chips; 37th ISCA, June 19-23, 2010, St. Malo, France]

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Massive Overhead Phenomena

proportionate to the number of processors

overproportionate to the number of processors

27

overhead von Neumann

machine

instruction fetch instruction stream

state address computation instruction stream

data address computation instruction stream

data meet PU + other overh. instruction stream

i / o to / from off-chip RAM instruction stream

Inter PU communication instruction stream

message passing overhead instruction stream

transactional memory overh. instruction stream

multithreading overhead etc. instruction stream

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Critique of the von Neumann Model

Brad Cox 1990: Planning the Software Industrial Revolution

Dijkstra 1968: The Goto considered harmful

R. Hartenstein, G. Koch 1975: The universal Bus considered harmful

Backus 1978: Can programming be liberated from the von Neumann style

Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style L. Savain 2006:

Why Software is bad

Critique of von Neumann is not new: Peter G. Neumann 1985-2003: 216x “Inside Risks“

18 years inside back cover of Comm_ACM

Peter G. Neumann

28

overhead piles up to code sizes of astronomic dimensions

“von Neumann Syndrome”:

C.V. Ramamoorthy; UC Berkeley

Nathan’s Law: Software is a gas. It expands to fill all its containers ...

Nathan Myhrvold

Wirth‘s

Law “software is slowing faster than hardware is accelerating“

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

>40 years Software Crisis

F. L. Bauer 1968, coined „Software Crisis“ - N. N. 1995: THE STANDISH GROUP REPORT

Robert N. Charette 2005: Why Software Fails; IEEE Spectrum, Sep 2005

Anthony Berglas 2008: Why it is Important that Software Projects Fail

Oct 1957

The Economist: Nov 19th 1955

In 1955, Parkinson could not have foreseen the impact of software.

The size of bureaucracy is independent of the amount of real work to be done.

29 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

30

von Neumann overhead vs.

Reconfigurable Computing

overhead von Neumann

machine datastream machine

instruction fetch instruction stream none*

state address computation instruction stream none*

data address computation instruction stream none*

data meet PU + other overh. instruction stream none*

i / o to / from off-chip RAM instruction stream none*

Inter PU communication instruction stream none*

message passing overhead instruction stream none*

transactional memory overh. instruction stream none*

multithreading overhead etc. instruction stream none*

30

Page 6: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 6

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

31 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Fab Line Cost

32

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

FPGA to ASIC design start ratio

33

3% ASIC

97% FPGA

Dataquest March 25, 2009 Most ASIC design

in the world has stopped

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Revenue down and up

34

? ?

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

35

Magic mixture of FPGA fabric and hard logic on the same die

“pure” FPGA abandoned

The programmable logic industry abandoned “pure” FPGAs - a big field of programmable fabric surrounded by IO.

Instead of FPGA fabric in custom SoC designs, we get custom SoC in FPGAs; devices with narrower application focus

Super-flexible ASSP-like devices for optimized hard-core design mixed with FPGA flexibility.

… to capture huge segments of standard parts / ASSP market.

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

36

Tools, IP and support

Xilinx and Altera started FPGA synthesis development

A commercially-viable design flow for FPGA fabric requires years of development and customer experience

Synthesis and place-and-route are P-complete: monstrously complex software - no “magic bullet”.

To develop of a robust synthesis and place-and-route tool suite requires years of testing, and fine-tuning

Page 7: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 7

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

37

1) focus on optimizing their tool only for their FPGAs

FPGA synthesis advantages

Advantages of FPGA companies:

2) their FPGA synthesis teams influence future FPGA architectures (what EDA synthesis teams could not)

3) earliest possible access and most detailed information about their company’s FPGA architectures

4) regularly access and benchmark to EDA companies’ tools: a known, measurable target to work against.

remark: not intuitive

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

38

Will EDA companies survive ?

Routing delays: the dominant timing factor, not the logic

Only FPGA firms’ synthesis and place-and-route tools survive?

Big EDA firms are wrestling with these complexities

If EDA abandons FPGA synthesis, smaller FPGA firms are in deep trouble.

FPGA firms interfaced synthesis to routing delay estimations

Feedback from a huge variety of design projects world-wide

larger-than-EDA staffing levels

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

39 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

40 year

relative performance

94 96 98 00 02 04 06 08 10 12 14 16 18 20 22 24 26 28 30

Performance Growth by Multicore? b

eg

in

o

f th

e

mu

ltic

ore

e

ra

& massive

programmer

productivity

problems

„Multicore shifts the burden of

Performance from Chip Designer

to Software Developers.“

[J. Larus: Spending Moore's Dividend; C_ACM, May 2009]

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011, 41

year

relative performance

94 96 98 00 02 04 06 08 10 12 14 16 18 20 22 24 26 28 30

be

gin

o

f th

e

mu

ltic

ore

e

ra

Multimedia in the Multicore Era

Multimedia Performance Needs

application performance needs up to:

Audio 800 MIPS Graphics 11 GOPS Video 160 GOPS Digital TV 900 GOPS

[Pierre Paulin, MPSoC’09]

needed performance

growing faster than

Moore‘s law

[courtesy E. Sanchez] MIPS

GSM GPRS EDGE UMTS

next standard

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

vN passes into history

• Foundational change will disrupt traditional habits throughout the discipline – [Michael Wrinn]

42

• Suddenly, All Computing Is Parallel: Seizing Opportunity Amid the Clamor - [Michael Wrinn]

• The proud era of von Neumann architecture passes into history - [Michael Wrinn]

Page 8: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 8

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

the Parallel Programming Problem

The parallel programming problem has been addressed, in HPC, for at least 25 years. The result: only a small number of specialized developers write parallel code. Multicore becoming ubiquitous, there is some hope that the “if you build it, they will come” –[T. Mattson, M. Wrinn]

43

Also see the list „dead supercomputer society“

The growing core counts are racing ahead of programming paradigms and programmer productivity

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

The vast majority of HPC or supercomputing applications originally written for single processor with direct access to main memory.

44

The Programmability Crisis

But the first petascale supercomputers employ more than 100,000 processor cores each, and distributed memory.

They hope, that dozens of applications are inherently parallel enough to be laboriously decomposed, sliced and diced, for mapping onto HPC

Large applications only modestly scalable. >50% apps don’t scale beyond 8 cores, only 6% can exploit >128 PE cores, a tiny fraction 100,000 or more cores.

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Some

programming

languages

45 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Some languages for parallelism

46

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Are new languages necessary?

47

Humans are quickly overwhelmed by concurrency and find it much more difficult to reason about concurrent than sequential code. Even careful people miss possible interleavings among even simple collections of partially ordered operations.” [Sutter and Larus]

Language wars are religious wars

Concurrency in software is difficult because of the abstractions having been chosen.

Adding just new features to existing languages ?

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

absurdely incomprehensible

Threads make programs absurdely incomprehensible caused by the wildly nondeterministic nature [E. A. Lee]

48

Object-oriented limits the visibility of data

Concurrency models can operate at component architecture level rather than programming languages [E. A. Lee]

E. A. Lee: Are new languages necessary for multicore? 2007 E. A. Lee. The problem with threads. Computer, 2006.

Page 9: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 9

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

We need a new Textbook

49

having an impact like Mead & Conway

"The book that changed everything“; Electronic Design News, Feb. 11, 2009

We‘ve to re-invent computing before writing it

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

50

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

We‘ve to re-write anyway

51

We‘ve to re-write software anyway (because of manycore)

We‘ve to re-write into configware (part of the software because of the power wall)

For both we have to learn locality awareness

We have to re-invent programmer education

We need a tool flow to support a twin-paradigm approach and locality awareness

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

(Outline)

4. We have to re-invent computing

52

• Re-write anyway • hetero with strong FPGA envolvement • need locality awareness -> model-based • We need une levee an masses

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

53

A Clean Terminology, please

program source result

Software instruction streams

Flowware data streams

Configware datapath structures configured

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Cray-XD1 Architecture features

The Cray-XD1 allows the Opteron µP to access the FPGA internal registers, internal and external memory.

54

provides several transfer modes between µP and the FPGA

The µP can read from / write to the FPGA local memory space (i.e. internal registers, internal BRAMS, and external memory).

The FPGA can read from / write to the µP local memory space.

The most bandwidth-efficient transfer mode:

write-only mode (producer initiates the transfer):

burst (for large amount of data) or non-burst.

Page 10: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 10

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

HLL programming models

55

What is best for locality

awareness ??

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Too many HDLs

56

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Some more hardware description languaqges

57 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

58

term controlled by execution

triggered by paradigm

CPU program counter

(at ALU)

instruction fetch

instruction stream

DPU**

rDPU**

data counter(s) (at memory)

data arrival* data-stream-based

*) “transport-triggered” **) does not have a program counter

- no instruction fetch

single paradigm (from the

mainframe age) is obsolete

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

59

term controlled by execution

triggered by paradigm

CPU program counter

(at ALU)

instruction fetch

instruction stream

DPU**

rDPU**

data counter(s) (at memory)

data arrival* data-stream

+ New Machine Model for FPGAs

*) “transport-triggered” **) does not have a program counter

- no instruction fetch

twin paradigm

twin paradigm

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

60

Imperative Languages Twins

(1) language category Computer Languages Languages f. Anti Machine

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions

language features control flow + data manipulation

data streams only (no data manipulation)

Flowware Languages Software Languages

Page 11: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 11

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

61

Imperative Languages Twins

(2)

Computer Languages Languages f. Anti Machine

procedural sequencing: traceable, checkpointable

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

program counter data counter(s)

massive memory cycle overhead overhead avoided

memory cycle overhead overhead avoided

interleaving only no restrictions

control flow + data manipulation

data streams only (no data manipulation)

Flowware Languages Software Languages

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

A Heliocentric CS Model needed

Twin Paradigm Dual Dichotomy Approach.

64

PE

Program Engineering

The Generalization of Software Engineering —

*) do not confuse with „dataflow“!

Flowware

Engineering

FE

auto-sequencing Memory

asM time to space mapping

issue

CE

Configware

Engineering

structures

pipe network model

rDPU reconfigurable-Data-Path- Unit

reconfigurable-Data-Path- Array rDPA

SE

Software

Engineering

CPU

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

POIIP: Loop turns into Pipeline

65

[1979]

(reconfigurable) DataPath Unit:

rDPU loop body

rDPU

rDPU

rDPU

Pipeline:

rDPU loop body

loop:

complex loop body

nested loops

complex rDPU or pipe network inside rDPU

complex pipe network

CPU

Memory

Adder

Speaker

FMDemod

LPF1

Split

Gather

LPF2 LPF3

HPF1 HPF2 HPF3

Source:

MIT

StreamIT

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Paradigm Dichotomy: a very old hat

66 RTM as DEC product available: 1973

B0

B1

CONDITION

ENABLE

decision box:

0 1 B0

B1

CO

ND

ITIO

N

ENABLE

demultiplexer:

“That’s so simple! why did it take

30 years to find out ?”

HDL scene ~1970:

„decision box turns into demultiplexer“

C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972

W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. 1967: 1972:

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

67

A Clean Terminology, please

program source result

Software instruction streams

Flowware data streams

Configware datapath structures configured

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Coming unaffordable electricity bill

• Massively saving energy by FPGAs

• What’s the problem with FPGAs ?

• The parallelism crisis

• We have to re-invent computing

• Conclusions

68

Page 12: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 12

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

key issues

69

The massive power consumption of computers.

Hetero systems needed (v N + non-v-N accelerators).

2 scenes: eRC (embedded) + HPRC (supercomputing)

Productivity programming methodology missing

Productivity programmer population missing.

Accelerators: hardwired + RC + (soon? ) ANN

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011, 70

Reconfigurable Computing offers an overwhelming reduction of electricity consumption

Potential of RC

Only Reconfigurable Computing can avoid, that running our infrastructures becomes unaffordable in the future.

as well as massive speed-up factors …

… both by up to several orders of magnitude.

We have to re-invent computing as soon as possible

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

We need „une' Levée en Masses“

71

We need „une' Levée en

Masses“

We need „une' Levée en

Masses“

71 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

END 72

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Taxonomy of Twin Paradigm

Programming Flows (HPRC)

73

E. El-Araby et al.: Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology And Empirical Study; Proc. SPL2007, Mar del Plata, Argentina, Febr. 2007

[courtesy Richard Newton]

„The nroff of EDA“ [R. N.]

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

74

Locality awareness is

essential for flowware

How data are moved Software: by addresses, read from instruction

Flowware: by wire (configured before run time)

relation to configware calls locality awareness

here locality is less relevant

Page 13: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 13

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

75

Reinvent? (final remark)

avoid traditional tunnel views

to obtain new perspectives

rediscovery and revival of old ideas

rearrange and teach them properly

to reach promising new horizons

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Amid the Clamor ?

76

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

time to space mapping

time domain: space domain:

procedure domain structure domain

77

program loop n time steps, 1 CPU

pipeline 1 time step, n DPUs

Bubble Sort n x k time steps,

1 „conditional swap“ unit

Shuffle Sort k time steps, n „conditional swap“ units

time algorithm space algorithm

conditional swap

x

y

conditional swap

conditional swap

conditional swap

conditional swap

time algorithm

space/time algorithm s

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Architekture instead of synchro

78

„Shuffle Sort“

conditional swap

conditional swap

conditional swap

conditional swap

modification:

with shuffle-

function

conditional swap

conditional swap

conditional swap

conditional swap

conditional swap

conditional swap

swap

conditional swap

conditional

direct time to

space mapping

accessing conflicts

Better Architecture instead of complex synchronisation: half he number of Blocks + up und down of data (shuffle function) – no von Neumann-syndrome !

Example

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Transformations since the 70ies

time domain: space domain:

procedure domain structure domain

79

program loop

n x k time steps,

Pipeline

k time steps, n DPUs

time algorithm space/time algorithmus

Strip Mining

Transformation

loop transformations: rich methodology published [survey: Diss. Karin Schmidt,

1994, Shaker Verlag]

1 CPU

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Outline

• Never run out of energy ?

• Energy consumption: unaffordable soon?

• The many-core crisis

• Rescue by Reconfigurable Computing?

• We need to Reinvent Computing

• Conclusions

80

Page 14: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 14

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Development with VHDL is expensive

81

FPGAs Achilles’ Heel is their long development time:Low level HDLs (VHDL/Verilog) are still dominant

Unlike software, FPGAs do not offer forward/backward compatibility

[Khaled Benkrid, Un. of Edinburg] FPGAs: low technology maturity; small user base, compared to software”

FPGA ASIC eASIC

Variety of Tool Flows

Survey urgently needed: Manpower required!

Complicated IP Core Scene

BDTI High-Level Synthesis

Tool Certification Program™

Grant Martin, Gary Smith. “High-Level Synthesis: Past, Present, and

Future,” IEEE Design and Test of Computers, July/August 2009.

TSMC creates a 'soft' IP cores collaboration program to improve soft IP readiness for EDA and IP suppliers including Arteris, Atrenta, Cadence,

Chips & Media, Imagination Technologies, Intrinsic-ID, MIPS Technologies, Sonics, Synopsys and Vivante.

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Productivity vs. Efficiency

82

[courtesy Richard Newton]

„The nroff of EDA“ [R. N.]

how to hide the ugliness from the user [Herman Schmit]

HDLs: zero ease of use !!!

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Understanding Complex

Hetero Systems

83

Layers of Abstraction and Automatic Parallelization hide critical sources of, and limits to efficient parallel execution

Efficient Distribution of Tasks being memory limited

Internode Communications reduces Computational Efficiency

We must change how programmers think

essential: awareness of locality,

Focusing on memory mapping issues and transfer modes to detect overhead and bottlenecks

Understanding streams through complex fabrics needed

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Processor inside FPGA vs.

FPGA inside Processor: EPP

84

-> totally changed concept

device more like heterogeneous SoCs: significant benefits for HPC applications:

FPGAs became software-centric:

-> EDUCATION !!!!-

not hardware centric:

Xilinx: Extensible Processing PlatformTM

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern Structured ASICs

Structured ASICs like eASIC are based mostly on FPGA-like architecture with special configuration mechanism to program at mask level: not re-programmable (more performance for less cost).

85

configware on ROM

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

RTL Programming to ASIC / ASSP

Platform-Language of Silicon: attractive for IP providers

86

Tools-Path to ASICs/ASSPs, across FPGAs

RTL is inherently parallel, mapped application are automatically optimally parallelized by CAD tools.

Application-Specific

Standard Products

now ESL (Electronic System Level) bridging HDL and ANSI C/C++ (at industrial level )

Battle between FPGAs vs MPSoC:

it is RTL vs Software programming.

Page 15: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 15

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Graphical / Dataflow Languages

87

E. El-Araby et al.: Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology And Empirical Study; Proc. SPL2007 Symp., Mar del Plata, Argentina, Febr. 2007

We should have a look at DSPlogic

We should have a look at DSPlogic

Evaluation Metrics

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Architecture visible

The software programming model:

Which hardware architecture parts are visible

and under the programmer’s direct control ?

88

The RC programming model:

how the programmer can control data transfers between FPGA, onboard memory, the microprocessor and its memory

to the programmer

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

>60 NoC Projects

Jih-Sheng Shen, Pao-Ann Hsiung (editors): Dynamic Reconfigurable Network-on-Chip Designs: Innovations for Computational Processing and Communication; Information Science Reference, Hershey, USA, April 1, 2010

89

Industry faces 'platform collision'

Which platform technology will win ? ASIC, ASSP, FPGA, MCU or IP core?

NoC research: world-wide >60 projects

"It's not clear, all may coexist”

Battles will get further interesting if/when the parallel programming crisis is over

[Brad Howe,

VP IC, Altera]

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

90

Intel going FPGA ?

If Intel, for example, wanted to enter the FPGA, they’d have to

acquire an existing big FPGA companies - specifically to acquire the

core technology of synthesis and place-and-route.

If FPGAs - or programmable logic technology in general - are a long-

term important component of digital electronics, then synthesis and

place-and-route have put just two companies - Xilinx and Altera - in

the driver’s seat for the massive profit and growth that are possible in

this market. -

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

(Outline)

4. We have to re-invent computing

91

• we need a twin paradigm approach (hetero with strong FPGA involvement)

• need locality awareness (& model-based) • we need une levee an masses

Re-write anyway ->

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

asM

asM

asM

asM

asM

asM

asM: Auto-Sequencing

Memory

use data counters, no program counter

rDPA

x x

x

x

x

x

x x

x

- -

-

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

the Data stream

machine

x x x

x x x

x x x

|

| |

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

|

implemented

by distributed

on-chip memory

92

asM

asM

asM

asM

asM

asM

programmed . by Flowware

Locality Awareness is essential

reconfigurable

address generator

(GAG) inside asM

Page 16: Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

[email protected] 10 March 2014

Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 16

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

Som

e f

unct

iona

l lan

guag

es

Som

e d

atas

tream

lang

uage

s

93 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

The Language and Tool Disaster

Software people don‘t speak VHDL

Hardware people don‘t MPI nor OpenMP

Bad quality application development tools

86% designers hate their tools [FCCM’98]

progress stalled by qualification problems in industry and academia

Comprehensibility barrier between procedural and structural mind set

N. Conner et al.: FPGAs for Dummies; Wiley, 2008 94

© 2010, [email protected] http://hartenstein.de

TU Kaiserslautern Productivity Semantic Gap

95 © 2010, [email protected] http://hartenstein.de

TU Kaiserslautern

2011,

Some Acceleration Mechanisms

parallelism by multi bank memory architecture auxiliary hardware for address calculation address calculation before run time

avoiding multiple accesses to the same data. avoiding memory cycles for address computation optimization by storage scheme transformations

optimization by memory architecture transformations

Accelerate tasks by streaming

Can achieve 100x in improved use of memory.

MISD structured computation: streaming computations across a long array before storing results in memory.