From Algorithms to Architectures - Alexandria...

221
The architectural solution space Dedicated VLSI architectures and how to design them Equivalence transforms for combinational computations Options for temporary storage of data Equivalence transforms for non-recursive computations Equivalence transforms for recursive computations Generalizations of the transform approach From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design Center ETH Z¨ urich Morgan Kaufmann “Top-Down Digital VLSI Design” Chapter 3 last update: July 18, 2014 c Hubert Kaeslin Microelectronics Design Center ETH Z¨ urich From Algorithms to Architectures 1 / 171

Transcript of From Algorithms to Architectures - Alexandria...

Page 1: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

From Algorithms to Architectures

Prof. Hubert KaeslinMicroelectronics Design Center

ETH Zurich

Morgan Kaufmann “Top-Down Digital VLSI Design” Chapter 3

last update: July 18, 2014

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 1 / 171

Page 2: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Content

You will learn

about the options for tailoring hardware to data/signal processing algorithms.

I General-purpose vs. special-purpose architecturesand all sorts of compromises between the two

I A toolbox for optimizing VLSI architecturesI Iterative decomposition, pipelining, replication, time sharingI Algebraic transformsI RetimingI Loop unfolding, pipeline interleaving

I Options for temporary storage of data

I Not so common architectural conceptsI Bit-serial architectures, distributed arithmeticI Computing in semirings

pipelining

replication

time sharingdecom-position

iterative

constant AT

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 2 / 171

Page 3: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The goals of architecture design

I Decide on the necessary hardware resources for carrying out computationsfrom data and/or signal processing.

I Organize their interplay such as to meet target specifications.

I Concerns:

1. Functional correctness2. Performance targets (throughput, operation rate, etc.)3. Circuit size4. Energy efficiency5. Agility (wrt to evolving needs, changing specs, future standards)6. Engineering effort and time to market

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 3 / 171

Page 4: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The goals of architecture design

I Decide on the necessary hardware resources for carrying out computationsfrom data and/or signal processing.

I Organize their interplay such as to meet target specifications.

I Concerns:

1. Functional correctness2. Performance targets (throughput, operation rate, etc.)3. Circuit size4. Energy efficiency5. Agility (wrt to evolving needs, changing specs, future standards)6. Engineering effort and time to market

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 3 / 171

Page 5: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Subject

The architectural solution space

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 4 / 171

Page 6: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

What you ought to know about microprocessors

Instruction set processors execute one program instruction after the otherin consecutive fetch-load-execute-store cycles.

ALU (arithmetic-logic unit) carries out data manipulations.

Datapath vs. Control section

+/− RAM ROM

input data

output data

higher-level control input

higher-level status output

status signals

control signals

datapath section control section

ALU

MUX FSMdata processing units,

data storage, anddata switches

finite state machines, instruction sequences, hardwired logic, or anycombination thereof

von Neumann architecture common memory space, vs.

Harvard architecture separate memory spaces for data and program code.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 5 / 171

Page 7: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The architectural antipodes I

application-specifichardware structure

general-purposehardware structure

general-purpose hardwarewith application-specificsoftware content

dedicated tospecialized unit

subtask Bdedicated to

specialized unit

subtask Adedicated to

specialized unit

subtask D

b)

dedicated tospecialized unit

subtask Coutputdata

inputdata

a)

programstorage controller

datamemory

program-controlled processor

purposedatapath

general-

outputdata

inputdata

program-controlledprocessor

input data output data

=

GP

SP

general-purpose

special-purpose

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 6 / 171

Page 8: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The architectural antipodes II

Hardware architectureGeneral purpose Special purpose

Algorithm any, not known a priori fixed, must be knownArchitecture instruction set processor dedicated, no single patternExecution model fetch-load-execute-store process data item and pass on

“instruction-oriented” “dataflow-oriented”Datapath ALU(s) plus memory customized designController with program microcode typically hardwiredPerformance instructions per second, data throughput,indicator run time of benchmarks can be anticipated analyticallyStrengths highly flexible, room for max. performance,

immediately available, highly energy-efficient,routine design flow, lean circuitrylow up-front costs

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 7 / 171

Page 9: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The architectural antipodes III

Guideline

Before embarking in ASIC design, find out

I Does an architecture dedicated to the application at hand make sense

I or is a program-controlled general-purpose processor more adequate?

I Opting for commercial microprocessors and/or FPL sidestepsmany technical problems that absorb much attentionwhen a custom IC is to be designed instead.

I Conversely, it is preciselyI the focus on the payload computation,I the absence of programming and configuration overhead, andI the full control over architecture, circuit, and layout details

that make it possible to optimize performance and energy efficiency.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 8 / 171

Page 10: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The architectural antipodes III

Guideline

Before embarking in ASIC design, find out

I Does an architecture dedicated to the application at hand make sense

I or is a program-controlled general-purpose processor more adequate?

I Opting for commercial microprocessors and/or FPL sidestepsmany technical problems that absorb much attentionwhen a custom IC is to be designed instead.

I Conversely, it is preciselyI the focus on the payload computation,I the absence of programming and configuration overhead, andI the full control over architecture, circuit, and layout details

that make it possible to optimize performance and energy efficiency.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 8 / 171

Page 11: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example: Viterbi decoderArchitecture General purpose Special purposeKey component DSP ASIC

TI TMS320C6455 sem03w6 sem05w1without with ETH ETHViterbi coprocessor VCP2

Number of chips 1 1 1 1CMOS process 90 nm 90 nm 250 nm 5Al 250 nm 5AlProgram code 187 Kibyte 242 Kibyte none noneCircuit size n.a. n.a. 73 kGE 46 kGEMax. throughput 45 kbit/s 9 Mbit/s 310 Mbit/s 54 Mbit/s

@ clock 1 GHz 1 GHz 310 MHz 54 MHzPower dissipation 2.1 W 2.1 W 1.9 W 50 mWYear 2005 2005 2004 2006

Reasons:

I DSP optimized for sustained multiply-accumulates, word width 32 bit.I Viterbi algorithm arranged to do without multiplication.I Viterbi algorithm arranged to do with words of 6 bit or less.I Dedicated architectures can exploit full potential for parallelism.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 9 / 171

Page 12: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example: AES block cipher encrypter/decrypter(Rijndael algorithm)

Architecture General purpose Special purposeKey component CISC Processor FPGA Xilinx ASIC (ETH) ASIC (UCLA)

Pentium III Virtex-II CryptoFun Rijndael coreNumber of chips motherboard 1 + config. 1 1CMOS process n.a. 150 nm 8Al 180 nm 4Al2Cu 180 nm 4Al2CuMax. throughput 648 Mbit/s 1.32 Gbit/s 2.0 Gbit/s 1.6 Gbit/s

@ clock 1.13 GHz n.a. 172 MHz 125 MHzPower dissipation 41.4 W 490 mW n.a. 56 mW

@ supply n.a. 1.5 V 1.8 V 1.8 VYear 2000 ≈ 2002 2007 2002

Reasons:

I Multiple LUTs included in hardware for S-Box function and inverse.

I Ciphering and subkey preparation carried out by concurrent units.

I Rijndael algorithm designed with Pentium III architecture in mind(MMX instructions, LUTs that fit into cache memory, etc.).

I Power dissipation of general-purpose processor remains daunting.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 10 / 171

Page 13: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

When do dedicated architectures make sense?

Dedicated architectures are favored by real-time applications such as

I Data, audio and video (de)compression

I Ciphering & deciphering (primarily for secret key ciphers)

I Error correction coding

I Digital modulation & demodulation(for modems, wireless communication, and disk drives)

I Adaptive channel equalization for copper lines and optical fibers

I Multipath combiners in broadband wireless access networks

I Computer graphics and video rendering

I Multimedia (e.g. MPEG, HDTV)

I Pattern recognition

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 11 / 171

Page 14: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Answer

“Does it make sense to consider dedicated hardware architectures?”

YES Dedicated architectures outperform program-controlledprocessors by orders of magnitude (wrt throughput andenergy efficiency) in many transformatorial systemswhere data streams get processed in fairly regular ways.

but also

NO Dedicated architectures can not rival the agility and economyof processor-type designs in applications where the computationis primarily reactive, very irregular, highly data-dependent,or memory-hungry.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 12 / 171

Page 15: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Answer

“Does it make sense to consider dedicated hardware architectures?”

YES Dedicated architectures outperform program-controlledprocessors by orders of magnitude (wrt throughput andenergy efficiency) in many transformatorial systemswhere data streams get processed in fairly regular ways.

but also

NO Dedicated architectures can not rival the agility and economyof processor-type designs in applications where the computationis primarily reactive, very irregular, highly data-dependent,or memory-hungry.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 12 / 171

Page 16: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Computational needs and capabilities2008

20122016

20202000

2004supercom

puters

data rate

[item/s]

1Gop/s

1Mop/s

1kop/s

Intel 80386 (1985)

Sun SPARC (1987)

1Eop/s

computation rate

[operation/s]

computational effort per data itemaka intensity

1Pop/s

NEC Earth Simulator

(supercomputer 2003)

1Top/s IBM Roadrunner

(supercomputer 2008)

[op/item]

technologypush

WLAN

wirelessATM

MPEG-2encoder

HDTVencoder

Turbo

for UMTSdecoder1k

10k

100k

1M

10M

100M

1G

10

100

1

Cray Titan

(supercomputer 2012)

Atmel 8051 (2012)

Intel i7-3900 (2013)

low-costmicroprocessors

GSMbaseband

simple modem

multiviewsynthesis

dedicatedhigh-performance

architectures

multi-antenna (MIMO)wireless receiver (4G)

satellite communication

multiprocessorsystems

Dolby AC-35.1 decoder

TI C66x (DSP core, 2012)

Nvidia GeForce GTX780

(graphics card, 2013)

UM

TS

: 3.8

4 M

chip

/s

CD

: 44.

1 ks

ampl

e/s

PA

L: 1

1 M

pixe

l/s

HD

TV

: 62

Mpi

xel/s

QF

HD

: 249

Mpi

xel/s

UH

D: 9

95 M

pixe

l/s

PC

M: 8

ksa

mpl

e/s

Turbo

for LTE-Adecoder

signaland graphicsprocessors

high-performancemicroprocessors

(multicore)

10 100 10k1k 100k 10M1M 100M 10G1G 100G1 1T

towardshigh data rates

low-costhardware

towards

applicationstowards sophisticated

high audio & video quality,(strong compression,

secure communication,robust transmission, etc.)

energy-savingalgorithms

towards

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 13 / 171

Page 17: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

What makes an algorithm suitable for dedicated VLSI architectures?

Ideally:

1. Loose coupling between major processing tasks• Well-defined functional specification for each task.• Manageable interactions between them.

2. Simple control flow• Course of operation does not depend on the data being processed.• No need for overly many modes of operations, data formats, etc.

I Makes it possible to anticipate the datapath resources required to meetthroughput goal and to design the architecture accordingly.

I Permits control by counters and simple finite state machines.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 14 / 171

Page 18: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

What makes an algorithm suitable for dedicated VLSI architectures?

Ideally:

1. Loose coupling between major processing tasks• Well-defined functional specification for each task.• Manageable interactions between them.

2. Simple control flow• Course of operation does not depend on the data being processed.• No need for overly many modes of operations, data formats, etc.

I Makes it possible to anticipate the datapath resources required to meetthroughput goal and to design the architecture accordingly.

I Permits control by counters and simple finite state machines.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 14 / 171

Page 19: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

... continued

3. Regular data flow, recurrence of a few identical operationsI Opens a door for sharing hardware resources in an efficient way.

4. Reasonable storage requirementsI Renders on-chip memories economically possible.I Massive storage requirements in conjunction with moderate computational

burdens place dedicated architectures at a disadvantage.

5. Compatible with finite precision arithmeticsI Insensitive to effects from finite precision, no need for floating-point

arithmetics.I Area, logic delay, interconnect length, parasitic capacitances, and energy

dissipation all grow with word width, they combine into a burden thatmultiplies at an overproportional rate.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 15 / 171

Page 20: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

... continued

3. Regular data flow, recurrence of a few identical operationsI Opens a door for sharing hardware resources in an efficient way.

4. Reasonable storage requirementsI Renders on-chip memories economically possible.I Massive storage requirements in conjunction with moderate computational

burdens place dedicated architectures at a disadvantage.

5. Compatible with finite precision arithmeticsI Insensitive to effects from finite precision, no need for floating-point

arithmetics.I Area, logic delay, interconnect length, parasitic capacitances, and energy

dissipation all grow with word width, they combine into a burden thatmultiplies at an overproportional rate.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 15 / 171

Page 21: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

... continued

3. Regular data flow, recurrence of a few identical operationsI Opens a door for sharing hardware resources in an efficient way.

4. Reasonable storage requirementsI Renders on-chip memories economically possible.I Massive storage requirements in conjunction with moderate computational

burdens place dedicated architectures at a disadvantage.

5. Compatible with finite precision arithmeticsI Insensitive to effects from finite precision, no need for floating-point

arithmetics.I Area, logic delay, interconnect length, parasitic capacitances, and energy

dissipation all grow with word width, they combine into a burden thatmultiplies at an overproportional rate.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 15 / 171

Page 22: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example:Fixed-pointdivision

0

0.2

0.4

0.6

0.8

1.0

1.4

1.6

1.2

1.8

2.0

mm2

computation delay

area occupation

NonrestoringRadix-2 SRTRadix-4 SRTCommercial

10 50 10020 30 40 60 70 80 90 ns0

16 bit24

40

32

52

64 bitPareto-optimal

solutions

suboptimal solutions

Figure: Comparison of hardware divider architectures for a 180 nm CMOS processunder worst-case PTV conditions. Note the impact of quotient width.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 16 / 171

Page 23: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures... continued

6. Non-recursive linear time-invariant computation over some algebraic fieldI Opens a door for reorganizing the data processing in many ways.I High-speed operation, in particular, is much easier to obtain.

7. No transcendental functionsI Roots, logarithmic, exponential, or trigonom. functions, translations

between incompatible number systems are expensive in hardware.◦ Results must either be stored in large lookup tables (LUTs) or◦ get calculated on-line in lengthy computation sequences.

8. Extensive usage of operations unavailable from instruction setsI Replace lengthy instruction sequences by dedicated computational units,

e.g. finite field arithmetics, many ciphering operations, CORDIC.I Fixed arguments often allow for some form of preprocessing, e.g.• drop unit factors and/or zero sum terms,• adopt special number representation schemes,• take advantage of symmetries and precomputed lookup tables.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 17 / 171

Page 24: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures... continued

6. Non-recursive linear time-invariant computation over some algebraic fieldI Opens a door for reorganizing the data processing in many ways.I High-speed operation, in particular, is much easier to obtain.

7. No transcendental functionsI Roots, logarithmic, exponential, or trigonom. functions, translations

between incompatible number systems are expensive in hardware.◦ Results must either be stored in large lookup tables (LUTs) or◦ get calculated on-line in lengthy computation sequences.

8. Extensive usage of operations unavailable from instruction setsI Replace lengthy instruction sequences by dedicated computational units,

e.g. finite field arithmetics, many ciphering operations, CORDIC.I Fixed arguments often allow for some form of preprocessing, e.g.• drop unit factors and/or zero sum terms,• adopt special number representation schemes,• take advantage of symmetries and precomputed lookup tables.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 17 / 171

Page 25: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures... continued

6. Non-recursive linear time-invariant computation over some algebraic fieldI Opens a door for reorganizing the data processing in many ways.I High-speed operation, in particular, is much easier to obtain.

7. No transcendental functionsI Roots, logarithmic, exponential, or trigonom. functions, translations

between incompatible number systems are expensive in hardware.◦ Results must either be stored in large lookup tables (LUTs) or◦ get calculated on-line in lengthy computation sequences.

8. Extensive usage of operations unavailable from instruction setsI Replace lengthy instruction sequences by dedicated computational units,

e.g. finite field arithmetics, many ciphering operations, CORDIC.I Fixed arguments often allow for some form of preprocessing, e.g.• drop unit factors and/or zero sum terms,• adopt special number representation schemes,• take advantage of symmetries and precomputed lookup tables.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 17 / 171

Page 26: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

... continued

9. No divisions and multiplications on very wide data wordsI Much more expensive than addition and subtraction.I Vast numerical range of results gives rise to scaling issues.I Matrix inversion is a particularly nasty case in point as it involves

divisions and often brings about numerical instability.

10. Throughput rather than latency is what mattersI Tight latency requirements rule out pipeliningI but are not in favor of microprocessors either as program-controlled

operation can not normally guarantee fixed response times,even less so when a complex operating system is involved.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 18 / 171

Page 27: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Algorithms suitable for dedicated architectures

... continued

9. No divisions and multiplications on very wide data wordsI Much more expensive than addition and subtraction.I Vast numerical range of results gives rise to scaling issues.I Matrix inversion is a particularly nasty case in point as it involves

divisions and often brings about numerical instability.

10. Throughput rather than latency is what mattersI Tight latency requirements rule out pipeliningI but are not in favor of microprocessors either as program-controlled

operation can not normally guarantee fixed response times,even less so when a complex operating system is involved.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 18 / 171

Page 28: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The architectural solution space

application-specifichardware structure

general-purposehardware structure

general-purpose hardwarewith application-specificsoftware content

dedicated tospecialized unit

subtask Bdedicated to

specialized unit

subtask Adedicated to

specialized unit

subtask D

b)

dedicated tospecialized unit

subtask Coutputdata

inputdata

a)

programstorage controller

datamemory

program-controlled processor

purposedatapath

general-

outputdata

inputdata

program-controlledprocessor

input data output data

=

GP

SP

general-purpose

special-purpose

Problem:How tocombinecomputationalefficiency withflexibility?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 19 / 171

Page 29: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The architectural solution space

application-specifichardware structure

general-purposehardware structure

general-purpose hardwarewith application-specificsoftware content

dedicated tospecialized unit

subtask Bdedicated to

specialized unit

subtask Adedicated to

specialized unit

subtask D

b)

dedicated tospecialized unit

subtask Coutputdata

inputdata

a)

programstorage controller

datamemory

program-controlled processor

purposedatapath

general-

outputdata

inputdata

program-controlledprocessor

input data output data

=

GP

SP

general-purpose

special-purpose

Problem:How tocombinecomputationalefficiency withflexibility?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 19 / 171

Page 30: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Have a look at typical electronic devicesSubfunctions primarily characterized by

irregular control flow and/or repetitive control flow andApplication need for flexibility need for computing efficiency

Blu-ray user interface, track seeking, 16-to-8 bit demodulation,player tray and spindle control, error correction,

processing of non-video data MPEG-2 decompression,(directory, title, author, deciphering (AACS AES-128),subtitles, region codes) video signal processing

Smartphone user interface, SMS/MMS, intermediate frequency filtering,directory management, (de)modul., channel (de)coding,battery monitoring, error correction (de)coding,communication protocol, (de)ciphering, speech andchannel allocation, video (de)compression,roaming, accounting display graphics

Guideline

Segregate the needs for computational efficiency from those of agility!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 20 / 171

Page 31: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Have a look at typical electronic devicesSubfunctions primarily characterized by

irregular control flow and/or repetitive control flow andApplication need for flexibility need for computing efficiency

Blu-ray user interface, track seeking, 16-to-8 bit demodulation,player tray and spindle control, error correction,

processing of non-video data MPEG-2 decompression,(directory, title, author, deciphering (AACS AES-128),subtitles, region codes) video signal processing

Smartphone user interface, SMS/MMS, intermediate frequency filtering,directory management, (de)modul., channel (de)coding,battery monitoring, error correction (de)coding,communication protocol, (de)ciphering, speech andchannel allocation, video (de)compression,roaming, accounting display graphics

Guideline

Segregate the needs for computational efficiency from those of agility!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 20 / 171

Page 32: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

1. Dedicated satellites and 2. Host with helper engines

dedicated tospecialized unit

subtask Bdedicated to

specialized unit

subtask Adedicated to

specialized unit

subtask D

parametrization bus(optional)

a)

handles subtask C

program-controlledprocessor

outputdata

inputdata

b)

dedicated tospecialized unit

subtask Bdedicated to

specialized unit

subtask Adedicated to

specialized unit

subtask D

input data output dataprogram-controlledprocessor

handles subtask C plus dispatching of data

data exchange and control busses

Figure: Chain of general-purpose processor and dedicated satellites (a),host computer with specialized fixed-function blocks or coprocessors (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 21 / 171

Page 33: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example: System on a chip for smartphones (by Texas Instr.)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 22 / 171

Page 34: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

3. Application-specific instruction set processor (ASIP)

programstorage

datamemory

program-controlled processor

controller

handles subtasks A, B, C and D with the aid of multiple specialized datapaths

datapathtype 1

specializeddatapath

type 2

specializeddatapath

type 3

specialized

outputdata

inputdata

application-specifichardware structure

general-purposehardware structure

general-purpose hardwarewith application-specificsoftware content

I Program-controlled operation highly flexible

I Application-specific features confined to datapath circuitry

I Single thread of execution (concurrency limited to SIMD),easily extended to multiple threads (by including multiple ASIP cores)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 23 / 171

Page 35: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example: AES cipher encrypter/decrypter revisited

General purpose Special purpose ASIPCISC Processor FPGA Xilinx ASIC (ETH) ASIC (UCLA) CryptoprocessorPentium III Virtex-II CryptoFun Rijndael core core UCLAmotherboard 1 + config. 1 1 1Assembler none none none Assemblern.a. n.a. 76 kGE 173 kGE 73.2 kGEn.a. 150 nm 8Al 180 nm 4Al2Cu 180 nm 4Al2Cu 180 nm 4Al2Cu648 Mbit/s 1.32 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 3.43 Gbit/s1.13 GHz n.a. 172 MHz 125 MHz 295 MHz41.4 W 490 mW n.a. 56 mW 86 mWn.a. 1.5 V 1.8 V 1.8 V 1.8 V2000 ≈ 2002 2007 2002 2004

Observation

ASIP combines excellent throughput and low powerwith the agility of a program-controlled architecture.

Catch: proprietary instruction set special assembler, libraries, debuggers, ...

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 24 / 171

Page 36: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example: AES cipher encrypter/decrypter revisited

General purpose Special purpose ASIPCISC Processor FPGA Xilinx ASIC (ETH) ASIC (UCLA) CryptoprocessorPentium III Virtex-II CryptoFun Rijndael core core UCLAmotherboard 1 + config. 1 1 1Assembler none none none Assemblern.a. n.a. 76 kGE 173 kGE 73.2 kGEn.a. 150 nm 8Al 180 nm 4Al2Cu 180 nm 4Al2Cu 180 nm 4Al2Cu648 Mbit/s 1.32 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 3.43 Gbit/s1.13 GHz n.a. 172 MHz 125 MHz 295 MHz41.4 W 490 mW n.a. 56 mW 86 mWn.a. 1.5 V 1.8 V 1.8 V 1.8 V2000 ≈ 2002 2007 2002 2004

Observation

ASIP combines excellent throughput and low powerwith the agility of a program-controlled architecture.

Catch: proprietary instruction set special assembler, libraries, debuggers, ...

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 24 / 171

Page 37: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

A framework for accelerating ASIP design

LISA = Language for Instruction Set Architectures(developed by CoWare Inc. acquired by Synopsys in 2010)

The design flow essentially goes

1. Define the most adequate instruction set for a target application,

2. Refine the architecture into a cycle-accurate model (optional),

3. Cast your architecture into an RTL-type model (optional)

using the LISA language.

System-level software tools then generate

I Assembler, linker, and simulator tools.

I VHDL synthesis code (from the RTL model).

Predefined processor templates also available.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 25 / 171

Page 38: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

4. Reconfigurable computing (promoted by FPL vendors)

input data output data

repository forcoprocessor

configurations

[re]configurationrequest

handles subtask C plus overhead

A, B, and Dhandles subtasks

one at a time

in-systemreconfigurablecoprocessor

data exchange bus

program-controlledprocessor

configurationdata

application-specifichardware structure

general-purposehardware structure

general-purpose hardwarewith application-specificsoftware content

Figure: General-purpose processor with juxtaposed reconfigurable coprocessor.

General procedure:1. Designers come up with a specific circuit structure

for each major piece of suitable computation.2. All configurations get stored in memory.3. Whenever the host encounters a call to one of those computations,

it downloads the pertaining configuration file into the FPL4. Host feeds coprocessor with data and fetches results.5. Host proceeds after computation completes.

dead time!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 26 / 171

Page 39: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

4. Reconfigurable computing (promoted by FPL vendors)

input data output data

repository forcoprocessor

configurations

[re]configurationrequest

handles subtask C plus overhead

A, B, and Dhandles subtasks

one at a time

in-systemreconfigurablecoprocessor

data exchange bus

program-controlledprocessor

configurationdata

application-specifichardware structure

general-purposehardware structure

general-purpose hardwarewith application-specificsoftware content

Figure: General-purpose processor with juxtaposed reconfigurable coprocessor.

General procedure:1. Designers come up with a specific circuit structure

for each major piece of suitable computation.2. All configurations get stored in memory.3. Whenever the host encounters a call to one of those computations,

it downloads the pertaining configuration file into the FPL4. Host feeds coprocessor with data and fetches results.5. Host proceeds after computation completes.

dead time!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 26 / 171

Page 40: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

5. Extendable instruction set processor (by Stretch Inc.)

programstorage

datamemory

controller

handles subtasks A, B, C, and D with a combination of fixed and configurable datapaths

outputdata

inputdata

datapath

generalpurpose

datapath logicconfigurable

on-chip

General procedure:

1. System developers write application programs in C or C++.

2. Proprietary EDA tools identify instruction sequencesthat are executed many times over (hot spots).

3. For each such sequence, reconfigurable logic is synthesized intoa parallel computation network that completes within one clock cycle.

4. Each occurrence of the original instruction sequence gets replacedby a function call that activates the custom-made logic.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 27 / 171

Page 41: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

6. Domain-specific programmable platform (DSPP) (new)

= generous and heterogeneous circuit resources in one malleable platform

I Instruction-set-programmableprocessors(CPUs, GPUs, ASIPs),

I Hardwired circuits(fixed-function blocksthat are pretty standardin one applic. domain),

I Electricallyconfigurablefabrics (FPL),

I Various types ofon-chip memories(SRAM, DRAM, flash).

dedicated tospecialized unit

subtask A

input data output dataprogram-controlledprocessor

handles subtask C plus dispatching of data

controller

datapath forsubtask B

specialized

datamemory

handles subtask B

programstorage

repository forconfiguration

handlessubtask D

in-systemconfigurable

logic

configurationdata

m u l t i p l e t h r e a d so f e x e c u t i o n

configurable bus system

dedicated tospecialized unit

subtask E

dedicated tospecialized unit

subtask F

not usedin this

application

not usedin this

application

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 28 / 171

Page 42: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

DSPP = platform ICs (continued)I Specification is using a domain-specific high-level language.I Developer tools assign most adequate execution units

such as to meet performance target at minimum energy.I The FPL is used to extend datapaths and/or instruction sets

where beneficial (in terms of throughput, energy efficiency, updates, etc.).I Little or no on-the-fly reconfiguration.I All inactive subcircuits are turned off.

Anticipated benefits

+ Good performance (intense computing done in hardware)

+ Energy-efficient (idem)

+ One platform covers a range of applications and products

+ Simplified design (essentially platform selection followed by assignmentof subfunctions to the on-chip resources)

+ Agile, fast turnaround (unless fixed-function blocks need to be modified)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 29 / 171

Page 43: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

DSPP = platform ICs (continued)I Specification is using a domain-specific high-level language.I Developer tools assign most adequate execution units

such as to meet performance target at minimum energy.I The FPL is used to extend datapaths and/or instruction sets

where beneficial (in terms of throughput, energy efficiency, updates, etc.).I Little or no on-the-fly reconfiguration.I All inactive subcircuits are turned off.

Anticipated benefits

+ Good performance (intense computing done in hardware)

+ Energy-efficient (idem)

+ One platform covers a range of applications and products

+ Simplified design (essentially platform selection followed by assignmentof subfunctions to the on-chip resources)

+ Agile, fast turnaround (unless fixed-function blocks need to be modified)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 29 / 171

Page 44: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Reality check

− Transistors used lavishly, many subcircuits may never be put to servicein a given application or product.

− Developer tools are in their infancy.

Technological progress tends to make such concerns less and less relevant.

I Viability stands or falls with the tool chain.I specification languages under developmentI standards required to ensure code reuse and portability

I In line with trends from general-purpose computing and high-end FPGAs.I costs per transistor ↓I mask costs ↑I verification costs ↑I energy-efficient computing has become a prime concernI CPU + GPU + FPL + fixed-function blocks + memory all on same chip

Cost structure to be discussed in chapter 16 “VLSI Economics and Project Management”

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 30 / 171

Page 45: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Reality check

− Transistors used lavishly, many subcircuits may never be put to servicein a given application or product.

− Developer tools are in their infancy.

Technological progress tends to make such concerns less and less relevant.

I Viability stands or falls with the tool chain.I specification languages under developmentI standards required to ensure code reuse and portability

I In line with trends from general-purpose computing and high-end FPGAs.I costs per transistor ↓I mask costs ↑I verification costs ↑I energy-efficient computing has become a prime concernI CPU + GPU + FPL + fixed-function blocks + memory all on same chip

Cost structure to be discussed in chapter 16 “VLSI Economics and Project Management”

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 30 / 171

Page 46: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

DSPP Forerunner: Zynq product family (by Xilinx Inc.)

“All Programmable SoC”

I ARM and DSP cores

I Hardwired fixed-functionblocks

I Programmable logic

I Memories andmemory controllers

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 31 / 171

Page 47: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

DSPP = platform ICs (conclusion)

Preliminary assessment

Much remains to be done before platform ICs can dominate digital VLSI,but the concept benefits from numerous technological and economic trends.

“CPU and GPU cores are the new gates (EE Times, 2011)

... and platform ICs are the new gate arrays (H. Kaeslin).”

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 32 / 171

Page 48: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

DSPP = platform ICs (conclusion)

Preliminary assessment

Much remains to be done before platform ICs can dominate digital VLSI,but the concept benefits from numerous technological and economic trends.

“CPU and GPU cores are the new gates (EE Times, 2011)... and platform ICs are the new gate arrays (H. Kaeslin).”

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 32 / 171

Page 49: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Insight gained

GP

SP

There is plenty of landbetween the antipodes

general-purposearchitectures

special-purposearchitectures

everythingprogram-controlled

everythinghardwired

design productivity

towardsagility and

(above all wrt energy)

towardscomputational efficiency

Guideline

I Rely on dedicated hardware only for those subfunctionsthat are called many times and are unlikely to change.

I Keep the rest programmable (via software or reconfiguration).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 33 / 171

Page 50: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Insight gained

GP

SP

There is plenty of landbetween the antipodes

general-purposearchitectures

special-purposearchitectures

everythingprogram-controlled

everythinghardwired

design productivity

towardsagility and

(above all wrt energy)

towardscomputational efficiency

Guideline

I Rely on dedicated hardware only for those subfunctionsthat are called many times and are unlikely to change.

I Keep the rest programmable (via software or reconfiguration).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 33 / 171

Page 51: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

The basic options of architecture designcomputational

efficiency

highhandlayout

hardwiredarchitecture

lowaccomodate unstable specs, immature standards,

and/or frequent product enhancements

limited design effort, short turnaround times

poor agility

productivity

good

ideal computingplatform

towards the

hardwaretowards universal

per-case basis

towards hardwareoptimized on a

high throughput

small circuit

low power

transformatorialfixed, repetitive,

principle of segregation:

select optimum architecturefor each task

plan for both hardwired andprogrammable subsystems,

computationnature of

reactive, irregular,subject to change

ASIP

instruction setprocessor

general-purpose

reconfigurablecomputing

DSPP

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 34 / 171

Page 52: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The antipodesWhat makes an algorithm suitable for a dedicated VLSI architecture?There is plenty of land between the antipodesDigest

Example:An industrial SoC

Note the coexistence of• general-purpose processors• ASIPs, and• hardwired helper engineson the same die.

Figure: Tegra 2 chip for smartphones and tablet computers (source Nvidia).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 35 / 171

Page 53: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Subject

How to design dedicated VLSI architectures

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 36 / 171

Page 54: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Why do we focus on dedicated architectures?

Many techniques for obtaining high performance at low cost are sharedbetween general- and special-purpose architectures.

Yet, our emphasis is on dedicated architectures because

I A priori knowledge of a computational problem offers room for ideasthat do not apply to instruction set processors.

I Utmost performance requirements often ask for special-purpose designs.

I The same holds for energy efficiency.

I Industry provides us with a vast selection of micro- and signal processorsmaking proprietary designs hard to justify.

I There exists a comprehensive literature on general-purpose architectures.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 37 / 171

Page 55: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Most processing algos must be reworked for hardware I

Departures from some mathematically ideal algorithm are almost alwaysnecessary to arrive at an economically feasible solution. Examples follow.

Digital filter Tolerate a somewhat lower stopband suppression in exchangefor a reduced computational burden.(e.g. lower order, smaller coefficients replaced by zeros.)

Viterbi decoder (for convolutional codes) Sacrifice 0.1 dB or so of coding gainfor the benefit of doing computations in a more economic way.(e.g. truncated dynamic range, frequent rescaling, restrictedtraceback.)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 38 / 171

Page 56: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Most processing algos must be reworked for hardware I

Departures from some mathematically ideal algorithm are almost alwaysnecessary to arrive at an economically feasible solution. Examples follow.

Digital filter Tolerate a somewhat lower stopband suppression in exchangefor a reduced computational burden.(e.g. lower order, smaller coefficients replaced by zeros.)

Viterbi decoder (for convolutional codes) Sacrifice 0.1 dB or so of coding gainfor the benefit of doing computations in a more economic way.(e.g. truncated dynamic range, frequent rescaling, restrictedtraceback.)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 38 / 171

Page 57: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Most processing algos must be reworked for hardware II

Autocorrelation function

Replace computation of

ACFxx (k) = rxx (k) =∞∑

n=−∞x(n) · x(n + k)

by the average magnitude difference function

AMDFxx (k) = r ′xx (k) =N−1∑n=0

|x(n)− x(n + k)|

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 39 / 171

Page 58: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Most processing algos must be reworked for hardware III

Magnitude function

I Approximated with shift, add and compare.

Name aka Formulalesser `−∞-norm l = min(|a|, |b|)sum `1-norm s = |a|+ |b|magnitude (reference) `2-norm m =

√a2 + b2

greater `∞-norm g = max(|a|, |b|)Approximation 1 m ≈ 3

8 s + 58 g

Approximation 2 m ≈ max(g , 78 g + 1

2 l)

I Simply replaced by `1- or `∞-norm.(finds applications in MIMO decoders, for instance.)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 40 / 171

Page 59: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Finding an optimal hardware organization

Guideline

There is room for remodeling computations in two distinct domains:

I Processing algorithm.

I Hardware architecture.

Alternative choices in the algorithmic domain. How to tailor an algorithmsuch as to cut the computational burden, to trim downmemory requirements, and/or to speed up calculationswithout incurring unacceptable implementation losses?

Equivalence transforms in the architectural domain. How to (re)organize acomputation such as to optimize throughput, circuit size, energyefficiency and overall costs while leaving the input-to-outputrelationship unchanged except, possibly, for latency?

We shall focus on the latter.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 41 / 171

Page 60: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Finding an optimal hardware organization

Guideline

There is room for remodeling computations in two distinct domains:

I Processing algorithm.

I Hardware architecture.

Alternative choices in the algorithmic domain. How to tailor an algorithmsuch as to cut the computational burden, to trim downmemory requirements, and/or to speed up calculationswithout incurring unacceptable implementation losses?

Equivalence transforms in the architectural domain. How to (re)organize acomputation such as to optimize throughput, circuit size, energyefficiency and overall costs while leaving the input-to-outputrelationship unchanged except, possibly, for latency?

We shall focus on the latter.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 41 / 171

Page 61: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Finding an optimal hardware organization

Guideline

There is room for remodeling computations in two distinct domains:

I Processing algorithm.

I Hardware architecture.

Alternative choices in the algorithmic domain. How to tailor an algorithmsuch as to cut the computational burden, to trim downmemory requirements, and/or to speed up calculationswithout incurring unacceptable implementation losses?

Equivalence transforms in the architectural domain. How to (re)organize acomputation such as to optimize throughput, circuit size, energyefficiency and overall costs while leaving the input-to-outputrelationship unchanged except, possibly, for latency?

We shall focus on the latter.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 41 / 171

Page 62: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Systems engineers and VLSI designers must collaborate

designarchitecture technology-

specificimplementation

algorithmdesign IC fabrication dataproduct idea

evaluation offunctional needsand specification

a)

designarchitecture

technology-specific

implementationalgorithm

design IC fabrication data

evaluation offunctional needsand specification

product idea

b)

competence of systems engineers

competence of systems engineers

competence of VLSI designers

competence of VLSI designers

Figure: Sequential thinking (a) versus networked team (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 42 / 171

Page 63: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Insight gained

Observation

It is always necessary to balance many contradicting requirementsto arrive at a working and marketable embodiment of an algorithm.

I There is more to VLSI design than accepting a given algorithm andturning that into hardware with the aid of some HDL synthesizer.

I Algorithm design is not covered in this course, but neverthelessextremely important for VLSI design.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 43 / 171

Page 64: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Example: Sequence estimation for EDGE receiverAlgorithm Delayed Max-log-MAP Soft Output

Decision Feedback Viterbi Equalizer

Soft output no yes yes

Forward recursion yes yes yesBackward recursion no yes noBacktracking step yes no no

Memory requirements 1x 50x 0.13x

Key design targets:

I soft output

I less than 577µs per burst

I small circuit, low power

I min. block error rate at anygiven signal-to-noise ratio

Which option would you go for?

SOVE

Max-Log-MAP

Log-MAP

DDFSECod

edB

LE

R

SNR [dB]

5 10 15 2010−3

10−2

10−1

100

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 44 / 171

Page 65: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Data dependency graphs (DDG)

edge transport weight indicates latency in computation cycles

Definitions vertex operationmemoryless

fan out expressed as"no operation" vertex

illegal!

0

Danger of race conditions

0

0

00

0

circular pathsof edge weight zeroare not admitted!

x(k)

time-varying data sourcevariable input expressed as

c

constant data sourceconstant input expressed as

y(k)

data sinkoutput expressed as

Shorthand notations

introduced for convenience0 =

1 =

2 =

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 45 / 171

Page 66: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

The isomorphic architecture

a)

y(k) =Σn=0

N=3bn x(k-n)

b)

b20b 1b

x(k)

y(k)

b3

y(k) = Σn=0

3bn(k) x(k-n)

c)

b20b 1b

x(k)

y(k)

b3

z-1 z-1 z-1

d)

b21b

x(k)

b3

multiplierparallel

* * **

y(k)

adder

+ + +

0b

1:1

Figure: Example: A third order transversal filter in various notations.Equation (a), DDG (b), and isomorphic architecture (d). SFG for comparison (c).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 46 / 171

Page 67: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Figures of merit for hardware architectures I (Perform.-related)

Cycles per data item Γ , number of computation cycles between releasing twosubsequent data items.

Longest path delay tlp , the lapse of time required for data to propagate alongthe longest path. A circuit cannot function correctly unlesstlp ≤ Tcp.

Time per data item T , the lapse of time between releasing two subsequentdata items, e.g. in µs/sample, ms/frame, or s/computation.T = Γ · Tcp ≥ Γ · tlp.

Data throughput Θ = 1T =

fcp

Γ expressed in pixel/s, sample/s, frame/s,record/s, FFT/s, or the like.

Latency L , number of computation cycles from a data item entering acircuit until the pertaining result becomes available.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 47 / 171

Page 68: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Figures of merit for hardware architectures I (Perform.-related)

Cycles per data item Γ , number of computation cycles between releasing twosubsequent data items.

Longest path delay tlp , the lapse of time required for data to propagate alongthe longest path. A circuit cannot function correctly unlesstlp ≤ Tcp.

Time per data item T , the lapse of time between releasing two subsequentdata items, e.g. in µs/sample, ms/frame, or s/computation.T = Γ · Tcp ≥ Γ · tlp.

Data throughput Θ = 1T =

fcp

Γ expressed in pixel/s, sample/s, frame/s,record/s, FFT/s, or the like.

Latency L , number of computation cycles from a data item entering acircuit until the pertaining result becomes available.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 47 / 171

Page 69: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Figures of merit for hardware architectures I (Perform.-related)

Cycles per data item Γ , number of computation cycles between releasing twosubsequent data items.

Longest path delay tlp , the lapse of time required for data to propagate alongthe longest path. A circuit cannot function correctly unlesstlp ≤ Tcp.

Time per data item T , the lapse of time between releasing two subsequentdata items, e.g. in µs/sample, ms/frame, or s/computation.T = Γ · Tcp ≥ Γ · tlp.

Data throughput Θ = 1T =

fcp

Γ expressed in pixel/s, sample/s, frame/s,record/s, FFT/s, or the like.

Latency L , number of computation cycles from a data item entering acircuit until the pertaining result becomes available.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 47 / 171

Page 70: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Figures of merit for hardware architectures II (Cost-related)

Circuit size A expressed in mm2, F 2 or GE (gate equivalent).

Size-time product AT , the hardware resources spent to obtain a giventhroughput. AT = A

Θ .

Energy per data item E , the amount of energy dissipated for a givencomputation on a data item e.g. in pJ/MAC, nJ/sample,µJ/datablock, or in mWs/frame.

Can also be understood as power-per-throughput ratio E = PΘ

measured in mWMbit/s or W

Gop/s .

because energydata item = energy per second

data item per second = powerthroughput

Energy-time product ET indicates how much energy gets spent for achievinga given throughput (synonym “energy-per-throughput ratio”).ET = E

Θ = PΘ2 , e.g. in µJ

datablock/s or mWsvideoframe/s .

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 48 / 171

Page 71: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Figures of merit for hardware architectures II (Cost-related)

Circuit size A expressed in mm2, F 2 or GE (gate equivalent).

Size-time product AT , the hardware resources spent to obtain a giventhroughput. AT = A

Θ .

Energy per data item E , the amount of energy dissipated for a givencomputation on a data item e.g. in pJ/MAC, nJ/sample,µJ/datablock, or in mWs/frame.

Can also be understood as power-per-throughput ratio E = PΘ

measured in mWMbit/s or W

Gop/s .

because energydata item = energy per second

data item per second = powerthroughput

Energy-time product ET indicates how much energy gets spent for achievinga given throughput (synonym “energy-per-throughput ratio”).ET = E

Θ = PΘ2 , e.g. in µJ

datablock/s or mWsvideoframe/s .

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 48 / 171

Page 72: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Example

b)

b20b 1b

x(k)

y(k)

b3

ApproximationsI Interconnect delays neglected (overly optimistic).I Delays of arithmetic operations summed up (sometimes pessimistic).I Glitching ignored (optimistic).

A = 3Areg + 4A∗ + 3A+

Γ = 1

tlp = treg + t∗ + 3t+

AT = (3Areg + 4A∗ + 3A+)(treg + t∗ + 3t+)

L = 0

E = 3Ereg + 4E∗ + 3E+

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 49 / 171

Page 73: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

A symbolic representation of hardware

f

a)

x(k)

y(k) c)

size

throughput

latency

longest path

b)

register

input stream

output stream

datapathsection

combinationallogic

no controlsection

Figure: DDG (a), reference hardware configuration (b), key characteristics (c).

Reference hardware = isomorphic architecture + output register(s)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 50 / 171

Page 74: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

There is room for remodeling in the algorithmic domain ...... and there is room in the architectural domainSystems engineers and VLSI designers must collaborateRelative merits of architectural alternativesComputation cycle versus clock period

Computation cycle versus clock period

I A computation period Tcp is the time span that separatestwo consecutive computation cycles.

I During each computation cycle, fresh data emanate from a register,propagate through combinational circuitry before the result gets storedin the next analogous register.

I It is the combinational circuitry that performs all arithmetic, logic,and data routing operations.

I Computation rate fcp = 1Tcp

denotes the inverse.

I For all circuits that adhere to single-edge-triggered one-phase clocking,computation cycle and clock period are the same.

fcp = fclk ⇔ Tcp = Tclk

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 51 / 171

Page 75: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Subject

Transforms for combinational computations

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 52 / 171

Page 76: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

What do we mean by combinational computation?

A computation is termed combinational if

I Result depends on the present arguments exclusively.

I All edges in the DDG have weight zero.

I DDG is free of circular paths.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 53 / 171

Page 77: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Darwin stepping off the boat at Galapagos

I Biological evolution has led to great diversity and adaptation.Can we achieve something similar in VLSI architecture design?

Let us try with architectural transformations.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 54 / 171

Page 78: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Darwin stepping off the boat at Galapagos

I Biological evolution has led to great diversity and adaptation.Can we achieve something similar in VLSI architecture design?

Let us try with architectural transformations.c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 54 / 171

Page 79: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: 8-point FFT0x (k)

1x (k)

2x (k)

3x (k)

4x (k)

5x (k)

7x (k)

6x (k)

0y (k)

1y (k)

2y (k)

3y (k)

4y (k)

5y (k)

7y (k)

6y (k)

a) 1st round 2nd round 3rd round

b)

=butterfly

If the combinational function f is complex (8�n-point FFT, AES, JPEG)then the isomorphic architecture is a rather expensive proposition.

What do you suggest?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 55 / 171

Page 80: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Options at the architecture level

Decomposing function f into a sequence of subfunctions that get executedone after the other on same hardware.

Pipelining of the functional unit for f to improve computation rateby cutting down combinational depth.

Replicating the hardware for f and having all units work concurrently.

Open questions:

I Does it make sense to combine pipelining with iterative decompositionin spite of their contrarian effects?

I How do replication and pipelining compare?Are there situations where one should be preferred over the other?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 56 / 171

Page 81: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Options at the architecture level

Decomposing function f into a sequence of subfunctions that get executedone after the other on same hardware.

Pipelining of the functional unit for f to improve computation rateby cutting down combinational depth.

Replicating the hardware for f and having all units work concurrently.

Open questions:

I Does it make sense to combine pipelining with iterative decompositionin spite of their contrarian effects?

I How do replication and pipelining compare?Are there situations where one should be preferred over the other?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 56 / 171

Page 82: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Iterative decomposition

Paradigm: Step-by-step execution

f

b)

datapathsection

f1 3f2f controlsection

decompositioniterative

a)

f1

3f

2ff

Figure: DDG (a) and hardware configuration for d = 3 (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 57 / 171

Page 83: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Performance and cost analysis

As a first-order approximation, iterative decomposition by a factor of d leadsto the following figures of merit:

Af

d+ Areg + Actl ≤ A(d) ≤ Af + Areg + Actl

Γ(d) = d

tlp(d) ≈ tf

d+ treg

d(Areg + Actl )treg + (Areg + Actl )tf + Af treg +1

dAf tf

≤ AT (d) ≤d(Af + Areg + Actl )treg + (Af + Areg + Actl )tf

L(d) = d

E (d) ≷ Ef + Ereg

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 58 / 171

Page 84: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Insight gained

Iterative decomposition

I Is attractive when a computation makes repetitive use of a singlesubfunction because a lot of area can then be saved.

Example: multiplication 7→ repeated shift & add operations

I Is unattractive when subfunctions are very disparate and, therefore,cannot be made to share much hardware resources.

Example: square root, logarithm, multiplication modulo some prime

I Does not impact throughput much as long as treg � tlp .

I May or may not improve energy efficiency.I yes, if cutting overly long signal propagation paths mitigates

excessive glitching and the associated energy losses.I no, if the extra activity of data registers, control logic, and

data recycling circuitry dominates.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 59 / 171

Page 85: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Insight gained

Iterative decomposition

I Is attractive when a computation makes repetitive use of a singlesubfunction because a lot of area can then be saved.

Example: multiplication 7→ repeated shift & add operations

I Is unattractive when subfunctions are very disparate and, therefore,cannot be made to share much hardware resources.

Example: square root, logarithm, multiplication modulo some prime

I Does not impact throughput much as long as treg � tlp .

I May or may not improve energy efficiency.I yes, if cutting overly long signal propagation paths mitigates

excessive glitching and the associated energy losses.I no, if the extra activity of data registers, control logic, and

data recycling circuitry dominates.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 59 / 171

Page 86: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: block cipher IDEA

first round

seven moreidentical rounds

output transformation

x(k)

y(k)

sub-key

sub-key

sub-key

sub-key

sub-key

sub-key

sub-key

sub-key

sub-key

sub-key

=modulo 2additionbitwise

= modulo 2addition 16

=multiplicationmodulo 2 +116

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 60 / 171

Page 87: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Pipelining

Paradigm: Assembly line operated by specialized workers

a)

f1

3f

2ff datapathsection

f1

3f2f no control

section

pipelining

b)

Figure: DDG (a) and hardware configuration for p = 3 (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 61 / 171

Page 88: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Performance and cost analysis

Pipelining by a factor of p changes performance and cost figures as follows

A(p) = Af + pAreg

Γ(p) = 1

tlp(p) ≈ tf

p+ treg

AT (p) ≈ pAreg treg + (Areg tf + Af treg ) +1

pAf tf

L(p) = p

E (p)fine grain

≷coarse grain

Ef + Ereg

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 62 / 171

Page 89: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Coarse grain versus fine grain pipeliningA

Areg

A f

pipelining

break-even point

treg ft

gatetmin( )

minimum size

optimum efficiencycoarse-grain regime

T

6 5

30

810

15

20

12

60

100

grainfine-

regime

p =1234

isomorphicconfiguration

40

80

120

25

50

maximumthroughput

2 4 6 8 10 12 [ns]

[GE]

800w

700w

600w

500w

400w

300w

200w

100w

pipeliningpossible

deepest

infeasible

distribution

recollectioncircuitry

super-fast

and

except with towardsthroughput

hardwaretowards

efficiencyhardwaretowards

economy

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 63 / 171

Page 90: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Insight gained

Must distinguish between two regimes of pipelining:

Coarse grain pipelining.

Few registers evenly inserted into a deep combinational network.

+ Little extra area for much better throughput.

+ AT -product lowered dramatically.

+ Long reconvergent fanout paths cut reduced glitching.

Fine grain pipelining.

Combinational delay in each stage approaches register delay.

∼ Diminishing speedup for more and more overhead.

− AT -product augments significantly.

− Significant register activity added waste of energy.

− More exposed to OCV.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 64 / 171

Page 91: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Insight gained

Must distinguish between two regimes of pipelining:

Coarse grain pipelining.

Few registers evenly inserted into a deep combinational network.

+ Little extra area for much better throughput.

+ AT -product lowered dramatically.

+ Long reconvergent fanout paths cut reduced glitching.

Fine grain pipelining.

Combinational delay in each stage approaches register delay.

∼ Diminishing speedup for more and more overhead.

− AT -product augments significantly.

− Significant register activity added waste of energy.

− More exposed to OCV.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 64 / 171

Page 92: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Theoretical bound

I Pipeline stage must accomodate at least one 2-input nand or nor. Computation rate and clock frequency are bounded.

Tcp ≥ min(tlp) = min(tgate) + treg = min(tnand , tnor ) + tsu ff + tpd ff

Numerical example:

I Standard cell library for a 130 nm CMOS process.

I Computation period bounded from below to

Tcp ≥ tNAN2D1 + tDFFPB1 = 18 ps + 249 ps ≈ 267 ps

Absolute maximum computation rate ≈ 3.7 GHz.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 65 / 171

Page 93: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

A glance at microprocessors Iyear clock FO4 inverter delays

CPU [MHz] per pipeline stage

Intel 80386 1989 33 ≈ 80Intel Pentium 4 2003 3200 12...16Core 2 Duo 2007 2167 ≈ 40Core i7 980X 2011 3333...3600 42...46IBM POWER5 2004 1650...1900 22IBM POWER6 2007 3500...5000 13IBM Cell Processor 2006 3200 11

FO4 = fanout 4

Observations

I Pipelining has been instrumental in pushing processor clock frequencies.

I 12 or so FO4 inverter delays per stage is close to practical limit.

I Trend towards ever deeper pipelines reversed in the Intel Core familyto reclaim energy efficiency.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 66 / 171

Page 94: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

A glance at microprocessors Iyear clock FO4 inverter delays

CPU [MHz] per pipeline stage

Intel 80386 1989 33 ≈ 80Intel Pentium 4 2003 3200 12...16Core 2 Duo 2007 2167 ≈ 40Core i7 980X 2011 3333...3600 42...46IBM POWER5 2004 1650...1900 22IBM POWER6 2007 3500...5000 13IBM Cell Processor 2006 3200 11

FO4 = fanout 4

Observations

I Pipelining has been instrumental in pushing processor clock frequencies.

I 12 or so FO4 inverter delays per stage is close to practical limit.

I Trend towards ever deeper pipelines reversed in the Intel Core familyto reclaim energy efficiency.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 66 / 171

Page 95: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

A glance at microprocessors II

Figure: Evolution of pipeline depth over the years (source Stanford CPU database).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 67 / 171

Page 96: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Pipelining in the presence of multiple feedforward paths

c

r’(k)

u(k)

l’(k)

c

r(k)

w(k)

u(k)

l(k)

v(k)

g(l,r,u)

g(v,w,u’)

enciphering

deciphering

function

function

a)

r(k)

w(k)

c1

2c

.....

dc

.....shimmingregisters

pipelineregisters

u(k,1)

l(k)

v(k)

.....

b)

u(k,2)

u(k,3)

=modulo 2additionbitwise

=c memorylessmapping

arbitrary

pipelining

Figure: Involutory cipher algorithm. DDG before (a) and after pipelining (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 68 / 171

Page 97: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

A brute force approach to performance I

Figure: If one functional unit does not meet your performance goals ...

What can you do?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 69 / 171

Page 98: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

A brute force approach to performance II

Figure: ... try to get more of them.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 70 / 171

Page 99: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Replication

Paradigm: Multi-piston pump

f

a)

datapathsection

b)

distributor

recollector

controlsection

replication

Figure: DDG (a) and hardware configuration for q = 3 (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 71 / 171

Page 100: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Performance and cost analysis

The key characteristics of replication by a factor of q are

A(q) = q(Af + Areg ) + Actl

Γ(q) =1

q

tlp(q) ≈ tf + treg

AT (q) ≈ (Af + Areg +1

qActl )(tf + treg ) ≈ (Af + Areg )(tf + treg )

L(q) = 1

E (q) ≈ Ef + Ereg + Ectl

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 72 / 171

Page 101: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Replication versus pipeliningA

Areg

A f

pipelining

break-even point

treg ft

gatetmin( )

minimum size

optimum efficiencycoarse-grain regime

T

6 5

30

810

15

20

12

60

100

grainfine-

regime

p =1234

isomorphicconfiguration

replication

3

4

5

6

7

8

10

9

12

11

q =1

2

40

80

120

25

50

maximumthroughput

2 4 6 8 10 12 [ns]

[GE]

800w

700w

600w

500w

400w

300w

200w

100w

pipeliningpossible

deepest

infeasible

distribution

recollectioncircuitry

super-fast

and

except with towardsthroughput

hardwaretowards

efficiencyhardwaretowards

economy

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 73 / 171

Page 102: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: Microprocessor architectures II Superscalar 7→ multiple ALUs, FPUs, etc. under common control.I Multicore 7→ multiple processor cores working independently.

Figure: Floorplan of a Sun Microsystems UltraSPARC T2 CPU (Niagara 2)that combines 8 cores on a single die (separate integer and floating point unitsin each core, 8 threads/core, 1831 pins, 65 nm CMOS, 342 mm2, 1.4 GHz).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 74 / 171

Page 103: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: Microprocessor architectures II

Computer industry has been pushed towards replication because

I CMOS offered more room for increasing circuit complexitythan for pushing clock frequencies higher.

I The faster the clock, the smaller the region on a semiconductor diethat can be reached within a single clock period.

I Fine grain pipelines dissipate a lot of energyfor relatively little computation.

I Reusing a well-tried subsystem benefits design productivityand lowers risks.

I A multicore processor can still be of commercial valueeven if one of its CPUs is found to be defective.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 75 / 171

Page 104: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Time sharing

I Many applications ask for the simultaneous processingof multiple parallel data streams.

Paradigm: Student sharing his time between various subjects

a)

hf g

b)

datapathsection hf g

collector

redistributor

output streams

input streams

f g h controlsection

sharingtime

Figure: DDG (a) and hardware configuration for s = 3 (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 76 / 171

Page 105: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Performance and cost analysis

Time sharing by a factor of s yields the following picture

maxf ,g ,h

(A) + Areg + Actl ≤ A(s) ≤∑f ,g ,h

A + Areg + Actl

Γ(s) = s

tlp(s) ≈ maxf ,g ,h

(t) + treg

s(maxf ,g ,h

(A) + Areg + Actl )(maxf ,g ,h

(t) + treg ) ≤ AT (s) ≤

s(∑f ,g ,h

A + Areg + Actl )(maxf ,g ,h

(t) + treg )

L(s) = s

E (s) ≈ s maxf ,g ,h

(E ) + Ereg + Ectl

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 77 / 171

Page 106: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Insight gained

Time sharing

I is most favorable when one monofunctional datapath proves sufficientbecause all streams are to be processed in exactly the same way

I is unattractive when subfunctions are very disparatebecause no savings can be obtained from concentrating their processinginto one multifunctional datapath

I refrains from taking advantage of the parallelism inherentin the original problem

I may be viewed as antithetic to replication

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 78 / 171

Page 107: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: 8-point FFT

0x (k)

1x (k)

2x (k)

3x (k)

4x (k)

5x (k)

7x (k)

6x (k)

0y (k)

1y (k)

2y (k)

3y (k)

4y (k)

5y (k)

7y (k)

6y (k)

a) 1st round 2nd round 3rd round

b)

=butterfly

Figure: DDG of 8-point FFT (a) and DDG of butterfly operator (b).c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 79 / 171

Page 108: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

A roadmap for tailoring combinational hardware

towardsthroughput

hardwaretowards

efficiency

hardwaretowards

economy

pipelining

replication

time sharingdecom-position

iterative

constant AT

ideal effect

actualeffect

27

3

9

1

1/3

1/9

311/31/9

isomorphicconfiguration

assumed

t =lp

A= L= cpd=

11/9

27

1

t =lp

A= L= cpd=

31/3

9

1/3 t =lp

A= L= cpd=

11/3

9

1

t =lp

A= L= cpd=

91

3

1/9 t =lp

A= L= cpd=

31

3

1/3 t =lp

A= L= cpd=

11

3

1

t =lp

A= L= cpd=

93

1

1/9 t =lp

A= L= cpd=

33

1

1/3 t =lp

A= L= cpd=

33

1

1

t =lp

A= L= cpd=

99

1/3

1/3t =lp

A= L= cpd=

279

1/3

1/9

t =lp

A= L= cpd=

2727

1/9

1/9

time-shareddatapath

single

size A

t =lp

A= L= cpd=

8127

1/9

1/27

time perdata item

T = Γ tlp

fine-grain pipelineddatapath

highly iterative andtime-shared

iterative andmid-grain pipelined

time-shared

datapath

multiplemid-grained pipelines

concurrent processing unitsmultiple replicated

multiple coarse-grainpipelines with each stageiteratively decomposed

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 80 / 171

Page 109: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: Two cryptochip architectures compared

Rijndaelarchitecture

q=2

replication p=2

pipeliningintraround lean but slow

architecture:iterativeheavily

2 hardware rounds2 regs per round

1 hardware round1 register2 registers

1 hardware round

full iterative decompositioninto a single round

d=4

Serpentarchitecture

p=4

interroundpipelining

near-optimumarchitectures:

tailor-made

compromises

4 hardware rounds1 register

p=8

interroundpipelining

p=2

pipeliningintraround

configuration:isomorphic

slow and fatfat but fast

architecture:pipelined

heavily

cipherrounds

8 hardware rounds1 register

starting point

16 registers, i.e. 2 regs per round

8 registers, i.e.1 reg per round

partial iterative decompositiond=2

area A

4 registers, i.e.1 reg per round

time per

Tdata item

available area

required throughput

acceptable solutions

Figure: Two competing teams have taken different routes but have arrivedat similar compromises between throughput and area (ETH CHES 2002).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 81 / 171

Page 110: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Universal versus algebraic transforms

Universal transforms. Whether and how to apply them can be decided from theDDG alone, no matter what operations the vertices stand for.

I Iterative decompositionI PipeliningI ReplicationI Time sharing

more to follow later in this chapter

Algebraic transforms. Take advantage of specific algebraic propertiesof the operations involved.

I Associativity transform, commutativity transformI Horner’s scheme (for evaluating polynomials)I Method of finite differences (to calculate equidistant values)I Karatsuba multiplication (for wide data words)I ...

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 82 / 171

Page 111: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Universal versus algebraic transforms

Universal transforms. Whether and how to apply them can be decided from theDDG alone, no matter what operations the vertices stand for.

I Iterative decompositionI PipeliningI ReplicationI Time sharing

more to follow later in this chapter

Algebraic transforms. Take advantage of specific algebraic propertiesof the operations involved.

I Associativity transform, commutativity transformI Horner’s scheme (for evaluating polynomials)I Method of finite differences (to calculate equidistant values)I Karatsuba multiplication (for wide data words)I ...

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 82 / 171

Page 112: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Example: Associativity transform

a)

0-th term

x(k)

y(k) minmin min min minminmin

I-th term

b)

x(k)

y(k)

min

min min min

minmin

min

0-th termI-th term

chain/treeconversion

Figure: 8-way minimum function. Chain-type DDG (a), tree-type DDG (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 83 / 171

Page 113: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Iterative decompositionPipeliningReplicationTime sharingAssociativity and other algebraic transformsDigest

Recapitulation

Equivalence transforms that help optimize combinational computations

Iterative decomposition. Pipelining. Replication. Algebraic transforms.Time sharing (in the presence of parallel data streams).

I Iterative decomposition and time sharing are most effectivewhen a computational unit can be reused several times.

I Pipelining is generally superior to replication.Coarse grain pipelining improves throughput dramatically,but benefits decline as more and more stages are included.

I Pipelining and iterative decomposition are complementary,they both can contribute to lowering the AT product.

I Lowering AT always implies cutting down the longest path tlp.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 84 / 171

Page 114: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Subject

Options for temporary storage of data

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 85 / 171

Page 115: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Why and when do we need to stora data?

Except for trivial SSI/MSI circuits, any IC includes some form of memory.

This is either because

I the data processing algorithm is of sequential nature and,therefore, asks for functional memory,

or because

I nonfunctional storage got introduced into the circuitas a consequence from architectural transformations.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 86 / 171

Page 116: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Options for temporary storage of data

Architectural options for temporary storage of data:

◦ On-chip registers built from individual flip-flops or latches.

◦ On-chip memory i.e. SRAM macrocell (or possibly embedded DRAM).

◦ Off-chip memory i.e. SRAM or DRAM catalog part.

Differences that impact high-level design decisions:

I One-at-a-time versus all-at-a-time data access patterns

I Available memory configurations and area occupation

I Storage capacities

I Wiring and the costs of going off-chip

I Energy efficiency

I Latency and timing

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 87 / 171

Page 117: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Options for temporary storage of data

Architectural options for temporary storage of data:

◦ On-chip registers built from individual flip-flops or latches.

◦ On-chip memory i.e. SRAM macrocell (or possibly embedded DRAM).

◦ Off-chip memory i.e. SRAM or DRAM catalog part.

Differences that impact high-level design decisions:

I One-at-a-time versus all-at-a-time data access patterns

I Available memory configurations and area occupation

I Storage capacities

I Wiring and the costs of going off-chip

I Energy efficiency

I Latency and timing

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 87 / 171

Page 118: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Data access patterns

RAMs impose access one data word after the otherFine in architectures obtained from

I iterative decomposition andI time sharing.

Perfect match for microprocessors(“fetch, load, execute, store”).

Registers allow for simultaneous access to all data words storedMandatory in high-throughput architectures obtained from

I pipelining,I retiming, to be introduced later in this chapter

I loop unfolding idem

where data are kept moving in every computation cycle.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 88 / 171

Page 119: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Available memory configurations

Figure: Area occupation of registers and on-chip RAMs for a 130 nm CMOS.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 89 / 171

Page 120: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Wiring and the costs of going off-chip

Off-chip memories add to pin count, package count, and board space.

I Extra parasitic capacitances

I Extra delays

I Extra energy dissipation

I Commodity RAMs impose bidirectional pads special attention requiredI Stationary and transient drive conflicts must be avoided.I ATE must be made to alternate between read and write modes

with no physical access to any control signal within the chip.I Test patterns must address bidirectional operation and

high-impedance states.I Electrical and timing measurements become more complicated.

Conclusion

Off-chip data storage is associated with important penalties.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 90 / 171

Page 121: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Wiring and the costs of going off-chip

Off-chip memories add to pin count, package count, and board space.

I Extra parasitic capacitances

I Extra delays

I Extra energy dissipation

I Commodity RAMs impose bidirectional pads special attention requiredI Stationary and transient drive conflicts must be avoided.I ATE must be made to alternate between read and write modes

with no physical access to any control signal within the chip.I Test patterns must address bidirectional operation and

high-impedance states.I Electrical and timing measurements become more complicated.

Conclusion

Off-chip data storage is associated with important penalties.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 90 / 171

Page 122: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Options for temporary data storage compared

architectural option o n - c h i p off-chipbistables embedded commodity

flip-flop latch SRAM DRAM DRAM

fabrication process compatible with logic optimizeddevices in each cell 20...30T 12...16T 6T 1T1C 1T1Ccell area per bit [F 2] 1700...2800 1100...1800 135...170 18...30* 6...8extra circuit overhead none 1.3 ≤ factor ≤ 2 off-chipmemory refresh cycles n o n e y e sextra package pins none none addr. & data busnature of wiring multitude of local lines on-chip busses package & boardbidirectional busses none optional mandatoryaccess to data words all at a time one at a timeavailable configurations any restrictedenergy efficiency good fair poor very poorlatency and paging none no fixed rules yesimpact on clock period minor substantial severe

* As low as 6...8 for processes that accomodate 3D capacitors (4 to 6 extra masks)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 91 / 171

Page 123: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Example: RAMs in a CMOS ASIC technology

Cu-11 is an ASIC technology by IBM (2002)

I gate length 110 nm, supply voltage 1.2 V

I Cu interconnect combined with low-k interlevel dielectrics

SRAM macrocell generator from 128 bit to 1 Mibit

Embedded DRAM megacells up to 16 Mibit (with trench caps)

I cycle time of 1 Mibit eDRAM is 15 ns(equivalent to 555 · tpd of a 2-input nand)

I eDRAM bit cell area is 0.31 µm2

I 1 Mibit eDRAM occupies an area of 2.09 mm2 (84% overhead)

I 16 Mibit eDRAM occupies 14.1 mm2 (63% overhead)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 92 / 171

Page 124: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Data access patternsAvailable memory configurations and area occupationWiring and the costs of going off-chipDigest

Recapitulation

Observation

There is no such thing as an optimal solution for temporary storage of data,what is best strongly depends on the situation and requirements.

I Only registers allow for simultaneous access to all data,but occupy a lot of die area per bit.

I SRAMs can hold more significant quantities of data than registersbut are slower than registers, yet faster than DRAMs.

I DRAMs require periodical refresh power dissipated even when idle.

I DRAM and Flash memories are cost-efficient for large data quantities.

I Flash is used for permanent storage, but is much slower than RAM.

I Commodity memories offer virtually unlimited capacities at low costs,but are is associated with speed, energy and other penalties.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 93 / 171

Page 125: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Subject

Transforms for non-recursive computations

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 94 / 171

Page 126: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

What do we mean by non-recursive computation?

A computation is termed (sequential and) non-recursive if

I Result is dependent on past arguments, not just present.

I Edges with weights greater than zero are present in the DDG.

I DDG is free of circular paths.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 95 / 171

Page 127: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Example: Nonlinear time-invariant third order correlator

x(k)

y(k)

a)

h

0c

1c

2c

3c

h

h

h

I Pipelining helps boost throughput but is rather inefficient in this case.

Can you do betterin terms of speed and area?

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 96 / 171

Page 128: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Retiming

Paradigm: Repartition workloads evenly for all workers on an assembly line

a)

g

wf wg

wh

f

h

i

wf wg

wh

gf

h

i

+l +l

−l

lag ldatapath

sectionno controlsection

retiming

nonuniformcomputational delays

uniformb)

Figure: DDG (a) and hardware configuration for l = 1 (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 97 / 171

Page 129: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Formal rules

To be legal, any retiming must observe the following rules:

1. Neither outputs nor sources of time-varying inputs may be partof a supervertex that is to be retimed.

2. When a supervertex is assigned a lag (lead) by l computation cycles,the weights of all its incoming edges are in- (de-)cremented by l andthe weights of all its outgoing edges are de- (in-)cremented by l .

3. No edge weight may be changed to assume a negative value.

4. Any circular path must always include at least one edgeof strictly positive weight (roundtrip weights will never change).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 98 / 171

Page 130: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Pipelining revisited

Same rules as for retiming except

1. Any supervertex to be assigned a lag (lead) must include all outputs(all time-varying inputs).

Comparison

I Both transforms aim at shortening the longest path.

I Pipelining increases latency as registers get added.

I Retiming leaves latency unchanged as registers get relocated.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 99 / 171

Page 131: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Pipelining revisited

Same rules as for retiming except

1. Any supervertex to be assigned a lag (lead) must include all outputs(all time-varying inputs).

Comparison

I Both transforms aim at shortening the longest path.

I Pipelining increases latency as registers get added.

I Retiming leaves latency unchanged as registers get relocated.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 99 / 171

Page 132: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Example:Nonlineartime-invariantthird ordercorrelator

x(k) y(k)c)

h

0c

1c

2c

3c

h

h

hlead 1

lead 2

lead 3

y(k-1)x(k)d)

h

0c

1c

2c

3c

h

h

h

lag 1

x(k)

y(k)

a)

h

0c

1c

2c

3c

h

h

h

x(k) y(k)b)

h

0c

1c

2c

3c

h

h

h

chain reversal

pipelining

retiming

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 100 / 171

Page 133: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Example: Nonlinear time-invariant third order correlator

The subsequent transforms change the circuit’s performance as follows:

Architectural variantoriginal reversed + retimed + pipelined

Key characteristics (a) (b) (c) (d)

arithmetic units (N + 1)Ah + NA+ idem idem idemfunctional registers NAreg idem idem idemnonfunctional registers 0 idem idem (N + 1)Areg

cycles per data item Γ 1 idem idem idemlongest path delay tlp treg + th + N t+ idem treg + th + t+ treg + max(th, t+)

for N = 3 [ns] 9.5 idem 5.5 3.5for N = 30 [ns] 63.5 idem 5.5 3.5

latency L 0 idem idem 1

Net benefits:

I Long path delay greatly reduced at little hardware costs.

I Maximum operating speed no longer a function of correlation order N.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 101 / 171

Page 134: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Example: Nonlinear time-invariant third order correlator

The subsequent transforms change the circuit’s performance as follows:

Architectural variantoriginal reversed + retimed + pipelined

Key characteristics (a) (b) (c) (d)

arithmetic units (N + 1)Ah + NA+ idem idem idemfunctional registers NAreg idem idem idemnonfunctional registers 0 idem idem (N + 1)Areg

cycles per data item Γ 1 idem idem idemlongest path delay tlp treg + th + N t+ idem treg + th + t+ treg + max(th, t+)

for N = 3 [ns] 9.5 idem 5.5 3.5for N = 30 [ns] 63.5 idem 5.5 3.5

latency L 0 idem idem 1

Net benefits:

I Long path delay greatly reduced at little hardware costs.

I Maximum operating speed no longer a function of correlation order N.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 101 / 171

Page 135: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Iterative decomposition and time sharing revisited

I Decomposing and time sharing sequential computations is straightforwardand can significantly reduce datapath hardware.

I Functional memory requirements remain the same as in the isomorphicarchitecture (memory bound).

I Mixed blessing energy-wise.

+ More uniform combinational depth reduces glitching activity.− Extra multiplexers necessary to route, recycle, collect and/or redistribute

data.− Extra counter or finite state machine required to control the datapath.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 102 / 171

Page 136: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Example: Third order transversal filter

a)

b21b

x(k)

b3

* * **

y(k)+ + +0b

multiplierparallel

LUT

adder accumulator

*

+

x(k)

y(k)

nb n

x(k-1)x(k-2)

x(k-3)

datapathsection

controlsection

multiplexerto 1N+1

coefficientstorage

wordsN+1

+1modN+1

longshift registerN=3

outputregister

indexregister

b)

datapathsection

no controlsection

iterative decomposition& time sharing

parallel multipliersN+1 addersN

longshift registerN=3

hard-wiredcoefficients

Figure: Isomorphic architecture (a) and a more economic alternative (b).

We will see a totally different approach later.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 103 / 171

Page 137: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

RetimingPipelining revisitedIterative decomposition and time sharing revisitedDigest

Recapitulation

Retiming

can help to optimize datapath architecture for sequential computationswithout affecting functionality nor latency.

I Retiming, pipelining and combinations of the two can improve throughputof arbitrary feedforward computations.

I The associative law allows one to take full advantage of the abovetransforms by having a DDG rearranged beforehand.

I Iterative decomposition and time sharing are the two options availablefor reducing circuit size.

I Highly time-multiplexed architectures dissipate energy on ancillaryactivities that do not directly contribute to data computation.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 104 / 171

Page 138: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Subject

Transforms for recursive computations

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 105 / 171

Page 139: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

What do we mean by recursive computation?

A computation is termed (sequential and) recursive if

I Result is dependent on earlier outcomes of the computation itself.

I Edges with weights greater than zero are present in the DDG.

I Circular paths (of non-zero weight) exist in the DDG.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 106 / 171

Page 140: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop I

Recursions such asy(k) = ay(k − 1) + x(k)

which in the z domain corresponds to transfer function

H(z) =Y (z)

X (z)=

1

1− az−1

have many technical applications.

Examples:

I IIR filters

I Differential pulse code modulation encoders (DPCM)

I Servo loops

They impose a stiff timing constraint, however.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 107 / 171

Page 141: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop IIa

x(k) y(k)

y(k-1)

a)

*

+

adder

multiplierparallela

x(k)y(k)

b)

Figure: DDG (a) and isomorphic architecture (b).

Iteration bound:∑loop

t = treg + t∗ + t+ = tlp ≤ Tcp

◦ No problem as long as long path constraint can be metwith available and affordable technology.◦ No obvious solution otherwise, recursiveness is a real bottleneck.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 108 / 171

Page 142: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop IIa

x(k) y(k)

y(k-1)

a)

*

+

adder

multiplierparallela

x(k)y(k)

b)

Figure: DDG (a) and isomorphic architecture (b).

Iteration bound:∑loop

t = treg + t∗ + t+ = tlp ≤ Tcp

◦ No problem as long as long path constraint can be metwith available and affordable technology.◦ No obvious solution otherwise, recursiveness is a real bottleneck.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 108 / 171

Page 143: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop IIIHave a second look!

Key idea

Relax the timing constraint by inserting additional latency registersinto the feedback loop.

A tentative solution must look like

H(z) =Y (z)

X (z)=

N(z)

1− apz−p

where N(z) is here to compensate for the changes due to the new denominator.

Recalling the sum of geometric series we easily establish N(z) as

N(z) =1− apz−p

1− az−1=

p−1∑n=0

anz−n

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 109 / 171

Page 144: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop IIIHave a second look!

Key idea

Relax the timing constraint by inserting additional latency registersinto the feedback loop.

A tentative solution must look like

H(z) =Y (z)

X (z)=

N(z)

1− apz−p

where N(z) is here to compensate for the changes due to the new denominator.

Recalling the sum of geometric series we easily establish N(z) as

N(z) =1− apz−p

1− az−1=

p−1∑n=0

anz−n

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 109 / 171

Page 145: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop IV

The new transfer function can then be completed to become

H(z) =

∑p−1n=0 anz−n

1− apz−p

and the new recursion in the time domain follows as

y(k) = apy(k − p) +

p−1∑n=0

anx(k − n)

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 110 / 171

Page 146: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop V

After unfolding by a factor of p = 4, the original recursion takes on the form

y(k) = a4y(k − 4) + a3x(k − 3) + a2x(k − 2) + ax(k − 1) + x(k)

which corresponds to transfer function

H(z) =1 + az−1 + a2z−2 + a3z−3

1− a4z−4in lieu of

1

1− az−1

Net result:

I Denominator has been widened to include p unit delays rather than one.

I Numerator stands for a feedforward circuit that is amenable to pipelining.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 111 / 171

Page 147: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop VI

Particularly elegant and efficient solutions exist when p isan integer power of 2 because of the lemma

p−1∑n=0

anz−n =

log2 p−1∏m=0

(a2m

z−2m

+ 1) p = 2, 4, 8, 16, ...

With p = 4, for instance, the numerator can be factorized into

H(z) =(1 + az−1)(1 + a2z−2)

1− a4z−4in lieu of

1

1− az−1

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 112 / 171

Page 148: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant first-order feedback loop VIIa

y(k)x(k)

a 2 a 4

numerator denominatora)

x(k)

a a 2building block

pipelinedmultiply-add

*

+

adder

multiplierparallel

y(k-6)

a 4

b)

*

+

*

+

Figure: DDG unfolded by p = 4 (a) and high-performance architecture (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 113 / 171

Page 149: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Higher-order loops

Guideline

Do not attempt to unfold loops of arbitrary order directly.Make use of a common technique from digital filter design.

I Any higher-order transfer function can be factored into a productof second- and first-order terms.

I The DDG takes the form of cascaded 2nd- and 1st-order sections.

I As an added benefit, cascade structures are known to be less sensitiveto quantization of coefficients and signals than direct forms.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 114 / 171

Page 150: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant second-order feedback loop I

a

x(k) y(k)

y(k-1)

y(k-2)

b

a)

+

**

+

a

b

x(k)y(k)

b)

Figure: DDG (a) and isomorphic architecture (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 115 / 171

Page 151: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant second-order feedback loop II

A second-order recursive function goes

y(k) = ay(k − 1) + by(k − 2) + x(k)

or, in the z domain,

H(z) =Y (z)

X (z)=

1

1− az−1 − bz−2

Unfolding is obtained from multiplying numerator and denominatorby an adequate factor. For p = 4, the transfer function becomes

H(z) =(1 + az−1 − bz−2) (1 + (a2 + 2b)z−2 + b2z−4)

1− ((a2 + 2b)2 − 2b2)z−4 + b4z−8

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 116 / 171

Page 152: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-invariant second-order feedback loop III

b 2

+2ba 2

+

*

+

*

b 2 - b 4

a

- b

y(k)x(k)

+2ba 2 22+2b)(a -2b2

numerator denominatora)

a

- b

x(k)

+

*

+

*

b)

building block

pipelinedmultiply-add

- b 4

22+2b)(a -2b2

y(k-6)

+

*

+

*Figure: DDG unfolded by p = 4 (a) and high-performance architecture (b).c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 117 / 171

Page 153: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Fourth-order ARMA filter 1

I Two second-order sections cascaded, loops unfolded with p=4.

I Pipelined multiply-add units with carry-save and carry-ripple adders.

I Fabricated in standard 0.9 µm CMOS technology (1992).

I Sampling frequency fs = fclk = 85 MHz, Γ = 1.

I Computation rate ≈ 1.5 Gop/s.

I One to two extra data bits added to maintain similar roundoff noise.

I Circuit size approximately 20 kGE.

I Supply 5 V, power dissipation 2.2 W at full speed.

Loop unfolding allows to push out the need for fast but costlyfabrication technologies such as GaAs, then and now.

1ARMA stands for “auto recursive moving average”, i.e. for IIR filtersthat comprise both recursive (AR) and non-recursive computations (MA).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 118 / 171

Page 154: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Fourth-order ARMA filter 1

I Two second-order sections cascaded, loops unfolded with p=4.

I Pipelined multiply-add units with carry-save and carry-ripple adders.

I Fabricated in standard 0.9 µm CMOS technology (1992).

I Sampling frequency fs = fclk = 85 MHz, Γ = 1.

I Computation rate ≈ 1.5 Gop/s.

I One to two extra data bits added to maintain similar roundoff noise.

I Circuit size approximately 20 kGE.

I Supply 5 V, power dissipation 2.2 W at full speed.

Loop unfolding allows to push out the need for fast but costlyfabrication technologies such as GaAs, then and now.

1ARMA stands for “auto recursive moving average”, i.e. for IIR filtersthat comprise both recursive (AR) and non-recursive computations (MA).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 118 / 171

Page 155: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Linear time-variant first-order feedback loop

x(k) y(k)

a(k)coefficient calculation

output computation

Figure: DDG after unfolding by a factor of p = 4.

I Coefficient terms must be calculated on-line requiring extra hardware.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 119 / 171

Page 156: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Nonlinear or general loops I

The most general case of a first-order recursion goes

y(k) = f (y(k − 1), x(k))

and can be unfolded an arbitrary number of times,e.g. with p = 2 to become

y(k) = f (f (y(k − 2), x(k − 1)), x(k))

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 120 / 171

Page 157: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Nonlinear or general loops IIy(k-1)

a)

f y(k)x(k)

b)

f y(k)x(k)

f

fx(k) y(k)

c)

f

fx(k) y(k)

d)

retiming

loopunfolding

f "

Figure: Original DDG (a) and isomorphic architecture (b), DDG after unfoldingby a factor of p = 2 (c), same DDG with retiming added on top (d).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 121 / 171

Page 158: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Nonlinear or general loops III

y(k-4)

e)

x(k) f y(k)f

f)

x(k)

x(k) y(k)

g)

f "

y(k-2)x(k)

h)

1 2

fis associative

provided operatorreordering

aggregation

feedforward feedbackf f

f "

Figure: DDG with the two functional blocks for f combined into f ” (g),pertaining architecture after pipelining and retiming (h).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 122 / 171

Page 159: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Limits to loop unfolding

Observation

I All successful architectural transforms for recursive computations takeadvantage of algorithmic properties such as linearity, fixed coefficients,associativity, limited word width or of a very limited set of register states.

I When the state size is large and the recurrence is not a closed-formfunction of specific classes, our methods for generating a high degreeof concurrency cannot be applied.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 123 / 171

Page 160: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Ciphering I

In electronic codebook mode, a block of ciphertext y(k) gets computedfrom the present block of plaintext x(k) and from key u(k)using some complex and non-analytical cipher function c .

cx(k) y(k)

u(k)

Figure: Block cipher in electronic codebook (ECB) mode.

I In search of throughput, the door is wide open for pipelining.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 124 / 171

Page 161: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Ciphering II

Figure: A computer graphics image in clear text.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 125 / 171

Page 162: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Ciphering III

Figure: Same image ciphered in electronic codebook mode (ECB).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 126 / 171

Page 163: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Ciphering IV

Figure: Same image ciphered in cipher block chaining mode (CBC).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 127 / 171

Page 164: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Ciphering V

Remedy: Cipher block chaining (CBC).

10

cx(k) y(k)

u(k)

a)

cryptographicimprovement

b)

x(k) y(k)c

u(k)

Figure: Combinational operation in ECB mode (a) vs. recursion in CBC mode (b).

I The nonlinear feedback introduced to improve cryptographic securityvetoes pipelining.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 128 / 171

Page 165: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Pipeline interleaving I

In search of higher throughput for a cipher in CBC mode, 2

none of our architectural transforms applies.

Think the unthinkable!

I “What is the effect of inserting an extra register into a first-orderrecursive loop with the idea of pipelining the datapath?”

2Operating a cipher in counter mode (CTR) manages without feedback and still avoidsthe leakage of plaintext into ciphertext that plagues ECB. This asks for a modification at thealgorithmic level, though.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 129 / 171

Page 166: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Pipeline interleaving II

c)

f

y(k=2n-2)

y(k=2n)x(k=2n)

f

y(k=2n-1)

y(k=2n+1)x(k=2n+1)

&

f

y(k-2)

y(k)x(k)

a) =

d)

f

x(0)x(2)

x(...)x(4)

x(1)x(3)

x(...)x(5)

y(0)y(2)

y(...)y(4)

y(1)y(3)

y(...)y(5)

f

y(k-2)x(k)

b)

pipelineinterleaving

Figure: Nonlinear time-variant first-order feedback loop with one extra registerinserted (a,b). Interpretation as two interleaved data streams (c,d).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 130 / 171

Page 167: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Ciphering revisited

10

=c en/decipheringmemoryless

mapping=

modulo 2additionbitwise

cx(k) y(k)

u(k)

a)

cryptographicimprovement

b)

x(k) y(k)c

u(k)

=

c)

cx(k) y(k)

u(k)

10

Figure: ECB mode (a), CBC mode with feedback (b), and CBC-8 operation (c).

Observation

Pipeline interleaving removes the bottleneck but alters functionality.

I Acceptable where data can be viewed as separate time-multiplexedstreams that are to be processed independently from each other.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 131 / 171

Page 168: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Sphere decoding in a MIMO OFDM receiver I 3

I Sphere decoding is a key subfunction in a MIMO OFDM receiver andessentially a sophisticated tree-traversal algorithm of low average searchcomplexity.

Observation

I OFDM operates on many subcarriers at a time (typically 48 to 108).

I Each subcarrier poses an independent tree-search problem.

3MIMO = Multi Input Multi Output, OFDM = Orthogonal Freq. Division Multiplexc© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 132 / 171

Page 169: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Sphere decoding in a MIMO OFDM receiver II

leve

lse

lect

radiusupdate

register bank(functional)

unit

metriccomputation

sphereconstraint

check

unit

metricenumeration

shimmingregisters

pipelineregisters

storage forintermediate

search results

triplicatedregister bank

Figure: Sphere decoder; black 7→ original architecture; color items 7→ extra circuitryrequired to handle three individual subcarriers in an interleaved fashion (p = 3).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 133 / 171

Page 170: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Example: Sphere decoding in a MIMO OFDM receiver III

0 1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

Clock Period in ns

Are

a in

kG

E

6 pipeline stages5 pipeline stages4 pipeline stages3 pipeline stages2 pipeline stagesunpipelinedconstant AT

1

2

3

4

5

6

Figure: The beneficial impact of pipeline interleaving on area and throughputof a sphere decoder circuit (diagram courtesy of Dr. Markus Wenk).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 134 / 171

Page 171: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

The feedback bottleneckUnfolding of first-order loopsHigher-order loopsTime-variant loopsNonlinear or general loopsPipeline interleaving, not quite an equivalence transformDigest

Recapitulation

Loop unfolding

can significantly improve the throughput of linear time-invariant feedbackcalculations.

I The rapid growth of overall circuit size tends to limit economicallypractical unfolding degrees to fairly low values, say p = 2...8.

I Nonlinear feedback loops are, in general, not amenableto throughput multiplication by applying unfolding techniques.A notable exception exists when the loop function is associative.

I Pipeline interleaving is not an equivalence transform but neverthelesshelpful where multiple data streams undergo the same processingindependently from each other.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 135 / 171

Page 172: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Subject

Generalizations of the transform approach

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 136 / 171

Page 173: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Generalization to other levels of detail

Level Granu- Relevant itemsof abstraction larity Operations DataArchitecture © subtasks, processes time series, picturesWord ◦ arithmetic/logic ops words, samples, pixelsBit · gate-level ops individual bits

What if we try to apply equivalence transforms at levels of abstractionother than the word level?

I Recall: DDGs are not concerned with the granularity of operationsand data.

Lucky finding

Everything we have learned is applicable at multiple levels of abstraction.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 137 / 171

Page 174: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Generalization to other levels of detail

Level Granu- Relevant itemsof abstraction larity Operations DataArchitecture © subtasks, processes time series, picturesWord ◦ arithmetic/logic ops words, samples, pixelsBit · gate-level ops individual bits

What if we try to apply equivalence transforms at levels of abstractionother than the word level?

I Recall: DDGs are not concerned with the granularity of operationsand data.

Lucky finding

Everything we have learned is applicable at multiple levels of abstraction.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 137 / 171

Page 175: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Examples of transforms at the architecture level

b)

preprocessingsegmentationfeature extractionclassification

a)

preprocessingsegmentationfeature extractionclassification

pipelining

c)

2. segmentation3. feature extraction4. classification

1. preprocessing

overallcontrol

versatiledatapath

iterativedecomposition

all studied atarchitecture level

Figure: Architectural alternatives for a typical pattern recognition system.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 138 / 171

Page 176: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Examples of transforms at the bit levelstudied atword level

studied atbit level

c) p=W

shimmingregisters

a)W

W W

iterative decomposition

pipelining

d) s=W

b)

MSB LSB

4W=

controlsection

nonfunctionalfeedback loop

shimmingregisters

Figure: 4-bit addition (a) broken down into a ripple-carry adder (b)before being subject to pipelining (c) and iterative decomposition (d).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 139 / 171

Page 177: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

What we have seen so far

“Standard” datapaths. Word-level operations executed one after the otherwith all bits being processed simultaneously.

a)

b21b

x(k)

b3

* * **

y(k)+ + +

0b

multiplierparallel

LUT

adder accumulator

*

+

x(k)

y(k)

nb n

x(k-1)x(k-2)

x(k-3)

datapathsection

controlsection

multiplexerto 1N+1

coefficientstorage

wordsN+1

+1modN+1

longshift registerN=3

outputregister

indexregister

b)

datapathsection

no controlsection

iterative decomposition& time sharing

parallel multipliersN+1 addersN

longshift registerN=3

hard-wiredcoefficients

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 140 / 171

Page 178: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

What we will see next

Uncommon architectural concepts where one bit from each data wordis being operated upon at a time until all bits have been processed.

Bit-serial architectures.

1. Word-level operations broken into bit-level operations.2. Iterative decomposition (and possibly pipelining too).

Distributed arithmetic.

1. Word-level operations broken into bit-level operations.2. Algebraic transforms to get rid of multiplication.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 141 / 171

Page 179: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example of a bit-serial architecture

scalar (single wire)

vector (parallel bus)adder

full

multiplier1-by- -bitw

-bit adderw

y(k,w)

b21b b3

sx(k,w)

s s

b0

studied atbit level

+LSB

+LSB

+LSB

+LSB

c

s+

c

s+

c

s+

broken down to bit level& iterative decomposition

Figure: Third order transversal filter

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 142 / 171

Page 180: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Pros and cons of bit-serial architectures± Overall hardware structure remains isomorphic with the DDG.

+ Small control overhead.

− Inflexible because DDG is hardwired into the datapathwith no explicit controller.

+ High computation rates keep computational units busy.

+ All non-local data communication is via serial links.

+ Much of the data circulation is local.

− Division, data-dependent decisions, etc. ill-suited for bitwise iterativedecomposition and pipelining.

− Incompatible with word-oriented RAMs and ROMs (bit-parallel),successive approximation and max./min. picking (MSB first).

Rule of thumb

Bit-serial architectures are at their best for unvaried real-time computationsthat involve operations such as addition and multiplication by a constant.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 143 / 171

Page 181: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Pros and cons of bit-serial architectures± Overall hardware structure remains isomorphic with the DDG.

+ Small control overhead.

− Inflexible because DDG is hardwired into the datapathwith no explicit controller.

+ High computation rates keep computational units busy.

+ All non-local data communication is via serial links.

+ Much of the data circulation is local.

− Division, data-dependent decisions, etc. ill-suited for bitwise iterativedecomposition and pipelining.

− Incompatible with word-oriented RAMs and ROMs (bit-parallel),successive approximation and max./min. picking (MSB first).

Rule of thumb

Bit-serial architectures are at their best for unvaried real-time computationsthat involve operations such as addition and multiplication by a constant.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 143 / 171

Page 182: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Distributed arithmetic IConsider the calculation of the following inner product

y =K−1∑k=0

ck xk

where each ck is a fixed coefficient. Input data xk are scaled such that |xk | < 1and coded with a total of W bits in 2’s-complement format.

xk = −xk,0 +W−1∑w=1

xk,w 2−w

The desired output y can be expressed as

y =K−1∑k=0

ck (−xk,0 +W−1∑w=1

xk,w 2−w )

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 144 / 171

Page 183: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Distributed arithmetic IIWith distributive law, commutative law, and reversed order of summation

y =K−1∑k=0

ck (−xk,0) +W−1∑w=1

(K−1∑k=0

ck xk,w ) 2−w

The pivotal observation refers to the term in parentheses

K−1∑k=0

ck xk,w = p(w)

For any given bit position w , calculating the sum of products takes one bitfrom each of the K data words xk , so p(w) can take on no more than 2K

distinct values. With the coefficients ck constant, all those values can be keptin a lookup table (LUT). The computation then simply becomes

y = −p(0) +W−1∑w=1

p(w) 2−w

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 145 / 171

Page 184: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example of distributed arithmetic

multiplierparallel +1

modK

ROM

...

adder accumulator

y

kx

*

+

multiplexerto 1K

K

coefficientstorage

data words

W

motto: "all bits from one word at a time"

indexregister

outputregister

log2 K

≈ W

a)

studied atword level

...

kx

ROM

accumulator

y

+/-adder-subtractor

-1modW

bit positionregister

multiplexerto 1W

K

partial productstorage

data words2K

w = 0

1/2

motto: "one bit from each word at a time"

outputregister

log2 W

≈ W

b)

studied atbit level

broken down to bit level& algebraic transforms

Figure: Computing a sum of products by way of repeated multiply-accumulateoperations (a) and with distributed arithmetic (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 146 / 171

Page 185: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Pros and cons of distributed arithmetic

+ No need for costly multipliers as these get merged with coefficient tables.

− Memory size grows exponentially with the order of the inner productto be computed.

∼ Mitigation techniques exist but depend heavily on coefficient values.

Rule of thumb

Distributed arithmetic should be considered when

I coefficients are fixed,

I number of distinct coefficient values is small,

I hardware multipliers are expensive compared to lookup tables.

Example: DSP applications with table-based FPGAs.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 147 / 171

Page 186: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Pros and cons of distributed arithmetic

+ No need for costly multipliers as these get merged with coefficient tables.

− Memory size grows exponentially with the order of the inner productto be computed.

∼ Mitigation techniques exist but depend heavily on coefficient values.

Rule of thumb

Distributed arithmetic should be considered when

I coefficients are fixed,

I number of distinct coefficient values is small,

I hardware multipliers are expensive compared to lookup tables.

Example: DSP applications with table-based FPGAs.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 147 / 171

Page 187: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Generalization to other algebraic structures I

What we have seen so far:

“Standard” computations. Filters, correlators and the like where arithmeticoperations were taken from the field of reals (R, +, · ).

What we will see next:

More fields. ◦ with infinitely many elements, and◦ with some finite number of elements.

Semirings. More general algebraic structures.

You may want to present slide set “A Brief Glossary of Algebraic Structures” at this point!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 148 / 171

Page 188: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Generalization to other algebraic structures II

I All algebraic fields share a common set of axioms, so any algebraictransform that is valid in one field must necessarily hold for anyother field. Universal transforms remain valid anyway.

Observation

Everything we have learned is applicable to any algebraic field.

Infinite fields. (R, +, · ) and (C, +, · ) are commonplace in digital signalprocessing.

Finite fields. GF(2), GF(p), GF(pn) have numerous applications in

I data compression (source coding),I error correction (channel coding), andI information security (ciphering).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 149 / 171

Page 189: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: The Viterbi algorithm I

pathmetricupdate

branchmetric

computation

survivorpath

trace back

path metric memory (functional)

Figure: The three major steps of the Viterbi algorithm.

I Convolutional decoding is a multi-stage decision problemwhere Richard Bellman’s principle of optimality applies:“The globally optimum solution includes no suboptimal local decision.”

I Bellman has developed a technique called “Dynamic Programming”,the Viterbi algorithm is a particular case thereof.

Refer to slide set “A Gentle Introduction to Dynamic Programming and the Viterbi Algorithm”!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 150 / 171

Page 190: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: The Viterbi algorithm II

butterfly

==

ACS

ACS

min

min

b)

a)

......

......

......

state transitions with branch metricsstates with path metrics

time slot0 1 2 3 4 5 kk-1

iterativedecomposition

Figure: Abstracted trellis-type DDGfor path metric computation (a) with details for one butterfly (b).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 151 / 171

Page 191: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Architectural choices for a Viterbi decoder I

c) f)

desirablelocationfor extraregisters

d)

loop unfolding

timesharing

iterativedecomposition

inverse transform

ALU

iterativedecomposition

e)

Figure: Datapath architectures obtained from different degrees of iterativedecomposition (c,d,e). Doomed attempt to boost throughput by inserting extralatency registers into the nonlinear first-order feedback loop (f).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 152 / 171

Page 192: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Architectural choices for a Viterbi decoder II

Natural choice: A datapath that computes one set of path metricsfrom the previous set in a single clock cycle 7→ architecture d).

Goals and options:

Smaller circuit. Combine iterative decomposition and time sharing, ultimatelyleads to a processor-type datapath built around an ALU e).

Reduced clock. If the longest path in architecture d) turns out to betoo fast to match that in the remainder of the circuit,a lesser degree of decomposition may prove more adequate.c) yields roughly the same throughput with half the clock.Combinational logic gets approximately doubled, though.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 153 / 171

Page 193: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Architectural choices for a Viterbi decoder II

Natural choice: A datapath that computes one set of path metricsfrom the previous set in a single clock cycle 7→ architecture d).

Goals and options:

Smaller circuit. Combine iterative decomposition and time sharing, ultimatelyleads to a processor-type datapath built around an ALU e).

Reduced clock. If the longest path in architecture d) turns out to betoo fast to match that in the remainder of the circuit,a lesser degree of decomposition may prove more adequate.c) yields roughly the same throughput with half the clock.Combinational logic gets approximately doubled, though.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 153 / 171

Page 194: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Architectural choices for a Viterbi decoder III

Goals and options (continued):

Still higher throughput. Longest path needs to be trimmed down.The computation in a butterfly goes

y1(k) = min(a11(k) + y1(k − 1), a12(k) + y2(k − 1))

y2(k) = min(a21(k) + y1(k − 1), a22(k) + y2(k − 1))

This is a nonlinear first-order recursion f) none of our architectural transforms applies.

A more sophisticated approach is needed!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 154 / 171

Page 195: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Loop unfolding revisitedRederive substituting the generic symbols � for + and � for ·

y(k) = a(k)� y(k − 1)� x(k)

to obtain for arbitrary integer values of p ≥ 2

y(k) = (

p−1∏n=0

a(k − n))� y(k − p)�p−1∑n=1

(n−1∏m=0

a(k −m))� x(k − n)� x(k)

where∑

and∏

refer to operators � and � respectively.

I The algebraic axioms necessary for that derivation areI closure under both operators,I associativity of both operators, andI distributive law of � over �.

I The algebraic structure defined by these axioms is the semiring.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 155 / 171

Page 196: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Loop unfolding revisitedRederive substituting the generic symbols � for + and � for ·

y(k) = a(k)� y(k − 1)� x(k)

to obtain for arbitrary integer values of p ≥ 2

y(k) = (

p−1∏n=0

a(k − n))� y(k − p)�p−1∑n=1

(n−1∏m=0

a(k −m))� x(k − n)� x(k)

where∑

and∏

refer to operators � and � respectively.

I The algebraic axioms necessary for that derivation areI closure under both operators,I associativity of both operators, andI distributive law of � over �.

I The algebraic structure defined by these axioms is the semiring.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 155 / 171

Page 197: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder I

Now consider a semiring where

• Set of elements: S = R ∪ {∞},• Algebraic addition: � = min, and

• Algebraic multiplication: � = +.

The reformulated add-compare-select operation now goes

y1(k) = a11(k)� y1(k − 1)� a12(k)� y2(k − 1)

y2(k) = a21(k)� y1(k − 1)� a22(k)� y2(k − 1)

which, making use of vector and matrix notation, can be rewritten as

~y(k) = A(k)� ~y(k − 1)

I Note, this is a linear first-order recursion!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 156 / 171

Page 198: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder I

Now consider a semiring where

• Set of elements: S = R ∪ {∞},• Algebraic addition: � = min, and

• Algebraic multiplication: � = +.

The reformulated add-compare-select operation now goes

y1(k) = a11(k)� y1(k − 1)� a12(k)� y2(k − 1)

y2(k) = a21(k)� y1(k − 1)� a22(k)� y2(k − 1)

which, making use of vector and matrix notation, can be rewritten as

~y(k) = A(k)� ~y(k − 1)

I Note, this is a linear first-order recursion!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 156 / 171

Page 199: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder I

Now consider a semiring where

• Set of elements: S = R ∪ {∞},• Algebraic addition: � = min, and

• Algebraic multiplication: � = +.

The reformulated add-compare-select operation now goes

y1(k) = a11(k)� y1(k − 1)� a12(k)� y2(k − 1)

y2(k) = a21(k)� y1(k − 1)� a22(k)� y2(k − 1)

which, making use of vector and matrix notation, can be rewritten as

~y(k) = A(k)� ~y(k − 1)

I Note, this is a linear first-order recursion!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 156 / 171

Page 200: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder II

By replacing ~y(k − 1) one gets the unfolded recursion for p = 2

~y(k) = A(k)� A(k − 1)� ~y(k − 2)

To take advantage of this unfolded form,the product B(k) = A(k)� A(k − 1) must be computed outside the loop.

Resubstituting the original operators and variables we obtain the recursion

y1(k) = min(b11(k) + y1(k − 2), b12(k) + y2(k − 2))

y2(k) = min(b21(k) + y1(k − 2), b22(k) + y2(k − 2))

same number and types of operations as the original formulationin twice as much time.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 157 / 171

Page 201: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder II

By replacing ~y(k − 1) one gets the unfolded recursion for p = 2

~y(k) = A(k)� A(k − 1)� ~y(k − 2)

To take advantage of this unfolded form,the product B(k) = A(k)� A(k − 1) must be computed outside the loop.

Resubstituting the original operators and variables we obtain the recursion

y1(k) = min(b11(k) + y1(k − 2), b12(k) + y2(k − 2))

y2(k) = min(b21(k) + y1(k − 2), b22(k) + y2(k − 2))

same number and types of operations as the original formulationin twice as much time.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 157 / 171

Page 202: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder III

min

1y (k)

12a (k)

a11

(k)

(k-1)y1

min

(k)y2

21a (k)

22a (k)

(k-1)y2

a)

1y (k)

12b (k+1)

b11

(k+1)

(k-1)y1

(k)y2

21b (k+1)

22b (k+1)

(k-1)y2

c)

min

min

22b (k)+ (k-2)-y

2

21b (k)+ (k-2)y

1

12b (k)+ (k-2)y

2

11b (k)+ (k-2)y

1

1y (k)

12a (k)

a11

(k)

(k-1)y1

(k)y2

21a (k)

22a (k)

(k-1)y2

loop unfolding

b)

two nonlineartime-invariantfirst-orderfeedback loops

two lineartime-invariantfirst-orderfeedback loops

reformulatedover semiring

loop unfolding

extra registersplaced in loops

Figure: The first-order recursion of the Viterbi algorithm before (a) and after beingreformulated over a semiring (b), with loop unfolding added on top (c).

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 158 / 171

Page 203: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example: Boosting throughput of a Viterbi decoder IV

The price to pay is the extra hardware required to performthe non-recursive computations outside the loop

b11(k) = min(a11(k) + a11(k − 1), a12(k) + a21(k − 1))

b12(k) = min(a11(k) + a12(k − 1), a12(k) + a22(k − 1))

b21(k) = min(a21(k) + a11(k − 1), a22(k) + a21(k − 1))

b22(k) = min(a21(k) + a12(k − 1), a22(k) + a22(k − 1))

in a heavily pipelined way.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 159 / 171

Page 204: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Insight gained

Compare the two formulations of the same problem:

◦ Nonlinear recursion over field not amenable to loop unfolding.

◦ Linear recursion over semiring amenable to loop unfolding.

Conclusion

Taking advantage of specific properties of an algorithm and of algebraictransforms has more potential to offer than universal transforms alone.

I Some computations can be accelerated by creating concurrenciesthat did not exist in the original formulation.

Opens a door to solutions that would otherwise remain off-limits.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 160 / 171

Page 205: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Subject

Summary and conclusions

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 161 / 171

Page 206: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Options available for reorganizing datapath architectures

Type of computationcombinational sequential (memorizing)(memoryless) non-recursive recursive

Data flow feedforward feedforward feedbackMemory no yes yesData DAG with DAG with Directed cyclic graphdependency all edge some or all edge with no circular pathgraph weights zero weights non-zero of weight zeroResponse length M = 1 1 < M <∞ M =∞

Nature linear time-invariant D,P,Q,S,a D,P,q,S,a,R D,S,a,R,i,Uof linear time-variant D,P,Q,S,a D,P,S,a,R D,S,a,R,i,Usystem nonlinear D,P,Q,S,a D,P,S,a,R D,S,a,R,i,u

D : Iterative decompositionP : PipeliningQ : ReplicationS : Time sharinga : Associativity transform provided operations are identical and associativeR : Retimingi : Pipeline interleaving

U : Loop unfoldingu : Loop unfolding provided computation is linear over a semiring

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 162 / 171

Page 207: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Important architectural transformsand their characteristics

Architectural Decom- Pipe- Repli- Time Associa- Retiming Looptransform position lining cation sharing tivity unfoldingKind universal universal universal universal algebraic universal algebraic

Applicable to combinational computations sequential computationsImpact on nonrecurs. recursiveA −... = = ...+ + −... = = = +Γ + = − + = = =tlp − − =, mux − = −...+ − −T = Γ · tlp = − − + −...+ − −AT −... = −... = = = ...+ −...+ − +L + + =, mux + + = = +E −...+ −...+ = = ...+ −...+ = +Extra recy. distrib., collect., extrahardware and none recoll., redist., none none wordoverhead cntl. and cntl. and cntl. widthHelpful no coarse possibly no yes yes possiblyfor indirect grain yes yesenergy saving yesCompatible any register register any any register registerstorage type

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 163 / 171

Page 208: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Power and energy considerations

What is meant by “Helpful for indirect energy saving”?

I In CMOS, the most effective way to cut the energy spent per operationis to lower the supply voltage.

I The long paths through a circuit are likely to become unacceptably slowand need to be trimmed to recover clock rate and throughput.

I Architectural transforms that help do so with no circuit overhead:I RetimingI Chain/tree conversion (and other algebraic transforms)I Coarse grain pipelining (small overhead only)

Effectiveness must be examined in detail on a per case basis!

Simple fact

Over the first decade of the 21th century,energy efficiency has become even more important than die size.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 164 / 171

Page 209: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Power and energy considerations

What is meant by “Helpful for indirect energy saving”?

I In CMOS, the most effective way to cut the energy spent per operationis to lower the supply voltage.

I The long paths through a circuit are likely to become unacceptably slowand need to be trimmed to recover clock rate and throughput.

I Architectural transforms that help do so with no circuit overhead:I RetimingI Chain/tree conversion (and other algebraic transforms)I Coarse grain pipelining (small overhead only)

Effectiveness must be examined in detail on a per case basis!

Simple fact

Over the first decade of the 21th century,energy efficiency has become even more important than die size.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 164 / 171

Page 210: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

The grand alternatives from an energy point of view ...

GP

SP

general-purposearchitectures

special-purposearchitectures

I Processor-type architectures make use ofI general-purpose multi-operation ALUs,I generic register files of generous capacity,I multi-driver busses, bus switches, multiplexers, etc.,I uniform and often overly wide datapaths,I program and data memories along with address generation,I controllers, program sequencers, and iteration counters,I instruction fetching and decoding,I stack operations and interrupt handling,I dynamic reordering of operations,I branch prediction and speculative execution,I data shuffling between main memory and multiple levels of cache.

Observation

All of this is a tremendous waste of energyas none of the above contributes to payload data processing!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 165 / 171

Page 211: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

The grand alternatives from an energy point of view ...

GP

SP

general-purposearchitectures

special-purposearchitectures

I Processor-type architectures make use ofI general-purpose multi-operation ALUs,I generic register files of generous capacity,I multi-driver busses, bus switches, multiplexers, etc.,I uniform and often overly wide datapaths,I program and data memories along with address generation,I controllers, program sequencers, and iteration counters,I instruction fetching and decoding,I stack operations and interrupt handling,I dynamic reordering of operations,I branch prediction and speculative execution,I data shuffling between main memory and multiple levels of cache.

Observation

All of this is a tremendous waste of energyas none of the above contributes to payload data processing!

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 165 / 171

Page 212: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

... (continued)

I The impressive throughputs of general-purpose processors have beenbought by operating them under conditions such as

I fine-grain pipelining,I extremely fast clock,I comparatively high supply voltage,I low MOSFET threshold voltages ( 7→ large overdrive factors) and, hence,I significant leakage.

far from optimal for the energy efficiency of CMOS circuits.

Consequence

A program-controlled processor may dissipate 100 to 1000 times as muchenergy for the same calculation as an application-specific circuit.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 166 / 171

Page 213: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example

“To achieve long battery life when playing video, mobile devices mustdecode the video in hardware (on the GPU); decoding it in software(on the CPU) uses too much power. ... The difference is striking:on an iPhone 4, for example, H.264 videos play for up to 10 h,while videos decoded in software play for less than 5 h before thebattery is fully drained.” (Steve Jobs, 2010) 4

Truism (from “The Future of Computing, Game Over or Next Level?” 2011)

Doing only what needs to be done saves both energy and area.

4Why does the author talk of orders of magnitude when Jobs just found a factor of 2?2. Battery run times depend on the entire system, not just the video decoder.1. A GPU is a specialized instruction set processor, not a dedicated hardwired circuit.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 167 / 171

Page 214: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Example

“To achieve long battery life when playing video, mobile devices mustdecode the video in hardware (on the GPU); decoding it in software(on the CPU) uses too much power. ... The difference is striking:on an iPhone 4, for example, H.264 videos play for up to 10 h,while videos decoded in software play for less than 5 h before thebattery is fully drained.” (Steve Jobs, 2010) 4

Truism (from “The Future of Computing, Game Over or Next Level?” 2011)

Doing only what needs to be done saves both energy and area.

4Why does the author talk of orders of magnitude when Jobs just found a factor of 2?2. Battery run times depend on the entire system, not just the video decoder.1. A GPU is a specialized instruction set processor, not a dedicated hardwired circuit.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 167 / 171

Page 215: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Aside

Question: Does the total absence of unproductive computations implythe isomorphic architecture is the most energy-efficient option then?

Answer: Normally no.

Reasons:

I Glitching (redundant switching during transients) 7→ most intensewhen data recombine in combinational logic after having travelledalong propagation paths of disparate lengths.

I Leakage (static transistor currents) 7→ everything else being equal,a smaller circuit tends to have fewer leakage paths.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 168 / 171

Page 216: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Aside

Question: Does the total absence of unproductive computations implythe isomorphic architecture is the most energy-efficient option then?

Answer: Normally no.

Reasons:

I Glitching (redundant switching during transients) 7→ most intensewhen data recombine in combinational logic after having travelledalong propagation paths of disparate lengths.

I Leakage (static transistor currents) 7→ everything else being equal,a smaller circuit tends to have fewer leakage paths.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 168 / 171

Page 217: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Architecture design in an energy-constrained world

Imperative

Increasing performance in applications with a limited power budget (all today),requires that the amount of energy spent per payload operation be lowered.

as P = Θ · E In-depth discussion to follow in chapter 11 “Energy Efficiency and Heat Removal”.

A key challenge of architecture design is to

I minimize redundant switching activities,

I provide as just as much flexibility as required,

I keep the effort for design and verification within reasonable bounds,

all at a time.

Finding clever combinations between hardwired units andprogram-controlled processors asks for creativity and methodical work.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 169 / 171

Page 218: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

Architecture design in an energy-constrained world

Imperative

Increasing performance in applications with a limited power budget (all today),requires that the amount of energy spent per payload operation be lowered.

as P = Θ · E In-depth discussion to follow in chapter 11 “Energy Efficiency and Heat Removal”.

A key challenge of architecture design is to

I minimize redundant switching activities,

I provide as just as much flexibility as required,

I keep the effort for design and verification within reasonable bounds,

all at a time.

Finding clever combinations between hardwired units andprogram-controlled processors asks for creativity and methodical work.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 169 / 171

Page 219: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

A guide to evaluating architectural alternatives I

1. Begin by analyzing the algorithm. Give quantitative indications forI the data rates between all major building blocks,I the word widths,I the memory bounds and access schemes for all building blocks, andI the computation rates for all major arithmetic operations.

2. Look for simplifications and optimizations in the algorithmic domain.

3. Examine the control flow.Find out where to go for a hard-wired dedicated architecture, where fora program-controlled processor, and where to look for a compromise.

4. Let your intuition come up with preliminary architectural concepts.Establish a rough block diagram for each. Have boundaries betweenmajor subfunctions coincide with registers.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 170 / 171

Page 220: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

A guide to evaluating architectural alternatives II

5. Prepare a spreadsheet that opposes all architectures considered.

6. EstimateI overall circuit size,I computation period,I latency, andI dissipated energy.

Synthesize, place and route time-critical portions as propagation delaysoften depend on lower-level details.

7. Identify bottlenecks and inacceptably burdensome subfunctions.Improve architecture with the aid of equivalence transforms.

8. Compare, then narrow down your choice.

Concluding remark

Architecture design is more art than science.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 171 / 171

Page 221: From Algorithms to Architectures - Alexandria …eng.staff.alexu.edu.eg/~mmorsy/Courses/Undergraduate/EE...From Algorithms to Architectures Prof. Hubert Kaeslin Microelectronics Design

The architectural solution spaceDedicated VLSI architectures and how to design them

Equivalence transforms for combinational computationsOptions for temporary storage of data

Equivalence transforms for non-recursive computationsEquivalence transforms for recursive computations

Generalizations of the transform approach

Generalization to other levels of detailBit-serial architecturesDistributed arithmeticGeneralization to other algebraic structuresSummary and conclusions

A guide to evaluating architectural alternatives II

5. Prepare a spreadsheet that opposes all architectures considered.

6. EstimateI overall circuit size,I computation period,I latency, andI dissipated energy.

Synthesize, place and route time-critical portions as propagation delaysoften depend on lower-level details.

7. Identify bottlenecks and inacceptably burdensome subfunctions.Improve architecture with the aid of equivalence transforms.

8. Compare, then narrow down your choice.

Concluding remark

Architecture design is more art than science.

c© Hubert Kaeslin Microelectronics Design Center ETH Zurich From Algorithms to Architectures 171 / 171