Processor Architectures and Program Mapping H. Corporaal and B. Mesman1 Platform-based Design TU/e...

Processor Architectures and Program Mapping H. Corporaal and B. Mesman

1

Platform-based Design

TU/e 5kk70Henk Corporaal

Bart Mesman

Digital Signal Processors


2

flexibility

efficiency

DSP

Programmable CPU

Programmable DSP

Application specific instruction set

processor (ASIP)

Applicationspecific processor


3

#define NTAPS 4

int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];

state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)

out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];

return(out[NTAPS]);

*

Z-1

*

Z-1

*

Z-1

*

+

c3c4 c2 c1

x4 x3 x2 x1

y

Z-1

c0

x0

*

Application examples (1)


4

.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4


19 instructions per tap!!


5

temp1 = input << 1temp2 = if (bit(input,7) == 1

then 29 else 0

out = temp1 exor temp2

Bit level operations:finite field arithmetic

r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop

nonzero r5 = XOR(r1,r0) Exclusive ORcommon …

in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]

out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]

exor exor exor


10 instructions!!Very simple in hardware


6

srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2

202223252627

source register ($2)

destination register ($24)

2 3 4 5 6 7

Bit level operations : DES example



7

srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1

181716 13

xor

$5

1$13

… 0 ...

Bit level operations : A5 example (GSM encryption)



8

Application examples: conclusions

• CPUs offer flexibility, but…

• not efficient in performance

• not efficient in code size

• not efficient in power consumption


9

Power Consumption in microprocessorsPower consumption is (becoming) the limiting factor in

processor design

Solution in direction of• Hardware acceleration• Instruction Level Parallelism instead of clock speed• Code size efficiency

source: ISSCC2001, Patrick Gelsinger, Intel


10

Amdahl’s law

• Impact of an improvement on the execution time of a program depends on 2 parameters:– f = fraction of the original computation time that is

affected by the improvement– s = speedup factor (local)

• exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s

• speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)

• if s >> 1 then speedup_overall = 1 / ( 1 – f )• Example: 40 % of program can be executed 10 x faster

speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56


11

• Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)• Keep it Simple heuristic (RISC vs. CISC)

• Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set• No special features that match a high level language construct.• At least 16 registers to ease register allocation.

• Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)

Conclusions


12

Programmable Digital Signal Processors

• real-time worst-case processing = need for more compute power sec instr cycles secprog prog instr cycle

CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines

(vs. scalars for CPUs)• difficult to compare architectures

• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten

• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)


13

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline


14

• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP

(conversion to int is time consuming since the behavior may change)

• disadvantage FP: cost (area, speed, power)• integer multiplication doubles the number of bits: n * n => 2n

DSP data types


15

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

SHIFTROUND

TRUNCATE

clockP_reg

clockP_reg

control


16

Prog/datamemory

EXU

Von Neumann(sequential)

progmem.

EXU

Harvard

datamem.

progmem.

EXU

datamem. 1

datamem. 2

Modified Harvard

c(i) * x(i)

Goal = 1 cycle per iteration


17

RAM_A RAM_B

ACU_A

AR_A

ACU_B

AR_B

MAC

DR_A DR_B

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

Control Bus

Rfile


18

*

Z-1

*

Z-1

*

Z-1

*

+

c4c5 c3 c2

x5 x4 x3 x2

y

Z-1

c1

x1

*

ci * xi

time loop

filter loop i

How updating the delayline ?

1 cycle/tap ?


19

Memorylocation

outputsample 1

outputsample 2

outputsample 3

outputsample 4

Outputsample 5

1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8

Solution 2: indirect adressing

• use of a pointer to mark the begin of the delay line• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer


20

A S

Modulo

outputto RAM

Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1

Modulo can beimplemented as a mask operation if the size is 2k

16 10 00023 10 111mask=hold

ACU architecture andInstruction set


21

Addressing modes

• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1

• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]

Remarks• direct = for static data• indirect = for arrays

• inc/dec = for stepping through arrays e.g. xn

• index = for stepping through arrays e.g. x2n


22

• 8 ARs (address or auxiliary register) available• extra indirect modes

•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular

• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

Addressing modes: extra for DSP


23

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

ACU_A

AR_A

RAM_A

DR_A

ACU_B

AR_B

RAM_B

DR_B

MAC ALUControl Bus

Rfile


24

LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)

init counterloop incr (=i+1)

read x(i)acc(i)=acc(i-1)+x(i)*c(i)

dec counter branch to loop if counter > 0

nop

c(i) * x(i)

6 clockcycles/samplelimit pipelines in the controller

first solution

resources

time (cc)

Not showncoefficient RAM+ACU


25

f

g

h

ai

bi

ci

di

f

g

h

a0

b0

c0

d0

f

g

h

a1

b1

c1

d1

f

g

h

a2

b2

c2

d2

h g f

ai

bi

bi-1ci-2

ci-1di-2

for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)

for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)

Loopfolding (software pipelining)


26

c(i) * x(i)

Pre- and postamble4 clockcycles /sample

LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)

init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)

dec counterbranch to loop if counter > 0nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)

Loopfolding (software pipelining)


27

Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)

init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)

c(i) * x(i)

hardware support for loop control

1 clockcycles/samplerepeat instruction and repeat block


28

T register

Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr

T

Multiplier (17*17)

A(40) B(40)

MUX

A

0

A

A B

B A

fractional MUX

Adder (40)

ZERO SAT ROUND

MALU (40)

UB

MUX

TAB CD

C D

Barrer shifter

MSW/LSWselect

E

COMP

TRN

TC

B

A

P C DD

TMS320C5000


29

Address bus

16 bits

EXTERNALADRESS SWITCH

Y Address

Y memory256-by-24-bit

RAM256-by-24-bit

ROM

AddressALU

X memory256-by-24-bit

RAM256-by-24-bit

ROM

2,048-by-24-bitPROGRAMMEMORY

ROM

X Address

P Address

EXTERNALDATA-BUS

SWITCH

INTERNAL DATA-BUS

SWITCH

24 BITS DATA

BUS

X-DATA

Y DATA

P DATA

GLOBAL DATA

DATA ALU

24-by-24 bitMULTIPLIER-

ACCUMULATORPRODUCING

56 BIT RESULT

PROGRAM CONTROLLER

ON CHIPPERIPHERALS,

HOST,SYNCHRONOUS

SERIAL INTERFACESERIAL COMMU-

NICATIONSINTERFACE,

PROGRAMMED I/O,BUS CONTROL

2 BITS

CLOCK

3 BITS

INTERRUPT

24 BITS

I/OPORTS

7 BITS

Motorola 56K family


30

X data

Y data

Z data

Buses for

X

X datamemory

16 bitbus

Y datamemory

16 bit bus

Two address Compution

units

Y

Inst

ruct

ion

d ec o

der

96-b

it in

stru

ctio

ns

Program control

unit

Programmemory (Z data)

16-bit bus

Two 16-by-16 bitmultipliers

Y0

Y1

X

Y0

Y1

X

PO P1

scale scale

Two 40 bit arithmic-logic units

SaturationSaturation

Four 40 bitaccumulators

Saturation/scale

shif

t

R.E.A.L.


31

lexical analysis

syntax analysis

semantic analysis

Code selection

Register allocation

scheduling

Front end

Code generation

code

source

Intermediate machine independent

representation

1 instr = // opsorder of instr


32

a b

*

c d

+

+

*

c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3

t1 t2

t3

BBi

BBj BBk

Intermediate machine independent

representation


33

ax ay

ar

af mx my

mr

mf

+ -

x y x y

+ - *ALU MAC

d memory p memory ADSP[Analog Devices]

Code selection example


34

a b

*

c d

+

+

*

c

t1 t2

t3

mx := dmem my := pmem ax := dmem ay := pmem

mr := dmem

2:

1:

3: ar := ax + ay

my := ar

mr = mr * my

Mr := mr + (mx * my)

Example of code selection = covering of intermediate representation with RTPs


35

Problems• local decisions which have a global impact• phase coupling: example

• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions


36

Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture

develop an architecture which is still efficient but alsoa good model for building a compiler

Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word

It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler

phase coupling: discussion

Processor Architectures and Program Mapping H. Corporaal and B. Mesman1 Platform-based Design TU/e...

Documents

Transcript of Processor Architectures and Program Mapping H. Corporaal and B. Mesman1 Platform-based Design TU/e...