Hardware/Software Instruction Set Configurability for Sytem-on
Transcript of Hardware/Software Instruction Set Configurability for Sytem-on
38th D
AC
, Las
Veg
as,
June
18-
22, 2
001
38th D
AC
, Las
Veg
as,
June
18-
22, 2
001
Hardware/SoftwareInstruction Set Configurabilityfor Sytem-on-Chip Processors
Hardware/SoftwareHardware/SoftwareInstruction Set ConfigurabilityInstruction Set Configurabilityfor Sytem-on-Chip Processorsfor Sytem-on-Chip Processors
Albert Wang, Chris Rowen,Dror Maydan, Earl Killia
2
Landscape of reconfigurable computingLandscape of Landscape of reconfigurablereconfigurable computing computing
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
ASIC
FPGA
∆ ~
10x
∆ ~10x
Instruction-setConfigurable
Processor
GeneralProcessor
FPGA+
Processor
3
Computing using temporal connectionComputing using temporal connectionComputing using temporal connection
Registers
Datapath
Con
trol
Processor Solution
Mem
ory
(Pro
gram
)
ü XCorrect Efficient
ü X
Processor
4
Computing using spatial connectionComputing using spatial connectionComputing using spatial connection
Registers
Datapath
Con
trol
Processor Solution ASIC Solution
FSM Storage
Mem
ory
(Pro
gram
)
üX
Correct Efficient
ü X
ASIC
5
Processor with Application-specific Instructions
Configurable Processors: best of bothConfigurable Processors: best of bothConfigurable Processors: best of both
Registers
Datapath
Con
trol
Processor Solutions ASIC Solutions
FSM Storage
Mem
ory
(Pro
gram
)
üü
Correct EfficientProcessor
ASIC
ü ü
6
OutlineOutlineOutline
vConfigurable processor solution
§ Xtensa ™ processor Architecture
§ Instruction extension automation
§ Software development tools
vAn Example
vResults
vSummary
7
Conventional ArchitectureConventional ArchitectureConventional Architecture
Source
RF0 RF1 RF2 S1S0
FU0 FU0 FU0 FU0
Result
Decoder
Con
trol
•More registers
•More FU’s
•Deeper pipeline
•Bypass/forward
8
Conventional Architecture - cont.Conventional Architecture - cont.Conventional Architecture - cont.
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Con
trol
•More FU’s
9
Conventional Architecture – cont.Conventional Architecture – cont.Conventional Architecture – cont.
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Con
trol
•More FU’s
•More registers
10
Conventional Architecture – cont.Conventional Architecture – cont.Conventional Architecture – cont.
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Con
trol
•More registers
•More FU’s
•Deeper pipeline
11
Conventional Architecture – cont.Conventional Architecture – cont.Conventional Architecture – cont.
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Con
trol
•More registers
•More FU’s
•Deeper pipeline
•Bypass/forward
12
Conventional Architecture – cont.Conventional Architecture – cont.Conventional Architecture – cont.
vProblem with fixed processor:
§ Waste silicon• There is no universal extensions, or even one for each
application class
§ Not fast enough, compared with hardwareimplementation
§ Waste power
vThe Tensilica solution:
§ Small core processor
§ Allow easy and efficient application-specificinstruction extensions
13
Xtensa Architecture – BaseXtensa Architecture – BaseXtensa Architecture – Base
Source routing
RF0 RF1 RF2 S1S0
FU0 FU0 FU0 FU0
Result routing
Decoder
Con
trol
v Good performance§ Comparable to any embedded 32-bit
RISCv Good code density§ Much better than 32-bit RISC§ Use 16b/24b instructions
v Small§ .7mm2 in .18
v Low power§ .37mw / MHz
v Easy extension§ With Tensilica Instruction Extension
(TIE) language – ISA levelv Efficient extension§ TIE compiler generates efficient
pipelined implementation§ TIE compiler extends all software
development tools
14
TIE language - opcodeTIE language - TIE language - opcodeopcode
Source routing
RF0 RF1 RF2 S1S0
FU0 FU0 FU0 FU0
Result routing
Decoder
Con
trol
•Opcode
opcode MAC op2=5 CUST0
15
TIE Language – regfile / stateTIE Language – TIE Language – regfileregfile / / statestate
Source routing
RF0 S0
FU0 FU0 FU0 FU0
Result routing
Decoder
Con
trol
•Opcode
•Register file / State… as needed
state ACC 40
16
TIE Language – semanticsTIE Language – TIE Language – semanticssemantics
Source routing
RF0
FU0 MAC
Result routing
Decoder
Con
trol
•Opcode
•Register file / state
•semantics
S0 … as needed
… as needed
semantic sem1 {MAC} {assign ACCL=ACCL+ars[16:0]*art[15:0];}
17
TIE Language – iclassTIE Language – TIE Language – iclassiclass
Source routing
RF0
FU0 MAC
Result routing
Decoder
Con
trol
•Opcode
•Register file / state
•semantics
S0 … as needed
… as needed
•Instruction class
iclass c1 {MAC} {in ars, in art} {inout ACC}
18
TIE Language - scheduleTIE Language - scheduleTIE Language - schedule
•schedule
Source routing
RF0
FU0MAC
Result routing
Decoder
Con
trol
•Opcode
•Register file / state
•semantics
S0 … as needed
… as needed
•Instruction class
schedule s1 {MAC}{use ars 1; use art 1; use ACC 2; def ACC 2;}
19
A Complete Example – parallel MACA Complete Example – parallel MACA Complete Example – parallel MAC
opcode PMAC op2=0 CUST0
state ACC1 40
state ACC2 40
iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}
semantic pmac_sem {PMAC} {assign ACC1 = ACC1 + ars[15:0] * art[15:0];
assign ACC2 = ACC2 + ars[31:16] * art[31:16];
}
schedule pmac_schd {PMAC} {use ars 1; use art 1;
use ACC1 2; use ACC2 2;
def ACC1 2; def ACC2 2;
}
20
Productivity Gain – language + compilerProductivity Gain – language + compilerProductivity Gain – language + compiler
Select processoroptions
Using theXtensaprocessorgenerator,create...
ALU
Pipe
I/O
Timer
MMURegister File
Cache
Tailored,synthesizableHDL uP core
CustomizedCompiler,Assembler,Linker,Debugger,Simulator
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Describe newinstructions In Minutes!
21
Productivity Gain – Software ToolsProductivity Gain – Software ToolsProductivity Gain – Software Tools
Select processoroptions
Using theXtensaprocessorgenerator,create...
ALU
Pipe
I/O
Timer
MMURegister File
Cache
Tailored,synthesizableHDL uP core
CustomizedCompiler,Assembler,Linker,Debugger,Simulator
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Describe newinstructions
22
Software Support – AssemblerSoftware Support – AssemblerSoftware Support – Assembler
RF0
FU0
Decoder
Con
trol
ACC1 ACC2
∗
+
∗
+
•Assembler
•Custom data type
•Register allocation
•Code Scheduling
•RTOS
•Simulator/debugger
Loop a2, .L1 l16si a10, a3, 0 l16si a11, a3, 2 addi.n a3, a3, 2 PMAC a10, a11.L1:
23
Software Support – custom data typeSoftware Support – custom data typeSoftware Support – custom data type
RF0
FU0
Decoder
Con
trol
ACC1 ACC2
∗
+
∗
+
•Assembler
•Custom data type
•Register allocation
•Code Scheduling
•RTOS
•Simulator/debugger
sat_int x,y,z;z = sat_add(x,y);C Code:
24
Software Support – register allocationSoftware Support – register allocationSoftware Support – register allocation
RF0
FU0
Decoder
Con
trol
ACC1 ACC2
∗
+
∗
+
•Assembler
•Custom data type
•Register allocation
•Code Scheduling
•RTOS
•Simulator/debugger
sat_add s3, s1, s2sat_store s3, a1, 0call8 foosat_load s3, a1, 0
Spilling around a call:
25
Software Support – code schedulingSoftware Support – code schedulingSoftware Support – code scheduling
RF0
FU0
Decoder
Con
trol
ACC1 ACC2
∗
+
∗
+
•Assembler
•Custom data type
•Register allocation
•Code Scheduling
•RTOS
•Simulator/debugger
t = sat_mult(x,y);z = sat_add(z, t);t2 = sat_mult(x2, y2);
sat_mult s3, s1, s2 sat_mult s6, s5, s4sat_add s7, s7, s3
26
Software Support - RTOSSoftware Support - RTOSSoftware Support - RTOS
RF0
FU0
Decoder
Con
trol
ACC1 ACC2
∗
+
∗
+
•Assembler
•Custom data type
•Register allocation
•Code Scheduling
•RTOS
•Simulator/debugger
Task0S0,S1,…s15
Task1S0,S1,…s15
Memory
sat_store
sat_load
Context Switch
27
Software Support – simulator/debuggerSoftware Support – simulator/debuggerSoftware Support – simulator/debugger
RF0
FU0
Decoder
Con
trol
ACC1 ACC2
∗
+
∗
+
gdb> break …
gdb> cont
gdb> step
gdb> display …
•Assembler
•Custom data type
•Register allocation
•Code Scheduling
•RTOS
•Simulator/debugger
?
?
?
28
OutlineOutlineOutline
vConfigurable processors
§ Architecture
§ Instruction extension
§ Software support
vAn Example
vResults
vSummary
29
Data Encryption Standard (DES)Data Encryption Standard (DES)Data Encryption Standard (DES)
Initial step(R, L) = Initial_permutation(Din64)
Iterate 16 timesKey generation
(C, D) = PC1(k)n = rotate_amount (function of iteration count)C = rotate_right(C, n)D = rotate_right (D, n)K = PC2(D, C)
EncryptionR i+1 = Li ⊕ Permutation ( S_Box ( K ⊕ Expansion ( R ) ) )L i+1 = Ri
Final stepDout64 = Final_permutation(L, R)
30
DES: Software ImplementationDES: Software ImplementationDES: Software Implementation
static unsigned permute(unsigned char *table,in t n,unsigned hi,unsigned lo)
{int ib, ob;unsigned out = 0;for (ob = 0; ob < n; ob++) {
ib = table[ob] - 1;if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob;} else {
if (lo & (1 << ib)) out |= 1 << ob;}
}return out;
}
31
DES: Software ImplementationDES: Software ImplementationDES: Software Implementation
static unsigned permute(unsigned char *table,in t n,unsigned hi,unsigned lo)
{int ib, ob;unsigned out = 0;for (ob = 0; ob < n; ob++) {
ib = table[ob] - 1;if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob;} else {
if (lo & (1 << ib)) out |= 1 << ob;}
}return out;
}Too much computation!
32
DES: Hardware ImplementationDES: Hardware ImplementationDES: Hardware Implementation
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
33
DES: Hardware ImplementationDES: Hardware ImplementationDES: Hardware Implementation
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
Complicated control logic!
34
DES: SETDATA instructionDES: DES: SETDATASETDATA instruction instruction
SETDATA ars, artInitial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
35
DES: SETKEY instructionDES: DES: SETKEYSETKEY instruction instruction
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
SETKEY ars, art
36
DES: DES instructionDES: DES: DESDES instruction instruction
DES immediate
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
37
DES: GETDATA instructionDES: DES: GETDATAGETDATA instruction instruction
GETDATA ars, hilo
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
38
DES: Putting it togetherDES: Putting it togetherDES: Putting it together
GETDATA ars, hilo
DES immediate
SETDATA ars, artInitial Permutation
ExpansionPermutation
S Boxes
P Permutation
⊕
⊕
Final Permutation
KeyGeneration
StateMachine
SETKEY ars, art
39
DES: Improved ProgramDES: Improved ProgramDES: Improved Program
SETKEY(K_hi, K_lo);for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ }
SETKEY(K_hi, K_lo);for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ }
DecryptionEncryption
40
DES: SummaryDES: SummaryDES: Summary
vAdd 4 TIE instructions:
§ 80 lines of TIE description
§ No cycle time impact
§ ~1700 additional gates
§ Code-size reduced
DES Performance
4350 53
72
0
20
40
60
80
1024 64 8 MeanBlock Size (Bytes)
Spe
edup
(X
)
41
OutlineOutlineOutline
vConfigurable processors
§ Architecture
§ Instruction extension
§ Software support
vAn Example
vResults
vSummary
42
Improvement over general purpose 32b RISCImprovement over general purpose 32b RISCImprovement over general purpose 32b RISC
JPEG (image compression)
JPEG (image compression)
Motion Estimation (video conferencing)
Motion Estimation (video conferencing)
FIR filter(signal processing)
FIR filter(signal processing)
Viterbi Decoding (wireless communication)
Viterbi Decoding (wireless communication)
MIPS or MIPS/Watt
DES (content encryption)
DES (content encryption)
2x 4x 6x 8x 10x 55x1x
Base + 7500 gates
Base + 6500 gates
Base + 900 gates
Base + 1000 gates
Base+1700 gates
43
What is “EEMBC”?What is “EEMBC”?What is “EEMBC”?
v EDN Embedded Microprocessor Benchmark Consortium
v Pronounced “Embassy”
v Non-profit consortium, funded by over 40 members
§ Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola,National Semi, NEC, TI, Toshiba…Tensilica, and more…
v Objective: Provide independently certified benchmark scoresrelevant to deeply embedded processor applications
§ Independent laboratory recreates and certifies all benchmarkresults - no tricks
v Five different benchmark suites:v Each suite comprised of a range (five to sixteen) of
benchmarks representative of that product category§ Example: Consumer: image compression, image filtering, color
conversion
44
EEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking Benchmark
Netmark Performance
0
2
4
6
8
10
12
14
IDT 32334/100
IDT79RC32364/100
NEC V832-143
AMD ElanSC520/133
Toshiba TMPR3927F-GH189/133
IDT79RC32V334-150
Toshiba TMPR3927F-GHM2000/133
NEC VR5432-167
Xtensa/200
IDT79RC64575IDtc/250
NEC VR5000
IDT79RC64575Algor/250
AMD K6-2/450
AMD K6-2E/400
Xtensa Optimized/200
AMD K6-2E+/500
AMD K6-IIIE+/550
Netmark Efficiency (Netmark/MHz)
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
vComparable in Netmark to high-end desktop CPUsv2x in Netmark/MHzv59K total gates at 200MHz
Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs
45
EEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom Benchmark
Telemark Performance
0
10
20
30
40
50
60
70
80
90AMD ElanSC520/133
IDT 32334-100
Analog Devices 21065L/60
NEC V832-143
IDT79RC32V334-150
Xtensa/200
NEC VR5432-167
IDT79RC64575Algor/250
NEC VR5000
AMD K6-2E/400
TI TMS320C6203/300
AMDK6-2E+/500
AMD K6-III+/550
IBM PowerPC750CX/500
TI TMS320C6203 C opt/300
TI TMS320C6203 Optimized/300
Xtensa Optimized/200
Telemark Efficiency (Telemark/MHz)
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs, Gray - DSPs
vBeats all processors, including hand-optimized TI C6xv180K total gates at 200MHz
46
EEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer Benchmark
Consumermark Performance
0
20
40
60
80
100
120
140
160
180
200
ST20C2/50
AMD ElanSC520/133
NEC V832/143
National Geode GX1/200
NEC VR5432/167
Xtensa/200
NEC VR5000/250
AMD K6-2E/400
AMDK6-2E+/500
AMD K6-III+/550
Xtensa Optimized/200
Consumermark Efficiency (Consumermark/MHz)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs
v6x in Consumermark and 12x in Consumermark/MHzv127K total gates at 200MHz
47
SummarySummarySummary
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
ASIC
FPGA
∆ ~
10x
∆ ~10x
Instruction-setConfigurable
Processor
TraditionalProcessor
FPGA+
Processor
48
SummarySummarySummary
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
ASIC
FPGA
∆ ~
10x
∆ ~10x
Instruction-setConfigurable
Processor
GeneralProcessor
FPGA+
Processor
49
SummarySummarySummary
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
ASIC
FPGA
∆ ~
10x
∆ ~10x
TraditionalProcessor
FPGA+
Processor
Instruction-setConfigurable
Processor
v Benefit of SoC integration
§ Higher Bandwidth
§ Lower Cost
§ Lower Power
v Benefit of IS configuration
§ A cost-effectivecomputing platform
v Benefit of TIE compilerand SW tools
§ Faster time-to-market
§ Lower development cost
§ Lower risk
50
Thank You!