Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... ·...
Transcript of Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... ·...
ALU Array based ReconfigurableALU‐Array based Reconfigurable Accelerator for Energy Efficient Executions
† d ‡ h d hd † k k d †Koji Inoue†, Hamid Noori‡, Farhad Mehdipour†, Takaaki Hanada†,
and Kazuaki Murakami†
†Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan
‡School of Electrical and Computer Engineering, University of Tehran
OutlineOutline
• IntroductionIntroduction• ADEXOR: Adaptive Extensible Processor
– Overview– Microarchitecture– Coarse‐grained Reconfigurable Functional Unit
• EvaluationEvaluation• Conclusions
2
Motivation and SolutionMotivation and Solution
• Embedded processors have to achievep– Low cost– High‐performanceL l ti– Low‐power or low‐energy consumption
• Key point– How can processors adapt to target applications?How can processors adapt to target applications?
• Solution: ASIP w/ Re‐configurability– Application specific ISA
• Provide custom instructions (CIs)– Implement re‐configurable FUs
3
ADaptive EXtensible processOR(ADEXOR)(ADEXOR)
• Has a coarse‐grained re‐configurable functional unitff “ l ”
400680 bi $25 $25 1
• Supports efficient “Multi‐Exits CIs”• Achieves high‐performance and low energy
Register FileRFU
ConfigurationM
Indexed by mtc1or sequencer
400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,1 ID/EXE RID/EXE Reg
CRFU
Memory
ALU
4006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 bgez $10,4006f04006d0 xori $13,$13,1
ID/EXE Reg
MUX Counter
EXE/MEM Reg
Triggered by mtc1 orsequencer
4006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10 4006f0
4
GPP: General Purpose Processor
CRFU: Coarse‐grained Reconfigurable Functional Unit
GPP Augmented HW4006e0 bgez $10,4006f0....
Hot Basic Block
CRFU MicroarchitectureCRFU Microarchitecture
• 16 FUs controlled by configuration bits16 FUs controlled by configuration bits
• MUX‐base interconnection between FUs
l d b f d• Early stage data can be transferred to output ports
Row 1
Configurationbits
Configurationbits
R 5
Adder/subtractor
AND OR XORBarrelShifter
Configurationbits
FU FU FU FU
Row 5
Supporting Multi‐Exits Custom Instructions (MECIs)Supporting Multi Exits Custom Instructions (MECIs)
MultipleMultiple‐‐Exits Custom InstructionExits Custom InstructionMultipleMultiple Exits Custom InstructionExits Custom InstructionConditional Execution + Hot‐Path Selection
#Required nodes: 16#Required nodes: 16adpcm
ExitExit
ExitExit
6
Assume 16 nodes can be included in one CI in maximum
Experimental Setup (1/2)Experimental Setup (1/2)
I 1Issue 1-way
L1-Instruction Cache 32K, 4 way, 1 cycle latency, miss penalty 20 cycles
L1- Data Cache 16K, 4 way, 1 cycle latency, miss penalty 20 cycles
ALUs 1 integer unit, 1 floating point unit
Multiplier 1 Integer (5 cycles)
Divider 1 Integer (8 cycles)
Branch predictor bimodal
Branch prediction table size 256
Extra branch misprediction 3
Register File 4-read ports, 2-write ports
Clock Frequency 135 MHz
Base Processor Configuration
7
Base Processor Configuration
Experimental Setup (2/2)Experimental Setup (2/2)Reg0 ………………………………...
.Reg31
From decode stage
Triggered bymtc1or sequencer
DEC/EXE Pipeline Registers
CounterFrom decode stage
CRFU Input RegsEn
ALU MUL/DIV CRFU
EXE/MEM Pi li R i t
Counter
ConfigMemory
Triggered bymtc1or sequencer
EXE/MEM Pipeline Registers
Result bus
q
arch1: (4‐read/2‐write)•Clock freq: 135MHz•RF read/write access
arch2: (8‐read/4‐write)•Clock freq: 130MHz
•RF read/write access Input: 5, 6, 7, or 8 +1 extra cycleOutput: 3 or 4 +1 extra cycleOutput: 5 or 6 +2 extra cyclesCRFU ti
•RF read/write access Input: no extra cycleOutput: 5 or 6 +1 extra cycle
•CRFU execution•CRFU executionarch‐1‐var: variable (1 or 2 cycles)arch‐1‐fix: 2 cycles
arch‐2‐var: variable (1 or 2 cycles)arch‐2‐fix: 2 cycles
8
Performance EvaluationPerformance Evaluation5
arch1 var
4
4.5arch1-vararch2-fixarch2-var
2 5
3
3.5
Spee
dup
1.5
2
2.5
1
sicmath
itcountsqso
rtsu
san
cjpeg
djpegdijk
stra
patrici
ablowfis
hrijn
dael
gsea
rch sha
adpc
m crc fftgsm
avg-se
qvg
-mtc1
9
basi
bitc d p blo rstr
ings a av avg
Energy ConsumptionEnergy Consumption
Pros ConsPros.
• Low activity of hardware components
Cons.
• RFU configuration– Accessing the config.
– I‐Cache, Bpred
– Decoder
– Register File
Memory
– Setting control signals in the RFU– Register File
– Functional Unit
• Higher I‐Cache hit rates
• Increased complexity– Communication between the
processor’s data path and the– Reduce the energy for off‐
chip accesses
processor s data‐path and the RFU
10
Total Energy ReductionTotal Energy Reduction
80
60
70
n (%
)
clk-gating-arch2-vararch2-vararch2-fixarch1-var
40
50
ergy
redu
ctio
n
10
20
30
Tota
l ene
0
10
sicmath
tcountsqso
rtsu
san
cjpeg
djpegdijk
stra
patrici
ablowfis
hrijn
dael
gsea
rch sha
adpc
m crc fftgsm
avg-se
qvg
-mtc1
11
basic
bitc d p blo rijstr
ings a av avg
Temperature Analysis
48130MHz 260MHz 390MHz 520MHz 650MHz
Temperature Analysis
47
47.5
48
(℃)
FU FUFU FU
CRFU Floor Plan(1.7x1.7 [mm2])
46
46.5
mpe
ratu
re
FU
FU
FU
FU
FU FUFU
FU FU FU
FUFU
45
45.5Tem
12
ConclusionsConclusions
• ADEXOR: Adaptive Extensible ProcessorADEXOR: Adaptive Extensible Processor– Has a coarse‐grain reconfigurable functional unit
Supports multi exit custom instructions– Supports multi‐exit custom instructions
• Performance / Energy Analysis( )– 5X speed up (best case)
– 60% energy reduction (best case)
• Future Work– Extend for 3D‐IC Implementation
13
AcknowledgementAcknowledgement
• This research was supported in part byThis research was supported in part by – New Energy and Industrial Technology Development Organization
– The chip fabrication program of VLSI Design and Education Center(VDEC), the University of Tokyo in collaboration with Hitachi Ltd and Dai Nipponcollaboration with Hitachi Ltd. and Dai Nippon Printing Corporation.
14