Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic...
![Page 1: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/1.jpg)
1 University of MichiganElectrical Engineering and Computer Science
Streamroller: Automatic Synthesis of Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator PipelinesPrescribed Throughput Accelerator Pipelines
Manjunath Kudlur, Kevin Fan, Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan
![Page 2: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/2.jpg)
2 University of MichiganElectrical Engineering and Computer Science
Automated C to Gates SolutionAutomated C to Gates Solution• SoC design
– 10-100 Gops, 200 mW power budget
– Low level tools ineffective• Automated accelerator
synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market
app.c
LA
LA LA
LA
![Page 3: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/3.jpg)
3 University of MichiganElectrical Engineering and Computer Science
Streaming ApplicationsStreaming Applications
Quantizer
MotionEstimator
Transform Coder
InverseQuantizer
InverseTransform
MotionPredictor
Image Coded Image
H.264 Encoder
• Data “streaming” through kernels
• Kernels are tight loops– FIR, Viterbi, DCT
• Coarse grain dataflow between kernels– Sub-blocks of images,
network packetsData in Data out
CRC Conv./Turbo
BlockInterleaver
OVSFGenerator
Spreader/Scrambler
BasebandTrasmitter
W-CDMA Transmitter
RRCFilter
![Page 4: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/4.jpg)
4 University of MichiganElectrical Engineering and Computer Science
Software OverviewSoftware Overview
Whole Application
1
2 3
4
SystemLevel
Synthesis
FrontendAnalyses
Accelerator Pipeline
SRAMBuffers
Loop Graph
![Page 5: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/5.jpg)
5 University of MichiganElectrical Engineering and Computer Science
Input SpecificationInput Specification
for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}
row_trans(char inp[8][8], char out[8][8] ) {
}
col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);
dct(char inp[8][8], char out[8][8]) {
row_trans
col_trans
zigzag_trans
inp
tmp1
tmp2
out
• Sequential C program• Kernel specification
– Perfectly nested FOR loop– Wrapped inside C function– All data access made
explicit
char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}
• System specification
– Function with main input/output
– Local arrays to pass data– Sequence of calls to kernels
![Page 6: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/6.jpg)
6 University of MichiganElectrical Engineering and Computer Science
Performance SpecificationPerformance Specification• High performance DCT
– Process one 1024x768 image every 2ms– Given 400 Mhz clock
• One image every 800000 cycles• One block every 64 cycles
• Low Performance DCT– Process one 1024x768 image every 4ms– One block every 128 cycles
8
8
row_trans
col_trans
zigzag_trans
inp
tmp1
tmp2
out
8
8
Input image(1024 x 768)
Output coeffs
Task
Performance goal :Task throughput in number of cycles between tasks
![Page 7: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/7.jpg)
7 University of MichiganElectrical Engineering and Computer Science
Building BlocksBuilding Blocks
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Multifunction Loop Accelerator[CODES/ISSS ’06]
tmp1
tmp2
tmp3
SRAM buffers
![Page 8: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/8.jpg)
8 University of MichiganElectrical Engineering and Computer Science
System Schema OverviewSystem Schema Overview
Kernel 1
Kernel 2
Kernel 4
LA 1
LA 2
LA 3
Kernel 3
Kernel 5
Kernel 1
Kernel 4
Kernel 5
K2 K3 Kernel 1
Kernel 4
Kernel 5
K2 K3 Kernel 1
Kernel 4
Kernel 5
K2 K3
time
Task throughput
![Page 9: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/9.jpg)
9 University of MichiganElectrical Engineering and Computer Science
Cost ComponentsCost Components• Cost of loop accelerator data path
– Cost of FUs, shift registers, muxes, interconnect• Initiation interval (II)
– Key parameter that decides LA cost• Low II → high performance → high cost
– Loop execution time ≈ (trip count) x II– Appropriate II chosen to satisfy task throughput
II=1
II=1
II=1
K1
K2
K3
TC=100
TC=100
TC=100
II=2
II=2
II=2
Low performance
K1
K2
K3
TC=100
TC=100
TC=100
K1
K2
K3
K1
K2
K3
Task 1
Task 2
K1
K2
K3
Task 3
100
200
300
High performance
Throughput = 1 task/100 cyclesK1
K2
K3
K1
K2
K3
Task 1
Task 2200
400
600
Throughput = 1 task/200 cycles
![Page 10: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/10.jpg)
10 University of MichiganElectrical Engineering and Computer Science
Cost Components (Contd..)Cost Components (Contd..)
• Grouping of loops into a multifunction LA– More loops in a single LA → LA occupied for longer
time in current task
K1
K2
K3
TC=100
TC=100
TC=100
K3TC=100
LA 2
LA 3
LA 1
K1
K2
K3
K4LA 1 occupied for 200 cycles
K1
K2
K3
100
200
300
K4400
Throughput = 1 task / 200 cycles
![Page 11: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/11.jpg)
11 University of MichiganElectrical Engineering and Computer Science
Cost Components (Contd..)Cost Components (Contd..)• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance
II=1
II=1
II=1
K1
K2
K3
TC=100
TC=100
TC=100
tmp1
tmp2
LA 1
LA 2
LA 3
K1
K2
K3
K1
K2
K3
100
200
300
LA 1
LA 2
LA 3
tmp1 buffer in use by LA2
K1
K2
K3
K1
K2
K3
100
200
300
Adjacent tasks use different
buffers
![Page 12: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/12.jpg)
12 University of MichiganElectrical Engineering and Computer Science
ILP FormulationILP Formulation
• Variables– II for each loop– Which loops are combined into single LA– Number of buffers for temp array
• Objective function– Cost of LAs + cost of buffers
• Constraints– Overall task throughput should be achieved
![Page 13: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/13.jpg)
13 University of MichiganElectrical Engineering and Computer Science
Non-linear LA CostNon-linear LA Cost
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
IImin IImax
II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1
Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14
IImin ≤ II ≤ IImax
Re
lativ
e C
ost
Initiation interval
![Page 14: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/14.jpg)
14 University of MichiganElectrical Engineering and Computer Science
Multifunction Accelerator CostMultifunction Accelerator Cost
LA 1LA 2
LA 3LA 4
LA 1LA 2
LA 3LA 4
LA 1LA 2
LA 3LA 4
Worst Case : No sharingCost = Sum
Realistic Case : Some sharingCost = Between Sum and Max
Best case : Full sharingCost = Max
• Impractical to obtain accurate cost of all combinations• CLA = 0.5 * (SUMCLA + MAXCLA)
![Page 15: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/15.jpg)
15 University of MichiganElectrical Engineering and Computer Science
Case Study : “Simple” benchmarkCase Study : “Simple” benchmarkLoop graph
TC=256
1
1
1
1
1
1
1
1
512 cycles LA 1
LA 2
LA 3
LA 4
1
1
2
1
1
1
3
3
1792 cycles
1536 cycles
LA 1
LA 2
1
1
1
1
1
1
1
1
LA 12048 cycles
![Page 16: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/16.jpg)
16 University of MichiganElectrical Engineering and Computer Science
BeamformerBeamformer
Beamformer• 10 loops• Memory Cost – 60% to 70%
• Up to 20% cost savings due to hardware sharing in multifunction accelerators• Systems at lower throughput have over-designed LAs
– Not profitable to pick a lower performance LA• Memory buffer cost significant
– High performance producer consumer better than more buffers
![Page 17: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/17.jpg)
17 University of MichiganElectrical Engineering and Computer Science
ConclusionsConclusions
• Automated design realistic for system of loops• Designers can move up the abstraction hierarchy• Observations
– Macro level hardware sharing can achieve significant cost savings
– Memory cost is significant – need to simultaneously optimize for datapath and memory cost
• ILP formulation tractable– Solver took less than 1 minute for systems with 30 loops
![Page 18: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.](https://reader036.fdocuments.net/reader036/viewer/2022081513/56649d615503460f94a4374c/html5/thumbnails/18.jpg)
18 University of MichiganElectrical Engineering and Computer Science