OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator...
Transcript of OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator...
![Page 1: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/1.jpg)
OPTIMIZING FPGA-BASED ACCELERATOR DESIGN FOR DEEPCONVOLUTIONAL NEURAL NETWORKS
Surabhi Singh(ss3659)Mohammad Dohadwala(md874)Godly k Thomas(gkt27)
![Page 2: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/2.jpg)
YOLO: REAL TIME OBJECT DETECTION SYSTEM
• YOLO.?? YOU ONLY LIVE ONCE
• YOU ONLY LOOK ONCE – State of real-time object detection system
• Can Process upto 200 frames per second
![Page 3: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/3.jpg)
HOW IT WORKS
• YOLO does the following:
• Applies a single neural network to the resized full image (448X448)
• Divide image into regions and predicts probability for each region
• Bounding box for each object is weighted by predicted probabilities
Image Courtesy: http://pjreddie.com/darknet/yolo/
![Page 4: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/4.jpg)
NETWORK DESIGN
• 24 convolutional Layers followed by two fully connected layers
• Each convolutional layer has input and output feature maps
Image Courtesy: See Reference[1]
![Page 5: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/5.jpg)
WHERE IS HLDDA.??
• Optimize a single convolutional layer execution in hardware(FPGA) while execute other convolutional layers in software (ARM)
• Use pragmas learnt in class to increase throughout
• Loop Unrolling, Loop Pipelining, Hardware Tiling for the function executed in hardware.
![Page 6: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/6.jpg)
SINGLE CONVOLUTIONAL LAYER
w11 w12
w11 w12
w11 w12
w11 w12
w11w13 w14
Weights Input Feature Map Output Feature Map
![Page 7: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/7.jpg)
KERNEL COMPUTATION
• Pseudo Code
KERNEL (WEIGHT) MULTIPLICATION
Image Courtesy: See Reference[3]
![Page 8: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/8.jpg)
IMPLEMENTATION: FEATURE MAP CALCULATION
LayerConvolutional
1 2 3 4 5 6 7 8
input_fm (N) 3 16 32 64 128 256 512 1024
output_fm (M) 16 32 64 128 256 512 1024 256
fm row (R) 448 224 112 56 28 14 7 7
fm col (C) 448 224 112 56 28 14 7 7
kernel (K) 3 3 3 3 3 3 3 3
stride (S) 1 1 1 1 1 1 1 1
set # 1 1 1 1 1 1 1 1
Weights (MBytes) 0.0033 0.0352 0.1406 0.5625 2.25 9 36 18
Input Feature Map (MBytes) 41.3438 55.125 27.5625 13.7813 6.8906 3.4453 1.7227 3.4453
Output Feature Map (MBytes) 24.5 12.25 6.125 3.0625 1.5313 0.7656 0.3828 0.0957
![Page 9: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/9.jpg)
IMPLEMENTATION: OPTIMIZATION
HARDWARE OPTIMIZATION Achieve II = 1 Any ideas..??
COMMUNICATION OVERHEAD
Reduce Latency
![Page 10: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/10.jpg)
CODE OPTIMIZATION
• Code Snippet
Image Courtesy: See Reference[3]
![Page 11: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/11.jpg)
HARDWARE OPTIMIZATION STEPS TO ACHIEVE II = 1
* Stream data into local RAM for inputs (multiple access required)* Pipeline the dot-product loop, to fully unroll it* Separate multiply-accumulate in inner loop to force two Floating Point operators
ACTUAL II = 14RESOURCE CONSTRAINT-LIMITED MEMORY PORTS
ARRAY PARTITIONGuesses??
* Partition local RAMs into N/2 sub-arrays for fully parallel access (dual-port read)
EXPECTED II = 1
![Page 12: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/12.jpg)
RESULTS & CONCLUSION
Design
CPU Cycles
Speedup Total Runtime (s)Average function
RuntimeTotal 1st layer
Runtime
SW HW BASELINE (ARM+FPGA) 22,135,862 7,415,513,736 1X 30.92
SW BASELINE (ARM) 2,763,683 925,833,786 8X 22.81
SW HW ALTERNATE (ARM+FPGA) 323,961 108,526,936 68X 21.6768X
![Page 13: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/13.jpg)
FINAL PROJECT LINK
• https://youtu.be/v9Bt19h6OUM
![Page 14: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/14.jpg)
ACKNOWLEDGEMENT
• Special Thanks to Professor Zhiru Zhang for helping us throughout the Project duration and guiding us when we faced roadblocks.
• We would also like to acknowledge suggestions from Steve Dai and Hsiang Yi, PhD students under Professor Zhiru Zhang.
![Page 15: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA](https://reader031.fdocuments.net/reader031/viewer/2022020120/5b27189a7f8b9af61c8b4579/html5/thumbnails/15.jpg)
REFERENCES
• Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15 Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysChen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong,
• YOLO REAL TIME DETECTION SYSTEM
http://pjreddie.com/darknet/yolo/
• YOLO GITHUB LINK
https://github.com/pjreddie/darknet