Matrix multiplication implemented in data flow technology
description
Transcript of Matrix multiplication implemented in data flow technology
![Page 1: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/1.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Matrix multiplication implemented in data flow
technology
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
![Page 2: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/2.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Problem with big dataNeed to change computing paradigmData flow instead of control flowAchieved by construction of graphGraph nodes (vertices) perform computationsEach node is one deep pipeline
Introduction
![Page 3: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/3.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Dependencies are resolved at compile timeNo new dependencies are madeThe whole mechanism is in deep pipelinePipeline levels perform parallel computations Data flow produces one result per cycle
Dataflow computation
![Page 4: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/4.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Data flow doesn’t suit all situationsHowever, it is applicable in lot of cases:
Partial differential equations3D finite differencesFinite elements methodProblems in bioinformatics, etc.
Most of them contain matrix multiplicationsGoal: realization on FPGA, using data flow
Matrix multiplication
![Page 5: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/5.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Two solutions:Maximal utilization of on-chip matrix part• Matrices with small dimensions• Matrices with large dimensions
Multiplication using parallel pipelines
Project realizations
![Page 6: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/6.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization A
...
X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...
X00 X01 X02 ... X0(n-2) X0(n-1)
...X10 X11 X12 X1(n-2) X1(n-1)
X(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)
...
y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...
y00 y01 y02 ... y0(n-2) y0(n-1)
...y10 y11 y12 y1(n-2) y1(n-1)
y(n-2)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)
FMem capacity
Pipe
0Pi
pe 1
Set of columns on the chip until they are fully usedEvery pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 coresAdditional parallelization possible
![Page 7: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/7.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization A
![Page 8: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/8.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization AChip utilization and accelerationLUTs: 195345/297600 (65,64%)FFs: 290689/595200 (48.83%)BRAMs: 778/1064 (73.12%)DSPs: 996/2016 (49,40%)
Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s
Acceleration at kernel clock 75 MHz: ≈18 x
![Page 9: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/9.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization B
...
X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...
X00 X01 X02 ... X0(n-2) X0(n-1)
...X10 X11 X12 X1(n-2) X1(n-1)
X(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)
...
y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...
y00 y01 y02 ... y0(n-2) y0(n-1)
...y10 y11 y12 y1(n-2) y1(n-1)
y(n-2)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)
X00 X01 X00
FMem capacity
Pipe
0Pi
pe 1
Part of matrix Y is on chip during computationEach pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 cores
![Page 10: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/10.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization B
![Page 11: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/11.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization BChip utilization and accelerationLUTs: 201237/297600 (67,62%)FFs: 302742/595200 (50.86%)BRAMs: 782/1064 (73.50%)DSPs: 1021/2016 (50,64%)
Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s
Acceleration at kernel clock 75 MHz: ≈ 18x
Matrix: 4608 x 4608Intel: 1034 sMAX3: 58.41 s
![Page 12: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/12.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Multiple parallel pipelines
...
X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...
X00 X01 X02 ... X0(n-2) X0(n-1)
...X10 X11 X12 X1(n-2) X1(n-1)
X(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)
...
y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...
y00 y01 y02 ... y0(n-2) y0(n-1)
...y10 y11 y12 y1(n-2) y1(n-1)
y(n-2)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)
X(n-2)(n-1)
Pipe 0 Pipe 1 Pipe 2 Pipe 46 Pipe 47
Matrices are exclusively in a big memoryEach pipe calculates one sum at the timeEquivalent to 48 processors with one core
![Page 13: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/13.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Multiple parallel pipelines
![Page 14: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/14.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Multiple parallel pipelinesChip utilization and accelerationLUTs: 166328/297600 (55,89%)FFs: 248047/595200 (41.67%)BRAMs: 430/1064 (40.41%)DSPs: 489/2016 (24,26%)
Matrix: 2304 x 2304Intel: 42.5 sMAX3: 4,08 s
Acceleration at kernel clock 150 MHz: > 10x
Matrix: 4608 x 4608Intel: 1034 sMAX3: 98,48 s
![Page 15: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/15.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Comparison of solutionsFirst solution:
Good chip utilizationShorter execution time
Drawback: matrices up to 8GB
Second solution: matrices up to 12GBDrawback: longer execution time
![Page 16: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/16.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
ConclusionsMatrix multiplication is operation with complexity O(n3)Part of complexity moved from time to spaceThat produces acceleration (shorter execution time)Achieved by application of data flow technologyDeveloped using tool chain from Maxeler TechnologiesCalculations order of magnitude faster than Intel Xeon
![Page 17: Matrix multiplication implemented in data flow technology](https://reader036.fdocuments.net/reader036/viewer/2022062302/56816371550346895dd44dfd/html5/thumbnails/17.jpg)
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Matrix multiplication implemented in data flow
technology
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering