Performance of mathematical software
description
Transcript of Performance of mathematical software
![Page 1: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/1.jpg)
Performance of mathematical software
Agner FogTechnical University of Denmark
www.agner.org
![Page 2: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/2.jpg)
• Intel and AMD microarchitecture• Performance bottlenecks• Parallelization• C++ vector classes, example• Instruction set dispatching• Performance measuring• Hands-on examples
Agenda
![Page 3: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/3.jpg)
AMD Microarchitecture
• 2 threads per CPU unit
• 4 instructions per clock
• Out-of-order execution
![Page 4: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/4.jpg)
Intel Micro-architecture
![Page 5: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/5.jpg)
Typical bottlenecks• Installation of program• Program start, load framework and
libraries• System database• Network• File input / output• Graphics• RAM access, cache utilization• Algorithm• Dependency chains• CPU pipeline• CPU execution units
Sp
eed
![Page 6: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/6.jpg)
Efficient cache use
• Store data in contiguous blocks• Avoid advanced data containers
such as resizable data structures, linked lists, etc.
• Store local data inside the function they are used in
![Page 7: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/7.jpg)
Branch prediction
Loop
Branch
BA
![Page 8: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/8.jpg)
Out-Of-Order Execution
x = a / b;y = c * d;z = x + y;
![Page 9: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/9.jpg)
Dependency chains
x = a + b + c + d;
x = (a + b) + (c + d);
![Page 10: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/10.jpg)
Loop-carrieddependency chain
for (i = 0; i < n; i++) { sum += x[i]; }
for (i = 0; i < n; i += 2) { sum1 += x[i]; sum2 += x[i+1]; }sum = sum1 + sum2;
![Page 11: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/11.jpg)
Levels of parallelization
• Multiple threads running in different CPU units
• Out-of-order execution• SIMD instructions
![Page 12: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/12.jpg)
SIMD programming methods
• Assembly language• Intrinsic functions• Vector classes• Automatic vectorization by
compiler
![Page 13: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/13.jpg)
![Page 14: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/14.jpg)
Obstacles to automatic vectorization
• Pointer aliasing• Array alignment• Array size• Algebraic reduction• Branches• Table lookup
![Page 15: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/15.jpg)
Vector math libraries
Short vectors
• Intel SVML• AMD libm• Agner’s VCL
Long vectors
• Intel VML• Intel IPP• Yeppp
![Page 16: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/16.jpg)
ExampleExponential function
in vector classes
![Page 17: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/17.jpg)
Instruction set dispatching
SSE
SSE2
SSE3
Generic x86
SSE code
SSE2 code
AVX2 code
N
J
N
N
J
J
GenuineIntel
N
J
Future extensions
![Page 18: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/18.jpg)
Performance measurement
• Profiler
• Insert measuring instruments into code
• Performance monitor counters
![Page 19: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/19.jpg)
When timing results are unstable
• Stop other running programs• Warm up CPU with heavy work• Disable power saving features in
BIOS setup• Disable hyperthreading• Use “core clock cycles” counter
(Intel only)
![Page 20: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/20.jpg)
ExampleMeasuring latency and throughput of
machine instruction
![Page 21: Performance of mathematical software](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814cd7550346895db9dafa/html5/thumbnails/21.jpg)
Hands-on example
Compare performance of different mathematical function libraries
http://cern.agner.org