Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus...
Transcript of Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus...
![Page 1: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/1.jpg)
Tools for Computational PhysicsWeek 2, Lecture 2Mathematical Libraries
R. RousseauSISSA, 2-4 Via Beirut,
Trieste, Italy
![Page 2: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/2.jpg)
Outline
• Introduction• CPUs and Memory• Math libs.• Performance• Linking Math libs• Examples
![Page 3: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/3.jpg)
Introductionq The goal of this lecture is to introduce some basic understanding of how the CPU and memory work together to perform a calculation.
q Demonstrate possible bottle necks in calculations can occurand how this may be avoided by using mathematical libraries.
q Explain what these libraries contain and give a brief overviewhow they are incorporated by the user into their code.
q Illustrate their uses and provide general information onavailable software and how to choose what you need for a given project.
![Page 4: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/4.jpg)
The basic CPU structure
Interface Bus
ProcessorL1 Cache
L2 Cache
Processor/Memory Bus
•A CPU contains a chip where circuits are wired in to turn electrical signalsinto mathematical operations.•Each CPU-type has a different architectureand a set of instructions on how to operate.(we will not discuss the +/- of a given type in any great detail).•CPUs also require data which they keep inmemory on the CPU (cache).•Access to this memory is fast but this memory isexpensive to manufacture and often there is more data than memory to store it.•Thus the computer also has RAM memory and I/O devices where this data is stored and moved into the CPU (by the program) for computation. Note that PC based machines all have small cacheas compared with work station class machines.
Van Neumaann:CPU does calc. Rest of computerstores data/code. CPU performsFetch Execute Cycles (FEC).
![Page 5: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/5.jpg)
memory hierarchy
• In modern computer system same data is stored in several storage devices during processing
• The storage devices can be described & ranked by their speed and “distance” from the CPU
• There is thus a hierarchy of memory objects
• Programming a machine with memory hierarchy requires optimization for that memory structure.
![Page 6: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/6.jpg)
The Memory Hierarchy
CPU
Data
Instructions
Addresses
� � � � �� � �� � �
�� � �
processor side system side
CPU Register Cache RAM VIRTUALSpeed
Size
![Page 7: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/7.jpg)
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
100019
8019
81
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law:4X/3”
Introduction of RISC architecture
![Page 8: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/8.jpg)
Memory Hierarchy
on-chip cacheregisters
control
processor
Second level
cache (SRAM)
Main memory
(DRAM)
Secondary storage (Disk)
Tertiary storage
(Disk/Tape)
Speed (ns): ~5 ~75 ~500 ~10 ms ~10 sec
Size (bytes): ~Kb ~Mb ~Gb ~Tb 100 Tb
![Page 9: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/9.jpg)
Layout of a typical Computer: Pentium IV
����������� �������������������� ��������� � ���� �� ���� � ���������� ���� �� ���� � ���������
�� �� � �� � � � � � �
� � � � � �
�� � � � � �
�� �� � �� � � � � � �
� � � � � �
�� � � � � �
� � ��� � ������� ��� �� � � �� � ����� � ��� � ������� ��� �� � � �� � ����
� �� � � �� ����� � ��� ! �" # �� ��$ �� �� � � �� ����� � ��� ! �" # �� ��$ �
� � ��� � ������� ��� �� � � �� � ��� � ������� ��� �� � � �
� � ����" �� % � ��� � � �� ����� � ����" �� % � ��� � � �� ����" # �� ��$ ��� �� ����&" # �� ��$ ��� �� ����& '' ( � � �( � � �
����������
���� � �����) � �� �*���� � �����) � �� �*
� � �� � �� ��" � $ ) $ �� ��������� � �� � �� ��" � $ ) $ �� ��������
� � �� �����$ �� � �� �����$ �
� � ��� � ����� � ��� � ����
+ $ �, �� � ���) �� ��+ $ �, �� � ���) �� ��
� � �* ( $ * �� �- �� � �* ( $ * �� �- �
� � �����$ ���� � �����$ ���
��) � � � �� �$ ����) � � � �� �$ ��
��� � ���( � �. � ) % ���� � ���( � �. � ) % �
� � ������ � �����
PCI-XBridge *PCI-X
Bridge * � ' �
OtherBridgeOtherBridge
��� �� � �
��������� ����
��������� ����
�������� ����
�������� ����
� ��
144144--bitbit
�������� ����
�������� ����
�������� ����
�������� ����
PCI-XBridgePCI-XBridge
I/OHub**I/O
Hub**� �" / �! �� /+ / �" ��0
- �� �� � ��� � �* ( $ * �� ��� � �* ������- �� �� � ��� � �* ( $ * �� ��� � �* ������
�� � �- � � ������� � �- � � �����
� � ��1 ���) � ��� � ��- ��� � ��1 ���) � ��� � ��- ��
� � �* ( $ * �� �� ��� ����, �2� � �* ( $ * �� �� ��� ����, �2
+ $ �, �� � �* ( $ * �� �3 �� � ��� $ * � ��+ $ �, �� � �* ( $ * �� �3 �� � ��� $ * � ��
� � �* ( $ * ��� � �* ( $ * ��
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
I/OHub3
I/OHub3 �
� �" / �- � /! �� / �" ��0
� ' � � ' �� $ * � �4
� � '' ��
� $ * � �� $ * � �44
�� �� ��
�� � � �� �
� � � �
�� �� ���� �� ��
�� � � �� ��� � � �� �
� � � � � � � � � ' � � ' �� $ * � �
� � '' ��
� $ * � �� $ * � �
� ' � � ' �� $ * � �
� � '' ��
� $ * � �� $ * � �
��) ��* * ���� � � � �
��) ���) ��* * ����* * ���� � � � �� � � � �
�� # $ ) � ) �� �- � � �������� # $ ) � ) �� �- � � ������
�����) � � ��� � ��� � %�����) � � ��� � ��� � %
�� # $ ) � ) �� �� � ����� # $ ) � ) �� �� � ���
� � '' � �� $ * � ������� �� $ * � ������
��) � � ��� � ����) � � ��� � ��
� � %� � %+ $ ) $ ��* ���) � �+ $ ) $ ��* ���) � �
� � �* ( $ * �� �� � �* ( $ * �� �
�� � �* �% � ��� � �* �% � ��� ��� �
��) ���) �
��) ��* * ���� � � � �
��) ���) ��* * ����* * ���� � � � �� � � � �
��) ��* * ���� � � � �
��) ���) ��* * ����* * ���� � � � �� � � � �
��) ��* * ���� � � � �
��) ��* * ���� � � � �
� ��
144144--bitbit
� ��
144144--bitbit
� ��
144144--bitbit
� ��
144144--bitbit
� ��144144--bitbit
� ��
144144--bitbit
� ��
144144--bitbit
•The most commonly used machines are PIVs (single or dual) which has highclock speeds for perform the FEC.512Mb cache. Details:#more /proc/cpuinfo
•Multiprocessor PIVs can exist on theSame board and share the same memory etc (SMP machine).
•The typical multiprocessor PIV machine N CPUs on the same BUS linked via a memory controlled to the RAMand I/O devices.
![Page 10: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/10.jpg)
Computational Bottle Necks: Athlon
![Page 11: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/11.jpg)
Compute Nodes• Lanai 9.0 Myrinet (copper)• 2 Broadcom Gbit ethernet.• 2 Opteron 246 (2GHz/1MB
cache)• 4GB 400MHz Kingston
memory• Mainboard Celestica A220
![Page 12: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/12.jpg)
Common Architecture: AMD Opteron
����������� �������������������� ��������� � ���� �� ���� � ���������� ���� �� ���� � ���������
�� �� � �� � � � � � �
� � � � � �
�� � � � � �
�� �� � �� � � � � � �
� � � � � �
�� � � � � �
� � ��� � ������� ��� �� � � �� � ����� � ��� � ������� ��� �� � � �� � ����� �� � � �� ����� � ��� ! �" # �� ��$ �� �� � � �� ����� � ��� ! �" # �� ��$ �
� � ��� � ������� ��� �� � � �� � ��� � ������� ��� �� � � �
� � ����" �� % � ��� � � �� ����� � ����" �� % � ��� � � �� ����
" # �� ��$ ��� �� ����&" # �� ��$ ��� �� ����& '' ( � � �( � � �����������
���� � �����) � �� �*���� � �����) � �� �*
� � �� � �� ��" � $ ) $ �� ��������� � �� � �� ��" � $ ) $ �� ��������
� � �� �����$ �� � �� �����$ �
� � ��� � ����� � ��� � ����
+ $ �, �� � ���) �� ��+ $ �, �� � ���) �� ��
� � �* ( $ * �� �- �� � �* ( $ * �� �- �
� � �����$ ���� � �����$ ���
��) � � � �� �$ ����) � � � �� �$ ��
��� � ���( � �. � ) % ���� � ���( � �. � ) % �
� � ������ � �����
PCI-XBridge *PCI-X
Bridge * � ' �
OtherBridgeOtherBridge
��� �� � �
��������� ����
��������� ����
�������� ����
�������� ����
� ��
144144--bitbit
�������� ����
�������� ����
�������� ����
�������� ����
PCI-XBridgePCI-XBridge
I/OHub**I/O
Hub**� �" / �! �� /
+ / �" ��0
- �� �� � ��� � �* ( $ * �� ��� � �* ������- �� �� � ��� � �* ( $ * �� ��� � �* ������
�� � �- � � ������� � �- � � �����
� � ��1 ���) � ��� � ��- ��� � ��1 ���) � ��� � ��- ��
� � �* ( $ * �� �� ��� ����, �2� � �* ( $ * �� �� ��� ����, �2
+ $ �, �� � �* ( $ * �� �3 �� � ��� $ * � ��+ $ �, �� � �* ( $ * �� �3 �� � ��� $ * � ��
� � �* ( $ * ��� � �* ( $ * ��
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
� ����� ��
����
I/OHub3
I/OHub3 �
� �" / �- � /! �� / �" ��0
� ' � � ' �� $ * � �4
� � '' ��
� $ * � �� $ * � �44
�� �� ��
�� � � �� �
� � � �
�� �� ���� �� ��
�� � � �� ��� � � �� �
� � � � � � � � � ' � � ' �� $ * � �
� � '' ��
� $ * � �� $ * � �
� ' � � ' �� $ * � �
� � '' ��
� $ * � �� $ * � �
��) ��* * ���� � � � �
��) ���) ��* * ����* * ���� � � � �� � � � �
�� # $ ) � ) �� �- � � �������� # $ ) � ) �� �- � � ������
�����) � � ��� � ��� � %�����) � � ��� � ��� � %
�� # $ ) � ) �� �� � ����� # $ ) � ) �� �� � ���
� � '' � �� $ * � ������� �� $ * � ������
��) � � ��� � ����) � � ��� � ��
� � %� � %+ $ ) $ ��* ���) � �+ $ ) $ ��* ���) � �
� � �* ( $ * �� �� � �* ( $ * �� �
�� � �* �% � ��� � �* �% � ��� ��� �
��) ���) �
��) ��* * ���� � � � �
��) ���) ��* * ����* * ���� � � � �� � � � �
��) ��* * ���� � � � �
��) ���) ��* * ����* * ���� � � � �� � � � �
��) ��* * ���� � � � �
��) ��* * ���� � � � �
� ��
144144--bitbit
� ��
144144--bitbit
� ��
144144--bitbit
� ��
144144--bitbit
� ��144144--bitbit
� ��
144144--bitbit
� ��
144144--bitbit
A newer paradigm in architecture is Opteron. Memory controlled on the CPU. However, RAM is ALWAYSslower than Cache. So architecturewill not completely solve the problem.Note CPU freq. is lower in opteronthan PIV (less power uesdand heat generated) but is just as fast or faster: clock speed is not everything.
![Page 13: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/13.jpg)
Math Libraries
• Routines for common math factions such as vector and matrix operations, Fourier transform etc written in a specific way to take the most advantage of the architecture of the CPU.
• Compilers can optimize code only to a certain point (they are dumb) hence sophisticated algorithms and coding is required for the compiler to make a routine that is really efficient: naive coding wont work!
• AN ABSOLUTE NECESSITY on PC based machines due to small cache on CPU.
• Makes coding easier as intrinsic math functions can be used from canned routines.
![Page 14: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/14.jpg)
Common Math Libraries
Groups of subroutines that are “standard” and can be modifiedby the machine vender to run on their machine. This enablesfor codes to be more portable from one machine to the next and still be efficient.
Linear Algebra (LA)Linear Algebra (LA)BBasic asic LLinear inear AAlgebra lgebra SSubroutinesubroutines ((BLASBLAS))
Level 1 (vectorLevel 1 (vector--vector operations)vector operations)Level 2 (matrixLevel 2 (matrix--vector operations)vector operations)Level 3 (matrixLevel 3 (matrix--matrix operations)matrix operations)Routines involving sparse vectorsRoutines involving sparse vectors
LLinear inear AAlgebra lgebra PACKPACKageage ((LAPACKLAPACK))leverage leverage BLASBLAS to perform complex operationsto perform complex operations
Fast Fourier Transform (FFTW)Real, or Comples 1D, 2D, 3D.
![Page 15: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/15.jpg)
Common Libraries
Standard BLAS and LAPACK (not machine specific)
Intel Math Kernel Library (MKL): BLAS LAPACK routines modifiedfor best performance on Pentium (Itanium) based machines.
AMD Math Core Library (ACML): BLAS LAPACK FFTW routines modifiedfor best performance on x86 (athlon, opteron) based machines.
Automatically Tuned Linear Algebra Software (ATLAS): some BLAS and LAPACK routines that can be compiled on PC based machines toobtain better maximum performance by tuning machine specific parameters.
All can be downloaded free from the web.
![Page 16: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/16.jpg)
Library Performance:Variance with Machine
![Page 17: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/17.jpg)
How to include these libraries in your code.
Within your code you simply need to call the BLAS/LAPACK routinesAs if they are subroutines you would normally write.Note check the BLAS/LAPACK manuals to know the name of routineAnd what variables need to be passed to them and in what order.NB DGEMM double generic matric-matrix multiplication
![Page 18: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/18.jpg)
Ubiquitous NomenclatureAs long as you stick to using the libraries in a standard way diverse softwareCan all use the SAME crucial routines:
![Page 19: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/19.jpg)
How to include these libraries in your code.
Normally, the math library exists on the machine as a precompiled libraryobject file.How it is compiled determines how you will link into it to You code.The location of the library, how it is compiled (weather or not you need totell the compiler where to find it) etc isEXTREMELY dependent on the
System administrators.
Static lib object file Shared object lib file
![Page 20: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/20.jpg)
How to include these libraries in your code: linking
![Page 21: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/21.jpg)
Example: CPMD w/wo atlasAs a simple example we will consider the cpmd 3.7 code compiled withpgi on a 2.0GHz single PIV:A With standard LAPACK/BLASB,With a BLAS/LAPACK/ATLAS combo library optimized fro a PIV.
![Page 22: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/22.jpg)
Example: CPMD w/wo atlas
CPU time/step of SCF Wavefucntion cycle:
CPMD 3.9IMPROVEDRoutines(same compile)
CPMD 3.7Old Routines(ATLASBochum)
CPMD 3.7Old Routines(StandardLibs)
CPU time (s)
![Page 23: Tools for Computational Physics Week 2, Lecture 2 ... · The basic CPU structure Interface Bus Processor L1 Cache L2 Cache Processor/Memory Bus •A CPU contains a chip where circuits](https://reader034.fdocuments.net/reader034/viewer/2022051913/60044716d2735d2b665da8e7/html5/thumbnails/23.jpg)
ConclusionThe best performance from a computercan be obtained by including “canned” software from a library.
Avoids memory bottlenecks and makes coding easier.
Can lead to many times the speed up Of software especially on small cacheCPU machines.
For legacy codes it is a huge job toBLASify them and if the code is I/O it may not be worth the effort.(always ask yourself “is it worth it?”)
Another type of library is requiredFor codes to run on more than 1 CPUAt a time the common paradigm isMessage Passing interface (MPI)