IXPUG-ISC16 Workshop Presentation 1 IXPUG
Early Performance of Seismic Imaging Kernels on Intel Knights Landing
Michael Lysaght1, Sean Delaney2, Gilles Civario1 1ICHEC, 2Tullow Oil plc.
IXPUG-ISC16 Workshop, Frankfurt
June 23rd 2016
IXPUG-ISC16 Workshop Presentation 2 IXPUG IXPUG
Background ▪ TORTIA: A Reverse Time Migration (RTM) code:
➢ Developed as suite of in-house codes at Tullow Oil plc (Oil & Gas exploration) ➢ An explicit FD scheme of variable spatial order, and 2nd order in time, to model
wave propagation in the three isotropy cases ➢ Based on an unconventional rotated staggered grid (RSG) method ➢ Client insists on portable open standards (MPI/OpenMP)
▪ Re-factored at ICHEC to target AVX512 and improve flexibility
Intel Advisor 2017 Beta Survey of TORTIA on KNL RSG Stencil
IXPUG-ISC16 Workshop Presentation 3 IXPUG IXPUG
Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 Tile 15 Tile 16 Tile 17 Tile 18 Tile 19 Tile 20 Tile 21 Tile 22 Tile 23 Tile 24 Tile 25 Tile 26 Tile 27 Tile 28 Tile 29 Tile 30 Tile 31 Tile 32
Increasing Data Re-Use on KNL
SD1,1 SD1,2 SD2,1 SD2,2
i-loop
j-loo
p
1,1 1,512
512,1 512,512
8 x KNL Tiles
4 x KN
L Tiles
SD1,1 SD1,2 SD1,3 SD1,4 SD1,5 SD1,6 SD1,7 SD1,8 SD2,1 SD2,2 SD2,3 SD2,4 SD2,5 SD2,6 SD2,7 SD2,8 SD3,1 SD3,2 SD3,3 SD3,4 SD3,5 SD3,6 SD3,7 SD3,8 SD4,1 SD4,2 SD4,3 SD4,4 SD4,5 SD4,6 SD4,7 SD4,8
SD = SubDomain
k-loop (vectorized)
• 2D decomposition maps to KNL tiles
• Cache block parameter in j-direction
• Assign Manual OpenMP “Teams” to KNL tiles = extra parallelization over j-loop
j-loo
p
i-loop
IXPUG-ISC16 Workshop Presentation 4 IXPUG IXPUG
Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 Tile 15 Tile 16 Tile 17 Tile 18 Tile 19 Tile 20 Tile 21 Tile 22 Tile 23 Tile 24 Tile 25 Tile 26 Tile 27 Tile 28 Tile 29 Tile 30 Tile 31 Tile 32
Increasing Data Re-Use on KNL
SD1,1 SD1,2 SD2,1 SD2,2
i-loop
j-loo
p
1,1 1,512
512,1 512,512
1 MB L2
OMP TEAM
KNL Tile
8 x KNL Tiles
4 x KN
L Tiles
SD1,1 SD1,2 SD1,3 SD1,4 SD1,5 SD1,6 SD1,7 SD1,8 SD2,1 SD2,2 SD2,3 SD2,4 SD2,5 SD2,6 SD2,7 SD2,8 SD3,1 SD3,2 SD3,3 SD3,4 SD3,5 SD3,6 SD3,7 SD3,8 SD4,1 SD4,2 SD4,3 SD4,4 SD4,5 SD4,6 SD4,7 SD4,8
SD = SubDomain
k-loop (vectorized)
AFFINITY = compact
(k-dim=128)
IXPUG-ISC16 Workshop Presentation 5 IXPUG IXPUG
8.0
16.0
32.0
64.0
128.0
256.0
512.0
1024.0
2048.0
4096.0
8192.0
0.125 0.25 0.5 1 2 4 8 16 32 64 128
GFl
ops/
s
Flops/DRAM Byte
Rooflines Xeon Phi 7210 (Knights Landing) Peak (FMA) Peak (no FMA) Left wall Right wall Measured Middle
Roofline Analysis
Zero FMA Ceiling* FMA Ceiling*
STREAM TRIAD (NERSC site): 440 GB/s *Ceilings From Intel Advisor 2017 Beta
Cluster Mode: Quadrant Cache Model: Flat MCDRAM: 16 GB 1 MPI + 256 OMP
IXPUG-ISC16 Workshop Presentation 6 IXPUG IXPUG
8.0
16.0
32.0
64.0
128.0
256.0
512.0
1024.0
2048.0
4096.0
8192.0
0.125 0.25 0.5 1 2 4 8 16 32 64 128
GFl
ops/
s
Flops/DRAM Byte
Rooflines Xeon Phi 7210 (Knights Landing) Peak (FMA) Peak (no FMA) Left wall Right wall Measured Middle
Roofline Analysis
Zero FMA Ceiling* FMA Ceiling*
STREAM TRIAD (NERSC site): 440 GB/s *Ceilings From Intel Advisor 2017 Beta
0 0.4 0.8 1.2 1.6 2
HSW = 2 x E5-2697 v3 KNL = 1 x 7210 (64-core)
1.67x speedup
IXPUG-ISC16 Workshop Presentation 7 IXPUG IXPUG
8.0
16.0
32.0
64.0
128.0
256.0
512.0
1024.0
2048.0
4096.0
8192.0
0.125 0.25 0.5 1 2 4 8 16 32 64 128
GFl
ops/
s
Flops/DRAM Byte
Rooflines Xeon Phi 7210 (Knights Landing) Peak (FMA) Peak (no FMA) Left wall Right wall Measured Middle
Roofline Analysis
Zero FMA Ceiling* FMA Ceiling*
Achieved: 448 GF/s
STREAM TRIAD (NERSC site): 440 GB/s *Ceilings From Intel Advisor 2017 Beta
Current GF/s: ~23% attainable perf
0 0.4 0.8 1.2 1.6 2
HSW = 2 x E5-2697 v3 KNL = 1 x 7210 (64-core)
1.67x speedup
IXPUG-ISC16 Workshop Presentation 8 IXPUG IXPUG
8.0
16.0
32.0
64.0
128.0
256.0
512.0
1024.0
2048.0
4096.0
8192.0
0.125 0.25 0.5 1 2 4 8 16 32 64 128
GFl
ops/
s
Flops/DRAM Byte
Rooflines Xeon Phi 7210 (Knights Landing) Peak (FMA) Peak (no FMA) Left wall Right wall Measured Middle
Roofline Analysis
Zero FMA Ceiling* FMA Ceiling*
STREAM TRIAD (NERSC site): 440 GB/s *Ceilings From Intel Advisor 2017 Beta
0D caching (AI = 0.28)
3D (block) caching (AI = 4.17)
0 0.4 0.8 1.2 1.6 2
HSW = 2 x E5-2697 v3 KNL = 1 x 7210 (64-core)
1.67x speedup
2D (plane) caching (AI = 1.28)
IXPUG-ISC16 Workshop Presentation 10 IXPUG IXPUG
Roofline Analysis Input elements Data-reuse Analysis
IXPUG-ISC16 Workshop Presentation 11 IXPUG IXPUG
Roofline Analysis 4 red elements reused = 1D reuse
5k it
erat
ions
Data-reuse Analysis
IXPUG-ISC16 Workshop Presentation 12 IXPUG IXPUG
Roofline Analysis 4 red elements reused = 2D reuse
5k it
erat
ions
'Full-plane' 2D caching: • Assume enough cache to store
the operands to tweak a Nj * Nk plane of points.
• V kernel example, inside k loop: • Evict 3/N velocities. • Load 3/N velocities. • Load 2 pressures • Load 1/N medium property • Execute 23 operations. • Precision = 4 bytes. • V kernel AI: 23 / (2 + 7/N) / 4 • If (N == 6) AI = 1.816 • P kernel AI: 11 / (2 + 3/N) / 4 • If (N == 6) AI = 1.1 • Weighted average AI: 1.279
Data-reuse Analysis
IXPUG-ISC16 Workshop Presentation 13 IXPUG IXPUG
Roofline Analysis with Intel Advisor 2017
• Fresh results on KNL!
• Analysis just starting!
• Already seeing correlation between qualitative & quantitative
• More to come!
• Thanks to Zakhar Metveev & Kirill Rogozhin at Intel!
TORTIA “FLOP/L1 Bytes” on KNL
IXPUG-ISC16 Workshop Presentation 14 IXPUG IXPUG
Quantifying impact of inter-L2 traffic
KNL Tile L2
KNL Tile L2
KNL Tile L2
KNL Tile L2
Halo Regions Inter-L2 traffic
SD1 SD2 SD3 SD4
SD1 SD2 SD2 SD2
i-loop
j-loo
p
iMin = ((t/grp) % nBi) * (bNi + gap) + gap;iMax = min((long) iMin + bNi, par.Ni);jMin = ((t/grp) / nBi) * (bNj + gap) + gap;jMax = min((long) jMin + bNj, par.Nj);
#define gap (2*par.rad) // twice the stencil radius
Grid size increases Accessing more data Data gaps Only updating same amount of data
IXPUG-ISC16 Workshop Presentation 15 IXPUG IXPUG
Conclusions and Insights ▪ Intel Xeon Phi ‘KNL’ already offering superior performance over Intel
Xeon for RTM simulations (with open standards!) ➢ 1.67x speedup over 2 x 14-core Haswell
▪ Early Stages: Now Targeting Even Higher Performance ➢ Investigate Cluster Modes ➢ Improve Cache Blocking: “Push to 3D (Block) Wall” ➢ Modernize MPI for scaling out
▪ Intel Advisor proving to be an invaluable tool: “Roofline Automation” feature will be highly useful! Thanks to Heinrich Bockhorst for assistance throughout!
IXPUG-ISC16 Workshop Presentation 16 IXPUG IXPUG
Backup
Early Performance of Seismic Imaging Kernels on Intel Knights Landing �
� Results
All of our investigations on Intel Xeon Phi KNL have been carried out on apreproduction platform on the Intel Endeavour cluster made available as partof an early access program. The configuration of KNL platform used in ourinvestigations is summarised in Table �.
Table �: Knights Landing preproduction platform configuration
Feature Configuration
Stepping B�Core freq. �.� GHz# Cores ��MCDRAM �� GBDDR� ������� MBMemory Mode FlatCluster Mode QuadrantBIOS GVPDINT�.��B.����.D��.����������
The fact that modern compilers such as the Intel compiler have many optionsto further influence performance, can pose challenges when searching for optimalsettings. On top of this, the TORTIA code has considerable flexibility whenchoosing runtime parameters, e.g., array sizes, cache blocking sizes (tile size),number OpenMP threads, OpenMP scheduling policy, etc., which taken togethercan often have a dramatic impact on performance. As part of our investigationsreported here, we have carried out an initial search of this large parameter spacewith the aim of finding an optimal configuration of compile-time and run-timeparameters for the TORTIA code running on Intel Xeon Ivy Bridge, Intel XeonHaswell, Intel Xeon Phi KNC and Intel Xeon Phi KNL platforms .
In Section � and Section �, we provide tentative results for the optimal compile-time and run-time parameters found as part of our early stage investigations onKNL.These results will be updated and extended in the lead up to theISC’�� IXPUG WorkshopThese results will be complemented by Intel VTune reports in thelead up to the ISC’�� IXPUG Workshop
When using � MPI processes on the KNL platform with the parameters perprocess as seen in Section �, we can achieve a higher throughput performance of��� GFlops relative to ��� GFlops on the two-socket Haswell E�-���� platform.
Early Performance of Seismic Imaging Kernels on Intel Knights Landing �
� Results
All of our investigations on Intel Xeon Phi KNL have been carried out on apreproduction platform on the Intel Endeavour cluster made available as partof an early access program. The configuration of KNL platform used in ourinvestigations is summarised in Table �.
Table �: Knights Landing preproduction platform configuration
Feature Configuration
Stepping B�Core freq. �.� GHz# Cores ��MCDRAM �� GBDDR� ������� MBMemory Mode FlatCluster Mode QuadrantBIOS GVPDINT�.��B.����.D��.����������
The fact that modern compilers such as the Intel compiler have many optionsto further influence performance, can pose challenges when searching for optimalsettings. On top of this, the TORTIA code has considerable flexibility whenchoosing runtime parameters, e.g., array sizes, cache blocking sizes (tile size),number OpenMP threads, OpenMP scheduling policy, etc., which taken togethercan often have a dramatic impact on performance. As part of our investigationsreported here, we have carried out an initial search of this large parameter spacewith the aim of finding an optimal configuration of compile-time and run-timeparameters for the TORTIA code running on Intel Xeon Ivy Bridge, Intel XeonHaswell, Intel Xeon Phi KNC and Intel Xeon Phi KNL platforms .
In Section � and Section �, we provide tentative results for the optimal compile-time and run-time parameters found as part of our early stage investigations onKNL.These results will be updated and extended in the lead up to theISC’�� IXPUG WorkshopThese results will be complemented by Intel VTune reports in thelead up to the ISC’�� IXPUG Workshop
When using � MPI processes on the KNL platform with the parameters perprocess as seen in Section �, we can achieve a higher throughput performance of��� GFlops relative to ��� GFlops on the two-socket Haswell E�-���� platform.
Knights Landing Platform Configuration on Endeavour System
icc version 16.0.3 Intel(R) MPI Library 5.1.3 for Linux
Top Related