Download - Early Performance of Seismic Imaging Kernels on Intel ...

IXPUG-ISC16 Workshop Presentation 1 IXPUG

Early Performance of Seismic Imaging Kernels on Intel Knights Landing

Michael Lysaght1, Sean Delaney2, Gilles Civario1 1ICHEC, 2Tullow Oil plc.

IXPUG-ISC16 Workshop, Frankfurt

June 23rd 2016

IXPUG-ISC16 Workshop Presentation 2 IXPUG IXPUG

Background ▪  TORTIA: A Reverse Time Migration (RTM) code:

➢ Developed as suite of in-house codes at Tullow Oil plc (Oil & Gas exploration) ➢ An explicit FD scheme of variable spatial order, and 2nd order in time, to model

wave propagation in the three isotropy cases ➢ Based on an unconventional rotated staggered grid (RSG) method ➢ Client insists on portable open standards (MPI/OpenMP)

▪  Re-factored at ICHEC to target AVX512 and improve flexibility

Intel Advisor 2017 Beta Survey of TORTIA on KNL RSG Stencil


Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 Tile 15 Tile 16 Tile 17 Tile 18 Tile 19 Tile 20 Tile 21 Tile 22 Tile 23 Tile 24 Tile 25 Tile 26 Tile 27 Tile 28 Tile 29 Tile 30 Tile 31 Tile 32

Increasing Data Re-Use on KNL

SD1,1 SD1,2 SD2,1 SD2,2

i-loop

j-loo

p

1,1 1,512

512,1 512,512

8 x KNL Tiles

4 x KN

L Tiles

SD1,1 SD1,2 SD1,3 SD1,4 SD1,5 SD1,6 SD1,7 SD1,8 SD2,1 SD2,2 SD2,3 SD2,4 SD2,5 SD2,6 SD2,7 SD2,8 SD3,1 SD3,2 SD3,3 SD3,4 SD3,5 SD3,6 SD3,7 SD3,8 SD4,1 SD4,2 SD4,3 SD4,4 SD4,5 SD4,6 SD4,7 SD4,8

SD = SubDomain

k-loop (vectorized)

•  2D decomposition maps to KNL tiles

•  Cache block parameter in j-direction

•  Assign Manual OpenMP “Teams” to KNL tiles = extra parallelization over j-loop

j-loo

p

i-loop


Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 Tile 15 Tile 16 Tile 17 Tile 18 Tile 19 Tile 20 Tile 21 Tile 22 Tile 23 Tile 24 Tile 25 Tile 26 Tile 27 Tile 28 Tile 29 Tile 30 Tile 31 Tile 32

Increasing Data Re-Use on KNL

SD1,1 SD1,2 SD2,1 SD2,2

i-loop

j-loo

p

1,1 1,512

512,1 512,512

1 MB L2

OMP TEAM

KNL Tile

8 x KNL Tiles

4 x KN

L Tiles

SD1,1 SD1,2 SD1,3 SD1,4 SD1,5 SD1,6 SD1,7 SD1,8 SD2,1 SD2,2 SD2,3 SD2,4 SD2,5 SD2,6 SD2,7 SD2,8 SD3,1 SD3,2 SD3,3 SD3,4 SD3,5 SD3,6 SD3,7 SD3,8 SD4,1 SD4,2 SD4,3 SD4,4 SD4,5 SD4,6 SD4,7 SD4,8

SD = SubDomain

k-loop (vectorized)

AFFINITY = compact

(k-dim=128)


8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

4096.0

8192.0

0.125 0.25 0.5 1 2 4 8 16 32 64 128

GFl

ops/

s

Flops/DRAM Byte

Rooflines Xeon Phi 7210 (Knights Landing) Peak (FMA) Peak (no FMA) Left wall Right wall Measured Middle

Roofline Analysis

Zero FMA Ceiling* FMA Ceiling*

STREAM TRIAD (NERSC site): 440 GB/s *Ceilings From Intel Advisor 2017 Beta

Cluster Mode: Quadrant Cache Model: Flat MCDRAM: 16 GB 1 MPI + 256 OMP


8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

4096.0

8192.0

0.125 0.25 0.5 1 2 4 8 16 32 64 128

GFl

ops/

s

Flops/DRAM Byte


Roofline Analysis



0 0.4 0.8 1.2 1.6 2

HSW = 2 x E5-2697 v3 KNL = 1 x 7210 (64-core)

1.67x speedup


8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

4096.0

8192.0

0.125 0.25 0.5 1 2 4 8 16 32 64 128

GFl

ops/

s

Flops/DRAM Byte


Roofline Analysis


Achieved: 448 GF/s


Current GF/s: ~23% attainable perf

0 0.4 0.8 1.2 1.6 2

HSW = 2 x E5-2697 v3 KNL = 1 x 7210 (64-core)

1.67x speedup


8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

4096.0

8192.0

0.125 0.25 0.5 1 2 4 8 16 32 64 128

GFl

ops/

s

Flops/DRAM Byte


Roofline Analysis



0D caching (AI = 0.28)

3D (block) caching (AI = 4.17)

0 0.4 0.8 1.2 1.6 2

HSW = 2 x E5-2697 v3 KNL = 1 x 7210 (64-core)

1.67x speedup

2D (plane) caching (AI = 1.28)


Roofline Analysis

“Tweak”

Data-reuse Analysis


Roofline Analysis Input elements Data-reuse Analysis


Roofline Analysis 4 red elements reused = 1D reuse

5k it

erat

ions

Data-reuse Analysis


Roofline Analysis 4 red elements reused = 2D reuse

5k it

erat

ions

'Full-plane' 2D caching: •  Assume enough cache to store

the operands to tweak a Nj * Nk plane of points.

•  V kernel example, inside k loop: •  Evict 3/N velocities. •  Load 3/N velocities. •  Load 2 pressures •  Load 1/N medium property •  Execute 23 operations. •  Precision = 4 bytes. •  V kernel AI: 23 / (2 + 7/N) / 4 •  If (N == 6) AI = 1.816 •  P kernel AI: 11 / (2 + 3/N) / 4 •  If (N == 6) AI = 1.1 •  Weighted average AI: 1.279

Data-reuse Analysis


Roofline Analysis with Intel Advisor 2017

•  Fresh results on KNL!

•  Analysis just starting!

•  Already seeing correlation between qualitative & quantitative

•  More to come!

•  Thanks to Zakhar Metveev & Kirill Rogozhin at Intel!

TORTIA “FLOP/L1 Bytes” on KNL


Quantifying impact of inter-L2 traffic

KNL Tile L2

KNL Tile L2

KNL Tile L2

KNL Tile L2

Halo Regions Inter-L2 traffic

SD1 SD2 SD3 SD4

SD1 SD2 SD2 SD2

i-loop

j-loo

p

iMin = ((t/grp) % nBi) * (bNi + gap) + gap;iMax = min((long) iMin + bNi, par.Ni);jMin = ((t/grp) / nBi) * (bNj + gap) + gap;jMax = min((long) jMin + bNj, par.Nj);

#define gap (2*par.rad) // twice the stencil radius

Grid size increases Accessing more data Data gaps Only updating same amount of data


Conclusions and Insights ▪  Intel Xeon Phi ‘KNL’ already offering superior performance over Intel

Xeon for RTM simulations (with open standards!) ➢ 1.67x speedup over 2 x 14-core Haswell

▪  Early Stages: Now Targeting Even Higher Performance ➢ Investigate Cluster Modes ➢ Improve Cache Blocking: “Push to 3D (Block) Wall” ➢ Modernize MPI for scaling out

▪  Intel Advisor proving to be an invaluable tool: “Roofline Automation” feature will be highly useful! Thanks to Heinrich Bockhorst for assistance throughout!


Backup

Early Performance of Seismic Imaging Kernels on Intel Knights Landing �

� Results

All of our investigations on Intel Xeon Phi KNL have been carried out on apreproduction platform on the Intel Endeavour cluster made available as partof an early access program. The configuration of KNL platform used in ourinvestigations is summarised in Table �.

Table �: Knights Landing preproduction platform configuration

Feature Configuration

Stepping B�Core freq. �.� GHz# Cores ��MCDRAM �� GBDDR� �� MBMemory Mode FlatCluster Mode QuadrantBIOS GVPDINT�.��B.��.D��.��

The fact that modern compilers such as the Intel compiler have many optionsto further influence performance, can pose challenges when searching for optimalsettings. On top of this, the TORTIA code has considerable flexibility whenchoosing runtime parameters, e.g., array sizes, cache blocking sizes (tile size),number OpenMP threads, OpenMP scheduling policy, etc., which taken togethercan often have a dramatic impact on performance. As part of our investigationsreported here, we have carried out an initial search of this large parameter spacewith the aim of finding an optimal configuration of compile-time and run-timeparameters for the TORTIA code running on Intel Xeon Ivy Bridge, Intel XeonHaswell, Intel Xeon Phi KNC and Intel Xeon Phi KNL platforms .

In Section � and Section �, we provide tentative results for the optimal compile-time and run-time parameters found as part of our early stage investigations onKNL.These results will be updated and extended in the lead up to theISC’�� IXPUG WorkshopThese results will be complemented by Intel VTune reports in thelead up to the ISC’�� IXPUG Workshop

When using � MPI processes on the KNL platform with the parameters perprocess as seen in Section �, we can achieve a higher throughput performance of�� GFlops relative to �� GFlops on the two-socket Haswell E�-�� platform.

Early Performance of Seismic Imaging Kernels on Intel Knights Landing �

� Results

All of our investigations on Intel Xeon Phi KNL have been carried out on apreproduction platform on the Intel Endeavour cluster made available as partof an early access program. The configuration of KNL platform used in ourinvestigations is summarised in Table �.

Table �: Knights Landing preproduction platform configuration

Feature Configuration

Stepping B�Core freq. �.� GHz# Cores ��MCDRAM �� GBDDR� �� MBMemory Mode FlatCluster Mode QuadrantBIOS GVPDINT�.��B.��.D��.��

The fact that modern compilers such as the Intel compiler have many optionsto further influence performance, can pose challenges when searching for optimalsettings. On top of this, the TORTIA code has considerable flexibility whenchoosing runtime parameters, e.g., array sizes, cache blocking sizes (tile size),number OpenMP threads, OpenMP scheduling policy, etc., which taken togethercan often have a dramatic impact on performance. As part of our investigationsreported here, we have carried out an initial search of this large parameter spacewith the aim of finding an optimal configuration of compile-time and run-timeparameters for the TORTIA code running on Intel Xeon Ivy Bridge, Intel XeonHaswell, Intel Xeon Phi KNC and Intel Xeon Phi KNL platforms .

In Section � and Section �, we provide tentative results for the optimal compile-time and run-time parameters found as part of our early stage investigations onKNL.These results will be updated and extended in the lead up to theISC’�� IXPUG WorkshopThese results will be complemented by Intel VTune reports in thelead up to the ISC’�� IXPUG Workshop

When using � MPI processes on the KNL platform with the parameters perprocess as seen in Section �, we can achieve a higher throughput performance of�� GFlops relative to �� GFlops on the two-socket Haswell E�-�� platform.

Knights Landing Platform Configuration on Endeavour System

icc version 16.0.3 Intel(R) MPI Library 5.1.3 for Linux