OPESCI: Open Peformance portablE Seismic...

1
OPESCI: Open Peformance portablE Seismic Imaging Marcos de Aguiar 2 Gerard Gorman 1 David Ham 1 Christian Jacobs 1 Paul Kelly 1 Michael Lange 1 Fabio Luporini 1 Renato Miceli 2 Tianjiao Sun 1 Felippe Vieira Zacarias 2 1 Imperial College, London, United Kingdom 2 SENAI CIMATEC, Salvador, Brazil Motivation Seismic imaging, used in energy exploration, is arguably the most compute and data intensive application in the private sector. The bulk of the computational cost relates to simulating the propagation of waves through the subsurface. Most seismic imaging codes today use a simplified physics model based on the acoustic equation. Using the elastic wave equation would provide a more accurate physics model and ultimately lead to more accurate subsurfaces images. However, this would dramatically increases the cost of an already expensive computation. Computer architectures are rapidly evolving and diversifying to continue deliv- ering performance increases within tight energy envelopes. These changes offer opportunities — but also demand disruptive software changes to harness the full hardware potential. The question is how to achieve performance portability across diverse, rapidly evolving architectures, in spite of the sharp trade-off be- tween easy-to-maintain, portable software written in high-level languages such as Python; and highly optimized, parallel codes for modern many-core architectures. The solution proposed is to leverage domain-specific languages (DSL) and code generation. At the highest level of abstraction application developers write algo- rithms in clear, concise manner akin to how algorithms are written mathemati- cally, while at the lower levels source-to-source compilers explore rich implemen- tation spaces to transform DSL codes into highly optimized native codes that can be compiled for target platforms for near-to-peak performance. The framework provides layers, decoupling domain experts from code tuning specialists, where different optimized code generator back ends can be replaced. Full Waveform Inversion Given a sound source, an array of receivers, and a physics model for how seismic energy propagates, use a wiggle for wiggle comparison infer the structure of the subsurface. Today a seismic survey has in the order of a petabyte of data. Expectation that this could be 10’s of petabytes in 12-18 months... Subsurface image is built up in iterative process by running the physics model forwards and backwards, all the while minimising the difference between the measured data and the simulated data. Dominant computational expense is running the wave model (aka forward model, and propagator) for each shot of data. Current state of the art is to use a modified acoustic equation. A linear elastic model will result is greater accuracy, but this greatly increases the computational cost. Opportunity & challenge 1. Elastic wave equation Provides a significantly better representation of the wave field than the standard acoustic models. Models both p-waves and s-waves. Also much more expensive to compute: S-waves travel at about half the speed of p-waves, and therefore half the wavelength: Grid resolution needs to be doubled (factor of 8 increase in memory for 3D). Time step needs to be halved therefore must execute twice the number of time steps. 2. Advanced numerical methods Regular grids with finite difference is the modus operandi of O&G industry. Higher order methods can achieve greater accuracy at lower cost: Coarser resolution (fewer grid points). Larger time step. Great data locality, opportunities for vectorisation etc. Implementation complexity! High order methods are many times more involved. 3. Architecture and code modernisation Hardware continues to track Moore’s Law — but working harder on software. Must exploit parallelism at every level to achieve good performance, e.g.: Various parallel programming models (MPI, threads, etc.) Deep memory hierarchy, data locality. SIMD vectorisation. Heterogeneous computing Intel ® Xeon ® , Intel ® Xeon Phi™. Increasingly difficult for domain specialists to implement HPC software. Traditional numerical algorithms may need to be discarded in favor of methods better suited to computer architectures. Domain Specific Languages (DSL’s) offer a route to bridge the divide between domain specialists (often the application developers) and parallel programming specialists (in this case compiler writers). Architectural layout Geophysicist Numerical analyst Library, and DSL compiler developers Platform specialist Inversion algorithm (High level language such as Python, Apache Spark) Nonlinear gradient-based optimization methods; compressive sensing (randomised sparse sampling) Forward models written using DSL: 2D/3D; acoustic/elastic wave equation; isotropic/anisotropic elastic modulii; and time domain Backward/adjoint model (Code generation) Gradient&Hessian operators (Code generation) Reference implementation of kernels (Fortran, C, etc.) Stencil DSL for finite difference (OPESCI-FD) Iterative solvers and optimisation (PETSc/TAO) UFL for finite element (Firedrake) Seismic data I/O Platform specific data layouts and task scheduling; code generation for MPI with OpenMP or OpenCL Anisotropic adaptive meshes Platform tuned kernels; autotuning frameworks MPI, OpenMP, OpenCL x86 64, Intel Phi, GPGPU etc. Future architecture Finite difference using symbolic mathematics and code generation SymPy is symbolic maths library in Python. Systematically derive computational kernel from governing equations. Extensible code generation classes that can be used to generate bespoke code. Finite difference is highly structured. Make use of template python libraries (Mako, Jinja) to insert kernel into pre-prepared code templates. Parameters (e.g. domain size, grid spacing) can be passed to templates as additional keywords. One template file can be used for variety of problems. Only changing parameters at symbolic level without directly modifying source code. Parameter/scheme autotuning. Separate mathematics from implementation. Generate multiple implementations in different stencil languages from same math specification. Only need to extend code generation class and template file to support a new stencil language. Researcher’s code written in specific DSL (I.E. Python - SymPy) Source to source translator Platform specific optimized code (I.E. C++ OpenMP) Initial tests using elastic wave equation 1 2 4 8 16 32 64 128 256 512 1024 2048 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 Single Precision Performance (GFlops/s) Arithmetic Intensity (Flops/Byte) 960 Gflops/s 892 Gflops/s Theoretical Bandwidth Achieved Bandwidth Theoretical Achieved 2nd Order 4th Order 6th Order 8th Order Figure: Roofline Model for Wave Equation on Intel ® Xeon ® . Under investigation why the 8th order performance didn’t increase proportionally. 0 20 40 60 80 100 120 Intel® Xeon® Intel® Xeon® Phi™ GFlop/s Reference 4th order Generated 4th order Generated 8th order (Best) Figure: Elastic Wave Stencil: Manual vs. Generated 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 Single Precision Performance (GFlops/s) Arithmetic Intensity (Flops/Byte) 2021 Gflops 1371 Gflops Theoretical Bandwidth Achieved Bandwidth Theoretical Achieved 2nd Order 4th Order 6th Order 8th Order Figure: Roofline Model for Wave Equation on Intel ® Xeon Phi™. Under investigation why the 8th order performance didn’t increase proportionally. Results Operation intensity of the reference code (OpenMP) is relatively low. Generated code (OpenMP) was able to increase the intensity, fixed alignment and enabled SIMD vectorization, thus was able to achieve much better performance. There is still some room for performance gain in the generated code, according to the roofline estimation and initial profiling, but the code was able to come close to the theoretical peak. Generated code is outperforming the reference implementation in the Intel ® Xeon ® and Intel ® Xeon Phi™platforms. Future work Profile code in both Intel ® Xeon ® and Intel ® Xeon Phi™. Implement different code generators for each platform if necessary. Optimize to try to achieve the roofline estimated value for this algorithm. Test polyhedral approaches for better cache usage. Test other equations. Optimising the backend compiler for Intel ® Xeon Phi™. Build a FWI framework on top of this high level abstraction. Links OPESCI – http://opesci.github.io (Intel PCC project) AMCG – http://amcg.ese.ic.ac.uk References [1] Williams, S., Waterman, A., Patterson, D., “Roofline: an insightful visual performance model for multicore architectures.” Communications of the ACM. 52 (4), 2009. [2] Borges, L. “3D Finite Differences on Multi-core Processors.”, available on-line at http://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors , 2011. [3] Reinders, J., Jeffers, J. “High Performance Parallelism Pearls” (Morgan Kaufmann) (2014). [4] Joyner D., Čertík, O., Muerer, A., Granger B. E. “Open Source Computer Algebra Systems: Sympy.” ACM Communications in Computer Algebra Volume 45 (2011). Acknowledgements Open Performance portablE SeismiC Imaging (OPESCI) is funded under the Intel Parallel Computing Center program — a unique collaboration between in- dustrial partners and researchers at Imperial College London and SENAI CI- MATEC, focused on developing disruptive open-source software on HPC archi- tectures. The authors would also like to acknowledgement support from EPSRC EP/L000407/1.

Transcript of OPESCI: Open Peformance portablE Seismic...

Page 1: OPESCI: Open Peformance portablE Seismic Imagingsc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · OPESCI: Open Peformance portablE Seismic Imaging ... (DSL)andcode

OPESCI: Open Peformance portablE Seismic ImagingMarcos de Aguiar 2 Gerard Gorman 1 David Ham 1 Christian Jacobs 1 Paul Kelly 1 Michael Lange 1 Fabio Luporini 1 Renato Miceli 2

Tianjiao Sun 1 Felippe Vieira Zacarias 2

1Imperial College, London, United Kingdom2SENAI CIMATEC, Salvador, Brazil

Motivation

Seismic imaging, used in energy exploration, is arguably the most compute anddata intensive application in the private sector. The bulk of the computationalcost relates to simulating the propagation of waves through the subsurface. Mostseismic imaging codes today use a simplified physics model based on the acousticequation. Using the elastic wave equation would provide a more accurate physicsmodel and ultimately lead to more accurate subsurfaces images. However, thiswould dramatically increases the cost of an already expensive computation.Computer architectures are rapidly evolving and diversifying to continue deliv-ering performance increases within tight energy envelopes. These changes offeropportunities — but also demand disruptive software changes to harness thefull hardware potential. The question is how to achieve performance portabilityacross diverse, rapidly evolving architectures, in spite of the sharp trade-off be-tween easy-to-maintain, portable software written in high-level languages such asPython; and highly optimized, parallel codes for modern many-core architectures.The solution proposed is to leverage domain-specific languages (DSL) and codegeneration. At the highest level of abstraction application developers write algo-rithms in clear, concise manner akin to how algorithms are written mathemati-cally, while at the lower levels source-to-source compilers explore rich implemen-tation spaces to transform DSL codes into highly optimized native codes that canbe compiled for target platforms for near-to-peak performance. The frameworkprovides layers, decoupling domain experts from code tuning specialists, wheredifferent optimized code generator back ends can be replaced.

Full Waveform Inversion

•Given a sound source, an array of receivers, and a physics model for howseismic energy propagates, use a wiggle for wiggle comparison infer thestructure of the subsurface.

•Today a seismic survey has in the order of a petabyte of data. Expectationthat this could be 10’s of petabytes in 12-18 months...

•Subsurface image is built up in iterative process by running the physics modelforwards and backwards, all the while minimising the difference between themeasured data and the simulated data.

•Dominant computational expense is running the wave model (aka forwardmodel, and propagator) for each shot of data.

•Current state of the art is to use a modified acoustic equation.•A linear elastic model will result is greater accuracy, but this greatly increasesthe computational cost.

Opportunity & challenge1. Elastic wave equation

•Provides a significantly better representation of the wave field than thestandard acoustic models.• Models both p-waves and s-waves.

•Also much more expensive to compute:• S-waves travel at about half the speed of p-waves, and therefore half the wavelength:

• Grid resolution needs to be doubled (factor of 8 increase in memory for 3D).• Time step needs to be halved therefore must execute twice the number of time steps.

2. Advanced numerical methods•Regular grids with finite difference is the modus operandi of O&G industry.

• Higher order methods can achieve greater accuracy at lower cost:• Coarser resolution (fewer grid points).• Larger time step.• Great data locality, opportunities for vectorisation etc.

• Implementation complexity!• High order methods are many times more involved.

3. Architecture and code modernisation•Hardware continues to track Moore’s Law — but working harder onsoftware.

•Must exploit parallelism at every level to achieve good performance, e.g.:• Various parallel programming models (MPI, threads, etc.)• Deep memory hierarchy, data locality.• SIMD vectorisation.• Heterogeneous computing Intel® Xeon®, Intel® Xeon Phi™.

• Increasingly difficult for domain specialists to implement HPC software.•Traditional numerical algorithms may need to be discarded in favor ofmethods better suited to computer architectures.

•Domain Specific Languages (DSL’s) offer a route to bridge the dividebetween domain specialists (often the application developers) and parallelprogramming specialists (in this case compiler writers).

Architectural layout

Geophysicist

Numericalanalyst

Library, andDSL compiler

developers

Platformspecialist

Inversion algorithm (High level language such as Python, Apache Spark)Nonlinear gradient-based optimization methods; compressive sensing (randomised sparse sampling)

Forward models written using DSL:2D/3D; acoustic/elastic wave equation;

isotropic/anisotropic elastic modulii; and time domainBackward/adjoint model (Code generation)

Gradient&Hessian operators (Code generation) Reference implementation of kernels (Fortran, C, etc.)

Stencil DSL for finite difference (OPESCI-FD) Iterative solvers and optimisation (PETSc/TAO)

UFL for finite element (Firedrake) Seismic data I/O

Platform specific data layouts and task scheduling;code generation for MPI with OpenMP or OpenCL

Anisotropic adaptive meshes

Platform tuned kernels; autotuning frameworks MPI, OpenMP, OpenCL

x86 64, Intel Phi, GPGPU etc. Future architecture

Finite difference using symbolic mathematics and code generation

•SymPy is symbolic maths library in Python.• Systematically derive computational kernel from governing equations.• Extensible code generation classes that can be used to generate bespoke code.

•Finite difference is highly structured.• Make use of template python libraries (Mako, Jinja) to insert kernel into pre-prepared codetemplates.

• Parameters (e.g. domain size, grid spacing) can be passed to templates as additionalkeywords.

• One template file can be used for variety of problems.

•Only changing parameters at symbolic level without directly modifying sourcecode.• Parameter/scheme autotuning.

•Separate mathematics from implementation.• Generate multiple implementations in different stencil languages from same mathspecification.

• Only need to extend code generation class and template file to support a new stencillanguage.

Researcher’s codewritten in specific DSL(I.E. Python - SymPy)

Source to sourcetranslator

Platform specificoptimized code

(I.E. C++ OpenMP)

Initial tests using elastic wave equation

1

2

4

8

16

32

64

128

256

512

1024

2048

0.0625 0.125 0.25 0.5 1 2 4 8 16 32

Sin

gle

Pre

cisi

on P

erfo

rman

ce (

GF

lops

/s)

Arithmetic Intensity (Flops/Byte)

960 Gflops/s892 Gflops/s

Theoretical Bandwidth

Achieved Bandwidth

TheoreticalAchieved2nd Order4th Order6th Order8th Order

Figure: Roofline Model for Wave Equation on Intel® Xeon®.Under investigation why the 8th order performance didn’t increase proportionally.

0

20

40

60

80

100

120

Intel® Xeon® Intel® Xeon® Phi™

GFlo

p/s

Reference 4th orderGenerated 4th orderGenerated 8th order (Best)

Figure: Elastic Wave Stencil: Manual vs. Generated

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.0625 0.125 0.25 0.5 1 2 4 8 16 32

Sin

gle

Pre

cisi

on P

erfo

rman

ce (

GF

lops

/s)

Arithmetic Intensity (Flops/Byte)

2021 Gflops

1371 Gflops

Theoretical Bandwidth

Achieved Bandwidth

TheoreticalAchieved2nd Order4th Order6th Order8th Order

Figure: Roofline Model for Wave Equation on Intel® Xeon Phi™.Under investigation why the 8th order performance didn’t increase proportionally.

Results

•Operation intensity of the reference code (OpenMP) is relatively low.•Generated code (OpenMP) was able to increase the intensity, fixedalignment and enabled SIMD vectorization, thus was able to achieve muchbetter performance.

•There is still some room for performance gain in the generated code,according to the roofline estimation and initial profiling, but the code wasable to come close to the theoretical peak.

•Generated code is outperforming the reference implementation in the Intel®Xeon®and Intel® Xeon Phi™platforms.

Future work

•Profile code in both Intel® Xeon®and Intel® Xeon Phi™.• Implement different code generators for each platform if necessary.•Optimize to try to achieve the roofline estimated value for this algorithm.•Test polyhedral approaches for better cache usage.•Test other equations.•Optimising the backend compiler for Intel® Xeon Phi™.•Build a FWI framework on top of this high level abstraction.

Links

•OPESCI – http://opesci.github.io (Intel PCC project)•AMCG – http://amcg.ese.ic.ac.uk

References

[1] Williams, S., Waterman, A., Patterson, D., “Roofline: an insightful visual performancemodel for multicore architectures.” Communications of the ACM. 52 (4), 2009.

[2] Borges, L. “3D Finite Differences on Multi-core Processors.”, available on-line athttp://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors ,2011.

[3] Reinders, J., Jeffers, J. “High Performance Parallelism Pearls” (Morgan Kaufmann) (2014).[4] Joyner D., Čertík, O., Muerer, A., Granger B. E. “Open Source Computer Algebra Systems:

Sympy.” ACM Communications in Computer Algebra Volume 45 (2011).

Acknowledgements

Open Performance portablE SeismiC Imaging (OPESCI) is funded under theIntel Parallel Computing Center program — a unique collaboration between in-dustrial partners and researchers at Imperial College London and SENAI CI-MATEC, focused on developing disruptive open-source software on HPC archi-tectures. The authors would also like to acknowledgement support from EPSRCEP/L000407/1.