Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31...
Transcript of Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31...
![Page 1: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/1.jpg)
Accelerating Atmospheric Simulation
on GPU, FPGA, and MIC
Haohuan Fu
Center for Earth System Science
Tsinghua University, Beijing
Sep/19/2013 @ NCAR
![Page 2: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/2.jpg)
The Center for Earth System Science,
Tsinghua University
Started in 2009
Study of the earth as an integrated system
to investigate interactions between atmosphere, land, water, ice, biosphere, societies, technologies, and economics
observing, understanding, and predicting global changes
to guide political / economical / technical decisions at different scales for assuring sustainable development
2
CMIP5: LASG-CESS
![Page 3: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/3.jpg)
The Present Faculty
3
![Page 4: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/4.jpg)
Tsinghua HPGC Group
HPGC: high performance geo-computing http://www.thuhpgc.org
High performance computational solutions for geoscience applications simulation-oriented research: providing highly efficient and highly
scalable simulation applications (climate modeling, exploration geophysics)
data-oriented research: data processing, data compression, and data mining
Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures
![Page 5: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/5.jpg)
Application Climate Modeling
global-scale atmospheric simulation (800 Tflops Shallow Water Equation)
GPU-based acceleration for GEOS-Chem
FPGA-based acceleration for weather forecasting acceleration
Exploration Geophysics
forward modeling / inversion / migration
Remote Sensing Data Processing
data analysis, visualization, correlation of different data sets
Algorithm parallel Stencil on Different HPC Architectures
parallel Sparse Matrix Solver
parallel Data Compression (PLZMA)
hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup
Architecture multi-core/many-core (CPU, GPU, MIC)
reconfigurable hardware (FPGA)
![Page 6: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/6.jpg)
Application Climate Modeling
global-scale atmospheric simulation (800 Tflops Shallow Water Equation)
GPU-based acceleration for GEOS-Chem
FPGA-based acceleration for weather forecasting acceleration
Exploration Geophysics
forward modeling / inversion / migration
Remote Sensing Data Processing
data analysis, visualization, correlation of different data sets
Algorithm parallel Stencil on Different HPC Architectures
parallel Sparse Matrix Solver
parallel Data Compression (PLZMA)
hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup
Architecture multi-core/many-core (CPU, GPU, MIC)
reconfigurable hardware (FPGA)
![Page 7: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/7.jpg)
Outline
Tianhe-1A: GPU
Maxeler DFE: FPGA
Tianhe-2: MIC
Future Plan & Discussion
![Page 8: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/8.jpg)
Multidisciplinary Collaborations
Prof. Chao Yang Institute of Software, CAS
computational mathematics
Dr. Wei Xue Department of Computer Science, Tsinghua University
HPC (MPI, OpenMP, MIC)
Dr. Haohuan Fu Center for Earth System Science, Tsinghua University
HPC (accelerators, GPU, FPGA, MIC)
Prof. Lanning Wang College of Global Change and Earth System Science, Beijing
Normal University (BNU-ESM)
climate scientist
![Page 9: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/9.jpg)
Outline
Tianhe-1A: GPU
Maxeler DFE: FPGA
Tianhe-2: MIC
Future Plan & Discussion
![Page 10: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/10.jpg)
Highly-Scalable Framework for Global
Atmospheric Simulation on Tianhe-1A
Starting from shallow wave equation cubed-sphere mesh grid
adjustable partition between CPU and GPU
scale to 40,000 CPU cores and 3750 GPUs with a sustainable performance of 800 TFlops
![Page 11: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/11.jpg)
Mesh and Stencil
13-point stencil
• Spatially discretized with a cell-
centred finite volume method
• Integrated with a second-order
accurate TVD Runge-Kutta method
Interp. Across patches
1-d linear interpolation
![Page 12: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/12.jpg)
Improved hybrid CPU-GPU Algorithm
Note: halo1/2/3/4 — the 4 steps of the “pipe-flow” communication scheme
adjustable partition between CPU and GPU
![Page 13: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/13.jpg)
“Pipe-Flow” Scheme for Message-Passing
on Cubed-Sphere
Four steps to arrange
conflict-free message-
passing on cubed-
sphere.
The arrows indicate
directions of data
entering or exiting
patches as a pipe flow.
For more details, please refer to our PPoPP 2013 paper: “A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric
Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP), pp. 1-12, Shenzhen, 2013. .
![Page 14: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/14.jpg)
Outline
Tianhe-1A: GPU
Maxeler DFE: FPGA
Tianhe-2: MIC
Future Plan & Discussion
![Page 15: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/15.jpg)
Highly-Scalable Atmospheric Simulation
on Data-Flow Engines
Maxeler Data-Flow Engine (DFE) Field-Programmable Gate Arrays (FPGA)
24 GB onboard memory
PCIE connection to host
MaxRing connection between cards
![Page 16: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/16.jpg)
halo
hal
o
hal
o
halo
For each stencil cycle
FPGA side:
①Inner-part stencil
CPU side:
①Update halos
②Interpolate if
necessary
③Outer-part stencils
BARRIER:
CPU-FPGA exchange
FPGA
CPU
Inner part
Outer part
Hybrid CPU+FPGA Design
![Page 17: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/17.jpg)
[email protected] 13/31
CPU-FPGA workflow
CPU
FPGA
Halo
Updating Interpolation Outer-part stencil
One stencil cycle
Inner-part stencil
C2F
F2C
① ② ③ ④ ⑤ BARRIER
Work Flow
![Page 18: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/18.jpg)
[email protected] 19/31
Baseline: a straightforward double-precision SWEs
Resource baseline
LUTs 299 %
FFs 220 %
BRAMs 20 %
DSPs 189 %
Resource Cost of SWEs
on Virtex-6 SX457T
Operations num
ADD/SUB 434
MUL 570
DIV 99
Others 45
Floating point operations
of SWE stencil
Precision-based optimization to further decrease the resource usage
Go for a Mixed-Precision Design
![Page 19: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/19.jpg)
Analysis of the Dynamic Range
fixed-point fixed-point
fixed-point
reduced-precision
![Page 20: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/20.jpg)
Precision Exploration
![Page 21: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/21.jpg)
General Architecture of the Mixed-
Precision Design
![Page 22: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/22.jpg)
Resource Cost of SWEs on Virtex-6
SX457T
Baseline: a straightforward double-precision SWEs
Mixed-precision: fixed-point and reduced-precision floating-point
Resource baseline mixed-
precision
LUTs 299 % 76.17%
FFs 220 % 53.41%
BRAMs 20 % 12.59 %
DSPs 189 % 44.84 %
![Page 23: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/23.jpg)
Hardware Platform: Maxeler DFEs
Environment
Maxcompiler development tool
MaxWorkstation • One Intel i7 quad-core CPU
• One Accelerator card (Virtex-6 SX 475T & 24 GB DRAM)
MaxNode • 12 Intel Xeon CPU cores
• four Accelerator cards (Virtex-6 SX 475T & 24 GB DRAM)
![Page 24: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/24.jpg)
Performance Results
Platform Performance
(𝑝𝑜𝑖𝑛𝑡𝑠/𝑠𝑒𝑐𝑜𝑛𝑑)
Speedup
6-core CPU 4.66K 1
Tianhe-1A node 110.38K 23x
MaxWorkstation 468.1K 100x
MaxNode 1.54M 330x
14x Meshsize: 1024 × 1024 × 6
MaxNode speedup over Tianhe node: 14 times
![Page 25: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/25.jpg)
Power Efficiency
Platform Efficiency (𝑝𝑜𝑖𝑛𝑡𝑠/(𝑠𝑒𝑐𝑜𝑛𝑑 × 𝑤𝑎𝑡𝑡) )
Speedup
6-core CPU 20.71 1
Tianhe-1A node 306.6 14.8x
MaxWorkstation 2.52K 121.6x
MaxNode 3K 144.9x
Meshsize: 1024 × 1024 × 6
MaxNode is 9 times more power efficient
9 x
For more details, please refer to our FPL 2013 paper: “Accelerating Solvers for Global Atmospheric Equations Through Mixed-
Precision Data Flow Engine”, in Proceedings of the 23rd International Conference on Field Programmable Logic and
Applications, 2013.
![Page 26: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/26.jpg)
Outline
Tianhe-1A: GPU
Maxeler DFE: FPGA
Tianhe-2: MIC
Future Plan & Discussion
![Page 27: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/27.jpg)
Tianhe-2: Brief Introduction
Tianhe-2
16,000 nodes
each node contains two 12-core Intel Ivy Bridge
CPUs, and 3 Intel Xeon Phi Acceleration Cards
peak: 54.9 PFlops
LINPACK: 33.8 PFlops
![Page 28: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/28.jpg)
Running SWE on Tianhe-2
Hierarchical 2D
domain
decomposition
Balanced CPU/MIC
utilization
0-3 MICs
adjustable blocks
28
Without MICs With 1 MIC
With 2 MICs With 3 MICs
![Page 29: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/29.jpg)
Running SWE on Tianhe-2: Workflow
29 Note: C2M — data movement from CPU to MIC
M2C — data movement from MIC to CPU
halo1/2/3/4 — the 4 steps of the “pipe-flow” communication scheme
![Page 30: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/30.jpg)
Optimization Scheme
base version
• serial
• no-VEC
multi-thread
• OpenMP
multi-thread + VEC
• compiler
• Cilk
Cache blocking
• different level
• auto-search
30
![Page 31: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/31.jpg)
Scaling the Performance
![Page 32: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/32.jpg)
SWE Performance on Tianhe-2
MIC against 24 CPU cores 1
.34
2.1
1
2.6
2
1.2
2
2.3
3
2.9
1
1.1
8 2
.26
3.1
5
0
0.5
1
1.5
2
2.5
3
3.5
1 MIC 2 MICs 3 MICs
1024*1024*6
2048*2048*6
4096*4096*6
![Page 33: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/33.jpg)
SWE Performance on Tianhe-2
Weak Scaling Using up to 8,652 nodes (207,648 CPU cores + 1,583,316 MIC
cores)
90.00%
92.00%
94.00%
96.00%
98.00%
100.00%
6 24 96 384 1536 1944 2400 3456 4056
# of
unknowns
for the
largest run:
200 billion
![Page 34: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/34.jpg)
Outline
Tianhe-1A: GPU
Maxeler DFE: FPGA
Tianhe-2: MIC
Future Plan & Discussion
![Page 35: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/35.jpg)
Highly-Scalable Framework for Global
Atmospheric Simulation
evolve from “2D Shallow Water Wave Equations”
to “3D Euler Equations”
35
![Page 36: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/36.jpg)
Highly-Scalable Framework for Global
Atmospheric Simulation
Model development:
from 2D SWE to 3D Euler
coupling the 3D Euler dynamics with physics
processes to test for local and global scenarios
HPC:
an FPGA-based cluster for climate modeling?
larger-scale runs on 100P supercomputer
(dynamic + physics)
![Page 38: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013 · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299](https://reader035.fdocuments.net/reader035/viewer/2022063017/5fda1ce1597c7b1a3273bb51/html5/thumbnails/38.jpg)
38