The Near Future of HPC at NERSC
Transcript of The Near Future of HPC at NERSC
Jack Deslippe
The Near Future of HPC at NERSC
July, 2015
NERSC is the production HPC center for the DOE Office of Science
• 4000 users, 500 projects• From 48 states; 65% from universities• Hundreds of users each day• 1500 publications per yearSystems designed for science
NERSC General Resources
Edison (Cray XC30, IvyBridge) 5576 compute nodes: 24 cores, 64 GB2.6 Petaflops Peak
Hopper (Cray XE60)6,384 compute nodes, 153,216 cores1.3 Petaflops Peak
Global Filesystem (NGF)4PB Capacity Scratch80GB/s Bandwidth
Test BedsBabbage (Xeon-Phi)Dirac (GPU)
Archival Storage40PB Capacity HPSS80GB/s Bandwidth
Other:Expert consulatants, analytics/vis. res. (R Studio, Visit), science gateways, global project directories, NX, Science Databases, ESNET, Globus Online, web resources and hosting, training
We are Hiring!
Getting Your Own NERSC Account
To be able to run at NERSC you need to have an account and an allocation.
• An account is a username and password• Simply fill out the Account Request Form:https://nim.nersc.gov/nersc_account_request.php
• An allocation is a repository of CPU hours• Good news! You already have an allocation!• Repo m1266. Talk to me to get access.
Getting Your Own NERSC Account
• If you have exhausted your CSGF allocation, apply for your own allocation with DOE
• Research must be relevant tothe mission of the DOE• https://www.nersc.gov/users/accounts/• Builds relationship with DOE program managers
Cori
What is different about Cori?
What is different about Cori?
Edison (Ivy-Bridge):● 12 Cores Per CPU● 24 Virtual Cores Per CPU
● 2.4-3.2 GHz
● Can do 4 Double Precision Operations per Cycle (+ multiply/add)
● 2.5 GB of Memory Per Core
● ~100 GB/s Memory Bandwidth
Cori (Knights-Landing):● 60+ Physical Cores Per CPU● 240+ Virtual Cores Per CPU
● Much slower GHz
● Can do 8 Double Precision Operations per Cycle (+ multiply/add)
● < 0.3 GB of Fast Memory Per Core < 2 GB of Slow Memory Per Core
● Fast memory has ~ 5x DDR4 bandwidth
What is different about Cori?
Edison (Ivy-Bridge):● 12 Cores Per CPU● 24 Virtual Cores Per CPU
● 2.4-3.2 GHz
● Can do 4 Double Precision Operations per Cycle (+ multiply/add)
● 2.5 GB of Memory Per Core
● ~100 GB/s Memory Bandwidth
Cori (Knights-Landing):● 60+ Physical Cores Per CPU● 240+ Virtual Cores Per CPU
● Much slower GHz
● Can do 8 Double Precision Operations per Cycle (+ multiply/add)
● < 0.3 GB of Fast Memory Per Core < 2 GB of Slow Memory Per Core
● Fast memory has ~ 5x DDR4 bandwidth
Basic Optimization Concepts
MPI Vs. OpenMP For Multi-Core Programming
CPU Core
Memory
CPU Core
Memory
CPU Core
Memory
CPU Core
Memory
NetworkInterconnect
CPU Core
Memory, Shared Arrays etc.
CPU Core
CPU Core
CPU Core
Private Arrays
Typically less memory overhead/duplication. Communication often implicit, through cache coherency and runtime
MPI OpenMP
PARATEC Use Case For OpenMP
PARATEC computes parallel FFTs across all processors.
Involves MPI all-to-all communication (small messages, latency bound).
Reducing the number of MPI tasks in favor OpenMP threads makes large improvement in overall runtime.
Figure Courtesy of Andrew Canning
Vectorization
There is a another important form of on-node parallelism
Vectorization: CPU does identical operations on different data; e.g., multiple iterations of the above loop can be done concurrently.
do i = 1, n a(i) = b(i) + c(i) enddo
Vectorization
There is a another important form of on-node parallelism
Vectorization: CPU does identical operations on different data; e.g., multiple iterations of the above loop can be done concurrently.
do i = 1, n a(i) = b(i) + c(i) enddo
Intel Xeon Sandy-Bridge/Ivy-Bridge: 4 Double Precision Ops Concurrently
Intel Xeon Phi: 8 Double Precision Ops Concurrently
NVIDIA Kepler GPUs: 32+ SIMT
Things that prevent vectorization in your code
Compilers want to “vectorize” your loops whenever possible. But sometimes they get stumped. Here are a few things that prevent your code from vectorizing:
Loop dependency:
Task forking:
do i = 1, n a(i) = a(i-1) + b(i) enddo
do i = 1, n if (a(i) < x) cycle if (a(i) > x) … enddo
Things that prevent vectorization in your code
Example From NERSC User Group Hackathon - (Astrophysics Transport Code)
… …
Things that prevent vectorization in your code
Example From NERSC User Group Hackathon - (Astrophysics Transport Code)
… …
… …
Things that prevent vectorization in your code
Example From NERSC User Group Hackathon - (Astrophysics Transport Code)
… …
… …
30% speed up for entire application!
Things that prevent vectorization in your code
Example From Cray COE Work on XGC1
Things that prevent vectorization in your code
Example From Cray COE Work on XGC1
~40% speed up for kernel
Memory Bandwidth
do i = 1, n
do j = 1, m
c = c + a(i) * b(j)
enddo
enddo
Consider the following loop:
Assume, n & m are very large such that a & b don’t fit into cache.
Then,
During execution, the number of loads From DRAM is
n*m + n
Memory Bandwidth
do i = 1, n
do j = 1, m
c = c + a(i) * b(j)
enddo
enddo
Consider the following loop: Assume, n & m are very large such that a & b don’t fit into cache.
Assume, n & m are very large such that a & b don’t fit into cache.
Then,
During execution, the number of loads From DRAM is
n*m + n
Requires 8 bytes loaded from DRAM per FMA (if supported). Assuming 100 GB/s bandwidth on Edison, we can at most achieve 25 GFlops/second (2 Flops per FMA)
Much lower than 460 GFlops/second peak on Edison node. Loop is memory bandwidth bound.
Roofline Model For Edison
Improving Memory Locality
Loads From DRAM:
n*m + n
do jout = 1, m, block
do i = 1, n
do j = jout, jout+block
c = c + a(i) * b(j)
enddo
enddo
enddo
Loads From DRAM:
m/block * (n+block) = n*m/block + m
do i = 1, n
do j = 1, m
c = c + a(i) * b(j)
enddo
enddo
Improving Memory Locality. Reducing bandwidth required.
Improving Memory Locality Moves you to the Right on the Roofline
Optimization Strategy
Optimizing Code For Cori is like:
Can You Increase Flops
Per Byte Loaded From Memory in Your Algorithm?
Make AlgorithmChanges
Explore Using HBM on Cori
For Key Arrays
Is Performance affected by Half-Clock
Speed?
Run Example at “Half Clock”
Speed
Run Example in “Half
Packed” Mode
Is Performance affected by
Half-Packing?
Your Code is at least Partially Memory Bandwidth Bound
You are at least
Partially CPU Bound
Make Sure Your Code is
Vectorized! Measure Cycles Per Instruction
with VTune
Likely Partially Memory Latency
Bound (assuming not IO or
Communication Bound)
Use IPM and Darshan to Measure and Remove Communication and IO Bottlenecks from Code
Can You Reduce Memory
Requests Per Flop In
Algorithm?
Try Running With as Many
Virtual Threads as Possible (>
240 Per Node on Cori)
Make AlgorithmChanges
YesYes
Yes Yes
No
No No
No
The Ant Farm Flow Chart
Can You Increase Flops
Per Byte Loaded From Memory in Your Algorithm?
Make AlgorithmChanges
Explore Using HBM on Cori
For Key Arrays
Is Performance affected by Half-Clock
Speed?
Run Example at “Half Clock”
Speed
Run Example in “Half
Packed” Mode
Is Performance affected by
Half-Packing?
Your Code is at least Partially Memory Bandwidth Bound
You are at least
Partially CPU Bound
Make Sure Your Code is
Vectorized! Measure Cycles Per Instruction
with VTune
Likely Partially Memory Latency
Bound (assuming not IO or
Communication Bound)
Use IPM and Darshan to Measure and Remove Communication and IO Bottlenecks from Code
Can You Reduce Memory
Requests Per Flop In
Algorithm?
Try Running With as Many
Virtual Threads as Possible (>
240 Per Node on Cori)
Make AlgorithmChanges
Yes Yes
No
No No
No
Use IPM/Darshan to Measure IO + Communication Time
# wallclock 953.272 29.7897 29.6092 29.9752# mpi 264.267 8.25834 7.73025 8.70985# %comm 27.7234 25.8873 29.3705## [time] [calls] <%mpi> <%wall># MPI_Send 188.386 639616 71.29 19.76# MPI_Wait 69.5032 639616 26.30 7.29# MPI_Irecv 6.34936 639616 2.40 0.67...
Typical IPM Output
Typical Darshan Output
https://www.nersc.gov/users/software/debugging-and-profiling
Measuring Your Memory Bandwidth Usage (VTune)
Measure memory bandwidth usage in VTune. (Next Talk)
Compare to Stream GB/s.
If 90% of stream, you are memory bandwidth
bound.
If less, more tests need to be done.
Are you memory or compute bound? Or both?
Run Example in “Half
Packed” Mode
aprun -n 24 -N 12 - S 6 ... VS aprun -n 24 -N 24 -S 12 ...
If you run on only half of the cores on a node, each core you do run has access to more bandwidth
If your performance changes, you are at least partially memory bandwidth bound
If your performance changes, you are at least partially memory bandwidth bound
Are you memory or compute bound? Or both?
Run Example in “Half
Packed” Mode
aprun -n 24 -N 12 - S 6 ... VS aprun -n 24 -N 24 -S 12 ...
If you run on only half of the cores on a node, each core you do run has access to more bandwidth
Are you memory or compute bound? Or both?
aprun --p-state=2400000 ... VS aprun --p-state=1900000 ...
Reducing the CPU speed slows down computation, but doesn’t reduce memory bandwidth available.
If your performance changes, you are at least partially compute bound
Run Example at “Half Clock”
Speed
So, you are Memory Bandwidth Bound?
What to do?
1. Try to improve memory locality, cache reuse
2. Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-be allocated in HBM on Cori.
Profit by getting ~ 5x more bandwidth GB/s.
So, you are Compute Bound?
What to do?1. Make sure you have good OpenMP scalability. Look at VTune to see thread activity for major
OpenMP regions.
2. Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization in vtune.
See whether intel compiler vectorized loop using compiler flag: -qopt-report=5
So, you are neither compute nor memory bandwidth bound?
You may be memory latency bound (or you may be spending all your time in IO and Communication).
If running with hyper-threading on Edison improves performance, you *might* be latency bound:
If you can, try to reduce the number of memory requests per flop by accessing contiguous and predictable segments of memory and reusing variables in cache as much as possible.
On Cori, each core will support up to 4 threads. Use them all.
aprun -j 2 -n 48 …. aprun -n 24 ….VS
BerkeleyGW Case Study
BerkeleyGW Use Case
★ Big systems require more memory. Cost scales as Natoms^2 to store the data.★ In an MPI GW implementation, in practice, to avoid communication, data is duplicated and
each MPI task has a memory overhead.★ Users sometimes forced to use 1 of 24 available cores, in order to provide MPI tasks with
enough memory. 90% of the computing capability is lost.
…
Computational Bottlenecks
In house code (I’m one of main developers). Use as “prototype” for App Readiness.
Computational Bottlenecks
In house code (I’m one of main developers). Use as “prototype” for App Readiness.
Significant Bottleneck is large matrix reduction like operations. Turning arrays into numbers.
Targeting Intel Xeon Phi Many Core Architecture
1. Target more on-node parallelism. (MPI model already failing users)2. Ensure key loops/kernels can be vectorized.
Example: Optimization steps for Xeon Phi Coprocessor
Refactor to Have 3 Loop Structure:
Outer: MPIMiddle: OpenMPInner: Vectorization
Add OpenMP
Ensure Vectorization
Final Loop Structure
ngpown typically in 100’s to 1000s. Good for many threads.
ncouls typically in 1000s - 10,000s. Good for vectorization.
Original inner loop. Too small to vectorize!
Attempt to save work breaks vectorization and makes code slower.
!$OMP DO reduction(+:achtemp) do my_igp = 1, ngpown ... do iw=1,3
scht=0D0 wxt = wx_array(iw)
do ig = 1, ncouls
!if (abs(wtilde_array(ig,my_igp) * eps(ig,my_igp)) .lt. TOL) cycle
wdiff = wxt - wtilde_array(ig,my_igp) delw = wtilde_array(ig,my_igp) / wdiff ... scha(ig) = mygpvar1 * aqsntemp(ig) * delw * eps(ig,my_igp) scht = scht + scha(ig)
enddo ! loop over g sch_array(iw) = sch_array(iw) + 0.5D0*scht
enddo
achtemp(:) = achtemp(:) + sch_array(:) * vcoul(my_igp)
enddo
Early Lessons Learned
Most codes are harder to optimize on KNC (relative to Xeon performance). But optimizations for KNC improve code everywhere.
Dungeon Sessions
Remaining Questions. What can a dungeon session answer?
We’ve had two dungeon sessions with Intel. What kinds of questions can they help with?
1. Why is KNC slower than Haswell for this problem?
- Known to be bandwidth bound on KNC, but not on Haswell. - How is it bound on Haswell?
2. How much gain can we get by allocating just a few arrays in HBM?
•
•
•
Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.
•
Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.
••
Bounding Around on the Roofline Model
2013 - Poor locality, loop ordering issues
Bounding Around on the Roofline Model
2014 - Refactored loops, improved locality
Bounding Around on the Roofline Model
2014 - Vectorized Code
Bounding Around on the Roofline Model
2015 - Cache Blocking
Conclusions
The End (Extra Slides)
real, allocatable :: a(:,:), b(:,:), c(:)
!DIR$ ATTRIBUTE FASTMEM :: a, b, c
% module load memkind jemalloc
% ftn -dynamic -g -O3 -openmp mycode.f90
% export MEMKIND_HBW_NODES=0
% aprun -n 1 -cc numa_node numactl --membind=1 --cpunodebind=0 ./gppkernel.jemalloc.x 512 2 32768 20 2
real, allocatable :: a(:,:), b(:,:), c(:)
!DIR$ ATTRIBUTE FASTMEM :: a, b, c
% module load memkind jemalloc
% ftn -dynamic -g -O3 -openmp mycode.f90
% export MEMKIND_HBW_NODES=0
% aprun -n 1 -cc numa_node numactl --membind=1 --cpunodebind=0 ./gppkernel.jemalloc.x 512 2 32768 20 2•••
Resources for Code Teams•
––
•–
–
•––
What is different about Cori?
Big Changes:
1. More on node parallelism. More cores, bigger vectors
2. Small amount of very fast memory. (similar-ish amounts of traditional DDR)
3. Burst Buffer
Code Teams
Nick Wright Katie Antypas Harvey Wasserman Brian Austin Zhengji Zhao
Jack DeslippeWoo-Sun Yang
Helen He Matt Cordery
Jon Rood (IPCC post-doc)
Richard Gerber
Rebecca Hartman-BakerScott French
Are you memory or compute bound? Or both?
Is Performance affected by Half-Clock
Speed?
Run Example at “Half Clock”
Speed
Run Example in “Half
Packed” Mode
Is Performance affected by
Half-Packing?
Your Code is at least Partially Memory Bandwidth Bound
You are at least
Partially CPU Bound
Likely Partially Memory Latency
Bound (assuming not IO or
Communication Bound)
YesYes
No No
?Approximation:
a. Real Division
b. Complex Division
c. Complex Division + -fp-model fast=2
Wall Time:
6.37 seconds
4.99 seconds
5.30 seconds
Approximation:
a. Real Division
b. Complex Division
c. Complex Divsion + -fp-model=fast
Wall Time:
6.37 seconds
4.99 seconds
5.30 seconds
Preparing for the Dungeon
Created 4 kernels representing major regions of code
○ ~ 500-1000 lines each.
○ Self contained. Can compile and run quickly.
○ Intended to run on a single node (single MPI task)
○ Imitates the node level work of a typical job running at scale. Not the
entire run for a small problem.
Approximation:
a. Real Division
b. Complex Division
c. Complex Division + -fp-model fast=2
d. Complex Division + -fp-model=fast=2 + !dir$ nounroll
Wall Time:
6.37 seconds
4.99 seconds
5.30 seconds
4.89 seconds
Intel Xeon Phi Users’ Group (IXPUG)
•
•
•
•
••
Early NESAP (Advances with Cray and Intel) Advances
Overall Improvement Notes
BGW GPP Kernel 0-10% Pretty optimized to begin with. Thread scalability improved by fixing ifort allocation performance. BGW FF Kernel 2x-4x Unoptimized to begin with. Cache reuse improvementsBGW Chi Kernel 10-30% Moved threaded region outward in codeBGW BSE Kernel 10-50% Created custom vector matmuls
ifortifort
(Nathan)
To run effectively on Cori users will have to:
•–
•–
•–
•–
x
y
z
x
y
z
|--> DO I = 1, N| R(I) = B(I) + A(I) |--> ENDDO
•
•
•
NERSC Exascale Science Application Program
But why does the GPP Kernel perform worse on Knights Corner than on Haswell?
••
Change loop ‘trip count’ (or index bound) to fit into L2 cache on KNC
A recent NERSC hack-a-thon suggests there are many opportunities for code optimization
… …
What about application portability?•
••
––––
NERSC Exascale Science Program Codes
Lower trip count to once again make memory bandwidth bound
Code Teams●
○○
●○
●○○○
●○
●○
●○○○
NESAPThe NERSC Exascale Science Application Program
Codes Used at NERSC
NERSC Code Breakdown 2012NERSC Code Breakdown 2013
NESAP Codes
Code Coverage
Timeline
Early Lessons Learned
Cray and Intel very helpful in profiling/optimizing the code. See following slides for using Intel resources effectively
Generating small tangible kernels is important for success.
Targeting Many-Core greatly helps performance
back on Xeon.
Complex division is slow on (particularly on KNC)