The Near Future of HPC at NERSC

93
Jack Deslippe The Near Future of HPC at NERSC July, 2015

Transcript of The Near Future of HPC at NERSC

Page 1: The Near Future of HPC at NERSC

Jack Deslippe

The Near Future of HPC at NERSC

July, 2015

Page 2: The Near Future of HPC at NERSC

NERSC is the production HPC center for the DOE Office of Science

• 4000 users, 500 projects• From 48 states; 65% from universities• Hundreds of users each day• 1500 publications per yearSystems designed for science

Page 3: The Near Future of HPC at NERSC

NERSC General Resources

Edison (Cray XC30, IvyBridge) 5576 compute nodes: 24 cores, 64 GB2.6 Petaflops Peak

Hopper (Cray XE60)6,384 compute nodes, 153,216 cores1.3 Petaflops Peak

Global Filesystem (NGF)4PB Capacity Scratch80GB/s Bandwidth

Test BedsBabbage (Xeon-Phi)Dirac (GPU)

Archival Storage40PB Capacity HPSS80GB/s Bandwidth

Other:Expert consulatants, analytics/vis. res. (R Studio, Visit), science gateways, global project directories, NX, Science Databases, ESNET, Globus Online, web resources and hosting, training

Page 4: The Near Future of HPC at NERSC

We are Hiring!

Page 5: The Near Future of HPC at NERSC

Getting Your Own NERSC Account

To be able to run at NERSC you need to have an account and an allocation.

• An account is a username and password• Simply fill out the Account Request Form:https://nim.nersc.gov/nersc_account_request.php

• An allocation is a repository of CPU hours• Good news! You already have an allocation!• Repo m1266. Talk to me to get access.

Page 6: The Near Future of HPC at NERSC

Getting Your Own NERSC Account

• If you have exhausted your CSGF allocation, apply for your own allocation with DOE

• Research must be relevant tothe mission of the DOE• https://www.nersc.gov/users/accounts/• Builds relationship with DOE program managers

Page 7: The Near Future of HPC at NERSC

Cori

Page 8: The Near Future of HPC at NERSC

What is different about Cori?

Page 9: The Near Future of HPC at NERSC

What is different about Cori?

Edison (Ivy-Bridge):● 12 Cores Per CPU● 24 Virtual Cores Per CPU

● 2.4-3.2 GHz

● Can do 4 Double Precision Operations per Cycle (+ multiply/add)

● 2.5 GB of Memory Per Core

● ~100 GB/s Memory Bandwidth

Cori (Knights-Landing):● 60+ Physical Cores Per CPU● 240+ Virtual Cores Per CPU

● Much slower GHz

● Can do 8 Double Precision Operations per Cycle (+ multiply/add)

● < 0.3 GB of Fast Memory Per Core < 2 GB of Slow Memory Per Core

● Fast memory has ~ 5x DDR4 bandwidth

Page 10: The Near Future of HPC at NERSC

What is different about Cori?

Edison (Ivy-Bridge):● 12 Cores Per CPU● 24 Virtual Cores Per CPU

● 2.4-3.2 GHz

● Can do 4 Double Precision Operations per Cycle (+ multiply/add)

● 2.5 GB of Memory Per Core

● ~100 GB/s Memory Bandwidth

Cori (Knights-Landing):● 60+ Physical Cores Per CPU● 240+ Virtual Cores Per CPU

● Much slower GHz

● Can do 8 Double Precision Operations per Cycle (+ multiply/add)

● < 0.3 GB of Fast Memory Per Core < 2 GB of Slow Memory Per Core

● Fast memory has ~ 5x DDR4 bandwidth

Page 11: The Near Future of HPC at NERSC

Basic Optimization Concepts

Page 12: The Near Future of HPC at NERSC

MPI Vs. OpenMP For Multi-Core Programming

CPU Core

Memory

CPU Core

Memory

CPU Core

Memory

CPU Core

Memory

NetworkInterconnect

CPU Core

Memory, Shared Arrays etc.

CPU Core

CPU Core

CPU Core

Private Arrays

Typically less memory overhead/duplication. Communication often implicit, through cache coherency and runtime

MPI OpenMP

Page 13: The Near Future of HPC at NERSC

PARATEC Use Case For OpenMP

PARATEC computes parallel FFTs across all processors.

Involves MPI all-to-all communication (small messages, latency bound).

Reducing the number of MPI tasks in favor OpenMP threads makes large improvement in overall runtime.

Figure Courtesy of Andrew Canning

Page 14: The Near Future of HPC at NERSC

Vectorization

There is a another important form of on-node parallelism

Vectorization: CPU does identical operations on different data; e.g., multiple iterations of the above loop can be done concurrently.

do i = 1, n a(i) = b(i) + c(i) enddo

Page 15: The Near Future of HPC at NERSC

Vectorization

There is a another important form of on-node parallelism

Vectorization: CPU does identical operations on different data; e.g., multiple iterations of the above loop can be done concurrently.

do i = 1, n a(i) = b(i) + c(i) enddo

Intel Xeon Sandy-Bridge/Ivy-Bridge: 4 Double Precision Ops Concurrently

Intel Xeon Phi: 8 Double Precision Ops Concurrently

NVIDIA Kepler GPUs: 32+ SIMT

Page 16: The Near Future of HPC at NERSC

Things that prevent vectorization in your code

Compilers want to “vectorize” your loops whenever possible. But sometimes they get stumped. Here are a few things that prevent your code from vectorizing:

Loop dependency:

Task forking:

do i = 1, n a(i) = a(i-1) + b(i) enddo

do i = 1, n if (a(i) < x) cycle if (a(i) > x) … enddo

Page 17: The Near Future of HPC at NERSC

Things that prevent vectorization in your code

Example From NERSC User Group Hackathon - (Astrophysics Transport Code)

… …

Page 18: The Near Future of HPC at NERSC

Things that prevent vectorization in your code

Example From NERSC User Group Hackathon - (Astrophysics Transport Code)

… …

… …

Page 19: The Near Future of HPC at NERSC

Things that prevent vectorization in your code

Example From NERSC User Group Hackathon - (Astrophysics Transport Code)

… …

… …

30% speed up for entire application!

Page 20: The Near Future of HPC at NERSC

Things that prevent vectorization in your code

Example From Cray COE Work on XGC1

Page 21: The Near Future of HPC at NERSC

Things that prevent vectorization in your code

Example From Cray COE Work on XGC1

~40% speed up for kernel

Page 22: The Near Future of HPC at NERSC

Memory Bandwidth

do i = 1, n

do j = 1, m

c = c + a(i) * b(j)

enddo

enddo

Consider the following loop:

Assume, n & m are very large such that a & b don’t fit into cache.

Then,

During execution, the number of loads From DRAM is

n*m + n

Page 23: The Near Future of HPC at NERSC

Memory Bandwidth

do i = 1, n

do j = 1, m

c = c + a(i) * b(j)

enddo

enddo

Consider the following loop: Assume, n & m are very large such that a & b don’t fit into cache.

Assume, n & m are very large such that a & b don’t fit into cache.

Then,

During execution, the number of loads From DRAM is

n*m + n

Requires 8 bytes loaded from DRAM per FMA (if supported). Assuming 100 GB/s bandwidth on Edison, we can at most achieve 25 GFlops/second (2 Flops per FMA)

Much lower than 460 GFlops/second peak on Edison node. Loop is memory bandwidth bound.

Page 24: The Near Future of HPC at NERSC

Roofline Model For Edison

Page 25: The Near Future of HPC at NERSC

Improving Memory Locality

Loads From DRAM:

n*m + n

do jout = 1, m, block

do i = 1, n

do j = jout, jout+block

c = c + a(i) * b(j)

enddo

enddo

enddo

Loads From DRAM:

m/block * (n+block) = n*m/block + m

do i = 1, n

do j = 1, m

c = c + a(i) * b(j)

enddo

enddo

Improving Memory Locality. Reducing bandwidth required.

Page 26: The Near Future of HPC at NERSC

Improving Memory Locality Moves you to the Right on the Roofline

Page 27: The Near Future of HPC at NERSC

Optimization Strategy

Page 28: The Near Future of HPC at NERSC

Optimizing Code For Cori is like:

Page 29: The Near Future of HPC at NERSC
Page 30: The Near Future of HPC at NERSC

Can You Increase Flops

Per Byte Loaded From Memory in Your Algorithm?

Make AlgorithmChanges

Explore Using HBM on Cori

For Key Arrays

Is Performance affected by Half-Clock

Speed?

Run Example at “Half Clock”

Speed

Run Example in “Half

Packed” Mode

Is Performance affected by

Half-Packing?

Your Code is at least Partially Memory Bandwidth Bound

You are at least

Partially CPU Bound

Make Sure Your Code is

Vectorized! Measure Cycles Per Instruction

with VTune

Likely Partially Memory Latency

Bound (assuming not IO or

Communication Bound)

Use IPM and Darshan to Measure and Remove Communication and IO Bottlenecks from Code

Can You Reduce Memory

Requests Per Flop In

Algorithm?

Try Running With as Many

Virtual Threads as Possible (>

240 Per Node on Cori)

Make AlgorithmChanges

YesYes

Yes Yes

No

No No

No

The Ant Farm Flow Chart

Page 31: The Near Future of HPC at NERSC

Can You Increase Flops

Per Byte Loaded From Memory in Your Algorithm?

Make AlgorithmChanges

Explore Using HBM on Cori

For Key Arrays

Is Performance affected by Half-Clock

Speed?

Run Example at “Half Clock”

Speed

Run Example in “Half

Packed” Mode

Is Performance affected by

Half-Packing?

Your Code is at least Partially Memory Bandwidth Bound

You are at least

Partially CPU Bound

Make Sure Your Code is

Vectorized! Measure Cycles Per Instruction

with VTune

Likely Partially Memory Latency

Bound (assuming not IO or

Communication Bound)

Use IPM and Darshan to Measure and Remove Communication and IO Bottlenecks from Code

Can You Reduce Memory

Requests Per Flop In

Algorithm?

Try Running With as Many

Virtual Threads as Possible (>

240 Per Node on Cori)

Make AlgorithmChanges

Yes Yes

No

No No

No

Page 32: The Near Future of HPC at NERSC

Use IPM/Darshan to Measure IO + Communication Time

# wallclock 953.272 29.7897 29.6092 29.9752# mpi 264.267 8.25834 7.73025 8.70985# %comm 27.7234 25.8873 29.3705## [time] [calls] <%mpi> <%wall># MPI_Send 188.386 639616 71.29 19.76# MPI_Wait 69.5032 639616 26.30 7.29# MPI_Irecv 6.34936 639616 2.40 0.67...

Typical IPM Output

Typical Darshan Output

https://www.nersc.gov/users/software/debugging-and-profiling

Page 33: The Near Future of HPC at NERSC

Measuring Your Memory Bandwidth Usage (VTune)

Measure memory bandwidth usage in VTune. (Next Talk)

Compare to Stream GB/s.

If 90% of stream, you are memory bandwidth

bound.

If less, more tests need to be done.

Page 34: The Near Future of HPC at NERSC

Are you memory or compute bound? Or both?

Run Example in “Half

Packed” Mode

aprun -n 24 -N 12 - S 6 ... VS aprun -n 24 -N 24 -S 12 ...

If you run on only half of the cores on a node, each core you do run has access to more bandwidth

If your performance changes, you are at least partially memory bandwidth bound

Page 35: The Near Future of HPC at NERSC

If your performance changes, you are at least partially memory bandwidth bound

Are you memory or compute bound? Or both?

Run Example in “Half

Packed” Mode

aprun -n 24 -N 12 - S 6 ... VS aprun -n 24 -N 24 -S 12 ...

If you run on only half of the cores on a node, each core you do run has access to more bandwidth

Page 36: The Near Future of HPC at NERSC

Are you memory or compute bound? Or both?

aprun --p-state=2400000 ... VS aprun --p-state=1900000 ...

Reducing the CPU speed slows down computation, but doesn’t reduce memory bandwidth available.

If your performance changes, you are at least partially compute bound

Run Example at “Half Clock”

Speed

Page 37: The Near Future of HPC at NERSC

So, you are Memory Bandwidth Bound?

What to do?

1. Try to improve memory locality, cache reuse

2. Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-be allocated in HBM on Cori.

Profit by getting ~ 5x more bandwidth GB/s.

Page 38: The Near Future of HPC at NERSC

So, you are Compute Bound?

What to do?1. Make sure you have good OpenMP scalability. Look at VTune to see thread activity for major

OpenMP regions.

2. Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization in vtune.

See whether intel compiler vectorized loop using compiler flag: -qopt-report=5

Page 39: The Near Future of HPC at NERSC

So, you are neither compute nor memory bandwidth bound?

You may be memory latency bound (or you may be spending all your time in IO and Communication).

If running with hyper-threading on Edison improves performance, you *might* be latency bound:

If you can, try to reduce the number of memory requests per flop by accessing contiguous and predictable segments of memory and reusing variables in cache as much as possible.

On Cori, each core will support up to 4 threads. Use them all.

aprun -j 2 -n 48 …. aprun -n 24 ….VS

Page 40: The Near Future of HPC at NERSC

BerkeleyGW Case Study

Page 41: The Near Future of HPC at NERSC

BerkeleyGW Use Case

★ Big systems require more memory. Cost scales as Natoms^2 to store the data.★ In an MPI GW implementation, in practice, to avoid communication, data is duplicated and

each MPI task has a memory overhead.★ Users sometimes forced to use 1 of 24 available cores, in order to provide MPI tasks with

enough memory. 90% of the computing capability is lost.

Page 42: The Near Future of HPC at NERSC

Computational Bottlenecks

In house code (I’m one of main developers). Use as “prototype” for App Readiness.

Page 43: The Near Future of HPC at NERSC

Computational Bottlenecks

In house code (I’m one of main developers). Use as “prototype” for App Readiness.

Significant Bottleneck is large matrix reduction like operations. Turning arrays into numbers.

Page 44: The Near Future of HPC at NERSC

Targeting Intel Xeon Phi Many Core Architecture

1. Target more on-node parallelism. (MPI model already failing users)2. Ensure key loops/kernels can be vectorized.

Example: Optimization steps for Xeon Phi Coprocessor

Refactor to Have 3 Loop Structure:

Outer: MPIMiddle: OpenMPInner: Vectorization

Add OpenMP

Ensure Vectorization

Page 45: The Near Future of HPC at NERSC

Final Loop Structure

ngpown typically in 100’s to 1000s. Good for many threads.

ncouls typically in 1000s - 10,000s. Good for vectorization.

Original inner loop. Too small to vectorize!

Attempt to save work breaks vectorization and makes code slower.

!$OMP DO reduction(+:achtemp) do my_igp = 1, ngpown ... do iw=1,3

scht=0D0 wxt = wx_array(iw)

do ig = 1, ncouls

!if (abs(wtilde_array(ig,my_igp) * eps(ig,my_igp)) .lt. TOL) cycle

wdiff = wxt - wtilde_array(ig,my_igp) delw = wtilde_array(ig,my_igp) / wdiff ... scha(ig) = mygpvar1 * aqsntemp(ig) * delw * eps(ig,my_igp) scht = scht + scha(ig)

enddo ! loop over g sch_array(iw) = sch_array(iw) + 0.5D0*scht

enddo

achtemp(:) = achtemp(:) + sch_array(:) * vcoul(my_igp)

enddo

Page 46: The Near Future of HPC at NERSC

Early Lessons Learned

Most codes are harder to optimize on KNC (relative to Xeon performance). But optimizations for KNC improve code everywhere.

Page 47: The Near Future of HPC at NERSC

Dungeon Sessions

Page 48: The Near Future of HPC at NERSC

Remaining Questions. What can a dungeon session answer?

We’ve had two dungeon sessions with Intel. What kinds of questions can they help with?

1. Why is KNC slower than Haswell for this problem?

- Known to be bandwidth bound on KNC, but not on Haswell. - How is it bound on Haswell?

2. How much gain can we get by allocating just a few arrays in HBM?

Page 49: The Near Future of HPC at NERSC

Page 50: The Near Future of HPC at NERSC

Page 51: The Near Future of HPC at NERSC

Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.

Page 52: The Near Future of HPC at NERSC

Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.

Page 53: The Near Future of HPC at NERSC

••

Page 54: The Near Future of HPC at NERSC

Bounding Around on the Roofline Model

2013 - Poor locality, loop ordering issues

Page 55: The Near Future of HPC at NERSC

Bounding Around on the Roofline Model

2014 - Refactored loops, improved locality

Page 56: The Near Future of HPC at NERSC

Bounding Around on the Roofline Model

2014 - Vectorized Code

Page 57: The Near Future of HPC at NERSC

Bounding Around on the Roofline Model

2015 - Cache Blocking

Page 58: The Near Future of HPC at NERSC

Conclusions

Page 59: The Near Future of HPC at NERSC
Page 60: The Near Future of HPC at NERSC

The End (Extra Slides)

Page 61: The Near Future of HPC at NERSC
Page 62: The Near Future of HPC at NERSC
Page 63: The Near Future of HPC at NERSC

real, allocatable :: a(:,:), b(:,:), c(:)

!DIR$ ATTRIBUTE FASTMEM :: a, b, c

% module load memkind jemalloc

% ftn -dynamic -g -O3 -openmp mycode.f90

% export MEMKIND_HBW_NODES=0

% aprun -n 1 -cc numa_node numactl --membind=1 --cpunodebind=0 ./gppkernel.jemalloc.x 512 2 32768 20 2

Page 64: The Near Future of HPC at NERSC

real, allocatable :: a(:,:), b(:,:), c(:)

!DIR$ ATTRIBUTE FASTMEM :: a, b, c

% module load memkind jemalloc

% ftn -dynamic -g -O3 -openmp mycode.f90

% export MEMKIND_HBW_NODES=0

% aprun -n 1 -cc numa_node numactl --membind=1 --cpunodebind=0 ./gppkernel.jemalloc.x 512 2 32768 20 2•••

Page 65: The Near Future of HPC at NERSC

Resources for Code Teams•

––

•–

•––

Page 66: The Near Future of HPC at NERSC

What is different about Cori?

Big Changes:

1. More on node parallelism. More cores, bigger vectors

2. Small amount of very fast memory. (similar-ish amounts of traditional DDR)

3. Burst Buffer

Page 67: The Near Future of HPC at NERSC

Code Teams

Nick Wright Katie Antypas Harvey Wasserman Brian Austin Zhengji Zhao

Jack DeslippeWoo-Sun Yang

Helen He Matt Cordery

Jon Rood (IPCC post-doc)

Richard Gerber

Rebecca Hartman-BakerScott French

Page 68: The Near Future of HPC at NERSC

Are you memory or compute bound? Or both?

Is Performance affected by Half-Clock

Speed?

Run Example at “Half Clock”

Speed

Run Example in “Half

Packed” Mode

Is Performance affected by

Half-Packing?

Your Code is at least Partially Memory Bandwidth Bound

You are at least

Partially CPU Bound

Likely Partially Memory Latency

Bound (assuming not IO or

Communication Bound)

YesYes

No No

Page 69: The Near Future of HPC at NERSC
Page 70: The Near Future of HPC at NERSC
Page 71: The Near Future of HPC at NERSC
Page 72: The Near Future of HPC at NERSC
Page 73: The Near Future of HPC at NERSC

?Approximation:

a. Real Division

b. Complex Division

c. Complex Division + -fp-model fast=2

Wall Time:

6.37 seconds

4.99 seconds

5.30 seconds

Page 74: The Near Future of HPC at NERSC

Approximation:

a. Real Division

b. Complex Division

c. Complex Divsion + -fp-model=fast

Wall Time:

6.37 seconds

4.99 seconds

5.30 seconds

Page 75: The Near Future of HPC at NERSC

Preparing for the Dungeon

Created 4 kernels representing major regions of code

○ ~ 500-1000 lines each.

○ Self contained. Can compile and run quickly.

○ Intended to run on a single node (single MPI task)

○ Imitates the node level work of a typical job running at scale. Not the

entire run for a small problem.

Page 76: The Near Future of HPC at NERSC

Approximation:

a. Real Division

b. Complex Division

c. Complex Division + -fp-model fast=2

d. Complex Division + -fp-model=fast=2 + !dir$ nounroll

Wall Time:

6.37 seconds

4.99 seconds

5.30 seconds

4.89 seconds

Page 77: The Near Future of HPC at NERSC

Intel Xeon Phi Users’ Group (IXPUG)

••

Page 78: The Near Future of HPC at NERSC

Early NESAP (Advances with Cray and Intel) Advances

Overall Improvement Notes

BGW GPP Kernel 0-10% Pretty optimized to begin with. Thread scalability improved by fixing ifort allocation performance. BGW FF Kernel 2x-4x Unoptimized to begin with. Cache reuse improvementsBGW Chi Kernel 10-30% Moved threaded region outward in codeBGW BSE Kernel 10-50% Created custom vector matmuls

ifortifort

(Nathan)

Page 79: The Near Future of HPC at NERSC

To run effectively on Cori users will have to:

•–

•–

•–

•–

x

y

z

x

y

z

|--> DO I = 1, N| R(I) = B(I) + A(I) |--> ENDDO

Page 80: The Near Future of HPC at NERSC

NERSC Exascale Science Application Program

Page 81: The Near Future of HPC at NERSC

But why does the GPP Kernel perform worse on Knights Corner than on Haswell?

••

Page 82: The Near Future of HPC at NERSC

Change loop ‘trip count’ (or index bound) to fit into L2 cache on KNC

Page 83: The Near Future of HPC at NERSC

A recent NERSC hack-a-thon suggests there are many opportunities for code optimization

… …

Page 84: The Near Future of HPC at NERSC

What about application portability?•

••

––––

Page 85: The Near Future of HPC at NERSC

NERSC Exascale Science Program Codes

Page 86: The Near Future of HPC at NERSC

Lower trip count to once again make memory bandwidth bound

Page 87: The Near Future of HPC at NERSC

Code Teams●

○○

●○

●○○○

●○

●○

●○○○

Page 88: The Near Future of HPC at NERSC

NESAPThe NERSC Exascale Science Application Program

Page 89: The Near Future of HPC at NERSC

Codes Used at NERSC

NERSC Code Breakdown 2012NERSC Code Breakdown 2013

Page 90: The Near Future of HPC at NERSC

NESAP Codes

Page 91: The Near Future of HPC at NERSC

Code Coverage

Page 92: The Near Future of HPC at NERSC

Timeline

Page 93: The Near Future of HPC at NERSC

Early Lessons Learned

Cray and Intel very helpful in profiling/optimizing the code. See following slides for using Intel resources effectively

Generating small tangible kernels is important for success.

Targeting Many-Core greatly helps performance

back on Xeon.

Complex division is slow on (particularly on KNC)