Preparing your Application for Advanced Manycore Architectures

Katie Antypas Services Dept Head, NERSC-8 Project Lead

Preparing your Application for Advanced Manycore Architectures

- 1 -

CSGF HPC Workshop July 17, 2014

What is “Manycore?”

• No precise definition • Multicore (heavyweight): slow evolution from 2–12

cores per chip; core performance matters more than core count

• Manycore (lightweight): jump to 30–60+ cores per chip; core count essentially matters more than individual core performance

• Manycore: Relatively simplified, lower-frequency computational cores; less instruction-level parallelism (ILP); engineered for lower power consumption and heat dissipation

- 2 -

Why Now? “Moore’s Law” will continue. • Memory

wall • Power wall • ILP wall

3

To sustain historic performance growth, the DOE community must prepare for new architectures

Hennessy, Patterson. Computer Architecture, a Quantitative Approach. 5th Ed. 2011

Exponential performance increase; do not rewrite software, just buy a new machine!

hmmm…

oh no!

Perf.

vs.

VAX

/780

What Does it Mean for Users?

• While Moore’s Law continues, users will need to make changes to applications to run well on manycore architectures

• Look at sources of parallelism: new and old

- 4 -

Sources of Parallelism • Domain Parallelism

– independent program units; explicit

• Thread Parallelism – independent execution

units within the program; generally explicit

• Data Parallelism – Same operation on

multiple elements

• Instruction Level Parallelism – Between independent instructions in sequential program

- 5 -

MPI MPI MPI

x

y

z

Threads

x

y

z

|--> DO I = 1, N | R(I) = B(I) + A(I) |--> ENDDO

Threads Threads

Multicore vs. Manycore Parallelism

• TLP: Thread Level Parallelism • DLP: Data Level Parallelism

- 6 -

TLP 24 12 cores and 2 hardware threads each

DLP 4 256 bit wide vector unit

TLP 240 60 cores and 4 hardware threads each

DLP 8 512 bit wide vector unit

Intel Ivy Bridge Intel Xeon-Phi (Knights Corner)

Regardless of vendor major systems throughout the DOE complex will require the same changes

- 7 -

• Regardless of processor architecture, users will need to modify applications to achieve performance – Expose more on-node

parallelism in applications – Increase application

vectorization capabilities – Manage hierarchical memory – For co-processor

architectures, locality directives must be added

Knights Landing

NERSC’s next supercomputer: Cori

Cori Configuration • 64 Cabinets of Cray XC System

– Over 9,300 ‘Knights Landing’ compute nodes – Over 1,900 ‘Haswell’ compute nodes

• Data partition – Cray Aries Interconnect

• Lustre File system – 28 PB capacity, 432 GB/sec peak performance

• NVRAM “Burst Buffer” for I/O acceleration • Significant Intel and Cray application transition

support • Delivery in mid-2016; installation in new LBNL CRT

- 9 -

Intel “Knights Landing” Processor • Next generation Xeon-Phi, >3TF peak • Single socket processor - Self-hosted, not a co-processor, not an

accelerator • Greater than 60 cores per processor with support for four hardware

threads each; more cores than current generation Intel Xeon Phi™ • Intel® "Silvermont" architecture enhanced for high performance

computing • 512b vector units (32 flops/clock – AVX 512) • 3X single-thread performance over current generation Xeon-Phi co-

processor • High bandwidth on-package memory, up to 16GB capacity with

bandwidth projected to be 5X that of DDR4 DRAM memory • Higher performance per watt

- 10 -

Cache Model

Let the hardware automatically manage the integrated on-package memory as an “L3” cache between KNL CPU and external DDR

Flat Model

Manually manage how your application uses the integrated on-package memory and external DDR for peak performance

Hybrid Model

Harness the benefits of both cache and flat models by segmenting the integrated on-package memory

Maximum performance through higher memory bandwidth and flexibility

Knights Landing Integrated On-Package Memory

Near Memory

HBW In-Package Memory

KNL CPU






. . .

. . .

CPU Package

DDR

DDR

DDR

. . .

Cache

PCB

Near Memory

Far Memory

Side View

Top View

Programming Model Considerations

• Knight’s Landing is a self-hosted part – Users can focus on adding parallelism to their applications

without concerning themselves with PCI-bus transfers

• MPI + OpenMP preferred programming model – Should enable NERSC users to make robust code changes

• MPI-only will work – performance may not be optimal

• On package MCDRAM – How to optimally use ?

• Explicitly or implicitly ??

- 12 -

Case Study

Architecture: NERSC’s Babbage Testbed

- -

• 45 Sandy-bridge nodes with Xeon-Phi Co-processor • Each Xeon-Phi Co-processor has

– 60 cores – 4 HW threads per core – 8 GB of memory

• Multiple ways to program with co-processor – As an accelerator – As a self-hosted processor (ignore Sandy-bridge) – Reverse accelerator – We chose to test as if the Xeon-Phi was a stand alone

processor to mimic Knight’s Landing architecture

FLASH application readiness • FLASH is astrophysics code with explicit solvers for

hydrodynamics and magneto-hydrodynamics • Parallelized using

– MPI domain decomposition AND – OpenMP multithreading over local domains or over cells in each local domain

• Target application is a 3D Sedov explosion problem – A spherical blast wave is evolved over multiple time steps – Use configuration with a uniform resolution grid and use 1003 global cells

• The hydrodynamics solvers perform large stencil computations.

- - Case study by Chris Daley

Initial best KNC performance vs host

- -

Lower is B

etter

Case study by Chris Daley

Best configuration on 1 MIC card

- -

Lower is B

etter


MIC performance study 1: thread speedup

- -

Higher is B

etter

• 1 MPI rank per MIC card and various numbers of OpenMP threads

• Each OpenMP thread is placed on a separate core

• 10x thread count ideally gives a 10x speedup

• Speedup is not ideal – But it is not the main cause of the poor MIC performance – ~70% efficiency @ 12 threads (as would be used with 10 MPI ranks per card)


Vectorization is another form of on-node parallelism

do i = 1, n a(i) = b(i) + c(i) enddo

Intel Xeon Sandy-Bridge/Ivy-Bridge: 4 Double Precision Ops Concurrently Intel Xeon Phi: 8 Double Precision Ops Concurrently NVIDIA Kepler GPUs: 32 SIMT threads

Data Parallelism - Vectorization • Vectorization is another form of on-node parallelism

- 20 -

DO i= 1, 100 a(i) = b(i) + c(i) ENDDO

• A single instruction will launch many operations on different data;

• Most commonly, multiple iterations of a loop can be done “concurrently.” (Simpler control logic, too.)

• Compilers with optimizations setting turned on will attempt to vectorize your code.

DO i= 1, 100, 4 a(i) = b(i) + c(i) a(i+1) = b(i+1) + c(i+1) a(i+2) = b(i+2) + c(i+2) a(i+3) = b(i+3) + c(i+3) ENDDO

– The data for 1 grid point is laid out as a structure of fluid fields, e.g. density, pressure, …, temperature next to each other: A(HY_DENS:HY_TEMP)

– Vectorization can only happen when the same operation is performed on multiple fluid fields of 1 grid point!

- -

No vectorization gain!

Lower is B

etter

MIC performance study 2: vectorization

• We find that most time is spent in subroutines which update fluid state 1 grid point at a time


Enabling vectorization • Must restructure the code

- The fluid fields should no longer be next to each other in memory - A(HY_DENS:HY_TEMP) should become A_dens(1:N), …, A_temp(1:N)

- The 1:N indicates the kernels now operate on N grid points at a time • We tested these changes on part of a data reconstruction kernel

- -

• The new code compiled with vectorization options gives the best performance on 3 different platforms

Higher is B

etter


Case Study Summary • FLASH on MIC

– MPI+OpenMP parallel efficiency – OK – Vectorization – zero / negative gain …must restructure!

• Compiler auto-vectorization / vectorization directives do not help the current code

• Changes needed to enable vectorization – Make the kernel subroutines operate on multiple grid points at a time – Change the data layout by using a separate array for each fluid field

• Effectively a change from array of structures (AofS) to structure of arrays (SofA)

• Tested these proof-of-concept changes on a reduced hydro kernel – Demonstrated improved performance on Ivy-Bridge, BG/Q and Xeon-Phi platforms

- - Case study by Chris Daley

Some Initial Lessons Learned

• Improving a code for advanced architectures can improve performance on traditional architectures.

• Inclusion of OpenMP may be needed just to get the code to run (or to fit within memory)

• Some codes may need significant rewrite or refactoring; others gain significantly just adding OpenMP and vectorization

• Profiling/debugging tools and optimized libraries will be essential

• Vectorization important for performance

- 24 -

Preparing your code manycore

(More) Optimized Code

What steps should I take to optimize my application without becoming a performance/architecture expert?

- 26 -

Profile application

Analyze code vectorization

Create MPI vs OpenMP scaling plot

Measure memory bandwidth contention.

Step 1: Profile Code

• Determine in which routines your code spends the most time

• Find out how much memory your code uses

• What MPI routines are taking the most time

- 27 -

init 10% decompose

5%

solve 80%

distribute 5%

There are many tools to help you profile your code, HPC Toolkit, Craypat, Open Speedshop, IPM and many more. Go to your HPC center’s web page for information.

Step 2: Find optimal thread and task balance

• Run this experiment: – Hold constant the number of MPI tasks + OpenMP threads – Vary the number of MPI tasks, which results in a change to

the OpenMP threads

- 28 -

Step 2.5: Assess quality of your OpenMP implementation (cont)

• Take the most optimal # of threads from previous graph and assess efficiency of OpenMP implementation

- 29 -

Step 3: Run in packed vs unpacked mode

Packed

- 30 -

Core 0 Core 1

Core 2 Core 3

NUMA node 0

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

NUMA node 2

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3

NUMA node 3 NUMA node 1

Socket 0 Socket 1

Hopper Node, 24 cores

Step 3: Run in packed vs unpacked mode

Unpacked

- 31 -

Core 0 Core 1

Core 2 Core 3

NUMA node 0

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

NUMA node 2

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3


Socket 0 Socket 1

Hopper Node, 24 cores

Step 3: Run in packed and unpacked mode

• In your experiment use the same number of cores, but the packed version will use fewer nodes.

• Quiz: Unpacked case runs > 5 times faster than packed version – What could be happening?

- 32 -

Core 0 Core 1

Core 2 Core 3

NUMA node 0

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

NUMA node 2

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3


Socket 0 Socket 1

Core 0 Core 1

Core 2 Core 3

NUMA node 0

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

NUMA node 2

Core 4 Core 5

DDR3

DDR3

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

DDR3

DDR3


Socket 0 Socket 1

Step 4: Use on-package memory

• New! • If your application is memory bandwidth limited, how

should you use the smaller high bandwidth memory? • On Knights Landing, on-package memory bandwidth

up to 5 times faster than DRAM • This is a new challenge for the HPC community and

we need your help!

- 33 -

Step 5: Examine code vectorization

• Vectorization analysis is performed by the compiler. • Run with –vec-report (or similar) and –novec to

measure code speed up from vectorization

- 34 -

DO I = 1, N A(I) = A(I-1) + B(I) ENDDO

Loop dependency (will not vectorize): DO I = 1, N IF (A(I) < X) CYCLE ... ENDDO

Loop exit (may not vectorize):

% ftn -o v.out -vec-report=3 yax.f (col. 10) remark: LOOP WAS VECTORIZED (col. 22) remark: loop was not vectorized: not inner loop (col. 7) remark: loop was not vectorized: existence of vector dependence (col. 7) remark: vector dependence: assumed FLOW dependence between line 16 and line 16

What comes next?

Whats wrong with Current Programming Environments?

• Peak clock frequency as primary limiter for performance improvement

• Cost: FLOPs are biggest cost for system: optimize for compute

• Concurrency: Modest growth of parallelism by adding nodes

• Memory scaling: maintain byte per flop capacity and bandwidth

• Locality: MPI+X model (uniform costs within node & between nodes)

• Uniformity: Assume uniform system performance

• Reliability: It’s the hardware’s problem

36

Old Constraints New Constraints • Power is primary design constraint for future

HPC system design • Cost: Data movement dominates: optimize

to minimize data movement • Concurrency: Exponential growth of

parallelism within chips • Memory Scaling: Compute growing 2x faster

than capacity or bandwidth • Locality: must reason about data locality

and possibly topology • Heterogeneity: Architectural and

performance non-uniformity increase • Reliability: Cannot count on hardware

protection alone

Fundamentally breaks our current programming paradigm and computing ecosystem

Slide from John Shalf

Potential future node architecture

37

Parameterized Machine Model (what do we need to reason about when designing a new code?)

Cores •How Many •Heterogeneous •SIMD Width

Network on Chip (NoC) •Are they equidistant or •Constrained Topology (2D)

On-Chip Memory Hierarchy •Automatic or Scratchpad? •Memory coherency method?

Node Topology •NUMA or Flat? •Topology may be important •Or perhaps just distance

Memory •Nonvolatile / multi-tiered? •Intelligence in memory (or not)

Fault Model for Node • FIT rates, Kinds of faults • Granularity of faults/recovery

Interconnect •Bandwidth/Latency/Overhead •Topology

Primitives for data movement/sync •Global Address Space or messaging? •Synchronization primitives/Fences

Slide from John Shalf

In Summary – Join our postdoc program!

- 39 -

• We will soon be hiring 8 postdocs for the NERSC Exascale Science Applications Program

Acknowledgements

• Thanks to Harvey Wasserman, John Shalf, Chris Daley and Jack Deslippe for slides

• Thanks to the whole NERSC App Readiness team for discussion and content

- 40 -

Thank you

- 41 -

NERSC-8 system named after Gerty Cori (1896 – 1957): Biochemist

• First American woman to be awarded a Nobel Prize in science (1947)

• Born in Prague; US naturalized 1928 • Shared the Nobel Prize in Medicine or

Physiology with her husband + 1 other • Recognized for work involving enzyme

chemistry in carbohydrates: how cells produce and store energy.

- 42 -

• Breakdown of carbohydrates and mechanism of enzyme action are of fundamental importance in renewable bioenergy (cf. DOE Complex Carbohydrate Research Center)

The Cori System

43

Cray Cascade- 64 Cabinets

KNL Compute 9304 Nodes

2 BOOT 2 SDB Network Nodes (4)

Boot Cabinet With esLogin

GigE Switch

GigE (Admin)

FDR IB

40 GigE

GigE (Internal)

FC

Boot RAID

FC Switch

SMW

Network RSIP (10)

esLogin 14 Nodes

esMS 2 Node

40 GigE Switch

136 FDR IB

DVS Server Nodes (32)

MOM 28 Nodes

DSL 32 Nodes

To NERSC network

To NERSC network

Haswell Compute 1920 Nodes

Redundant Core IB Switches

15 FDR IB 102 FDR IB

PFS

12 Scalable Units

MGS – 2 Nodes

MDS – 4 Nodes OSS – 96 Nodes

DDN WolfCreek 5 NetApp E2700

Burst Buffer 384 Nodes

68 LNET Routers

32 FDR IB to NGF

Community Involvement

• Tri-facility app readiness meeting with LCFs, 3/25/2014 – Compilation of Office of Science applications

• NESAP created after learning best LCF practices • LCFs will help evaluate NESAP proposals • Collaboration with Berkeley Lab Computing Research

Division: postdocs, IPCC, more • Other labs and universities • NERSC visibility will help

- 44 -

Early Science Program (ESP)

You must be this tall to enter the dungeon session

You must be this tall to enter the dungeon session

- 45 -

Profile application

Define a few kernels that can fit into memory of KNC and can complete in a few minutes

Analyze code vectorization with vector report

Create MPI vs OpenMP scaling plot

Compute operational intensity of application

Run application in ‘packed’ and ‘unpacked’ mode to assess memory bandwidth contention

• Lightweight cores not fast enough to process complex protocol stacks at line rate • Simplify MPI or add thread match/dispatch extensions • Or use the memory address for endpoint matching

• Can no longer ignore locality (especially inside of node) • Its not just memory system NUMA issues anymore • On chip fabric is not infinitely fast (Topology as first class citizen) • Relaxed relaxed consistency (or no guaranteed HW coherence)

• New Memory Classes & memory management • NVRAM, Fast/low-capacity, Slow/high-capacity • How to annotate & manage data for different classes of memory

• Asynchrony/Heterogeneity • New potential sources of performance heterogeneity • Is BSP up to the task?

Programming Model Challenges (why is MPI+X not sufficient?)

46

Things that Kill Vectorization

Compilers want to “vectorize” your loops whenever possible. But sometimes they get stumped. Here are a few things that prevent your code from vectorizing:

Loop dependency:

Task forking:

do i = 1, n a(i) = a(i-1) + b(i) enddo

do i = 1, n if (a(i) < x) cycle ... enddo

Assumptions of Uniformity is Breaking (many new sources of heterogeneity)

48

• Heterogeneous compute engines (hybrid/GPU computing)

• Fine grained power mgmt. makes homogeneous cores look heterogeneous • thermal throttling – no longer guarantee

deterministic clock rate • Nonuniformities in process

technology creates non-uniform operating characteristics for cores on a CMP • Near Threshold Voltage (NTV)

• Fault resilience introduces inhomogeneity in execution rates

• error correction is not instantaneous • And this will get WAY worse if we move towards

software-based resilience

Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy

Bulk Synchronous Execution

NERSC collaborates with computer companies to deploy advanced HPC and data resources

- 49 -

• Hopper (N6) and Cielo (ACES) were the first Cray petascale systems with a Gemini interconnect

• Architected and deployed data platforms including the largest DOE system focused on genomics

• Edison (N7) is the first Cray petascale system with Intel processors, Aries interconnect and Dragonfly topology (serial #1)

• Cori (N8) will be one of the first large Intel KNL systems and will have unique data capabilities

Preparing your Application for Advanced Manycore Architectures

Documents

Transcript of Preparing your Application for Advanced Manycore Architectures