Bridging the productivity gap in HPC · – Agile Manifesto Test-driven development Code refactors...

Bridging the productivity gapin HPCDiego RossinelliCSE Lab, ETH Zurich

Unprecedented diversity

Industry switch to many-core architectures

● Unprecedented computing power for science/med apps

● Diverse architectural trends

● Unprecedented software challenges

Unprecedented divergence

Are these challenges properly addressed?

CUBISM/MPCF

Stanford CTR compressible turbulence

Shinjo, UmumuraLiquid jet breakup

Productivity-performance gap

Fast evolution of computing hardware may lead to

● Frequent rewrite of software

● Unsustainable developing efforts

● Or a suboptimal use of the hardware

Dwarfs are goneH

S● Idea: identify common patterns

– That are shared among different application domains– That are difficult to execute efficiently

● Application developers rely on dwarf-libraries– Delegation of development and optimization's burden– Increase of productivity– HPC experts develop dwarf libraries– Dwarf-specific software optimization techniques

Recent software from CSE Lab● CUBISM-MPCF

– Started in 2012– 6 months, 3 developers– Shock-Bubble Interaction at Mach 3

● VP2.0– Started in 2014– 6 months, 1 developer– Simulation of Collapsing Bubbles

● uDeviceX– Started in 2014– 8 months, 4 core developers, 4 HPC specialists– Catching a Needle in a Flowing Haystack

In Silico Lab-On-A-Chip

● Focus: devices isolating CTCs

– CTC-iChip [M. Toner's Group]● Deterministic Lateral Displacement

– Funnels ratchet [Mc Faul]● Viscoelastic deformations

● Goal: numerical investigation

– Can we assess the device effectiveness?

– Can we improve it?

Dissipative Particle Dynamics● N-body algorithm● Short-range pairwise interactions

● Fluctuation-dissipation theorem● Unified framework for complex FSI

– Walls as frozen particles– Cells discretized as membranes– Suspended RBCs, CTCs in the solvent– Enables separation of scales

[Hoogerbrugge and Koelman, 1992][Groot and Warren, 1997]

[Espanol, 1995]

The In-Silico Lab-on-a-Chip: Petascale and High-Throughput Simulations of Microfluidics at Cell ResolutionRossinelli, Tang, Lykov, Alexeev, Bernaschi, Hadjidoukas, Bisson, Joubert, Conti, Karniadakis, Fatica, Pivkin, KoumoutsakosACM Gordon Bell finalist 2015

Simulation Complexity

Unprecedented level of detail● Up to 0.3 ml of blood / 1.4 billion RBCs● Devices of up to 50 mm3

●Up to 1 trillion DPD particles● 10-100 million steps● 3-60 ms per step

DPD:● Irregular computation● Irregular MPI messages● Neighborhood changes

at ~ every step● About 10X

more expensive than LJ● Dominated by INT OPS

Consequences:➢ Poor GPU execution➢ Detrimental effect on the network

(both bandwidth and latency)➢ Verlet lists are not profitable here➢ Increased TTS➢ GK110 has a 3X slowdown for most

of the integer instructions

HPC Challenges NV GK110

Results

CTC-iChip1● 13 tilted rows of egg-shaped obstacles● Separates large cells from RBCs

Funnels ratchet● 128 rows of shrinking funnels● Separates large cells according to

their to viscoelastic properties

Supercomputing - uDeviceXOn Titan, considering 18,688 nodes:● 30-40X faster than LAMMPS DPD with the GPU Package

100X faster for blood simulations● Achieving up to 65% of the nominal CPU+GPU peak● 35% of peak overall, sustained● Weak scaling efficiency of 99+%

Strong scaling efficiency of 94% from 625 to 5,000 nodes

High-impact HPC softwareMinimize risks– Software should be built to answer one high-impact scientific question– Domain specific competences are a prerequisite (external is good too)– Pick a suitable algorithm: advanced and established (trade off?) – Target an established platform/architecture. No future system!

HPC– Life is 3D, computers are 1D/2D (for now) -> HPC will always make a difference– When does HPC render computing game-changing? – Next level: HW/SW codesign (Anton, bitcoin mining) and ASIC coprocessors (BGQ, web servers)

Primary goal and priorities– Time-to-solution?– Time-to-software?

➢ Time-to-sciencePapers, Proposals ( -> time-to-money )

Vertical solutions● Horizontal solutions almost never pay off

– Absolutely avoid division of competences outside the group

– Avoid development of libraries– Avoid homogeneity across developers

● Form a small group (5-10 people)– Very heterogeneous competences – Very high competences

● A-priori software design is a waste of time– Except if you are a genius

● Mortals like us should embrace XP principles– Resources– Quality– Time– Scope

Science/papers

SW interface

TTSAccuracy

Numerical schemes

Map computation toILP DLP TLP

Ninja/assembly

HW architectureWha

Give up on ambitions

Approach– Start small, figure out all the details

● Nothing is easy, even fwrite

– Time-to-software– Focus on just 1-2 platforms– Agile Manifesto

● Test-driven development● Code refactors and hacks are a source of benefits● Fine-grained Hack, debug, refactor

– XP for team sizes < ~5– Do force homogeneity inside the group

● Divergence of opinion is a source of benefits

And for the “puritans”...● Embrace the sad truth

– We can't write HPC software with just one programming language– Performance portability in HPC is a joke or a career– Writing “clean code” is just a misleading feeling– The standard package C/MPI/pthreads/OpenMP/CUDA will stay around

● Generality is nothing else than a big set of specific cases– Generic programming does not make my software generic– The most generic pieces of code around are written in C

● Do not underestimate vanilla codes● Life is dirty, so is my code● From the dirt, non-obvious beautiful things emerge

– Bitonic sort and recompaction at warp level in CUDA– Distributed Work Stealing based with foMPI– M4-unrolling and SSA for C kernels– In-place 1D CUDA wavelet transforms

● Predicting re-usability and software design is almost impossible

Adopt UNIX and Linux philosophy● Bell-labs

– Ritchie, Thompson, Sweldens– UNIX, C, second-generation wavelets– Embrace the UNIX philosophy

● Linux– Very powerful thread scheduler, file system, fast I/O– Powerful programming environment– Good job in dealing with

● High oversubscription● High performance I/O● IPC

– Can't be ignored during the software development● Cray and IBM supercomputers

– Put hard bounds on fanciness– Example: aprun + fork + mpi init

End / Conversation

Abstracting the hardware

How can we characterize hardware performance?● We target high throughput● GFLOP/s (or other operations /s, /cycle)● GB/s (or B/cycle)

Identify common traits:● Increasing data-parallelism● Small data cache units● Superscalar execution

– Compute-Transfer overlap(+data prefetching & streaming stores)

– Dedicated floating-point units

Abstracting the software

Operational Intensity (OI) ● Operations per byte of DRAM traffic [FLOP/B]● It measures the traffic between the DRAM and the LLC● Irregularities/instructions-related problem are ignored● One OI for each kernel, the spectrum is wide!

Software as set of compute kernels

Each kernel is characterized by:● Op count [FLOP]● Compulsory off-chip memory traffic [B]

The roofline model

● It visually relates hardware with software● Performance = min(PB x OI, PP)● Ridge point characterizes the model

Software (FLOP/B)

Bridging the productivity gap in HPC · – Agile Manifesto Test-driven development Code refactors...

Documents

Transcript of Bridging the productivity gap in HPC · – Agile Manifesto Test-driven development Code refactors...

Catalogo Refactor 2014

Programming camp Debug Hacks

Debug Hacks出版記念トークイベント、新宿ジュンク堂

Refactor Large apps with Backbone

Debug Hacks Conference 2009

teste exercicios 2 - Portefólio Adriano Neves · File Edit View Navigate Source Refactor Run Debug Profile Team Tools Window Output - Java IMS116 (run) X Help Nome apelido do while.java

Taller SOLID Refactor

Pitching Hacks - Venture Hacks

teste exercicios 4 - Portefólio Adriano Nevesadrianoneves.weebly.com/uploads/2/6/8/2/26829282/exercicios_4.pdf · File Edit View Navigate Source Refactor Run Debug Profile Team Tools

Let me refactor this for you

Reduce, Reuse, Refactor

Refactor Dance - Puppet Labs 'Best Practices'

Refactor your specs! Øredev 2013

Debug Hacks Night

Refactor Yourself - TechUp #17

Embracing the-power-of-refactor

Refactor versus rewrite

Code refactor strategy part #1

Refactor case study LAN example

Why We Refactor? Confessions of GitHub Contributors