Post on 04-Aug-2020
Bridging the productivity gapin HPCDiego RossinelliCSE Lab, ETH Zurich
Unprecedented diversity
Industry switch to many-core architectures
● Unprecedented computing power for science/med apps
● Diverse architectural trends
● Unprecedented software challenges
Unprecedented divergence
Are these challenges properly addressed?
CUBISM/MPCF
Stanford CTR compressible turbulence
Shinjo, UmumuraLiquid jet breakup
PPM
Productivity-performance gap
Fast evolution of computing hardware may lead to
● Frequent rewrite of software
● Unsustainable developing efforts
● Or a suboptimal use of the hardware
BER
KEL
EY V
IEW
Dwarfs are goneH
ARD
WAR
E
APPL
ICAT
ION
S● Idea: identify common patterns
– That are shared among different application domains– That are difficult to execute efficiently
● Application developers rely on dwarf-libraries– Delegation of development and optimization's burden– Increase of productivity– HPC experts develop dwarf libraries– Dwarf-specific software optimization techniques
Recent software from CSE Lab● CUBISM-MPCF
– Started in 2012– 6 months, 3 developers– Shock-Bubble Interaction at Mach 3
● VP2.0– Started in 2014– 6 months, 1 developer– Simulation of Collapsing Bubbles
● uDeviceX– Started in 2014– 8 months, 4 core developers, 4 HPC specialists– Catching a Needle in a Flowing Haystack
In Silico Lab-On-A-Chip
● Focus: devices isolating CTCs
– CTC-iChip [M. Toner's Group]● Deterministic Lateral Displacement
– Funnels ratchet [Mc Faul]● Viscoelastic deformations
● Goal: numerical investigation
– Can we assess the device effectiveness?
– Can we improve it?
Dissipative Particle Dynamics● N-body algorithm● Short-range pairwise interactions
● Fluctuation-dissipation theorem● Unified framework for complex FSI
– Walls as frozen particles– Cells discretized as membranes– Suspended RBCs, CTCs in the solvent– Enables separation of scales
[Hoogerbrugge and Koelman, 1992][Groot and Warren, 1997]
[Espanol, 1995]
The In-Silico Lab-on-a-Chip: Petascale and High-Throughput Simulations of Microfluidics at Cell ResolutionRossinelli, Tang, Lykov, Alexeev, Bernaschi, Hadjidoukas, Bisson, Joubert, Conti, Karniadakis, Fatica, Pivkin, KoumoutsakosACM Gordon Bell finalist 2015
Simulation Complexity
Unprecedented level of detail● Up to 0.3 ml of blood / 1.4 billion RBCs● Devices of up to 50 mm3
●Up to 1 trillion DPD particles● 10-100 million steps● 3-60 ms per step
DPD:● Irregular computation● Irregular MPI messages● Neighborhood changes
at ~ every step● About 10X
more expensive than LJ● Dominated by INT OPS
Consequences:➢ Poor GPU execution➢ Detrimental effect on the network
(both bandwidth and latency)➢ Verlet lists are not profitable here➢ Increased TTS➢ GK110 has a 3X slowdown for most
of the integer instructions
HPC Challenges NV GK110
Results
CTC-iChip1● 13 tilted rows of egg-shaped obstacles● Separates large cells from RBCs
Funnels ratchet● 128 rows of shrinking funnels● Separates large cells according to
their to viscoelastic properties
Supercomputing - uDeviceXOn Titan, considering 18,688 nodes:● 30-40X faster than LAMMPS DPD with the GPU Package
100X faster for blood simulations● Achieving up to 65% of the nominal CPU+GPU peak● 35% of peak overall, sustained● Weak scaling efficiency of 99+%
Strong scaling efficiency of 94% from 625 to 5,000 nodes
High-impact HPC softwareMinimize risks– Software should be built to answer one high-impact scientific question– Domain specific competences are a prerequisite (external is good too)– Pick a suitable algorithm: advanced and established (trade off?) – Target an established platform/architecture. No future system!
HPC– Life is 3D, computers are 1D/2D (for now) -> HPC will always make a difference– When does HPC render computing game-changing? – Next level: HW/SW codesign (Anton, bitcoin mining) and ASIC coprocessors (BGQ, web servers)
Primary goal and priorities– Time-to-solution?– Time-to-software?
➢ Time-to-sciencePapers, Proposals ( -> time-to-money )
Vertical solutions● Horizontal solutions almost never pay off
– Absolutely avoid division of competences outside the group
– Avoid development of libraries– Avoid homogeneity across developers
● Form a small group (5-10 people)– Very heterogeneous competences – Very high competences
● A-priori software design is a waste of time– Except if you are a genius
● Mortals like us should embrace XP principles– Resources– Quality– Time– Scope
Science/papers
SW interface
TTSAccuracy
Numerical schemes
Map computation toILP DLP TLP
Ninja/assembly
HW architectureWha
t we
like
mos
tW
here
cre
dit g
oes
Give up on ambitions
Approach– Start small, figure out all the details
● Nothing is easy, even fwrite
– Time-to-software– Focus on just 1-2 platforms– Agile Manifesto
● Test-driven development● Code refactors and hacks are a source of benefits● Fine-grained Hack, debug, refactor
– XP for team sizes < ~5– Do force homogeneity inside the group
● Divergence of opinion is a source of benefits
And for the “puritans”...● Embrace the sad truth
– We can't write HPC software with just one programming language– Performance portability in HPC is a joke or a career– Writing “clean code” is just a misleading feeling– The standard package C/MPI/pthreads/OpenMP/CUDA will stay around
● Generality is nothing else than a big set of specific cases– Generic programming does not make my software generic– The most generic pieces of code around are written in C
● Do not underestimate vanilla codes● Life is dirty, so is my code● From the dirt, non-obvious beautiful things emerge
– Bitonic sort and recompaction at warp level in CUDA– Distributed Work Stealing based with foMPI– M4-unrolling and SSA for C kernels– In-place 1D CUDA wavelet transforms
● Predicting re-usability and software design is almost impossible
Adopt UNIX and Linux philosophy● Bell-labs
– Ritchie, Thompson, Sweldens– UNIX, C, second-generation wavelets– Embrace the UNIX philosophy
● Linux– Very powerful thread scheduler, file system, fast I/O– Powerful programming environment– Good job in dealing with
● High oversubscription● High performance I/O● IPC
– Can't be ignored during the software development● Cray and IBM supercomputers
– Put hard bounds on fanciness– Example: aprun + fork + mpi init
End / Conversation
Abstracting the hardware
How can we characterize hardware performance?● We target high throughput● GFLOP/s (or other operations /s, /cycle)● GB/s (or B/cycle)
Identify common traits:● Increasing data-parallelism● Small data cache units● Superscalar execution
– Compute-Transfer overlap(+data prefetching & streaming stores)
– Dedicated floating-point units
Abstracting the software
Operational Intensity (OI) ● Operations per byte of DRAM traffic [FLOP/B]● It measures the traffic between the DRAM and the LLC● Irregularities/instructions-related problem are ignored● One OI for each kernel, the spectrum is wide!
Software as set of compute kernels
Each kernel is characterized by:● Op count [FLOP]● Compulsory off-chip memory traffic [B]
The roofline model
● It visually relates hardware with software● Performance = min(PB x OI, PP)● Ridge point characterizes the model
Har
dwar
e (G
FLO
P/s )
Software (FLOP/B)