Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the...
Transcript of Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the...
![Page 1: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/1.jpg)
PerformancePerformance CEPBACEPBA
Judit Gimenez PJudit Gimenez – Pjudit@
Analisis withAnalisis with A ToolsA-Tools
erformance Toolserformance [email protected]
![Page 2: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/2.jpg)
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
![Page 3: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/3.jpg)
Traces!
.prv.prv
.pcf.pcfapplicationapplication.row.row
instrumentationinstrumentation
.trf.trf
ParaverParaver
DimemasDimemas
Since 1991
![Page 4: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/4.jpg)
• Variance is important
• Along time, across processors
• I f ti i i th d t il• Information is in the details
• Highly non linear systems
• Microscopic effects may have larmacroscopic impactmacroscopic impact
rge
![Page 5: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/5.jpg)
AlPPC / x86 clusters
Power 4/5/6 AIXAl
GL / BGP
Dimemas
me translators: TAU, KOJAK, OTF, HPC
y easy to add new programming mo
ltix Sltix(Supported by NASA AMES) By Rainer Kelle
PERUSEB R i K ll (H
Cell BE
By Rainer Keller (H& (UTK)
MPICH collective internals
C Toolkit....
odels
![Page 6: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/6.jpg)
R d tRaw data
tunable
eeing is believing
easuring is better
TimelinesId tifi f f tiIdentifier of functionHardware countsMiss ratiosPerformance (IPC, Mflops,…)
Routine duration• • •• • •
S i iStatisticsProfilesAverage miss ratio per routineg pHistogram of routine durationNumber of messages
• • •• • •
![Page 7: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/7.jpg)
•• From raw events Piece-wise constan
• Basic metricsInstructions
MPI callsMPI callsUseful duration
• Derived metrics
instrusefulIPC= instrcycles
�useful
• Models
L2 = cycles− instr / idea
nt functions of time plots / colors
Color encoding
MPI call Cost=
MPI callduration
bytes
preemptedtime=elapsed− cyccloc
alIPC
![Page 8: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/8.jpg)
• 2D T bl• 2D Tables • Accumulate over the different values of a co
• P fil C t i l ti li ( f ti• Profile: Categorical timeline (user function
• Histogram: Numeric timeline (duration, in• Diff t d t i d C l ti• Different data window – Correlations
• Average IPC @ each user function
• N b f i t ti f h f• Number of instructions for each range of • Communication Patterns• 3D extensions – 2 control windows3D extensions 2 control windows
Duration - IPC
While compu
ontrol window
i ll )n, mpi call...)
structions, IPC...)
d tidurations MPI calls profile
uting Communication pattern
![Page 9: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/9.jpg)
6 tasks trace
![Page 10: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/10.jpg)
• Simulation: Highly non linear model
• Linear componentsp
• Point to point communication
• Sequential processor performanceq p p
• Global CPU speed
• Per block/subroutinePer block/subroutine
• Non linear components
• Synchronization semantics
• Blocking receives
• Rendezvous
• Resource contention
• CPU
• Communication subsystem • links (half/full duplex), busses
B
Local
CPU
L
CPULocal
L
CPULoc
L
CPU
Local
MemoryCPU CPU
CPU
Local
Memory CPU
CPU
Loc
Mem
• GRID extensionsDedicated connectionsDedicated connections
External network (trafic)
![Page 11: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/11.jpg)
• Point to point model
SimresMPI_send
Logical TransferTransfer
LatencyUses CPUIndependent of size
MPI_recv
• Collectives modelCollectives model
• Barrier / Fan-in / Fan-out
• Cost Generic / Per callCost Generic / Per call
Processor timeBlock time
Comm. time
Time = Latency+ mBaBa
mulated contention for machine sources (links & buses)
Physical Transfer SizeBW
Process Blocked
MODEL_Bandwidth
SizeLatencyTime ∗⎟⎠⎞
⎜⎝⎛ +=
Machine 1 Machine 2
Fan in inte
Fan in extern
Fan out exte
Fan out inter
![Page 12: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/12.jpg)
• Simulations with 4 processes per node• NMM Iberia 4Km
• Not sensitive to latency
• 512 sensitive to contention?
• 256 MB/s OK• ARW Iberia 4 Km
• Not sensitive to latency
• sensitive to contention
• Need 1GB/sContention Impact (L=8; BW=2
1 2
0.8
1
1.2
omec
tivity
0.4
0.6
Spee
dup
vs. F
ull c
o
0
0.2
4 8 12 16 20 24 28
S
Impact of latency (BW=256; B=0)
0 998
1
1.002
l Lat
ency
0.994
0.996
0.998
up v
s.
Nom
inal
0.99
0.992
0 2 4 8 16 32
Spe
edu
Impact of BW (L=8; B=0)
256)
0.80
1.00
1.20
NMM 512
ARW 5120.40
0.60
0.80
Effic
ienc
y
NMM 256
ARW 256
NMM 128
ARW 1280.00
0.20
1 4 16 64 256 1024
32 36
![Page 13: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/13.jpg)
• MPIRE, 32 tasks, no network contention
L=25us BW=100MB/sL=25us, BW=100MB/s
L 25 BW 10MB/
L 1000 s BW 100MB/sL=1000us, BW=100MB/s
All windows same sca
![Page 14: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/14.jpg)
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
![Page 15: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/15.jpg)
l for data reductioniodic structure: Reference focus for detaomatically generate a representative sub
ta handling/summarization capability
y g p
• Software counters, filtering and cuttinggnals:
• To “clean” data: Flushing processes, %pree#msgs/BW (clogged system)
• T id tif t t #T k ti S• To identify structure: #Tasks computing, Suduration, #processes in MPI, average IPC,…
tomatizable through signal processing techniqg g p g q
• Mathematical morphology to clean up pertu
• Wavelet transform to identify coarse regionsWavelet transform to identify coarse regions
• Spectral analysis for detailed periodic patte
ailed analysisbtrace of few iterations
570 s2.2 GB
WRF-NMMPeninsula 4km
MPI, HWC 128 procs
emted time,
b t
570 s5 MB
um burst …ques:q
urbed regions
s 4.6 ss
ern36.5 MB
![Page 16: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/16.jpg)
• Useful for identifying and highlighting struct
• Cluster information injected in trace
• Phases within routines
• Different routines may have similar behav
• Compact trace encoding• Compact trace encoding
• Input to time analysis
CPMD
NAS BT
ture
vior
0.8
1
DBSCAN (Eps=0.01, MinPoints=20) clustering of trace WRF-128-PI.chop2.trf
0.4
0.6
Inst
ruct
ions
Com
plet
ed
WRF
0.2
In
![Page 17: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/17.jpg)
Focusing analysis
• Projection of HWC within a cluster
• Apply analysis to cluster level
• Statistics• CPI stack model• My favorite metrics
![Page 18: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/18.jpg)
• Based on MRNet• Based on MRNet
i … i+n10 … 300 … i+
m… 1i0
…
• 1st experiment: Collective duration thresholdp
245MB, >15500 col
• Current development
245MB, 15500 col
• Periodic frequency analysis
• P i di l t i h t
d <1MB, <
25MB, <
Collective internals
![Page 19: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/19.jpg)
A l i tAnalysis atAnalysis atAnalysis at
Clusters distribution
t i t 30t minute 10t minute 20t minute 30
CPI STACK model (generic
![Page 20: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/20.jpg)
• Useful
• To control granularity (not driven by MPI c
• To identify timeline profile, application str
• Adding sampling information on the tracefile
• based on the overflow mechanism offered
• period = f (cycles / instructions / cache m
• sampled information:
• call stack – as reference to source
• other hardware counters (not sampled)
•• interval between samples:
• High frequency sampling (> Nyquist)
• L f li• Low frequency sampling
calls, selected user functions)
ucture
e
d by PAPI
isses...)
![Page 21: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/21.jpg)
How to increase precision?Folding based on known periodic structure of application (tagged iterations) Applies to stationary applications
R lt t f it ti ith th tiResult: trace for one iteration with synthetic paraver events
Refer counts/timestamps to start of iteration
• Call stack
• Search for consecutive sequences of folded samples within same function and generate synthetic events
• Hardware countersHardware counters
• Noise reduction• Fit folded samples• Sample fitting curve to generate synthetic
events.
![Page 22: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/22.jpg)
• Useful to identify
• Internal behavior
• Density of L1 misses within MPI in an SM
Data
MP: when data actually arrives.
arrives
![Page 23: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/23.jpg)
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
![Page 24: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/24.jpg)
T
T i
T
Supi
LB=∑i=1
P
eff iCoeff i=
T iIPC
# LBP *max�eff i�
eff i T#instr
T ideal
Sup=
Migrating/local load imbalanceSerializationSerialization
CommicroLB=max �T i �
up= PP
� LBLB
� CommEffC Eff
� IPCIPC
�instr0
i t
Directly from real execution me
P0 LB0 CommEff 0 IPC 0 instr
ommEff= max �eff i �
Directly from real execution me
PP0
� macroLBmacroLB0
� microLBmicroLB0
� CommEffCommEff 0
� IPCIPC 0
�instr0
instr
Requires Dimemas simul
mmEff=T ideal
![Page 25: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/25.jpg)
Poor comReplication of computation Poor commmunication efficiencymmunication efficiency
Improved macrmicro load bala
CommuncMicro LoMacro LoIPCIPCComputat
![Page 26: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/26.jpg)
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
![Page 27: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/27.jpg)
Pattern Pattern AnalysisAnalysis
MPITraceMPITrace
DIMEMASVENUS (IBM-ZRL)
realideal_lFU
ideal_nFUideal_ROB
ideal_L1_cachesideal_L1_instr_cache
ideal
BT:C.64, task 20, cluster 1
MPSimMPSimreal_lFU
real_nFUreal_ROB
real_L1_cachesreal_L1_data_cache
real
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3
ClusteringClusteringgg
bursts bursts selectionselection
fFfF
ValgrindValgrind
![Page 28: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/28.jpg)
• Profile presentation tools
• Reduction/aggregation of the performanc
• Time dimension disappeared / Space dim• Traces
• “All” data is there
ce dimensions
Profilestats C
mension sometimes
VarianceVarianceTimspa
![Page 29: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/29.jpg)
l t f V i d TAU tslator of Vampir and TAU traces
![Page 30: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/30.jpg)
HPCTHPCT
Pattern Pattern AnalysisAnalysis
Stats Stats
algrindalgrind
racerace FiltersFiltersPARAVPARAV
.prv.prvff
PARAVPARAV
.pcf.pcf.row.row
MPSiMPSi
ClusteringClustering
.trf.trf
MPSimMPSim
KOJAKKOJAK
TAUTAUGenGen
.cfg.cfg
VERVERVamVam
oftoft22prvprv
VERVER
DIMEMAS
VENUS (IBM-ZRL)
![Page 31: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/31.jpg)
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
![Page 32: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/32.jpg)
Test case: NUCLEOSOME
N
GaGaaG
2 4 8 16 32 64 128 256 512 1024
Cluster / MN perf.
2,5
3
3,5
1
1,5
2
Raw processor performance0
0,5
1 2 3 4 5 6 7 8
Raw processor performancevs
Communication efficiency?
NAMD 2.7
y
GMX4_mn_work_nucleo_nsdayGMX4_mn_stop_nucleo_nsdayayGMX4_cluster_work_nsday
![Page 33: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/33.jpg)
64 tasks: whole run 14GBfilter only “large enough” computations 25
h 1 it ti 13MBchop 1 iteration13MB
5MB
![Page 34: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/34.jpg)
64 tasks
256 tasks
64 tasks 256 tasks
Main scalability problems• Load balanceparallel eff
comm Eff • Computation balance
4% of code replication
comm. Effload balancemico load balancecomp. balance 4% of code replication
64 tasks already poor effic
pcomputationIPC balanceIPC y p
• Communication problem
![Page 35: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/35.jpg)
64 tasks
<
![Page 36: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/36.jpg)
256 tasks
T diff f
![Page 37: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/37.jpg)
Instructions histogram
64 tasks
256 tasks
![Page 38: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/38.jpg)
64 tasks
![Page 39: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/39.jpg)
64 tasks: globally balanced but some lo
Instructions imbal
IPC Imba
ocal imbalance
lance
alance
![Page 40: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/40.jpg)
MPI calls
64 tasks
256 tasks
Duration of the Computation
64 tasks
256 tasks
![Page 41: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/41.jpg)
64Load balance!!!
T id l
64 pro
– Trapezoidal– IPC ∝ Instr
– 2x instr20% IPC
256
– 20% IPC– IPC ∝ 1/Instr
ocs
procs
Long waits for FFTs
![Page 42: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/42.jpg)
Tasks phase parallel eff.64 all 0,55
FFTs 0 55FFTs 0,55part icles 0,54
256 all 0,31FFTs 0 35FFTs 0,35part icles 0,29
0,9
1
0,6
0,7
0,8
0,4
0,5
,
0 1
0,2
0,3
64 all 64 FFTs 64 part icles0
0,1
load balance comp. balance5 0,93 0,815 0 91 0 955 0,91 0,954 0,94 0,871 0,61 0,585 0 55 0 695 0,55 0,699 0,74 0,72
parallel eff.load balancecomp. balance
256 all 256 FFTs 256 part icles
![Page 43: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/43.jpg)
64 tasks
256 t k256 tasks
![Page 44: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/44.jpg)
64 tasks
![Page 45: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/45.jpg)
64 tasks
• Real
• Prediction nominal
• Prediction ideal– Not much gain– Significant serialization
• Dependences?
![Page 46: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/46.jpg)
Particles
MPI• The imbalance
computation region where OpenMP would help for
MPI_Re
_
p pload balance is between sendrecv calls in line 1533 in domdec.c
MPI Waitall:
MPI_Senredv: 1533@do
MPI_Waitall:
Color re
Allreduce: 286@pme pp.c
cv: 427@pme_pp.c
_ @p _pp
: 126@demdec network c
omdec.c
: 126@demdec_network.c
epresents MPI caller line
![Page 47: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/47.jpg)
FFTsCo
• Imbalance computation between Alltoallvs in line [email protected]– Strange?– Better selection of # fft
procs?procs?– OpenMP?
MPI_Sendrecv: 482@pdMPI_Sendrecv: 7
MPI_Sendrecv: 344@gmx_MPI AlltoMPI_Allto
MMM
lor represents MPI call
Color represents MPI caller line
pme.c@[email protected]
_parallel_3dfft.coallv: 241@fftgrid coallv: [email protected]_Sendrecv: [email protected] Sendrecv: [email protected]_Sendrecv: [email protected]
MPI_Recv: 204@pme_pp.c
![Page 48: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/48.jpg)
Main scalability problem is load unbala
• Mixing MPI+OpenMP with dynamic
Mixing of Particles and FFTs tasks maMixing of Particles and FFTs tasks ma
• Two different granularities
Even if was initially considered to havepoor communication efficiencypoor communication efficiency
ance
c run time load balance
akes things difficultakes things difficult
e a good performance, the 64 tasks have
![Page 49: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline](https://reader033.fdocuments.net/reader033/viewer/2022041807/5e5553985daff22b9222c788/html5/thumbnails/49.jpg)
Performance analysis: fun and colorful
Simple tools to analyze complex proble
Scalability requires tools to drive/suppo
Importance of the details to understand
Benefits from the tools interoperability
Tools freely available ([email protected]) B
l
ems
ort the analyst where to look at
d
Best effort support and training