Solr on Windows: Does it Work? Does it Scale? - Teun Duynstee
“OpenMP Does Not Scale”
Transcript of “OpenMP Does Not Scale”
![Page 1: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/1.jpg)
8
RuudvanderPasDis.nguishedEngineer
ArchitectureandPerformanceSPARCMicroelectronics,Oracle
SantaClara,CA,USA
“OpenMP Does Not Scale”
![Page 2: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/2.jpg)
9
Agendan The Myth n Deep Trouble n Get Real n The Wrapping
![Page 3: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/3.jpg)
10
TheMyth
![Page 4: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/4.jpg)
11
“A myth, in popular use, is something that is widely believed but false.” ......
(source: www.reference.com)
![Page 5: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/5.jpg)
12
“OpenMPDoesNotScale”
ACommonMyth
AProgrammingModelCanNot“NotScale”
WhatCanNotScale:
TheImplementaDonTheSystemVersusTheResourceRequirements
Or.....You
![Page 6: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/6.jpg)
1313
Hmmm....WhatDoesThatReallyMean?
![Page 7: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/7.jpg)
14
SomeQues.onsICouldAsk“Doyoumeanyouwroteaparallelprogram,usingOpenMPanditdoesn'tperform?”“Isee.DidyoumakesuretheprogramwasfairlywellopDmizedinsequenDalmode?”
![Page 8: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/8.jpg)
15
SomeQues.onsICouldAsk
“Oh.Youjustthinkitshouldandyouusedallthecores.HaveyouesDmatedthespeedupusingAmdahl'sLaw?”
“Oh.Youdidn't.Bytheway,whydoyouexpecttheprogramtoscale?”
“No,thislawisnotanewEUfinancialbailoutplan.Itissomethingelse.”
![Page 9: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/9.jpg)
16
SomeQues.onsICouldAsk“Iunderstand.Youcan'tknoweverything.HaveyouatleastusedatooltoidenDfythemostDmeconsumingpartsinyourprogram?”
“Oh.Youdidn't.Didyouminimizethenumberofparallelregionsthen?”
“Oh.Youdidn't.Itjustworkedfinethewayitwas.
“Oh.Youdidn't.Youjustparallelizedallloopsintheprogram.Didyoutrytoavoidparallelizinginnermostloopsinaloopnest?”
![Page 10: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/10.jpg)
17
MoreQues.onsICouldAsk“Didyouatleastusethenowaitclausetominimizetheuseofbarriers?”“Oh.You'veneverheardofabarrier.Mightbeworthtoreadupon.”“Doallthreadsroughlyperformthesameamountofwork?”
“Youdon'tknow,butthinkitisokay.Ihopeyou'reright.”
![Page 11: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/11.jpg)
18
IDon’tGiveUpThatEasily“DidyoumakeopDmaluseofprivatedata,ordidyousharemostofit?”“Oh.Youdidn't.Sharingisjusteasier.Isee.
![Page 12: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/12.jpg)
19
IDon’tGiveUpThatEasily
“You'veneverheardofthateither.Howunfortunate.CouldthereperhapsbeanyfalsesharingaffecDngperformance?”“Oh.Neverheardofthateither.MaycomehandytolearnaliYlemoreaboutboth.”
“Youseemtobeusingacc-NUMAsystem.Didyoutakethatintoaccount?”
![Page 13: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/13.jpg)
20
TheGrassIsAlwaysGreener...“So,whatdidyoudonexttoaddresstheperformance?”
“SwitchedtoMPI.Isee.DoesthatperformanybeYerthen?”
“Oh. You don't know. You're still debugging the code.”
![Page 14: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/14.jpg)
21
GoingIntoPedan.cMode
“Whileyou'rewaiDngforyourMPIdebugruntofinish(areyousureitdoesn'thangbytheway?),pleaseallowmetotalkaliYlemoreaboutOpenMPandPerformance.”
![Page 15: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/15.jpg)
22
DeepTrouble
![Page 16: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/16.jpg)
23
n The transparency and ease of use of OpenMP are a mixed blessing
à Makes things pretty easy
à May mask performance bottlenecks
n In the ideal world, an OpenMP application “just performs well”
n Unfortunately, this is not always the case
OpenMPAndPerformance/1
![Page 17: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/17.jpg)
24
n Two of the more obscure things that can negatively impact performance are cc-NUMA effects and False Sharing
n Neither of these are restricted to OpenMP
à They come with shared memory programming on modern
cache based systems
à But they might show up because you used OpenMP
à In any case they are important enough to cover here
OpenMPAndPerformance/2
![Page 18: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/18.jpg)
2525
ConsideraDonsforcc-NUMA
![Page 19: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/19.jpg)
26
AGenericcc-NUMAArchitecture
CacheCoherentInterconnect
ProcessorMem
ory
ProcessorMem
ory
Main Issue: How To Distribute The Data ?
Local Access (fast)
Remote Access (slower)
![Page 20: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/20.jpg)
27
n Important aspect on cc-NUMA systems
à If not optimal, longer memory access times and hotspots
n OpenMP 4.0 does provide support for cc-NUMA
à Placement under control of the Operating System (OS)
à User control through OMP_PLACES
n Windows, Linux and Solaris all use the “First Touch” placement policy by default
à May be possible to override default (check the docs)
AboutDataDistribu.on
![Page 21: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/21.jpg)
28
for (i=0; i<10000; i++) a[i] = 0;
CacheCoherentInterconnectMem
ory
Processor
Mem
ory
a[0] : a[9999]
ProcessorProcessor
First Touch All array elements are in the memory of
the processor executing this thread
ExampleFirstTouchPlacement/1
![Page 22: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/22.jpg)
29
for (i=0; i<10000; i++) a[i] = 0;
CacheCoherentInterconnect
Mem
ory
Processor
Mem
ory
a[0] : a[4999]
#pragma omp parallel for num_threads(2)
First Touch Both memories now have “their own
half” of the array
a[5000] : a[9999]
Processor
ExampleFirstTouchPlacement/2
ProcessorProcessor
![Page 23: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/23.jpg)
30
GetReal
![Page 24: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/24.jpg)
31
“Don’tTryThisAtHome(yet)”
![Page 25: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/25.jpg)
32
TheIni.alPerformance(35GB)
0.000#
0.010#
0.020#
0.030#
0.040#
0.050#
0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#
Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5J2#Performance#(SCALE#26)#
Gameoverbeyond16threads
![Page 26: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/26.jpg)
33
Thatdoesn’tscaleverywell
Let’suseabiggermachine!
![Page 27: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/27.jpg)
34
Ini.alPerformance(35GB)
0.000#
0.010#
0.020#
0.030#
0.040#
0.050#
0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#
Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5J2#and#T5J8#Performance#
T5J2#SCALE#26# T5J8#SCALE#26#
![Page 28: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/28.jpg)
3535
Oops!Thatcan’tbetrue
Let’srunalargergraph!
![Page 29: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/29.jpg)
36
Ini.alPerformance(280GB)
36
0.000#
0.010#
0.020#
0.030#
0.040#
0.050#
0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5J2#and#T5J8#Performance#
T5J2#SCALE#29# T5J8#SCALE#29#
![Page 30: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/30.jpg)
3737
Let’sGetTechnical
![Page 31: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/31.jpg)
38
37%$ 31%$22%$ 15%$
7%$ 6%$
27%$21%$
16%$
10%$
6%$ 4%$
29%$
26%$
20%$
13%$
8%$6%$
11%$
20%$
27%$
22%$ 27%$
4%$ 5%$12%$
17%$
31%$ 34%$
3%$ 6%$ 15%$25%$ 22%$
3%$ 2%$ 4%$ 3%$ 2%$ 2%$
0%$
20%$
40%$
60%$
80%$
100%$
1$ 2$ 4$ 8$ 16$ 20$Number$of$threads$
Total$CPU$Time$Percentage$DistribuDon$(Baseline,$SCALE$26)$
Atomic$operaDons$
OMPPatomic_wait$
OMPPcriDcal_secDon_wait$
OMPPimplicit_barrier$
Other$
make_bfs_tree$
verify_bfs_tree$
TotalCPUTimeDistribu.on
38
Func.on1
Func.on2
![Page 32: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/32.jpg)
39
0"
10"
20"
30"
40"
50"
60"
70"
80"
0" 750" 1500" 2250" 3000" 3750" 4500" 5250" 6000" 6750" 7500" 8250" 9000"
Band
width"(G
B/s)"
Elapsed"Time"(seconds)"
SPARC"T5F2"Measured"Bandwidth"(BASE,"SCALE"28,"16"threads)""
Total" Read"(Socket"0)" Read"(Socket"1)" Write"(Socket"0)" Write"(Socket"1)"
BandwidthOfTheOriginalCode
39
Lessthanhalfofthememorybandwidthisused
![Page 33: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/33.jpg)
40
SummaryOriginalVersion
40
• Communica4oncostsaretoohigh– Increasesasthreadsareadded– Thisseriouslylimitsthenumberofthreadsused– Thisisturnaffectsmemoryaccessonlargergraphs
• Thebandwidthisnotbalanced• Fixes:
– FindandfixmanyOpenMPinefficiencies– Usesomeefficientatomicfunc4ons
![Page 34: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/34.jpg)
41
Methodology
UseTheChecklistToIdenDfyBoYlenecks
UseAProfilingTool
IfTheCodeDoesNotScaleWell
TackleThemOneByOne
ThisIsAnIncrementalApproach
ButVeryRewarding
![Page 35: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/35.jpg)
4242
BO BOSecretSauce
![Page 36: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/36.jpg)
43
NotethemuchshorterrunDmeforthemodifiedversion
ComparisonOfTheTwoVersions
43
Atomic/cri4calsec4onwait
Implicitbarrier
Idlestate
![Page 37: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/37.jpg)
44
0.00#
0.10#
0.20#
0.30#
0.40#
0.50#
0.60#
0.70#
0# 16# 32# 48# 64# 80# 96# 112#128#144#160#176#192#208#224#240#256#Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5L2#Performance#(SCALE#29)#
T5L2#Baseline# T5L2#OPT1#
Peakperformanceis13.7xhigher
PerformanceComparison
44
![Page 38: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/38.jpg)
4545
BO MOMore
SecretSauce
![Page 39: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/39.jpg)
46
Observa.ons
ButNeedsIt....
TheCodeDoesNotExploitLargePages
FirstTouchPlacementIsNotUsed
UsedASmarterMemoryAllocator
![Page 40: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/40.jpg)
47
0"
25"
50"
75"
100"
125"
150"
0" 400" 800" 1200" 1600" 2000" 2400" 2800" 3200" 3600" 4000"
Band
width"(G
B/s)"
Elasped"Time"(seconds)"
SPARC"T5E2"Measured"Bandwidth"(OPT2,"SCALE"29,"224"threads)""
Total"BW"(GB/s)" Read"(GBs/s)" Write"(GB/s)"
BandwidthOfTheNewCode
47
MaximumReadBandwidth135GB/s
![Page 41: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/41.jpg)
48
TheResult
48
0.04 0.03 0.03
1.13
0.790.89
1.47
1.731.67
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)
Billion
Searche
dEd
gesP
erSecon
d(GTEPS)
T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO)
39-52ximprovementoveroriginalcode
![Page 42: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/42.jpg)
49
0"
10"
20"
30"
40"
50"
60"
70"
80"
0"
100"
200"
300"
400"
500"
600"
700"
800"
1" 2" 4" 8" 16" 32" 64" 128" 256" 384" 512"
Elap
sed"Time"(m
inutes)"
Number"of"threads"
SPARC"T5E8"Graph"Search"Performance""(SCALE"30"E"Memory"Requirements:"580"GB)"
Search"Time"(minutes)" Speed"Up"
BiggerIsDefinitelyBe[er!
49
72xspeedup!
SearchDmereducedfrom12hoursto10
minutes
![Page 43: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/43.jpg)
50
0.00#
0.20#
0.40#
0.60#
0.80#
1.00#
1.20#
1.40#
1.60#
0# 64# 128# 192# 256# 320# 384# 448# 512# 576# 640# 704# 768# 832# 896# 960#1024#
Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5L8#(SCALE#32,#OPT2)##
A2.3TBSizedProblem
50
EvenstarDngat32threadsthespeedupissDll11x
896Threads!
![Page 44: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/44.jpg)
51
0"5"
10"15"20"25"30"35"40"45"50"55"
26" 27" 28" 29" 30" 31"
Speed"Up"Over"T
he"OPT
0"Ve
rsion"
SCALE"Value"
SPARC"T5D8"Speed"Up"Over"OPT0"
OPT1" OPT2"
TuningBenefitBreakdown
51
Somewhatdiminishingreturn
Biggerisbe[er
![Page 45: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/45.jpg)
5252
MO MOBODifferent
SecretSauce
![Page 46: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/46.jpg)
53
57-75ximprovement
0.04 0.03 0.03
1.13
0.790.89
1.47
1.73 1.67
2.15
2.43 2.42
0.00
0.50
1.00
1.50
2.00
2.50
SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)
Billion
Searche
dEd
gesP
erSecon
d(GTEPS)
T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO) T5-8OPT2(MOBO)
ASimpleOpenMPChange
![Page 47: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/47.jpg)
5454
“IValueMyPersonalSpace”
![Page 48: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/48.jpg)
55
MyFavoriteSimpleAlgorithmvoid mxv(int m,int n,double *a,double *b[], double *c) { for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; } }
parallel loop
= *
j
i
![Page 49: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/49.jpg)
56
TheOpenMPSource
#pragma omp parallel for default(none) \ shared(m,n,a,b,c) for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; }
= *
j
i
![Page 50: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/50.jpg)
57
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"
PerformanceOnIntelNehalem
Maxspeedupis~1.6xonly
Waitaminute!Thisalgorithmishighly
parallel....
NotaDon:Numberofcoresxnumberofthreadswithincore
System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz
![Page 51: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/51.jpg)
58
?
![Page 52: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/52.jpg)
59
Let'sGetTechnical
![Page 53: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/53.jpg)
60
hwthread0hwthread1
core 0 caches
hwthread0hwthread1
core 1 caches
hwthread0hwthread1
core 2 caches
hwthread0hwthread1
core 3 caches
socket 0
shar
ed c
ache
mem
ory
hwthread0hwthread1
core 0 caches
hwthread0hwthread1
core 1 caches
hwthread0hwthread1
core 2 caches
hwthread0hwthread1
core 3 caches
socket 1
shar
ed c
ache
mem
ory
QPI
Inte
rcon
nect
ATwoSocketNehalemSystem
![Page 54: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/54.jpg)
61
DataIni.aliza.onRevisited#pragma omp parallel default(none) \ shared(m,n,a,b,c) private(i,j) { #pragma omp for for (j=0; j<n; j++) c[j] = 1.0; #pragma omp for for (i=0; i<m; i++) { a[i] = -1957.0; for (j=0; j<n; j++) b[i][j] = i; } /*-- End of omp for --*/ } /*-- End of parallel region --*/
= *
j
i
![Page 55: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/55.jpg)
62
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"
DataPlacementMa[ers!
Theonlychangeisthewaythedataisdistributed
Maxspeedupis~3.3xnow
NotaDon:Numberofcoresxnumberofthreadswithincore
System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz
![Page 56: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/56.jpg)
63
mem
ory
mem
ory
Syst
em In
terc
onne
ct
ATwoSocketSPARCT4-2Systemhwthread0
hwthread7
core 0 caches
hwthread0
hwthread7
core 7 caches shar
ed c
ache
.....
.....
.....
.....
socket 0
hwthread0
hwthread7
core 0 caches
hwthread0
hwthread7
core 7 caches shar
ed c
ache
.....
.....
.....
.....
socket 1
![Page 57: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/57.jpg)
64
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"
PerformanceOnSPARCT4-2
Maxspeedupis~5.8x
Scalingonlargermatricesisaffectedbycc-NUMAeffects(similarasonNehalem)
Notethattherearenoidlecyclestofillhere
NotaDon:Numberofcoresxnumberofthreadswithincore
System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz
![Page 58: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/58.jpg)
65
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"
DataPlacementMa[ers!
Maxspeedupis~11.3x
Theonlychangeisthewaythedataisdistributed
NotaDon:Numberofcoresxnumberofthreadswithincore
System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz
![Page 59: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/59.jpg)
66
SummaryMatrixTimesVector
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
Nehalem"OMP"16"threads" Nehalem"OMPDFT"16"threads"
T4D2"OMP"16"threads" T4D2"OMPDFT"16"threads"
1.9x
2.1x
![Page 60: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/60.jpg)
67
TheWrapping
![Page 61: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/61.jpg)
68
WrappingThingsUp“Whilewe'resDllwaiDngforyourMPIdebugruntofinish,IwanttoaskyouwhetheryoufoundmyinformaDonuseful.”
“Yes,itisoverwhelming.Iknow.”
“AndOpenMPissomewhatobscureincertainareas.Iknowthataswell.”
![Page 62: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/62.jpg)
69
WrappingThingsUp“Iunderstand.You'renotaComputerScienDstandjustneedtogetyourscienDficresearchdone.”
“IagreethisisnotagoodsituaDon,butitisallaboutDarwin,youknow.I'msorry,itisatoughworldoutthere.”
![Page 63: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/63.jpg)
70
ItNeverEnds“Oh,yourMPIjobjustfinished!Great.”“Yourprogramdoesnotwriteafilecalled'core'anditwasn'ttherewhenyoustartedtheprogram?”“Youwonderwheresuchafilecomesfrom?Let'stalk,butIneedtogetabigandstrongcoffeefirst.”
“WAIT!Whatdidyoujustsay?”
![Page 64: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/64.jpg)
71
ItReallyNeverEnds“SomebodytoldyouWHAT??”
“YouthinkGPUsandOpenCLwillsolveallyourproblems?”
“Let'smakethatanXLTripleEspresso.I’llbuy”
![Page 66: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/66.jpg)
73
RuudvanderPasDis.nguishedEngineer
ArchitectureandPerformanceSPARCMicroelectronics,Oracle
SantaClara,CA,USA
DTrace Why It Can Be Good For You
![Page 67: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/67.jpg)
74
n DTrace is a Dynamic Tracing Tool à Supported on Solaris, Mac OS X and some Linux flavours
n Monitors the Operating System (OS) n Through “probes”, the user can see what the OS is
doing n Main target: OS related performance issues n Surprisingly, it can also greatly help to find out what
your application is doing though *
Mo.va.on
*)Aregularprofilingtoolshouldbeusedfirst
![Page 68: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/68.jpg)
75
n A DTrace probe is written by the user
n When the probe “fires”, the code in the probe is executed
n The probes are based on “providers” n The providers are pre-defined
à Example: “sched” provider to get info from the scheduler
à You can also instrument your own code to have DTrace
probes, but there is little need for that
HowDTraceWorks
provider:module:function:name
![Page 69: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/69.jpg)
76
Example–ThreadAffinity/1
sched:::off-cpu/ pid == $target && self->on_cpu == 1 /{ self->time_delta = (timestamp - ts_base)/1000;
@thread_off_cpu [tid-1] = count(); @total_thr_state[probename] = count();
printf("Event %8u %4u %6u %6u %-16s %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, probename, probefunc);
self->on_cpu = 0; self->time_delta = 0;}
“predicate”
![Page 70: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/70.jpg)
77
Example–ThreadAffinity/2pid$target::processor_bind:entry/ (processorid_t) arg2 >= 0 /{ self->time_delta = (timestamp - ts_base)/1000; self->target_processor = (processorid_t) arg2;
@proc_bind_info[tid-1, curcpu->cpu_id, self->target_processor] = count();
printf("Event %8u %4u %6u %6u %9s/%-6d %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, "proc_bind", self->target_processor, probename);
self->time_delta = 0; self->target_processor = 0;}
![Page 71: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/71.jpg)
78
Example–ExampleCode#pragma omp parallel{ #pragma omp single { printf("Single region executed by thread %d\n", omp_get_thread_num()); } // End of single region} // End of parallel region
export OMP_NUM_THREADS=4export OMP_PLACES=coresexport OMP_PROC_BIND=close./affinity.d -c ./omp-par.exe
![Page 72: “OpenMP Does Not Scale”](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1a5cf1a28ab625d8b9c12/html5/thumbnails/72.jpg)
79
Example–Result=========================================================== Affinity Statistics===========================================================
Thread On HW Thread Lgroup Created Thread Count 0 787 7 1 1 0 787 7 2 1 0 787 7 3 1
Thread Running on HW Thread Bound to HW Thread 0 787 784 1 771 792 2 848 800 3 813 808