Cori: User Update
-1-
Helen He and Wahid Bhimji NERSC User Group Meeting March 24, 2016
Outline
• EarlyUserProgramandCoriUsageInfo• RunningJobsandBatchQueues• SelectedUserIssues• Applica@onSSPandScratchIOPerformance
-2-
"Cori Early User Program
• Earlyuserswereenabledin7phases:– AllowCorisystembecamereadyinvariousaspects(networking,
programmingenvironment,batchsystem,etc.)• 162MMPPhoursusedbeforeCorichargingstartedforAY16on
Jan12,2016.
-3-
Category #users DateEnabled
BurstBuffer/HeavyDatausers 50 10/29/2015
AllBabbageusers(coversall3IersofNESAPteams) 230 11/3/2015
NUGEXmembers(currentandpast) 15 11/3/2015
HeavyEdisonusers 20 11/7/2015
SpecificRequests(byusersorstaff) 70 10/29-11/7/2015
ProjectsthatareoutofIme:11/7,~340users 340 11/7/2015
Allusers 6000+enabled,2700acIve
11/12/2015
Cori Usage Info
-4-
• 11/12/2015:Allusersenabled• 11/30/2015–1/4/2016:Edisonoffline• 12/15/2015:HopperreIred.• 1/12/16:CoristartedchargingwhenAY16began
MorelargejobsduringfreeImeJ
Cori Usage Info: Free Period and AY16
-5-
162MMPPhoursused(10/29/15-1/11/16)
75.8MMPPhoursused(1/12/16-3/22/16)
Cori Phase 1 Data Features • FileSystems
– BurstBufferforhighbandwidth,lowlatencyI/O– HighperformanceLustrefilesystem:28PBofdisk,>700GBI/O
bandwidth– CrossmounIngoffilesystems(CoriscratchonEdisonandDTNs)(TBA)– Largeamountofmemorypernode(128GB)aswellashighmemory
nodes(775GB).• Networking
– ImprovedoutboundInternetconnecIons(eg.toaccessadatabaseinanothercenter)
– SofwareDefinedNetworkingR&Dforhighbandwidthtransfersinandoutofthecomputenode(TBA)
• OnnodesoZware– Improvedsharedlibraryperformance– User-definedimages/Shifer
-6-
Cori Phase 1 Data Features (SLURM) • CoriPhase1alsoknownasthe"CoriDataPar@@on”• Designedtoacceleratedata-intensiveapplica@ons,withhigh
throughputand“real@me”need.– "shared”parIIon.MulIplejobsonthesamenode.Largersubmitand
runlimits.40nodessetaside– The1-2nodebininthe"regular"forhighthroughputjobs.Largesubmit
andrunlimits.– “realIme”parIIonforjobsrequiringrealImedataanalysis.Highest
queuepriority.Specialpermissiononly.– Internalsshd(CCMmode)inanyqueue– Largenumberoflogin/interacIvenodestosupportapplicaIonswith
advancedworkflows– “burstbuffer”integratedinSLURM,inearlyuserperiod.– Encourageuserstorunjobsusing683+nodesonEdisonwithqueue
priorityboostand40%chargingdiscountthere.
-7-
Transition from Hopper/Edison to Cori
• ProgrammingenvironmentisverysimilartoHopper/Edison.Por@ngtoCoriisstraigh`orwardinregardstosoZwarebuilding.
• Theaspectthatusersneedtoadjustthemostisthetransi@onfromTorque/MoabtoSLURM.
• Provideddetaileddocumenta@onsonSLURMtransi@onguide,examplebatchscripts,andmini-tutorials.
• Workedwithsomespecificapplica@[email protected],withbhrequired.
-8-
SLURM Batch Scheduler Adoption
• OverallSLURMadop@onissmooth.• Easytouse“premium”,“ccm”,goodsupportandusage
for“shared”and“real@me”.• Afewtraps(withusereduca@on):
– Hyperthreadingisonbydefault• SLURMsees64CPUspernode• Askingnodeswith“#SBATCH–n”,butwithout“#SBATCH–N”maygethalfthenodedesired
• NeedtosetOMP_NUM_THREADS=1explicitlytorunwithpureMPI(forhybridMPI/OpenMPprogramcompiledwithopenmpenabled)
– AutomaIcprocessandthreadaffinityisgood.CanexplorewithadvancedserngsformorecomplicatedbindingopIons.
-9-
Batch Job Wait Time • UsersreportedaboutVERYLONGwait@meforjobs• ChangesweremadeonJan15
– AddedmaxnumberofbackfilljobsperparIIon(ontopofmaxnumberofbackfilljobsperuser)
– Decreasedmaxsizeofdebugfrom128to112.– Communicatedwithindividualuserstousethe“shared”parIIon,job
arrays,andbundlingjobs.– JobsdonotplantoruninAY16weredeleted
• Mostdebugjobsthenstartedwithin30mininsteadofhours,manynowstartinafewmin.
• Theregularjobswait@mearesignificantlysmallertoo• MoretuningonQueueConfigura@onisundergoing.
– ClosemonitoringonjobthroughputanduIlizaIon– Changesmadeon03/22forschedulingalgorithmgreatlyincreased
systemuIlizaIon
-10-
NERSC Custom Queue Monitoring Script • “sqs”isaNERSCcustomqueuedisplayscriptwhichprovidesbasicbatchjobinfoplusthejobrankingbasedonstart@meprovidedbythebackfillscheduler.
• Anewversionof“sqs”wasdeployedonJan19withtwocolumnsofrankingvaluestogiveusersmoreperspec@veoftheirjobsinqueue.– Addedjobpriorityrankingwithabsolutepriorityvalue(afuncIonofparIIon,QOS,jobwaitIme,andfairshare)
-11-
A Few Tips to Get Faster Job Turnaround
• Requestshorterwall@me,donotuseallowedmaxwall@me.
• Use“shared”par@@onforserialjobsorverysmallparalleljobs.
• Bundlejobs(mul@ple“srun”sinonescript,sequen@alorsimultaneously)
• UseJobArrays(bekermanagingjobs,notnecessaryfasterturnaround).Eacharraytaskisconsideredasinglejobforscheduling.
• Usejobdependencyfeatureformanagingworkflow.
-12-
Resolved: Cray HDF5 with Intel16
• InternalcompilererrorforFortrancodeswhenusingcray-hdf5,andcray-hdf5-parallel/1.8.14withintel/16.0.0.109
• Twoworkarounds:– UseNERSCbuilthdf5/1.8.14andhdf5-parallel/1.8.14withIntel/16.0.0.109compiler
– Usecray-hdf5/1.8.14,butswapintelcompilerversionfrom16.0.0.109to15.0.1.133.
• cray-hdf5/1.8.16hasbeeninstalledandsettodefaultwhichresolvedthisissue(Feb27,2016)
-13-
Workaround: Node Voltage Fault
• Nodevoltagefaultonlyseenwithonespecificapplica@on“pw.x”fromQuantumEspresso.
• Bydefault,hyperthreadingisused.Andtheapplica@ongeneratesaveryclosesequenceofcurrentspikesthatmaycausetheVoltageConvertertoself-protectandshutdown.
• Workaroundbyusereduca@ontouse1threadperMPItask.AlsomodifiedtheNERSCprovidedmodulefiletosetOMP_NUM_THREADS=1.(Jan16,2016)
-14-
Resolved: /project IO performance
• Twoapplica@onsreported10xparallelIOperformanceslowdownin/project,seenaZerDec25,2015.
• FixedduringsystemrebootwithscheduledmaintenanceonJan20,2016.
• Exactcauseofslowdownunknown– Unlikelydueto“CoriDVSnodesGPFSIBcablenotused”
-15-
Current Issues
• LoginnodescrashwhenhipngLBUG• Computenodesstuckincomple@ngstates• ComputenodesOOMfromapplica@ons
-16-
Cori Phase 1 SSP Performance
-17-
CommikedSSP:68.2MeasuredSSP:83.0
0
200
400
600
800
1000
1200
MiniFE MiniGhost AMG UMT SNAP MiniDFT GTC MILC
RunTime(Sec)
Commiked
Measured
Peak Cori Scratch Lustre I/O Performance
-18-
POSIX–File-Per-Process MPI-IO–SingleSharedFile
Thank you.
-19-
Top Related