Projects - GitHub Pages
Transcript of Projects - GitHub Pages
Projects
• Worktogetherfortheimplementa1on– Discussionanddebugging,butnotthecodeitself• Eachsubmityourownimplementa1onandreport• Presenta1on– Onepresenta1on
• Addi1onalmee1ng1me
1
NameAdi0Pa0l Project1AparnaPuram Project1ErikHoggard Project1JacquesBreaux Project1KaramAbughalieh Project2KevinWeinert Project2MarkEasterly Project2NathanSketch Project2ShayanMukhtar Project2
Lecture16:ParallelArchitecture–ThreadLevelParallelism
ConcurrentandMul0coreProgramming
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
2
Topics(Part1)
• Introduc1on• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonsharedmemorysystem(Chapter7)– OpenMP– Cilk/Cilkplus– PThread,mutualexclusion,locks,synchroniza0ons• Analysisofparallelprogramexecu1ons(Chapter5)– PerformanceMetricsforParallelSystems• Execu0onTime,Overhead,Speedup,Efficiency,Cost
– ScalabilityofParallelSystems– Useofperformancetools
3
Topics(Part2)
• Parallelarchitecturesandhardware– Parallelcomputerarchitectures• Threadlevelparallelismanddatalevelparallelism
– Memoryhierarchyandcachecoherency• ManycoreGPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– Introduc1ontooffloadingmodelinOpenMP• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollec0ves)– Introduc1ontoPGASlanguages,UPCandChapel• Parallelalgorithms(Chapter8,9&10)
4
Moore’sLaw
Source:hZp://en.wikipedia.org/wki/Images:Moores_law.svg
• Long-termtrendonthenumberoftransistorperintegratedcircuit• Numberoftransistorsdoubleevery~18month
BinaryCodeandInstruc0ons
6
StagestoExecuteanInstruc0on
7
Pipeline
8
PipelineandSuperscalar
9
Whatdowedowiththatmanytransistors?
• Op1mizingtheexecu1onofasingleinstruc1onstreamthrough– Pipelining• Overlaptheexecu1onofmul1pleinstruc1ons• Example:allRISCarchitectures;Intelx86underneaththehood
– Out-of-orderexecu1on:• Allowinstruc1onstoovertakeeachotherinaccordancewithcodedependencies(RAW,WAW,WAR)
• Example:allcommercialprocessors(Intel,AMD,IBM,SUN)– Branchpredic1onandspecula1veexecu1on:• Reducethenumberofstallcyclesduetounresolvedbranches
• Example:(nearly)allcommercialprocessors
Whatdowedowiththatmanytransistors?(II)
– Mul1-issueprocessors:• Allowmul1pleinstruc1onstostartexecu1onperclockcycle• Superscalar(Intelx86,AMD,…)vs.VLIWarchitectures
– VLIW/EPICarchitectures:• Allowcompilerstoindicateindependentinstruc1onsperissuepacket
• Example:IntelItanium– Vectorunits:• Allowfortheefficientexpressionandexecu1onofvectoropera1ons
• Example:SSE-SSE4,AVXinstruc1ons
Limita0onsofop0mizingasingleinstruc0onstream(II)
• Problem:withinasingleinstruc1onstreamwedonotfindenoughindependentinstruc1onstoexecutesimultaneouslydueto– datadependencies– limita1onsofspecula1veexecu1onacrossmul1plebranches– difficul1estodetectmemorydependenciesamonginstruc1on
(aliasanalysis)• Consequence:significantnumberoffunc1onalunitsareidlingat
anygiven1me• Ques1on:Canwemaybeexecuteinstruc1onsfromanother
instruc1onsstream– Anotherthread?– Anotherprocess?
The“Future”ofMoore’sLaw
• ThechipsaredownforMoore’slaw– hZp://www.nature.com/news/the-chips-are-down-for-moore-
s-law-1.19338• SpecialReport:50YearsofMoore'sLaw– hZp://spectrum.ieee.org/sta1c/special-report-50-years-of-
moores-law• Moore’slawreallyisdeadthis1me– hZp://arstechnica.com/informa1on-technology/2016/02/
moores-law-really-is-dead-this-1me/• Reboo1ngtheITRevolu1on:ACalltoAc1on(SIA/SRC,2015)– hZps://www.semiconductors.org/clientuploads/Resources/
RITR%20WEB%20version%20FINAL.pdf
13
Thread-levelparallelism
• Problemsforexecu1nginstruc1onsfrommul1plethreadsatthesame1me– Theinstruc1onsineachthreadmightusethesameregister
names– Eachthreadhasitsownprogramcounter• Virtualmemorymanagementallowsfortheexecu1onofmul1plethreadsandsharingofthemainmemory
• Whentoswitchbetweendifferentthreads:– Finegrainmul1threading:switchesbetweeneveryinstruc1on– Coursegrainmul1threading:switchesonlyoncostlystalls(e.g.
level2cachemisses)
SimultaneousMul0-Threading(SMT)
• ConvertThread-levelparallelismtoinstruc1on-levelparallelism
Superscalar CourseMT FineMT SMT
Simultaneousmul0-threading(II)
• DynamicallyscheduledprocessorsalreadyhavemosthardwaremechanismsinplacetosupportSMT(e.g.registerrenaming)
• Requiredaddi1onalhardware:– Registerfileperthread– Programcounterperthread• Opera1ngsystemview:– IfaCPUsupportsnsimultaneousthreads,theOpera1ng
Systemviewsthemasnprocessors– OSdistributesmost1meconsumingthreads‘fairly’acrossthe
nprocessorsthatitsees.
ExampleforSMTarchitectures(I)
• IntelHyperthreading:– FirstreleasedforIntelXeonprocessorfamilyin2002– SupportstwoarchitecturalsetsperCPU,– Eacharchitecturalsethasitsown• Generalpurposeregisters• Controlregisters• Interruptcontrolregisters• Machinestateregisters
– Addslessthan5%totherela1vechipsizeReference:D.T.Marret.al.“Hyper-ThreadingTechnologyArchitectureandMicroarchitecture”,IntelTechnologyJournal,6(1),2002,pp.4-15.xp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf
ExampleforSMTarchitectures(II)
• IBMPower5– SamepipelineasIBMPower4processorbutwithSMTsupport– Furtherimprovements:• Increaseassocia1vityoftheL1instruc1oncache• IncreasethesizeoftheL2andL3caches• Addseparateinstruc1onprefetchandbufferingunitsforeachSMT
• Increasethesizeofissuequeues• Increasethenumberofvirtualregistersusedinternallybytheprocessor.
SimultaneousMul0-Threading
• Workswellif– Numberofcomputeintensivethreadsdoesnotexceedthenumberof
threadssupportedinSMT– Threadshavehighlydifferentcharacteris1cs(e.g.onethreaddoingmostly
integeropera1ons,anothermainlydoingfloa1ngpointopera1ons)• Doesnotworkwellif– Threadstrytou1lizethesamefunc1onunits– Assignmentproblems:• e.g.adualprocessorsystem,eachprocessorsuppor1ng2threadssimultaneously(OSthinksthereare4processors)
• 2computeintensiveapplica1onprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)
Synchroniza0onbetweenprocessors
• Requiredonalllevelsofmul1-threadedprogramming– Lock/unlock– Mutualexclusion– Barriersynchroniza1on
• Keyhardwarecapability:*cp++– Uninterruptableinstruc1oncapableofautoma1callyretrieving
orchangingavalue
RaceCondi0onintcount=0;int*cp=&count;….*cp++;/*bytwothreads*/
21Picturesfromwikipedia:hZp://en.wikipedia.org/wiki/Race_condi1on
SimpleExample(IIIb)
void *thread_func (void *arg){ int * cp (int *) arg; pthread_mutex_lock (&mymutex); *cp++; // read, increment and write shared variable pthread_mutex_unlock (&mymutex); return NULL; }
Synchroniza0on
• Lock/unlockopera1onsonthehardwarelevel,e.g.– Lockreturning1iflockisfree/available– Lockreturning0iflockisunavailable• Implementa1onusingatomicexchange(compareandswap)– Processsetsthevalueofaregister/memoryloca1ontothe
requiredopera1on– Se�ngthevaluemustnotbeinterruptedinordertoavoid
racecondi1ons– Accessbymul1pleprocesses/threadswillberesolvedbywrite
serializa1on
Synchroniza0on(II)
• Othersynchroniza1onprimi1ves:– Test-and-set– Fetch-and-increment• Problemswithallthreealgorithms:– Requireareadandwriteopera1oninasingle,uninterruptable
sequence– Hardwarecannotallowanyopera1onsbetweenthereadand
thewriteopera1on– Complicatescachecoherence– Mustnotdeadlock
Loadlinked/storecondi0onal
• Pairofinstruc1onswherethesecondinstruc1onreturnsavalueindica1ng,whetherthepairofinstruc1onswasexecutedasiftheinstruc1onswereatomic
• Specialpairofloadandstoreopera1ons– Loadlinked(LL)– Storecondi8onal(SC):returns1ifsuccessful,0otherwise• Storecondi1onalreturnsanerrorif– Contentsofmemoryloca1onspecifiedbyLLchangedbefore
callingSC– Processorexecutesacontextswitch
Loadlinked/storecondi0onal(II)
• AssemblercodesequencetoatomicallyexchangethecontentsofregisterR4andthememoryloca1onspecifiedbyR1
try: MOV R3, R4
LL R2, 0(R1)
SC R3, 0(R1)
BEQZ R3, try
MOV R4, R2
Loadlinked/storecondi0onal(III)
• Implemen1ngfetch-and-incrementusingloadlinkedandcondi1onalstore
try: LL R2, 0(R1) DADDUI R3, R2, #1 SC R3, 0(R1)
BEQZ R3, try • Implementa1onofLL/SCbyusingaspecialLinkRegister,whichcontainstheaddressoftheopera1on
Spinlocks
• Alockthataprocessorcon1nuouslytriestoacquire,spinningaroundinaloopun1litsucceeds.
• Trivialimplementa1on
DADDUI R2, R0, #1
lockit: EXCH R2, 0(R1) !atomic exchange
BNEZ R2, lockit
• SincetheEXCHopera1onincludesareadandamodifyopera1on– Valuewillbeloadedintothecache• Goodifonlyoneprocessortriestoaccessthelock• Badifmul1pleprocessorsinanSMPtrytogetthelock(cachecoherence)
– EXCHincludesawriteaZempt,whichwillleadtoawrite-missforSMPs
Spinlocks(II)
• ForcachecoherentSMPs,slightmodifica1onofthelooprequired
lockit: LD R2, 0(R1) !load the lock
BNEZ R2, lockit !lock available?
DADDUI R2, R0, #1 !load locked value
EXCH R2, 0(R1) !atomic exchange
BNEZ R2, lockit !EXCH successful?
Spinlocks(III)
• …orusingLL/SClockit: LL R2, 0(R1) !load the lock
BNEZ R2, lockit !lock available?
DADDUI R2, R0, #1 !load locked value
SC R2, 0(R1) !atomic exchange
BNEZ R2, lockit !SC successful?