CS 61C: Great Ideas in Computer Architecture Lecture 19 ...cs61c/fa16/lec/19/L19.pdf · CS 61c...
Transcript of CS 61C: Great Ideas in Computer Architecture Lecture 19 ...cs61c/fa16/lec/19/L19.pdf · CS 61c...
CS61C:GreatIdeasinComputerArchitecture
Lecture19:Thread-LevelParallelProcessing
BernhardBoser&RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
CS61c Lecture19:ThreadLevelParallelProcessing 2
ImprovingPerformance1. Increaseclockratefs
− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers
2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”
3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated
§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated
§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runppt (viewlectureslides)andbrowser(youtube)simultaneously
4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks
3CS61c Lecture19:ThreadLevelParallelProcessing
Today’slecture
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 4
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Project4CS61c Lecture19:ThreadLevelParallelProcessing
ParallelComputerArchitectures
CS61c 5
Severalseparatecomputers,somemeansforcommunication(e.g.Ethernet)
Massivearrayofcomputers,fastcommunicationbetweenprocessors
Multi-coreCPU:1datapathinsinglechip
shareL3cache,memory, peripheralsExample:Hivemachines
GPU“graphicsprocessing unit”
Example:CPUwith2Cores
6
Processor“Core”1
Control
DatapathPC
Registers(ALU)
MemoryInput
Output
Bytes
I/O-MemoryInterfaces
Processor0MemoryAccesses
Processor“Core”2
Control
DatapathPC
Registers(ALU)
Processor1MemoryAccesses
CS61c
MultiprocessorExecutionModel
• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)
− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.1st and2nd)
• Shared resources− Memory(DRAM)− Often3rd levelcache
§ Oftenonsamesiliconchip§ Butnotarequirement
• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor
§ E.g.4coreCPU(centralprocessingunit)§ Executes4differentinstructionstreamssimultaneously
7CS61c Lecture19:ThreadLevelParallelProcessing
TransitiontoMulticore
Sequential App Performance
8CS61c Lecture19:ThreadLevelParallelProcessing
MultiprocessorExecutionModel
• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent− Advantages:
§ Simplifiescommunication inprogramviasharedvariables− Drawbacks:
§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)
• Twowaystouseamultiprocessor:− Job-levelparallelism
§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms
− Partitionworkofsingletaskbetweenseveralcores§ E.g.eachperformspartoflargematrixmultiplication
9CS61c Lecture19:ThreadLevelParallelProcessing
ParallelProcessing
• It’sdifficult!• It’sinevitable
− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)
• Inmobilesystems(e.g.smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.
§ motionprocessoriniPhone§ GPU(graphicsprocessingunit)
• Warehouse-scalecomputers− multiple“nodes”
§ “boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode
10CS61c Lecture19:ThreadLevelParallelProcessing
PotentialParallelPerformance(assumingsoftwarecanuseit)
Year Cores SIMD bits /Core Core *SIMD bits
Total, e.g.FLOPs/Cycle
2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320
11
2.5X 8X 20X
MIMD SIMD MIMD&SIMD+2/
2yrs2X/4yrs
CS61c
12years
20xin12years201/12 =1.28xà 28%peryearor2xevery3years!
IF(!)wecanuseit
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
CS61c Lecture19:ThreadLevelParallelProcessing 12
ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90
/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29
/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration
297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer
…156total at this momentHow does mylaptopdothis?
Imagine doing 156assignments all at the same time!CS61c Lecture19:ThreadLevelParallelProcessing 13
Threads• Sequentialflowofinstructionsthatperformssometask
− Uptonowwejustcalledthisa“program”
• Eachthreadhasa− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory
• Eachprocessorprovidesone(ormore)− hardware threads (orharts)thatactivelyexecuteinstructions− Eachcoreexecutesone“hardware thread”
• Operatingsystemmultiplexesmultiple− software threads ontotheavailablehardwarethreads− allthreadsexceptthosemappedtohardwarethreadsarewaiting
14CS61c Lecture19:ThreadLevelParallelProcessing
OperatingSystemThreads
Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:
a) Switchoutblockedthreads(e.g.cachemiss,userinput,networkaccess)b) Timer(e.g.switchactivethreadevery1ms)
2. Removeasoftwarethreadfromahardwarethreadbyi. interruptingitsexecutionii. savingitsregistersandPCtomemory
3. Startexecutingadifferentsoftwarethreadbyi. loadingitspreviouslysavedregistersintoahardwarethread’sregistersii. jumpingtoitssavedPC
CS61c Lecture19:ThreadLevelParallelProcessing 15
Example:4Cores
CS61c Lecture19:ThreadLevelParallelProcessing 16
Threadpool:Listofthreadscompetingforprocessor
OSmapsthreadstocoresandscheduleslogical(software)threads
Core2
Each“Core”activelyruns1programatatime
Core1 Core3 Core4
Multithreading
• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits~ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable
• Problem−Mustsavecurrentthreadstateandloadnewthreadstate
§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles
• Canhardwarehelp?−Moore’slaw:transistorsareplenty
17CS61c Lecture19:ThreadLevelParallelProcessing
HardwareassistedSoftwareMultithreading
18
MemoryInput
Output
Bytes
I/O-MemoryInterfaces
Processor(1 Core,2Threads)
Control
DatapathPC0
Registers0
(ALU)
PC1
Registers1
• TwocopiesofPCandRegistersinsideprocessorhardware
• Looksliketwoprocessorstosoftware(hardwarethread0,hardwarethread1)
• Hyperthreading:• Boththreadsmaybeactive
simultaneously
CS61c Lecture19:ThreadLevelParallelProcessingNote:presentedincorrectlyinthelecture
Multithreading
• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance
§ Separateregisters§ Sharedatapath,ALU(s),caches
• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?
• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore
19CS61c Lecture19:ThreadLevelParallelProcessing
Bernhard’sLaptop
CS61c Lecture19:ThreadLevelParallelProcessing 20
$ sysctl -a | grep hw
hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 3,145,728
• 2Cores• 4Threadstotal
Example:6Cores,24LogicalThreads
CS61c Lecture19:ThreadLevelParallelProcessing 21
Threadpool:Listofthreadscompetingforprocessor
OSmapsthreadstocoresandscheduleslogical(software)threads
Thread1Core2
Thread2
Thread3
Thread4
Thread1Core6
Thread2
Thread3
Thread4
Thread1Core4
Thread2
Thread3
Thread4
Thread1Core5
Thread2
Thread3
Thread4
Thread1Core3
Thread2
Thread3
Thread4
Thread1Core1
Thread2
Thread3
Thread4
4Logicalthreadspercore(hardware)thread
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
CS61c Lecture19:ThreadLevelParallelProcessing 22
LanguagessupportingParallelProgramming
23
ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC
CS61c Lecture19:ThreadLevelParallelProcessing
Whichonetopick?
Whysomanyparallelprogramminglanguages?
• Piazzaquestion:−Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!
• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:
§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?
o Generalproblem isveryhardtosolveo Presentstate:specializedsolutions forspecificcaseso Youropportunitytobecomefamous!
CS61c Lecture19:ThreadLevelParallelProcessing 24
ParallelProgrammingLanguages
• Numberofchoicesisindicationof− Nouniversalsolution
§ Needsareveryproblemspecific− E.g.
§ Scientificcomputing(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!
• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)− Noneisparticularly”easy”touse
• 61C− Parallellanguageexamplesforhigh-performancecomputing− OpenMP
CS61c Lecture19:ThreadLevelParallelProcessing 25
ParallelLoops
• Serialexecution:for (int i=0; i<100; i++) {
…}
• ParallelExecution:
CS61c Lecture19:ThreadLevelParallelProcessing 26
for (int i=0; i<25; i++) { …
}
for (int i=25; i<50; i++) {
…}
for (int i=50; i<75; i++) {
…}
for (int i=75; i<100; i++) {
…}
Parallelfor inOpenMP
#include <omp.h>
#pragma omp parallel forfor (int i=0; i<100; i++) {
…}
CS61c Lecture19:ThreadLevelParallelProcessing 27
OpenMPExample$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40
CS61c Lecture19:ThreadLevelParallelProcessing 28
OpenMP
• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism
− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>
• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures
§ E.g.sameprogramfor1&16cores
• Onlyworkswithsharedmemory
29CS61c Lecture19:ThreadLevelParallelProcessing
OpenMPProgrammingModel• Fork- JoinModel:
• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution
• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread
• Processrepeatsforeachparallelregion− Amdahl’slaw?
30CS61c Lecture19:ThreadLevelParallelProcessing
WhatKindofThreads?
• OpenMPthreadsareoperatingsystem(software)threads.• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads.• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing.• Butothertasksonmachinecanalsousehardwarethreads!• Be“careful”(?)whentimingresultsforproject4!
− 5AM?− Jobqueue?
31CS61c Lecture19:ThreadLevelParallelProcessing
Example2:computingp
CS61c 32http://openmp.org/mp-documents/omp-hands-on-SC08.pdf
Sequentialp
CS61c Lecture19:ThreadLevelParallelProcessing 33
pi = 3.142425985001
• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize
Parallelize(1)…
CS61c Lecture19:ThreadLevelParallelProcessing 34
• Problem:eachthreadsneedsaccesstothesharedvariablesum
• Coderunssequentially…
Parallelize(2)…
CS61c Lecture19:ThreadLevelParallelProcessing 35
sum[0] sum[1]
1. Computesum[0]andsum[2]
inparallel
2. Computesum = sum[0] + sum[1]
sequentially
Parallelp
CS61c 36Lecture19:ThreadLevelParallelProcessing
TrialRun
i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001
CS61c Lecture19:ThreadLevelParallelProcessing 37
Scaleup:num_steps = 106
pi = 3.141592653590
Youverify howmany digitsarecorrect…
CS61c Lecture19:ThreadLevelParallelProcessing 38
CanweParallelizeComputingsum
CS61c Lecture19:ThreadLevelParallelProcessing 39
Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s goingon?
AlwayslookingforwaystobeatAmdahl’sLaw…
YourTurn
Whatarethepossiblevaluesof*($s0) afterexecutingthiscodeby2concurrent threads?
# *($s0) = 100lw $t0,0($s0)addi $t0,$t0,1sw $t0,0($s0)
CS61c Lecture19:ThreadLevelParallelProcessing 40
Answer *($s0)
A 100 or101B 101C 101or102D 100or101or102E 100or101or102or103
YourTurn
Whatarethepossiblevaluesof*($s0) afterexecutingthiscodeby2concurrent threads?
# *($s0) = 100lw $t0,0($s0)addi $t0,$t0,1sw $t0,0($s0)
CS61c Lecture19:ThreadLevelParallelProcessing 41
Answer *($s0)
C 101or102
• 102ifthethreadsentercodesectionsequentially• 101ifbothexecutelw beforeeitherrunssw• onethreadsees“stale”data
What’sgoingon?
CS61c Lecture19:ThreadLevelParallelProcessing 42
• Operationisreallypi = pi + sum[id]
• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,andstorestheresultbacktopi?
• Eachprocessorreadssameintermediatevalueofpi!• Resultdependsonwhogetstherewhen
• A“race”à resultisnotdeterministic
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
CS61c Lecture19:ThreadLevelParallelProcessing 43
Synchronization
• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime
§ otherwisechangesbyseveralpeoplegetallmixedup
• Solution:
CS61c Lecture19:ThreadLevelParallelProcessing 44
• Taketurns:• Onlyonepersonget’sthe
microphone&talksatatime
• Alsogoodpracticeforclassrooms,btw…
Locks
• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”
• Usuallyimplementedwithavariable− int lock;
§ 0forunlocked§ 1forlocked
CS61c Lecture19:ThreadLevelParallelProcessing 45
Synchronizationwithlocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)
// set locklock = 1;
// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)
// release locklock = 0;
CS61c Lecture19:ThreadLevelParallelProcessing 46
LockSynchronization
Thread1
while (lock != 0) ;
lock = 1;
// critical section
lock = 0;
Thread2
while (lock != 0) ;
lock = 1; // critical sectionlock = 0;
CS61c Lecture19:ThreadLevelParallelProcessing 47
• Thread2findslocknotset,beforethread1setsit
• Boththreadsbelievetheygotandsetthelock!
Tryasyouwant,thisproblemhasnosolution,notevenattheassemblylevel.
Unlessweintroducenewinstructions,thatis!
HardwareSynchronization
• Solution:− Atomicread/write− Read&writeinsingleinstruction
§ Nootheraccesspermittedbetweenreadandwrite− Note:
§ Mustusesharedmemory (multiprocessing)
• Commonimplementations:− Atomicswapofregister↔memory− Pairofinstructionsfor“linked”readandwrite
§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread
§ MIPSusesthissolution
48CS61c Lecture19:ThreadLevelParallelProcessing
MIPSSynchronizationInstructions• Loadlinked: ll $rt, off($rs)
− Readsmemorylocation(likelw)− Alsosets(hidden)“linkbit”− Linkbitisresetifmemorylocation(off($rs))isaccessed
• Storeconditional: sc $rt, off($rs)
− Storesoff($rs) = $rt (like sw)− Sets$rt=1 (success)iflinkbitisset
§ i.e.no(other)processaccessedoff($rs) sincell− Sets$rt=0 (failure)otherwise− Note:sc clobbers $rt,i.e.changesitsvalue
49CS61c Lecture19:ThreadLevelParallelProcessing
LockSynchronization
BrokenSynchronization
while (lock != 0) ;
lock = 1;
// critical section
lock = 0;
Fix(lockisatlocation$s1)
Try: addiu $t0,$zero,1ll $t1,0($s1)bne $t1,$zero,Trysc $t0,0($s1)beq $t0,$zero,Try
Locked:
# critical section
Unlock:sw $zero,0($s1)
CS61c Lecture19:ThreadLevelParallelProcessing 50
Tryagainifsc failed(another threadexecutedsc sinceabovell)
$t0 = 1 beforecalling ll:minimize timebetweenll andsc
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
CS61c Lecture19:ThreadLevelParallelProcessing 51
OpenMPLocks
CS61c Lecture19:ThreadLevelParallelProcessing 52
SynchronizationinOpenMP
• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:
− critical− atomic− barrier− ordered
• OpenMPoffersmanymorefeatures− seeonlinedocumentation− ortutorialat
§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf
CS61c Lecture19:ThreadLevelParallelProcessing 53
OpenMPcritical
CS61c Lecture19:ThreadLevelParallelProcessing 54
TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen
− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper
§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt
§ it’snotthere,sos/hewaits
• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug
§ malloc/free iseasy…
CS61c Lecture19:ThreadLevelParallelProcessing 55
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
CS61c Lecture19:ThreadLevelParallelProcessing 56
AndinConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance
− SIMD:instructionlevelparallelism§ Implemented inallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers
− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel
o E.g.OpenMP− SIMD&MIMDformaximumperformance
• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks
57CS61c Lecture19:ThreadLevelParallelProcessing