List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and...

16
TDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1 Introduction and Functional Programming 2 Imperative Programming and Data Structures 3 Environment 4 Evaluation 5 Object Oriented Programming 6 Macros and decorators 7 Virtual Machines and Bytecode 8 Garbage Collection and Native Code 9 Parallel and Distributed Computing 10 Logic Programming 11 Summary 3 / 64 Lecture goal Learn about the concept, the challenges of distributed computing The impact of distributed programming on programming language and implementations 4 / 64 Lecture content Parallel Programming Multithreaded Programming The States Problems and Solutions Atomic actions Language and Interpreter Design Considerations Single Instruction, Multiple Threads Programming Distributed programming Message Passing MapReduce

Transcript of List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and...

Page 1: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

TDDA69DataandProgramStructureParallelandDistributedComputing

CyrilleBerger

2/64

Listoflectures1IntroductionandFunctionalProgramming2ImperativeProgrammingandDataStructures3Environment4Evaluation5ObjectOrientedProgramming6Macrosanddecorators7VirtualMachinesandBytecode8GarbageCollectionandNativeCode9ParallelandDistributedComputing

10LogicProgramming11Summary

3/64

LecturegoalLearnabouttheconcept,thechallengesofdistributedcomputingTheimpactofdistributedprogrammingonprogramminglanguageandimplementations

4/64

LecturecontentParallelProgramming

MultithreadedProgrammingTheStatesProblemsandSolutions

AtomicactionsLanguageandInterpreterDesignConsiderations

SingleInstruction,MultipleThreadsProgramming

DistributedprogrammingMessagePassingMapReduce

Page 2: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

5/64

ConcurrentcomputingInconcurrentcomputingseveralcomputationsareexecutedatthesametimeInparallelcomputingallcomputationsunitshaveaccesstosharedmemory(forinstanceinasingleprocess)Indistributedcomputingcomputationsunitscommunicatethroughmessagespassing

6/64

BenefitsofconcurrentcomputingFasterResponsiveness

Interactiveapplicationscanbeperformingtwotasksatthesametime:rendering,spellchecking...

AvailabilityofservicesLoadbalancingbetweenservers

ControllabilityTasksrequiringcertainpreconditionscansuspendandwaituntilthepreconditionshold,thenresumeexecutiontransparently.

7/64

Disadvantagesofconcurrentcomputing

ConcurrencyishardtoimplementproperlySafety

EasytocorruptDeadlock

TaskscanwaitindefinitelyforeachNon-Notalwaysfaster!

ThememorybandwidthandCPUcacheis

8/64

Concurrentcomputingprogramming

Fourbasicapproachtocomputing:Sequencialprogramming:noconcurrencyDeclarativeconcurrency:streamsinafunctionallanguageMessagepassing:withactiveobjects,usedindistributedcomputingAtomicactions:onasharedmemory,usedinparallelcomputing

Page 3: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

9/64

StreamProgramminginFunctionalProgramming

NoglobalFunctionsonlyactontheirinput,theyarereentrantFunctionscanthenbeexecutedinparallel

Aslongastheydonotdependontheoutputofanotherfunction

ParallelProgramming

11

ParallelProgrammingInparallelcomputingseveralcomputationsareexecutedatthesametimeandhaveaccesstosharedmemory

Unit Unit Unit

Memory

12

SIMD,SIMT,SMT(1/2)SIMD:SingleInstruction,Multiple

Elementsofashortvector(4to8elements)areprocessedinparallel

SIMT:SingleInstruction,MultipleThesameinstructionisexecutedbymultiplethreads(from128to3048ormoreinthefuture)

SMT:SimultaneousGeneralpurpose,differentinstructionsareexecutedbydifferentthreads

Page 4: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

13

SIMD,SIMT,SMT(2/2)SIMD:

PUSH[1,2,3,4]PUSH[4,5,6,7]chrome://downloads/VEC_ADD_4

SIMT:execute([1,2,3,4],[4,5,6,7],lambdaa,b,ti:a[ti]=a[ti]+max(b[ti],5))

SMT:a=[1,2,3,4]b=[4,5,6,7]...Thread.new(lambda:a=a+b)Thread.new(lambda:c=c*b)

14

Whytheneedforthedifferentmodels?

Flexibility:SMT>SIMT>SIMD

Lessflexibilitygivehigherperformance

Unlessthelackofflexibilitypreventtoaccomplishthetask

Performance:SIMD>SIMT>SMT

MultithreadedProgramming

16

SinglethreadedvsMultithreaded

Page 5: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

17

MultithreadedProgrammingModelStartwithasinglerootthreadFork:tocreateconcurentlyexecutingthreadsJoin:tosynchronizethreadsThreadscommunicatethroughsharedmemoryThreadsexecuteassynchronouslyTheymayormaynotexecuteondifferentprocessors

main

sub0 subn...

main

sub0 subn...

main

18

Amultithreadedexamplethread1=newThread(function(){/*dosomecomputation*/});thread2=newThread(function(){/*dosomecomputation*/});thread1.start();thread2.start();thread1.join();thread2.join();

TheStatesProblemsandSolutions

20

GlobalStatesandmulti-threadingExample:

vara=0;thread1=newThread(function(){a=a+1;});thread2=newThread(function(){a=a+1;});thread1.start();thread2.start();

Whatisthevalueofa?Thisiscalledaracecondition

Page 6: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

Atomicactions

22

MutexMutexistheshortofMutualexclusion

Itisatechniquetopreventtwothreadstoaccessasharedresourceatthesametime

Example:vara=0;varm=newMutex();thread1=newThread(function(){m.lock();a=a+1;m.unlock();});

thread2=newThread(function(){m.lock();a=a+1;m.unlock();});thread1.start();thread2.start();

Now

23

DependencyExample:

vara=1;varm=newMutex();thread1=newThread(function(){m.lock();a=a+1;m.unlock();});

thread2=newThread(function(){m.lock();a=a*3;m.unlock();});thread1.start();thread2.start();

Whatisthevalueofa?4or6?

24

ConditionvariableAConditionvariableisasetofthreadswaitingforacertainconditionExample:

vara=1;varm=newMutex();varcv=newConditionVariable();thread1=newThread(function(){m.lock();a=a+1;cv.notify();m.unlock();});

thread2=newThread(function(){cv.wait();m.lock();a=a*3;m.unlock();});thread1.start();thread2.start();a=6

Page 7: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

25

DeadlockWhatmighthappen:

vara=0;varb=2;varma=newMutex();varmb=newMutex();thread1=newThread(function(){ma.lock();mb.lock();b=b-1;a=a-1;ma.unlock();mb.unlock();});

thread2=newThread(function(){mb.lock();ma.lock();b=b-1;a=a+b;mb.unlock();ma.unlock();});thread1.start();thread2.start();thread1waitsformb,

thread2waitsforma

26

AdvantagesofatomicactionsVeryefficientLessoverhead,fasterthanmessagepassing

27

DisadvantagesofatomicactionsBlocking

MeaningsomethreadshavetowaitSmalloverheadDeadlockAlow-prioritythreadcanblockahighprioritythreadAcommonsourceofprogrammingerrors

LanguageandInterpreterDesignConsiderations

Page 8: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

29

CommonmistakesForgettounlockamutexRaceconditionDeadlocksGranularityissues:toomuchlockingwillkilltheperformance

30

ForgettounlockamutexMostprogramminglanguagehave,either:

AguardobjectthatwillunlockamutexupondestructionAsynchronizationstatementsome_rlock=threading.RLock()withsome_rlock:print("some_rlockislockedwhilethisexecutes")

31

RaceconditionCanwedetectpotentialraceconditionduringcompilation?Intherustprogramminglanguage

ObjectsareownedbyaspecificthreadTypescanbemarkedwithSendtraitindicatethattheobjectcanbemovedbetweenthreads

TypescanbemarkedwithSynctraitindicatethattheobjectcanbeaccessedbymultiplethreadssafely

32

SafeSharedMutableStateinrust(1/3)

letmutdata=vec![1u32,2,3];forjin0..2{thread::spawn(move||{for(inti=0;i<2;++i)data[i]+=1;});}

Givesanerror:"captureofmovedvalue:`data`"

Page 9: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

33

SafeSharedMutableStateinrust(2/3)

letmutdata=Mutex::new(vec![1u32,2,3]);forjin0..2{letdata=data.lock().unwrap();thread::spawn(move||{for(inti=0;i<2;++i)data[i]+=1;});}

Givesanerror:MutexGuarddoesnothaveSendtraits

Meaningwecanotmovedatatothethread

34

SafeSharedMutableStateinrust(3/3)

letmutdata=Arc::new(vec![1u32,2,3]);forjin0..2{letdata=data.clone();thread::spawn(move||{letmutdata=data.lock().unwrap();for(inti=0;i<2;++i)data[i]+=1;});}

ArchastheSynctrait.

SingleInstruction,MultipleThreadsProgramming

36

SingleInstruction,MultipleThreadsProgramming

WithSIMT,thesameinstructionsisexecutedbymultiplethreadsondifferentregisters

Page 10: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

37

Singleinstruction,multipleflowpaths(1/2)

Usingamaskingsystem,itispossibletosupportif/elseblock

Threadsarealwaysexecutingtheinstructionofbothpartoftheif/elseblocksdata=[-2,0,1,-1,2],data2=[...]functionf(thread_id,data,data2){if(data[thread_id]<0){data[thread_id]=data[thread_id]-data2[thread_id];}elseif(data[thread_id]>0){data[thread_id]=data[thread_id]+data2[thread_id];}}

38

Singleinstruction,multipleflowpaths(1/2)

Benefits:Multipleflowsareneededinmanyalgorithms

Drawbacks:Onlyoneflowpathisexecutedatatime,nonrunningthreadsmustwaitRandomizememoryaccessElementsofavectorarenotaccessedsequentially

39

ProgrammingLanguageDesignforSIMT

OpenCL,CUDAarethemostcommonVerylowlevel,C/C++-derivative

GeneralpurposeprogramminglanguagearenotsuitableSomeworkhasbeendonetowriteinPythonforCUDA

@jit(argtypes=[float32[:],float32[:],float32[:]],target='gpu')defadd_matrix(A,B,C):A[cuda.threadIdx.x]=B[cuda.threadIdx.x]+C[cuda.threadIdx.x]withlimitationonstandardfunctionthatcanbecalled

Distributedprogramming

Page 11: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

41

DistributedProgramming(1/4)Indistributedcomputingseveralcomputationsareexecutedatthesametimeandcommunicatethroughmessagespassing

Unit Unit Unit

Memory Memory Memory

42

Distributedprogramming(2/4)Adistributedcomputingapplicationconsistsofmultipleprogramsrunningonmultiplecomputersthattogethercoordinatetoperformsometask.Computationisperformedinparallelbymanycomputers.Informationcanberestrictedtocertaincomputers.Redundancyandgeographicdiversityimprovereliability.

43

Distributedprogramming(3/4)Characteristicsofdistributedcomputing:

Computersareindependent—theydonotsharememory.Coordinationisenabledbymessagespassedacrossanetwork.

44

Distributedprogramming(4/4)Individualprogramshavedifferentiatingroles.Distributedcomputingforlarge-scaledataprocessing:

Databasesrespondtoqueriesoveranetwork.Datasetscanbepartitionedacrossmultiplemachines.

Page 12: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

MessagePassing

46

MessagePassingMessagesare(usually)passedthroughsocketsMessagesareexchangedsyncrhonouslyorasynchronouslyCommunicationcanbecentralizedorpeer-to-peer

47

Python'sGlobalInterpreterLockCPythoncanonlyinterpretonesinglethreadatagiventimeThelockisreleased,

ThecurrentthreadisblockingforI/OEvery100interpreterticks

TruemultithreadingisnotpossiblewithCPython

48

Python'sMultiprocessingmoduleThemultiprocessingpackageoffersbothlocalandremoteconcurrency,effectivelyside-steppingtheGlobalInterpreterLockbyusingsubprocessesinsteadofthreadsItimplementstransparentmessagepassing,allowingtoexchangePythonobjectsbetweenprocesses

Page 13: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

49

Python'sMessagePassing(1/2)Exampleofmessagepassing

frommultiprocessingimportProcessdeff(name):print'hello',nameif__name__=='__main__':p=Process(target=f,args=('bob',))p.start()p.join()Outputhellobob

50

Python'sMessagePassing(2/2)ExampleofmessagepassingwithpipesfrommultiprocessingimportProcess,Pipedeff(conn):conn.send([42,None,'hello'])conn.close()if__name__=='__main__':parent_conn,child_conn=Pipe()p=Process(target=f,args=(child_conn,))p.start()printparent_conn.recv()p.join()

Output[42,None,'hello']

Transparentmessagepassingispossiblethankstoserialization

51

SerializationAserializedobjectisanobjectrepresentedasasequenceofbytesthatincludestheobject’sdata,itstypeandthetypesofdatastoredintheobject.

52

pickleInPython,serializationisdonewiththepicklemodule

Itcanserializeuser-definedTheclassdefinitionmustbeavailablebeforedeserialization

WorkswithdifferentversionofBydefault,useanASCII

Itcanserialize:Basictypes:booleans,numbers,Containers:tuples,lists,setsanddictionnary(ofpickableToplevelfunctionsandclasses(onlytheObjectswhere__dict__or__getstate()__are

Example:pickle.loads(pickle.dumps(10))

Page 14: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

53

SharedmemoryMemorycanbesharedbetweenPythonprocesswithaValueorArray.

frommultiprocessingimportProcess,Value,Arraydeff(n,a):n.value=3.1415927foriinrange(len(a)):a[i]=-a[i]if__name__=='__main__':num=Value('d',0.0)arr=Array('i',range(10))p=Process(target=f,args=(num,arr))p.start()p.join()printnum.valueprintarr[:]

Andofcourse,youwouldneedtouseMutextoavoidrace

MapReduce

55

BigDataProcessing(1/2)MapReduceisaframeworkforbatchprocessingofbigdata.Framework:AsystemusedbyprogrammerstobuildapplicationsBatchprocessing:Allthedataisavailableattheoutset,andresultsarenotuseduntilprocessingcompletesBigdata:Usedtodescribedatasetssolargeandcomprehensivethattheycanrevealfactsaboutawholepopulation,usuallyfromstatisticalanalysis

56

BigDataProcessing(2/2)TheMapReduce

DatasetsaretoobigtobeanalyzedbyonemachineUsingmultiplemachineshasthesamecomplications,regardlessoftheapplication/analysisPurefunctionsenableanabstractionbarrierbetweendataprocessinglogicandcoordinatingadistributedapplication

Page 15: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

57

MapReduceEvaluationModel(1/2)

Mapphase:Applyamapperfunctiontoallinputs,emittingintermediatekey-valuepairs

Themappertakesaniterablevaluecontaininginputs,suchaslinesoftextThemapperyieldszeroormorekey-valuepairsforeachinput

58

MapReduceEvaluationModel(2/2)Reducephase:Foreachintermediatekey,applyareducerfunctiontoaccumulateallvaluesassociatedwiththatkey

Thereducertakesaniterablevaluecontainingintermediatekey-valuepairsAllpairswiththesamekeyappearconsecutivelyThereduceryieldszeroormorevalues,eachassociatedwiththatintermediatekey

59

MapReduceExecutionModel(1/2)

60

MapReduceExecutionModel(2/2)

Page 16: List of lecturesTDDA69/lectures/2015/09_concurrent.pdfTDDA69 Data and Program Structure Parallel and Distributed Computing Cyrille Berger 2 / 64 List of lectures 1Introduction and

61

MapReduceexampleFroma1.1billionpeopledatabase(facebook?),wewanttoknowtheaveragenumberoffriendsperageIn

SELECTage,AVG(friends)FROMusersGROUPBYage

Inthetotalsetofusersinsplitteddifferentusers_setfunctionmap(users_set){for(userinusers_set){send(user.age,user.friends.size);}}

Thekeysareshuffledandassignedtoreducersfunctionreduce(age,friends):{varr=0;for(friendinfriends){r+=friend;}send(age,r/friends.size);}

62

MapReduceAssumptionsConstraintsonthemapperandreducer:

ThemappermustbeequivalenttoapplyingadeterministicpurefunctiontoeachinputindependentlyThereducermustbeequivalenttoapplyingadeterministicpurefunctiontothesequenceofvaluesforeachkey

Benefitsoffunctionalprogramming:Whenaprogramcontainsonlypurefunctions,callexpressionscanbeevaluatedinanyorder,lazily,andinparallelReferentialtransparency:acallexpressioncanbereplacedbyitsvalue(orvisversa)withoutchangingtheprogram

InMapReduce,thesefunctionalprogrammingideasallow:

Consistentresults,howevercomputationisRe-computationandcachingofresults,as

63

MapReduceBenefitsFaulttolerance:Amachineorharddrivemightcrash

TheMapReduceframeworkautomaticallyre-runsfailedtasksSpeed:Somemachinemightbeslowbecauseit'soverloaded

Theframeworkcanrunmultiplecopiesofataskandkeeptheresultoftheonethatfinishesfirst

Networklocality:DatatransferisexpensiveTheframeworktriestoschedulemaptasksonthemachinesthatholdthedatatobeprocessed

Monitoring:Willmyjobfinishbeforedinner?!?Theframeworkprovidesaweb-basedinterfacedescribingjobs

64/64

SummaryParallelprogrammingMulti-threadingandhowtohelpreduceprogrammererrorDistributedprogrammingandMapReduce