User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on...

38
Sangmin Seo Assistant Computer Scientist Argonne National Laboratory [email protected] November 7, 2016 User-Level Threads and OpenMP

Transcript of User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on...

Page 1: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

SangminSeo

AssistantComputerScientistArgonneNationalLaboratory

[email protected]

November7,2016

User-Level Threads and OpenMP

Page 2: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Top500 Supercomputers Today1. SunwayTaihuLight (SunwayMPP):2. Tianhe-2(IntelXeon+XeonPhi):3. Titan(CrayXK7+Nvidia K20x):4. Sequoia(BlueGene/Q):5. Kcomputer(SPARC64):6. Mira(BlueGene/Q):

10,646,600cores3,120,000cores560,640cores

1,572,864cores705,024cores786,432cores

NumberofcorespersocketinTop500 TotalnumberofcoresinTop500rank#14thXMPWorkshop(11/7/2016) 2

0%

20%

40%

60%

80%

100%1 2 4 6 8 10 12 14-

0K

2,000K

4,000K

6,000K

8,000K

10,000K

12,000K

TotalN

umbe

rofC

oers

Page 3: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Massive On-node Parallelism

• Toaddressmassiveon-nodeparallelism,thenumberofworkunits(e.g.,threads)mustincreaseby100X

• MPI+OpenMP issufficientformanyapps,butimplementationispoor– TodayMPI+OpenMP ==MPI+Pthreads

• Pthreadabstractionistoogeneric,notsuitableforHPC– Lackoffine-grainedscheduling,memory

management,networkmanagement,signaling,etc.

• BetterruntimecansignificantlyimproveMPI+OpenMP performanceandsupportotheremergingprogrammingmodels

core

MPIprocesswithmanyOpenMP threads

4thXMPWorkshop(11/7/2016) 3

Page 4: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Outline

• Motivation• User-LevelThreads(ULTs)• Argobots• BOLT:OpenMP overLightweightThreads• Summary

4thXMPWorkshop(11/7/2016) 4

Page 5: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

User-Level Threads (ULTs)

• Whatisuser-levelthread(ULT)?– Providesthreadsemanticsinuserspace– Executionmodel:cooperativetimesharing

• MorethanoneULTcanbemappedtoasinglekernelthread

• ULTsonthesameOSthreaddonotexecuteinparallel– Canbeimplementedwithcoroutines

• Enableexplicitsuspendandresumeofitsprogressbypreservingexecutionstate

• SomelanguagessuchasPythonandGousecoroutines forasynchronousI/O

timeline

Context switch

Context switch

ULT1

ULT2

Core Core Core Core Core Core Core Core

ULTs :

Kernel threads :

4thXMPWorkshop(11/7/2016) 5

Page 6: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

User-Level Threads (ULTs)

• WhyULTs?– Conventionalthreads(e.g.,Pthreads)aretooexpensiveto

expressmassiveparallelism– IfwecreatePthreadsmorethan# ofcores

(i.e.,oversubscription):• Context-switchandsynchronizationoverheadwillincreasedramatically

– ULTscanmitigatehighoverheadofPthreadsbutneedexplicitcontrol

• Wheretouse?– Tobetteroverlapcomputationandcommunication/IO

• Lowcontext-switchingoverheadofULTscangivemoreopportunitiestohidecommunication/IOlatency

– Toexploitfine-grainedtaskparallelism• LightweightULTsaremoresuitabletoexpressmassivetaskparallelism

U

U

U

OSthread

4thXMPWorkshop(11/7/2016) 6

Page 7: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Pthreads vs. ULTs

• Averagetimeforcreatingandjoiningonethread• Pthread:6.6us- 21.2us(avg.34,953cycles)• ULT(Argobots):78ns- 130ns(avg.191cycles)• ULTis64x- 233xfaster thanPthread

– HowfastisULT?• L1$access:1.112ns,L2$access:5.648ns,memoryaccess:18.4ns• Contextswitch(2processes):1.64us

1

10

100

1000

10000

100000

1 2 4 8 16 32 64 128 256 512 1024 2048

Avg.Create&

JoinTime/thread

(ns)

NumberofThreads

Pthread ULT(Argobots)

*measuredusingLMbench3

4thXMPWorkshop(11/7/2016) 7

Page 8: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Growing Interests in ULTs

• ULTandtasklibraries– Conversethreads,Qthreads,MassiveThreads,Nanos++,

Maestro,GnuPth,StackThreads/MP,Protothreads,Capriccio,StateThreads,TiNy-threads,etc.

• OSsupports– Windowsfibers,Solaristhreads

• Languageandprogrammingmodels– Cilk,OpenMP task,C++11task,C++17coroutine proposal,

Stackless Python,Gocoroutines,etc.

• Pros– EasytousewithPthreads-likeinterface

• Cons– Runtimetriestodosomethingsmart(e.g.,work-stealing)– Thismayconflictwiththecharacteristicsanddemandsof

applications

4thXMPWorkshop(11/7/2016) 8

Page 9: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Argobots

Overview• Separationofmechanismsandpolicies• Massiveparallelism

– Exec.Streams guaranteeprogress– WorkUnits executetocompletion

• User-levelthreads(ULTs)vs.Tasklets• Clearlydefinedmemorysemantics

– Consistencydomains• ProvideEventualConsistency

– Softwarecanmanageconsistency

ArgobotsInnovations• Enablingtechnology,butnotapolicymaker

– High-levellanguages/librariessuchasOpenMPorCharm++havemoreinformationabouttheuserapplication(datalocality,dependencies)

• Explicitmodel:– Enablesdynamism,butalwaysmanaged

byhigh-levelsystems

Argobots

coreProcessor

Programming Models(MPI, OpenMP, Charm++, PaRSEC, …)

U User-Level Thread T TaskletLightweightWork Units

Exec

utio

n St

ream

Private pool Private poolShared pool

U U

U T

TTU TU

Exec

utio

n St

ream

Exec

utio

n St

ream

Alightweightlow-levelthreadingandtaskingframework(http://www.mcs.anl.gov/argobots/)

9

*Currentteammembers:PavanBalaji,SangminSeo,Halim Amer (ANL),L.Kale,NitinBhat (UIUC)

4thXMPWorkshop(11/7/2016)

Page 10: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Argobots Execution Model

• ExecutionStreams(ES)– Sequentialinstructionstream

• Canconsistofoneormoreworkunits– Mappedefficientlytoahardware

resource– Implicitlymanagedprogresssemantics

• OneblockedEScannotblockotherESs

• User-levelThreads(ULTs)– Independentexecutionunitsinuser

space– AssociatedwithanESwhenrunning– Yieldableandmigratable– Canmakeblockingcalls

• Tasklets– Atomicunitsofwork– Asynchronouscompletionvia

notifications– Notyieldable,migratablebefore

execution– Cannotmakeblockingcalls

S

Scheduler Pool

U

ULT

T

Tasklet

E

Event

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots Execution Model

...

ESn

• Scheduler– Stackableschedulerwithpluggable

strategies• Synchronizationprimitives

– Mutex,conditionvariable,barrier,future• Events

– Communicationtriggers

4thXMPWorkshop(11/7/2016) 10

Page 11: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Explicit Mapping ULT/Tasklet to ES

• TheuserneedstomapworkunitstoESs• Nosmartscheduling,nowork-stealingunlesstheuserwants

touse

ES1

U0

U1

T1

T2

U2

U3

ES2

U4

U5

• Benefits– Allowlocalityoptimization

• ExecuteworkunitsonthesameES

– NoexpensivelockisneededbetweenULTsonthesameES

• Theydonotruninparallel• Aflagisenough

4thXMPWorkshop(11/7/2016) 11

Page 12: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Stackable Scheduler with Pluggable Strategies

• AssociatedwithanES• CanhandleULTsandtasklets• Canhandleschedulers

– Allowstostackschedulershierarchically• Canhandleasynchronousevents• Userscanwriteschedulers

– Providesmechanisms,notpolicies– Replacethedefaultscheduler

• E.g.,FIFO,LIFO,PriorityQueue,etc.• ULTcanexplicitlyyieldto anotherULT

– Avoidscheduleroverhead

Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

U S U U U

yield() yield_to(target)

4thXMPWorkshop(11/7/2016) 12

Page 13: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Performance: Create/Join Time

• Idealscalability– IftheULTruntimeisperfectlyscalable,thetimeshouldbethesame

regardlessofthenumberofESs

10

100

1000

10000

1 2 4 8 16 24 32 36 40 48 56 64 72

Create/JoinTimepe

rULT(cycles)

NumberofExecutionStreams(Workers)

Qthreads MassiveThreads(H) MassiveThreads(W)

Argobots(ULT) Argobots(Tasklet)

4thXMPWorkshop(11/7/2016) 13

Page 14: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Argobots’ Position

NodeOS

Argobots Comm.Lib.

High-Level Programming Models/LibrariesDomain SpecificLanguages(DSLs)

Applications

...

NodeOS

Argobots Comm.Lib.

Argobotsisalow-levelthreading/taskingruntime!

4thXMPWorkshop(11/7/2016) 14

Page 15: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Argobots Ecosystem

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots

...

ESn

MPI+Argobots

ULT

ES

ULT

ES

MPI

Argobots runtime

Communication libraries

Charm++

Applications

Charm++Cilk “Worker”

ArgobotsES

RWSULT

FusedULT1

FusedULT2

FusedULTN

…CilkBots

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

PaRSEC

OpenMP MercuryRPC

Origin

Target

RPCproc

RPCproc

OmpSs

TASCEL,XMP,ROSE,GridFTP,Kokkos,RAJA,etc.MoreConnections

4thXMPWorkshop(11/7/2016) 15

Page 16: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

OpenMP

• Directivebasedprogrammingmodel• Commonlyusedforshared-memoryprogramminginanode• Manydifferentimplementations

– TypicallyontopofPthreadslibrary– Intel,GCC,Clang,IBM,etc.

Sequentialcodefor (i = 0; i < N; i++) {

do_something();}

OpenMPcode#pragma omp parallel forfor (i = 0; i < N; i++) {

do_something();}

zz

zzzz

164thXMPWorkshop(11/7/2016)

Page 17: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Nested Parallel Loop: Microbenchmark

17

AthreadforeachCPUiscreatedbydefault

Eachthreadexecutesaportion

Eachthreadcreatesmorethreadsforthesecondloop

Eachinnerthreadexecutesaportion

int in[1000][1000], out[1000][1000];

#pragma omp parallel for

for (i = 0; i < 1000; i++) {

lib_compute(i);

}

lib_compute(int x)

{

#pragma omp parallel for

for (j = 0; j < 1000; j++)

out[x][j] = compute(in[x][j]);

}

Contribution:AdrianCastello(Universitat Jaume I)

4thXMPWorkshop(11/7/2016)

Page 18: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Nested Parallel Loop: Implementations

• GCC– Doesnotreusetheidlethreadsinnestedparallelconstructs– Allthreadteamsinsideaparallelregionneedtobecreated

• ICC– Reuseidlethreads

• Iftherearenotmorethreadsavailable,newthreadsarecreated• AllcreatedthreadsareOSthreadsandaddoverhead

• ImplementationusingArgobots– CreatesULTsortasklets forbothouterloopandinnerloop

18

ES0

ESN-1…

WU

Outerloopsynchronizationpoint

S

S

S

S

WU

WU

WU WU

WU WU WU

OneESforeachcoreWorkunitfortheouterloop

Eachworkunitexecutesaportionoftheinnerloop

Innerloopsynchronizationpoint

4thXMPWorkshop(11/7/2016)

Page 19: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

0.00

0.01

0.10

1.00

10.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35Time(s)

#OMPThreads|ArgobotsULTs/tasks(innerloop)

ICC/Pthreads ICC/ArgobotsULTs ICC/Argobotstasks

0.00

0.01

0.10

1.00

10.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Time(s)

#OMPThreads|ArgobotsULTs/tasks(innerloop)

GCC/Pthreads GCC/ArgobotsULTs GCC/Argobotstasks

Nested Parallel Loop: Performance

GCCOpenMPimplementationdoesnotreuseidlethreadsinnestedparallelregions,alltheteamsofthreadsneedtobecreatedineachiteration

Executiontimefor36threadsintheouterloop

SomeoverheadisaddedbycreatingULTsinsteadoftasks

Lowerisbetter

Lowerisbetter

194thXMPWorkshop(11/7/2016)

Page 20: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Nested Parallel Loop: Analysis

• Howdoeseachimplementationmanagethethreadsinnestedparallelregions?

Imple. #CreatedThreads #ReusedThreads # CreatedULTs

# CreatedTasks

GCC C+N*(T-1) =35,036 0 --- ---

ICC C+C*(T-1)=1,296 (N-C)*(T-1) =33,740 --- ---

Argobots tasks C=36 0 C=36 N*T=36,000

§ Parameters:• C:numberofcores(orthreadscreatedbyuseratthebeginning)• N:numberofiterationsoftheouterloop• M:numberofiterationsoftheinnerloop• T:numberofthreadsfortheinnerloop

Sequentialcreation Parallelcreation

Example->C:36,N:1,000,M:1,000,T:36

204thXMPWorkshop(11/7/2016)

Page 21: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

BOLT: A Lightning-Fast OpenMP Implementation

• AboutBOLT– BOLTisarecursiveacronymthatstandsfor

"BOLTisOpenMP overLightweightThreads"– https://www.mcs.anl.gov/bolt/

• Objective– OpenMP frameworkthatexploitslightweightthreadsandtasks

21

ImprovedNestedMassiveParallelism

EnhancedFine-GrainedTaskParallelism

BetterInteroperabilitywithMPIandOtherInternodeProgrammingModels

4thXMPWorkshop(11/7/2016)

Page 22: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Approach & Development

• Basicapproach– CompilersimplygeneratesruntimeAPIcalls,whiletheruntime

createsULTs/tasklets andmanagesthemoverafixedsetofcomputationalresources

– UseArgobots astheunderlyingthreadingandtaskingmechanism– ABIcompatibilitywithIntelOpenMP compilers,LLVM/Clang,andGCC

(i.e.,canbeusedwiththesecompilers)• Development

– Runtime• BasedonIntelOpenMP RuntimeAPI• GeneratesArgobotsworkunitsfromOpenMP pragmas• CangenerateULTsortasklets dependingoncodecharacteristics

– Compiler(planned)• LLVM/Clang• Passescharacteristicsofparallelregionortask(e.g.,existenceofblocking

calls)totheruntime• Extendspragmaswiththeoption“nonblocking”

224thXMPWorkshop(11/7/2016)

Page 23: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

BOLT Execution Model

• OpenMP threadsandtasksaretranslatedintoArgobotsworkunits(i.e.,ULTsandtasklets)

• Sharedpoolsareutilizedtohandlenestedparallelism• AcustomizedArgobotsschedulermanagesschedulingofworkunits

acrossexecutionstreams

23

T TU T

OpenMP

Argobots

U

ULT

T

Tasklet

#pragmaomp parallel #pragmaomp task

U U

ExecutionStream

CPUcore

PrivatePool

U TU T USharedPool

CPU

OpenMPthreads

OpenMPtasks

4thXMPWorkshop(11/7/2016)

Page 24: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Prototype Implementation of BOLT Runtime

• BasedonIntel’sopen-sourceOpenMP runtime– http://openmp.llvm.org/

• KepttheoriginalruntimeAPIfortheABIcompatibility• Designedandimplementedthethreadinglayerusing

Argobots andmodifiedtheruntimeinternallayer

24

ThreadingLayer

Pthreads

RuntimeAPILayer

RuntimeInternalLayer

Argobots

4thXMPWorkshop(11/7/2016)

Page 25: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

OpenMP Pragma Translation

1. AsetofN threadsiscreatedatruntime– Iftheyhavenotbeencreatedyet– CommonlyasmanyasthenumberofCPUcores

2. Thenumberofiterationsisdividedbetweenallthethreads3. Asynchronizationpointisaddedaftertheforloop

– Implicitbarrierattheendofparallelfor

#pragma omp parallel for (1,2)for (i = 0; i < N; i++) {

do_something();} (3)

254thXMPWorkshop(11/7/2016)

Page 26: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

OpenMP Compiler & BOLT Runtime

__kmpc_fork_call(…){

}

__kmp_fork_call(…)

__kmp_join_call(…)

IntelOpenMPRuntimeAPI

• CreateExecutionStreams(ifneeded)• AddaULTortasklettoeachES• Launchthework

• Join workunitscreated

BOLTruntime

#pragmaompparallelClangandIntelcompiler

264thXMPWorkshop(11/7/2016)

Page 27: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

parallel for

#pragma omp parallel forfor (i = 0; i < N; i++) {

do_something();}

Createsthreads

Dividesalliterationsamongthreads

Synchronizationpoint

ES0

ES1

ESK

WU

WU

WU

Eachworkunitexecutesaportionoftheforloop

ImplementationusingArgobots

S

S

S

Asynchronizationpointisadded

OneExecutionStreamforeachCPU core(orhardwarethread)

274thXMPWorkshop(11/7/2016)

Page 28: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

OpenUH OpenMP Validation Suite 3.1

GCC 6.1 ICC17.0.0 +IntelOpenMP

ICC 17.0.0+BOLTruntime(Argobots)

BOLT(clang+Argobots)

#oftestedOpenMPconstructs 62 62 62 62

#ofusedtests 123 123 123 123

#ofsuccessful tests 118 118 122 112

#offailedtests 5 5 1 1

Passrate(%) 95.9 95.9 99.2 99.2

28

• TheBOLTprototypefunctionallyworkswell!

4thXMPWorkshop(11/7/2016)

Page 29: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Nested Parallel Loop Microbenchmark

29

10

100

1000

10000

100000

1000000

10000000

100000000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

ExecutionTime(us)

#ofThreadsfortheInnerLoop

gcc6.1.0 icc17.0.0 BOLT(ULT) BOLT(ULT+tasklet) Argobots(ULT)

*Thenumberofthreadsfortheouterloopwasfixedat36.

4thXMPWorkshop(11/7/2016)

Page 30: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Application Study: KIFMM

• Kernel-IndependentFastMultipoleMethod(KIFMM)– Offloaddgemv operationstoIntelMKL

• EvaluatedtheefficiencyofthenestedparallelismsupportinIntelOpenMPandBOLTduringtheDownwardstage– 9threadsfortheapplication(outerparallelregion)

30

0

2

4

6

8

10

12

14

1 2 4 8

ExecutionTime(s)

#ofThreadsforIntelMKL

IntelOMP:core-close IntelOMP:core-true IntelOMP:no-binding BOLT

4thXMPWorkshop(11/7/2016)

Page 31: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Application Study: ACME mini-app

• ACME(AcceleratedClimateModelingforEnergy)– ImplementingadditionallevelsofparallelismthroughOpenMP

nestedparallelloopsforupcomingmany-coremachines• Preliminaryresultsoftestingthetransport_semini-appversion

ofHOMME(ACME’sCAM-SEdycore)

31

0%

20%

40%

60%

80%

100%

120%

H=16,V=1 H=8,V=2 H=4,V=4 H=4,V=8 H=8,V=4

Normalize

dExecutionTime(%

)

ACMEmini-app(transport_se)

ICC+IntelOpenMP(15.0.0) ICC+BOLT(Argobots)

Lowerisbetter(upto3.16x

faster)

oversubscription

4thXMPWorkshop(11/7/2016)

Page 32: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Summary

• Massiveon-nodeparallelismisinevitable– Needruntimesystemsutilizingsuchparallelism

• User-levelthreads(ULTs)– Lightweightthreadsmoresuitableforfine-graineddynamic

parallelismandcomputation-communicationoverlap• Argobots

– Alightweightlow-levelthreading/taskingframework– Providesefficientmechanisms,notpolicies,tousers(library

developersorcompilers)• Theycanbuildtheirownsolutions

• BOLT:OpenMP overLightweightThreads– MoreefficientsupportofnestedparallelismwithArgobotsULTs

andtasklets– PreliminaryresultsshowthatBOLTispromising

4thXMPWorkshop(11/7/2016) 32

Page 33: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Argo Concurrency Team

• ArgonneNationalLaboratory(ANL)– Pavan Balaji (co-lead)– SangminSeo– Abdelhalim Amer– PeteBeckman(PI)

• UniversityofIllinoisatUrbana-Champaign(UIUC)– Laxmikant Kale(co-lead)– MarcSnir– NitinKundapur Bhat

• UniversityofTennessee,Knoxville(UTK)– GeorgeBosilca– ThomasHerault– DamienGenet

• PacificNorthwestNationalLaboratory(PNNL)– Sriram Krishnamoorthy

PastTeamMembers:• CyrilBordage (UIUC)• Prateek Jindal(UIUC)• JonathanLifflander (UIUC)• EstebanMeneses

(UniversityofPittsburgh)• Huiwei Lu(ANL)• Yanhua Sun(UIUC)

4thXMPWorkshop(11/7/2016) 33

Page 34: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

BOLT Collaborations

• Maintainers– ArgonneNationalLaboratory

• Sangmin Seo• Abdelhalim Amer• Pavan Balaji

• Contributors– Universitat Jaume IdeCastelló

• Adrián Castelló• RafaelMayo• EnriqueS.Quintana-Ortí

– BarcelonaSupercomputingCenter(BSC)• AntonioJ.Peña• JesusLabarta

– RIKEN• Jinpil Lee• Mitsuhisa Sato

344thXMPWorkshop(11/7/2016)

Page 35: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Try Argobots & BOLT

• Argobots– http://www.mcs.anl.gov/argobots/– git repository

• https://github.com/pmodels/argobots– Wiki

• https://github.com/pmodels/argobots/wiki– Doxygen

• http://www.mcs.anl.gov/~sseo/public/argobots/

• BOLT– http://www.mcs.anl.gov/bolt/– git repository

• https://github.com/pmodels/bolt-runtime

354thXMPWorkshop(11/7/2016)

Page 36: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Funding AcknowledgmentsFundingGrantProviders

InfrastructureProviders

4thXMPWorkshop(11/7/2016)

Page 37: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Programming Models and Runtime Systems GroupGroupLead– PavanBalaji(computerscientistandgroup

lead)

CurrentStaffMembers– Abdelhalim Amer (postdoc)– Yanfei Guo (postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffenetti (developer)– Sangmin Seo (assistantcomputerscientist)– MinSi(postdoc)

PastStaffMembers– AntonioPena(postdoc)– WesleyBland(postdoc)– DariusT.Buntinas (developer)– JamesS.Dinan (postdoc)– DavidJ.Goodell (developer)– Huiwei Lu(postdoc)– MinTian(visitingscholar)– Yanjie Wei(visitingscholar)– Yuqing Xiong (visitingscholar)– JianYu(visitingscholar)– Junchao Zhang(postdoc)– Xiaomin Zhu(visitingscholar)

– Ashwin Aji (Ph.D.)– Abdelhalim Amer (Ph.D.)– Md.Humayun Arafat(Ph.D.)– AlexBrooks(Ph.D.)– AdrianCastello (Ph.D.)– Dazhao Cheng(Ph.D.)– Hoang-VuDang(Ph.D.)– JamesS.Dinan (Ph.D.)– Piotr Fidkowski (Ph.D.)– PriyankaGhosh(Ph.D.)– Sayan Ghosh (Ph.D.)– RalfGunter(B.S.)– Jichi Guo (Ph.D.)– Yanfei Guo (Ph.D.)– MariusHorga (M.S.)– JohnJenkins(Ph.D.)– Feng Ji (Ph.D.)– PingLai(Ph.D.)– Palden Lama(Ph.D.)– YanLi(Ph.D.)– Huiwei Lu(Ph.D.)– JintaoMeng (Ph.D.)– GaneshNarayanaswamy (M.S.)– Qingpeng Niu (Ph.D.)– Ziaul Haque Olive(Ph.D.)

– DavidOzog (Ph.D.)– Renbo Pang(Ph.D.)– Nikela Papadopoulou (Ph.D)– Sreeram Potluri (Ph.D.)– Sarunya Pumma (Ph.D)– LiRao (M.S.)– Gopal Santhanaraman (Ph.D.)– ThomasScogland (Ph.D.)– MinSi(Ph.D.)– BrianSkjerven (Ph.D.)– RajeshSudarsan (Ph.D.)– LukaszWesolowski (Ph.D.)– Shucai Xiao(Ph.D.)– Chaoran Yang(Ph.D.)– Boyu Zhang(Ph.D.)– Xiuxia Zhang(Ph.D.)– XinZhao(Ph.D.)

Advisory Board– PeteBeckman(seniorscientist)– RustyLusk(retired,STA)– MarcSnir (divisiondirector)– RajeevThakur(deputydirector)

CurrentandRecentStudents

4thXMPWorkshop(11/7/2016)

Page 38: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

Q&A

• Thankyouforyourattention!

Questions?

4thXMPWorkshop(11/7/2016)