Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin,...

15
Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators Azzam Haidar Yulu Jia Piotr Luszczek Stanimire Tomov Asim YarKhan Jack Dongarra

Transcript of Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin,...

Page 1: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Weighted Dynamic Scheduling with Many Parallelism Grains for

Offloading of Numerical Workloads to Multiple Varied

Accelerators

Azzam Haidar Yulu Jia

Piotr Luszczek Stanimire Tomov

Asim YarKhan Jack Dongarra

Page 2: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Problem Statement: Factorization

11/16/15, Austin, TX, USA ScalA 2015 2

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Page 3: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Problem Statement: Algorithm

11/16/15, Austin, TX, USA ScalA 2015 3

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Page 4: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Problem Statement: Mapping to Hardware

11/16/15, Austin, TX, USA ScalA 2015 4

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Page 5: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

From Code to DAG

11/16/15, Austin, TX, USA ScalA 2015 5

GETRF(A[1:n,1:n]){fori=1,nb,2*nb,…{ GETF2(A[i:n,i:i+nb]) TRSM(A[i:i+nb,i:i+nb],A[i:i+nb,i:n]) GEMM(A[i:n,i:i+nb],A[i:i+nb,i:n],A[i+nb:n,i+nb:n])}

}

GETF2(A[1,1],A[2,1],…)TRSM(A[1,1],A[1,2])TRSM(A[1,1],A[1,3])GEMM(A[2,1],A[1,2],A[2,2])

Startwithcanonicalcode:

Unwindthefunc:oncalls: Trackdependences:

Thefunc:oncallsandtheirdependencesformaDAG.

Page 6: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Matrix-View of Dependences

11/16/15, Austin, TX, USA ScalA 2015 6

GPU

XeonPhi

Page 7: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

From Code to DAG

11/16/15, Austin, TX, USA ScalA 2015 7

Page 8: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Asynchronous Algorithm

11/16/15, Austin, TX, USA ScalA 2015 8

Panelfactoriza:on

Pivo:ng(rowswaps)

Triangularsolve

Matrixmul:ply(Schurcomplement)

Scheduletasksaccordingto:-datasizes-hardwareweights-affinity-performlookahead

Page 9: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Main Features of the Scheduler

•  Resource capability weight for the task – w(kernel, device)

•  Adaptive scheduling with weights •  Dynamic lookahead •  Enhanced task priorities – Regulates lookahead (DAG exploration order)

•  Transparent data movement – Uses (and tracks) asynchronous data transfers

•  Dynamic data redistribution – Data is moved as needed and only if needed

11/16/15, Austin, TX, USA ScalA 2015 9

Page 10: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Runtime Scheduler Loop

11/16/15, Austin, TX, USA ScalA 2015 10

defmain_thread_loop(user_code,queues,threshold):#ifthereareenoughtasksforcoresifqueues.total_length()>threshold: #resumeuser’scodeforsubmittingtasks task=user_code.get_next_task() q=queues.find_closes_queue(task.devices()) q.insert(task,task.priority())else: task=queues.steal_task(main_cpu) #iftasksavailableforstealing ifnottask.is_empty(): task.execute()#executesingletask else: queues.wait_for_tasks()

Page 11: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Effect of Dynamic Scheduling

11/16/15, Austin, TX, USA ScalA 2015 11

Page 12: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Performance: Kepler K20

11/16/15, Austin, TX, USA ScalA 2015 12

Page 13: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Performance: Xeon Phi

11/16/15, Austin, TX, USA ScalA 2015 13

Page 14: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Performance: Xeon Phi + Kepler K40

11/16/15, Austin, TX, USA ScalA 2015 14

Page 15: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Summary of Contributions

•  Fine- and coarse-grained tasks for scheduling •  Capability weights for hardware description •  Unified scheduling across CPUs, GPUs, and

coprocessors •  Synchronous memory-transfer model with

transparent asynchronous progress

11/16/15, Austin, TX, USA ScalA 2015 15