ScalableDynamicDagBuildingfor data-flowtask-basedRuntime...

Scalable Dynamic Dag Building fordata-flow task-based Runtime

Systems

Reazul Hoque9TH JLESC WORKSHOP

April 15, 2019, Knoxville, TN, USA

Need for a Runtime

• Parallel programming is complex• Application developers need to:• Express large degree of parallelism• Consider data placement and movement (distributed)• Manage heterogeneity• Tackle communication bottleneck • Manage various memory types

2

• Parallel Runtime Scheduling and Execution Controller• Executes a graph of tasks generated based on a dataflow representation of

a program (high level runtime)• Task-based stateful (re-execution, delay, migration, stealing, …)• Supports heterogeneous architecture and distributed memory machines• Communications are implicit, overlapped• How tasks are presented to the runtime is critical from many aspects

3

Motivation• Building dynamic DAG offers flexibility• Apriori information about task dependencies is not required • Analysis of overhead of dynamic and static approach to DAG

building• Promise of composition of multiple interfaces (static and

dynamic)

4

Dynamic Task Discovery (DTD)

• Task-graph (DAG) is built dynamically during runtime (similar to OpenMP but using an API) • Easy to express algorithms (API based)• Allows users to write sequential code• Shares the entire software infrastructure with

other DSL/Interface (such as PTG):• Schedulers• Communication Engine• Data Interface• Heterogeneous and Architecture-aware support

5

0,0 0,1 0,21,0 1,1 1,22,0 2,1 2,2

0

1 2

3 4 5

6

7

8

9

Cholesky Factorization (3x3)for( k = 0; k < total; k++ ) {

parsec_insert_task( "Potrf”, TILE_OF(A, k, k), INOUT | AFFINITY);

for( m = k+1; m < total; m++ ) {parsec_insert_task( "Trsm”,

TILE_OF(A, k, k), INPUT,TILE_OF(A, m, k), INOUT | AFFINITY);

}for( m = k+1; m < total; m++ ) {

parsec_insert_task( "Herk”,TILE_OF(A, m, k), INPUT,TILE_OF(A, m, m), INOUT | AFFINITY);

for( n = m+1; n < total; n++ ) {parsec_insert_task( “Gemm”,

TILE_OF(A, n, k), INPUT,TILE_OF(A, m, k), INPUT,TILE_OF(A, n, m), INOUT | AFFINITY);

}}

}

for( k = 0; k < total; k++ ) {




}for( m = k+1; m < total; m++ ) {




}}

}

0,0 0,1 0,21,0 1,1 1,22,0 2,1 2,2

0

Inserting 1 x POTRF

for( k = 0; k < total; k++ ) {




}for( m = k+1; m < total; m++ ) {




}}

}

0,0 0,1 0,21,0 1,1 1,22,0 2,1 2,2

0

1 2

Inserting 2 x TRSM

for( k = 0; k < total; k++ ) {




}for( m = k+1; m < total; m++ ) {




}}

}

0,0 0,1 0,21,0 1,1 1,22,0 2,1 2,2

0

1 2

3 4 5

Inserting 2 x HERK + 1 x GEMM

for( k = 0; k < total; k++ ) {parsec_insert_task( parsec_dtd_handle, parsec_core_potrf, priority,"Potrf”,

sizeof(int), &uplo, VALUE,…PASSED_BY_REF, TILE_OF(A, k, k, 0), INOUT | REGION_FULL | AFFINITY,0 );

for( m = k+1; m < total; m++ ) {parsec_insert_task( parsec_dtd_handle, parsec_core_trsm, priority, "Trsm",

…PASSED_BY_REF, TILE_OF(A, k, k, 0), INPUT | REGION_FULL,PASSED_BY_REF, TILE_OF(A, m, k, 0), INOUT | REGION_FULL | AFFINITY,0 );

} for( m = k+1; m < total; m++ ) {

parsec_insert_task( parsec_dtd_handle, parsec_core_herk, priority, "Herk",…PASSED_BY_REF, TILE_OF(A, m, k, 0), INPUT | REGION_FULL,PASSED_BY_REF, TILE_OF(A, m, m, 0), INOUT | REGION_FULL | AFFINITY,0 );

for( n = m+1; n < total; n++ ) {parsec_insert_task( parsec_dtd_handle, parsec_core_gemm, priority, "Gemm",

…PASSED_BY_REF, TILE_OF(A, n, k, 0), INPUT |REGION_FULL,PASSED_BY_REF, TILE_OF(A, m, k, 0), INPUT |REGION_FULL,PASSED_BY_REF, TILE_OF(A, n, m, 0), INOUT |REGION_FULL|AFFINITY,0 );

}}

}

Real code (slightly trimmed)

200000 240000 28000040000 16000012000080000 200000 240000 28000040000 16000012000080000 200000 240000 28000040000 16000012000080000

Size(N)

10000

15000

20000

25000G

flop

s

QR Factorization on Stampede, 2048 cores, Tile Size = 320, Block Size = 64

PaRSEC-PTG

PaRSEC-DTD

ScaLAPACK

2.3x faster

1.8x faster

1.3x faster

Higher is better, 128 nodes

• More complicated communication pattern

• Less opportunity for an explosion in parallelism

• More complex dependencies

11

12

1441 4 16 64 1441 4 16 64 1441 4 16 64 1441 4 16 64 1441 4 16 64 1441 4 16 64No. of Nodes

175

200

225

250

275

300

325

350

Gflop

s/N

ode

Weak Scaling - DPOTRF and DGEQRF, (20k x 20k)/Node, Tile Size 320, up to 2304 Cores

Practical Peak DGEMM

DGEQRF PaRSEC-PTG

DGEQRF PaRSEC-DTD

DPOTRF PaRSEC-PTG

DPOTRF PaRSEC-DTD

5%

Higher is better

~75% ofPractical Peak

Overhead - DTD

Overall execution time for DTD paradigm (lower bound):

13

Start

Time

End0

0

1 2

1 2

3 4 5

3 4 5

6

7

8

6

7

8

9

7

8

9

8

9 9

TDTD:OveralltimeN:TotalnumberoftasksCT:Cost/durationofeachtaskP:Totalnumberofnodes/process

n:TotalnumberofcoresCD:CostofdiscoveringCR:CostofbuildingDAG/relationship

Overhead - DTD

14

TDTD/PTG:OveralltimeN:TotalnumberoftasksCT:Cost/durationofeachtaskP:Totalnumberofnodes/processn:TotalnumberofcoresCD:CostofdiscoveringataskCR:CostofbuildingDAG/relationship

DTD overhead

15

50k/3.8 70k/10.410k/0.030 30k/0.8 40k/2.0 60k/6.520k/0.2 80k/15.6

Size(N)/No. of Tasks(millions), Tile size: 320

1000

1500

2000

2500

3000

3500

4000

4500

Cholesky(double) on Haswell, 8 nodes, 20 cores

PaRSEC-DTD, Tile size: 320 PaRSEC-PTG, Tile size: 320

Gflop

s

Higher is better

16

50k/59.3 70k/162.810k/0.5 30k/12.8 40k/30.4 60k/102.520k/3.8 80k/244.1


1000

1500

2000

2500

3000

3500

4000

4500


PaRSEC-DTD, Tile size: 128 PaRSEC-PTG, Tile size: 128

Gflop

s

Higher is better

17

50k/59.3 70k/162.810k/0.5 30k/12.8 40k/30.4 60k/102.520k/3.8 80k/244.1


1000

1500

2000

2500

3000

3500

4000

4500


PaRSEC-DTD, 128, Execution Only

PaRSEC-DTD, Tile size: 128

PaRSEC-PTG, Tile size: 128

Gflop

s

Higher is better

4%

Overall execution time for DTD paradigm (lower bound):

TDTD:OveralltimeN:TotalnumberoftasksCT:Cost/durationofeachtaskP:Totalnumberofnodes/process

n:TotalnumberofcoresCD:CostofdiscoveringCR:CostofbuildingDAG/relationship

We are trying to see the upper bound of the best we can do:

Initial Experiment• Chain of tasks• Partitioned per process/rank• No remote dependency

Proposed Solution (Future Work)

• Distributed Dynamic Task Discovery• Owner Tracks Future:• Each rank will discover all local tasks (task that rank will execute)• Each rank will also discover all the remote tasks that uses a data residing in it• The owner(temporary) rank will be responsible for tracking the Future of a data• The correct ordering will be coordinated by the current owner of each data

20

21

Rank 0 Rank 1 Rank 2Data Placement

Owner Owner

Current Scheme

Future Scheme: Owner Tracks Future

Rank 1 Rank 2

Rank 0 Rank 1 Rank 2

Rank 0

Overhead

22

TDTD/PTG:OveralltimeN:TotalnumberoftasksCT:Cost/durationofeachtaskP:Totalnumberofnodes/processn:TotalnumberofcoresCD:CostofdiscoveringataskCR:CostofbuildingDAG/relationship

Open Question• Simple paradigm is not scalable• What could be an alternate interface that allows *easy

expression and is also scalable?

23

ScalableDynamicDagBuildingfor data-flowtask-basedRuntime...

Documents

Transcript of ScalableDynamicDagBuildingfor data-flowtask-basedRuntime...