Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model...

32
Implementing Dense Linear Algebra Implementing Dense Linear Algebra Algorithms on Multi-Core Algorithms on Multi-Core Processors Using Dataflow Processors Using Dataflow Execution Model Execution Model Jakub Kurzak Jack Dongarra University of Tennessee John Shalf Lawrence Berkeley National Laboratory SIAM Parallel Processing for Scientific Computing MS18: Dataflow 2.0: The Re-emergence and Evolution of Dataflow Programming Techniques for Parallel Computing March 13, 2008

Transcript of Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model...

Page 1: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Implementing Dense Linear Algebra Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Algorithms on Multi-Core Processors

Using Dataflow Execution ModelUsing Dataflow Execution Model

Jakub KurzakJack Dongarra

University of Tennessee

John ShalfLawrence Berkeley National Laboratory

SIAM Parallel Processing for Scientific ComputingMS18: Dataflow 2.0: The Re-emergence and Evolution

of Dataflow Programming Techniques for Parallel ComputingMarch 13, 2008

Page 2: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

New Class of AlgorithmsNew Class of Algorithms

LU, QR, Cholesky, one-sided, factorizations Inspired by out-of-memory algorithmsProcessing of input by small tilesBlock data layoutFine granularitySuitable for dynamic schedulingSuitable for dynamic scheduling

LAPACK Working Note 190LAPACK Working Note 191 LAPACK Working Note 184

Page 3: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

New Class of AlgorithmsNew Class of Algorithms

Classic multi-core (x86)Faster than MKLFaster than ACMLWay faster than LAPACK

Exotic multi-core (CELL)Extremely fastNo alternatives

Page 4: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

New Algorithms – LU on x86New Algorithms – LU on x86

Page 5: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

New Algorithms – QR on x86New Algorithms – QR on x86

Page 6: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

New Algorithms – Cholesky on x86New Algorithms – Cholesky on x86

Page 7: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

New Algorithms – Cholesky on the CELLNew Algorithms – Cholesky on the CELL

Page 8: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Going “Out of Sync”Going “Out of Sync”

CELL Cholesky – 8 cores

CELL Cholesky – 16 cores

Page 9: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

PLASMA FrameworkPLASMA Framework

Don't reinvent the wheelSCHEDULEHeNCEParametrized DAGs (Cosnard, Jeannot)Dataflow Systems (Simulink, Gedae)Systolic Arrays

Page 10: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

PLASMA Goals PLASMA Goals (WHAT)(WHAT)

ScalabilityMassive parallelism

PerformanceEfficient runtime

PortabilityCluster / MPPSMP / CMPOut-of-memory / CELL

Page 11: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

PLASMA Principles PLASMA Principles (HOW)(HOW)

Data-driven (dynamic) execution Explicit expression of parallelism Implicit communication Component architecture

“Language” Runtime Tools

Dataflow visualization DAG visualization Profiling Tracing ..... There is no such thing as autoparallelization.

BTW,

There is no such thing as auto SIMD'zation.

(see LAPACK Working Note 189)

Page 12: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Current InfrastructureCurrent Infrastructure

MPIMPI

ScaLAPACScaLAPACKK

LAPACKLAPACKPBLASPBLAS

BLASBLAS

BLACSBLACS

Page 13: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Current InfrastructureCurrent Infrastructure

MPIMPI

LAPACKLAPACKPBLASPBLAS

BLASBLAS

BLACSBLACS

ScaLAPACScaLAPACKK

Page 14: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Target InfrastructureTarget Infrastructure

TRACTRACEE

VIVISS PLASMAPLASMA

““LanguageLanguage””

μ-kernelsμ-kernels(“μ(“μBLAS”)BLAS”)

RDMA-likeRDMA-like(“μMPI(“μMPI”)”)

PROFPROFPLASMAPLASMARuntimeRuntime

Page 15: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

DAG / Task GraphDAG / Task Graph

view computation as a DAG

use dependency / data driven scheduling

pursue the critical path

Page 16: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

DAGs can be hugeDifficult / impossible to storeDifficult / impossible to manage

I don't need to build the DAGI need the ability to traverse the DAG

DAG / Task GraphDAG / Task Graph

Page 17: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Producer Taskknows where its output goes

Consumer Taskknows where its input comes from

Nobody knows the DAG

Parents know their children

Children know their parents

DAG / Task GraphDAG / Task Graph

Page 18: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

P

P

P

P

T

T

T

T

T T

0

1

2

3

0,00,0 0,1 0,2

1,0 1,1

2,0

0

1

2

3

0, 1, 2, 3

Task ID / Task CoordinatesTask ID / Task Coordinates

Page 19: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

P

P

P

P

T

T

T

T

T T

0

1

2

3

0,00,0 0,1 0,2

1,0 1,1

2,0

// Task NameP

// Execution Spacei = 0 ... N

// Parallel Partitioningi % Ncores = coreID

// Parametersinput T ( i – 1, 0 )output T ( i, 0 ... i )

Task DefinitionTask Definition

Page 20: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

P

P

P

P

T

T

T

T

T T

0

1

2

3

0,00,0 0,1 0,2

1,0 1,1

2,0

// Task NameT

// Execution Spacei = 0 ... N – 1j = 0 ... N – 1 – i

// Parallel Partitioningi % Ncores = coreID

// Parametersinput 1 P ( i )input 2 T ( i – 1, j + 1 )output T ( i + 1, j – 1 )

Task DefinitionTask Definition

Page 21: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

// NamePOTRF(k)

// Execution spacek = 0..SIZE

// Parallel partitioningk % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersINOUT T <- (k == 0) ? IN(k, k) : T SYRK(k, k) -> T TRSM(k, k+1..SIZE) -> OUT(k, k)

// Bodylapack_spotrf(lapack_lower, NB, T, NB, &info);

// NameSYRK(k, n)

// Execution spacek = 0..SIZEn = 0..k-1

// Parallel partitioningk % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersIN A <- C TRSM(n, k)INOUT T <- (n == 0) ? IN(k, k) : T SYRK(k, n-1) -> (n == k-1) ? T POTRF(k) : T SYRK(k, n+1)

// Bodycblas_ssyrk(CblasColMajor, CblasLower, CblasNoTrans, NB, NB, 1.0, A, NB, 1.0, T, NB);

// NameTRSM(k, m)

// Execution spacek = 0..SIZEm = k+1..SIZE

// Parallel partitioningm % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersIN T <- T POTRF(k)INOUT C <- (k == 0) ? IN(m, k) : C GEMM(k, m, k-1) -> A GEMM(m, m+1..SIZE, k) -> B GEMM(k+1..m-1, m, k) -> A SYRK(m, k) -> OUT(k, m)

// Bodycblas_strsm(CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, NB, NB, 1.0, T, NB, B, NB);

// NameGEMM(k, m, n)

// Execuiton spacek = 0..SIZEm = k+1..SIZEn = 0..k-1

// Parallel partitioningm % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersIN A <- C TRSM(n, k)IN B <- C TRSM(n, m)INOUT C <- (n == 0) ? IN(m, n) : C GEMM(k, m, n-1) -> (n == k-1) ? C TRSM(k, m) : C GEMM(k, m, n+1)

// Bodycblas_sgemm(CblasColMajor, CblasNoTrans, CblasTrans, NB, NB, NB, 1.0, B, NB, A, NB, 1.0, C, NB);

// GlobalsGRIDrows, GRIDcols, NB

Page 22: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

depreq

datadata data

depdep req

req

execute

notify

notifynotify

Task Life-CycleTask Life-Cycle

Page 23: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

data

requ

est

POTRF

GEMM

TRSM

res

pons

e

response

data

request

POTRFcompletionnotifications

GEMMcompletionnotifications

TRSMdataavailable

TRSMdependencies satisfied

TRSMcompletion notifications

Task Life-CycleTask Life-Cycle

Page 24: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

no dependencies satisfied

some dependencies satisfied waiting for all dependencies

all dependencies satisfied some data delivered waiting for all data

all data delivered waiting for execution

Task Life-CycleTask Life-Cycle

Page 25: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Existing compiler technologytask body – standard C / Fortranformulas – C / Fortran expressionsmetadata – human readable ASCII

CompilationCompilation

Page 26: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Schedulermanagement of task bins

Communication systemschared mem – pointer aliasingdistrib mem – messagingout-of-memory – mix

RuntimeRuntime

Page 27: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Standalone ASCII codeLittle / No error checking

Visual tools to assist developmentplot dataflow diagraminstantiate small DAG

Visual ToolsVisual Tools

Page 28: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

POTRF

T

TRSM

T

C

SYRK

A

T

GEMM

B

C

A

OUTOUTININ

Dataflow VisualizationDataflow Visualization

Page 29: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Task Graph VisualizationTask Graph Visualization

Tile LU Tile QR

Tile Cholesky

Page 30: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Connectivity / TopologyConnectivity / Topology

4D cubequad-core

Page 31: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Connectivity / TopologyConnectivity / Topology

communication matrix should resemble connectivity matrix

Page 32: Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

Thank You !Thank You !

Questions ?Questions ?