Investigating Memory Operations Performance in the StarPU ...

28

Transcript of Investigating Memory Operations Performance in the StarPU ...

Page 1: Investigating Memory Operations Performance in the StarPU ...

Investigating Memory Operations

Performance in the StarPU Runtime

Lucas Leandro Nesi, Lucas Mello Schnorr

September 5, 2018

Graduate Program in Computer Science (PPGC/UFRGS), Porto Alegre, Brazil

Page 2: Investigating Memory Operations Performance in the StarPU ...

Summary

Introduction

Basic Concepts

Analyzing Data Management

Experiments

Conclusion

1

Page 3: Investigating Memory Operations Performance in the StarPU ...

Introduction

� Programming for HPC is complex

� Using classical paradigms, the programmer has to:

� Map computation to resources

� Manage the data and communication

� The task-based paradigm facilitates the programming by

abstracting those responsibilities to a runtime

� Runtime example: StarPU

2

Page 4: Investigating Memory Operations Performance in the StarPU ...

Problem

� Similar to any other HPC approach, performance analysis of

task-based applications is laborious

� Some tools for performance analysis of task-based applications

are available with di�erent features

� Example: StarVZ

� However, those tools lack features for memory management

analysis of the runtime and the application

3

Page 5: Investigating Memory Operations Performance in the StarPU ...

Objectives

Create mechanisms for data management performance analysis of

task-based applications running over the StarPU runtime

4

Page 6: Investigating Memory Operations Performance in the StarPU ...

Basic Concepts

5

Page 7: Investigating Memory Operations Performance in the StarPU ...

Task-Based Programming Paradigm

� The applications consist of a set of tasks

� The tasks are organized using a DAG (Directed Acyclic Graph)

� The graph links are memory dependencies between tasks

� The applications run over a runtime

Application

Task A

Task B Task C

Task C Task A

6

Page 8: Investigating Memory Operations Performance in the StarPU ...

Task-Based Programming Paradigm

� The runtime creates workers for system resources

� The tasks are submitted by the application to the runtime

� The runtime will schedule the tasks to workers based on

scheduler heuristics, examples are:

� LWS (Local work stealing)

� DM (deque model)

CPU 1 CPU 2 GPU 1 GPU 2Resources

Worker 1 Worker 2 Worker 3 Worker 4

Runtime

Scheduler

7

Page 9: Investigating Memory Operations Performance in the StarPU ...

StarPU - Memory Blocks

� Data is organized into memory blocks

� Example: The matrix is divided into memory blocks

0 0

0 1 1 1

8

Page 10: Investigating Memory Operations Performance in the StarPU ...

Data Management

� Each resource memory has a entity called memory manager

� The memory block is the minimal runtime memory unit:

� Tasks uses memory blocks as inputs

� The DAG dependencies are reuse of memory blocks

� Blocks are transferred between resources

� Coherence between resources: MSI System:

� Three possible states for each block: Modified, Shared,

Invalid

� Each memory block has one state per resource memory

9

Page 11: Investigating Memory Operations Performance in the StarPU ...

Application example: Chameleon/MORSE

� Solver for dense linear algebra using task-based paradigm

� One of the the possible operations: Cholesky factorization

� Uses a triangular matrix

� Four tasks: dportf, dsyrk, dtrsm, dgemm

� Compute from lower memory blocks coordinates to higher ones

10

Page 12: Investigating Memory Operations Performance in the StarPU ...

Analyzing Data Management

11

Page 13: Investigating Memory Operations Performance in the StarPU ...

Memory Nodes States

� Uses the traditional Gantt Chart for memory managers states

� X is time in ms and Y is the Memory Managers

74.78%

75.03%

MM1

MM2

0 50000 100000Time [ms]

Mem

ory

Sta

te

Allocating Allocating Async WritingBack

12

Page 14: Investigating Memory Operations Performance in the StarPU ...

Memory Nodes States - Zoom

� If zoomed it shows the coordinates of the states' related to

memory blocks

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

98.22%

95.39%

MM1

MM2

50000 50010 50020 50030 50040 50050Time [ms]

Mem

Nod

es

Allocating Allocating Async WritingBack

13

Page 15: Investigating Memory Operations Performance in the StarPU ...

Memory Block Residency

� Show the residency of a memory block on the memory

managers during the execution

� X is the time divided into intervals, and Y is the time that the

memory block was present on the memory manager

0

100

200

300

0 50 100 150 186Time Steps [s]

Tim

e pr

esen

t [%

]

MEMMANAGER0 MEMMANAGER1 MEMMANAGER2 14

Page 16: Investigating Memory Operations Performance in the StarPU ...

Experiments

15

Page 17: Investigating Memory Operations Performance in the StarPU ...

Experiments

� Test Case: Chameleon/Morse Cholesky Factorization

� Block size: 960x960

� 60x60 tiles.

� DMDA Scheduler

� Machine: Tupi

� Preliminary testes suggested performance problems

� Unknown reasons.

� Using the earlier methodology to investigate

16

Page 18: Investigating Memory Operations Performance in the StarPU ...

Application Workers

� GPUs have high idle times after used memory reaches a

plateau

1386

29

16.61%

16.73%

16.5%

17.76%

17.17%

33.07%

32.27%

CPU0CPU1CPU2CPU3CPU4

CUDA0_0

CUDA1_0App

licat

ion

Wor

kers

dgemm dpotrf dsyrk dtrsm

0250050007500

10000

0 50000 100000Time [ms]

Use

d (M

B)

17

Page 19: Investigating Memory Operations Performance in the StarPU ...

Memory Managers

� GPUs' Memory Managers with many allocation's states

74.78%

75.03%

MM1

MM2

Mem

Nod

es

Allocating Allocating Async WritingBack

0

2500

5000

7500

10000

0 50000 100000Time [ms]

Use

d (M

B)

18

Page 20: Investigating Memory Operations Performance in the StarPU ...

Zoom Memory Nodes

� Repeated allocations for the same memory blocks

� Inspecting StarPU code: this could happen if the allocations

fail

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

98.22%

95.39%

MM1

MM2

50000 50010 50020 50030 50040 50050Time [ms]

Mem

Nod

es

Allocating Allocating Async WritingBack

19

Page 21: Investigating Memory Operations Performance in the StarPU ...

Residency of Blocks

� Memory Blocks remain on resources' memory even if they

aren't needed anymore

8 9 10 11

13

0 60 1200 60 1200 60 1200 60 1200

100

200

300

Time Steps [s]

Pre

senc

e [%

]

MM0 MM1 MM2

20

Page 22: Investigating Memory Operations Performance in the StarPU ...

Code Inspection

At this point:

� We think that the allocations are falling because the memory

is full

� nvidia-smi con�rms it

� However, if StarPU identify that the memory is full, it should

free unused blocks.

� So StarPU believes it has free memory!

� We inspect the execution using gdb and identify a disparity

between the internal StarPU GPUs used memory values and

the real ones

21

Page 23: Investigating Memory Operations Performance in the StarPU ...

Problem Cause

� StarPU internally register the used memory by the values

passed to cudaMalloc

� However, cudaMalloc can reserve more memory than the

requested

� Page size of the device; in our case: 2MB

� Causing a deviance of 1.8GB for the matrix

� We propose a patch for correcting it

22

Page 24: Investigating Memory Operations Performance in the StarPU ...

Patch Experiments

� 10 executions comparing the original and the corrected version.

� Using the earlier Cholesky parameters:

� Block size: 960

� tupi machine

� DMDA Scheduler

� Varying the size of the matrix

� 99% of con�dence

23

Page 25: Investigating Memory Operations Performance in the StarPU ...

Patch Results

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size

that fits on GPU memory

0

100

200

300

400

500

600

700

29000 40000 50000 60000 68000

Input Matrix Width [Cells]

App

. Per

form

ance

[GF

lops

]Version Original Corrected

24

Page 26: Investigating Memory Operations Performance in the StarPU ...

Conclusion

25

Page 27: Investigating Memory Operations Performance in the StarPU ...

Conclusion

� Memory Manager is a important role in StarPU's Applications

� The proposed visualizations give extra information on the

memory management decisions

� Application independent

� We solved a runtime problem that was discovered using the

new features

� Performance gains of 66%

� Application Performance sustain independent of matrix size

26

Page 28: Investigating Memory Operations Performance in the StarPU ...

Investigating Memory Operations

Performance in the StarPU Runtime

Lucas Leandro Nesi, Lucas Mello Schnorr

September 5, 2018

Graduate Program in Computer Science (PPGC/UFRGS), Porto Alegre, Brazil