Investigating Memory Operations Performance in the StarPU ...

Investigating Memory Operations

Performance in the StarPU Runtime

Lucas Leandro Nesi, Lucas Mello Schnorr

September 5, 2018

Graduate Program in Computer Science (PPGC/UFRGS), Porto Alegre, Brazil

Summary

Introduction

Basic Concepts

Analyzing Data Management

Experiments

Conclusion

1

Introduction

� Programming for HPC is complex

� Using classical paradigms, the programmer has to:

� Map computation to resources

� Manage the data and communication

� The task-based paradigm facilitates the programming by

abstracting those responsibilities to a runtime

� Runtime example: StarPU

2

Problem

� Similar to any other HPC approach, performance analysis of

task-based applications is laborious

� Some tools for performance analysis of task-based applications

are available with di�erent features

� Example: StarVZ

� However, those tools lack features for memory management

analysis of the runtime and the application

3

Objectives

Create mechanisms for data management performance analysis of

task-based applications running over the StarPU runtime

4

Basic Concepts

5

Task-Based Programming Paradigm

� The applications consist of a set of tasks

� The tasks are organized using a DAG (Directed Acyclic Graph)

� The graph links are memory dependencies between tasks

� The applications run over a runtime

Application

Task A

Task B Task C

Task C Task A

6

Task-Based Programming Paradigm

� The runtime creates workers for system resources

� The tasks are submitted by the application to the runtime

� The runtime will schedule the tasks to workers based on

scheduler heuristics, examples are:

� LWS (Local work stealing)

� DM (deque model)

CPU 1 CPU 2 GPU 1 GPU 2Resources

Worker 1 Worker 2 Worker 3 Worker 4

Runtime

Scheduler

7

StarPU - Memory Blocks

� Data is organized into memory blocks

� Example: The matrix is divided into memory blocks

0 0

0 1 1 1

8

Data Management

� Each resource memory has a entity called memory manager

� The memory block is the minimal runtime memory unit:

� Tasks uses memory blocks as inputs

� The DAG dependencies are reuse of memory blocks

� Blocks are transferred between resources

� Coherence between resources: MSI System:

� Three possible states for each block: Modified, Shared,

Invalid

� Each memory block has one state per resource memory

9

Application example: Chameleon/MORSE

� Solver for dense linear algebra using task-based paradigm

� One of the the possible operations: Cholesky factorization

� Uses a triangular matrix

� Four tasks: dportf, dsyrk, dtrsm, dgemm

� Compute from lower memory blocks coordinates to higher ones

10

Analyzing Data Management

11

Memory Nodes States

� Uses the traditional Gantt Chart for memory managers states

� X is time in ms and Y is the Memory Managers

74.78%

75.03%

MM1

MM2

0 50000 100000Time [ms]

Mem

ory

Sta

te

Allocating Allocating Async WritingBack

12

Memory Nodes States - Zoom

� If zoomed it shows the coordinates of the states' related to

memory blocks

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

98.22%

95.39%

MM1

MM2

50000 50010 50020 50030 50040 50050Time [ms]

Mem

Nod

es


13

Memory Block Residency

� Show the residency of a memory block on the memory

managers during the execution

� X is the time divided into intervals, and Y is the time that the

memory block was present on the memory manager

0

100

200

300

0 50 100 150 186Time Steps [s]

Tim

e pr

esen

t [%

]

MEMMANAGER0 MEMMANAGER1 MEMMANAGER2 14

Experiments

15

Experiments

� Test Case: Chameleon/Morse Cholesky Factorization

� Block size: 960x960

� 60x60 tiles.

� DMDA Scheduler

� Machine: Tupi

� Preliminary testes suggested performance problems

� Unknown reasons.

� Using the earlier methodology to investigate

16

Application Workers

� GPUs have high idle times after used memory reaches a

plateau

1386

29

16.61%

16.73%

16.5%

17.76%

17.17%

33.07%

32.27%

CPU0CPU1CPU2CPU3CPU4

CUDA0_0

CUDA1_0App

licat

ion

Wor

kers

dgemm dpotrf dsyrk dtrsm

0250050007500

10000

0 50000 100000Time [ms]

Use

d (M

B)

17

Memory Managers

� GPUs' Memory Managers with many allocation's states

74.78%

75.03%

MM1

MM2

Mem

Nod

es


0

2500

5000

7500

10000

0 50000 100000Time [ms]

Use

d (M

B)

18

Zoom Memory Nodes

� Repeated allocations for the same memory blocks

� Inspecting StarPU code: this could happen if the allocations

fail

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

9 15

9 10

98.22%

95.39%

MM1

MM2

50000 50010 50020 50030 50040 50050Time [ms]

Mem

Nod

es


19

Residency of Blocks

� Memory Blocks remain on resources' memory even if they

aren't needed anymore

8 9 10 11

13

0 60 1200 60 1200 60 1200 60 1200

100

200

300

Time Steps [s]

Pre

senc

e [%

]

MM0 MM1 MM2

20

Code Inspection

At this point:

� We think that the allocations are falling because the memory

is full

� nvidia-smi con�rms it

� However, if StarPU identify that the memory is full, it should

free unused blocks.

� So StarPU believes it has free memory!

� We inspect the execution using gdb and identify a disparity

between the internal StarPU GPUs used memory values and

the real ones

21

Problem Cause

� StarPU internally register the used memory by the values

passed to cudaMalloc

� However, cudaMalloc can reserve more memory than the

requested

� Page size of the device; in our case: 2MB

� Causing a deviance of 1.8GB for the matrix

� We propose a patch for correcting it

22

Patch Experiments

� 10 executions comparing the original and the corrected version.

� Using the earlier Cholesky parameters:

� Block size: 960

� tupi machine

� DMDA Scheduler

� Varying the size of the matrix

� 99% of con�dence

23

Patch Results

Maximum Matrix size

that fits on GPU memory

Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


Maximum Matrix size


0

100

200

300

400

500

600

700

29000 40000 50000 60000 68000

Input Matrix Width [Cells]

App

. Per

form

ance

[GF

lops

]Version Original Corrected

24

Conclusion

25

Conclusion

� Memory Manager is a important role in StarPU's Applications

� The proposed visualizations give extra information on the

memory management decisions

� Application independent

� We solved a runtime problem that was discovered using the

new features

� Performance gains of 66%

� Application Performance sustain independent of matrix size

26

Investigating Memory Operations

Performance in the StarPU Runtime

Lucas Leandro Nesi, Lucas Mello Schnorr

September 5, 2018

Graduate Program in Computer Science (PPGC/UFRGS), Porto Alegre, Brazil

Investigating Memory Operations Performance in the StarPU ...

Documents

Transcript of Investigating Memory Operations Performance in the StarPU ...