Investigating Memory Operations Performance in the StarPU ...
Transcript of Investigating Memory Operations Performance in the StarPU ...
Investigating Memory Operations
Performance in the StarPU Runtime
Lucas Leandro Nesi, Lucas Mello Schnorr
September 5, 2018
Graduate Program in Computer Science (PPGC/UFRGS), Porto Alegre, Brazil
Summary
Introduction
Basic Concepts
Analyzing Data Management
Experiments
Conclusion
1
Introduction
� Programming for HPC is complex
� Using classical paradigms, the programmer has to:
� Map computation to resources
� Manage the data and communication
� The task-based paradigm facilitates the programming by
abstracting those responsibilities to a runtime
� Runtime example: StarPU
2
Problem
� Similar to any other HPC approach, performance analysis of
task-based applications is laborious
� Some tools for performance analysis of task-based applications
are available with di�erent features
� Example: StarVZ
� However, those tools lack features for memory management
analysis of the runtime and the application
3
Objectives
Create mechanisms for data management performance analysis of
task-based applications running over the StarPU runtime
4
Basic Concepts
5
Task-Based Programming Paradigm
� The applications consist of a set of tasks
� The tasks are organized using a DAG (Directed Acyclic Graph)
� The graph links are memory dependencies between tasks
� The applications run over a runtime
Application
Task A
Task B Task C
Task C Task A
6
Task-Based Programming Paradigm
� The runtime creates workers for system resources
� The tasks are submitted by the application to the runtime
� The runtime will schedule the tasks to workers based on
scheduler heuristics, examples are:
� LWS (Local work stealing)
� DM (deque model)
CPU 1 CPU 2 GPU 1 GPU 2Resources
Worker 1 Worker 2 Worker 3 Worker 4
Runtime
Scheduler
7
StarPU - Memory Blocks
� Data is organized into memory blocks
� Example: The matrix is divided into memory blocks
0 0
0 1 1 1
8
Data Management
� Each resource memory has a entity called memory manager
� The memory block is the minimal runtime memory unit:
� Tasks uses memory blocks as inputs
� The DAG dependencies are reuse of memory blocks
� Blocks are transferred between resources
� Coherence between resources: MSI System:
� Three possible states for each block: Modified, Shared,
Invalid
� Each memory block has one state per resource memory
9
Application example: Chameleon/MORSE
� Solver for dense linear algebra using task-based paradigm
� One of the the possible operations: Cholesky factorization
� Uses a triangular matrix
� Four tasks: dportf, dsyrk, dtrsm, dgemm
� Compute from lower memory blocks coordinates to higher ones
10
Analyzing Data Management
11
Memory Nodes States
� Uses the traditional Gantt Chart for memory managers states
� X is time in ms and Y is the Memory Managers
74.78%
75.03%
MM1
MM2
0 50000 100000Time [ms]
Mem
ory
Sta
te
Allocating Allocating Async WritingBack
12
Memory Nodes States - Zoom
� If zoomed it shows the coordinates of the states' related to
memory blocks
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
98.22%
95.39%
MM1
MM2
50000 50010 50020 50030 50040 50050Time [ms]
Mem
Nod
es
Allocating Allocating Async WritingBack
13
Memory Block Residency
� Show the residency of a memory block on the memory
managers during the execution
� X is the time divided into intervals, and Y is the time that the
memory block was present on the memory manager
0
100
200
300
0 50 100 150 186Time Steps [s]
Tim
e pr
esen
t [%
]
MEMMANAGER0 MEMMANAGER1 MEMMANAGER2 14
Experiments
15
Experiments
� Test Case: Chameleon/Morse Cholesky Factorization
� Block size: 960x960
� 60x60 tiles.
� DMDA Scheduler
� Machine: Tupi
� Preliminary testes suggested performance problems
� Unknown reasons.
� Using the earlier methodology to investigate
16
Application Workers
� GPUs have high idle times after used memory reaches a
plateau
1386
29
16.61%
16.73%
16.5%
17.76%
17.17%
33.07%
32.27%
CPU0CPU1CPU2CPU3CPU4
CUDA0_0
CUDA1_0App
licat
ion
Wor
kers
dgemm dpotrf dsyrk dtrsm
0250050007500
10000
0 50000 100000Time [ms]
Use
d (M
B)
17
Memory Managers
� GPUs' Memory Managers with many allocation's states
74.78%
75.03%
MM1
MM2
Mem
Nod
es
Allocating Allocating Async WritingBack
0
2500
5000
7500
10000
0 50000 100000Time [ms]
Use
d (M
B)
18
Zoom Memory Nodes
� Repeated allocations for the same memory blocks
� Inspecting StarPU code: this could happen if the allocations
fail
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
9 15
9 10
98.22%
95.39%
MM1
MM2
50000 50010 50020 50030 50040 50050Time [ms]
Mem
Nod
es
Allocating Allocating Async WritingBack
19
Residency of Blocks
� Memory Blocks remain on resources' memory even if they
aren't needed anymore
8 9 10 11
13
0 60 1200 60 1200 60 1200 60 1200
100
200
300
Time Steps [s]
Pre
senc
e [%
]
MM0 MM1 MM2
20
Code Inspection
At this point:
� We think that the allocations are falling because the memory
is full
� nvidia-smi con�rms it
� However, if StarPU identify that the memory is full, it should
free unused blocks.
� So StarPU believes it has free memory!
� We inspect the execution using gdb and identify a disparity
between the internal StarPU GPUs used memory values and
the real ones
21
Problem Cause
� StarPU internally register the used memory by the values
passed to cudaMalloc
� However, cudaMalloc can reserve more memory than the
requested
� Page size of the device; in our case: 2MB
� Causing a deviance of 1.8GB for the matrix
� We propose a patch for correcting it
22
Patch Experiments
� 10 executions comparing the original and the corrected version.
� Using the earlier Cholesky parameters:
� Block size: 960
� tupi machine
� DMDA Scheduler
� Varying the size of the matrix
� 99% of con�dence
23
Patch Results
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
Maximum Matrix size
that fits on GPU memory
0
100
200
300
400
500
600
700
29000 40000 50000 60000 68000
Input Matrix Width [Cells]
App
. Per
form
ance
[GF
lops
]Version Original Corrected
24
Conclusion
25
Conclusion
� Memory Manager is a important role in StarPU's Applications
� The proposed visualizations give extra information on the
memory management decisions
� Application independent
� We solved a runtime problem that was discovered using the
new features
� Performance gains of 66%
� Application Performance sustain independent of matrix size
26
Investigating Memory Operations
Performance in the StarPU Runtime
Lucas Leandro Nesi, Lucas Mello Schnorr
September 5, 2018
Graduate Program in Computer Science (PPGC/UFRGS), Porto Alegre, Brazil