OSU INAM: A Profiling and Visualization Tool for Scalable and In …€¦ · OSU INAM: A Pro•ling...

2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 OSU INAM: A Proling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU-enabled HPC Clusters * Pouya Kousha, Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering, e Ohio State University {kousha.2, panda.2}@osu.edu 1 Introduction and Motivation Considering the recent advances in interconnect technol- ogy (NVLinks, X-Bus etc.), understanding the interaction between applications, MPI libraries, and the communication fabric becomes more challenging for system administrators, application and MPI designers. Although there are tools that provide network level or MPI level metrics, there is a gap for a tool providing both metrics and correlate them. Moreover, determining the root cause of performance degradation is complex for the domain scientist. e scale of emerging HPC clusters further exacerbates the problem. ese issues lead to the following broad challenge: How can we design a tool that enables in-depth un- derstanding of the communication trac on the inter- connect and GPU through tight integration with the MPI runtime at scale? 1.1 Contributions e contributions of the proposed tool are two-fold: 1) do- main scientists and system administrators can understand the interaction of applications and runtime libraries with underlying high-performance interconnects, and 2)Proposed tool enables designers of high-performance communication libraries to gain low-level knowledge to optimize existing designs and develop new algorithms to optimally utilize cuing-edge interconnects on GPU clusters. 2 Proposed Solutions and Designs We propose two primary proler modules on top of the CUPTI and MPI T interfaces to enable proling GPU, and MPI activities over high-performance interconnects [1]. Also, we redesigned and enhanced Fabric Discovery and Port Met- rics Inquiry. 2.1 Enhanced Fabric Discovery and Port Metrics Inquiry Using multi-threading, we improved the performance of fabric discovery and port counters up to 14x for a e Ohio Supercomputer Center(OSC) [3]. InniBand port metrics are recorded at sub-second granularity. * is research is supported in part by NSF grants #CNS-1513120, #ACI- 1450440, #CCF-1565414, and #ACI 1664137. 2.2 Proler Interface for MPI+CUDA Communication We implement the proling interfaces in a CUDA-Aware MPI library, MVAPICH2 [2]. e design is modular and applicable to any MPI library to gather GPU metrics and topology. Our proposed design has three phases: Startup where each rank will discover the topology and update shared re- gion with rank and assigned device info. en, local rank zero starts proler thread on CPU to prole all GPUs on node. ery where the proler thread gathers all the metrics from enrolled GPUs based on user-dened interval and send to OSU INAM daemon. Exit that occurs once the ranks stop using device, proler thread will perform last query step and then exit. 2.3 Introducing New MPI T Performance Variables For each collective and point-to-point operation, every rank stores transmied/received data from every other rank, an array of start and end time-stamps, selected algorithm for the communication, and the number calls to particular algo- rithm/function. 2.4 Improvements to Database Schema and Storage Interface We updated the methodology of storing data to correlate GPU metrics to MPI T PVAR metrics. Figure 1 depict the schema and data types used to store the GPU metrics and PVAR information respectively. For GPU metrics, we distin- guish between NVLinks using source id, source port, desti- nation id and destination port where destination source id is the device ID. e intranode topology table stores the link-level information. Such a database scheme design will allows visualization module to quickly render the intra-node topology. Intra_node_topo Id (primary key) Node_name Physical_link_count Link_capacity Source Source_id Destination Destination_id NVLink_metrics Id (primary key) Source_local_ran k Link_id Source_global_rank Node_name Dest_local_rank Source_name Dest_global_rank Source_port Data_unit Source_id Data_recv Dest_name Data_sent Dest_port Data_recv_rate Dest_id Data_sent_rate Added_on PVAR_table Id (primary key) jobid Node_name Start_time End_time Bytes_recv Bytes_sent PVAR_name Algorithm Source_rank Dest_rank Added_on Figure 1. Database schema used for storing the metrics gathered by the tools

Transcript of OSU INAM: A Profiling and Visualization Tool for Scalable and In …€¦ · OSU INAM: A Pro•ling...

Page 1: OSU INAM: A Profiling and Visualization Tool for Scalable and In …€¦ · OSU INAM: A Pro•ling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU-enabled

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

OSU INAM: A Pro�ling and Visualization Tool forScalable and In-Depth Analysis of High-Performance

GPU-enabled HPC Clusters ∗

Pouya Kousha, Advisor: Dhabaleswar K. (DK) PandaDepartment of Computer Science and Engineering, �e Ohio State University

{kousha.2, panda.2}@osu.edu

1 Introduction and MotivationConsidering the recent advances in interconnect technol-ogy (NVLinks, X-Bus etc.), understanding the interactionbetween applications, MPI libraries, and the communicationfabric becomes more challenging for system administrators,application and MPI designers. Although there are tools thatprovide network level or MPI level metrics, there is a gap fora tool providing both metrics and correlate them. Moreover,determining the root cause of performance degradation iscomplex for the domain scientist. �e scale of emerging HPCclusters further exacerbates the problem. �ese issues leadto the following broad challenge:

How can we design a tool that enables in-depth un-derstanding of the communication tra�c on the inter-connect and GPU through tight integration with theMPI runtime at scale?

1.1 Contributions�e contributions of the proposed tool are two-fold: 1) do-main scientists and system administrators can understandthe interaction of applications and runtime libraries withunderlying high-performance interconnects, and 2)Proposedtool enables designers of high-performance communicationlibraries to gain low-level knowledge to optimize existingdesigns and develop new algorithms to optimally utilizecu�ing-edge interconnects on GPU clusters.

2 Proposed Solutions and DesignsWe propose two primary pro�ler modules on top of theCUPTI and MPI T interfaces to enable pro�ling GPU, andMPI activities over high-performance interconnects [1]. Also,we redesigned and enhanced Fabric Discovery and Port Met-rics Inquiry.2.1 Enhanced Fabric Discovery and Port Metrics

InquiryUsing multi-threading, we improved the performance offabric discovery and port counters up to 14x for a �e OhioSupercomputer Center(OSC) [3]. In�niBand port metrics arerecorded at sub-second granularity.

∗�is research is supported in part by NSF grants #CNS-1513120, #ACI-1450440, #CCF-1565414, and #ACI 1664137.

2.2 Pro�ler Interface for MPI+CUDACommunication

We implement the pro�ling interfaces in a CUDA-Aware MPIlibrary, MVAPICH2 [2]. �e design is modular and applicableto any MPI library to gather GPU metrics and topology.Our proposed design has three phases: Ê Startup where

each rank will discover the topology and update shared re-gion with rank and assigned device info. �en, local rankzero starts pro�ler thread on CPU to pro�le all GPUs onnode. Ë �ery where the pro�ler thread gathers all themetrics from enrolled GPUs based on user-de�ned intervaland send to OSU INAM daemon. Ì Exit that occurs oncethe ranks stop using device, pro�ler thread will perform lastquery step and then exit.2.3 Introducing New MPI T Performance VariablesFor each collective and point-to-point operation, every rankstores transmi�ed/received data from every other rank, anarray of start and end time-stamps, selected algorithm forthe communication, and the number calls to particular algo-rithm/function.2.4 Improvements to Database Schema and Storage

InterfaceWe updated the methodology of storing data to correlateGPU metrics to MPI T PVAR metrics. Figure 1 depict theschema and data types used to store the GPU metrics andPVAR information respectively. For GPU metrics, we distin-guish between NVLinks using source id, source port, desti-nation id and destination port where destination source idis the device ID. �e intranode topology table stores thelink-level information. Such a database scheme design willallows visualization module to quickly render the intra-nodetopology.

Intra_node_topoId (primary key)

Node_name

Physical_link_count

Link_capacity

Source

Source_id

Destination

Destination_id

NVLink_metricsId (primary key) Source_local_ran k

Link_id Source_global_rank

Node_name Dest_local_rank

Source_name Dest_global_rank

Source_port Data_unit

Source_id Data_recv

Dest_name Data_sent

Dest_port Data_recv_rate

Dest_id Data_sent_rate

Added_on

PVAR_tableId (primary key)

jobid

Node_name

Start_time

End_time

Bytes_recv

Bytes_sent

PVAR_name

Algorithm

Source_rank

Dest_rank

Added_on

Figure 1. Database schema used for storing the metricsgathered by the tools

Page 2: OSU INAM: A Profiling and Visualization Tool for Scalable and In …€¦ · OSU INAM: A Pro•ling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU-enabled

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

SC’19, Denver, CO,Pouya Kousha, Advisor: Dhabaleswar K. (DK) Panda

300

400

500600

700

800900

1000

8 16 20 24

Late

ncy

(sec

onds

)

Number of threads for fabric discovery

Figure 4. Impact of multithreading on Fabric Discoverymodule on OSC clusterTable 1. Network and Live Jobs View Retrieval Timing forOSC for 1K Jobs

View Average Min Max STDEV.pNetwork View 196.15 ms 187 ms 206.09 ms 5.75 msLive Jobs View 18.17 ms 16 ms 20 ms 1 ms

Table 2. Timing of the GPU pro�ler thread phases for eachnode. Each node has four GPUs

Metrics Average Min Max STDEV.pStartup phase 1.632 s 1.561 s 1.672 s 0.035 s

CUDA context create 1.624 s 1.548 s 1.663 s 0.035 s�ery phase 2.33 ms 1.63 ms 208.03 ms 4.43 msExit phase 88 us 85 us 93 us 28 us

Table 3. Overhead of collecting PVAR data at nanosecondgranularity

Metrics Average Min Max STDDEV.pCollecting PVARs 517.63 ns 140 ns 16,204 ns 305.91 ns

2.5 Improvements to the Visualization Interface�e OSU INAM web application was enhanced to show addi-tional node-level information including: Logical and physicalconnectivity and real-time link utilization amount of datatransferred between each pair of rank in the current nodelive MPI PVAR counters live NVLink utilization data in thecurrent node3 Experimental EvaluationWe used the HPC clusters at OSC [3] with 1,430 computenodes, 114 switches, and 3,402 links. For GPU pro�lingand MPI T evaluations, we conducted the evaluation on anNVLink-enabled GPU system, where each node has two IBMPOWER9 CPUs on 2 sockets connected via X-Bus, and eachsocket is connected to two NVIDIA Tesla V100 GPUs using

NVLink2. Every socket has 22 CPU cores with four hardwarethreads per core.Figures 2, 3, and 4 depict the scalability and the e�ect of

number of threads for port inquiry component and fabric dis-covery designs, respectively. In Table 1, we present the timefor the OSU INAM front end process and render Networkand Live Jobs view for the OSC cluster.

0.10.15

0.20.25

0.30.35

0.40.45

0.5

8 12 16 20 24

Late

ncy

(Sec

)

Number of threads for Port InquiryAverage Median STDEV.P

Figure 2. Impact of multi-threading on Port Inquiry moduleon OSC cluster

0

0.5

1

1.52

2.5

3

3.5

4

0 200 400 600 800 1000

Late

ncy

of P

ort

Inqu

iry (

Sec)

Samples with 16 Threads

Figure 3. Histogram of querying port metrics for all nodesfor OSC cluster with 1000 samples

Table 2 shows the timing of GPU pro�ler. Table 3 showsthe stats for PVAR data collection inside MPI measured for9,800 samples.

References[1] P. Kousha, B. Ramesh, K. Kandadi Suresh, C. Chu, A. Jain, N.

Sarkauskas, D. Panda . 2019. Designing a Pro�ling and Visualiza-tion Tool for Scalable and In-Depth Analysis of High-PerformanceGPU Clusters.

[2] MVAPICH2: MPI over In�niBand, 10GigE/iWARP and RoCE. 2019.h�ps://mvapich.cse.ohio-state.edu/. (2019).

[3] Ohio Supercomputer Center. 2019. h�ps://www.osc.edu/. (2019).