Visualization of Lustre RPC Traces with Vampir · 2016. 3. 10. · Zellescher Weg 12 WIL A 208 Tel....
Transcript of Visualization of Lustre RPC Traces with Vampir · 2016. 3. 10. · Zellescher Weg 12 WIL A 208 Tel....
-
Zellescher Weg 12
WIL A 208
Tel. +49 351 - 463 – 34217
Michael Kluge ([email protected])
Visualization of Lustre RPC Traces with Vampir
EOFS Workshop September 2011 – Paris
Center for Information Services and High Performance Computing
-
Folie 2
Content
! Problem statement
! OTF and Vampir introduction
! Converter design
! Time synchronisation
! Screenshots
! Directions
-
Folie 3
Problem statement
! Lustre RPC traces consist of millions of events
! Data size is in the GB range
! One trace per server and client
! Log files are text files
! Evaluation?
! Visualization?
-
Folie 4
OTF
! One index file
! One definition file
! Multiple event streams
! Each stream can store events for many processes
! Typical usage: application traces
-
Folie 5
Vampir - Overview
Michael Kluge
Vampir Trace
Vampir Trace
Trace File
(OTF)
Vampir 7
Trace Bundle
VampirServer
CPU CPU
CPU CPU CPU CPU
CPU CPU
Multi-Core Program
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
Many-Core Program
-
Folie 6
Vampir - Performance Radar
! Color coded
! many counter timelines
! Like heat maps in gnuplot
! Basic arithmetics on counter data
Michael Kluge
-
Folie 7
Approach
! Rewrite Lustre RPC logs to OTF trace file bundle
! Use the Performance Radar to show:
– RPCs in flight per client – Average RPC completion time – Queue times on the server – Different types of RPCs – …
-
Folie 8
Challenges
! Conversation of lots of unsynchronized individual files
! Find a ‘file name’ to ‘host name’ mapping
! Final event streams need to be ordered in time
! Five events per RPC
– Client send – Server receive – Server start work – Server end work – Client ack
! Catch different protocol versions
! Deal with server side logs only
client
server
-
Folie 9
Time Synchronisation
! Initial time offsets unknown
! Unknown whether tracing started synchronously
! Solution:
– Provide a {RPC log file, file type, NID list} tuple – Read the first N seconds from each file – Try to find matching RPC pairs – Create a timer offset map
-
Folie 10
timeline for a server
Internal Data Structure
XID OpCode Client Server
ptr ptr
ptr
ptr ptr
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
…
timeline for a client
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
Time Type RPC
…
-
Folie 11
100GBit Testbed
QDR InfiniBand
Cluster IB-S
witc
h 5x
Server 1
Server 2
Server 16
…
16x 10G Uplink 100G Link, 37 Miles 16x 10G Downlink
16 Servers 16x FC8 to 2x S2A9900
-
Folie 12
Screenshot – RPCs in flight on the clients
-
Folie 13
Screenshot – average time to complete an RPC
-
Folie 14
Screenshot – average time to complete an RPC
-
Folie 15
Screenshot – RPCs queued on the server
-
Folie 16
Screenshot – average RPC processing time (1)
-
Folie 17
Screenshot – average RPC processing time (2)
-
Folie 18
Screenshot – Jaguar (server log only)
-
Folie 19
Where to go from here
! More metrics needed?
! Use markers to mark interesting locations
! Locks are in the log files
! How about a complete log from jaguar
! Other possibilities to explore:
– make use of the Python interface for OTF – make use of the Octave import for OTF files
-
Folie 20
Time for Questions
-
Folie 21
Some internal limits
! Queued (open) RPCs per process
! Maximum RPC completion time
! Time window for average calculations
! Maximum allowed time synchronisation