Lawrence Livermore National Laboratory Pianola: A script-based I/O benchmark Lawrence Livermore...
-
Upload
dorthy-dorsey -
Category
Documents
-
view
219 -
download
2
Transcript of Lawrence Livermore National Laboratory Pianola: A script-based I/O benchmark Lawrence Livermore...
Lawrence Livermore National Laboratory
Pianola: A script-based I/O benchmark
Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
John MayPSDW08, 17 November 2008
LLNL-PRES-406688
2Lawrence Livermore National Laboratory
I/O benchmarking: What’s going on here?
Is my computer’s I/O system “fast”? Is the I/O system keeping up with my application? Is the app using the I/O system effectively? What tools do I need to answer these questions?
And what exactly do I mean by “I/O system” anyway?• For this talk, an I/O system is everything involved in
storing data from the filesystem to the storage hardware
3Lawrence Livermore National Laboratory
Existing tools can measure general or application-specific performance
IOzone automatically measures I/O system performance for different operations and parameters
Relatively little ability to customize I/O requests
Many application-oriented benchmarks• SWarp, MADbench2…
Interface-specific benchmarks• IOR, //TRACE
64256
1024 40961638465536
26214410485764194304
4
32
256
2048
16384
0
100000
200000
300000
400000
500000
600000
KB/sec
File size (KB)
Record size (KB)
Write new file
500000-600000
400000-500000
300000-400000
200000-300000
100000-200000
0-100000
4Lawrence Livermore National Laboratory
Measuring the performance that matters
System benchmarks only measure general response, not application-specific response
Third-party application-based benchmarks may not generate the stimulus you care about
In-house applications may not be practical benchmarks• Difficult for nonexperts to build and run• Nonpublic source cannot be distributed to vendors
and collaborators Need benchmarks that…• Can be generated and used easily • Model application-specific characteristics
5Lawrence Livermore National Laboratory
Script-based benchmarks emulate real apps
Capture trace data from application and generate the same sequence of operations in a replay-benchmark
We began with //TRACE from CMU (Ganger’s group)• Records I/O events and intervening “compute” times• Focused on parallel I/O, but much of the infrastructure is useful for our
sequential I/O work
Application
Capture library
opencomputereadcompute…
Replay script Replay tool
Script-based benchmark
6Lawrence Livermore National Laboratory
Challenges for script-based benchmarking:Recording I/O calls at the the right level
fprintf( … );/* work */fprintf( … );/* more work */fprintf( … );…
write( … );opencomputewrite…
Instrumenting at high level+ Easy with LD_PRELOAD- Typically generates more
events, so logs are bigger- Need to replicate formatting- Timing includes computation
Instrumenting at low level+ Fewer types of calls to
capture+ Instrumentation is at I/O
system interface- Cannot use LD_PRELOAD to
intercept all calls
7Lawrence Livermore National Laboratory
First attempt at capturing system calls:Linux strace utility
Records any selected set of system calls Easy to use: just add to command line Produces easily parsed output
$ strace -r -T -s 0 -e trace=file,desc ls 0.000000 execve("/bin/ls", [, ...], [/* 29 vars */]) = 0 <0.000237> 0.000297 open("/etc/ld.so.cache", O_RDONLY) = 3 <0.000047> 0.000257 fstat64(3, {st_mode=S_IFREG|0644, st_size=64677, ...}) = 0 <0.000033> 0.000394 close(3) = 0 <0.000015> 0.000230 open("/lib/librt.so.1", O_RDONLY) = 3 <0.000046> 0.000289 read(3, ""..., 512) = 512 <0.000028> ...
8Lawrence Livermore National Laboratory
Strace results look reasonably accurate,but overall runtime is exaggerated
0
5
10
15
20
25
30
35
40
45
0 50 100 150 200 250 300 350 400 450
Execution Time
I/O Time
Application Read Application Write Replay Read Replay Write
Read (sec.) Write (sec.) Elapsed (sec.)
Uninstrumented -- -- 324
Instrumented Application
41.8 11.8 402
Replay 37.0 11.8 390
9Lawrence Livermore National Laboratory
For accurate recording, gather I/O calls using binary instrumentation
Can intercept and instrument specific system-level calls Overhead of instrumentation is paid at program startup Existing Jockey library works well for x86_32, but not ported to
other platforms Replay can be portable, though
mov 4 (%esp), %ebx
int $0x80
ret
mov $0x4, %eax
save current statemov 4 (%esp), %ebx
nop
ret
jmp to trampoline call my write function
restore state
jmp to original code
InstrumentedUninstrumented
10Lawrence Livermore National Laboratory
Issues for accurate replay
Replay engine must be able to read and parse events quickly
Reading script must not interfere significantly with I/O activities being replicated
Script must be portable across platforms
Accurate replayMinimize I/O impact of reading script
Correctly reproduce inter-event delays
opencomputewrite…
Replayed I/O events
11Lawrence Livermore National Laboratory
Accurate replay:Preparsing, compression, and buffering
Text-formatted output script is portable across platforms Instrumentation output is parsed into binary format and
compressed (~30:1)• Conversion done on target platform
Replay engine reads and buffers script data during “compute” phases between I/O events
opencomputewrite…
010010110101010110…
……
preparsing compression buffering
12Lawrence Livermore National Laboratory
Replay timing and profile match original application well
0
5
10
15
20
25
30
35
40
0 50 100 150 200 250 300 350 400
Execution Time
I/O Time
Application Read Application Write Replay Read Replay Write
Read (sec.) Write (sec.) Elapsed(sec.)
Uninstrumented -- -- 314
Instrumented Application
35.8 12.8 334
Replay 35.7 12.5 319
13Lawrence Livermore National Laboratory
Things that didn’t help
Compressing text script as it’s generated• Only 2:1 compression• Time of I/O events themselves are not what’s very
important during instrumentation phase Replicating the memory footprint• Memory used by application is taken from same pool
as I/O buffer cache• Smaller application (like the replay engine) should go
faster because more buffer space available• Replicated memory footprint by tracking brk() and
mmap() calls, but it made no difference!
14Lawrence Livermore National Laboratory
Conclusions on script-based I/O benchmarking
Gathering accurate I/O traces is harder than it seems• Currently, no solution is both portable and efficient
Replay is easier, but efficiency still matters Many possibilities for future work—which matter most?• File name transformation• Parallel trace and replay• More portable instrumentation• How to monitor mmap’d I/O?