Performance Analysis of HPC with Lmbench
Didem Unat Supervisor: Nahil Sobh
July 22nd 2005
netfiles.uiuc.edu/dunat2/www
Lmbench: Micro-Benchmark Suite
• Simple, portable benchmarks• Compares different Unix systems
performance• Measures latency and bandwidth • Only analyzes performance of
processor, memory, network, file system and disk
• Free software
Compiler & optimization issues
• The GNU C compiler is used for all the resources but copper
• IBM xlc compiler was used on copper. • All of the benchmarks were compiled with
optimization -O except the benchmarks that calculate clock speed and the context switch times
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
Inter Process Communication Bandwidth
• Transfers 64 MB of data in 64 KB chunks
through• Unix Pipe • Unix sockets • TCP/IP sockets 0
500
1000
1500
2000
2500
3000
Pipe AF Unix TCP
W Co Cu Hg
MB/sec
Inter Process Communication Bandwidth
• Transfers 64 MB of data in 64 KB chunks
through• Unix Pipe • Unix sockets • TCP/IP sockets 0
500
1000
1500
2000
2500
3000
Pipe AF Unix TCP
W Co Cu Hg
MB/sec
W
Co
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies
Cached file read• A reread benchmark, intended to be used
on a file that is in memory • File reread :
copies data from the kernel’s file system page into the processor’s buffer
• Mmap reread :
maps the entire file (8 MB) into process’s address space
Metrics in the Benchmark
Bandwidth • Pipe/TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies
Memory copy• Measures how fast the system
can bcopy data• Bcopy copies n bytes from string
source to string destination• An 8 MB to 8 MB copy, does not
fit in the cache• Kernel bcopy and C library bcopy• C library bcopy shown in the
next slide
Metrics in the Benchmark
Bandwidth • Pipe/TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies
Memory read/writeRead• Measures the time to read data into
the processor• An unrolled loop that sums up a series
of integers
Write• Measures the time to write data to
memory• An unrolled loop that stores a value
into an integer
12
3
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
Operating System Entry/ Signal Handling / Process Creation Costs
• Process-related latencies
• System Call null call, null I/O, stat, open/close
• Signal Handling signal installation, signal handling
• Process Creation fork + exit, fork + execve, fork +
/bin/sh -c
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
Context Switching• The time to save the state of one process and
restore the state of another process
• The processes are connected in a ring of Unix pipes
• A token is passed from process to process
• The process allocates an array and sums the array
• Context-switch time doesn't include the overhead of doing the work.
• Two parameters: number and size of processes
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
Interprocess Communication Latencies• Passing a small message back and forth
between two processes
• The time reported is one round trip
• Message size: a byte or a word
• Metrics: Pipe, Unix Socket, UDP and TCP , RPC/UDP-TCP, TCP connection latency
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies
File & VM System• File create/ delete creates a number of small files in the current
working directory and then removes the files
• Mmap latency : costs of mmapping and unmmapping varying file sizes
• Prot fault : the time to catch a protection fault • Page fault : the cost of page faulting pages from a file
• 100 fd selct : the time to do a select on n file descriptors
Metrics in the Benchmark
Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write
Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication • File and VM system• Memory read latencies
Memory Latencies
• Measures memory read latency for varying memory sizes and strides
• The size of the array starts from 512 bytes
• The stride varies from 16 to 1024
• Does not include the instruction execution time
Conclusion the best has problems IPC bandwidth
Co W, Cu
Cashed I/O bandwidth
W Co, Hg
Memory R/W Bandwidth
W Co, Hg
Process Creation
Cu Co
CPU ops W , Co, Hg Cu
Network Lat W Co, Cu
Memory Lat W, Co Cu
THANK YOU !
Have a nice weekend !
References
• “Lmbench – Tools for Performance Analysis” http://www.bitmover.com/lmbench/
• Larry McVoy and Carl Staelin, “Lmbench: Portable tools for performance analysis”
http://www.usenix.org/publications/library/proceedings/ sd96/full_papers/mcvoy.pdf
• Carl Staelin, “Lmbench:an extensible micro-benchmark suite”
http://www.hpl.hp.com/techreports/2004/HPL-2004-213.html
Top Related