Download - Performance Analysis of HPC with Lmbench Didem Unat Supervisor: Nahil Sobh July 22 nd 2005 netfiles.uiuc.edu/dunat2/www.

Performance Analysis of HPC with Lmbench

Didem Unat Supervisor: Nahil Sobh

July 22nd 2005

netfiles.uiuc.edu/dunat2/www

Lmbench: Micro-Benchmark Suite

• Simple, portable benchmarks• Compares different Unix systems

performance• Measures latency and bandwidth • Only analyzes performance of

processor, memory, network, file system and disk

• Free software

Compiler & optimization issues

• The GNU C compiler is used for all the resources but copper

• IBM xlc compiler was used on copper. • All of the benchmarks were compiled with

optimization -O except the benchmarks that calculate clock speed and the context switch times

Metrics in the Benchmark

Bandwidth • Pipe/ TCP• Cached file read• Memory copy• Memory read/write

Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication• File and VM system• Memory read latencies

Inter Process Communication Bandwidth

• Transfers 64 MB of data in 64 KB chunks

through• Unix Pipe • Unix sockets • TCP/IP sockets 0

500

1000

1500

2000

2500

3000

Pipe AF Unix TCP

W Co Cu Hg

MB/sec

Inter Process Communication Bandwidth

• Transfers 64 MB of data in 64 KB chunks

through• Unix Pipe • Unix sockets • TCP/IP sockets 0

500

1000

1500

2000

2500

3000

Pipe AF Unix TCP

W Co Cu Hg

MB/sec

W

Co



Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• File and VM system• Inter process communication • Memory read latencies

Cached file read• A reread benchmark, intended to be used

on a file that is in memory • File reread :

copies data from the kernel’s file system page into the processor’s buffer

• Mmap reread :

maps the entire file (8 MB) into process’s address space


Bandwidth • Pipe/TCP• Cached file read• Memory copy• Memory read/write


Memory copy• Measures how fast the system

can bcopy data• Bcopy copies n bytes from string

source to string destination• An 8 MB to 8 MB copy, does not

fit in the cache• Kernel bcopy and C library bcopy• C library bcopy shown in the

next slide


Bandwidth • Pipe/TCP• Cached file read• Memory copy• Memory read/write


Memory read/writeRead• Measures the time to read data into

the processor• An unrolled loop that sums up a series

of integers

Write• Measures the time to write data to

memory• An unrolled loop that stores a value

into an integer

Operating System Entry/ Signal Handling / Process Creation Costs

• Process-related latencies

• System Call null call, null I/O, stat, open/close

• Signal Handling signal installation, signal handling

• Process Creation fork + exit, fork + execve, fork +

/bin/sh -c

Context Switching• The time to save the state of one process and

restore the state of another process

• The processes are connected in a ring of Unix pipes

• A token is passed from process to process

• The process allocates an array and sums the array

• Context-switch time doesn't include the overhead of doing the work.

• Two parameters: number and size of processes

Interprocess Communication Latencies• Passing a small message back and forth

between two processes

• The time reported is one round trip

• Message size: a byte or a word

• Metrics: Pipe, Unix Socket, UDP and TCP , RPC/UDP-TCP, TCP connection latency

File & VM System• File create/ delete creates a number of small files in the current

working directory and then removes the files

• Mmap latency : costs of mmapping and unmmapping varying file sizes

• Prot fault : the time to catch a protection fault • Page fault : the cost of page faulting pages from a file

• 100 fd selct : the time to do a select on n file descriptors



Latency• System call• Signal handling• Process creation• Basic CPU operations• Context switching• Inter process communication • File and VM system• Memory read latencies

Memory Latencies

• Measures memory read latency for varying memory sizes and strides

• The size of the array starts from 512 bytes

• The stride varies from 16 to 1024

• Does not include the instruction execution time

Conclusion the best has problems IPC bandwidth

Co W, Cu

Cashed I/O bandwidth

W Co, Hg

Memory R/W Bandwidth

W Co, Hg

Process Creation

Cu Co

CPU ops W , Co, Hg Cu

Network Lat W Co, Cu

Memory Lat W, Co Cu

THANK YOU !

Have a nice weekend !

References

• “Lmbench – Tools for Performance Analysis” http://www.bitmover.com/lmbench/

• Larry McVoy and Carl Staelin, “Lmbench: Portable tools for performance analysis”

http://www.usenix.org/publications/library/proceedings/ sd96/full_papers/mcvoy.pdf

• Carl Staelin, “Lmbench:an extensible micro-benchmark suite”

http://www.hpl.hp.com/techreports/2004/HPL-2004-213.html