Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan,...

32
Making the “Box” Transparent: System Call Performance as a First- class Result Yaoping Ruan, Vivek Pai Princeton University

description

3 Beginning of the Story Flash Web Server on the SPECWeb99 benchmark yield very low score – : Standard Apache 400: Customized Apache 575: Best - Tux on comparable hardware However, Flash Web Server is known to be fast Motivation

Transcript of Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan,...

Page 1: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

Making the “Box” Transparent:

System Call Performance as a First-class Result

Yaoping Ruan, Vivek PaiPrinceton University

Page 2: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

2

Outline

Motivation Design & implementation Case study More results

Page 3: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

3

Beginning of the Story Flash Web Server on the SPECWeb99

benchmark yield very low score – 200 220: Standard Apache 400: Customized Apache 575: Best - Tux on comparable hardware

However, Flash Web Server is known to be fast

Motivation

Page 4: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

4

Performance Debugging ofOS-Intensive ApplicationsWeb Servers (+ full SpecWeb99) High throughput Large working sets Dynamic content QoS requirements Workload scaling

How do we debug such a system?

CPU/networkdisk activity

multiple programslatency-sensitive

overhead-sensitive

Motivation

Page 5: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

5

Guesswork, inaccuracy

OnlineMeasurement callsgetrusage()gettimeofday( )

High overhead > 40%DetailedCall graph profiling

gprof

CompletenessFastStatistical sampling

DCPI, Vtune, Oprofile

Guesswork, inaccuracy

OnlineMeasurement callsgetrusage()gettimeofday( )

High overhead > 40%DetailedCall graph profiling

gprof

CompletenessFastStatistical sampling

DCPI, Vtune, Oprofile

Current TradeoffsMotivation

Page 6: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

6In-kernel measurement

Example of Current Tradeoffs

gettimeofday( )

Wall-clock time of an non-blocking system call

Motivation

Page 7: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

7

Design Goals Correlate kernel information with

application-level information Low overhead on “useless” data High detail on useful data Allow application to control profiling

and programmatically react to information

Design

Page 8: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

8

DeBox: Splitting Measurement Policy &

Mechanism Add profiling primitives into kernel Return feedback with each syscall

First-class result, like errno Allow programs to interactively profile

Profile by call site and invocation Pick which processes to profile Selectively store/discard/utilize data

Design

Page 9: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

9

DeBox Architecture…res = read (fd, buffer, count)…

errnobuffer

read ( ) {Start profiling;…End profilingReturn;}

App

Kernel

DeBox Infoor

online

offline

Design

Page 10: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

10

DeBox Data StructureDeBoxControl ( DeBoxInfo *resultBuf,

int maxSleeps, int maxTrace )

Trace Info

Sleep InfoBasic DeBox Info

01

…01

…2

Performance summary

Blocking information of the call

Full call trace & timing

Implementation

Page 11: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

11

Sample Output: Copying a 10MB Mapped FileBasic

DeBox InfoPerSleepInf

o

CallTrace

# of CallTrace used0# of PerSleepInfo used2# of page faults989Call time (usec)3591064System call # (write( ))4

Basic DeBox Info

# processes on exit0# processes on entry1Line where blocked2727File where blockedkern/vfs_bio.cResource labelbiowrTime blocked (usec)723903# occurrences1270

PerSleepInfo[0]

fo_write( )3586472dofilewrite( )3590326fhold( )5holdfp( )25write( )3591064

CallTrace

……21210

(depth)

Implementation

Page 12: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

12

In-kernel Implementation 600 lines of code in FreeBSD 4.6.2 Monitor trap handler

System call entry, exit, page fault, etc. Instrument scheduler

Time, location, and reason for blocking Identify dynamic resource contention

Full call path tracing + timings More detail than gprof’s call arc counts

Implementation

Page 13: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

13

General DeBox OverheadsImplementation

Page 14: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

14

Case Study: Using DeBox on the Flash Web server

Experimental setup Mostly 933MHz PIII 1GB memory Netgear GA621

gigabit NIC Ten PII 300MHz

clients

0

100

200

300

400

500

600

700

800

900

SPECWeb99 Results

orig

VM Patch

sendfile

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 15: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

15

Example 1: Finding Blocking System

Callsopen()/190readlink()/

84read()/12stat()/6

unlink()/1…

Blocking system calls, resource blocked and their counts

inode - 28biord - 162inode - 84

biord - 3inode - 9inode - 6biord - 1

Case Study

Page 16: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

16

Example 1 Results & Solution

Mostly metadata locking

Direct all metadata calls to name convert helpers

Pass open FDs using sendmsg( )

0

100

200

300

400

500

600

700

800

900

SPECWeb99 Results

orig

VM Patch

sendfile

FD Passing

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 17: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

17

Example 2: Capturing Rare Anomaly

PathsFunctionX( ) {

…open( );… }

FunctionY( ) {open( );…open( ); }

When blocking happens:

abort( ); (or fork + abort)

Record call path

1. Which call caused the problem

2. Why this path is executed (App + kernel call trace)

Case Study

Page 18: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

18

Example 2 Results & Solution

Cold cache miss path

Allow read helpers to open cache missed files

More benefit in latency

0

100

200

300

400

500

600

700

800

900

SPECWeb99 Results

orig

VM Patch

sendfile

FD Passing

FD Passing

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 19: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

19

Example 3: Tracking Call History & Call

Trace

Call time of fork( ) as a function of invocation

Call trace indicates:

• File descriptors copy - fd_copy( )

• VM map entries copy - vm_copy( )

Case Study

Page 20: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

20

Example 3 Results & Solution

Similar phenomenon on mmap( )

Fork helper eliminate mmap( )

0

100

200

300

400

500

600

700

800

900

SPECWeb99 Results

origVM Patch

sendfileFD Passing

Fork helperNo mmap()

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 21: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

21

Sendfile Modifications(Now Adopted by

FreeBSD)

Cache pmap/sfbuf entries Return special error for cache misses Pack header + data into one packet

Case Study

Page 22: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

22

SPECWeb99 Scores

0.12 0.75 1.38 2.01 2.64

0 100 200 300 400 500 600 700 800 900orig

sendfileFD PassingFork helper

No mmap()

VM Patch

New CGI InterfaceNew sendfile( )

Dataset size (GB)0.43 1.06 1.69 2.32 2.95

Case Study

Page 23: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

23

New Flash Summary Application-level changes

FD passing helpers Move fork into helper process Eliminate mmap cache New CGI interface

Kernel sendfile changes Reduce pmap/TLB operation New flag to return if data missing Send fewer network packets for small files

Case Study

Page 24: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

24

Throughput on SPECWeb Static Workload

0100200300400500600700

500 MB 1.5 GB 3.3 GBDataset Size

Thro

ughp

ut (

Mb/

s)

Apache Flash New-Flash

Other Results

Page 25: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

25

Latencies on 3.3GB Static Workload

44X

47X

Mean latency improvement: 12X

Other Results

Page 26: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

26

Throughput Portability (on Linux)

(Server: 3.0GHz P4, 1GB memory)

0200400600800

100012001400

500 MB 1.5 GB 3.3 GBDataset Size

Thro

ughp

ut (

Mb/

s)

Haboob Apache Flash New-Flash

Other Results

Page 27: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

27

Latency Portability(3.3GB Static Workload)

3.7X

112X

Mean latency improvement: 3.6X

Other Results

Page 28: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

28

Summary DeBox is effective on OS-intensive application

and complex workloads Low overhead on real workloads Fine detail on real bottleneck Flexibility for application programmers

Case study SPECWeb99 score quadrupled

Even with dataset 3x of physical memory Up to 36% throughput gain on static workload Up to 112x latency improvement Results are portable

Page 29: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

www.cs.princeton.edu/~yruan/debox

[email protected]@cs.princeton.edu

Thank you

Page 30: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

30

SpecWeb99 Scores Standard Flash 200 Standard Apache 220 Apache + special module 420 Highest 1GB/1GHz score 575 Improved Flash 820 Flash + dynamic request module 1050

Page 31: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

31

SpecWeb99 on Linux Standard Flash 600 Improved Flash 1000 Flash + dynamic request module 1350

3.0GHz P4 with 1GB memory

Page 32: Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

32

New Flash Architecture