SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim...
-
Upload
patience-francis -
Category
Documents
-
view
225 -
download
0
Transcript of SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim...
SuperPin: Parallelizing Dynamic Instrumentation
for Real-Time Performance
Steven Wallaceand Kim Hazelwood
2 Hazelwood – CGO 2007
Dynamic Binary Instrumentation
Inserts user-defined instructions into executing binaries
•Easily•Efficiently•Transparently
Why?•Detect inefficiencies•Detect bugs•Security checks•Add features
Examples•Valgrind, DynamoRIO, Strata, HDTrans, Pin
EXE
Transform
CodeCache
Execute
Profile
3 Hazelwood – CGO 2007
Intel Pin
• A dynamic binary instrumentation system
• Easy-to-use instrumentation interface
• Supports multiple platforms– Four ISAs – IA32, Intel64, IPF, ARM– Four OSes – Linux, Windows, FreeBSD, MacOS
• Robust and stable (Pin can run itself!)– 12+ active developers– Nightly testing of 25000 binaries on 15 platforms– Large user base in academia and industry – Active mailing list (Pinheads)
• 11,500 downloads
4 Hazelwood – CGO 2007
Our Goal: Improve Performance
The latest Pin overhead numbers …
100%
120%
140%
160%
180%
200%
perlbench
sjeng
xalancbm
k
gobm
k
gcc
h264ref
omnetpp
bzip2
libquantum mcf
astar
hmmer
Rel
ativ
e to
Nat
ive
5 Hazelwood – CGO 2007
Adding Instrumentation
100%
200%
300%
400%
500%
600%
700%
800%
perlbench
sjeng
xalancbm
k
gobm
k
gcc
h264ref
omnetpp
bzip2
libquantum mcf
astar
hmmer
Rel
ativ
e to
Nat
ive Pin
Pin+icount
6 Hazelwood – CGO 2007
Sources of Overhead
Internal
• Compiling code & exit stubs (region detection, region formation, code generation)
• Managing code (eviction, linking)
• Managing directories and performing lookups
• Maintaining consistency (SMC, DLLs)
External
• User-inserted instrumentation
7 Hazelwood – CGO 2007
“Normal Pin” Execution Flow
Instrumentation is interleaved with application
Uninstrumented Application
Instrumented Application
Pin Overhead
Instrumentation Overhead
“Pinned” Application
time
8 Hazelwood – CGO 2007
“SuperPin” Execution Flow
SuperPin creates instrumented slices
Uninstrumented Application
SuperPinned Application
Instrumented Slices
9 Hazelwood – CGO 2007
Issues and Design Decisions
Creating slices• How/when to start a slice• How/when to end a slice
System calls
Merging results
10 Hazelwood – CGO 2007
for k
S6+fo
r k
S5+fo
r k
S4+fo
r kre
cord
si
gr3
, sl
eep
S3+
for k
S2+
sleep
S1+
Execution Timelinefo
r k
S1 S2 S3 S4 S5 S6
dete
ct
sigr4
dete
ct
exit
resu
me
dete
ct
sigr3
dete
ct
sigr6
dete
ct
sigr2
dete
ct
sigr5
resu
me
reco
rd
sigr4
, sl
eep
CPU2
CPU3
CPU4
time
reco
rd
sigr2
, sl
eep
resu
me
resu
me
reco
rd
sigr5
, sl
eep
resu
me
reco
rd
sigr6
, sl
eep
resu
me
original application
instrumentedapplication slices
CPU1
11 Hazelwood – CGO 2007
Starting Slices
How?
• Master application process – ptrace
• Controlling process
• Child slices – fork– Reserve memory for transparency– Each slice has its own code cache (for now)
When?
• Timeouts– Uses a special timer process– Tunable parameter
• System calls
12 Hazelwood – CGO 2007
S1
Handling System Calls
Problem: Don’t want to duplicate system calls in the main application/slices
Solutions:
• brk or anonymous mmap – duplication OK
• frequent calls – record and playback
• default – trigger new timeslice
S2
S1Instr AInstr BSysCallInstr CInstr D
Instr AInstr BMMAPInstr CInstr D
S1
Instr AInstr BSysCallPlayInstr CInstr D
Duplicate Record/Playback End Slice
13 Hazelwood – CGO 2007
Ending Slices
Each slice is responsible for detecting its own end-of-slice condition
S4+
record sig4, sleep
resume
detect sig5
Challenges: • Need to efficiently capture a point in time (signature)• Need to efficiently detect when we’ve reached that point
14 Hazelwood – CGO 2007
Signature Detection
End-of-slice conditions:
1. System calls – easy to detect
2. Timeouts at arbitrary points – harder to detect
Signature match:
• Instruction pointer
• Architectural state
• Top of stack
15 Hazelwood – CGO 2007
Implementing Signature Detection
Uses Pin’s lightweight conditional analysis• INS_InsertIfCall – lightweight inlined check• INS_InsertThenCall – heavyweight (conditional)
analysis routine
Instrument the end-of-slice instruction pointer
1. Lightweight check – two registers
2. Heavyweight check – full architectural state
3. Heavyweight check – top 100 words on the stack
• Lightweight triggers heavyweight: ~2%
• Stack check fails: ~0%
16 Hazelwood – CGO 2007
Performance Results
Icount1 – Instruments every instruction with count++
% pin –t icount1 -- ./helloHello CGOCount: 496043
Icount2 – Instruments every basic block with count += bblength
% pin –t icount2 -- ./helloHello CGOCount: 496043
17 Hazelwood – CGO 2007
Performance – icount1
% pin –t icount1 -- <benchmark>
0%
500%
1000%
1500%
2000%
2500%
3000%
ammp
applu
apsi art
bzip2
crafty
eon
equake
facerec
fma3d
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbm
sixtrack
swim
twolf
vortex vpr
wupwis
AVG
Pin SuperPin
18 Hazelwood – CGO 2007
Performance – icount2
% pin –t icount2 -- <benchmark>
0%
200%
400%
600%
800%
1000%am
mp
applu
apsi art
bzip2
crafty
eon
equake
facerec
fma3d
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbm
sixtrack
swim
twolf
vortex vpr
wupwis
AVG
Pin SuperPin
19 Hazelwood – CGO 2007
Performance Scalability
0
100
200
300
400
500
600
1 2 4 8 12 16Max # Running Slices
Ru
nti
me
(sec
)
Running on an 8-processor HT-enabled machine (16 virtual processors)
20 Hazelwood – CGO 2007
Overhead Categorization
Where is SuperPin spending its time?• Executing the application• Fork overheads• Sleeping (waiting to start a slice)• Pipeline delays
for k for k for k for k
S1 S2 S3 S4
reco
rd
sigr2
, sl
eep
dete
ct
sigr3
resu
me
reco
rd
sigr4
, sl
eep
dete
ct
exit
resu
me
sleep
dete
ct
sigr2
resu
me
dete
ct
sigr4
reco
rd
sigr3
, sl
eep
resu
me
CPU3
S2+ S4+
S1+ S3+
21 Hazelwood – CGO 2007
Overhead Categorization
0
40
80
120
160
200
Ru
nti
me
(sec
)
.5s 1s 2s 4s
Timeslice (sec)
native fork&others sleep pipeline
22 Hazelwood – CGO 2007
The SuperPin API
You may write SuperPin-aware Pintools:– SP_Init(fun)– SP_AddSliceBeginFunction(fun,val)– SP_AddSliceEndFunction(fun,val)– SP_EndSlice()– SP_CreateSharedArea(local,size,merge)
You may also control (via switches):– Spmsec {value}: milliseconds per timeslice– Spmp {value}: maximum slice count– Spsysrecs {value}: maximum syscalls per slice
23 Hazelwood – CGO 2007
Pin Instrumentation API – icount2
VOID DoCount(INT32 c) { icount += c;}
VOID Trace(TRACE trace, VOID *v) {for (BBL bbl=Head(trace); Valid(bbl); bbl=Next(bbl)) { INS_InsertCall(BBL_InsHead(bbl), IPOINT_BEFORE, (AFUNPTR)DoCount, IARG_INT32, BBL_NumIns(bbl), IARG_END); }
}
int main(INT32 argc, CHAR **argv) {
PIN_Init(argc, argv);
TRACE_AddInstrumentFunction(Trace);
PIN_StartProgram();
return 0;
}
24 Hazelwood – CGO 2007
SuperPin Version of Icount2
VOID DoCount(INT32 c) {//same as before}
VOID ToolReset(INT32 c) { icount = 0;} VOID Merge(INT32 sliceNum) { *sharedData += icount;}
VOID Trace(TRACE trace, VOID *v) {//same as before}
int main(INT32 argc, CHAR **argv) { PIN_Init(argc, argv); SP_Init(ToolReset); sharedData = (UINT64*) SP_CreateSharedArea(&icount, sizeof(icount),0); SP_AddSliceEndFunction(Merge); TRACE_AddInstrumentFunction(Trace); PIN_StartProgram(); return 0; }
25 Hazelwood – CGO 2007
SuperPin Limitations
Not all instrumentation tasks are a good fit
Great fit – independent tasks• Instruction profiling (counts, distributions)• Trace generation
Requires massaging – dependent tasks• Branch prediction• Data cache simulation
– Assume a starting state, resolve later
Stick with regular Pin – path modification• Adaptive execution
26 Hazelwood – CGO 2007
Future (Rainy Day) Extensions
• Adaptive parallelism detection– Hardware feedback: adapts to available processors
– OS feedback: adapts to present load
• Adaptive slice timeouts
• Slice-shared code caches
• Multithreaded application support
27 Hazelwood – CGO 2007
SuperPin Summary
Allows users to leverage available parallelism for certain instrumentation tasks
• Hides the gory details
• Enables significant speedup (for the right tasks … on the right machines)
• Exposed as Pin API extensions
Download it today!
http://rogue.colorado.edu/pin
28 Hazelwood – CGO 2007
Thank You
29 Hazelwood – CGO 2007
Merging Results
Each slice is a separate process must merge slice-local results into a collective total
Merge routines:
• Addition
• Concatenation
• User-defined
Executed in program order
Use shared memory