Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra...
-
Upload
claud-copeland -
Category
Documents
-
view
229 -
download
0
Transcript of Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra...
Compactly Representing Parallel Program ExecutionsCompactly Representing Parallel Program Executions
Ankit Goel Ankit Goel Abhik Roychoudhury Abhik Roychoudhury Tulika Tulika MitraMitra
National University of SingaporeNational University of Singapore
Path profilesPath profiles
Profiling a program’s executionProfiling a program’s execution– Count basedCount based– Path basedPath based
Count based profiles are more Count based profiles are more aggregateaggregate– # of# of execution of the program’s basic blocks execution of the program’s basic blocks– # of# of accesses of various memory locations accesses of various memory locations
Path based profiles are more Path based profiles are more accurateaccurate– SequenceSequence of basic blocks executed of basic blocks executed– SequenceSequence of memory locations accessed of memory locations accessed
Use Online compression to generate compact Use Online compression to generate compact path profiles.path profiles.
OrganizationOrganization
Compressed Path Profiles in Sequential Compressed Path Profiles in Sequential ProgramsPrograms
Parallel Program Path ProfilesParallel Program Path Profiles
Compression Efficiency and OverheadsCompression Efficiency and Overheads
Data race detection over path profilesData race detection over path profiles
Compressed Path - ExampleCompressed Path - Example
11
22
33
Uncompressed PathUncompressed Path
123123123123
Compressed Compressed RepresentationRepresentation
S S AA AA
A A 123 123
Control Flow GraphControl Flow Graph
Online Path CompressionOnline Path Compression
A program path is a A program path is a stringstring over a finite over a finite alphabetalphabet Alphabet decided by what we instrumentAlphabet decided by what we instrument
– Control flow (Basic Blocks executed)Control flow (Basic Blocks executed)– Data flow (Memory Locations accessed)Data flow (Memory Locations accessed)
A string s is represented by a Context Free A string s is represented by a Context Free Grammar Gs: Language of Gs is {s}Grammar Gs: Language of Gs is {s}
Construction of Gs is Construction of Gs is onlineonline and not post- and not post-mortemmortem– Start with trivial grammar & modify it for each Start with trivial grammar & modify it for each
symbolsymbol
No recursive rules (DAG representation)No recursive rules (DAG representation) Compression scheme – Nevill-Manning & Witten Compression scheme – Nevill-Manning & Witten
9797– Application to program paths – Larus 99Application to program paths – Larus 99
Online Compression in actionOnline Compression in action
Path Executed Compressed RepresentationPath Executed Compressed Representation
11 S -> 1S -> 1
1212 S -> 12S -> 12
123123 S -> 123S -> 12312311231 S -> 1231S -> 1231
1231212312 S ->S -> 12 12331212
S -> S -> AA33AA
A -> 12A -> 12
Online Compression in actionOnline Compression in action
Path Executed Compressed RepresentationPath Executed Compressed Representation
123123123123 S -> S -> A3A3A3A3
A -> 12A -> 12
S -> BBS -> BB
B -> B -> AA33
A -> 12A -> 12
S -> BBS -> BB
B -> 123B -> 123
OrganizationOrganization
Compressed Path Profiles in Sequential Compressed Path Profiles in Sequential ProgramsPrograms
Parallel Program Path ProfilesParallel Program Path Profiles
Compression Efficiency and OverheadsCompression Efficiency and Overheads
Data race detection over path profilesData race detection over path profiles
What to represent ?What to represent ?
Control/data flow in each program threadControl/data flow in each program thread
Communication among threadsCommunication among threads– Synchronization (locks, barriers)Synchronization (locks, barriers)– Unsynchronized shared variable accessesUnsynchronized shared variable accesses
Too costly to observe/record order of all Too costly to observe/record order of all shared variable accessesshared variable accesses
We will representWe will represent– Compressed flow in each thread (Compressed flow in each thread (via Grammarvia Grammar))– Communication via synchronizations (How ?)Communication via synchronizations (How ?)
Synchronization Pattern (Locks)Synchronization Pattern (Locks)
locklock
unlockunlock
ComputComputee
locklock
unlockunlock
P1P1 P2P2 MemorMemoryy
Message Sequence Chart Message Sequence Chart (MSC)(MSC)
Pgm = P1 || Pgm = P1 || P2P2
Synchronization Pattern (Barrier)Synchronization Pattern (Barrier)
BlockedBlockedgogo
gogo
readyready
ComputComputee ComputComput
ee
P1P1 P2P2
Pgm = P1 || P2Pgm = P1 || P2
MemorMemoryy
readyready
Connection to MSCsConnection to MSCs
Partial Order of MSCPartial Order of MSC
unlockunlock
loclockk
Matches Matches Observed OrderingObserved Ordering
•Total order in each threadTotal order in each thread
•Ordering across threads Ordering across threads visible via synchronization visible via synchronization (msg. exchange) (msg. exchange)
All synchronization ops. form a total orderAll synchronization ops. form a total order
Th. 1Th. 1 Th. 2Th. 2 Shared Shared Mem.Mem.
A first cutA first cut
InstrumentInstrument each thread to observe local each thread to observe local control/data flow and global synch.control/data flow and global synch.
RepresentRepresent path profile of P1 || P2 path profile of P1 || P2– Each thread’s flow as a Grammar – (G1, G2)Each thread’s flow as a Grammar – (G1, G2)
Contains synch. ops. as well.Contains synch. ops. as well.– All synchronization ops. as a list.All synchronization ops. as a list.– Associate entries in this list to the occurrence Associate entries in this list to the occurrence
of synch. ops. in (G1,G2)of synch. ops. in (G1,G2)
How to How to navigatenavigate the path profile ? the path profile ?– Zoom in to a specific Zoom in to a specific lock—unlocklock—unlock segment of segment of
P1P1
Edge annotationsEdge annotations
aa
b (lock)b (lock)
c (unlock)c (unlock)
xx
b (lock)b (lock)
c (unlock)c (unlock)
yy
SS
AAaa
bb cc
xx
yy
Grammar for one threadGrammar for one thread
00 22
00 11
2244
Locating synch. operationsLocating synch. operations
SS
AAaa
bb cc
xx
yy00 22
00 11
2244
Locating the 3Locating the 3rdrd synchronization operation synchronization operation
Can find synch. segments by looking up global Can find synch. segments by looking up global list.list.
XX
YY}}
n synch ops.n synch ops.
nn
So farSo far
Control flow of each thread stored as a Control flow of each thread stored as a grammargrammar
Synchronization ops. form a global listSynchronization ops. form a global list
Grammar of each thread annotated with Grammar of each thread annotated with countscounts– Easy searching of synchronization operationsEasy searching of synchronization operations
What about shared data accesses ?What about shared data accesses ?
Sequence of memory locations accessed by a Sequence of memory locations accessed by a singlesingle LD/ST instruction can be compressed LD/ST instruction can be compressed– Use a Grammar representation for this seq. as Use a Grammar representation for this seq. as
wellwell
Further compressionFurther compression
Locations accessed by a memory operationLocations accessed by a memory operation– 10,14,18,22,26,54,58,62,66,70,9810,14,18,22,26,54,58,62,66,70,98
Online Compression of the string as grammarOnline Compression of the string as grammar– 10(1), 4(4), 28(1), 4(4), 28(1)10(1), 4(4), 28(1), 4(4), 28(1)– Difference representation + Run-length Difference representation + Run-length
encodingencoding
Useful for detecting regularity of array Useful for detecting regularity of array accessesaccesses– Sweep through an array: A run of constant diffs.Sweep through an array: A run of constant diffs.– Accessing a sub-grid of a multidimensional Accessing a sub-grid of a multidimensional
arrayarray
OrganizationOrganization
Compressed Path Profiles in Sequential Compressed Path Profiles in Sequential ProgramsPrograms
Parallel Program Path ProfilesParallel Program Path Profiles
Compression Efficiency and OverheadsCompression Efficiency and Overheads
Data race detection over path profilesData race detection over path profiles
Any better than gzip ?Any better than gzip ?
0 %
2 %
4 %
6 %
8 %
10 %
12 %
FFT LU Mp3dWater SOR
Grammargzip
Compression % (2 Processors)Compression % (2 Processors)
Scalability of Compression Scalability of Compression
0 %1 %2 %3 %4 %5 %6 %7 %8 %9 %
10 %
FFT LU Mp3d Water SOR
2 Proc.4 Proc.8 Proc.
Compression % for our schemeCompression % for our scheme
Concerns about Timing Overheads Concerns about Timing Overheads Our scheme does not add substantial time Our scheme does not add substantial time
overhead over grammar based string overhead over grammar based string compressioncompression
Our experiments conducted using RSIM Our experiments conducted using RSIM – Tracing overheads can be higher in a real Tracing overheads can be higher in a real
multiprocessor multiprocessor – Can tracing distort program behavior ?Can tracing distort program behavior ?
Possible solutionPossible solution– Trace minimal number of operations in a Trace minimal number of operations in a
parallel program execution (Netzer 1993) to parallel program execution (Netzer 1993) to ensure deterministic replayensure deterministic replay
– Collect compressed path profile during replay.Collect compressed path profile during replay.
OrganizationOrganization
Compressed Path Profiles in Sequential Compressed Path Profiles in Sequential ProgramsPrograms
Parallel Program Path ProfilesParallel Program Path Profiles
Compression Efficiency and OverheadsCompression Efficiency and Overheads
Data race detection over path profilesData race detection over path profiles
Apparent Data racesApparent Data races
locklock
unlockunlock
locklock
unlockunlock
locklock
Th. 1Th. 1 Th.2Th.2
unlockunlocklocklock
unlockunlock
Th.3Th.3 Mem.Mem.
•Last unlock in Th. 1 Last unlock in Th. 1 (first unlock)(first unlock)
•Next lock in Th. 1 Next lock in Th. 1 (second lock)(second lock)
•Locate root-to-leaf Locate root-to-leaf paths of these ops.paths of these ops.
•Tree rooted at the Tree rooted at the least common least common ancestor of these ops. ancestor of these ops.
No Decompression of the grammar of No Decompression of the grammar of Th. 1Th. 1
Data race artifactsData race artifacts
Sub := 1Sub := 1
A[1] := 0A[1] := 0
X := Sub;X := Sub;
Y := A[X] Y := A[X] (artifact)(artifact)
X decides which addr. is accessed in Y := X decides which addr. is accessed in Y := A[X]A[X]
X is set by Sub:= 1 which is also in a data X is set by Sub:= 1 which is also in a data race.race.
Detecting artifacts requires Detecting artifacts requires Data-flow Data-flow
Not captured by rd/wr sets in synch. Not captured by rd/wr sets in synch. segmentssegments
Captured in our compact path profiles.Captured in our compact path profiles.
SummarySummary
Compressed representation of the execution Compressed representation of the execution profile of shared memory parallel programsprofile of shared memory parallel programs– Control and shared data flow per threadControl and shared data flow per thread– Synchronization patterns across threadsSynchronization patterns across threads
Overall compression efficiency 0.25% -- 9.81% Overall compression efficiency 0.25% -- 9.81%
Compression efficiency scalable with Compression efficiency scalable with increasing number of processorsincreasing number of processors
Application: Post-mortem debugging such as Application: Post-mortem debugging such as detecting data racesdetecting data races
Other ApplicationsOther Applications
We do not capture actual order of We do not capture actual order of unsynchronizedunsynchronized shared memory accesses shared memory accesses across processorsacross processors
Can be useful in making architectural Can be useful in making architectural decisions such as choice of cache coherence decisions such as choice of cache coherence protocolprotocol
Sufficient to maintain [Netzer 1993]Sufficient to maintain [Netzer 1993]– transitive reduction of program order on each transitive reduction of program order on each
proc.proc.– shared variable conflict ordersshared variable conflict orders
Can we capture transitive reduction relation Can we capture transitive reduction relation via annotations of WPP edges?via annotations of WPP edges?