Tradeoffs in Automatic Provenance Capture

Trade-offs in Automatic Provenance Capture

Manolis Stamatogiannakis, Hasanat Kazmi,Hashim Sharif, Remco Vermeulen,

Ashish Gehani, Herbert Bos, and Paul Groth

Capturing ProvenanceDisclosed Provenance

+ Accuracy+ High-level semantics– Intrusive– Manual Effort

Observed Provenance– False positives– Semantic Gap+ Non-intrusive+ Minimal manual effort

CPL (Macko ‘12)Trio (Widom ‘09)

PrIME (Miles ‘09)Taverna (Oinn ‘06)

VisTrails (Fraire ‘06)

ES3 (Frew ‘08)Trec (Vahdat ‘98)

PASSv2 (Holland ‘08)

DTrace Tool (Gessiou ‘12)

OPUS (Balakrishnan ‘ 13)

3https://github.com/ashish-gehani/SPADE/wiki

• Strace Reporter– Programs run under strace. Produced log is

parsed to extract provenance.• LLVMTrace– Instrumentation added to function boundaries

at compile time.• DataTracker– Dynamic Taint Analysis. Bytes associated with

metadata which are propagated as the program executes.

SPADEv2 – Provenance Collection

SPADEv2 flow

Current Intuition

Incomplete Picture• Faster, but how much?• What is the performance “price” for

fewer false positives?• Does a compile-time solution worth

the effort?

How can one get more insight?

Run a benchmark!

Which one?• LMBench, UnixBench, Postmark,

BLAST, SPECint…

• [Traeger 08]: “Most popular benchmarks are flawed.”

• No-matter what you chose, there will be blind spots.

Start simple: UnixBench• Well understood sub-benchmarks.• Emphasizes on performance of system calls.• System calls are commonly used for the

extraction of provenance.

• More insight on which collection backend would suit specific applications.

• We’ll have a performance baseline to improve the specific implementations.

UnixBench Results

TRADEOFFS

Performance vs. Integration Effort

• Capturing provenance from completely unmodified programs may degrade performance.

• Modification of either the source (LLVMTrace) or the platform (LPM, Hi-Fi) should be considered for a production deployment.

Performance vs. Provenance Granularity

• We couldn’t verify this intuition for the case of strace reoporter compared to LLVMTrace.– Strace reporter implementation is not

optimal.

• Tracking fine-grained provenance may interfere with existing optimizations.– E.g. buffering I/O does not benefit

DataTracker.

Performance vs.False Positives/Analysis Scope

• “Brute-forcing” a low false-positive ratio with the “track everything” approach of DataTracker is prohibitively expensive.

• Limiting the analysis scope gives a performance boost.

• If we exploit known semantics, we can have the best of both worlds.– Pre-existing semantic knowledge: LLVMTrace– Dynamically acquired knowledge: ProTracer [Ma

TAKEAWAYS

Takeaway: System Event Tracing

• A good start for quick deployments• Simple versions may be expensive• What happens in the binary?

Takeaway: Compile-time Instrumentation

• Middle-ground between disclosed and automatic provenance collection.

• But you have to have access to source

Takeaway: Taint Analysis• Prohibitively expensive

for computation-intensive programs.

• Likely to remain so, even after optimizations.

• Reserved for provenance analysis of unknown/legacy software.

• Offline approach (Stamatogiannakis TAPP’15)

Generalizing the Results• Only one implementation

was tested for each method.• Repeating testing with

alternative implementations will provide confidence for the insights gained.

• More confidence when choosing a specific collection method. Different methods

Implementation Details Matter

• Our results are influenced by the specifics of the implementation.

• Anecdote: The initial implementation of LLVMTrace was actually slower than strace reporter.

Provenance Quality• Qualitative features of the

provenance are also very important.

• How many vertices/edges are contained in the generated provenance graph?

• Precision/Recall based on provenance ground truth.

Performance Benchmarks

itativ

Where to go next?• UnixBench is a basic benchmark.• SPEC: Comprehensive in terms of

performance evaluation.– Hard to get the provenance ground truth

– assess quality of captured provenance.• Better directions:– Coreutils based micro-benchmarks.–Macro-benchmarks (e.g. Postmark,

compilation benchmarks).

Conclusion• Automatic provenance capture is an

important part of the ecosystem• Trade-offs in different capture modes• Benchmarking – to inform• Common platforms are essential

The End

UnixBench Results

Tradeoffs in Automatic Provenance Capture

Technology

Transcript of Tradeoffs in Automatic Provenance Capture

Win Wins Or Tradeoffs

Provenance Explorer – Customized Provenance Views ...iswc2006.semanticweb.org/items/Cheung2006kl.pdfProvenance Explorer – Customized Provenance Views Using Semantic Inferencing

DISCLAIMER!! Thisdocumentwaspreparedasanaccountof .../67531/metadc... · Provenance research falls primarily into main categories: 1.) business provenance, 2.) provenance capture

Practical Whole-System Provenance Capture · a number of illustrative provenance use cases. Retrospective Security [7, 54, 79, 91] is the detection of security violations after execution,

Provenance Capture and Use: A Practical Guide · Provenance Capture and Use: A Practical Guide M. David Allen, Len Seligman, ... lessons learned from the IM-PLUS project, which is

Optimizations and Tradeoffs

NetKarma Portal Chris Small. Portal Goals Make it much easier for experimenters to capture provenance data with experiment Integrate with: – Measurement.

Tradeoffs and Concessions

Tradeoffs in Network Complexity

GUI Tradeoffs

Datapath Component Tradeoffs

Naming System Design Tradeoffs

The Provenance of Provenance in Germanic Areas

Tradeoffs RAID5 RAID10

Modeling Tradeoffs

Provenance Situations: Use Cases for Provenance on Web ...gil/slides/Provenance-WebArchitecture-10-28-10.pdf · W3C Provenance XG October 28, 2010 2 Provenance and Web Architecture:

Electricity Tradeoffs

Confession #1 Provenance and Causality Provenance-based Belief

ES3 architecture ES3’s provenance capture system uses a client-server architecture.

SMOKE: Fine-Grained Provenance Capture at Interactive Speedfotis/pubs/presentations/vldb18.pdfSMOKE: Fine-Grained Provenance Capture at Interactive Speed FotisPsallidas fotis@cs.columbia.edu