Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster,...
-
Upload
anna-oneal -
Category
Documents
-
view
225 -
download
0
description
Transcript of Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster,...
Bin Xin, Patrick Eugster, Xiangyu ZhangDept. of Computer SciencePurdue University{xinb, peugster, xyzhang}@cs.purdue.usc
Lightweight Task Graph Inference for Distributed Applications
Jinlin YangCenter for Software ExcellenceMicrosoft [email protected]
2010 29th IEEE International Symposium on Reliable Distributed Systems
Introduction
• New Challenges to reliability as applications move to Cloud• Distinct corporate entities managing the
infrastructure and the owing the application deployed
• Application developer do not have access to lower level debugging information in case of failures/faults.• Depends on Application output or app level
custom Logs for diagnosis
• Goal: Describe the high-level structural view of a distributed program execution to facilitate easy “after the fact” diagnosis.
Contributions
• Define abstraction for representing distributed executions – “Tasks”
• A lightweight approach to generate “Task Graphs” from the application event logs.
• A declarative formulation of the rules to generate Task Graphs using Prolog.
• Demonstrate use of Task Graph to help understand the distributed execution including anomaly detection.
Relevance to SmartGrid and CiC
• Extensions• Fault Detection by real-time log processing (CEP?)
• The patterns for CEP can be defined by the application developer
• OR can be auto-generated using code augmentation and static code analysis.
• On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy)
• Shortcomings• Do not explicitly consider the “Data Repository”
• Considered only as one of the ‘tasks’. • Not sure how it handles Transactions
Definitions
Event: is the execution of an operation that sends (or receives) data/signal to a different thread/process (Smallest building blocks)Signaling Event: is the operation of SendingActing Event: is the operation of Receiving
Happens Before (a e b): partial ordering of events. A is the Sender and B is the receiver who acts on that signal.
Task: Autonomous computation within a thread between to “acting” events. [Astart, Aend)Task contains exactly one Acting EventZero or more Signaling Event
Task Graph: A DAG whose nodes are tasks and edges represent Happens Before relations
A Request: A pair of signaling and acting events, where the signaling event is originating from outside the System.
A Reply: A pair of signaling and acting events, where the Acting event is triggered outside the System.
E2E service Graph:
Example
System Setup
• Uses HDFS as the example application on Cloud
• HDFS logs are not sufficient/standardized• Uses Instrumentation using a tool called
“AspectJ”• AspectJ lets the developer insert code based on
specific “rules” during compilation• Each event is logged as a 7-field tuple
• (EventID, ProcID, threadID, SourceLocation, Type, Tag, Value)
Constructing Task Graphs (Prolog formulation) - I
Events
A “Fact” to parse and store all events
An entry for hb is made only if the Rules on the right are true for events X & Y
Constructing Task Graphs (Prolog formulation) - II
Tasks
Issues & Solutions - I
False +ves caused by Common Sycn Objects
Notion of “Time” is required. But Global Clocks or Vector Clocks are expensive and complex.
Heuristic: Use the order of events in the event logs.
Problem:
Proposed Solution:
Issues & Solutions - II
False +ves caused by Communication
Multiple Writes on the same Socket.
Heuristic: Use “Packet Size” and Total Received so far to decide which write to associate to which reads.
Problem:
Proposed Solution:
Issues & Solutions - III
False -ves caused by Gaurded Waits
Multiple waiting threads are notified and the Lock Condition is updated before the current thread’s execution. Hence a Condition Check is required after waking up.
Manually update such cases and remove augmented code within the loop and Add a marker just after the loop.
Problem:
Proposed Solution:
Evaluation - I
Performance Impact
Runtime:22.2% increase in binary size38% increase in execution time
TaskGraph building using Prolog:
Evaluation – II (Demo)
To Help a new HDFS developer to analyze HDFS Execution
Relevance to SmartGrid and CiC
• Extensions• Fault Detection by real-time log processing (CEP?)
• The patterns for CEP can be defined by the application developer
• OR can be auto-generated using code augmentation and static code analysis.
• On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy)
• Shortcomings• Do not explicitly consider the “Data Repository”
• Considered only as one of the ‘tasks’. • Not sure how it handles Transactions