Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.
-
Upload
sandra-hubbard -
Category
Documents
-
view
228 -
download
0
Transcript of Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.
Problem Diagnosis
• Distributed Problem Diagnosis
• Sherlock
• X-trace
Troubleshooting Networked Systems
• Hard to develop, debug, deploy, troubleshoot• No standard way to integrate debugging,
monitoring, diagnostics
Status quo: device centric
...
...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire......
...
...[04:03:23 2006] [notice] Dispatch s1...[04:03:23 2006] [notice] Dispatch s2...[04:04:18 2006] [notice] Dispatch s3...[04:07:03 2006] [notice] Dispatch s1...[04:10:55 2006] [notice] Dispatch s2...[04:03:24 2006] [notice] Dispatch s3...[04:04:47 2006] [crit] Server s3 down.........
...
... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......
...
... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......
...
...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid.........
Firewall
Load Balancer
Web 1
Web 2
Database
Status quo: device centric
• Determining paths:– Join logs on time and ad-hoc identifiers
• Relies on – well synchronized clocks– extensive application knowledge
• Requires all operations logged to guarantee complete paths
Examples
5
User
DNS Server
Proxy
Web Server
Examples
6
User
DNS Server
Proxy
Web Server
Examples
7
User
DNS Server
Proxy
Web Server
Examples
8
User
DNS Server
Proxy
Web Server
Approaches to Diagnosis
• Passively learn the relationships– Infer problems as deviations from the norm
• Actively Instrument the stack to learn relationships– Infer problems as deviations from the norm
Sherlock – Diagnosing Problems in the Enterprise
Srikanth Kandula
Well-Managed Enterprises Still Unreliable
10% Troubled
85% Normal
Fraction Of Requests
0.7% Down
.1
.02
.04
.06
.08
10 100 1000 10000
Response time of a Web server (ms)
0
10% responses take up to 10x longer than normal
How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?
Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems
Sherlock
Challenges for the End-to-End Approach
• Don’t know what user’s performance depends on
• Don’t know what user’s performance depends on– Dependencies are distributed
– Dependencies are non-deterministic
• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10
packets, but which affected user?
SQLBackend
Web Server
Auth. Server
DNS
Client
E.g., Web Connection
Challenges for the End-to-End Approach
Sherlock’s Contributions
• Passively infers dependencies from logs• Builds a unified dependency graph incorporating
network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise
Sherlock’s Architecture
Servers
Clients
Sherlock’s Architecture
Web1 1000ms
Web2 30ms
File1 Timeout
User Observations+
=
List Troubled Components
Network Dependency Graph
Inference Engine
Sherlock works for various client-server applications Sherlock works for various client-server applications
Video Server
Data Store
DNS
How do you automatically learn such distributed dependencies?
Strawman: Instrument all applications and libraries
Sherlock exploits timing info
Time
My Client talks to B
t
My Client talks to C
If talks to B, whenever talks to C Dependent Connections
Not Practical
Sherlock exploits timing info
Time
t
BBB B BB
False Dependence
BC
If talks to B, whenever talks to C Dependent Connections
Strawman: Instrument all applications and libraries Not Practical
Sherlock exploits timing info
Time
If talks to B, whenever talks to C Dependent Connections
t
BB C
Inter-access timeDependent iff t << Inter-access time
As long as this occurs with probability higher than chance
Strawman: Instrument all applications and libraries Not Practical
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing
Video
DNS
Store
Dependency Graph
Bill’s Client StoreDNS
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
• Works with legacy applications• Adapts to changing conditions
Dependency Graph
Video
DNS
Store
But hard dependencies are not enough…
Bill’s Client StoreDNS
Video Store
Video
Bill watches Video
Bill DNS Bill Video
But hard dependencies are not enough…
Need Probabilities
p1
p3
If Bill caches server’s IP DNS down but Bill gets video
Sherlock uses the frequency with which a dependence occurs in logs as its edge probability
p2p1=10% p2=100%
How do we use the dependency graph to diagnose user problems?
Bill’s Client StoreDNS
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
Which components caused the problem?
Need to disambiguate!!
Diagnosing User Problems
Bill’s Client StoreDNS
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
Diagnosing User Problems
Which components caused the problem?
Bill Sees Sales
Sales
Bill Sales
Paul Watches Video2
Paul Video2
Video2 Store
Video2
Use correlation to disambiguate!!• Disambiguate by correlating
– Across logs from same client– Across clients
• Prefer simpler explanations
Will Correlation Scale?
Corporate Core
Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices
Building Network
Campus Core
Data Center
Dependency Graph is Huge
Can we evaluate all combinations of component failures?
The number of fault combinations is exponential!
Impossible to compute!
Will Correlation Scale?
Scalable Algorithm to Correlate
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Only a few faults happen concurrently
Exponential PolynomialExponential Polynomial
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Scalable Algorithm to Correlate
Only a few faults happen concurrently
Only few nodes change state
Exponential PolynomialExponential Polynomial
Re-evaluate only if an ancestor changes state
Reduces the cost of evaluating a case by 30x-70x
Reduces the cost of evaluating a case by 30x-70x
Exponential PolynomialExponential Polynomial
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Only a few faults happen concurrently
Only few nodes change state
Scalable Algorithm to Correlate
Results
Experimental Setup
• Evaluated on the Microsoft enterprise network
• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets
• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center
What Do Web Dependencies in the MS Enterprise Look Like?
Auth. Server
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal
Auth. Server
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal
Auth. Server
Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal Client Accesses Sales
What Do File-Server Dependencies Look Like?
Client Accesses Software Distribution Server
Auth.Server
WINS DNS
Backend Server 1
Backend Server 2
Backend Server 3
Backend Server 4
ProxyFile Server
100%10% 6% 5% 2%
8%
5%
1%.3%
Sherlock works for many client-server applicationsSherlock works for many client-server applications
Dependency Graph: 2565 nodes; 358 components that can fail
Sherlock Identifies Causes of Poor Performance
Com
pone
nt In
dex
Time (days)87% of problems localized to 16 components87% of problems localized to 16 components
Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail
Corroborated the three significant faults
Com
pone
nt In
dex
Time (days)
• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes
Sherlock Goes Beyond Traditional Tools
Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!
X-Trace
• X-Trace records events in a distributed execution and their causal relationship
• Events are grouped into tasks– Well defined starting event and all that is
causally related• Each event generates a report, binding it to
one or more preceding events• Captures full happens-before relation
X-Trace Output
• Task graph capturing task execution – Nodes: events across layers, devices– Edges: causal relations between events
IP IP Router
IP RouterIP
TCP 1Start
TCP 1End
IP IP Router IP
TCP 2Start
TCP 2End
HTTPProxy
HTTPServer
HTTPClient
• Each event uniquely identified within a task: [TaskId, EventId]
• [TaskId, EventId] propagated along execution path• For each event create and log an X-Trace report
– Enough info to reconstruct the task graph
Basic Mechanism
IP IP Router
IP RouterIP
TCP 1Start
TCP 1End
IP IP Router IP
TCP 2Start
TCP 2End
HTTPProxy
HTTPServer
HTTPClient
f hb
a g
m
n
c d e i j k l
[T, g][T, a]
[T, a]X-Trace ReportTaskID: TEventID: gEdge: from a, f
X-Trace ReportTaskID: TEventID: gEdge: from a, f
X-Trace Library API
• Handles propagation within app• Threads / event-based (e.g., libasync)• Akin to a logging API:
– Main call is logEvent(message)• Library takes care of event id creation,
binding, reporting, etc• Implementations in C++, Java, Ruby,
Javascript
Task Tree
• X-Trace tags all network operations resulting from a particular task with the same task identifier
• Task tree is the set of network operations connected with an initial task
• Task tree could be reconstruct after collecting trace data with reports
52
An example of the task tree
• A simple HTTP request through a proxy
53
X-Trace Components
• Data– X-Trace metadata
• Network path– Task tree
• Report– Reconstruct task tree
54
Propagation of X-Trace Metadata
• The propagation of X-Trace metadata through the task tree
55
Propagation of X-Trace Metadata
• The propagation of X-Trace metadata through the task tree
56
The X Trace metadata
Field Usage
Flags Bits that specify which of the three optional components are present
TaskID An unique integer ID
TreeInfo ParentID, OpID, EdgeType
Destination Specify the address that X-Trace report should be sent to
Options Accommodate future extensions mechanism
57
X-Trace Report Architecture
58
X-Trace Report Architecture
59
X-Trace Report Architecture
60
X-Trace-like in Google/Bing/Yahoo
• Why?– Own large portion of the ecosystem– Use RPC for communication– Need to understand
• Time for user request• Resource utilization by request
Sherlock V X-trace
• Overhead V. Accuracy
• Deployment issues– Invasiveness– Code modification
Conclusions
• Sherlock passively infers network-wide dependencies from logs and traceroutes
• It diagnoses faults by correlating user observations
• X-trace actively discovers network-wide dependencies