Automated Problem Diagnosis for Production Systems

Automated Problem Diagnosis for Production Systems

Soila P. KavulyaScott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Priya Narasimhan (CMU)PARALLEL DATA LABORATORY

Carnegie Mellon University

2

Automated Problem Diagnosis• Diagnosing problems

• Creates major headaches for administrators• Worsens as scale and system complexity grows

• Goal: automate it and get proactive• Failure detection and prediction• Problem determination (or “fingerpointing”)• Problem visualization

• How: Instrumentation plus statistical analysis

November 12 http://www.pdl.cmu.edu/

http://www.pdl.cmu.edu/ 3

Target Systems for Validation• VoIP system at large telecom provider

• 10s of millions of calls per day, diverse workloads• 100s of heterogeneous network elements • Labeled traces available

• Hadoop: MapReduce implementation Hadoop clusters with homogeneous hardware

Yahoo! M45 & Opencloud production clusters Controlled experiments in Amazon EC2 cluster

Long running jobs (> 100s): Hard to label failures

November 12


Assumptions of Approach• Majority of system is working correctly• Problems manifest in observable behavioral

changes• Exceptions or performance degradations

• All instrumentation is locally timestamped • Clocks are synchronized to enable system-

wide correlation of data• Instrumentation faithfully captures system

behavior

November 12


Overview of Diagnostic Approach

End-to-endTrace

Construction

PerformanceCounters

ApplicationLogs

Ranked list of root-causes

Anomaly Detection Localization

November 12


Anomaly Detection Overview• Some systems have rules for anomaly

detection, e.g.,• Redialing number immediately after disconnection• Server reported error codes and exceptions

• If no rules available, rely on peer-comparison• Identifies peers (nodes, flows) in distributed

systems• Detect anomalies by identifying “odd-man-out”

November 12


Anomaly Detection Approach

• Histogram comparison identifies anomalous nodes• Pairwise comparison of node histograms• Detect anomaly if difference between histograms

exceeds pre-specified threshold

Faulty node

Histograms (distributions) of durations of flows

Normal node Normal node

Nor

mal

ized

cou

nts

(tota

l 1.0

)

Nor

mal

ized

cou

nts

(tota

l 1.0

)

Nor

mal

ized

cou

nts

(tota

l 1.0

)

November 12


Localization Overview1. Obtain labeled end-to-end traces

(labels indicate failures and successes)• Telecom systems

– Use heuristics, e.g., Redialing number immediately after disconnection

• Hadoop– Use peer-comparison for anomaly detection since

heuristics for detection are unavailable2. Localize source of problems

• Score attributes based on how well they distinguish failed calls from successful ones

November 12


“Truth Table” Call Representation

November 12

Server1 Server2 Customer1 Phone1 OutcomeCall1 1 1 0 1 SUCCESSCall2 1 0 1 1 FAIL

Log SnippetCall1: 09:31am,SUCCESS, Server1,Server2,Phone1Call2: 09:32am,FAIL,Server1,Customer1,Phone1

10s of thousands of attributes

10s of millions of calls


Identify Suspect Attributes• Estimate conditional probability distributions

• Prob(Success|Attribute) vs Prob(Failure|Attribute)• Update belief on distribution with each call seen

November 12

Deg

ree

of

Bel

ief

Probability

Success|Customer1Failure|Customer1

Anomaly score: Distance between distributions


Find Multiple Ongoing Problems• Search for combination of attributes that

maximize anomaly score• E.g., (Customer1 and ServerOS4)• Greedy search limits combinations explored• Iterative search identifies multiple problems

November 12

1. Chronic signature1Customer1ServerOS4

2. Chronic signature2PhoneType7

Time of Day (GMT)

Faile

d C

alls

UI: Ranked list of chronics


Evaluation• Prototype in use by Ops team

• Daily reports over past 2 years• Helped Ops to quickly discover new chronics

• For example, to analyze 25 million VoIP calls• 2 2.4GHz Xeon cores, used <1 GB of memory• Data loading: 1.75 minutes for 6GB of data• Diagnosis: ~4 seconds per signature

(near-interactive)

November 12


1. ChronicSignature1

Service_ACustomer_A

2. ChronicSignature2

Service_ACustomer_NIP_Address_N

Call Quality (QoS) Violations

November 12

Message loss used as the event failure indicator (>1%)

Draco showed most QoS issues were tied to specific customers and not ISP network elements (as was previously believed)

Customer name, IP

Incident at ISP:

Faile

d C

alls

Time of Day (GMT)

Faile

d C

alls

Time of Day (GMT)


In Summary…• Use peer-comparison for anomaly detection• Localize source of problems using statistics

• Applicable when end-to-end traces available• E.g., customer, network element, version conflicts

• Approach used on Trone might vary• Depends on instrumentation available• Also depends on fault-model

November 12

Automated Problem Diagnosis for Production Systems

Documents

Transcript of Automated Problem Diagnosis for Production Systems