Automated Problem Diagnosis for Production Systems

14
Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU), Priya Narasimhan (CMU) PARALLEL DATA LABORATORY Carnegie Mellon University

description

Automated Problem Diagnosis for Production Systems. Soila P. Kavulya. Scott Daniels (AT&T ), Kaustubh Joshi (AT & T), Matti Hiltunen (AT & T), Rajeev Gandhi (CMU), Priya Narasimhan ( CMU) PARALLEL DATA LABORATORY Carnegie Mellon University. Automated Problem Diagnosis. - PowerPoint PPT Presentation

Transcript of Automated Problem Diagnosis for Production Systems

Page 1: Automated Problem Diagnosis for Production Systems

Automated Problem Diagnosis for Production Systems

Soila P. KavulyaScott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Priya Narasimhan (CMU)PARALLEL DATA LABORATORY

Carnegie Mellon University

Page 2: Automated Problem Diagnosis for Production Systems

2

Automated Problem Diagnosis• Diagnosing problems

• Creates major headaches for administrators• Worsens as scale and system complexity grows

• Goal: automate it and get proactive• Failure detection and prediction• Problem determination (or “fingerpointing”)• Problem visualization

• How: Instrumentation plus statistical analysis

November 12 http://www.pdl.cmu.edu/

Page 3: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 3

Target Systems for Validation• VoIP system at large telecom provider

• 10s of millions of calls per day, diverse workloads• 100s of heterogeneous network elements • Labeled traces available

• Hadoop: MapReduce implementation Hadoop clusters with homogeneous hardware

Yahoo! M45 & Opencloud production clusters Controlled experiments in Amazon EC2 cluster

Long running jobs (> 100s): Hard to label failures

November 12

Page 4: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 4

Assumptions of Approach• Majority of system is working correctly• Problems manifest in observable behavioral

changes• Exceptions or performance degradations

• All instrumentation is locally timestamped • Clocks are synchronized to enable system-

wide correlation of data• Instrumentation faithfully captures system

behavior

November 12

Page 5: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 5

Overview of Diagnostic Approach

End-to-endTrace

Construction

PerformanceCounters

ApplicationLogs

Ranked list of root-causes

Anomaly Detection Localization

November 12

Page 6: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 6

Anomaly Detection Overview• Some systems have rules for anomaly

detection, e.g.,• Redialing number immediately after disconnection• Server reported error codes and exceptions

• If no rules available, rely on peer-comparison• Identifies peers (nodes, flows) in distributed

systems• Detect anomalies by identifying “odd-man-out”

November 12

Page 7: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 7

Anomaly Detection Approach

• Histogram comparison identifies anomalous nodes• Pairwise comparison of node histograms• Detect anomaly if difference between histograms

exceeds pre-specified threshold

Faulty node

Histograms (distributions) of durations of flows

Normal node Normal node

Nor

mal

ized

cou

nts

(tota

l 1.0

)

Nor

mal

ized

cou

nts

(tota

l 1.0

)

Nor

mal

ized

cou

nts

(tota

l 1.0

)

November 12

Page 8: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 8

Localization Overview1. Obtain labeled end-to-end traces

(labels indicate failures and successes)• Telecom systems

– Use heuristics, e.g., Redialing number immediately after disconnection

• Hadoop– Use peer-comparison for anomaly detection since

heuristics for detection are unavailable2. Localize source of problems

• Score attributes based on how well they distinguish failed calls from successful ones

November 12

Page 9: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 9

“Truth Table” Call Representation

November 12

Server1 Server2 Customer1 Phone1 OutcomeCall1 1 1 0 1 SUCCESSCall2 1 0 1 1 FAIL

Log SnippetCall1: 09:31am,SUCCESS, Server1,Server2,Phone1Call2: 09:32am,FAIL,Server1,Customer1,Phone1

10s of thousands of attributes

10s of millions of calls

Page 10: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 10

Identify Suspect Attributes• Estimate conditional probability distributions

• Prob(Success|Attribute) vs Prob(Failure|Attribute)• Update belief on distribution with each call seen

November 12

Deg

ree

of

Bel

ief

Probability

Success|Customer1Failure|Customer1

Anomaly score: Distance between distributions

Page 11: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 11

Find Multiple Ongoing Problems• Search for combination of attributes that

maximize anomaly score• E.g., (Customer1 and ServerOS4)• Greedy search limits combinations explored• Iterative search identifies multiple problems

November 12

1. Chronic signature1Customer1ServerOS4

2. Chronic signature2PhoneType7

Time of Day (GMT)

Faile

d C

alls

UI: Ranked list of chronics

Page 12: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 12

Evaluation• Prototype in use by Ops team

• Daily reports over past 2 years• Helped Ops to quickly discover new chronics

• For example, to analyze 25 million VoIP calls• 2 2.4GHz Xeon cores, used <1 GB of memory• Data loading: 1.75 minutes for 6GB of data• Diagnosis: ~4 seconds per signature

(near-interactive)

November 12

Page 13: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 13

1. ChronicSignature1

Service_ACustomer_A

2. ChronicSignature2

Service_ACustomer_NIP_Address_N

Call Quality (QoS) Violations

November 12

Message loss used as the event failure indicator (>1%)

Draco showed most QoS issues were tied to specific customers and not ISP network elements (as was previously believed)

Customer name, IP

Incident at ISP:

Faile

d C

alls

Time of Day (GMT)

Faile

d C

alls

Time of Day (GMT)

Page 14: Automated Problem Diagnosis for Production Systems

http://www.pdl.cmu.edu/ 14

In Summary…• Use peer-comparison for anomaly detection• Localize source of problems using statistics

• Applicable when end-to-end traces available• E.g., customer, network element, version conflicts

• Approach used on Trone might vary• Depends on instrumentation available• Also depends on fault-model

November 12