Automated Diagnosis of Chronic Problems in Production Systems

Automated Diagnosis of Chronic Problems in Production Systems

Soila Kavulya

Thesis CommitteeChristos Faloutsos , CMUGreg Ganger, CMUMatti Hiltunen, AT&TPriya Narasimhan, CMU (Advisor)

Soila Kavulya @ March 20122

Outline Motivation Thesis Statement Approach

End-to-end trace construction Anomaly detection Localization

Evaluation VoIP Hadoop

Critique & Related Work Pending Work


Motivation Chronics are problems that are

Not transient Not resulting in system-wide outage

Chronics occur in real production systems VoIP

User’s calls fail due to version conflict between user and upgraded server Hadoop (CMU’s OpenCloud)

A user job sporadically fails in map phase with cryptic block I/O error User and admins spend 2 months troubleshooting Traced to large heap size in tasktracker starving collocated datanodes

Chronics are due to a variety of root-causes Configuration problems, bad hardware, software bugs

Thesis: Automate chronics diagnosis in production systems


Challenge for Diagnosis

Due to single node?

Due to complex interactions between nodes?

Due to multiple independent node?

Node1

Single manifestation, multiple possible causes

Node2

Node3

Node4

Node5


Challenges in Production Systems Labeled failure-data is not always available

Difficult to diagnose problems not encountered before Sysadmins’ perspective may not correspond to users’

No access to user configurations, user behavior No access to application semantics First sign of trouble is often a customer complaint Customer complaints can be cryptic

Desired level of instrumentation may not be possible As-is vendor instrumentation with limited control Cost of added instrumentation may be high Granularity of diagnosis consequently limited


Objectives “Is there a problem?” (anomaly detection)

Detect a problem despite potentially not having seen it before Distinguish a genuine problem from a workload change

“Where is the problem?” (localization) Drill down by analyzing different instrumentation perspectives

“What kind of problems?” (chronics) Manifestation: exceptions, performance degradations Root-cause: misconfiguration, bad hardware, bugs, contention Origin: single/multiple independent sources, interacting sources

“What kind of environments?” (production systems) Production VoIP system at AT&T Hadoop: Open-source implementation of MapReduce


Thesis Statement

Peer-comparison* enables anomaly detection in production systems despite workload changes, and the subsequent incremental fusion of different instrumentation sources enables localization of chronic problems.

*Comparison of some performance metric across similar (peer) system elements

9

rika (Swahili), noun. peer, contemporary, age-set, undergoing rites of passage (marriage) at similar times.

What was our Inspiration?


What is a Peer?

Temporal similarity Age-set: Born around the same time Anomaly detection: Events within same time window

Spatial similarity Age-set: Live in same location Anomaly detection: Run on same node

Phase similarity Age-set: (birth, initiation, marriage) Anomaly detection: (map, shuffle, reduce)

Contextual similarity Age-set: Same gender, clan Anomaly detection: Same workload, h/w


Target Systems for Validation VoIP system at large telecommunication provider

10s of millions of calls per day, diverse workloads 100s of network elements with heterogeneous hardware 24x7 Ops team uses alarm correlation to diagnose outages Separate team troubleshoots long-term chronics Labeled traces available

Hadoop: Open-source implementation of MapReduce Diverse kinds of real workloads

Graph mining, language translation Hadoop clusters with homogeneous hardware

Yahoo! M45 & Opencloud production clusters Controlled experiments in Amazon EC2 cluster

Long running jobs (> 100s): Hard to label failures


In Support of Thesis Statement

OBJECTIVE VoIP HADOOP

Anomaly Detection

Heuristics-based, peer-comparison pending

Peer comparison without labeled data

Problem Localization

Localize to customer/network-element/resource/error-code

Localize to node/task/resource

Chronics Exceptions, performance degradation, single/multiple-source

Exceptions, performance degradation, single-sourcemultiple-source pending

Production Systems

AT&T production system EC2 test system, OpenCloud pending

Publications OSR’11, DSN’12 WASL’08, HotMetrics’09, ISSRE’09, ICDCS’10, NOMS’10, CCGRID’10


Goals & Non-Goals Goals

Anomaly detection in the absence of labeled failure-data Diagnosis based on available instrumentation sources Differentiation of workload changes from anomalies

Non-goals Diagnosis of system-wide outages Diagnosis of value faults and transient faults Root-cause analysis at code-level Online/runtime diagnosis Recovery based on diagnosis


Assumptions Majority of system is working correctly Problems manifest in observable behavioral changes

Exceptions or performance degradations All instrumentation is locally timestamped Clocks are synchronized to enable system-wide

correlation of data Instrumentation faithfully captures system behavior


Overview of Approach

End-to-endTrace

Construction

PerformanceCounters

ApplicationLogs

Ranked list of root-causes

Anomaly Detection Localization


Target System #1: VoIP

PSTN Access IP Access

GatewayServers

IP BaseElements

ApplicationServers

Call ControlElements

ISP’s network


Target System #2: Hadoop

JobTracker

NameNodeTaskTrackerDataNode

Map/Reduce tasks

HDFSblocks

Master Node Slave Nodes

Hadooplogs

OS dataOS data

Hadooplogs


Performance Counters For both Hadoop and VoIP Metrics collected periodically from /proc in OS Monitoring interval varies from 1 sec to 15 min Examples of metrics collected

CPU utilization CPU run-queue size Pages in/out Memory used/free Context switches Packets sent/received Disk blocks read/written


End-to-End Trace Construction

End-to-endTrace

Construction

PerformanceCounters

ApplicationLogs



Application Logs Each node logs each request that passes through it

Timestamp, IP address, request duration/size, phone no., … Log formats vary across components and systems

Application-specific parsers extract relevant attributes Construction of end-to-end traces

Pre-defined schema used to stitch requests across nodes Match on key attributes

In Hadoop, match tasks with same task IDs In VoIP, match calls with same sender/receiver phone no

Incorporate time-based correlation In Hadoop, consider block reads in same time interval as maps In VoIP, consider calls with same phone no. within same time interval



Application Logs: VoIP

Combine per-element logs to obtain per-call traces Approximate match on key attributes Timestamps, caller-callee numbers, IP, ports

Determine call status from per-element codes Zero talk-time, callback soon after call termination

IP Base Element

Call Control Element

Application Server

Gateway Server

10:03:59, START973-123-8888 to 409-555-5555192.156.1.2 to 11.22.34.110:03:59, STOP

10:03:59, ATTEMPT973-123-8888 to 409-555-5555

10:04:01, ATTEMPT973-123-xxxx to 409-555-xxxx192.156.1.2 to 11.22.34.1


Application Logs: Hadoop (1) Peer-comparable attributes extracted from logs Correlate traces using IDs and request schema

2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask:

attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known

outputs (0 slow hosts and 105 dup hosts)

2009-03-06 23:06:01,612 INFO org.apache.hadoop.mapred.ReduceTask:

Shuffling 2 bytes (2 raw bytes) into RAM from

attempt_200903062245_0051_m_000055_0 …from

ip-10-250-90-207.ec2.internal

Temporal similarity: Timestamps

Hostnames: Spatial similarity

Phase similarity: MapReduce

Context similarity: TaskType


Application Logs: Hadoop (2) No global IDs for correlating logs in Hadoop & VoIP Extract causal flows using predefined schemas

NoSQL Database

2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts)

Application logs

Extract events

<time=t2,type=shuffle, reduceid=reduce1,mapid=map1,

duration=2s>

MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},…

Flow schema (JSON)

Causal flows


Anomaly Detection

End-to-endTrace

Construction

PerformanceCounters

ApplicationLogs




Anomaly Detection Overview Some systems have rules for anomaly detection

Redialing number immediately after disconnection Server reported error codes and exceptions

If no rules available, rely on peer-comparison Identifies peers (nodes, flows) in distributed systems Detect anomalies by identifying “odd-man-out”


Anomaly Detection (1) Empirically determine best peer groupings

Window size, request-flow types, job information Best grouping minimizes false positives in fault-free runs

Peer-comparison identifies “odd-man-out” behavior Robust to workload changes Relies on histogram-comparison Less sensitive to timing differences

Multiple suspects might be identified Due to propagating errors, multiple independent problems


Anomaly Detection (2)

Histogram comparison identifies anomalous flows Generate aggregate histogram represents majority behavior Compare each node’s histogram against aggregate histogram O(n) Compute anomaly score using Kullback-Leibler divergence Detect anomaly if score exceeds pre-specified threshold

Faulty node

Histograms (distributions) of durations of flows

Normal node Normal node

Nor

mal

ized

cou

nts

(tota

l 1.0

)

Nor

mal

ized

cou

nts

(tota

l 1.0

)

Nor

mal

ized

cou

nts

(tota

l 1.0

)


Localization

End-to-endTrace

Construction

PerformanceCounters

ApplicationLogs




“Truth table” Request Representation

Node1 Node2 Map ReadBlock OutcomeReq1 1 0 1 1 SUCCESSReq2 0 1 1 1 FAIL

Log SnippetReq1: 20100901064914,SUCCESS,Node1,Map,ReadBlock

Req2: 20100901064930,FAIL,Node2,Map,ReadBlock

Identify Suspect Attributes

Assume each attribute represented as “coin toss” Estimate attribute distribution using Bayes

Success distribution: Prob(Attribute|Success) Anomalous distribution: Prob(Attribute|Anomalous) Anomaly score: KL-divergence between the two distributions

http://www.pdl.cmu.edu/

Bel

ief

Probability(Node2=TRUE)

Successful requestsAnomalous requests

Indict attributes with highest divergence between

distributions



Rank Problems by Severity

ShuffleMap

Node3 Node2

Step 1: All requests

Problem1: Node2 Map

Shuffle

ExceptionX ExceptionY

Node3

Step 2: Filter all requests exceptthose matching Problem1

Problem2:Node3Shuffle

Indict path with highest anomaly score

350120

670 90

290

450

160 340


Incorporate Performance Counters (1) Annotate requests on indicted nodes with performance

counters based on timestamps

Identify metrics most correlated with problem Compare distribution of metrics in successful and failed requests

Requests on node2# Timestamp,CallNo,Status,Memory(%),CPU(%) 20100901064914, 1, SUCCESS, 54, 620100901065030, 2, SUCCESS, 54, 6 20100901065530, 3, SUCCESS, 56, 4 20100901070030, 4, FAIL, 52, 45


Incorporate Performance Counters (2)

ShuffleMap

Node3 Node2

All requests

Problem1: Node2 Map High CPU

High CPU

Incorporate performance counters in diagnosis

350120

670 90


Why Does It Work? Real-world data backs up utility of peer-comparison

Task durations peer-comparable in >75% of jobs [CCGrid’10] Approach analyzes both successful and failed requests

Analyzing only failed requests might elevate common elements over causal elements

Iterative approach discovers correlated attributes Identifies problems due to conjunctions of attributes Filtering step identifies multiple ongoing problems

Handles unencountered problems Does not rely on historical models of normal behavior Does not rely on signatures of known defects


VoIP: Diagnosis of Real IncidentsExamples of real-world incidents Diagnose

dResource Indicted

Customers use wrong codec to send faxes ✓ NACustomer problem causes blocked calls at IPBE. ✓ NABlocked circuit identification codes on trunk group ✓ NASoftware bug at control server causes blocked calls ✓ NAProblem with customer equipment leads to poor QoS ✓ NA

Debug tracing overloads servers during peak traffic. ✓ CPUPerformance problem at application server. ✓ CPU/MemoryCongestion at gateway servers due to high load ✓ CPU/Concurrent

Sessions Power outage and causes brief outages. ✗ NAPSX not responding to invites from app. server ✗ Low responses at

app. server

8 out of 10 real incidents diagnosed


Day1 Day2 Day3 Day4 Day5 Day6

Day1 Day2 Day3 Day4 Day5 Day6

VoIP: Case StudiesIncident 1: Chronic due to unsupported fax codec

Faile

d ca

lls fo

r tw

o cu

stom

ers

Faile

d ca

lls

for s

erve

r

Customers stop using unsupported codec

Chronic nightly problem

Unrelated chronic server problem emerges

Server reset

Incident 2: Chronic server problem

39

Implementation of ApproachDraco: Deployment in Production at AT&T


1. Problem1STOP.IP-TO-PS.487.3 STOP.IP-TO-PSTN.41.0.-.-Chicago*GSXServersMemoryOverload

2. Problem2STOP.IP-TO-PSTN.102.0.102.102ServiceB CustomerAcmeIP_w.x.y.z

SearchFilter

~8500 lines of C code



VoIP: Ranking Multiple Problems

Draco performs better at ranking multiple independent problems


VoIP: Performance of Algorithm

Offline Analysis Avg. Log Size

Avg. Data Load Time

Avg. Diagnosis Time

Draco simulated-1hr (C++)

271 MB 8s 4s

Draco real-1day(C++)

2.4 G 7min 8min

Running on 16-core Xeon (@ 2.4GHz), 24 GB Memory


Hadoop: Target Clusters 10 to 100-node Amazon’s EC2 cluster

Commercial, pay-as-you-use cloud-computing resource Workloads under our control, problems injected by us

gridmix, nutch, sort, random writer Can harvest logs and OS data of only our workloads

4000-processor M45 & 64 node Opencloud cluster Production environment Offered to CMU as free cloud-computing resource Diverse kinds of real workloads, problems in the wild

Massive machine-learning, language/machine-translation Permission to harvest all logs and OS data


Hadoop: EC2 Fault Injection

Fault Description

Resource contention

CPU hog External process uses 70% of CPU

Packet-loss 5% or 50% of incoming packets dropped

Disk hog 20GB file repeatedly written to

Application bugs

Source: Hadoop JIRA

HADOOP-1036 Maps hang due to unhandled exception

HADOOP-1152 Reduces fail while copying map output

HADOOP-2080 Reduces fail due to incorrect checksum

Injected fault on single node


Metrics

True

Pos

itive

Rat

es

Different metrics detect different

problems

Hadoop: Peer-comparison ResultsWithout Causal Flows

Correlated problems (e.g., packet-loss) harder to localize


Hadoop: Peer-comparison ResultsWith Causal Flows + Localization

Examples of real-world incidents Diagnosed Metrics IndictedCPU hog ✓ NodePacket-loss ✓ Node+ShuffleDisk hog ✓ NodeHADOOP-1036 ✓ Node+Map

HADOOP-1152 ✓ Node+ShuffleHADOOP-2080 ✓ Node+Shuffle

Correlated problems correctly identified


Critique of Approach Anomaly detection thresholds are fragile

Need to use statistical tests Anomaly detection does not address problems at master Peer-groups are defined statically

Assumes homogeneous clusters Need to automate identification of peers

False positives occur if root-cause not in logs Algorithm tends to implicate adjacent network elements Need to incorporate more data to improve visibility


Related Work Chronics fly under the radar

Undetected by alarm mining [Mahimkar09] Chronics can persist undetected for long periods of time

Hard to detect using change-points [Kandula09] Hard to demarcate problem periods [Sambasivan11]

Multiple ongoing problems at a time Single fault assumption inadequate [Cohen05, Bodik10]

Peer-comparison on its own inadequate Hard to localize propagating problems

[Kasick10,Tan10,Kang10]


Pending Work

OBJECTIVE VoIP HADOOP

Anomaly Detection

Heuristics-based, peer-comparison pending

Peer comparison without labeled data

Problem Localization

Localize to customer/network-element/resource/error-code

Localize to node/task/resource

Chronics Exceptions, performance degradation, single/multiple-source

Exceptions, performance degradation, single-sourcemultiple-source pending

Production Systems

AT&T production system EC2 test system, OpenCloud pending

Publications OSR’11, DSN’12 WASL’08, HotMetrics’09, ISSRE’09, NOMS’10, CCGRID’10


Pending Work: Details OpenCloud production cluster & multiple-source

problems [April-June 2012] 64-node cluster housed at Carnegie Mellon Obtained and parsed logs from 25 real OpenCloud incidents Root-causes include misconfigurations, h/w issues, buggy apps Yet to analyze logs

Peer comparison in VoIP [June-July 2012] Examining data that is not labeled, and identifying peers Notion of a peer might be determined by function and location Root-causes under investigation are as before

Dissertation writing [June-August 2012] Defense [September 2012]


Collaborators & Thanks VoIP (AT&T)

Matti Hiltunen, Kaustubh Joshi, Scott Daniels Hadoop diagnosis

Jiaqi Tan, Xinghao Pan, Rajeev Gandhi, Keith Bare, Michael Kasick, Eugene Marinelli

Hadoop visualization Christos Faloutsos, U Kang, Elmer Garduno, Jason Campbell

(Intel), HCI 05-610 team OpenCloud

Greg Ganger, Garth Gibson, Julio Lopez, Kai Ren, Mitch Franzos, Michael Stroucken

Summary Peer-comparison effective for anomaly detection

Robust to workload changes Requires little training data

Incremental fusion of different instrumentation sources enables localization of chronics Starts with user-visible symptoms of a problem Drills down to localize root-cause of problem

Usefulness of approach in two production systems VoIP system at large telecommunication provider

(demonstrated) Hadoop clusters (underway)



Questions?

Climbing Mt. Kilimanjaro comes a distant second to a thesis proposal!


Selected Publications (1)Diagnosis in Production VoIP system DSN12: Draco: Statistical Diagnosis of Chronic Problems in Large

Distributed Systems. S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan. To appear DSN 2012.

OSR12: Practical Experiences with Chronics Discovery in Large Telecommunications Systems. S. P. Kavulya, K. Joshi, M. Hiltunen, S. Daniels, R. Gandhi, P. Narasimhan. Best Papers from SLAML 2011 in Operating Systems Review, 2011.

Survey Paper & Workload Analysis of Production Hadoop Cluster RAE12: Failure Diagnosis of Complex Systems S. P. Kavulya, K. Joshi, F.

Di Giandomenico, P. Narasimhan. To appear in Book on Resilience Assessment and Evaluation. Wolter, 2012.

An analysis of traces from a production MapReduce cluster. S. Kavulya, J. Tan, R. Gandhi, P. Narasimhan. CCGrid 2010.


Selected Publications (2)Visualization in Hadoop CHIMIT11: Understanding and improving the diagnostic workflow of

MapReduce users. J. D. Campbell, A. B. Ganesan, B. Gotow, S. P. Kavulya, J. Mulholland, P. Narasimhan, S. Ramasubramanian, M. Shuster, J. Tan. CHIMIT 2011

ICDCS10: Visual, log-based causal tracing for performance debugging of MapReduce systems. J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ICDCS 2010

Diagnosis in Hadoop (Application logs + performance counters) NOMS10: Kahuna: Problem Diagnosis for MapReduce-Based Cloud

Computing Environments. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. NOMS 2010.

ISSRE09: Blind Men and the Elephant (BLIMEy): Piecing together Hadoop for Diagnosis. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ISSRE 2009.


Selected Publications (3)Diagnosis in Hadoop (Performance counters) HotMetrics09: Ganesha: Black-Box Fault Diagnosis for MapReduce

Systems. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. HotMetrics 2009.

Diagnosis in Hadoop (Application logs) WASL: SALSA: Analyzing Logs as StAte Machines. J. Tan, X. Pan, S.

Kavulya, R. Gandhi. P. Narasimhan. WASL 2008,

Diagnosis in Group Communication Systems SRDS08: Gumshoe: Diagnosing Performance Problems in Replicated File-

Systems. S. Kavulya, R. Gandhi, P. Narasimhan. SRDS 2008. SysML07: Fingerpointing Correlated Failures in Replicated Systems. S.

Pertet, R. Gandhi, P. Narasimhan. SysML, April 2007.

Related Work (1) [Bodik10]: Fingerprinting the datacenter: automated classification of

performance crises. Peter Bodík, Moisés Goldszmidt, Armando Fox, Dawn B. Woodard, Hans Andersen: EuroSys 2010.

[Cohen05]: Capturing, indexing, clustering and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox. SOSP, 2005.

[Kandula09]: Detailed diagnosis in enterprise networks. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl. SIGCOMM 2009.

[Kasick10]: Black-Box Problem Diagnosis in Parallel File Systems. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. FAST 2010.

[Kiciman05]: Detecting application-level failures in component-based Internet Services. Emre Kiciman, Armando Fox. IEEE Trans. on Neural Networks 2005.


Related Work (2) [Mahimkar09]: Towards automated performance diagnosis in a large IPTV

network. Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao. SIGCOMM 2009.

[Sambasivan11]: Diagnosing Performance Changes by Comparing Request Flows. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. NSDI 2011.



Automated Diagnosis of Chronic Problems in Production Systems

Documents

Transcript of Automated Diagnosis of Chronic Problems in Production Systems