Machine Learning Methods for Fault Classification908072/FULLTEXT01.pdf · feasibility of...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2014

Machine Learning Methods for Fault Classification

MARKUS FELLDIN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Abstract

This project, conducted at Ericsson AB, investigates thefeasibility of implementing machine learning techniques inorder to classify dump files for more e�cient trouble reportrouting. The project focuses on supervised machine learn-ing methods and in particular Bayesian statistics. It showsthat a program utilizing Bayesian methods can achieve wellabove random prediction accuracy. It is therefore con-cluded that machine learning methods may indeed becomea viable alternative to human classification of trouble re-ports in the near future.

Referat

Maskininlärningsmetoder för felklassificering

Detta examensarbete, utfört på Ericsson AB, ämnar attundersöka huruvida maskininlärningstekniker kan använ-das för att klassificera dumpfiler för mer e�ektiv problemi-dentifiering. Projektet fokuserar på övervakad inlärning ochdå speciellt Bayesiansk klassificering. Arbetet visar att ettprogram som utnyttjar Bayesiansk klassificering kan upp-nå en noggrannhet väl över slumpen. Arbetet indikerar attmaskininlärningstekniker mycket väl kan komma att bli an-vändbara alternativ till mänsklig klassificering av dumpfileri en nära framtid.

Contents

1 Introduction 51.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Current Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 External Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.7 Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.8 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 71.9 Choice of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 81.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.11 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.12 Implications of Related Research . . . . . . . . . . . . . . . . . . . . 9

2 Machine Learning 112.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Types of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Perceptrons and Neural Networks . . . . . . . . . . . . . . . . . . . . 122.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Compound Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Method 173.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Viability of Disregarded Algorithms . . . . . . . . . . . . . . . . . . 18

4 Implementation 194.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Remarks on Feature Extraction . . . . . . . . . . . . . . . . . . . . . 244.6 Deviations from Bayesian Methods . . . . . . . . . . . . . . . . . . . 244.7 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Results 275.1 Independent Results of Classifiers . . . . . . . . . . . . . . . . . . . . 275.2 Combined Results and Comparisons . . . . . . . . . . . . . . . . . . 295.3 Comments on the Results . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusion 316.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 33

Preface

I would like to extend my gratitude to my superb supervisors at Ericsson AB; ElinaMeier, Per-Olof Gatter and Johnny Carlbaum. Leif Jonsson, the resident machine-learning guru, also deserves a big thanks for paving the way with his research andproviding easier access to the data required for my research. I would also liketo thank the remaining members of my team at Ericsson AB for their help andsupport throughout the duration of this project; Fredrik Tengblad, Tomas Jonek,Peter Kerwein, Michael Hedström och Stefan Sundin. Finally I would like to thankmy supervisor at KTH, Olov Engwall, my examinator, Olle Bälter, and my guidanceteam members Jens Arvidsson, Olle Hassel, Anton Lindström and David Nilsson.

1

Glossery

ASIC Action-specific integrated circuit

Bugs Software errors

CDA Crash dump analysis

DSP Digital signal processor

EM Expectation maximization

OS Operating system

MHO An Ericsson design unit

ML Machine learning

RBS Radio base station

SVM Support vector machine

TR Trouble report

P(A) The probability of an event A occurring

P(A|B) The probability of an event A occurring if the event B has occurred.

3

Chapter 1

Introduction

This chapter serves as an introduction to the goals and intents of this degree project.It also provides some background as to why this research is being conducted, andelucidates the conditions of the research conducted.

1.1 Introduction

As one of the world’s largest telecommunication companies, Ericsson AB, amongother things, develops, sells and maintains a large amount of base stations acrossthe globe. When faults occur within one or more of these, a large amount oftext must be analyzed in order to determine the cause. This process is currentlyperformed manually, which can be both slow and imprecise, and must be performedfor every new trouble report. Part of Ericsson’s vision is that there will be 50 billiondevices, each connected to a global network, by the year 2020. The sheer amount ofdata that could potentially be generated by these devices means that manual erroridentification may very well become a severe bottle neck in the very near future. Ifan automated initial classification could be performed based on the text files, theamount of manual labor required per trouble report could be greatly reduced. Thisin turn could facilitate both quicker debugging for developers and faster customersupport for clients.

1.2 Background

In today’s reality the capacity to store and produce data is vastly in excess of whatthe human mind is able to decipher quickly. There has been a lot of research in thefields of AI, machine learning and data-mining. This research is more interestingnow than ever before as computational power has reached a level where such meth-ods may finally be applicable in an increasing amount of real world scenarios. Asstated in the background section, Ericsson, among others, currently rely heavily onnon-automated error identification. A process that becomes more and more costly

5

CHAPTER 1. INTRODUCTION

as the amount of data that needs to be analyzed increases. Ericsson has conductedsome research in the field, but have not yet reached an applicable solution.

There are several concurrent research project within the organization that arelooking at di�erent methods of incorporating some degree of atomization withinthe trouble report (TR) routing process. Each of these projects di�er in termsof data-sources or methodologies. The, to date, most successful method has beenutilizing human TR descriptions together with a compound machine learning ap-proach. Other approaches are, for example, looking at changes in the codebasebetween errors.

A successful project could lead to improved trouble report routing within Er-icsson, and may be incorporated as a part of a more complex TR routing tool.A failure to implement algorithms may also help illustrate various di�culties withlog file analysis as a basis for classification that may serve as a reference to futureattempts. Furthermore, automatic error identification from log files could be ap-plicable to almost any system that has crash persistent logging there will likely beexternal interest in the project as well.

1.3 Thesis ObjectiveThe primary objective of this project is to test the viability of automatically routingTRs using machine-learning techniques based on the data generated by fatal errorsin Ericsson’s radio base stations. The viability of the methods found by this projectwill be determined by their performance in relation to random classification. Amethod is therefore deemed viable if it can be shown that it performs better thanwhat would be statistically probable for a random classifier. The approach utilized,if deemed viable, will serve as a baseline for future development with the intent ofconstructing a system that is able to perform at human levels of TR classification.Internal empirical data has shown that humans perform TR classification at a 77%accuracy.

1.4 Current WorkflowAt the moment there is no easy way to classify software errors based on causality.Instead errors must be dealt with without prior knowledge pertaining to their causes.Initial classification can only be achieved through the intuition of the individualresponsible for handling or reporting it. At the moment errors are most oftenhandled in one of two ways:

1. The error is noticed during testing/debugging of code internally by the devel-opers. In this case the developers will often be tasked with debugging theirown code with assistance from colleagues. This scenario relies heavily on theexperience and intuition of the developers involved.

6

1.5. CONSTRAINTS

2. The error occurs post-delivery and is reported to a customer service technician.In this case the technician files a TR based on their own experience and theinformation supplied by the customer. The TR is then assigned to a team ofdevelopers who have to debug the product based on the information given inthe TR.

Both of these process could see great reductions in complexity if some degree oferror pre-classification can be performed.

1.5 ConstraintsThis project will only test the viability of supervised machine learning techniqueson static log files from Ericsson’s business unit networks (BNET) that have beengenerated by Ericsson’s radio base stations (RBS) and processed by Ericsson’s CrashDump Analysis (CDA) tool. Therefore the format and contents of the data cannotbe controlled or changed within the scope of this project. Furthermore, due to thecomplexity of the system responsible for generating the dump files, it is very di�cultto recreate a statistically representative spread of issues artificially. Therefore nonew data can be artificially generated for the purpose of this project.

1.6 External FactorsInsight into the contents and meaning of the dump files will also be limited by theamount of documentation and knowledge that is available within Ericsson.

1.7 Intended AudienceIn order to gain insight from this report one should have some degree of under-standing of algorithm design as well as working knowledge of basic calculus andstatistics. The intended readers are primarily Ericsson employees working with TRrouting and software developers working with similar problems. The project mayalso be of interest to researchers working with feature extraction and data aggrega-tion.

1.8 Ethical ConsiderationsThis project will, in all likelihood, not cause any ethical dilemmas, as it is a projectdelving into theoretical problems. However, if the project is successful, it couldpotentially reduce the amount of human labor required to perform TR routing inthe future. This in turn could lead to a reduction in the amount of support sta�hired by Ericsson. However, this is not a likely outcome as TR routing is only oneof their responsibilities. Furthermore, the purpose of this project is merely to result

7

CHAPTER 1. INTRODUCTION

in a tool used as an aid to increase the throughput of the TR routing process, notto replace the process entirely.

A higher throughput in the TR routing process may however increase e�ciency,which may, in turn, result in shorter down times. Shorter down times would improvequality of service and utility of Ericsson’s radio base stations. Shorter down timesand higher quality of service of cellular networks would have positive socioeconomicimpacts.

1.9 Choice of MethodologyThe approach should be divided into di�erent phases. First and foremost the dumpfiles should be deconstructed and the current method of identification inspected.Then available research, both internal and external, should be studied for possibleapproaches. Additionally, test cases must either be found or constructed. After this,a baseline should be constructed or found as a source for performance comparison.A program comprised of appropriate methods should then be constructed based onspeed and accuracy. This program should be improved upon, based on performancein relation to the baseline and previous attempts.

1.10 EvaluationThe accuracy should be evaluated based on comparisons with both a random al-gorithm and the current manual system. The current accuracy of manual classifi-cation of trouble reports is approximately 77%. Approaching this accuracy wouldbe considered a great success, but is not a requirement. Instead the focus lies ondemonstrating that classification is possible through machine learning algorithms,which would be evident through a statistically significant improvement over randomclassification. The accuracy of both random classification and the prototype clas-sification programs will be based on results of runs on a validation set with knownanswers.

1.11 Related ResearchEricsson has undertaken previous research into implementing machine learning al-gorithms in order to classify trouble reports. However, these attempts have focusedon di�erent stages of the trouble reporting process and no previous attempts havebeen made to classify dump files without any manual pre-processing. These at-tempts have had various degrees of success but have in general been deemed tooerror prone to yield reliable results. Most attempts have been made at sortingthe reports filed by customer service technicians based on keywords. The usageof human reports introduces an additional level of entropy as the customer ser-vice representatives’ individual anachronisms can make general classification moredi�cult.

8

1.12. IMPLICATIONS OF RELATED RESEARCH

Another thesis worker at Ericsson, Weixi Li, recently explored utilizing machinelearning techniques for automatic log file analysis [11]. Whilst her findings are notentirely applicable to this research they are very relevant as the data comes froma related source with similar content. Her research dealt with automatically di�er-entiating between abnormal and normal logs. She was able to show that machinelearning techniques yielded a better than random prediction percentage.

Outside of Ericsson there has been an increasing interest in utilizing machinelearning algorithms for software debugging, noticeable through an increased amountof publications on the subject as of late. Research by Alice X. Zheng[3] demonstratedthat algorithms could be used to cluster program data in order to identify underlyingbugs. The results of Zheng’s research [3] illustrated a key problem with classifyingsoftware errors; there may be more than one underlying cause for any one failurewhich often results in an ambiguous classification of that particular data point.

The research paper titled “How Bayes Debug” by Chao Liu, Zeng Lian, andJiawei Han looked at how Bayesian classifiers could facilitate software debugging[9]. They found that their particular implementation of a Bayesian classifier wasindeed able to make relatively accurate predications based on code input, i.e. givena section of code the algorithm returns either true, (yes there is a bug), or false,(no there is not a bug), for that particular section of code. Whilst not entirelycomparable to the intended research subject of this paper it is still a promisingresult.

As for text file analysis in general the term ‘Data-Mining’ is nearly part ofeveryday vocabulary nowadays in large due to the amount of research conducted inthe field. Understandably Google is among the pioneers in this research and paperslike “Experience Mining Google’s Production Console Logs” [10] show that machinelearning techniques can be successfully applied even on console log data.

1.12 Implications of Related ResearchPrevious research, like that by Chao Liu, Zeng Lian, and Jiawei Han [9], indicatethat the chosen method has been successfully applied to problems of a similar naturewith positive results. Whilst not all of their findings were applicable due to slightlydi�erent data and goals, many of the key concepts remain the same. Research,like that by Weixi Li [11], shows that patterns can be found in data that is closelyrelated to that used for the purpose of this project. Her paper also underlies someof the problems that she faced during her research that are similar to those facedby this project during the feature extraction phase.

9

Chapter 2

Machine Learning

This chapter provides theoretical background for the remainder of the report andserves as a basis for the concepts utilized throughout this project.

2.1 Machine LearningMachine learning is, as the name suggests, a term used to describe the field associ-ated with learning algorithms. Learning in and of itself is a broad term that is hardto define. In the field it is normally defined as the ability of a particular algorithmor system to acquire knowledge or skill through analysis of related data . A machinelearning system will typically consist of several distinct phases; a training phase,a validation phase and actual operation. During the training phase the system isgiven test data with known associations so that it can identify patterns in order tocorrectly make predictions within that set. The validation phase is meant to gradethe performance of the system using related but disjoint data, also with previouslyknown answers. Finally, if the program is deemed to be up to specifications, it canbe used on real data with unknown answers.

The idea of machine learning originates in trying to make computers able tomake predictions based on what may seem like random data. This is somethingthat humans are very good at whilst computers remain rather inept. One of thereasons for this is that we have a large library of preexisting knowledge. Examplesof this include speech and handwriting recognition; within these fields humans canmake predictions with a high degree of accuracy based on both previous knowledgeand the ability to collate scattered data quickly. The di�culty lies in describinghow this process actually works for humans; if the process cannot be broken downit is hard to represent it using a conventional algorithm.

2.2 Types of LearningMachine learning is a diverse field in computer science and algorithms can be di-vided into several di�erent groups. First they can be divided based on the type of

11

CHAPTER 2. MACHINE LEARNING

problem they are designed to solve. Some of the most common problem types are;classification, clustering, regression, and anomaly detection [1]. Then they can becategorized based on their learning methodology. Some of the more common learn-ing methods are; supervised learning, unsupervised learning, reinforcement learning,and association rule learning [1].

The primary focus of this project will be researching and implementing super-vised classification algorithms as the degree of success of an algorithm dependsheavily on selecting the right approach for the data at hand. In this particular casethe goal is to classify system dumps into di�erent types of errors based on theircontent.

2.3 Perceptrons and Neural NetworksThe perceptron algorithm is one of the very first machine learning algorithms [1].It aims to roughly emulate the way neurons work in a biological entity in order tolinearly classify an input vector. In much the same way as neurons in our own brainwork, the perceptron algorithm takes an input vector, let us call it X, multiplies thisvector by a weight vector, let us call it W, and if this value exceeds the thresholdvalue a signal is sent. The size of the input vector depends on the dimensionality ofthe available data. Each dimension is assigned a weight, i.e. how much values fromsaid dimension impact the results in the data set. In its most basic configurationthe perceptron algorithm is a linear classifier, this in short means that the space ofdata will be separated by a linear function that denotes whether or not the inputvector for each point exceeded the threshold.

Figure 2.1: A graphical illustration of the perceptron algorithm1.

The perceptron algorithm is a form of supervised learning and is trained bychanging the weights iteratively in so that the algorithm will achieve the correctresult for a given input. There are di�erent methods of learning but the general ideais to iterate over the training data and when the algorithm gets it wrong manipulatethe weights according to a predetermined method until it gets it right. Of course, if

1By Mayranna (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)], viaWikimedia Commons

12

2.4. SUPPORT VECTOR MACHINES

the input data cannot be separated by a hyper plane, there is no perfect solution.In this case convergence can be guaranteed, for linearly separable data, using thefollowing method [1].

1. Set the weights and threshold values. The weights should be small, either zeroor close to it.

2. For each data point in our training set, the following steps should be per-formed:

a) Determine the current output (the dot product of the input and weightvectors).

b) Update the weights based on the chosen learning rate. The learning rateis a multiplicative constant of the weight vector that is applied.

2.4 Support Vector MachinesSupport vector machines (SVM) are supervised learning models in which the al-gorithms are designed to analyze data in order to perform pattern recognition forclassification or regression [7]. The original model is a linear classifier that aimsto find a separating hyperplane. Later research has introduced non-linear SVMsthrough usage of kernel functions [7]. The general idea behind the model is to max-imize the shortest distance between any data point and the separating hyper plane,this is often called the functional margin [1].

Figure 2.2: A graphical depiction of the goal of SVM algorithms2.

2By Cyc (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)], via WikimediaCommons

13

CHAPTER 2. MACHINE LEARNING

2.5 Expectation Maximization

Expectation maximization (EM) is an iterative algorithm used to find the maximumlikelihood estimates of parameters. Its primary usage is when the model dependson unobserved variables [7]. The name EM is derived from the two ‘steps’ of thealgorithm. Each iteration consists of;

1. An expectation step. This step calculates the expected value of the log like-lihood function. The log likelihood function is a statistical method for com-paring two models.

2. A maximization step. This step computes parameters that maximize the log-likelihood function from step 1.

3. The resulting parameter estimates are used to determine the distribution ofthe unobserved or latent variables that will be the starting point for the sub-sequent E step.

Figure 2.3: A graphical illustration of the expectation maximiza-tion algorithm3.

3By Chire, (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)], via Wikime-dia Commonss

14

2.6. COMPOUND CLASSIFIERS

2.6 Compound ClassifiersCompound classification utilizes at least two distinct classifiers in order to makepredictions [7]. The idea behind this is that utilizing di�erent classification methodstogether can reduce the error rates of the system as a whole by attempting to maskthe impact of failures of individual classifiers with the success of others. There aremany di�erent ways of doing this; voting ensembles give each classifier the rightto place a weighted vote on the predicted value for given input and the predictionwith most votes is selected, and cascading classifiers perform algorithms in seriesand will result in one final prediction.

2.7 Naïve BayesNaive Bayes is a type of supervised learning algorithm used for classification ofdata. It is based on Bayes’ theorem which describes the relationship between theprobabilities of A, B and their respective conditional probabilities. The reasonthat it is called naïve is because it inherently assumes that all feature variables areindependent of one another which is very seldom the case in reality [1].

Figure 2.4: The underlying principle of a naïve Bayesian classifier4.

4Saed Sayad (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)]

15

Chapter 3

Method

3.1 Data AnalysisThe initial data selection was conducted based on the existing human expertise ofEricsson employees experienced with crash dump analysis and TR routing. Thiswas followed by empirical analysis of the selected sources in order to deem theirrespective independence as well as the type of data they contained. Methods usedfor analysis where counting unique entries associated with each data source andMHO as well as adjacency matrices and plots used to indicate their viability inclassification.

3.2 Feature SelectionOnce data sources are selected, features must be selected from the data. This isone of the most important parts of the project. Features should minimize loss insignificance whilst maximizing loss in noise. Anything that is part of a given datasource without providing added significance can be considered as noise.

The selection of meaningful features was achieved through consultation of Er-icsson employees together with empirical analysis achieved by testing the accuracyof classifiers with di�erent features and counting the amount of unique entries pro-duced by the given feature implementation. The viability of features was also eval-uated based on their performance as data sources for individual classifiers.

3.3 Bayesian MethodsThe selected machine learning method for this project is the Bayesian classifier.The features and practical application of this method will be based on empiricalanalysis in combination with existing theory on the subject. The underlying idea ofBayesian statistics is utilizing the joint historical probability of a variety of featuresoccurring for any given class. The statistical basis for Bayesian methods makes ita good choice for a proof of concept design for a variety of reasons. First of all it is

17

CHAPTER 3. METHOD

widely used and proven method [7] and has been used for similar projects [9]. It isalso rather simple to implement which allows for more readily available testing.

3.4 ClassificationThe set of data utilized throughout this project contains twenty distinct classes,each represented by at least one crash dump. Implementing Bayesian statistics forthis many classes is slightly atypical [7], but the statistical basis for the methodsremains unchanged.

3.5 Viability of Disregarded AlgorithmsPerceptrons and neural networks are viable alternatives to Bayesian classifiers forsimilar projects. However, they require more data than was available for this par-ticular project to achieve viable performance [7]. Support Vector Machines havebeen implemented in similar projects [2] but their practical application is slightlyless trivial than a Bayesian classifier, especially when there is not an abundance ofdata.

Compound classifiers are, in essence, several classifiers working symbiotically tomake predictions and are, as such, more of a potential path for future developmentthan an initial attempt.

18

Chapter 4

Implementation

This chapter describes the data utilized for the experiments as well as the variousmethods used in order to achieve the results of this project.

4.1 Data

This project uses dump file data generated by Ericsson’s base stations during afatal error. These files are then run through internal software called CDA (crashdump analysis) which generates a structure of HTML documents. These documentscontain a vast amount of data pertaining to the state of the system at the pointof failure, all of which will not be utilized for learning. Only the most pertinentsections of data will be included in the feature vectors, as including more wouldraise the dimensionality beyond what is reasonable for the amount of training dataavailable. There are 210 data sets available for training and validation with 20classes represented by at least one dump. The dump files used for training andtesting are from cases that have already been solved and thus have known routes.

4.2 Data Sources

Features were extracted from a subset of the various HTML documents availablefrom each dump. The subset was chosen based on recommendations from Ericssonemployees familiar with TR routing and some rudimentary testing.

The following documents were chosen:

• LPPShowFatalError.html – A short human-readable error message describingthe fatal error that caused the crash.

• StackTraceUnwind.html – A snapshot of the contents of each DSP stack at thepoint of failure. Each entry contains an address and a source code reference.

19

CHAPTER 4. IMPLEMENTATION

• LPPShowSem.html – A semaphore table where each entry contains a semaphoreID as well as the name of the program that had it reserved at the point offailure.

• LPPZipLog.html – A complete event list per DSP.

The content of each of these documents has been parsed and the data aggre-gated in order to make their respective contents comparable. Comparability andconsistency of data was tested independently by plotting each feature in MatLab1.

4.3 Feature Extraction

Feature extraction is primarily performed in order to reduce the dimensionality ofthe problem and to reduce the amount of noise. The dimensionality of the problem issimply the amount of rows in the feature vector, or the amount of features extractedper element of data. Every piece of data that does not correlate with classificationis considered noise and should be removed so that it does not negatively impactperformance or accuracy. In addition to selecting relevant features and extractingthem from their sources, they must be processed into something interpretable bythe chosen machine learning approach. Often this means reducing the data to anormalized numerical value.

One potentially relevant data source from LPPShowFatalError.html is the errormessage, which is displayed in the form of a human readable descriptive sentence.For the purpose of this type of analysis the actual meaning of the sentence has verylittle importance, instead the sentence must be reduced into representative data thatclearly di�erentiates di�erent messages from one another. One way of comparingstrings quickly is calculating checksums, which results in a unique value for everyunique string. The checksums can then be counted and assigned a numerical value;the trivial way of doing this is simply incrementing a counter by one for each newunique hash value. Each unique hash is then assigned a probability of occurrenceper MHO.

1MatLab – Software for numerical analysis and graphical representation of data.

20

4.3. FEATURE EXTRACTION

Figure 4.1: The amount of unique fatal error messages for eachMHO.

Figure 4.2: An adjacency plot where a line between two nodesindicates at least one shared error message.

Some of the MHO’s in figure 4.1 as well as in the following diagrams will show

21


zero unique values for certain features. This is due to the fact that not all logs arecompletely consistent in content and sometimes parts of the logs contain corruptdata which is omitted during classification.

Each LPPShowFatalError.html file also contains an error code which is a kind oferror ID. The code is a short hexadecimal entry that is typically not used for humanclassification unless two logs with the same error message are being compared. Inthat case the error code can be used to di�erentiate between the two. Much likewith the error messages each unique code can be assigned an incrementing valuewhere each code is given a numerical ID. Once each error code has been assignedan ID the probability of each error code occurring per MHO can be calculated.

Figure 4.3: The amount of unique fatal error codes for each MHO.

22


Figure 4.4: An adjacency plot where a line between two nodesindicates at least one shared error code.

As can be seen in figure 4.4, these error codes can be very powerful for identifyingthe source of an error as the error codes are linked almost exclusively to one MHO.There is however a large amount of unique codes which means that, unless there ismore data available at a later stage, there is a significant probability of the occurringcode not having been encountered during classification.

Finally each LPPShowFatalError.html contains the line of code at which thefatal error was triggered. This is presented in the form of a path and line. Testingshowed that specific line at which the code failed had little impact on the per-formance of the classifiers, this was thus omitted when identifying unique errorsources. The error sources were identified in the same manner as above, given anincrementing ID and a probability of occurrence per MHO.

23


Figure 4.5: The amount of unique fatal error sources for each MHO.

Figure 4.6: An adjacency plot where a line between two nodesindicates at least one shared error source.

StackTraceUnwind.html contains a, per processor, list of active processes. Eachentry has a process ID, associated memory address, processor ID and a path to

24


the source code. Everything except the source path is intended to be assignedon a priority basis and is thus not a deterministic source of identification. Eachunique source, with the specific line omitted, was given an incrementing ID and itsfrequency of occurrence per MHO was calculated.

Figure 4.7: The amount of unique stack trace entries for each MHO.

25


Figure 4.8: An adjacency plot where a line between two nodesindicates at least one shared stack trace entry.

As is made apparent by the adjacency plot in figure 4.8, every MHO shares atleast one entry with every other MHO. Ideally, there should be as little correlationbetween features of di�erent classes as possible since these di�erences are whatis being used for classification. Since the diagram above only shows if there isa correlation, and not how strong it is, using these features may still be viable.However, this data is a strong indicator that their individual performance shouldbe scrutinized further.

LPPShowSem.html contains a list of the semaphores reserved at the time of thefatal error. A semaphore is a construct used to reserve a portion of memory fora specific process and the allocation of these can give rise to a variety of errors ifdone improperly. Examples of such occurrences are race conditions and deadlocks.Each row of the file is assigned an ID and its frequency of occurrence per MHO iscalculated.

26


Figure 4.9: The amount of unique semaphore entries for each MHO.

Figure 4.10: An adjacency plot where a line between two nodesindicates at least one shared semaphore entry.

LPPZipLog.html contains a complete list of runtime events per processor. Eachentry contains a processor ID, process ID, bu�er number, micro second of occurrence

27


and the event itself. Each event is assigned an incrementing ID and its frequencyof occurrence per MHO is calculated.

Figure 4.11: The amount of unique zip log entries for each MHO.

Figure 4.12: An adjacency plot where a line between two nodesindicates at least one shared zip log entry.

28

4.4. FEATURE SELECTION

4.4 Feature SelectionBased on the figures in this section and the results of the independent classifiersthe human readable error messages and semaphores were omitted from the final(combined) classifier as they only served to decrease its accuracy for the existingcircumstances. This means that the combined classifier utilized the following fea-tures in order to make predictions; zip log entries, stack trace entries, error sourcesand error codes.

4.5 Remarks on Feature ExtractionThe selection of features is heavily based on recommendations from personnel withexperience from manual classification and extensive knowledge about their contentand origin. In addition to this features were tested individually for prediction ac-curacy, entropy and MHO adjacency. These tests, conducted by counting uniqueentries per MHO and feature and drawing adjacency plots were the basis for de-ciding how to best reduce noise and select representative data. Noise is data thatcontains no significant information about which class the dump is from and thusdilutes the quality of information. The adjacency plots show that almost all datasources have significant overlap between di�erent classes, there is however a di�er-ence in frequency of occurrence as well as features that are distinct in nearly everycase.

4.6 Deviations from Bayesian MethodsThe traditional application of Bayesian statistics would be to treat each unique entryin the zip log, stack trace and semaphore files as independent features. However,this would make for very large feature vectors in relation to the amount of data andtesting showed a significant drop in performance utilizing this approach comparedto utilizing aggregated features. Instead of treating each entry independently, anaverage data file is constructed for each MHO which is then compared to that of theunclassified crash dump. If more data were to become available this aspect couldbe revisited in order to reevaluate its performance.

The average data file is created by counting the occurrence of each type ofentry per MHO, and subsequently dividing these counters by the number of timesthat particular type of MHO was encountered during training. The segment of thefeature vector that represents each of these data sources is thus a series of counters,one per unique type of entry that has been encountered during training.

Additionally some of the features will not have been encountered during training,this would mean that their historical probability of occurrence is 0%. Since theamount of unique entries is large in relation to the dataset these cases are reassigneda very low positive probability (0.005%). This ensures that viable results are notomitted due to a small portion of data being new to the program.

29


4.7 Data ProcessingThe frequency of occurrence for each unique entry is calculated per MHO. Whenall data in the training set has been processed, the count per entry is divided by theamount of MHOs of that type. This results in a probabilistic distribution of entriesper MHO.

30

Chapter 5

Results

The results in this section were generated by randomly dividing the available datainto two distinct subgroups based on a predetermined ratio of 1:3. One for trainingthe classifier and one for validating the results. The randomization can be performedon demand but is not changed between runs and comparisons between classifiersare always based on their results on the same test data. There was a total of210 available dump hierarchies which were split into 78% training data and 22%validation data. The discrepancy between the desired ratio and the actual ratio isdue to a portion of dumps being corrupt, these were disregarded upon detection.The answers to the training data are known to the classifiers whilst the answers tothe validation data is unknown. Performance is gauged by comparing the predictionsof the classifier to the correct answers.

Three of the twenty MHOs where omitted from the tests due to insu�cient data;less than three crash dumps per MHO. Meaning that there were a total of seventeenclasses during the tests. The random baseline is assumed to be 1/17 (approx. 5.9%)and the human baseline of 77% prediction accuracy is based on anecdotal evidencefrom Ericsson’s own internal research.

Figure 5.1: A simple diagram depicting the ratio of training andvalidation data used for obtaining the results of this project.

31

CHAPTER 5. RESULTS

5.1 Independent Results of Classifiers

This sections shows the results of running Bayesian classification with only onetype of data source. Each diagram has one entry per MHO in the validation set.The values are either one, which indicates a correct classification, or zero, whichindicates an incorrect classification.

Figure 5.2: Accuracy of predictions per MHO in the validation setbased on the human readable error messages.

Figure 5.3: Accuracy of predictions per MHO in the validation setbased on the error codes.

32

5.1. INDEPENDENT RESULTS OF CLASSIFIERS

Figure 5.4: Accuracy of predictions per MHO in the validation setbased on the error sources.

Figure 5.5: Accuracy of predictions per MHO in the validation setbased on the contents of the semaphore lists.

33

CHAPTER 5. RESULTS

Figure 5.6: Accuracy of predictions per MHO in the validation setbased on the contents of the zip logs.

Figure 5.7: Accuracy of predictions per MHO in the validation setbased on the contents of the stack traces.

5.2 Combined Results and Comparisons

The following section contains comparisons between the accuracy and predictions ofthe respective classifiers as well as the results of the final Bayesian classifier utilizingthe combination of stack traces, zip logs, error sources and error codes.

34

5.2. COMBINED RESULTS AND COMPARISONS

Figure 5.8: Accuracy of predictions per MHO in the validation setfor the combined classifier.

Figure 5.9: The prediction accuracy of each classifier.

35

CHAPTER 5. RESULTS

Figure 5.10: The respective hits and misses of each respectivemethod of classification per dump in the validation set.

Removing all entries that existed for more than one MHO was attempted asa result of the findings during the data analysis phase. This attempt showed thatclassifiers with sparse adjacency matrices performed nearly identically with exclusivefeatures, whilst those with dense adjacency matrices showed a significant decreasein prediction accuracy.

Figure 5.11: The prediction accuracy of each classifier with exclu-sive features.

5.3 Comments on the ResultsThe resulting Bayesian classifier had a 59% prediction accuracy on the validation setfor its first guess and an additional 16% on its second guess. The best performingindividual classifiers where those based on stack trace entries (30%), zip log entries(38%), error codes (35%), and error sources (30%). The individual classifiers withpoorest performance were those based on semaphore entries (8.1%) and humanreadable messages (2.7%).

36

Chapter 6

Conclusion

6.1 DiscussionThe methods implemented during the course of this project show great potentialand exceeded the initial goals of a better than random accuracy. In fact the re-sults of the final classifier, 59% prediction accuracy, is approaching human levelsof accuracy. However, given the small size of the dataset, it is di�cult to predictexactly how well this accuracy will hold up over time. The best performance wasachieved by utilizing error sources, error codes, zip logs and stack traces as datasources. Adding the human readable error messages or semaphores both yielded ameasurable decrease in performance. This does not come as a surprise given theirpoor individual performance illustrated in the results section above.

Some of the individual classifiers performed very well by themselves and it ispossible that further improvements could be made through further research intodi�erent methods of utilizing their contents. Furthermore, a portion of the log filesremain unexplored as a result of the time constraints of the project and lack ofinsight into their content. A more exhaustive study may try to utilize some of thiscontent to further improve prediction accuracy.

The results indicate that the human readable messages are, at least from amachine learning perspective, extremely poor data sources for classification. Thisis strange as they are intended to be meaningful to human classifiers. This resultmay, on its own, impact the way these error messages are written and how muchtrust they are given upon crash dump analysis.

6.2 ConclusionThe Bayesian classifier implemented as a result of this project far surpassed therandom baseline of 5.8%, but was unable to reach human levels of accuracy. Thelimited amount of data is likely to have negatively impacted the prediction accuracyfor certain classes where only a few historical examples were available for training.Despite this, the results of this project clearly show that machine learning techniques

37

CHAPTER 6. CONCLUSION

are a feasible alternative to human processing of trouble reports and may be ableto replace or aid human classification of trouble reports in the very near future.

6.3 RecommendationsThe accuracy of this particular approach has not yet reached human levels of pre-diction accuracy but could be used as an aid to point out probable error sourceswhich can then be analyzed by a human. In order to increase accuracy more histor-ical data should be stored and utilized for training. If this approach is to achievehuman levels of prediction accuracy more historical data needs to be assembled.As is indicated by this project, machine learning is indeed a viable approach forTR routing and could, in all likelihood, be improved further by implementing moreadvanced classifiers.

6.4 Future WorkIncreasing the amount of training data would enable the addition of more featureswhich may increase the prediction accuracy. It would also improve the accuracyof existing methods by providing a greater data base for each class. The methodspresented in this project could also be implemented as a part of a compound classifierutilizing several di�erent machine learning methods, such an approach may yield amore robust classifier with greater accuracy. Finally it could be used as a supplementto human classification. The methods and resulting program could be used to pointout probable routes which could then be evaluated by a human, thus potentiallyreducing the burden on employees responsible for TR routing.

38

Bibliography

[1] Marsland, Stephen. Machine Learning: An Algorithmic Perspective. Boca Ra-ton: Chapman & HallCRC, 2009. Print.

[2] Fronza, Ilenia, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and JelenaVlasenko. "Failure Prediction Based on Log Files Using Random Indexing andSupport Vector Machines." The Journal of Systems and Software 86 (2013):2-11. Web.

[3] Zheng, Alice Xiaozhou. Statistical Software Debugging. Diss. Thesis (Ph.D.Engineering–Electrical Engineering and Computer Sciences)–University of Cal-ifornia, Berkeley, Fall, 1999. Berkeley: University of California, 2005. Print.

[4] Noorian, Mahdi, Ebrahim Bagheri, and Wheichang Du. "Machine Learning-based Software Testing: Towards a Classification Framework." University ofNew Brunswick, Fredericton, Canada. Print.

[5] Roychowdhury, Shounak. "Ensemble of Feature Selectors for Software Fault Lo-calization."IEEE International Conference on Systems, Man, and Cybernetics,COEX, Seoul, Korea (2012). Department of Electrical and Computer Engineer-ing, The University of Texas at Austin. Web.

[6] Rish, Irina. An Empirical Study of the Naive Bayes Classifier. Tech. no. RC22230. Yorktown Heights: IBM Research Division, 2001. Print.

[7] Bishop, Christopher M. Pattern Recognition and Machine Learning. New York:Springer, 2006. Print.

[8] Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. Learningfrom Data: A Short Course. United States: AMLBook.com, 2012. Print.

[9] Liu, Chao, Zeng Lian, and Jiawei Han. How Bayesians Debug. University ofIllinois-UC,Brigham Young University, Print.

[10] Xu, Wei, Huang, Ling, Fox, Armando, Patterson, David, and Jordan, Michael.Experience Mining Google’s Production Console Logs. Uinversity of Californiaat Berkeley, Intel Labs Berkeley. Print.

39

BIBLIOGRAPHY

[11] Li, Weixi. "Automatic Log Analysis Using Machine Learning." Thesis. UppsalaUniversitet, 2013. Print.

40

www.kth.se

Machine Learning Methods for Fault Classification908072/FULLTEXT01.pdf · feasibility of...

Documents

Transcript of Machine Learning Methods for Fault Classification908072/FULLTEXT01.pdf · feasibility of...