Tripartite Active Learning for Interactive Anomaly Discovery · 2020-06-12 · the model and...

Date of publication May 7, 2019, date of current version May 7, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2915388

Tripartite Active Learning for InteractiveAnomaly DiscoveryYANQIAO ZHU1 and KAI YANG21School of Software Engineering, Tongji University, Shanghai, China.2Department of Computer Science, Tongji University, Shanghai, China.

Corresponding author: Kai Yang (e-mail: [email protected])

This work was supported in part by National Natural Science Foundation of China under Grant 61771013, in part by the 2018 Key JointResearch Program of Department of Education of China and China Mobile under Grant 3-5, and in part by the National Young ExpertProgram of China.

ABSTRACT Most existing approaches to anomaly detection focus on statistical features of the data.However, in many cases, users are merely interested in a subset of the statistical outliers dependingon the specific domain of interest, e.g., network attacks or financial fraud. The instruction from humanexperts is therefore indispensable in building predictive models in such applications. However, obtaininglabels from human experts is time-consuming and expensive. Obtaining labels from nonexpert labelersare relatively easy and cost-effective. However, the labeling accuracy of a nonexpert is usually difficult toassess. Therefore, it remains open to leverage both the machine intelligence and the knowledge from labelerswith diverse backgrounds to construct a machine learning model for domain-specific anomaly detection. Tothis end, this paper proposes a framework of tripartite active learning for interactive anomaly discovery inlarge datasets based on crowdsourced labels. This tripartite active learning method consists of two stages.In the first stage, an unsupervised learning algorithm is employed to extract statistical outliers from thedataset. This algorithm is of low computational complexity as well as memory requirement and thus wellsuited for large datasets. We then develop an iterative algorithm consisting of two steps. The algorithmfirst evaluates and trains labelers based on gold instances provided by the expert labelers. Then, it assignsthe most informative samples to its most confident labeler for relabeling and update the detector based onnew labels. The capacity constraints are taken into account in the active learning approach to guarantee fairallocation of labeling instances as well as robustness against erroneous labels. It is seen through experimentsthat the proposed algorithm provides an effective means for interactive anomaly detection. As far as we areaware of, this is the first work that considers designing a tripartite machine learning system for domain-specific anomaly detection.

INDEX TERMS Active Learning, Interactive Artificial Intelligence, Anomaly Detection, Linear IntegerProgramming, Human Training

I. INTRODUCTION

ANOMALY detection finds many applications in a vari-ety of areas, including the bank fraud detection, novelty

discovery, network intrusion detection, system monitoring,etc. Traditional anomaly detection refers to the detectionof patterns in a given dataset that do not comply with themajority of the data. It is becoming increasingly important inthe big data era in which massive amount of data is availablefor processing and analyzing. A statistical anomaly detectorseeks to detect unusual patterns that seem to deviate fromthe majority of the data. Such an approach, however, maygive rise to many false alarms and hence is not suitable for

many applications. In practice, the knowledge from domainexperts are indispensable in building an accurate predictivemodel. However, it is in general time-consuming and ex-pensive to obtain labels from experts. Obtaining labels fromnonexperts are usually less expensive, but the labels can beerroneous, which may severely deteriorate the performanceof the anomaly detector.

Active learning is a sub-field of machine learning thatinteractively queries labelers to obtain human instruction. Inreality, unlabeled data is available in abundance, while it isoften time-consuming and labor-intensive to manually obtainlabels from human experts. As a result, training an accurate

VOLUME 7, 2019 1

Y. Zhu et al.: Tripartite Active Learning for Interactive Anomaly Discovery

model with as fewer labels as possible has attracted signifi-cant research interests. To make full use of both labeled andunlabeled data, semi-supervised learning has been widelyused as an effective approach. Active learning, as one of thesemi-supervised learning approaches, can achieve great per-formance with few labeled data for training, especially underthe constraint of restricted labeling budget [1]. Traditionally,the active learning algorithm iteratively selects the mostinformative instances with the greatest potential to improvethe model and queries their labels from an oracle. Underthis setting, the labeler is assumed to be fully reliable, whichis not realistic. In this paper, we present a tripartite activelearning approach with gold instances from a crowdsourceddataset. A distinct feature of this learning system is that itinvolves three parties, namely the intelligent machine, theexpert annotators, and the nonexpert annotators. The machinenot only queries instructions from human experts, it alsoevaluates and trains the nonexpert labelers based on thegolden instances collected from the expert labelers. Such atripartite framework can leverage both the human and ma-chine intelligence and thus can be applied to diverse domains,e.g., detecting fake news or fake photos which social mediacompanies like Facebook have put significant efforts in. Tothe best of our knowledge, this is the first work that considersactive learning in a three-way manner for anomaly discovery.

The proposed anomaly detection scheme consists of twostages. In the first stage, we employ an unsupervised labelpropagation algorithm with random sampling (LPRS) toextract statistical outliers from the dataset. This algorithmis both memory and computationally efficient and thus isparticularly suitable for large datasets. More importantly, thisLPRS algorithm can efficiently extract a small portion ofthe data as potential anomaly candidates and then feed theminto the second stage of the algorithm in which human la-beling are involved. Then, we develop an iterative algorithmconsisting of two steps trying to refine the results obtainedin the prior stage. In this stage, a number of annotatorswith diverse expertise from crowdsourcing platforms (e.g.,Amazon Mechanical Turk) may be involved in the labelingtask. We first evaluate and train labelers’ performance basedon gold instances. Gold instances [2] are instances with high-quality labels along with several explanations of why suchlabels are assigned. We categorize labelers into three groupsbased on their performance on these gold instances, including(1) expert labelers who can consistently provide high-qualitylabels, (2) apprentice labelers whose labels are of relativelylow quality but can be trained to provide better labels, and (3)unqualified labelers, who keep on providing erroneous labels.Then, we assign each error-like instance with one expertlabeler and let experts relabel these erroneous instances. Inthe end of the second stage, the detector in the last roundwill produce labels for the remaining unlabeled data and thedetector will be retrained with labels provided by the twosources.

Extensive experiments have been conducted on a varietyof datasets. It is seen in the experiments that the proposed

algorithm can significantly improve the precision and recallfor anomaly detection and also reduce the labeling efforts byat most 90%, compared with other state-of-the-art methods.

The rest of this paper is organized as follows: SectionII is devoted to the review of prior related work. SectionIII presents the proposed two-stage algorithm based on ac-tive learning with gold instances for the anomaly detection.Detailed experiment results are shown in Section IV. Weconclude this paper and and discuss possible future researchdirections in Section V.

II. RELATED WORKAs for the anomaly and rare-category detection problem,mainstream approaches in the literature are based on threetechniques: classification-based approaches, clustering-basedapproaches [3], and nearest neighbor-based approaches [4],[5]. The latter two approaches often make assumptions onoutliers in terms of frequentness or statistical distribution,but they do not provide semantic explanations of why aparticular sample is selected as an outlier. In addition, theyusually require a relevant large amount of computation. Evenworse, typical unsupervised learning algorithms often sufferfrom high false alarm rates. The former method, in contrast,relies on accurate labeled dataset which is usually difficult toobtain.

Over recent years, active learning has been the topic ofa number of researches and has been well-studied becauseof its ability to reduce labeling efforts. Many active learn-ing algorithms aim to design strategies for selecting themost informative and representative instances to improvethe classification model [6], [7]. There are also some lit-eratures related to anomaly detection using active learningstrategies to reduce the labeled instances and maximize thenumber of true anomalies [8], [9]. One strong assumptionunderneath the traditional active learning algorithm is thatthe quality of labeled data from one perfect oracle is high.In reality, however, such assumption does not hold. On mostcrowdsourcing platforms, error-prone labelers exist and willdeteriorate the performance of the machine learning modeldue to the erroneous labels.

There are many approaches for learning a model fromnoisy labelers. One type of such methods are proposedto collect an accurate labeled dataset from multiple noisyannotators by estimating different degrees of reliability oflabelers and selecting best annotator for labeling all instances[5], [10], [11], or by combining different parts of reliabledatasets from multiple workers [12]. In addition to estimatingreliability of labelers, there is other work [13], [14] offeringan optional “unsure option” when querying from labelers tosomehow reduce misleading labels produced by reluctant la-belers. In the other category of methods, labelers are asked torelabel previous error-like instances caused by carelessness[15] to mitigate or eliminate negative impacts brought bynoise. Admittedly, these methods are helpful for collectivelyimproving data quality under the multiple-labeler setting.However, none of the prior arts considers interactively learn-

2 VOLUME 7, 2019


Gold Instances

Unlabeled Data

Labeled Data

Unsupervised learningalgorithm Relabeling by human expert

Evaluating and

training

Supervised learning algorithm

Stage I

Stage II

FIGURE 1: Graphical illustration of the proposed TALCalgorithm.

ing in a three-way manner involving the intelligent machine,the expert, and nonexpert human operators for labeling.

III. TRIPARTITE ANOMALY DETECTION WITH GOLDINSTANCES FROM CROWDSIn this paper, we present Tripartite Active Learning with GoldInstances from Crowds for Anomaly Detection algorithm,TALC for brevity, a two-stage scheme for anomaly detection.As demonstrated in Figure 1, the proposed algorithm consistsof two stages, i.e. the initial stage and the active learningstage.

In this paper, datasets are in uppercase italics, vectorsare in lowercase boldface, scalars are in lowercase italics,and matrices are in upperclass boldface. Suppose that wehave already obtained a small gold instance dataset G ={(xi, yi)}nG

i=1 with nG labeled instances, where yi = 1and yi = 0 indicate for anomalies and non-anomalousdata respectively, and an unlabeled dataset U = {xi}nU

i=1

with nU instances. Also, there is a set of candidate labelersA = {al}ml=1 with m labelers involved in the algorithm. Toobtain the gold instance set of relatively small size, we couldhire several expert labelers at a relatively high cost, and letthem label a small set of instances.

A. STAGE I: LABEL PROPAGATION WITH RANDOMSAMPLINGIn the first stage, an unsupervised learning algorithm isemployed to detect anomalies from the unlabeled dataset U .Because anomalies usually constitute only a small portionof the data, the unsupervised learning approach can removemost non-anomalous data and thereby facilitate the activelearning with gold instances in the second stage.

Traditional unsupervised anomaly detection algorithms,such as [4], [16], compute local deviation of density ordistance of a given sample with respect to its neighbors,which demands relatively high computational resources. Inthe study of the anomaly detection problem, these methodsare not capable as a consequence. In this paper, we proposea novel method named Label Propagation with RandomSampling, LPRS for brevity, that can efficiently remove mostnon-anomalous points and reserve potential anomalies.

At first, we generate a dataset P = {xi}nPi=1 by iteratively

random sampling from U . In each iteration, a total of r in-stances denoted by {x1, . . . , xr} are sampled. Among them,the medoid instance determined by

xmedoid = argminy∈{x1,...,xr}

r∑i=1

||y, xi||22 (1)

will be selected as a non-anomalous point and will be addedto P . Once obtained a small dataset P containing nP non-anomalous points, we then select k nearest neighbors ofinstances in P as non-anomalous instances and add them intoP .

Then, we generate a negative dataset N = {xi}nNi=1 con-

sisting of nN abnormal points by building a similarity matrixW , whose entry wij measures the distance-based similarityof xi ∈ U and xj ∈ P . After that, we compute meansimilarities of non-anomalous samples in P with unlabeledinstances U . Last, we select instances in U that have theminimum mean similarity and add them into U .

Once we got a small labeled dataset P⋃N , we exert

Label Propagation [17] to efficiently generate labels for allinstances of dataset U . To obtain high-quality labels, wenext remove instances with low label confidence. To deter-mine the appropriate threshold of confidence levels, we takethese confidence values as a set and perform clustering, orsegmentation, on the set. As a consequence of clustering,two clusters will be generated and labels belonging to theminor cluster will be regarded as of low quality and willbe excluded. The remaining detected instances, denoted byU0 = {(xi, yi)}

nU0i=1 where nU0

< nU , will then be used asthe input to a supervised learning algorithm to produce theinitial detector θ0. The LPRS algorithm described above issummarized in Algorithm 1.

B. STAGE II: ACTIVE LEARNING WITH TRAINING ONGOLD INSTANCESThe pool-based active learning approach in the second stageis introduced to further refine the anomaly detector andimprove the detection performance. It can be divided into twosteps: (1) evaluating and training and (2) relabeling. Initially,every labeler al ∈ A gets an identical copy Gl of G as thetraining dataset of gold instances.

1) Step 1: Evaluating and Training Labelers Based on GoldInstances

In the first step, at most ρ gold instances from Gl will betaken to evaluate the performance of each labeler in eachround. If labeler al gives the correct label to gold instance(xi, yi) ∈ Gl, xi will be removed from Gl for futureevaluation. Otherwise, if the labeler falsely labels the goldinstance xi, the correct label yi along with reasons why suchlabel is chosen will be displayed. We assume that the labelerwill carefully read these provided answers and thus gain acertain degree of understanding from gold instance (xi, yi).

VOLUME 7, 2019 3


Algorithm 1: The LPRS Algorithm

/* Generate normal dataset P */1 P = ∅2 for i = 1 to bnP /rc do3 Randomly sample r instances x1, . . . , xr4 Choose xmedoid according to Eq.(1)5 Add xmedoid to P6 end7 Compute k nearest neighbors of P and add them to P/* Generate abnormal dataset N */

8 W = {wij}, wij is initialized to +∞9 N = ∅

10 for xi in U\P do11 for xj in P do12 wij = ||xi,xj ||2213 end14 si =

∑j wij

|P |15 end16 Sort s in ascending order17 Add instances of top-nN in s to N18 Propagate labels to U using P

⋃N

19 Cluster the confidence values to two classes c1 and c2,where |c1| > |c2|

20 Remove labels whose confidence level is in c2 for U ,and denote the remaining dataset with U0

21 return U0

2) Step 2: Relabeling and Training the New DetectorConsidering in a crowdsourcing environment, labelers mayhave diverse expertise in the problem domain [18]. There-fore, we categorize labelers into three groups based on theirperformance on gold instances, including (1) expert labelerswho can consistently provide high-quality labels, (2) appren-tice labelers whose labels are of relatively low quality butcan be trained to provide better labels, and (3) unqualifiedlabelers, who keep on providing erroneous labels. All kindsof labelers will keep training on gold instances to improvetheir labeling accuracy and we only allow both expert labelersand apprentice labelers to give labels to unlabeled instancesin the next step.

In Step 2, the algorithm will first select at most φ instancesand assign each instance xj to one labeler al in every round.Intuitively, instances selected to be relabeled from the un-labeled dataset U should have the most information. In theproposed algorithm, we evaluate the degree of information ofan instance based on uncertainty estimation. The uncertaintyr(xj) under detector θ can be measured as:

r(xj) = −∑k

Pθ(yk|xj) log(Pθ(yk|xj)). (2)

Besides, the labeler should be mostly familiar and con-fidence to such instance. The confidence level of instance-labeler pair (xj , al) can be determined according to theperformance of al on the gold instance dataset Gl in the

previous step. If al has correctly labeled more instances ofthe neighbors of xj in Ml, it is more likely that al will assigncorrect label for xj . Formally, we have:

c(xj , al) =1

t

∑xk∈N (xj ,al,t)

S(xj ,xk) (3)

where N (xj , al, t) returns a set consisting of t neighbors ofxj in set G\Gl for labeler al and S(xj ,xk) measures thesimilarity between xj and xk. In this paper, we simply choosecosine similarity as the similarity metric:

S(xj ,xk) =xj · xk

||xj ||2||xk||2. (4)

Furthermore, to control the labeler’s quality in relabeling,we require that the labeler to be selected should have labelingperformance exceeding the performance of the current detec-tor model. Let A+ denote the set of such labelers, i.e. A+ ,[al : P(al)− P(θ)]. Based on previous discussions, a viablecriterion for selecting the instance-labeler pair (x∗, a∗) is:

(x∗, a∗) = argmaxxj∈U,al∈A+

rβ(xj) · c(xj , al), (5)

where 0 < β ≤ 1 is a weight factor, P(al) and P(θ)measures the performance of labeler al and detector θ inthe current round respectively. The weight factor β can beadjusted dynamically when the labelers are under training.In the first T1 rounds, all labelers receive few training andthus provide labels with roughly poor quality. At this time,we give preference to query labels with high confidence levelof instance-labeler pairs by setting a relatively small value ofβ, e.g., β = 0.5. After the overall quality of all labelers isimproved, we will be inclined to query instances with mostinformativeness by setting β to a relatively large value, e.g.,β = 1. Such selection strategy ensures that every (x∗, a∗)pair will generate an accurate label at different stages oftraining and consequently allow the detector to take thisadditional label to improve its performance.

However, the aforementioned selection strategy does notconsider capacity constraints of each labeler. And conse-quently, such an selection strategy may assign many in-stances to a few labelers for relabeling. This may leadsto excessive delay of the relabeling process. In addition,the labeler may get tired of the labeling process and thelabeling accuracy could deteriorate over time. As a remedy,we propose the following selection strategy that takes intoaccount the capacity constraints of each labeler:

max∑j

∑l

[rβ(xj) · c(xj , al)]γjl

s.t.∑l

γjl ≤ 1,∀j,∑j

γjl ≤ ml,∀l,∑j

∑l

γjl ≤ φ,

var γjl ∈ {0, 1},

(6)

4 VOLUME 7, 2019


where γjl is an 0 − 1 indicator variable, i.e. γjl = 1 if thejth instance is assigned to the lth labeler for relabeling. Thefirst constraint ensures that each instance will get relabeled byone labeler. The second constraint means that the lth labelerwill relabel at most ml instances. A total of no more than φinstances will be sent for relabeling, which is depicted in thethird constraint.

The aforementioned problem is an integer linear program-ming (ILP) problem that is in general NP-hard. We next provethat under mild assumptions, this specific problem can beconverted into a minimum cost flow problem over layeredgraph and thus can be efficiently solved.

Proposition III.1. Assume the total number of instancesis much larger than φ and

∑ml > φ. Let wjl , η −

rβ(xj)c(xj , al), where η = maxjl rβ(xj)c(xj , al). Problem

(6) is equivalent to the following optimization problem:

min∑j

∑l

wjlγjl

s.t.∑l

γjl ≤ 1,∀j∑j

γjl ≤ ml,∀l,∑j

∑l

γjl = φ,

var γjl ∈ {0, 1}.

(7)

Proof. Since both rβ(xj) and c(xj , al) are non-negative, wehave rβ(xj) · c(xj , al) ≥ 0. Assume the optimal solutionto Problem (6) is γ∗jl. Since the total number of instances ismuch larger than φ and

∑ml > φ, we have

∑j

∑l γ∗jl =

φ. In addition, as wjl , η − rβ(xj)c(xj , al), the optimalsolution that can maximize the utility function in Problem (6)can also minimize the utility function in Problem (7), whichconcludes the proof.

We next show that Problem (7) can be mapped to aminimum cost flow problem over a layered directed graph.As shown in Figure 2, this graph consists of four layers, thereis one node in the first layer, representing the source nodeand the incoming traffic is φ. The only node on the fourthlayer is the destination node and the outgoing traffic amountis also φ. The jth node in the second layer represents the jth

instance, and each node in the third layer models a labeler.The capacity of the edge connecting the source node to eachnode in the second layer is 1 and the capacity of the edgeconnecting to lth node in the third layer to the destinationnode is ml. The cost associated with the edge connecting thejth node in the second layer and lth node in the third layer iswjl and the cost associated with all other edges are 0.

Proposition III.2. The optimal solution to Problem (7) canbe obtained via solving a corresponding minimum cost flowproblem over layered graph.

Proof. Let ζj and ηl denote respectively the network flowfrom the source node to the jth node in the second layer

and the flow from the lth node to the destination node. Theincoming traffic for the source node is φ. Let κjl representsthe network flow from the jth node in the second layer to thelth node in the third layer. Recall that the cost associated withthe edge connecting the jth node in the second layer and lth

node in the third layer is wjl. And the cost associated withall other edges are zero. The minimum cost network flowproblem over the directed graph is given in the sequel:

min∑j

∑l

wjlκjl

s.t.∑l

κjl = ζj ,∀j,∑j

κjl = ηl,∀l,∑j

ζj = φ,

var 0 ≤ κjl ≤ 1, 0 ≤ ηl ≤ ml, 0 ≤ ζj ≤ 1.

(8)

The first to third group of constraints are flow conserva-tion constraints. The remaining constraints are capacity con-straints. The above optimization is a linear optimization prob-lem and the optimal solution could be fractional. However,since all capacity constraints are integer, the resulting node-edge incidence matrix is unimodular, which guarantees thatthe problem has an integral optimal solution [19]. Assume theoptimal solution to the above problem is {κ∗jl, ζ∗j , η∗l}. Itfollows from the constraints that

∑j

∑l κ∗jl =

∑j ζ∗j = φ.

Likewise∑l κ∗jl = ζ∗j ≤ 1 and

∑j κ∗jl = ηl ≤ ml.

Therefore, κ∗jl is a feasible solution to Problem (7). Re-call γ∗kl is the optimal solution to Problem (7). We have∑j

∑l wjlγ

∗kl ≤

∑j

∑l wjlκ

∗kl. One the other hand, it

is easy to see that γ∗kl is also a feasible solution to theminimum cost flow problem given in Eq.(8). Therefore, wehave

∑j

∑l wjlγ

∗kl ≥

∑j

∑l wjlκ

∗kl. Hence it follows

that∑j

∑l wjlγ

∗kl =

∑j

∑l wjlκ

∗kl, which concludes the

proof.

There exist efficient algorithms, e.g., network simplexalgorithm, that can optimally solve the minimum cost net-work flow problem in polynomial time. Once the instance-labeler pair is selected, we can also utilize the unlabeleddataset U by generating labels for them using the detectorin the previous round. Following that procedure, the detectorwill be retrained based on a combination of data from twosources.

The TALC algorithm is summarized in Algorithm 2. Thealgorithm first generates an initial detector from data pro-duced by an unsupervised learning algorithm. Then, thealgorithm evaluates and trains labelers based on the goldinstance dataset. After that, it assigns some instances to onehuman operator and let them relabel the instance. In the end,a new detector will be updated with new labels.

This algorithm will iteratively evaluate, train, query, andupdate, until the given cost or target performance is reached.Cost in this case can be simply counted as the number

VOLUME 7, 2019 5


……

……Source Destination

Instance Labeler

w11x1 a1

xjwjl al

wj1

w1l

FIGURE 2: Graphical model that maps Problem (7) to aminimum cost network flow problem.

of rounds, or the accuracy of labelers’ previous labeling,considering expert labelers may provide labels at a relativehigh cost and apprentice labelers may provide labels at arelative low cost.

Algorithm 2: The TALC Algorithm

1 L = ∅2 Generate labels for U using Algorithm 13 Train detector θ0 on U

⋃G

4 repeat5 Evaluate and train labelers6 for each instance xj ∈ U and each labeler ai ∈ A

do7 Calculate uncertainty level using θp−1 for xj

according to Eq.28 Calculate confidence level for (xj , ai)

according to Eq.39 end

10 Select top-φ optimal pairs according to the greedyselection algorithm

11 for each pair (x∗, a∗) do12 Query the label of x∗ from a∗

13 L = L⋃(x∗, y∗)

14 U = U\x∗15 end16 Generate labels using θp−1 for U , with instances

and labels altogether denoted by Up−117 Train the detector θp on G

⋃Up−1

18 Evaluate the detector θp on the test set19 until the target performance or cost budget is reached

IV. EXPERIMENTSWe first compare our unsupervised LPRS learning schemewith other unsupervised learning algorithms for initialanomaly detection in high-dimensional data, focusing on itscomputational efficiency. Then, experiments are conducted tocompare the TALC algorithm with other anomaly detectionalgorithms, in terms of the detection accuracy, i.e. precisionand recall, as well as the number of labels needed to constructthe model.

TABLE 1: Training time of different unsupervised learningalgorithms for statistical anomaly detection.

Method Training Time

LPRS 1.31sHBOS 2.19sIForest 3.62sLOF 8.39s

LoOP 70.9sABOD > 5h

A. EXPERIMENTS OF THE LPRS ALGORITHM

We first verify the effectiveness of the unsupervised learningscheme for anomaly detection in large datasets. We com-pare our method with several state-of-the-art unsupervisedanomaly detection algorithms, including LOF [4], ABOD[20], HBOS [21], Isolation Forest [22], and LoOP [16]. Forcomparing these methods, we select dataset speech whosedata dimension is 400, data size is 3,686, and anomaly rateis 1.65% from the widely-used ODDS repository [23]. Theresult of the experiments is shown in Table 1. Note thatthe training time of the ABOD method exceeds 5 hours,so its performance is not reported. It is seen that for high-dimensional data, the proposed LPRS learning algorithmonly requires a relatively short period of time for modeltraining.

B. EXPERIMENTS OF THE TALC ALGORITHM

1) Experimental Settings

Evaluation metrics. We compare the performance in termsof Precision, Recall, and F1-Score under a restricted costwith several baseline algorithms. Besides, we further validatethe effectiveness of our approach in terms of the number ofrelabeled instances needed to achieve the same performanceas the ideal detector. In the experiment, we set the cost as thenumber of rounds.

Baseline algorithms. We assume all the training data iscorrectly labeled, and these labels are fed into a supervisedlearning algorithm to obtain an ideal detector. We then com-pare our proposed algorithm against the following baselines,to assess different aspects of our method:

• CEAL: This algorithm [18] does not evaluate and trainlabelers and it aims to select a labeler-instance pairthat is both costless and effective. Since our algorithmuses pool sampling, we make adaptation to its labeler-instance selection criteria and select to query labels oftop-φ optimal pairs per round.

• ALTMV: The algorithm evaluates and trains labelers asour TALC algorithm. But this algorithm only activelychooses most uncertain samples, queries from all label-ers, and uses the majority vote to generate labels.

• ALA: The algorithm evaluates and trains labelers as ourTALC algorithm. But this algorithm does not restrictlabelers’ quality when deciding which instance-labelerpair to query.

6 VOLUME 7, 2019


TABLE 2: Initial accuracy of each labeler in differentdatasets.

Dataset Labelera1 a2 a3 a4 a5

mnist 80% 70% 70% 50% 50%annthyroid 90% 75% 75% 50% 50%

http 80% 70% 70% 50% 50%

For these compared methods, we use the official imple-mentation of XGBoost [24] as the supervised-learning algo-rithm, and Label Spreading in scikit-learn [25] as the imple-mentation of the label propagation algorithm. All parametersof the two algorithms are set to default. Besides, we useGoogle Optimization Tools to solve the minimum cost flowproblem (8).

Datasets. We select three datasets including annthyroid,mnist, and http. Annthyroid and mnist are from the ODDSrepository, and http is extracted from the famous KDD Cup99 dataset. Http is the largest subset from KDD Cup 99network intrusion data, where service feature is “http”. Werandom sampled 20,000 instances from the original dataset.Annthyroid and mnist are selected because they have knownclustered anomalies. Note that all three databases are arti-ficially converted from their original classification datasets,where in http dataset, attack instances are treated as anoma-lies, and in annthyroid and mnist dataset, major classesformed inliers and minor classes constructed outliers.

For all datasets, instances are randomly shuffled beforeconducting the experiment and each categorical feature hasbeen converted to more than one boolean features via one-hot encoding. In addition, all features are standardized. Thesize of datasets ranges from 1,600 to 20,000, the number ofdimensions ranges from 21 to 115, and the percentage ofanomalies ranges from 1.6% to 9.6%.

Parameter configurations. We prepare 5% of all data asgold instances and add extra reasons of why such labels aregiven. Besides, the remaining 75% and 20% data is chosento be the training set and the test set, respectively. For ouralgorithm, we set ρ = d0.1|G|e and φ = d0.008(|U |+ |G|)eto strike a balance between the maximum number of goldinstances used per round and unlabeled instances used perround. For Eq. (5), we set T1 to 1

4 of the number of totalrounds. In this experiment, we set the total budget to 20rounds.

Labeler settings. We assume that there are five labelersconducting experiments, whose initial settings are shown inTable 2. Note that these labelers’ initial accuracy are differentin different datasets to verify that our algorithm can outper-form other methods in both cases with and without expertlabelers. Among those labelers, a5 is always an unqualifiedlabeler, who keeps providing erroneous labels by randomguessing.

TABLE 3: Performance indicators after 20 rounds of threedatasets.

Method Metric mnist annthyroid http

Ideal DetectorPrecision 96.06% 85.45% 100.00%

Recall 87.23% 88.68% 100.00%F1-Score 91.45% 87.04% 100.00%

TALCPrecision 96.90% 85.45% 100.00%

Recall 88.65% 88.68% 100.00%F1-Score 92.59% 87.04% 100.00%

CEALPrecision 45.28% 59.09% 76.74%

Recall 17.02% 24.53% 22.45%F1-Score 24.74% 34.67% 34.74%

ALTMVPrecision 97.58% 81.74% 100.00%

Recall 85.82% 88.68% 100.00%F1-Score 91.32% 85.07% 100.00%

ALAPrecision 92.31% 78.26% 59.31%

Recall 85.11% 84.91% 93.20%F1-Score 88.56% 81.45% 72.49%

2) Results and AnalysisTable 3 presents the final Precision, Recall, and F1-Score indifferent methods after 20 rounds, Figure 3 demonstrates thecurves of performance indicators of three datasets when thenumber of rounds increases, and Table 4 presents the numberof relabeled instances to reach the performance of the idealdetector in different datasets. With the number of roundsincreasing, our method maintains a rapid growth rate inalmost all datasets. In addition, within 20 rounds, our methodachieves the same or even better performance compared tothe ideal detector in all three datasets. Note that becauseinliers and outliers are not balanced in datasets, our methodwith fewer training samples in some cases is even slightlybetter than the ideal detector on the test set. To compare thenumber of needed relabeled instances, the result shows thatour method only needs to relabel around 10% of the datasetas expected.

In the traditional ATAMV method, all labelers are in-volved. These labelers are diverse in expertise, and in turnmay provide informative instances with erroneous labels inthe early stage of the training, leading to the deteriorate ofthe detector’s performance. Note that for the http dataset, theATAMV method achieves the same performance as the idealdetector. But this method requires majority vote among alllabelers, and thus will requires much more labeling efforts.As a comparison, the proposed approach takes into accountboth the informativeness of instances and labelers’ accuracy,and selects most reliable instance-labeler pair. The proposedalgorithm reaches the same performance as the ideal detectorafter merely 12 iterations for the three datasets. This exampleshows that the TALC algorithm is well suited for interactiveanomaly detection and significantly outperforms other base-line algorithms.

V. CONCLUSION AND FUTURE WORKActive learning, as a subfield of semi-supervised learn-ing, provides an effective means to solve domain-specific

VOLUME 7, 2019 7


1 3 5 7 9 11 13 15 17 19Number of rounds

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Per

form

ance

Ideal precisionIdeal recallIdeal F1PrecisionRecallF1

(a) annthyroid TALC

1 3 5 7 9 11 13 15 17 19Number of rounds

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Per

form

ance


(b) annthyroid CEAL

1 3 5 7 9 11 13 15 17 19Number of rounds

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Per

form

ance


(c) annthyroid ALA

1 3 5 7 9 11 13 15 17 19Number of rounds

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Per

form

ance


(d) annthyroid ATAMV

1 3 5 7 9 11 13 15 17 19Number of rounds

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Per

form

ance


(e) mnist TALC

1 3 5 7 9 11 13 15 17 19Number of rounds

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Per

form

ance


(f) mnist CEAL

1 3 5 7 9 11 13 15 17 19Number of rounds

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Per

form

ance


(g) mnist ALA

1 3 5 7 9 11 13 15 17 19Number of rounds

0.2

0.4

0.6

0.8

1.0

Per

form

ance


(h) mnist ATAMV

1 3 5 7 9 11 13 15 17 19Number of rounds

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Per

form

ance


(i) http TALC

1 3 5 7 9 11 13 15 17 19Number of rounds

0.2

0.4

0.6

0.8

1.0

Per

form

ance


(j) http CEAL

1 3 5 7 9 11 13 15 17 19Number of rounds

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Per

form

ance


(k) http ALA

1 3 5 7 9 11 13 15 17 19Number of rounds

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Per

form

ance


(l) http ATAMV

FIGURE 3: Performance indicator curves of three datasets

TABLE 4: The number of relabeled instances in differentdatasets.

Dataset # Relabeled Instances Relabeled Percentage

mnist 840 14.73%annthyroid 741 13.72%

http 800 5.33%

anomaly detection problem in which only limited number oftraining samples are available. This paper presents a tripartiteactive learning algorithm for interactive anomaly detection.The proposed algorithm not only considers the active selec-tion of the instances for relabeling based on the uncertaintylevel, but also develops a strategy to evaluate and train mul-tiple labelers with gold instances. Experiments demonstratethat the proposed algorithm outperforms most state-of-the-art methods for domain-specific anomaly detection.

The study of human-in-the-loop machine learning, e.g.,tripartite active learning from crowds with gold instances,in general remains widely open, with various challenges andmany applications in diverse areas, where robustness againstthe erroneous labels is crucial to the successful applicationof the machine learning algorithm. In our future work, weplan to investigate how to combine the proposed methodwith other query strategy framework, such as the variancereduction [1].

REFERENCES[1] B. Settles, “Active learning literature survey,” Computer Sciences Techni-

cal Report 1648, University of Wisconsin–Madison, 2009.[2] D. Oleson, A. Sorokin, G. Laughlin, V. Hester, J. Le, and L. Biewald,

“Programmatic gold: Targeted and scalable quality assurance in crowd-sourcing,” in Proceedings of the 11th AAAI Conference on Human Com-putation, AAAIWS’11-11, pp. 43–48, 2011.

[3] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion detection with unlabeleddata using clustering,” in Proceedings of ACM CSS Workshop on DataMining Applied to Security (DMSA 2001), pp. 5–8, 2001.

[4] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifyingdensity-based local outliers,” in Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’00, (NewYork, NY, USA), pp. 93–104, ACM, 2000.

[5] E. M. Knorr and R. T. Ng, “Algorithms for mining distance-based outliersin large datasets,” in Proceedings of the 24rd International Conference onVery Large Data Bases, VLDB ’98, (San Francisco, CA, USA), pp. 392–403, 1998.

[6] D. D. Lewis and W. A. Gale, “A sequential algorithm for training textclassifiers,” in Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, SI-GIR ’94, (New York, NY, USA), pp. 3–12, 1994.

[7] B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,”in Advances in Neural Information Processing Systems 20 (J. C. Platt,D. Koller, Y. Singer, and S. T. Roweis, eds.), pp. 1289–1296, CurranAssociates, Inc., 2008.

[8] S. Das, W. Wong, A. Fern, T. G. Dietterich, and M. A. Siddiqui,“Incorporating feedback into tree-based anomaly detection,” CoRR,vol. abs/1708.09441, 2017.

[9] S. Das, W. K. Wong, T. Dietterich, A. Fern, and A. Emmott, “Incorporatingexpert feedback into active anomaly discovery,” in Proceedings of the 2016IEEE 16th International Conference on Data Mining (ICDM), ICDM ’16,pp. 853–858, Dec. 2016.

[10] P. Donmez, J. G. Carbonell, and J. Schneider, “Efficiently learning theaccuracy of labeling sources for selective sampling,” in Proceedings of the

8 VOLUME 7, 2019


15th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD ’09, (New York, NY, USA), pp. 259–268, ACM,2009.

[11] C. Long, G. Hua, and A. Kapoor, “Active visual recognition with expertiseestimation in crowdsourcing,” in 2013 IEEE International Conference onComputer Vision, pp. 3000–3007, Dec. 2013.

[12] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? improvingdata quality and data mining using multiple, noisy labelers,” in Proceed-ings of the 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’08, (New York, NY, USA), pp. 614–622, ACM, 2008.

[13] C. C. Friedel, U. Rückert, and S. Kramer, “Cost curves for abstainingclassifiers,” in Proceedings of the ICML 2006 workshop on ROC Analysisin Machine Learning, 2006.

[14] J. Zhong, K. Tang, and Z.-H. Zhou, “Active learning from crowds withunsure option,” in Proceedings of the 24th International Conference onArtificial Intelligence, IJCAI’15, pp. 1061–1067, 2015.

[15] X. Y. Zhang, S. Wang, and X. Yun, “Bidirectional active learning: A two-way exploration into unlabeled and labeled data set,” IEEE Transactionson Neural Networks and Learning Systems, vol. 26, pp. 3034–3044, Dec.2015.

[16] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Loop: Local outlierprobabilities,” in Proceedings of the 18th ACM Conference on Informa-tion and Knowledge Management, CIKM ’09, (New York, NY, USA),pp. 1649–1652, ACM, 2009.

[17] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled datawith label propagation,” tech. rep., CMU CALD tech report, 2002.

[18] S.-J. Huang, J.-L. Chen, X. Mu, and Z.-H. Zhou, “Cost-Effective ActiveLearning from Diverse Labelers,” in 26th International Joint Conferenceon Artificial Intelligence, (California, USA), pp. 1879–1885, InternationalJoint Conferences on Artificial Intelligence Organization, 2017.

[19] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows: Theory,Algorithms, and Applications. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1993.

[20] H.-P. Kriegel, M. S hubert, and A. Zimek, “Angle-based outlier detectionin high-dimensional data,” in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,KDD ’08, (New York, NY, USA), pp. 444–452, ACM, 2008.

[21] M. Goldstein and A. Dengel, “Histogram-based outlier score (hbos): A fastunsupervised anomaly detection algorithm,” pp. 59–63, KI-2012: Posterand Demo Track, 2012.

[22] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedingsof the 2008 Eighth IEEE International Conference on Data Mining, ICDM’08, (Washington, DC, USA), pp. 413–422, IEEE Computer Society, Dec.2008.

[23] S. Rayana, “ODDS library,” 2016. http://odds.cs.stonybrook.edu.[24] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”

in Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’16, (New York, NY, USA),pp. 785–794, ACM, 2016.

[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Re-search, vol. 12, pp. 2825–2830, 2011.

YANQIAO ZHU ([email protected]) is pursu-ing his B.Eng. degree at the School of SoftwareEngineering, Tongji University, Shanghai, China.His research interests mainly include data miningand analysis, graph mining, recommendation sys-tems, machine learning, and distributed systems.

KAI YANG ([email protected]) received hisB.Eng. degree from Southeast University, Nan-jing, China, his M.S. degree from the NationalUniversity of Singapore, and his Ph.D. degreefrom Columbia University, New York. He is a dis-tinguished professor at Tongji University, Shang-hai, China. He was a technical staff member atBell Laboratories, New Jersey, a senior data scien-tist at Huawei Technologies, Plano, Texas, and aresearch associate at NEC Laboratories America,

Princeton, New Jersey. He has also been an adjunct faculty member atColumbia University since 2011. He holds over 20 patents and has beenpublished extensively in leading IEEE journals and conferences. His currentresearch interests include big data analytics, machine learning, wirelesscommunications, and signal processing.

He was a recipient of the Eliahu Jury Award from Columbia University,the Bell Laboratories Teamwork Award, the Huawei Technology Break-through Award, and the Huawei Future Star Award. The products he hasdeveloped have been deployed by Tier-1 operators and served billions ofusers worldwide. He serves as an Editor for the IEEE Internet of ThingsJournal, IEEE Communications Surveys & Tutorials, and a Guest Editor forthe IEEE Journal on Selected Areas in Communications. From 2012 to 2014,he was the Vice-Chair of the IEEE ComSoc Multimedia CommunicationsTechnical Committee. In 2017, he founded and served as the Chair of theIEEE TCCN Special Interest Group on AI Embedded Cognitive Networks.He has served as a Demo/Poster Co-Chair of IEEE INFOCOM, SymposiumCo-Chair of IEEE GLOBECOM, and Workshop Co-Chair of IEEE ICME.

VOLUME 7, 2019 9

Tripartite Active Learning for Interactive Anomaly Discovery · 2020-06-12 · the model and...

Documents

Transcript of Tripartite Active Learning for Interactive Anomaly Discovery · 2020-06-12 · the model and...