An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

21
An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li

Transcript of An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Page 1: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

An Investigation ofSubspace Outlier Detection

Alex WiegandSupervisor: Jiuyong Li

Page 2: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

2

Contents

• Research Question• Introduction to Outliers• The problem with many dimensions• Subspace outlier detection• Techniques considered• Evaluation• Achievements• New framework• Left to Do• End

Page 3: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

3

Research Question

• What is the best way to find outliers in high dimensional data?

Page 4: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

The attributes-are-dimensions metaphor

• Central to the concept of outliers• Each attribute is considered to be a dimension• A database is a dataset• Each object, or tuple, is a point• The schema is a space• So finding unusual objects is a geometric problem

Page 5: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Outliers

• “an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” (Hawkins, 1980)

• Outliers point to interesting phenomena.

• Being able to explain outliers adds strength to a model.

• Outliers can signify important events– network intrusions– credit card fraud– disease outbreaks

Page 6: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

The Low Dimensional Case

• For 1 to about 4 attributes, outlier detection is a solved problem

• A few techniques exist• The most popular is LOF– Local Outlier Factor

• But they are less reliable as the number of dimensions increases– Because of the curse

Page 7: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

The Curse of Dimensionality

• Consider a dataset with d dimensions• For any three points,• As d → ∞,• a, b and c → ∞• but a/b & a/c → 1• i.e. the distances become

more similar as the numberof dimensions increases

• This happens under mostcommon conditions

Page 8: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

The Curse of Dimensionality

• So it becomes like this:

for some large distance h• Traditional approaches can't find outliers – no points

are relatively far away• But what if there are some unusual points to be found,

but some attributes are distracting us

Page 9: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Subspaces

• A space within another space• Less dimensions• e.g. a 2D plane crossing 3D space• Can be created by selecting a subset of the set of

attributes– This is called feature selection– Equivalent to database projection

Page 10: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Subspace Outlier Detection

• Outlier Detection in the subspaces

• Actually looking for “subspace outliers”• “point x is an outlier in subspace S”• i.e. object x has unusual values for some attributes

Page 11: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Existing Techniques

• Four subspace outlier detection algorithms were looked at

• Aggarwal Evolutionary Search• Subspace Outlier Degree (SOD)• Lazarevic Feature Bagging (LazFB)• Most Interesting Subspace Top N Outlier Detection (MOIS)

• Not much evaluation done• Three of them have results for some test data in the papers

that define them• Different test data for each one

• No comparisons between them

Page 12: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Distance Metrics

• Normal distance is Euclidean distance

• A couple of other distance metrics have been found to increase the contrast between distances when there are many dimensions– Nearest neighbour ranking• dist(x, y) = k where y is the k-th nearest point to x

– Fractional Lp norm

• dist(x, y) = • where p < 1

• These have not been tried in outlier detection

x 1 y 1p x d y d

p1p

Page 13: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Research Plan

• Compare– MOIS– Lazarevic Feature Bagging– Two benchmark algorithms

• LOF• Distance-based outlier• (these are non-subspace outlier detection algorithms)

• Try new distance metrics• nearest neighbour rank• fractional Lp norm

– Use Lazarevic Feature Bagging and LOF– replace the Euclidean distance function with the chosen

metric

Page 14: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Evaluation

• ROC curve• Find a parameter that controls

sensitivity– number of reported outliers

(positives) per number of points

• Run the algorithm for many values of that parameter

• Draw a scatter plot of the true positive vs false positive rates

• Connect the dots• The area under the curve (AUC)

is the quality of the algorithm for the test data set

Page 15: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Achievement

Implementation• New framework

Page 16: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Implementation

• Only one existing algorithm, MOIS, had an available implementation

• The others need new implementations• Implementing all the algorithms involves much repetition

• Loading datasets• Accessing data points• Calculating distances

• A system for code reuse is desirable• Variations of the algorithms must be easy to create

• For any improved algorithms I design• Running many tests should be easy

Page 17: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Framework

• I decided to use a software framework• Standardised API for algorithms• Inversion of control

• User commands framework• Framework commands algorithms

• Existing frameworks considered• Weka• RapidMiner• ELKI

• Weka and ELKI (0.1) don't natively support outlier detection algorithms

• RapidMiner carries a high implementation overhead• Due to architecture

Page 18: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Framework

• Looking for the quickest way to implement the algorithms

• The drawbacks mentioned made those frameworks unsuitable for my task

• Unless they were extended

• My decision: create a new framework, created just for subspace outlier detection

Page 19: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

The New Framework

• Some design decisions– Use of Weka API

• to use functionality already available in Weka

– Interactive command-line interface• scriptability• e.g. easy to run tests on an arbitrary set of datasets

– Inheritance-friendly design• for quick creation of modified algorithms, metrics and data structures

– The Metric class• some functions are implement as subclasses of Metric• makes those functions easier to replace with new ones• used for distance metrics

Page 20: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Left to Do

• Complete LOF implementation• Complete Lazarevic Feature Bagging implementation• Implement new distance metrics• Run tests• Analyse results

Page 21: An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Thankyou for listening

• Questions?