MOA : Massive Online Analysis

28
MOA: {M}assive {O}nline {A}nalysis. University of Waikato Hamilton, New Zealand

description

Talk about MOA, a framework similar to WEKA dealing with massive data and data streams.

Transcript of MOA : Massive Online Analysis

Page 1: MOA : Massive Online Analysis

MOA: {M}assive {O}nline {A}nalysis.

University of WaikatoHamilton, New Zealand

Page 2: MOA : Massive Online Analysis

What is MOA?

{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.

It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:

boosting and baggingHoeffding Trees

with and without Naïve Bayes classifiers at the leaves.

2 / 21

Page 3: MOA : Massive Online Analysis

WEKA

Waikato Environment for Knowledge AnalysisCollection of state-of-the-art machine learning algorithmsand data processing tools implemented in Java

Released under the GPLSupport for the whole process of experimental data mining

Preparation of input dataStatistical evaluation of learning schemesVisualization of input data and the result of learning

Used for education, research and applicationsComplements “Data Mining” by Witten & Frank

3 / 21

Page 4: MOA : Massive Online Analysis

WEKA: the bird

4 / 21

Page 5: MOA : Massive Online Analysis

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

5 / 21

Page 6: MOA : Massive Online Analysis

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

5 / 21

Page 7: MOA : Massive Online Analysis

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

5 / 21

Page 8: MOA : Massive Online Analysis

Data stream classification cycle

1 Process an example at a time,and inspect it only once (atmost)

2 Use a limited amount ofmemory

3 Work in a limited amount oftime

4 Be ready to predict at anypoint

6 / 21

Page 9: MOA : Massive Online Analysis

Experimental setting

Evaluation procedures for DataStreams

HoldoutInterleaved Test-Then-Train orPrequential

EnvironmentsSensor Network: 100KbHandheld Computer: 32 MbServer: 400 Mb

7 / 21

Page 10: MOA : Massive Online Analysis

Experimental setting

Data SourcesRandom Tree GeneratorRandom RBF GeneratorLED GeneratorWaveform GeneratorFunction Generator

7 / 21

Page 11: MOA : Massive Online Analysis

Experimental setting

ClassifiersNaive BayesDecision stumpsHoeffding TreeHoeffding Option TreeBagging and Boosting

Prediction strategiesMajority classNaive Bayes LeavesAdaptive Hybrid

7 / 21

Page 12: MOA : Massive Online Analysis

Hoeffding Option TreeHoeffding Option TreesRegular Hoeffding tree containing additional option nodes thatallow several tests to be applied, leading to multiple Hoeffdingtrees as separate paths.

8 / 21

Page 13: MOA : Massive Online Analysis

Extension to Evolving Data Streams

New Evolving Data Stream Extensions

New Stream GeneratorsNew UNION of StreamsNew Classifiers

9 / 21

Page 14: MOA : Massive Online Analysis

Extension to Evolving Data Streams

New Evolving Data Stream Generators

Random RBF with DriftLED with DriftWaveform with Drift

HyperplaneSEA GeneratorSTAGGER Generator

9 / 21

Page 15: MOA : Massive Online Analysis

Concept Drift Framework

t

f (t) f (t)

α

α

t0W

0.5

1

DefinitionGiven two data streams a, b, we define c = a⊕W

t0 b as the datastream built joining the two data streams a and b

Pr[c(t) = b(t)] = 1/(1+ e−4(t−t0)/W ).Pr[c(t) = a(t)] = 1−Pr[c(t) = b(t)]

10 / 21

Page 16: MOA : Massive Online Analysis

Concept Drift Framework

t

f (t) f (t)

α

α

t0W

0.5

1

Example

(((a⊕W0t0 b)⊕W1

t1 c)⊕W2t2 d) . . .

(((SEA9⊕Wt0 SEA8)⊕W

2t0SEA7)⊕W

3t0SEA9.5)

CovPokElec = (CoverType⊕5,000581,012 Poker)⊕5,000

1,000,000 ELEC2

10 / 21

Page 17: MOA : Massive Online Analysis

Extension to Evolving Data Streams

New Evolving Data Stream Classifiers

Adaptive Hoeffding Option TreeDDM Hoeffding TreeEDDM Hoeffding Tree

OCBoostFLBoost

11 / 21

Page 18: MOA : Massive Online Analysis

Ensemble Methods

New ensemble methods:Adaptive-Size Hoeffding Tree bagging:

each tree has a maximum sizeafter one node splits, it deletes some nodes to reduce itssize if the size of the tree is higher than the maximum value

ADWIN bagging:When a change is detected, the worst classifier is removedand a new classifier is added.

12 / 21

Page 19: MOA : Massive Online Analysis

Adaptive-Size Hoeffding Tree

T1 T2 T3 T4

Ensemble of trees of different sizesmaller trees adapt more quickly to changes,larger trees do better during periods with little changediversity

13 / 21

Page 20: MOA : Massive Online Analysis

Adaptive-Size Hoeffding Tree

0,2

0,21

0,22

0,23

0,24

0,25

0,26

0,27

0,28

0,29

0,3

0 0,1 0,2 0,3 0,4 0,5 0,6

Kappa

Err

or

0,25

0,255

0,26

0,265

0,27

0,275

0,28

0,1 0,12 0,14 0,16 0,18 0,2 0,22 0,24 0,26 0,28 0,3

Kappa

Err

or

Figure: Kappa-Error diagrams for ASHT bagging (left) and bagging(right) on dataset RandomRBF with drift, plotting 90 pairs ofclassifiers.

14 / 21

Page 21: MOA : Massive Online Analysis

ADWIN Bagging

ADWIN

An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates

ADWIN BaggingWhen a change is detected, the worst classifier is removed anda new classifier is added.

15 / 21

Page 22: MOA : Massive Online Analysis

Web

http://www.cs.waikato.ac.nz/∼abifet/MOA/

16 / 21

Page 23: MOA : Massive Online Analysis

GUIjava -cp .:moa.jar:weka.jar-javaagent:sizeofag.jar moa.gui.TaskLauncher

17 / 21

Page 24: MOA : Massive Online Analysis

GUIjava -cp .:moa.jar:weka.jar-javaagent:sizeofag.jar moa.gui.TaskLauncher

17 / 21

Page 25: MOA : Massive Online Analysis

Command Line

EvaluatePeriodicHeldOutTestjava -cp .:moa.jar:weka.jar -javaagent:sizeofag.jarmoa.DoTask "EvaluatePeriodicHeldOutTest-l DecisionStump -s generators.WaveformGenerator-n 100000 -i 100000000 -f 1000000" > dsresult.csv

This command creates a comma separated values file:

training the DecisionStump classifier on the WaveformGeneratordata,

using the first 100 thousand examples for testing,

training on a total of 100 million examples, and

testing every one million examples:

18 / 21

Page 26: MOA : Massive Online Analysis

Easy Design of a MOA classifier

void resetLearningImpl ()

void trainOnInstanceImpl (Instance inst)

double[] getVotesForInstance (Instance i)

void getModelDescription (StringBuilderout, int indent)

19 / 21

Page 27: MOA : Massive Online Analysis

Example of a classifier: MajorityClass

public class MajorityClass extends AbstractClassifier {

protected DoubleVector observedClassDistribution;

@Overridepublic void resetLearningImpl() {this.observedClassDistribution = new DoubleVector();

}

@Overridepublic void trainOnInstanceImpl(Instance inst) {this.observedClassDistribution.addToValue(

(int) inst.classValue(), inst weight());}

public double[] getVotesForInstance(Instance i) {return this.observedClassDistribution.getArrayCopy();

}...

}

20 / 21

Page 28: MOA : Massive Online Analysis

Summary{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.

http://www.cs.waikato.ac.nz/∼abifet/MOA/

It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:

boosting and baggingHoeffding Trees

MOA deals with evolving data streamsMOA is easy to use and extend

21 / 21