A new streaming computation engine for real-time analytics by Michael Barton at Big Data Spain 2015

26

Transcript of A new streaming computation engine for real-time analytics by Michael Barton at Big Data Spain 2015

& ALGORITHMS

Michael Barton@mrb_bartonITRS Group Malaga

What happens when you make analysis easy to re-use

We have a big complicated trading system

Can we calculate the latency of each order?

Lets say I work in a bank

entry point

exit point

HTTP POST

{“MsgDirection”: “I”,“SendingTime”: “2015-04-05T14:30Z”,…

}

Simple to publish data

entry point

exit point

HTTP POST

{“MsgDirection”: “I”,“SendingTime”: “2015-04-05T14:31Z”,…

}

{“MsgDirection”: “O”,“SendingTime”: “2015-04-05T14:33Z”,…

}

{“MsgDirection”: “O”,“SendingTime”: “2015-04-05T14:32Z”,…

}

Tell Valo the schema?

{"schema": {

"version": "1.0.0","config": {},"topDef": {

"type": "record","properties": {

...

"MsgDirection": {"type": "string","comments": "I for input message, O for output"

},"Account": {

"type": "string","optional": "true","comments": "Account mnemonic as agreed between buy and sell sides"

},

...

"SendingTime": {"type": "datetime","comments": "Time of message transmission (always expressed in UTC (Universal Time Coordinated, also known a

},"Side": {

"type": "string","optional": "true","comments": "Side"

},"Symbol": {

"type": "string","optional": "true","comments": "Ticker symbol. Common, human understood representation of the security."

},

...

{“MsgDirection”: “I”,“SendingTime”: “2015-04-05T14:30Z”,…

}

{“MsgDirection”: “I”,“SendingTime”: “2015-04-05T14:31Z”,…

}

{“MsgDirection”: “O”,“SendingTime”: “2015-04-05T14:33Z”,…

}

{“MsgDirection”: “O”,“SendingTime”: “2015-04-05T14:32Z”,…

}

Lets use it

from historical /streams/demo/fix/exchange where MsgDirection == "I" into left

inner join from /streams/demo/fix/exchange where MsgDirection == "O" into right

on left.ClOrdID == right.ClOrdID &&left.MsgType=="New Order Single" &&right.MsgType=="Execution Report“

select left.ClOrdID, duration(right.SendingTime, left.SendingTime) as resTime

Lets use it

from historical /streams/demo/fix/exchange where MsgDirection == "I" into left

inner join from /streams/demo/fix/exchange where MsgDirection == "O" into right

on left.ClOrdID == right.ClOrdID &&left.MsgType=="New Order Single" &&right.MsgType=="Execution Report“

select left.ClOrdID, duration(right.SendingTime, left.SendingTime) as resTime

From Filter Join Output

Yeah, it gets complicated

Cluster of nodesCommodity hardware

Uniform architectureNo special leaders or roles

Streams of dataImmutable, append-only, distributed Eventual consistency in failure cases

It’s just VALO

/streams/demo/fix/exchange

We know where the data is

From

nodeA

Semi-structured Repo

Time Series Repo

We know our storage

From

/streams/demo/fix/exchange

nodeA

Semi-structured Repo

Time Series Repo

We know our storage

Semi-structured Repo

Hierarchical Document DataFlexible schemas

Lucene IndexesTaxonomies and Facets

Time Series Repo

Well defined schema

Custom I/O layerBitmap B+Tree Indices

From

nodeA

Semi-structured Repo

Time Series Repo

…Filter

Join

Execute directly against the data and indexes in storage

Push down the query

How can we re-use this?

from /streams/demo/fix/exchange where MsgDirection == "I" into left

inner join from /streams/demo/fix/exchange where MsgDirection == "O" into right

on left.ClOrdID == right.ClOrdID &&left.MsgType=="New Order Single" &&right.MsgType=="Execution Report“

select left.ClOrdID, duration(right.SendingTime, left.SendingTime) as resTime

Real-timeHistorical

HybridTime Ranges

ward-G5

ward-G3

intensive-care

Can we look for unusual activityin the ECG monitors?

Lets say I work in a hospital

Assumption-Free Anomaly Detection in Time Series

Li WeiNitin Kumar

Venkata LollaEamonn Keogh Stefano Lonardi

Chotirat Ann Ratanamahatana

University of California – RiversideDepartment of Computer Science & Engineering

Riverside, CA 92521, USA

http://alumni.cs.ucr.edu/~ratana/SSDBM05.pdfhttp://alumni.cs.ucr.edu/~wli/SSDBM05/

Here’s an interesting paper

@ValoOnlineFunction("anomaly")

@ValoOnlineFunctionAnnotation(SchemaAnnotations.ANALYTICS.ANOMALY)

@ValoOnlineFunctionDescription("Unsupervised anomaly detection for time series")

object OnlineAnomalyDetectionFactory extends

OnlineAlgorithmFactory[OnlineAnomalyDetectionParams, Double, OnlineAnomalyDetectionResult] {

override val isCommutative: Boolean = false

override val isAssociative: Boolean = false

override val isMergeable: Boolean = false

override def getDependency(windowType: WindowType): AlgoDependency = AlgoDependency.NoDependencies

override def init(args: OnlineAnomalyDetectionParams): OnlineAlgorithm[Double, OnlineAnomalyDetectionResult] = {

new OnlineAnomalyDetection(args.lagWindow, args.leadWindow, args.featureSize, 3, 5)

}

}

final case class OnlineAnomalyDetectionParams(lagWindow: Int, leadWindow: Int, featureSize: Int)

final case class OnlineAnomalyDetectionResult(isTraining: Boolean, point: Double, signal: Double)

Full algorithm code omitted for brevity!

So lets implement it!

HTTP POST

{“ts”: “2015-04-05T14:30Z”,“contributor”: “ward-g3-monitor0”“value”: 0.25443

}

{“ts”: “2015-04-05T14:30Z”,“contributor”: “ward-g3-monitor1”“value”: 0.36432

}

{“ts”: “2015-04-05T14:30Z”,“contributor”: “intensive-care”“value”: 0.46580

}

{“ts”: “2015-04-05T14:31Z”,“contributor”: “ward-g3-monitor0”“value”: 0.26073

}

ward-G3

intensive-care

Simple to publish data

Lets use it

from historical /streams/demo/infrastructure/ecggroup by contributorselect contributor, anomaly(200, 40, 20, value) as resultemit every value

final case class OnlineAnomalyDetectionParams(lagWindow: Int, leadWindow: Int, featureSize: Int)

Lets use it

from historical /streams/demo/infrastructure/ecggroup by contributorselect contributor, anomaly(200, 40, 20, value) as resultemit every value

One type of monitor is consistentlyhave issues and producing bad results.

Can we monitor which ones?

ward-G3

intensive-careward-G5

Can we re-use the analysis?

Live updating sets of contributors to data

manufacturer == “ACME”

Re-use the same query across domains

ACME Monitors

Domains

ward-G3

intensive-careward-G5

http://collections.rmg.co.uk/mediaLib/476/media-476182/large.jpgCC BY-NC-SA

Can we re-use the analysis?

Same algorithm

Similar queries

Real-time and historical

Can we re-use the analysis?

http://collections.rmg.co.uk/mediaLib/476/media-476182/large.jpgCC BY-NC-SA

thank you

LIT BY