FlinkML - Big data application meetup

FlinkML: Large-scale Machine Learning with Apache FlinkTheodore Vasiloudis, Swedish Institute of Computer Science (SICS)

Big Data Application MeetupJuly 27th, 2016

Large-scale Machine Learning

What do we mean?

What do we mean?

● Small-scale learning ● Large-scale learning

Source: Léon Bottou

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning


What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning○ We have a large-scale learning problem

when the active budget constraint is the computing time.


Apache Flink

What is Apache Flink?

● Distributed stream and batch data processing engine● Easy and powerful APIs for batch and real-time streaming analysis● Backed by a very robust execution backend

○ true streaming dataflow engine○ custom memory manager○ native iterations○ cost-based optimizer

What is Apache Flink?

What does Flink give us?

● Expressive APIs● Pipelined stream processor● Closed loop iterations

Expressive APIs

● Main bounded data abstraction: DataSet● Program using functional-style transformations, creating a dataflow.

case class Word(word: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)).groupBy(“word”).sum(“frequency”).print()

Pipelined Stream Processor

Iterate in the dataflow

Iterate by looping

● Loop in client submits one job per iteration step● Reuse data by caching in memory or disk

Iterate in the dataflow

Delta iterations

Performance

Extending the Yahoo Streaming Benchmark

http://data-artisans.com/extending-the-yahoo-streaming-benchmark/


FlinkML

FlinkML

● New effort to bring large-scale machine learning to Apache Flink

FlinkML

● New effort to bring large-scale machine learning to Apache Flink● Goals:

○ Truly scalable implementations○ Keep glue code to a minimum○ Ease of use

FlinkML: Overview

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

FlinkML: Overview


● Recommendation○ Alternating Least Squares (ALS)

FlinkML: Overview



● Pre-processing○ Polynomial features○ Feature scaling

FlinkML: Overview




● Unsupervised learning○ Quad-tree exact kNN search

FlinkML: Overview




● Unsupervised learning○ Quad-tree exact kNN search

● sklearn-like ML pipelines

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

FlinkML API


val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

FlinkML API



mlr.fit(trainingData)

FlinkML API



mlr.fit(trainingData)

// The fitted model can now be used to make predictionsval predictions: DataSet[LabeledVector] = mlr.predict(testingData)

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

FlinkML Pipelines


// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

FlinkML Pipelines


// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

// Train pipelinepipeline.fit(trainingData)

// Calculate predictionsval predictions = pipeline.predict(testingData)

FlinkML: Focus on scalability

Alternating Least Squares

R ≅ X Y✕Users

Items

Naive Alternating Least Squares

Blocked Alternating Least Squares

Blocked ALS performance

FlinkML blocked ALS performance

Going beyond SGD in large-scale optimization

● Beyond SGD → Use Primal-Dual framework

● Slow updates → Immediately apply local updates

CoCoA: Communication Efficient Coordinate Ascent

Primal-dual framework

Source: Smith (2014)

Immediately Apply Updates

Source: Smith (2014)

Immediately Apply Updates

Source: Smith (2014)Source: Smith (2014)

CoCoA: Communication Efficient Coordinate Ascent

CoCoA performance

Source:Jaggi (2014)

CoCoA performance

Available on FlinkML

SVM

Dealing with stragglers: SSP Iterations

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration



● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.



● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.




● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.




● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.○ Allows for progress, while keeping convergence guarantees.



Source: Ho et al. (2013)

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

PR submitted

Challenges in developing an open-source ML library

Challenges in open-source ML libraries

● Depth or breadth● Design choices● Testing

Challenges in open-source ML libraries

● Attracting developers● What to commit● Avoiding code rot

Current and future work on FlinkML

Current work

● Tooling○ Evaluation & cross-validation framework○ Distributed linear algebra○ Streaming predictors

● Algorithms○ Implicit ALS○ Multi-layer perceptron○ Efficient streaming decision trees○ Colum-wise statistics, histograms

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.

○ Preliminary work already started, implement SOTA algorithms and develop new techniques.

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.

○ Preliminary work already started, implement SOTA algorithms and develop new techniques.

● “Computation efficient” learning○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning

with modest computing resources.

Check it out:

@[email protected]

flink.apache.orgci.apache.org/projects/flink/flink-docs-master/libs/ml

https://twitter.com/thvasilo


mailto:[email protected]


http://flink.apache.org/


https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/


“Demo”

Thank you

@[email protected]

flink.apache.orgci.apache.org/projects/flink/flink-docs-master/libs/ml









References

● Flink Project: flink.apache.org● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/● Leon Botou: Learning with Large Datasets● Smith (2014): CoCoA AMPCAMP Presentation● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS

2013.● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData

2015● Recent INRIA paper examining Spark vs. Flink (batch only)● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink)● Also interesting: Bayesian anomaly detection in Flink

https://flink.apache.org/


http://leon.bottou.org/slides/largescale/lstut.pdf

http://leon.bottou.org/slides/largescale/lstut.pdf

http://www.slideshare.net/jeykottalam/cocoa-ampcamp

http://www.slideshare.net/jeykottalam/cocoa-ampcamp

https://hal.inria.fr/hal-01347638/document

https://hal.inria.fr/hal-01347638/document



https://github.com/Martin-Neumann/BS-AnomalyDetector

FlinkML - Big data application meetup

Software

Transcript of FlinkML - Big data application meetup