FlinkML - Big data application meetup

76
FlinkML: Large-scale Machine Learning with Apache Flink Theodore Vasiloudis, Swedish Institute of Computer Science (SICS) Big Data Application Meetup July 27th, 2016

Transcript of FlinkML - Big data application meetup

Page 1: FlinkML - Big data application meetup

FlinkML: Large-scale Machine Learning with Apache FlinkTheodore Vasiloudis, Swedish Institute of Computer Science (SICS)

Big Data Application MeetupJuly 27th, 2016

Page 2: FlinkML - Big data application meetup

Large-scale Machine Learning

Page 3: FlinkML - Big data application meetup

What do we mean?

Page 4: FlinkML - Big data application meetup

What do we mean?

● Small-scale learning ● Large-scale learning

Source: Léon Bottou

Page 5: FlinkML - Big data application meetup

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning

Source: Léon Bottou

Page 6: FlinkML - Big data application meetup

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning○ We have a large-scale learning problem

when the active budget constraint is the computing time.

Source: Léon Bottou

Page 7: FlinkML - Big data application meetup

Apache Flink

Page 8: FlinkML - Big data application meetup

What is Apache Flink?

● Distributed stream and batch data processing engine● Easy and powerful APIs for batch and real-time streaming analysis● Backed by a very robust execution backend

○ true streaming dataflow engine○ custom memory manager○ native iterations○ cost-based optimizer

Page 9: FlinkML - Big data application meetup

What is Apache Flink?

Page 10: FlinkML - Big data application meetup

What does Flink give us?

● Expressive APIs● Pipelined stream processor● Closed loop iterations

Page 11: FlinkML - Big data application meetup

Expressive APIs

● Main bounded data abstraction: DataSet● Program using functional-style transformations, creating a dataflow.

case class Word(word: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)).groupBy(“word”).sum(“frequency”).print()

Page 12: FlinkML - Big data application meetup

Pipelined Stream Processor

Page 13: FlinkML - Big data application meetup

Iterate in the dataflow

Page 14: FlinkML - Big data application meetup

Iterate by looping

● Loop in client submits one job per iteration step● Reuse data by caching in memory or disk

Page 15: FlinkML - Big data application meetup

Iterate in the dataflow

Page 16: FlinkML - Big data application meetup

Delta iterations

Page 18: FlinkML - Big data application meetup

FlinkML

Page 19: FlinkML - Big data application meetup

FlinkML

● New effort to bring large-scale machine learning to Apache Flink

Page 20: FlinkML - Big data application meetup

FlinkML

● New effort to bring large-scale machine learning to Apache Flink● Goals:

○ Truly scalable implementations○ Keep glue code to a minimum○ Ease of use

Page 21: FlinkML - Big data application meetup

FlinkML: Overview

Page 22: FlinkML - Big data application meetup

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

Page 23: FlinkML - Big data application meetup

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

Page 24: FlinkML - Big data application meetup

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

● Pre-processing○ Polynomial features○ Feature scaling

Page 25: FlinkML - Big data application meetup

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

● Pre-processing○ Polynomial features○ Feature scaling

● Unsupervised learning○ Quad-tree exact kNN search

Page 26: FlinkML - Big data application meetup

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

● Pre-processing○ Polynomial features○ Feature scaling

● Unsupervised learning○ Quad-tree exact kNN search

● sklearn-like ML pipelines

Page 27: FlinkML - Big data application meetup

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

Page 28: FlinkML - Big data application meetup

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

Page 29: FlinkML - Big data application meetup

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

mlr.fit(trainingData)

Page 30: FlinkML - Big data application meetup

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

mlr.fit(trainingData)

// The fitted model can now be used to make predictionsval predictions: DataSet[LabeledVector] = mlr.predict(testingData)

Page 31: FlinkML - Big data application meetup

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

Page 32: FlinkML - Big data application meetup

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

Page 33: FlinkML - Big data application meetup

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

// Train pipelinepipeline.fit(trainingData)

// Calculate predictionsval predictions = pipeline.predict(testingData)

Page 34: FlinkML - Big data application meetup

FlinkML: Focus on scalability

Page 35: FlinkML - Big data application meetup

Alternating Least Squares

R ≅ X Y✕Users

Items

Page 36: FlinkML - Big data application meetup

Naive Alternating Least Squares

Page 37: FlinkML - Big data application meetup

Blocked Alternating Least Squares

Page 38: FlinkML - Big data application meetup

Blocked ALS performance

FlinkML blocked ALS performance

Page 39: FlinkML - Big data application meetup

Going beyond SGD in large-scale optimization

Page 40: FlinkML - Big data application meetup

● Beyond SGD → Use Primal-Dual framework

● Slow updates → Immediately apply local updates

CoCoA: Communication Efficient Coordinate Ascent

Page 41: FlinkML - Big data application meetup

Primal-dual framework

Source: Smith (2014)

Page 42: FlinkML - Big data application meetup

Primal-dual framework

Source: Smith (2014)

Page 43: FlinkML - Big data application meetup

Immediately Apply Updates

Source: Smith (2014)

Page 44: FlinkML - Big data application meetup

Immediately Apply Updates

Source: Smith (2014)Source: Smith (2014)

Page 45: FlinkML - Big data application meetup

CoCoA: Communication Efficient Coordinate Ascent

Page 46: FlinkML - Big data application meetup

CoCoA performance

Source:Jaggi (2014)

Page 47: FlinkML - Big data application meetup

CoCoA performance

Available on FlinkML

SVM

Page 48: FlinkML - Big data application meetup

Dealing with stragglers: SSP Iterations

Page 49: FlinkML - Big data application meetup

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

Dealing with stragglers: SSP Iterations

Page 50: FlinkML - Big data application meetup

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.

Dealing with stragglers: SSP Iterations

Page 51: FlinkML - Big data application meetup

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

Dealing with stragglers: SSP Iterations

Page 52: FlinkML - Big data application meetup

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.

Dealing with stragglers: SSP Iterations

Page 53: FlinkML - Big data application meetup

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.○ Allows for progress, while keeping convergence guarantees.

Dealing with stragglers: SSP Iterations

Page 54: FlinkML - Big data application meetup

Dealing with stragglers: SSP Iterations

Source: Ho et al. (2013)

Page 55: FlinkML - Big data application meetup

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

Page 56: FlinkML - Big data application meetup

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

PR submitted

Page 57: FlinkML - Big data application meetup

Challenges in developing an open-source ML library

Page 58: FlinkML - Big data application meetup

Challenges in open-source ML libraries

● Depth or breadth● Design choices● Testing

Page 59: FlinkML - Big data application meetup

Challenges in open-source ML libraries

● Attracting developers● What to commit● Avoiding code rot

Page 60: FlinkML - Big data application meetup

Current and future work on FlinkML

Page 61: FlinkML - Big data application meetup

Current work

● Tooling○ Evaluation & cross-validation framework○ Distributed linear algebra○ Streaming predictors

● Algorithms○ Implicit ALS○ Multi-layer perceptron○ Efficient streaming decision trees○ Colum-wise statistics, histograms

Page 62: FlinkML - Big data application meetup

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.

○ Preliminary work already started, implement SOTA algorithms and develop new techniques.

Page 63: FlinkML - Big data application meetup

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.

○ Preliminary work already started, implement SOTA algorithms and develop new techniques.

● “Computation efficient” learning○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning

with modest computing resources.

Page 65: FlinkML - Big data application meetup

“Demo”

Page 66: FlinkML - Big data application meetup
Page 67: FlinkML - Big data application meetup

“Demo”

Page 68: FlinkML - Big data application meetup

“Demo”

Page 69: FlinkML - Big data application meetup

“Demo”

Page 70: FlinkML - Big data application meetup

“Demo”

Page 71: FlinkML - Big data application meetup

“Demo”

Page 72: FlinkML - Big data application meetup

“Demo”

Page 73: FlinkML - Big data application meetup
Page 74: FlinkML - Big data application meetup

“Demo”

Page 76: FlinkML - Big data application meetup

References

● Flink Project: flink.apache.org● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/● Leon Botou: Learning with Large Datasets● Smith (2014): CoCoA AMPCAMP Presentation● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS

2013.● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData

2015● Recent INRIA paper examining Spark vs. Flink (batch only)● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink)● Also interesting: Bayesian anomaly detection in Flink