Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
FlinkML - Big data application meetup
-
Upload
theodoros-vasiloudis -
Category
Software
-
view
503 -
download
3
Transcript of FlinkML - Big data application meetup
FlinkML: Large-scale Machine Learning with Apache FlinkTheodore Vasiloudis, Swedish Institute of Computer Science (SICS)
Big Data Application MeetupJuly 27th, 2016
Large-scale Machine Learning
What do we mean?
What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning○ We have a small-scale learning problem
when the active budget constraint is the number of examples.
● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning○ We have a small-scale learning problem
when the active budget constraint is the number of examples.
● Large-scale learning○ We have a large-scale learning problem
when the active budget constraint is the computing time.
Source: Léon Bottou
Apache Flink
What is Apache Flink?
● Distributed stream and batch data processing engine● Easy and powerful APIs for batch and real-time streaming analysis● Backed by a very robust execution backend
○ true streaming dataflow engine○ custom memory manager○ native iterations○ cost-based optimizer
What is Apache Flink?
What does Flink give us?
● Expressive APIs● Pipelined stream processor● Closed loop iterations
Expressive APIs
● Main bounded data abstraction: DataSet● Program using functional-style transformations, creating a dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)).groupBy(“word”).sum(“frequency”).print()
Pipelined Stream Processor
Iterate in the dataflow
Iterate by looping
● Loop in client submits one job per iteration step● Reuse data by caching in memory or disk
Iterate in the dataflow
Delta iterations
Performance
Extending the Yahoo Streaming Benchmark
FlinkML
FlinkML
● New effort to bring large-scale machine learning to Apache Flink
FlinkML
● New effort to bring large-scale machine learning to Apache Flink● Goals:
○ Truly scalable implementations○ Keep glue code to a minimum○ Ease of use
FlinkML: Overview
FlinkML: Overview
● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression
FlinkML: Overview
● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression
● Recommendation○ Alternating Least Squares (ALS)
FlinkML: Overview
● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression
● Recommendation○ Alternating Least Squares (ALS)
● Pre-processing○ Polynomial features○ Feature scaling
FlinkML: Overview
● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression
● Recommendation○ Alternating Least Squares (ALS)
● Pre-processing○ Polynomial features○ Feature scaling
● Unsupervised learning○ Quad-tree exact kNN search
FlinkML: Overview
● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression
● Recommendation○ Alternating Least Squares (ALS)
● Pre-processing○ Polynomial features○ Feature scaling
● Unsupervised learning○ Quad-tree exact kNN search
● sklearn-like ML pipelines
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
mlr.fit(trainingData)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictionsval predictions: DataSet[LabeledVector] = mlr.predict(testingData)
FlinkML Pipelines
val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()
FlinkML Pipelines
val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
FlinkML Pipelines
val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipelinepipeline.fit(trainingData)
// Calculate predictionsval predictions = pipeline.predict(testingData)
FlinkML: Focus on scalability
Alternating Least Squares
R ≅ X Y✕Users
Items
Naive Alternating Least Squares
Blocked Alternating Least Squares
Blocked ALS performance
FlinkML blocked ALS performance
Going beyond SGD in large-scale optimization
● Beyond SGD → Use Primal-Dual framework
● Slow updates → Immediately apply local updates
CoCoA: Communication Efficient Coordinate Ascent
Primal-dual framework
Source: Smith (2014)
Primal-dual framework
Source: Smith (2014)
Immediately Apply Updates
Source: Smith (2014)
Immediately Apply Updates
Source: Smith (2014)Source: Smith (2014)
CoCoA: Communication Efficient Coordinate Ascent
CoCoA performance
Source:Jaggi (2014)
CoCoA performance
Available on FlinkML
SVM
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.○ Allows for progress, while keeping convergence guarantees.
Dealing with stragglers: SSP Iterations
Dealing with stragglers: SSP Iterations
Source: Ho et al. (2013)
SSP Iterations in Flink: Lasso Regression
Source: Peel et al. (2015)
SSP Iterations in Flink: Lasso Regression
Source: Peel et al. (2015)
PR submitted
Challenges in developing an open-source ML library
Challenges in open-source ML libraries
● Depth or breadth● Design choices● Testing
Challenges in open-source ML libraries
● Attracting developers● What to commit● Avoiding code rot
Current and future work on FlinkML
Current work
● Tooling○ Evaluation & cross-validation framework○ Distributed linear algebra○ Streaming predictors
● Algorithms○ Implicit ALS○ Multi-layer perceptron○ Efficient streaming decision trees○ Colum-wise statistics, histograms
Future of Machine Learning on Flink
● Streaming ML○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new techniques.
Future of Machine Learning on Flink
● Streaming ML○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new techniques.
● “Computation efficient” learning○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.
Check it out:
flink.apache.orgci.apache.org/projects/flink/flink-docs-master/libs/ml
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
Thank you
flink.apache.orgci.apache.org/projects/flink/flink-docs-master/libs/ml
References
● Flink Project: flink.apache.org● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/● Leon Botou: Learning with Large Datasets● Smith (2014): CoCoA AMPCAMP Presentation● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS
2013.● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData
2015● Recent INRIA paper examining Spark vs. Flink (batch only)● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink)● Also interesting: Bayesian anomaly detection in Flink