Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

32
Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti

Transcript of Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Page 1: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Milan – July 13 2016

Data Intensive Applications with Apache Flink

Simone RobuttiMachine Learning Engineer at Radicalbit

@SimoneRobutti

Page 2: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Agenda1. Brief Introduction to Apache Flink

○ Why

○ What

○ How

2. Machine Learning on Flink

○ Present landscape

○ Future of the Ecosystem

3. Closing notes on Radicalbit (shameless plug ahead)

Page 3: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

100% Buzzword-free guaranteed

Big Data

Machine Intelligence

Web-scale400x

It’s like the human brain

Exactly-once

Exactly-once

Page 4: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Why Flink (and not Spark/Storm/Samza...)

Because it’s

production-ready

streaming-firstlow-latency

fault-toleranthigh-throughput

processing engine

Page 5: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Flink: what is it?

From Flink’s Documentation

Page 6: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Connectors and integrations

Page 7: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Flink’s Runtime

From Flink’s Documentation

Page 8: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Flink’s DataFlow

From Flink’s Documentation

Written by the user through DataSet/DataStream API

Compiled and optimized in the client

Page 9: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Flink’s DataFlow

From Flink’s Documentation

The compiled job is translated to distributed tasks by

the master and executed by workers

Page 10: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Machine Learning on Flink

Page 11: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Ready and awesome for parallel ML

Work in progress for distributed ML

ML on Flink

Page 12: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Flink for Model Evaluation Pipelines

Source

Data Preparation

Evaluation Sink

Source

Postprocess

-ing

Composable, modular Flink Operator

Page 13: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Evaluation with Flink-JPMML

Source Operator

Flink - JPMML

Operator

Sink Operator

Source Operator

model.pmml

Small library that implements basic model eval.

operations on top of JPMML (Gitlab)

Data Preparation

Page 14: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

“I have seen people insisting on using Hadoop for

datasets that could easily fit on a flash drive and could

easily be processed on a laptop.”

- Yann LeCun

-

ML on Flink

Page 15: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Page 16: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

FlinkML

What: Out-of-the-box workhorse algorithms (ALS,

SVM, LinReg, LogReg …)

Status: early phase, slow development

Page 17: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

FlinkML

Pro: available out of the box, written with Flink API

Cons: reinvents the wheel, only a few algorithms,

no model persistence

Page 18: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Samsara

What: Linear algebra framework

Status: mature

Page 19: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Samsara

Pro: generic algorithms with platform-specific

bindings, skilled community

Cons: covers only a few use cases

Page 20: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

SAMOA

What: Online learning algorithm framework (VHT,

AMR, …)

Status: early phase, complicated relationship with

the industry

Page 21: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

SAMOA

Pro: many powerful generic online learning

algorithms, backed by academics (MOA, Weka)

Cons: not production ready, academic focus

Page 22: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

ML on Flink: the future of the ecosystem

Page 23: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Apache Beam

Programming model for data processing pipelines

● Streaming first, batch as a bounded stream

● Layered API: What, Where, When, How

● Platform agnostic: same program, different

runners

Page 24: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Apache Beam - Runners

● Flink

● Spark (Partial)

● Google Cloud Dataflow

● Plain Java

● Gearpump (WIP)

● Apex (WIP)

Page 25: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

BeamML: a runner-agnostic ML library

Page 26: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

FlinkML Roadmap

● More algorithms!

● Evaluation framework

● Persistence/export

● Online Learning Framework

Page 27: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Proteus

Online Learning Platform - based on Flink

Source: Proteus’ website

Page 28: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

The role of Radicalbit

Page 29: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Contributions

● Cassandra Connector

● Scala API extensions

● FlinkML (Linear Algebra Framework, MinHash)

● Akka Connector

Page 30: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Our vision

Flink can become the ideal choice to build real-time decision-heavy applications with high data-throughput

To achieve this:

● Ambitious applications (aim for real-time services)

● Reliable distributed online learning (Proteus?)

● A Pipelining Framework (experiment fast, increase testability and

modularity)

Page 31: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Q&A

Page 32: Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

THANKS!Simone Robutti

Mail: [email protected] Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit