Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Milan – July 13 2016

Data Intensive Applications with Apache Flink

Simone RobuttiMachine Learning Engineer at Radicalbit

@SimoneRobutti

Agenda1. Brief Introduction to Apache Flink

○ Why

○ What

○ How

2. Machine Learning on Flink

○ Present landscape

○ Future of the Ecosystem

3. Closing notes on Radicalbit (shameless plug ahead)

100% Buzzword-free guaranteed

Big Data

Machine Intelligence

Web-scale400x

It’s like the human brain

Exactly-once

Exactly-once

Why Flink (and not Spark/Storm/Samza...)

Because it’s

production-ready

streaming-firstlow-latency

fault-toleranthigh-throughput

processing engine

Flink: what is it?

From Flink’s Documentation

Connectors and integrations

Flink’s Runtime


Flink’s DataFlow


Written by the user through DataSet/DataStream API

Compiled and optimized in the client

Flink’s DataFlow


The compiled job is translated to distributed tasks by

the master and executed by workers

Machine Learning on Flink

Ready and awesome for parallel ML

Work in progress for distributed ML

ML on Flink

Flink for Model Evaluation Pipelines

Source

Data Preparation

Evaluation Sink

Source

Postprocess

-ing

Composable, modular Flink Operator

Evaluation with Flink-JPMML

Source Operator

Flink - JPMML

Operator

Sink Operator

Source Operator

model.pmml

Small library that implements basic model eval.

operations on top of JPMML (Gitlab)

Data Preparation

https://gitlab.com/radicalbit/Flink-JPMML

“I have seen people insisting on using Hadoop for

datasets that could easily fit on a flash drive and could

easily be processed on a laptop.”

- Yann LeCun

-

ML on Flink

FlinkML

What: Out-of-the-box workhorse algorithms (ALS,

SVM, LinReg, LogReg …)

Status: early phase, slow development

FlinkML

Pro: available out of the box, written with Flink API

Cons: reinvents the wheel, only a few algorithms,

no model persistence

Samsara

What: Linear algebra framework

Status: mature

Samsara

Pro: generic algorithms with platform-specific

bindings, skilled community

Cons: covers only a few use cases

SAMOA

What: Online learning algorithm framework (VHT,

AMR, …)

Status: early phase, complicated relationship with

the industry

SAMOA

Pro: many powerful generic online learning

algorithms, backed by academics (MOA, Weka)

Cons: not production ready, academic focus

ML on Flink: the future of the ecosystem

Apache Beam

Programming model for data processing pipelines

● Streaming first, batch as a bounded stream

● Layered API: What, Where, When, How

● Platform agnostic: same program, different

runners

Apache Beam - Runners

● Flink

● Spark (Partial)

● Google Cloud Dataflow

● Plain Java

● Gearpump (WIP)

● Apex (WIP)

BeamML: a runner-agnostic ML library

FlinkML Roadmap

● More algorithms!

● Evaluation framework

● Persistence/export

● Online Learning Framework

Proteus

Online Learning Platform - based on Flink

Source: Proteus’ website

http://www.proteus-bigdata.com/

http://www.proteus-bigdata.com/

The role of Radicalbit

Contributions

● Cassandra Connector

● Scala API extensions

● FlinkML (Linear Algebra Framework, MinHash)

● Akka Connector

Our vision

Flink can become the ideal choice to build real-time decision-heavy applications with high data-throughput

To achieve this:

● Ambitious applications (aim for real-time services)

● Reliable distributed online learning (Proteus?)

● A Pipelining Framework (experiment fast, increase testability and

modularity)

THANKS!Simone Robutti

Mail: [email protected] Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit

mailto:[email protected]

mailto:[email protected]

Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Data & Analytics

Transcript of Data intensive applications with Apache Flink - Simone Robutti, Radicalbit