Distributed machine learning 101 using apache spark from a browser devoxx.be2015

107
Distributed Machine Learning using Apache Spark from the Browser Devoxx Belgium 2015, Antwerpen

Transcript of Distributed machine learning 101 using apache spark from a browser devoxx.be2015

Page 1: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed Machine Learning using Apache Spark from the Browser

Devoxx Belgium 2015, Antwerpen

Page 2: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Distributed computing● what is Machine Learning?

● Spark for machine learning?

● Spark MLlib by examples

● Spark and other libraries

● Wrap up

Outline

Page 3: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Page 4: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed ComputingWhy you must care, by Data Fellas

Andy Petrella & Xavier Tordoir

Page 5: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga!

Computing

Processing Power Memory Storage

Page 6: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Computing

Oh no!

Hence performance is limited in time and space

Processing Power Memory StorageTIME SPACE

Page 7: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.

The components interact with each other in order to achieve a common goal. [...].

Ref: https://en.wikipedia.org/wiki/Distributed_computing

Distributing

Interesting

Page 8: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Consequences

Oh no!

Algorithms have to work on DATA Partitions and with partial results

The entire dataset cannot be accessed at once

Page 9: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

New resource!

Damned

Processing Power Memory StorageSPACE

Network

TIME

Network Will impact performances...

Page 10: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Oops did it again

Distributing

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage

network

Page 11: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

DrawbackPartition

Huh?

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage

network

Page 12: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

DrawbackPartition

Hey, you sank my node!

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

network

Processing

Memory

Storage

BOOM

Page 13: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Ouch, my rack

AdvantageElastic scaling

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage network

What if this cluster happens to not be big enough?

Page 14: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

That’s more reasonable

AdvantageElastic scaling

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage network

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage network

network

Page 15: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

HPC: computationally intensive applications

Model: specialized hardware (CPU/GPU) and network

They are orchestrated by a scheduler that gather their computing power and memory.

Yeah! what about?

What about HPC?

Page 16: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Drawbacks:

● Costs and upgrades by large blocks● Decoupled storage

storage latency = no streaming / no Iteration

Got No Money and NO time

What about HPC?

Page 17: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Why processing data if not to model?

Machine learning: iterative (streaming & batch)

Data is aggregated in the form of a model (parameters)

Data change little, model is small

Do that baby!

Iterate

Page 18: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Iterate

you gotta be kidding

Storage

Processing

Memory

Processing

Memory

Processing

Memory

Processing

Memory

Storage

Storage

Storage

Storage

Moving lots of data again and again...

Page 19: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed computing allow cost effective parallelism

Efficiency requires distributed storage

Colocated with the processing units

What about programming models?

Summary

Interesting

Page 20: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed storage

Partitions!

HDFS: Apache implementation of Google FS

● Natural fit for distributed storage● Works as a service

Other chunked sources...

● Apache Cassandra, S3, Tachyon,...

Page 21: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed storage

Split da Name Node

256Mb put /data/f256.txt

replication factor 2 Data Node 1

Data Node 2

Data Node 4

Data Node 3

Page 22: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed storage

Split da

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

256Mb put /data/f256.txt

replication factor 2 64Mb

64Mb

64Mb

64Mb

Page 23: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed storage

Everywhere

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

256Mb

64Mb

64Mb

64Mb

64Mb

put /data/f256.txtreplication factor 2 put /data/f256.txt/part-r-00000 64

Mb

Page 24: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed storage

everywhere

256Mb put /data/f256.txt

replication factor 2Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

put /data/f256.txt/part-r-00000 64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

Page 25: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed storage

Replicate

Data Node 1

Data Node 2

Data Node 4

Data Node 3

Name Node

256Mb put /data/f256.txt

replication factor 2 put /data/f256.txt/part-r-00000 64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

64Mb

Page 26: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceHigh Level Execution

The rocket’s base

data part

data part

data part

data part

Load the data

Page 27: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceHigh Level Execution

The rocket’s engines

data part mapper

data part

data part

data part

mapper

mapper

mapper

Mapand Pair

Page 28: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceHigh Level Execution

The rocket’s trunk

GroupB

yKey

data part mapper

data part

data part

data part

mapper

mapper

mapper

Shuffle Pairs using Keys

Page 29: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceHigh Level Execution

The rocket’s cockpit

data part mapper

GroupB

yKey

Reducer

data part

data part

data part

mapper

mapper

mapper

Reducer

Reducer

Values per key are Reduced

Page 30: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceHigh Level Execution

The rocket’s tip

data part mapper

GroupB

yKey

Reducer

data part

data part

data part

mapper

mapper

mapper

Reducer

Reducer

Results

We collect the results

Page 31: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceHigh Level Execution

To the infinite and beyond!

data part mapper

GroupB

yKey

Reducer

data part

data part

data part

mapper

mapper

mapper

Reducer

Reducer

Results

The whole#!

Page 32: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map Reduce Matrix-Vector Product

How about word count?

=

Page 33: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map Reduce Matrix-Vector Product

Back to school...

=

Page 34: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map Reduce Matrix-Vector Product

Wait, that’s maths

=

Page 35: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map Reduce Matrix-Vector Product

Where is the RAT?

Store Matrix as ordered

Vector V loaded in memory as ordered

Map function:

Each matrix element mapped on a producT

Page 36: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map Reduce Matrix-Vector Product

OK … I TAKE OVER

MAP

Page 37: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map Reduce Matrix-Vector Product

just a sum …

REDUCE

Page 38: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceSummary

Summary ==

Reduce?

Simple Abstraction of computations, Map and Reduce

Using simple abstraction of data, key value pairs

Page 39: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Map ReduceSummary

So what?

Brings transparent:

● parallelization● distribution ● fault tolerance

Page 40: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Why Apache SparkMapReduce on steroids

Man… Finally!

Uses

● Functional paradigm● Lazy computations

Creates dependencies between tasks definitions and optimizes execution

Page 41: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Why Apache SparkMapReduce on steroids

Almost forgot that one

Can cache data in memory or local file system.

Far less IO or network.

Page 42: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine learning?Why you must care, by Data Fellas

Andy Petrella & Xavier Tordoir

Page 43: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

you cannot prove a vague theory is wrong

[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.

—Richard Feynman [1964]

What is Machine Learning?Science with data

Surely You’re Joking Mr…

Page 44: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Modelling without first principle…

What is Machine Learning?Overview

2nd law neither...

Page 45: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Modelling without first principle…

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

Take that Newton...

Page 46: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Modelling without first principle…

● Modelling dependencies from the data

What is Machine Learning?Overview

With some “a priori” knowledge

Page 47: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation

What is Machine Learning?Learning Machine…

You still need a domain expert…

Like me!

LearningMachine

Page 48: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Estimate dependencies from data

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

SamplesGenerator

System

x

y

z ?

LearningMachine

Page 49: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Estimate dependencies from data

● Minimize a risk functional over the set given the data

What is Machine Learning?Overview

I like them so much in LaTeX2e

SamplesGenerator

System

x

y

z ?

LearningMachine

Page 50: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

● Regression: continuous output

○ Risk = Prediction error

● Classification: categorical output

○ Risk = Probability of misclassification

What is Machine Learning?Supervised learning

Lyfxw y-fxw2…

WTF?

Page 51: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Unsupervised learning: no output

I like clusters, specially with roasted nuts

● Clustering

○ Risk = Error Distortion (distances to center)

● Density estimation (probability densities)

Page 52: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Bias - Variance, Regression illustration

Playtime!

Notebook!

Page 53: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Inductive principle

In principle, it should work.

An inductive principle tells what to do

Finite Data

Inductive principle

Model

Page 54: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Inductive principle

In principle, it should work.

Empirical risk minimization

Finite Data Model

• Functions class not defined• Loss not defined• Optimization procedure not defined

Page 55: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Inductive principle

In principle, it should work.

Regularization

Finite Data Model

• control on penalty strength• Penalize complexity/a priori knowledge

Page 56: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Inductive principle

In principle, it should work.

Early stopping rules

Finite Data Model

• Iterative optimization• Depends on initial params and algorithm• used for neural networks• Penalize along a path

Page 57: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Inductive principle

In principle, it should work.

Structural Risk

Finite Data Model

• Analytic estimates of empirical risk

Page 58: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Inductive principle

In principle, it should work.

Bayesian inference

Finite Data Model

• Explicit a priori probabilities• Learn mixtures• Hard multidimensional integrations…

Page 59: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Curse of dimentionality

In principle, it should work.

We want to control complexity

Finite Data Model

• smoothness constraint in a neighborhood

Page 60: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Data density is key…

Finite DataIn a Space

ModelComplexity

Inductive principle

Page 61: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Data density is key…e.g.● 1-D 0.1m interval => 10 points/m● 2-D 0.1M interval => 100 points/M^2

● d-d 0.1 m interval => 10^d points/m^d

Same smoothness requires lots of data in high dimensional spaces

Page 62: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Sampling is hard…e.g.● 1-D 10% sample => 0.1 x size● 2-D 10% sample => 0.31 x size

● 10-d 10% sample => 0.79 x size

=> local estimates from samples are difficult

Page 63: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Data points are closer to edges…One Data points “sees” himself as an outlier

=> Predictions require lots of extrapolation

Page 64: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Curse of dimensionality

In principle, it should work.

Samples must increase exponentially

… or model complexity must be controlled

Page 65: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Data driven penalized risk minimization

Page 66: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Loss functions

Page 67: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Regularizers

L2 (ridge)

L1(lasso)

Elastic net

Page 68: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

Which algorithm to find a minimum in a distributed fashion?

Convex optimization methods (linear methods)● Gradient descent● Stochastic gradient descent● Limited-memory BFGS

Page 69: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

Gradient descent● Efficient steps but needs to read through

the whole data

Page 70: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

Stochastic Gradient descent● Samples data for each step but converges

very slowly

Page 71: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Regularization in more details

In principle, it should work.

Optimization (there comes the fun… )

L-BFGS● quadratic derivative estimates by keeping

several previous gradient in memory● Fast convergence

Page 72: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Model selection

all work and no play makes Jack a dull boy

Model Complexity control: Resampling

Selecting the right lambda…

… to minimize prediction risk

Page 73: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Model selection

Enough theory boy!

The universe

Page 74: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Model selection

Enough theory boy!

Our data

Page 75: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Model selection

Enough theory boy!

Our data

Learning Set (70%)

validation set (30%)

Page 76: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Model selection

Enough theory boy!

Our data

Learning Set (70%)

validation set (30%)

Page 77: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What is Machine Learning?Model selection

Nice flag

K-Fold

K = 4

Page 78: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

MLLibA library to learn them all...

Page 79: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed computing framework

Large Scale Data Processing engine

What is Apache Spark?

I play BIG!

Page 80: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed computing framework

Large Scale Data Processing engine

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!

What is Apache Spark?

Page 81: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed computing framework

Large Scale Data Processing engine

● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Let the brain do the work...

What is Apache Spark?

Page 82: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed computing framework

Large Scale Data Processing engine

● Interactive● @ any scale

Breed mixin’

What is Apache Spark?

Page 83: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

MLLibSpark

In principle, it should work.

Intro to Spark… notebook

Page 84: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

MLLibSpark

In principle, it should work.

Intro to Spark… notebook

So we’we seen… ● Basics of Spark data manipulation● MLLib data representation● Linear regression● Regularization and k-fold cross validation

What else is there?

Page 85: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

MLLibSpark

In principle, it should work.

Basic statisticsClassification and regressionCollaborative filteringClusteringDimensionality reductionFeature extraction and transformationFrequent pattern miningEvaluation metrics…

http://spark.apache.org/docs/latest/mllib-guide.html

Page 86: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)

Playtime!

Some more examples

Page 87: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

GenomicsThe data

So… that’s what separates us huh?

Page 88: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

Page 89: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

1000 genomes: http://www.1000genomes.org/

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

Page 90: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Page 91: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)

Playtime!

Notebook!

Page 92: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

What else?Old and new players are now integrating with Spark

(and Scala)

Page 93: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Integrated with Data Frame

Offer API to create

shareable/reusable

Pipeline constructions (PCA, …)

Spark ML Pipeline

Higher API

Page 94: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Like Pipeline but

Type Safe

Chainable API (andThen-friendly)

Spark ML Keystone

Higher API

Page 95: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

H2O

Higher API

Page 96: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

DL4J Spark ML

Higher API

Page 97: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Intel Data Analytics Acceleration Library

DAAL (Intel)

Higher API

Page 98: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Declarative large-scale machine learning

optimization based on data and cluster

characteristics

System ML (IBM)

Higher API

Page 99: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Nitro's Extremely Exciting Deep Learning Engine

MLP, RBM, LSTM and more to come

Needle

Higher API

Page 100: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

H2OSparkling & Deep Learning on genomics

water in fire

Learning structures using H2O Deep Learning Algorithm integrated in SparKin a Notebookon an Ec2 Cluster

http://h2o.ai/product/sparkling-water/

Page 101: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

H2OSparkling: in-memory data exchange

I remember things better when I remember then twice.

Page 102: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Wrap upwhat we hope you have learned

Page 103: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed computingFor machine learning

I am ready.

Data is exploding

Distributed Technologies are maturing

Scale up and down, interactivity

Page 104: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Distributed ML on SparkWhat is available

What are my options by the way?

Spark MLLibH2O

DL4J

Needle

EC2 GCEURIKA-XA

clouderaMapr

Hortonworks

HDFSC*

kafka

Page 105: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 106: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Page 107: Distributed machine learning 101 using apache spark from a browser   devoxx.be2015

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab

Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)

Check also @TypeSafe: http://t.co/o1Bt6dQtgH