Distributed machine learning 101 using apache spark from the browser

Post on 28-Jul-2015

1.300 views 5 download

Transcript of Distributed machine learning 101 using apache spark from the browser

Distributed Machine Learning 101using Apache Spark from the Browser

Scala days 2015, Amsterdam

● what is Machine Learning?◦ Variables, Variance and Bias

◦ Model selection

● Why Spark for machine learning?

● Spark MLlib by exampes◦ Genomics clustering and classification example

● What for the future?◦ Streaming

◦ Human Learning

Outline

Andy Petrella

MathsscalaApache Spark

Spark NotebookTrainerData Banana

Xavier Tordoir

PhysicsBioinformatics

ScalaSpark

you cannot prove a vague theory is wrong

[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.

—Richard Feynman [1964]

What is Machine Learning?Science with data

Surely You’re Joking Mr…

● Modelling without first principle…

What is Machine Learning?Overview

2nd law neither...

● Modelling without first principle…

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

Take that Newton...

● Modelling without first principle…

● Modelling dependencies from the data

What is Machine Learning?Overview

With some “a priori” knowledge

● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation

What is Machine Learning?Learning Machine…

You still need a domain expert…

Like me!

LearningMachine

● Estimate dependencies from data

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

SamplesGenerator

System

x

y

z ?

LearningMachine

● Estimate dependencies from data

● Minimize a risk functional over the set given the data

What is Machine Learning?Overview

I like them so much in LaTeX2e

SamplesGenerator

System

x

y

z ?

LearningMachine

● Regression: continuous output

○ Risk = Prediction error

● Classification: categorical output

○ Risk = Probability of misclassification

What is Machine Learning?Supervised learning

Lyfxw y-fxw2…

WTF?

What is Machine Learning?Unsupervised learning: no output

I like clusters, specially with roasted nuts

● Clustering

○ Risk = Error Distortion (distances to center)

● Density estimation (probability densities)

What is Machine Learning?Bias - Variance, Regression illustration

Playtime!

Notebook!

What is Machine Learning?Model selection

all work and no play makes Jack a dull boy

Model Complexity control: Resampling

Because we only see one sample of the universe

Replay it!

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

More Samples

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

More Samples

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

Bigger Samples

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

Bigger Samples

Spark for Machine Learning?Model selection

Nice flag

K-Fold

K = 4

GenomicsThe data

So… that’s what separates us huh?

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

1000 genomes: http://www.1000genomes.org/

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Data continues to flow

Models must be trained continuously

=> Streaming Machine learning algorithms

Models must be validated

=> Batch machine learning

→ ƛambda ML

What else?Streaming

Lambada?

Learning probabilistic models

Not only learning which features are important...

but also Learning interactions effectively explaining observations

What else?Probabilistic Programming

I’ll probably program too

That’s all folks

Roooaaar

Q / Option[A] / beers

THANKS!

Xavier Tordoir

@xtordoir

Andy Petrella

@noootsabhttp://data-fellas.guru https://github.com/andypetrella/spark-notebook/

Frank Nothaft

Matt Massie

Matt Gianni

Venkat Krishnamurthy