Distributed machine learning 101 using apache spark from the browser

30
Distributed Machine Learning 101 using Apache Spark from the Browser Scala days 2015, Amsterdam

Transcript of Distributed machine learning 101 using apache spark from the browser

Page 1: Distributed machine learning 101 using apache spark from the browser

Distributed Machine Learning 101using Apache Spark from the Browser

Scala days 2015, Amsterdam

Page 2: Distributed machine learning 101 using apache spark from the browser

● what is Machine Learning?◦ Variables, Variance and Bias

◦ Model selection

● Why Spark for machine learning?

● Spark MLlib by exampes◦ Genomics clustering and classification example

● What for the future?◦ Streaming

◦ Human Learning

Outline

Page 3: Distributed machine learning 101 using apache spark from the browser

Andy Petrella

MathsscalaApache Spark

Spark NotebookTrainerData Banana

Xavier Tordoir

PhysicsBioinformatics

ScalaSpark

Page 4: Distributed machine learning 101 using apache spark from the browser

you cannot prove a vague theory is wrong

[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.

—Richard Feynman [1964]

What is Machine Learning?Science with data

Surely You’re Joking Mr…

Page 5: Distributed machine learning 101 using apache spark from the browser

● Modelling without first principle…

What is Machine Learning?Overview

2nd law neither...

Page 6: Distributed machine learning 101 using apache spark from the browser

● Modelling without first principle…

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

Take that Newton...

Page 7: Distributed machine learning 101 using apache spark from the browser

● Modelling without first principle…

● Modelling dependencies from the data

What is Machine Learning?Overview

With some “a priori” knowledge

Page 8: Distributed machine learning 101 using apache spark from the browser

● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation

What is Machine Learning?Learning Machine…

You still need a domain expert…

Like me!

LearningMachine

Page 9: Distributed machine learning 101 using apache spark from the browser

● Estimate dependencies from data

What is Machine Learning?Overview

Machine learning you do with a Learning Machine

SamplesGenerator

System

x

y

z ?

LearningMachine

Page 10: Distributed machine learning 101 using apache spark from the browser

● Estimate dependencies from data

● Minimize a risk functional over the set given the data

What is Machine Learning?Overview

I like them so much in LaTeX2e

SamplesGenerator

System

x

y

z ?

LearningMachine

Page 11: Distributed machine learning 101 using apache spark from the browser

● Regression: continuous output

○ Risk = Prediction error

● Classification: categorical output

○ Risk = Probability of misclassification

What is Machine Learning?Supervised learning

Lyfxw y-fxw2…

WTF?

Page 12: Distributed machine learning 101 using apache spark from the browser

What is Machine Learning?Unsupervised learning: no output

I like clusters, specially with roasted nuts

● Clustering

○ Risk = Error Distortion (distances to center)

● Density estimation (probability densities)

Page 13: Distributed machine learning 101 using apache spark from the browser

What is Machine Learning?Bias - Variance, Regression illustration

Playtime!

Notebook!

Page 14: Distributed machine learning 101 using apache spark from the browser

What is Machine Learning?Model selection

all work and no play makes Jack a dull boy

Model Complexity control: Resampling

Because we only see one sample of the universe

Replay it!

Page 15: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2

Page 16: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2

Page 17: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

More Samples

Page 18: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

More Samples

Page 19: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

Bigger Samples

Page 20: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Enough theory boy!

f0f1f2F3

Bigger Samples

Page 21: Distributed machine learning 101 using apache spark from the browser

Spark for Machine Learning?Model selection

Nice flag

K-Fold

K = 4

Page 22: Distributed machine learning 101 using apache spark from the browser

GenomicsThe data

So… that’s what separates us huh?

Page 23: Distributed machine learning 101 using apache spark from the browser

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

Page 24: Distributed machine learning 101 using apache spark from the browser

1000 genomes: http://www.1000genomes.org/

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

Page 25: Distributed machine learning 101 using apache spark from the browser

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Page 26: Distributed machine learning 101 using apache spark from the browser

Data continues to flow

Models must be trained continuously

=> Streaming Machine learning algorithms

Models must be validated

=> Batch machine learning

→ ƛambda ML

What else?Streaming

Lambada?

Page 27: Distributed machine learning 101 using apache spark from the browser

Learning probabilistic models

Not only learning which features are important...

but also Learning interactions effectively explaining observations

What else?Probabilistic Programming

I’ll probably program too

Page 28: Distributed machine learning 101 using apache spark from the browser

That’s all folks

Roooaaar

Page 29: Distributed machine learning 101 using apache spark from the browser

Q / Option[A] / beers

THANKS!

Xavier Tordoir

@xtordoir

Andy Petrella

@noootsabhttp://data-fellas.guru https://github.com/andypetrella/spark-notebook/

Frank Nothaft

Matt Massie

Matt Gianni

Venkat Krishnamurthy