Distributed machine learning 101 using apache spark from the browser

download Distributed machine learning 101 using apache spark from the browser

of 30

  • date post

    28-Jul-2015
  • Category

    Documents

  • view

    1.281
  • download

    5

Embed Size (px)

Transcript of Distributed machine learning 101 using apache spark from the browser

1. Distributed Machine Learning 101 using Apache Spark from the Browser Scala days 2015, Amsterdam 2. what is Machine Learning? Variables, Variance and Bias Model selection Why Spark for machine learning? Spark MLlib by exampes Genomics clustering and classification example What for the future? Streaming Human Learning Outline 3. Andy Petrella Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir Physics Bioinformatics Scala Spark 4. you cannot prove a vague theory is wrong [] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences. Richard Feynman [1964] What is Machine Learning? Science with data Surely Youre Joking Mr 5. Modelling without first principle What is Machine Learning? Overview 2nd law neither... 6. Modelling without first principle What is Machine Learning? Overview Machine learning you do with a Learning Machine Take that Newton... 7. Modelling without first principle Modelling dependencies from the data What is Machine Learning? Overview With some a priori knowledge 8. What is the problem? Hypothesis? Data Generation Process? Collection and Preprocessing Interpretation What is Machine Learning? Learning Machine You still need a domain expert Like me! Learning Machine 9. Estimate dependencies from data What is Machine Learning? Overview Machine learning you do with a Learning Machine Samples Generator System x y z ? Learning Machine 10. Estimate dependencies from data Minimize a risk functional over the set given the data What is Machine Learning? Overview I like them so much in LaTeX2e Samples Generator System x y z ? Learning Machine 11. Regression: continuous output Risk = Prediction error Classification: categorical output Risk = Probability of misclassification What is Machine Learning? Supervised learning Lyfxw y-fxw2 WTF? 12. What is Machine Learning? Unsupervised learning: no output I like clusters, specially with roasted nuts Clustering Risk = Error Distortion (distances to center) Density estimation (probability densities) 13. What is Machine Learning? Bias - Variance, Regression illustration Playtime! Notebook! 14. What is Machine Learning? Model selection all work and no play makes Jack a dull boy Model Complexity control: Resampling Because we only see one sample of the universe Replay it! 15. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 16. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 17. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 More Samples 18. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 More Samples 19. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 Bigger Samples 20. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 Bigger Samples 21. Spark for Machine Learning? Model selection Nice flag K-Fold K = 4 22. Genomics The data So thats what separates us huh? 23. 1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Genomics The data Please, dont mind the colors... 24. 1000 genomes: http://www.1000genomes.org/ ~1000 samples Few samples => Machine Learning Genomics The data Woooow, really, you must be kidding me ahahahahah 25. 1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Few samples => Machine Learning Lots of Data => Distributed computing Genomics The data Oh damned hum huh 26. Data continues to flow Models must be trained continuously => Streaming Machine learning algorithms Models must be validated => Batch machine learning ambda ML What else? Streaming Lambada? 27. Learning probabilistic models Not only learning which features are important... but also Learning interactions effectively explaining observations What else? Probabilistic Programming Ill probably program too 28. Thats all folks Roooaaar 29. Q / Option[A] / beers THANKS! Xavier Tordoir @xtordoir Andy Petrella @noootsab http://data-fellas.guru https://github.com/andypetrella/spark-notebook/ Frank Nothaft Matt Massie Matt Gianni Venkat Krishnamurthy 30. Look at the Code The browser part is powered by the Spark Notebook. The 3 notebooks are: mllib/Variance - Bias.snb adam/Clustering Genomes using Adam with LDA.snb adam/Classifying Genomes using Adam with RF.snb So grab a Spark Notebook on http://spark-notebook.io/. Yeaaaaah!