Practical Machine Learning Pipelines with MLlib

download Practical Machine Learning Pipelines with MLlib

of 35

  • date post

    25-Jul-2015
  • Category

    Software

  • view

    1.663
  • download

    0

Embed Size (px)

Transcript of Practical Machine Learning Pipelines with MLlib

1. Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015 2. About Spark MLlib Started in UC Berkeley AMPLab Shipped with Spark 0.8 Currently (Spark 1.3) Contributions from 50+ orgs, 100+ individuals Good coverage of algorithms classica'on regression clustering recommenda'on featureextrac'on,selec'on frequentitemsets sta's'cs linearalgebra 3. MLlibs Mission Howcanwemovebeyondthislistofalgorithms andhelpusersdeveloperrealMLworkows? MLlibs mission is to make practical machine learning easy and scalable. Capable of learning from large-scale datasets Easy to build machine learning applications 4. Outline ML workflows Pipelines Roadmap 5. Outline ML workflows Pipelines Roadmap 6. Example: Text Classification Set Footer from Insert Dropdown Menu 6 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1:aboutscience 0:notaboutscience LabelFeatures text,image,vector,... CTR,inchesofrainfall,... Dataset:20Newsgroups FromUCIKDDArchive 7. Training & Testing Set Footer from Insert Dropdown Menu 7 Training Tes*ng/Produc*on Givenlabeleddata: RDDof(features,label) Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help...! Subject: RIPEM FAQ! RIPEM is a program which performs Privacy Enhanced...! ... Label 0! Label 1! Learnamodel. Givennewunlabeleddata: RDDoffeatures Subject: Apollo Training! The Apollo astronauts also trained at (in) Meteor...! Subject: A demo of Nonsense! How can you lie about something that no one...! Usemodeltomakepredic'ons. Label 1! Label 0! 8. Example ML Workflow Training Trainmodel labels+predicEons Evaluate Loaddata labels+plaintext labels+featurevectors Extractfeatures Explicitlyunzip&zipRDDs labels.zip(predictions).map { if (_._1 == _._2) ... } val features: RDD[Vector] val predictions: RDD[Double] CreatemanyRDDs val labels: RDD[Double] = data.map(_.label) Painpoint 9. Example ML Workflow Writeasascript Painpoint Notmodular Diculttore-useworkow Training labels+featurevectors Trainmodel labels+predicEons Evaluate Loaddata labels+plaintext Extractfeatures 10. Example ML Workflow Training labels+featurevectors Trainmodel labels+predicEons Evaluate Loaddata labels+plaintext Extractfeatures Testing/Production featurevectors Predictusingmodel predicEons Actonpredic'ons Loadnewdata plaintext Extractfeatures Almost iden-cal workow 11. Example ML Workflow Training labels+featurevectors Trainmodel labels+predicEons Evaluate Loaddata labels+plaintext Extractfeatures Painpoint Parametertuning KeypartofML Involvestrainingmanymodels Fordierentsplitsofthedata Fordierentsetsofparameters 12. Pain Points Create&handlemanyRDDsanddatatypes Writeasascript Tuneparameters Enter... Pipelines! inSpark1.2&1.3 13. Outline ML workflows Pipelines Roadmap 14. Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning 15. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Namedcolumnswithtypes label: Double text: String words: Seq[String] features: Vector prediction: Double label text words features 0 Thisis... [This,is,] [0.5,1.2,] 0 Whenwe... [When,...] [1.9,-0.8,] 16. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Namedcolumnswithtypes Domain-SpecicLanguage # Select science articles sciDocs = data.filter(label == 1) # Scale labels data(label) * 0.5 17. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL ShippedwithSpark1.3 APIsforPython,Java&Scala(+Rindev) Integra'onwithSparkSQL Dataimport/export Internalop'miza'ons Namedcolumnswithtypes Domain-SpecicLanguage Painpoint:Create&handle manyRDDsanddatatypes BIGdata 18. Abstractions Set Footer from Insert Dropdown Menu 18 Training Trainmodel Evaluate Loaddata Extractfeatures 19. Abstraction: Transformer Set Footer from Insert Dropdown Menu 19 Training Trainmodel Evaluate Extractfeatures def transform(DataFrame): DataFrame label: Double text: String label: Double text: String features: Vector 20. Abstraction: Estimator Set Footer from Insert Dropdown Menu 20 Training Trainmodel Evaluate Extractfeatures label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model 21. Trainmodel Abstraction: Evaluator Set Footer from Insert Dropdown Menu 21 Training Evaluate Extractfeatures label: Double text: String features: Vector prediction: Double Metric: accuracy AUC MSE ... def evaluate(DataFrame): Double 22. Actonpredic'ons Abstraction: Model Set Footer from Insert Dropdown Menu 22 ModelisatypeofTransformer def transform(DataFrame): DataFrame text: String features: Vector Testing/Production Predictusingmodel Extractfeatures text: String features: Vector prediction: Double 23. (Recall) Abstraction: Estimator Set Footer from Insert Dropdown Menu 23 Training Trainmodel Evaluate Loaddata Extractfeatures label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model 24. Abstraction: Pipeline Set Footer from Insert Dropdown Menu 24 Training Trainmodel Evaluate Loaddata Extractfeatures label: Double text: String PipelineModel PipelineisatypeofEs*mator def fit(DataFrame): Model 25. Abstraction: PipelineModel Set Footer from Insert Dropdown Menu 25 text: String PipelineModelisatypeofTransformer def transform(DataFrame): DataFrame Testing/Production Predictusingmodel Loaddata Extractfeatures text: String features: Vector prediction: Double Actonpredic'ons 26. Abstractions: Summary Set Footer from Insert Dropdown Menu 26 Training Trainmodel Evaluate Loaddata ExtractfeaturesTransformer DataFrame Estimator Evaluator Testing Predictusingmodel Evaluate Loaddata Extractfeatures 27. Demo Set Footer from Insert Dropdown Menu 27 Transformer DataFrame Estimator Evaluator label: Double text: String features: Vector Currentdataschema prediction: Double Training Logis'cRegression BinaryClassica'on Evaluator Loaddata Tokenizer Transformer HashingTF words: Seq[String] 28. Demo Set Footer from Insert Dropdown Menu 28 Transformer DataFrame Estimator Evaluator Training Logis'cRegression BinaryClassica'on Evaluator Loaddata Tokenizer Transformer HashingTF Painpoint:Writeasascript 29. Parameters Set Footer from Insert Dropdown Menu 29 > hashingTF.numFeaturesStandardAPI Typed Defaults Built-indoc Autocomplete org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures 30. Parameter Tuning Given: Estimator Parameter grid Evaluator Find best parameters lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Logis'cRegression Tokenizer HashingTF BinaryClassica'on Evaluator CrossValidator 31. Parameter Tuning Given: Estimator Parameter grid Evaluator Find best parameters Logis'cRegression Tokenizer HashingTF BinaryClassica'on Evaluator CrossValidator Painpoint:Tuneparameters 32. Pipelines: Recap Inspira'ons scikit-learn +SparkDataFrame,ParamAPI MLBase(BerkeleyAMPLab) OngoingcollaboraEons Create&handlemanyRDDsanddatatypes Writeasascript Tuneparameters DataFrame Abstrac'ons ParameterAPI *Groundworkdone;fullsupportWIP. Also Python,Scala,JavaAPIs Schemavalida'on User-DenedTypes* Featuremetadata* Mul'-modeltrainingop'miza'ons* 33. Outline ML workflows Pipelines Roadmap 34. Roadmap spark.mllib:PrimaryMLpackage spark.ml:High-levelPipelinesAPIforalgorithmsinspark.mllib (experimentalinSpark1.2-1.3) Nearfuture Featureaoributes Featuretransformers MorealgorithmsunderPipelineAPI Fartherahead IdeasfromAMPLabMLBase(auto-tuningmodels) SparkRintegra'on 35. Thank you! Outline MLworkows Pipelines DataFrame Abstrac*ons Parametertuning Roadmap Sparkdocumenta'on hop://spark.apache.org/ Pipelinesblogpost hops://databricks.com/blog/2015/01/07