MLlib sparkmeetup_8_6_13_final_reduced

of 141 /141
Evan Sparks and Ameet Talwalkar UC Berkeley UC Berkeley base ML

Embed Size (px)

description

 

Transcript of MLlib sparkmeetup_8_6_13_final_reduced

  • 1. EvanSparksandAmeetTalwalkar UCBerkeley UC Berkeley baseML baseML M ML M

2. ThreeConvergingTrends 3. BigData ThreeConvergingTrends 4. Distributed Compu2ng BigData ThreeConvergingTrends 5. Distributed Compu2ng BigData ThreeConvergingTrends Machine Learning 6. Distributed Compu2ng BigData ThreeConvergingTrends Machine Learning MLbase 7. Vision MLlib MLI MLOpAmizer ReleasePlan 8. Problem:ScalableimplementaAons dicultforMLDevelopers Me S ML Contract + Code ML 9. Problem:ScalableimplementaAons dicultforMLDevelopers Me S ML Contract + Code ML 10. Problem:ScalableimplementaAons dicultforMLDevelopers Me S ML Contract + Code ML 11. Toomany algorithms Problem:MLisdicult forEndUsers 12. Toomany algorithms Toomany knobs Problem:MLisdicult forEndUsers 13. Toomany algorithms Toomany knobs Problem:MLisdicult forEndUsers Dicultto debug 14. Toomany algorithms Toomany knobs Problem:MLisdicult forEndUsers Dicultto debug Doesntscale 15. Toomany algorithms Toomany knobs Problem:MLisdicult forEndUsers Dicultto debug Reliable Fast Accurate Provable Doesntscale 16. MLExperts SystemsExpertsMLbase 17. 1. EasyscalableMLdevelopment(MLDevelopers) 2. User-friendlyMLatscale(EndUsers) MLExperts SystemsExpertsMLbase 18. 1. EasyscalableMLdevelopment(MLDevelopers) 2. User-friendlyMLatscale(EndUsers) Alongtheway,wegaininsightintodataintensive compu2ng MLExperts SystemsExpertsMLbase 19. MatlabStack 20. MatlabStack Single Machine 21. Lapack MatlabStack Single Machine Lapack:low-levelFortranlinearalgebralibrary 22. Lapack Matlab Interface MatlabStack Single Machine Lapack:low-levelFortranlinearalgebralibrary MatlabInterface Higher-levelabstrac2onsfordataaccess/processing Moreextensivefunc2onalitythanLapack LeveragesLapackwheneverpossible 23. Lapack Matlab Interface MatlabStack Single Machine Lapack:low-levelFortranlinearalgebralibrary MatlabInterface Higher-levelabstrac2onsfordataaccess/processing Moreextensivefunc2onalitythanLapack LeveragesLapackwheneverpossible SimilarstoriesforRandPython 24. MLbaseStack Lapack Matlab Interface Single Machine 25. MLbaseStack Runtime(s) Lapack Matlab Interface Single Machine 26. MLbaseStack Runtime(s)Spark Spark:clustercompu=ngsystemdesignedforitera=vecomputa=on Lapack Matlab Interface Single Machine 27. MLbaseStack Runtime(s) MLlib Spark Spark:clustercompu=ngsystemdesignedforitera=vecomputa=on MLlib:low-levelMLlibraryinSpark CallablefromScala,Java Lapack Matlab Interface Single Machine 28. MLbaseStack Runtime(s) MLlib MLI Spark Spark:clustercompu=ngsystemdesignedforitera=vecomputa=on MLlib:low-levelMLlibraryinSpark CallablefromScala,Java MLI:API/plaHormforfeatureextrac=onandalgorithmdevelopment Includeshigher-levelfunc2onalitywithfasterdevcyclethanMLlib Lapack Matlab Interface Single Machine 29. MLbaseStack Runtime(s) MLlib MLI ML Optimizer Spark Spark:clustercompu=ngsystemdesignedforitera=vecomputa=on MLlib:low-levelMLlibraryinSpark CallablefromScala,Java MLI:API/plaHormforfeatureextrac=onandalgorithmdevelopment Includeshigher-levelfunc2onalitywithfasterdevcyclethanMLlib MLOpAmizer:automatesmodelselec=on SolvesasearchproblemoverfeatureextractorsandalgorithmsinMLI Lapack Matlab Interface Single Machine 30. MLlib MLI ML Optimizer EndUser MLbaseStackStatus Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server . result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves 31. MLlib MLI ML Optimizer EndUser MLbaseStackStatus Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server . result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves 32. MLlib MLI ML Optimizer EndUser MLbaseStackStatus Goal 1: Summer Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server . result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves 33. MLlib MLI ML Optimizer EndUser MLbaseStackStatus Goal 1: Summer Release Goal 2: Winter Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server . result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves 34. Example:MLlib 35. Example:MLlib Goal:Classica2onoftextle 36. Example:MLlib Goal:Classica2onoftextle Featurizedatamanually 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML 37. Example:MLlib Goal:Classica2onoftextle Featurizedatamanually CallsMLlibsLRfunc2on 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML 38. Example:MLI 39. Example:MLI Usebuilt-infeatureextrac2onfunc2onality 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 40. Example:MLI Usebuilt-infeatureextrac2onfunc2onality MLILogis2cRegressionleveragesMLlib 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 41. Example:MLI Usebuilt-infeatureextrac2onfunc2onality MLILogis2cRegressionleveragesMLlib Extensions: Embedincross-valida2onrou2ne Usedierentfeatureextractors/algorithmsor writenewones 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 42. Example:MLOp2mizer varX=load(text_le,2to10) vary=load(text_le,1) var(fn-model,summary)=doClassify(X,y) Userdeclara2velyspeciestask MLOp2mizersearchesthroughMLI 43. Vision MLlib MLI MLOpAmizer ReleasePlan 44. Easeofuse Performance, Scalability LayoftheLand 45. Matlab,R x Easeofuse Performance, Scalability LayoftheLand 46. Matlab,R x Easeofuse Performance, Scalability Mahout x LayoftheLand 47. Matlab,R x Easeofuse Performance, Scalability GraphLab,VW x Mahout x LayoftheLand 48. Matlab,R x Easeofuse Performance, Scalability GraphLab,VW x Mahout x LayoftheLand MLlib x 49. Logis2cRegression,LinearSVM(+L1,L2) LinearRegression(+Lasso,Ridge) Alterna2ngLeastSquares K-Means SGD,ParallelGradient MLlib ClassicaAon: Regression: CollaboraAveFiltering: Clustering: OpAmizaAonPrimiAves: 50. Logis2cRegression,LinearSVM(+L1,L2) LinearRegression(+Lasso,Ridge) Alterna2ngLeastSquares K-Means SGD,ParallelGradient MLlib ClassicaAon: Regression: CollaboraAveFiltering: Clustering: OpAmizaAonPrimiAves: IncludedwithinSparkcodebase UnlikeMahout/Hadoop PartofSpark0.8release Con2nuedsupportviaSparkproject 51. MLlibPerformance 52. WallAme:elapsed2metoexecutetask MLlibPerformance 53. WallAme:elapsed2metoexecutetask Weakscaling xproblemsizeperprocessor ideally:constantwall2measwegrowcluster MLlibPerformance 54. WallAme:elapsed2metoexecutetask Weakscaling xproblemsizeperprocessor ideally:constantwall2measwegrowcluster Strongscaling xtotalproblemsize ideally:linearspeedupaswegrowcluster MLlibPerformance 55. WallAme:elapsed2metoexecutetask Weakscaling xproblemsizeperprocessor ideally:constantwall2measwegrowcluster Strongscaling xtotalproblemsize ideally:linearspeedupaswegrowcluster EC2Experiments m2.4xlargeinstances,upto32machineclusters MLlibPerformance 56. Logis2cRegression-WeakScaling 57. Logis2cRegression-WeakScaling Fulldataset:200Kimages,160Kdensefeatures 58. Logis2cRegression-WeakScaling Fulldataset:200Kimages,160Kdensefeatures Similarweakscaling 0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib 59. Logis2cRegression-WeakScaling Fulldataset:200Kimages,160Kdensefeatures Similarweakscaling MLlibwithinafactorof2ofVWswall=me MLbase VW Matlab 0 1000 2000 3000 4000 walltime(s) n=6K, d=160K n=12.5K, d=160K n=25K, d=160K n=50K, d=160K n=100K, d=160K n=200K, d=160K MLlib0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib 60. Logis2cRegression-StrongScaling 61. Logis2cRegression-StrongScaling FixedDataset:50Kimages,160Kdensefeatures 62. Logis2cRegression-StrongScaling FixedDataset:50Kimages,160Kdensefeatures MLlibexhibitsbeTerscalingproper=es 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib 63. Logis2cRegression-StrongScaling FixedDataset:50Kimages,160Kdensefeatures MLlibexhibitsbeTerscalingproper=es MLlibfasterthanVWwith16and32machines MLbase VW Matlab 0 1000 wa Fig. 5: Walltime for weak scaling for logistic regressi MLbase VW Matlab 0 200 400 600 800 1000 1200 1400 walltime(s) 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines Fig. 7: Walltime for strong scaling for logistic regress with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented MLlib 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib 64. ALS-Wall2me 65. ALS-Wall2me Dataset:ScaledversionofNeHlixdata(9Xinsize) Cluster:9machines 66. ALS-Wall2me Dataset:ScaledversionofNeHlixdata(9Xinsize) Cluster:9machines System WallAme(seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481 67. ALS-Wall2me Dataset:ScaledversionofNeHlixdata(9Xinsize) Cluster:9machines System WallAme(seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481 68. ALS-Wall2me Dataset:ScaledversionofNeHlixdata(9Xinsize) Cluster:9machines MLlibanorderofmagnitudefasterthanMahout MLlibwithinfactorof2ofGraphLab System WallAme(seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481 69. DeploymentConsidera2ons 70. DeploymentConsidera2ons VowpalWabbit,GraphLab Dataprepara=onspecictoeachprogram Non-trivialsetuponcluster Nofaulttolerance 71. DeploymentConsidera2ons VowpalWabbit,GraphLab Dataprepara=onspecictoeachprogram Non-trivialsetuponcluster Nofaulttolerance MLlib ReadslesfromHDFS Launch/compile/runonclusterwithafewcommands RDDsarefaulttolerance 72. Vision MLlib MLI MLOpAmizer ReleasePlan 73. Matlab,R x Easeofuse Performance, Scalability GraphLab,VW x Mahout x LayoftheLand MLlib x 74. Matlab,R x Easeofuse Performance, Scalability GraphLab,VW x MLI x Mahout x LayoftheLand MLlib x 75. CurrentOp2ons 76. CurrentOp2ons +Easy(Resemblesmath,limited/nosetupcost) +Sucientforprototyping/wri2ngpapers Ad-hoc,non-scalablescripts Lossoftransla2onuponre-implementa2on 77. CurrentOp2ons +Easy(Resemblesmath,limited/nosetupcost) +Sucientforprototyping/wri2ngpapers Ad-hoc,non-scalablescripts Lossoftransla2onuponre-implementa2on 78. CurrentOp2ons +Easy(Resemblesmath,limited/nosetupcost) +Sucientforprototyping/wri2ngpapers Ad-hoc,non-scalablescripts Lossoftransla2onuponre-implementa2on +Scalableand(some2mes)fast +Exis2ngopen-sourcelibraryofMLalgorithms Diculttosetup,extend 79. Examples ML Developer Code 80. Examples ML Developer Code DistributedDivide-Factor-Combine(DFC) Ini2alstudiesinMATLAB(Notdistributed) DistributedprototypeinvolvingcompiledMATLAB 81. Examples ML Developer Code DistributedDivide-Factor-Combine(DFC) Ini2alstudiesinMATLAB(Notdistributed) DistributedprototypeinvolvingcompiledMATLAB MahoutALSwithEarlyStopping Theory:simpleif-statement(3linesofcode) 82. Examples ML Developer Code DistributedDivide-Factor-Combine(DFC) Ini2alstudiesinMATLAB(Notdistributed) DistributedprototypeinvolvingcompiledMATLAB MahoutALSwithEarlyStopping Theory:simpleif-statement(3linesofcode) Prac2ce:sihthrough7les,nearly1Klinesofcode 83. Insight:ProgrammingAbstrac2ons 84. Insight:ProgrammingAbstrac2ons ShieldMLDevelopersfromlow-details:provide familiarmathema2caloperatorsindistributedsejng MLDeveloperAPI(MLI) 85. Insight:ProgrammingAbstrac2ons ShieldMLDevelopersfromlow-details:provide familiarmathema2caloperatorsindistributedsejng MLDeveloperAPI(MLI) TableComputa2on:MLTable LinearAlgebra:MLSubMatrix Op2miza2onPrimi2ves:MLSolve 86. Insight:ProgrammingAbstrac2ons ShieldMLDevelopersfromlow-details:provide familiarmathema2caloperatorsindistributedsejng MLDeveloperAPI(MLI) TableComputa2on:MLTable LinearAlgebra:MLSubMatrix Op2miza2onPrimi2ves:MLSolve MLIExamples: DFC:~50linesofcode 87. Insight:ProgrammingAbstrac2ons ShieldMLDevelopersfromlow-details:provide familiarmathema2caloperatorsindistributedsejng MLDeveloperAPI(MLI) TableComputa2on:MLTable LinearAlgebra:MLSubMatrix Op2miza2onPrimi2ves:MLSolve MLIExamples: DFC:~50linesofcode ALS:earlystoppingin3lines;