Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Joseph K. BradleyFebruary 24th, 2016
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineer and Apache Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
SparkStreamingSpark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
N O T A BL E U S E RS T H A T PRE S EN T ED A T S PA RK S U MM IT 2 0 1 5 S A N F RA N CISCO
Source: Slide 5 of Spark Community Update
Machine Learning: What and Why?
What: ML uses data to identify patterns and make decisions.
Why: The core value of ML is automated decision making.• Especially important when dealing with TB or PB of data
Many use cases, including:• Marketing and advertising optimization• Security monitoring / fraud detection• Operational optimizations
Why Spark MLlib
Provide general purpose ML algorithms on top of Spark• Hide complexity of distributing data & queries, and scaling• Leverage Spark improvements (DataFrames, Tungsten, Datasets)
Advantages of MLlib’s design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility
Spark scales well
Largest cluster:8000 Nodes (Tencent)
Largest single job:1 PB (Alibaba, Databricks)
Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)
2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB
Machine Learning highlights
Source: Why you should use Spark for Machine Learning
Source: Toyota Customer 360 Insights on Apache Spark and MLlibPerformance• Original batch job: 160 hours• Same Job re-written using Apache Spark: 4 hours
ML task• Prioritize incoming social media in real-time using Spark MLlib
(differentiate campaign, feedback, product feedback, and noise)• ML life cycle: Extract features and train:
• V1: 56% Accuracy -> V9: 82% Accuracy• Remove False Positives and Semantic Analysis (similarity between
concepts)
Example analysis: Population vs. housing priceLinksSimplifying Machine Learning with Databricks Blog PostPopulation vs. Price Multi-chart Spark SQL NotebookPopulation vs. Price Linear Regression Python Notebook
Scatterplotimport numpy as npimport matplotlib.pyplot as plt
x = data.map(lambda p: (p.features[0])).collect()y = data.map(lambda p: (p.label)).collect()
from pandas import *from ggplot import *pydf = DataFrame({'pop':x,'price':y})p = ggplot(pydf, aes('pop','price')) + \
geom_point(color='blue')display(p)
Linear Regression with SGDDefine and Build Models
# Import LinearRegression classfrom pyspark.ml.regression import LinearRegression
# Define LinearRegression modellr = LinearRegression()
# Build two modelsmodelA = lr.fit(data, {lr.regParam:0.0})modelB = lr.fit(data, {lr.regParam: 100.0})
Linear Regression with SGDMake Predictions
# Make predictionspredictionsA = modelA.transform(data)display(predictionsA)
Linear Regression with SGDEvaluate the Models
from pyspark.ml.evaluation import RegressionEvaluatorevaluator = RegressionEvaluator(metricName="mse")MSE = evaluator.evaluate(predictionsA)print("ModelA: Mean Squared Error = " + str(MSE))
ModelA: Mean Squared Error = 16538.4813081ModelB: Mean Squared Error = 16769.2917636
Scatterplot with plotting Regression Modelsp = ggplot(pydf, aes('pop','price')) + \
geom_point(color='blue') + \geom_line(pydf, aes('pop','predA'), color='red') + \
geom_line(pydf, aes('pop','predB'), color='green') + \
scale_x_log(10) + scale_y_log10()display(p)
Learning more about MLlib
Guides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains
many more examples and references.
References• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)
21
Combining the Strengths of MLlib, scikit-learn, & R
23
Great libraries à Business investment• Education• Tooling & workflows
Big Data
24
Scaling (trees)Topic model on 4.5 million Wikipedia articles
Recommendation with50 million users,5 million songs,50 billion ratings
Big Data & MLlib
• More data à higher accuracy• Scale with business (# users, available data)• Integrate with production systems
25
Bridging the gap
How do you get from a single-machine workloadto a distributed one?
26
At school: Machine Learning with R on my laptop
The Goal: Machine Learning on a huge computing cluster
Wish list
• Run original code on a production environment• Use distributed data sources• Distribute ML workload piece by piece• Use familiar algorithms & APIs
27
Our task
28
Sentiment analysisGiven a review (text),Predict the user’s rating.
Datafromhttps://snap.stanford.edu/data/web-Amazon.html
Our ML workflow
29
TextThis scarf I bought is very strange. When I ...
LabelRating = 3.0
Tokenizer
Words[This,scarf,I,bought,...]
HashingTerm-Freq
Features[2.0,0.0,3.0,...]
Linear Regression
PredictionRating = 2.7
Our ML workflow
30
Cross Validation
Linear Regression
Feature Extraction
regularizationparameter:{0.0, 0.1, ...}
Cross validation
31
Cross Validation
...
Best Linear Regression
Linear Regression #1
Linear Regression #2
Feature Extraction
Linear Regression #3
Cross validation
32
Cross Validation
...
Best Linear Regression
Linear Regression #1
Linear Regression #2
Feature Extraction
Linear Regression #3
Distribute cross validation
33
Cross Validation
...
Best Linear Regression
Linear Regression #1
Linear Regression #2
Feature Extraction
Linear Regression #3
Repeating this at homeThis demo used:• Spark 1.6• spark-sklearn (on Spark Packages) (on PyPi)
The notebook from the demo is available here:• sklearn integration• MLlib + sklearn: Distribute Everything!
The Amazon Reviews data20K and test4K datasets were created and can be used within the databricks-datasets with permission from Professor Julian McAuley @ UCSD. Source: Image-based recommendations on styles and substitutes.J. McAuley, C. Targett, J. Shi, A. van den Hengel.SIGIR, 2015.
34
Integrations we mentioned
Data sources• Spark DataFrames: Conversions between pandas (local data) &
Spark (distributed data)• MLlib: Conversions between scipy & MLlib data types
Model selection / tuning• spark-sklearn: Automatically distribute cross-validation
Python API• MLlib: Distributed learning algorithms with familiar APIs• spark-sklearn: Conversions between scikit-learn & MLlib models
35
Integrations with RDataFrames• Conversions between R (local)
& Spark (distributed)• SQL queries from R
36
model <- glm(Sepal_Length ~ Sepal_Width + Species,data = df, family = "gaussian")
head(filter(df, df$waiting < 50))## eruptions waiting##1 1.750 47##2 1.750 47##3 1.867 48
API for calling MLlib algorithms from R• Linear & logistic regression supported in Spark 1.6• More algorithms in development
Learning more about integrationsPython, pandas & scikit-learn• spark-sklearn documentation and blog post• Spark DataFrame Python API & pandas conversions• Databricks Guide on using scikit-learn and other libraries with Spark
R• Spark R API User Guide (DataFrames & ML)• Databricks Guide: Spark R overview + docs & examples for each function
TensorFlow on Apache Spark (Deep Learning in Python)• Blog post explaining how to run TensorFlow on top of Spark, with example code
37
MLlib roadmap highlights
Workflow• Simplify building and customizing ML Pipelines.
Key models• Improve inspection for generalized linear models (linear & logistic
regression).
Language APIs• Support Pipeline persistence (saving & loading Pipelines and Models)
in the Python API.
Spark 2.0 Roadmap JIRA: https://issues.apache.org/jira/browse/SPARK-12626
More resources
• Databricks Guide• Apache Spark User Guide• Databricks Community Forum• Training courses: public classes, MOOCs, & private training• Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
39
Thanks!
Top Related