Download - Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Transcript
Page 1: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Joseph K. BradleyFebruary 24th, 2016

Page 2: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

About the speaker: Joseph Bradley

Joseph Bradley is a Software Engineer and Apache Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

2

Page 3: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

About the moderator: Denny Lee

Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

3

Page 4: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

We are Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

4

Data Value

Created Databricks on top of Spark to make big data simple.

Page 5: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache Spark Engine

Spark Core

SparkStreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

Page 6: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Page 7: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

N O T A BL E U S E RS T H A T PRE S EN T ED A T S PA RK S U MM IT 2 0 1 5 S A N F RA N CISCO

Source: Slide 5 of Spark Community Update

Page 8: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Machine Learning: What and Why?

What: ML uses data to identify patterns and make decisions.

Why: The core value of ML is automated decision making.• Especially important when dealing with TB or PB of data

Many use cases, including:• Marketing and advertising optimization• Security monitoring / fraud detection• Operational optimizations

Page 9: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Why Spark MLlib

Provide general purpose ML algorithms on top of Spark• Hide complexity of distributing data & queries, and scaling• Leverage Spark improvements (DataFrames, Tungsten, Datasets)

Advantages of MLlib’s design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

Page 10: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Spark scales well

Largest cluster:8000 Nodes (Tencent)

Largest single job:1 PB (Alibaba, Databricks)

Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)

2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB

Page 11: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Machine Learning highlights

Source: Why you should use Spark for Machine Learning

Page 12: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Source: Toyota Customer 360 Insights on Apache Spark and MLlibPerformance• Original batch job: 160 hours• Same Job re-written using Apache Spark: 4 hours

ML task• Prioritize incoming social media in real-time using Spark MLlib

(differentiate campaign, feedback, product feedback, and noise)• ML life cycle: Extract features and train:

• V1: 56% Accuracy -> V9: 82% Accuracy• Remove False Positives and Semantic Analysis (similarity between

concepts)

Page 13: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Example analysis: Population vs. housing priceLinksSimplifying Machine Learning with Databricks Blog PostPopulation vs. Price Multi-chart Spark SQL NotebookPopulation vs. Price Linear Regression Python Notebook

Page 14: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Page 15: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Page 16: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Scatterplotimport numpy as npimport matplotlib.pyplot as plt

x = data.map(lambda p: (p.features[0])).collect()y = data.map(lambda p: (p.label)).collect()

from pandas import *from ggplot import *pydf = DataFrame({'pop':x,'price':y})p = ggplot(pydf, aes('pop','price')) + \

geom_point(color='blue')display(p)

Page 17: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Linear Regression with SGDDefine and Build Models

# Import LinearRegression classfrom pyspark.ml.regression import LinearRegression

# Define LinearRegression modellr = LinearRegression()

# Build two modelsmodelA = lr.fit(data, {lr.regParam:0.0})modelB = lr.fit(data, {lr.regParam: 100.0})

Page 18: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Linear Regression with SGDMake Predictions

# Make predictionspredictionsA = modelA.transform(data)display(predictionsA)

Page 19: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Linear Regression with SGDEvaluate the Models

from pyspark.ml.evaluation import RegressionEvaluatorevaluator = RegressionEvaluator(metricName="mse")MSE = evaluator.evaluate(predictionsA)print("ModelA: Mean Squared Error = " + str(MSE))

ModelA: Mean Squared Error = 16538.4813081ModelB: Mean Squared Error = 16769.2917636

Page 20: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Scatterplot with plotting Regression Modelsp = ggplot(pydf, aes('pop','price')) + \

geom_point(color='blue') + \geom_line(pydf, aes('pop','predA'), color='red') + \

geom_line(pydf, aes('pop','predB'), color='green') + \

scale_x_log(10) + scale_y_log10()display(p)

Page 21: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Learning more about MLlib

Guides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains

many more examples and references.

References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

21

Page 22: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Combining the Strengths of MLlib, scikit-learn, & R

Page 23: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

23

Great libraries à Business investment• Education• Tooling & workflows

Page 24: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Big Data

24

Scaling (trees)Topic model on 4.5 million Wikipedia articles

Recommendation with50 million users,5 million songs,50 billion ratings

Page 25: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Big Data & MLlib

• More data à higher accuracy• Scale with business (# users, available data)• Integrate with production systems

25

Page 26: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Bridging the gap

How do you get from a single-machine workloadto a distributed one?

26

At school: Machine Learning with R on my laptop

The Goal: Machine Learning on a huge computing cluster

Page 27: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Wish list

• Run original code on a production environment• Use distributed data sources• Distribute ML workload piece by piece• Use familiar algorithms & APIs

27

Page 28: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Our task

28

Sentiment analysisGiven a review (text),Predict the user’s rating.

Datafromhttps://snap.stanford.edu/data/web-Amazon.html

Page 29: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Our ML workflow

29

TextThis scarf I bought is very strange. When I ...

LabelRating = 3.0

Tokenizer

Words[This,scarf,I,bought,...]

HashingTerm-Freq

Features[2.0,0.0,3.0,...]

Linear Regression

PredictionRating = 2.7

Page 30: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Our ML workflow

30

Cross Validation

Linear Regression

Feature Extraction

regularizationparameter:{0.0, 0.1, ...}

Page 31: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Cross validation

31

Cross Validation

...

Best Linear Regression

Linear Regression #1

Linear Regression #2

Feature Extraction

Linear Regression #3

Page 32: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Cross validation

32

Cross Validation

...

Best Linear Regression

Linear Regression #1

Linear Regression #2

Feature Extraction

Linear Regression #3

Page 33: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Distribute cross validation

33

Cross Validation

...

Best Linear Regression

Linear Regression #1

Linear Regression #2

Feature Extraction

Linear Regression #3

Page 34: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Repeating this at homeThis demo used:• Spark 1.6• spark-sklearn (on Spark Packages) (on PyPi)

The notebook from the demo is available here:• sklearn integration• MLlib + sklearn: Distribute Everything!

The Amazon Reviews data20K and test4K datasets were created and can be used within the databricks-datasets with permission from Professor Julian McAuley @ UCSD. Source: Image-based recommendations on styles and substitutes.J. McAuley, C. Targett, J. Shi, A. van den Hengel.SIGIR, 2015.

34

Page 35: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Integrations we mentioned

Data sources• Spark DataFrames: Conversions between pandas (local data) &

Spark (distributed data)• MLlib: Conversions between scipy & MLlib data types

Model selection / tuning• spark-sklearn: Automatically distribute cross-validation

Python API• MLlib: Distributed learning algorithms with familiar APIs• spark-sklearn: Conversions between scikit-learn & MLlib models

35

Page 36: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Integrations with RDataFrames• Conversions between R (local)

& Spark (distributed)• SQL queries from R

36

model <- glm(Sepal_Length ~ Sepal_Width + Species,data = df, family = "gaussian")

head(filter(df, df$waiting < 50))## eruptions waiting##1 1.750 47##2 1.750 47##3 1.867 48

API for calling MLlib algorithms from R• Linear & logistic regression supported in Spark 1.6• More algorithms in development

Page 37: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Learning more about integrationsPython, pandas & scikit-learn• spark-sklearn documentation and blog post• Spark DataFrame Python API & pandas conversions• Databricks Guide on using scikit-learn and other libraries with Spark

R• Spark R API User Guide (DataFrames & ML)• Databricks Guide: Spark R overview + docs & examples for each function

TensorFlow on Apache Spark (Deep Learning in Python)• Blog post explaining how to run TensorFlow on top of Spark, with example code

37

Page 38: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

MLlib roadmap highlights

Workflow• Simplify building and customizing ML Pipelines.

Key models• Improve inspection for generalized linear models (linear & logistic

regression).

Language APIs• Support Pipeline persistence (saving & loading Pipelines and Models)

in the Python API.

Spark 2.0 Roadmap JIRA: https://issues.apache.org/jira/browse/SPARK-12626

Page 39: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

More resources

• Databricks Guide• Apache Spark User Guide• Databricks Community Forum• Training courses: public classes, MOOCs, & private training• Databricks Community Edition: Free hosted Apache Spark.

Join the waitlist for the beta release!

39

Page 40: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Thanks!