Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

download Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

of 40

  • date post

    11-Jan-2017
  • Category

    Technology

  • view

    6.001
  • download

    8

Embed Size (px)

Transcript of Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

  • Apache Spark MLlib: From Quick Start to Scikit-Learn

    Joseph K. BradleyFebruary 24th, 2016

  • About the speaker: Joseph Bradley

    Joseph Bradley is a Software Engineer and Apache Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

    2

  • About the moderator: Denny Lee

    Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

    3

  • We are Databricks, the company behind Apache Spark

    Founded by the creators of Apache Spark in 2013

    Share of Spark code contributed by Databricksin 2014

    75%

    4

    Data Value

    Created Databricks on top of Spark to make big data simple.

  • Apache Spark Engine

    Spark Core

    SparkStreamingSpark SQL MLlib GraphX

    Unified engine across diverse workloads & environments

    Scale out, fault tolerant

    Python, Java, Scala, and R APIs

    Standard libraries

  • N O T A BL E U S E RS T H A T PRE S EN T ED A T S PA RK S U MM IT 2 0 1 5 S A N F RA N CISCO

    Source: Slide 5 of Spark Community Update

  • Machine Learning: What and Why?

    What: ML uses data to identify patterns and make decisions.

    Why: The core value of ML is automated decision making. Especially important when dealing with TB or PB of data

    Many use cases, including: Marketing and advertising optimization Security monitoring / fraud detection Operational optimizations

  • Why Spark MLlib

    Provide general purpose ML algorithms on top of Spark Hide complexity of distributing data & queries, and scaling Leverage Spark improvements (DataFrames, Tungsten, Datasets)

    Advantages of MLlibs design: Simplicity Scalability Streamlined end-to-end Compatibility

  • Spark scales well

    Largest cluster:8000 Nodes (Tencent)

    Largest single job:1 PB (Alibaba, Databricks)

    Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)

    2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB

  • Machine Learning highlights

    Source: Why you should use Spark for Machine Learning

  • Source: Toyota Customer 360 Insights on Apache Spark and MLlibPerformance

    Original batch job: 160 hours Same Job re-written using Apache Spark: 4 hours

    ML task Prioritize incoming social media in real-time using Spark MLlib

    (differentiate campaign, feedback, product feedback, and noise) ML life cycle: Extract features and train:

    V1: 56% Accuracy -> V9: 82% Accuracy Remove False Positives and Semantic Analysis (similarity between

    concepts)

  • Example analysis: Population vs. housing priceLinksSimplifying Machine Learning with Databricks Blog PostPopulation vs. Price Multi-chart Spark SQL NotebookPopulation vs. Price Linear Regression Python Notebook

  • Scatterplotimport numpy as npimport matplotlib.pyplot as plt

    x = data.map(lambda p: (p.features[0])).collect()y = data.map(lambda p: (p.label)).collect()

    from pandas import *from ggplot import *pydf = DataFrame({'pop':x,'price':y})p = ggplot(pydf, aes('pop','price')) + \

    geom_point(color='blue')display(p)

  • Linear Regression with SGDDefine and Build Models

    # Import LinearRegression classfrom pyspark.ml.regression import LinearRegression

    # Define LinearRegression modellr = LinearRegression()

    # Build two modelsmodelA = lr.fit(data, {lr.regParam:0.0})modelB = lr.fit(data, {lr.regParam: 100.0})

  • Linear Regression with SGDMake Predictions

    # Make predictionspredictionsA = modelA.transform(data)display(predictionsA)

  • Linear Regression with SGDEvaluate the Models

    from pyspark.ml.evaluation import RegressionEvaluatorevaluator = RegressionEvaluator(metricName="mse")MSE = evaluator.evaluate(predictionsA)print("ModelA: Mean Squared Error = " + str(MSE))

    ModelA: Mean Squared Error = 16538.4813081ModelB: Mean Squared Error = 16769.2917636

  • Scatterplot with plotting Regression Modelsp = ggplot(pydf, aes('pop','price')) + \

    geom_point(color='blue') + \geom_line(pydf, aes('pop','predA'), color='red') + \

    geom_line(pydf, aes('pop','predB'), color='green') + \

    scale_x_log(10) + scale_y_log10()display(p)

  • Learning more about MLlib

    Guides & examples Example workflow using ML Pipelines (Python) Power plant data analysis workflow (Scala) The above 2 links are part of the Databricks Guide, which contains

    many more examples and references.

    References Apache Spark MLlib User Guide

    The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

    Meng et al. MLlib: Machine Learning in Apache Spark. 2015. http://arxiv.org/abs/1505.06807 (academic paper)

    21

  • Combining the Strengths of MLlib, scikit-learn, & R

  • 23

    Great libraries Business investment Education Tooling & workflows

  • Big Data

    24

    Scaling (trees)Topic model on 4.5 million Wikipedia articles

    Recommendation with50 million users,5 million songs,50 billion ratings

  • Big Data & MLlib

    More data higher accuracy Scale with business (# users, available data) Integrate with production systems

    25

  • Bridging the gap

    How do you get from a single-machine workloadto a distributed one?

    26

    At school: Machine Learning with R on my laptop

    The Goal: Machine Learning on a huge computing cluster

  • Wish list

    Run original code on a production environment Use distributed data sources Distribute ML workload piece by piece Use familiar algorithms & APIs

    27

  • Our task

    28

    Sentiment analysisGiven a review (text),Predict the users rating.

    Datafromhttps://snap.stanford.edu/data/web-Amazon.html

  • Our ML workflow

    29

    TextThis scarf I bought is very strange. When I ...

    LabelRating = 3.0

    Tokenizer

    Words[This,scarf,I,bought,...]

    HashingTerm-Freq

    Features[2.0,0.0,3.0,...]

    Linear Regression

    PredictionRating = 2.7

  • Our ML workflow

    30

    Cross Validation

    Linear Regression

    Feature Extraction

    regularizationparameter:{0.0, 0.1, ...}

  • Cross validation

    31

    Cross Validation

    ...

    Best Linear Regression

    Linear Regression #1

    Linear Regression #2

    Feature Extraction

    Linear Regression #3

  • Cross validation

    32

    Cross Validation

    ...

    Best Linear Regression

    Linear Regression #1

    Linear Regression #2

    Feature Extraction

    Linear Regression #3

  • Distribute cross validation

    33

    Cross Validation

    ...

    Best Linear Regression

    Linear Regression #1

    Linear Regression #2

    Feature Extraction

    Linear Regression #3

  • Repeating this at homeThis demo used: Spark 1.6 spark-sklearn (on Spark Packages) (on PyPi)

    The notebook from the demo is available here: sklearn integration MLlib + sklearn: Distribute Everything!

    The Amazon Reviews data20K and test4K datasets were created and can be used within the databricks-datasets with permission from Professor Julian McAuley @ UCSD. Source: Image-based recommendations on styles and substitutes.J. McAuley, C. Targett, J. Shi, A. van den Hengel.SIGIR, 2015.

    34

  • Integrations we mentioned

    Data sources Spark DataFrames: Conversions between pandas (local data) &

    Spark (distributed data) MLlib: Conversions between scipy & MLlib data types

    Model selection / tuning spark-sklearn: Automatically distribute cross-validation

    Python API MLlib: Distributed learning algorithms with familiar APIs spark-sklearn: Conversions between scikit-learn & MLlib models

    35

  • Integrations with RDataFrames Conversions between R (local)

    & Spark (distributed) SQL queries from R

    36

    model

  • Learning more about integrationsPython, pandas & scikit-learn spark-sklearn documentation and blog post Spark DataFrame Python API & pandas conversions Databricks Guide on using scikit-learn and other libraries with Spark

    R Spark R API User Guide (DataFrames & ML) Databricks Guide: Spark R overview + docs & examples for each function

    TensorFlow on Apache Spark (Deep Learning in Python) Blog post explaining how to run TensorFlow on top of Spark, with example code

    37

  • MLlib roadmap highlights

    Workflow Simplify building and customizing ML Pipelines.

    Key models Improve inspection for generalized linear models (linear & logistic

    regression).

    Language APIs Support Pipeline persistence (saving & loading Pipelines and Models)

    in the Python API.

    Spark 2.0 Roadmap JIRA: https://issues.apache.org/jira/browse/SPARK-12626

  • More resources

    Databricks Guide Apache Spark User Guide Databricks Community Forum Training courses: public classes, MOOCs, & private training Databricks Community Edition: Free hosted Apache Spark.

    Join the waitlist for the beta release!