Distributed Deep Learning At Scale On Apache Spark With BigDL

download Distributed Deep Learning At Scale On Apache Spark With BigDL

of 38

  • date post

    11-Apr-2017
  • Category

    Technology

  • view

    438
  • download

    1

Embed Size (px)

Transcript of Distributed Deep Learning At Scale On Apache Spark With BigDL

  • Intel Confidential INTERNAL USE ONLY

    SPARK MEETUP @ INTELCO-HOSTED by Intel and DATABRICKSMarch 23, 2017

  • https://github.com/intel-analytics/BigDL 2

    AGENDA7:00 - 7:10 pm Opening Remarks & Introductions

    7:10 - 7:55: pm Intel Tech Talk from Jiao Wang and Sergey Ermolin

    7:55 - 8:00 pm Short Break

    8:00 - 8:45 pm Databricks Tech Talk from Tathagata Das (TD)

    8:45 - 9:00 pm Mingling

    WIFI ACCESSConnect to the wireless network: Guest

    Enter access code: 81995579

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 3

    INTEL BIG DATA TECHNOLOGIES group Intel and Hadoop/Spark Ecosystem

    PMC members and committers on multiple projects

    Unique perspectives gained from customers

    Industry and academia collaboration on open-source projects

    software.intel.com/bigdatasoftware.intel.com/AI

    https://software.intel.com/ai

    Jason Dai Senior Principal Engineer and Chief Architect, Big Data Technologies,Intel Corporation

  • Intel Confidential INTERNAL USE ONLY

    Distributed Deep Learning At Scale on Apache Spark with BigDLJiao(Jennie) Wang, Sergey ErmolinBig Data Technologies, Software and Service GroupIntel corporation

  • https://github.com/intel-analytics/BigDL 5

    What is BigDL?BigDL is a distributed deep learning library for Apache Spark*

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 6

    Why BigDL?

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL https://software.intel.com/ai 7

    Production ML/DL system is Complex and Distributed. Spark-based Deep Learning library is a natural fit

    Why BigDL?

  • https://github.com/intel-analytics/BigDL 8

    BIGDL WITHIN SPARK FRAMEWORKEnd-to-end Big Data Analytics with Deep Learning Functionalities Directly on Spark

    Natively integrated with Big Data (Hadoop/Spark) ecosystem

    Massively distributed, scale out

    Sends compute to data

    Fault tolerance

    Elasticity

    Incremental scaling

    Dynamic resource sharing

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 9

    BIGDL benefits Allows to write deep learning applications as standard Spark programs

    Runs on top of existing Spark or Hadoop/Hive clusters

    Adds rich Deep Learning functionalities to Apache Spark

    Feature parity with Caffe and other single-node DL frameworks.

    High performance - Intel MKL and multi-threaded programming

    Efficient scale-out with an all-reduce communications on Spark

    BigDL has been open sourced since 2016: https://github.com/intel-analytics/BigDL

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 10

    WHAT is in bigdl for you?You may want to write your deep learning programs using BigDL if you need to:

    Analyze big data using deep learning on the same Hadoop/Spark cluster where the data are stored

    Add deep learning functionalities to the Big Data (Spark) programs and/or workflow

    Leverage existing Hadoop/Spark clusters to run deep learning applications

    Dynamically share with other workloads (e.g., ETL, data warehouse, feature engineering, classical machine learning, graph analytics, etc.)

    Making deep learning more accessible for Big Data users and data scientists, who are usually not experts for deep learning

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 11

    BigDL Features

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 12

    BigDL FeaturesDistributed Deep learning applications (training, fine-tuning & prediction) on Apache Spark* No changes to the existing Hadoop/Spark clusters needed

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 13

    BigDL can re-use/fine-tune models from other frameworks

    Load existing Caffe/Torch Model

    Allows for transition from single-node

    to distributed application deployment

    Useful for inference

    Allows for minor model tuning

    Allows for model sharing between

    Data Scientists and Production Engr.

    CaffeModel File

    Torch Model File

    Storage

    BigDL

    BigDL Model File

    Load

    Save

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 14

    BigDL integration with spark ml

    Integrates with Spark-ML Pipeline:

    Wrapper with Spark ML Transformer

    BigDL Plugs into Spark ML pipeline

    Support Spark v1.5/1.6/2.0

    DataFrame

    Transfomer1

    Transfomer2

    DLClassifer

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 15

    Python API SupportBased on PySpark, Python API in BigDL allows use of existing Python libs:

    Numpy

    Scipy

    Pandas

    Scikit-learn

    NLTK

    Matplotlib

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 16

    Works with Jupyter NotebookRunning BigDL applications directly in Jupyter notebooks

    Share and Reproduce

    Notebooks can be shared with others

    Easy to reproduce and track

    Rich Content

    Texts, images, videos, LaTeX and JavaScript

    Code can also produce rich contents

    Rich toolbox

    Apache Spark, from Python, R and Scala

    Pandas, scikit-learn, ggplot2, dplyr, etc

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 17

    Visualization for Learning

    BigDL integration with TensorBoard

    TensorBoard is a suite of web applications from Google for visualizing and understanding deep learning applications

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 18

    Spark Streaming RDDs

    EvaluatorBigDL Model

    StreamWriter

    Integration with Spark Streaming for runtime training and prediction

    HDFS/S3

    Kafka

    Flume

    Kinesis

    Twitter

    Train

    Predict

    BigDL integration with spark streaming

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 19

    Tight Integrations with Spark SQL, DataFrames and Structured Streaming

    *Image classification on ImageNet(http://www.image-net.org)

    BigDL Features

    Kafka File

    Data Frame(Batch/Stream)

    BigDL UDF

    Filtered Data Frame

    (Batch/Stream)

    df.select($image).withColumn(

    image_type, ImgClassifier(image))

    .filter($image_type == dog)

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 20

    BigDL FeaturesBigDL provides out-of-the box examples of popular CNN models

    - helps a developer to get started

    https://github.com/intel-analytics/BigDL/wiki/Examples

    Models (Train and Inference Example Code):

    LeNet, Inception, VGG, ResNet, RNN, Auto-encoder

    Examples:

    Text Classification

    Image Classification

    Load Torch/Caffe model

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 21

    BigDL Features Single node Xeon performance

    Benchmarked to be best on Xeon E5-26XX v3 or E5-26XX v4

    Orders of magnitude speedup vs. out-of-box open source Caffe, Torch or TensorFlow

    Scaling-out

    Efficiently scales out to 10s~100s of Xeon servers on Spark

    * For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 22

    BigDL installation on major cloud frameworks AWS.https://github.com/intel-analytics/BigDL/wiki/Running-on-EC2

    https://software.intel.com/ai

    https://github.com/intel-analytics/BigDL/wiki/Running-on-EC2

  • https://github.com/intel-analytics/BigDL 23

    BigDL on other platforms

    Intels BigDL on Databrickshttps://databricks.com/blog/2017/02/09/intels-bigdl-databricks.html

    How to use BigDL on Apache Spark for Azure HDInsighthttps://blogs.msdn.microsoft.com/azuredatalake/2017/03/17/how-to-use-bigdl-on-apache-spark-for-azure-hdinsight/

    More to come check back periodically and stay tuned

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 24

    BigDL Use cases

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 25

    Visual recognition and Object DetectionFaster-RCNN SSD: Single Shot MultiBox Detector

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 26

    Object Detection on PASCAL

    *(http://host.robots.ox.ac.uk/pascal/VOC/)

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL https://software.intel.com/ai 27

    Natural Language Model - RNN

    Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  • https://github.com/intel-analytics/BigDL 28

    Learn from Shakespeare Poems Output of RNN:

    Long live the King . The King and Queen , and the Strange of the Veils of the rhapsodic . and grapple, and the entreatments of the pressure .

    Upon her head , and in the world ? `` Oh, the gods ! O Jove ! To whom the king : `` O friends !

    Her hair, nor loose ! If , my lord , and the groundlings of the skies . jocund and Tasso in the Staggering of the Mankind . and

    https://software.intel.com/ai

  • https://github.com/intel-analytics/BigDL 29

    More RNN Support: LSTM

    BigDL also supports LSTM variants such as GRU and LSTM with peepholes

    Source: http://colah.github.io/posts/2015-