Spark Streaming and MLlib - Hyderabad Spark Group

download Spark Streaming and MLlib - Hyderabad Spark Group

of 15

  • date post

    14-Apr-2017
  • Category

    Technology

  • view

    339
  • download

    8

Embed Size (px)

Transcript of Spark Streaming and MLlib - Hyderabad Spark Group

  • Spark Streaming and MLlibThe stack for distributed,

    massively scalable, (near) real-timedata processing and machine learning

    present

    Phaneendra Chiruvella

    http://twitter.com/pcx66

    Hyderabad Spark Group & Zemoso Technologies

  • Agenda Brief intro to Spark Core

    Introduction to Spark Streaming

    What is the world talking about?: A demo of Spark

    Streaming with Twitter

    Introduction to Spark MLlib

    Lets see what movies you might like: A demo of Spark

    MLlib by building a Movie Recommendation Engine

  • Spark: Lightning-fast cluster computing Data processing engine

    Distributed

    Massively scalable: Known largest cluster size is 8,000

    machines with PBs of data processed

    Programmable in Scala, Java, Python and R

    Interactive shell

    Both Batch & Stream processing

    Stable and robust: being used in production at many

    companies

    Known to work well with other Big data tools like Kafka,

    Cassandra, HDFS, HBase, etc.

    Image source: http://spark.apache.org/docs/latest/cluster-overview.html

  • Spark: How it works?

    Every application has its own

    SparkContext

    Cluster Managers available are:

    Spark Standalone, YARN, Mesos

    Image source: http://spark.apache.org/docs/latest/cluster-overview.html

  • Spark: Resilient Distributed DatasetsRDD is the fundamental abstraction of Spark, providing a rich, fault-tolerant layer over a cluster of machines

    Executors

    SparkContext

    RDD

  • Spark Core: Demo Creating RDDs

    Transformations

    Actions

    Cache

  • Spark Streaming:batch processing not enuf!

    Extension to Core API Micro-batches processed in

    realtime Minimize latency to seconds

  • Spark Streaming: How it works? DStreams - Just a chain of RDDs

    Batch Interval, Input DStreams and Receivers

    Some Input Sources: Sockets, File systems, Kafka, Twitter

    Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

  • Spark Streaming: How it works? Windowed operations

    DStream Transformations are translated to RDD Transformations

    Direct access to RDDs underneath

    Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

  • Spark Streaming: Demo What is the world talking about?: A Twitter stream analysis

  • Spark MLlib: Just analytics not enuf! Practical, scalable ML library with

    implementations of several common

    algorithms and more being added

    Alternative spark.ml high-level API based

    on spark.sql.DataFrame. Out of scope

    for our current talk.

  • Spark MLlib: Demo Lets see what movies you might like: A demo of Spark MLlib by building a

    Movie Recommendation Engine

  • Spark: Streaming and MLlib, match made in heaven!

    MLlib provides algorithms that can learn on streaming data and simultaneously apply on the streaming data!

    Also, a large set of algorithms that

    can learn offline and be applied on

    the streaming data

  • Spark: What next? Spark SQL - A SQL-like layer over RDDs spark.ml Spark GraphX - A graph-processing abstraction over RDDs Apache Storm and Apache Flink - Modern streaming-first systems

    Q&A

  • Thank you!Slide deck will be made available at:http://blog.zemosolabs.com/

    Spark Docs are a great place to get startedhttp://spark.apache.org/docs/latest/programming-guide.html

    Acknowledgements:Code demos are from Databricks TrainingMemes generated from ImgFlip.com

    https://databricks-training.s3.amazonaws.com/index.htmlhttps://imgflip.com/