Spark Streaming and MLlib - Hyderabad Spark Group
-
Author
phaneendra-chiruvella -
Category
Technology
-
view
349 -
download
13
Embed Size (px)
Transcript of Spark Streaming and MLlib - Hyderabad Spark Group

Spark Streaming and MLlibThe stack for distributed,
massively scalable, (near) real-timedata processing and machine learning
present
Phaneendra Chiruvella
http://twitter.com/pcx66
Hyderabad Spark Group & Zemoso Technologies

Agenda● Brief intro to Spark Core
● Introduction to Spark Streaming
● What is the world talking about?: A demo of Spark
Streaming with Twitter
● Introduction to Spark MLlib
● Let’s see what movies you might like: A demo of Spark
MLlib by building a Movie Recommendation Engine

Spark: Lightning-fast cluster computing ● Data processing engine
● Distributed
● Massively scalable: Known largest cluster size is 8,000
machines with PBs of data processed
● Programmable in Scala, Java, Python and R
● Interactive shell
● Both Batch & Stream processing
● Stable and robust: being used in production at many
companies
● Known to work well with other “Big data” tools like Kafka,
Cassandra, HDFS, HBase, etc.
Image source: http://spark.apache.org/docs/latest/cluster-overview.html

Spark: How it works?
● Every application has it’s own
SparkContext
● Cluster Managers available are:
Spark Standalone, YARN, Mesos
Image source: http://spark.apache.org/docs/latest/cluster-overview.html

Spark: Resilient Distributed DatasetsRDD is the fundamental abstraction of Spark, providing a rich, fault-tolerant layer over a cluster of machines
Executors
SparkContext
RDD

Spark Core: Demo● Creating RDDs
● Transformations
● Actions
● Cache

Spark Streaming:batch processing not enuf!
● Extension to Core API● Micro-batches processed in
realtime● Minimize latency to seconds

Spark Streaming: How it works?● DStreams - Just a chain of RDDs
● Batch Interval, Input DStreams and Receivers
● Some Input Sources: Sockets, File systems, Kafka, Twitter
Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming: How it works?● Windowed operations
● DStream Transformations are translated to RDD Transformations
● Direct access to RDDs underneath
Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming: Demo● What is the world talking about?: A Twitter stream analysis

Spark MLlib: Just analytics not enuf!● Practical, scalable ML library with
implementations of several common
algorithms and more being added
● Alternative spark.ml high-level API based
on spark.sql.DataFrame. Out of scope
for our current talk.

Spark MLlib: Demo● Let’s see what movies you might like: A demo of Spark MLlib by building a
Movie Recommendation Engine

Spark: Streaming and MLlib, match made in heaven!
● MLlib provides algorithms that can learn on streaming data and simultaneously apply on the streaming data!
● Also, a large set of algorithms that
can learn offline and be applied on
the streaming data

Spark: What next?● Spark SQL - A SQL-like layer over RDDs● spark.ml● Spark GraphX - A graph-processing abstraction over RDDs● Apache Storm and Apache Flink - Modern streaming-first systems
Q&A

Thank you!Slide deck will be made available at:http://blog.zemosolabs.com/
Spark Docs are a great place to get startedhttp://spark.apache.org/docs/latest/programming-guide.html
Acknowledgements:Code demos are from Databricks TrainingMemes generated from ImgFlip.com