Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala...

33
www.hadoopexpress.com Introduction to Apache Spark An Overview of Features © Net Serpents LLC, USA 08-24-2016

Transcript of Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala...

Page 1: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Introduction to Apache Spark

An Overview of Features

© Net Serpents LLC, USA

08-24-2016

Page 2: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Introduction to Apache Spark

Agenda

¡ What is Apache Spark¡ Major Vendors and Users¡ Key Features¡ Hadoop Vs Spark¡ Spark Architecture¡ Spark Streaming¡ Spark Processing¡ Examples and Use Cases

Part 1: Introduction

© Net Serpents LLC, USA 2

Disclaimer: Apache Hadoop and Spark is a registered trademark of the Apache Software Foundation(ASF ). Hadoop Express and Net Serpents is not affiliated in any way to ASF. All educational material is created and owned by Net Serpents (dba Hadoop Express) and is intended only to provide training. Net Serpents does not own any of the products on which it provides training, many of which are owned by Apache while others are owned companies such as SAS, Python and Oracle. Net Serpents LLC is committed to education and online learning. All recognizable terms, names of software, tools, programming languages that appear on this site belong to the respective copyright and/or trademark owners.

Page 3: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

¡ General data processing engine compatible with Hadoop data

¡ Used to query, analyze and transform data

¡ Developed in 2009 at AMPLab at University of California, Berkeley

¡ Became an Apache open source project in 2010

¡ Became top level project of Apache in 2014

¡ First discussed in the Mesos Whitepaper created in AMPLab

¡ Optimized to run in memory

100 times faster than MapReduce when run in memory

10 times faster than MapReduce when writing data to disk

What is Apache Spark

© Net Serpents LLC, USA

What is Apache Spark

3

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics

Page 4: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

¡ A general-purpose data processing engine, suitable for use in a wide range of circumstances

¡ Interactive queries across large data sets, processing of streaming data from sensors or financial systems, and machine learning tasks

¡ supports other data processing tasks with developer libraries and APIs

¡ Support of languages like as Java, Python, R and Scala

¡ Often used alongside Hadoop’s HDFS

¡ Can also integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3

What is Apache Spark

© Net Serpents LLC, USA

What is Apache Spark

4

Page 5: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

¡ Often used alongside Hadoop’s HDFS

¡ Can also integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3.

© Net Serpents LLC, USA

What is Apache Spark

5

What is Apache Spark

Page 6: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

• Data Bricks – founded by founders of Spark at Berkeley• Cloudera• Hortonworks• MapR

Major Vendors

© Net Serpents LLC, USA 6

• More than 1000 organizations are using Spark in production

• IBM, Huawei, Baidu, Aliba Taobao (eCommerce web site)

• Tencent (social nertworking site with 800 million users; 8000 compute nodes)

• Amazon, Ebay, Yahoo! And many others….

Major Users

Major Vendors and Users Major Vendors and Users

Page 7: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Simplicity / Ease of Use

¡ Rich set of APIs¡ to interact with large datasets¡ Well documented¡ Structured

© Net Serpents LLC, USA

Key Features

7

Key Features

Page 8: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Speed

In Memory / On Disk

¡ Spark is designed for speed, operating both in memory and on disk.

¡ In 2014, won the Daytona Gray Sort benchmarking challenge

¡ Processed 100 terabytes of data on solid-state drives in 23 minutes. The previous winner used Hadoop that took 72 minutes.

Key Features

© Net Serpents LLC, USA 8

Key Features

Page 9: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Key Features

Stream processing

Process “streams” of data from multiple sources simultaneously

Machine learning

¡ Well suited to training machine learning algorithms. Running broadly similar queries again and again, at scale, significantly

reduces the time required to iterate through a set of possible solutions in order to find the most efficient algorithms.

Interactive analytics¡ explore data interactively by viewing query results and then either altering the

initial query slightly or drilling deeper into results

Data integration ¡ Spark (and Hadoop) are increasingly being used to reduce the cost and time

required for ETL process.

Key Features

© Net Serpents LLC, USA 9

Page 10: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Development Language Support

v SCALA

v Python

v Java

v SQL

v R

Key Features

© Net Serpents LLC, USA 10

Key Features

Page 11: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Hadoop Versus Spark

v Hadoop has cluster management features provided by YARN while Spark requires a cluster manager

v Spark can run on top of Hadoop and utilize its cluster manager (YARN) or run separately utilizing other cluster managers such as Mesos.

v Spark is not designed for data management and cluster management. Hadoop handles these well

v Hadoop provides advanced data security which is missing in Spark

v Hadoop provides Disaster Recovery capabilities to Spark

v Spark provides for fast in-memory data processing of large data volumes which Hadoop does not

v Spark provides enterprise-class streaming, graph processing and machine learning capabilities which can be utilized by Hadoop

Hadoop Vs Spark

© Net Serpents LLC, USA 11

Spark is not a replacement of Hadoop. Spark and Hadoop complement each other

Page 12: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com © Net Serpents LLC, USA 12

Architecture Architecture

Integrations

Spark can run in following modes:

• Standalone cluster mode• On Hadoop YARN• On Apache Mesos

Spark can access data in:

• HDFS• Cassandra• Hive• Hbase• Tachyon• Any Hadoop data source

Page 13: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Architecture Architecture

© Net Serpents LLC, USA 13

SPARK Core Engine

SPARK SQLSPARK

Streaming(Streaming)

MLib(Machine Learning)

GraphX(Graph

Computation)

Spark R(R on Spark)

SPARK Technology Stack

Standalone Scheduler YARN MESOS

Page 14: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Architecture

© Net Serpents LLC, USA 14

SPARK Core Engine

SPARK Streaming

(Streaming)

MLib(Machine Learning)

GraphX(Graph

Computation)

Spark R(R on Spark)

SPARK Technology Stack

Standalone Scheduler YARN MESOS

SPARK SQL

SPARK Core Engine

• Basic functionality of Spark• Uses RDDs (Resilient Distributed

Datasets)• Contains APIs for manipulating

RDDs

Spark RDDs are a collection of items distributed across compute nodes. Spark core APIs allows manipulation of these RDDs in parallel

Architecture

Page 15: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Architecture

© Net Serpents LLC, USA 15

SPARK Core Engine

SPARK Streaming

(Streaming)

MLib(Machine Learning)

GraphX(Graph

Computation)

Spark R(R on Spark)

SPARK Technology Stack

Standalone Scheduler YARN MESOS

SPARK SQL

SPARK SQL

• Used for working with structured data

• Allows querying with SQL and HQL (Hive QL)

• Data sources can be Hive tables, Parquet, JSON, others..

• Allows intermixing SQL with programmatic manipulation of RDDs in Python, Scala, Java

Note: Shark is an older version of SPARK SQL developed by UC, Berkeley

Architecture

Page 16: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com © Net Serpents LLC, USA 16

SPARK Core Engine

SPARK Streaming(Streaming) MLib

(Machine Learning)

GraphX(Graph

Computation)

Spark R(R on Spark)

Standalone Scheduler YARN MESOS

SPARK SQL

SPARK Streaming

• Used for processing live streams of data

• Eg., log files / message queues• Can manipulate data stored

on disk or in-memory as it arrives in real time

Streaming offers high throughput and is fault tolerant and scalable

Architecture Architecture

SPARK Technology Stack

Page 17: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com © Net Serpents LLC, USA 17

SPARK Core Engine

SPARK Streaming

(Streaming)

MLib(Machine

Learning) GraphX

(Graph Computation)

Spark R(R on Spark)

Standalone Scheduler YARN MESOS

SPARK SQL

MLib

• Provides machine learning (ML) algorithms

• Eg., clustering, regression analysis, classification, filtering, model evaluation, data import

• Includes lower level ML primitives like gradient descent

MLib is a library with methods that have the capability to scale out across a cluster

Architecture Architecture

SPARK Technology Stack

Page 18: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com © Net Serpents LLC, USA 18

SPARK Core Engine

SPARK Streaming

(Streaming)

MLib(Machine Learning)

GraphX(Graph

Computation) Spark R

(R on Spark)

Standalone Scheduler YARN MESOS

SPARK SQL

GraphX

• Library for manipulating graphs• Allows viewing data as graphs

called property graphs• Pregel API is an API to create

custom iterative graph algorithms

Property graphs are immutable, fault tolerant and distributed (just like RDDs)

Architecture Architecture

SPARK Technology Stack

Page 19: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com © Net Serpents LLC, USA 19

SPARK Core Engine

SPARK Streaming

(Streaming)

MLib(Machine Learning)

GraphX(Graph

Computation)

Spark R(R on Spark)

Standalone Scheduler YARN MESOS

SPARK SQL

Spark R

• Support for R in Spark is more recent (with release 1.4)

• Allows data scientists working in R to utilize Spark capabilities

Architecture Architecture

SPARK Technology Stack

Page 20: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Streaming

Spark Streaming SparkStreaming

© Net Serpents LLC, USA 20

• Allows ingestion of data from a wide range of data sources

• Data processed by Spark can be stored in external systems or presented in dashboards

KAFKA

FLUME

HDFS

TWITTER

Databases

HDFS

Dashboards

Page 21: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Streaming

Spark Streaming

© Net Serpents LLC, USA 21

Input stream of data is divided into discreet chunks

KAFKA

FLUME

HDFS

TWITTER

Databases

HDFS

Dashboards

Each chunk represents data collected during a brief period and is processed individually

Input data

Stream

Spark Engine

@ time 0

@ time 1@ time 2

Discreet Sequence of RDDs

Processed RDDs

SparkStreaming

Page 22: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Processing

© Net Serpents LLC, USA 22

Source: https://spark.apache.org/docs/latest/cluster-overview.html

SparkStreaming

Page 23: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Processing

© Net Serpents LLC, USA 23

Source: https://spark.apache.org/docs/latest/cluster-overview.html

Driver program accesses Spark through a SparkContext object.

SparkProcessing

Page 24: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Processing Spark Processing

© Net Serpents LLC, USA 24

Source: https://spark.apache.org/docs/latest/cluster-overview.html

Spark Context represents a connection to a computing clusterOnce created, it can be used to build RDDs

Page 25: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Processing

© Net Serpents LLC, USA 25

Source: https://spark.apache.org/docs/latest/cluster-overview.html

Cluster Manager is an external service

• A default built-in cluster manager called Standalone Cluster manager is pre-packaged with Spark

• Hadoop YARN and Apache Mesos are two popular cluster managers• Driver requests cluster manager to provide resources for launching executors• Cluster manager launches executors which are then used by driver to run tasks

SparkProcessing

Page 26: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Processing

© Net Serpents LLC, USA 26

Source: https://spark.apache.org/docs/latest/cluster-overview.html

Tasks are the smallest unit of physical execution

• The driver program implicitly creates a DAG (Direct Acyclic Graph) of operations

• This DAG is converted to a physical execution plan• The execution plan is used by the driver to execute tasks using

executors on the worker nodes

SparkProcessing

Page 27: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Processing Spark Processing

© Net Serpents LLC, USA 27

Source: https://spark.apache.org/docs/latest/cluster-overview.html

Executors are processes that execute tasks

• Executors run the tasks and return results to the driver• Also provide in-memory storage for RDDs

Page 28: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Use Cases Use Cases

© Net Serpents LLC, USA 28

Spark Streaming Use Cases

ETL (Extract Transform Load)

• With Spark streaming it is possible to run ETL on streaming data that is continually cleaned and aggregated before moving it to data stores

• This is different from tradition approach of ETL based on batch processing

• IoT data collected via sensors on devices can be continually collected, cleaned and stored in datastores for analytics

Online Data Enrichment

• With Spark Streaming it is possible to combine historical data of online customers with changes in their buying behavior and preferences to present targeted advertisements in real time

Page 29: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Use Cases Use Cases

© Net Serpents LLC, USA 29

Spark Streaming Use Cases

Trigger Event Detection

• Spark streaming is being utilized to detect events and respond quickly to them by raising alerts. Eg., fraudulent transaction detection by banking systems and detecting changes in a patient’s vital signs such as heartbeat and blood pressure in a hospital

Session Analysis on the web

• Spark Streaming can be used to analyze a user’s online activity on a web site and and provide real-time recommendations. Eg., suggesting movies to a user on Netflix

Page 30: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Use Cases Use Cases

© Net Serpents LLC, USA 30

Machine Learning Use Cases

MLib is used for common big data functions like customer segmentation and sentiment analysis

Network Security: Predictive Intelligence can be used to inspect and detect threats on data packets arriving over the network before passing them to the storage platform.

Page 31: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

SPARK Use Cases Use Cases

© Net Serpents LLC, USA 31

Business examples

• Uber uses Kafka, Spark Streaming and HDFS to analyze and terabytes of user data by collecting and converting it from unstructured event data into structured data

• Pinterest uses an ETL pipeline to gain insights into how users are engaging all over the world with Pins to help them select products to buy or plan trips to destinations.

• Conviva uses Spark to optimize video streams and manage live videottraffic of over 4 million video feeds per month

Page 32: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Special thanks to references Use Cases

© Net Serpents LLC, USA 32

Special thanks to the following authors and contributors for providing valuable material used in this presentation:

Apache website: spark.apache.orgLearning Spark (Lightning fast data analytics) by Holden Karau, Andy Konwinski and Matei ZahariaGetting started on Apache Spark by James A ScottTop Apache Use Cases : https://www.qubole.com/blog/big-data/apache-spark-use-cases/ Introduction to Apache Spark by Databricks.com (download slides:http://cdn.liber118.com/workshop/itas_workshop.pdf)

Page 33: Introduction to Apache Spark - Hadoopexpress 1.pdf · Introduction to Apache Spark ... R and Scala ¡ Often used alongside Hadoop’s HDFS ... Ebay, Yahoo! And many others ...

www.hadoopexpress.com

Thank You!

© Net Serpents LLC, USA© Net Serpents LLC, USA

For queries / suggestions/ feedback please send an email to [email protected] or [email protected]