Rapid Cluster Computing with Apache Spark 2016

195
1 Rapid Cluster Computing with Apache Spark Zohar Elkayam CTO, Brillix [email protected] www.realdbamagic.com Twitter: @realmgic

Transcript of Rapid Cluster Computing with Apache Spark 2016

Page 1: Rapid Cluster Computing with Apache Spark 2016

1

Rapid Cluster Computing with Apache Spark

Zohar Elkayam CTO, Brillix

[email protected]

Twitter: @realmgic

Page 2: Rapid Cluster Computing with Apache Spark 2016

2

Who am I?• Zohar Elkayam, CTO at Brillix

• Programmer, DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years

• Oracle ACE Associate

• Part of ilOUG – Israel Oracle User Group

• Blogger – www.realdbamagic.com and www.ilDBA.co.il

Page 3: Rapid Cluster Computing with Apache Spark 2016

3

About Brillix• We offer complete, integrated end-to-end solutions based

on best-of-breed innovations in database, security and big data technologies

• We provide complete end-to-end 24x7 expert remote database services

• We offer professional customized on-site trainings, delivered by our top-notch world recognized instructors

Page 4: Rapid Cluster Computing with Apache Spark 2016

4

Some of Our Customers

Page 5: Rapid Cluster Computing with Apache Spark 2016

5

Agenda• The Big Data problem and possible solutions

• Basic Spark Core

• Working with RDDs

• Working with Spark Cluster and Parallel programming

• Spark modules: Spark SQL and Spark Streaming

• Performance and Troubleshooting

Page 6: Rapid Cluster Computing with Apache Spark 2016

6

Our Goal Today• Knowing more about Big Data and Big Data

solutions

• Get a taste of Spark abilities – not becoming a Spark expert

• This is a starting point – don’t be afraid to try

Page 7: Rapid Cluster Computing with Apache Spark 2016

7

The REAL Agenda

נשמח לקבל את חוות , משובטופס יחולק יום הסמינר בסיום •

!טאבלטממלאי המשוב יוגרל מידי יום בין . דעתכם

הפסקה10:30-10:45

2באולם ששון י הכנס ארוחת צהריים לכל משתתפ12:30-13:30

הפסקה מתוקה במתחם קבלת הפנים15:00-15:15

הולכים הביתה16:30

Page 8: Rapid Cluster Computing with Apache Spark 2016

The Challenge

And a Possible Solution…

Page 9: Rapid Cluster Computing with Apache Spark 2016

9

The Big Data Challenge

Page 10: Rapid Cluster Computing with Apache Spark 2016

10

Volume

• Big data comes in one size: Big.

• Size is measured in Terabyte (1012), Petabyte (1015),

Exabyte (1018), Zettabyte (1021)

• The storing and handling of the data becomes an issue

• Producing value out of the data in a reasonable time

is an issue

Page 11: Rapid Cluster Computing with Apache Spark 2016

11

Data Grows Fast!

Page 12: Rapid Cluster Computing with Apache Spark 2016

12

Variety• Big Data extends beyond structured data, including semi-

structured and unstructured information: logs, text, audio and videos

• Wide variety of rapidly evolving data types requires highly flexible stores and handling

Un-Structured Structured

Objects Tables

Flexible Columns and Rows

Structure Unknown Predefined Structure

Textual and Binary Mostly Textual

Page 13: Rapid Cluster Computing with Apache Spark 2016

13

Data Types By Industry

Page 14: Rapid Cluster Computing with Apache Spark 2016

14

Velocity • The speed in which data is being generated and

collected

• Streaming data and large volume data movement

• High velocity of data capture – requires rapid ingestion

• Might cause a backlog problem

Page 15: Rapid Cluster Computing with Apache Spark 2016

15

The Backlog Problem• Caused when the data is produced very quickly

• The time it takes to digest the new data is as long or very close to the time it takes the new data to be generated

• If the intake of new data is down for any reason, there is no way to complete that missing data thus causing the backlog problem

Page 16: Rapid Cluster Computing with Apache Spark 2016

16

The Internet of Things (IoT)/

Page 17: Rapid Cluster Computing with Apache Spark 2016

17

Value

Big data is not about the size of the data,

It’s about the value within the data

Page 18: Rapid Cluster Computing with Apache Spark 2016

18

Page 19: Rapid Cluster Computing with Apache Spark 2016

19

So, We Define Big Data Problem…• When the data is too big or moves too fast to

handle in a sensible amount of time

• When the data doesn’t fit any conventional database structure

• When we think that we can still produce value from that data and want to handle it

• When the technical solution to the business need becomes part of the problem

Page 20: Rapid Cluster Computing with Apache Spark 2016

How to do Big Data

Page 21: Rapid Cluster Computing with Apache Spark 2016

21

Page 22: Rapid Cluster Computing with Apache Spark 2016

22

Big Data in Practice• Big data is big: technological framework and

infrastructure solutions are needed

• Big data is complicated:

– We need developers to manage handling of the data

– We need devops to manage the clusters

– We need data analysts and data scientists to produce value

Page 23: Rapid Cluster Computing with Apache Spark 2016

23

Possible Solutions: Scale Up• Older solution: using a giant server with a lot of

resources (scale up: more cores, faster processers, more memory) to handle the data– Process everything on a single server with hundreds of CPU

cores

– Use lots of memory (1+ TB)

– Have a huge data store on high end storage solutions

• Data needs to be copied to the processes in real time, so it’s no good for high amounts of data (Terabytes to Petabytes)

Page 24: Rapid Cluster Computing with Apache Spark 2016

24

Another Solution: Distributed Systems• A scale-out solution: let’s use distributed systems:

use multiple machine for a single job/application

• More machines means more resources

– CPU

– Memory

– Storage

• But the solution is still complicated: infrastructure and frameworks are needed

Page 25: Rapid Cluster Computing with Apache Spark 2016

25

Distributed Infrastructure Challenges• We need Infrastructure that is built for:

– Large-scale

– Linear scale out ability

– Data-intensive jobs that spread the problem across clusters of server nodes

• Storage: efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data

• Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing

• High-end hardware is too expensive - we need a solution that uses cheaper hardware

Page 26: Rapid Cluster Computing with Apache Spark 2016

26

Distributed System/Frameworks Challenges• How do we distribute our workload across the

system?

• Programming complexity – keeping the data synced

• What to do with faults and redundancy?

• How do we handle security demands to protect highly-distributed infrastructure and data?

Page 27: Rapid Cluster Computing with Apache Spark 2016

A Big Data Solution:Apache Hadoop

Page 28: Rapid Cluster Computing with Apache Spark 2016

28

Apache Hadoop• Open source project run by Apache Foundation

(2006)

• Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure

• It Is has been the driving force behind the growth of the big data industry

• Get the public release from:

– http://hadoop.apache.org/core/

Page 29: Rapid Cluster Computing with Apache Spark 2016

29

Original Hadoop 1.0 Components• HDFS (Hadoop Distributed File System) – distributed file system that

runs in a clustered environment

• MapReduce – programming technique for running processes over a clustered environment

• Hadoop main idea: let’s distribute the data to many servers, and then bring the program to the data

Page 30: Rapid Cluster Computing with Apache Spark 2016

30

Hadoop 2.0• Hadoop 2.0 changed the Hadoop conception and

introduced a better resource management concept:– Hadoop Common

– HDFS

– YARN

– Multiple data processing frameworks including MapReduce, Spark and others

Page 31: Rapid Cluster Computing with Apache Spark 2016

31

HDFS is...• A distributed file system

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures and still stay resilient

• Intended for larger files

• Designed for batch inserts and appending data (no updates)

Page 32: Rapid Cluster Computing with Apache Spark 2016

32

Files and Blocks• Files are split into 128MB blocks (single unit of

storage)

– Managed by NameNode and stored on DataNodes

– Transparent to users

• Replicated across machines at load time

– Same block is stored on multiple machines

– Good for fault-tolerance and access

– Default replication factor is 3

Page 33: Rapid Cluster Computing with Apache Spark 2016

33

HDFS Node Types• HDFS has three types of Nodes

• Datanodes– Responsible for actual file store

– Serving data from files(data) to client

• Namenode (MasterNode)– Distribute files in the cluster

– Responsible for the replication between the datanodes and for file blocks location

• BackupNode– It’s a backup of the NameNode

Page 34: Rapid Cluster Computing with Apache Spark 2016

34

Using HDFS in Command Line

Page 35: Rapid Cluster Computing with Apache Spark 2016

35

How Does HDFS Look Like (GUI)

Page 36: Rapid Cluster Computing with Apache Spark 2016

36

Interfacing with HDFS

Page 37: Rapid Cluster Computing with Apache Spark 2016

37

MapReduce is...• A programming model for expressing distributed

computations at a massive scale

• An execution framework for organizing and performing such computations

• MapReduce can be written in Java, Scala, C, Python, Ruby and others

• Bring the code to the data, not the data to the code

Page 38: Rapid Cluster Computing with Apache Spark 2016

38

MapReduce paradigm• Implement two functions:

• MAP - Takes a large problem and divides into sub problems and performs the same function on all subsystems

Map(k1, v1) -> list(k2, v2)

• REDUCE - Combine the output from all sub-problems

Reduce(k2, list(v2)) -> list(v3)

• Frameworks handles everything else (almost)

• Value with same key must go to the same reducer

Page 39: Rapid Cluster Computing with Apache Spark 2016

39

Divide and Conquer

Page 40: Rapid Cluster Computing with Apache Spark 2016

40

YARN• Takes care of distributed processing and coordination

• Scheduling

– Jobs are broken down into smaller chunks called tasks

– These tasks are scheduled to run on data nodes

• Task Localization with Data

– Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task

– Code is moved to where the data is

Page 41: Rapid Cluster Computing with Apache Spark 2016

41

YARN (cont.)• Error Handling

– Failures are an expected behavior so tasks are automatically re-tried on other machines

• Data Synchronization

– Shuffle and Sort barrier re-arranges and moves data between machines

– Input and output are coordinated by the framework

Page 42: Rapid Cluster Computing with Apache Spark 2016

42

Submitting a Job• Yarn script with a class argument command

launches a JVM and executes the provided Job

$ yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob \/user/sample/hamlet.txt \/user/sample/wordcount/

Page 43: Rapid Cluster Computing with Apache Spark 2016

43

Resource Manage: UI

Page 44: Rapid Cluster Computing with Apache Spark 2016

44

Application View

Page 45: Rapid Cluster Computing with Apache Spark 2016

45

Hadoop Main Problems• But Hadoop (MapReduce Framework – not

MapReduce paradigm) had some problems:

– Developing MapReduce was complicated – there was more than just business logics to develop

– Transferring data between stages requires the intermediate data to be written to disk (and than read by the next step)

– Multi-step needed orchestration and abstraction solutions

– Initial resource management was very painful –MapReduce framework was based on resource slots

Page 46: Rapid Cluster Computing with Apache Spark 2016

A Different Big Data Solution: Apache Spark

Page 47: Rapid Cluster Computing with Apache Spark 2016

47

Introducing Apache Spark• Apache Spark is a fast, general engine

for large-scale data processing on a cluster

• Originally developed by UC Berkeley in 2009 as a research project, and is now an open source Apache top level project

• Main idea: use the memory resources of the cluster for better performance

• It is now one of the most fast-growing project today

Page 48: Rapid Cluster Computing with Apache Spark 2016

48

Spark Advantages• High level programming framework: programmers focus on

what (logic), not how• Cluster computing

– Managed by a single master node– Distributed to worker nodes– Scalable and fault tolerant by the framework

• Distributed storage– Data is distributed when stored– Replication for efficiency and fault tolerance– Bring code to the data state of mind

• High performance by in-memory utilization and caching

Page 49: Rapid Cluster Computing with Apache Spark 2016

49

Code Complexity

Page 50: Rapid Cluster Computing with Apache Spark 2016

50

Scalability• Spark is highly scalable

• Adding worker nodes to the cluster increase performance in a near-linear scale

– More processing power

– More memory

• Nodes can be added and removed according to load – ideal for cloud computing (EC2)

Page 51: Rapid Cluster Computing with Apache Spark 2016

51

Fault Tolerance• Commodity hardware is bound to fail

• Spark is built for low-cost clusters and has fault tolerance embedded in the framework

– System continue to function

– Master re-assign tasks to nodes

– Data is replicated so there is no data loss

– Node will rejoin cluster automatically when they recover

Page 52: Rapid Cluster Computing with Apache Spark 2016

52

Spark and Hadoop• Spark and Hadoop are built to co-exist

• Spark can use other storage systems (S3, local disks, NFS) but works best when combined with HDFS

– Uses Hadoop InputFormats and OutputFormats

– Fully compatible with Avro and SequenceFiles as well of other types of files

• Spark can use YARN for running jobs

Page 53: Rapid Cluster Computing with Apache Spark 2016

53

Spark and Hadoop (cont.)• Spark interacts with the Hadoop ecosystem:

– Flume

– Sqoop (watch out for DDoS on the database…)

– HBase

– Hive

• Spark can also interact with tools outside the Hadoop ecosystem: Kafka, NoSQL, Cassandra, XAP, Relational databases, and more

Page 54: Rapid Cluster Computing with Apache Spark 2016

54

The Spark Stack• In addition to the Core spark engine, there are

some related projects to extend Spark functionality

Page 55: Rapid Cluster Computing with Apache Spark 2016

55

Spark Use Cases• Spark is especially useful when working with any

combination of– Large amount of data

– Intensive computing

– Iterative algorithm

• Spark does well because of– Distributed storage

– Distributed computing

– In-memory processing and pipelining

Page 56: Rapid Cluster Computing with Apache Spark 2016

56

Common Spark Use Cases• ETL Processing

• Text Mining

• Index Building

• Graph Creation and analysis

• Pattern recognition

• Fraud detection

• Collaborative filtering

• Stream processing

• Prediction models

• Sentiment analysis

• Risk assessment

• Machine learning

Page 57: Rapid Cluster Computing with Apache Spark 2016

57

Examples for Common Use Cases• Risk analysis

– How likely is this borrower to pay back a loan?

• Recommendation

– Which products will this customer enjoy?

• Predictions

– How can we prevent service outage?

• Classification

– How can we tell which mail is spam and which is Not?

Page 58: Rapid Cluster Computing with Apache Spark 2016

58

Spark 2.0

Page 59: Rapid Cluster Computing with Apache Spark 2016

59

Spark 2.0 Major Changes• Major performance improvements

• Unifying DataFrames and Datasets for Scala/Java

• Changes to extensions– Multiple changes to MLlib and Machine Learning

– Improvements for Spark Streaming [ALPHA]

– Spark SQL supports ANSI SQL 2003

– R UDF

• Over 2000 bugs fixed

• Current version: 2.0.2 (released, Nov 14, 2016)

Page 60: Rapid Cluster Computing with Apache Spark 2016

Basic Spark

Spark Core

Page 61: Rapid Cluster Computing with Apache Spark 2016

61

What is Apache Spark• Apache Spark is a fast the general engine for large-

scale data processing• Written in Scala• Spark Shell

– Interactive interface for learning, testing, and data explorations

– Scala or Python shells available– Spark on R using RStudio and SparkR

• Spark Application– Framework for running large scale processes– Supports Scala, Python, and Java

Page 62: Rapid Cluster Computing with Apache Spark 2016

62

Starting the Shells$ pyspark

Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)SparkContext available as sc, HiveContext available as sqlContext.>>>

$ spark-shell

Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)Spark context available as sc.SQL context available as sqlContext.

scala>

Page 63: Rapid Cluster Computing with Apache Spark 2016

63

Spark Context• Every Spark application requires a Spark Context

– The main entry point to the Spark API

• Spark shells provide preconfigured Spark contexts called sc

• Spark application will need to create their own Spark context instance

• Spark Context is there the “magic” is happening

Page 64: Rapid Cluster Computing with Apache Spark 2016

64

RDD: Resilient Distributed Datasets• Basic component is the RDD

– Resilient – if the data is lost, it can be recreated for previous steps

– Distributed – appears as a single chunk of code, but is actually distributed across nodes

– Dataset – initial data can come from file or created programmatically

• RDD are the fundamental unit of data in spark

• Most of Spark program consists of performing operations on RDDs

Page 65: Rapid Cluster Computing with Apache Spark 2016

65

How to Create a RDD• We can create RDD in 3 ways:

– Create the RDD from a file, set of files, or directory

– Create RDD from data already in memory

– Create RDD by manipulating another RDD

• Later, we will talk about creating RDDs from different data sources…

Page 66: Rapid Cluster Computing with Apache Spark 2016

66

Creating RDD from Files• Creating RDD from files:

– Use the SparkContext.textFile – it can read a single file, comma delimited list or wildcards

– Each line in the file is separate record in the RDD

• Files are referenced by absolute or relative URI– Absolute: file:/home/myfile.txt

– Relative (used default file system): myfile.txt

sc.textFile (“myfile.txt”)sc.textFile (“mydata/*.txt”)sc.textFile (“myfile1.txt,myfile2.txt”)

Page 67: Rapid Cluster Computing with Apache Spark 2016

67

Example: Creating RDD From Files (Scala)scala> val mydata = sc.textFile("file:/home/spark/derby.log")16/06/12 13:15:39 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 220.3 KB, free 960.5 KB)16/06/12 13:15:39 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 26.4 KB, free 986.9 KB)16/06/12 13:15:39 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:36895(size: 26.4 KB, free: 511.4 MB)16/06/12 13:15:39 INFO SparkContext: Created broadcast 3 from textFile at <console>:27mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at <console>:27

scala> mydata.count()[..]16/06/12 13:15:41 INFO DAGScheduler: Job 0 finished: count at <console>:30, took 0.489132 sres3: Long = 13

scala> val mydata = sc.textFile("hdfs:/tmp/eventlog-demo.log")[..]mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:27

scala> mydata.count()16/06/12 13:16:41 INFO DAGScheduler: Job 1 finished: count at <console>:30, took 1.515042 sres4: Long = 451210

Page 68: Rapid Cluster Computing with Apache Spark 2016

68

RDD Operations• There are two types of operations

– Actions – return values

– Transformations – define a new RDD based on the current one(s)

• RDD have a lazy execution model

– Transformations set things up

– Actions cause calculations to actually be performed

Page 69: Rapid Cluster Computing with Apache Spark 2016

69

RDD Operations: Actions• Common actions:

– take(n) – return an array of the first n elements

– collect() – return an array of all elements

– saveAsTextFile(file) – save RDD to text file

– Count() – return the number of elements in the RDDscala> mydata.take(2)res6: Array[String] = Array(2016-06-08T16:48:49|121.170.77.248|FR|SUCCESS, 2016-06-08T16:48:49|142.13.127.131|FR|SUCCESS)

scala> for (line <- mydata.take(2)) println(line)2016-06-08T16:48:49|121.170.77.248|FR|SUCCESS2016-06-08T16:48:49|142.13.127.131|FR|SUCCESS

scala> mydata.count()res4: Long = 451210

Page 70: Rapid Cluster Computing with Apache Spark 2016

70

RDD Operations: Transformations• Transformations define a new RDD based on the current one

• RDD are immutable– Data in an RDD cannot change

– Transform in sequence to modify as needed

• Operations can be chained (piped) for multiple operations

• Common transformations:– map(function) – creates a new RDD by preforming a function on

each record in the base RDD

– filter(function) – creates a new RDD by including/excluding records in the base RDD according to the Boolean function it got

Page 71: Rapid Cluster Computing with Apache Spark 2016

What is Functional Programming?

Page 72: Rapid Cluster Computing with Apache Spark 2016

72

Functional Programming• Spark is dependent on the concept of functional

programming

– Functions are the fundamental unit of programming

– Functional have input and output only – there is no state or side effects

• Key concepts

– Passing functions as input to other functions

– Anonymous functions

Page 73: Rapid Cluster Computing with Apache Spark 2016

73

Passing Functions as Parameters• Some of the RDD operations take functions as

parameters• The received function is applied to all the records in

the RDD• For example:

– map function gets a function as a parameter. That function will convert each record in the RDD to a key and value tuple (AKA Pair RDD)

– filter function gets a function which will check each record in the RDD and return a Boolean value for filtering the data

Page 74: Rapid Cluster Computing with Apache Spark 2016

74

Defining Functions and Passing Them• Python Example:

• Scala Example:

> def toUpper(s):return s.upper()

> mydata.map(toUpper).take(3)

> Def toUpper(s: String): String ={ s.toUpperCase }

> mydata.map(toUpper).take(2)

Page 75: Rapid Cluster Computing with Apache Spark 2016

75

Anonymous Functions• Scala, Python, R, and Java can declare anonymous

one-time functions

• These functions are in-line functions without a name, often used for one-off functions

• Spark doesn’t require the use of anonymous functions

• Examples:

– Python: lambda x: …

– Scala X => …

– Java 8: x -> …

Page 76: Rapid Cluster Computing with Apache Spark 2016

76

Using Anonymous Functions• Python:

• Scala:

• Scala, using “_” as anonymous parameter:

> mydata.map(lambda line: line.upper()).take(3)

> mydata.map(line => line.toUpperCase()).take(2)

> mydata.map(_.toUpperCase()).take(2)

Page 77: Rapid Cluster Computing with Apache Spark 2016

Working with RDDs

Page 78: Rapid Cluster Computing with Apache Spark 2016

78

RDD Data Types• RDD can hold any type of element

– Primitive: integers, chars, Booleans, etc.

– Sequences: string, list, tuple, dictionaries, arrays, and all kind of nested data types

– Scala and Java serialized types

– Mixed types

• Some RDD types have additional functionality– Pair RDDs consist of Key-Value pairs

– Double RDDs consist of numeric data

Page 79: Rapid Cluster Computing with Apache Spark 2016

79

Generating RDDs From Collections• We can create RDDs from collections, and not from

files. Common uses: testing, integrating, and in case we need to generate data programmatically

• Example:

> Randomnumlist = [random.uniform(0,10) for _ in xrange (10000)]

> randomrdd = sc.parallelize(randomnumlist)

> Print “Mean is %f” % randomrdd.mean

Page 80: Rapid Cluster Computing with Apache Spark 2016

80

Common RDD Operations (1)

• Common transformations– flatMap – maps an element to 0 or more output

elements

– distinct – return a new dataset that contains unique elements in the original RDD

• Other RDD operations– first – return the first element in the dataset

– foreach – apply function to each element in an RDD

– top(n) return the n largest elements using natural ordering

Page 81: Rapid Cluster Computing with Apache Spark 2016

81

Common RDD Operations (2)

• Sampling operations

– sample(percent) – create a new RDD with a sampling of elements

– takeSample(percent) – return an array of sampled elements (with or without replacement)

• Double RDD operations

– Statistical functions: mean, sum, stdev etc.

Page 82: Rapid Cluster Computing with Apache Spark 2016

82

Using flatMap and distinct> sc.textFile(file) \

.flatMap(lambda line: line.split()) \

.distinct()

> sc.textFile(file).flatMap(line => line.split(“\\W”)).distinct

I see the worldand the world see me

I

see

the

world

and

the

world

see

me

I

see

the

world

and

me

Page 83: Rapid Cluster Computing with Apache Spark 2016

83

Pair RDD• Pair RDDs are special form of RDD

– Each element must be a key-value pair (tuple)

– Key and values be any type

• Use Pair RDDs when using MapReduce algorithms

• Many additional functions for common data processing needs: sorting, grouping, joining, etc.

Page 84: Rapid Cluster Computing with Apache Spark 2016

84

Simple Pair RDD• Create a Pair RDD from comma delimited file (CSV)

> users = sc.textFile (“file:/users.csv”) \.map(lambda line: line.split(“,”)) \.map(lambda fields: (fields[0], fields[1]))

> val users = sc.textFile (“file:/users.csv”).map (line => line.split(‘,’)) .map (fields => (fields(0), fields(1)))

user001,Zohar Elkayamuser002,Efratuser009,Tamaruser100,Ido

(user001, Zohar Elkayam)

(user002, Efrat)

)user009, Tamar)

(user100, Ido)

Page 85: Rapid Cluster Computing with Apache Spark 2016

85

Creating Pair RDDs• Commonly used functions that create Pair RDDs

– map

– flatMap

– flatMapValues

– keyBy

• Deciding what is the key and what is the value is very important as first step in most workflows in to convert the base RDD to a Pair RDD

Page 86: Rapid Cluster Computing with Apache Spark 2016

86

Complex Values• When creating key-value pairs, the values can be a

complex value

> users = sc.textFile (“file:/users.csv”) \.map(lambda line: line.split(“,”)) \.map(lambda fields: (fields[0], (fields[1], field[2])))

user001,Zohar, Elkayamuser002,Efrat, Elkayamuser009,Tamar, Fritziuser100,Ido, Bob

(user001, (Zohar, Elkayam))

(user002, (Efrat, Elkayam))

)user009, (Tamar, Fritzi))

(user100, (Ido, Bob))

Page 87: Rapid Cluster Computing with Apache Spark 2016

87

Using flatMapValues• flatMapValues will convert multiple values having

the same key into different key-value pairs

0001 a1:b1:c10002 a2:b20003 c3

> users = sc.textFile (“file”) \.map(lambda line: line.split(“ ”)) \.map(lambda fields: (fields[0], fields[1])) \.flatMapValues(lambda val: val.split(‘:’))

(0001, a1:b1:c1)

(0002, a2:b2)

)0003, c3)

(0001, a1)

(0001, b1)

(0001, c1)

(0002, a2)

(0002, b2)

(0003, b3)

[0001, a1:b1:c1]

[0002, a2:b2]

[0003, c3]

Page 88: Rapid Cluster Computing with Apache Spark 2016

88

MapReduce: Reminder• MapReduce is a common programming paradigm• MapReduce breaks complex tasks down into smaller

elements which can be executed in parallel• Hadoop MapReduce was the first major framework

implementation, but it had some limitations:– Each job can have only one Map, one Reduce– Job output and intermediate data be saved to files

• Spark implement the MapReduce model with greater flexibility– Map and Reduce function can be interspersed– Results stored in memory (or to disk, if not enough memory)– Operations can easily chained

Page 89: Rapid Cluster Computing with Apache Spark 2016

89

MapReduce in Spark• MapReduce in spark works on Pair RDDs

• Map phase:– Operated on one record at a time

– “Maps” each record to one or more new records

– Use map and flatMap for the Mapping phase

• Reduce phase– Works on Map output

– Consolidates multiple records

– reduceByKey operation

Page 90: Rapid Cluster Computing with Apache Spark 2016

90

Word Count using Spark• The famous word count example using Spark

made very easy:

the cat sat on the mat

> Count = sc.textFile (“file”) \.flatMap(lambda line: line.split()) \.map(lambda word: (word, 1)) \.reduceByKey(lambda v1, v2: v1 + v2)

(the, 2)

(cat, 1)

(on, 1)

(mat, 1)

(sat, 1)

the

cat

sat

on

the

mat

(the, 1)

(cat, 1)

(sat, 1)

(on, 1)

(the, 1)

(mat, 1)

Page 91: Rapid Cluster Computing with Apache Spark 2016

91

Why Is It Always Word Count?!• Word count is an easy explanation of many things:

– File handling

– Breaking lines into key-value pairs

– Reducing the keys and handling values

• Calculating statistics are often simple aggregative functions, just like in word count example

• Many common tasks are very similar to word count - Log file analysis for example

Page 92: Rapid Cluster Computing with Apache Spark 2016

92

ReduceByKey• ReduceByKey function behaves like the original

reduce key in the MapReduce model

• The function ReduceByKey gets must be binary –combining two values from two keys

• In order to work properly, the function must be

– Commutative (x+y = y+x)

– Associative ((x+y)+z = x+(y+z))

• All keys are being handled together (piped)

Page 93: Rapid Cluster Computing with Apache Spark 2016

93

Other Pair RDD Operations• Pair RDDs can do other things beside reduce

– countByKey – returns map with count occurrences by key

– groupByKey – group all values for each key in an RDD

– sortByKey – sort in ascending/descending order by key

– join – return RDD containing all pairs with matching keys from two pair RDDs

Page 94: Rapid Cluster Computing with Apache Spark 2016

94

Pair RDD Operations Examples

> grpUsers = users.groupByKeys()

> sortUsers = users.sortByKey (ascending=False)

(0001, a1:b1:c1)

(0002, a2:b2)

)0003, c3)

(0001, a1)

(0001, b1)

(0001, c1)

(0002, a2)

(0002, b2)

(0003, b3)

(0003, b3)

(0002, b2)

(0002, a2)

(0001, c1)

(0001, c1)

(0001, b1)

Page 95: Rapid Cluster Computing with Apache Spark 2016

95

Joining RDDs• Using Joins is a common programming pattern

– Map separate datasets into key-value pair RDDs

– Join by key (make sure key is the same data type and same key structure)

– Map joined data into the desired format

– Save, display or continue processing the new RDD

Page 96: Rapid Cluster Computing with Apache Spark 2016

96

Other Pair RDD Operations (cont.)• Other pairs operations

– keys – return an RDD of just the keys (no values)

– values – return an RDD of just the values

– lookup(key) return the value for a specific key

– leftOuterJoin, rightOuterJoin – left and right outer joins

– mapValues, flatMapValues – execute function on just the values, keeping the key the same

• See PairRDDFunctions class for a full (long) list of functions

Page 97: Rapid Cluster Computing with Apache Spark 2016

Working with Clusters

Spark Clusters and Resource Management

Page 98: Rapid Cluster Computing with Apache Spark 2016

98

Spark Clusters• Spark is designed to run on a cluster

– Spark includes basic cluster management called “Spark Standalone”

– Can also run on Hadoop and Mesos

• The jobs are broken down into tasks and sent to worker nodes– Each worker nodes run executioners with stand alone

JVMs

• Spark cluster workers can work closely with HDFS

Page 99: Rapid Cluster Computing with Apache Spark 2016

99

Spark Cluster Options

• Locally (No distributed processing)

• Locally with multiple Worker threads

• On an actual cluster, resources managed by:

– Spark Standalone

– Apache Hadoop YARN (Yet Another Resource Negotiator)

– Apache Mesos

Page 100: Rapid Cluster Computing with Apache Spark 2016

100

Spark Cluster Terms• Cluster is a group of computer working together

– i.e. Spark Standalone Cluster, Hadoop (YARN), Mesos

• Node is an individual computer in the cluster

– Master node mange the distributed work

– Worker node does the actual work

• Daemon is a program running on a node

– Each preform different functions in the cluster

Page 101: Rapid Cluster Computing with Apache Spark 2016

101

Spark Standalone Cluster Daemons• Spark Master (cluster manager)

– One per cluster

– Manages applications, distribute individual tasks to Spark Workers

• Spark Worker– One per node

– Starts and monitor Executors for applications

– Spark workers can run on Hadoop DataNodes – for reading data from HDFS efficiently

Page 102: Rapid Cluster Computing with Apache Spark 2016

102

Spark Driver

• Spark driver is the main program• Runs in a Spark Shell or as Spark Application• The driver creates the Spark Context for the run• Communicate with the Cluster Manager to

distribute the work between workers

Page 103: Rapid Cluster Computing with Apache Spark 2016

103

Driver Modes: Client vs. Cluster• Driver runs outside the cluster by default

– Called “client” deploy mode

– Most common

– Required for interactive use

• We can run the driver from one of the worker nodes in a cluster

– Called “cluster” deploy mode

– Doesn’t require interaction with cluster’s nodes

Page 104: Rapid Cluster Computing with Apache Spark 2016

104

Running a Cluster Application

Page 105: Rapid Cluster Computing with Apache Spark 2016

105

Running a Cluster Application

Page 106: Rapid Cluster Computing with Apache Spark 2016

106

Running a Cluster Application

Page 107: Rapid Cluster Computing with Apache Spark 2016

107

Running a Cluster Application

Page 108: Rapid Cluster Computing with Apache Spark 2016

108

Running a Cluster Application

Page 109: Rapid Cluster Computing with Apache Spark 2016

109

Supported Cluster Resource Managers• Spark Standalone (EC2 or private)

– Included with Spark– Easy to install and run– Limited configurability and scalability– Useful for testing, development, or smaller systems

• Hadoop YARN– Requires an Hadoop Cluster– Common for production sites– Allows sharing cluster resources with other applications (MapReduce,

Hive, etc.)

• Apache Mesos– Original platform for Spark– Less common

Page 110: Rapid Cluster Computing with Apache Spark 2016

110

Setting sc.master • Using the --master parameter, we set the SparkContext.master

parameter in spark shells

• Spark shell can set different Cluster Masters using command line:– URL – the url location for the cluster manager (Spark Standalone Manager or

Mesos manager)

– local[*] – runs locally with as many threads as cores (default)

– local[n] – runs locally with n worker threads

– local – does not use distributed processing

– yarn – use YARN as the cluster manager

• Example:$ pyspark --master spark://sparkmasternode:7077

$ spark-shell --master yarn

Page 111: Rapid Cluster Computing with Apache Spark 2016

111

UI Management• Spark standalone master provides a UI interface for

monitoring and history• UI runs by default on port 18080

Page 112: Rapid Cluster Computing with Apache Spark 2016

112

Spark Job Details UI

Page 113: Rapid Cluster Computing with Apache Spark 2016

113

Spark Job Details Timeline

Page 114: Rapid Cluster Computing with Apache Spark 2016

114

Stage Breakdown

Page 115: Rapid Cluster Computing with Apache Spark 2016

Parallel Programming with Spark

Page 116: Rapid Cluster Computing with Apache Spark 2016

116

Datasets in the Cluster• Resilient Distributed Datasets

– Data is partitioned across worker nodes

• Partitioning is being done by the Spark framework – no action needed by the programmer

• We can control the number of partitions

Page 117: Rapid Cluster Computing with Apache Spark 2016

117

Working With Partitioned Files• Partitioning a single file is based on the size of the

file – the default is 2

• We can control the number of partitions (optional):

• The more partitions we have, the more parallel the program is

sc.textFile(“myfile.txt”, 4)

Page 118: Rapid Cluster Computing with Apache Spark 2016

118

Working With Multiple Files• When working with sc.textFile, we can choose a

wildcard or directory– Each file will become at least one partition

– Operations can be done per file (JSON and XML parsing)

• We can automatically create Pair RDD by using sc.wholeTextFile(“mydir”)

– Useful for many small files

– Key = file name

– Value = file ccontents

> sc.textFile (“mydir/*”)

Page 119: Rapid Cluster Computing with Apache Spark 2016

119

Running Operations on Partition• Most operations are working on single elements

• Some operations can be ran at the partition level– foreachPartition – call a function for each partition

– mapPartition – create a new RDD by executing a function on each partition in the RDD (transformation)

– mapPartitionWithIndex – same as mapPartition but includes an index for the RDD (transformation)

• Commonly used for initializations

• Functions in partition operators get iterators as argument to iterate through the elements

Page 120: Rapid Cluster Computing with Apache Spark 2016

120

HDFS and Data Locality• By default, Spark partitions file-based RDDs by

block. Each block loaded to a single partition

• An action triggers execution: tasks on executors load data from blocks into partitions

• When using HDFS, workers run near their respected blocks

• The Collect operation will copy the data from the workers back to the driver (no locality here)

Page 121: Rapid Cluster Computing with Apache Spark 2016

121

Parallel Operations• RDD operations run in parallel on partitions

• Some operations preserve partitioning– map, flatMap, filter

• Some operations need repartitioning– reduce, sort, group

• Repartitioning requires the data to move between workers – thus hurting performance– Try to reduce the number of elements movement by

running better sequences up to the reshuffle stages

Page 122: Rapid Cluster Computing with Apache Spark 2016

122

Execution Terminology• Job – a set of tasks executed as a result of an action

• Stage – a set of tasks in a job that can exected in parallel

• Task – an individual unit of work sent to one executor

Page 123: Rapid Cluster Computing with Apache Spark 2016

123

From Job to Stages• Spark calculates a Directed Acyclic

Graph (DAG) of RDD dependencies• Narrow operations:

– Only one child depends on the RDD– No Shuffle required between nodes– Can be collected in a single stage– e.g., map, filter, union

• Wide operations– Multiple depends on the RDD– Defines a new stage– e.g., reduceByKey, join, groupByKey

Page 124: Rapid Cluster Computing with Apache Spark 2016

124

Controlling the Level of Parallelism• Wide operations partition result RDDs

– More partitions = more parallel tasks

– Cluster will be under-utilized if there are too few partitions

• We can control the number of partitions:

– Setting a default property (spark.default.parallelism)

– Setting an optional parameter in the function call> users = sc.textFile (“file”) \

.flatMap(lambda line: line.split()) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda v1, v2: v1 + v2, 10)

Page 125: Rapid Cluster Computing with Apache Spark 2016

125

Spark Lineage• Each transformation operation create a new

child RDD

• Spark keeps track of the parent RDD for each new RDD

• Child RDDs depends on their parents

• Action operations execute parent transformations

• Each action re-executes the linage transformations starting with the base RDD

Page 126: Rapid Cluster Computing with Apache Spark 2016

126

Caching• Since running all the transformations from the

base RDD can be expensive, RDDs can be cached

• Caching an RDD means saving it to the memory to reduce dependency link length

• Caching is a suggestion to Spark

– If not enough memory is available, transformations will be re-executed when needed

– Cache will never spill to disk – it’s in memory only

Page 127: Rapid Cluster Computing with Apache Spark 2016

127

Caching and Fault-Tolerance• Resilient Distributed Datasets

– Resiliency is a product of tacking lineage

• RDDs can always be recomputed from their base if needed

• In case a worker fail, the task can be re-run on a different node, recalculating data from the base RDD using the same partition

Page 128: Rapid Cluster Computing with Apache Spark 2016

128

Persistence Levels• Cache stores data in-memory only

• The persist method offers other options called Storage Levels

• Storage location – where to store the data– MEMORY_ONLY – same as cache

– MEMORY_AND_DISK – stores partition in memory, use disk if not enough memory

– MEMORY_ONLY_SER – stores partition as serialized java object

– DISK_ONLY – store partition on disk

• Replication – store partition on two cluster nodes– MEMORY_ONLY_2, MEMORY_AND_DISK_2, DISK_ONLY_2

Page 129: Rapid Cluster Computing with Apache Spark 2016

129

Persistence Options• To stop persisting and remove from memory

and/or disk– rdd.unpersist()

• To change persistency level we need to unpersistand re-persist the data in the new level

Page 130: Rapid Cluster Computing with Apache Spark 2016

130

When and Where to Cache• When should we cache a dataset

– When dataset is likely to be re-used (machine learning, iterative algorithms, etc.)

– When calculation is long and we don’t want to lose steps in case of failure

• How to choose a persistent level– Memory only – whenever possible, use serialized object to

reduce memory usage if possible– Disk – choose when recomputation is more expensive than

disk read (filtering large dataset, expensive functions)– Replication – choose when recomputation is more

expensive than memory

Page 131: Rapid Cluster Computing with Apache Spark 2016

131

Checkpointing• Maintianing RDD linage provides resillence but can

also cause problem when the lineage gets very long

• Recovery can be very expensive

• Potential stack overflow

• Solution: Checkpointing, saving the data to HDFS (reliable) or to local disk (local)

– HDFS Provides fault –tolerance storage across nodes

– Linage is not saved

– Must be checkpointed before any actions on the RDD

Page 132: Rapid Cluster Computing with Apache Spark 2016

Spark Modules

Spark SQL and Spark Streaming

Page 133: Rapid Cluster Computing with Apache Spark 2016

133

The Spark Stack• In addition to the Core spark engine, there are

some related projects to extend Spark functionality

Page 134: Rapid Cluster Computing with Apache Spark 2016

134

Spark SQL• Spark SQL is a Spark module for structured data

processing• Spark SQL provides more information about the

structure of both the data and the computation being performed for more optimization

• Supports basic SQL and HiveQL• Spark SQL can also act as a distributed query engine

using its JDBC/ODBC or command-line interface• For more information:

http://spark.apache.org/docs/latest/sql-programming-guide.html

Page 135: Rapid Cluster Computing with Apache Spark 2016

135

Dataframes, Datasets and RDDs• A DataFrame is a distributed collection of data

organized into named columns– It is conceptually equivalent to a table in a relational

database or a data frame in R/Python, but with richer optimizations under the hood

• A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine– A Dataset can be constructed from JVM objects and then

manipulated using functional transformations

Page 136: Rapid Cluster Computing with Apache Spark 2016

136

Spark SQL Context• Spark SQL entry point is Spark SQL Context:

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)

val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._

sqlContext <- sparkRSQL.init(sc)

Page 137: Rapid Cluster Computing with Apache Spark 2016

137

Running a Query

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.sql("SELECT * FROM table")

val sqlContext = ... // An existing SQLContextval df = sqlContext.sql("SELECT * FROM table")

sqlContext <- sparkRSQL.init(sc)df <- sql(sqlContext, "SELECT * FROM table")

Page 138: Rapid Cluster Computing with Apache Spark 2016

138

Connecting Oracle and Spark SQL• Connecting Spark SQL and Oracle is easy when

using JDBC:

• More about it (and a demo): https://www.realdbamagic.com/spark-sql-and-oracle-database-integration/

scala> val employees = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:zohar/zohar@//localhost:1521/single", "dbtable" -> "hr.employees"))warning: there were 1 deprecation warning(s); re-run with -deprecation for detailsemployees: org.apache.spark.sql.DataFrame = [EMPLOYEE_ID: decimal(6,0), FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: timestamp, JOB_ID: string, SALARY: decimal(8,2), COMMISSION_PCT: decimal(2,2), MANAGER_ID: decimal(6,0), DEPARTMENT_ID: decimal(4,0)]

Page 139: Rapid Cluster Computing with Apache Spark 2016

Spark Streaming

Ishay Wayner

[email protected]

Page 140: Rapid Cluster Computing with Apache Spark 2016

140

Agenda• What is stream processing?

• Principles

• Stream processing with Spark

• Demo

• How does it compare?

Page 141: Rapid Cluster Computing with Apache Spark 2016

141

Spark Streaming• Spark Streaming is an extension of the core Spark

API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

Page 142: Rapid Cluster Computing with Apache Spark 2016

142

What is Stream Data Processing?• Stream – a constant flow of data events

• Data – any occurrence that happens at a clearly defined time and is recorded in a collection of fields

• Processing – the act of analyzing data

Page 143: Rapid Cluster Computing with Apache Spark 2016

143

When Do You Use Stream Processing?• Wherever you have a continuous stream of data

• This data needs to be processed quickly so that the business can react

• Examples include trading, fraud detection, spam filtering, trading and many more

Page 144: Rapid Cluster Computing with Apache Spark 2016

144

Data Delivery Methods• At-most-once – possibility for data loss.

• At-least-once – messages may be redelivered.

• Exactly-once – each message is only delivered once.

Page 145: Rapid Cluster Computing with Apache Spark 2016

145

Keep The Data Moving• Data events are processed in the stream (in

memory)

• Storage operations add unnecessary latency

• Data will be stored on disk at the end of the stream

Page 146: Rapid Cluster Computing with Apache Spark 2016

146

Window Consideration• Windowing: grouping of events based on time

• Windowing can also be data-driven

• Out of order events make windowing tricky

• There are different types of windowing including fixed, sliding and count windows

Page 147: Rapid Cluster Computing with Apache Spark 2016

147

Window Types

Time

1 3

4

Inputevents

2 7

9

Time

1 3Inputevents

2 7 2 4

1012

9

Fixed Sliding

Page 148: Rapid Cluster Computing with Apache Spark 2016

Stream Processing With Spark

Page 149: Rapid Cluster Computing with Apache Spark 2016

149

Spark Streaming • An extension of the core Spark API

• Fault tolerant

• Built in support for merging streaming data with historical data

• Supports Scala, Python and Java

Page 150: Rapid Cluster Computing with Apache Spark 2016

150

Spark Streaming - Metrics• Exactly once delivery

• Provides Statefull state management

• Groups events into micro batches

• Latency depends on the configuration of the DStream microbatch interval

Page 151: Rapid Cluster Computing with Apache Spark 2016

151

The DStream• An abstraction provided by Spark

• Represents a continuous stream of data

• Internally treated as a sequence of RDDs

• Each RDD consists of the last X seconds

Page 152: Rapid Cluster Computing with Apache Spark 2016

152

How do we use it?• Create a StreamingContext

• Define the input sources

• Apply transformations and output operations to DStreams

• Issue streamingContext.start() to start receiving data

• Wait for the processing to be stopped using streamingContext.awaitTermination()

Page 153: Rapid Cluster Computing with Apache Spark 2016

153

DStreams and Receivers• Every input DStream is associated with a receiver

object

• The receiver receives the data and stores it in memory for processing

• If you receive multiple input DStreams, multiple receivers will be created

• Remember to allocate spark with enough cores to process the data as well as to run the receivers

Page 154: Rapid Cluster Computing with Apache Spark 2016

154

Scala Exampleimport org.apache.spark._import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Create a local StreamingContext with two working thread and batch interval of 1 second.// The master requires 2 cores to prevent from a starvation scenario.

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words, pairs, and countval words = lines.flatMap(_.split(" "))val pairs = words.map(word => (word, 1))val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.print()

ssc.start() // Start the computationssc.awaitTermination() // Wait for the computation to terminate

Page 155: Rapid Cluster Computing with Apache Spark 2016

155

Python Examplefrom __future__ import print_function

import sys

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext

if __name__ == "__main__":if len(sys.argv) != 3:

print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)exit(-1)

sc = SparkContext(appName="PythonStreamingNetworkWordCount")ssc = StreamingContext(sc, 1)

lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))counts = lines.flatMap(lambda line: line.split(" "))\

.map(lambda word: (word, 1))\

.reduceByKey(lambda a, b: a+b)counts.pprint()

ssc.start()ssc.awaitTermination()

Page 156: Rapid Cluster Computing with Apache Spark 2016

Spark Streaming Demo

Page 157: Rapid Cluster Computing with Apache Spark 2016

157

Source Types• Spark Streaming provides two built in types of

streaming sources:

– Basic sources: such as file systems and socket connections

– Advanced sources: such as like Kafka, Flume, Kinesis etc

Page 158: Rapid Cluster Computing with Apache Spark 2016

158

Basic Sources - Files• The StreamingContext API provides several

methods for creating input DStreams from files

• File Streams: monitors a given directory and process any files created in it– Files must have the same format

– Once moved into the directory the files can not be changed

• There is an easier method for simple text files which doesn’t require a receiver

Page 159: Rapid Cluster Computing with Apache Spark 2016

159

Advanced Sources• Requires interfacing with non spark libraries

• Functionality to create DStreams from advanced sources has moved to separate libraries

• This is done to prevent version conflict issues

• Libraries need to be explicitly linked when needed

• There is also the ability to create a custom source using a user defined receiver

Page 160: Rapid Cluster Computing with Apache Spark 2016

160

Sliding Window Operations• Spark Streaming allows for windowed computations

• Used to apply transformations on a sliding window of data

• Every time the window slides the source RDDs that fit into the window are combined into a single DStream object

• A window operation needs these two parameters

– Window length: the duration of the window

– Sliding interval: the interval in which the window slides

Page 161: Rapid Cluster Computing with Apache Spark 2016

161

Checkpointing• A streaming application usually runs 24/7

• Therefore it has to survive numerous types of failures outside the application logic

• Spark streaming checkpoints data to a fault tolerant storage system to recover if and when any failure occurs

• Two types of data are checkpointed for that purpose

Page 162: Rapid Cluster Computing with Apache Spark 2016

162

Checkpointed Data Types• Metadata: the information defining the stream

computation

– The configuration used to create the streaming application

– The DStream operations that define the streaming application

– Batches who’s jobs are queued and are yet to be completed

• Data itself: the RDDs containing the data

– Necessary in transformations that combine data from multiple batches

Page 163: Rapid Cluster Computing with Apache Spark 2016

163

How To Checkpoint• Set up a directory in a fault tolerant file system

• Use StreamingContext.checkpoint(dir) to enable data checkpointing

• The application will recover from driver failures (metadata checkpointing) If it does the following:– Create StreamingContext at first run, set up DStreams

then start()

– When restarted StreamingContext will be recreated from checkpoint directory

Page 164: Rapid Cluster Computing with Apache Spark 2016

164

Monitoring And Tuning Streaming Applications• Statistics about a running streaming application

are available thru the Spark web UI

• There are two main metrics to monitor and tune:

– Processing Time: the time to process each batch of data (lower is better)

– Scheduling Delay: are my batches being processed as fast as they are arriving (they should be)

Page 165: Rapid Cluster Computing with Apache Spark 2016

165

Reducing Batch Processing Time• Divide your input DStream into several DStreams

• Increase parallelism of data processing if you feel your resources are under utilized

• Reduce serialization overhead by tuning the serialization format

Page 166: Rapid Cluster Computing with Apache Spark 2016

166

Setting The Right Batch Interval• Batch processing time should be less then batch

interval time

• First start with a conservative interval (a few seconds)

• Make sure batch processing time is below batch interval

• If so you may increase data rate or lower batch interval

• Consistently monitor the batch processing time

Page 167: Rapid Cluster Computing with Apache Spark 2016

167

Other Streaming Frameworks• Apache Storm

• Samza

• Apache Flink

Page 168: Rapid Cluster Computing with Apache Spark 2016

168

How Does It Compare?Storm Samza Flink Spark

Deliverysemantics

At Least OnceExactly-oncewith Trident

At Least Once Exactly once Exactly Once

Statemanagement

StatelessRoll your own or

use Trident

StatefulEmbaded key-

value store

Statefulperiodically writes state

without inerrupting

Statefulwrites state to

storage

Latency Sub-second Sub-Second Sub-second SecondsDepending on

batch size

Languagesupport

Any JVM-languages,

Ruby, Python Javascript, Perl

Scala, Java Scala, Java Scala, Java, Python

Page 169: Rapid Cluster Computing with Apache Spark 2016

169

For More Info• For more information:

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 170: Rapid Cluster Computing with Apache Spark 2016

Spark Applications

Page 171: Rapid Cluster Computing with Apache Spark 2016

171

Spark Shell vs. Spark Applications• The Spark Shell allows interactive exploration and

manipulation of data (REPL – read, evaluate, print, loop)

• Spark applications run as independent programs

– Python, Scala, R with SparkR package, or Java

– Common uses: ETL processing, Streaming, and more

Page 172: Rapid Cluster Computing with Apache Spark 2016

172

SparkContext• Every Spark program needs a SparkContext

– The interactive shell creates SC for us

– When creating our own application, we need to create the context ourselves

– A common convention is to name the context cs

Page 173: Rapid Cluster Computing with Apache Spark 2016

173

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConf

object SparkWordCount {def main(args: Array[String]) {

// create Spark context with Spark configurationval sc = new SparkContext(new SparkConf().setAppName("Spark Count"))

// get thresholdval threshold = args(1).toInt

// read in text file and split each document into wordsval tokenized = sc.textFile(args(0)).flatMap(_.split(" "))

// count the occurrence of each wordval wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)

// filter out words with fewer than threshold occurrencesval filtered = wordCounts.filter(_._2 >= threshold)

// count charactersval charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)

System.out.println(charCounts.collect().mkString(", "))}

}

Page 174: Rapid Cluster Computing with Apache Spark 2016

174

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

# create Spark context with Spark configurationconf = SparkConf().setAppName("Spark Count")sc = SparkContext(conf=conf)

# get thresholdthreshold = int(sys.argv[2])

# read in text file and split each document into wordstokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" "))

# count the occurrence of each wordwordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1 +v2)

# filter out words with fewer than threshold occurrencesfiltered = wordCounts.filter(lambda pair:pair[1] >= threshold)

# count characterscharCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c,

1)).reduceByKey(lambda v1,v2:v1 +v2)

list = charCounts.collect()print repr(list)[1:-1]

Page 175: Rapid Cluster Computing with Apache Spark 2016

175

library(SparkR)

args <- commandArgs(trailing = TRUE)

if (length(args) != 2) {print("Usage: wordcount <master> <file>")q("no")

}

# Initialize Spark contextsc <- sparkR.init(args[[1]], "RwordCount")lines <- textFile(sc, args[[2]])

words <- flatMap(lines,function(line) {

strsplit(line, " ")[[1]]})

wordCount <- lapply(words, function(word) { list(word, 1L) })

counts <- reduceByKey(wordCount, "+", 2L)output <- collect(counts)

for (wordcount in output) {cat(wordcount[[1]], ": ", wordcount[[2]], "\n")

}

Page 176: Rapid Cluster Computing with Apache Spark 2016

176

Building a Spark Application: Scala or Java• Scala or Java applications must be compiled and

assembled into JAR files.

– The JAR file will be passed (uploaded) to worker nodes

• Most developers use Apache Maven or SBT to build

– See http://spark.apache.org/docs/latest/building-spark.html for more details about building an application

• Build details will differ depending on Hadoop version, deployment platform, and other factors

Page 177: Rapid Cluster Computing with Apache Spark 2016

177

Running a Spark Application• The easiest way to run a Spark application is by using spark-

submit– Python

– Scala and Java

• Spark-submit options:– --master (local, local[*], yarn, etc.)– --deploy_mode (client or cluster)– --name – application name for UI– --conf – configuration changes from default or settings– … more …

$ spark-submit WordCount.py fileURL

$ spark-submit –class WordCount myJarFile.jar fileURL

Page 178: Rapid Cluster Computing with Apache Spark 2016

178

Running time Configuration Options• Spark-submit can accept a properties file with

settings, instead of parameters– Tab or space-delimited list of properties

– Load with spark-submit –properties-file filename

– Example:

• Site defaults properties file– $SPARK_HOME/conf/spark-defaults

spark.master spark://masternode:7077spark.local.dir /tmp/sparkSpark.ui.port 28080

Page 179: Rapid Cluster Computing with Apache Spark 2016

179

Setting Configuration at Runtime• Spark allow changing the configuration when

creating the SparkContext

• Configure the parameters with the SparkConfobject

• Some functions– setAppName(name)

– setMaster(master)

– set(property-name, value)

• Set functions return a SparkConf object to support chaining

Page 180: Rapid Cluster Computing with Apache Spark 2016

180

Logging

• Spark uses Apache log4j for logging

• You can configure it by adding a log4j.propertiesfile in the $SPARK_HOME/conf directory

• Log file locations depends on the cluster management platform– Spark daemons: /var/log/spark

– Individual tasks: $SPARK_HOME/work on each worker node

– YARN has a log aggregator for log files from workers

Page 181: Rapid Cluster Computing with Apache Spark 2016

Spark Performance and Troubleshooting

Common use cases, problems and solutions

Page 182: Rapid Cluster Computing with Apache Spark 2016

182

Broadcast Variables• Broadcast variables set by the driver and retrived

by workers

• They are read-only once set

• The first read of Broadcast variable retrieves and stores its value on nodes

Page 183: Rapid Cluster Computing with Apache Spark 2016

183

Why Use Broadcast Variables• Use to minimize transfer of data over the network,

which is usually the biggest bottleneck

• Spark Broadcast variables are distributed to worker nodes using a very efficient peer-to-peer algorithmscala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.valueres5: Array[Int] = Array(1, 2, 3)

Page 184: Rapid Cluster Computing with Apache Spark 2016

184

Accumulators• Accumulators are shared variables

– Worker nodes can add to the value– Only the driver application can access the value

• Default accumulator is of type int or double, but we can create custom types when needed (extend class AccumulatorParam)

scala> val accum = sc.accumulator(0, "My Accumulator")accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)...16/06/18 14:34:40 INFO DAGScheduler: Job 5 finished: foreach at <console>:30, took 0.111459 s

scala> accum.valueres6: Int = 10

Page 185: Rapid Cluster Computing with Apache Spark 2016

185

Accumulators (cont.)

• Accumulators will only increment once per task

– If task must be return due to failure, spark will correctly add only for task which succeeded

• Only the driver can access the value

– Code will throw an exception if we use .value on a worker

• Supports the increment (+=) operator

Page 186: Rapid Cluster Computing with Apache Spark 2016

186

Common Performance Issues: Serialization• Serialization affects

– Network bandwidth– Memory (save memory by serializing to disk)

• Default methods of serialization in Spark is basic java serializations– Simple, but slow

• Use Kryo Serialization for Scala and Java– Set spark.serializer = spark.KryoSerializer– Create KryoRegistrar class and set the class in

spark.kryo.registrator=MyRegistrator– Register classes with Kryo (kryo.register(classOf[MyClass]))

Page 187: Rapid Cluster Computing with Apache Spark 2016

187

Small Partitions

• Problem: filter() can result in partitions with small amounts of data

– Results in many small tasks

• Solution: repartition(n)

– This is the same as coalesce(n, suffle=true)

• This will build a new partition RDD, reducing the number of tasks

Page 188: Rapid Cluster Computing with Apache Spark 2016

188

Passing Too Much Data in Functions• Problem: Passing large amounts of data to parallel

functions results in poor performance

• Solution:

– If the data is relatively small, use broadcast variable

– If the data is very large, parallelize into RDDs

Page 189: Rapid Cluster Computing with Apache Spark 2016

189

Where to Look for Performance Issues• Scheduling and lunching tasks

– Are you passing too much data to tasks?

– Use broadcast variable, or RDD

• Task execution– Are there tasks with a very high per-record overhead?

• mydata.map(dbLookup)

• Each lookup call opens a connection to the DB, reads, and closes

– Try mapPartitions

– Are a few tasks taking much more time than others?• Repartition, partition on a different key, or write custom partitioner

Page 190: Rapid Cluster Computing with Apache Spark 2016

190

Where to Look for Performance Issues (cont)• Shuffling

– Make sure you have enough memory for buffer cache

– Make sure spark.local.dir is local disk, ideally dedicated or SSD

• Collecting data to the Driver– Beware of returning large amounts of data to the

driver (using collect())

– Process data on the worker, not the driver

– Save large results to HDFS

Page 191: Rapid Cluster Computing with Apache Spark 2016

191

Conclusion• We talked about the Big Data problem and

Hadoop

• We learned how to use Spark Core

• We got an overview about Spark Cluster and Parallel programming

• We reviewed Spark modules: Spark SQL, Spark Streaming, Machine learning and Graphs

• Spark is one of the leading technologies in todays world

Page 192: Rapid Cluster Computing with Apache Spark 2016

192

What Did We Not Talk About?• Spark unified APIs and data frames

• Spark MLlib and machine learning in general

• Spark graph processing

Page 193: Rapid Cluster Computing with Apache Spark 2016

Q&AAny Questions? Now will be the time!

Page 194: Rapid Cluster Computing with Apache Spark 2016

Zohar Elkayamtwitter: @[email protected]

www.ilDBA.co.ilwww.realdbamagic.com

Page 195: Rapid Cluster Computing with Apache Spark 2016

195