Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used...

31
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Apache Spark 1 / 26

Transcript of Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used...

Page 1: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Apache Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 1 / 26

Page 2: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics

Outline

1. What is Spark

2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets

3. MLLib: Machine Learning with Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 1 / 26

Page 3: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 1. What is Spark

Outline

1. What is Spark

2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets

3. MLLib: Machine Learning with Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 1 / 26

Page 4: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 1. What is Spark

Spark Overview

Apache Spark is an open source framework for large scale dataprocessing and analysis

Main Ideas:

I Processing occurs where the data resides

I Avoid moving data over the network

I Works with the data in Memory

Technical details:

I Written in Scala

I Work seamlessly with Java and Python

I Developed at UC Berkeley

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 1 / 26

Page 5: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 1. What is Spark

Apache Spark Stack

Data platform: Distributed file system /data base

I Ex: HDFS, HBase, Cassandra

Execution Environment: single machine or a cluster

I Standalone, EC2, YARN, Mesos

Spark Core: Spark API

Spark Ecosystem: libraries of common algorithms

I MLLib, GraphX, Streaming

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 2 / 26

Page 6: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 1. What is Spark

Apache Spark Ecosystem

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 3 / 26

Page 7: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 1. What is Spark

How to use Spark

Spark can be used through:

I The Spark ShellI Available in Python and Scala

I Useful for learning the Framework

I Spark ApplicationsI Available in Python, Java and Scala

I For “serious” large scale processing

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 4 / 26

Page 8: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark

Outline

1. What is Spark

2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets

3. MLLib: Machine Learning with Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 5 / 26

Page 9: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark

Working with Spark

Working with Spark requires accessing a Spark Context:

I Main entry point to the Spark API

I Already preconfigured in the Shell

Most of the work in Spark is a set of operation on Resilient DistributedDatasets (RDDs):

I Main data abstraction

I The data used and generated by the application is stored as RDDs

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 5 / 26

Page 10: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark

Spark Interactive Shell (Python)

$ . / b in / pyspa rkWelcome to

/ / / /\ \/ \/ ‘/ / ’ /

/ / . /\ , / / / /\ \ v e r s i o n 1 . 3 . 1/ /

Us ing Python v e r s i o n 2 . 7 . 6 ( d e f a u l t , Jun 22 2015 17 : 5 8 : 1 3 )SparkContext a v a i l a b l e as sc , SQLContext a v a i l a b l e as s q lCon t e x t .>>>

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 6 / 26

Page 11: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.1 Spark Context

Spark Context

The Spark Context is the main entry point for the Spark functionality.

I It represents the connection to a Spark cluster

I Allows to create RDDs

I Allows to broadcast variables on the cluster

I Allows to create Accumulators

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 7 / 26

Page 12: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Resilient Distributed Datasets (RDDs)

A Spark application stores data as RDDs

Resilient → if data in memory is lost it can be recreated (fault tolerance)

Distributed → stored in memory across different machines

Dataset → data coming from a file or generated by the application

A Spark program is about operations on RDDs

RDDs are immutable: operations on RDDs may create new RDDs butnever change them

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 8 / 26

Page 13: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Resilient Distributed Datasets (RDDs)

data

data

data

data

RDDRDD Element

RDD elements can be stored in different machines (transparent to thedeveloper)

data can have various data types

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 9 / 26

Page 14: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

RDD Data types

An element of an RDD can be of any type as long as it is serializable

Example:

I Primitive data types: integers, characters, strings, floating pointnumbers, ...

I Sequences: lists, arrays, tuples ...

I Pair RDDs: key-value pairs

I Serializable Scala/Java objects

A single RDD may have elements of different types

Some specific element types have additional functionality

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 10 / 26

Page 15: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Example: Text file to RDD

I had breakfast this morning.The coffee was really good.I didn't like the bread though.But I had cheese.Oh I love cheese.

I had breakfast this morning.

The coffee was really good.

I didn't like the bread though.

But I had cheese.

RDD: mydata

Oh I love cheese.

File: mydiary.txt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 11 / 26

Page 16: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

RDD operations

There are two types of RDD operations:

I Actions: return a value based on the RDD

I Example:

I count: returns the number of elements in the RDD

I first(): returns the first element in the RDD

I take(n): returns an array with the first n elements in the RDD

I Transformations: creates a new RDD based on the current one

I Example:

I filter: returns the elements of an RDD which match a given criterion

I map: applies a particular function to each RDD element

I reduce: applies a particular function to each RDD element

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 12 / 26

Page 17: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Actions vs. Transformations

data

data

data

data

RDD

Action Value

data

data

data

data

BaseRDD

Transformation

data

data

data

data

NewRDD

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 13 / 26

Page 18: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Actions examples

I had breakfast this morning.The coffee was really good.I didn't like the bread though.But I had cheese.Oh I love cheese.

I had breakfast this morning.

The coffee was really good.

I didn't like the bread though.

But I had cheese.

RDD: mydata

Oh I love cheese.

File: mydiary.txt

>>> mydata = sc . t e x t F i l e ( ”mydiary . t x t ” )>>> mydata . count ( )5>>> mydata . f i r s t ( )u ’ I had b r e a k f a s t t h i s morning . ’>>> mydata . take (2 )[ u ’ I had b r e a k f a s t t h i s morning . ’ , u ’The c o f f e e was r e a l l y good . ’ ]

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 14 / 26

Page 19: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Transformation examples

I had breakfast this morning.

The coffee was really good.

I didn't like the bread though.

But I had cheese.

RDD: mydata

Oh I love cheese.

filter

I had breakfast this morning.

I didn't like the bread though.

But I had cheese.

RDD: filtered

Oh I love cheese.

filter(lambda line: "I " in line)

I had breakfast this morning.

I didn't like the bread though.

But I had cheese.

RDD: filtered

Oh I love cheese.

I HAD BREAKFAST THIS MORNING.

I DIDN'T LIKE THE BREAD THOUGH.

BUT I HAD CHEESE.

RDD: filterMap

OH I LOVE CHEESE.

map

map(lambda line: line.upper())

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 15 / 26

Page 20: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Transformations examples

>>> f i l t e r e d = mydata . f i l t e r ( lambda l i n e : ” I ” i n l i n e )>>> f i l t e r e d . count ( )4>>> f i l t e r e d . take (4 )[ u ’ I had b r e a k f a s t t h i s morning . ’ ,u” I d idn ’ t l i k e the bread though . ” ,u ’ But I had chee s e . ’ ,u ’Oh I l o v e chee s e . ’ ]

>>> f i l t e rMa p = f i l t e r e d .map( lambda l i n e : l i n e . upper ( ) )>>> f i l t e rMa p . count ( )4>>> f i l t e rMa p . take (4 )[ u ’ I HAD BREAKFAST THIS MORNING. ’ ,u” I DIDN ’T LIKE THE BREAD THOUGH. ” ,u ’BUT I HAD CHEESE . ’ ,u ’OH I LOVE CHEESE . ’ ]

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 16 / 26

Page 21: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Operations on specific typesNumeric RDDs have special operations:

I mean()

I min()

I max()

I ...

>>> l i n e l e n s = mydata .map( lambda l i n e : l e n ( l i n e ) )>>> l i n e l e n s . c o l l e c t ( )[ 2 9 , 27 , 31 , 17 , 17 ]>>> l i n e l e n s . mean ( )24 .2>>> l i n e l e n s .min ( )17>>> l i n e l e n s .max ( )31>>> l i n e l e n s . s t d e v ( )6.0133185513491636

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 17 / 26

Page 22: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Operations on Key-Value Pairs

Pair RDDs contain a two element tuple: (K ,V )

Keys and values can be of any type

Extremely useful for implementing MapReduce algorithms

Examples of operations:

I groupByKey

I reduceByKey

I aggregateByKey

I sortByKey

I ...

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 18 / 26

Page 23: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Word Count Example

Map:

I Input: document-word list pairs

I Output: word-count pairs

(dk , “w1, . . . ,w′′m) 7→ [(wi , ci )]

Reduce:

I Input: word-(count list) pairs

I Output: word-count pairs

(wi , [ci ]) 7→ (wi ,∑c∈[ci ]

c)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 19 / 26

Page 24: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Word Count Example

(d1, “love ain't no stranger”)

(d2, “crying in the rain”)

(d3, “looking for love”)

(d4, “I'm crying”)

(d5, “the deeper the love”)

(d6, “is this love”)

(d7, “Ain't no love”)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 5)

(stranger,1)

(crying, 2)

(ain't, 2)

(rain, 1)

(looking, 1)

(deeper, 1)

(this, 1)

Mappers ReducersLucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 20 / 26

Page 25: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Word Count on Spark

I had breakfast this morning.

The coffee was really good.

I didn't like the bread though.

But I had cheese.

RDD: mydata

Oh I love cheese.

flatMap

I

had

breakfast

RDD

this

morning.

...

map

(I,1)

(had,1)

(breakfast,1)

RDD

(this,1)

(morning.,1)

...

reduceByKey

(I,4)

(had,2)

(breakfast,1)

RDD

(this,1)

(morning.,1)

...

>>> count s = mydata . f l a tMap ( lambda l i n e : l i n e . s p l i t ( ” ” ) ) \.map( lambda word : ( word , 1 ) ) \. reduceByKey ( lambda x , y : x + y )

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 21 / 26

Page 26: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

ReduceByKey

. reduceByKey ( lambda x , y : x + y )

ReduceByKey works a little different from the MapReduce reduce

function:

I It takes two arguments: combines two values at a time associatedwith the same key

I Must be commutative: reduceByKey(x,y) = reduceByKey(y,x)

I Must be associative: reduceByKey(reduceByKey(x,y), z) =

reduceByKey(x,reduceByKey(y,z))

Spark does not guarantee on which order the reduceByKey functions areexecuted!

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 22 / 26

Page 27: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

ConsiderationsSpark provides a much more efficient MapReduce implementation thenHadoop:

I Higher level API

I In memory storage (less I/O overhead)

I Chaining MapReduce operations is simplified: sequence ofMapReduce passes can be done in one job

Spark vs. Hadoop on training a logistic regression model:

Source: Apache Spark. https://spark.apache.org/

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 23 / 26

Page 28: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 3. MLLib: Machine Learning with Spark

Outline

1. What is Spark

2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets

3. MLLib: Machine Learning with Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 24 / 26

Page 29: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 3. MLLib: Machine Learning with Spark

Overview

MLLib is a Spark Machine Learning library containing implementations for:

I Computing Basic Statistics from Datasets

I Classification and Regression

I Collaborative Filtering

I Clustering

I Feature Extraction and Dimensionality Reduction

I Frequent Pattern Mining

I Optimization Algorithms

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 24 / 26

Page 30: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 3. MLLib: Machine Learning with Spark

Logistic Regression with MLLib

Import necessary packages:

from pyspa rk . m l l i b . r e g r e s s i o n import Labe l edPo i n tfrom pyspa rk . m l l i b . u t i l import MLUti l sfrom pyspa rk . m l l i b . c l a s s i f i c a t i o n import Log i s t i cReg r e s s i onWi thSGD

Read the data (LibSVM format):

da t a s e t = MLUti l s . l oadL ibSVMFi l e ( sc ,” data / m l l i b / s amp l e l i b s vm da t a . t x t ” )

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 25 / 26

Page 31: Big Data Analytics · Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: I The Spark Shell I Available in Python and Scala I Useful for learning the Framework

Big Data Analytics 3. MLLib: Machine Learning with Spark

Logistic Regression with MLLib

Train the Model:

model = Log i s t i cReg r e s s i onWi thSGD . t r a i n ( d a t a s e t )

Evaluate:

l a b e l sAndP r ed s = da t a s e t.map( lambda p : ( p . l a b e l , model . p r e d i c t ( p . f e a t u r e s ) ) )

t r a i n E r r = l abe l sAndP r ed s. f i l t e r ( lambda ( v , p ) : v != p ). count ( ) / f l o a t ( d a t a s e t . count ( ) )

p r i n t ( ” T r a i n i n g E r r o r = ” + s t r ( t r a i n E r r ) )

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 26 / 26