Lightening Fast Big Data Analytics using Apache Spark

63
www.unicomlearning.c om India Big Data Week 2014 Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org Manish Gupta Solutions Architect – Product Engineering and Development 30 th Jan 2014 - Delhi

description

Lightening Fast Big Data Analytics using Apache Spark ---------------------------------------------------------------------------------- Hadoop gives you a great (actually revolutionary) mechanism for storing large data sets in a highly fault tolerant and highly available storage system (HDFS) and ability to process these mammoth datsets using it's massively parallel & distributed processing framework (Map Reduce). It was built for batch processing where analysts and programmers can submit their series of jobs to crunch very large structured/unstructured datasets and then wait for results for performing further analysis. But, one of the very few reasons why Hadoop is criticized for is it's speed and not being highly interactive (mainly because it's user base has increased tremendously and people always demand more specially when it comes to speed). Spark is an open source system that can run on top of your existing HDFS and can provide upto 100x times faster (almost interactive) in-memory analytics than Map Reduce. Topics that will be covered: Quick Introduction of Hadoop & it's Limitation Introduction of Spark Spark Architecture Programming model of Spark Demo Spark Use Cases

Transcript of Lightening Fast Big Data Analytics using Apache Spark

Page 1: Lightening Fast Big Data Analytics using Apache Spark

www.unicomlearning.com

India Big Data Week 2014Lightning Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.org

Manish GuptaSolutions Architect – Product Engineering and Development

30th Jan 2014 - Delhi

Page 2: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.org

Agenda Of The Talk:

www.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Page 3: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.org

Agenda Of The Talk:

www.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Page 4: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

What is Hadoop?It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way.

MRIt also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion.

HDFS

Map

Map

Map

Reduce

Reduce

Input Output

Page 5: Lightening Fast Big Data Analytics using Apache Spark

Slow due to replication, serialization, and disk IO

Inefficient for:

• Iterative algorithms (Machine Learning, Graphs & Network Analysis)

• Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)

www.bigdatainnovation.orgwww.unicomlearning.com

Limitations of Map Reduce

Input iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Map

Map

Map

Reduce

Reduce

Input Output

Page 6: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Approach: Leverage Memory?

Memory bus >> disk & SSDs

Many datasets fit into memory

1TB = 1 billion records @ 1 KB

Memory Capacity also follows the Moore’s Law

A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single

stick of RAM that contains 64GB for the same price.

Page 7: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 8: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 9: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Open Sourced originally developed in AMPLab at UC Berkley.

Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).

Designed for running Iterative algorithms & Interactive analytics

Highly compatible with Hadoop’s Storage APIs.

- Can run on your existing Hadoop Cluster Setup.

Developers can write driver programs using multiple programming languages.

“A big data analytics cluster-computing framework written in Scala.”

Page 10: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

HDFS

Datanode Datanode Datanode....Spark

WorkerSpark

WorkerSpark

Worker....

Cache Cache Cache

Block Block Block

Cluster Manager

Spark Driver (Master)

Page 11: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Page 12: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

iter. 1 iter. 2 . . .

Input

Not tied to 2 stage Map Reduce paradigm

1. Extract a working set2. Cache it3. Query it repeatedly

Logistic regression in Hadoop and Spark

HDFSread

Page 13: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

A simple analytical operation:

pagecount = spark.textFile( "/wiki/pagecounts“ )pagecount.count()

englishPages = pagecount.filter( _.split(" ")(1) == "en“ )englishPages.cache()englishPages.count()englishTuples = englishPages.map( line => line.split(" ") )englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )englishKeyValues.reduceByKey( _+_, 1).collect

1

2

Select count(*) from pagecounts

Select Col1, sum(Col4) from pagecountsWhere Col2 = “en”Group by Col1

Page 14: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Shark

HIVE on SPARK = SHARK A large scale data warehouse system just like Apache Hive. Highly compatible with Hive (HQL, metastore, serialization formats, and

UDFs) Built on top of Spark (thus a faster execution engine). Provision of creating In-memory materialized tables (Cached Tables). And cached tables utilizes columnar storage instead of raw storage.

1

Column Storage

2 3

ABC XYZ PPP

4.1 3.5 6.4

Row Storage

1 ABC 4.1

2 XYZ 3.5

3 PPP 6.4

Page 15: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Shark

Meta store

HDFS

Client

Driver

SQL Parser

Query Optimizer

Physical Plan

Execution

CLI JDBC

Map Reduce

HIVE

Page 16: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

SharkSHARK

Meta store

HDFS

Client

Driver

SQL Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimizer

Page 17: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 18: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 19: Lightening Fast Big Data Analytics using Apache Spark

Datanode

HDFS

Datanode…

www.bigdatainnovation.orgwww.unicomlearning.com

Spark Programming Model

User (Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

Driver Program

SparkContextCluster

Manager

Worker Node

Executer Cache

Task Task

Worker Node

Executer Cache

Task Task

Page 20: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark Programming Model

User (Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

Driver Program

RDD(Resilient

Distributed Dataset)

• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to

optimize data placement• Can be manipulated using

rich set of operators.

Page 21: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

RDD

Programming Interface: Programmer can perform 3 types of operations:

Transformations

• Create a new dataset from and existing one.

• Lazy in nature. They are executed only when some action is performed.

• Example :• Map(func)• Filter(func)• Distinct()

Actions

• Returns to the driver program a value or exports data to a storage system after performing a computation.

• Example:• Count()• Reduce(funct)• Collect• Take()

Persistence

• For caching datasets in-memory for future operations.

• Option to store on disk or RAM or mixed (Storage Level).

• Example:• Persist() • Cache()

Page 22: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

How Spark Works:

RDD: Parallel collection with partitions User application create RDDs, transform them,

and run actions.This results in a DAG (Directed Acyclic Graph) of

operators.DAG is compiled into stagesEach stage is executed as a series of Task (one

Task for each Partition).

Page 23: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Example:

sc.textFile(“/wiki/pagecounts”) RDD[String]

textFile

Page 24: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))

RDD[String]

textFile map

RDD[List[String]]

Page 25: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))

RDD[String]

textFile map

RDD[List[String]]

RDD[(String, Int)]

map

Page 26: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3)

RDD[String]

textFile map

RDD[List[String]]

RDD[(String, Int)]

map

RDD[(String, Int)]

reduceByKey

Page 27: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()

RDD[String]

textFile map

RDD[List[String]]

RDD[(String, Int)]

map

RDD[(String, Int)]

reduceByKey

Array[(String, Int)]

collect

Page 28: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

textFile map map reduceByKey

collect

Execution Plan:

Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…

Page 29: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

textFile map map reduceByKey

collect

Execution Plan:

Stage 1 Stage 2

Stages are sequences of RDDs, that don’t have a Shuffle in between

Page 30: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

textFile map map reduceByKey

collect

Stage 1 Stage 2

Stage 1 Stage 2

1. Read HDFS split2. Apply both the maps3. Start Partial reduce4. Write shuffle data

1. Read shuffle data2. Final reduce3. Send result to driver

program

Page 31: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Stage Execution:Stage 1

Task 1

Task 2

Task 2

Task 2

Create a task for each Partition in the new RDD Serialize the Task Schedule and ship Tasks to Slaves

And all this happens internally (you need to do anything)

Page 32: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Task Execution:

Task is the fundamental unit of execution in Spark

Fetch Input

Execute Task

Write Output

time

HDFS / RDD

HDFS / RDD / intermediate shuffle output

Page 33: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Spark Executor (Slaves)

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Core 1

Core 2

Core 3

Page 34: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

Summary of Components

Task : The fundamental unit of execution in Spark

Stage : Set of Tasks that run parallel

DAG : Logical Graph of RDD operations

RDD : Parallel dataset with partitions

Page 35: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 36: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 37: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Example & Demo

Cluster Details: 6 m1.Xlarge EC2 nodes.

1 machine is Master Node 5 worker node machines

64 bit, 4 vCPU 15 GB Ram

Page 38: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Example & Demo

Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data

Dataset:

<date_time> <project_code> <page_title> <num_hits> <page_size>

Base RDD to All Wiki Pagesval allPages = sc.textFile("/wiki/pagecounts")allPages.take(10).foreach(println)allPages.count()

Transformed RDD for all English pages (cached)val englishPages = allPages.filter(_.split(" ")(1) == "en")englishPages.cache()englishPages.count()englishPages.count()

Page 39: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Example & Demo

Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data

Dataset:

<date_time> <project_code> <page_title> <num_hits> <page_size>

Select date, sum(pageviews) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)

Select date, count(distinct pageURL) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)

Select distinct(datetime) from pagecounts order by datetimeenglishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)

Page 40: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Example & Demo

Network Datasets Directed and Bi-directed Graphs

One small Facebook Social Network 127 nodes (Friends) 1668 Edges (Friendships) Bi-directed graph

Google’s internal site network 15713 Nodes (web pages) 170845 Edges (hyperlinks) Directed Graph

Dataset:

Page 41: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Example & DemoPage Rank Calculation:

• Estimate the node importance• Each directed link from A -> B is a vote to B from A.• More links to a page, more important a page is.• When a page with higher PR, points to something, then it’s vote weighs more.

1. Start each page at a rank of 1

2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

Page 42: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Example & Demo

Scala Code:

var iters = 100val lines = sc.textFile("/dataset/google/edges.csv",1)

val links = lines.map{ s =>val parts = s.split( "\t“ )(parts(0), parts(1))}.distinct().groupByKey().cache()var ranks = links.mapValues(v => 1.0)for (i <- 1 to iters) {val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>val size = urls.sizeurls.map(url => (url, rank / size))}ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)}val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))

Page 43: Lightening Fast Big Data Analytics using Apache Spark
Page 44: Lightening Fast Big Data Analytics using Apache Spark

2 seconds

Page 45: Lightening Fast Big Data Analytics using Apache Spark

38 secondsPage Rank Page URL

761.1985177 google455.7028756 google/about.html259.6052388 google/privacy.html192.7257649 google/jobs/144.0349154 google/support134.1566312 google/terms_of_service.html130.3546324 google/intl/en/about.html123.4014613 google/imghp120.0661165 google/accounts/Login118.6884515 google/intl/en/options/112.2309539 google/preferences108.8375347 google/sitemap.html106.9724799 google/press/105.822426 google/language_tools

105.1554798 google/support/toolbar/99.97741309 google/maps97.90651416 google/advanced_search90.7910291 google/intl/en/services/

90.70522689 google/intl/en/ads/87.4353413 google/adsense/

Page 46: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap

Agenda Of The Talk:

Page 48: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Roadmap

Page 49: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Conclusion

Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.

Suitable for scenarios when sufficient memory available in your cluster.

It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.

It’s community is expanding and development is happening very aggressively.

It’s comparatively newer than Hadoop and only few users.

Page 50: Lightening Fast Big Data Analytics using Apache Spark

www.unicomlearning.com

Topic:

Organized byUNICOM Trainings & Seminars Pvt. Ltd.

[email protected]

Speaker name: MANISH GUPTAEmail ID: [email protected]

Thank You

www.bigdatainnovation.org

Page 51: Lightening Fast Big Data Analytics using Apache Spark
Page 52: Lightening Fast Big Data Analytics using Apache Spark

Backup Slides

Page 53: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark Internal Components

Hadoop I/O Mesos backend Standalone backend

Interpreter

Spark core

Operators

Block manager

Scheduler

Networking

Accumulators Broadcast

Page 54: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

In-Memory

But what if I run out of memory?

Cache disabled

25% 50% 75% Fully cached0

10

20

30

40

50

60

70

80

90

10068.8

58.1

40.7

29.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)

Page 55: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Benchmarks

AMPLab performed a quantitative and qualitative comparisons of 4 system

HIVE, Impala, Redshift and Shark

Done on Common Crawl Corpus Dataset 81 TB size Consists of 3 tables:

Page Rankings User Visits Documents

Data was partitioned in such a way that each node had: 25GB of User Visits 1GB of Ranking 30GB of Web Crawl (document)

Source: https://amplab.cs.berkeley.edu/benchmark/#

Page 56: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Benchmarks

Page 57: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

BenchmarksHardware Configuration

Page 58: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Benchmarks

• Redshift outperforms for on-disk data.• Shark and Impala outperform Hive by 3-4X.• For larger result-sets, Shark outperforms Impala.

Page 59: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Benchmarks

• Redshift columnar storage outperforms every time.• Shark in-memory is 2nd best in all cases.

Page 60: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Benchmarks

• Redshift bigger cluster has an advantage.• Shark and Impala competing.

Page 61: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Benchmarks

• Impala & Redshift don’t have UDF.• Shark outperforms hive.

Page 62: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Roadmap

Page 63: Lightening Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.orgwww.unicomlearning.com

Spark

In Last 6 months of Year 2013