Bigdata processing with Spark

62
SIKS Big Data Course Prof.dr.ir. Arjen P. de Vries [email protected] Enschede, December 5, 2016

Transcript of Bigdata processing with Spark

Page 1: Bigdata processing with Spark

SIKS Big Data Course

Prof.dr.ir. Arjen P. de Vries

[email protected], December 5, 2016

Page 2: Bigdata processing with Spark

“Big Data”

If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity

http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Page 3: Bigdata processing with Spark

Process Challenges in Big Data Analytics include

- capturing data,- aligning data from different sources (e.g., resolving when two

objects are the same),- transforming the data into a form suitable for analysis,- modeling it, whether mathematically, or through some form of

simulation,- understanding the output — visualizing and sharing the results

Attributed to IBM Research’s Laura Haas in http://www.odbms.org/download/Zicari.pdf

Page 4: Bigdata processing with Spark

How big is big? Facebook (Aug 2012):

- 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)

- 2.7 billion Likes per day - 300 million photos uploaded per day

Page 5: Bigdata processing with Spark

Big is very big! 100+ petabytes of disk space in one of

FB’s largest Hadoop (HDFS) clusters 105 terabytes of data scanned via Hive, Facebook’s

Hadoop query language, every 30 minutes 70,000 queries executed on these databases per day 500+ terabytes of new data ingested into the databases

every day

http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/

Page 6: Bigdata processing with Spark

Back of the Envelope Note:

“105 terabytes of data scanned every 30 minutes” A very very fast disk can do 300 MB/s – so, on one disk,

this would take(105 TB = 110100480 MB) / 300 (MB/s) =

367Ks =~ 6000m So at least 200 disks are used in parallel! PS: the June 2010 estimate was that facebook ran on 60K servers

Page 7: Bigdata processing with Spark

Source: Google

Data Center (is the Computer)

Page 8: Bigdata processing with Spark

Source: NY Times (6/14/2006), http://www.nytimes.com/2006/06/14/technology/14search.html

Page 9: Bigdata processing with Spark

FB’s Data Centers Suggested further reading:

- http://www.datacenterknowledge.com/the-facebook-data-center-faq/- http://opencompute.org/

- “Open hardware”: server, storage, and data center- Claim 38% more efficient and 24% less expensive to build and

run than other state-of-the-art data centers

Page 10: Bigdata processing with Spark

Building Blocks

Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024

Page 11: Bigdata processing with Spark

Storage Hierarchy

Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024

Page 12: Bigdata processing with Spark

Numbers Everyone Should KnowL1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 100 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 10,000 ns

Send 2K bytes over 1 Gbps network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from network 10,000,000 ns

Read 1 MB sequentially from disk 30,000,000 ns

Send packet CA->Netherlands->CA 150,000,000 ns

According to Jeff Dean

Page 13: Bigdata processing with Spark

Storage Hierarchy

Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024

Page 14: Bigdata processing with Spark

Storage Hierarchy

Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024

Page 15: Bigdata processing with Spark

Quiz Time!! Consider a 1 TB database with 100 byte records

- We want to update 1 percent of the records

Plan A:Seek to the records and make the updates

Plan B:Write out a new database that includes the updates

Source: Ted Dunning, on Hadoop mailing list

Page 16: Bigdata processing with Spark

Seeks vs. Scans Consider a 1 TB database with 100 byte records

- We want to update 1 percent of the records Scenario 1: random access

- Each update takes ~30 ms (seek, read, write)- 108 updates = ~35 days

Scenario 2: rewrite all records- Assume 100 MB/s throughput- Time = 5.6 hours(!)

Lesson: avoid random seeks!

In words of Prof. Peter Boncz (CWI & VU): “Latency is the enemy”

Source: Ted Dunning, on Hadoop mailing list

Page 17: Bigdata processing with Spark

Programming for Big Data the Data Center

Page 18: Bigdata processing with Spark

Emerging Big Data Systems Distributed Shared-nothing

- None of the resources are logically shared between processes Data parallel

- Exactly the same task is performed on different pieces of the data

Page 19: Bigdata processing with Spark

Shared-nothing A collection of independent, possibly virtual, machines,

each with local disk and local main memory, connected together on a high-speed network- Possible trade-off: large number of low-end servers instead of

small number of high-end ones

Page 20: Bigdata processing with Spark
Page 21: Bigdata processing with Spark

@U

T ~1

990

Page 22: Bigdata processing with Spark

Data Parallel Remember:

0.5ns (L1) vs. 500,000ns (round trip in datacenter)

Δ is 6 orders in magnitude!

With huge amounts of data (and resources necessary to process it), we simply cannot expect to ship the data to the application – the application logic needs to ship to the data!

Page 23: Bigdata processing with Spark

Gray’s LawsHow to approach data engineering challenges for large-scale scientific datasets:

1. Scientific computing is becoming increasingly data intensive2. The solution is in a “scale-out” architecture3. Bring computations to the data, rather than data to the

computations4. Start the design with the “20 queries”5. Go from “working to working”

See:http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part1_szalay.pdf

Page 24: Bigdata processing with Spark

Distributed File System (DFS) Exact location of data is unknown to the programmer Programmer writes a program on an abstraction level

above that of low level data- however, notice that abstraction level offered is usually still

rather low…

Page 25: Bigdata processing with Spark

GFS: Assumptions Commodity hardware over “exotic” hardware

- Scale “out”, not “up” High component failure rates

- Inexpensive commodity components fail all the time “Modest” number of huge files

- Multi-gigabyte files are common, if not encouraged Files are write-once, mostly appended to

- Perhaps concurrently Large streaming reads over random access

- High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

Page 26: Bigdata processing with Spark

GFS: Design Decisions Files stored as chunks

- Fixed size (64MB) Reliability through replication

- Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata

- Simple centralized management No data caching

- Little benefit due to large datasets, streaming reads Simplify the API

- Push some of the issues onto the client (e.g., data layout)

HDFS = GFS clone (same basic ideas)

Page 27: Bigdata processing with Spark

A Prototype “Big Data Analysis” Task Iterate over a large number of records Extract something of interest from each Aggregate intermediate results

- Usually, aggregation requires to shuffle and sort the intermediate results

Generate final output

Key idea: provide a functional abstraction for these two operations

Map

Reduce

(Dean and Ghemawat, OSDI 2004)

Page 28: Bigdata processing with Spark

Map / Reduce“A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs”

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, 2004http://research.google.com/archive/mapreduce.html

Page 29: Bigdata processing with Spark

MR Implementations Google “invented” their MR system, a proprietary

implementation in C++- Bindings in Java, Python

Hadoop is an open-source re-implementation in Java- Original development led by Yahoo- Now an Apache open source project- Emerging as the de facto big data stack- Rapidly expanding software ecosystem

Page 30: Bigdata processing with Spark

Map / Reduce Process data using special map() and reduce()

functions- The map() function is called on every item in the input and

emits a series of intermediate key/value pairs- All values associated with a given key are grouped together:

(Keys arrive at each reducer in sorted order)- The reduce() function is called on every unique key, and its

value list, and emits a value that is added to the output

Page 31: Bigdata processing with Spark

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

Master

UserProgram

outputfile 0

outputfile 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read(4) local write

(5) remote read(6) write

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

Adapted by Jimmy Lin from (Dean and Ghemawat, OSDI 2004)

Page 32: Bigdata processing with Spark

MapReduce

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2ba 1 2 c c3 6c c3 6 a c5 2a c5 2 b c7 8b c7 8

a 1 5a 1 5 b 2 7b 2 7 c 2 3 6 8c 2 3 6 8

r1 s1r1 s1 r2 s2r2 s2 r3 s3r3 s3

Page 33: Bigdata processing with Spark

MapReduce “Runtime” Handles scheduling

- Assigns workers to map and reduce tasks Handles “data distribution”

- Moves processes to data Handles synchronization

- Gathers, sorts, and shuffles intermediate data Handles errors and faults

- Detects worker failures and restarts Everything happens on top of a Distributed File System

(DFS)

Page 34: Bigdata processing with Spark

Q: “Hadoop the Answer?”

Page 35: Bigdata processing with Spark

Data Juggling Operational reality of many organizations is that Big Data

is constantly being pumped between different systems:- Key-value stores- General-purpose distributed file system- (Distributed) DBMSs- Custom (distributed) file organizations

Page 36: Bigdata processing with Spark

Q: “Hadoop the Answer?” Not that easy to write efficient and scalable code!

Page 37: Bigdata processing with Spark

Controlling Execution Cleverly-constructed data structures for keys and values

- Carry partial results together through the pipeline Sort order of intermediate keys

- Control order in which reducers process keys Partitioning of the key space

- Control which reducer processes which keys Preserving state in mappers and reducers

- Capture dependencies across multiple keys and values

Page 38: Bigdata processing with Spark

Hadoop’s Deficiencies

Page 39: Bigdata processing with Spark

Sources of latency… Job startup time Parsing and serialization Checkpointing Map reduce boundary

- Mappers must finish before reducers start Multi job dataflow

- Job from previous step in analysis pipeline must finish first No indexes

Page 40: Bigdata processing with Spark

Hadoop Drawbacks / Limitations No record abstraction

- HDFS even leads to “broken” records Focus on scale-out, low emphasis on single node “raw”

performance Limited (insufficient?) expressive power

- Joins? Graph traversal? Lack of schema information

- Only becomes a problem in the long run…

Fundamentally designed for batch processing only

Page 41: Bigdata processing with Spark

Two Cases against Batch Processing Interactive analysis

- Issues many different queries over the same data

Iterative machine learning algorithms- Reads and writes the same data over and over again

Page 42: Bigdata processing with Spark

Slow due to replication, serialization, and disk IO

Input

query 1query 1

query 2query 2

query 3query 3

result 1

result 2

result 3

. . .

HDFSread

iter. 2iter. 2 . . .

HDFSread

HDFSwrite

Data Sharing (Hadoop)

iter. 1iter. 1

HDFSread

HDFSwrite

Input

iter. 1iter. 1

HDFSread

HDFSwrite

Input

Page 43: Bigdata processing with Spark

Intermezzo…

Page 44: Bigdata processing with Spark

iter. 2iter. 2 . . .

Distributedmemory

Input

query 1query 1

query 2query 2

query 3query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Data Sharing (Spark)

iter. 1iter. 1

Input

iter. 1iter. 1

Input

Page 45: Bigdata processing with Spark

Challenge Distributed memory abstraction must be

- Fault-tolerant- Efficient in large commodity clusters

How do we design a programming interface that can provide fault tolerance efficiently?

Page 46: Bigdata processing with Spark

Challenge Previous distributed storage abstractions have offered an

interface based on fine-grained updates- Reads and writes to cells in a table- E.g. key-value stores, databases, distributed memory

Requires replicating data or update logs across nodes for fault tolerance- Expensive for data-intensive apps (i.e., Big Data)

Page 47: Bigdata processing with Spark

Spark Programming Model

Key idea: Resilient Distributed Datasets (RDDs)- Distributed collections of objects

- Cached in memory across cluster nodes, upon request- Parallel operators to manipulate data in RDDs- Automatic reconstruction of intermediate results upon failure

Interface- Clean language-integrated API in Scala- Can be used interactively from Scala console

Page 48: Bigdata processing with Spark

RDDs: Batch Processing Set-oriented operations (instead of tuple-oriented)

- Same basic principle as relational databases, key for efficient query processing

A nested relational model- Allows for complex values that may need to be “flattened” for

further processing- E.g.: map vs. flatMap

Page 49: Bigdata processing with Spark

RDD OperationsGreat documentation!http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

Page 50: Bigdata processing with Spark

Example: Log Mining Load error messages from a log into memory, then interactively

search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1Block 1

Block 2Block 2

Block 3Block 3

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

Base RDDBase RDDTransformed RDDTransformed RDD

ActionAction

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 51: Bigdata processing with Spark

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Initial parameter vectorInitial parameter vector

Repeated MapReduce stepsto do gradient descent

Repeated MapReduce stepsto do gradient descent

Load data in memory onceLoad data in memory once

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 52: Bigdata processing with Spark

Logistic Regression Performance

127 s / iteration

first iteration 174 sfurther iterations 6 s

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 53: Bigdata processing with Spark

Example Jobval sc = new SparkContext( “spark://...”, “MyJob”, home, jars)

val file = sc.textFile(“hdfs://...”)

val errors = file.filter(_.contains(“ERROR”))

errors.cache()

errors.count()

Resilient distributeddatasets (RDDs)

Resilient distributeddatasets (RDDs)

ActionAction

Page 54: Bigdata processing with Spark

Transformations build up a DAG, but don’t “do anything”

54

Page 55: Bigdata processing with Spark

RDD Graph

HadoopRDDpath = hdfs://...

HadoopRDDpath = hdfs://...

FilteredRDDfunc = _.contains(…)shouldCache = true

FilteredRDDfunc = _.contains(…)shouldCache = true

file:

errors:

Partition-level view:Dataset-level view:

Task 1 Task 2 ...

Page 56: Bigdata processing with Spark

Data LocalityFirst run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS)

Second run: FilteredRDD is in cache, so use its locations

If something falls out of cache, go back to HDFS

Page 57: Bigdata processing with Spark

Resilient Distributed Datasets (RDDs) Offer an interface based on coarse-grained transformations

(e.g. map, group-by, join) Allows for efficient fault recovery using lineage

- Log one operation to apply to many elements- Recompute lost partitions of dataset on failure- No cost if nothing fails

Page 58: Bigdata processing with Spark

RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions

Ex:

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFSFileHDFSFile FilteredRDDFilteredRDD MappedRDDMappedRDDfilter

(func = _.startsWith(...))map

(func = _.split(...))

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 59: Bigdata processing with Spark

RDD Representation Simple common interface:- Set of partitions- Preferred locations for each partition- List of parent RDDs- Function to compute a partition given parents- Optional partitioning info

Allows capturing wide range of transformations Users can easily add new transformations

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 60: Bigdata processing with Spark

RDDs in More DetailRDDs additionally provide:- Control over partitioning, which can be used to optimize data

placement across queries.- usually more efficient than the sort-based approach of Map

Reduce- Control over persistence (e.g. store on disk vs in RAM)- Fine-grained reads (treat RDD as a big table)

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 61: Bigdata processing with Spark

Wrap-up: Spark Avoid materialization of intermediate results Recomputation is a viable alternative for replication to

provide fault tolerance

A good and user-friendly (i.e., programmer-friendly) API helps gain traction very fast- In few years, Spark has become the default tool for deploying

code on clusters

Page 62: Bigdata processing with Spark

Thanks Matei Zaharia, MIT (https://people.csail.mit.edu/matei/) http://spark-project.org