Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf ·...

66
Big Data Analytics Systems: What Goes Around Comes Around Reynold Xin, CS186 guest lecture @ Berkeley Apr 9, 2015

Transcript of Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf ·...

Page 1: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Big Data Analytics Systems: What Goes Around Comes Around

Reynold Xin, CS186 guest lecture @ Berkeley Apr 9, 2015

Page 2: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Who am I?

Co-founder & architect @ Databricks On-leave from PhD @ Berkeley AMPLab Current world record holder in 100TB sorting (Daytona GraySort Benchmark)

Page 3: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Transaction Processing

(OLTP)

“User A bought item b”

Analytics

(OLAP)

“What is revenue each store this year?”

Page 4: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Agenda

What is “Big Data” (BD)? GFS, MapReduce, Hadoop, Spark What’s different between BD and DB? Assumption: you learned about parallel DB already.

Page 5: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 6: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 7: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

What is “Big Data”?

Page 8: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Gartner’s Definition

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Page 9: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

3 Vs of Big Data

Volume: data size Velocity: rate of data coming in Variety (most important V): data sources, formats, workloads

Page 10: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

“Big Data” can also refer to the tech stack

Many were pioneered by Google

Page 11: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Why didn’t Google just use database systems?

Page 12: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Challenges

Data size growing (volume & velocity) –  Processing has to scale out over large clusters

Complexity of analysis increasing (variety) –  Massive ETL (web crawling) –  Machine learning, graph processing

Page 13: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Examples

Google web index: 10+ PB Types of data: HTML pages, PDFs, images, videos, …

Cost of 1 TB of disk: $50 Time to read 1 TB from disk: 6 hours (50 MB/s)

Page 14: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

The Big Data Problem

Semi-/Un-structured data doesn’t fit well with databases Single machine can no longer process or even store all the data! Only solution is to distribute general storage & processing over clusters.

Page 15: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 16: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

GFS Assumptions

“Component failures are the norm rather than the exception” “Files are huge by traditional standards” “Most files are mutated by appending new data rather than overwriting existing data” - GFS paper

Page 17: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

File Splits

17

Large  File  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001  

…    

6440MB  

Block    1  

Block    2  

Block    3  

Block    4  

Block    5  

Block    6  

Block    100  

Block    101  

64MB   64MB   64MB   64MB   64MB   64MB  

…  

64MB   40MB  

Block    1  

Block    2  

Let’s color-code them

Block    3  

Block    4  

Block    5  

Block    6  

Block    100  

Block    101  

e.g., Block Size = 64MB

Files are composed of set of blocks •  Typically 64MB in size •  Each block is stored as a separate file in the

local file system (e.g. NTFS)

Page 18: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Default placement policy: –  First copy is written to the node creating the file (write affinity) –  Second copy is written to a data node within the same rack (to minimize cross-rack network traffic) –  Third copy is written to a data node in a different rack (to tolerate switch failures)

Node  5  Node  4  Node  3  Node  2  Node  1  

Block Placement

18

Block    1  

Block    3  

Block    2  

Block    1  

Block    3  

Block    2  

Block    3  

Block    2  

Block    1  

e.g., Replication factor = 3

Objec<ves:  load  ba

lancing,  fast  acces

s,  fault  tolerance  

Page 19: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

GFS Architecture

19

NameNode   BackupNode  

DataNode   DataNode   DataNode   DataNode   DataNode  

(heartbeat, balancing, replication, etc.)

namespace backups

Page 20: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Failure types: q  Disk errors and failures q  DataNode failures q  Switch/Rack failures q  NameNode failures q  Datacenter failures

Failures, Failures, Failures

GFS paper: “Component failures are the norm rather than the exception.”

20

NameNode  

DataNode  

Page 21: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

GFS Summary

Store large, immutable (append-only) files Scalability Reliability Availability

Page 22: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Google Datacenter

How do we program this thing?

Page 23: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Traditional Network Programming

Message-passing between nodes (MPI, RPC, etc) Really hard to do at scale:

–  How to split problem across nodes? •  Important to consider network and data locality

–  How to deal with failures? •  If a typical server fails every 3 years, a 10,000-node cluster sees 10

faults/day!

–  Even without failures: stragglers (a node is slow)

Almost nobody does this!

Page 24: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Data-Parallel Models

Restrict the programming interface so that the system can do more automatically “Here’s an operation, run it on all of the data”

–  I don’t care where it runs (you schedule that) –  In fact, feel free to run it twice on different nodes

Does this sound familiar?

Page 25: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 26: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce

First widely popular programming model for data-intensive apps on clusters Published by Google in 2004

–  Processes 20 PB of data / day

Popularized by open-source Hadoop project

Page 27: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce Programming Model

Data type: key-value records Map function:

(Kin, Vin) -> list(Kinter, Vinter) Reduce function:

(Kinter, list(Vinter)) -> list(Kout, Vout)

Page 28: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Hello World of Big Data: Word Count

the  quick  brown  fox  

the  fox  ate  the  mouse  

how  now  brown  cow  

Map  

Map  

Map  

Reduce  

Reduce  

brown,  2  fox,  2  how,  1  now,  1  the,  3  

ate,  1  cow,  1  

mouse,  1  quick,  1  

the,  1  brown,  1  fox,  1  

quick,  1  

the,  1  fox,  1  the,  1  

how,  1  now,  1  

brown,  1  ate,  1  

mouse,  1  

cow,  1  

Input   Map   Shuffle  &  Sort   Reduce   Output  

Page 29: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce Execution

Automatically split work into many small tasks Send map tasks to nodes based on data locality Load-balance dynamically as tasks finish

Page 30: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce Fault Recovery

If a task fails, re-run it and re-fetch its input –  Requirement: input is immutable

If a node fails, re-run its map tasks on others

–  Requirement: task result is deterministic & side effect is idempotent

If a task is slow, launch 2nd copy on other node

–  Requirement: same as above

Page 31: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce Summary

By providing a data-parallel model, MapReduce greatly simplified cluster computing:

–  Automatic division of job into tasks –  Locality-aware scheduling –  Load balancing –  Recovery from failures & stragglers

Also flexible enough to model a lot of workloads…

Page 32: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Hadoop

Open-sourced by Yahoo! –  modeled after the two Google papers

Two components:

–  Storage: Hadoop Distributed File System (HDFS) –  Compute: Hadoop MapReduce

Sometimes synonymous with Big Data

Page 33: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 34: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Why didn’t Google just use databases?

Cost –  database vendors charge by $/TB or $/core

Scale

–  no database systems at the time had been demonstrated to work at that scale (# machines or data size)

Data Model –  A lot of semi-/un-structured data: web pages, images, videos

Compute Model

–  SQL not expressive (or “simple”) enough for many Google tasks (e.g. crawl the web, build inverted index, log analysis on unstructured data)

Not-invented-here

Page 35: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce Programmability

Most real applications require multiple MR steps –  Google indexing pipeline: 21 steps –  Analytics queries (e.g. count clicks & top K): 2 – 5 steps –  Iterative algorithms (e.g. PageRank): 10’s of steps

Multi-step jobs create spaghetti code

–  21 MR steps -> 21 mapper and reducer classes –  Lots of boilerplate code per step

Page 36: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Higher Level Frameworks

SELECT count(*) FROM users

A = load 'foo';B = group A all;C = foreach B generate COUNT(A);

In reality, 90+% of MR jobs are generated by Hive SQL

Page 37: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

SQL on Hadoop (Hive)

Meta  store  

HDFS  

 

   Client  

Driver  

SQL  Parser  

Query  Op<mizer  

Physical  Plan  

Execu<on  

CLI   JDBC  

MapReduce  

Page 38: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Problems with MapReduce

1.  Programmability –  We covered this earlier …

2.  Performance

–  Each MR job writes all output to disk –  Lack of more primitives such as data broadcast

Page 39: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Spark

Started in Berkeley AMPLab in 2010; addresses MR problems. Programmability: DSL in Scala / Java / Python

–  Functional transformations on collections –  5 – 10X less code than MR –  Interactive use from Scala / Python REPL –  You can unit test Spark programs!

Performance:

–  General DAG of tasks (i.e. multi-stage MR) –  Richer primitives: in-memory cache, torrent broadcast, etc –  Can run 10 – 100X faster than MR

Page 40: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Programmability

#include "mapreduce/mapreduce.h" // User’s map function class SplitWords: public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while (i < n && isspace(text[i])) i++; // Find word end int start = i; while (i < n && !isspace(text[i])) i++; if (start < i) Emit(text.substr( start,i-start),"1"); } } }; REGISTER_MAPPER(SplitWords);

// User’s reduce function class Sum: public Reducer { public: virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt( input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; REGISTER_REDUCER(Sum);

int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) { MapReduceInput* in= spec.add_input(); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("SplitWords"); } // Specify the output files MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("Sum"); // Do partial sums within map out->set_combiner_class("Sum"); // Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0; }

Full Google WordCount:

Page 41: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Programmability

Spark WordCount:

val file = spark.textFile(“hdfs://...”)val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.save(“out.txt”)

Page 42: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Performance

0.96 110

0 25 50 75 100 125

Logistic Regression

4.1 155

0 30 60 90 120 150 180

K-Means Clustering Hadoop MR

Spark

Time per Iteration (s)

Page 43: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

43

Performance Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Page 44: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 45: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Spark Summary

Spark generalizes MapReduce to provide: –  High performance –  Better programmability –  (consequently) a unified engine

The most active open source data project

Page 46: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Note: not a scientific comparison.

Page 47: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Beyond Hadoop Users

47

Spark early adopters

Data Engineers Data Scientists Statisticians R users PyData …

Users

Understands MapReduce

& functional APIs

Page 48: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

48

Page 49: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

DataFrames in Spark

Distributed collection of data grouped into named columns (i.e. RDD with schema) DSL designed for common tasks

–  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs

Available in Python, Scala, Java, and R (via SparkR)

49

Page 50: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Plan Optimization & Execution

50

DataFrames and SQL share the same optimization/execution pipeline Maximize code reuse & share optimization efforts

SQL  AST  

DataFrame  

Unresolved  Logical  Plan   Logical  Plan   Op<mized  

Logical  Plan  Physical  Plans  Physical  Plans   RDDs  Selected  

Physical  Plan  

Analysis   Logical  Op<miza<on  

Physical  Planning  

Cost  M

odel  

Physical  Plans  

Code  Genera<on  

Catalog  

Page 51: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Our Experience So Far

SQL is wildly popular and important –  100% of Databricks customers use some SQL

Schema is very useful –  Most data pipelines, even the ones that start with unstructured

data, end up having some implicit structure –  Key-value too limited –  That said, semi-/un-structured support is paramount

Separation of logical vs physical plan

–  Important for performance optimizations (e.g. join selection)

Page 52: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Return of SQL

Page 53: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work
Page 54: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Why SQL?

Almost everybody knows SQL Easier to write than MR (even Spark) for analytic queries Lingua franca for data analysis tools (business intelligence, etc) Schema is useful (key-value is limited)

Page 55: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

What’s really different?

SQL on BD (Hadoop/Spark) vs SQL in DB? Two perspectives: 1.  Flexibility in data and compute model

2.  Fault-tolerance

Page 56: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Traditional Database Systems (Monolithic)

Physical Execution Engine (Dataflow)

SQL

Applications

One way (SQL) in/out and data must be structured

Page 57: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Data-Parallel Engine (Spark, MR)

SQL DataFrame M.L.

Big Data Ecosystems (Layered)

Decoupled storage, low vs high level compute Structured, semi-structured, unstructured data

Schema on read, schema on write

Page 58: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Evolution of Database Systems Decouple Storage from Compute

Physical Execution Engine (Dataflow)

SQL

Applications

Physical Execution Engine (Dataflow)

SQL

Applications

Traditional 2014 - 2015

IBM Big Insight Oracle

EMC Greenplum …

support for nested data (e.g. JSON)

Page 59: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Perspective 2: Fault Tolerance

Database systems: coarse-grained fault tolerance

–  If fault happens, fail the query (or rerun from the beginning)

MapReduce: fine-grained fault tolerance

–  Rerun failed tasks, not the entire query

Page 60: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks).

Page 61: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

MapReduce Checkpointing-based Fault Tolerance

Checkpoint all intermediate output –  Replicate them to multiple nodes –  Upon failure, recover from checkpoints –  High cost of fault-tolerance (disk and network I/O)

Necessary for PBs of data on thousands of machines

What if I have 20 nodes and my query takes only 1 min?

Page 62: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Spark Unified Checkpointing and Rerun

Simple idea: remember the lineage to create an RDD, and recompute from last checkpoint. When fault happens, query still continues. When faults are rare, no need to checkpoint, i.e. cost of fault-tolerance is low.

Page 63: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

What’s Really Different?

Monolithic vs layered storage & compute –  DB becoming more layered –  Although “Big Data” still far more flexible than DB

Fault-tolerance

–  DB mostly coarse-grained fault-tolerance, assuming faults are rare

–  Big Data mostly fine-grained fault-tolerance, with new strategies in Spark to mitigate faults at low cost

Page 64: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Convergence

DB evolving towards BD –  Decouple storage from compute –  Provide alternative programming models –  Semi-structured data (JSON, XML, etc)

BD evolving towards DB

–  Schema beyond key-value –  Separation of logical vs physical plan –  Query optimization –  More optimized storage formats

Page 65: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Thanks & Questions?

Reynold Xin [email protected] @rxin

Page 66: Big Data Analytics Systems - GitHub Pagesrxin.github.io/talks/2015-04-09_cs186-berkeley.pdf · 4/9/2015  · Scale – no database systems at the time had been demonstrated to work

Acknowledgement

Some slides taken from: Zaharia. Processing Big Data with Small Programs Franklin. SQL, NoSQL, NewSQL? CS186 2013