Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem...

41
Matei Zaharia Big Data Analysis with Apache

Transcript of Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem...

Page 1: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Matei Zaharia

Big Data Analysis with Apache

Page 2: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Outline The big data problem

MapReduce

Apache Spark

How people are using it

Page 3: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

The Big Data Problem Data is growing faster than computing power

Growing data sources » Mostly machine generated

Cheap storage

Stalling CPU speeds

Page 4: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Examples Facebook’s daily logs: 60 TB

1000 genomes project: 200 TB

Google web index: 10+ PB

Cost of 1 TB of disk: $25

Time to read 1 TB from disk: 6 hours (50 MB/s)

Page 5: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

The Big Data Problem Single machine can no longer process or even store all the data!

Only solution is to distribute over large clusters

Page 6: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Google Datacenter

How do we program this thing?

Page 7: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Traditional Network Programming Message-passing between nodes

Really hard to do at scale: » How to split problem across nodes? » How to deal with failures? » Even worse: stragglers (node is not failed, but slow)

Page 8: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Data-Parallel Models Restrict the programming interface so that the system can do more automatically

“Here’s an operation, run it on all of the data” » I don’t care where it runs (you schedule that) » In fact, feel free to run it twice on different nodes

Biggest example: MapReduce

Page 9: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

MapReduce First widely popular programming model for data-intensive apps on clusters

Published by Google in 2004 » Processes 20 PB of data / day

Popularized by open-source Hadoop project

Page 10: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

MapReduce Programming Model Data type: key-value records

Map function:

(Kin, Vin) → list(Kinter, Vinter)

Reduce function:

(Kinter, list(Vinter)) → list(Kout, Vout)

Page 11: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Example: Word Count def map(line): foreach word in line.split(): output(word, 1) def reduce(key, values): output(key, sum(values))

Page 12: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Word Count Execution

the quick brown

fox

the fox ate the mouse

how now brown cow

Map

Map

Map

Reduce

Reduce

brown, 2 fox, 2 how, 1 now, 1 the, 3

ate, 1 cow, 1

mouse, 1 quick, 1

the, 1 brown, 1

fox, 1

quick, 1

the, 1 fox, 1 the, 1

how, 1 now, 1

brown, 1 ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Page 13: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

MapReduce Execution Automatically split work into many small tasks

Send tasks to nodes based on data locality

Automatically recover from failures

Page 14: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Summary Data-parallel programming models let systems automatically manage much of execution: » Assigning work, load balancing, fault recovery

But... the story doesn’t end here!

Page 15: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Outline The big data problem

MapReduce

Apache Spark

How people are using it

Page 16: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Limitations of MapReduce Programmability: most applications require higher level functions than map / reduce » E.g. statistics, matrix multiply, graph search » Google ads pipeline had 20 MR steps

Performance: inefficient to combine multiple MapReduce steps into complex programs

Page 17: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Apache Spark Programming model that generalizes MapReduce to support more applications » Adds efficient, in-memory data sharing

Large library of built-in functions

APIs in Python, Java, Scala, R

Spark Core

Stre

aming

SQL

MLli

b

Gra

phX

Page 18: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Spark Programmability

#include "mapreduce/mapreduce.h" // User’s map function class SplitWords: public Mapper {

public: virtual void Map(const MapInput& input) { const string& text = input.value();

const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace

while (i < n && isspace(text[i])) i++; // Find word end int start = i;

while (i < n && !isspace(text[i])) i++; if (start < i)

Emit(text.substr( start,i-start),"1"); } }

}; REGISTER_MAPPER(SplitWords);

// User’s reduce function class Sum: public Reducer { public: virtual void Reduce(ReduceInput* input)

{ // Iterate over all entries with the // same key and add the values int64 value = 0;

while (!input->done()) { value += StringToInt( input->value());

input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value));

} };

REGISTER_REDUCER(Sum);

int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) {

MapReduceInput* in= spec.add_input(); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("SplitWords");

} // Specify the output files

MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text");

out->set_reducer_class("Sum"); // Do partial sums within map

out->set_combiner_class("Sum"); // Tuning parameters spec.set_machines(2000);

spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it

MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0;

}

WordCount in MapReduce:

Page 19: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Spark Programmability

file = spark.textFile(“hdfs://...”)

counts = file.flatMap(lambda line: line.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) counts.save(“out.txt”)

WordCount in Spark:

Page 20: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Spark Performance

4.1 121

0 50 100 150

K-means Clustering Hadoop M/R Spark

sec

0.96 80

0 20 40 60 80 100

Logistic Regression Hadoop M/R Spark

sec

Page 21: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Programming Model Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs) » Collections of objects that can be stored in

memory or disk across a cluster » Built via parallel transformations (map, filter, …) » Automatically rebuilt on failure

Page 22: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Example: Text Search Load a large log file into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “Illumina” in s).count()

messages.filter(lambda s: “Dell” in s).count()

. . .

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data)

Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk data)

Page 23: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Fault Recovery RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex:

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter"

(func = _.contains(...)) map"

(func = _.split(...))

Page 24: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Example: Logistic Regression Goal: find line separating two sets of points

+

+ + +

+

+

+ + +

– – –

– – –

+

target

random initial line

Page 25: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Logistic Regression Performance

110 s / iteration

first iteration 80 s further iterations 5 s

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing T

ime

(s)

Number of Iterations

Hadoop Spark

Page 26: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Supported Operators map

filter

groupBy

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

flatMap

take

first

partitionBy

pipe

distinct

save

...

Page 27: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Built-in Libraries

SQL and DataFrames

Spark Streaming MLlib

Spark Core (RDDs)

GraphX

Largest integrated standard library for big data

Page 28: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Combining Libraries # Load data using SQL ctx.jsonFile(“tweets.json”).registerTempTable(“tweets”) points = ctx.sql(“select latitude, longitude from tweets”)

# Train a machine learning model model = KMeans.train(points, 10)

# Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a+b)

Page 29: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Summary Libraries + function-based interface let users write parallel programs similar to sequential code

Can use Spark interactively in Python, R, etc

Page 30: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Outline The big data problem

MapReduce

Apache Spark

How people are using it

Page 31: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

1000+ deployments, clusters up to 8000 nodes

Spark Community

Page 32: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Applications Large-scale machine learning

Analysis of neuroscience data

Network security

SQL and data clustering

Trends & recommendations

Page 33: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Programming Languages

84%

38% 38%

71%

31%

58%

18%

2014 Languages Used 2015 Languages Used

Page 34: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Libraries Used

18%

54%

46%

75%

95%

0% 25% 50% 75% 100%

GraphX

MLlib

Streaming

SQL

Core

Fraction of Users

Page 35: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Example: Neuroscience HHMI Janelia Farm analyzes data from full-brain imaging of neural activity

Images from Jeremy Freeman

Larval zebrafish +

Light-sheet imaging =

2 TB / hour of data

Page 36: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Data Analysis Streaming code does clustering,"dimens. reduction on 80-node cluster

Images from Jeremy Freeman

Page 37: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Example: Berkeley ADAM Stores and processes reads with standard big data tools and formats

"25% smaller than BAM(!), linear scale-up

"bdgenomics.org

Page 38: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

GATK4/Spark https://github.com/broadinstitute/gatk

GATK4 is a Spark-native application for genetic sequencing analyses. Currently in alpha testing, with good scalability to hundreds of cores, e.g.: 1.  MarkDuplicates – took hours to run and was single-core

only, now GATK4 runs in 3 minutes (on 30 GB exome)

2.  Depth of coverage – took days to run on 200GB whole genome, now GATK4 runs in 4 minutes

3.  Whole genome metrics (e.g., insert-size distribution)

runs in 2-3 minutes on 300GB whole genome

Page 39: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Open-source, modular, scalable platform for statistical genetics in development by the Neale lab at Broad Combine genetic and phenotypic data to uncover the biology of disease Built using Scala and Spark, currently in alpha Wall time of QC pipeline down from weeks to minutes

Page 40: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

Conclusion Apache Spark offers a fast, high-level interface to work with big data based on data-parallel model

Large set of existing libraries

Easy to try on just your laptop!"spark.apache.org

Page 41: Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

To Learn More Free MOOCs on edX

"

Use case videos at"Spark Summit

edx.org spark-summit.org