Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker...

63
Data-Intensive Distributed Computing Part 1: MapReduce Algorithm Design (2/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 431/631 451/651 (Winter 2019) Adam Roegiest Kira Systems January 10, 2019 These slides are available at http://roegiest.com/bigdata-2019w/

Transcript of Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker...

Page 1: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Data-Intensive Distributed Computing

Part 1: MapReduce Algorithm Design (2/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2019)

Adam RoegiestKira Systems

January 10, 2019

These slides are available at http://roegiest.com/bigdata-2019w/

Page 2: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

RTFM you must

Page 3: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Announcements

A0 finalized for all sections

CS 431/631 Only: If you want a challenge, you may elect to do all of the CS 451 assignments instead of the six CS 431 assignments.

This is a one-way road. No switching back.

Page 4: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Agenda for Today

Why big data?Hadoop walkthrough

Page 5: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Why big data?

Execution

Infrastructure

Analytics

Infrastructure

Data Science

Tools

This

Co

urs

e“big data stack”

Page 6: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Wikipedia (Everest)

Why big data? ScienceBusinessSociety

Page 7: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Emergence of the 4th Paradigm

Data-intensive e-ScienceMaximilien Brice, © CERN

Science

Page 8: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Maximilien Brice, © CERN

Page 9: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Maximilien Brice, © CERN

Page 10: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)
Page 11: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Wikipedia (DNA)

Page 12: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGC

AATGCTTAGCTATGCGGGC

AATGCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

CGGTCTAGATGCTTACTATGC

AATGCTTACTATGCGGGCCCCTT

CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?

Subject genome

Sequencer

Reads

Human genome: 3 gbpA few billion short reads (~100 GB compressed data)

Page 13: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

BusinessData-driven decisions

Data-driven products

Source: Wikiedia (Shinjuku, Tokyo)

Page 14: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

Business Intelligence

Page 15: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

In the 1990s, Wal-Mart found that customers tended to buy diapers and beer together. So they put them next to each other and increased sales of both.*

This is not a new idea!

So what’s changed?

More compute and storage

Ability to gather behavioral data

* BTW, this is completely apocryphal. (But it makes a nice story.)

Page 16: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

Page 17: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: https://images.lookhuman.com/render/standard/8002245806006052/pillow14in-whi-z1-t-netflixing.png

Page 18: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: https://www.reddit.com/r/teslamotors/comments/6gsc6v/i_think_the_neural_net_mining_is_just_starting/ (June 2017)

Page 19: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)
Page 20: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Wikipedia (Rosetta Stone)

Page 21: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

No data like more data!

Page 22: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Guardian

Humans as social sensors

Computational social science

Society

Page 23: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Predicting X with Twitter

(Paul and Dredze, ICWSM 2011; Bond et al., Nature 2011)

Political Mobilization on Facebook

2010 US Midterm Elections: 60m users shown “I Voted” Messages

Summary: increased turnout by 60k directly and 280k indirectly

Page 24: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: https://www.theverge.com/2017/11/1/16593346/house-russia-facebook-ads

Page 25: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: https://www.propublica.org/article/facebook-enabled-advertisers-to-reach-jew-haters

Page 26: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)
Page 27: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: http://www.cnn.com/2016/12/07/asia/new-zealand-passport-robot-asian-trnd/index.html

Page 28: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

The Perils of Big Data

The end of privacyWho owns your data and can the government access it?

The echo chamberAre you seeing only what you want to see?

The racist algorithmAlgorithms aren’t racist, people are?

We desperately need “data ethics” to go with big data!

Page 29: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)
Page 30: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Popular Internet Meme

Page 31: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Google

Tackling Big Data

Page 32: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

group values by key

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3 * Important detail: reducers process keys in sorted order

***

Logical View

Page 33: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

Master

User

Program

output

file 0

output

file 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read(4) local write

(5) remote read(6) write

Input

files

Map

phase

Intermediate files

(on local disk)

Reduce

phase

Output

files

Adapted from (Dean and Ghemawat, OSDI 2004)

Physical View

Page 34: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Google

The datacenter is the computer!

Page 35: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

The datacenter is the computer!

It’s all about the right level of abstractionMoving beyond the von Neumann architecture

What’s the “instruction set” of the datacenter computer?

Hide system-level details from the developersNo more race conditions, lock contention, etc.

No need to explicitly worry about reliability, fault tolerance, etc.

Separating the what from the howDeveloper specifies the computation that needs to be performed

Execution framework (“runtime”) handles actual execution

Page 36: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

The datacenter is the computer!

Scale “out”, not “up”Limits of SMP and large shared-memory machines

Move processing to the dataCluster have limited bandwidth, code is a lot smaller

Process data sequentially, avoid random accessSeeks are expensive, disk throughput is good

“Big ideas”

Assume that components will breakEngineer software around hardware failures

*

*

Page 37: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Seek vs. Scans

Consider a 1 TB database with 100 byte recordsWe want to update 1 percent of the records

Scenario 2: Rewrite all recordsAssume 100 MB/s throughput

Time = 5.6 hours(!)

Scenario 1: Mutate each recordEach update takes ~30 ms (seek, read, write)

108 updates = ~35 days

Source: Ted Dunning, on Hadoop mailing list

Lesson? Random access is expensive!

Page 38: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Wikipedia (Mahout)

So you want to drive the elephant!

Page 39: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

So you want to drive the elephant!(Aside, what about Spark?)

Page 40: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

org.apache.hadoop.mapreduceorg.apache.hadoop.mapred

Source: Wikipedia (Budapest)

A tale of two packages…

Page 41: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

MapReduce API*Mapper<Kin,Vin,Kout,Vout>

Called once at the start of the taskvoid setup(Mapper.Context context)

Called once for each key/value pair in the input splitvoid map(Kin key, Vin value, Mapper.Context context)

Called once at the end of the taskvoid cleanup(Mapper.Context context)

*Note that there are two versions of the API!

Reducer<Kin,Vin,Kout,Vout>/Combiner<Kin,Vin,Kout,Vout>

Called once at the start of the taskvoid setup(Reducer.Context context)

Called once for each keyvoid reduce(Kin key, Iterable<Vin> values, Reducer.Context context)

Called once at the end of the taskvoid cleanup(Reducer.Context context)

Page 42: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

MapReduce API*

Partitioner<K, V>

Returns the partition number given total number of partitionsint getPartition(K key, V value, int numPartitions)

*Note that there are two versions of the API!

Job

Represents a packaged Hadoop job for submission to clusterNeed to specify input and output paths

Need to specify input and output formatsNeed to specify mapper, reducer, combiner, partitioner classes

Need to specify intermediate/final key/value classesNeed to specify number of reducers (but not mappers, why?)

Don’t depend of defaults!

Page 43: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable.

WritableComprable Defines a sort order. All keys must be of this type (but not values).

IntWritableLongWritableText…

Concrete classes for different data types.Note that these are container objects.

SequenceFile Binary-encoded sequence of key/value pairs.

Data Types in Hadoop: Keys and Values

Page 44: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

“Hello World” MapReduce: Word Count

def map(key: Long, value: String) = {for (word <- tokenize(value)) {

emit(word, 1)}

}

def reduce(key: String, values: Iterable[Int]) = {for (value <- values) {

sum += value}emit(key, sum)

}

Page 45: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

private static final class MyMapperextends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);private final static Text WORD = new Text();

@Overridepublic void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {for (String word : Tokenizer.tokenize(value.toString())) {

WORD.set(word);context.write(WORD, ONE);

}}

}

Word Count Mapper

Page 46: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

private static final class MyReducerextends Reducer<Text, IntWritable, Text, IntWritable> {

private final static IntWritable SUM = new IntWritable();

@Overridepublic void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {Iterator<IntWritable> iter = values.iterator();int sum = 0;while (iter.hasNext()) {

sum += iter.next().get();}SUM.set(sum);context.write(key, SUM);

}}

Word Count Reducer

Page 47: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Three Gotchas

Avoid object creation

Execution framework reuses value object in reducer

Passing parameters via class statics doesn’t work!

Page 48: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Getting Data to Mappers and Reducers

Configuration parametersPass in via Job configuration object

“Side data”DistributedCache

Mappers/Reducers can read from HDFS in setup method

Page 49: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Complex Data Types in Hadoop

The easiest way:Encode it as Text, e.g., (a, b) = “a:b”

Use regular expressions to parse and extract dataWorks, but janky

The hard way:Define a custom implementation of Writable(Comprable)

Must implement: readFields, write, (compareTo)Computationally efficient, but slow for rapid prototypingImplement WritableComparator hook for performance

How do you implement complex data types?

Somewhere in the middle:Bespin (via lin.tl) offers various building blocks

Page 50: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Anatomy of a Job

Job submission:Client (i.e., driver program) creates a job,

configures it, and submits it to jobtracker

That’s it! The Hadoop cluster takes over…

Hadoop MapReduce program = Hadoop jobJobs are divided into map and reduce tasks

An instance of a running task is called a task attemptEach task occupies a slot on the tasktracker

Multiple jobs can be composed into a workflow

Page 51: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

Master

User

Program

output

file 0

output

file 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read(4) local write

(5) remote read(6) write

Input

files

Map

phase

Intermediate files

(on local disk)

Reduce

phase

Output

files

Adapted from (Dean and Ghemawat, OSDI 2004)

Page 52: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Anatomy of a Job

Behind the scenes:Input splits are computed (on client end)

Job data (jar, configuration XML) are sent to jobtrackerJobtracker puts job data in shared location, enqueues tasks

Tasktrackers poll for tasksOff to the races…

Page 53: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

InputSplit

Source: redrawn from a slide by Cloduera, cc-licensed

InputSplit InputSplit

Input File Input File

InputSplit InputSplit

RecordReader RecordReader RecordReader RecordReader RecordReader

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Inp

utF

orm

at

Page 54: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

… …

InputSplit InputSplit InputSplit

Client

Records

Mapper

RecordReader

Mapper

RecordReader

Mapper

RecordReader

Where’s the data actually coming from?

Page 55: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: redrawn from a slide by Cloduera, cc-licensed

Mapper Mapper Mapper Mapper Mapper

Partitioner Partitioner Partitioner Partitioner Partitioner

Intermediates Intermediates Intermediates Intermediates Intermediates

Reducer Reducer Reduce

Intermediates Intermediates Intermediates

(combiners omitted here)

Page 56: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: redrawn from a slide by Cloduera, cc-licensed

Reducer Reducer Reducer

Output File

RecordWriter

Ou

tpu

tFo

rma

t

Output File

RecordWriter

Output File

RecordWriter

Page 57: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Input and Output

InputFormatTextInputFormat

KeyValueTextInputFormatSequenceFileInputFormat

OutputFormatTextOutputFormat

SequenceFileOutputFormat…

Spark also uses these abstractions for reading and writing data!

Page 58: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Hadoop ClusterYou

Submit node(datasci)

Getting data in?Writing code?Getting data out?

Hadoop Workflow

Where’s the actual data stored?

Page 59: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Debugging Hadoop

First, take a deep breathStart small, start locally

Build incrementally

Page 60: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Wikipedia (The Scream)

Page 61: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Code Execution Environments

Different ways to run code:Local (standalone) modePseudo-distributed mode

Fully-distributed mode

Learn what’s good for what

Page 62: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Hadoop Debugging Strategies

Good ol’ System.out.printlnLearn to use the webapp to access logs

Logging preferred over System.out.printlnBe careful how much you log!

Fail on successThrow RuntimeExceptions and capture state

Use Hadoop as the “glue”Implement core functionality outside mappers and reducers

Independently test (e.g., unit testing)Compose (tested) components in mappers and reducers

Page 63: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)

Source: Wikipedia (Japanese rock garden)

Questions?