Big Data & Hadoop. Simone Leo (CRS4)

Big Data MapReduce and Hadoop Pydoop

Big Data & Hadoop

Simone Leo

Distributed Computing – CRS4http://www.crs4.it

Seminar Series 2015

Simone Leo Big Data & Hadoop

http://www.crs4.it


Outline

1 Big Data

2 MapReduce and HadoopThe MapReduce ModelHadoop

3 PydoopMotivationArchitectureUsage



The Data Deluge

sciencematters.berkeley.edu

satellites

upload.wikimedia.org

social networkswww.imagenes-bio.de

DNA sequencing

scienceblogs.com

high-energy physics

Storage costs rapidlydecreasingEasier to acquire hugeamounts of data

Social networksParticle collidersRadio telescopesDNA sequencersWearable sensors. . .



The Three Vs: Volume

http://www.ibmbigdatahub.com/infographic/four-vs-big-data




The Three Vs: Velocity





The Three Vs: Variety





Big Data: a Hard to Define Buzzword

OED: “data of a very large size, typically to the extent thatits manipulation and management present significantlogistical challenges”1

Wikipedia: “data sets so large or complex that traditionaldata processing applications are inadequate”datascience@berkeley blog: 43 definitions2

How big is big?Depends on the field and organizationVaries with time[Volume] does not fit in memory / in a hard disk[Velocity] production faster than processing time[Variety] unknown / unsupported formats . . .

1http://www.forbes.com/sites/gilpress/2013/06/18/big-data-news-a-revolution-indeed2http://datascience.berkeley.edu/what-is-big-data


http://www.forbes.com/sites/gilpress/2013/06/18/big-data-news-a-revolution-indeed

http://datascience.berkeley.edu/what-is-big-data


The Free Lunch is Over

CPU clock speed reachedsaturation around 2004

Multi-core architecturesEveryone must goparallel

Moore’s law reinterpretednumber of cores perchip doubles every 2 yclock speed remainsfixed or decreases

http://www.gotw.ca/publications/concurrency-ddj.htm


http://www.gotw.ca/publications/concurrency-ddj.htm


Data-Intensive Applications

https://imgflip.com/memegenerator

A new class of problemsNeeds specific solutions

InfrastructureTechnologiesProfessional background

ScalabilityAutomatically adapt tochanges in sizeDistributed computing


https://imgflip.com/memegenerator

Big Data MapReduce and Hadoop Pydoop The MapReduce Model Hadoop

Outline

1 Big Data





What is MapReduce?

A programming model for large-scale distributed dataprocessing

Inspired by map and reduce in functional programmingMap: map a set of input key/value pairs to a set ofintermediate key/value pairsReduce: apply a function to all values associated to thesame intermediate key; emit output key/value pairs

An implementation of a system to execute suchprograms

Fault-tolerantHides internals from usersScales very well with dataset size



MapReduce’s Hello World: Word Count

the quick brown fox ate the lazy green fox

Mapper Map

Reducer ReduceReducer

Shuffle &Sort

the, 1

quick, 1

brown, 1

fox, 1

ate, 1

lazy, 1fox, 1

green, 1

quick, 1brown, 1

fox, 2ate, 1

the, 2lazy, 1

green, 1

the, 1

Mapper Mapper



Wordcount: Pseudocode

map(String key, String value):// key: does not matter in this case// value: a subset of input wordsfor each word w in value:

Emit(w, "1");

reduce(String key, Iterator values):// key: a word// values: all values associated to keyint wordcount = 0;for each v in values:wordcount += ParseInt(v);

Emit(key, AsString(wordcount));



Mock Implementation – mockmr.py

from itertools import groupbyfrom operator import itemgetter

def _pick_last(it):for t in it:

yield t[-1]

def mapreduce(data, mapf, redf):buf = []for line in data.splitlines():for ik, iv in mapf("foo", line):buf.append((ik, iv))

buf.sort()for ik, values in groupby(buf, itemgetter(0)):for ok, ov in redf(ik, _pick_last(values)):print ok, ov



Mock Implementation – mockwc.py

from mockmr import mapreduce

DATA = """the quick brownfox ate thelazy green fox"""

def map_(k, v):for w in v.split():

yield w, 1

def reduce_(k, values):yield k, sum(v for v in values)

if __name__ == "__main__":mapreduce(DATA, map_, reduce_)



MapReduce: Execution Model

Mapper

Reducer

split 0

split 1

split 4

split 3

split 2

UserProgram

Master

Input(DFS)

Intermediate files(on local disks)

Output(DFS)

(1) fork

(2) assignmap (2) assign

reduce

(1) fork

(1) fork

(3) read(4) local

write(5) remote

read

(6) write

Mapper

Mapper

Reducer

adapted from Zhao et al., MapReduce: The programming model and practice, 2009 – http://research.google.com/pubs/pub36249.html


http://research.google.com/pubs/pub36249.html


MapReduce vs Alternatives – 1Pro

gra

mm

ing

mod

el

Decl

ara

tive

Pro

ced

ura

l

Flat raw files

Data organization

Structured

MPI

DBMS/SQL

MapReduce





MapReduce vs Alternatives – 2

MPI MapReduce DBMS/SQL

programming model message passing map/reduce declarative

data organization no assumption files split into blocks organized structures

data type any (k,v) string/Avro tables with rich types

execution model independent nodes map/shuffle/reduce transaction

communication high low high

granularity fine coarse fine

usability steep learning curve simple concept runtime: hard to debug

key selling point run any application huge datasets interactive querying

There is no one-size-fits-all solution

Choose according to your problem’s characteristics





Outline

1 Big Data





Hadoop: Overview

DisclaimerFirst we describe Hadoop 1Then we highlight what’s different in Hadoop 2

ScalableThousands of nodesPetabytes of data over 10M filesSingle file: Gigabytes to Terabytes

EconomicalOpen sourceCOTS Hardware (but master nodes should be reliable)

Well-suited to bag-of-tasks applications (many bio apps)Files are split into blocks and distributed across nodesHigh-throughput access to huge datasetsWORM storage model



Hadoop: Architecture

MR MasterHDFS Master

Client

Namenode Job Tracker

Task Tracker Task Tracker

Datanode Datanode Datanode

Task Tracker

Job Request

HDFS Slaves

MR Slaves

Client sends Jobrequest to JTJT queries NN aboutdata block locationsInput stream is splitamong n map tasksMap tasks arescheduled closestto where datareside



Hadoop Distributed File System (HDFS)

file name "a"

block IDs,datanodes

a3 a2

readdata

Client Namenode

SecondaryNamenode

a1

Datanode 1 Datanode 3

Datanode 2 Datanode 6

Datanode 5

Datanode 4

Rack 1 Rack 2 Rack 3

a3 a1

a3

a2

a1 a2

namespace checkpointinghelps namenode with logs

not a namenode replacement

Large blocks(∼100 MB)Each block isreplicated n times(3 by default)One replica on thesame rack, theothers on differentracksYou have to providenetwork topology



Wordcount: (part of) Java Code

public static class TokenizerMapperextends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}}}

public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();public void reduce(

Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {

int sum = 0;for (IntWritable val: values) sum += val.get();result.set(sum);context.write(key, result);

}}



Other MapReduce Components

Combiner (local reducer)InputFormat / RecordReader

Translates the byte-oriented view of input files into therecord-oriented view required by the MapperDirectly accesses HDFS filesProcessing unit: input split

PartitionerDecides which Reducer receives which keyTypically uses a hash function of the key

OutputFormat / RecordWriterWrites key/value pairs output by the ReducerDirectly accesses HDFS files



Input Formats and HDFS

HDFS

InputFormat

Job

Input split 6= HDFS blockAn HDFS block is an actual data chunk in an HDFS fileAn input split is a logical representation of a subset of theMapReduce input dataset



Hadoop 2

HDFS FederationMultiple namenodes can share the burden of maintainingthe namespace(s) and managing blocksBetter scalability

HDFS high availability (HA)Can run redundant namenodes in active/passive modeAllows failover in case of crash or planned manteinance

YARNResource manager: global assignment of computeresources to applications (HA supported since 2.4.0)Application master(s) per-application scheduling andcoordination (like a per-job JT, on a slave)MapReduce is just one of many possible frameworks


Big Data MapReduce and Hadoop Pydoop Motivation Architecture Usage

Outline

1 Big Data





Developing your First MapReduce Application

The easiest path for beginners is Hadoop StreamingJava package included in HadoopUse any executable as the mapper or reducerRead key-value pairs from standard inputWrite them to standard outputText protocol: records are serialized as <k>\t<v>\n

Word count mapper (Python)

#!/usr/bin/env pythonimport sysfor line in sys.stdin:

for word in line.split():print "%s\t1" % word



Hadoop Streaming – Word Count Reducer (Python)

#!/usr/bin/env pythonimport sys

def serialize(key, value):return "%s\t%d" % (key, value)

def deserialize(line):key, value = line.split("\t", 1)return key, int(value)

def main():prev_key, out_value = None, 0for line in sys.stdin:key, value = deserialize(line)if key != prev_key:if prev_key is not None:

print serialize(prev_key, out_value)out_value = 0

prev_key = keyout_value += value

print serialize(key, out_value)

if __name__ == "__main__": main()



MapReduce Development with Hadoop

Java: nativeC++: MapReduce API (via Hadoop Pipes) + libhdfs(included in the Hadoop distribution)Python: several solutions, but do they meet all of therequirements of nontrivial apps?

Reuse existing modules, including extensionsNumPy / SciPy for numerical computationSpecialized components (RecordReader/Writer, Partitioner)HDFS access. . .



Python MR: Hadoop-Integrated Solutions

Hadoop StreamingAwkward programmingstyleCan only write mapper andreducer scripts (noRecordReader, etc.)No HDFS access

JythonIncomplete standardlibraryMost third-party packagesare only compatible withCPythonCannot use C/C++extension modulesTypically one or morereleases behind CPython



Python MR: Third Party Solutions

Most third party packages are Hadoop Streamingwrappers

They add usability, but share many of Streaming’slimitations

Most are not actively developed anymoreOur solution: Pydoop(https://github.com/crs4/pydoop)

Actively developed since 2009(Almost) pure Python Pipes client + libhdfs wrapper


https://github.com/crs4/pydoop


Pydoop: Overview

Access to most MR components, including RecordReader,RecordWriter and PartitionerGet configuration, set counters and report statusEvent-driven programming model (like the Java API, butwith the simplicity of Python)

You define classes, the MapReduce framework instantiatesthem and calls their methods

CPython→ use any moduleHDFS API



Summary of Features

Streaming Jython Pydoop

C/C++ Ext Yes No Yes

Standard Lib Full Partial Full

MR API No* Full Partial

Event-driven prog No Yes Yes

HDFS access No Yes Yes

(*) you can only write the map and reduce parts as executable scripts.



Outline

1 Big Data





Hadoop Pipes

Java Hadoop Framework

PipesMapRunner PipesReducer

C++ Application

child process

recordwriter

reducer

child process

recordreader

mapper

partitioner

combiner

downwardprotocol

upwardprotocol

downwardprotocol

upwardprotocol

App: separateprocessCommunication withJava framework viapersistent socketsThe C++ appprovides a factoryused by theframework to createMR components



Integration of Pydoop with Hadoop

Java pipes

Hadoop

Java hdfs

C libhdfs(JNI wrapper)

(de)serializationcore ext

hdfs ext

Pure Python Modules



Outline

1 Big Data





Python Wordcount, Full Program Code

import pydoop.mapreduce.api as apiimport pydoop.mapreduce.pipes as pp

class Mapper(api.Mapper):

def map(self, context):words = context.value.split()for w in words:

context.emit(w, 1)

class Reducer(api.Reducer):

def reduce(self, context):s = sum(context.values)context.emit(context.key, s)

def __main__():pp.run_task(pp.Factory(Mapper, Reducer))



Status Reports and Counters

import pydoop.mapreduce.api as api

class Mapper(api.Mapper):

def __init__(self, context):super(Mapper, self).__init__(context)context.setStatus("initializing mapper")self.input_words = context.get_counter("WC", "INPUT_WORDS")

def map(self, context):words = context.value.split()for w in words:

context.emit(w, 1)context.increment_counter(self.input_words, len(words))



Status Reports and Counters: Web UI



Optional Components: Record Reader

from pydoop.utils.serialize import serialize_to_string as to_simport pydoop.hdfs as hdfs

class Reader(api.RecordReader):

def __init__(self, context):super(Reader, self).__init__(context)self.isplit = context.input_splitself.file = hdfs.open(self.isplit.filename)self.file.seek(self.isplit.offset)self.bytes_read = 0if self.isplit.offset > 0:

discarded = self.file.readline()self.bytes_read += len(discarded)

def next(self):if self.bytes_read > self.isplit.length:

raise StopIterationkey = to_s(self.isplit.offset + self.bytes_read)record = self.file.readline()if record == "": # end of file

raise StopIterationself.bytes_read += len(record)return key, record



Optional Components: Record Writer, Partitioner

class Writer(api.RecordWriter):

def __init__(self, context):super(Writer, self).__init__(context)jc = context.job_confpart = jc.get_int("mapred.task.partition")out_dir = jc["mapred.work.output.dir"]outfn = "%s/part-%05d" % (out_dir, part)hdfs_user = jc.get("pydoop.hdfs.user", None)self.file = hdfs.open(outfn, "w", user=hdfs_user)self.sep = jc.get("mapred.textoutputformat.separator", "\t")

def close(self):self.file.close()self.file.fs.close()

def emit(self, key, value):self.file.write("%s%s%s\n" % (key, self.sep, value))

class Partitioner(api.Partitioner):

def partition(self, key, n_red):reducer_id = (hash(key) & sys.maxint) % n_redreturn reducer_id



The HDFS Module

>>> import pydoop.hdfs as hdfs>>> hdfs.mkdir("test")>>> hdfs.dump("hello", "test/hello.txt")>>> text = hdfs.load("test/hello.txt")>>> print texthello>>> hdfs.ls("test")[’hdfs://localhost:9000/user/simleo/test/hello.txt’]>>> hdfs.stat("test/hello.txt").st_size5L>>> hdfs.path.isdir("test")True>>> hdfs.cp("test", "test.copy")>>> with hdfs.open("test/hello.txt", "r") as fi:... print fi.read(3)...hel



HDFS Usage by Block Size

import pydoop.hdfs as hdfs

ROOT = "pydoop_test_tree"

def usage_by_bs(fs, root):stats = {}for info in fs.walk(root):

if info["kind"] == "directory":continue

bs = int(info["block_size"])size = int(info["size"])stats[bs] = stats.get(bs, 0) + size

return stats

if __name__ == "__main__":with hdfs.hdfs() as fs:

print "BS (MB)\tBYTES USED"for k, v in sorted(usage_by_bs(fs, ROOT).iteritems()):

print "%d\t%d" % (k/2**20, v)



Thank you for your time!


Big Data & Hadoop. Simone Leo (CRS4)

Science

Transcript of Big Data & Hadoop. Simone Leo (CRS4)