HDFS & MapReduce

Let us change our traditional attitude to the construction of programs: Instead of imagining that

our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what

we want the computer to do

Donald E. Knuth, Literate Programming, 1984

Drivers

Central activity

Dominant logics

Economy Subsistence Agricultural Industrial Service Sustainable

Question How to survive? How to farm?How to manage resources?

How to create customers?

How to reduce impact?

Dominant issue

Survival

Production

Customer service

Sustainability

Key information systems

GestureSpeech

WritingCalendar

AccountingERPProject management

CRMAnalytics

SimulationOptimizationDesign

Data sources

Operational

Social

Environmental

Digital transformation

Data are the raw material for informationIdeally, the lower the level of detail the better

Summarize up but not detail down

Immutability means no updatingAppend plus a time stamp

Maintain history

Data types

StructuredUnstructured

Can structure with some effort

Requirements for Big Data

Robust and fault-tolerantLow latency reads and updatesScalableSupport a wide variety of applicationsExtensibleAd hoc queriesMinimal maintenanceDebuggable 12

Bottlenecks

Solving the speed problem

Lambda architecture

Speed layer

Serving layer

Batch layer

Addresses the cost problemThe batch layer stores the master copy of the dataset• A very large list of records• An immutable growing dataset

Continually pre-computes batch views on that master dataset so they available when requested

Might take several hours to run

Batch programmingAutomatically parallelized across a cluster of machines

Supports scalability to any size dataset

If you have an x nodes cluster, the computation will be about x times faster compared to a single machine

Serving layer

A specialized distributed databaseIndexes pre-computed batch views and loads them so they can be efficiently queriedContinuously swaps in newer pre-computed versions of batch views

Serving layer

Simple databaseBatch updatesRandom readsNo random writes

Low complexityRobustPredictableEasy to configure and manage

Speed layer

The only data not represented in a batch view are those data collected while the pre-computation was runningThe speed layer is a real-time system to top-up the analysis with the latest data

Does incremental updates based on recent data Modifies the view as data are collectedMerges the two views as required by queries

Lambda architecture

Speed layer

Intermediate results are discarded every time a new batch view is receivedThe complexity of the speed layer is “isolated”

If anything goes wrong, the results are only a few hours out-of-date and fixed when the next batch update is received

Lambda architecture

Lambda architectureNew data are sent to the batch and speed layers

New data are appended to the master dataset to preserve immutability

Speed layer does an incremental update

Lambda architectureBatch layer pre-computes using all dataServing layer indexes batch created views

Prepares for rapid response to queries

Lambda architectureQueries are handled by merging data from the serving and speed layers

Master dataset

Goal is to preserve integrityOther elements can be recomputed

Replication across nodesRedundancy is integrity

CRUD to CRCreateReadUpdateDelete

CreateRead

Immutability exceptions

Garbage collectionDelete elements of low potential value• Don’t keep some histories

Regulations and privacyDelete elements that are not permitted• History of books borrowed

Fact-based data modelEach fact is a single piece of data

Clare is femaleClare works at BloomingdalesClare lives in New York

Multi-valued facts need to be decomposed

Clare is a female working at Bloomingdales in New York

A fact is data about an entity or a relationship between two entities

Fact-based data modelEach fact has an associated timestamp recording the earliest time that the fact is believed to be true

For convenience, usually the time the fact is capturedCreate a new data type of time series or attributes become entities

More recent facts override older factsAll facts need to be uniquely identified

Often a timestamp plus other attributesUse a 64 bit nonce (number used once) field, which is a a random number, if timestamp plus attribute combination could be identical

Fact-based versus relational

Decision-making effectiveness versus operational efficiency

Days versus seconds

Access many records versus access a fewImmutable versus mutable

History versus current view

Schemas

Schemas increase data quality by defining structureCatch errors at creation time when they are easier and cheaper to correct

Fact-based data model

Graphs can represent facts-based data models

Nodes are entitiesProperties are attributes of entitiesEdges are relationships between entities

Graph versus relational

Keep a full historyAppend onlyScalable?

Solving the speed and cost problems

Hadoop

Distributed file systemHadoop distributed file system (HDFS)

Distributed computationMapReduce

Commodity hardwareA cluster of nodes

Hadoop

Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email anti-spam, ad optimization, ETL, and moreOver 40,000 servers170 PB of storage

Hadoop

Lower costCommodity hardware

SpeedMultiple processors

Files are broken into fixed sized blocks of at least 64MBBlocks are replicated across nodes

Parallel processingFault tolerance

Node storageStore blocks sequentially to minimize disk head movementBlocks are grouped into filesAll files for a dataset are grouped into a single folderNo random access to recordsNew data are added as a new file

Scalable storageAdd nodesAppend new data as files

Scalable computationSupport of MapReduce

PartitioningGroup data into folders for processing at the folder level

Vertical partitioning

MapReduceA distributed computing method that provides primitives for scalable and fault-tolerant batch computationAd hoc queries on large datasets are time consuming

Distribute the computation across multiple processorsPre-compute common queries

Move the program to the data rather than the data to the program

MapReduce

InputDetermines how data are read by the mapperSplits up data for the mappers

MapOperates on each data set individually

PartitionDistributes key/value pairs to reducers

MapReduce

SortSorts input for the reducer

ReduceConsolidates key/value pairs

OutputWrites data to HDFS

Shuffle

Programming MapReduce

A Map function converts each input element into zero or more key-value pairsA “key” is not unique, and many pairs with the same key are typically generated by the Map functionThe key is the field about which you want to collect data

Compute the square of set of numbers

Input is (null,1), (null,2), …Output is (1,1), (2,4), …

mapper <- function(k,v) { key <- v value <- key^2 keyval(key,value)}

Reduce

A Reduce function is applied, for each input key, to its associated list of values The result is a new pair consisting of the key and whatever is produced by the Reduce functionThe output of the MapReduce is what results from the application of the Reduce function to each key and its list 53

Reduce

Report the number of items in a list

Input is (key, value-list), … Output is (key, length(value-list)), …

reducer <- function(k,v) { key <- k value <- length(v) keyval(key,value)}

MapReduce API

A low-level Java implementationCan gain additional compute efficiency but tedious to programTry out highest-level options first and descend to lower levels if required

R & Hadoop

Compute squares

# create a list of 10 integersints <- 1:10# equivalent to ints <- c(1,2,3,4,5,6,7,8,9,10)# compute the squaresresult <- sapply(ints,function(x) x^2)result[1] 1 4 9 16 25 36 49 64 81 100

Key-value mapping

Input Map Reduce Output

(null,1) (1,1) (1,1)

(null,2) (2,4) (2,4)

… … …

(null,10) (10,100) (10,100)

MapReduce

MapReduce library(rmr2)rmr.options(backend = "local") # local or hadoop# load a list of 10 integers into HDFS hdfs.ints = to.dfs(1:10)# mapper for the key-value pairs to compute squaresmapper <- function(k,v) { key <- v value <- key^2 keyval(key,value)}# run MapReduce out = mapreduce(input = hdfs.ints, map = mapper)# convert to a data framedf1 = as.data.frame(from.dfs(out))colnames(df1) = c('n', 'n^2')#display the resultsdf1

No reduce

Exercise

Use the map component of the mapreduce() to create the cubes of the integers from 1 to 25

R & Hadoop

Tabulation

library(readr)url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')#convert and round temperature to an integert$temperature = round((t$temperature-32)*5/9,0)# tabulate frequenciestable(t$temperature)

Key-value mapping

Input Map (F to C)

Reduce Output

(null,35.1) (2,1) (-7,c(1)) (-7,1)

(null,37.5) (3,1) (-6,c(1)) (-6,1)

… … … …

(null,43.3) (6,1) (27,c(1,1,1,1,1,1,1,1))

(27,8)

MapReduce (1)

MapReduce library(rmr2)library(readr)rmr.options(backend = "local") #local or hadoopurl <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')# save temperature in hdfs filehdfs.temp <- to.dfs(t$temperature)# mapper for conversion to Cmapper <- function(k,v) { key <- round((v-32)*5/9,0) value <- 1 keyval(key,value)}

MapReduce (2)

MapReduce # reducer to count frequenciesreducer <- function(k,v) { key <- k value = length(v) keyval(key,value)}out = mapreduce( input = hdfs.temp, map = mapper, reduce = reducer)df2 = as.data.frame(from.dfs(out))colnames(df2) = c('temperature', 'count')df3 <- df2[order(df2$temperature),]print(df3, row.names = FALSE) # no row names

R & Hadoop

Basic statistics

Rlibrary(readr)url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')a1 <- aggregate(t$temperature,by=list(t$year),FUN=max)colnames(a1) = c('year', 'value')a1$measure = 'max'a2 <- aggregate(t$temperature,by=list(t$year),FUN=mean)colnames(a2) = c('year', 'value')a2$value = round(a2$value,1)a2$measure = 'mean'a3 <- aggregate(t$temperature,by=list(t$year),FUN=min)colnames(a3) = c('year', 'value')a3$measure = 'min'# stack the resultsstack <- rbind(a1,a2,a3) library(reshape)# reshape with year, max, mean, min in one rowstats <- cast(stack,year ~ measure,value="value")head(stats)

Key-value mapping

(null,record) (year, temperature)

(year, vector of temperatures)

(year, max)(year, mean)(year, min)

MapReduce (1)

MapReduce library(rmr2)library(reshape)library(readr)rmr.options(backend = "local") # local or hadoopurl <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')# convert to hdfs filehdfs.temp <- to.dfs(data.frame(t))# mapper for computing temperature measures for each yearmapper <- function(k,v) { key <- v$year value <- v$temperature keyval(key,value)}

MapReduce (2)

MapReduce #reducer to report statsreducer <- function(k,v) { key <- k #year value <- c(max(v),round(mean(v),1),min(v)) #v is list of values for a year keyval(key,value)}out = mapreduce( input = hdfs.temp, map = mapper, reduce = reducer)df3 = as.data.frame(from.dfs(out))df3$measure <- c('max','mean','min')# reshape with year, max, mean, min in one rowstats2 <- cast(df3,key ~ measure,value="val")head(stats2)

R & Hadoop

Word counting

library(stringr)# read as a single character stringt <- readChar("http://people.terry.uga.edu/rwatson/data/yogiquotes.txt", nchars=1e6)t1 <- tolower(t[[1]]) # convert to lower caset2 <- str_replace_all(t1,"[[:punct:]]","") # get rid of punctuationwordList <- str_split(t2, "\\s") #split into stringswordVector <- unlist(wordList) # convert list to vectortable(wordVector)

Key-value mapping

(null, text)

(word,1)(word,1)…

(word, vector)…

word, length(vector)…

MapReduce (1)

MapReduce library(rmr2)library(stringr)rmr.options(backend = "local") # local or hadoop# read as a single character stringurl <- "http://people.terry.uga.edu/rwatson/data/yogiquotes.txt"t <- readChar(url, nchars=1e6)text.hdfs <- to.dfs(t)mapper=function(k,v){ t1 <- tolower(v) # convert to lower case t2 <- str_replace_all(t1,"[[:punct:]]","") # get rid of punctuation wordList <- str_split(t2, "\\s") #split into words wordVector <- unlist(wordList) # convert list to vector keyval(wordVector,1)}

MapReduce (2)

MapReduce reducer = function(k,v) {keyval(k,length(v))}out <- mapreduce (input = text.hdfs,map = mapper,reduce = reducer,combine=T)# convert output to a framedf1 = as.data.frame(from.dfs(out))colnames(df1) = c('word', 'count')#display the resultsprint(df1, row.names = FALSE) # no row names

Hortonworks data platform

A distributed databaseDoes not enforce relationshipsDoes not enforce strict column data typingPart of the Hadoop ecosytem

Applications

FacebookTwitterStumbleUpon

Hiring: learning from big data

People with a criminal background perform a bit better in customer-support call centersCustomer-service employees who live nearby are less likely to leaveHonest people tend to perform better and stay on the job longer but make less effective salespeople

Outcomes

Scientific discoveryQuasarsHiggs Boson

Discovering linkages among humans, products, and servicesAn ecological sustainable society

Energy Informatics

Critical questions

What’s the business problem?What information is needed to make a high quality decision?What data can be converted into information?

ConclusionsFaster and lower cost solutions for data-driven decision makingHDFS

Reduces the cost of storing large data setsBecoming the new standard for data storage

MapReduce is changing the way data are processed

CheaperFasterNeed to reprogram for parallelism

HDFS & MapReduce

Documents

Transcript of HDFS & MapReduce

HDFS & MapReduce

High Performance RDMA-based Design of HDFS over InfiniBandnowlab.cse.ohio-state.edu/static/media/publications/... · 2017. 7. 18. · HBase 5 HDFS HBase MapReduce Hadoop Framework

Technologie-Exkurs „Big Data”: Hadoop, NoSQL, Text ... · Schnittstellen oder MapReduce-Nutzung). HDFS & MapReduce Hadoop verspricht die Bewältigung enor-mer Datenmengen. Yahoo

Stuart Pérez A12729. Agenda Que es Hadoop Porque usarlo Componentes de Hadoop HDFS MapReduce Cluster Hadoop (HDFS + MR) Hadoop Scheduler Conclusiones.

MapReduce Online - USENIXHadoop is composed of Hadoop MapReduce, an imple-mentation of MapReduce designed for large clusters, and the Hadoop Distributed File System (HDFS), a ﬁle

Overview of the HiBD (High-Performance Big Data) Projectneurohpc.cse.ohio-state.edu/static/media/talks/slide/hibd-luxi_1.pdf · HDFS and MapReduce in Apache Hadoop • HDFS: Primary

Big datahome.agh.edu.pl/~wojnicki/wiki/_media/pl:ztb:ztb-hadoop.pdf · 2013-11-08 · Hadoop Sector MapReduce MapReduce Sphere UDF BigTable HBase/Hive Space GFS HDFS SDFS — —

XHAMI - Extended HDFS and MapReduce Interface for Image ...gridbus.cs.mu.oz.au/papers/XHAMI-Cloud2015.pdf · Keywords: Cloud Computing, Big Data, Hadoop, MapReduce, Extended MapReduce,

A BigData Tour – HDFS, Ceph and MapReduce...•The Hadoop infrastructure provides these capabilities Introduction to Hadoop •Apache Hadoop • Based on 2004 Google MapReduce Paper

A BigData Tour – HDFS, Ceph and MapReducecourses.cecs.anu.edu.au/courses/COMP4300/lectures15/HDFS.pdf · 16/05/15 1 A BigData Tour – HDFS, Ceph and MapReduce These slides are

Tutorial Hadoop HDFS MapReduce

Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Hadoop, HDFS and MapReduce

Introduction to Hadoop, MapReduce and HDFS for Big Data ...

Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

Lecture 11: Graph algorithms - GitHub Pages · Course content • Introduction • Data streams 1 & 2 • The MapReduce paradigm • Looking behind the scenes of MapReduce: HDFS &

Intro to HDFS and MapReduce

XHAMI - Extended HDFS and MapReduce Interface for Image ...buyya.com/papers/XHAMI-Cloud2015.pdf · MapReduce model have become de facto standard for large scale data organization

Hadoop, HDFS, MapReduce and Pig

Hadoop: HDFS, MapReduce e ZooKeeperra100582/mc715/apresenta.pdf · ..1 Agenda...2 Introdu¸cão...3 Hadoop File System...4 MapReduce...5 Zookeeper...6 Referências Bibliográficas