HDFS & MapReduce

82
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do Donald E. Knuth, Literate Programming, 1984

description

HDFS & MapReduce. Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do Donald E. Knuth, Literate Programming , 1984. - PowerPoint PPT Presentation

Transcript of HDFS & MapReduce

Page 1: HDFS & MapReduce

HDFS & MapReduce

Let us change our traditional attitude to the construction of programs: Instead of imagining that

our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what

we want the computer to do

Donald E. Knuth, Literate Programming, 1984

Page 2: HDFS & MapReduce

Drivers

2

Page 3: HDFS & MapReduce

Central activity

3

Page 4: HDFS & MapReduce

Dominant logics

4

Economy Subsistence Agricultural Industrial Service Sustainable

Question How to survive? How to farm?How to manage resources?

How to create customers?

How to reduce impact?

Dominant issue

Survival

Production

Customer service

Sustainability

Key information systems

GestureSpeech

WritingCalendar

AccountingERPProject management

CRMAnalytics

SimulationOptimizationDesign

Page 5: HDFS & MapReduce

Data sources

5

Page 6: HDFS & MapReduce

Operational

6

Page 7: HDFS & MapReduce

Social

7

Page 8: HDFS & MapReduce

Environmental

8

Page 9: HDFS & MapReduce

Digital transformation

9

Page 10: HDFS & MapReduce

Data

Data are the raw material for informationIdeally, the lower the level of detail the better

Summarize up but not detail down

Immutability means no updatingAppend plus a time stamp

Maintain history

10

Page 11: HDFS & MapReduce

Data types

StructuredUnstructured

Can structure with some effort

11

Page 12: HDFS & MapReduce

Requirements for Big Data

Robust and fault-tolerantLow latency reads and updatesScalableSupport a wide variety of applicationsExtensibleAd hoc queriesMinimal maintenanceDebuggable 12

Page 13: HDFS & MapReduce

Bottlenecks

13

Page 14: HDFS & MapReduce

Solving the speed problem

14

Page 15: HDFS & MapReduce

Lambda architecture

Speed layer

Serving layer

Batch layer

15

Page 16: HDFS & MapReduce

Batch layer

Addresses the cost problemThe batch layer stores the master copy of the dataset• A very large list of records• An immutable growing dataset

Continually pre-computes batch views on that master dataset so they available when requested

Might take several hours to run

16

Page 17: HDFS & MapReduce

Batch programmingAutomatically parallelized across a cluster of machines

Supports scalability to any size dataset

If you have an x nodes cluster, the computation will be about x times faster compared to a single machine

17

Page 18: HDFS & MapReduce

Serving layer

A specialized distributed databaseIndexes pre-computed batch views and loads them so they can be efficiently queriedContinuously swaps in newer pre-computed versions of batch views

18

Page 19: HDFS & MapReduce

Serving layer

Simple databaseBatch updatesRandom readsNo random writes

Low complexityRobustPredictableEasy to configure and manage

19

Page 20: HDFS & MapReduce

Speed layer

The only data not represented in a batch view are those data collected while the pre-computation was runningThe speed layer is a real-time system to top-up the analysis with the latest data

Does incremental updates based on recent data Modifies the view as data are collectedMerges the two views as required by queries

20

Page 21: HDFS & MapReduce

Lambda architecture

21

Page 22: HDFS & MapReduce

Speed layer

Intermediate results are discarded every time a new batch view is receivedThe complexity of the speed layer is “isolated”

If anything goes wrong, the results are only a few hours out-of-date and fixed when the next batch update is received

22

Page 23: HDFS & MapReduce

Lambda architecture

23

Page 24: HDFS & MapReduce

Lambda architectureNew data are sent to the batch and speed layers

New data are appended to the master dataset to preserve immutability

Speed layer does an incremental update

24

Page 25: HDFS & MapReduce

Lambda architectureBatch layer pre-computes using all dataServing layer indexes batch created views

Prepares for rapid response to queries

25

Page 26: HDFS & MapReduce

Lambda architectureQueries are handled by merging data from the serving and speed layers

26

Page 27: HDFS & MapReduce

Master dataset

Goal is to preserve integrityOther elements can be recomputed

Replication across nodesRedundancy is integrity

27

Page 28: HDFS & MapReduce

CRUD to CRCreateReadUpdateDelete

CreateRead

28

Page 29: HDFS & MapReduce

Immutability exceptions

Garbage collectionDelete elements of low potential value• Don’t keep some histories

Regulations and privacyDelete elements that are not permitted• History of books borrowed

29

Page 30: HDFS & MapReduce

Fact-based data modelEach fact is a single piece of data

Clare is femaleClare works at BloomingdalesClare lives in New York

Multi-valued facts need to be decomposed

Clare is a female working at Bloomingdales in New York

A fact is data about an entity or a relationship between two entities

30

Page 31: HDFS & MapReduce

Fact-based data modelEach fact has an associated timestamp recording the earliest time that the fact is believed to be true

For convenience, usually the time the fact is capturedCreate a new data type of time series or attributes become entities

More recent facts override older factsAll facts need to be uniquely identified

Often a timestamp plus other attributesUse a 64 bit nonce (number used once) field, which is a a random number, if timestamp plus attribute combination could be identical

31

Page 32: HDFS & MapReduce

Fact-based versus relational

Decision-making effectiveness versus operational efficiency

Days versus seconds

Access many records versus access a fewImmutable versus mutable

History versus current view

32

Page 33: HDFS & MapReduce

Schemas

Schemas increase data quality by defining structureCatch errors at creation time when they are easier and cheaper to correct

33

Page 34: HDFS & MapReduce

Fact-based data model

Graphs can represent facts-based data models

Nodes are entitiesProperties are attributes of entitiesEdges are relationships between entities

34

Page 35: HDFS & MapReduce

Graph versus relational

Keep a full historyAppend onlyScalable?

35

Page 36: HDFS & MapReduce

Solving the speed and cost problems

36

Page 37: HDFS & MapReduce

Hadoop

Distributed file systemHadoop distributed file system (HDFS)

Distributed computationMapReduce

Commodity hardwareA cluster of nodes

37

Page 38: HDFS & MapReduce

Hadoop

Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email anti-spam, ad optimization, ETL, and moreOver 40,000 servers170 PB of storage

38

Page 39: HDFS & MapReduce

Hadoop

Lower costCommodity hardware

SpeedMultiple processors

39

Page 40: HDFS & MapReduce

HDFS

Files are broken into fixed sized blocks of at least 64MBBlocks are replicated across nodes

Parallel processingFault tolerance

40

Page 41: HDFS & MapReduce

HDFS

Node storageStore blocks sequentially to minimize disk head movementBlocks are grouped into filesAll files for a dataset are grouped into a single folderNo random access to recordsNew data are added as a new file

41

Page 42: HDFS & MapReduce

HDFS

Scalable storageAdd nodesAppend new data as files

Scalable computationSupport of MapReduce

PartitioningGroup data into folders for processing at the folder level

42

Page 43: HDFS & MapReduce

Vertical partitioning

43

Page 44: HDFS & MapReduce

MapReduceA distributed computing method that provides primitives for scalable and fault-tolerant batch computationAd hoc queries on large datasets are time consuming

Distribute the computation across multiple processorsPre-compute common queries

Move the program to the data rather than the data to the program

44

Page 45: HDFS & MapReduce

MapReduce

45

Page 46: HDFS & MapReduce

MapReduce

46

Page 47: HDFS & MapReduce

MapReduce

InputDetermines how data are read by the mapperSplits up data for the mappers

MapOperates on each data set individually

PartitionDistributes key/value pairs to reducers

47

Page 48: HDFS & MapReduce

MapReduce

SortSorts input for the reducer

ReduceConsolidates key/value pairs

OutputWrites data to HDFS

48

Page 49: HDFS & MapReduce

Shuffle

49

Page 50: HDFS & MapReduce

Programming MapReduce

Page 51: HDFS & MapReduce

Map

A Map function converts each input element into zero or more key-value pairsA “key” is not unique, and many pairs with the same key are typically generated by the Map functionThe key is the field about which you want to collect data

51

Page 52: HDFS & MapReduce

Map

Compute the square of set of numbers

Input is (null,1), (null,2), …Output is (1,1), (2,4), …

52

mapper <- function(k,v) { key <- v value <- key^2 keyval(key,value)}

Page 53: HDFS & MapReduce

Reduce

A Reduce function is applied, for each input key, to its associated list of values The result is a new pair consisting of the key and whatever is produced by the Reduce functionThe output of the MapReduce is what results from the application of the Reduce function to each key and its list 53

Page 54: HDFS & MapReduce

Reduce

Report the number of items in a list

Input is (key, value-list), … Output is (key, length(value-list)), …

54

reducer <- function(k,v) { key <- k value <- length(v) keyval(key,value)}

Page 55: HDFS & MapReduce

MapReduce API

A low-level Java implementationCan gain additional compute efficiency but tedious to programTry out highest-level options first and descend to lower levels if required

55

Page 56: HDFS & MapReduce

R & Hadoop

Compute squares

56

Page 57: HDFS & MapReduce

R

# create a list of 10 integersints <- 1:10# equivalent to ints <- c(1,2,3,4,5,6,7,8,9,10)# compute the squaresresult <- sapply(ints,function(x) x^2)result[1] 1 4 9 16 25 36 49 64 81 100

Page 58: HDFS & MapReduce

Key-value mapping

58

Input Map   Reduce Output

(null,1) (1,1)     (1,1)

(null,2) (2,4)     (2,4)

… …     …

(null,10) (10,100)     (10,100)

Page 59: HDFS & MapReduce

MapReduce

MapReduce library(rmr2)rmr.options(backend = "local") # local or hadoop# load a list of 10 integers into HDFS hdfs.ints = to.dfs(1:10)# mapper for the key-value pairs to compute squaresmapper <- function(k,v) { key <- v value <- key^2 keyval(key,value)}# run MapReduce out = mapreduce(input = hdfs.ints, map = mapper)# convert to a data framedf1 = as.data.frame(from.dfs(out))colnames(df1) = c('n', 'n^2')#display the resultsdf1

No reduce

Page 60: HDFS & MapReduce

Exercise

Use the map component of the mapreduce() to create the cubes of the integers from 1 to 25

60

Page 61: HDFS & MapReduce

R & Hadoop

Tabulation

Page 62: HDFS & MapReduce

R

library(readr)url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')#convert and round temperature to an integert$temperature = round((t$temperature-32)*5/9,0)# tabulate frequenciestable(t$temperature)

Page 63: HDFS & MapReduce

Key-value mapping

63

Input Map (F to C)

  Reduce Output

(null,35.1) (2,1)   (-7,c(1)) (-7,1)

(null,37.5) (3,1)   (-6,c(1)) (-6,1)

… …   … …

(null,43.3) (6,1)   (27,c(1,1,1,1,1,1,1,1))

(27,8)

Page 64: HDFS & MapReduce

MapReduce (1)

MapReduce library(rmr2)library(readr)rmr.options(backend = "local") #local or hadoopurl <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')# save temperature in hdfs filehdfs.temp <- to.dfs(t$temperature)# mapper for conversion to Cmapper <- function(k,v) { key <- round((v-32)*5/9,0) value <- 1 keyval(key,value)}

Page 65: HDFS & MapReduce

MapReduce (2)

MapReduce # reducer to count frequenciesreducer <- function(k,v) { key <- k value = length(v) keyval(key,value)}out = mapreduce( input = hdfs.temp, map = mapper, reduce = reducer)df2 = as.data.frame(from.dfs(out))colnames(df2) = c('temperature', 'count')df3 <- df2[order(df2$temperature),]print(df3, row.names = FALSE) # no row names

Page 66: HDFS & MapReduce

R & Hadoop

Basic statistics

Page 67: HDFS & MapReduce

Rlibrary(readr)url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')a1 <- aggregate(t$temperature,by=list(t$year),FUN=max)colnames(a1) = c('year', 'value')a1$measure = 'max'a2 <- aggregate(t$temperature,by=list(t$year),FUN=mean)colnames(a2) = c('year', 'value')a2$value = round(a2$value,1)a2$measure = 'mean'a3 <- aggregate(t$temperature,by=list(t$year),FUN=min)colnames(a3) = c('year', 'value')a3$measure = 'min'# stack the resultsstack <- rbind(a1,a2,a3) library(reshape)# reshape with year, max, mean, min in one rowstats <- cast(stack,year ~ measure,value="value")head(stats)

Page 68: HDFS & MapReduce

Key-value mapping

68

Input Map   Reduce Output

(null,record) (year, temperature)

   (year, vector of temperatures)

(year, max)(year, mean)(year, min)

Page 69: HDFS & MapReduce

MapReduce (1)

MapReduce library(rmr2)library(reshape)library(readr)rmr.options(backend = "local") # local or hadoopurl <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"t <- read_delim(url, delim=',')# convert to hdfs filehdfs.temp <- to.dfs(data.frame(t))# mapper for computing temperature measures for each yearmapper <- function(k,v) { key <- v$year value <- v$temperature keyval(key,value)}

Page 70: HDFS & MapReduce

MapReduce (2)

MapReduce #reducer to report statsreducer <- function(k,v) { key <- k #year value <- c(max(v),round(mean(v),1),min(v)) #v is list of values for a year keyval(key,value)}out = mapreduce( input = hdfs.temp, map = mapper, reduce = reducer)df3 = as.data.frame(from.dfs(out))df3$measure <- c('max','mean','min')# reshape with year, max, mean, min in one rowstats2 <- cast(df3,key ~ measure,value="val")head(stats2)

Page 71: HDFS & MapReduce

R & Hadoop

Word counting

Page 72: HDFS & MapReduce

R

library(stringr)# read as a single character stringt <- readChar("http://people.terry.uga.edu/rwatson/data/yogiquotes.txt", nchars=1e6)t1 <- tolower(t[[1]]) # convert to lower caset2 <- str_replace_all(t1,"[[:punct:]]","") # get rid of punctuationwordList <- str_split(t2, "\\s") #split into stringswordVector <- unlist(wordList) # convert list to vectortable(wordVector)

Page 73: HDFS & MapReduce

Key-value mapping

73

Input Map   Reduce Output

(null, text)

(word,1)(word,1)…

   (word, vector)…

word, length(vector)…

Page 74: HDFS & MapReduce

MapReduce (1)

MapReduce library(rmr2)library(stringr)rmr.options(backend = "local") # local or hadoop# read as a single character stringurl <- "http://people.terry.uga.edu/rwatson/data/yogiquotes.txt"t <- readChar(url, nchars=1e6)text.hdfs <- to.dfs(t)mapper=function(k,v){ t1 <- tolower(v) # convert to lower case t2 <- str_replace_all(t1,"[[:punct:]]","") # get rid of punctuation wordList <- str_split(t2, "\\s") #split into words wordVector <- unlist(wordList) # convert list to vector keyval(wordVector,1)}

Page 75: HDFS & MapReduce

MapReduce (2)

MapReduce reducer = function(k,v) {keyval(k,length(v))}out <- mapreduce (input = text.hdfs,map = mapper,reduce = reducer,combine=T)# convert output to a framedf1 = as.data.frame(from.dfs(out))colnames(df1) = c('word', 'count')#display the resultsprint(df1, row.names = FALSE) # no row names

Page 76: HDFS & MapReduce

Hortonworks data platform

76

Page 77: HDFS & MapReduce

HBase

A distributed databaseDoes not enforce relationshipsDoes not enforce strict column data typingPart of the Hadoop ecosytem

77

Page 78: HDFS & MapReduce

Applications

FacebookTwitterStumbleUpon

78

Page 79: HDFS & MapReduce

Hiring: learning from big data

People with a criminal background perform a bit better in customer-support call centersCustomer-service employees who live nearby are less likely to leaveHonest people tend to perform better and stay on the job longer but make less effective salespeople

79

Page 80: HDFS & MapReduce

Outcomes

Scientific discoveryQuasarsHiggs Boson

Discovering linkages among humans, products, and servicesAn ecological sustainable society

Energy Informatics

80

Page 81: HDFS & MapReduce

Critical questions

What’s the business problem?What information is needed to make a high quality decision?What data can be converted into information?

81

Page 82: HDFS & MapReduce

ConclusionsFaster and lower cost solutions for data-driven decision makingHDFS

Reduces the cost of storing large data setsBecoming the new standard for data storage

MapReduce is changing the way data are processed

CheaperFasterNeed to reprogram for parallelism

82