A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to...

download A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah

of 45

  • date post

    26-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    2

Embed Size (px)

Transcript of A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to...

  • Slide 1
  • A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo! Developer Network MapReduce Tutorial
  • Slide 2
  • MAP-REDUCE
  • Slide 3
  • What is it? Amir Payberah https://www.sics.se/~amir/dic.htm
  • Slide 4
  • MapReduce Basics A programming model and its associated implementation for parallel processing of large data sets It was developed within Google as a mechanism for processing large amounts of raw data, e.g. crawled documents or web request logs. Capable of efficiently distribute processing of TBs of data on 1000s of processing nodes This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset (or different segment of a large dataset) Implementations run-time system library takes care of parallelism, fault tolerance, data distribution, load balancing etc Complementary to RDBMS, but differs in many ways (data size, access, update, structure, integrity and scale) Features: fault tolerance, locality, task granularity, backup tasks, skipping bad records and so on
  • Slide 5
  • MapReduce Simple Dataflow Amir Payberah https://www.sics.se/~amir/dic.htm
  • Slide 6
  • MapReduce Functional Programming Concepts MapReduce programs are designed to compute large volumes of data in a parallel fashion This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution https://developer.yahoo.com/hadoop/tutorial
  • Slide 7
  • MapReduce List Processing Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements A MapReduce program will do this twice, using two different list processing idioms: map, and reduce These terms are taken from several list processing languages such as LISP, Scheme, or ML https://developer.yahoo.com/hadoop/tutorial
  • Slide 8
  • MapReduce Mapping Lists The first phase of a MapReduce program is called mapping A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element. Say there is a toUpper(str) function, which returns an uppercase version of the input string. The Map would then turn input strings into a list of uppercase strings. Note: the input has not been modifed. A new string has been returned https://developer.yahoo.com/hadoop/tutorial
  • Slide 9
  • MapReduce Reducing Lists Reducing lets you aggregate values together A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. Reducing is often used to produce "summary" data, turning a large volume of data into a smaller summary of itself. For example, "+" can be used as a reducing function, to return the sum of a list of input values. https://developer.yahoo.com/hadoop/tutorial
  • Slide 10
  • Putting Map and Reduce together A MapReduce program has two components: one that implements the mapper, and another that implements the reducer Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. Eg. the list below is for flight departures and the number of passengers that failed to board The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value must be emitted to the next list in the data flow. EK123 65, 12:00pm BA789 50, 12:02pm EK123 40, 12:05pm QF456 25, 12:15pm... https://developer.yahoo.com/hadoop/tutorial
  • Slide 11
  • MapReduce - Keys In MapReduce, an arbitrary number of values can be output from each phase; a mapper may map one input into zero, one, or one hundred outputs. A reducer may compute over an input list and emit one or a dozen different outputs Keys divide the reduce space: A reducing function turns a large list of values into one (or a few) output values In MapReduce, all of the output values are not usually reduced together All of the values with the same key are presented to a single reducer together This is performed independently of any reduce operations occurring on other lists of values, with different keys attached Different colors represent different keys. All values with the same key are presented to a single reduce task. https://developer.yahoo.com/hadoop/tutorial
  • Slide 12
  • Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
  • Slide 13
  • High-Level Structure of a MR Program 1/2 mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) https://developer.yahoo.com/hadoop/tutorial
  • Slide 14
  • High-Level Structure of a MR Program 2/2 Several instances of the mapper function are created on the different machines in a Hadoop cluster Each instance receives a different input file (it is assumed that there are many such files) The mappers output (word, 1) pairs which are then forwarded to the reducers Several instances of the reducer method are also instantiated on the different machines Each reducer is responsible for processing the list of values associated with a different word The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file. mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) https://developer.yahoo.com/hadoop/tutorial
  • Slide 15
  • Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
  • Slide 16
  • Slide 17
  • Slide 18
  • MapReduce Data Flow 1/4 https://developer.yahoo.com/hadoop/tutorial
  • Slide 19
  • MapReduce Data Flow 2/4 MapReduce inputs typically come from input files loaded onto the Hadoop clusters HDFS F/S These files are evenly distributed across all nodes Running a MapReduce program involves running mapping tasks on many or all of the nodes in the Hadoop cluster Each of these mapping tasks is equivalent: no mappers have particular "identities" associated with them Thus, any mapper can process any input file. Each mapper loads the set of files local to that machine and processes them https://developer.yahoo.com/hadoop/tutorial
  • Slide 20
  • MapReduce Data Flow 3/4 When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines All values with the same key are sent to a single reducer https://developer.yahoo.com/hadoop/tutorial
  • Slide 21
  • MapReduce Data Flow 4/4 The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only communication step in MapReduce The user never explicitly marshals information from one machine to another. All data transfer is handled by the Hadoop MapReduce platform runtime, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully. https://developer.yahoo.com/hadoop/tutorial
  • Slide 22
  • In Depth View of Map-Reduce
  • Slide 23
  • https://developer.yahoo.com/hadoop/tutorial
  • Slide 24
  • InputFormat 1/2 Input files reside on HDFS and can be of an arbitrary format Deciding how to split up and read these files is decided by the InputFormat class. It: Selects the files or other objects that should be used for input Defines the InputSplits that break a file into tasks Provides a factory for RecordReader objects that read the file https://developer.yahoo.com/hadoop/tutorial
  • Slide 25
  • InputFormat 2/2 An InputSplit is the unit of work which comprises a single map task By default this is a 64MB chunk. As various blocks make up a file, it is possible to run parallel Map tasks on these chunks https://developer.yahoo.com/hadoop/tutorial
  • Slide 26
  • RecordReader and Mapper An InputSplit defined a unit of work The RecordReader class defines how to load the data and convert into (key, value) pairs that the Map phase can use The Mapper function does the Map, emitting (key, value) pairs for use by the Reduce phase https://developer.yahoo.com/hadoop/tutorial
  • Slide