Bigdata Hadoop part 1

Bigdata Hadoop

FEBRUARY 4, 2015

Introduction In this modern era any one or everyone has a set of information which is unique to

him / her. The back bone of any industry is its consumer, however big or small. So for any industry to prosper industry must know about his consumers. For which it needs to collect information of the consumers and maintain it. Industry needs to do some data mining on the information collected and come up with the trend which is a driving force of the business.

This is being done for the time immemorial. Earlier industry were small and consumers could be counted manually, but as and when industry grew bigger, its consumer base also increased. Consequently the industry started using technology because processing the data and mining information from it could be done more efficiently.

Now a state came when consumer based started increasing exponentially and to analysis and process data it’s started taking way more time (i.e. a thing which was getting done earlier in an hour started taking a day or so) because the data into consideration was huge.

Let’s explore this problem with a practical example: Let’s say a bullock cart having one bull can pull 1 quintal load of goods. Now we have to pull 2 quintals. To fulfill this requirement we don’t want to increase size of the bull instead increase its number to 2. Consequently we were able to meet the requirement of moving 2 quintal load from point A to point B.

With this thought process we change the traditional way of processing data and started sharing the load among multiple machine to process data. Earlier we use single machine for processing data. Now instead of increasing the size of machine we increase the number of machine to process the data. Here we implemented the famous British’s “Divide and Rule” philosophy, by processing the data into sets and processing them with different machine. This way processing became faster and overall throughput increases. This architecture of analyzing the data is termed as Hadoop.

HADOOPTo make data analysis simple Apache Engineers develop a framework on Java which

is way simple and efficient way to knowledge mining known as Hadoop. It's based on the philosophy “writes once and read many times [data]”. The entire Hadoop architecture comprises of two sub modules

1. HDFS which is Hadoop distributed file system [Storage].2. Map Reduce [Processing model].

Big dataBig data is used to describe a volume of data

[structured/semi-structured/unstructured] that is so large that it's difficult to process using traditional database and software techniques. Big data has the potential to help companies

improve operations and make them faster, more intelligent decisions. Facebook generates a huge amount of data in petabytes every day. Same as in case of American stock exchange.

For a data to be considered big data, it must have this 3 features:

1. Huge volume of Data many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.

2. Velocity in simple words the velocity implies the speed over time. Similarly here velocity implies the speed of data generated in a bit of time. For example: on our social media, like Facebook, twitter, LinkedIn etc. we get every second feeds, which means every sec or in micro sec the new data is generated and stored i.e. the velocity in Big data.

3. Diversify Variety of the data: data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.

HDFS Hadoop Distributed File SystemHDFS is a distributed file system designed for storing very large files with streaming data access patterns, running cluster on the commodity hardware ( *commodity hardware means easily available hardware like PC’s).

HDFS build for a specific purpose to solve.

1. Highly fault tolerance : Even if the data is corrupted in one of the data nodes, we still have access to the proper data as the data is replicated across the cluster

2. High throughput: High throughput refers to the overall analyzing time or say processing time not accessing time.

3. Streaming access to file to file system: In this data is accessed in a stream as data is travelling from the racks no barriers are there as they work on the replica of data.

4. Can be built out of commodity hardware: Which means there is no need of specialized systems with high configuration, it can work on a normal PC’s with normal specifications.

HDFS Components:1. Name Node is a master node on the cluster which stores MetaData (No of Blocks, On

Which Rack which DataNode the data is stored and other details) about the data being stored in DataNodes whereas the DataNode stores the actual Data.

2. Data Node as the name suggests in this nodes the actual data resides and also data processing happens on these nodes. Data nodes also known as the SLAVE nodes or the Worker nodes.

3. Node Manager perform task on the data and collects the token or reply of the job and gives the job ID’s to the every task. Sends heart-beat signal to Resource Manager and retrieve resources from HDFS after completion of task.

4. Resource Manager is responsible for tracking the resources in a cluster, and scheduling applications.

5. Secondary Name merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.

6. Yarn processes vast amounts of data in-parallel on large clusters in a reliable & fault-tolerant manner Consists of:Resource Manager & Node Manager (about this I already discussed above)

7. Application manager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure

Architecture of Hadoop 2.Gen

Replication Awareness or Rack Awareness

Improve data reliability, availability, and network bandwidth utilizationDistance.

Nodes are divided on the basis of the location or distance between the data.

They follow a simple strategy: First replica on the same node as the client.

Second replica is placed on a different rack from the first (off-rack)chosen at random

Third replica is placed on the same racks the second, but on a different node chosen at random.

Further replicas are placed on random nodes on the cluster Replica Selection -HDFS tries to satisfy a read request from a replica that is

closest to the reader.

File System Image (fs image) and Edit Logs Fs-imagefile is a persistent checkpoint of the file system metadata When a client performs a write operation, it is first recorded in the edit log. The name node also has an in-memory representation of the file system metadata,

which it updates after the edit log has been modified

MapReduceMap reduce basically a programming model build on java to generate data sets while processing on huge data. Map Reduce work on cluster. A Map Reduce program is composed of a Map () procedure that performs filtering and sorting. A Reduce () procedure that performs a summary operation.

Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(v3)

There are set of functions which are responsible for the working of MapReduce. Namely undersign: 1. Reader()2. Map()3. Combiner()4. Partitioner()5. Reduce()6. Reporter()

Reader Function: Reader is basically a function to take the data as an input. And further it splits the data as per the size and send it for further processing. There are few functions are present for the reader.

1. Input splits input Split presents a byte-oriented view of the input. It is the responsibility of Record Reader to process and present a record-oriented view. File Split is the default Input Split.

2. Input Format input Format describes the input-specification for a Map-Reduce job. Validate the input-specification of the job. Split-up the input file(s) into logical Input Splits, each of which is then assigned to an individual Mapper. Provide the Record Reader implementation to be used to glean input records from the logical Input Split for processing by the Mapper. In some cases, the application has to implement a Record Reader on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical Input Split to the individual task.

3. Record Reader the record reader breaks the data into key/value pairs for input to the Mapper.It provides following interfaces to get key and value from the input:

getCurrentKey () Get the current key. getCurrentValue() Get the current value. getProgress() A number between 0.0 and 1.0 that is the fraction of the

data read Initialize (InputSplit split, TaskAttemptContext context) Called once at

initialization. nextKeyValue () Read the next key, value pair.

Mapper / Map ( ):It maps the input dataset, processed as a pair to a set of intermediator pair of records. The structure and number of intermediate output records might be same or different from the input chunk data. These Mappers run on different chunks of data available on different data-nodes and produce the output result for that chunk. The number of maps to be executed by the map-reduce application depends on the total size of input and total number of blocks of input file. After the mapper function Shuffle sort comes into the picture. This is the most important part of map reduce.

Shuffle sort: Between the map and reduce stages the data is shuffled which is parallel-sorted (exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced.

Combiner ( ): The combine function is used as an optimization for the MapReduce job. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer.

Partitioner ( ): It partitions the key space and controls the partitioning of keys of intermediate map-outputs. The key or a subset of it is used to create the partition usually by a hash function. The total number of partitions is the same as the number of reduce task. Hash Partitioner is the default Partitioner. It controls which of the m keys of Mapper is sent to for reduction. After the Partitioner the keys are send to the reducer for the reduction.

Reducer ( ): Reduce function is called once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.

For example: in the word count problem the reducer match the values and literately sum up the values and gives a single reduced output. I.e. the “total no of counts “.

Reporter ( ):Reporter facilitates the MapReduce application to report progress, update counters and set application level status message. Mappers and Reducers use Reporters to report progress and to indicate that they are alive. Output Collector facilitates the MapReduce framework to collect data output by the Mapper and Reducer.

Bigdata Hadoop part 1

Documents

Transcript of Bigdata Hadoop part 1