Hadoop

BY ANKIT PRASAD

CSE 3RD YEAR

NSEC

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques.

Big Data includes huge volume, high velocity, and extensible variety of data.

Classification of Big

DataThe data in it will be of three types:

Structured data: Relational data.

Semi Structured data: XML data.

Unstructured data: Word, PDF, Text, Media

Logs.

Big Data ChallengesThe major challenges associated with big data:

Capturing data

Storage

Searching

Sharing

Transfer

Analysis

Presentation

's Solution MapReduce

It is a parallel programming model for writing

distributed applications.

It can efficiently process multi-terabyte data-

sets.

Runs on large clusters of commodity

hardware in a reliable, fault-tolerant manner.

Introduction to Hadoop Hadoop was developed by Doug Cutting.

Hadoop is an Apache open source

framework written in java.

Hadoop allows distributed storage and

processing of large datasets across clusters of

computers.

Hadoop ArchitectureHadoop has the two major layers namely:

Processing/Computation layer (MapReduce)

Storage layer (Hadoop Distributed File

System)

Other modules of Hadoop Framework includes:

Hadoop Common

Hadoop YARN(Yet Another Resource

Negotiator)

What is MapReduce?The MapReduce algorithm contains two

important tasks, namely Map and Reduce.

Map takes a set of data and breaks

individual elements into tuples (key/value

pairs).

Reduce takes Map’s output as an input and

combines those data tuples forming a

smaller set of tuples.

Under the MapReduce model, the data

processing primitives are called mappers and

reducers.

MapReduce AlgorithmHadoop initiates Map stage by issuing

mapping task to appropriate servers in the

cluster.

Map stage:

The input file or directory, stored in the HDFS is

passed to the mapper function line by line.

The mapper processes the data and creates

several small chunks of data(key/value pairs).

Hadoop monitors for task completion and

initiates shuffle stage.

Shuffle stage:

The framework groups data from all mappers

by the keys and splits them among the

appropriate servers for the reduce stage.

Reduce stage:

The Reducer processes the data coming from

the mapper, producing a new set of output,

that is stored in the HDFS.

The framework manages all the details of

data-passing and copying between the

nodes in the cluster.

Hadoop Distributed File

SystemHDFS is based on the Google File System.

It is highly fault-tolerant and is designed to be

deployed on low-cost hardware.

It is suitable for applications having large

datasets.

These files are stored in redundant fashion to

rescue the system from possible data losses in

case of failure.

HDFS ArchitectureNamenode:

It acts as a master server that manages the file system namespace.

Regulates client’s access to files.

Datanode:

These nodes manage the data storage of their system.

And performs read-write and block operations regulated by namenode.

Block:

It is the minimum amount of data that HDFS

can read/ write.

The files are divided into one or more blocks.

Blocks are stored in individual data nodes.

Hadoop CommonIt provides essential services and basic

processes such as abstraction of the

underlying operating system and its file

system.

It assumes that hardware failures are

common and should be automatically

handled by the Framework.

It also contains the necessary Java Archive

(JAR) files and scripts required to start

Hadoop.

Hadoop YARNResourceManager:

It is a clustering platform that helps to

manage and allocate resources to

applications and schedule tasks.

ApplicationMasters:

Responsible for negotiating resources with

the ResourceManager and for working

with the Node Managers to execute and

monitor the tasks.

NodeManager:Takes instructions from the ResourceManager

and manage resources on its own node.

How Does Hadoop

Work?Data is initially divided into directories and

files. Files are divided into uniform sized blocks of 128M and 64M.

These files are then distributed across various cluster nodes for further processing supervised by the HDFS.

Blocks are replicated for handling hardware failure.

Checking that the code was executed successfully.

Performing the sort that takes place between

the map and reduce stages.

Sending the sorted data to a certain

computer.

Writing the debugging logs for each job.

Applications of HadoopBlack Box Data

Social Media Data

Stock Exchange Data

Transport Data

Search Engine Data

Prominent users of

HadoopThe Search Webmap is a Hadoop

application that runs on a big Linux cluster.

In 2010, Facebook claimed that they had the

largest Hadoop cluster in the world.

The New York Times used 100

instances and a Hadoop application to

process 4 TB data into 11 million PDFs in a day

at a computation cost of about $240.

https://en.m.wikipedia.org/wiki/Facebook

https://en.m.wikipedia.org/wiki/The_New_York_Times

Advantages of HadoopHadoop is open source and compatible on

all the platforms since it is Java based.

Hadoop does not rely on hardware to

provide fault-tolerance and high availability.

Servers can be added or removed from the

cluster dynamically without interruption.

Hadoop efficiently utilizes the underlying

parallelism of the CPU cores in distributed

systems .

References:

www.tutorialspoint.com/hadoop/

https://en.wikipedia.org/wiki/Apache_Hadoop

https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html

https://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/

http://saphanatutorial.com/how-yarn-overcomes-mapreduce-limitations-in-hadoop-2-0/

http://www.tutorialspoint.com/hadoop/

https://en.wikipedia.org/wiki/Apache_Hadoop

https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html

https://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/

http://saphanatutorial.com/how-yarn-overcomes-mapreduce-limitations-in-hadoop-2-0/

Hadoop

Data & Analytics

Transcript of Hadoop