Hadoop

25
BY ANKIT PRASAD CSE 3 RD YEAR NSEC

Transcript of Hadoop

Page 1: Hadoop

BY ANKIT PRASAD

CSE 3RD YEAR

NSEC

Page 2: Hadoop

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques.

Big Data includes huge volume, high velocity, and extensible variety of data.

Page 3: Hadoop

Classification of Big

DataThe data in it will be of three types:

Structured data: Relational data.

Semi Structured data: XML data.

Unstructured data: Word, PDF, Text, Media

Logs.

Page 4: Hadoop

Big Data ChallengesThe major challenges associated with big data:

Capturing data

Storage

Searching

Sharing

Transfer

Analysis

Presentation

Page 5: Hadoop

's Solution MapReduce

It is a parallel programming model for writing

distributed applications.

It can efficiently process multi-terabyte data-

sets.

Runs on large clusters of commodity

hardware in a reliable, fault-tolerant manner.

Page 6: Hadoop
Page 7: Hadoop

Introduction to Hadoop Hadoop was developed by Doug Cutting.

Hadoop is an Apache open source

framework written in java.

Hadoop allows distributed storage and

processing of large datasets across clusters of

computers.

Page 8: Hadoop

Hadoop ArchitectureHadoop has the two major layers namely:

Processing/Computation layer (MapReduce)

Storage layer (Hadoop Distributed File

System)

Other modules of Hadoop Framework includes:

Hadoop Common

Hadoop YARN(Yet Another Resource

Negotiator)

Page 9: Hadoop

What is MapReduce?The MapReduce algorithm contains two

important tasks, namely Map and Reduce.

Map takes a set of data and breaks

individual elements into tuples (key/value

pairs).

Reduce takes Map’s output as an input and

combines those data tuples forming a

smaller set of tuples.

Page 10: Hadoop

Under the MapReduce model, the data

processing primitives are called mappers and

reducers.

Page 11: Hadoop

MapReduce AlgorithmHadoop initiates Map stage by issuing

mapping task to appropriate servers in the

cluster.

Map stage:

The input file or directory, stored in the HDFS is

passed to the mapper function line by line.

The mapper processes the data and creates

several small chunks of data(key/value pairs).

Hadoop monitors for task completion and

initiates shuffle stage.

Page 12: Hadoop

Shuffle stage:

The framework groups data from all mappers

by the keys and splits them among the

appropriate servers for the reduce stage.

Reduce stage:

The Reducer processes the data coming from

the mapper, producing a new set of output,

that is stored in the HDFS.

The framework manages all the details of

data-passing and copying between the

nodes in the cluster.

Page 13: Hadoop

Hadoop Distributed File

SystemHDFS is based on the Google File System.

It is highly fault-tolerant and is designed to be

deployed on low-cost hardware.

It is suitable for applications having large

datasets.

These files are stored in redundant fashion to

rescue the system from possible data losses in

case of failure.

Page 14: Hadoop

HDFS ArchitectureNamenode:

It acts as a master server that manages the file system namespace.

Regulates client’s access to files.

Datanode:

These nodes manage the data storage of their system.

And performs read-write and block operations regulated by namenode.

Page 15: Hadoop

Block:

It is the minimum amount of data that HDFS

can read/ write.

The files are divided into one or more blocks.

Blocks are stored in individual data nodes.

Page 16: Hadoop

Hadoop CommonIt provides essential services and basic

processes such as abstraction of the

underlying operating system and its file

system.

It assumes that hardware failures are

common and should be automatically

handled by the Framework.

It also contains the necessary Java Archive

(JAR) files and scripts required to start

Hadoop.

Page 17: Hadoop

Hadoop YARNResourceManager:

It is a clustering platform that helps to

manage and allocate resources to

applications and schedule tasks.

ApplicationMasters:

Responsible for negotiating resources with

the ResourceManager and for working

with the Node Managers to execute and

monitor the tasks.

Page 18: Hadoop

NodeManager:Takes instructions from the ResourceManager

and manage resources on its own node.

Page 19: Hadoop

How Does Hadoop

Work?Data is initially divided into directories and

files. Files are divided into uniform sized blocks of 128M and 64M.

These files are then distributed across various cluster nodes for further processing supervised by the HDFS.

Blocks are replicated for handling hardware failure.

Checking that the code was executed successfully.

Page 20: Hadoop

Performing the sort that takes place between

the map and reduce stages.

Sending the sorted data to a certain

computer.

Writing the debugging logs for each job.

Page 21: Hadoop

Applications of HadoopBlack Box Data

Social Media Data

Stock Exchange Data

Transport Data

Search Engine Data

Page 22: Hadoop

Prominent users of

HadoopThe Search Webmap is a Hadoop

application that runs on a big Linux cluster.

In 2010, Facebook claimed that they had the

largest Hadoop cluster in the world.

The New York Times used 100

instances and a Hadoop application to

process 4 TB data into 11 million PDFs in a day

at a computation cost of about $240.

Page 23: Hadoop

Advantages of HadoopHadoop is open source and compatible on

all the platforms since it is Java based.

Hadoop does not rely on hardware to

provide fault-tolerance and high availability.

Servers can be added or removed from the

cluster dynamically without interruption.

Hadoop efficiently utilizes the underlying

parallelism of the CPU cores in distributed

systems .

Page 24: Hadoop

References:

www.tutorialspoint.com/hadoop/

https://en.wikipedia.org/wiki/Apache_Hadoop

https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html

https://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/

http://saphanatutorial.com/how-yarn-overcomes-mapreduce-limitations-in-hadoop-2-0/

Page 25: Hadoop