Hadoop architecture meetup

Post on 24-Dec-2014

466 views 4 download

Tags:

description

 

Transcript of Hadoop architecture meetup

Hadoop Architecture

Agenda• Different Hadoop daemons & its roles

• How does a Hadoop cluster look like

• Under the Hood:- How does it write a file

• Under the Hood:- How does it read a file

• Under the Hood:- How does it replicate the file

• Under the Hood:- How does it run a job

• How to balance an un-balanced hadoop cluster

Hadoop – A bit of background

• It’s an open source project

• Based on 2 technical papers published by Google

• A well known platform for distributed applications

• Easy to scale-out

• Works well with commodity hard wares(not entirely true)

• Very good for background applications

Hadoop Architecture

• Two Primary components Distributed File System (HDFS): It deals with file

operations like read, write, delete & etc

Map Reduce Engine: It deals with parallel computation

Hadoop Distributed File System

• Runs on top of existing file system

• A file broken into pre-defined equal sized blocks & stored individually

• Designed to handle very large files

• Not good for huge number of small files

Map Reduce Engine

• A Map Reduce Program consists of map and reduce functions

• A Map Reduce job is broken into tasks that run in parallel

• Prefers local processing if possible

Hadoop Cluster

Typical Workflow

Cluster Balancing

Quiz

• If you had written a file of size 1TB into HDFS with replication factor 2, What is the actual size required by the HDFS to store this file?

• True/False? Even if Name node goes down, I still will be able to read files from HDFS.

Quiz

• True/False? In Hadoop Cluster, We can have a secondary Job Tracker to enhance the fault tolerance.

• True/False? If Job Tracker goes down, You will not be able to write any file into HDFS.

Quiz

• True/False? Name node stores the actual data itself.

• True/False? Name node can be re-built using the secondary name node.

• True/False? If a data node goes down, Hadoop takes care of re-replicating the affected data block.

Quiz

• In which scenario, one data node tries to read data from another data node?

• What are the benefits of Name node’s rack-

awareness?

• True/False? HDFS is well suited for applications which write huge number of small files.

Quiz

• True/False? Hadoop takes care of balancing the cluster automatically?

• True/False? Output of Map tasks are written to HDFS file?

• True/False? Output of Reduce tasks are written to HDFS file?

Quiz

• True/False? In production cluster, commodity hardware can be used to setup Name node.

• Thank You