Understanding hdfs

HDFS(Hadoop Distributed File System)

Thiru

Agenda

Typical Work flow

Writing file in to HDFS

Reading file from

HDFS

Rack Awareness

Planning for a

ClusterQ & A

Client

Map Reduce{Job Tracker}

HDFS{Name Node}

{Secondary Name Node}

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Masters

Slaves

Hadoop Server Roles

Hadoop Cluster Name

NodeJob

TrackerSecondary

NNHadoop Client

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

Hadoop Client

Sample HDFS Workflow

Write data in to cluster (HDFS)

Analyze the data (Map Reduce)

Store the result in to cluster (HDFS)

Read the result from cluster (HDFS)

Sample scenario: How many times customer called to customer care enquiring about a recently launched product? Compare it against the AD campaign in the television. Correlate both and find the best time to run the AD

CRM Data entry SQOO

P

HDFS

Map Reduce

ResultResult

Write data in to cluster (HDFS

Hadoop Client

File size 200 MB

I want to write file

Name Node

Ok! Block size is 64 MB. Split the file in to 3 and write in to node 1,4,5

Data Node 1

Data Node 2

Data Node 3

Data Node 4

Data Node 5

Data Node 6

Client Consults

Name node

Client write data to one data node

Data node replicates

as per replication factor and intimates

Name node

Cycle repeats for every block

Rack Awareness

Never loose data when a rack is down

Keep bulky flows within Rack when possible

Assumption in rack has higher bandwidth, and low latency

DN 1

DN 2

DN 3

DN 4

DN 5

DN 6

DN 7

DN 8

DN 9

DN 10

DN 11

DN 12

Name Node

Rack Aware: Rack 1: Data node 1Data node 2Rack 2:Data node 5

A

A

A

CB B

B

C

C

Multi Block Replication

Hadoop Client

File size 200 MB

File.txt

Name Node

A CB

DN 1

DN 2

DN 3

DN 4

DN 5

DN 6

DN 7

DN 8

DN 9

DN 10

DN 11

DN 12

A

Replicate in 3,8

A

A

Name Node

Data node sends hearth beatsEvery 10th heart beat is Block reportName node builds meta data from block report

If name node is down, HDFS is downMissing heartbeats signify lost nodes

Name node consults metadata and finds affected data

Name node consults rack awareness script

Name node tells data node to replicate

Name Node & Secondary Name node Not a hot standby for the name node* (Zoo keeper)

Connects to name node every one hour* (Configurable)

Housekeeping, backup of Name node meta data

Saved meta data can be used to rebuild name node

Primary Name node

Secondary Name Node

File System Metadata:

File.txt = A0 {1,5,7}A1 {1,7,9}

A2{5,10,15}

It’s been 1 hr, give

your data

Understanding Secondary name node house keeping

fsimageedits

edits-new

fsimageedits

Fsimage.ckpt

Fsimage.ckpt

edits Fsimage.ckpt

Primary Name Node Secondary Name Node

Reading data from HDFS Cluster

Hadoop Client

I want to read file file.txt

Name Node

Ok! File.txt =

blck a {1,5,6}Blck b

{8,1,2}Blck c

{5,8,9}

Data Node 1

Data Node 2

Data Node 3

Data Node 4

Data Node 5

Data Node 6

Client Consults

Name node

Client receives

DN list for each block

Client picks first node

of list

Client reads data sequentiall

y

Data Node 7

Data Node 8

Data Node 9C

B

A

A A

BB

C

C

Choosing right hardware

Master node

Single Point of Failure

Dual power supply for

redundancy

No Commodity hardware

Regular Data backup

RAM thumb rule – 1 GB/

Million blocks of

data

# Task per node•1 core can run 1.5 Mapper or Reducer

Practice at Yahoo!

HDFS clusters at Yahoo! include about 3500 nodes

A typical cluster node has:

· 2 quad core Xeon processors @ 2.5ghz · Red Hat Enterprise Linux Server Release 5.1 · Sun Java JDK 1.6.0_13-b03 · 4 directly attached SATA drives (one terabyte each) · 16G RAM · 1-gigabit Ethernet

Practice at YAHoo!

70 percent of the disk space is allocated to HDFS. The remainder is reserved for the operating system (Red Hat Linux), logs, and space to spill the output of map tasks. (MapReduce intermediate data are not stored in HDFS.)

For each cluster, the NameNode and the BackupNode hosts are specially provisioned with up to 64GB RAM; application tasks are never assigned to those hosts.

In total, a cluster of 3500 nodes has 9.8 PB of storage available as blocks that are replicated three times yielding a net 3.3 PB of storage for user applications. As a convenient approximation, one thousand nodes represent one PB of application storage.

Practice at YAHoo!

Durability of Datauncorrelated node failures

Replication of data three times is a robust guard against loss of data due to uncorrelated node failures.correlated node failures, the failure of a rack or core switch.

HDFS can tolerate losing a rack switch (each block has a replica on some other rack).loss of electrical power to the cluster

a large cluster will lose a handful of blocks during a power-on restart.

Practice at YAHoo!

Benchmarks

Practice at YAHoo!

Benchmarks

NameNode Throughput benchmark

Practice at YAHoo!

Automated failover

plan: Zookeeper, Yahoo’s distributed consensus technology to build an automated failover solution Scalability of the NameNode

Solution: Our near-term solution to scalability is to allow multiple namespaces (and NameNodes) to share the physical storage within a cluster.

Drawbacks: The main drawback of multiple independent namespaces is the cost of managing them.

Future work

Thank you

Understanding hdfs

Documents

Transcript of Understanding hdfs