HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K....
-
Upload
reynard-holland -
Category
Documents
-
view
220 -
download
0
Transcript of HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K....
HADOOP DISTRIBUTED FILE SYSTEMHDFS Reliability
Based on“The Hadoop Distributed File System”
K. Shvachko et al., MSST 2010
Michael Tsitrin
26/05/13
Topics• Introduction• HDFS Overview
• Basics• Architecture
• Data reliability• Block replicas
• NameNode reliability• NameNode failure• Journal• Checkpoint
• Conclusion
INTRODUCTION
Introduction• HDFS is a Cloud Based File Systems which allows
storage of large data sets on clusters of commodity hardware
• Huge number of components, each component has a non-trivial probability of failure• Hardware failure is the norm rather than the exception
• The purpose of this presentation is to present the techniques used in HDFS to keep the system and Data fully reliable.
HDFS OVERVIEW
HDFS Basics• An open-source implementation of distributed file system
based on Google File System
• Designed to store very large data sets reliably across large clusters of computers
• Optimized for MapReduce application• Large files, some several GB large• Reads are performed in a large streaming fashion• Large throughput instead of low latency
HDFS Architecture
Namenode
Breplication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..
Block ops
HDFS NameSpace Node• The HDFS Namespcae Node keeps the metadata for
each data block in the system• Implemented as a single master server for a cluster• To achieve high performance, the entire namespace kept
in RAM• Manage the replication logic for the DataNodes• Serves clients with file block location for reads• metadata includes:
• Files and directories hierarchy• Permissions, modification time, etc• Mapping of file blocks to DataNodes
HDFS DataNode• A cluster can contain thousands of DataNodes• DataNode is where the actual File block is kept• User data divided into blocks and replicated across
DataNodes• A DataNode identifies block replicas in its possession to
the NameNode by sending a block report• DataNodes serves read, write requests, performs block
creation, deletion, and replication upon instruction from NameNode
DATA RELIABILITYBlock replicas
NameNode & Data Replication• All data-replication information is stored and managed by
the NameNode• The NameNode makes all decisions regarding replication
of blocks• It periodically receives a Heartbeat and a Blockreport from
each of the DataNodes in the cluster
• A Blockreport contains a list of all blocks on a DataNode• Receipt of a Heartbeat implies that the DataNode is functioning
properly• Datanodes without recent heartbeat marked as dead
Re-replication• The NameNode constantly tracks which blocks need to be
replicated and initiates replication whenever necessary• The necessity for re-replication may arise due to many
reasons• a DataNode may become unavailable• the replication factor of a file may be increased• a replica may become corrupted• a hard disk on a DataNode may fail
• Re-replication is fast because it is a parallel problem that scales with the size of the cluster• Lower the propability of block loss while replication is carried out
Replica placement• To protected against rack failure (as in power shortage),
the name node can manage replicas to be stored in different racks
• Beside the data reliability, this can also improve network bandwidth and client’s latency
• Common case (replication factor == 3):• Put one replica on one node in the local rack• Another on a different node in the local rack• The last on a different node in a different rack
• Doesn’t compromise data reliability and availability
NAMENODE RELIABILITY
NameNode Failure• NameNode is a Single Point of Failure for HDFS cluster
• If it becomes unavailable for clients, the whole cluster is unusable• Corruption / loss of metadata – Data blocks becomes unavailable
• NameNode keeps data on RAM – full data loss in case of power shortage• Needs a persistent solution
NameNode Persistence• The persistent record of the image stored in the local
host’s native files system is called a checkpoint
• The NameNode also stores the modification log of the image called the journal in the local host’s native file system
• For improved durability, redundant copies of the checkpoint and journal can be made at other servers
Journal• The journal is persistently record every change that
occurs to file system metadata (not including block mapping)
• Implemented as a write-ahead commit log for changes to the file system that must be persistent• To avoid being a bottleneck, few transactions are batched and
committed together
Checkpoint• Checkpoint is a persistent record of the NameNode’s
state written to disk• The checkpoint file is never changed by the NameNode
• Either a new checkpoint is created or a namespace is loaded from a previous checkpoint by the namenode
• When the NameNode starts, it performs the checkpoint process:• reads the current checkpoint and Journal from the disk• applies all the transactions from the Journal to the in-memory
representation of the namespace• flushes out this new version into a new checkpoint on disk• truncate the old Journal
Creating a Checkpoint• New checkpoint file can be created at startup only or
periodically• Creating a checkpoint emptying the journal:
• Long journal increase the probability of loss or corruption of the journal file
• Very large journal extends the time required to restart the NameNode
• To create periodic checkpoint, a dedicated server required (Checkpoint Node)• since it has the same memory requirements as the NameNode
CONCLUSION
Conclusion• HDFS has good reliability model, which can handle the
expected hardware failure
• While few techniques are in use to achieve namespace fault tolerance, it is still single point of failure in the system
• Many reliability parameters are configurable and can be changed to fit system demands• Replicas count• Rack scattering policy• Checkpoint and Journal redundancy