Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes...

19
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/279472051 Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal Article in Frontiers of Electrical and Electronic Engineering in China · January 2016 DOI: 10.1631/FITEE.1500015 READS 85 2 authors: Dipayan Dev National Institute of Technology, Silchar 6 PUBLICATIONS 11 CITATIONS SEE PROFILE Ripon Patgiri National Institute of Technology, Silchar 5 PUBLICATIONS 3 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. Available from: Dipayan Dev Retrieved on: 08 June 2016

Transcript of Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes...

Page 1: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/279472051

Dr.Hadoop:aninfinitescalablemetadatamanagementforHadoop—Howthebabyelephantbecomesimmortal

ArticleinFrontiersofElectricalandElectronicEngineeringinChina·January2016

DOI:10.1631/FITEE.1500015

READS

85

2authors:

DipayanDev

NationalInstituteofTechnology,Silchar

6PUBLICATIONS11CITATIONS

SEEPROFILE

RiponPatgiri

NationalInstituteofTechnology,Silchar

5PUBLICATIONS3CITATIONS

SEEPROFILE

Allin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate,

lettingyouaccessandreadthemimmediately.

Availablefrom:DipayanDev

Retrievedon:08June2016

Page 2: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 1

Frontiers of Information Technology & Electronic Engineering

www.zju.edu.cn/jzus; engineering.cae.cn; www.springerlink.com

ISSN 2095-9184 (print); ISSN 2095-9230 (online)

E-mail: [email protected]

Dr.Hadoop : an infinite scalable metadata management for

Hadoop —How the baby elephant becomes immortal!

Dipayan DEV‡, Ripon PATGIRI(Department of Computer Science and Engineering, NIT Silchar, India)

E-mail: [email protected]; [email protected]

Received Jan. 12, 2015; Revision accepted June 11, 2015; Crosschecked

Abstract: In this Exa Byte scale era, the data increases at an exponential rate. This is in turn generating amassive amount of metadata in the file system. Hadoop is the most widely used framework to deal with Big Data.But due to this growth of huge amount of metadata, the efficiency of Hadoop is questioned numerous time bymany researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop.Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtreepartitioning does not uniformly distribute workload among the metadata servers, and metadata need to be migratedto keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, thoughit uniformly distributes the load among NameNodes, which is the metadata server of Hadoop. In this paper, wepresent a circular metadata management mechanism named Dynamic Circular Metadata Splitting (DCMS). DCMSpreserves metadata locality using consistent hashing as well as locality-preserving hashing, keeps replicated metadatafor excellent reliability, as well as dynamically distributes metadata among the NameNodes to keep load balancing.NameNode is a centralized heart of the Hadoop which keeps the directory tree of all files and thus failure of whichcauses the Single Point of Failure (SPOF). DCMS removes Hadoop’s SPOF as well as provides an efficient andscalable metadata management. The new framework is named ‘Dr.Hadoop’ after the name of the authors.

Key words: Hadoop, NameNode, Metadata, Locality-preserving hashing, Consistent hashingdoi:10.1631/FITEE.1500015 Document code: A CLC number:

1 Introduction

Big Data, a recent buzz in the Internet world,which is growing louder with every passing day.Facebook has almost 21 PB in 200 M objects (Beaveret al., 2010) whereas Jaguar ORNL has more than 5PB data. These stored data are growing so rapidlythat EB-scale storage systems are likely to be used by2018–2020. By that time there should be more thanone thousand 10 PB deployments. Hadoop (Hadoop,2012) has its own file system, Hadoop DistributedFile System (HDFS) (Shvachko et al., 2010; Hadoop,2007; Dev and Patgiri, 2014), which is one of themost widely used large-scale distributed file systems

‡ Corresponding authorc©Zhejiang University and Springer-Verlag Berlin Heidelberg 2015

like GFS (Ghemawat et al., 2003) and Ceph (Weilet al., 2006) and which processes the “Big Data” effi-ciently.

HDFS separates file system data access andmetadata transactions to achieve better performanceand scalability. Application data are stored in vari-ous storage servers called DataNodes where as meta-data are stored in some dedicated server(s) called Na-meNode(s). The NameNode stores the global names-paces and directory hierarchy of the HDFS and oth-er inode information. Clients access the data fromDataNodes through NameNode(s) via network. InHDFS, all the metadata have only one copy storedin the NameNode. When the metadata server des-ignated as NameNode in Hadoop fails, the wholeHadoop cluster goes offline for a period of time. This

uned

ited

Page 3: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

2 Dev et al. / Front Inform Technol Electron Eng in press

is termed as single point of failure (SPOF) (Hadoop,2012) of Hadoop.

The NameNode allows data transfers betweena large number of clients and DataNodes, itself notbeing responsible for storage and retrieval of data.Moreover, NameNode keeps all the namespaces inits main memory, which is very likely to run outof memory with the increase in the metadata size.Considering all these factors, centralized architectureof NameNode in the Hadoop framework seems tobe a serious bottleneck and looks quite impractical.A practical system needs an efficient and scalableNameNode cluster to provide an infinite scalabilityto the whole Hadoop framework.

The main problem in designing a NameNodecluster is how to partition the metadata efficientlyamong the cluster of NameNode to provide high-performance metadata services (Brandt et al., 2003;Weil et al., 2004; Zhu et al., 2008). A typical meta-data server cluster splits metadata among itself andtries to keep a proper load balancing. To achievethis as well as to preserve better namespace locality,some servers are overloaded and some become a littlebit under loaded.

The size of metadata ranges from 0.1 to 1 per-cent in size as compared to the whole data size stored(Miller et al., 2008). This value seems negligible, butwhen measured in EB-scale data (Raicu et al., 2011),these metadata become huge for storage in the mainmemory, e.g., 1 PB to 10 PB for 1EB data. Onthe other hand, more than half of the file systemaccesses are used for metadata (Ousterhout et al.,1985). Therefore an efficient and high performancefilesystem implies efficient and systemized metada-ta management (Dev and Patgiri, 2015). A betterNameNode cluster management should be designedand implemented to resolve all the serious bottle-necks of a filesystem. The workload of metadata canbe solved by uniformly distributing the metadata toall the NameNodes in the cluster. Moreover, withthe ever growing metadata size, an infinite scalabilityshould also be achieved. Metadata of each NameN-ode server can be replicated to other NameNodesto provide better reliability and for excellent failuretolerance.

We examine these issues and propose an efficientsolution. As Hadoop is a “Write once, Read many”architecture, so metadata consistency of the writeoperation (Atomic operations) is beyond the scope

of this paper.We provide a modified model of Hadoop, termed

as Dr.Hadoop. Dr. is an acronym of the authors (Di-payan, Ripon). Dr.Hadoop uses a novel NameNodeserver cluster architecture named dynamic circularmetadata splitting (DCMS) that removes the SPOFof Hadoop. The cluster is highly scalable and it usesa key-value store mechanism (Giuseppe et al., 2007;Lim et al., 2007; Robert et al., 2007; Biplob et al.,2010) that provides a simple interface: lookup (key)under write and read operations. In DCMS, locality-preserving-hashing (LpH) is implemented for excel-lent namespace locality. Consistent hashing is usedin addition to LpH, which provides a uniform dis-tribution of metadata across the NameNode cluster.In DCMS, with the increase in the number of Na-meNodes in the cluster, the throughput of metadataoperations doesn’t decrease and hence any increasein the metadata scale does not affect the through-put of the filesystem. To enhance the reliability ofthe cluster, DCMS also provides replication of eachNameNodes’ metadata to different NameNodes.

We evaluate DCMS and its competing metada-ta management policies in terms of namespace lo-cality. We compare Dr.Hadoop’s performance withTraditional Hadoop from multiple perspectives likethroughput, fault tolerance, NameNode’s load etc.We demonstrate that, DCMS is more productivethan traditional state-of-the-art metadata manage-ment approaches of large-scale file system.

2 Background

This section discusses the different types ofmetadata server of distributed file systems (Had-dad, 2000; White et al., 2001; Braam, 2007). Thenext portion of this section discusses the major tech-niques used in distributed metadata management inlarge-scale systems.

2.1 Metadata server cluster scale

2.1.1 Single metadata server

A framework having a single metadata server(MDS) simplifies the overall architecture to a vast ex-tent and enables the server to make data placementand replica management relatively easy. But suchstructure faces a serious bottleneck, which engen-ders the SPOF. Many distributed file systems (DF-

uned

ited

Page 4: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 3

S), such as Coda (Satyanarayanan et al., 1990), di-vide their namespace among multiple storage server-s, thereby making all of the metadata operationsdecentralized. Other DFS, e.g., GFS (Ghemawat etal., 2003), also use one MDS, with a failover server,which works on the failure of primary MDS. The ap-plication data and file system metadata are storedin different places in GFS. The metadata is stored ina dedicated server called master, while data serverscalled chunkservers are used for storage of applica-tion data. Using a single MDS at one time is a seriousbottleneck in such an architecture, as the numbers ofclient and/or files/directories are increasing day byday.

2.1.2 Multiple metadata servers

A file system’s metadata can be dynamicallydistributed to several MDS. The expansion of MDScluster provides high performance and avoids heavyloads on a particular server within the cluster. ManyDFS are now working to provide a distributed ap-proach to their MDS instead of a centralized names-pace. Ceph (Weil et al., 2006) uses clusters of MDSand has a dynamic subtree partitioning algorithm(Weil et al., 2004) to evenly map the namespace treeto metadata servers. GFS is also moving into a dis-tributed namespace approach (McKusick and Quin-lan, 2009). The upcoming GFS will have more thanhundred MDS with 100 million files per master serv-er. Lustre (Braam, 2007) in its Lustre 2.2 releaseuses a clustered namespace. The purpose of this isto map a directory over multiple MDS, where eachMDS will contain a disjoint portion of the names-pace.

2.2 Metadata organization techniques

A filesystem having cloud-scale data manage-ment (Cao et al., 2007) should have multiple MDSto dynamically split metadata across them. Subtreepartitioning and hash-based mapping are two majortechniques used for MDS clusters in large-scale filesystems.

2.2.1 Hash-based mapping

Hash-based mapping (Corbett and Feitelson,1996; Miller and Katz, 1997; Rodeh and Teperman,2003) uses a hash-function to be applied on a path-name or filename of a file track file’s metadata. This

helps the clients to locate and discover the rightMDS. Clients’ requests are distributed evenly amongthe MDS to reduce the load of a single MDS. Ves-ta (Corbett and Feitelson, 1996), Rama (Miller andKatz, 1997), and zFs (Rodeh and Teperman, 2003)use hashing of the pathname to retrieve the meta-data. The hash-based technique explores a betterload balancing concept and removes the hot-spotse.g. popular directories.

2.2.2 Subtree partitioning

Static subtree partitioning (Nagle et al., 2004)explores a simple approach for distribution of meta-data among MDS clusters. With this approach, thedirectory hierarchy is statically partitioned and eachsubtree is assigned to a particular MDS. It providesmuch better locality of reference and improved MD-S independence compared to hash-based mapping.The major drawback of this approach is improperload balancing which causes a system performancebottleneck. To cope with this, the subtree might bemigrated in some cases (e.g., PanFS (Nagle et al.,2004)).

3 Dynamic circular metadata splitting(DCMS) protocol

The dynamic circular metadata splitting of theDr.Hadoop framework is a circular arrangement ofNameNodes. The internal architecture of a NameN-ode (metadata server) is shown in Fig. 1 and it is apart of the circular arrangement of DCMS. Hashtablestorage manager keeps a hashtable to store the key-value of metadata. It also administers a SHA1 (FIPS180-1, 1995) based storage system under two compo-nents, i.e., a replication engine and failover manage-ment. The replication engine features the redundan-cy mechanisms and it determines the throughput ofread or write metadata in the DCMS.

DCMS implements metadata balancing acrossthe servers using consistent hashing and excellen-t namespace locality using LpH that provides themost efficient metadata cluster management. A bet-ter LpH based cluster exploits the minimum numberof remote procedure calls (RPC) for read and writelookups.

All these features and approaches are describedin the subsequent portions of this paper.

uned

ited

Page 5: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

4 Dev et al. / Front Inform Technol Electron Eng in press

Fig. 1 The Skeleton of a NameNode Server in DCMS

3.1 DCMS features

DCMS simplifies the design of Hadoop and Na-meNode cluster architecture and removes variousbottlenecks by addressing all difficult problems:

1. Load balance: DCMS uses a consistenthashing approach, spreading keys uniformly all overthe NameNodes; this contributes a high degree oftypical load balance.

2. Decentralization: DCMS is fully distribut-ed: all the nodes in the cluster have equal impor-tance. This improves robustness and makes DCMSappropriate for loosely-organized metadata lookup.

3. Scalability: DCMS provides infinite scala-bility to the Dr.Hadoop framework. With the in-crease in the size of metadata, only the nodes (Na-meNode) in the cluster need to be added. No extramodification in parameter is required to achieve thisscaling.

4. Two-for-One: The left and right NameNodeof a NameNode keep the metadata of the middle oneand behaves as the “hot-standby” (Hot, 2012) to it.

5. Availability: DCMS uses Resource-Managerto keep track of all the NameNodes which are newlyadded as well as the failed ones.During failure, theleft and right side of the failed NameNode of DCMScan serve for the lookup of the failure one. The fail-ure node will be replaced by electing a DataNode asNameNode using a suitable election algorithm. That

is why there is no downtime on DCMS. The failureNameNode will serve as the DataNode, if recoveredlater.

6. Uniform key size: DCMS puts no restric-tion on the structure of the keys in the hashtable;the DCMS key-space is flat. It uses SHA1 on eachfile or directory path to generate a unique identifierof k Bytes. So, even a huge file/directory pathnamedoes not take more than k Bytes of key.

7. Self-healing, Self-managing and Self-balancing: In the DCMS, the administrator neednot worry about increase in the number of NameN-ode or replacement of NameNode. The DCMS hasthe feature of self-healing, self-balancing and selfmanaging of metadata and MetaData Server.

3.2 Overview

DCMS is an in-memory hashtable-based system,in which a key-value pair is shown in Fig. 2. It isdesigned to meet the following five general goals:

1. High scalability of NameNode cluster2. High availability with suitable replication

management3. Excellent namespace locality4. Fault tolerance capacity5. Dynamic load balancing

Fig. 2 Key-value system. Key is SHA1(pathname),while value is its corresponding metadata

In the DCMS design, each NameNode storesone unique hashtable. Like other hash-based map-ping it uses a hashing concept to distribute themetadata across multiple NameNodes in the cluster.It maintains a directory hierarchical structure suchthat metadata of all the files in a common directoryget stored in same NameNode. For distribution ofmetadata, it uses locality-preserving-Hashing (LpH)which is based on the pathname of each file or direc-tory. Due to the use of LpH in DCMS, it eliminatesthe overhead of directory hierarchical traversal. To

uned

ited

Page 6: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 5

access data, a client hashes the pathname of the filewith the same locality-preserving hash function to lo-cate which metadata server contains the metadata ofthe file, and then contacts the appropriate metadataserver. The process is extremely efficient, typical-ly involving a single message to a single NameNode.DCMS applies SHA1() hash function on the path-name of each file or directory full path. SHA1 standsfor “secure hash algorithm” which generates a unique20-Byte-identifier for every different input.

In HDFS, all the files are divided into pre-defined sizes of multiple chunks, called blocks.The SHA1() is applied to each blocks to gen-erate a unique 20 Bytes identifier which is s-tored in the hashtable as the key. For example,the identifiers of the first and third block of file“/home/dip/folder/fold/fold1/file.txt” are respec-tively, SHA1(home/dip/folder/fold/fold1/file.txt, 0)and SHA1 (home/dip/folder/fold/fold1/file.txt, 2)

3.3 System architecture

In DCMS, each NameNode possesses equal pri-ority and hence the cluster shows pure decentral-ized behavior. The typical system architecture isshown in Fig. 3. The DCMS is a cluster of Na-meNodes where each of them is organized in a cir-cular fashion. Each NameNode’s hostname is de-noted by NameNode_X which has neighbors, viz.NameNode_(X − 1) and NameNode_(X + 1). Thehash function is random enough. This is SHA1 inour case. Many keys are inserted, due to the na-ture of consistent hashing; those keys will be evenlydistributed across the various NameNode servers. D-CMS improves the scalability of consistent hashingby avoiding the requirement that every node knowabout every other nodes in the cluster. But in D-CMS, each node needs routing information of twonodes, which is its left and right in its topologicalorder. This is because each NameNode will put areplica of its hashtable to its two neighbor servers.The replication management portion is discussed inthe following section.

The DCMS holds three primary data struc-tures: namenode, fileinfo, and clusterinfo. Thesedata structures are the building blocks of the DCMSmetadata. namenode and clusterinfo are stored onthe typical ResourceManager (Job Tracker) of typi-cal Hadoop architecture and fileinfo is stored on eachnamenode which handles the mapping to file to its

Fig. 3 System architecture. Physical NameNodeservers compose a Metadata cluster to form a DCMSoverlay network

paths in the DataNodes.

/** File Information mapping of: file or directory path - nodeURL*/public static Hashtable<FilePath, MetaData>fileInfo = new Hashtable<FilePath, MetaData>();

/** Cluster information mapping of: nodeURL and ClusterInfo object*/public static Hashtable<String, ClusterInfo>clusterinfo = new Hashtable<String, ClusterInfo>();

/** Data structure for storing all the NameNode servers*/public Hashtable<NameNode_hostname, Namenode-URL> namenode = new Hashtable<NameNode_hostname, Namenode-URL>();

1. namenode: The namenode data structurestores the mapping of the NameNode’s hostname toNameNode-URL. It is used to obtain the list of al-l the servers to keep track of all the NameNodesin the DCMS cluster. The ResourceManager holdsthis data structure to track all the live NameNodesin the cluster by a heartbeat mechanism. It mustupdate this mapping table information when a newNameNode joins or leaves the DCMS cluster. TheDr.Hadoop framework can be altered by reformingthis data structure.

2. fileinfo: The fileinfo data structure, storedin each NameNode of the DCMS cluster, keeps themapping of the SHA1() of the file or directory pathto the corresponding location where they are stored

uned

ited

Page 7: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

6 Dev et al. / Front Inform Technol Electron Eng in press

in the DataNodes. The mapping is basically the key-value store in the hashtable. The mapping locatesthe DataNodes on the second layer. The mappedfileinfo metadata is used to access DataNodes on thecluster, where the files are actually stored.

3. clusterinfo: The clusterinfo data structure,stored in the ResourceManager, stores the basic in-formation about the cluster such as storage size,available space and number of DataNodes.

3.4 Replication management

In the Dr.Hadoop we replicate each hashtableof NameNodes to the different NameNodes. Theidea of replication management of Dr.Hadoop in D-CMS is that, each replica starts from a particularstate, receives some input (key-value) and goes tothe same state and outputs the result. In DCMS,all the replicas possess the identical metadata man-agement principle. The input here only means themetadata write requests, because only file/directorywrite requests can change the state of a NameNodeserver. The read request only needs a lookup.

We designate the main hashtable as primary,and the secondary hashtable replicas are placed inthe left and right NameNodes of the primary. Theprimary is given the charge of the replication of al-l metadata updates and to preserve the consisten-cy with the other two. As shown in Fig. 4, theNameNode_(X + 1) stores the primary hashtableHT_(X +1) and its secondary replicas are stored inNameNode_X and NameNode_(X + 2) as its leftand right NameNodes respectively. These two n-odes, hosting the in-memory replicas, will behave ashot-standby to NameNode_(X + 1).

Fig. 4 Replication Management of Dr.Hadoop in D-CMS

The metadata requests are not concurrent inDr.Hadoop in order to preserve metadata consisten-cy. When a primary processes the metadata writerequests, it first puts the replicas to its secondary be-fore sending the acknowledgment back to the client.

Details about these operations are explained later.The secondary replica hashtable does not need

to process any write metadata requests. The clientsonly send the metadata write requests to the primaryhashtable, i.e., HT_(X + 1) in the figure. But theread requests are processed by any replicas, eitherprimary or secondary, i.e., HT_X or HT_(X+2) inthe figure to increase the throughput of read.

4 Hashing approaches of DCMS

4.1 Consistent Hashing

The consistent hash function assigns each Na-meNode and key (hashtable) an n bit identifier usinga base hash function such as SHA-1 (FIPS 180-1,1995). A NameNode’s identifier is chosen by hashingits IP address, while a key identifier for the hashtableis produced by hashing the key.

As explained in an earlier section, DCMS con-sists of a circular arrangement of NameNodes. n bitsare needed to specify an identifier. We applied con-sistent hashing in DCMS as follows: identifiers areordered in a HashCode module2n.Key k is assignedto NameNode_0 if this result is 0 and so on. Thisnode is called the home node of key k, denoted byhome(k).If identifiers are denoted by 0 to 2n−1, thenhome(k) is the first NameNode of the circle havinghostname NameNode_0.

The following results are proven in the papersthat introduced consistent hashing (Karger et al.,1997; Lewin, 1998) and that are the foundation ofour architecture.

THEOREM 1. For any set of N nodes and K

keys, with high probability:1. Each node is responsible for at most (1 +

ε)K/N keys.2. When an (N + 1)st node joins or leaves the

network, responsibility for O(K/N) keys changeshands (and only to or from the joining or leavingnode)

With SHA1, this ε can be reduced to a negligiblevalue providing a uniform K/N key distribution toeach NameNodes in the DCMS.

We need to discuss the term “with high prob-ability” mentioned in the theorem. As the NameN-odes and keys in our model are randomly chosen, theprobability distribution over the choices of keys isvery much likely to produce a balanced distribution.

uned

ited

Page 8: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 7

But if an adversary chooses all the keys to hash to anidentical identifier (NameNode), the situation wouldproduce a highly unbalanced distribution, leadingto destruction of the load balancing property. Theconsistent hashing paper uses “k universal hash func-tions” that guarantees pure load balancing even whensomeone uses non-random keys.

Instead of using a “universal hash function” theDCMS uses the standard SHA-1 function as our basehash function. This will make the hashing more de-terministic, so that the phrase “high probability” nolonger makes sense. But it should be noted, collisionin SHA-1 is seen, but with a negligible probability.

4.1.1 Collision probability of SHA1

Considering random hash values with a uniformdistribution, a hash function that produces n bitsand a collection of b different blocks, the probabilityp that one or more collision will occur for differentblocks is given by

p ≤ b(b− 1)

2

1

2n.

For a huge storage system that contains aPetaByte (1015 bytes) of data or an even larger sys-tem that contains an ExaByte (1018 bytes) storedas 8 Kb blocks (1014 blocks) using the SHA-1 hashfunction, the probability of a collision is less than10−20. Such a scenario seems sufficiently unlikelythat we can ignore it and use the SHA-1 hash as aunique identifier for a block.

4.2 Locality-preserving Hashing

In EB-scale storage systems, we should imple-ment near-optimal namespace locality by assigningkeys to the same NameNode that are consistentbased on their full pathnames.

To attain near-optimal locality, the entire di-rectory tree nested under a point has to reside onthe same NameNode. The clients distribute themetadata for the files and directories over DCMSby computing a hash of the parent path present inthe file operation. Using the parent path implies thatthe metadata for all the files in a given directory ispresent at the same NameNode. Algorithms 1 and 2describes the creation and reading operation of a fileby a client over DCMS.

Algorithms 1 and 2 show the write and readoperation respectively in Dr.Hadoop on DCMS to

Algorithm 1 Write operation of Dr. Hadoop inDCMSRequire: Write query(filepath) from clientEnsure: File stored in DataNodes with proper replica-

tion and consistency.for i← 0 to m do

hashtable(NameNode_i) ← hashtable_i;end forwhile true do

Client issuewrite(/dir1/dir2/dir3/file.txt);Perform hash of its parent pathfile.getAbsolutePath();if Client Path hashes to NameNode_X then

update(hashtable_(X-1));update(hashtable_(X+1));

end ifReply back to client from the NameNode_X foracknowledgement.

end while

Algorithm 2 Read Operation of Dr.Hadoop in D-CMSRequire: Read query(filepath) from clientEnsure: Read data from files stored int the DataNodes

for i← 0 to m dohashtable(NameNode_i)← hashtable_i;

end forwhile true do

Client issue read(/dir1/dir2/dir3/file.txt);//hash of its parent pathfile.getAbsolutePath();if Client Path hashes to NameNode_X then

parallel reading fromhashtable_Xhashtable_(X-1)hashtable_(X+1)

end ifPerform getBlock()Perform getMoreElements()The client then communicates directly with theDataNodes to read the contents of the file.

end while

preserve the Locality-preserving-Hashing.In Algorithm 1, the client sends the query as a

file path or directory path to Dr.Hadoop and it out-puts the data in the DataNodes as well as metadatain the NameNodes of DCMS with proper replicationand consistency.

As mentioned in Section 3.4 section, thehashtable of NameNode_X will be denoted asHashtable_X . When a client issues a write opera-

uned

ited

Page 9: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

8 Dev et al. / Front Inform Technol Electron Eng in press

tion of “/dir1/dir2/dir3/file.txt”, it hashes the parentpath, i.e., hash of “/dir1/dir2/dir3”, so that all thefiles that belong to that particular directory reside inthe same NameNode. If the hash of the above queryoutputs 3, then the metadata of that file will get s-tored in the Hashtable_3. For synchronous one-copyreplication purposes, that metadata will be replicat-ed to Hashtable_2 and Hashtable_1. After the w-hole process is completed, Dr.Hadoop will send anacknowledgment to the client.

The read operation of Dr.Hadoop portrayed inAlgorithm 2 to preserve LpH follows a similar trendto Algorithm 1. When the client issues a read op-eration on “/dir1/dir2/dir3/file.txt”, it would againmake a hash of its parent directory path. If dur-ing the write this value was 3, it will now also out-put the same value. So, as per the algorithm, theclient will automatically contact NameNode_3 againwhere the metadata actually resides. Now, as themetadata is stored in the three consecutive hashta-bles, DCMS provides a parallel reading capacity, i.e.from Hashtable_2, Hashtable_3 and Hashtable_4.The application performs getBlock() in parallel eachtime to get the metadata of each blocks of the file.This hugely increases the throughput of the read op-eration of Dr. Hadoop.

Fig. 5 shows the step by step operation to accessfiles by clients in Dr.Hadoop.

Fig. 5 Client accessing files on the Dr.Hadoop Frame-work

Step 1 : Client wishes to create a file named/dir1/dir2/filename. It computes a hash of the par-ent path, /dir1/dir2, to determine that it has to con-tact which NameNode.

Step 2:Before returning the response back to theclient, NameNode sends a replication request to itsleft and right topological NameNodes to perform thesame operation.

Step 3: lookup or add a record to its hashtableand file gets stored/retrieved in/from DataNodes.

Step 4: NameNode sends back the response tothe client.

5 Scalability and failure management

5.1 Node join

In a DCMS-like dynamic network, new NameN-odes join in the cluster whenever the metadata limitexceeds the combined main memory size (consideringthe replication factor) of all the NameNodes. Whiledoing so, the main challenge for this is to preservethe locality of every key k in the cluster. To achievethis, DCMS needs to ensure two things:

1. node’s successor and predecessor is properlymaintained.

2. home(k) is responsible for every key k of D-CMS.

This section discusses how to conserve these twofactors while adding a new NameNode to the DCMScluster.

For the join operation, each NameNode main-tains a successor and predecessor pointer. A NameN-ode’s successor and predecessor pointer contains theDCMS identifier and IP address of the correspond-ing node. To preserve the above factors, DCMS mustperform the following three operations when a noden, i.e., NameNode_N joins the network:

1. Initialize the successor and predecessor ofNameNode_N .

2. Update these two pointers ofNameNode_(N − 1) and NameNode_0.

3. Notify Resource-Manager to update itsnamenode data structure about the addition ofNameNode_N .

5.2 Transferring replicas

The last operation to be executed when aNameNode_N joins DCMS is to shift the reason-ability of holding the replicas (hashtable) of its newsuccessor and predecessor. Typically, this involvesmoving the data associated with the keys to keep thebalance of the hashtable in DCMS for Dr.Hadoop.

uned

ited

Page 10: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 9

Algorithm 3 and Fig. 6 provide the detailed overviewof this operation.

Algorithm 3 The node join operation ofDr.Hadoop. addnew_NameNode_N()Require: Random node from cluster of DataNodesEnsure: Add the Nth NameNode of DCMS

Randomly choose one DataNode from the cluster asNth NameNode of DCMSwhile true do

update_metadata(NameNode_N)update_metadata(NameNode_(N − 1))update_metadata(NameNode_0);

end whilewhile update_metadata(NameNode_N) do

place(hashtable_(N − 1));place(hashtable_N);place(hashtable_0);

end whilewhile update_metadata(NameNode_(N − 1)) do

hashtable_0.clear() ;place(hashtable_(N));

end whilewhile update_metadata(NameNode_(X) do

hashtable_((N − 1)).clear ();place(hashtable_(N));

end while

5.3 Failure and replication

The number of NameNodes in DCMS ofDr.Hadoop would not be very large. So, we useheartbeat (Aguilera et al., 1997) to do the failuredetection. Each NameNode sends a heartbeat to theResourceManager every 3 s. The ResourceManagerdetects a NameNode failure by checking every 200 sif any nodes have not sent the heartbeat for at least600 s. If a NameNode_N is declared fail, other Na-meNodes whose successor and predecessor pointerscontain NameNode_N must adjust their pointer ac-cordingly. In addition, the failure of the node shouldnot be allowed to distort the metadata operation inprogress, as the cluster will be in a re-stabilize situ-ation.

After a node fails, but before the stabilizationtakes place, a client might send a request to this Na-meNode for a lookup (key). Generally, these lookupswould be processed after a timeout, but from anoth-er NameNode (its left or right) which will be actingas a hot standby for it, despite that failure. Thiscase is possible because the hashtable has its replica

(a)

(b)

Fig. 6 Dynamic nature of Dr.Hadoop. (a) Intialsetup of DCMS with half filled RAM; (b) Addition ofNameNode in DCMS when RAM gets filled up

stored in its left and right nodes.

6 Analysis

6.1 Design analysis

Let the total number of nodes in the cluster be n.In Traditional Hadoop, there is only 1 Primary Na-meNode, 1 Resource-Manager and (n−2) numbers ofDataNodes to store the actual data. The single Na-meNode stores all the metadata of the framework.Hence, it becomes a centralized NameNode whichcannot tolerate the crash of a single server.

In Dr.Hadoop, we use m NameNodes that dis-tributes the metadata r times. So, Dr.Hadoop canhandle the crash of (r − 1) NameNode servers. InTraditional Hadoop, 1 remote procedure call (RPC)is enough for lookup using a hash of the file/directorypath needed to create a file. But using Algorithm 1,Dr.Hadoop needs 3 RPC to create a file and 1 RPCto read a file.

Dr.Hadoop uses m NameNodes to handle theread operation. The throughput of read metadata isthus m times that of the traditional Hadoop. Sim-ilarly, the throughput of write metadata operationis m/r times that of traditional Hadoop ((m/r) iscalculated because the metadata is distributed overm NameNodes but with an overheard of replication

uned

ited

Page 11: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

10 Dev et al. / Front Inform Technol Electron Eng in press

factor r). The whole analysis and comparison is tab-ulated in Table 1.

Table 1 Analytical comparison of traditional Hadoopand Dr.Hadoop

Parameter Traditional Hadoop Dr.Hadoop

Maximum number ofNameNode crush cansurvive

0 r − 1

Number of RPCneeded for readoperation

1 1

Number of RPCneeded for writeoperation

1 r

Metadata storage perNameNode

X (X/r).m

Throughput of meta-data read

X X.m

Throughput of meta-data write

X X.(m/r)

6.2 Failure analysis

In the failure analysis of Dr.Hadoop, the failureof NameNode is assumed to be independent of oneanother and equally distributed over time. We ignorethe failures of DataNodes in this study because it hasthe same effect on both designs, i.e., Hadoop andDr.Hadoop, and thus simplifies our failure analysis.

6.2.1 Traditional Hadoop

Mean time between failure (MTBF) is definedby the mean time in which a system is supposed tofail. So, the probability that a NameNode server failsin a given time is denoted by

f =1

MTBF. (1)

Let RHadoop be the time to recover from the fail-ure. Therefore, the traditional Hadoop Frameworkwill be unavailable for f.RHadoop of the time.

If NameNode fails oncer in a month and it takes6 hours to recover from the failure.

f = 1/(30× 24) = 1.38× 10-3,

RHadoop = 6h,

Unavailability = f.RHadoop, (2)

∴ Unavailability = f.RHadoop = 1.38 × 10−3 × 2 =

8.28× 10−3

∴ Percent of unavailability=0.828%∴ Availability= (100− 0.276) =99.18 %

6.2.2 Dr.Hadoop

In Dr.Hadoop, say there aremMetadata Server-s used for DCMS and let r be the replication factorused. Let f denote the probability that a given serv-er fails in a given time t and RDr.Hadoop denotes thetime to recover from the failure.

Here it is to be noted that, RDr.Hadoop will beapproximately equal to

RDr.Hadoop =r.RHadoop

m. (3)

This is because, the recovery time of metadata isdirectly proportional to the amount of metadata s-tored in its system and as per our assumption; themetadata is replicated to r times.

Now, the probability of failure of any r consec-utive NameNodes in DCMS can be calculated usingthe Binomial Distribution of Probability. Let P bethe probability of this happening.

P = m.f r.(1 − f)m−r. (4)

Let there be 10 NameNodes and let each meta-data be replicated three times, i.e., r = 3, and letthe value of f be 0.1 in 3 days. Putting this value inthe equation gives P as 0.47%.

In Dr.Hadoop a portion of the system becomesoffline if and only if the r consecutive NameNodes ofDCMS fail within the recovery time of one another.

This situation happens with a probability,

FDr.Hadoop = m.f.

(f.RDr.Hadoop

t

)r−1

.(1− f)m−r.

(5)

To compare the failure analysis with traditionalHadoop, we take the failure probability, f much lessthan in earlier calculation. Let, f be 0.1 in 3 days,m be 10 NameNodes and for DCMS, r = 3.

RDr.Hadoop is calculated as,

RDr.Hadoop =r.RHadoop

m= (3× 6/10) = 1.8h.

So, the recovery time of Dr.Hadoop is 1.8 h while incase of Hadoop it was 6 h. Now,

FDr.Hadoop = 10× 0.1.

(0.1× 1.8

3× 24

)3

.(1 − 0.1)10−3.

The above result gives,

FDr.Hadoop = 7.46× 10−9.

uned

ited

Page 12: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 11

The filesystem, Dr.Hadoop is unavailable forFDr.Hadoop ×RDr.Hadoop of the time.

∴ FDr.Hadoop×RDr.Hadoop = 7.46× 10−9 ∗1.8 =

1.34× 10−8

∴ Percent of Availability = 99.99%So, this shows the increase in the availability of

Dr.Hadoop over traditional Hadoop which is 99.18%.The improvement of recovery time is also shown andproved.

7 Performance evaluation

This section provides the performance evalua-tion of DCMS of Dr.Hadoop using trace-driven sim-ulation. Locality of namespace is first carefully ob-served and then we perform the scalability measure-ment of DCMS. Performance evaluation of DCMSagainst locality preservation is compared with (1)FileHash which is a random distribution of files basedon their pathnames, each of them assigned to a MDS;(2) DirHash which are randomly distributed directo-ries just like FileHash. Each NameNode identifierin the experiment is 160 bits in size which is ob-tained from SHA1 hash function. We use real tracesas shown in Table 2. Yahoo means traces of NFSand email by yahoo finance group and its data-sizeis 256.8 GB (including access pattern information).Microsoft means traces of Microsoft Windows pro-duction (Kavalanekar et al., 2008) of build serverfrom BuildServer00 to BuildServer07 within 24 h,and its data-size is 223.7 GB (access pattern infor-mation included). A metadata crawler is applied tothe data-sets that recursively extracts file/directorymetadata using stat() function.

centering

Table 2 Real Data Traces

Traces Number of Data size Metadata extractedfiles (GB) (MB)

Yahoo 8 139 723 256.8 596.4Microsoft 7 725 928 223.7 416.2

The cluster used for all the experiments has 40DataNodes, one ResourceManager and 10 NameN-odes. Each node is running at 3.10 GHz clock speedand with 4 GB of RAM and a gigabit Ethernet NIC.All nodes are configured with 500 GB hard disk.Ubuntu (Ubuntu, 2012) 14.04 is used as our operat-ing system. Hadoop 2.5.0 version is configured for

comparison between Hadoop and Dr.Hadoop keep-ing the HDFS block-size as 512 MB.

For our experiments, we have developed an en-vironment for different events of our simulator. Thesimulator is used for validation of different designchoices and decisions.

To write the code for the simulator, the sys-tem (Dr.Hadoop) was studied in depth and differententities and parameters such as replication_factor,start_time, end_time, split_path are identified. Af-ter this the behaviors of the entities are specified andclasses for each entity are defined.

The different classes for simulation are placedon an inbuilt event-driven Java simulator, SimJa-va (Fred and McNab, 1998). Simjava is a popularjava toolkit which uses multiple threads for the sim-ulation. As our model is based on an object ori-ented system, the multiple threaded simulator waspreferred.

In terms of resource consumption, SimJava givesthe best efficiency for long running simulations. Forexperiments with a massive amount of data, it mightrun out of system memory. That was not in ourcase here, because we basically dealt with metadataand the simulator delivered its best result when weverified with a real time calculation.

The accuracy of the designed simulator is calcu-lated based on the time calculation as well as usingbenchmarking based on the nature of the experimen-t. For failure analysis, the real time calculation isdone and is compared with the result found from thesimulator. For other experiments, benchmarks areexecuted and a mismatch ranging from millisecondsto tens of seconds is observed. The overall result-s show a mismatch between the designed simulatorand real time device of around 1.39% on average.

To better understand accuracy of the simulator,in Fig. 8a, a comparison between simulator result vsreal time result is shown.

7.1 Experiment on locality-preserving

DCMS uses LpH which is the finest featureof Dr.Hadoop. Excellent namespace locality ofDr.Hadoop’s MDS cluster is necessary to reduce theI/O request of a metadata lookup. In this experi-ment, we have used a parameter, locality =

∑mj=1 pij ,

where pij (either 0 or 1) represents whether a sub-tree (directory) path pi is stored in NameNode j ornot. Basically this metric shows the total number of

uned

ited

Page 13: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

12 Dev et al. / Front Inform Technol Electron Eng in press

times a subtree path is split across the NameNodes.Fig. 7 portraits the average (locality) namespace lo-cality comparison of three different level paths ontwo given traces using three metadata distributiontechniques.

Figs. 7a and 7b show that, the performance ofDCMS is significantly improved over FileHash andDirHash for the given two traces. This is becauseDCMS achieves optimal namespace locality usingLpH, i.e. keys are assigned based on the order ofthe pathnames. Whereas, in the case of FileHashand DirHash, the orders of pathnames are not main-tained, so namespace locality is not preserved at all.

First Second Third0

1

2

3

4

5

6

7

8

9

Num

ber o

f Nam

eNod

es

Path levels

DCMSDirHashFileHash

First Second ThirdPath levels

0

1

2

3

4

5

6

7

8

9

Num

ber o

f Nam

eNod

es

DCMSDirHashFileHash

(a)

Fig. 7 Locality Comparisons of Three Level Pathsover Two Traces in the 10 NameNodes Cluster. (a)Yahoo Trace; (b) Microsift Windows Trace

7.2 Scalability and storage capacity

The scalability of Dr.Hadoop is analyzed withthe two real traces of Table 2. The growth of meta-data (namespace) is studied with the increasing datauploaded to the cluster in HDFS. These observationsare tabulated in Tables 3, 4, and 5 for Hadoop andDr.Hadoop with 10 GB data uploaded on each at-

0 20 40 60 80 100 120 140 160 180 200 2200

100

200

300

400

500

Met

adat

a lo

ad (M

B)

Data (GB)

HadoopHadoop (RT)Dr.HadoopDr.Hadoop (RT)

0 20 40 60 80 100 120 140 160 180 200 220 2400

100

200

300

400

500

600

700

Data (GB)

HadoopDr.Hadoop

Met

adat

a lo

ad (M

B)

(a)

(b)

Fig. 8 Linear growth of metadata for Hadoop andDr.Hadoop Load (MB) per NameNode. (a) MicrosoftWindows Trace; (b) Yahoo! Trace

tempt. The metadata size of Table 5 are 3 times ofits original size for Yahoo and Microsoft respectivelybecause of the replication. Figs. 8a and 8b representthe scalability of Hadoop and Dr.Hadoop in terms ofload (MB)/ NameNode for both the datasets. Thegraphs show a linear increment in the metadata sizeof Hadoop and Dr.Hadoop. In Traditional Hadoop,with the increase in data size in the DataNodes, themetadata is likely to grow to the upper bound of themain memory of a single NameNode. So, the maxi-mum limit of data size that DataNodes can store islimited to the size of available memory on the singleNameNode server. In Dr.Hadoop, DCMS providesa cluster of NameNodes which reduces the metada-ta load rate per NameNode for the cluster. Thisresults in enormous increase the storage capacity ofDr.Hadoop.

Due to the scalable nature of DCMS inDr.Hadoop each NameNode’s load is inconsequen-tial in comparison to Traditional Hadoop. In spiteof 3 times replication of each metadata in DCMS,Load/NameNode of Dr.Hadoop shows a drastic de-

uned

ited

Page 14: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 13

Table 3 Incremental Data storage vs namespace sizefor the Traditional Hadoop cluster (Yahoo Trace)

Data namespace Data namespace(GB) (MB) (GB) (MB)

10 29.8 140 318.3320 53.13 150 337.2230 77.92 160 356.7140 96.18 170 373.1850 119.02 180 394.2260 142.11 190 426.1370 159.17 200 454.7680 181.67 210 481.0190 203.09 220 512.16100 229.37 230 536.92110 251.04 240 558.23120 279.30 250 570.12130 299.82 256.8 594.40

Table 4 Incremental Data storage vs namespace sizefor the Traditional Hadoop cluster ( Microsoft Trace)

Data namespace Data namespace(GB) (MB) (GB) (MB)

10 20.5 130 254.1220 41.83 140 271.8930 59.03 150 297.1240 77.08 160 312.7150 103.07 170 329.1160 128.18 180 352.1270 141.9 190 369.7780 157.19 200 384.7690 174.34 210 393.04100 190.18 220 401.86110 214.2 223.7 416.02120 237.43

Table 5 Data storage vs namespace size for theDr.Hadoop cluster

Data(GB) Namespace (MB)

Yahoo Microsoft

100 684.11 572.65200 1364.90 1152.89256.8 1789.34 1248.02

cline compared to Hadoop’s NameNode. Fewer loadson the main memory of a node implies that it is lessprone to failure.

In Fig. 8a, we have also compared the simulatorobtained value with the real time analyzed value. Ingraph, ‘RT’ represents real time. The difference be-tween these two data is represented in dotted line vssolid line. As the result suggests, the % of mismatchis very little and lines for Hadoop and Hadoop_RTare almost proximate.This validates the accuracy ofthe simulator.

7.3 Experiment on throughput

In Dr.Hadoop, each NameNode of DCMS be-haves as both primary as well as hot standby node forits left and right NameNode. During failover, thesehot standby take over the responsibility of metadatalookup. That might affect the throughput of readand write operations. These two metrics are mostimperative for our model as they dictate the load ofwork performed by the clients. We first performedthe experiment of throughput for read/write whenthey are in no failure. Later the throughput (oper-ation/s) is carried out during the NameNode failurecircumstances to evaluate the performance on faulttolerance.

7.3.1 Read/write throughput

This experiment demonstrates the throughputof metadata operations during no failover. A mul-tithreaded client is configured, which sends meta-data operations (read and write) at an appropriatefrequency and the corresponding successful opera-tions are measured. We measure the successful com-pletion of metadata operation/seconds to computeDr.Hadoop’s efficiency and capacity.

Figs. 9a and 9b show the average read and writethroughput in terms of successful completion of oper-ations for Hadoop and Dr.Hadoop for both the datatraces. Figures show Dr.Hadoop’s DCMS through-put is significantly higher than that of Hadoop. Thisvalidates our claim in Table 1. The experiment isconducted using 10 NameNodes; after few secondsin Dr.Hadoop, the speed shows some reduction onlybecause of the extra RPC involved.

7.3.2 Throughput against fault-tolerance

This part demonstrates the fault tolerance ofDr.Hadoop using DCMS. A client thread is made tosend 100 metadata operations (read and write) persecond and successful operations against these forHadoop and Dr.Hadoop are displayed in Figs. 10aand 10b.

In Fig. 10a, the experiment is carried on a Ya-hoo data trace where Hadoop shows a steady statethroughput initially. We killed the primary NameN-ode’s daemon at t = 100 s and eventually the wholeHDFS filesystem went offline. Around t = 130 s, theNameNode is again restarted and it recovered itselffrom the check-pointed state from the secondary Na-

uned

ited

Page 15: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

14 Dev et al. / Front Inform Technol Electron Eng in press

0 400 800 1200 1600 2000

0

200

400

600

800

1000

1200

Suc

cess

ful o

pera

tions

per

sec

ond

Operations sent per second

Dr. HadoopHadoop

0

200

400

600

800

1000

1200

Dr. HadoopHadoop

Suc

cess

ful o

pera

tions

per

sec

ond

(a)

(b)

Fig. 9 Comparison of throughput under different loadconditions. (a) Yahoo Trace; (b) Microsift WindowsTrace

meNode and repeats the log operations that it failedto perform. During the recovery phase, the graphshows few spikes because the NameNode buffers allthe requests until it recovers and batches them alltogether.

To test the fault tolerance of DCMS inDr.Hadoop, out of 10 NameNodes, we randomlykilled the daemons of a few, viz. NameNode_0 andNameNode_6 at t = 40 s. This activity did not leadto any reduction of successful completion of meta-data operations because the neighbor NameNodesacted as hot standby to the killed ones. Again att = 70 s, we killed NameNode_3 and NameNode_8,leaving 6 NameNodes out of 10 which reflected anamount of declination in throughput. At t = 110

s, we restarted the daemons of two NameNodes andthis stabilizes the throughput of Dr.Hadoop.

Experiment with Microsoft data trace is shownin Fig. 10b. It shows a similar trend like YahooTrace for Hadoop. During the test for Dr.Hadoop,we killed two sets of 3 consecutive NameNodes, Na-meNode_1 to NameNode_3 and NameNode_6 to

0 50 100 150 2000

100

200

300

400

500

Suc

cess

ful O

pera

tions

per

Sec

ond

Time (s)

HadoopDr. Hadoop

0 50 100 150 2000

100

200

300

400

500

Time (s)

Suc

cess

ful O

pera

tions

per

Sec

ond

(a)

(b)

Fig. 10 Fault tolerance of Hadoop and Dr.Hadoop.(a) Yahoo Trace; (b) Microsift Windows Trace

NameNode_8 at t = 90 s to acquire the worst casescenario (with a probability of almost 10−8). Thismade a portion of the filesystem unavailable andthroughput of Dr.Hadoop gives a poorest value. Att = 110 s we up NameNode_1 to NameNode_3 toincrease its throughput. Again at t = 130 s we upNameNode_6 to NameNode_8 which re-stabilizesthe situation.

7.4 Metadata migration in dynamic situation

In DCMS, addition and deletion of files or di-rectories happen very frequently. NameNodes joinand leave the system, the metadata distribution ofDr.Hadoop changes, and thus DCMS might have toperform migration of metadata to maintain consis-tent hashing. DCMS needs to justify two situations:the first one is how DCMS behaves with the storageload/NameNode when the numbers of servers areconstant. The second one is how much key distribu-tion or migration overhead is needed to maintain theproper metadata load balancing. The experimenton scalability and Storage capacity has answered thefirst problem sufficiently.

Fig. 11 depicts the metadata migration overhead

uned

ited

Page 16: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 15

showing outstanding scalability. We performed theexperiment as follows. All the NameNodes in theDCMS system are in an adequate load balancingstate at the beginning as we discussed earlier in theConsistent Hashing section. In the experiment, 2NameNodes are allowed to join the system randomlyafter periodic time, and the system reaches a newbalanced key distribution state. We examined howmany metadata elements are migrated to the newlyadded servers.

10->12 12->14 14->16 16->18 18->200

1

2

3

4

5

6

Num

ber o

f met

adat

a (×

106 )

Increase in the number of NameNodes

YahooMicrosoft

Fig. 11 Migration Overhead

7.5 Comparison of Hadoop Vs. Dr.Hadoopwith MapReduce Job Execution (Word-Count)

In this experiment, a Map-Reduce job is execut-ed to show comparative results for both the frame-works. ‘Wordcount’, a popular benchmark job is ex-ecuted on Hadoop and Dr.Hadoop and comparisonis depicted in Fig. 12.

4 8 16 32 64 1280

200

400

600

800

1000

Exec

utio

n tim

e (s

)

Data size (GB)

HadoopDr. Hadoop

Fig. 12 Wordcount Job execution time with differentdatasets size

The job is executed on a Wikipedia dataset,varying the input size from 4 GB to 128 GB. InHadoop, the lone NameNode is responsible for thejob for execution. Whereas, in Dr.Hadoop, theJobTracker (Master for DCMS) submits the job toany particular NameNode after mapping the inputfilepath. So, there is an extra step involved in case ofDr.Hadoop, due to which the graph shows a slowerexecution in comparison to Hadoop.

The overhead of replicating metadata inDr.Hadoop is also a factor for this little delay. Asthe locality of metadata is not 100% in Dr.Hadoop,therefore the metadata of subdirectories might be s-tored in different NameNodes. So, this might be acase of increased RPC, which increases the job exe-cution time initially.

However, as the size of dataset increases beyond32GB, the running time is found almost to neutralizeall these factors.

During the execution time, the rest of the Na-meNodes are freely available for other jobs,which isnot for the case of Hadoop, which have a single Na-meNode. So, it is quite practical to neglect the s-mall extra overhead when considering the broaderscenario.

8 Related work

Various recent works for decentralization of theNameNode server is discussed in this section.

To provide Hadoop with a highly available meta-data server is urgent and topical problem in theHadoop and Research community.

Apache Foundation of Hadoop came out with afeature of secondary NameNode (White, 2009) whichis purely optional. The secondary NameNode peri-odically checks NameNode’s namespace status andmerges the fsimage with edit logs. It decreases therestart time of NameNode. But unfortunately it isnot a hot backup daemon of NameNode, thus notfully capable of hosting DataNodes in the absenceof NameNode. So, it could not resolve the SPOF ofHadoop.

Apache Hadoop’s core project team has beenworking on a Backup NameNode that is capable ofhosting DataNodes when NameNode fails. The Na-meNodes can have only one Backup NameNode. Itcontinually contacts Primary NameNode for the syn-chronization and hence adds to the complexity of the

uned

ited

Page 17: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

16 Dev et al. / Front Inform Technol Electron Eng in press

architecture. According to the project team, theyneed a serious contribution from different researchesfor the Backup NameNode contribution.

Wang et al. (2009) with a team from IBM Chinaworked on NameNode’s high availability by replicat-ing the metadata. Their work is somewhat simi-lar to our work, but the solution consists of threestages : Initialization phase, replication phase andfailover phase. This increases the complexity of thewhole system by having different systems for clus-ter management and for failure management.In ourDr.Hadoop, similar work is done with less overheadand with the same kind of machines.

A major project of Apache Hadoop (HDFS-976)was for the HA of Hadoop by providing a concept ofAvatar Node (HDFS-976, 2010). The experimentwas carried out on a cluster, which consists of over1200 nodes. The fundamental approach of this con-cept was to switch between primary and standbyNameNodes as if it were switched to an Avatar. Forthis, the standby Avatar node needs some supportfrom Primary NameNode which creates an extra loadand overhead to the Primary because it is already ex-hausted by the huge client requests. This slows downthe performance of Hadoop cluster.

Zookeeper (Apache Zookeeper, 2007), a subpro-ject of Apache Hadoop, used replication among acluster of servers and further provides a leader elec-tion algorithm for the coordination mechanism. Butthe whole project focused on coordination of dis-tributed application rather than a HA solution.

ContextWeb performed an experiment on anHA solution of Cloudera Hadoop (Bisciglia, 2009).The experiment used DRBD from LINBIT andHeartbeat from a Linux-HA project, but it did notoptimize the solution for performance and availabil-ity of Hadoop. For example, it replicated numerousunnecessary data.

E. Sorensen added a hot standby feature to theApache Derby (Bisciglia, 2009). But unfortunately,the solution does not assist a multi-threading ap-proach to serve replicated messages. Therefore, itcan’t be used in the parallel processing of large-scaledata of Hadoop.

Okorafor and Patrick (2012) worked on HA ofHadoop, presenting a mathematical model. Themodel basically features zookeeper leader election.The solution only covered the HA of ResourceMan-ager, not the NameNode.

A systematic replica management is provided byBerkeley DB (Yadava, 2007), but it only featured theDatabase Management System. If users want to useBerkeley DB replication for their application otherthan for a DBMS purpose, they have to spend a lotof time redesigning the replica framework.

IBM DB2 HADR (High Availability DisasterRecover) (Torodanhan, 2009) can adjust replicationmanagement for efficient performance at runtime.The solution uses a simulator to estimate the pre-vious performance of a replica. But according tothe actual scenario of Hadoop, this approach is notsuitable to tune the replication process.

9 Conclusions

This paper presents Dr.Hadoop, a modificationof the Hadoop framework by providing a dynamiccircular metadata splitting (DCMS), a dynamic andexcellent scalable distributed metadata managemen-t system. DCMS keeps excellent namespace locali-ty by exploiting Locality-preserving Hashing to dis-tribute the metadata among multiple NameNodes.Attractive features of DCMS include its simplici-ty, self-healing, correctness and performance whena new NameNode joins the system

When the size of the NameNode cluster changes,DCMS uses consistent hashing that maintains thebalance among the servers. DCMS has an addition-al replication management approach which makes itmost reliable in the case of node failure. Dr.Hadoopcontinues to work correctly in such a case, albeit withdegraded performance when consecutive replicatedservers fail. Our theoretical analysis on availabilityof Dr.Hadoop shows that, this framework stays alivefor 99.99% of the time.

DCMS offers multiple advantages, such as highscalability, balancing metadata storage efficientlyand no bottleneck with negligible additional over-head. We believe that Dr.Hadoop using DCMS willbe a valuable component for large-scale distributedapplications when metadata management is the mostessential part of the whole system.

Acknowledgment

The authors would like to thank Data Scienceand Analytic Lab of NIT Silchar for the research.

uned

ited

Page 18: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

Dev et al. / Front Inform Technol Electron Eng in press 17

ReferencesBeaver, D., Kumar, S., Li, H.C., et al., 2010. Finding a

needle in haystack: Facebookąŕs photo storage. OSDI,p.47ĺC60.

Ghemawat, S., Gobioff, H., Leung, S.T., 2003. The Googlefile system. Proc. 19th ACM Symp. Operating SystemsPrinciples, p.29-43.

Weil, S.A., Brandt, S.A., Miller, E.L., et al., 2006. CEPH: ascalable, high-performance distributed file system. OS-DI, p.307ĺC320.

Hadoop, 2012. Hadoop Distributed File System,http://hadoop.apache.org/.

Shvachko, K., Kuang, H.R., Radia, S., et al., 2010. TheHadoop Distributed File System. IEEE 26th Sympo-sium on Mass Storage Systems and Technologies, p.1-10.

Hadoop, 2012. Hadoop Distributed File System,http://hadoop.apache.org/ hdfs/

Dev, D., Patgiri, R., 2014. Performance evaluation of HDFSin big data management. International Conference onHigh Performance Computing and Applications, p.1-7.

Hadoop Wiki, 2012. NameNodeFailover, on Wiki Apache Hadoop,http://wiki.apache.org/hadoop/NameNodeFailover

DeCandia, Giuseppe, et al., 2007. Dynamo: Amazon’s high-ly available key-value store. ACM SIGOPS OperatingSystems Review, 41(6).

Lim, H., et al., 2011. SILT: a memory-efficient, high-performance key-value store. Proceedings of the 23rdACM Symposium on Operating Systems Principles, p.

Escriva, Robert, Bernard Wong, Emin Gĺźn Sirer, 2012.HyperDex: A distributed, searchable key-value store.ACM SIGCOMM Computer Communication Review42(4):25-36.

Debnath, Biplob, Sudipta Sengupta, Jin Li, 2010. Flash-Store: high throughput persistent key-value store. Pro-ceedings of the VLDB Endowment, p.1414-1425.

Brandt, S.A., Miller, E.L, Long, D.D.E., et al., 2003. Effi-cient metadata management in large distributed storagesystems. IEEE Symposium on Mass Storage Systems,p.290ĺC298.

Weil, S.A., Pollack, K.T., Brandt, S. A., et al., 2004 Dynamicmetadata management for petabyte-scale file systems.SC, p.4.

Zhu, Y., Jiang, H., Wang, J., et al., 2008. Hba: Distribut-ed metadata management for large cluster-based stor-age systems. IEEE Trans. Parallel Distrib. Syst.,19(6):750ĺC763.

Cao, Y., Chen, C., Guo, F., et al., 2011. Es2: a cloud datastorage system for supporting both OLTP and OLAP.Proc. IEEE ICDE, p.291-302.

Corbett, P.F., Feitelson, D.G., 1996. The Vesta parallel filesystem. ACM Trans. Comput. Syst., 14(3):225-264.

Miller, E.L., Katz, R.H., 1997. Rama: an easy-to-use, high-performance parallel file system. Parallel Computing,23(4-5):419ĺC446.

Rodeh, O., Teperman, A., 2003. ZFS—a scalable distributedfile system using object disks. IEEE Symposium onMass Storage Systems, p.207ĺC218.

Miller, E.L., Greenan, K., Leung, A., et al., 2008.Reliable and efficient metadata storage and index-ing using nvram. [Online]. Available: dc-slab.hanyang.ac.kr/nvramos08/EthanMiller.pdf

Nagle, D., Serenyi, D., Matthews, A., 2004. The Panasas ac-tivescale storage cluster-delivering scalable high band-width storage. Proc. ACM/IEEE SC, p.1-10.

Dev, D., Patgiri, R., 2015. HAR+: archive and metadatadistribution! Why not both? ICCCI, to be published.

Raicu, I., Foster, I.T., Beckman, P., 2011. Making a case fordistributed file systems at exascale. LSAP, p.11ĺC18.

Ousterhout, J.K., Costa, H.D., Harrison, D., et al., 1985. Atrace-driven analysis of the Unix 4.2 BSD file system.SOSP, p.15ĺC24.

FIPS 180-1.Secure Hash Standard. U.S. Department of Com-merce/NIST, National Technical Information Service,Springfield, VA, Apr. 1995

ąřHot Standby for NameNode,ąśhttp://issues.apache.org/jira/browse/ HDFS-976,last access on July 10th, 2012.

Karger, D., Lehman, E., Leighton, F., et al., 1997. Con-sistent hashing and random trees: distributed cachingprotocols for relieving hot spots on the World Wide We-b. Proceedings of the 29th Annual ACM Symposiumon Theory of Computing, p.654ĺC663

Lewin, D., 1998. Consistent hashing and random trees: Al-gorithms for caching in distributed networks. Masterąŕsthesis, Department of EECS, MIT.

Aguilera, M.K., Chen, W., Toueg, S., 1997. Heartbeat: atimeoutfree failure detector for quiescent reliable com-munication. Proceedings of the 11th InternationalWorkshop on Distributed Algorithms, p.126ĺC140.

Kavalanekar, S., Worthington, B.L., Zhang, Q., et al., 2008.Characterization of storage workload traces from pro-duction Windows Servers. Proc. IEEE IISWC, p.119-128.

Ubuntu, http://www.ubuntu.com/, 2012.Fred, H., McNab, R., 1998. SimJava: a discrete event simu-

lation library for java. Simulation Series, 30:51-56.Haddad, I.F., 2000. Pvfs: a parallel virtual file system for

Linux clusters. Linux J., p.5.White, B.S., Walker, M., Humphrey, M., et al. 2001. Legion-

fs: a secure and scalable file system supporting cross-domain highperformance applications. Proceedings ofthe 2001 ACM/IEEE Conference on Supercomputing,p.59ĺC59.

R. Z. PJ Braam, . Lustre: a Scalable, High Performance FileSystem. Cluster File Systems, Inc.

Satyanarayanan, M., Kistler, J.J., Kumar, P., et al., 1990.Coda: a highly available file system for a distribut-ed workstation environment. IEEE Trans. Comput.,39(4):447-459.

McKusick, M.K., Quinlan, S., 2009. GFS: evolution on fast-forward. ACM Queue, 7(7):10-20.

White, T., 2009. Hadoop: The Definitive Guide. OąŕReillyMedia, Inc.

Wang, F., Qiu, J., Yang, J., et al., 2009. Hadoop highavailability through metadata replication. Proceedingsof the 1st International Workshop on Cloud Data Man-agement.

Apache Hadoop Project HDFS-976, 2010.Hadoop AvatarNode High Availability.http://hadoopblog.blogspot.com/2010/02/hadoopnamenode-high-availability.htm

Apache Zookeeper. http://hadoop.apache.org/zookeeper/

uned

ited

Page 19: Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby elephant becomes immortal!

18 Dev et al. / Front Inform Technol Electron Eng in press

Bisciglia, C., 2009. Hadoop HA Configuration.http://www.cloudera.com/blog/2009/07/22/hadoop-haconfiguration/

Bisciglia, C., 2009. Hadoop HA Configuration.http://www.cloudera.com/blog/2009/07/22/hadoop-haconfiguration/

Ekpe Okorafor1, Mensah Kwabena Patrick, 2012. Availabilityof Jobtracker machine in hadoop/mapreduce zookeepercoordinated clusters. Advanced Computing: An Inter-national Journal ( ACIJ ), 3(3)

Yadava, H., 2007. The Berkeley DB Book. Apress. Oct.2007.

Torodanhan, 2009. Best Practice: D-B2 High Availability Disaster Recovery.http://www.ibm.com/developerworks/wikis/display/data/Best+Practice+-+DB2+High+Availability+Disaster+Recovery

uned

ited