June 2013 Large Scale File Systems by Dale Denis.

June 2013

Large Scale File Systemsby Dale Denis

Dale Denis

Outline

• The need for Large Scale File Systems.• Big Data.• The Network File System (NFS).• The use of inexpensive commodity hardware.

• Large Scale File Systems.• The Google File System (GFS).• The Hadoop File System (HDFS).

Dale Denis

Big Data

• International Data Corporation (IDC):• The digital universe will grow to 35 zettabytes globally

by 2020.

• The New York Stock Exchange generates over one terabyte of new trade data per day.

• The Large Hadron Collider new Geneva, Switzerland, will produce approximately 15 petabytes of data per year.

• Facebook hosts approximately 10 billion photos, taking up 2 petabytes of data.

Dale Denis

Big Data

• 85 % of the data being stored is unstructured data.• The data does not require frequent updating once it

is written, but it is read often.• The scenario is complimentary to data that is more

suitable for an RDBMS• Relational Database Management Systems are good

at storing structured data:

• Microsoft SQL Server• Oracle• MySQL

Dale Denis

NFS

• NFS: The ubiquitous distributed file system.• Developed by Sun Microsystems in the early

1980’s.• While its design is straightforward, it is also

very constrained:• The files in an NFS volume must all reside on a

single machine.• All clients must go to this machine to retrieve their

data.

Dale Denis

NFS

• The bottleneck of reading data from a drive:• Transfer speeds have not kept up with storage

capacity.• 1990: A typical drive with 1370 MB capacity had a

transfer speed of 4.4 MB/s so it would take 5 minutes to read all of the drives data.

• 2010: A terabyte drive with the typical transfer speed of 100 MB/s takes around 2 hours to read all of the data.

Dale Denis

Inexpensive Commodity Hardware

• Cost-effective.• One Google server rack:

• 176 processors• 176 GB of memory• $278,000

• Commercial grade server:• 8 processors• 1/3 the memory• Comparable amount of disk space• $758,000

• Scalable.• Failure is to be expected.

Dale Denis

A Solution is Born

• Apache Nutch:• Doug Cutting created Apache Lucene.• Apache Nutch was a spin-off of Lucene.• Nutch was an open source web search engine.• Development started in 2002.• The architecture had scalability issues due to the

very large files generated as part of the web crawl and indexing process.

Dale Denis

A Solution is Born

• In 2003 the paper “The Google File System” was published.• In 2004 work began on an open source implementation of

the Google File System (GFS) for the Nutch web search engine.

• The project was called the Nutch Distributed File System (NDFS).

• In 2004 Google published a paper introducing MapReduce.

• By early 2005 the Nutch developers had a working implementation of MapReduce and by the end of the year all of the major algorithms in Nutch had been ported to run MapReduce on NDFS.

Dale Denis

A Solution is Born

• In early 2006 they realized that the MapReduce implementation and NDFS had the potential beyond web search.• The project was moved out of Nutch and was

renamed Hadoop.

• In 2006 Doug Cutting was hired by Yahoo.• Hadoop became an open source project at

Yahoo!.• In 2008 Yahoo! Announced that its production

search index was running on a 10,000-core Hadoop cluster.

Dale Denis

The Google File System (GFS)

• A scalable distributed file system for large distributed data-intensive applications.

• Provides fault tolerance while running on inexpensive commodity hardware.

• Delivers high aggregate performance to a large number of clients.

• The design was driven by observations at Google of their application workloads and technological environment.• The file system API and the applications were co-

designed.

Dale Denis

GFS Design Goals and Assumptions

• The system is built from many inexpensive commodity components that often fail.• The system must tolerate, detect, and recovery

from component failure.

• The system stores a modest number of large files.• Small files must also be supported but the system

doesn’t need to be optimized for them.

• The workloads primarily consist of large streaming reads and small random reads.

Dale Denis

GFS Design Goals and Assumptions

• The workloads also have many large, sequential writes that append data to files.• Small writes at arbitrary positions in a file are to be

supported but do not have to be efficient.

• Multiple clients must be able to concurrently append to the same file.

• High sustained bandwidth is more important than low latency.

• The system must provide a familiar file system interface.

Dale Denis

GFS Interface

• Supports operations to create, delete, open, close, read, and write files.

• Has snapshot and record append operations.• Record append operations allow multiple

clients to append data to the same file concurrently while guaranteeing the atomicity of each client’s append.

Dale Denis

GFS Architecture

• A GFS cluster consists of:• A single master.• Multiple chunk servers.

• The GFS cluster is accessed by multiple clients.• Files are divided into fixed-size chunks.

• The chunks are 64 MB, this is configurable.

• Chunk servers store the chunks on local disks.• Each chunk is replicated on multiple servers.

• A standard replication factor of 3.

Dale Denis

GFS Architecture

• Two files being stored on three chunk servers with a replication factor of 2.

Dale Denis

GFS Single Master

• Maintains all file system metadata.• Namespace information.• Access control information.• Mapping from files to chunks.• The current location of the chunks.

• Controls system wide activities.• Executes all namespace operations.• Chunk lease management.• Garbage collection.• Chunk migration between the servers.

• Communicates with each chunk server in heart beat messages.

Dale Denis

GFS Single Master

• Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunk servers.

• A client sends the master a request for a file and the master responds with the locations of all of the chunks.

• The client then sends a request to one of the chunks servers for a replica.

Dale Denis

GFS Chunks

• The chunk size is large.• Advantages:

• Reduces the client’s need to interact with the master.• Helps to keep the master from being a bottleneck.

• Reduces the size of the metadata stored on the master.• The master is able to keep all of the metadata in

memory.• Reduces the network overhead by keeping persistent TCP

connections to the chunk server over an extended period of time.

• Disadvantages:• Hotspots with small files if too many clients are accessing

the same file.

• Chunks are stored on local disks as Linux files.

Dale Denis

GFS Metadata

• The master stores three types of metadata.• File and chunk namespaces.• The mapping from files to chunks.• The location of each chunk’s replicas.

• The master doesn’t store the locations of the replicas.• The master asks each chunk server about its chunks

• At master startup.• When a chunk server joins the cluster.

• The master also includes rudimentary support for permissions and quotas.

Dale Denis

GFS Metadata

• The operations log is central to GFS!• The operations log contains a historical record of

critical metadata changes.• Files and chunks are uniquely identified by the

logical times at which they were created.• The log is replicated on multiple remote machines.• The master recovers its state by replaying the

operation log.• Monitoring infrastructure outside of the GFS restarts

a new master process if the old master fails.• Read-only “Shadow Masters” provide read-only

access when the primary master is down.

Dale Denis

GFS Leases and Mutations

• A mutation is an operation that changes the contents or metadata of a chunk.

• Leases are used to maintain a consistent mutation order across replicas.• The master grants a lease to one of the replicas,

which is called the primary.• The primary picks a serial order for all mutations to

the chunk.

• The lease mechanism is designed to minimize the management overhead at the master.

Dale Denis

GFS The anatomy of a mutation

1. The client asks the master which chunk server holds the current lease for a chunk and the locations of the other replicas.

2. The client pushes the data to all of the replicas.

3. When all replicas acknowledge receiving the data the client sends a write request to the primary.

4. The primary serializes the mutations and applies the changes to its own state.

Dale Denis


5. The primary forwards the write request to all of the secondary replicas.

6. The secondary's apply the mutations in the same serial order assigned by the primary.

7. The secondary's reply to the primary that they have completed.

8. The primary replies to the client.

Dale Denis


• The data flow and the control flow have been decoupled.

• The data is pushed linearly along a carefully picked chain of chunk servers in a pipeline fashion.

• Each chunk server forwards the data to the next nearest chunk server in the chain.

• The goal is to fully utilize each machine’s network bandwidth and avoid bottlenecks.

Dale Denis


• Write Control and Data Flow.

Dale Denis

GFS Record Append

• Record appends are atomic.• The client specifies the data, the GFS appends it to

the file atomically at an offset of the GFS’s choosing.• In a traditional write, the client specifies the off-set at

which data is to be written.

• The primary replica checks to see if appending to the current chunk would exceed the maximum size. If so, the primary pads the current chunk and replies to the client that the operation should be retried on the next chunk.

Dale Denis

GFS Master Operations

• Namespace Management and Locking.• Locks allow multiple operations to be active at the

same time.

• Locks over regions of the namespace ensure proper serialization.

• Each master operation acquires a set of locks before it runs.

• The centralized server approach was chosen to in order to simplify the design.

• Note: GFS does not have a per-directory data structure that lists all the files in that directory.

Dale Denis


• Replica Placement.• The duel goals of replica placement policy:

• Maximize data reliability and availability.• Maximize network bandwidth utilization.

• Chunks must not only be spread across machines, they must also be spread across racks.

• Fault tolerance.• To exploit the aggregate bandwidth of multiple

racks.

Dale Denis


• The master rebalances replicas periodically.• Replicas are removed from chunk servers with

below-average free space.• Through this process the master gradually fills up a

new chunk server.

• Chunks are re-replicated as the number of replicas falls below a user-specified goal.

• Due to failure.• Data corruption.

• Garbage collection is done lazily at regular intervals.

Dale Denis

GFS Garbage Collection

• When a file is deleted the file is renamed to a hidden name and the file is given a deletion time stamp.• After three days the file is removed from the

namespace.• The time interval is configurable.

• Hidden files can be undeleted.

• In the regular heartbeat message the chunk server reports a subset of the chunks that it has.

• The master replies with the id’s of the chunks that are no longer in the namespace.

• The chunk server is free to delete chunks that are not in the namespace.

Dale Denis

GFS Data Integrity

• Each chunk server uses check summing to detect the corruption of stored data.

• Each chunk is broken into 64 KB blocks, each block has a 32 bit checksum.

• During idle periods the chunk servers are scanned to verify the contents of inactive chunks.

• GFS servers generate diagnostic logs that record many significant events.

• Chunk servers going online and offline.• All RPC requests and replies.

Dale Denis

GFS Recovery Time

• Experiment 1:• One chunk server with approx. 15,000

chunks containing 600 GB of data was taken off-line.

• The number of concurrent clonings was restricted to 40% of the total number of chunk servers.

• All chunks were restored in 23.3 minutes, at an effective replication rate of 440 MB/s.

Dale Denis

GFS Recovery Time

• Experiment 2:• Two chunk servers with approx. 16,000

chunks and 660 GB of data were taken off-line.

• Cloning was set to a high priority.• All chunks were restored to a 2x replication

within 2 minutes.• The cluster back in a state where it could tolerate

another chunk server failure without data loss.

Dale Denis

GFS Measurements

• Test Environment.• 16 client machines.• 19 GFS servers.

• 16 chunk servers.• 1 master, 2 master replicas.

• All machines had the same configuration.• Each machine had a 100 Mbps full-duplex Ethernet

connection.

• 2 HP 2524 10/100 switches.• All 19 servers were connected to one switch and all

16 clients were connected to the other.• A 1 Gbps link connected the two switches.

Dale Denis

GFS Measurements - Reads

• N clients reading simultaneously from the file system.• Theoretical limit peaks at an aggregate of 125 Mbps when the 1

Gbps link is saturated.• Theoretical per client limit of 12.5 Mbps when the network

interface is saturated.• Observed read rate of 10 Mbps when one client is reading.

Dale Denis

GFS Measurements - Writes

• N clients writing simultaneously to N distinct files.• Theoretical limit peaks at an aggregate of 67 Mbps because each byte

has to be written to 3 of the 16 chunk servers.• Observed write rate of 6.3 Mbps. This slow rate was attributed to issues

with the network stack that didn’t work well with GFS pipeline scheme.• In practice that has not been a problem.

Dale Denis

GFS Measurements - Appends

• N clients append simultaneously to a single file.• The performance is limited by the network bandwidth of the chunk

servers that store the last chunk of the file.• As the number clients increases the congestion on the chunk

servers also increases.

Dale Denis

The Hadoop Distributed File System (HDFS)

• The Hadoop Distributed File System: The open source distributed file system for large data sets that is based upon the Google File System.

• As with the GFS the HDFS is a distributed file system that is designed to run on commodity hardware.

• The HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

• The HDFS is not a general purpose file system.

Dale Denis

HDFS Design Goals and Assumptions

• Hardware failure is the norm.• The detection of faults and quick, automatic recovery is a

core architectural goal.

• Large Data Sets• Applications that run on HDFS have large data sets.

• A typical file is gigabytes to terabytes in size.• It should support tens of millions of files.

• Streaming Data Access.

• Applications that run on HDFS need streaming access to their data sets.

• The emphasis is on high throughput of data access rather than low latency.

Dale Denis

HDFS Design Goals and Assumptions

• Simple Coherency Model• Once a file has been written it cannot be changed.• There is a plan to support appending-writes in the

future.

• Portability across heterogeneous hardware and software platforms.

• In order to facilitate adoption of HDFS.• Hadoop is written in Java.

• Provide interfaces for applications to move themselves closer to where the data is.

• “Moving computation is cheaper than moving data”

Dale Denis

HDFS Architecture

• An HDFS cluster consists of:• A Name Node.• Multiple Data Nodes.

• Files are divided into fixed-size blocks.• The blocks are 64 MB, this is configurable.• The goal is to minimize the cost of seeks.

• Seek time should be 1% of transfer time.• As transfers speeds increase block size can be

increased.• Block sizes that are too big will cause MapReduce

jobs to run slowly.

Dale Denis

HDFS Architecture

• Writing data to the HDFS.• No control messages to or from the data nodes.• No concern for serialization.

Dale Denis

HDFS Network Topology

• Ideally bandwidth between nodes should be used to determine distance.

• In practice measuring bandwidth between nodes is difficult.

• HDFS assumes that in each of the following scenarios that bandwidth becomes progressively less:• Processes on the same node.• Different nodes on the same rack.• Nodes on different racks in the same data center.• Nodes in different data centers.

Dale Denis

HDFS Network Topology

• By default HDFS assumes that all nodes are on the same rack in the same data center.

• An XML configuration script is used to map nodes to locations.

Dale Denis

HDFS Replica Placement

• Trade-off between reliability, write bandwidth, and read bandwidth.

• All replicas on nodes at different data centers provides high redundancy at the cost of high write bandwidth.

Dale Denis

HDFS Replica Placement

• First replica goes on the same node as the client.

• Second replica goes on a different rack, selected at random.

• The third replica is placed on the same rack as the second but a different node is chosen.

• Further replicas are placed on nodes selected at random from the cluster.

• Nodes are selected that are not too busy or full.• The system avoids placing too many replicas on the

same rack.

Dale Denis

HDFS Permissions

• Based upon the POSIX model but does not provide strong security for HDFS files.

• Is designed to prevent accidental corruption or misuse of information.

• Each file and directory is associated with an owner and a group.

• For files there are separate permissions to read, write or append to the file.

• For directories there are separate permissions to create or delete files or directories.

• Permissions are new to HDFS, adding methods such as Kerberos authentication in order to establish user identify have been planned for the future.

Dale Denis

HDFS Accessibility

• HDFS provides a Java API.• A JNI-base wrapper, libhdfs had been

developed that allows you to work with the Java API with C/C++.

• Work is underway to expose HDFS through the WebDAV protocol.

Dale Denis

HDFS Web Interface

• The main web interface is exposed on the NameNode at port 50070.

• Contains an overview about the health, capacity and usage of the cluster.

• Each datanode also has a web interface at port 50075.

• Logfiles generated by the Hadoop daemons can be accessed through this interface.

• Very useful for distributed debugging and troubleshooting.

Dale Denis

HDFS Configuration

• The scripts to manage a Hadoop cluster were written in the UNIX shell scripting language.

• The HDFS configuration is located in a set of XML files.

• Hadoop can run under Windows.• Requires cygwin.

• Not recommended for production environments.

• Hadoop can be installed on a single machine.• Standalone - The HDFS is not used.

• Pseudo-distributed – A functioning NameNode/DataNode is installed.

• Memory Intensive.

Dale Denis

Hadoop

•Who uses Hadoop?• Amazon• Adobe• Cloudspace• EBay• Facebook• IBM• LinkedIn• The New York Times• Twitter• Yahoo!• …

Dale Denis

Microsoft & Hadoop

• Microsoft Data Management Solutions• Relational Data

• Microsoft SQL Server 2012

• Non-Relational Data• Hadoop for Windows Server and Windows Azure

• An Apache Hadoop-based distribution for Windows developed in partnership with Hortonworks Inc.

Dale Denis

Microsoft & Hadoop

Dale Denis

Conclusion

•The current technological environment has presented the need for new Large Scale File Systems.

• Big data.• Unstructured data.• The use of inexpensive commodity hardware.

•Large Scale File Systems.• The proprietary Google File System.• The open source Hadoop Distributed File System.

Dale Denis

References

• Barroso, Luis Andre. Dean, Jeffrey. Holzle, Urs. “Web Search For A Planet: The Google Cluster Architecture. ” Micro, IEEE (Volume: 23, Issue: 2), March-April 2003.

• Callaghan, Brent. NFS Illustrated. Reading: Addison-Wesley Professional, 2000.

• Ghemawat, Sanjay. Bogioff, Howard, Bogioff. Leung, Shun-Tak. “The Google File System.” SOSP’03, October 19-22 2003.

• Stross, Randall. Planet Google: One Company’s Audacious Plan to Organize Everything We Know. New York: Free Press, 2008.

• White, Tom. Hadoop: The Definitive Guide. Sebastopol: O’Reilly Media, Inc., 2009.

Dale Denis

References

• May 2013. <http://developer.yahoo.com/hadoop/tutorial/module2.html>

• May 2013. http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html#Introduction

• May 2012: http://strata.oreilly.com/2012/01/microsoft-big-data.html

• May 2013. <www.Microsoft.com> Microsoft_Big_Data_Booklet.pdf

June 2013 Large Scale File Systems by Dale Denis.

Documents

Transcript of June 2013 Large Scale File Systems by Dale Denis.