Distributed File System: Data Storage for Networks Large and Small Pei Cao Cisco Systems, Inc.

21
Distributed File System: Data Storage for Networks Large and Small Pei Cao Cisco Systems, Inc.
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    2

Transcript of Distributed File System: Data Storage for Networks Large and Small Pei Cao Cisco Systems, Inc.

Distributed File System: Data Storage for Networks Large and

SmallPei Cao

Cisco Systems, Inc.

Review: DFS Design Considerations

1. Name space construction

2. AAA

3. Operator batching

4. Client caching

5. Data consistency

6. Locking

Summing it Up: CIFS as an Example

• Network transport in CIFS– Use SMB (Server Message block) messages

over a reliable connection-oriented transport• TCP

• NetBIOS over TCP

– Use persistent connections called “sessions”• If a session is broken, client does the recovery

Design Choices in CIFS

• Name space construction: – per-client linkage, multiple methods for server

resolution• file://fs.xyz.com/users/alice/stuff.doc• \\cifsserver\users\alice\stuff.doc• E:\stuff.doc

– CIFS also offers “redirection” method• A share can be replicated in multiple servers or moved• Client open server reply

“STATUS_DFS_PATH_NOT_COVERED” client issues “TRANS2_DFS_GET_REFERRAL” server reply with new server

Design Choices in CIFS

• AAA: Kerberos– Older systems use NTLM

• Operator batching: supported– These methods have “AndX” variations:

TREE_CONNECT, OPEN, CREATE, READ, WRITE, LOCK

– Server implicitly takes results of preceding operations as input for subsequent operations

– First command that encounters an error stops all subsequent processing in the batch

Design Choices in CIFS

• Client caching– Cache both file data and file metadata, write-back cache, can read-

ahead– Offers strong cache consistency using an invalidation-based

approach• Data access consistency

– Oplocks: similar to “tokens” in AFS v3• “level II oplock”: read-only data locks• “exclusive oplock”: exclusive read/write data lock• “batch oplock”: exclusive read/write “open” lock and data lock and

metadata lock– Transition among the oplocks– Observation: can have a hierarchy of lock managers

Design Choices in CIFS

• File and data record locking– Offer “shared” (read-only) and “exclusive” (read/write)

locks– Part of the file system; Mandatory– Can lock either a whole file or byte-range in the file– Lock request can specify a timeout for waiting– Enables atomic writes with the “ANDX” batching with

Writes• “Lock/write/unlock” as a batched command sequence

• Additional capability: “directory change notification”

DFS for Mobile Networks

• What properties of DFS are desirable:– Handle frequent connection and disconnection– Enable clients to operate in disconnected state

for an extended period of time– Ways to resolve/merge conflicts

Design Issues for DFS in Mobile Networks

• What should be kept in client cache?• How to update the client cache copies with

changes made on the server?• How to upload changes made by the client

to the server?• How to resolve conflicts when more than

one clients change a file during disconnected state?

Example System: Coda

• Client cache content:– User can specify which directories should always be

cached on the client

– Also cache recently used files

– Cache replacement: walk over the cached items every 10 min to reevaluate their priorities

• Updates from server to client:– The server keeps a log of callbacks that couldn’t be

delivered and deliver them upon client connection

Coda File System

• Upload the changes from client to server– The client has to keep a “replay log”

• Contents of the “replay log”

– Ways to reduce the “replay log” size

• Handling conflicts– Detecting conflicts– Resolving conflicts

Performance Issues in File Servers

• Components of server load– Network protocol handling– File system implementation– Disk accesses

• Read operations– Metadata– Data

• Write operations– Metadata– Data

• Workload characterization

DFS for High-Speed Networks: DAFS

• Proposal from Network Appliance and companies• Goal: eliminate memory copies and protocol processing

– Standard implementation: network buffers file system buffer cache user-level application buffers

• Designed to take advantage of RDMA (“Remote DMA”) network protocols– Network transport provides direct memory memory transfer– Protocol processing is provided in hardware

• Suitable for high-bandwidth, low-error-rate, low-latency network

DAFS Protocol

• Data read from the client:– RDMA request from the server to copy file data directly into application

buffer• Data write from the client

– RDMA request from the server to copy application buffer into server memory

• Implementation:– as a library linked to user application interface with RDMA network

library directly• Eliminate two data copies

– as a new file system implementation in the kernel• Eliminate one data copy

• Performance advantage:– Example: 90 usec/op in NFS vs. 25 usec/op in DAFS

DAFS Features

• Session-based

• Offer authentication of client machines

• Flow control by server

• Stateful lock implementation with leases

• Offers atomic writes

• Offers operator batching

Clustered File Servers

• Goal: scalability in file service– Build a high-performance file service using a collection of cheap

file servers• Methods for Partitioning the Workload

– Each server can support one “subtree”• Advantages• Disadvantages

– Each server can support a group of clients• Advantages• Disadvantages

– Client requests are sent to server in round-robin or load-balanced fashion

• Advantages• Disadvantages

Non-Subtree-Partition Clustered File Servers

• Design issues– On which disks should the data be stored?– Management of memory cache in file servers– Data consistency management

• Metadata operation consistency• Data operation consistency

– Server failure management• Single server failure fault tolerance• Disk failure fault tolerance

Mapping Between Disks and Servers

• Direct-attached disks

• Network-attached disks– Fiber-channel attached disks– iSCSI attached disks

• Managing the network-attached disks: “volume manager”

Functionalities of a Volume Manager

• Group multiple disk partitions into a “logical” disk volume

• Volume can expand or shrink in size without affecting existing data

• Volume can be RAID-0/1/5, tolerating disk failures

• Volume can offer “snapshot” functionalities for easy backup

• Volumes are “self-evident”

Implementations of Volume Manager

• In-kernel implementation– Example: Linux volume manager, Veritas

volume manager, etc.

• Disk server implementation– Example: EMC storage systems

Serverless File Systems

• Serverless file systems in WAN– Motivation: peer-to-peer storage; never lose the

file

• Serverless file system in LAN– Motivation: client powerful enough to be like

servers; use all client’s memory to cache file data