GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE

GMount: An Ad-hoc and Locality-Aware Distributed File Systemby using SSH and FUSEGraduate School of Information Science and Technology

The University of TokyoNan Dun Kenjiro Taura Akinori Yonezawa

CCGrid 2009, Shanghai, China 2

Computing resources across different administration domains◦ InTrigger (JP), Tsubame (JP), T2K-Tokyo (JP)◦ Grid5000 (FR), D-Grid (DE), INFN Grid (IT), National

Grid Services (UK)◦ Open Science Grid (US)

Workload to run on all the available resources◦ Finding super nova◦ Gene decoding◦ Weather simulation, etc.

May 20, 2009

Today You may Have

May 20, 2009CCGrid 2009, Shanghai, China 3

Scenario I

How to share your data among

arbitrary machines across different domains

?


Option 1: Staging you data◦ Too troublesome: SCP, FTP, GridFTP, etc.

Option 2: Conventional DFSs◦ Ask your administrators!

Which one? NFS, OpenAFS, PVFS, GPFS, Lustre, GoogleFS, Gfarm

Only for you? Believe me, they won’t do so Quota, security, policy? Headaches… Configure and install, even if admins are supposed

to do their job ... Option 3: GMount

◦ Build a DFS by yourself on the fly!

May 20, 2009

Ways of Sharing


Scenario II

You have many clients/resources,

And you want more servers


Option 1: Conventional DFSs◦ File servers are fixed at deploy time

Fixed number of MDS (Metadata Server) Fixed number of DSS (Data Storage Servers)

◦ Ask your administrators again! Append more DSS

Option 2: GMount◦ No metadata server◦ File servers scale with the clients

As long as you have more servers, you have more DSS◦ Especially benefit if your workloads prefer large

amount of local writes

May 20, 2009

Ways of Scaling


Scenario III

What happens when client access nearby files

in the wide-area environments?


High-Latency: DFSs with Central MDS◦ Central MDS is far away from some clients

Locality-Aware: GMount◦ Search nearby nodes

first◦ Sending high-latency

message only if targetfile can not be foundlocally

May 20, 2009

File Lookup in Wide-Area


Prerequisite1. You can SSH login some nodes2. Each node has some export directory

having the data you want to share3. Specify a mountpoint via which DFS can

be accessed◦ Simply make an empty directory for each node

May 20, 2009

Impression of Usage


Just one command, You are done!◦ gmnt /export/directory /mountpoint◦ GMount will create a DFS at mountpoint: a

UNION of all export directories can be mutually accessed by all nodes

May 20, 2009

Impression of Usage

Host001 Host002

export

dir1 dir2

dat1

export

dir1 dir2

dat2 dat3 dat4

mount

dir1 dir2

dat1 dat2 dat3 dat4

mount

dir1 dir2

dat1 dat3 dat4dat2

Mutual Access


Building Blocks◦ FUSE, SSHFS and SSHFS-MUX

To create basic userspace file system To utilize existing SSH authentication and data transfer

features◦ Grid and cluster shell (GXP)

To efficiently execute commands in parallel Core Ideas

◦ Scalable All-Mount-All algorithms To enable all nodes hierarchically and simultaneously

share with each other◦ Locality-Aware Optimization

To enable file access be aware of closer files

May 20, 2009

Enabling Techniques


FUSE [fuse.sf.net]

◦ Framework for quickly building userspace FS

◦ Widely available(Linux version>2.6.14)

SSHFS [fuse.sf.net/sshfs.html]

◦ Manipulate files on remote hosts as local files

◦ $ sshfs myhost.net:/export /mount◦ Limitation: only can mount one host at a time

May 20, 2009

FUSE and SSHFS Magic


Manipulate multiple hosts simultaneously◦ SSHFS-MUXA$ sshfsm B:/export C:/export /mount Priority lookup

E.g. C:/export will be accessed before B:/export

May 20, 2009

FUSE and SSHFS Magic (cont.)

/export

dir1 dir2

dat1

B’s /export

dir1 dir2

dat2 dat3

C’s /mount

dir1 dir2

dat1

A’s

dat2 dat3


Data to export at /export (E.g. 3 nodes)INPUT: export directory at each node: E

DFS mounted directory /mountOUPUT: DFS mount directory at each node: M

May 20, 2009

Problem Setting

321 ,E,EE

321321 EEEMMM


321321 EEEMMM

Execution examples for 3 nodes1$ sshfsm 1:/export 2:/export 3:/export /mount2$ sshfsm 1:/export 2:/export 3:/export /mount3$ sshfsm 1:/export 2:/export 3:/export /mount

May 20, 2009

A Straightforward Approach

1 2 3

1 2 3

What if we have 100 nodes?

Scalability!


Phase I: One-Mount-All

May 20, 2009

Scalable Approach: Phase I

1

2 3

1$ sshfsm 1:/export 2:/export 3:/export /mount

3

2

3211

M

M

EEEM


Phase II: All-Mount-One

May 20, 2009

Scalable Approach: Phase II

1

2 3

2$ sshfsm 1:/mount /mount3$ sshfsm 1:/mount /mount

13

12

3211

MM

MM

EEEM


Straight Forward

Scalable

Connections 9 (N2) 4 (O(KlogkN))SSH daemons in each node 3 (N) 2 (K)

May 20, 2009

Comparison

1 2 3

1 2 3

1

2 3

2 3

VS.

K is the number of children


Locality-Aware Lookup

May 20, 2009

Further Optimization

1

2 3

2 3

1$ sshfsm 1:/export 2:/export 3:/export /mount2$ sshfsm 1:/mount /mount3$ sshfsm 1:/mount /mount

1$ sshfsm 1:/export 2:/export 3:/export /mount2$ sshfsm 1:/mount 2:/export /mount3$ sshfsm 1:/mount 3:/export /mount

1

2 3

2 3

CCGrid 2009, Shanghai, China 20May 20, 2009

Hierarchical grouping, sharing and lookup

Recursively and Hierarchically Constructing

• Nodes share with each other at the same level• Export their union to upper level

• File lookup happened in local group first• Then lookup upward if not found


Grid and Cluster Shell [Taura ’04]

◦ Simultaneously operate hundreds of nodes◦ Scalable and efficient◦ Across different administration domains◦ Install at one node, deploy to all node by itself◦ Also a useful tool for daily Grid interaction Programmable parallel execution framework

In GMount Efficiently execute SSHFS-MUX in parallel on many nodes

May 20, 2009

How to execute many mounts in parallel ?


1. Grab nodes by GXP◦ Assign the starting node as master, others as

workers2. Master gather info and make a mount plan

◦ Get the number of nodes◦ Get the information of each node◦ Make a spanning tree and mount plan and send

them to workers3. Execute the plan

◦ Workers execute the mount plan and send result back to master

◦ Master aggregates the results and prompt user

May 20, 2009

Summary of GMount Executions


Utilize Network Topology Information◦ Grouping nodes based on

implicit/explicit network affinity : Using IP address affinity Using network topology

information if available

NAT/Firewall◦ Overcome by cascade mount

Specify gateways as root ofinternal nodes and cascadeinside-outside traffic

May 20, 2009

Deal with Real Environments

NAT, Firewall1

2 3

4 5 6 7

LAN


Experimental Environments◦ InTrigger, 15 sites distributed cluster of clusters in

Japan Experiments

◦ Performance of building block (SSHFS-MUX) I/O performance Metadata performance

◦ File system construction time on system size Mount time Umount time

◦ I/O performance on spanning tree shape◦ Metadata performance on local accesses

May 20, 2009

Evaluation


Over 300 nodes across 12 sites◦ Representative platform in

wide-area environments Heterogeneous wide-area links

◦ NAT enabled in 2 sites Unified software environments

◦ Linux 2.6.18◦ FUSE 2.7.3◦ OpenSSH 4.3p2◦ SSHFS-MUX 1.1◦ GXP 3.03

May 20, 2009

InTrigger Platformhttp://www.intrigger.jp

14

6

761110

269


File System Construction Time

1 (69/4)

2 (158/8)

3 (194/12)

4 (207/16)

5 (236/20)

6 (266/24)

7 (272/28)

8 (282/32)

9 (293/36)

10 (304/40)

11 (315/44)

12 (329/48)

0123456789

10mount (all nodes/site)unmount (all nodes/site)mount (4 nodes/site)unmount (4 nodes/site)

Tim

e (s

ec)

Sites # (nodes #)

< 10 seconds for nation

wide329 nodes

2 4 8 16 320

50

100

150

200

250

300

350

400

450

500

GMount (K=4)GMount (K=8)GMount (K=16)Gfarm-FUSE

Concurrent Clients #May 20, 2009CCGrid 2009, Shanghai, China 27

Parallel I/O Performance

2 4 8 16 3205

101520253035404550

GMount (K=4)

GMount (K=8)

GMount (K=16)

Concurrent clients #

• Limited SSH transfer rate is primary bottleneck• Performance is also depends on tree shape


Gfarm: Wide-area DFS◦ Central meta server

Clients first query in meta server for file location

Clients may be distant from metadata server

Locality Awareness◦ Clients prefer to access files that stored in nodes

close to it (within the same cluster/LAN)◦ Percent of Local Access

May 20, 2009

Metadata Operation Peformance

AccessTotalofNumberAccessLocalofNumberP accesslocal

where local access is the access to the nodes within thesame cluster/LAN


Metadata: GMount in WAN

0.2 0.4 0.600000000000001 0.8 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

mkdirrmdiropen+closestat EXISTstat NONEXISTutimechmod u+xunlink

Percent of Locality Access

Aggr

egat

e Op

erat

ion

Late

ncy

(sec

)

Gfarm in WAN

Gfarm in LAN

Locality-Aware: Saved Network Latency


Conventional DFS GMount DFS

Resources Fixed within domain Ad-hocFixed at deploy time Scale on demand

Volume Quota Policy depend Sum of local volumesFirewall Optional OK

Wide-AreaPotential high-latency file lookup if using central metadata server

Distributed metadata and locality-aware file lookup

Data Persistence Permanent storage On demand sharingData Redundancy

Yes No

Authentication GSI, SharedKey, etc SSHDeploy Privilege Administrator UserPrerequisite Kernel source, DB, etc. SSH, FUSE, PythonEnabling Effort Weeks-Months Minutes-HoursImplementation Years Months May 20, 2009

Highlights


SFTP Limitations◦ Not Fully POSIX Compatible

Rename operation and link operation◦ Limited Receive Buffer [Rapier et al. ’08]

Low data transfer rate in long-fat network◦ SFTP extended attributed support

Piggybacking file location during lookup Performance Enhancement

◦ SSHFS-MUX local mount operation (Done!) Fault Tolerance

◦ Tolerance on connection drops

May 20, 2009

Future Work


SSHFSMUXhttp://sshfsmux.googlecode.com/

Grid and Cluster Shellhttp://sourceforge.net/projects/gxp/

May 20, 2009

Available as OSS


Thank You!

GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE

Documents

Transcript of GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE