GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE
description
Transcript of GMount: An Ad-hoc and Locality-Aware Distributed File System by using SSH and FUSE
GMount: An Ad-hoc and Locality-Aware Distributed File Systemby using SSH and FUSEGraduate School of Information Science and Technology
The University of TokyoNan Dun Kenjiro Taura Akinori Yonezawa
CCGrid 2009, Shanghai, China 2
Computing resources across different administration domains◦ InTrigger (JP), Tsubame (JP), T2K-Tokyo (JP)◦ Grid5000 (FR), D-Grid (DE), INFN Grid (IT), National
Grid Services (UK)◦ Open Science Grid (US)
Workload to run on all the available resources◦ Finding super nova◦ Gene decoding◦ Weather simulation, etc.
May 20, 2009
Today You may Have
May 20, 2009CCGrid 2009, Shanghai, China 3
Scenario I
How to share your data among
arbitrary machines across different domains
?
CCGrid 2009, Shanghai, China 4
Option 1: Staging you data◦ Too troublesome: SCP, FTP, GridFTP, etc.
Option 2: Conventional DFSs◦ Ask your administrators!
Which one? NFS, OpenAFS, PVFS, GPFS, Lustre, GoogleFS, Gfarm
Only for you? Believe me, they won’t do so Quota, security, policy? Headaches… Configure and install, even if admins are supposed
to do their job ... Option 3: GMount
◦ Build a DFS by yourself on the fly!
May 20, 2009
Ways of Sharing
May 20, 2009CCGrid 2009, Shanghai, China 5
Scenario II
You have many clients/resources,
And you want more servers
CCGrid 2009, Shanghai, China 6
Option 1: Conventional DFSs◦ File servers are fixed at deploy time
Fixed number of MDS (Metadata Server) Fixed number of DSS (Data Storage Servers)
◦ Ask your administrators again! Append more DSS
Option 2: GMount◦ No metadata server◦ File servers scale with the clients
As long as you have more servers, you have more DSS◦ Especially benefit if your workloads prefer large
amount of local writes
May 20, 2009
Ways of Scaling
May 20, 2009CCGrid 2009, Shanghai, China 7
Scenario III
What happens when client access nearby files
in the wide-area environments?
CCGrid 2009, Shanghai, China 8
High-Latency: DFSs with Central MDS◦ Central MDS is far away from some clients
Locality-Aware: GMount◦ Search nearby nodes
first◦ Sending high-latency
message only if targetfile can not be foundlocally
May 20, 2009
File Lookup in Wide-Area
CCGrid 2009, Shanghai, China 9
Prerequisite1. You can SSH login some nodes2. Each node has some export directory
having the data you want to share3. Specify a mountpoint via which DFS can
be accessed◦ Simply make an empty directory for each node
May 20, 2009
Impression of Usage
CCGrid 2009, Shanghai, China 10
Just one command, You are done!◦ gmnt /export/directory /mountpoint◦ GMount will create a DFS at mountpoint: a
UNION of all export directories can be mutually accessed by all nodes
May 20, 2009
Impression of Usage
Host001 Host002
export
dir1 dir2
dat1
export
dir1 dir2
dat2 dat3 dat4
mount
dir1 dir2
dat1 dat2 dat3 dat4
mount
dir1 dir2
dat1 dat3 dat4dat2
Mutual Access
CCGrid 2009, Shanghai, China 11
Building Blocks◦ FUSE, SSHFS and SSHFS-MUX
To create basic userspace file system To utilize existing SSH authentication and data transfer
features◦ Grid and cluster shell (GXP)
To efficiently execute commands in parallel Core Ideas
◦ Scalable All-Mount-All algorithms To enable all nodes hierarchically and simultaneously
share with each other◦ Locality-Aware Optimization
To enable file access be aware of closer files
May 20, 2009
Enabling Techniques
CCGrid 2009, Shanghai, China 12
FUSE [fuse.sf.net]
◦ Framework for quickly building userspace FS
◦ Widely available(Linux version>2.6.14)
SSHFS [fuse.sf.net/sshfs.html]
◦ Manipulate files on remote hosts as local files
◦ $ sshfs myhost.net:/export /mount◦ Limitation: only can mount one host at a time
May 20, 2009
FUSE and SSHFS Magic
CCGrid 2009, Shanghai, China 13
Manipulate multiple hosts simultaneously◦ SSHFS-MUXA$ sshfsm B:/export C:/export /mount Priority lookup
E.g. C:/export will be accessed before B:/export
May 20, 2009
FUSE and SSHFS Magic (cont.)
/export
dir1 dir2
dat1
B’s /export
dir1 dir2
dat2 dat3
C’s /mount
dir1 dir2
dat1
A’s
dat2 dat3
CCGrid 2009, Shanghai, China 14
Data to export at /export (E.g. 3 nodes)INPUT: export directory at each node: E
DFS mounted directory /mountOUPUT: DFS mount directory at each node: M
May 20, 2009
Problem Setting
321 ,E,EE
321321 EEEMMM
CCGrid 2009, Shanghai, China 15
321321 EEEMMM
Execution examples for 3 nodes1$ sshfsm 1:/export 2:/export 3:/export /mount2$ sshfsm 1:/export 2:/export 3:/export /mount3$ sshfsm 1:/export 2:/export 3:/export /mount
May 20, 2009
A Straightforward Approach
1 2 3
1 2 3
What if we have 100 nodes?
Scalability!
CCGrid 2009, Shanghai, China 16
Phase I: One-Mount-All
May 20, 2009
Scalable Approach: Phase I
1
2 3
1$ sshfsm 1:/export 2:/export 3:/export /mount
3
2
3211
M
M
EEEM
CCGrid 2009, Shanghai, China 17
Phase II: All-Mount-One
May 20, 2009
Scalable Approach: Phase II
1
2 3
2$ sshfsm 1:/mount /mount3$ sshfsm 1:/mount /mount
13
12
3211
MM
MM
EEEM
CCGrid 2009, Shanghai, China 18
Straight Forward
Scalable
Connections 9 (N2) 4 (O(KlogkN))SSH daemons in each node 3 (N) 2 (K)
May 20, 2009
Comparison
1 2 3
1 2 3
1
2 3
2 3
VS.
K is the number of children
CCGrid 2009, Shanghai, China 19
Locality-Aware Lookup
May 20, 2009
Further Optimization
1
2 3
2 3
1$ sshfsm 1:/export 2:/export 3:/export /mount2$ sshfsm 1:/mount /mount3$ sshfsm 1:/mount /mount
1$ sshfsm 1:/export 2:/export 3:/export /mount2$ sshfsm 1:/mount 2:/export /mount3$ sshfsm 1:/mount 3:/export /mount
1
2 3
2 3
CCGrid 2009, Shanghai, China 20May 20, 2009
Hierarchical grouping, sharing and lookup
Recursively and Hierarchically Constructing
• Nodes share with each other at the same level• Export their union to upper level
• File lookup happened in local group first• Then lookup upward if not found
CCGrid 2009, Shanghai, China 21
Grid and Cluster Shell [Taura ’04]
◦ Simultaneously operate hundreds of nodes◦ Scalable and efficient◦ Across different administration domains◦ Install at one node, deploy to all node by itself◦ Also a useful tool for daily Grid interaction Programmable parallel execution framework
In GMount Efficiently execute SSHFS-MUX in parallel on many nodes
May 20, 2009
How to execute many mounts in parallel ?
CCGrid 2009, Shanghai, China 22
1. Grab nodes by GXP◦ Assign the starting node as master, others as
workers2. Master gather info and make a mount plan
◦ Get the number of nodes◦ Get the information of each node◦ Make a spanning tree and mount plan and send
them to workers3. Execute the plan
◦ Workers execute the mount plan and send result back to master
◦ Master aggregates the results and prompt user
May 20, 2009
Summary of GMount Executions
CCGrid 2009, Shanghai, China 23
Utilize Network Topology Information◦ Grouping nodes based on
implicit/explicit network affinity : Using IP address affinity Using network topology
information if available
NAT/Firewall◦ Overcome by cascade mount
Specify gateways as root ofinternal nodes and cascadeinside-outside traffic
May 20, 2009
Deal with Real Environments
NAT, Firewall1
2 3
4 5 6 7
LAN
CCGrid 2009, Shanghai, China 24
Experimental Environments◦ InTrigger, 15 sites distributed cluster of clusters in
Japan Experiments
◦ Performance of building block (SSHFS-MUX) I/O performance Metadata performance
◦ File system construction time on system size Mount time Umount time
◦ I/O performance on spanning tree shape◦ Metadata performance on local accesses
May 20, 2009
Evaluation
CCGrid 2009, Shanghai, China 25
Over 300 nodes across 12 sites◦ Representative platform in
wide-area environments Heterogeneous wide-area links
◦ NAT enabled in 2 sites Unified software environments
◦ Linux 2.6.18◦ FUSE 2.7.3◦ OpenSSH 4.3p2◦ SSHFS-MUX 1.1◦ GXP 3.03
May 20, 2009
InTrigger Platformhttp://www.intrigger.jp
14
6
761110
269
CCGrid 2009, Shanghai, China 26May 20, 2009
File System Construction Time
1 (69/4)
2 (158/8)
3 (194/12)
4 (207/16)
5 (236/20)
6 (266/24)
7 (272/28)
8 (282/32)
9 (293/36)
10 (304/40)
11 (315/44)
12 (329/48)
0123456789
10mount (all nodes/site)unmount (all nodes/site)mount (4 nodes/site)unmount (4 nodes/site)
Tim
e (s
ec)
Sites # (nodes #)
< 10 seconds for nation
wide329 nodes
2 4 8 16 320
50
100
150
200
250
300
350
400
450
500
GMount (K=4)GMount (K=8)GMount (K=16)Gfarm-FUSE
Concurrent Clients #May 20, 2009CCGrid 2009, Shanghai, China 27
Parallel I/O Performance
2 4 8 16 3205
101520253035404550
GMount (K=4)
GMount (K=8)
GMount (K=16)
Concurrent clients #
• Limited SSH transfer rate is primary bottleneck• Performance is also depends on tree shape
CCGrid 2009, Shanghai, China 28
Gfarm: Wide-area DFS◦ Central meta server
Clients first query in meta server for file location
Clients may be distant from metadata server
Locality Awareness◦ Clients prefer to access files that stored in nodes
close to it (within the same cluster/LAN)◦ Percent of Local Access
May 20, 2009
Metadata Operation Peformance
AccessTotalofNumberAccessLocalofNumberP accesslocal
where local access is the access to the nodes within thesame cluster/LAN
CCGrid 2009, Shanghai, China 29May 20, 2009
Metadata: GMount in WAN
0.2 0.4 0.600000000000001 0.8 10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
mkdirrmdiropen+closestat EXISTstat NONEXISTutimechmod u+xunlink
Percent of Locality Access
Aggr
egat
e Op
erat
ion
Late
ncy
(sec
)
Gfarm in WAN
Gfarm in LAN
Locality-Aware: Saved Network Latency
CCGrid 2009, Shanghai, China 30
Conventional DFS GMount DFS
Resources Fixed within domain Ad-hocFixed at deploy time Scale on demand
Volume Quota Policy depend Sum of local volumesFirewall Optional OK
Wide-AreaPotential high-latency file lookup if using central metadata server
Distributed metadata and locality-aware file lookup
Data Persistence Permanent storage On demand sharingData Redundancy
Yes No
Authentication GSI, SharedKey, etc SSHDeploy Privilege Administrator UserPrerequisite Kernel source, DB, etc. SSH, FUSE, PythonEnabling Effort Weeks-Months Minutes-HoursImplementation Years Months May 20, 2009
Highlights
CCGrid 2009, Shanghai, China 31
SFTP Limitations◦ Not Fully POSIX Compatible
Rename operation and link operation◦ Limited Receive Buffer [Rapier et al. ’08]
Low data transfer rate in long-fat network◦ SFTP extended attributed support
Piggybacking file location during lookup Performance Enhancement
◦ SSHFS-MUX local mount operation (Done!) Fault Tolerance
◦ Tolerance on connection drops
May 20, 2009
Future Work
CCGrid 2009, Shanghai, China 32
SSHFSMUXhttp://sshfsmux.googlecode.com/
Grid and Cluster Shellhttp://sourceforge.net/projects/gxp/
May 20, 2009
Available as OSS
CCGrid 2009, Shanghai, China 33May 20, 2009
Thank You!