1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth...
-
Upload
gilbert-black -
Category
Documents
-
view
221 -
download
6
Transcript of 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth...
![Page 1: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/1.jpg)
1
Cplant I/O
Pang Chen
Lee WardSandia National Laboratories
Scalable Computing Systems
Fifth NASA/DOE Joint PC Cluster
Computing Conference
October 6-8, 1999
![Page 2: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/2.jpg)
2
Net I/O
Service
Users
File I/OCompute
/home
Conceptual Partition Model
![Page 3: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/3.jpg)
3
File I/O Model
• Support large-scale unstructured grid applications.– Manipulate single file per application, not per processor.
• Support collective I/O libraries.– Require fast concurrent writes to a single file.
![Page 4: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/4.jpg)
4
Problems
• Need a file system NOW!
• Need scalable, parallel I/O.
• Need file management infrastructure.
• Need to present the I/O subsystem as a single parallel file system both internally and externally.
• Need production-quality code.
![Page 5: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/5.jpg)
5
Approaches
• Provide independent access to file systems on each I/O node.– Can’t stripe across multiple I/O nodes to get better performance.
• Add a file management layer to “glue” the independent file systems so as to present a single file view.– Require users (both on and off Cplant) to differentiate between this
“special” file system and other “normal” file systems.
– Lots of special utilities are required.
• Build our own parallel file system from scratch.– A lot of work just to reinvent the wheel, let alone the right wheel.
• Port other parallel file systems into Cplant.– Also a lot of work with no immediate payoff.
![Page 6: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/6.jpg)
6
Current Approach
• Build our I/O partition as a scalable nexus between Cplant and external file systems.+ Leverage off existing and future parallel file systems.
+ Allow immediate payoff with Cplant accessing existing file systems.
+ Reduce data storage, copies, and management.
– Expect lower performance with non-local file systems.
– Waste external bandwidth when accessing scratch files.
![Page 7: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/7.jpg)
7
Building the Nexus
• Semantics– How can and should the compute partition use this service?
• Architecture– What are the components and protocols between them?
• Implementation– What we have now and what we hope to achieve in the future?
![Page 8: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/8.jpg)
8
Compute Partition Semantics
• POSIX-like.– Allow users to be in a familiar environment.
• No support for ordered operations (e.g., no O_APPEND).
• No support for data locking.– Enable fast non-overlapping concurrent writes to a single file.
– Prevent a job from slowing down the entire system for others.
• Additional call to invalidate buffer cache.– Allow file views to synchronize when required.
![Page 9: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/9.jpg)
9
Cplant I/O
I/O I/O I/O I/O
Enterprise Storage Services
![Page 10: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/10.jpg)
10
Architecture
• I/O nodes present a symmetric view.– Every I/O node behaves the same (except for the cache).– Without any control, a compute node may open a file with one I/O node,
and write that file via another I/O node.
• I/O partition is fault-tolerant and scalable.– Any I/O node can go down without the system losing jobs.– Appropriate number of I/O nodes can be added to scale with the compute
partition.
• I/O partition is the nexus for all file I/O.– It provides our POSIX-like semantics to the compute nodes and
accomplishes tasks on behalf of the them outside the compute partition.
• Links/protocols to external storage servers are server dependent.– External implementation hidden from the compute partition.
![Page 11: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/11.jpg)
11
Compute -- I/O node protocol
• Base protocol is NFS version 2.– Stateless protocols allow us to repair faulty I/O nodes without aborting
applications.
– Inefficiency/latency between the two partitions is currently moot; Bottleneck is not here.
• Extension/modifications:– Larger I/O requests.
– Propagation of a call to invalidate cache on the I/O node.
![Page 12: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/12.jpg)
12
Current Implementation
• Basic implementation of the I/O nodes
• Have straight NFS inside Linux with ability to invalidate cache.
• I/O nodes have no cache.
• I/O nodes are dumb proxies knowing only about one server.
• Credentials rewritten by the I/O nodes and sent to the server as if the the requests came from the I/O nodes.
• I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI O2K as the (XFS) file server on the other end.
• Don’t have jumbo packets.
• Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each using about 15% of CPU.
![Page 13: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/13.jpg)
13
Current Improvements
• Put a VFS infrastructure into I/O node daemon.– Allow access to multiple servers.
– Allow a Linux /proc interface to tune individual I/O nodes quickly and easily.
– Allow vnode identification to associate buffer cache with files.
• Experiment with a multi-node server (SGI/CXFS).
![Page 14: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/14.jpg)
14
Future Improvements
• Stop retries from going out of network.
• Put in jumbo packets.
• Put in read cache.
• Put in write cache.
• Port over Portals 3.0.
• Put in bulk data services.
• Allow dynamic compute-node-to-I/O-node mapping.
![Page 15: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,](https://reader036.fdocuments.net/reader036/viewer/2022071705/56649cca5503460f949931a4/html5/thumbnails/15.jpg)
15
Looking for Collaborations
Lee Ward
505-844-9545
Pang Chen
510-796-9605