Architecture of a Next-‐Generation Parallel File System
-‐ Introduction -‐ Whats in the code now -‐ futures
Agenda
An introduction
What is OrangeFS?
• OrangeFS is a next generation Parallel File System • Based on PVFS • Distributes file data across multiple file servers
leveraging any block level file system. • Distributed Meta Data across 1 to all storage servers • Supports simultaneous access by multiple clients,
including Windows using the PVFS protocol Directly • Works w/ standard kernel releases and does not require
custom kernel patches • Easy to install and maintain
Why Parallel File System?
HPC – Data Intensive Parallel (PVFS) Protocol
• Large datasets
• Checkpointing
• Visualization
• Video
• BigData
Unstructured Data Silos Interfaces to Match Problems
• Unify Dispersed File Systems
• Simplify Storage Leveling
§ Multidimensional arrays
§ typed data § portable formats
Original PVFS Design Goals § Scalable § Configurable file striping § Non-‐contiguous I/O patterns § Eliminates bottlenecks in I/O path § Does not need locks for metadata ops § Does not need locks for non-‐conflicting applications
§ Usability § Very easy to install, small VFS kernel driver § Modular design for disk, network, etc § Easy to extend -‐> Hundreds of Research Projects have
used it, including dissertations, thesis, etc…
OrangeFS Philosophy • Focus on a Broader Set of Applications • Customer & Community Focused
• (>300 Member Strong Community & Growing)
• Open Source • Commercially Viable • Enable Research
Configurability
Performance
Consistency Reliability
System Architecture
• OrangeFS servers manage objects • Objects map to a specific server • Objects store data or metadata • Request protocol specifies operations on one or
more objects
• OrangeFS object implementation • DB for indexing key/value data • Local block file system for data stream of bytes
Current Architecture
Client Server
1994-‐2004 Design and Development at CU Dr. Ligon + ANL (CU Graduates)
Primary Maint & Development ANL (CU Graduates) + Community
2004-‐2010
2007-‐2010 New PVFS Branch
SC10 (fall 2010)
2015
Announced with community and is now Mainline of future development as of 2.8.4
Spring 2012
New Development focused on a broader set of problems
SC11 (fall 2011)
Performance improvements, Direct Lib + Cache Stability, WebDAV, S3
PVFS2 PVFS
Improved MD, Stability, Server Side Operations,
Newer Kernels, Testing Windows Client, Stability, Replicate on Immutable
2.8.6 + Webpack
2.8.5 + Win Support and Targeted Development Services Initially Offered by Omnibond
OrangeFS 3.0
Summer 2014 Distributed Dir MD, Capability based security 2.9.0
Winter 2013 Performance improvements, Stability, 2.8.7 + Webpack
Spring 2014 Performance improvements, Stability, shared mmap, multi TCP/IP Server Homing, Hadoop MapReduce, user lib fixes, new spec file for RPMS + DKMS 2.8.8 + Webpack
Available in the AWS Marketplace
Replicated MD, File Data, 128 bit UUID for File Handles, Parallel Background Processes, web based Mgt Ui, self healing processes, data balancing
In the Code now
Server to Server Communications (2.8.5)
Traditional Metadata Operation
Create request causes client to communicate with all servers O(p)
Scalable Metadata Operation
Create request communicates with a single server which in turn communicates with other servers using a tree-‐based protocol O(log p)
Mid
Client
Serv
App
Mid Mid
Client
Serv
Client
Serv
App
Network
App
Mid
Client
Serv
App
Mid Mid
Client
Serv
Client
Serv
App
Network
App
Recent Additions (2.8.5)
SSD Metadata Storage Replicate on Immutable (file based)
Windows Client
Supports Windows 32/64 bit
Server 2008, R2, Vista, 7
Direct Access Interface (2.8.6)
• Implements: • POSIX system calls • Stdio library calls
• Parallel extensions • Noncontiguous I/O • Non-‐blocking I/O
• MPI-‐IO library • Found more boundary
conditions fixed in upcoming 2.8.7
App
Kernel PVFS lib
Client Core
Direct lib
PVFS lib Kernel
App
IB
TCP
File System File System File System
Direct Interface Client Caching (2.8.6)
• Direct Interface enables Multi-‐Process Coherent Client Caching for a single client
File System File System
Client Application
Direct Interface
Client Cache
File System
WebDAV (2.8.6 webpack)
PVFS Protocol
Orang
eFS
Apache
• Supports DAV protocol and tested with the Litmus DAV test suite
• Supports DAV cooperative locking in metadata
S3 (2.8.6 webpack)
PVFS Protocol
Orang
eFS
Apache
• Tested using s3cmd client • Files accessible via other access methods
• Containers are Directories
• Accounting Pieces not implimented
Summary -‐ Recently Added to OrangeFS
• In 2.8.3 • Server-‐to-‐Server Communication • SSD Metadata Storage • Replicate on Immutable
• 2.8.4, 2.8.5 (fixes, support for newer kernels) • Windows Client • 2.8.6 – Performance, Fixes, IB updates
• Direct Access Libraries (initial release) • preload library for applications, Including Optional Client
Cache • Webpack
• WebDAV (with file locking), S3
Available on the Amazon AWS Marketplace and brought to you by Omnibond
OrangeFS Instance
Unified High Performance File System
DynamoDB
EBS Volumes
OrangeFS on AWS Marketplace
In 2.8.8 (Just Released)
Hadoop JNI Interface (2.8.8)
• OrangeFS Java Native Interface
• Extension of Hadoop File System Class –>JNI
• Buffering
• Distribution
• Fast PVFS Protocol for Remote Configuration
PVFS Protocol
Additional Items(2.8.8) • Updated user lib • Shared mmap support in kernel module
• Support for kernels up to 3.11
• Multi-‐homing servers over IP • Clients can access server over multiple interfaces (say
clients on IPoIB +clients on IPoEthernet +clients on IPoMx
• Enterprise Installers (Coming Shortly) • Client (with DKMS for Kernel Module) • Server • Devel
Performance
Scaling Tests
16 Storage Servers with 2 LVM’d 5+1 RAID sets were tested with up to 32 clients, with read performance reaching nearly 12GB/s and write performance reaching nearly 8GB/s.
MapReduce over OrangeFS
• 8 Dell R720 Servers Connected with 10Gb/s Ethernet • Remote Case adds an additional 8 Identical Servers and
does all OrangeFS work Remotely and only Local work is done on Compute Node (Traditional HPC Model)
• *25% improvement with OrangeFS running Remotely
MapReduce over OrangeFS
• 8 Dell R720 Servers Connected with 10Gb/s Ethernet • Remote Clients are R720s with single SAS disks for local
data (vs. 12 disk arrays in the previous test).
OrangeFS Clients
SC13 Demo Overview
OrangeFS Clients
16 Dell R720 OrangeFS Servers
SC13 Floor • Clemson • USC • I2 • Omnibond
I2 Innovation Platform 100Gb/s
Sc13 WAN Performance
Multiple Concurrent Client File Creates over PVFS protocol (nullio)
For 2.9 (summer 2014)
Distributed Directory Metadata (2.9.0)
DirEnt1
DirEnt2
DirEnt3
DirEnt4
DirEnt5
DirEnt6
DirEnt1 DirEnt5
DirEnt3
DirEnt2
DirEnt6
DirEnt4
Server0
Server1
Server2
Server3
Extensible Hashing
u State Management based on Giga+ u Garth Gibson, CMU
u Improves access times for directories with a very large number of entries
Cert or Credential
Signed Capability I/O
Signed Capability
Signed Capability I/O
Signed Capability I/O
OpenSSL PKI
• 3 Security Modes • Basic – OrangeFS/PVFS Classic Mode • Key-‐Based – Keys are used to authorize clients for use with the FS • User Certificate Based with LDAP – user certs are used for access to
the file system and are generated based on LDAP uid/gid info
For v3
Replication / Redundancy • Redundant Metadata
• seamless recovery after a failure • Redundant objects from root directory down • Configurable
• Redundant Data Update mode (real time, on close, on immutable, none) Configurable Number of Replicas
• Real Time “forked flow” work shows little overhead • Replicate on Close • Replicate to external (like LTFS)
• Looking at supporting HSM option to external (no local replica)
• Emphasis on continuous operation
OrangeFS 3.0
• An OID (object identifier) is a 128-‐bit UUID that is unique to the data-‐space
• An SID (server identifier) is a 128-‐bit UUID that is unique to each server.
• No more than one copy of a given data-‐space can exist on any server
• The (OID, SID) tuple is unique within the file system.
• (OID, SID1), (OID, SID2), (OID, SID3) are copies of the object identifier on different servers.
Handles -‐> UUIDs OrangeFS 3.0
• In an Exascale environment with the potential for thousands of I/O servers, it will no longer be feasible for each server to know about all other servers.
• Servers Discovery • Will know a subset of their neighbors at startup (or may
be cached from previous startups). Similar to DNS domains.
• Servers will learn about unknown servers on an as needed basis and cache them. Similar to DNS query mechanisms (root servers, authoritative domain servers).
• SID Cache, in memory db to store server attributes
Server Location / SID Mgt OrangeFS 3.0
Policy Based Location
• User defined attributes for servers and clients • Stored in SID cache
• Policy is used for data location, replication location and multi-‐tenant support
• Completely Flexible • Rack • Row • App • Region
OrangeFS 3.0
• Modular infrastructure to easily build background parallel processes for the file system Used for: • Gathering Stats for Monitoring • Usage Calculation (can be leveraged for Directory
Space Restrictions, chargebacks) • Background safe FSCK processing (can mark bad
items in MD) • Background Checksum comparisons • Etc…
Background Parallel Processing Infrastructure (3.0)
Admin REST interface / Admin UI
PVFS Protocol
Orang
eFS
Apache
REST
(3.0)
Data Migration / Mgt
• Built on Redundancy & DBG processes • Migrate objects between servers
• De-‐populate a server going out of service • Populate a newly activated server (HW lifecycle) • Moving computation to data
• Hierarchical storage • Use existing metadata services
• Possible -‐ Directory Hierarchy Cloning • Copy on Write (Dev, QA, Prod environments with high % data
overlap)
OrangeFS 3.x
Hierarchical Data Management
Archive Intermediate Storage NFS
Remote Systems exceed, OSG, Lustre, GPFS, Ceph, Gluster
HPC OrangeFS
Metadata OrangeFS
Users
OrangeFS 3.x
Attribute Based Metadata Search
• Client tags files with Keys/Values • Keys/Values indexed on Metadata Servers • Clients query for files based on Keys/Values • Returns file handles with options for filename
and path
Key/Value Parallel Query
Data
Data
File Access
OrangeFS 3.x
Beyond OrangeFS NEXT
Extend Capability based security
• Enable certificate level access (in process) • Federated access capable • Can be integrated with rules based access
control • Department x in company y can share with
Department q in company z • rules and roles establish the relationship • Each company manages their own control of who is in
the company and in department
SDN -‐ OpenFlow
• Working with OF research team at CU • OF separates the control plane from delivery,
gives ability to control network with SW • Looking at bandwidth optimization
leveraging OF and OrangeFS
ParalleX
ParalleX is a new parallel execution model • Key components are:
• Asynchronous Global Address Space (AGAS) • Threads • Parcels (message driven instead of message passing) • Locality • Percolation • Synchronization primitives • High Performance ParalleX (HPX) library
implementation written in C++
PXFS
• Parallel I/O for ParalleX based on PVFS • Common themes with OrangeFS Next
• Primary objective: unification of ParalleX and storage name spaces.
• Integration of AGAS and storage metadata subsystems • Persistent object model
• Extends ParalleX with a number of IO concepts • Replication • Metadata
• Extending IO with ParalleX concepts • Moving work to data • Local synchronization
• Effort with LSU, Clemson, and Indiana U. • Walt Ligon, Thomas Sterling
Community
Johns Hopkins OrangeFS Selection
• JHU -‐ HLTCOE Selected OrangeFS • After evaluating: Ceph, GlusterFS, Lustre and
OrangeFS “Leveraging OrangeFS for the parallel filesystem, the system as a whole is capable of delivering 30GB/s write, 46GB/s read, and be-‐ tween 37,260-‐237,180 IOPS of performance. The variation in IOPS performance is dependent on the file size and number of bytes written per commit as documented in the Test Results section.”*
* http://hltcoe.jhu.edu/uploads/publications/papers/14662_slides.pdf
“The final system design rep-‐ resents a 2,775% increase in read performance and a 1,763-‐11,759% increase in IOPS”*
Learning More
• www.orangefs.org web site • Releases • Documentation • Wiki
• pvfs2-‐users@beowulf-‐underground.org • Support for users
• pvfs2-‐developers@beowulf-‐underground.org • Support for developers
Support & Development Services
• www.orangefs.com & www.omnibond.com • Professional Support & Development team • Buy into the project
Intelligent
Transportation Solutions
Identity Manager Drivers & Sentinel
Connectors
Parallel Scale-‐Out Storage Software
Social Media
Interaction System
Omnibond Info
Computer Vision Enterprise Personal
Solution Areas
Insert Discussion Here
Top Related