iRODS UGM 2016 Preso Summary FINAL

7
iRODS UGM 2016, June 10-11, 2016, Chapel Hill, NC. Copyright Randall Splinter, 2016 1 Academic Workflow for Research Repositories Using iRODS and Object Storage Randall Splinter, Ph.D. DataDirect Networks 238 Serenoa Drive Canton, GA 30114 [email protected] ABSTRACT Traditionally, the sharing and retention of research data has been a contentious issue. Sharing data over WANs has been limited by the available storage technologies. NAS solutions while excellent for sharing data over a LAN have never had the same success over WANs. The successful implementation of object storage solutions has opened a door into the ability to share data over WAN links. Coupling that ability to share objects over a WAN with middleware like iRODS provides the research community with the ability to provide to provide more stringent controls over the data including Better control of ACLs including o Implementing data retention policies to meet regulatory requirements o Loss of IP due to faculty loss Virtualization of multiple storage silos under a single namespace Extensive metadata tags and searching of those tags Extensible rules engine to implement functionality such as o HSM style functionality between storage devices o Data migration based upon set criterion Some of the advantages of this approach include Ease of administration – Once rules are tested and in place the system can be managed with a minimum of administrative overhead Automating workflows to guarantee consistency and reproducibility in the science that is produced Ease of auditing for both usage and back charging and for maintaining adequate data security compliance Using storage platforms like DDN WOS remote replication becomes simple and provides a straightforward way to manage DR systems Keywords Data Archives, Data Repositories, iRODS, Object Storage, WOS,

Transcript of iRODS UGM 2016 Preso Summary FINAL

Page 1: iRODS UGM 2016 Preso Summary FINAL

iRODS UGM 2016, June 10-11, 2016, Chapel Hill, NC. Copyright Randall Splinter, 2016

1

Academic Workflow for Research Repositories Using iRODS and Object Storage

Randall Splinter, Ph.D.

DataDirect Networks 238 Serenoa Drive Canton, GA 30114

[email protected]

ABSTRACT

Traditionally, the sharing and retention of research data has been a contentious issue. Sharing data over WANs has been limited by the available storage technologies. NAS solutions while excellent for sharing data over a LAN have never had the same success over WANs. The successful implementation of object storage solutions has opened a door into the ability to share data over WAN links.

Coupling that ability to share objects over a WAN with middleware like iRODS provides the research community with the ability to provide to provide more stringent controls over the data including

▪ Better control of ACLs including o Implementing data retention policies to meet regulatory requirements o Loss of IP due to faculty loss

▪ Virtualization of multiple storage silos under a single namespace ▪ Extensive metadata tags and searching of those tags ▪ Extensible rules engine to implement functionality such as

o HSM style functionality between storage devices o Data migration based upon set criterion

Some of the advantages of this approach include

• Ease of administration – Once rules are tested and in place the system can be managed with a minimum of administrative overhead

• Automating workflows to guarantee consistency and reproducibility in the science that is produced • Ease of auditing for both usage and back charging and for maintaining adequate data security compliance • Using storage platforms like DDN WOS remote replication becomes simple and provides a straightforward

way to manage DR systems

Keywords

Data Archives, Data Repositories, iRODS, Object Storage, WOS,

Page 2: iRODS UGM 2016 Preso Summary FINAL

2

INTRODUCTION

Over the past few years as government funding agencies have begun to adopt requirements for data management plans for research, universities and federal laboratories have struggled with the development of programs to provide long-term storage of research data. The problem is not new per se, but the recent development of object storage has finally provided an interesting solution by which data can be reliably stored long-term while providing solutions to

1. Data security from accidental deletion (retention policies), loss due to theft or hardware failures. 2. Satisfying regulatory restrictions such as HIPPA and FIPS-140-2 3. Changing hardware standards 4. Hardware availability

Historically, satisfying all of these simultaneously has been a challenge from traditional NAS and Fibre Channel storage technologies.

In this brief we would like to summarize the existing state of affairs and show how a combination of object storage and iRODS can provide a solution to long-term archiving and active research data repositories that is significantly more cost effective and flexible than more traditional methodologies.

INTRODUCTION TO THE PROBLEM

Reducing the problem to its bare essence we are left with the problem of collaboration and research repositories. Both have their challenges using traditional storage technologies. In the next two sections we wish to outline the problem in some detail before moving on to a modern solution to both problems.

The Problem of Collaboration

The first significant attempt at providing an affordable solution to the problem of collaborative storage was provided by Sun Microsystems in 1984 in the form of the Network File System (NFS)[1]. This was followed by the development of Server Message Block (SMB) by IBM around 1990[2][3]. In 1996 Microsoft renamed SMB to Common Internet File System (CIFS)[4]. Both of these technologies still enjoy a good measure of success in the marketplace, but they also have their share of problems. The biggest single obstacle is a lack of interoperability between NFS and CIFS, but they also suffer from being primarily local technologies in the sense that they do not scale well over WAN-type of distances. This significantly limits their usefulness for collaborations that are over substantial distances such as we are seeing today.

Therefore, the problem of collaboration reduces to one of finding the appropriate technology to enable the sharing data over large distances. Two interesting solutions to this problem from a software perspective have emerged over the last few years. The first is GlobusOnline[5] and the second is iRODS[6]. The GlobusAlliance has focused on the ease of transferring data over arbitrary distances. The development of GridFTP has made that the transfer of data more efficient by enabling parallel IO streams and enhancing the security of the data transfer. The iRODS team has focused on providing solutions to

1. A virtual namespace spanning all available storage resources. 2. Providing mechanism by which end users can easily modify ACLs and permissions to share data. 3. Providing mechanism by which end user can easily add metadata tags to objects and search that metadata 4. Providing a rules engine which can be used to automate workflows and simplify the movement of data

based on rules

These four rules can be graphically summarized as a “Swiss Army” knife, with each blade representing one of the above items

Page 3: iRODS UGM 2016 Preso Summary FINAL

3

For the rest of this paper we will focus on iRODS. The Globus team is developing very interesting technology to enable data transfers, sharing, and the publishing of data. But, is beyond the scope of this work.

The Problem of Research Repositories

As commented above as government agencies have started to demand data management plans as part of grant funding the development of adequate solutions to the long-term storage of data has become more pressing. A brief list of the problems that are encountered can be summarized as

1. Insure long-term availability of the data a. Hardware availability/failure b. Changing hardware standards

2. Data security a. Retention policies are needed b. Loss due to theft – loss of IP

3. Regulatory restrictions a. HIPPA, FIPS-140-2, etc.

4. Enable ease of data sharing

Each of these presents their own set of problems, and frequently they can contradict one another. For instance, the need to be able to share data or research results may well provide issues for data security. External users may need access to some objects, but not to all objects. Or as another example, should a researcher leave one institution they may have a need to take some data with them, but their former employer may view that data as institutional IP.

OBJECT STORAGE AS A BRIDGE

The first proposal for object storage was given by Mesnier, et al[7]. Since then a number of competing object storage technologies have been released into the market, for example

1. DDN WOS 2. Scality RING 3. IBM Cleversafe 4. OpenStack SWIFT 5. CEPH

6. Amazon S3 7. Amplidata 8. EMC Atmos 9. Caringo 10. Plus others……

Some of these are Open Source products while most of them are commercial platforms. Nonetheless, all of them share some common features that make them ideal for the user in collaborative/repository storage.

Page 4: iRODS UGM 2016 Preso Summary FINAL

4

1. In order to scale filesystems to very large scale it has been realized that POSIX locking can be a significant hurdle. Therefore, all object store technologies have abstracted away the underlying POSIX filesystem.

2. In order to avoid the POSIX issues above all limit the number of operations that an end-user can exercise, usually some combination of create, read, update and delete (CRUD).

3. Rich custom metadata for an object. This is allows end users to add metadata tags to objects and search metadata at a bare minimum.

4. The use of standards such as ReST. 5. The use of data replication. The effectiveness of replication over WAN distances depends largely on the

object storage platform. 6. Erasure encoding to provide data security (in place of RAID)

HOW IRODS ENABLES OBJECT STORAGE

The role of research repositories and archives will become increasingly important over the next few years as funding agencies push for increased openness in supported research. This provides the ideal environment for bringing together object storage and iRODS.

Object storage is ideal in a Write Once, Read Many (WORM) environment. By removing the limitations imposed by POSIX locking object storage enables the growth of storage environments to massive sizes. Furthermore, the native replication techniques that object storage platforms implement provides data protection for extended periods of time. Further, some object storage platforms, such as DDN WOS, also implement erasure encoding at the local level to provide additional data protection.

On the other hand, object storage platforms in their raw form tend to be difficult to use. Most support the ReST interface natively, but the use of ReST can be problematic for less sophisticated users. Gateways, on the other hand regardless of the protocol they support (NFS, CIFS, S3, Swift, to name a few) act to provide a simple and often POSIX like interface to the raw object storage. That is somewhat counter intuitive since a great deal of work has gone into removing POSIX from the object storage in the first place. The solution, of course, is software.

iRODS stands as the best alternative for a software stack to layer on top of an object storage platform and to provide the best user experience in terms of usability. Recalling from above

1. A virtual namespace spanning all available storage resources. 2. Providing mechanism by which end users can easily modify ACLs and permissions to share data. 3. Providing mechanism by which end user can easily add metadata tags to objects and search that metadata 4. Providing a rules engine which can be used to automate workflows and simplify the movement of data

based on rules

The first item allows for the inclusion of multiple storage platforms into the overall solution. This can simplify data movement between storage resources. For instance, in a HPC environment the output of compute jobs can be easily moved either by hand or through the use of the Rules Engine from high speed filesystems to the long-term object storage. Metadata tags can be added to the dataset prior to moving the data to the object storage platform to accommodate later searches. Finally, the end user can modify ACLs or create tickets to enable the straightforward sharing of data with colleague on or off campus.

AN EXAMPLE IMPLEMENTATION OF A DDN WOS WITH IRODS REPOSITORY

The following is an example of a research data repository that has been recently delivered to the Texas Tech University. The system uses DDN WOS storage as the object storage backend and iRODS as the software layer providing the needed functionality to make the repository functional.

Some critical features of the repository are

Page 5: iRODS UGM 2016 Preso Summary FINAL

5

1. Two WOS zones for replication of all data between two geographically separated data centers, and local erasure encoding to provide additional data safety.

WOS Policy Name WOS Zone Description

wosresUDC UDC Only Local Object Assure with one copy at UDC

wosresRDC RDC Only Local Object Assure with one copy at RDC

wosresREPL UDC and RDC Local Object Assure at both sites with one copy at both UDC and RDC

2. A highly available environment using Corosync and Pacemaker to provide an active/passive HA cluster for

the PostgreSQL/iCAT database, HAProxy services and the iRODS Cloud Browser. 3. A scale-out environment using WOS back-end storage to enable simple capacity upgrades. The simplicity

of adding more WOS capacity is a critical feature of the overall design. The customer did not want to perform extensive reconfiguration after adding more capacity.

4. Multiple ingest methods including, iCommands, iRODS Cloud Browser and the GridFTP/iRODS connector. Future work will include adding a 10GbE link to the campus HPC Lustre filesystem to provide resource to move data directly into the back-end WOS storage from the HPC Lustre filesystem.

5. Authentication must use the PAM and Kerberos in order to tie into the existing campus-wide eRaider authentication system that is used campus wide.

Page 6: iRODS UGM 2016 Preso Summary FINAL

6

The complete solution will provide a Research Repository for TTU that will scale into the foreseeable future and provide campus researchers a long-term storage platform for the storage of long-term research data. The system has been opened for beta customers and upon completion of beta testing the system will be opened for campus wide use later this year.

It also demonstrates the flexibility and robustness that can be built into an iRODS deployment. The use of HA clustering enables resistance to downtime from hardware or software failures, while employing a scale-up and scale-out design provides a WOS cluster that scales out simply and the use of multiple iCAT servers to handle any peaks in work load that may occur during normal usage.

CONCLUSION

In conclusion we have presented arguments for the use of iRODS as a middleware stack with object storage as the physical hardware storage for use in long-term archives or repositories. By combining the two we believe that we have demonstrated through our work at TTU that it is possible to build a highly available and scalable iRODS/WOS environment.

The choice of iRODS is simple, the combination of a virtualized namespace, end-user ACLs and permissions, metadata tagging and searching and a highly extensible rules engine make iRODS the ideal middleware stack for use in an archive/repository environment. Using the iRODS rules engine, data can be migrated between multiple iRODS storage resources transparently and provide data retention policies for data that cannot be lost to accidental deletion or theft.

The ability of object storage to scale to very large numbers of Petabytes, with local erasure encoding and replication provides an ideal method for the long-term storage of data, while keeping the data secure and safe from hardware failure. The DDN WOS hardware platform was used in the example here. A WOS cluster scale out easily in terms of capacity so growth can be handled without extensive changes to the existing environment. Finally, WOS has flexibly storage policies so virtually any situation can be accommodated, from local erasure encoding to replication of objects between multiple WOS zones in the cluster.

ACKNOWLEDGMENTS

The author would like to thank the iRODS consortium for the opportunity to present this material at the 2016 iRODS User’s Group Meeting.

REFERENCES

[1] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, Bob Lyon (1985). "Design and Implementation of the Sun Network Filesystem". USENIX.

[2] "Common Internet File System". Microsoft TechNet Library. Retrieved 2013-08-20. The Common Internet File System (CIFS) is the standard way that computer users share files across corporate intranets and the Internet. An enhanced version of the Microsoft open, cross-platform Server Message Block (SMB) protocol, CIFS is a native file-sharing protocol in Windows 2000.

[3] "Microsoft SMB Protocol and CIFS Protocol Overview". Microsoft MSDN Library. 2013-07-25. Retrieved 2013-08-20. The Server Message Block (SMB) Protocol is a network file sharing protocol, and as implemented in Microsoft Windows is known as Microsoft SMB Protocol. The set of message packets that defines a particular version of the protocol is called a dialect. The Common Internet File System (CIFS) Protocol is a dialect of SMB. Both SMB and CIFS are also available on VMS, several versions of Unix and other operating systems.

[4] Tridgell, Andrew. "Myths About Samba". Retrieved 2016-01-03.

[5] Globus Alliance established as international consortium to advance Globus grid software" (Press release). Globus Alliance. September 2, 2003. Retrieved 2007-08-08. the Globus

Page 7: iRODS UGM 2016 Preso Summary FINAL

7

Project today transformed itself into the “Globus Alliance.” ... The Globus Project was established in 1995 by the U.S. Argonne National Laboratory, the University of Southern California's Information Sciences Institute (ISI) and the University of Chicago (UofC).

[6] Conway, Mike; Moore, Reagan; Rajasekar, Arcot; Nief, Jean-Yves (2011). "Demonstration of Policy-Guided Data Preservation Using iRODS". Proceedings of the 2011 IEEE International Symposium on Policies for Distributed Systems and Networks: 173–174. doi:10.1109/POLICY.2011.17. ISBN 978-0-7695-4330-7.

[7] Mesnier, Mike; Gregory R. Ganger; Erik Riedel (August 2003). "Object-Based Storage" (PDF). IEEE Communications Magazine: 84–90. doi:10.1109/mcom.2003.1222722. Retrieved 27 October 2013.