National Science Foundation Cooperative Agreement: OCI-0940841
Compute Resources – HPC centers, institutional clusters
DFC Collaboration Environment – Data Grid
Community Resources – Repository, Catalog
DFC Vision
• Build collaboration environment– Sharing of data, information, and knowledge
• Form national data cyberinfrastructure– Federation of existing data management
systems • Support reproducible data-driven
research– Encapsulate knowledge within shared
workflows• Enable student participation in
research– Policy-controlled analysis of “live” data
NEW
Data Driven Science and Engineering
Collaboration Environments– Oceanography – Ocean Observatory
Initiative• Archiving climatic data records from
real-time sensor data streams– Engineering – CIBER-U
• Engineering Digital Library: Curating civil engineering data, materials data, archaeology data, student training materials
– Hydrology - EarthCube• Automating hydrology research
workflows (data retrieval, transformation, analysis)
– Plant biology – the iPlant Collaborative• Enable collaborative research across
existing data repositories– Cognitive science – the Temporal
Dynamics of Learning Center• Manage research data, apply IRB
policies
– Social Science – the Odum Institute
• Integrate policy-based data management with the existing Dataverse repository
Challenges
• Federated national data cyberinfrastructure
• Existing projects have web services, data repositories, digital libraries, archives, processing pipelines, science portals
• What are the interoperability mechanisms needed to enable federation of existing resources?
1. Astrophysics Auger supernova search2. Atmospheric science NASA Langley Atmospheric Sciences Center3. Biology Phylogenetics at CC IN2P34. Climate NOAA National Climatic Data Center5. Cognitive Science Temporal Dynamics of Learning Center6. Computer Science GENI experimental network7. Cosmic Ray AMS experiment on the International Space Station8. Dark Matter Physics Edelweiss II9. Earth Science NASA Center for Climate Simulations10. Ecology CEED Caveat Emptor Ecological Data11. Engineering CIBER-U12. High Energy Physics BaBar / Stanford Linear Accelerator13. Hydrology Institute for the Environment, UNC-CH; Hydroshare14. Genomics Broad Institute, Wellcome Trust Sanger Institute, NGS15. Medicine Sick Kids Hospital16. Neuroscience International Neuroinformatics Coordinating Facility17. Neutrino Physics T2K and dChooz neutrino experiments18. Oceanography Ocean Observatories Initiative19. Optical Astronomy National Optical Astronomy Observatory20. Particle Physics Indra multi-detector collaboration at IN2P321. Plant genetics the iPlant Collaborative22. Quantum Chromodynamics IN2P323. Radio Astronomy Cyber Square Kilometer Array, TREND, BAOradio24. Seismology Southern California Earthquake Center25. Social Science Odum, TerraPop
DFC Builds on the iRODS data grid(integrated Rule Oriented Data System)
CollectionDefines
Attribute
Has
Has
Digital Object
Has
Has
Collection Purpose
Defines
PolicyProperty Defines Controls UpdatesPersistent
State Information
Policy Concept Graph Purpose
Procedure
Completeness
Correctness
Isa
Consensus
Consistency
HasFeature
HasFeature
HasFeature
Integrity
Isa
Authenticity Isa
Access control
HasFeature
Property
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
Isa
Isa
Isa
Policy
Workflow
Isa
Function
Chains
Operation
Isa
Updates
GetUserACL
SetDataType
SetQuota
DataObjRepl
SysChksumDataObj
Isa
Isa
Isa
Isa
Isa
Procedure
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa
Persistent State
Client Action
Periodic Assessment
Criteria Policy
Policy Enforcement
Point
Invokes
HasSubType
Policy Enforcement
Policy-based Data Management – Implementation in iRODS
CollectionPurpose
(5 main types)
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy (11 default)
Has
Property (7 default)
Defines Procedure (11 default)
Controls Updates
Clients (50)
Periodic Assessment
Criteria Policy
Policy Enforcement Points (70)
Workflow
Invokes
HasSubType Isa
Micro-service (317)
Chains
Operation
Isa
Persistent State
Information (338)
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
IsaIntegrity
Isa
AuthenticityIsa
Access control
Isa
msiGetUserACL
msiSetDataType
msiSetQuota
msiDataObjRepl
msiSysChksumDataObj
Isa
Isa
Isa
Isa
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa IsaIsa
Isa
HasFeature
ArchiveData gridCollection
Digital LibraryProcessing Pipeline
SubType
DFC - CNI
Federation Approach
• Use middleware to implement unifying name spaces for:1. Users Single sign-on2. Collections Directories, workflow, time series3. Objects Files, soft links, workflows4. Storage systems Cloud, tape, file systems, objects5. Metadata Provenance, description, state6. Policies Management, assessment7. Micro-services Procedures, interactions
Port: 1237, Zone: dfcmain
iCATiren2.renci.org
hydroReschydro.renci.org
res-bk15srbbrick15.ucsd.edu
res-dfcmainiren2.renci.org
demoResciren2.renci.org
renciIren2.renci.org: 1247
ooiicat.oceanobservatories.org: 1247
TDLCtdlc-01.sdsc.edu: 6688
odumMainiodum1.irss.unc.edu: 1247
dfctestdfctest.renci.org: 1248
engineeringirods.ischool.drexel.edu: 1247
hydrologyiren2.renci.org: 2823
DFC Federation Hub
National Infrastructure
Research Environment - Portals, Applications, Workflows
DFC Collaboration Environment – Data Grid
Community ResourceRepository
Community ResourceCatalog
Community ResourceServices
Existing infrastructure
XSEDEKepler
OOITDLCiPlant
CUAHSINCDC
Dataverse
GeoBrainDataONE
NCSA Polyglot
DFC - CNI
The Challenge:Support reproducible data-driven research
PETABYTES
DOUBLING EVERY
TWO YEARSDeliver the capability to manage, mine, and publish
knowledge through collaboration environments.
ExperimentsArchive
sSensor
s
Literature
Simulation
The Future: Reproducible Research
DFC - CNI
National Infrastructure Approach
1. Build national data cyberinfrastructure prototype – Support multiple science and engineering domains by loosely coupling
their existing infrastructure with a collaboration environment2. Develop generic interoperability framework – Define the generic infrastructure needed for the national
infrastructure to manage knowledge as well as data and information3. Define interoperability mechanisms – Support access across the disparate types of infrastructure in common
use4. Define domain specific extensions – Support three levels: technical interoperability, project level policy,
and end user usage requirements
Interoperability Mechanisms
Information
Collection Registration
Information Exchange
Soft Links
Message Queue
Information ManipulationDatabase Query
Policies control execution of each interoperability mechanism
DataData Access
Data Manipulation
Micro-services
Storage Driver
KnowledgeKnowledge CreationAnalysis Workflows
Knowledge ManagementProcedures : Micro-services
DFC - CNI
DataNet InteroperabilityResearch Environment - Portals, Applications, Workflows
DFC Collaboration Environment
Message Queue
Web Service
DataONE Member Node
TerraPop Server
SEAD Portal (VIVO)
DataONE Coordinating Node
SEAD Engagement CenterDFC
Data GridSEAD Data
DFC Data Grid
DFC - CNI
DFC Interoperability Layers
Authentication
Workflows
Data Manipulation
Networks
PAM / GSSAPI InCommon, GSI, Kerberos, Shibboleth, LDAP
Micro-Services Kepler, NCSA Cyberintegrator, Taverna, NCSA
Polyglot
Format Drivers NetCDF, HDF5, THREDDS, ERDDAP
Network Drivers HTTPS, TCP/IP, Parallel TCP/IP, RBUDP
Data Access Micro-Services DataONE, Data Conservancy, CUAHSI, NCDC
DFC - CNI
Clients
Vocabulary
Messaging
Management
OpenSocialWeb browsers, Web Services, Workflows,
FUSE, Synchronization, MediaWiki
Micro-Services HIVE, (Cheshire)
Micro-Services AMQP, iRODS Xmsg
Policies (RDA Policies), (ISO 16363 Criteria)
Storage Systems Storage Drivers File Systems, Tape Archives, Object Stores, Cloud Storage
Interoperability Mechanisms
• Drivers– Encapsulate knowledge to support your operations at the remote
repository: partial I/O, parsing of formats, manipulation of data structures
– Authentication, format, storage• Micro-services
– Encapsulate knowledge needed to interact with an external system or with a data set using the remote protocol
– Data access, external workflows, semantics, messaging• Policies
– Encapsulate knowledge needed for management functions– Federation control, administrative tasks, validation checks
Assertion
• Three basic types of interoperability mechanisms are sufficient for assembling national data cyberinfrastructure
• Example: Linked software defined networks to data grids– From an iRODS data grid, controlled the selection of three
disjoint network paths for optimizing data transport by adding appropriate policy enforcement points and micro-services
• Expect functionality currently in data grid middleware to migrate into network middleware
Future Architecture
Clients
Resources
Data Grid Middleware
Clients
Network Middleware
Data Grid Middleware
ResourcesDFC Federation
GEMI - GENI
Virtual collection
Virtual network
DFC - CNI
Contacts
http://datafed.orghttp://irods.org
Reagan W. [email protected]
National Science Foundation Cooperative Agreement: OCI-0940841