e-Infrastructure Integration-with gCube
description
Transcript of e-Infrastructure Integration-with gCube
e-Infrastructure Integration with gCube
Andrea Manzi ( CERN )Pasquale Pagano ( ISTI-CNR )
EGI User Forum
13 April 2011Vilnius (Lithuania)
www.d4science.eu
2
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Outline
• D4Science II Ecosystem• gCube architecture• Interoperability approaches
• Resource Discovery• Data Storage & Access• Data Discovery • Data Process• Security
• Applications• AquaMaps• Time Series
3
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
D4SCIENCE INFRASTRUCTURE
Hadoop EGEE/EGI
INSPIRE
DRIVER
GENESI-DR
AquaMaps
D4Science
II
Ecosyste
m
• Heterogeneous resources
• Heterogeneous computational platforms
• Rich set of legacy applications
• Multiple administrative domains
• Evolving communities
FAO Geonetwork
FAO FIGIS
Community A
Community B
4
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
gCube architecture
gCube run-time environment
gCube Definition and Management Services
gCube Application Services
Presentation Services
Portlets
Application Support Layer
Information Organization Services
Storage Management
Collection - Content - Metadata -
Annotation -… Management
Information Access Services
Search Framework
Ontology Management
Personalization Service
Index Management Framework
DIR Support Framework
Process ExecutionManagement
VREManagement
Information System Security
gCube Container gCore Framework
Use
r Se
rvices
5
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Virtual Organization
A Virtual Organization (VO) specifies how a set of users can access a set of resources
what is shared who is allowed to share the conditions under which
sharing can occur
The concept of VO Is not adequate to cover some common scenarios
• Data needs to be assessed before to make it publically exploitable by the VO members.
• Restricted set of users have to collaborate to refine processes and implement show cases.
• Products generated through elaboration of data or simulation have to be validated by expert users.
6
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Virtual Research Environment
Virtual Research Environment (VRE) is a distributed and dynamically created
environment where subset of resources can be
assigned to a subset of users via interfaces
for a limited timeframe at little or no cost for the providers of
the infrastructure Integrated with cloud systems
( OpenNebula )
VRE 2
VRE 1
VO
gCube is a first example of a VRE management system
7
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability: Assumptions
Very rich applications and data collections are currently maintained by a multitude of authoritative providers
Different problems require different execution paradigms: batch, map-reduce, synchronous call, message-queue, …
Key distributed computation technologies exist: grid (gLite and Globus), distributed resource management (Condor), clusters (Hadoop), …
Several standards are adopted in the same domain
8
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability: Landscape
Unstructured Data: blob (binary), and textual filesStructured Data: tabular, statistical, geospatial, temporal, and textual dataCompound Data: data composed by unstructured and structured data entities
secu
rity
9
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability: gCube Vision
gCube objectives:
hide heterogeneity, i.e. abstract over differences in location, protocol, and model;
embrace heterogeneity, i.e. allow for multiple locations, protocols, and models;
Technical goals:
no bottlenecks: scale no less than the interfaced resources no outages: keep failures partial and temporary autonomicity: system reacts and recovers
10
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Hiding Heterogeneity
• Heterogeneous resources are virtually accessible in a common ecosystem of resources
• despite their locations, technologies, and protocol
• Different communities have access to different views• according to the conditions under which the sharing can occur
• Each community can define its own VRE
• for a limited timeframe and at no cost for the providers of the resource
• Several VRE can coexist• without interfering each other
even by competing for the same resources
11
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Embracing Heterogeneity
Approaches and solutions to achieve interoperability :
Blackboard-based
asynchronous communication between components in a system one protocol to R/W and one language to specify messages
Wrapper/ Mediator-based
translates one interface for a component into a compatible interface
Adaptor-based
provides a unified interface to a set of other components interfaces and encapsulates how this set of objects interact
12
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Each resource is represented by a profile (metadata) characterising:
the interface the state the list of dependencies the run-time status the policies the configuration the pending tasks to execute
A Resource profile
is published by the resource owner is discovered by the resource consumers asynchronously through a
common resource-independent protocol
gCube offers a distributed and scalable Information System (blackboard) to store, discover, and access resource profiles
Interoperability Approaches: Resource Discovery
gCub
e in
tero
pera
bilit
y fr
amew
ork:
the
sol
utio
n
13
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability Approaches: Content Interoperability[1/2]
gCube Open Content Management Architecture (OCMA)
Assumption data stored in different storage back-ends diverse locations, models, access types few common primitives: documents, collections, repositories
gCube allows to reach content that lies outside system expose content (reachable from) inside system perform coarse-grained as well as fine-grained retrieval, update, and
addressing
Runtime scalability autonomic read-only state replication, maximize throughput, minimize response time: discovery-time load balancing
(through IS) reduce latencies
Software plugin-based architecture to reduce development costs (plugins over Storage
systems)
14
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
T1 T2
adapts adapts
factory
Interoperability Approaches: Content Interoperability[2/2]
gDoc
…gDocRead
gDocWrite
Content Manager Service ( OCMA Service)•Adapts gCube doc model ( gDoc ) to an unbounded number of back-end types
15
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability Approaches :Data Discovery
gCube offers
Several index types Forward indexing, which supports ultra fast lookups on tabular
typed metadata; XML indexing, that supports semistructured lookups on content
metadata; Textual field indexing, that supports full text and qualified lookups
on textual (mainly) metadata; Metadata full text indexing, that enables full text lookups on
metadata; Content full text indexing, that enables full text lookups on text
extracted by content; Geospatial/temporal indexing, that enables geospatial proximity and
coverage queries to be executed over geospatial/temporal metadata;
Feature indexing, that enables high-dimension vector indexing, for feature lookup (currently the feature is inactive);
16
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
gCube offers solutions to:
Decouple the business domain and infrastructure specific logic from the core “execution” functionality
Invocate a wide range of logic components: SOAP and REST WebServices, Shell Scripts, Executable Binaries, POJOs, …
Support most of the execution paradigms: batch, map-reduce, synchronous call
Bridges key distributed computation technologies: grid (gLite and Globus), Condor, Hadoop
Control and monitor the execution of a processing flow
Staging of data among different storage providers
Streaming data among computation elements
Interoperability Approaches : Process Execution [1/2]
17
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability Approaches : Process Execution [2/2]
By using adaptors that
operate on a specific third party language and translate them into native constructs,
allow for the creation of complex workflows that exploit several diverse technologies deployed on different infrastructures
18
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Interoperability Approaches : Security [1/2]
gCube offers solutions :
To secure access to gCube resources for interoperable external systems (incoming security) To ensure Interoperability of gCube security mechanisms with standards compliant security systems (reuse) To facilitate secure access to external resources for gCube services (outgoing)
19
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Authz: •XACML for authz request/response protocol and policy definition •SAML assertions to transport user/service authN information•Argus-based approach (EMI Authz framework) having pluggable design to integrate additional PIPs•SAML Profile for XACML 2.0 following the OASIS Authorization Interoperability Profile SpecificationAuthN:•Production level SSL/HTTPS support•Key- and Trust-Manager
Interoperability Approaches : Security [2/2]
20
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
AquaMaps is an application*
tailored to predict global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species,
that generates color-coded species range maps using a half-degree latitude and longitude blocks
by interfacing several databases and repository providers
Species Distribution Maps Generation
* Algorithm by Kashner et al. 2006
21
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
AquaMaps execution is based on the gCube Ecological Niche Modelling Suite which allows the extrapolation of known species occurrences
Species Distribution Maps Generation
◦ to determine environmental envelopes (species tolerances)
◦ to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution)
Very large volume of input and output data: HSPEC native range 56,468,301 - HSPEC suitable range 114,989,360Very large number of computation: One multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species requires 125 millions computations (Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center)
22
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Time Series Management
Offers a set of tools to manage capture statistics
Supports the complete TS lifecycle Supports validation, curation, and analysis Provides support for data reallocation Produces uniform data-set
23
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Time Series and R statistical software integration
The main aims are to:
•provide a complete, fully working, environment for R language
•give user methods to automatically extract data from the time series he was working on
•give user the possibility to perform queries on the time series database
•provide a service distributed on the infrastructure. Multiple instances can be managed on the infrastructure VREs, the distribution being transparent to the users (SaaS model)
24
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Conclusions
gCube System:
Stable software being improved over the last 5 years ( end of DILIGENT -> D4Science -> D4ScienceII)
gCube offers a variety of patterns, tools, and solutions
to delivery interoperability solutions and interconnect Heterogeneous digital content Heterogeneous repository systems Heterogeneous computation platforms
to decrease the cost of adoption to deal with several standards
25
www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011
Questions Time