Federated data catalogues supporting cross-facility, cross- discipline interaction at the scale of...

18
Federated data catalogues supporting cross- facility, cross-discipline interaction at the scale of atoms and molecules Neutron diffraction X-ray diffraction High-quality structure refinement Unification of data management policies Shared protocols for exchange of user information Common scientific data formats Interoperation of data analysis software Linking Data and Publications and supporting the long- term preservation of the research outputs PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

Transcript of Federated data catalogues supporting cross-facility, cross- discipline interaction at the scale of...

Federated data catalogues supporting cross-facility, cross-discipline interaction at the scale of atoms and molecules

Neutron diffraction

X-ray diffraction

High-quality structure refinement

• Unification of data management policies

• Shared protocols for exchange of user information

• Common scientific data formats

• Interoperation of data analysis software

• Linking Data and Publications and supporting the long-term preservation of the research outputs

PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

PaN-data PartnersPaN-data bring together 11 major European Research Infrastructures

PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK

ISIS is the world’s leading pulsed spallation neutron source

ILL operates the most intense slow neutron source in the world

PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser

HZB operates the BER II research reactor the BESSY II synchrotron

CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor

ESRF is a third generation synchrotron light source jointly funded by 19 European countries

Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust

DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser

Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007

ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser

ALBA is a new 3 GeV synchrotron facility due to become operational in 2010

PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

PaN-data Applications

The partners operate hundreds of instruments used by over 30,000 scientists each year

These instruments support scientific fields as varied as:• Physics, Chemistry, Biology, Material sciences, Energy technology,

Environmental science, Medical technology and Cultural heritage

Applications include:

• crystallography that reveals the structures of viruses and proteins important

for the development of new drugs

• neutron scattering that identifies stresses within engineering components

such as turbine blades

• tomography that can image microscopic details of the 3D-structure of the

brain

Industrial applications include pharmaceuticals, petrochemicals and microelectronics

PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

PaN-data Standardisation

PaN-data Europe is undertaking 5 standardisation activities:

1.Development of a common data policy framework

2.Agreement on protocols for shared user information exchange

3.Definition of standards for common scientific data formats

4.Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected

5.Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs

PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

PaN-data Europe TimelinePaN-data Europe runs from June 2010 until December 2011 with workshops in Spring and Autumn 2011.

PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

Workpackage (abbreviated title) Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov

Milestones M1 M2 W1 M3 M4 W2

WP1 Management D D D D WP1 Management

WP2 Common data policy framework D D D D WP2 Common data policy framework

WP3 Knowledge exchange/dissemination D D D D WP3 Knowledge exchange/dissemination

WP4 Common user information exchange D D D WP4 Common user information exchange

WP5 Scientific data D D D WP5 Scientific data

WP6 Data analysis software infrastructure D D D D WP6 Data analysis software infrastructure

WP7 Integration and cross-linking D D D WP7 Integration and cross-linking

Key

D - Deliverable

M - MilestoneW - Workshop

Workpackage (abbreviated title)

Workshops

Data Policy

Development and delivery

of the comm

on data policy

User and Data Standards

Delivery of draft standards

for data and user information

Baseline for integration

Delivery of policy on user

information, first report on

publications and integration

Integration proposal

Delivery of policy and

first proposal on integration

and on analysis softw

are

Final Workshop

Final reports on standards

M1

M2

M3

M4

2.1 Data Policy

2.2Software Policy

2.3UserPolicy

2.4Integrated Policy

4.1User

Proposal

4.2User

Workshop

4.3User

Revision

5.1Data

Proposal

5.2Data

Workshop

5.3Data

Revision

6.1SoftwareReview

6.2Software Workshop

6.3SoftwareProposal

6.4Software Revision

7.1Integration Report

7.2Integration Proposal

7.3Integration Revision

3.4

Final

Workshop

Project Management, Knowledge Exchange and Dissemination Activities

Dependencies between the major project tasks

• The establishment of the policy framework within each technical theme guides the development of the relevant standards in that theme.

• The Data and User Policies task start from a mature basis and do not require an initial review. They will take their initial proposal to the first workshop and a revised proposal to the final workshop for dissemination.

• The software theme is less mature in the community, and thus requires an initial review, an exploratory first workshop and will take a draft proposal before the final workshop, to be revised subsequently.

• The data and software themes provide requirements for the user information theme as requirements on data and software will influence the information which will be needed to be shared on users.

• The three workshops within WP4, 5, and 6 will be co-located into one project workshop, represented via the box grouping tasks 4.2, 5.2 and 6.2 above.

• The final workshop is to a wider audience, and will present the results of the technical themes of the project, together with a draft proposal for an integrated strategy for developing the infrastructure further for the community. This integration strategy will be revised in the light of the workshop.

Dependencies

ObjectivesObjective 1 – Collaboration. To establish an effective and efficient collaboration between the

partners delivering added value to each participant through shared activities and to integrate this collaboration with related infrastructure initiatives beyond the project

Objective 2 – Policy. To agree between partners on the elements of a general, standard data policy framework and to establish and maintain individual data policies in accordance with this standard.

Objective 3 – Knowledge exchange and dissemination. To promote feedback from user communities with respect to project objectives and results; to liaise with other projects and facilities, including other photon and neutron facilities and e-infrastructure projects; and to ensure effective communication with third parties such as software suppliers.

Objective 4 – Users. To foster interoperability of user information across the participating facilities and the wider research community.

Objective 5 – Data (including Formats and Metadata). To foster interoperability of data formats and metadata schemas across the participating facilities and the wider research community.

Objective 6 – Software. To determine how to develop, deploy, operate and evaluate a common registry of data analysis software and, where appropriate, the necessary format converters, so that data from different sources can, in the future, be treated with a variety of data analysis software.

Objective 7 – Integration and cross-linking of outputs. To foster the integration of the whole science lifecycle, focusing on linking of publications and data, interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.

5 Standardisation ActivitiesThe common data policy framework work package aims to agree between partners on the elements of a standard data policy framework and to establish and maintain individual data policies in accordance with this standard. It is a basis for the work packages dedicated to individual strands, listed below, and its phased timing corresponds to those work packages.

The common user information exchange work package will underpin Virtual Organisation Management across the participants. This work package will build upon existing technology developed elsewhere and thus begins from a mature basis. It will consist primarily of proposing adaptations of these technologies to the current environment.

The scientific data work package is slightly different in nature as it is largely centred on the common data formats that will enable the integration of the Data Catalogue and Software Services. These formats are already well understood and accepted, so there is no need for a review phase. The data standards will enable the sharing of data across the participating facilities by providing integrated searching across the associated metadata.

The data analysis software infrastructure work package will enable best use of the available software by allowing the most appropriate software to be used independently of where the data is collected. This will require interaction with external parties such as software developers, and so the first workshop takes place only a short time after the commencement of this work package, to allow their input to be obtained.

The integration and cross-linking of outputs work package is concerned with closing the lifecycle of research, by incorporating the publications that are the end result, and are held in repositories. It also focuses on long-term preservation of the data and other outputs. These aspects are linked because publications can provide what is called Representation Information (in the terminology of the OAIS standard) to assist the continued correct interpretation and use of data into the future.

Staff Effort

Questions

Staff Months STFC STFC ESRF ILL Diamond PSI DESY ELETTR Soleil ALBA BESSY CEA Total

Management 4.5 4.5

Policy 2 4 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 12

Dissemination 2 0.5 0.5 0.5 0.5 4 0.5 0.5 0.5 0.5 0.5 10.5

User 0.5 2 0.5 4 2 0.5 0.5 0.5 0.5 0.5 0.5 12

Data 0.5 2 0.5 0.5 5.5 2 0.5 0.5 0.5 0.5 0.5 13.5

Software 0.5 0.5 5.5 2 0.5 2 0.5 0.5 0.5 0.5 0.5 13.5

Integration 4 0.5 0.5 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 10.5

Total 4.5 9.5 9.5 9.5 9.5 9.5 9.5 3 3 3 3 3 76.5

DeliverablesD1.1 : Project Reporting, IPR, risk and quality management procedures (M3)D1.2 : First bi-annual management report (M6)D1.3 : Second annual management report (M12)D1.4 : Final management report (M18)

D2.1 : Common policy framework on scientific data (M4)D2.2 : Common policy framework on analysis software (M8)D2.3 : Common policy framework on user data (M12)D2.4 : Common integrated policy framework (M16)

D3.1 : Project Website (M3)D3.2 : Dissemination plan (M6)D3.3 : Update on dissemination including version one roadmap for international standardisation(M12)D3.4 : Final Dissemination Workshop Report (M18)

D4.1 : Proposal for authentication system enabling shared Virtual Organisation Management (M8)D4.2 : User information workshop report (M10)D4.3 : Revised specification of common authentication system (M12)

D5.1. Proposal for data format standards (M8)D5.2. Data standards workshop report (M10)D5.3. Revised specification of data standards (M12)

D6.1: Report on current software registries and data analysis software (M8).D6.2: Workshop report on standards and methods for sharing software (M10).D6.3: Draft proposal on strategy for data analysis software infrastructure (M16).D6.4 : Final proposal on standards and methods for sharing software (M18)

D7.1: Report on survey of publication repositories, cross-linking and long-term preservation (M12).D7.2: Proposal for integration of practices (M16).D7.3 : Final report on standards for publication repositories, cross-linking and long-term preservation (M18)

WP1 ManagementObjectives • To establish an effective and efficient collaboration between the partners delivering added value to

each participant and to the European research community as a whole.• To report to the Commission as required. Methodology• Establish and enforce financial and administrative procedures to report and manage the EC contract

with the commission and partners.• Establish mailing lists, an internal website and hold regular meetings to ensure an efficient flow of

information between the consortium partners.• Establish quality management procedures and monitor quality of output.• Establish a plan and procedures for the management of IPR.• Establish a risk management plan and monitor risks, reporting to the Project Management Board. Task 1.1: Agree on common modes of working required to achieve the goals of the project including

management of IPR (M1–M3).Task 1.2: Monitor progress of these joint activities and put in place appropriate corrective actions as

necessary to deliver the project. (ongoing).Task 1.3: Organise general meetings of the project. (to be held at the M1,M 6, M12, M18).Task 1.4: Report to EC on the financial and technical progress of the project. (M6, 12, 18).DeliverablesD1.1 : Project Reporting, IPR, risk and quality management procedures (M3)D1.2 : First bi-annual management report (M6)D1.3 : Second annual management report (M12)D1.4 : Final management report (M18)

WP2 Development of standards for a common data policy frameworkObjectives • To agree between the partners on the elements of a general, standard, data policy framework and to establish,

promote, and maintain individual data policies in accordance with this standard. • This work package is the basis for the work packages devoted to the individual strands; it sets the requirements and

principles within which they operate.Methodology:• Survey existing relevant policies at the partner facilities and correlate them with guidelines emerging from national

and international bodies.• Abstract from these a common set of generic policy elements and refine and approve existing policies against this

framework.• Undertake a common foresight activity to inform evolution of policy in the light of technical and regulatory

developments.• Work towards convergence of policies in the longer term as experience of what constitutes best practice emerges.• Liaise with other parties where such policies frameworks already exist to promote best practice in data management

and exploitationThe policies will influence and be influenced by the corresponding Work Packages devoted to the development of

standards and practices, and the timings are chosen to match the milestones of those Work Packages.Task 2.1 : Development of common policy framework for scientific data (M1-M4)Task 2.2 : Development of common policy framework for analysis software (M1-M8)Task 2.3 : Development of common policy framework for user data (M8-M12)Task 2.4 : Development of integrated common policy framework for data (M12-M16)DeliverablesD2.1 : Common policy framework on scientific data (M4)D2.2 : Common policy framework on analysis software (M8)D2.3 : Common policy framework on user data (M12)D2.4 : Common integrated policy framework (M16)

WP3 Knowledge Exchange and Dissemination (Original Text)Objectives • To promote feedback from user communities with respect to project objectives and results.• To liaise with other projects and facilities, including other photon and neutron facilities and e-infrastructure projects.• To ensure effective communication with third parties such as software suppliers.Methodology • Set up mechanisms for effective communication of project outputs to other relevant I3 projects, facility user

communities, partner research institutes/organisations, and more general e-infrastructure developments.• Remain aware of related e-infrastructure and data integration developments outside the project, in particular across

Europe, with a view to the longer term integration of this work into the broader integrated infrastructure required to support European Research in the coming decade.

• Contribute to the development of the broader infrastructure through participation in relevant integration, planning and standardization activities required to achieve the eIRG vision of an integrated European e-Infrastructure.

Task 3.1. Establish an external web site (M1-3).Task 3.2. Establish an interest group for project news items via community channels, informing them of project progress

(M4-9).Task 3.3. Presentations to relevant international audiences at conferences, symposia, (other) project meetings etc.

(ongoing).Task 3.4. Final workshops to present the integrated systems to user and facility communities (M18).DeliverablesD3.1 : Project Website (M3)D3.2 : Dissemination plan (M6)D3.3 : Update on dissemination (M12)D3.4 : Final Dissemination Workshop Report (M18)

WP3 Knowledge Exchange and Dissemination (EC Additions)Methodology • The project will plan activities adequately resourced devoted to dissemination for specialised

constituencies and general public, in particular for awareness and educational purposes. • The dissemination plan deliverable has to consider adequate messages about the objectives of the

project and its societal and economic impact. • The tools to be used should include web-based communication, press releases, brochures, booklets,

multimedia material, etc. • The 'dissemination material' should be regularly updated to provide the latest version of the project

status and objectives. • Electronic and/or paper versions of this 'dissemination material' will be made available to the Project

Officer beforehand for consultation and upon its final release.

• The project will actively participate in the concertation activities and meetings related with the e-Infrastructures area.

• The objective is to optimise synergies between projects by providing input and receiving feedback from working groups addressing activities of common interest (e.g. from clusters and projects).

• Projects may offer advice and guidance and receiving information relating to 7th Framework programme implementation, standardisation, policy and regulatory, EU Member States initiatives or relevant international initiative.

Note that proper acknowledgement of the source funding (the FP7 logo and the EU flag, EC/e-Infrastructures, etc.) will be provided in all dissemination activities.

D3.3 : Update on dissemination including version one roadmap for international standardisation (M12)

WP4 Development of standards for common user information exchangeObjectives • To foster interoperability of user information across the participating facilities and the wider research community.• To develop standards enabling a shared Virtual Organisation Management and common processes across the

participating facilities.Methodology• The ultimate objective is the implementation of a system to allow scientific users to access data files across the

physically distributed repositories. A typical use case would be a user having performed experiments at several facilities who needs to perform the same data analysis on all data sets. This process involves the use of remote computing resources and software packages, which implies a system whereby a logged user at a local site can be automatically authenticated and authorised (AAA) to use remote facilities. This additional level of AAA should be as transparent as possible to the user.

• Data protection laws in each country enormously complicate the sharing of user information between organisations. Consequently the AAA must function with the transfer of the very minimum of information, possibly only the user’s name and/or email and the trust information. A corollary is that AAA is not involved in implementing user databases at each site but rather in providing a mechanism of interfacing with existing applications to make available the trust information in a consistent and coordinated manner across the facilities.

Task 4.1: Review existing authentication solutions with special emphasis of the IRUVX / ESRFUP prototype solution. Propose prototype authentication system in view of the needs of the full neutron and photon community (M1-M8).

Task 4.2: Workshop with facility authentication experts; plan the adoption strategy for the full- community authentication system (M9).

Task 4.3: Revise the proposal in the light of the workshop findings, and determine the next steps (non web-based applications, GRID-related issues). (M8-M12).

(Note: the final workshop to disseminate the results of the work package takes place in WP3)DeliverablesD4.1 : Proposal for authentication system enabling shared Virtual Organisation Management (M8)D4.2 : User information workshop report (M10)D4.3 : Revised specification of common authentication system (M12)

WP5 Development of standards for scientific dataObjectives • To foster interoperability of data formats and metadata schemas across the participating facilities and the

wider research community.Methodology:• Today all participating facilities use their own data file formats, which is a great obstacle for file access as

input file readers have to be provided for each format. A shared infrastructure, involving databases and software, effectively imposes a common data format, which requires some agreement on the data to store and the format itself. This work package, through fact finding, monitoring and strategy development, will define a common data format, based on the NeXus international standard.

• In order to make raw and processed data accessible to scientists it is essential to be able to search databases by their metadata, which refers to the data describing the stored data, e.g. experiment name, date, facility where the data was taken, energy range of the data, type of technique, sample type and name, etc. The metadata with a link to the raw or processed data file will be made available via a metadata catalogue. This workpackage, through fact finding, monitoring and strategy development, will determine the metadata to be included in databases.

Task 5.1: Evaluate existing data format standards and propose a coherent set covering the format requirements across the facilities and in the user community, prepare workshop (M4-M8).

Task 5.2. Workshop to agree on this minimum set; include decision makers from users, facilities, software developers (M9).

Task 5.3. Revise the data format standards in the light of the workshop findings (M8-M12).(Note: the final workshop to disseminate the results of the work package takes place in WP3)DeliverablesD5.1. Proposal for data format standards (M8)D5.2. Data standards workshop report (M10)D5.3. Revised specification of data standards (M12)

WP6 Strategy for data analysis software infrastructureObjectives • To determine how to develop, deploy, operate and evaluate a common registry of data analysis software and, where

appropriate, the necessary format converters, so that data from different sources can, in the future, be treated with a variety of data analysis software.

MethodologyData analysis (software) is a key link in the chain of events that transforms original ideas into conclusive scientific output.

This WP, by fostering a common software resource, will ultimately enable the most appropriate software to be used independently of where the data is collected. A model for this type of activity is the “Collaborative Computational Projects” in the UK (see www.ccp.ac.uk). The approach of this WP are therefore to help define a common software resource that will:

1. simplify and streamline for facility users the conversion of raw data into high quality scientific data for publication,

2. accelerate the deployment and use of new data analysis methods which will open doors to new science across the facilities and the user community,

3. enhance and optimise the scientific output of the facilities i.e. better value for money.Task 6.1: Review existing registries for data analysis software. Catalogue the data analysis software in use across the

facilities and in the user community (M4-M8).Task 6.2: Workshop to agree position on data analysis software infrastructure, including providers of this software to

define standards/rules for sharing, versioning, tracing software (M9).Task 6.3 : Analyse findings of workshop and propose strategy on software sharing (M8-M16)Task 6.4 : Revise proposal strategy for data analysis software sharing (M17-M18)(Note: the final workshop to disseminate the results of the work package takes place in WP3)DeliverablesD6.1: Report on current software registries and data analysis software (M8).D6.2: Workshop report on standards and methods for sharing software (M10).D6.3: Draft proposal on strategy for data analysis software infrastructure (M16).D6.4 : Final proposal on standards and methods for sharing software (M18)

WP7 Development of standards for integration and cross-linking of outputsObjectives • To foster the integration of the whole science lifecycle, focussing on linking of publications and data,

interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.

Methodology:• Publications repositories complete the lifecycle of innovation. Linking to Users, Data and Software enable

traceability of published results through the scientific process. Sharing of the final results provides a foundation for the next cycle of science, and packaging enables long-term preservation of the outputs of research. Association of data with the publications resulting from it is a basis for preservation through Representation Information—a term from the OAIS standard (Open Archival Information System), meaning information necessary to ensure continued understandability and usability of a digital resource.

• Furthermore, this is also a basis for reuse of data across diverse communities, since the supplementary information needed for continued understandability is also valuable for transfer across communities. The European Support Action PARSE. Insight (of which STFC is WPL) is producing a roadmap for digital preservation in Europe, informed by a large-scale survey of attitudes and practices in a wide range of scientific disciplines. The roadmap includes components such as tools for creation of Representation Information, and will be taken into account in the project work.

Task 7.1: Review existing provision for publication repositories, citation recording and long-term preservation in use across the facilities and in the user community, including facility libraries. (M8-M12)

Task 7.2: Propose strategy on integration of practices across the community (M12-M16). Task 7.3: Develop final proposal on integration of practices across the community (M17-18). (Note: the final workshop to disseminate the results of the work package takes place in WP3)DeliverablesD7.1: Report on survey of publication repositories, cross-linking and long-term preservation (M12).D7.2: Proposal for integration of practices (M16).D7.3 : Final report on standards for publication repositories, cross-linking and long-term preservation (M18)