INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid...
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
Operation and management issues in the EGEE/SWE grid infrastructure
G. Barreira, G. Borges, M. David, N. Dias, J. Gomes, J. P. MartinsLIP: Laboratório de Instrumentação em Física Experimental de Partículas
C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. PachecoPIC: Port d’Informació Científica
F. Bernabé, J. Fontán, J. Lopez, P. ReyCESGA: Fundación Centro Tecnológico de Supercomputación de Galicia
R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas
J. PalaciosIFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 2
Enabling Grids for E-sciencE
INFSO-RI-031688
Outline
o The EGEE grid project.
o Main operation activities inside EGEE South-West grid infrastructure:– Resources;– Activities coordination:
Certification;• Sites and middleware certification;
Accounting;• EGEE View• Participation in the Accounting Enforcement task;
Monitoring;• Interaction with the Grid Operation Centre (GOC);• Participation in COD;
Support;• Interaction with the Global Grid User Support (GGUS);
Authentication and Security;• Activities in the EUGridPMA framework.
Middleware tests and integration.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 3
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE project
o The Enabling Grids for E-sciencE project:– An European financed grid project;
– The biggest world wide grid for multi-disciplinary sciences; Integrates several national and regional grids;
More then 90 partners distributed over 32 countries;
– Developed on top of the infrastructures and software built in EDG and LCG grid projects.
o The LHC Computing Grid project:– LHC will be the world most powerful particle accelerator;
Built at CERN and expected to start operating in 2007;
– LCG aims to build and maintain a data storage and analysis infrastructure for the large LHC physics community: 15 Petabytes of experimental data annually,
Available during the 15 years life time of the LHC machine;
Fully accessible to ~5000 scientists from more than 500 institutes.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 4
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE project
o EGEE concentrates in three core areas:– Improve and maintain the middleware;
Provide a reliable service;– Attract new users from industry as well as from science;
Ensure they receive high standard of training and support; – Combine national, regional and thematic Grid efforts;
For a seamless Grid infrastructure for scientific research and to build a sustainable Grid for business research and industry.
o EGEE has expanded from the originally two scientific field (High energy physics and life sciences) and now integrates applications from other scientific fields:– Astrophyics; Biomedic and Bioinformatic applications;– Computational chemistry; Earth Sciencies;– Finance; Fusion; Geophysics;– (...)
o EGEE supports more than 100 virtual organizations.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 5
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE project
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 6
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE Operations: The GOC
o The Grid Operations Centre is responsible for coordinating the overall operation of the EGEE Grid:
– Devises and manages mechanisms and procedures which encourage optimal operation of the Grid;
– It acts as a central point of operational information such as: Site local and central services; Site resources configuration;Contact details.
– Monitores the operation of the Grid Infrastructure as a whole;GOC works with the federation local support groups to assist them in
providing the best possible service while their infrastructure is connected to the Grid.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 7
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE Operations: The ROCs o The fulfillment of the federations key objectives is supervised by
the Regional Operation Centre (ROC):– Operate essential core services;
RBs, data management services, information services, VOMS servers;
– Interface between VO requests and sites resources;– To provide monitoring and operational troubleshooting services; – Receiving, responding and coordinating the resolution of grid operation
problems from the sites and users point of view.
– South-Western Europe
– France
– UK/Ireland
– Northern Europe
– Germany/Switzerland
– CERN
– Italy
– Central Europe
– South Eastern Europe
– Russia
– Asia/Pacific
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 8
Enabling Grids for E-sciencE
INFSO-RI-031688
South-West federation
o EGEE South-West federation is part of the European Grid Operation, Support and Management activity (SA1).
o Responsible for maintaining high quality services of the grid infrastructure inside the South-West region:– Portuguese: LIP;
– Spanish: CESGA, CSIC, PIC, CIEMAT, BIFI;
– PIC is the “Tier 1” centre of the SWE federation.
o The EGEE SWE ROC is shared among the different institutes:– This requires a higher coordination effort;
All operations/management questions are weekly reported to the ROC manager during a VRVS meeting;
Promotes the communication between the different site managers;
Promotes the knowledge exchange necessary for a faster resolution of problems.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 9
Enabling Grids for E-sciencE
INFSO-RI-031688
South-West federation resources
o EGEE South-West federation is presently offering…– Core services for the production testbed (13/10/2006):
8 Resource Brokers;
8 top BDII machines;
3 LFC central catalogs;
1 FTS service.
– Local services for the production infrastructure: 18 Computing Elements;
• 1052 CPUs = 935.2 Normalized CPUs.o (Norm = 1000 SpecInts2000 = Pentium IV @ 2.8 GHz).
18 Storage Elements;
• 35.4 TB of online storage (disk);
• 1.5 PB of nearline storage (tape backend).
– These resources are currently shared according to the federation internal policies by more than 20 virtual organizations.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 10
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Site certification
o The SWE ROC is responsible for certifying if a site fulfills the necessary requirements to join the grid production infrastructure:– Performed by LIP in Portugal;
– Performed by PIC in Spain;
– The certification process consists on a set of demanding tests: Information system;
Site configuration;
Interactions with the central core services.
– ROC negotiates service level agreements (SLA’s): Settle the level of services each Resource Center (RC) should
provide to the infrastructure.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 11
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
o The EGEE South-West federation was one of the first to widely deploy grid accounting tools;– CESGA is the responsible entity inside the South-West federation
for maintaining the accounting portal; – The most relevant information is monthly compiled and reported to
the ROC and federation members.
o Due to its expertise, CESGA was proposed as the responsible entity to handle the “Accounting enforcement task”…– Monitor all the EGEE infrastructure;– Check if all the Resource Centres are publishing correct accounting
information and open tickets if they don’t;– Help the Resource Centres to deploy the necessary accounting
tools;
o … and take charge of the “EGEE View”:– Portal with accounting information from all EGEE sites.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 12
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
→ → 949658 Jobs949658 Jobs
→ → 3504204 hours3504204 hours
→ → 2870184 hours2870184 hours
Some SWE accounting charts
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 13
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 14
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
Some “EGEE View” charts
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 15
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 16
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
o COD on Duty (COD) is done by Telefonica I+D helped by PIC;
o CODs are grid expert teams which manage the day-to-day operation of the grid:– Active monitoring of the infrastructure;
– Take appropriate action to protect the grid from the effects of failing components and to recover from operational problems. Ex: A Resource Centre is causing problems by generating invalid information;
COD team opens a ticket to the Resource Centre;
COD team contacts the corresponding ROC operations support line;
COD team informs a network operations centre of suspected failures;
COD may remove the RC from the grid if the RC in unresponsive and until the problem has been fixed;
– Many of these support and troubleshooting roles are undertaken in conjunction with Regional Operation Centres; It is intended that tools will be developed to automate much of this work;
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 17
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
o CESGA maintains a GridICE portal for all the SWE RC’s.– The GridIce server collects information through specific sensors
included in the EGEE middleware: job information, grid service, fabric monitoring data.
– Based on some plugins for Nagios: Collect the data published by the sites;
Keeps them in a “postgresql” database;
Shows them in a web page.
– GridICE also includes e-mail notifications about changes in the status of the sites (Hosts, important processes, etc...
o CESGA is also responsible for the SWE monitoring alert system based on SFT/SAM results and Gstat: – Site Availability Monitoring:
Collection of comprehensive tests that are run daily on each certified site;
– GStat Monitor: A snapshot of the Grid Information System.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 18
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 19
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 20
Enabling Grids for E-sciencE
INFSO-RI-031688
ROC SWE tasks: Monitoring
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 21
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Support
o The regional EGEE South-West federation help desk portal is maintained by CSIC-IFIC:– Users/Admins from the SWE federation can open tickets;
o The coordination of the user support services inside the federation is handled by LIP: – It is LIP responsibility to follow all tickets assigned to the SWE
federation; – Make sure that they are routed to the correct RC and solved in time; – SWE ROC is automatically warned (and acts accordingly) when:
Open tickets are opened by users or COD staff on federation sites; SAM or any other monitoring tool reports failures…
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 22
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Support
o The SWE help desk portal interacts with the EGEE Global Grid User Support (GGUS);
o GGUS is a trouble ticketing system application:
– Grid users and administrators can open tickets asking for help; Users can start a ticket using independent regional portals. Local experts can
try to solve the problem or assign it to the central GGUS service; A ticket can also be opened directly in the GGUS services via a web form or
email;– First line of support is provided by “Ticket Processing Managers”:
TPM teams are composed of 3 Grid experts, who change on a weekly basis; TPM’s are able to provide a solution to a given grid operation problem or
assign the issue to more specialized support unit.– Support is assured 5 days a week, 9 hours a day;– GGUS is used to start COD trouble tickets when the monitoring jobs
fail;
o LIP contributes with one “Ticket Processing Manager” team for the general GGUS tasks.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 23
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Support
Regional SWE help-desk
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 24
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Authentication and Security
o The emission of valid certificates for EGEE for SWE region is operated by:– LIP, through the LIP Certification Authority (LIPCA), in Portugal; – CSIC-IFCA and PK-IRISGRID in Spain.
o These CA’s are members of the European Policy Management Authority for Grid Authentication in e-Science (EUGridPMA).– EUGridPMA coordinates a Public Key Infrastructure (PKI) used in
the emission of X.509 certificates;
o SWE CAs participate in the body of EUGridPMA and in the revision of the CP/CPS (Certificate Policy/Certification Practice Statement).
o LIP (in Portugal) and RED.ES (in Spain) are responsible for security coordination and for handling security incidences.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 25
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Middleware integration
o gLite is the middleware layer developed by EGEE.– Extends the use of the grid infrastructure to all fields of science;
– Follows a Service Oriented Architecture (SOA): Decreases the middleware dependence on the user’s applications and
interactions with the different services.
o gLite middleware doesn’t support all LRMs systems:– Only LFS and Torque/Maui batch schedulers by default:
– LIP and CESGA, together with IC, are involved in an EGEE task force to provide gLite support for SGE batch system: New jobmanager implementation;
New infoprovider scripts;
Upgrade the yaim installation procedure.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 26
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE pre-production testbed
o In parallel with the EGEE production testbed, some SWE sites also participate in a pre-production testbed:– CESGA, CSIC-IFIC, LIP and PIC;
o Objectives of the pre-production testbed:– Test new middleware releases;
First contact with new services;
Test all services interactions/interconnections;
Report bugs to the developers;
Test bug fixes;
– Release the middleware packages/patches which were correctly validated to the production testbed;
o SWE ROC participates in the validation process of middleware components and helps the deployment in the RC’s.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 27
Enabling Grids for E-sciencE
INFSO-RI-031688
Summary & Conclusions
o We have presented the main EGEE SWE federation activities:– Its resources for the production testbed;– Its operation and regional management procedures;– Its responsibilities in the some general EGEE tasks:
Certification; Accounting; Support; Monitoring Authentication; Middleware tests and integration;
– Further details regarding EGEE SWE federation activities can be obtained consulting the SWE portal mantained by the CSIC-IFCA.
o This presentation aims to a better understanding of the EGEE project, their fundamental organization and to acknowledge how the different resources work together to deliver high quality services to the users.