INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid...

27
INFSO-RI-508833 Enabling Grids for E- sciencE www.eu-egee.org Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges , M. David, N. Dias, J. Gomes, J. P. Martins LIP: Laboratório de Instrumentação em Física Experimental de Partículas C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. Pacheco PIC: Port d’Informació Científica F. Bernabé, J. Fontán, J. Lopez, P. Rey CESGA: Fundación Centro Tecnológico de Supercomputación de Galicia R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas J. Palacios IFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid...

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Operation and management issues in the EGEE/SWE grid infrastructure

G. Barreira, G. Borges, M. David, N. Dias, J. Gomes, J. P. MartinsLIP: Laboratório de Instrumentação em Física Experimental de Partículas

C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. PachecoPIC: Port d’Informació Científica

F. Bernabé, J. Fontán, J. Lopez, P. ReyCESGA: Fundación Centro Tecnológico de Supercomputación de Galicia

R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas

J. PalaciosIFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 2

Enabling Grids for E-sciencE

INFSO-RI-031688

Outline

o The EGEE grid project.

o Main operation activities inside EGEE South-West grid infrastructure:– Resources;– Activities coordination:

Certification;• Sites and middleware certification;

Accounting;• EGEE View• Participation in the Accounting Enforcement task;

Monitoring;• Interaction with the Grid Operation Centre (GOC);• Participation in COD;

Support;• Interaction with the Global Grid User Support (GGUS);

Authentication and Security;• Activities in the EUGridPMA framework.

Middleware tests and integration.

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 3

Enabling Grids for E-sciencE

INFSO-RI-031688

EGEE project

o The Enabling Grids for E-sciencE project:– An European financed grid project;

– The biggest world wide grid for multi-disciplinary sciences; Integrates several national and regional grids;

More then 90 partners distributed over 32 countries;

– Developed on top of the infrastructures and software built in EDG and LCG grid projects.

o The LHC Computing Grid project:– LHC will be the world most powerful particle accelerator;

Built at CERN and expected to start operating in 2007;

– LCG aims to build and maintain a data storage and analysis infrastructure for the large LHC physics community: 15 Petabytes of experimental data annually,

Available during the 15 years life time of the LHC machine;

Fully accessible to ~5000 scientists from more than 500 institutes.

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 4

Enabling Grids for E-sciencE

INFSO-RI-031688

EGEE project

o EGEE concentrates in three core areas:– Improve and maintain the middleware;

Provide a reliable service;– Attract new users from industry as well as from science;

Ensure they receive high standard of training and support; – Combine national, regional and thematic Grid efforts;

For a seamless Grid infrastructure for scientific research and to build a sustainable Grid for business research and industry.

o EGEE has expanded from the originally two scientific field (High energy physics and life sciences) and now integrates applications from other scientific fields:– Astrophyics; Biomedic and Bioinformatic applications;– Computational chemistry; Earth Sciencies;– Finance; Fusion; Geophysics;– (...)

o EGEE supports more than 100 virtual organizations.

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 5

Enabling Grids for E-sciencE

INFSO-RI-031688

EGEE project

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 6

Enabling Grids for E-sciencE

INFSO-RI-031688

EGEE Operations: The GOC

o The Grid Operations Centre is responsible for coordinating the overall operation of the EGEE Grid:

– Devises and manages mechanisms and procedures which encourage optimal operation of the Grid;

– It acts as a central point of operational information such as: Site local and central services; Site resources configuration;Contact details.

– Monitores the operation of the Grid Infrastructure as a whole;GOC works with the federation local support groups to assist them in

providing the best possible service while their infrastructure is connected to the Grid.

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 7

Enabling Grids for E-sciencE

INFSO-RI-031688

EGEE Operations: The ROCs o The fulfillment of the federations key objectives is supervised by

the Regional Operation Centre (ROC):– Operate essential core services;

RBs, data management services, information services, VOMS servers;

– Interface between VO requests and sites resources;– To provide monitoring and operational troubleshooting services; – Receiving, responding and coordinating the resolution of grid operation

problems from the sites and users point of view.

– South-Western Europe

– France

– UK/Ireland

– Northern Europe

– Germany/Switzerland

– CERN

– Italy

– Central Europe

– South Eastern Europe

– Russia

– Asia/Pacific

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 8

Enabling Grids for E-sciencE

INFSO-RI-031688

South-West federation

o EGEE South-West federation is part of the European Grid Operation, Support and Management activity (SA1).

o Responsible for maintaining high quality services of the grid infrastructure inside the South-West region:– Portuguese: LIP;

– Spanish: CESGA, CSIC, PIC, CIEMAT, BIFI;

– PIC is the “Tier 1” centre of the SWE federation.

o The EGEE SWE ROC is shared among the different institutes:– This requires a higher coordination effort;

All operations/management questions are weekly reported to the ROC manager during a VRVS meeting;

Promotes the communication between the different site managers;

Promotes the knowledge exchange necessary for a faster resolution of problems.

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 9

Enabling Grids for E-sciencE

INFSO-RI-031688

South-West federation resources

o EGEE South-West federation is presently offering…– Core services for the production testbed (13/10/2006):

8 Resource Brokers;

8 top BDII machines;

3 LFC central catalogs;

1 FTS service.

– Local services for the production infrastructure: 18 Computing Elements;

• 1052 CPUs = 935.2 Normalized CPUs.o (Norm = 1000 SpecInts2000 = Pentium IV @ 2.8 GHz).

18 Storage Elements;

• 35.4 TB of online storage (disk);

• 1.5 PB of nearline storage (tape backend).

– These resources are currently shared according to the federation internal policies by more than 20 virtual organizations.

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 10

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Site certification

o The SWE ROC is responsible for certifying if a site fulfills the necessary requirements to join the grid production infrastructure:– Performed by LIP in Portugal;

– Performed by PIC in Spain;

– The certification process consists on a set of demanding tests: Information system;

Site configuration;

Interactions with the central core services.

– ROC negotiates service level agreements (SLA’s): Settle the level of services each Resource Center (RC) should

provide to the infrastructure.

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 11

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Accounting

o The EGEE South-West federation was one of the first to widely deploy grid accounting tools;– CESGA is the responsible entity inside the South-West federation

for maintaining the accounting portal; – The most relevant information is monthly compiled and reported to

the ROC and federation members.

o Due to its expertise, CESGA was proposed as the responsible entity to handle the “Accounting enforcement task”…– Monitor all the EGEE infrastructure;– Check if all the Resource Centres are publishing correct accounting

information and open tickets if they don’t;– Help the Resource Centres to deploy the necessary accounting

tools;

o … and take charge of the “EGEE View”:– Portal with accounting information from all EGEE sites.

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 12

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Accounting

→ → 949658 Jobs949658 Jobs

→ → 3504204 hours3504204 hours

→ → 2870184 hours2870184 hours

Some SWE accounting charts

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 13

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Accounting

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 14

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Accounting

Some “EGEE View” charts

Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 15

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Accounting

Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 16

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Monitoring

o COD on Duty (COD) is done by Telefonica I+D helped by PIC;

o CODs are grid expert teams which manage the day-to-day operation of the grid:– Active monitoring of the infrastructure;

– Take appropriate action to protect the grid from the effects of failing components and to recover from operational problems. Ex: A Resource Centre is causing problems by generating invalid information;

COD team opens a ticket to the Resource Centre;

COD team contacts the corresponding ROC operations support line;

COD team informs a network operations centre of suspected failures;

COD may remove the RC from the grid if the RC in unresponsive and until the problem has been fixed;

– Many of these support and troubleshooting roles are undertaken in conjunction with Regional Operation Centres; It is intended that tools will be developed to automate much of this work;

Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 17

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Monitoring

o CESGA maintains a GridICE portal for all the SWE RC’s.– The GridIce server collects information through specific sensors

included in the EGEE middleware: job information, grid service, fabric monitoring data.

– Based on some plugins for Nagios: Collect the data published by the sites;

Keeps them in a “postgresql” database;

Shows them in a web page.

– GridICE also includes e-mail notifications about changes in the status of the sites (Hosts, important processes, etc...

o CESGA is also responsible for the SWE monitoring alert system based on SFT/SAM results and Gstat: – Site Availability Monitoring:

Collection of comprehensive tests that are run daily on each certified site;

– GStat Monitor: A snapshot of the Grid Information System.

Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 18

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Monitoring

Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 19

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Monitoring

Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 20

Enabling Grids for E-sciencE

INFSO-RI-031688

ROC SWE tasks: Monitoring

Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 21

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Support

o The regional EGEE South-West federation help desk portal is maintained by CSIC-IFIC:– Users/Admins from the SWE federation can open tickets;

o The coordination of the user support services inside the federation is handled by LIP: – It is LIP responsibility to follow all tickets assigned to the SWE

federation; – Make sure that they are routed to the correct RC and solved in time; – SWE ROC is automatically warned (and acts accordingly) when:

Open tickets are opened by users or COD staff on federation sites; SAM or any other monitoring tool reports failures…

Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 22

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Support

o The SWE help desk portal interacts with the EGEE Global Grid User Support (GGUS);

o GGUS is a trouble ticketing system application:

– Grid users and administrators can open tickets asking for help; Users can start a ticket using independent regional portals. Local experts can

try to solve the problem or assign it to the central GGUS service; A ticket can also be opened directly in the GGUS services via a web form or

email;– First line of support is provided by “Ticket Processing Managers”:

TPM teams are composed of 3 Grid experts, who change on a weekly basis; TPM’s are able to provide a solution to a given grid operation problem or

assign the issue to more specialized support unit.– Support is assured 5 days a week, 9 hours a day;– GGUS is used to start COD trouble tickets when the monitoring jobs

fail;

o LIP contributes with one “Ticket Processing Manager” team for the general GGUS tasks.

Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 23

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Support

Regional SWE help-desk

Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 24

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Authentication and Security

o The emission of valid certificates for EGEE for SWE region is operated by:– LIP, through the LIP Certification Authority (LIPCA), in Portugal; – CSIC-IFCA and PK-IRISGRID in Spain.

o These CA’s are members of the European Policy Management Authority for Grid Authentication in e-Science (EUGridPMA).– EUGridPMA coordinates a Public Key Infrastructure (PKI) used in

the emission of X.509 certificates;

o SWE CAs participate in the body of EUGridPMA and in the revision of the CP/CPS (Certificate Policy/Certification Practice Statement).

o LIP (in Portugal) and RED.ES (in Spain) are responsible for security coordination and for handling security incidences.

Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 25

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE ROC tasks: Middleware integration

o gLite is the middleware layer developed by EGEE.– Extends the use of the grid infrastructure to all fields of science;

– Follows a Service Oriented Architecture (SOA): Decreases the middleware dependence on the user’s applications and

interactions with the different services.

o gLite middleware doesn’t support all LRMs systems:– Only LFS and Torque/Maui batch schedulers by default:

– LIP and CESGA, together with IC, are involved in an EGEE task force to provide gLite support for SGE batch system: New jobmanager implementation;

New infoprovider scripts;

Upgrade the yaim installation procedure.

Page 26: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 26

Enabling Grids for E-sciencE

INFSO-RI-031688

SWE pre-production testbed

o In parallel with the EGEE production testbed, some SWE sites also participate in a pre-production testbed:– CESGA, CSIC-IFIC, LIP and PIC;

o Objectives of the pre-production testbed:– Test new middleware releases;

First contact with new services;

Test all services interactions/interconnections;

Report bugs to the developers;

Test bug fixes;

– Release the middleware packages/patches which were correctly validated to the production testbed;

o SWE ROC participates in the validation process of middleware components and helps the deployment in the RC’s.

Page 27: INFSO-RI-508833 Enabling Grids for E-sciencE  Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 27

Enabling Grids for E-sciencE

INFSO-RI-031688

Summary & Conclusions

o We have presented the main EGEE SWE federation activities:– Its resources for the production testbed;– Its operation and regional management procedures;– Its responsibilities in the some general EGEE tasks:

Certification; Accounting; Support; Monitoring Authentication; Middleware tests and integration;

– Further details regarding EGEE SWE federation activities can be obtained consulting the SWE portal mantained by the CSIC-IFCA.

o This presentation aims to a better understanding of the EGEE project, their fundamental organization and to acknowledge how the different resources work together to deliver high quality services to the users.