Overview of monitoring tools for Grid Systems Varenna , 12 May 2008 Antonio Pierro

31
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN- BARI) 1 Overview of monitoring tools for Grid Systems Varenna, 12 May 2008 Antonio Pierro INFN-BARI (Italy) Antonio.pierro <at> ba.infn.it

description

Overview of monitoring tools for Grid Systems Varenna , 12 May 2008 Antonio Pierro INFN-BARI (Italy) Antonio.pierro ba.infn.it. Outlines. Overview of EGEE monitoring tools: SAM (Service Availability Monitoring) GridMap GStat (Global Grid Information Monitoring System) GridView - PowerPoint PPT Presentation

Transcript of Overview of monitoring tools for Grid Systems Varenna , 12 May 2008 Antonio Pierro

Page 1: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 1

Overview of monitoring tools

for Grid Systems

Varenna, 12 May 2008

Antonio PierroINFN-BARI (Italy)

Antonio.pierro <at> ba.infn.it

Page 2: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 2

Outlines

Overview of EGEE monitoring tools:

SAM (Service Availability Monitoring)

GridMap

GStat (Global Grid Information Monitoring System)

GridView

GridICE (infrastructure and application monitoring)

Page 3: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 3/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

Resource Utilization and Performance Evaluation

Resources observability is needed for an optimized Grid utilization

Management Decisions

To reduce time spent waiting for Resource Availability

Be always aware of what is happening

Debugging purposes

to help the operations team locate and troubleshoot the problems

Grid resources and services are subject to failures

Why do we need monitoring?

Page 4: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 4/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

Requirements for a Grid Monitoring tool

Scalable

Dynamic

Robust

Should be integrated with other Grid Technologies

and middleware (security infrastructure, resource

brokers, schedulers, ...)

Page 5: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 5

SAM (introduction)

Service Availability Monitoring framework (SAM) :

Monitoring all grid services and nodes not only CE

It is used in the validation process of sites and services

SAM wiki : http://goc.grid.sinica.edu.tw/gocwiki/SAM

SAM portal : https://lcg-sam.cern.ch:8443/sam/sam.py

Service and Site status are recorded (several snapshots per

day)

Daily, weekly, monthly availability is calculated using

integration (averaging) over the given period

Official evaluation of T0,T1 and T2 sites.

Page 6: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 6

SAM(performed tests) 1/2

CE

job submission - UI->RB->CE->WN chain

version of CA certificates installed (on WN!) and software middleware (on WN!)

replica management tests-using lcg-utils,default SE defined on WN and a selected

“central” SE

accessibility of experiments software directory - environment variable, directory existence

accessibility of VO tag management tools

other tests: R-GMA client check, Apel accounting records

SE, SRM

storing file from the UI - using lcg-cr command with LFC registration

getting file back to the UI - using lcg-cp command

removing file - using lcg-del command with LFC de-registration

Page 7: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 7

SAM(performed tests) 2/2

LFC

directory listing - using lfc-ls command on /grid

creating file entry in /grid/<VO> area

FTS

checking if FTS is published correctly in the BDII

channel listing - using glite-transfer-channel-list command with ChannelManagement

service

transfer test (in development):

Standalone tests

GSTAT, RB

VO specific tests as well

Page 8: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

SAM - CE sensor TestsFrance Region, VO OPS

Page 9: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

SAM - CE sensor TestsFrance Region, VO OPS

OK: normal status

Errror: subject has failed and problem is localized

•*** Running R-GMA client test on alifarm57.ct.infn.it ***

Inserting tuple: ERROR: Could not contact R-GMA server at grid005.ct.infn.it:8443 –

(104, 'Connection reset by peer')

ERROR: Could not contact R-GMA server at grid005.ct.infn.it:8443 –

(104, 'Connection reset by peer') Failed Timeout when executing test

CE-sft-rgma after 600 seconds!

subject may fail

soon

Page 10: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 10/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

It publishes the same data of SAM in a different way

Is a simple interactive and user-friendly interface to see the

state of Grid

Sites or services of the Grid are represented by rectangles

of different size and colour allowing two dimensions of data to

be visualized simultaneously.

This representation of monitoring data requires much less

space than conventional sorted tables or bar charts.

GridMAP

Page 11: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 11/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

GridMAP

GridMap Prototype – visualizing the state of the grid

the state of the grid – SAM test

Daily availability

Page 12: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 12

GridView 1/2

It is a visualization system for viewing monitoring information

Approach:

Collections monitoring information from different sources,

e.g.:

SAM, GridFTP monitor, RB Logs

The records of monitoring information are in a central

Oracle database at CERN

Visualizations of summary data through Web interface

Target: Grid operators, Site administrators, VO managers

Page 13: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 13

GridView (web page) 2/2Statistic of data transfert

jobs running

service availability

Page 14: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 14

GStat 1/2

GStat is built using Python scripts that generate web based

reports used by Grid site administrators to troubleshoot Information

System issues or access usage information.

GStat scripts are executed periodically to query and collect

the information published by each site in the Grid Infrastructure.

The information published is then processed by extensible

analysis framework that checks for IS failures and errors.

Target:

Grid operators

Site administrators

Page 15: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 15

The main page of GStat shows the overall status and usage statistic for each site. GStat site detailed report GStat site resource status

GStat 2/2

Page 16: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 16

EGEE EGEE-SWE RDIG EGEE-SEE Grid.it GILDA CMS ATLAS EUMedGrid

EUChinaGrid EUIndiaGrid BalticGrid LIBI BioinfoGRID EELA

OMII BeGrid

It is a distributed monitoring tool for Grid systems

is evolving in the context of EU-EGEE and many other EU Grid

projects

fully integrated with the gLite-3.x Middleware

Self-configurable collection and presentation

just give the URL of the root Grid Information Service (GIS)

Installed servers are monitoring Grid resources in the scope of:

GridICE: Overview

Page 17: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 17/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

Recent evolution of GridICE lightweight sensor + VOMS information

Attributes measured by the Job Monitoring sensor

To reduce its intrusiveness in terms of

resources consumption:

Two daemons running and a probe

executed periodically

They listen to a set of log files and

collect the relevant information

Few LRMS commands to retrieve

jobs status

The status of all jobs is stored in a

cache (stateful behaviour)

Page 18: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 18/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

Integration with local monitoring systems (LEMON)

Grid monitoring integrated with local monitoring

The last server version is very simple to install

The client installation may be turned on in the standard middleware LCG

installation (no additional operation are needed)

The LEMON monitoring system and alarm management are integrated in the

new version of the GridICE server

The local sensor currently used for farm monitoring can be interfaced with

GridICE to collect all the available data

The back-end is realized with LEMON

Local farm monitoring that are using LEMON can be integrated with GridICE

Page 19: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 19

LRMSinfo

The LRMS Info sensor provides aggregated information of the Local Resource Manager System

Page 20: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 20

We focus on the following categories of users:

VO manager

actual set of resources accessible to VO members: “How many jobs

submitted by my users are running or queued?” (with details of the

VOMS groups and/or single user)

Grid operator

all resources under responsibility of a Grid Operator Center (“How many

resources are available?”)

Site administrator

site resources offered to a Grid (“Is there any service down?”)

Grid users

The status of their jobs on a grid.

Page 21: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

How do we identify the user/role?

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 21

The users are identified with the digital certificate installed

in its browser

a valid CA certificate

server based on https protocol

The new sensor are able to retrieve the VOMS information

VOMS information: groups and roles of users

submitting the jobs

The related role (e.g., site manager, VO manager) can

be retrieved by GridICE database.

Page 22: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 22/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

“Standard user ” monitoring (1)

• User that has no jobs submitted and no role registered

Page 23: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 23/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

“Standard user ” monitoring (2)

An authenticated user sees only his/her own jobs

Page 24: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 24/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

“Standard user ” monitoring (3)

An authenticated user sees only his/her own jobs

exit status = 0 => successfully jobs

exit status <> 0 =>failure jobs

Page 25: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 25

Grid monitoring from the VO Manager perspectives

Page 26: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 26Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)

Grid monitoring from the Site Manager perspectives

Page 27: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 27

Acronyms and Abbreviations (1):

ACL - Access Control ListAPEL - Accounting Processor for Event LogsAPI - Application Programming InterfaceBDII - Berkeley Database Information IndexCA - Certificate AuthorityCE Computing Element: a Grid-enabled computing resourceCERN - European Organisation for Nuclear ResearchGIIS - Grid Index Information Service. MDS index node. Aggragates informationdCache - (disk pool management system)DN - Distinguished Name (X.500, LDAP)EGEE - Enabling Grids for E-sciencEFTS - File Transfer Service (EGEE)GARR - Gruppo per l'Armonizzazione delle Reti della RicercaGGUS - Global Grid User SupportGIIS - Grid Information Index ServerGILDA - Grid Infn Laboratory for Dissemination ActivitiesGRIS - Grid Resource Information Service. Collects information for MDS.IN2P3 - Institut National de Physique Nucléaire et de Physique des ParticulesINFN - Istituto Nazionale di Fisica Nucleare (in Italy)ISO - International Standardization OrganizationJDL - Job Description LanguageLB - Logging and Bookeeping serviceLEMON - LHC Era Monitoring

Page 28: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 28

Acronyms and Abbreviations (2):

LCG - LHC Computing GridLDAP - Lightweight Directory Access ProtocolLDIF - LDAP Data Interchange FormatLDN - Logical Dataset NameLFC - LCG File CatalogLFN - Logical File NameLHC - Large Hadron Collider. Under construction. Hosts CMS, ATLAS, and other experiments.LRMS - Local Resource Management SystemMDS - Meta Directory Service, or Monitoring and Discovery Service (Globus)MPI - Message Passing Interface (Globus)PhEDEx - Physics Experiment Data Export (CMS)RFIO - Remote File I/OR-GMA - Relational Grid Monitoring Architecture (EGEE). A monitoring system similar to MDSROC - Regional Operations CentreRLS - Replica Locator ServiceSE - Storage ElementSOAP - Simple Object Access ProtocolSRM - Storage Resource ManagementVO - Virtual Organization, e.g., an experimentVOBOX - VO boxVOMRS - Virtual Organization Management Registration ServiceVOMS - VO Management ServiceX.509 - (ITU-T standard for Public-key and attribute certificate frameworks)

Page 29: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 29

References

SAM

http://goc.grid.sinica.edu.tw/gocwiki/SAME_Planning

https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE&regions=

GRIDMAP

http://gridmap.cern.ch/gm/

http://cerncourier.com/cws/article/cnl/31986

Gstat

http://goc.grid.sinica.edu.tw/gstat/

GridView:

Portal: http://gridview.cern.ch/

TWiki: https://twiki.cern.ch/twiki/bin/view/LCG/GridView

GridICE:

http://gridice.forge.cnaf.infn.it/

Page 30: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 30

Conclusions

There are several monitoring tools available for

the Grid system

Which tool do you use?

It depends by your role in grid

Sometimes you could use more tools at the

same time to satisfy your needs

Page 31: Overview of monitoring tools for Grid Systems Varenna , 12  May 2008 Antonio  Pierro

May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 31

Thank You