Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I...

39
Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE [email protected] 20 th September 2005

Transcript of Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I...

Page 1: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Grid Deployment & Operations:

EGEE, LCG and GridPP

Jeremy ColesGridPP Production ManagerUK&I Operations Manager for [email protected]

20th September 2005

Page 2: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Overview

2 The middleware and its deployment

3 Structures developed in response to operating a large grid

4 How the infrastructure is being used

5 Particular problems being faced

6 Summary

1 Project Background (to EGEE, LCG and GridPP)

Page 3: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

A reminder of the Enabling Grids for E-sciencE project

• 48 % service activities (Grid Operations, Support and Management, Network Resource Provision)

• 24 % middleware re-engineering (Quality Assurance, Security, Network Services Development)

• 28 % networking (Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation)

32 Million Euros EU funding over 2 years starting 1st April 2004

Emphasis in EGEE is on operating a productiongrid and supporting the end-users

From Bob Jones’s talk AHM 2004!

Page 4: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The UK & Ireland contribution to SA1 – deployment &

operations

Consists of 3 partners:• Grid Ireland

Page 5: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The UK & Ireland contribution to SA1 – deployment &

operations

Consists of 3 partners:• Grid Ireland• The National Grid Service (NGS)

- Leeds/Manchester/Oxford/RAL

Page 6: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The UK & Ireland contribution to SA1 – deployment &

operations

Consists of 3 partners:• Grid Ireland• The National Grid Service (NGS)• GridPP

•Currently the lead partner•Based on a Tier-2 structure

Page 7: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The UK & Ireland contribution to SA1 – deployment &

operations

Consists of 3 partners:• Grid Ireland• The National Grid Service (NGS)• GridPP

•Currently the lead partner•Based on a Tier-2 structure within the Large Hadron Collider Grid Project (LCG) [See T Doyle’s talk tomorrow 11am CR2]

The UKI structure:• Regional Operations Centre (ROC)

• Helpdesk• Communications• Liaison with ROCs and CICs• Monitoring of resources

• Core Infrastructure Centre (CIC)• Team take “shifts” to …• Monitor core services and• Follow up on site problems

Page 8: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

GridPP is a major contributor to the growth of EGEE

resources

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

06/2

4/04

07/0

6/20

04

07/1

8/04

07/3

0/04

08/1

1/20

04

08/2

3/04

09/0

4/20

04

09/1

6/04

09/2

8/04

10/1

0/20

04

10/2

2/04

11/0

3/20

04

11/1

5/04

11/2

7/04

12/0

9/20

04

12/2

1/04

01/0

2/20

05

01/1

4/05

01/2

6/05

02/0

7/20

05

02/1

9/05

03/0

3/20

05

03/1

5/05

03/2

7/05

04/0

8/20

05

04/2

0/05

05/0

2/20

05

05/1

4/05

05/2

6/05

06/0

7/20

05

06/1

9/05

07/0

1/20

05

07/1

3/05

07/2

5/05

08/0

6/20

05

08/1

8/05

EGEE total job slots UK total job slots

Page 9: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

When sites join EGEE the ROC …

• Records site details in a central Grid Operations Centre DataBase (GOCDB) with access certificate controlled

• Ensures that the site has agreed to and signed the Acceptable Use and Incident Response procedures

• Runs tests against the site to ensure that the setup is correctly configured

NB. Page access requires appropriate grid certificate

Page 10: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Experience has revealed growing requirements for the

GOCDB

• ROC manager control - To be able to update site information and change the monitoring status for or remove sites

• A structure that allows easy population of structured views (such as accounting according to regional structures)

• To be able to differentiate pure production sites from test resources (e.g. preproduction services)

Page 11: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Computing cluster Network resources Data storage

Operating system Local schedulerFile system

User access SecurityData transferInformation schema

Resource Broker Data managementApp monitoring system

User interfaces Applications

Hardware

System software

“Basic” services

“Collective” services

Application level services

dCache-SRM, DPM…dCache-SRM, DPM…

Scientific Linux, RHEL…Scientific Linux, RHEL… NFS, …NFS, … PBS, Condor, LSF,…PBS, Condor, LSF,…

VDT (Condor, Globus, GLUE)VDT (Condor, Globus, GLUE)

EU DataGridEU DataGrid

Information system

EGEE middleware is still evolving based on operational

needs

Page 12: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

An overview of the (changing) middleware release process

Release(s)Release(s)

Certificationis run daily

Update User Guides EISEIS

UpdateRelease Notes

GISGIS

ReleaseNotes

InstallationGuides

UserGuides

Re-Certify

CICCIC

Every Month

1111

ReleaseReleaseReleaseRelease

Client ReleaseClient Release

Deploy ClientReleases

(User Space)GISGIS

Deploy ServiceReleases (Optional) CICs

RCsCICsRCs

Deploy MajorReleases

(Mandatory) ROCsRCs

ROCsRCs

YAIM

Every Month

Every 3 months

on fixed dates !

at own pace

Site deployment of middleware

YAIM – bash script. Simple and transparent. Much preferred by administrators.

QUATTOR – Steep learning curve but allows tightercontrol over installation.

Patches & functionality Vs

stability!

Porting to “non-standard” LCGoperating systems

Page 13: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

A mixed infrastructure is inevitable and local variations

must be manageable

0

5

10

15

20

25

30

35

40

24/0

1/20

05

07/0

2/20

05

21/0

2/20

05

07/0

3/20

05

21/0

3/20

05

04/0

4/20

05

18/0

4/20

05

02/0

5/20

05

16/0

5/20

05

30/0

5/20

05

13/0

6/20

05

27/0

6/20

05

11/0

7/20

05

25/0

7/20

05

08/0

8/20

05

22/0

8/20

05

Date

Sit

es a

t re

leas

e

LCG-2_4_0 LCG-2_3_1 LCG-2_3_0 LCG-2_5_0 LCG-2_6_0 Sites

• Releases take time to be adopted – how will more frequent updates be tagged and handled!?• Grid Ireland has a completely different deployment model to GridPP (central vs site based)

Page 14: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Additional components are added such as for managed

storage

Storage Resource Management interface

• Provides a protocol for large scale storage systems on the grid

• Clients can retrieve and store files, control file lifetimes and filespace

• Sites will need to offer an SRM compliant storage element to VOs

• These SEs are basically filesystem mount points on specific servers

• There are few solutions available and deployment at test sites has proved time consuming (integration at sites, understandinghardware setup (documentation improving))

The GridThe Grid

Storage Elementinterfaces

“Handlers”

TAPE storage(or disk)

AccessControl

FileMetadata

Page 15: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Once sites are part of the grid they are actively monitored

• The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc.

• These have recently been updated as certain “critical tests” gave a misleading impression of a site

Page 16: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Once sites are part of the grid they are actively monitored

• The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc.

• These have recently been updated as certain “critical tests” gave a misleading impression of a site

• The tests are being used (and expanded) by Virtual Organisations (VOs) to select stable sites (to improve efficiency)

Page 17: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Once sites are part of the grid they are actively monitored

• The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc.

• These have recently been updated as certain “critical tests” gave a misleading impression of a site

• The tests are being used (and expanded) by Virtual Organisations (VOs) to select stable sites (to improve efficiency)

•They have proved very useful to sites and can now be run by them on demand

Page 18: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The tests form part of a suite of information used by the Core Infrastructure Centres

(CICs)• There are currently 5 CICs in EGEE

• Introduction of a CIC on Duty rota (whereby each CIC oversees EGEE operations for 1 week at a time) saw a great improvement in grid stability

• Available information is captured in a Trouble Ticket and sent to problem sites (and their ROC) informing them that there is a problem

• Tickets are automatically escalated if not resolved

• Core services are monitored in addition to sites

Page 19: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Good, reliable and easy to access information has been extremely useful to sites and

ROC staff At a glance we can see for each site:

• whether it passes or fails the functional tests

• if there are configuration errors (via “sanity checks”)

• what middleware version is deployed

• the total job slots available and used as published by the site

• basic storage information

• average and maximum published jobs slots showing deviations

Page 20: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

With a rapidly growing number of sites and

geographic coverage many tools have had to evolve

Page 21: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

And new ones developed. EGEE and LCG metrics are an

increasing area of focus – how else are we to manage!

Page 22: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

We need to develop a better understanding of grid

dynamics

Is this several sites with largefarms upgrading?

Is this the result of a loss of the Tier-1 scheduler? Or just a problem with the tests!

Page 23: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The good news is that UKI is currently the largest contributor to

EGEE resources

Page 24: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

… and resource usage is growing (at 55% for August and 26% for period

from June 04

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

06/0

2/20

04

06/1

6/04

06/3

0/04

07/1

4/04

07/2

8/04

08/1

1/20

04

08/2

5/04

09/0

8/20

04

09/2

2/04

10/0

6/20

04

10/2

0/04

11/0

3/20

04

11/1

7/04

12/0

1/20

04

12/1

5/04

12/2

9/04

01/1

2/20

05

01/2

6/05

02/0

9/20

05

02/2

3/05

03/0

9/20

05

03/2

3/05

04/0

6/20

05

04/2

0/05

05/0

4/20

05

05/1

8/05

06/0

1/20

05

06/1

5/05

06/2

9/05

07/1

3/05

07/2

7/05

08/1

0/20

05

08/2

4/05

Date

% j

ob

slo

ts u

sed

% EGEE slots used % UK slots used

• Utilisation may worry some people but note that the majority of resources are being deployed for High Energy Physics experiments which will ramp up usage quickly in 2007

• Recent activity is due partly due to a Biomedical data challenge in August

Page 25: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Several sites have been running full for July/August. The plot below is

for the Tier-1 in August

Maximum Capacity

Page 26: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

However full does not always mean well used!

•The plot shows weighted job efficiencies for the ATLAS VO in July 2005 •Straight line structures show jobs which ran for a period of time before blocking on an external resource and eventually being killed by an elapsed time limit• Clusters at low efficiency probably show performance problems on external storage elements• Many problems seen here are NOW FIXED

Page 27: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

… and some sites have specific scheduling

requirements

Batch Server(pbsserver)

Execution host(pbsmom

Execution host(pbsmom)

Execution host(pbsmom)

Batch serverand cluster

Configuration,Job queue,State table

Execution host(pbsmom

Job, start, stop, status

qsub,qdel,qstat

Schedulerplug-in

Node,job,start,stop

statusJob, start, stop, status

Scheduler and additional cluster

Configuration

Grid scheduling (using user specified requirements to select resources) Vs Local policies (the site “prefers” certain VOs)

Page 28: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

The user community is expanding creating new

problems

• Over 900 users in some 60+ VOs

• UK sites support about 10 VOs

• Opening up resources for non-traditional site VOs/users requires effort

• Negotiation between VOs and the regional sites has required the creation of an “Operational Advisory Group”

• New Acceptable Use policies which apply across countries and agreeable (and actually readable) are taking time to develop.

Pilot New

Page 29: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Aggregation of job accounting is recording VO usage

GOC SITE

Web summary view of data

Page 30: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Aggregation of job accounting is recording VO usage, but …

GOC SITE

Web summary view of data

• Not all batch systems are covered• Not all sites are publishing data• Farm normalisation factors are not consistent• Publishing across grids yet to be tackled (but the solution in EGEE does use a GGF schema)

Page 31: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

GridPP data is reasonably complete for recent months

Note the usage by non particle physics organisations. This is what the EGEE grid is all about.

Page 32: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Support is proving difficult because the project is so

large and diverse

UKI ROC ticket tracking system(Footprints)

Site ASite A

Site ASite A

GGUS(Remedy)

Regional service 1Regional service 1

Regional service 1

Tier-1 helpdesk(Request tracker)

Grid-Ireland helpdesk(Request Tracker)

GOSC(Footprints)

CIC-on-duty

Users Experiments/VOs

Savannah – bug tracking

Site administrators

LCG-ROLLOUT

TB-SUPPORT

• This is ONLY the view for the UKI operations centre. There are 9 ROCs

Page 33: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

TPM VOSupportTPM

I need help! I send e-mail to [email protected]

I need help! I send e-mail to [email protected]

E-mail automatically converted in GGUS ticket.

Can be addressed to TPM VO only, or TPM only, or

to both

VO SupportUnits

ROC Support Units

Middleware Support Units

Other Grids Support Units

Mailing lists

Ticket Process Manager: Monitor ticket assignments. Direct to correct support unit.

Notify users of specific actions and ticket status

TPM VO Support: People from VOs. Receive tickets VO related and follow

them. Solve/forward VO specific problems. Recognize Grid related

problems and assign them to specific support units or back to TPM

CIC Support Unit

The EGEE model uses a central helpdesk facility and

Ticket Process Managers

Page 34: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Mailing lists are

very active on their own!

Linking up ROC helpdesks is taking time. Getting VOs to populate their follow up lists is not happening quickly

TPM VOSupportTPM

Some users are confused - mixed messages

Some users are confused - mixed messages

The central GGUS facility is taking time to become stable

VO SupportUnits

Ticket Process Managers are

difficult to provide – EGEE funding

did not account for them

VOs still have independent support lists and routes – especially the larger VOs

The EGEE model uses a central helpdesk facility and ticket process managers, but

Page 35: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Interoperability is another area to be

developed

In terms of:•Operations•Support •Job submission•Job monitoring• …

Currently the VOs/experiments develop their own solutions to this problem.

Page 36: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Some other areas which are talks in themselves!

Security• Getting all sites to adopt best practices

•check patches•check port changes•reviewing log files

• Scanning for grid wide intrusion

Network monitoring• Aggregation of data from site “network boxes” • Mediator for integrated network checks

Page 37: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Going forward, one of the main drivers pushing the service is a

series of service challenges in LCG

{• Main UK site connected to CERN via UKLIGHT• Up to 650 Mb/s sustained transfers • 3 “Tier-2” centres deployed an SRM and managed sustained data transfer rates up to 550 Mb/s over SJ4. One connected via UKLIGHT

Page 38: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

Summary

2 Our grid management tools are now evolving rapidly

3 Grid utilisation is improving – we start to look at the dynamics

4 Growing focus areas include support and interoperation (and gLite!)

6 Come and visit the GridPP (PPARC) and CCLRC stands!

1 UK&I has a strong presence in EGEE and LCG

5 There is a lot of work not covered here! Fabric:Security:Networking…

Page 39: Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September.

SITESITE

FIREMAN

VOMS

LFC

shared

LCG

gLite SRM-SE

myProxygLiteWLMRB

UIs

WNsgLiteLCG

gLite-IO

gLite-CE

FTS

LCGCE

FTS

R-GMAR-GMA

BD-II BD-II

Data from LCG is owned by VO and role, gLite-IO service owns gLite data

FTS for LCG uses user proxy, gLite uses service cert

R-GMAs can be merged (security ON)

CEs use same batch system

Independent IS

Catalogue and access control

gLite vs LCG-2

Components

gLite vs LCG-2

Components

dgasAPEL