INFSO-RI-508833 Enabling Grids for E-sciencE Status of EGEE Operations Ian Bird, CERN SA1 Activity...

20
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Status of EGEE Operations Ian Bird, CERN SA1 Activity Leader EGEE 3 rd Conference Athens, 18 th April, 2005
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE Status of EGEE Operations Ian Bird, CERN SA1 Activity...

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Status of EGEE Operations

Ian Bird, CERN

SA1 Activity Leader

EGEE 3rd Conference

Athens, 18th April, 2005

Athens Conference; 18th April 2005 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Overview

• Overall activity status Service & Operations

• Planning for remainder of project Main focus of activities gLite migration

• Summary

Tomorrow’s plenary session for technical details

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Operations Status

Country providing resourcesCountry anticipating joining

In LCG-2: 131 sites, 30 countries >12,000 cpu ~5 PB storage

Includes non-EGEE sites:• 9 countries• 20 sites

Computing Resources: April 2005

Athens Conference; 18th April 2005 5

Enabling Grids for E-sciencE

INFSO-RI-508833

Infrastructure metrics

Countries, sites, and

CPU available in EGEE

production service

Countries, sites, and

CPU available in EGEE

production service

Region coun-tries

sites cpu M6 (TA)

cpuM15 (TA)

cpuactual

CERN 0 1 900 1800 1841

UK/Ireland 2 19 100 2200 2398

France 1 8 400 895 1172

Italy 1 21 553 679 2164

South East 5 16 146 322 159

South West 2 13 250 250 498

Central Europe 5 10 385 730 629

Northern Europe 2 4 200 2000 427

Germany/Switzerland 2 10 100 400 1733

Russia 1 9 50 152 276

EGEE-total 21 111 3084 9428 11297

USA 1 3 - - 555

Canada 1 6 - - 316

Asia-Pacific 6 8 - - 394

Hewlett-Packard 1 3 - - 172

Total other 9 20 - - 1437

Grand Total 30 131 - - 12734

EGEE partner regions

Other collaborating sites

Athens Conference; 18th April 2005 6

Enabling Grids for E-sciencE

INFSO-RI-508833

Service Usage

• VOs and users on the production service Active HEP experiments:

4 LHC, D0, CDF, Zeus, Babar Active other VO:

Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics)

6 disciplines Registered users in these VO: 600 In addition to these there are many VO that are

local to a region, supported by their ROCs, but not yet visible across EGEE

• Scale of work performed: LHC Data challenges 2004:

>1 M SI2K years of cpu time (~1000 cpu years) 400 TB of data generated, moved and stored 1 VO achieved ~4000 simultaneous jobs (~4 times

CERN grid capacity)

Number of jobs processed/month

Athens Conference; 18th April 2005 7

Enabling Grids for E-sciencE

INFSO-RI-508833

SA1 – Operations Structure

• Operations Management Centre (OMC):• Core Infrastructure Centres (CIC)

Manage daily grid operations – oversight, troubleshooting

Run essential infrastructure services Provide 2nd level support to ROCs UK/I, Fr, It, CERN, + Russia (M12)

Weekly rotation in place since October

Taipei also run a CIC

• Regional Operations Centres (ROC) Act as front-line support for user and

operations issues Provide local knowledge and adaptations One in each region – many distributed

• User Support Centre (GGUS) In FZK – manage PTS – provide single

point of contact (service desk) Not foreseen as such in TA, but need is

clear

Athens Conference; 18th April 2005 8

Enabling Grids for E-sciencE

INFSO-RI-508833

Operations Procedures

• Driven by experience during 2004 Data Challenges, &• Reflecting the outcome of the November Operations

Workshop• Operations Procedures

roles of CICs - ROCs - RCs weekly rotation of operations centre duties (CIC-on-duty)

Process in place since October

daily tasks of the operations shift monitoring (tools, frequency) problem reporting

• problem tracking system

• communication with ROCs&RCs escalation of unresolved problems handing over the service to the next CIC

Athens Conference; 18th April 2005 9

Enabling Grids for E-sciencE

INFSO-RI-508833

New Release Process (simplified)

C&TC&T

EISEISGISGIS

GDBGDB

ApplicationsApplicationsRCRCBugs/Patches/Task

SavannahBugs/Patches/Task

Savannah

EISEISCICsCICs

Head of Deployment

Head of Deployment

prioritization&

selection

DevelopersDevelopers

ApplicationsApplications

DevelopersDevelopers

11

List for next release(can be empty)

List for next release(can be empty)22

integration&

first testsC&TC&T

33

Internal ReleasesInternal

Releases

44

User Level install of

client toolsEISEIS

55

full deployment on test clusters (6)

functional/stress tests~1 week

C&TC&T

66

assign and update cost

Bugs/Patches/TaskSavannah

Bugs/Patches/TaskSavannah

componentsready at cutoff

InternalClient

Release

InternalClient

Release

77Client

ReleaseClient

ReleaseService ReleaseService Release

Updates ReleaseUpdates Release

Core Service Release

Core Service Release

C&TC&T

Athens Conference; 18th April 2005 10

Enabling Grids for E-sciencE

INFSO-RI-508833

Deployment process

Release(s)Release(s)

Certificationis run daily

Update User Guides EISEIS

UpdateRelease Notes

GISGIS

ReleaseNotes

InstallationGuides

UserGuides

Re-Certify

CICCIC

Every Month

1111

ReleaseReleaseReleaseReleaseClient ReleaseClient Release

Deploy ClientReleases

(User Space)GISGIS

Deploy ServiceReleases (Optional) CICs

RCsCICsRCs

Deploy MajorReleases

(Mandatory) ROCsRCs

ROCsRCs

YAIM

Every Month

Every 3 months

on fixed dates !

at own pace

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Planning for next year

Athens Conference; 18th April 2005 12

Enabling Grids for E-sciencE

INFSO-RI-508833

Future work – comments from review

• Testing and software packaging will be critical to success. Reinforce these also intellectually very demanding activities even further. Yes – this is agreed!

• Work hard on event-based monitoring techniques, triggering preventive maintenance actions, to improve the stability of the Grid infrastructure.

• Implement a strong mechanism to quickly isolate unstable sites in the production Grid. These are both part of ongoing program of work Use R-GMA as monitoring framework; build triggers and alarms on top Better mechanism to remove sites – web interface to allow VO to select

• Improve the middleware deployment process (technical, organisational) even further to increase the stability of the infrastructure and consequently improve the job success rate and reduce the load on the support team. Already updated and streamlined deployment and release process and

improved configuration mechanisms

Athens Conference; 18th April 2005 13

Enabling Grids for E-sciencE

INFSO-RI-508833

15 month plan

• No major changes to goals or work • Areas of work focus:

Migration to gLite See next slides

Improving operational and grid reliability Follow recommendations of review discussed above Improve monitoring systems – build reactive alarms Site isolation – need simple mechanism (CIC tool) to remove sites

• Bad sites, security problems, etc. Improving user support

In progress – need recognised usable service by mid-year 24x7 service availability

Availability of service rather than components Identify critical services Isues: on-call support; hot stand-by machines; etc (might need work

on middleware to support this!)

Athens Conference; 18th April 2005 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Review recommendations to SA1

• The migration path to gLite needs to be better planned, as it is inherently difficult to support two different grid software stacks indefinitely. More specifically, establishing a fixed time-line for migration as well as deprecation deadlines for LCG-2 services, plus possibly identifying who would be the earliest adopters from the application side and the time-line for their possible early committal, would be essential; otherwise, existing users may not be motivated to migrate.

•Migration plan is being worked out in detail – but will be driven by experience in the certification and pre-production deployment•Must be a migration plan and not a switch from old to new•Early adopters include LCG, others should be identified via NA4

Athens Conference; 18th April 2005 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Migration to gLite

• Migration strategy Needs to be incremental rather than big-

bang – as has been stated for a year

• 2 Activities in parallel: Deploy components into LCG-2 certification

test-bed and then to pre-production Deploy pre-production sites in parallel

• PPS and Production Are evolutionary LCG-2 gLite components

• Cannot provide LCG-2 end-of-life estimate/deadlines LCG-2 is the fallback solution

• Applications must test services and decide which ones they need

LCG-2 (=EGEE-0)

prototyping

prototyping

product

20042004

20052005

LCG-3 (=EGEE-x?)

product

Athens Conference; 18th April 2005 16

Enabling Grids for E-sciencE

INFSO-RI-508833

Review recommendations to SA1

• Consider the current gLite as a stepping stone towards a more robust standards-based infrastructure, rather than a final deployment solution. Select additional components for integration and deployment through collaborations with other international middleware R&D initiatives.

•Work with Globus, VDT, OSG, etc on common solutions/interfaces – but has to be driven by the applications and experience from operations•Should be in situation to be able to deploy components needed by the applications•Integration and certification process mechanism from selecting other components

Athens Conference; 18th April 2005 17

Enabling Grids for E-sciencE

INFSO-RI-508833

Review recommendations to SA1

• Continue to conduct application-driven investigation that may result in complex usage scenarios and consider how the advanced middleware and infrastructure would support them in a viable manner. As such, keep a keen eye on new generations of production-level Grid middleware from various international groups that go beyond gLite features.

•For HEP – Data challenges and service challenges bring specific goals and targets (and timescales) – this will continue•Other applications might consider similar exercises – define some goals

Athens Conference; 18th April 2005 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Milestones for rest of project

• M14: full production grid in production 9 ROCs, 5 CICs (include Russia at M12), 20 sites Should be based on EGEE re-engineered middleware.

This is dependent on the quality and robustness of gLite components Experience: takes 6 months to put new software into production Will not deploy new components unless they improve upon existing

components or add new required functionality

• M21: expanded production infrastructure in place As above, but expanded to 50 sites Now decoupled from specific gLite release

Athens Conference; 18th April 2005 19

Enabling Grids for E-sciencE

INFSO-RI-508833

Deliverables for rest of project

• Release notes corresponding to milestones Updated relative to first set of release notes; snapshots corresponding to

milestones NB. ALL releases are accompanied by full set of release notes

• EGEE “Cookbook” Foreseen as planning guides to assist new participants join or build components

of the infrastructure. Resource centres and their administrators ROCs, CICs, and VOs Templates and checklists to assist administrators to: design a facility, determine what

resources to acquire, how to configure them, etc. Detailed enough to allow admins to understand limitations of the system are and how

to address them (e.g. what services can run on 1 machine, how to configure, etc.)

Make use of expertise of CICs, ROCs and staff in RCs (“and use technical writers in NA3”)

• M24: Assessment of infrastructure operation throughout the project Remove suggestions on long-term sustainability put into EGEE-2 planning

Athens Conference; 18th April 2005 20

Enabling Grids for E-sciencE

INFSO-RI-508833

Summary

• Production grid is operational and in use Larger scale than foreseen, use in 2004 probably the first time such a

set of large scale grid productions has been done Modest growth in resources foreseen over next year

• Operational infrastructure in place and working Need to continue to improve reliability of service Need to continue to improve user support

• Support for applications and VOs VO deployment should become still simpler and more routine Application support needs more resources than foreseen

• Deployment and migration to gLite is now a major focus