GridPP: Running a Production Grid

GridPP: Running a Production Grid

Stephen BurkeCLRC/RAL

On behalf of the GridPP Deployment & Operations Team

UK e-Science All-hands, Nottingham, 21st September 2006

21st September 2006

Running a Production Grid - All-hands 2/29

Overview

• EGEE, LCG and GridPP

• Middleware

• Deployment & Operations

• Conclusions

EGEE, LCG and GridPP

21st September 2006


EGEE

• Major EU Grid project: 2004-08 (in two phases)– Successor to the European DataGrid (EDG) project, 2001-04– 32 countries, 91 partners, €37 million + matching funding– Associated with several Grid projects outside Europe– Expected to be succeeded by a permanent European e-

infrastructure

• Supports many areas of e-science, but currently High Energy Physics is the major user– Biomedical research is also a pioneer– Currently ~3000 users in 200 Virtual Organisations

• Currently 195 sites, 28689 CPUs, 18.4 Pb of storage– Values taken from the information system – beware of GIGO!

21st September 2006


EGEE/LCG Google map

21st September 2006


(W)LCG

• The computing services for the LHC (Large Hadron Collider) at CERN in Geneva are provided by the LHC Computing Grid (LCG) project– LHC starts running in ~ 1 year– Four experiments, all very large– ~5000 users at 500 sites worldwide, 15 year lifetime

• Expect ~15 Pb/year, plus similar volumes of simulated data• Processing requirement is ~100,000 CPUs• Must transfer ~100 Mbyte/sec/site – sustained for 15 years!

• Running a series of Service Challenges to ramp up to full scale

• LCG uses the EGEE infrastructure, but also the Open Science Grid (OSG) in the US and other Grid infrastructures– Hence WLCG = Worldwide LCG

21st September 2006


Organisation

• EGEE sites are organised by region– GridPP is part of UK/Ireland

• Also NGS + Grid Ireland– Each region has a Regional Operation Centre (ROC) to look

after the sites in the region– Overall operations co-ordination rotates weekly between

ROCs

• LCG divides sites into Tier 1/2/3– + CERN as Tier 0– Function of size and QOS– Tier 1 needs >97% availability, max 24 hour response– Tier 2 95%/72 hours– Tier 3 are local facilities, no specific targets

• ROC ≈ Tier 1: RAL is both

21st September 2006


GridPP

• Grid for UK Particle Physics– Two phases, 2001-2004-2007– Proposal for phase 3 to 2011– Part of EGEE and LCG

• Working towards interoperability with NGS

• 20 sites, 4354 CPUs, 298 Tb of storage• Currently supports 33 VOs, including some non-PP

– But not many non-PP from the UK – any volunteers?• For LCG, sites are grouped into four “virtual” Tier 2s

– Plus RAL as Tier 1– Grouping is largely administrative, the Grid sites remain separate

• Runs UK-Ireland ROC (with NGS)• Grid Operations Centre (GOC) @ RAL (with NGS)

– Gridwide configuration, monitoring and accounting repository/portal• Operations and User Support shifts (working hours only)

21st September 2006


GridPP sites

Middleware

21st September 2006


Site services

• Basis is Globus (still GT2, GT4 soon) and Condor, as packaged in the Virtual Data Toolkit (VDT) – also used by NGS

• EGEE/LCG/EDG middleware distribution now under the gLite brand name

• Computing Element (CE): Globus gatekeeper + batch system + batch workers– In transition from Globus to Condor-C

• Storage Element (SE): Storage Resource Manager (SRM) + GridFTP + other data transports + storage system (disk-only or disk+tape)– Three SRM implementations in GridPP

• Berkeley Database Information Index (BDII): LDAP server publishing CE + SE + site + service information according to the GLUE schema

• Relational Grid Monitoring Architecture (R-GMA) server: publishing GLUE schema, monitoring, accounting, user information

• VOBOX: Container for VO-specific services (aka “edge services”)

21st September 2006


Core services

• Workload Management System (WMS), aka Resource Broker: accepts jobs, dispatches them to sites and manages their lifecycle

• Logging & Bookkeeping: primarily logs lifecycle events for jobs

• MyProxy: stores long-lived credentials

• LCG File Catalogue (LFC): maps logical file names to local names on SEs

• File Transfer Service (FTS): provides managed, reliable file transfers

• BDII: aggregates information from site BDIIs

• R-GMA schema/registry: stores table definitions and lists of producers/consumers

• VO Membership Service (VOMS) server: stores VO group/role assignments

• User Interface (UI): provides user client tools for the Grid services

21st September 2006


Grid services

• Some extra services are needed to allow the Grid to be operated effectively– Mostly unique instances, not part of the gLite distribution

• Grid Operations Centre DataBase (GOCDB): stores information about each site, including contact details, status and a node list– Queried by other tools to generate configuration, monitoring etc

• Accounting (APEL): publishes information about CPU and storage use

• Various monitoring tools, including:– gstat (Grid status) - collects data from the information system, does sanity

checks– Site Availability Monitoring (SAM) - runs regular test jobs at every site, raises

alerts and measures availability over time– GridView – collects and displays information about file transfers– Real Time Monitor – displays job movements, and records statistics

• Freedom of Choice for Resources (FCR): allows the view of resources in a BDII to be filtered according to VO-specific criteria, e.g. SAM test failures

• Operations portal: aggregates monitoring and operational information, broadcast email tool, news, VO information, …

21st September 2006


SAM monitoring

21st September 2006


GridView

21st September 2006


Middleware issues

• We need to operate a large production system with 24*7*365 availability

• Middleware development is usually done on small, controlled test systems, but the production system is much larger in many dimensions, more heterogeneous and not under any central control

• Much of the middleware is still immature, with a significant number of bugs, and developing rapidly– Documentation is sometimes lacking or out of date

• There are therefore a number of issues which must be managed by deployment and operational procedures, for example:– The rapid rate of change and sometimes lack of backward compatibility

requires careful management of code deployment– Porting to new hardware, operating systems etc can be time consuming– Components are often developed in isolation, so integration of new

components can take time– Configuration can be very complex, and only a small subset of possible

configurations produce a working system– Fault tolerance, error reporting and logging are in need of improvement– Remote management and diagnostic tools are generally undeveloped

Deployment & Operations

21st September 2006


Configuration

• We have tried many installation & configuration tools over the years

• Configuration is complex, but system managers don’t like complex tools!

• Most configuration flexibility needs to be “frozen”– Admins don’t understand all the options anyway– Many configuration changes will break something– The more an admin has to type, the more chances for a

mistake• Current method preferred by most sites is YAIM (Yet

Another Installation Method):– bash scripts– simple configuration of key parameters only– doesn’t always have enough flexibility, but good enough

for most cases

21st September 2006


Release management

• There is a constant tension between the desire to upgrade to get new features, and the desire to have a stable system– Need to be realistic about how long it takes to get new things into

production

• We have so far had a few “big bang” releases per year, but these have some disadvantages– Anything which misses a release has to wait for a long time, hence there

is pressure to include untested code– Releases can be held up by problems in any area, hence are usually late– They involve a lot of work for system managers, so it may be several

months before all sites upgrade

• We are now moving to incremental releases, updating each component as it completes integration and testing– Have to avoid dependencies between component upgrades– Releases go first to a 10%-scale pre-production Grid– Updates every couple of weeks– The system becomes more heterogenous– Still some big bangs – e.g. new OS– Seems OK so far - time will tell!

21st September 2006


VO support

• If sites are going to support a large number of VOs the configuration has to be done in a standard way– Largely true, but not perfect: adding a VO needs changes in several

areas– Configuration parameters for VOs should be available on the

operations portal, although many VOs still need to add their data

• It needs to be possible to install VO-specific software, and maybe services, in a standard way– Software is ~OK: NFS-shared area, writeable by specific VO

members, with publication in the information system– Services still under discussion: concerns about security and support

• VOs often expect to have dedicated contacts at sites (and vice versa)– May be necessary in some cases but does not scale– Operations portal stores contacts, but site -> VO may not reach the

right people – need contacts by area– Not too bad, but still needs some work to find a good modus vivendi

21st September 2006


Availability

• LCG requires high availability, but the intrinsic failure rate is high– Most of the middleware does not deal gracefully with failures– Some failure modes can lead to “black holes”– Must fix/mask failures via operational tools so users don’t see them

• Several monitoring tools have been developed, including test jobs run regularly at sites

• On-duty operators look for problems, and submit tickets to sites– Currently ~ 50 tickets per week (c.f. 200 sites)

• FCR tool allows sites failing specified tests to be made “invisible”– New sites must be certified before they become visible– Persistently failing sites can be decertified– Sites can be removed temporarily for scheduled downtime

• Performance is monitored over time– The situation has improved a lot, but we still have some way to go

Conclusions

21st September 2006


Lessons learnt

• “Good enough” is not good enough– Grids are good at magnifying problems, so must try to fix

everything

• Exceptions are the norm– 15,000 nodes * MTBF of 5 years = 8 failures a day

• Also 15,000 ways to be misconfigured!– Something somewhere will always be broken

• But middleware developers tend to assume that everything will work– It needs a lot of manpower to keep a big system going

• Bad error reporting can cost a lot of time– And reduce people’s confidence

• Very few people understand how the whole system works– Or even a large subset of it– Easy to do things which look reasonable but have a bad side-effect

• Communication between sites and users is an n*m problem– Need to collapse to n+m

21st September 2006


Summary

• LHC turns on in 1 year – we must focus on delivering a high QOS

• Grid middleware is still immature, developing rapidly and in many cases a fair way from production quality

• Experience is that new middleware developments take ~ 2 years to reach the production system, so LHC will start with what we have now

• The underlying failure rate is high – this will always be true with so many components, so middleware and operational procedures must allow for it

• We need procedures which can manage the underlying problems, and present users with a system which appears to work smoothly at all times – Considerable progress has been made, but there is more to do

• GridPP is running a major part of the EGEE/LCG Grid, which is now a very large system operated as a high-quality service, 24*7*365

• We are living in interesting times!

GridPP: Running a Production Grid

Documents

Transcript of GridPP: Running a Production Grid