Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience

Stefano Belforte INFN Trieste Analysis Operations 2009 Post-Mortem

Jan 26, 2010 1

Analysis Operations Postmortem Report from 2009 beam data and overall 2009

experience

Stefano BelforteFrank Wuerthwein

James LettsSanjay Padhi

Dave L. EvansMarco Calloni

Jan 26, 2010Analysis Operations 2009 Post-Mortem

2Stefano Belforte INFN Trieste

Summary

1. Metrics : facts

2. Operational experience: subjective report

3. Concerns & Lessons: our opinions

4. Conclusions: my very personal opinions


Jan 26, 2010 3

1. Metrics



Metrics report from James Letts 1



Metrics CAF (only batch work is monitored )

114 hosts each 8-core = 900 batch slots

LoadUsersFrom Lemon last year

Dashboard Nov1 to Jan 1

160M sec ~30 used hour each clock hour (30 slots)

CAF CrabServer last 2 months

Purple = URL

30K jobs ~500 jobs/day


Jan 26, 2010 10

2. Operation Report



Operations: Support

Little data ? Users activity not a visible load Lessons in this area still from OctX

Much MC, largely a success in this area MC sample felt to be of proper size and available as needed sort of too many, difficult to know which one to use

metainfo in DBS does not appear adequate yet people work out of dataset lists in twiki's, mails

Crab support load still too high Requires (good) expert support

No idea yet of how effective can new people be Most questions not solved by pointing to documentation

good: documentation and tutorials are good and people use them

bad: few threads are closed quickly, often need to reproduce, debug, pass to developers, even within our group we need to consult with each other and make mistakes

Many questions belong to : better message from tool will prevent users from asking again



Operations: CAF

Users relied on CAF in 2009 Little data, ntuple could be made on CAF and replicated around by

"all sort of means“. CAF does not appear steadily overloaded. in a pattern of one-day-beam-few-days-no that may continue

may be painful to wean from it in a rush under pressure many people use grid daily, we know it works, but it needs a

bit of getting used to use cases appeared for moving data from CAF to T2's (MC so

far, expect data and calibration next) CAF is by design separated by the distributed infrastructure But hard to explain to users that "data can't move to/from CERN

disks and offsite T2 disks“ and that jobs can not be submitted to CAF unless logged on lxplus (political limitations, not technical, other experiments do those things)

difficult to defend policies that make little technical sense grid access to CAF would allow to user same /store/result

instance (only have ~1FTE overall to operate the service)



Operations: CAF CRAB Server

CRAB Server for general CAF users is not really "like the grid one, but with AFS"

requires dedicated test/validation and support Operations deals mostly with failures,they are different here

dedicated development effor not addressed here Double the work, add <10% of resources Does not hide the "unreliable grid", but the highly reliable local

batch system CERN may go down, but difficult that lxplus works and lxbatch

does not. While on grid N sites go up/down as an habit On grid the problem is to make grid invisible, here it is to make

CRAB server invisible On grid we look forward to pilot jobs to overlay our own global

batch scheduler over the N sites, making them look like a single batch system. CAF has that already via LSF. What do we really gain with the server ?

De-facto last priority for support now Fill-up of local disk detected by ops, not users. None complained

when turned off for a week to address that.



Operations: changes

Transitions changes in offline (SL5, pythong 2.6..) are often felt like done when

CMSSW is out. But transition in CRAB and overall infrastrucutre needs to be taken into account more fully

Last December had to throw in new CRAB release in a hurry to cope with SL5, grid integration issues solved ~last day (with recipy from operations) and lead to "horror code“. DPM sites still not working with CMS distribution, using unofficial patch

New version is always better, but we operate current one Backporting of fixes to production release expensive for developers Answering same problem day after day expensive for us

Adding servers, falling back, resilience Long lead time to get new servers in, esp. at CERN Need good advance planning, no sudden changes and robust

servers. Crab server failures are not transparent, take ~half of CMS workload with it. UPS ?


Jan 26, 2010 15

3. Concerns



Concerns: Data Distribution

data distribution largely a success but we have not dealt with cleanup and working with "full sites"

yet are space bookeeping tools adequate ?

User’s output data stageout still major problem, likely to be so for a while Effort on new versions, validation, support, documentation etc.

delayed deploy/test campaing on /store/temp/user Have to catch up fast



Concerns : Need to reduce daily support load

Software is 10% coding and 90% support

Crab is very powerful, flexible, touches everything from CMSSW to DBS to grid, from MC production to data splitting and calibration access

We can not offer expert advice and debugging on all currently implemented features, can not provide help on use cases we have not explored ourselves

Already started to cut, e.g. multicrab Will make a list of what we are comfortable in knowing how to

use, and only support that Could use better separation of CMSSW/DBS/CRAB/GRID parts

(true for us, users and possibly for developers) Clean segmentation allows to "attach problems to specific area"

need to become able to operate in the mode: "watch it while it runs and figure out overall patterns"



Concerns: CAF

Don’t see how we can give it the needed level of support with planned resources

Provocative proposals follow: so far CAF Server approach used for two very different things

Incubator for workflows that will move to T0 (alca, express…): predicted/predictable load, few workflows, few people, more development/integration then operations hand over

General tool for the CAF community to do all kind of CRAB work, entice users to move from bsub to crab –submit: have to support all possible kind of loads in time-critical way, continous self-inflicted DOS risk drop server and only support direct crab to LSF submission

(a piece of) CAF should be integrated with distributed infrastructure, be able to use with same tools running same workflows submit with same crab server use same /store/result instance access to phedex transfers



Concerns: too many features

Each small “addition” generates more work forever However useful and well motivated

Would like to focus on what’s needed, not what’s nice Uncomfortable in making this selection ourselves

But of course ready to do Feature requests (exp. New ones) should be championed by

some (group of) users providing clear use scenarios relevant in the short term, and willing to take on education of the community Dumping something in our lap that we in analysis ops are not

familiar with, is not likely to turn out in user’s satisfaction We can not be/become expert on all possible workflows from

user’s produced MC to DQM The simple “give me your script and I’ll make it run it on the

grid in N copies” is still more work then we can deal with at the moment



Concerns: changes

Changes in offline Changes in offline do not stop at CMSSW SL5, new python, new root, new… all affect also CRAB and running

on remote sites. Making CRAB work and making CMSSW work at all CMS Tier2’s

should be prerequisito for release validation. Not a race by ops and crab developes after the fact

Changes in CRAB As seen in other projects developers think in terms of next (and

next-to-next) release, while we need to operate with current one (and are asked to validate next)

A concern mostly because of large looming transition to WMAgent, with unknown (to us) timing, duration, functionalities change

What features/fixes could/should be backported to production release instead ?



Concerns: user MC

See an extraordinary amount of user-made MC Support for large workflows is difficult, also difficult to predict

and control side effects on other users Setting bad example/habit of “DataOps will take long, so let’s

do it ourselves” Why is this not going via DataOps ?

Communication issue ? Perception of bureocratic overload ? Private code ? Unsupported generators ?

We are not prepared/equipped to offer good support workflows that are not based on datasets located at T2’s



Conclusions

Learn a lot in 2009, built competence in the new group MC exercised us at scale Learnt little from work on beam data

Computing resources appear well usable and not saturated Effort by “analysis ops metric” particularly appreciated, we start

having the overall vision Central placement of useful data at T2’s working well so far Ready to face more beam data

Have listed our main concerns and desires Cut on scope of crab support to reduce daily load Reduce crab complexity and coupling to all possible tools Focus on stageout issues (50% of all failures) Push large single workflows into DataOps Add more access options to CAF to better integrate with grid Focus CAF CRAB server on critical workload More care and coordination in transitions, operations a stakeholder

in next release planning

Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience

Documents

Transcript of Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience