Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience
description
Transcript of Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience
Stefano Belforte INFN Trieste Analysis Operations 2009 Post-Mortem
Jan 26, 2010 1
Analysis Operations Postmortem Report from 2009 beam data and overall 2009
experience
Stefano BelforteFrank Wuerthwein
James LettsSanjay Padhi
Dave L. EvansMarco Calloni
Jan 26, 2010Analysis Operations 2009 Post-Mortem
2Stefano Belforte INFN Trieste
Summary
1. Metrics : facts
2. Operational experience: subjective report
3. Concerns & Lessons: our opinions
4. Conclusions: my very personal opinions
Stefano Belforte INFN Trieste Analysis Operations 2009 Post-Mortem
Jan 26, 2010 3
1. Metrics
Jan 26, 2010Analysis Operations 2009 Post-Mortem
4Stefano Belforte INFN Trieste
Metrics report from James Letts 1
Jan 26, 2010Analysis Operations 2009 Post-Mortem
5Stefano Belforte INFN Trieste
Metrics report from James Letts 2
Jan 26, 2010Analysis Operations 2009 Post-Mortem
6Stefano Belforte INFN Trieste
Metrics report from James Letts 3
Jan 26, 2010Analysis Operations 2009 Post-Mortem
7Stefano Belforte INFN Trieste
Metrics report from James Letts 4
Jan 26, 2010Analysis Operations 2009 Post-Mortem
8Stefano Belforte INFN Trieste
Metrics report from James Letts 5
Jan 26, 2010Analysis Operations 2009 Post-Mortem
9Stefano Belforte INFN Trieste
Metrics CAF (only batch work is monitored )
114 hosts each 8-core = 900 batch slots
LoadUsersFrom Lemon last year
Dashboard Nov1 to Jan 1
160M sec ~30 used hour each clock hour (30 slots)
CAF CrabServer last 2 months
Purple = URL
30K jobs ~500 jobs/day
Stefano Belforte INFN Trieste Analysis Operations 2009 Post-Mortem
Jan 26, 2010 10
2. Operation Report
Jan 26, 2010Analysis Operations 2009 Post-Mortem
11Stefano Belforte INFN Trieste
Operations: Support
Little data ? Users activity not a visible load Lessons in this area still from OctX
Much MC, largely a success in this area MC sample felt to be of proper size and available as needed sort of too many, difficult to know which one to use
metainfo in DBS does not appear adequate yet people work out of dataset lists in twiki's, mails
Crab support load still too high Requires (good) expert support
No idea yet of how effective can new people be Most questions not solved by pointing to documentation
good: documentation and tutorials are good and people use them
bad: few threads are closed quickly, often need to reproduce, debug, pass to developers, even within our group we need to consult with each other and make mistakes
Many questions belong to : better message from tool will prevent users from asking again
Jan 26, 2010Analysis Operations 2009 Post-Mortem
12Stefano Belforte INFN Trieste
Operations: CAF
Users relied on CAF in 2009 Little data, ntuple could be made on CAF and replicated around by
"all sort of means“. CAF does not appear steadily overloaded. in a pattern of one-day-beam-few-days-no that may continue
may be painful to wean from it in a rush under pressure many people use grid daily, we know it works, but it needs a
bit of getting used to use cases appeared for moving data from CAF to T2's (MC so
far, expect data and calibration next) CAF is by design separated by the distributed infrastructure But hard to explain to users that "data can't move to/from CERN
disks and offsite T2 disks“ and that jobs can not be submitted to CAF unless logged on lxplus (political limitations, not technical, other experiments do those things)
difficult to defend policies that make little technical sense grid access to CAF would allow to user same /store/result
instance (only have ~1FTE overall to operate the service)
Jan 26, 2010Analysis Operations 2009 Post-Mortem
13Stefano Belforte INFN Trieste
Operations: CAF CRAB Server
CRAB Server for general CAF users is not really "like the grid one, but with AFS"
requires dedicated test/validation and support Operations deals mostly with failures,they are different here
dedicated development effor not addressed here Double the work, add <10% of resources Does not hide the "unreliable grid", but the highly reliable local
batch system CERN may go down, but difficult that lxplus works and lxbatch
does not. While on grid N sites go up/down as an habit On grid the problem is to make grid invisible, here it is to make
CRAB server invisible On grid we look forward to pilot jobs to overlay our own global
batch scheduler over the N sites, making them look like a single batch system. CAF has that already via LSF. What do we really gain with the server ?
De-facto last priority for support now Fill-up of local disk detected by ops, not users. None complained
when turned off for a week to address that.
Jan 26, 2010Analysis Operations 2009 Post-Mortem
14Stefano Belforte INFN Trieste
Operations: changes
Transitions changes in offline (SL5, pythong 2.6..) are often felt like done when
CMSSW is out. But transition in CRAB and overall infrastrucutre needs to be taken into account more fully
Last December had to throw in new CRAB release in a hurry to cope with SL5, grid integration issues solved ~last day (with recipy from operations) and lead to "horror code“. DPM sites still not working with CMS distribution, using unofficial patch
New version is always better, but we operate current one Backporting of fixes to production release expensive for developers Answering same problem day after day expensive for us
Adding servers, falling back, resilience Long lead time to get new servers in, esp. at CERN Need good advance planning, no sudden changes and robust
servers. Crab server failures are not transparent, take ~half of CMS workload with it. UPS ?
Stefano Belforte INFN Trieste Analysis Operations 2009 Post-Mortem
Jan 26, 2010 15
3. Concerns
Jan 26, 2010Analysis Operations 2009 Post-Mortem
16Stefano Belforte INFN Trieste
Concerns: Data Distribution
data distribution largely a success but we have not dealt with cleanup and working with "full sites"
yet are space bookeeping tools adequate ?
User’s output data stageout still major problem, likely to be so for a while Effort on new versions, validation, support, documentation etc.
delayed deploy/test campaing on /store/temp/user Have to catch up fast
Jan 26, 2010Analysis Operations 2009 Post-Mortem
17Stefano Belforte INFN Trieste
Concerns : Need to reduce daily support load
Software is 10% coding and 90% support
Crab is very powerful, flexible, touches everything from CMSSW to DBS to grid, from MC production to data splitting and calibration access
We can not offer expert advice and debugging on all currently implemented features, can not provide help on use cases we have not explored ourselves
Already started to cut, e.g. multicrab Will make a list of what we are comfortable in knowing how to
use, and only support that Could use better separation of CMSSW/DBS/CRAB/GRID parts
(true for us, users and possibly for developers) Clean segmentation allows to "attach problems to specific area"
need to become able to operate in the mode: "watch it while it runs and figure out overall patterns"
Jan 26, 2010Analysis Operations 2009 Post-Mortem
18Stefano Belforte INFN Trieste
Concerns: CAF
Don’t see how we can give it the needed level of support with planned resources
Provocative proposals follow: so far CAF Server approach used for two very different things
Incubator for workflows that will move to T0 (alca, express…): predicted/predictable load, few workflows, few people, more development/integration then operations hand over
General tool for the CAF community to do all kind of CRAB work, entice users to move from bsub to crab –submit: have to support all possible kind of loads in time-critical way, continous self-inflicted DOS risk drop server and only support direct crab to LSF submission
(a piece of) CAF should be integrated with distributed infrastructure, be able to use with same tools running same workflows submit with same crab server use same /store/result instance access to phedex transfers
Jan 26, 2010Analysis Operations 2009 Post-Mortem
19Stefano Belforte INFN Trieste
Concerns: too many features
Each small “addition” generates more work forever However useful and well motivated
Would like to focus on what’s needed, not what’s nice Uncomfortable in making this selection ourselves
But of course ready to do Feature requests (exp. New ones) should be championed by
some (group of) users providing clear use scenarios relevant in the short term, and willing to take on education of the community Dumping something in our lap that we in analysis ops are not
familiar with, is not likely to turn out in user’s satisfaction We can not be/become expert on all possible workflows from
user’s produced MC to DQM The simple “give me your script and I’ll make it run it on the
grid in N copies” is still more work then we can deal with at the moment
Jan 26, 2010Analysis Operations 2009 Post-Mortem
20Stefano Belforte INFN Trieste
Concerns: changes
Changes in offline Changes in offline do not stop at CMSSW SL5, new python, new root, new… all affect also CRAB and running
on remote sites. Making CRAB work and making CMSSW work at all CMS Tier2’s
should be prerequisito for release validation. Not a race by ops and crab developes after the fact
Changes in CRAB As seen in other projects developers think in terms of next (and
next-to-next) release, while we need to operate with current one (and are asked to validate next)
A concern mostly because of large looming transition to WMAgent, with unknown (to us) timing, duration, functionalities change
What features/fixes could/should be backported to production release instead ?
Jan 26, 2010Analysis Operations 2009 Post-Mortem
21Stefano Belforte INFN Trieste
Concerns: user MC
See an extraordinary amount of user-made MC Support for large workflows is difficult, also difficult to predict
and control side effects on other users Setting bad example/habit of “DataOps will take long, so let’s
do it ourselves” Why is this not going via DataOps ?
Communication issue ? Perception of bureocratic overload ? Private code ? Unsupported generators ?
We are not prepared/equipped to offer good support workflows that are not based on datasets located at T2’s
Jan 26, 2010Analysis Operations 2009 Post-Mortem
22Stefano Belforte INFN Trieste
Conclusions
Learn a lot in 2009, built competence in the new group MC exercised us at scale Learnt little from work on beam data
Computing resources appear well usable and not saturated Effort by “analysis ops metric” particularly appreciated, we start
having the overall vision Central placement of useful data at T2’s working well so far Ready to face more beam data
Have listed our main concerns and desires Cut on scope of crab support to reduce daily load Reduce crab complexity and coupling to all possible tools Focus on stageout issues (50% of all failures) Push large single workflows into DataOps Add more access options to CAF to better integrate with grid Focus CAF CRAB server on critical workload More care and coordination in transitions, operations a stakeholder
in next release planning