Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.
Transcript of Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.
Your university or experiment logo here
BaBar Status Report
Chris BrewGridPP16 QMUL
28/06/2006
Outline• 3 BaBar Grid Projects:
– Monte Carlo (Simulation) Production– Skimming– User Analysis
• easyGrid• bbrbsub
• Overall experience with the Grid• Conclusion
Usual Guff
• BaBar is a running experiment, Situated at SLAC near San Francisco
• e+e- collider tuned to investigate CP Violation in B Physics
• Started taking data in 1999/2000 currently has 350 fb-1 of data
• Projected to have 1000 fb-1 by end of 2008
Data Flow
Tier 1(RAL)
Tier 2sTier 0
(SLAC)
LargeTier 2s
Tier 1(RAL)
Simulation Production
SkimmingAnalysis
Merging
Simulation Production• Running at M/Cr, RAL, RALPP and B'ham
– Tests at Lancs, Oxford + others– Still working to add other BaBar Sites– Limited by need to install Objy DB at each site
• Stable running: 500,000,000 Events Produced, 12% of worldwide total.
• New R-GMA Based job monitor: Status query down from 45 minutes to 5 minutes
• Recent hiatus due to bugs found in BaBar simulation code which caused a global halt. Production has recently restarted
C. Brew, G.Castelli
Cumulative Simulation Production on GridPP
0
100000
200000
300000
400000
500000
600000
10-Ju
l-05
07-A
ug-0
5
04-S
ep-0
5
02-O
ct-05
30-O
ct-05
27-N
ov-0
5
25-D
ec-0
5
22-Ja
n-06
19-F
eb-0
6
19-M
ar-0
6
16-A
pr-0
6
14-M
ay-0
6
11-Ju
n-06
Week Beginning
10
00
's o
f E
ve
nts gridpp.rl.ac.uk
pp.rl.ac.uk
ph.bham.ac.uk
tier2.hep.manchester.ac.uk
Total
Skimming• New Grid Project: Process real and simulated data
to select ~200 subsamples, defined by the BaBar physics analysis working groups.– Much quicker to run over skim than full data sample– Skimming includes physics analysis code and saves the
results, so CPU time spent in skimming is regained many times over
• Plan is to run at one or more large T2s. If we can get this into production we should be able to recover some of the UK’s Common Fund rebate we’ve lost due to lack of T1 Resources
• GridPP has funded three months of effort from Will Roethel to further this work
G.Castelli, W. Roethel, C. Brew
Status of SkimmingPrepare code to be installed on grid Done
Modify BaBar framework to read data out of dCache and RFIO
Working, starting load and stability testing
Develop tools for copying and managing data on Storage Elements
Under development (PHeDEx?)
Integration with BaBar Task Management software
Task DB Creation Done
Task List Creation Works
Job Creation Works
Local Job Submission Works
Grid Job Submission Works
Job Monitoring In progress, should be able to reuse code from SP Tools
Job Recovery
Job Output Checking In progress
Data Merging Not Started
User Analysis (easyGrid)• Prototype running on Manchester
Testbed testbed (80 CPUs) since Nov/2005 without problems. Real analysis with real data by real users that knows nothing about grid.
• No errors in Easygrid job submission.• No errors in grid testbed due to
installation configuration and improvements.
J. Werner
• Many problems encountered moving from Testbed to
Production Grid Resources– errors in RB, CE, etc - 10% of time with less then 4
jobs/second submission rate.– errors in BDII, SE, dcache. SE fails 40% of jobs (less then 100
jobs in parallel).– when SE works, performance is terrible (approx. 8 times more
time to run same software).– lack of response to problems from site admins.
• Serious issue for a typical user analysis which is about 2000 8 CPU hour jobs
• Product development will be resumed when resources are available and reliable. Meanwhile, EasyGrid prototype and M/Cr testbed will attend usersFor more information: http://www.hep.man.ac.uk/u/jamwer/
User Analysis (bbrbsub)• Integration of Simple Job Manager +
bbrbsub with Grid Submission • Take the tools already used by analysis
users to submit jobs at RAL• Transparently add RAL -> RAL grid
submission• Add RAL -> M/Cr and M/Cr -> RAL
submission capabilities• Add RAL -> RALPP and M/Cr -> RALPP• Gradually build up full grid functionality
– Application transport and configuration– Automatic output recovery– Job to data matching
G. Castelli
Overall Grid Experience
• Grid is still not reliable (worst test run):
• SP running seems to indicate that Grid isn't getting more reliable and may be getting less so, long term efficiency stuck around 80%:– RB Problems (have capability of multiple RB use but
efficiency drops because of lack of fail over)– Central LFC problems– BDII problems - Sites drop in and out of bdii– SE Problems - Files randomly don't up/download
• Could run for 1-2 weeks at a time with minimal intervention, now seems to need daily (or more) interventions
RAL to RAL Successful Job Rate
Grid PBS
<50% >99%
Conclusions• BaBar has made good progress on moving its
three main offline compute intensive processes to the Grid
• Monte-Carlo generation is in production, significant progress has been made in skimming and user analysis
• There are many things we like about the grid• We are adapting the BaBar software framework
to integrate better with the grid, the dependence on Objectivity will be removed and we are adding the ability to read data directly from Storage Elements
• However, reliability and ease of use are still big issues