Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
-
Upload
eric-powell -
Category
Documents
-
view
216 -
download
1
Transcript of Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
Results of the LHCb experiment Data Challenge 2004
Joël Closier
CERN / LHCb
CHEP’ 04
Result of LHCb DC04 2
The LHCb DC04 team
Dirac– Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees
Production management– Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew
Pickford … and the LHCb site managers
LHCb Bookkeeping, Monitoring & accounting– Markus Frank, Carmine Cioffi, Manuel Sanchez, Ruben Vizcaya
LCG-LHCb liaison– Flavia Donno, Roberto Santinelli
The LCG-GDA team– Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz,
David Smith, Zdenek Sekera, Marco Serra…
Result of LHCb DC04 3
Outline
Aims of the LHCb Data Challenge 2004 Production model Performances of DC’04 Lessons from DC’04 Conclusions
Result of LHCb DC04 4
LHCb DC’04 aims
Main goal :gather information to be used for writing the LHCb computing Technical Design Report– Robustness test of the LHCb software and production system
• Using software as realistic as possible in terms of performance
– Test of the LHCb distributed computing model• Including distributed analyses• Realistic test of analysis environment, need realistic analyses
– Incorporation of the LCG application area software into the LHCb production environment
– Use of LCG resources (at least 50% of the production capacity)– 3 phases
• Production : MC simulation and reconstruction• Stripping : Event pre-selection• Analysis
Result of LHCb DC04 5
LHCb DC04 aims (cont’d)
Physics goals– HLT studies, consolidating efficiencies– Background/Signal studies, consolidate background estimates +
background properties
Requires quantitative increase in number of signal and background events compared to DC03:– 30 106 signal events– 15 106 specific background– 125 106 background (B inclusive + minimum bias, ratio 1:1.8)
Result of LHCb DC04 6
Production
Production done with DIRAC system – Track 4 - Distributed Computing Services : id 377
DIRAC is deployed to each site participating to DC’04 Central Services supporting the Data Challenge
– Production database– Workload Management System– Monitoring, Accounting– Bookkeeping, ALIEN File Catalog
Technologies used by the production services– C++, python, XML-RPC– ORACLE and mysql databases
Result of LHCb DC04 7
LHCb job
Non LCG site1. DIRAC deployment (CE).2. DIRAC JobAgent:
– Check CE status.– Request a DIRAC task (jdl).– Install LHCb software if
needed– Submit to Local Batch
System the job.– Execute task:– Check Steps.– Upload results
3. DIRAC TransferAgent.
LCG site1. Input SandBox:
– Small bash script (~50 lines).1. Check environment:
• Site, hostname, CPU, Memory, Disk Space…
2. Install DIRAC:• Download DIRAC tarball (~1 MB).• Deploy DIRAC on WN.
3. Execute the job: A. Request a DIRAC task (LHCb
Simulation job) B. Execute task:C. Check StepsD. Upload results:
2. Retrieval of SandBox3. Analysis of Retrieved Output
SandBox
Result of LHCb DC04 8
Strategy Test sites:
– Each site is tested with special and production-like jobs. Enable site :
– DIRAC Workload Management System. Always keep jobs in the queues
DIRAC Run Local Agent continuously:
– Via cron jobs– Via runsv– Via daemon
LCG Submit jobs continuously:
– Via cron job on User Interface
PS: LCG is considered as a site for DIRAC point of view
Result of LHCb DC04 9
Data Storage
All the output of the reconstructed phase (DST) are send to CERN (as Tier0)
All the intermediate files are not kept. DSTs are also stored in one of our 5 TIER1
– CNAF (Italy)– Karlsruhe (Germany)– Lyon (France)– PIC (Spain)– RAL (United Kingdom)
Result of LHCb DC04 10
DC’04 performances
Result of LHCb DC04 11
Phase 1 results
DIRAC alone
LCG inaction
1.8 106/day
LCG paused
Phase 1 Completed
3-5 106/day
LCG restarted
186 M Produced Events
Result of LHCb DC04 12
Daily performance
5 million/day
Result of LHCb DC04 13
Sites involved
43 LCG Sites (8 also DIRAC sites)
20 DIRAC Sites
Used resources from non-LHCb countries
e.g. Hungary produced ~2M
events
Result of LHCb DC04 14
Simultaneous jobs (a snapshot)
Result of LHCb DC04 15
TIER storage
Tier 1 Nb of Events Size (TB)
CNAF 37 129 350 12.6
RAL 19 462 850 6.5
PIC 16 505 010 5.4
Karlsruhe 12 486 300 4
Lyon 4 368 656 1.5
TIER 0 Nb of Events Size (TB)
CERN 187 557 231 62
Result of LHCb DC04 16
DIRAC-LCG : events share
LHCb DC'04
0
50
100
150
200
Total may june july august
Month
Eve
nts
(M
)
LCG
DIRAC
50% of events were produced using LCG
Result of LHCb DC04 17
DIRAC – LCG : CPU share
May: 88%:12%
11% of DC’04
Jun: 78%:22%
25% of DC’04
Jul: 75%:25%
22% of DC’04
Aug: 26%:74%
42% of DC’04
376 CPU · Years
Result of LHCb DC04 18
211k Submitted Jobs to LCG
After Running:
Jobs(k) %Sub %RemainSubmitted 211 100.0%Cancelled 26 12.2%Remaining 185 87.8% 100.0%Aborted (not Run) 37 17.6% 20.1%Running 148 70.0% 79.7%Aborted (Run) 34 16.2% 18.5%Done 113 53.8% 61.2%
Retrieved 113 53.8% 61.2%
LCG Efficiency: 61 %
LCG performance
113 k Done (Successful) 34 k Aborted
Result of LHCb DC04 19
DC’04 lessons
Result of LHCb DC04 20
Lessons learnt: DIRAC
The concept of the light, customizable and simple to deploy agents proved to be very effective
Easy update procedure - propagate bug fixes quickly of DIRAC tools
Applications software installation triggered by a running job
Most of the central services were running on the same machine– Too many processes, high loads
Improve Server Availability
Improve Error Handling and Reporting.
Result of LHCb DC04 21
Lessons learnt: LCG Improve OutputSandBox Upload | Retrieval mechanism:
– Should also be available for Failed and Aborted Jobs. Improve reliability of CE status collection methods (timestamps?). Add intelligence on CE or RB to detect and avoid large number of
aborted jobs on start-up:– Avoid miss-configured site to become a black-hole.
Need to collect LCG-log info and tool to navigate them (including different JobIDs).
Need a way to limit the CPU (and Wall-clock time):– LCG Wrapper must issue appropriated signals to User Job to allow graceful
termination. How to manuals:
– Clear instruction to Site Managers on the procedure to shutdown a site (for maintenance and/or upgrade).
– Problems with site configurations (LCG config, firewalls, gridFTP servers..)
Result of LHCb DC04 22
LHCb DC’04 Phase 1 is over. The Production Target was achieved:
– 186 M Events in 424 CPU years.– ~ 50% on LCG Resources (75-80% at the last weeks).
LHCb Strategy successful: – Submitting “empty” DIRAC Agents to LCG has proven to be very flexible
allowing a success rate above LCG alone. Big room for improvements, both on DIRAC and LCG
– DIRAC needs to improve in the reliability of the Servers:• big step already during DC.
– LCG needs improvement on the single job efficiency:• ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint.
– In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in.
Success due to dedicated support from LCG team and DIRAC Site Managers
Conclusions
Result of LHCb DC04 23
Other links CHEP04 talks:
– File-Metadata Management System for the LHCb Experiment • (Track 4 - Distributed Computing Services) id 392• 27-Sep-2004 17:30
– DIRAC Workload Management System • (Track 5 - Distributed Computing Systems and Experiences) id 365• 29-Sep-2004 10:00
– Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC
• (Track 4 - Distributed Computing Services) id 368• 29-Sep-2004 10:00
– DIRAC - The Distributed MC Production and Analysis for LHCb • (Track 4 - Distributed Computing Services) id 377• 30-Sep-2004 18:10
– A Lightweight Monitoring and Accounting System for LHCb DC04 Production
• (Track 4 - Distributed Computing Services) id388• 30-Sep-2004 17:30