Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

23
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04

Transcript of Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Page 1: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Results of the LHCb experiment Data Challenge 2004

Joël Closier

CERN / LHCb

CHEP’ 04

Page 2: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 2

The LHCb DC04 team

Dirac– Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees

Production management– Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew

Pickford … and the LHCb site managers

LHCb Bookkeeping, Monitoring & accounting– Markus Frank, Carmine Cioffi, Manuel Sanchez, Ruben Vizcaya

LCG-LHCb liaison– Flavia Donno, Roberto Santinelli

The LCG-GDA team– Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz,

David Smith, Zdenek Sekera, Marco Serra…

Page 3: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 3

Outline

Aims of the LHCb Data Challenge 2004 Production model Performances of DC’04 Lessons from DC’04 Conclusions

Page 4: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 4

LHCb DC’04 aims

Main goal :gather information to be used for writing the LHCb computing Technical Design Report– Robustness test of the LHCb software and production system

• Using software as realistic as possible in terms of performance

– Test of the LHCb distributed computing model• Including distributed analyses• Realistic test of analysis environment, need realistic analyses

– Incorporation of the LCG application area software into the LHCb production environment

– Use of LCG resources (at least 50% of the production capacity)– 3 phases

• Production : MC simulation and reconstruction• Stripping : Event pre-selection• Analysis

Page 5: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 5

LHCb DC04 aims (cont’d)

Physics goals– HLT studies, consolidating efficiencies– Background/Signal studies, consolidate background estimates +

background properties

Requires quantitative increase in number of signal and background events compared to DC03:– 30 106 signal events– 15 106 specific background– 125 106 background (B inclusive + minimum bias, ratio 1:1.8)

Page 6: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 6

Production

Production done with DIRAC system – Track 4 - Distributed Computing Services : id 377

DIRAC is deployed to each site participating to DC’04 Central Services supporting the Data Challenge

– Production database– Workload Management System– Monitoring, Accounting– Bookkeeping, ALIEN File Catalog

Technologies used by the production services– C++, python, XML-RPC– ORACLE and mysql databases

Page 7: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 7

LHCb job

Non LCG site1. DIRAC deployment (CE).2. DIRAC JobAgent:

– Check CE status.– Request a DIRAC task (jdl).– Install LHCb software if

needed– Submit to Local Batch

System the job.– Execute task:– Check Steps.– Upload results

3. DIRAC TransferAgent.

LCG site1. Input SandBox:

– Small bash script (~50 lines).1. Check environment:

• Site, hostname, CPU, Memory, Disk Space…

2. Install DIRAC:• Download DIRAC tarball (~1 MB).• Deploy DIRAC on WN.

3. Execute the job: A. Request a DIRAC task (LHCb

Simulation job) B. Execute task:C. Check StepsD. Upload results:

2. Retrieval of SandBox3. Analysis of Retrieved Output

SandBox

Page 8: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 8

Strategy Test sites:

– Each site is tested with special and production-like jobs. Enable site :

– DIRAC Workload Management System. Always keep jobs in the queues

DIRAC Run Local Agent continuously:

– Via cron jobs– Via runsv– Via daemon

LCG Submit jobs continuously:

– Via cron job on User Interface

PS: LCG is considered as a site for DIRAC point of view

Page 9: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 9

Data Storage

All the output of the reconstructed phase (DST) are send to CERN (as Tier0)

All the intermediate files are not kept. DSTs are also stored in one of our 5 TIER1

– CNAF (Italy)– Karlsruhe (Germany)– Lyon (France)– PIC (Spain)– RAL (United Kingdom)

Page 10: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 10

DC’04 performances

Page 11: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 11

Phase 1 results

DIRAC alone

LCG inaction

1.8 106/day

LCG paused

Phase 1 Completed

3-5 106/day

LCG restarted

186 M Produced Events

Page 12: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 12

Daily performance

5 million/day

Page 13: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 13

Sites involved

43 LCG Sites (8 also DIRAC sites)

20 DIRAC Sites

Used resources from non-LHCb countries

e.g. Hungary produced ~2M

events

Page 14: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 14

Simultaneous jobs (a snapshot)

Page 15: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 15

TIER storage

Tier 1 Nb of Events Size (TB)

CNAF 37 129 350 12.6

RAL 19 462 850 6.5

PIC 16 505 010 5.4

Karlsruhe 12 486 300 4

Lyon 4 368 656 1.5

TIER 0 Nb of Events Size (TB)

CERN 187 557 231 62

Page 16: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 16

DIRAC-LCG : events share

LHCb DC'04

0

50

100

150

200

Total may june july august

Month

Eve

nts

(M

)

LCG

DIRAC

50% of events were produced using LCG

Page 17: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 17

DIRAC – LCG : CPU share

May: 88%:12%

11% of DC’04

Jun: 78%:22%

25% of DC’04

Jul: 75%:25%

22% of DC’04

Aug: 26%:74%

42% of DC’04

376 CPU · Years

Page 18: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 18

211k Submitted Jobs to LCG

After Running:

Jobs(k) %Sub %RemainSubmitted 211 100.0%Cancelled 26 12.2%Remaining 185 87.8% 100.0%Aborted (not Run) 37 17.6% 20.1%Running 148 70.0% 79.7%Aborted (Run) 34 16.2% 18.5%Done 113 53.8% 61.2%

Retrieved 113 53.8% 61.2%

LCG Efficiency: 61 %

LCG performance

113 k Done (Successful) 34 k Aborted

Page 19: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 19

DC’04 lessons

Page 20: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 20

Lessons learnt: DIRAC

The concept of the light, customizable and simple to deploy agents proved to be very effective

Easy update procedure - propagate bug fixes quickly of DIRAC tools

Applications software installation triggered by a running job

Most of the central services were running on the same machine– Too many processes, high loads

Improve Server Availability

Improve Error Handling and Reporting.

Page 21: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 21

Lessons learnt: LCG Improve OutputSandBox Upload | Retrieval mechanism:

– Should also be available for Failed and Aborted Jobs. Improve reliability of CE status collection methods (timestamps?). Add intelligence on CE or RB to detect and avoid large number of

aborted jobs on start-up:– Avoid miss-configured site to become a black-hole.

Need to collect LCG-log info and tool to navigate them (including different JobIDs).

Need a way to limit the CPU (and Wall-clock time):– LCG Wrapper must issue appropriated signals to User Job to allow graceful

termination. How to manuals:

– Clear instruction to Site Managers on the procedure to shutdown a site (for maintenance and/or upgrade).

– Problems with site configurations (LCG config, firewalls, gridFTP servers..)

Page 22: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 22

LHCb DC’04 Phase 1 is over. The Production Target was achieved:

– 186 M Events in 424 CPU years.– ~ 50% on LCG Resources (75-80% at the last weeks).

LHCb Strategy successful: – Submitting “empty” DIRAC Agents to LCG has proven to be very flexible

allowing a success rate above LCG alone. Big room for improvements, both on DIRAC and LCG

– DIRAC needs to improve in the reliability of the Servers:• big step already during DC.

– LCG needs improvement on the single job efficiency:• ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint.

– In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in.

Success due to dedicated support from LCG team and DIRAC Site Managers

Conclusions

Page 23: Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Result of LHCb DC04 23

Other links CHEP04 talks:

– File-Metadata Management System for the LHCb Experiment • (Track 4 - Distributed Computing Services) id 392• 27-Sep-2004 17:30

– DIRAC Workload Management System • (Track 5 - Distributed Computing Systems and Experiences) id 365• 29-Sep-2004 10:00

– Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC

• (Track 4 - Distributed Computing Services) id 368• 29-Sep-2004 10:00

– DIRAC - The Distributed MC Production and Analysis for LHCb • (Track 4 - Distributed Computing Services) id 377• 30-Sep-2004 18:10

– A Lightweight Monitoring and Accounting System for LHCb DC04 Production

• (Track 4 - Distributed Computing Services) id388• 30-Sep-2004 17:30