LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille...

LCG-France, 22 July 2004, CERN 1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN

Transcript of LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille...

Page 1: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 1

LHCb Data Challenge 2004

A.Tsaregorodtsev, CPPM, Marseille

LCG-France Meeting, 22 July 2004, CERN

Page 2: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 2

Goals of DC’04

Main goal: gather information to be used for writing the LHCb computing TDR/TP Robustness test of the LHCb software and production

system• Using software as realistic as possible in terms of performance

Test of the LHCb distributed computing model• Including distributed analyses• realistic test of analysis environment, need realistic analyses

Incorporation of the LCG application area software into the LHCb production environment

Use of LCG resources as a substantial fraction of the production capacity

Page 3: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 3

DC 2004 phases

Phase 1 – MC data production 180M events of different signals, bg, mbias Simulation+reconstruction DST’s are copied to Tier1 centres

Phase 2 – Data reprocessing Selection of various physics streams from DST’s Copy selections to all Tier1 centers

Phase 3 – User analysis User analysis jobs on DST data distributed in all the

Tier1 centers

Page 4: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 4

Phase 1 MC production

Page 5: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 5

DIRAC Services and Resources

DIRAC JobManagement


DIRAC JobManagement






CE 1CE 1


AgentAgent AgentAgent AgentAgent

CE 2CE 2

CE 3CE 3


Productionmanager GANGA UIGANGA UI User CLI User CLI




Job monitorJob monitor





BK query webpage BK query webpage






DIRAC StorageDIRAC Storage




Page 6: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 6

Software to be installed

Before an LHCb application can run on a Worker Node the following software components should be installed: Application software itself; Software packages on which the application depends; Necessary databases (file based) DIRAC software

Single untar command to install in place All the necessary libraries are included – no assumption

made about the availability of whatever software on the destination site (except recent python interpreter): External libraries; Compiler libraries;

Same binary distribution running on RH 7.1-9.0

Page 7: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 7

Software installation

Software repository: Web server (http protocol) LCG Storage Element

Installation in place DIRAC way: By Agent upon reception of a job with particular

software requirements;OR

By a running job itself.

Installation in place LCG2 way: Special kind of a job running standard DIRAC software

installation utility

Page 8: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 8

Software installation in the job

A job may need extra SW packages not in place on CE Special version of geometry; User analysis algorithms.

Any number of packages can be installed in the job itself (up to all of them)

Packages are installed in the job user space Imitate the structure of the LHCb standard SW

directory tree with symbolic links

Page 9: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 9

3’d party components

Originally DIRAC aimed at producing the following components: Production database; Metadata and job provenance database; Workload management.

Expected 3’d party components: Data management (FileCatalogue, replica management) Security services Information and Monitoring Services

Expectations for early delivery of the ARDA prototype components failed

Page 10: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 10

File catalog service

The LHCb Bookkeeping was not meant to be used as a File (Replica) Catalog Main use as Metadata and Job Provenance database Replica catalog based on specially built views

AliEn File Catalog was chosen to get a (full) set of the necessary functionality: Hierarchical structure:

• Logical organization of data – optimized queries;• ACL by directory;• Metadata by directory;• File system paradigm;

Robust, proven implementation Easy to wrap as an independent service:

• Inspired by the ARDA RTAG work

Page 11: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 11

AliEn FileCatalog in DIRAC

AliEn FC SOAP interface was not ready in the beginning of 2004 Had to provide our own XML-RPC wrapper

• Compatible with XML-RPC BK File Catalog Using AliEn command line “alien –exec”

Ugly, but works Building service on top of AliEn which is run by the

lhcbprod AliEn user Not really using the AliEn security mechanisms

Using AliEn version 1.32 So far in DC2004:

>100’000 files with >250’000 replicas Very stable performance

Page 12: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 12

File catalogs


AliEn FCAliEn FC AliEn UIAliEn UI



AliEn FCClient

AliEn FCClient





BK FCClientBK FCClient

FC ClientFC ClientDIRAC




AliEn FileCatalog ServiceAliEn FileCatalog Service

BK FileCatalog Service BK FileCatalog Service

FileCatalog ClientFileCatalog Client

Page 13: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 13

Data Production – 2004

Currently distributed data sets CERN:

• Complete DST (copied directly from production centres) Tier1:

• Master copy of DST produced at associated sites DIRAC sites:

• Bologna, Karlsruhe, Spain (PIC), Lyon, UK sites (RAL), all otherwise CERN

LCG sites:• Currently only 3 Grid (MSS) SE sites - CASTOR• Bologna, PIC, CERN

Bologna:ru,pl,hu,cz,gr,it PIC: us,ca,es,pt,tw CERN: elsewhere

Page 14: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 14

DIRAC DataManagement tools DIRAC Storage Element:

IS description + server (bbftpd, sftpd, httpd, gridftpd, xmlrpcd, file, rfio, etc)

Need no special service installation on the site Description in the Information Service:

Host, Protocol, Local path ReplicaManager API for common operations:

copy(), copyDir(), get(), exists(), size(), mkdir(), etc Examples of usage:

dirac-rm-copyAndRegister <lfn> <fname> <size> <SE> <guid>dirac-rm-copy dc2004.dst CERN_Castor_BBFTP

Tier0SE and Tier1SE’s are defined in the central IS

Page 15: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 15

Reliable Data Transfer

Any data transfer should be accomplished despite temporary failures of various services or networks: Multiple retries of failed transfers with any necessary delay:

• Until services are up and running;• Not applicable for LCG jobs.

Multiple retries of registration in the Catalog.

Transfer Agent: Maintains a database of Transfer requests; Transfers datasets or whole directories with log files; Retries transfers until success

Page 16: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 16

DIRAC DataManagement tools


Transfer DBTransfer DB

JobJobData ManagerData Manager

Data OptimizerData Optimizer

SE 1

SE 2cache


Page 17: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 17

DIRAC DC2004 performance

In May-July: Simulation+Reconstruction >80000 jobs ~75M events ~25TB of data

• Stored at CERN,PIC,Lyon,CNAF,RAL Tier1 centres >150’000 files in the catalogs ~2000 jobs running continuously

• Up to 3000 in a peak

Page 18: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 18

DC2004 at CC/IN2P3

The main DIRAC development site The CC/IN2P3 contribution is very weak

Production runs stable continuously; Resources are very limited

HPSS performance is stable

Page 19: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 19

Note on the BBFTP

Nice product Stable, performant, complete, grid enabled

Light weight Easy deployment of the statically linked executable

Good peformance Would be nice to have a parallelized load balancing server

Functionality not complete with respect to GRIDFTP: Remote storage management (ls(), size(), remove() ) Transfers between remote servers

Page 20: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 20

LCG experience

Page 21: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 21

Production jobs

Long jobs – 23 hours on average 2GHz PIV Simulation+Digitization+Reconstruction steps

5 to 10 steps in one job

No event input data Output data – 1-2 output files of ~200MB

Stored to Tier1 and Tier0 SE

Log files copied to an SE at CERN AliEn and Bookkeeping Catalogues are


Page 22: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 22

Using LCG resources

Different ways of scheduling jobs to LCG Standard: jobs got via RB; Direct: jobs go directly to CE; Resource reservation

Using Reservation mode for the DC2004 production: Deploying agents to WN as LCG jobs DIRAC jobs are fetched by the agents in case the

environment is OK Agent steers the job execution including data

transfers, update of the catalogs and bookkeeping.

Page 23: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 23

Using LCG resources (2)

Using DIRAC DataManagement tools: DIRAC SE + gridftp + sftp

Starting to populate RLS from DIRAC catalogues: For evaluation For use with ReplicaManager of LCG

Page 24: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 24

Resource Broker I

No trivial use of tools for large number of jobs i.e. production Command re-authenticated for every job Produce errors with list of jobs (e.g. retrieve non-terminated


Slow to response when few 100 jobs in RB e.g. 15 seconds for job scheduling

Ranking mechanism to provide even distribution of jobs Number of CPUs published is for site & not for user/VO -

request for free CPU in JDL doesn’t help

Page 25: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 25

Resource Broker II

LCG, in general, does not advertise normalised time units Solution: request CPU resources for the slowest CPU

(500Hz) Problem: only v. few site have long enough queues Solution: DIRAC agent scales CPU for particular WN before

request to DIRAC Problem: some sites have normalised their units!

Jobs with ∞loops 3 day job in week queue - killed by proxy expiry rather than

CPU reqt Jobs aborted by “proxy expired”

RB was re-using old proxies !!!!

Page 26: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 26

Resource Broker III

Job cancelled by RB but with message “cancelled by user” Due to loss of communication between RB & CE -

job rescheduled & killed on original CE Some job are not killed until they fail due to

inability to transfer data DIRAC also re-schedules!

RB lost control of the status of all jobs RB “stuck” - not responding to any request -

solved without loss of jobs

Page 27: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 27

Disk Storage

Job runs in directory without enough space Jobs running need ~2GB - problem where site has

jobs sharing same disk server rather than local WN space

Page 28: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 28

Reliable Data Transfer

In case of data transfer failure the data on LCG is lost. There is no retry mechanism if the destination SE is temporarily not available

Problems with GRIDFTP server at CERN: Certificates not understood Refused connections

Page 29: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 29

Odds & Sods

LDAP of globus-mds server stops OK - no jobs can be submitted to site BUT also problems with authentication of GridFTP


Empty output sandbox Tricky to debug !

Jobs cancelled by retry count Occurs on sites with many jobs running DIRAC just submits more agents

Page 30: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 30


Page 31: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 31

Demand 2004

CPU: 14 M UI hours (1.4 M UI hours consumed so far)

Storage HPSS 20 TB Disk 2 TB

• Accessible from the LCG grid

Page 32: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 32

Demand 2005

CPU: ~15 M UI hours

Storage HPSS 30 TB ( ~15 TB recycled) Disk 2 TB

Page 33: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 33

Tier2 centers

Feasible Good network connectivity is essential

Limited functionality: Number crunches (production simulation type tasks)

Standard technical solution Hardware (CPU+storage) Cluster software Central consultancy support

Housing space Adequate rooms in the labs (cooling, electric power, etc)

Page 34: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 34

Tier2 centers (2)

Local support Stuff to be found (remote central watch tower ?) 24/24, 7/7 or best effort support

Serving the community Regional

• Possible financing source• Extra clients (security, resource sharing policies issues)

National• French grid (segment) ?

Page 35: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 35

LHCb DC'04 Accounting

Page 36: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 36

Next Phases Reprocessing and Analysis

Page 37: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 37

Data reprocessing and analysis

Preparing data reprocessing phase: Stripping – selecting events on DST files into

several output streams by groups of physics Scheduling jobs to sites where the needed data are

• Tier1’s (CERN, Lyon, PIC, CNAF, RAL, Karlsruhe)

The workload management is capable of automatic job scheduling to a site having data

Tools are being prepared to formulate reprocessing tasks.

Page 38: LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

LCG-France, 22 July 2004, CERN 38

Data reprocessing and analysis (2)

User analysis: Interfacing GANGA to submit jobs to DIRAC Submitting user jobs to DIRAC sites:

• Security concerns – job are executed by the agent account on behalf of user

Submitting user jobs to LCG sites:• Through DIRAC to have a common job Monitoring and

Accounting• Using user certificates to submit to LCG• No agent submission:

Expecting high failure rate