ALICE computing in view of LHC start

24
ALICE computing in view of LHC start G.Shabratova On behalf of ALICE offline 27.09.06 CERN-Russia JWGC meeting CERN

description

27.09.06 CERN-Russia JWGC meeting CERN. ALICE computing in view of LHC start. G.Shabratova On behalf of ALICE offline. Part –1 ALICE computing in 2006. ALICE computing model. For pp similar to the other experiments - PowerPoint PPT Presentation

Transcript of ALICE computing in view of LHC start

Page 1: ALICE computing in view of LHC start

ALICE computing in view of LHC start

G.Shabratova

On behalf of ALICE offline

27.09.06 CERN-Russia JWGC meeting CERN

Page 2: ALICE computing in view of LHC start

Part –1

ALICE computing in 2006

Page 3: ALICE computing in view of LHC start

ALICE computing model• For pp similar to the other experiments

– Quasi-online data distribution, reconstruction/calibration/alignment @T0– Prompt calibbration/alignment/reconstruction/analysis @CAF- CERN Analysis

Facility– Further reconstructions @T1’s

• For AA different model– Only partial quasi-online data distribution– Prompt calib/align/reco/analysis @CAF - CERN Analysis Facility– RAW data replication and reconstruction @T0 in four months after AA run

(shutdown)– Further reconstructions @T1’s

• T0: First pass reconstruction, storage of RAW, calibration data and first-pass ESD’s

• T1: Subsequent reconstructions and scheduled analysis, storage of a collective copy of RAW and one copy of reconstructed and simulated data to be safely kept, disk replicas of ESD’s, AOD’s and calib

• T2: Simulation and end-user analysis, disk replicas of ESD’s and AOD’s

Page 4: ALICE computing in view of LHC start

Computing resources

• Serious crisis of computing resources• Balance reevaluated since the TDR in view of the

pledged resources and the T1-T2 relations– Down scaling of resources in 2007/08– The problem is NOT solved, just shifted– If the situation does not evolve we will have to change

accordingly our computing model at the cost of less or lower quality physics

Page 5: ALICE computing in view of LHC start

ALICE view on the current situation

EDG

AliEn

Exp specific services

LCGAliEn arch + LCG code

EGEE

Exp specific services (AliEn’ for ALICE)

EGEE, ARC, OSG…

Page 6: ALICE computing in view of LHC start

AliEn

• Coherent set of modular services

– Used in production 2001-2006

• LCG elements have progressively replaced AliEn ones

– Consistent with the plan announced since 2001 by ALICE– This will continue as suitable components become available

• Whenever possible, we use “common” services

• AliEn offers a single interface for the ALICE physicists into the complex, heterogeneous (multiple grids and platforms) and fast-evolving Grid reality

Page 7: ALICE computing in view of LHC start

Elements of the GRID machinery• AliEn - Entry point of ALICE to the GRID – both for data

production and user analysis– Steady improvements in stability and capacity– Central services are now at 95% availability and are practically at

production level– Quick advances on the end-user interface (gshell)

• Software improvements are coupled with user training – tutorials on analysis are given monthly

• Analysis job submission and tracking is very robust

• Remaining major issues– Access to data – standard storage solutions (CASTOR2, DPM,

dCache) with the require by ALICE xrootd interface are still in active development

• Blockages and high latency of storage is quite frustrating for the end user

• This has to be solved by the October phase (user analysis) of PDC’06– Stability of participation and level of provided resources at the

computing centres• The resources commitment was overestimated almost everywhere

Page 8: ALICE computing in view of LHC start

Elements of the GRID machinery (2)• LCG components

– Steady improvement in quality of service– ALICE is using fully all stable services of LCG– The VO-box infrastructure (required at all sites) has been tested

and found to be a scalable and secure solution• Plays the role of an interface between the AliEn services and the local

LCG/site services• Was a highly controversial issue initially, now accepted standard

solution at all ALICE sites

• Interfaces to other GRID flavours– ARC (NDGF) – prototype exists and is being tested at Bergen– OSG (USA) – discussion started

• Remaining major issues:– Storage: DPM. CASTOR2, dCache, allowing for the sites to install

only LCG storage services (with incorporated xrootd)• In development

Page 9: ALICE computing in view of LHC start

Elements of the GRID machinery (3)• Operational support

– ALICE’s goal is to authomatize as much as possible the GRID operations - small team of experts take care of everything• Core team (ARDA, SFT, LCG GD, INFN, Offline) are responsible for the

software and central services operations, training of regional experts and general dissemination of GRID-related information

• Regional experts (1 per country) are responsible for the site operations and interactions with the local system administrators

• Total of 15 people are responsible for the software development, daily operations and support of the ALICE GRID

– Prolonged periods of running (already 5 months) only possible due to the pool of experts with overlapping knowledge and responsibilities• Still, this is quite a strain on very few people – expecting that with the

more mature software, the load will go down

• GRID monitoring - MonALISA– Almost complete monitoring and history of running on the GRID

available at http://alimonitor.cern.ch:8889

Page 10: ALICE computing in view of LHC start

PDC’06 goals

– Production of MC event for detector and software performance studies

– Verification of the ALICE distributed computing model • Integration and debugging of the GRID components into a

stable system– LCG Resource broker, LCG File Catalogue, File Transfer Service,

Vo-boxes– AliEn central services – catalogue, job submission and control, task

queue, monitoring

• Distributed calibration and alignment framework• Full data chain – RAW data from DAQ, registration in the

AliEn FC, first pass reconstruction at T0, replication at T1 • Computing resources – verification of scalability and stability of

the on-site services and building of expert support • End-user analysis on the GRID

Page 11: ALICE computing in view of LHC start

History of PDC’06 • Continuous running since Aplil 2006

– Test jobs, allowing to debug all site services and test the stability

– From July – production and reconstruction of p+p MC events

Page 12: ALICE computing in view of LHC start

History of PDC’06 (2)

• Gradual inclusion of sites in the ALICE Grid - current status:– 6 T1s: CCIN2P4, CERN, CNAF, GridKA, NIKHEF,

RAL– 30 T2s

• Currently available CPU power – 2000 CPUs for ALICE (expected ~4000)– Competing for resources with the other LHC

experiments– Computing centres are waiting for the last moment to

buy hardware – will get more for the same price– Expect additional resources from Nordic countries and

from US (LBL and LLNL)

Page 13: ALICE computing in view of LHC start

Resources statistics• Resources contribution (normalized Si2K units):

50% from T1s, 50% from T2s– The role of the T2 remains very high!

Page 14: ALICE computing in view of LHC start

Resources statistics (2)• Total amount of CPU work: 2MSi2K units,

equivalent to 500 CPUs working continuously since the beginning of the exercise (142 days)

Page 15: ALICE computing in view of LHC start

Data movement• Step 1: produced data is sent to CERN

– Up to 150 MB/sec data rate (limited by the amount of available CPUs) – ½ of the rate during Pb+Pb data export

Page 16: ALICE computing in view of LHC start

Data movement (2)• Step 2: data is replicated from CERN to the T1s

– Test of LCG File Transfer Service– Goal is 300 MB/sec – exercise is still ongoing

Page 17: ALICE computing in view of LHC start

Registration of RAW data from DAQ• Dummy registration in AliEn and CASTOR2 –

continuous since July 15• Registration and automatic reconstruction of TPC

test data

Page 18: ALICE computing in view of LHC start

MC production statistics• Total of 0.5 PB of data registered in CASTOR2

– 300K files – 1.6 GB/file – Files are combined in archives for optimal load on MSS

• 7M p+p events combined in 70 runs (production is continuing)– ESDs, simulated RAW and ESD tag files

• 50K Pb+Pb event in 100 runs– ESDs and ESD tag files

Page 19: ALICE computing in view of LHC start

ALICE sites on the world map

Page 20: ALICE computing in view of LHC start

ALICE sites on the world map

Page 21: ALICE computing in view of LHC start

Current FTS status

• Using 500 files x 1.9GB to test the sites

• Automatic load generator based on aliensh keeping the transfer queues at optimal capacity

• Performance– Peak performance achieved: 160MB/sec (average for

few hours

– Still far from target 300MB/sec for one week

Page 22: ALICE computing in view of LHC start

Current FTS status (cont)

Typical image when the queue is full

Page 23: ALICE computing in view of LHC start

Current status (cont)

• Alice begins to use a fair share of the resources

• Typical situation: 3 sites (out of 5) working at any given time

Page 24: ALICE computing in view of LHC start

Conclusions and outlook• Production is running on the GRID continuously

since April– Testing of the ALICE computing model with ever increasing

complexity of tasks– Gradual build up of the distributed infrastructure in preparation

for data taking in 2007– Training of experts and collection of operational experience– Improvements of the AliEn software - hidden thresholds are only

uncovered under high load

• We are reaching a ‘near production’ level of service– Storage still requires a lot of work and attention

• Next big event – user analysis of the produced data– Pilot users help a lot in finding the weak spots and push the

development forward