LHCb report to LHCC and C-RSG
description
Transcript of LHCb report to LHCC and C-RSG
LHCb report toLHCC and C-RSG
Philippe CharpentierCERN
on behalf of LHCb
LHCb to LHCC and C-RSG review, PhC 2
Activities in 2009-Q3/Q4
m Core Softwareo Stable versions of Gaudi and LCG-AA
m Applicationso Stable as of September for real datao Fast minor releases to cope with reality of life…
m Monte-Carloo Intensive MC09 simulation (@ 5TeV)
P Minimum biasP b- and c- inclusiveP b signal channels
o Few events in foreseen 2009 configuration (450 GeV)o MC09 stripping (2 passes)
P Trigger strippingP Physics stripping
m Real data reconstruction and strippingo As of November 20th …
LHCb to LHCC and C-RSG review, PhC 3
Resource usage
LHCb to LHCC and C-RSG review, PhC 4
139 sites hit, 4.2 million jobs
m Start in June: start of MC09
LHCb to LHCC and C-RSG review, PhC 5
Job failure: 15% (17% at Tier1s)
LHCb to LHCC and C-RSG review, PhC 6
Failure breakdown
LHCb to LHCC and C-RSG review, PhC 7
Production and user jobs
LHCb to LHCC and C-RSG review, PhC 8
Jobs at Tier1s
LHCb to LHCC and C-RSG review, PhC 9
Job types at Tier1s
LHCb to LHCC and C-RSG review, PhC 10
CPU used (not normalised)
m Average job durationo 5.6 hours for all jobso 20 mn for user jobs (20%)o 6.6 hours for production
jobs
LHCb to LHCC and C-RSG review, PhC 11
m Average job durationo 5.6 hours for all jobso 20 mn for user jobso 6.6 hours for production
jobs
LHCb to LHCC and C-RSG review, PhC 12
CPU usage (not normalised)
LHCb to LHCC and C-RSG review, PhC 13
WLCG vs LHCb accounting (unnormalised)
m 13% more in WLCG than in DIRAC (unnormalised)o 1.26 Mdays vs 1.1 Mdayso Overhead of non reporting jobs + pilot/LCG/batch
frameworksm Average CPU power: 1.5 kSI2k (from WLCG
accounting)
LHCb to LHCC and C-RSG review, PhC 14
Normalised CPU usage in 2009
m Ramping up of pilot role in summerm Resource usage decreased since LHC restarted
o Concentrate on (few) real datao Wait for data analysis for continuing MC simulation
m Group 1: production
m Group 2: pilotm Group 3 & 4: userm Group 5: lcgadmin
LHCb to LHCC and C-RSG review, PhC 15
Resource usage
m Note: CERN above does not include non-Grid usage
o From WLCG accounting: 32% is non-Grid at CERNo CERN number should then read: 2.18 kHS06.years
m CPU usage within 10% of requestsm Distribution not exactly like expected
o More non-Tier1 resources availableP Less MC ran at CERN + Tier1s
o Almost no real data: less resources used at CERNP CAF not used as much as expected
Site Used (kHS06.years) Requested (kHS06.years)
CERN 1.48 8.54
Tier1s 8.24 11.7
Tier2s 24.44 17.12
Total 34.16 37.36
LHCb to LHCC and C-RSG review, PhC 16
Storage usage
m *) From Castor queries todaym **) From WLCG accounting end Decemberm ***) Including 420 TB for T1D0 cache
m Sites provided slightly more than the pledgeso Thanks!o At CERN, some disk pools (default, T1D0) were not
included in the requests but are in the accounting
Site Requested Allocated Used
CERN*) TxD1 650 696.5 482.7
CERN*) T1D0 70 148.5 irrelevant
CERN**) 720 721 478
Tier1s**) 1740***) 1915 633
LHCb to LHCC and C-RSG review, PhC 17
Experience with real data
LHCb to LHCC and C-RSG review, PhC 18
First experience with real data
m Very low crossing rateo Maximum 8 bunches colliding (88 kHz crossing)o Very low luminosityo Minimum bias trigger rate: from 0.1 to 10 Hzo Data taken with single beam and with collisions
No zero-suppression in VELOOtherwise ~25 GB only!
LHCb to LHCC and C-RSG review, PhC 19
Real data processing
m Iterative processo Small changes in reconstruction applicationo Improved alignmento In total 7 sets of processing conditions
P Only last files were all processed 4 times now (twice in 2010)
m Processing submissiono Automatic job creation and submission after:
P File is successfully migrated in CastorP File is successfully replicated at Tier1
o If job fails for a reason other than application crashP The file is reset as “to be processed”P New job is created / submitted (automatic)
o Processing more efficient at CERN (see later)P Eventually after few trials at Tier1, the file is processed
at CERNo No stripping ;-)
P DST files distributed to all Tier1s for analysis
LHCb to LHCC and C-RSG review, PhC 20
Reconstruction jobs
LHCb to LHCC and C-RSG review, PhC 21
Issues with real data
m Castor migrationo Very low rate: had to change the migration algorithm
for more frequent migration (1 hour instead of 8 hours)
m Issue with large files (above 2 GB)o Real data files are not ROOT files but open by ROOTo There was an issue with a compatibility library for
slc4-32 bit on slc5 nodesP Fixed within a day
m Wrong magnetic field signo Due to different coordinate systems for LHCb and
LHC ;-)o Fixed within hours
m Data access problem (by protocol, directly from server)
o Still dCache issue at IN2P3 and NIKHEFP dCache experts working on it
o Moved to copy mode paradigm for reconstructiono Still a problem for user jobs: a pain!
P Sites are regularly banned for analysis
LHCb to LHCC and C-RSG review, PhC 22
Transfers and job latency
m No problem observed during file transferso Files randomly distributed to Tier1o Will move to distribution by runs (few 100’s files)o For 2009, runs were never longer than 4-5 files!o Max file size set to 3 GB
m Very good Grid latencyo Time between submission and jobs starting running
LHCb to LHCC and C-RSG review, PhC 23
Resource requests
LHCb to LHCC and C-RSG review, PhC 24
Resource requests for 2010-12
m 2010 runningo The requests were made in April-June 2009
P No additional resources expectedP Try to fit within those requests
o Running scenario for LHCbP March: 35% LHC efficiency @ 100 HzP April-May-June: 50% LHC efficiency @ 1 kHz in averageP July-August-September-half October: 50% @ 2 kHzP no Heavy Ion run for LHCbP This corresponds to 6.1 106 seconds @ 2 kHzP The 2009-10 request accounted precisely by chance for
6.1 106 seconds (0.5+5.6)P Therefore we use 6.1 106 seconds for 2010 at 2 kHz
trigger ratem 2011 running
o Use the recommendation of MBP March: 35% LHC efficiency @ 2 kHzP April to mid-October: 50% LHC efficiency @ 2 kHzP Total running time: 8.9 106 seconds
m 2012: no run
LHCb to LHCC and C-RSG review, PhC 25
Resource requirements for 2010-12
kHEP06*year2010 (old) 2010 (confirmed) 2011 (prelim.) 2012 (very prelim.)
Integrated Integrated Power Integrated Power Integrated Power
CERN T0 5.70 4.50 4.07
CERN CAF - Analysis/Calib/Alignment
11.56 11.91 15.46
CERN T0 + T1 17.19 17.26 21 16.41 20 19.53 24
Tier1s 32.99 33.84 41 57.49 70 65.55 80
Tier2s 31.74 31.74 46 31.48 46 31.48 46
Total 81.91 82.83 108 105.38 136 116.57 150
Disk (TB)
CERN T0 + T1 1290 1270 1685 1776
Tier1s 3290 3350 4215 4458
Tier2s 20 20 20 20
Total 4600 4640 5920 6254
Tape (TB)
CERN T0 + T1 1500 1462 3020 3723
Tier1s 1800 1922 4271 5605
Total 3300 3384 7290 9328
LHCb to LHCC and C-RSG review, PhC 26
Comments on resources
m Very uncertain and fluctuating running plans!
m Depending on LHC running, MC requests may be different
o Minimum bias, charm physics, b physics…m Only after one year (at least) experience we can
see how running analysis on the Grid workso Analysis at CERN?o Analysis at Tier3s?o Reliability for analysis?
m 2012 is still very uncertaino No LHC runningo Will the MC requests be the same as previous yearso How many reprocessings?
P Currently assume 1 full reprocessing of 2010 and 2 of 2011
LHCb to LHCC and C-RSG review, PhC 27
Conclusions
m Real data in 2009o So few that it didn’t impact resource usageo Was extremely valuable for
P Setting proceduresP Start understanding the detector
d Already very promising performance after a few daysd Π0 peak, Λ and K0 reconstruction…
P Exercising automatic processesm 2010
o Still expect somewhat chaotic runningP Frequent changes in LHC settings, LHCb trigger
commissioningo No change in LHCb resource requests w.r.t. June
2009m 2011
o More precise requests with experience from 2010m 2012
o Still very preliminary, but small increase only compared to 2011