[email protected] ~~~ LCG-LHCC Referees Meeting, 6 th July 2009

Status of Tier1s & Tier2s~~~

Focus on STEP’09 Metrics & Recommendations

N.B. much more detail in post-mortems!

[email protected] [email protected] ~~~

LCG-LHCC Referees Meeting, 6th July 2009

What Were The Metrics?

• Those set by the experiments: based on the main “functional blocks” that Tier1s and Tier2s support

• Primary (additional) Use Cases in STEP’09:

1. (Concurrent) reprocessing at Tier1s – including recall from tape2. Analysis – primarily at Tier2s (except LHCb)

• In addition, we set a single service / operations site metric, primarily aimed at the Tier1s (and Tier0)

• Details: • ATLAS (logbook, p-m w/s), CMS (p-m), blogs• Daily minutes: week1, week2• WLCG Post-mortem workshop

2

https://twiki.cern.ch/twiki/bin/view/Atlas/Step09

https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Logbook

What Were The Results? The good news first:

Most Tier1s and many of the Tier2s met – and in some cases exceeded by a significant margin – the targets that were set

• In addition, this was done with reasonable operational load at the site level and with quite a high background of scheduled and unscheduled interventions and other problems – including 5 simultaneous LHC OPN fibre cuts!

Operationally, things went really rather well • Experiment operations – particularly ATLAS – overloaded

The not-so-good news:

• Some Tier1s and Tier2s did not meet one or more of the targets

3

Tier1s: “not-so-good”

• Of the Tier1s that did not meet the metrics, need to consider (alphabetically) ASGC, DE-KIT and NL-T1

• In terms of priority (i.e. what these sites deliver to the experiments), the order is probably DE-KIT, NL-T1, ASGC

In all cases, once the problems have been understood and resolved, need to re-test (also other sites…)

• Also important to understand reproducibility of results: would another STEP yield the same results (in particular for those sites that passed?)

There is also a very strong correlation between these sites and those that did not fully satisfy (I think I am being kind…) the site metric (which translates to full compliance with WLCG operations standards)

4

IN2P3

• IN2P3 had already scheduled an intervention when the timetable for STEP’09 was agreed

• It took longer for them to recover than anticipated, but IMHO their results are good not only wrt the experiment targets, but also wrt “site metric”

• Had we not taken a rather short two-week snapshot, they would almost certainly have caught-up with backlog very quickly – in the end they met 97% of ATLAS target!

5

DE-KIT• (Very) shortly before the start of the main STEP period, DE-KIT

announced that they could not take part in any tape-related activities of STEP’09 due to major SAN / tape problems

• Despite this attempts were made to recover the situation but the experiment targets were not met

• AFAIK, the situation is still not fully understood

Communication of problems and follow-up still do not meet the agreed WLCG “standards” – even after such a major incident and numerous reminders, there has still not been a “Service Incident Report” for this major issue

I do not believe that this is acceptable: this is a major European Tier1 providing resources to all 4 LHC experiments: the technical and operational problems must be fixed as soon as possible

6

• To wlcg operations and wlcg management board

Dear all,unfortunately, we have some bad news about the GridKa tape infrastructure.The connection to archival storage (tape) at GridKa is broken and will probably not function before STEP09. The actual cause is yet unknown but is rooted in the SAN configuration and/or hardware.We experience random timeouts on all fibre channel links that connect the dcache pool hosts to the tape drives and libraries.GridKa technicians (and involved supporters from the vendors) are doing everything possible to fix the problems as soon as possible but chances to have the issue solved until next week are low. We therefore cannot take part in any tape dependant activities during STEP09. If the situation improves during STEP09 we might be able to join later, maybe during the second week.

Sorry for the late information (experiments have been informed earlier) but I hoped to get some better news from our technicians this afternoon.We will keep you updated if the situation changes or (hopefully) improves.

Reagrds, Andreas

7

NL-T1

• NL-T1 is somewhat less ‘critical’ than DE-KIT (but still important) as it supports fewer of the experiments (not CMS) and has very few Tier2s to support)• As opposed to DE-KIT which supports a large number in

several countries…

• Some progress in resolving the MSS performance problems (DMF tuning) has been made, but it is fair to point out that this site too is relatively poor on “the site metric”

• The problems continue to be compounded by the NIKHEF / SARA “split” – which does not seem to affect a much more distributed (but simpler in terms of total capacity & # VOs): NDGF

8

• Main Issues (Maurice Bouwhuis [[email protected]]): • Network bandwidth to the compute nodes too small to accommodate

analysis jobs [will be fixed over summer together with new procurement of compute and storage resources]

• Performance issues with the Mass Storage environment: lack of Tape Drives and tuning of new components [12 new tape drives have been installed after STEP’09 and will be brought online over the next month, Tuning of Mass Storage Disk Cache during STEP’09is finished, Unexplained low performance during limited period still under investigation]

• As expected before STEP’09, did not make the grade.

• Achievements• dCache performed well and stable, ( > 24 hours continuous 10 Gb/s data

transfers out of dCache based storage] • Relatively few incidents (improvement compared to CCRC08).

Performance issues were known before STEP’09 and will be addressed over summer and early autumn. Detailed technical report on STEP’09 NL-T1 Post-Mortem is available.

• Resources NL-T1:• until End of 2009 - Computing 3.800 kSI2k, disk storage: 1.110 TB, tape

storage: 460 TB• New resources will be delivered September/October • Beginning of 2010 – Computing 11.800 kSI2k, Disk Storage: 3.935 TB,

Tape Storage: 1500 TB• Upgrade of network infrastructure at NIKHEF and SARA over the summer • Upgrade Mass Storage infrastructure over the summer 9

ASGC• ASGC suffered a fire in Q1 which had a major impact on

the site• They made very impressive efforts to recover as quickly as

possible, including relocating to a temporary centre

They did not pass the metric(s) for a number of reasons

• It is clearly important to understand these in detail and retest once they have relocated back (on-going)

But there have been and continue to be major concerns and problems with this site which pre-date the fire by many months

• The man-power situation appears to be sub-critical• Communication has been and continues to be a major

problem – despite improvements including local participation in the daily operations meeting

Other sites that are roughly equidistant from CERN (TRIUMF, Tokyo) do not suffer from these problems10

[email protected] 11

Tier 1 experiences

NDGF: Operationally all worked as expected

No-grid related failures Issues:

Low throughput – esp no pre-staging of data in NDGF User analysis jobs caused heavy load on network and filesystems on several sites Different types of jobs need different optimisation; none has been done and nearly

impossible with pilot jobs Failing ATLAS jobs report success – triggers transfer errors – affects efficiency

PIC: Achievements:

Experiment activities overlapped during most of the test. Services were stable and high avail Highest performance ever seen at PIC; met expt targets

WAN: T0T1 up to 500 MB/s; T140 T2 up to 500 MB/s LAN: up to 1400 simult jos, reading from storage at 2.5 GB/s Tape: >350 MB/s total read reached in pre-stage+reproc

Issues: LTO drive perf lower than expected (~30MB/s). Dominated by seek times. Will

purchase more drives. Need to work with experiments to improve strategies.

30 June 2009ALICE 1Q2009 report 11


Tier 1 cont. FZK:

MSS problems started before STEP’09; decided not to participate in STEP’09 for the tape testing part; Achievements:

T0-T1; T1-T1; T1-T2 worked well for CMS; OK for ATLAS until gridftp problem overloaded servers. Tape writing was OK Reprocessing from disk was OK No problems with gLite services No load problems with 3D services No WAN bottlenecks even during OPN cut

Issues: Should have tested SAN and dCache pool setup better STEP as a load test was important: functional testing and background activity is not enough Need to repeat tests now SAN setup is fixed

BNL: Tape System Performance

Up to 45k files written, 35k files / 30 TB read per day 26k staging requests received in 2 hours, 6k files staged in 1 hour >300 MB/s average tape I/O over the course of the exercise

(Re-)Processing Up to 10k successfully completed jobs/day w/ input data from tape

LAN and WAN Transfers Up to 1.5 GB/s into/out of Tier-1 (1 GB/s from CERN/other Tier-1s) >22k/hour completed SRM managed WAN transfers Fiber cut in DE reduced BNL’s b/w to ~5 Gbps; TRIUMF was served via BNL 2.7 GB/s average transfer rate between disk storage and WNs

All Tier-1 Services stable over the entire course of STEP09 Issues

ATLAS Tier-0 pool got saturated at 3 GB/s (output rate)BNL was not getting data at nominal rate at times => leading to backlog

Other sites still observing decent data rates ATLAS ADC and LCG working on a fix (proposed: dynamic adjustment/rebalancing of FTS active transfers)


Tier 1 – cont.

FNAL: Goal was 240MB/s staging rate; could only do 200 in 1st week. LTO has long seek. 2nd

week achieved 400 MB/s after tuning: need to regroup files on tape and monitor perf. 5000 processing slots in continuous use; CPU efficiency good with pre-stage T0FNAL – no issues but traffic was not high T1T1 no major issues, some config problems T1T2 hit 1 known FTS bug

RAL: Generally very smooth operation:

Calm atmosphere. Only one callout, most service system load low Very good liaison with VOs and good idea what was going on. Some problems with CASTOR tape migration (and robotics) – but all

handled satisfactorily and fixed. Did not impact experiments. Data rates much higher than ever seen before:

Network: Peak steady state: 4.5Gb/s inbound (most from OPN), 2.5Gb/s outbound but massive 6Gb/s spikes to Tier-2 sites. Reprocessing network load not a problem.

Tape system running flat out at times but keeping up. (ATLAS/CMS/LHCB). Sustaining 3.5Gb/s for days and peaking at 6Gb/s.

Batch system running (cpu) efficiently (80-90%) but could not entirely fill it (3GB vmem requirement).

Very useful exercise – learned a lot, but very reassuring More at: http://www.gridpp.rl.ac.uk/blog/category/step09/


Tier 1 – cont.

CNAF: ATLAS: Transfer rates ~130MB/s; production + reprocessing – completed 30k vs 10k

requested jobs; 500 MB/s for data access for analysis CMS: good perf for pre-staging up to 400 MB/s. All pledged CPU used but efficiency

affected by Castor disk server issue. IN2P3:

Demonstrated MSS can deliver data for reprocessing of both ATLAS and CMS. Introduced a new component for scheduling tape staging requests issued by dCache to HPSS. Only used for CMS reprocessing exercises during 5 days. It was not used by ATLAS because we discovered a bug that prevented the fair usage of tape drives for both experiments. Then, dCache interacted directly (i.e. not optimally) with HPSS for staging data for their reprocessing while CMS used the new component. Anyway it worked well for both experiments and HPSS & dCache delivered as expected.

planning for week of July 6th internal tests of the whole software chain for data staging for both ATLAS and CMS. An exercise is agreed to be done with ATLAS during August.

need to improve monitoring tools to have an integrated view of the experiments activities of data transfer (import & export) and in-site data processing (delivering data to jobs running at the site). Curently, information is split over several systems, monitored by different tools,

During STEP'09 we saw some negative impact on the activity of the tier-1 by the co-located tier-2, mainly the analysis activities of CMS. The storage configuration is being modified to prevent CMS analysis jobs to chaotically requests staging data from HPSS.

Migration (or not) to dCache / Chimera. Schedule would be tight. Need clarity on LHC schedule


Tier 1 - cont

NL-T1: Issues:

Network bandwidth to the compute nodes too small to accommodate analysis jobs [will be fixed over summer together with new procurement of compute and storage resources]

Performance issues with the Mass Storage environment: lack of Tape Drives and tuning of new components [12 new tape drives have been installed after STEP’09 and will be brought online over the next month], Tuning of Mass Storage Disk Cache during STEP’09is finished, Unexplained low performance during limited period still under investigation

As expected before STEP’09, did not make the grade. Achievements

dCache performed well and stable, ( > 24 hours continuous 10 Gb/s data transfers out of dCache based storage]

Relatively few incidents (improvement compared to CCRC08). Performance issues were known before STEP’09 and will be addressed over summer and early autumn. Detailed technical report on STEP’09 NL-T1 Post-Mortem is available.

ASGC: Achievements

ATLAS got >1GB/s to ASGC cloud Reprocessing smooth in 2nd week only (several outstanding issues were fixed) CMS prestaging 150 vs 73 MB/s target; 8k successful jobs Good data transfer performances

Issues: (still problems related to relocation etc after fire) 3D port wrong delayed start of ATLAS reproc tests Oracle BigID problem in Castor (they were missing workaround Wrong fairshare allowed too many simulation and Tier 2 jobs CMS reprocessing job efficiency low compared to other Tier 1s

Summary of Tier1s

• The immediate problems of ASGC, DE-KIT and NL-T1 need to be solved and documented

• Once this has been done, they need to be re-tested (separately, multi-VO and finally together)

• A site visit to these sites (initially DE-KIT and NL-T1) should be performed with priority: it is important that the issues and understood and in depth and that the situation is clarified not only with MB/GDB contacts, but also at the more technical level (buy-in)

Other major changes are underway (RAL CC move) or in the pipeline (dCache pnfs to Chimera migration)

A “rerun” – already referred to as “SEPT’09” is needed 16

Tier2s

• The results from Tier2s are somewhat more complex to analyse – an example this time from CMS:• Primary goal: use at least 50% of pledged T2 level for analysis

• backfill ongoing analysis activity • go above 50% if possible

• Preliminary results:• In aggregate: 88% of pledge was used. 14 sites with > 100% • 9 sites below 50%

• The number of Tier2s is such that it does not make sense to go through each by name, however: Need to understand primary causes for some sites to

perform well and some to perform relatively badly Some concerns on data access performance / data

management in general at Tier2s: this is an area which has not been looked at in (sufficient?) detail

17

Summary of Tier2s

• Detailed reports written by a number of Tier2s• MC conclusion “solved since a long time” (Glasgow)

• Also some numbers on specific tasks, e.g. GangaRobot• Some specific areas of concern (likely to grow IMHO)

• Networking: internal bandwidth and/or external• Data access: aside from constraints above, concern that data

access will met the load / requirements from heavy end-user analysis

• “Efficiency” – # successful analysis jobs – varies from 94% down to 56% per (ATLAS) cloud, but >99% down to 0% (e.g. 13K jobs failed, 100 succeed) (error analysis also exists)

• IMHO, the detailed summaries maintained by the experiments together with site reviews demonstrate that the process is under control, not withstanding concerns 18

Recommendations

1. Resolution of major problems with in-depth written reports

2. Site visits to Tier1s that gave problems during STEP’09 (at least DE-KIT & NL-T1)

3. Understanding of Tier2 successes and failures

4. Rerun of “STEP’09” – perhaps split into reprocessing and analysis before a “final” re-run – on timescale of September 2009

5. Review of results prior to LHC restart19

Overall Conclusions

• Once again, this was an extremely valuable exercise and we all learned a great deal!

• Progress – again – has been significant

• The WLCG operations procedures / meetings have proven their worth

• Need to understand and resolve outstanding issues!

Overall, STEP’09 was a big step forward!20

[email protected] ~~~ LCG-LHCC Referees Meeting, 6 th July 2009

Documents

Transcript of [email protected] ~~~ LCG-LHCC Referees Meeting, 6 th July 2009