WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

10
WLCG Service Report [email protected] [email protected] [email protected] [email protected] ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1

description

GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS92112 LHCb Totals

Transcript of WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

Page 1: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

WLCG Service Report

[email protected]@[email protected] [email protected]

~~~WLCG Management Board, 7th September

2010Updated 8th September 2010

1

Page 2: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

Introduction• Generally smooth operation on experiment and service side

• A prolonged outage at NL-T1 due to database issues (most of this reporting period but now recovering)• NL-T1 well represented at daily meeting with regular reports and

ticket updates; prognosis at some stages uncertain

• And the start of a multi-day outage at ASGC – again due to database issues• No attendance in recent days (all last week) – hard to get a clear

picture of the problem and its resolution• Some information (but not very useful) in GGUS tickets (we are in

downtime) and in EGEE broadcasts• This point was made at yesterday’s WLCG operations call, where

an update on the recent events was requested (a SIR is also due)

2

Page 3: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

GGUS summary (2 weeks)

VO User Team Alarm Total

ALICE 4 0 0 4

ATLAS 11 88 1 100

CMS 9 2 1 12

LHCb 7 32 0 39

Totals 31 122 2 155

3

Page 4: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

05/03/23 WLCG MB Report WLCG Service Report 4

Support-related events since last MB

• There were 2 ALARM tickets since the last MB (2 weeks), raised by ATLAS and CMS. • The ATLAS ALARM was a real case, issued towards CERN (re-occurence of a known LSF bug) GGUS:61607• The CMS ALARM was a test issued by the GGUS developer towards NGI_DE, which integrated its local ticketing system to GGUS at the July Release (2010/07/21) and fails to understand ticket attachments.•At the last T1 Service Coordination Meeting on 2010/08/26 the need to efficiently route network problems in GGUS was raised by BNL. This is being investigated by CERN/IT/ES.•In the post-EGEE era a number of GGUS Support Units are no more appropriate. New are imminent and cover yaim-core, new gLite Releases, UI, WN and VOBOX and individual Batch systems.

Page 5: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

ATLAS ALARM->CERN LSF SLOW JOB DISPATCH

•https://gus.fzk.de/ws/ticket_info.php?ticket=61607

05/03/23 WLCG MB Report WLCG Service Report 5

What time UTC What happened2010/08/27 11:40 GGUS ALARM ticket opened, automatic email

notification to [email protected] AND automatic assignment to ROC_CERN

2010/08/27 12:07 Operator confirms the expert has been notified.

2010/08/27 12:43 Expert reports the problem is known and being traced with Platform (the vendor). Reconfigured the batch service daemons and put the ticket to ‘solved’.

Page 6: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

1.1

0.1 0.1 0.1 0.1 0.1 0.1 0.10.2 0.2 0.2 0.2 0.2 0.2 0.20.3 0.3

0.3

2.1

2.1 2.1

2.1

2.1

2.1

2.2

0.4

0.1 0.1 0.1 0.10.44.2

4.1

4.3 4.3 4.3 4.3

Page 7: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

Analysis of the availability plots

COMMON TO ALL EXPERIMENTS0.1 NIKHEF: SRMv2 test failures. GGUS: 61266. Issue related to GGUS: 61265 → The database behind FTS service at SARA is down due to a corrupted Oracle database.0.2 SARA-MATRIX: The database behind FTS service at SARA is down due to a corrupted Oracle database. Unscheduled downtime. GGUS: 61265.0.3 TAIWAN: SAM SRM instabilities, causing transfer inefficiencies and SAM errors. Savannah ticket 116338 & 116356. 0.4 IN2P3: CE & SRM related problems. SRM was down GGUS:61634. Tests were failing occasionally. GGUS: 61631.

ATLAS1.1 BNL: new SAM ATLAS CE specific tests developed, tested and deployed for OSG sites: the unavailability observed is not due to the site but to an issue related to certificate’s mapping to high priority queue.

• ATLAS proposes these CE tests to be used from now on to compute ATLAS specific availability.

ALICE2.1 FZK: VOBOX related problems. The user proxy registration and delegation procedures were not working.2.2 SARA: CE tests were failing occasionally.

CMSNTR.

LHCb4.1 CNAF: SRM tests were failing due to full space token – problem on experiment side4.2 IN2P3: CE tests were failing occasionally. SRM was down GGUS:61634.4.3 RAL: SRM tests were failing occasionally.

Page 8: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

0.1 0.1 0.10.2 0.2 0.

20.3

0.3

0.3

0.3

0.3

0.3

0.3

0.3

0.3

0.3

0.1 0.1 0.1

2.1 2.1 2.1 2.1 2.1 2.1

2.2

3.13.2

4.1 4.1 4.1 4.1 4.1 4.1 4.1

4.2

4.2 4.2 4.24.24.2 4.24.3 4.3 4.34.3 4.3

1.1

Page 9: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

Analysis of the availability plots

COMMON TO ALL EXPERIMENTS0.1 NIKHEF: SRMv2 test failures. GGUS: 61266. Issue related to GGUS: 61265 → The database behind FTS service at SARA is down due to a corrupted Oracle database.0.2 SARA-MATRIX: The database behind FTS service at SARA is down due to a corrupted Oracle database. Unscheduled downtime. GGUS: 61265.0.3 TAIWAN: Unscheduled downtime. Restoring CASTOR stager db. GGUS: 61683 & 61725.

ATLAS1.1 BNL: new SAM ATLAS CE specific tests developed, tested and deployed for OSG sites: the unavailability observed is not due to the site but to an issue related to certificate’s mapping to high priority queue.

• ATLAS proposes these CE tests to be used from now on to compute ATLAS specific availability.

ALICE2.1 FZK: VOBOX related problems. The user proxy registration and delegation procedures were not working.2.2 RAL: CE tests were failing occasionally. Various communication errors on send / host not known errors.

CMS3.1 PIC: SRMv2 tests were failing occasionally. GGUS: 61630.3.2 IN2P3-CC: CE-cms-analysis test failures.

LHCb4.1 CNAF: LFC tests were failing on Monday and SRM tests were failing throughout the week.4.2 PIC: CE and SRM test failures. SRM 'disk full / spake token is full' errors.4.3 RAL: SRM tests were failing throughout the week.

Page 10: WLCG Service Report  ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September 2010 1.

Summary• (Largely) smooth operation – both over last 2

weeks as well as most of the summer (despite concerns over potential absences)

• New SAM ATLAS CE specific tests developed, tested and deployed for OSG sites. ? ATLAS proposes these tests to be used from now on to

compute ATLAS specific availability.• SIRs expected both from NL-T1 incident as well

as ASGC, from whom regular attendance at the daily meeting should be re-established asap

• The number of SIRs during this quarter has been lower than the recent average, but prolonged site downtimes continue

10