GGUS summary (7 weeks)

14
GGUS summary (7 weeks) VO User Team Alarm Total ALICE ATLAS CMS LHCb Totals 1 culate the totals for this slide and copy/paste the usual graph plea the summary from the table on pages: //gus.fzk.de/download/wlcg_metrics/html/20110718_escalationreport_wl //gus.fzk.de/download/wlcg_metrics/html/20110725_escalationreport_wl y file: //twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xl y and add the 2 lines for 18-Jul and 25-Jul. Re-upload .xls on //twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings up the last 7 weeks, starting 13-Jun (included) and put them in thi y/paste the graph from the .xls file of point 2 above.

description

GGUS summary (7 weeks). To calculate the totals for this slide and copy/paste the usual graph please: Take the summary from the table on pages: https://gus.fzk.de/download/wlcg_metrics/html/20110718_escalationreport_wlcg.html - PowerPoint PPT Presentation

Transcript of GGUS summary (7 weeks)

Page 1: GGUS summary  (7 weeks)

GGUS summary (7 weeks)

VO User Team Alarm Total

ALICE

ATLAS

CMS

LHCb

Totals

1

To calculate the totals for this slide and copy/paste the usual graph please:1. Take the summary from the table on pages:https://gus.fzk.de/download/wlcg_metrics/html/20110718_escalationreport_wlcg.htmlhttps://gus.fzk.de/download/wlcg_metrics/html/20110725_escalationreport_wlcg.html2. Copy file: https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xlsLocally and add the 2 lines for 18-Jul and 25-Jul. Re-upload .xls on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings3. Add up the last 7 weeks, starting 13-Jun (included) and put them in this table.4. Copy/paste the graph from the .xls file of point 2 above.

Page 2: GGUS summary  (7 weeks)

04/21/23 WLCG MB Report WLCG Service Report 2

Support-related events since last MB

NB!!!!!!!!!!!!!!! CHECK IF THERE ARE MORE ALARMS BETWEEN 13-24 July & adjust totals below!!!!!!!!!!!•There were 11 real ALARM tickets since the 2011/06/07 MB (7 weeks), 9 submitted by ATLAS, 2 by CMS, all ‘solved’, some even ‘verified’, 10 of them for CERN and 1 for CNAF.• The 1st 5 ALARM tickets for CERN did not generate the required email notification to the CERN operators and experts on call! This was due to a switch of the sender’s email address from [email protected] to [email protected] that happened with the 2011/05/25 GGUS Release due to the new exim mailer at KIT.•This was solved in the week of 2011/06/27 by including this new email address in the CERN [VO][email protected] e-groups’ admins.•All test ALARMs following the 2011/07/06 release were successful.Details follow…

Page 3: GGUS summary  (7 weeks)

ATLAS ALARM->CERN SRM connections fail GGUS:71471

04/21/23 WLCG MB Report WLCG Service Report 3

What time UTC What happened

2011/06/12 16:54 SUNDAY & WHIT MONDAY

GGUS TEAM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/06/12 16:58 Submitter adds reference to dashboard link with details.

2011/06/12 17:28 Submitter escalates ticket to ALARM. Email notification recorded as ‘Sent to [email protected]’ but no email received by the e-group members (operators & service mgrs)!!!

2011/06/12 19:50 Service mgr starts investigation (GGUS-SNOW mapping takes care of bypassing the helpdesk outside working hours and notifying the service mgrs’ list).

2011/06/12 20:27 –2011/06/13 17:21(8 comments exchanged)

Service mgr records a load-related lack of available frontend threads. Submitter and other Atlas members acknowledge FTS config. may need to be restored to pre-CASTOR/EOS migration values. Related GGUS:71328

2011/06/14 07:30 Submitter records ‘FTS settings reviewed’ & sets to ‘solved’ and ‘verified’.

Page 4: GGUS summary  (7 weeks)

ATLAS ALARM->CERN SRM many errors GGUS:71715

04/21/23 WLCG MB Report WLCG Service Report 4

What time UTC What happened

2011/06/20 15:25 GGUS TEAM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/06/20 16:16 Service mgr starts investigation.

2011/06/20 16:50 Problem appears to be due to FTS killing transfers.

2011/06/20 17:12 A supporter escalates ticket to ALARM. Email notification recorded as ‘Sent to [email protected]’ but no email received by the e-group members (operators & service mgrs)!!!

2011/06/20 17:19 -2011/06/21 10:19(23 comments exchanged)

Service mgrs, submitter and other shifters & supporters supply dashboard data and more debug info. Internal SNOW escalation records misleading entries in the GGUS diary. This is followed-up via a SNOW development Request.

2011/06/21 11:07 Service mgr records the FTS DB clean-up working at a time of high transfer load caused the timeouts. Configured a lower rate of clean-up and set the ticket to ‘solved’.

2011/06/21 13:07 Supporter sets the ticket to status ‘verified’.

Page 5: GGUS summary  (7 weeks)

ATLAS ALARM->CERN Castor timeouts GGUS:71904

04/21/23 WLCG MB Report WLCG Service Report 5

What time UTC What happened

2011/06/24 11:34 GGUS ALARM ticket, automatic email notification sent to [email protected] But No email received by the e-group members (operators & service mgrs)!!! Automatic GGUS assignment to ROC_CERN successful.Automatic SNOW ticket creation successful.

2011/06/24 11:58 Service mgr starts investigation (due to the assignment of the relevant SNOW ticket to Castor).

2011/06/24 12:28 A stuck job was found to have locked the whole Atlas stager DB. This job was removed and the service was restored.

2011/06/24 15:51 Service mgr sets the ticket to ‘solved’.

2011/06/24 16:24 Submitter sets the ticket to ‘verified’.

Page 6: GGUS summary  (7 weeks)

CMS ALARM->CERN job stageout errors GGUS:71934

04/21/23 WLCG MB Report WLCG Service Report 6

What time UTC What happened

2011/06/26 14:31SUNDAY

GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/06/26 14:43 Submitter adds a long list of email addresses in Cc.

2011/06/26 14:49 Submitter emails [email protected]. Email notification recorded as ‘Sent to [email protected] but no email received by the e-group members (operators & service mgrs)!!!

2011/06/26 15:20 Service mgrs start investigation.

2011/06/26 15:45 –2011/06/26 21:07(8 comments exchanged)

Service mgr records 2 out of 3 Castor headnodes are in trouble (readonly FS, stuck rsyslog daemon, files appearing to have zero size). Moved tape functionality to another machine. Service restarted very slowly.

2011/06/26 21:33 Submitter records ‘unclear if the stuck rsyslog was the reason’ & sets to ‘solved’ and ‘verified’.

Page 7: GGUS summary  (7 weeks)

CMS ALARM-> CERN CASTOR POOL FULL GGUS:71969

04/21/23 WLCG MB Report WLCG Service Report 7

What time UTC What happened

2011/06/27 13:48 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/06/27 14:10 Service mgr records in the ticket that investigation has started. The email notification to operators was not yet fixed at that point.

2011/06/27 17:35 Submitter records in the ticket info received by phone. Castor can’t perform with intense pool use and high rate of file deletions.

2011/06/28 07:47 Service mgr puts ticket to status ‘solved’. Work on-going for garbage collection optimisation.

Page 8: GGUS summary  (7 weeks)

ATLAS ALARM-> CERN EXPORT FAILS WITH FTS ERRORS GGUS:71985

04/21/23 WLCG MB Report WLCG Service Report 8

What time UTC What happened

2011/06/27 21:04 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Email notification to operators was not yet fixed at that point.

2011/06/27 21:40 Service mgr records in the ticket that operators should be asked to call the FTS expert on call.

2011/06/28 01:22 Night shifter escalates ticket in GGUS. This sends a reminder email to the relevant Support Unit (ROC_CERN).

2011/06/28 03:21

Expert acknowledges ticket reception.

2011/06/28 04:15 Expert records agents were stopped to clean-up.

2011/06/28 08:05 Submitter reports jobs are stuck in FTS for hours.

2011/06/28 09:19 – 13:00 (3 comments)

Expert checks again and puts the ticket into status ‘solved’ with diagnostic: /var/tmp was filling up too quickly. Reason was the clean-up job failing since the FTS uid had become, recently, global by mistake.

Page 9: GGUS summary  (7 weeks)

ATLAS ALARM-> CERN T0MERGE WRITING ERRORS GGUS:72132

04/21/23 WLCG MB Report WLCG Service Report 9

What time UTC What happened

2011/07/01 06:57 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Email notification to operators was working again!

2011/07/01 07:11 Operator records in the ticket that the Castor piquet was called.

2011/07/01 07:35 First diagnostic recorded in the ticket. Filesystem full, no space left on device. Castor expert on piquet implements work-around, by which each re-try will try a different filesystem.

2011/07/01 21:13 Service mgr and puts the ticket into status ‘solved’.

2011/07/02 05:02SATURDAY

Submitter puts the ticket into status ‘verified’.

Page 10: GGUS summary  (7 weeks)

ATLAS ALARM-> CERN NO SPACE LEFT ON DEVICE IN POOLS GGUS:72218

04/21/23 WLCG MB Report WLCG Service Report 10

What time UTC What happened

2011/07/04 13:42 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Email notification to operators worked as expected!

2011/07/04 13:48 Castor expert records that investigation started on atlt3 filesystems appearing DISABLED!

2011/07/04 13:49 Operator records in the ticket that the Castor piquet was called.

2011/07/04 14:51 Service mgr puts the ticket into status ‘solved’. Explanations include Oracle errors (then under investigation) and configuration problems (repaired).

2011/07/04 15:00 Submitter puts the ticket into status ‘verified’.

Page 11: GGUS summary  (7 weeks)

ATLAS ALARM-> CERN CASTOR POOLS’ WRITING HANGS GGUS:72262

04/21/23 WLCG MB Report WLCG Service Report 11

What time UTC What happened

2011/07/05 13:10 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/07/05 13:22 Operator records in the ticket that the Castor piquet was called.

2011/07/05 14:06 Service expert puts the ticket into status ‘solved’. There was a hardware issue with the load-balanced head-nodes. A xroot-castor plugin was also upgraded with the occasion.

2011/07/05 14:11 Submitter puts the ticket into status ‘verified’.

Page 12: GGUS summary  (7 weeks)

ATLAS ALARM-> CNAF MONITORING SHOWS ZERO SPACE ON DATATAPE GGUS:72473

04/21/23 WLCG MB Report WLCG Service Report 12

What time UTC What happened

2011/07/08 21:55 GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to NGI_IT.

2011/07/09 06:56SATURDAY

Site mgr records in the ticket that a problem with info provider should by now be fixed.

2011/07/09 14:13 Shifter records errors from DDM dashboard.

2011/07/09 14:35 Shifter upgrades TEAM ticket into an ALARM. Email sent to [email protected].

2011/07/09 18:02 Site admin (?) records they are checking.

2011/07/11 06:07 Automatic (?) warning about non-authorised ALARM raising (?)

2011/07/11 14:02 Site admin. Puts the ticket in status ‘solved’ with explanation ‘storm misconfiguration fixed’.

Page 13: GGUS summary  (7 weeks)

ATLAS ALARM-> CERN CASTOR NO ACCESS TO FILE GGUS:72528

04/21/23 WLCG MB Report WLCG Service Report 13

What time UTC What happened

2011/07/11 16:01 GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.

2011/07/11 16:46 Shifter upgrades TEAM ticket into an ALARM. Email sent to [email protected]

2011/07/11 17:05 Expert records in the ticket that the file is on an unavailable server and the incident doesn’t qualify for an ALARM.

2011/07/11 17:22 Operator records in the ticket that the Castor piquet is called.

2011/07/11 17:51 Service mgr puts the ticket in status ‘solved’ with explanation ‘the file server is a faulty box, discussed at the WLCG daily meeting already, which is given to the vendor for repair’.

Page 14: GGUS summary  (7 weeks)

VONAME ALARM->SITE SERVICE GGUS:XXXXX

04/21/23 WLCG MB Report WLCG Service Report 14

What time UTC What happened

2011/xx/yy xx:yy GGUS ALARM ticket opened, automatic email notification to Mailing_list_name AND automatic assignment to ROC_or_NGI_name

2011/xx/yy xx:yy Comment on acknowledgment. May be several raws on operator-to-service mgr notification. Investigation.

2011/xx/yy xx:yy Pb traced down to [put the Diagnosis here]. Service mgr puts ticket ‘solved’.

2011/xx/yy xx:yy Submitter puts ticket to status ‘verified’.