Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX...

27
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

description

INFN-T1 on-call procedure

Transcript of Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX...

Page 1: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Lesson learned after our recent cooling problem

Michele Onofri, Stefano Zani,Andrea Chierici

HEPiX Spring 2014

Page 2: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 2

Outline

• INFN-T1 on-call procedure• Incident• Recover Procedure• What we learned• Conclusions

21/05/2013

Page 3: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

INFN-T1 on-call procedure

Page 4: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 4

On-call service

• CNAF staff on-call on a weekly basis– 2/3 times per year– Must live within 30min from CNAF– Service phone receiving alarm SMSes – Periodic training on security and intervention

procedures• 3 incidents in last three years– only this last one required the site to be totally

powered off 21/05/2013

Page 5: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 5

Service Dashboard

21/05/2013

Page 6: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Incident

Page 7: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 7

What happened on the 9th of March

• 1.08am: fire alarm– On-call person intervenes and calls Firefighters

• 2.45am: fire extinguished• 3.18am: high temp warning

– Air conditioning blocked– On-call person calls for help

• 4.40am: decision is taken to shut down the center• 12.00pm: chiller under maintenance• 17.00pm: chiller fixed, center can be turned back on• 21.00pm: farm back on-line, waiting for storage

21/05/2013

Page 8: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 8

10th of March

• 9.00am: support call to switch storage back on• 6.00pm: center open again for LHC

experiments

• Next day: center fully open again

21/05/2013

Page 9: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 9

Chiller power supply

21/05/2013

Page 10: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 10

Incident representation

21/05/2013

Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6

ControlSystem Head

Ctrl sys

Pow 1

Ctrl sys

Pow 2

Page 11: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 11

Incident examination

• 6 chillers for the computing room• 5 share the same power supply for the control

logic (we did not know that!)• Fire in one of the control logic, power was cut to 5

chillers out of 6– 1 chiller was still working and we weren’t aware of

that!– Could have avoided turning the whole center off?

Probably not! But a controlled shutdown could have been done.

21/05/2013

Page 12: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 12

Facility monitoring app

21/05/2013

Page 13: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 13

Chiller n.4

21/05/2013

BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)

Page 14: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 14

Incident seen by inside

21/05/2013

Page 15: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 15

Incident seen by outside

21/05/2013

Page 16: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Recover Procedure

Page 17: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 17

Recover procedure

• Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4

• Storage: support call • Farming: took the chance to apply all security

patches and latest kernel to nodes– Switch on order: LSF server, CEs, UIs– For a moment we were thinking about upgrading to

LSF 9

21/05/2013

Page 18: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 18

Failures (1)

• Old WNs – BIOS battery exhausted,

configuration reset• PXE boot, hyper-threading, disk

configuration (AHCI)– lost IPMI configuration (30%

broken)

21/05/2013

Page 19: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 19

Failures (2)

• Some storage controllers were replaced

• 1% PCI cards (mainly 10Gbit network) replaced

• Disks, power supplies and network switches were almost not damaged

21/05/2013

Page 20: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

What we learned

Page 21: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 21

We fixed our weak point

21/05/2013

Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6

ControlSystem Head

Ctrl sys

Pow 1

Ctrl sys

Pow 6

Ctrl sys

Pow 2

Ctrl sys

Pow 3

Ctrl sys

Pow 4

Ctrl sys

Pow 5

Page 22: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 22

We miss an emergency button

• Shut the center down is not easy: a real “emergency shutdown” procedure is missing– We could have avoided switching

down the whole center if we have had more control

– Depending on the incident, some services may be left on-line

• Person on-call can’t know all the site details

21/05/2013

Page 23: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 23

Hosted services

• Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control– We need an emergency

procedure for those too– We need a better

understanding of the SLAs

21/05/2013

Page 24: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Conclusions

Page 25: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 25

We benchmarked ourselves

21/05/2013

• It took 2 days to get the center back on-line– less than one to open LHC

experiments– everyone was aware about

what to do– All working nodes rebooted

with a solid configuration– A few nodes were

reinstalled and put back on line in a few minutes

Page 26: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 26

Lesson learned

• We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now)– The new dashboard appears to be the right place

• We created a task-force to implement a controlled shutdown procedure– Establish a shutdown order

• WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches

• In case of emergency, on-call person is required to take a difficult decision

21/05/2013

Page 27: Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Andrea Chierici 27

Testing shutdown procedure

• The shutdown procedure we are implementing can’t be easily tested

• How to perform a “simulation”?– Doesn’t sound right to switch the center off just to

prove we can do it safely• How do other sites address this?• Should periodic bios battery replacements be

scheduled?

21/05/2013