David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles...

19
David Britton, 28/May/09.

Transcript of David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles...

Page 1: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

David Britton, 28/May/09.

Page 2: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

2

•14 TeV Collisions

•27 km circumference

•1200 14m 8.36 Tesla SC dipoles

•8000 cryomagnets

•40,000 tons of metal at -271c

•700,000L liquid He

•12,000,000L liquid N2

•800,000,000 proton-proton collisions/sec.

The Large Hadron Collider at CERN

Page 3: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

3

Data from the LHC Experiments

55 million channelsRaw data = 220 MB/s

18 million channelsRaw data = 100 MB/s

ATLAS (7,000 tonnes) CMS (12,500 tonnes)

ALICE (10,000 tonnes)LHCb (5,600 tonnes)

1.2 million channelsRaw data = 50 MB/s

Concorde(15 Km)

Mt. Blanc(4.8 Km)

One year’s data from LHC would fill a stack of CDs 20km high

Raw data flow

~700 MB/s

Total ~15 PB ofdata per year

100 million channelsRaw data = 320 MB/s

Page 4: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

4

Data Driven Grid Computing

10/04/23

Grid architecture chosen because:

• Costs of maintaining and updating resources more easily shared in a distributed environment.

• Funding bodies can provide local resources and contribute to global goal.

• More easy to build redundancy and fault tolerance and minimise risks from single point of failure.

• LHC will operate around the clock for 8 months each year. Spanning of time zones means that monitoring/support more readily provided.

ALICE

ATLAS CMS

LHCb

Page 5: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

5

Worldwide LHC Computing Grid

28/May/09

Tier 0

Tier 1National centres

Tier 2Regional groups

Institutes

Workstations

Offline farm

Online system

CERN computer centre

RAL,UK

ScotGrid NorthGridSouthGrid London

FranceItalyGermanySpain

Glasgow Edinburgh Durham

11 T1 centres

Simulation, Analysis

Primary Data Store

Reconstruction, Storage, Analysis

Page 6: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

6

WorldWide Resources

55 Countries

283 Sites

180,000 CPUs

Worldwide:

23 Sites

20,000 CPUs

UK:

Page 7: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

7

How does it work? Components

Tier 0, Tier 1, Tier 2

DATA MOVEMENT – FILE TRANSFER SERVICE (FTS)

STORAGE INTERFACE – STORAGE RESOURCE MANAGER (SRM)

AUTHORISATION/ROLES – VIRTUAL ORGANISATION MEMBERSHIP (VOMS)

METADATA/REPLICATION – LCG FILE CATALOGUE (LFC)

BATCH SUBMISSION – WORKLOAD MANAGEMENT SYSTEM (WMS)

DISTRIBUTED CONDITIONS DATABASES – ORACLE STREAMS (3D)

GRID INTERFACES (e.g. Ganga)

PRODUCTION/ANALYSIS SYSTEMS

GRID

MID

DLE

WAR

E

EXPERIMENTFRAMEWORKS

WLCGFABRIC

Page 8: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

8

How does it work? Workflow

griduiJDL

VOMS

WLMS

JS

RB

LFC

BDII

Logging & Bookkeeping

33

CPU Nodes Storage

Grid Enabled Resources

CPU Nodes Storage

Grid Enabled Resources

CPU Nodes Storage

Grid Enabled Resources

CPU Nodes Storage

Grid Enabled Resources

44

55

Submitter

6677

88 99

1010

00 VOMS-proxy-init

11 Job Submission

22

Job

Stat

us?

1111Job Retrieval

Page 9: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

9

Availability: The UK Tier-1

Availability fraction of time the site is up (so even scheduled maintenance counts against this metric).

Target is 97% (achieved).

Measured by SAM tests (Service Availability Monitor).

There are also experiment-specific SAM tests which are more demanding. Example shown here is from ATLAS.

Target is also 97%. Performance is improving but was degraded by the CASTOR mass storage system.

Page 10: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

10

Availability: Full UK Picture

Page 11: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

11

Resilience and Disaster Planning

• The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site.

• One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality.

28/May/09

Page 12: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

12

Strategy

• Fortifying the Service

• Duplicate services or machines

• Increase the hardware’s capacity (to handle faults)

• Use (good) fault detection

• Implement automatic restarts

• Provide fast intervention

• Fully investigate failures

• Report bugs -> ask for better middleware

Disaster Planning

• Taking control early enough. • (Pre-) establishing possible

options.• Understanding user

priorities.• Timely Action.• Effective Communication.

Hardware; Software; Location

Page 13: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

13

Duplicating Services or Machines

Multiple WMS’es

Page 14: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

14

Hardware Capacity and Fault Tolerance

Examples:

Storage – Use raid arrays: RAID5 RAID6 for storage arrays; RAID1 for system disks. Use of hot-spares allows automatic rebuilds.

Memory – Increase memory capacity; use ECC (error-correction-code) memory and monitor for a rise in error correction rate.

Power – Use redundant power supplies connected to different circuits if possible. UPS for critical systems.

Interconnects - Use two or more bonded network connections with cables routed separately.

CPU – Use more powerful machines.

Databases – Use Oracle RAC’s (Real Application Clusters) which enable multiple servers to access database simultaneously.

Resilient hardware will help services survive common failure modes and keep it operating until you can replace the component and make the service resilient again.

Page 15: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

15

Fault Detection

• If it can be monitored, monitor it!• Catch problems early e.g. with nagios alarms.

Load alarms; File systems near to full; Certificates close to expiry; Failed drives

• Look for signatures of impending problems to predict component failure.

• Idle disks hide their faults–Regular low-level verification runs to push sick drives over the edge–Replace early in failure cycle

• So it doesn’t fail during a rebuild…

• Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation–If you have redundant links, you can replace the faulty one and keep the service going

• Call-out system for problems that impact services

Page 16: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

16

Intervention and Investigation

28/May/09

• Run 24 x 7 call out system connected to a pager that is triggered by automatic alarms.

• 2 hour response time for critical failures.• All incidents are examined to learn lessons: Call-out

rate has dropped from 10/day to as low as 1/week.• Reports written up on serious incidents (reported to

the wLCG so other sites around the world can see).

Page 17: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

17

Despite everything, disasters will happen”

28/May/09

wLCG weekly operations report, Feb-09

(Taiwan)

Page 18: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

18

Disaster Planning

28/May/09

• Need a Disaster Response plan which is well understood – use it regularly for anything that could turn into a disaster!

• Stage 1: Disaster Potential Identified– Informally Assess/Monitor/Set deadlines/Do not interfere.

• Stage 2: Possible Disaster– Add internal management oversight/Formally assess/Divert

resources

• Stage 3: Disaster Likely– Add external experts and stakeholder representation to

oversight. – Regular meetings with the experiments.– Prepare contingencies; Communicate widely.

• Stage 4: Actual Disaster– Manage disaster according to high level disaster plan and

contingencies identified at Stage-3. Communicate widely.

Page 19: David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

19

Summary

• In the UK we have spent the last 6 years preparing for the LHC data challenge and have deployed 20,000 CPUs as part of a world-wide Grid of 180,000 CPUs: The largest scientific computing Grid in the world.

• The last year has focused on making the service reliable and resilient: Our Tier-1 centre currently delivers 97% availability and our Tier-2 centres average over 90%.

• We have initiated planning to understand the possible responses to major disaster and to set up a disaster management process to handle such incidents.

• We look forward to the arrival of LHC data!

1/Apr/09

LHC Data