CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo...

19
CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith

Transcript of CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo...

Page 1: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

CERN Data ServicesUpdate

HEPiX 2004 / NeSC Edinburgh

Data Services team:Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes,

Gordon Lee, Tony Osborne, Tim Smith

Page 2: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 2 of 19

Outline

Data Services Drivers Disk Service

Migration to Quattor / LEMON Future directions

Tape Service Media migration Future directions

Grid Data Services

Page 3: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 3 of 19

Data Flows Tier-0 / Tier-1 for the LHC

Data Challenges: CMSDC04 (finished) ; PCP05 (Autumn) +80;

+170 ALICE ongoing +137 TB LHCb ramping up +40 TB ATLAS ramping up +60 TB

Fixed Target Programme: NA48 at 80 MB/s +200 TB COMPASS at 70 MB/s (peak 120) +625 TB nToF at 45 MB/s +180 TB NA60 at 15 MB/s +60 TB Testbeams at 1~5 MB/s (x 5)

Analysis…

Page 4: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 4 of 19

Disk Server FunctionsAFS1%

Oracle8%

CASTOR: Experiment dedicated

64%

CASTOR: Infrastructure

9%

CASTOR: Public Services

4%

LCG14%

Page 5: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 5 of 19

Generations

0th Jumbos

1st & 2nd

4U

3rd & 4th

8U

Page 6: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 6 of 19

Warrantees

0

50

100

150

200

250

300

350

400

Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09

Nu

mb

er o

f D

isk

Ser

vers

ELONEX - 2.4GHz

ELONEX - 2.4GHz

ELONEX - 2.0GHz

ELONEX - 1.1GHz

JTT - 1.1GHz

ELONEX - 1GHz

ELONEX - 1GHz

JTT - 1GHz

ELONEX - 900

ELONEX - 900

ELONEX - 900

TECH - 800

ELONEX - 700

ELONEX - 650

COGESTRA - 500

ELONEX - 500

ELONEX - 500

COGESTRA - 450

Out of Warantee

4th Generation

3rd Generation

2nd Generation

1st Generation

0th Generation

Page 7: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 7 of 19

Disk Servers: Jan 2004 370 EIDE Disk Servers

Commodity Storage in a box 544 TB of disk capacity 6700 spinning disks

Storage Configuration HW Raid-1 mirrored for “maximum

reliability” ext2 file systems

Operating systems RH6.1, 6.2, 7.2, 7.3, RHES 13 different kernels

Application uniformity; CASTOR SW

Page 8: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 8 of 19

Quattor-ising Motivation: Scale

Uniformity; Manageability; Automation Configuration Description (into CDB)

HW and SW; nodes and services Reinstallation

Production machines – min service interruption!

Eliminate peculiarities from CASTOR nodes MySQL, web servers Refocus root control

Quiescing a disk server ≠ draining a batch node!

Gigabit cards gymnastics (ext2 -> ext3)

Complete (except 10 RH6 boxes for Objectivity)

Page 9: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 9 of 19

LEMON-ising MSA everywhere

Linux box monitoring and alarms

Automatic HW static checks

Adding CASTOR server specific Service monitoring

HW Monitoring lm_sensors (see tape section) smartmontools

smartd deployment Kernel issues; firmware bugs; through 3ware controller smart_ctl auto checks; predictive monitoring

IPMI investigations; especially remote access Remote reset/power-on/power-off

Page 10: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 10 of 19

Disk Replacement Failure rate

unacceptably high 10 months to be

believed 4 weeks to execute

1224 disks exchanged (out of 6700)

And the cages

Western Digital; type DUA Head instabilities

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

4.5%

Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04

% B

rok

en

Mir

rors

Page 11: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 11 of 19

Disk Storage Futures

EIDE Commodity storage in a box Production systems

HW Raid-1 / ext3 Pilots (15 production systems)

HW Raid-5 + SW Raid-0 / XFS (See Jan Iven’s talk next)

New tenders out… 30TB SATA in a box 30TB external SATA disk arrays

New CASTOR stager (see Olof’s talk)

Page 12: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 12 of 19

9940B14157

9940A8889

98408149

35908639

Tape Service

70 tape servers (Linux) (mostly) Single FibreChannel attached

drives 2 symmetric robotic installations

5 x STK 9310 Silos in each

9940B50

9940A4

984020

359014

LTO6Drives

Media

Page 13: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 13 of 19

Tape Server Temperatures

lm_sensors package General SMBus access

and hardware monitoring.

Used to access LM87 chip

Fan speeds Voltages Int/Ext temperatures

ADM1023 chip Int/Ext temperatures

Page 14: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 14 of 19

Tape Server Temperatures

Page 15: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 15 of 19

Media Migration

To 9940B (mainly from 9940A) 200GB – extra capacity avoids

unnecessary acquisitions Better performance – though hard to

benefit in normal chaotic mode Reduced errors; fewer interventions

1-2% of A tapes can not be read (extremely slow) on B drives Have not been able to return all A-

drives

Page 16: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 16 of 19

Tape Service Developments

Removing tails… Tracking of all tape errors (18 months)

Retiring of problematic media Proactive retiring of heavily used media

(>5000 mounts) repack on new media

Checksums Populated writing to tape Verified loading back to disk 22% already after few weeks

Page 17: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 17 of 19

Water Cooled Tapes!

Plumbing error!

5000 tapes disabled for a few days 550 superficially wet 152 seriously wet – visually

inspected

Page 18: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 18 of 19

Tape Storage Futures

Commodity drive studies LTO-2 (Collaboratively CASPUR/Valencia)

Test and evaluate High-end drives IBM 3592 STK NGD

Other STK offerings SL8500 robotics and silos Indigo; managed storage, tape

virtualisation

Page 19: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 19 of 19

GRID Data Management

GridFTP + SRM servers (Former) Standalone / experiment dedicated Hard to intervene; not scalable

New load-balanced 6 node Service castorgrid.cern.ch SRM modifications to support operate

behind load balancer GridFTP standalone client

Retire ftp and bbftp access to CASTOR