HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006

22
HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006 Martin Bly – RAL Tier1 HEPSysMan – Cambridge 23 October 2006

description

HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006. Martin Bly – RAL Tier1 HEPSysMan – Cambridge 23 October 2006. Introduction. Site issues Subject talks. Sites: CERN. Successfully negotiated new LCG-wide licences for Oracle All Physics databases now migrated to Oracle RAC hosting - PowerPoint PPT Presentation

Transcript of HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006

Page 1: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

HEPiX Trip ReportJefferson Laboratory 9 -13 October 2006

Martin Bly – RAL Tier1HEPSysMan – Cambridge

23 October 2006

Page 2: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Introduction

• Site issues• Subject talks

Page 3: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: CERN• Successfully negotiated new LCG-wide licences for Oracle• All Physics databases now migrated to Oracle RAC hosting• SLC4 for LHC start up, SLC3 support ends October 2007• Lemon Alarm System (LAS) replacing SURE• Central CVS service running well

– Looking at Subversion

• First Opteron systems in CERN CC• Insecure mail protocols forbidden/blocked

– POP/IMAP etc must use SSL

• No compromise on performance of disk servers to get ‘fat’ systems

Page 4: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: FermiLab

• Multiple 10Gb/s connections to Starlight• Efforts to automate computer security

– Replace home-grown tools with commercial utilities

• New computer rooms– Overhead power and networking– Plastic curtains to trap cold air in front of machines

• US-CMS– 700TB dCache space

• Expected to be 2.5Pb by autumn 2007– 700 node cluster expanding to 1600 nodes

• BlueArc NAS for online storage– Expensive…

Page 5: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: GridKa

• Issues with recent Opteron procurement– MSI K1-1000D motherboard, AMD 270s

• BIOS issues, BMC and NIC firmware updates

• Issues with water cooled racks traced to leaks in chillers

• NEC supplying 4500TB storage– 28 Storage controllers, RAID 6, 60 file servers

• Report on latest benchmarks– Woodcrest performs very well

Page 6: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: NERSCNational Energy Research Scientific Computing

Center, Berkely

• NERSC Global Filesystem (NFG) in production– 70TB for project file space (subject of separate talk)– Aim to procure ‘just storage’

• 10Gb/s internal/external networks– 10Gb/s ‘jumbo’ network

• Cray Hood system– 19000+ CPUs, 70TB disk, 102 cabinets

• Nagios for monitoring, extending to the Cray• Computer room full, need more power, space

Page 7: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: INFN

• 10Gb/s link to GARR backbone• T2s now at 1Gb/s• GPFS now robust enough to be adopted

by many sites– Lustre also being tested by a few sites

• Testing iSCSI– Satisfactory but not completely satisfying– Looking at new EMC device and home-

grown solutions to try and resolve issues

Page 8: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: GSI Darmstadt

• Issues with large storage farm– 100/120 nodes failed to boot after move to

new racks• Had been OK for 6 months previously in old

racks

– Traced to vibration resonance between disk and CPU cooling fans

• Issues with cooling in racks– Keep cold and warm air flows separate

• Blanking plates important

Page 9: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: SLAC

• SLAC now a US-Atlas site– Procurements to start soon

• Non-HEP experiment computing building up• Many old clusters being decommissioned to make

space• Plan for 150/200-node infiniband cluster

– Model check-pointing is a challenge

• Testing Lustre• Need to move away from AFS (K4) token passing

– SSH/K5 with GSSAPI to pass K5 tickets

• New wireless registration scheme to enable users to be contacted should their machine cause problems

Page 10: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: INFN-CNAF

• CPU capacity upgrade delayed while cooling system upgraded after cooling issues during summer

• Using Quattor/Lemon– CERN customisations sometimes a problem

• Staying with SLC3 (v3.0.8 supports Woodcrest)

• SLC4 when EGEE moves

Page 11: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: LAL

• VMware still preferred Linux-on-desktop solution

• Installed gLite3 on SL4 without modification

• Using Quattor and Lemon– Having removed CERN specifics

Page 12: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Sites: General

• Moving to specifying computing capacity requirements in performance terms for CPUs– Needs ‘common’ benchmarking

• Require vendors to do it (and prove it!)

• Corresponding interest in benchmarking and how to do it so it means something

• 10Gb links now very common• Big Condor pools in use at some sites• Waiting for Grid middleware to be ported to

SL4

Page 13: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Scientific Linux Update• UK top by download (no stats from mirrors)• FTP repository moved from GFS to NAS• New plone version for scientificlinux.org site• SL 4.4 Oct 2006 for i386, x86_64• SL 3.0.8 release candidate available soon

– Now available…• Bug fix repositories for SL variants

– bugfixNN where NN is version• SL 3.0.8 should be the last of the 3 series

– Support plan as previously published: till Autumn 2007• Working on SL5 (installers etc)

– SL5 alphas to be based on TUV beta releases

Page 14: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Core Services/Infrastructure (1)

• Tail of FermiLab’s run in with SpamCop– SpamCop don’t respond to any requests

• Takes 24 hrs to ‘fall off’ list

– Remove bounce messages and verify local addresses

– Trap obvious Spam– Have alternative ip addresses for email

gateways

• Propose ‘white list’ of HEP sites…

Page 15: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Core Services/Infrastructure (2)

• Service Level Status service– CERN tool for displaying the status of

services rather than individual nodes– Status defined by managers in terms of

dependencies and dependants, and what service availability levels mean

– Services and meta-services– Displays Key Performance Indicators of

service levels compared to targets

Page 16: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Core Services/Infrastructure (3)

• RT used to manage Installation workflow (SLAC)

• High Availability methods and experiences at GSI

• Scientific Linux Inventory Project (FermiLab)– Need to monitor software inventory and

hardware of a machine

Page 17: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Compute Clusters & Storage

• Hazards of Fast tape Drives (JLAB)– Is your memory buffer big enough to prevent the tape drive

having to stop, rewind and take a run up to speed when more data is available to write?

– CERN report 100MB/sec using two stage tape serving, with large (8GB) RAM on the L1 caches

• NGF: NERSC’s Global File System (NERSC)• Benchmark Updates (CERN)

– Spec.org results unreliable for HEP purposes• Don’t match our conditions

– Requires vendors to use ‘fixed’ configuration of SPEC2000 benchmark

– HPL used to benchmark ‘power’ perfomance

Page 18: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Security

• No Bob Cowles– Therefore, no ‘scare the pants off everyone’

talk

• But:– The Stakkato Intrusion

• The tail of the long-running intrusion at the Swedish National Supercomputer Centre, 2004-2005

– Network Security Monitoring• How it is done at Brookhaven National Lab, with

Sguil

Page 19: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Grid Projects• Issues and problems around Grid Site Management

(+discussions) – Ian Bird– Measuring site availability: T1s poor– Instabilities in site availabilities observed– Strategies:

• Improve sites, improve job direction– SAM (Site Availability Monitor)

• An expansion of SFT functionality• Sensors integrated with submission framework or standalone• Integrated tests done by test job submission• Analysis of job efficiencies (failure rates): reasons non-trivial

– Good sites change daily!– Plan to use Job wrappers to test as submitting-VO view rather than

OPS-VO view• Better view of system ‘weather’

Page 20: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

IHEPCCC

• IHEPCCC discussing collaborating with HEPiX on areas of mutual interest, particularly benchmarking and global file systems

• RTAG format proposed– Short-term study groups, report to

HEPiX/IHEPCCC

• Lots of interest in participating, particularly in benchmarking and discussing whether SPEC2006 is appropriate

Page 21: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

Next meetings

• Spring 2007:– April 23rd to 27th in DESY Hamburg

• Topics suggested included benchmarking, cluster file systems, VoIP and in general, ‘discussion topics’ (as opposed to LCG workshops) likely to attract LCG Tier 2 sites.

• Autumn/Fall 2007:– possibly early November at either Berkeley or

FermiLab, hopefully in the week preceding Supercomputing’07 in Reno

• Spring 2008:– CERN

Page 22: HEPiX Trip Report Jefferson Laboratory  9 -13 October 2006

23 October 2006 HEPSysMan - Cambridge - Autumn 2006

References

• Abstracts and slides form HEPiX Fall 2006:https://indico.fnal.gov/conferenceDisplay.py?confId=384

• Alan Silverman’s comprehensive trip report:https://www.hepix.org/mtg/fall_06_jlab/HEPiX%20_Lab_Trip_Report_silverman.pdf