Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

15
Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Transcript of Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Page 1: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Northgrid Status

Alessandra FortiGridpp24 RHUL15 April 2010

Page 2: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Outline

• Apel pies• Lancaster status• Liverpool status• Manchester status• Sheffield• Conclusions

Page 3: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Apel pie (1)

Page 4: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Apel pie (2)

Page 5: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Apel pie (3)

Page 6: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Lancaster

• All WN moved to tarball• Moving all nodes to SL5 solved “sub-cluster”

problems.• Deployed and decommissioned a test SCAS.

– Will install glexec when user demand it

• In the middle of deploying CREAM CE• Finished tendering for the HEC facility

– Will give us access to 2500 cores– Extra 280 TB of storage– Shared Facility has Roger Jones as director so

we have a strong voice for GridPP interests

Page 7: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Lancaster

• Older storage nodes are being re-tasked• Tarball WN are working well but YAIM is

suboptimal to configure them• Maui continues to be weird for us

– Jobs blocking other jobs– Confused by multiple queues– Jobs don't use their reservations when they are

blocked

• Problems trying to use the same NFS server for experiment software and tarballs.– Now they have been split

Page 8: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Liverpool

• What we did (we were supposed to do)– Major hardware procurement

• 48TB unit with 4Gbit bonded link• 7X4X8 units = 224 cores, 3GB mem, 2x1TB

disk

– Scrapped some 32bit nodes– CREAM test CE running

• Other things we did– General guide to capacity publishing– Horizontal job allocation– Improved use of Vms– Grid use of slack local HEP nodes

Page 9: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Liverpool

• Things in progress– Put CREAM in GOCDB (ready)– Scrap all 32 bit nodes (gradually)– Production runs of central computer cluster

(other dept involved)

• Problems– Obsolete equipment– WMS/ICE fault at RAL

• What's next– Install/deploy newly procured storage and CPU

hardware– Achieve production runs of central computing

cluster

Page 10: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Manchester

• Since last time– Upgraded WN to SL5– Eliminated all dcache setup from the nodes– Raid0 on internal disks– Increased scratch area– Unified two DPM instances– 106 TB/84 dedicated to atlas– Upgraded to 1.7.2– Changed network configuration of data servers– Installed squid cache– Installed Cream CE (still in test phase)– Last HC test in March 99% efficiency

Page 11: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Manchester

• Major UK site in atlas production 2 or 3 after RAL and Glasgow

• Last HC in March had 99% efficiency• 80 TB almost empty

– Not many jobs– But from the stats of the past few days also

real users seem also fine.

96%

Page 12: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Manchester

• Tender– European Tender submitted 15/9/2009– Vendors replies should be in 16/04/2010 (in

two days)– Additional GridPP3 money can be added

• Included a clause for increased budget

– Minimum requirements 4400 HEPSPEC/240TB• Can be exceeded• Buying only nodes

– Talking to Uni for Green funding to replace what we can't replace• Not easy

Page 13: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Sheffield

• Storage Upgrade– Storage moved to physics: 24/7 access– All nodes running SL5, DPM 1.7.3– 4x25TB disk pools, 2TB disks, RAID5, 4

cores– Memory will be upgaded to 8GB on all nodes– 95% reserved for atlas– Xfs crashed, problem solved with additional

kernel module

– Sw server 1TB (raid1)– Squid server

Page 14: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Sheffield

• Worker Nodes– 200 old 2.4GHz, 2GB, SL5– 72 TB of local disk per 2 cores– Lcg-CE and MONBOX on SL4– Additional 32 amp ring has been

added – Fiber link between CICS and physics

• Availability– 97-98% since January 2008– 94.5% efficiency in atlas

Page 15: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

Sheffield Plans

• Additional storage– 20TB → bring total 120TB for atlas

• Cluster integration– Local HEP and UKI-NORTHGRID-SHEF-HEP

will have joint Wns– 128 CPU + 72 new nodes ???– Torque server from local cluster and lcg-CE

from grid cluster– Need 2 days DT waiting for atlas approval– CREAM CE installed waiting to complete

cluster integration