Large Cluster Workshop 7 th September 2001 Alan Silverman Large-Scale Cluster Computing Workshop...

7th September 2001

Large Cluster Workshop Alan Silverman

Large-Scale Cluster Large-Scale Cluster Computing WorkshopComputing Workshop

held at Fermilabheld at Fermilab22-2522-25thth May 2001 May 2001

Alan Silverman and Dane Skow

CHEP 2001, Beijing

7th September 2001

CHEP 2001 Alan Silverman

2Large Cluster Workshop

OutlineOutline

Background and Goals Attendees The Challenge to be faced Format of the Workshop Panel Summaries Conclusions References



BackgroundBackground

Sponsored by HEPIX, in particular by the Large Cluster SIG

In background reading on Grid technologies, we found many papers and USENIX-type talks on cluster techniques, methods and tools.

But often with results and conclusions based on small numbers of nodes.

What is the “real world” doing? Gathering practical experience was the primary

goal



Goals Goals

Understand what exists and what might scale to large clusters (1000-5000 nodes and up).

And by implication, predict what might not scale Produce the definitive guide to building and

running a cluster - how to choose/select/test the hardware; software installation and upgrade tools; performance mgmt, logging, accounting, alarms, security, etc, etc

Maintain this.



The AttendeesThe Attendees

Participation was targeted at sites with a minimum cluster size (100-200 nodes)

Invitations were sent, not only to HENP sites but to other sciences, including biophysics. We also invited participation by technical representatives from commercial firms (sales people refused!)

Our target was 50-60 people, chosen to optimise interaction and discussion

64 people registered, 60 attended



The ChallengeThe Challenge

Fermilab Run II and CERN LHC experiments will need clusters measured in thousands of nodes

1 billion people surfing the Web

1 billion people surfing the Web

105

104

103

102

Level 1 Rate (Hz)

High Level-1 Trigger(1 MHz)

High No. ChannelsHigh Bandwidth(500 Gbit/s)

High Data Archive(PetaByte)

LHCB

KLOE

HERA-B

CDF IIaD0 IIa

CDF

H1ZEUS

UA1

LEP

NA49

ALICE

Event Size (bytes)

104 105 106

ATLASCMS

106

107

KTeV



The ChallengeThe Challenge

Fermilab Run II and CERN LHC experiments will need clusters measured in thousands of nodes

Should or could a cluster emulate a mainframe? How much can HENP compute models be adjusted to

make the most efficient use of clusters? Where do clusters not make sense? What is the real total cost of ownership of clusters? Can we harness the unused CPU power of desktops? How to use clusters for high I/O applications? How to design clusters for high availability?



LHC Computing PlansLHC Computing Plans

As proposed by MONARC, computing will be arranged in Tiers where Tier 0 runs at CERN, Tier 1 are Regional Centres, Tier 2 are National Centres and so on down to Tier 4 on desks

Grid Computing will be an important constituent But we will still need to manage very large clusters,

do we have the tools and the resources? Many sessions at this conference have already

covered MONARC, Grid Computing and associated tools; but few at the fabric level.



Workshop LayoutWorkshop Layout

Apart from a few plenary sessions, typically to set the scale of the problem as compared to where we are today, the workshop was arranged in 2 streams of highly-interactive panels

Each panel was presented with some initial questions to consider as a starting point

Each panel was “seeded” with 2 or 3 short informal talks relevant to the panel topic

The panels were summarised on the last day Proceedings are in preparation and will be

published soon



Examples of Cluster Examples of Cluster Acquisition ProceduresAcquisition Procedures

FNAL and CERN have formal tender cycles with technical evaluations. But FNAL can select the bidders, CERN must invite bids Europe-wide and the lowest valid bid wins.

Also, FNAL qualifies N suppliers for 18-24 months while CERN rebids each major order, lowest bid wins. Variety is the spice of life?

KEK funding agency demands long-term leases. The switch to PCs was delayed by in-place leases with RISC vendors

NERSC is funded by user groups buying CPU slices and disc space but NERSC still decides the configurations and NERSC still own the systems.



Panel A1 - Panel A1 - Configuration Configuration ManagementManagement

Identified a number of key tools (for example, VA Linux’s VACM and SystemImager, Chiba City tools) in use and some, strangely, not much used in HENP (eg. cfengine)

Tool sharing not so common – historical constraints, different local environment, less of an intellectual challenge

Almost no prior modelling - previous experience much more the prime planning “method”



Panel A2 - Installation, Panel A2 - Installation, UpgradingUpgrading

Dolly+ at KEK – uses a logical ring structure for speed (presented earlier this week in CHEP)

ROCKS toolkit at San Diego – uses vendor tools and stores everything in packages; if you doubt the validity of the configuration, re-install the node

European DataGrid WP4 – major challenge for first milestone (due mid-Oct); intermediary solution chosen

Burn-in tests rare (FNAL and NERSC yes) but look at CTCS from VA Linux (handle with care!)



Panel A3 - MonitoringPanel A3 - Monitoring

BNL wrote their own tools but use vendors’ tools where possible (e.g. for the AFS and LSF services)

FNAL and CERN started projects (NGOP and PEM respectively) when market surveys produced no tool sufficiently flexible, scalable or affordable

Bought-in tools in this area for our scales of cluster are expensive and a lot of work to implement but one must not forget the ongoing support costs of in-house developments



Panel A4 – Grid ComputingPanel A4 – Grid Computing

Three relevant efforts – European DataGrid, PPDG and GriPhyN. Refer to presentations earlier in this conference for details of these

Clear parallels and overlaps – it will be important to keep these in mind to avoid developing conflicting schemes which will have common (LHC) users

No PPDG or GriPhyN equivalent of the European DataGrid Work Package 4 – Fabric Management; is this a problem/risk?



Panel B1 – Data AccessPanel B1 – Data Access

Future direction heavily related to Grid activities All tools must be freely available Network bandwidth and error rates/recovery can be the

bottleneck “A single active physics collaborator can generate up to 20

TB of data per year” (Kors Bos, NIKHEF) Genomics team at Uni of Minnesota needed to access

“opportunistic cycles” on desktops via Condor because resources scheduled for 2008 are needed now because their science has moved so fast



Panel B2 – CPU, Resource Panel B2 – CPU, Resource AllocAlloc

30% of the workshop audience used LSF, 30% used PBS, 20% used Condor

FNAL developed FBS and then FBSng CCIN2P3 developed BQS The usual trade-off – resources needed to develop

one’s own tool or adapt public domain tools against cost of a commercial tool and less flexibility with regard to features

Platform (who were represented) claimed to be listening and understood the issue as regards LSF



Panel B3 - SecurityPanel B3 - Security

BNL and FNAL were (are) adopting formal Kerberos-based pilot security schemes

Elsewhere the usual procedures are in place – CRACK password checking, firewalls, local security response teams, etc

Many sites, especially those seriously hacked, forbid access from offsite with clear text passwords

Smart cards and certificates are starting to be used



Panel B4 – Load BalancingPanel B4 – Load Balancing

For distributed application sharing, use remote file sharing or perform local node re-synchronisation?

Link applications to libraries dynamically (the users usual preference) or statically (normally the sys admin’s choice)?

Frequent use of a cluster alias and DNS for load balancing; some quite clever algorithms in use

Delegate queue management to users – peer pressure works much better on abusers



Other HighlightsOther Highlights

Introduction to the IEEE Task Force on Cluster Computing: most of us did not know it existed!

Description of the issues facing Bio-physicists such as those at the Sanger Centre in the UK

Quote of the week - “a cluster is a great error amplifier” (Chuck Boeheim, SLAC)

Report from the Supercomputer Scalable Cluster Conference: they seem to consist largely (wholly?) of ASCI sites. They are already at the multi-thousand node cluster level but for them money seems to be little problem. They promised to keep in touch though.



ConclusionsConclusions

How to produce conclusions when the goal was to share experiences and discuss technologies?

Each delegate will have his/her own conclusions, suggestions to follow-up, ideas to investigate, tools to experiment with, and so on.



My ConclusionsMy Conclusions

It was a valuable sharing of experiences Many tools were exposed, some frequently

mentioned, others new to many in the audience Clusters are here to stay but they don’t solve every

problem and they bring their own, especially in the area of systems administration

Growing awareness of in-house development costs but also management and operational costs

Don’t forget the resources locked up in desktops



Cluster Builders GuideCluster Builders Guide

A framework covering all (we hope) aspects of designing, configuring, acquiring, building, installing, administering, monitoring, upgrading a cluster.

Not the only way to do it but it should make cluster owners think of the correct questions to ask and hopefully where to start looking for answers.

Section headings to be filled in as we gain experience.

1. Cluster Design Considerations1.1 What are characteristics of the computational problems ?

– 1.1.1 Is there a “natural” unit of work ?• 1.1.1.1 Executable size• 1.1.1.2 Input data size

• ……….

1.2 What are characteristics of the budget available ?– 1.2.1 What initial investment is available ?– 1.2.2 What is the annual budget available ?

– …………

……. 5. Operations

5.1 Usage

5.2 Management– 5.2.1 Installation

– 5.2.2 Testing

– ……….



Future MeetingsFuture Meetings

HEPiX (and HEPNT) - Oct 15 to 18, NERSC, LBNL (Berkeley, California) See web site http://wwwinfo.cern.ch/hepix/ for details

IEEE TFCC - Oct 8 to 11, Newport Beach, Calif. Large Cluster Workshop - late 2002 or early 2003.

By invitation; contact me if interested to receive news

([email protected])



ReferencesReferences

Most of the overheads presented at the workshop can be found on the web site

http://conferences.fnal.gov/lccws/ You will also find there the full programme, and

soon (end October?) the Proceedings (now in preparation) and some useful cluster links (including many links within the Proceedings).

Other useful links for clusters IEEE Cluster Task Force http://www.ieeetfcc.org Top500 Clusters http://clusters.top500.org

Large Cluster Workshop 7 th September 2001 Alan Silverman Large-Scale Cluster Computing Workshop...

Documents

Transcript of Large Cluster Workshop 7 th September 2001 Alan Silverman Large-Scale Cluster Computing Workshop...