The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...

1

The State of Cluster Systems Overview and Future Directions

Donald Becker

February 2007 BWBUG meeting

State of Commodity Clustering

The state of commodity clustering is strong

Commodity Clusters dominate the HPC world

» A strong showing in the Top500 list

» Displacing custom HPC architectures

Linux is the OS of choice for clusters

» Over 75% of highend HPC machines use Linux (November 2006)

» An even higher percentage in midrange and utility computing

Virtualization / Virtualization /

SMPlike ModelSMPlike Model Economics Standards Scalabilityx Heavy system integration

x Management complexity

x Nonoptimized performance

Simplicity Manageability Reliability Powerx Proprietary

x Prescriptivex Expensive

Commodity Commodity

ClustersClusters

Trends in High Performance Computing

Open Source LinuxOpen Source Linux Standards Based Standards Based

ComputingComputing

Maximizing the ROI of Clustering is all about getting the Maximizing the ROI of Clustering is all about getting the

Best of all WorldsBest of all Worlds

R&D Quality Scale

Innovation EconomicsAbility to customizePotential stabilityx Opaque commercial support

The ideal HPC platformThe ideal HPC platform

Lowers capital costsLowers capital costs

Supports standards Supports standards

Dramatically improves Dramatically improves

productivity productivity

Lowers your operational Lowers your operational

expensesexpenses

Beowulf Clustering: The Early Years Conceived by Becker and Sterling in ‘93 and

initiated at NASA in ’94 Objective: show that commodity clusters could

solve HPC problems Build a system that scaled in all dimensions

» Networking, bandwidth, disks, main memory, processing power

Initial prototype

» 16 processors, Channelbonded Ethernet, under $50K

» Matched performance of contemporary $1M machine

Master Node

Interconnection Network

Internet or Internal Network

Traditional “AdHoc” Linux Cluster

Full set of disparate daemons, services, user/password, & host access setup Basic parallel shell with complex glue scripts run jobs Monitoring & management added as isolated tools

Full Linux install to disk; Load into memory

● Manual & Slow; 530 minutes

Manual management of multiple, individual servers

Master node

Lessons Learned

What did we do wrong?

Complexity

Clusters required extensive training to install, configure and use

Longterm administration and updates were difficult

Only “static scalability”

Definitions

Clustering:

Independent computers

Combined into a unified

system

Through software and

networking

Virtualization:

Isolation from physical

resources to

Divide ample resources

Emulate scarce resources

Combine resources

Handle changes

What does cluster architecture have to do with

virtualization?

Good cluster systems virtualize

» Create a unified view of independent machines

» Single System Image (Illusion)

Machine Virtualization creates clusters» Many problems solvable with cluster experience

Virtualization isn't always a single thing» Creating unified view with a toolkit

HPC Clustering redefined

Manual management of multiple, individual servers – slow to scale, reliability issues

● Full operating system install on each compute node in cluster

Complex Cluster Management and Administration

● Time, skill and cost intensive

Security threat for each HPC compute node deployed

‘Doityourself’, ad hoc clusters – cumbersome and slow

Deploy, monitor and scale virtual pool of resources from a single point

● Only lightweight distribution on compute nodes provision on demand, up in seconds

● Software version skews eliminated

Central Command/Control makes the complex simple & costeffective

● Manage entire cluster as single, consistent virtual system

FROM (traditional) TO (Cluster Virtualization) Complete cluster management solution

that runs outofthebox● Robust, sophisticated architecture● Commercially tested and supported

Central security management● Lightweight compute nodes = nothing to

hack

Driving productivity up and complexity outSimplicity is the value of Cluster Virtualization

Cluster Virtualization Architecture Realized

• Virtual, unified process space enables intuitive single signon, job submission

● Effortless job migration to nodes

• Minimal inmemory OS with single daemon rapidly deployed in seconds no disk

● Less than 20 seconds

Master Node

Interconnection Network

Internet or Internal Network

• Monitor & manage efficiently from the Master ●Single System Install●Single Process Space●Shared cache of the cluster state●Single point of provisioning●Better performance due to lightweight nodes●No version skew is inherently more reliable

Optional Disks

Manage & use a cluster like a single SMP machineManage & use a cluster like a single SMP machine

Elements of Cluster Systems

Some important elements of a cluster system Booting and Provisioning Process creation, monitoring and control Update and consistency model Name services File Systems Physical management Workload virtualization

Booting

Booting requirements Reliable booting Automatic scaling Centralized control and updates Reporting and diagnostics designed into the system No dependency on machinelocal storage or

configuration

•The Scyld solution

Integrated, automatic network boot

Basic hardware reporting and diagnostics in the PreOS stageOnly CPU, memory and NIC needed

Kernel and minimal environment from masterJust enough to say “what do I do now?”

Remaining configuration driven by master

Booting ImplementationIntegrated boot server, supporting PXE and other

protocols Avoids network booting reliability issues Understands PXE versions and bugs Avoids TFTP "capture effect" Builtin dynamic node addition DHCP service for noncluster hosts

Hardware reporting and diagnostics in preboot environment

Used to provide specific boot images

Provisioning

Provisioning Design: Traditional full standard installation on Masters Dynamically built "distribution" on other nodes

Core system is "stateless" nonpersistent Provided by the boot server as ramdisks No local storage or network file system used Additional elements provided under centralized master

control

•Initial provisioning architecture

Initial operation using only CPU, memory and NIC

Minimize actions before under central control

Master configures systemsStepbystep flow of controlConfiguration and policy exists and used only on masterOpposite of traditional "download a system and cross

fingers"

Phase 1 Booting Details

Client operations from network boot image Scan bus and load all NIC drivers DHCP on all interfaces (locate log host and runtime

master) Start logging, then execute commands from master Master order PCI bus scans, uses IDs to load device

drivers

Final steps of boot: create application environment Mount local and remote file system as needed by

application

Installation, initial provisioning, and updates

Replacing a failed node is painful● Replace hardware, power on● Manual Install and Configuration● 1530 minutes

Complex, Scripted Node Updates● Login/apply foreach node● Manually correct failed updates

Disk installs get stale, inconsistent● Tedious manual verification ● Or 30 minute reprovision

DiskBased Install of entire cluster● Script or tool maintenance (kickstart) ●1530 minutes/node, days/cluster

Replacing a failed node is effortless● Replace hardware, power on● Node is up in 20 seconds

Single install of the update on Master● Master autoupdates nodes● Guarantees consistency

FROM (traditional) TO (Stateless Auto-Provisioning) Single Install on the Master

● Autoprovisions inmemory OS to nodes● < 20 seconds/node; min. to hours/cluster

Stateless provisioning always consistent● Refresh in 20 seconds if need be

Turning Provisioning from Painful to EffortlessFocus on application challenges Not node administration

Process Initiation, Monitoring and Control

Problems:Starting jobs on a dynamic clusterMonitoring and controlling running processesAllowing interactive and schedulerbased usage

Opportunity:Clusters jobs are issued from designated mastersThat master has exactly the required environmentWe already have a POSIX job control modelTight communication

Operation

Semantics: Execution consistency Remote execution produces same results as local

execution

Implications: same executable (including version!) same libraries, including library linkage order same parameters and environment

Traditional Approaches to Remote Execution Consistency

One approach: treat cluster nodes as workstations Full local OS installation

» Automated installation» Ad hoc reinstall or update system» But synchronization (copying) is not consistency!

NFS mounting OS» Scalability problem, typical limit 3040 nodes» Reliability problems under heavy load» Mediocre consistency when updating

Runtime provisioning architecture

We "copy" running processes Correct: We know the current versions, library linkage

order, environment, etc. Fast: libraries and executables are cached Transfer and cache as versioned whole files With dynamic caching we can start with "nothing"! Dynamically builds a minimal custom installation

Feasible and efficient because "slave nodes" run only applications,

not full OS install

•Hybrid environment

Caching libraries and executables limits Only handles automatic consistency Scripts have inherently unspecified consistency

» Which version is correct after updates?

Approach Extend and expose caching FS Network mount master and other servers

Violates rule of “mount only to support application”

Clusterwide process space

Single virtual process space over cluster Standard process monitoring tools work unchanged Negligible performance impact Careful design avoids cascading failures

Major operational and performance benefitsConsider clusterwide “killall”Over 7 minutes on UAz cluster with 'ssh'Realtime, interactive response with Scyld approach

Unified process space implementation

Correct semantics and efficient monitoring/control Implemented by extending master's process table 2.2/2.4 Linux kernel implementation with custom hooks Redesigned 2.6 implementation minimizes/eliminates

hooks» Uses security and auditing interface (e.g. "SE Linux")

•Challenges to Unified Process Spaces

Technical scalability is good» Scales to 1000s of nodes and processes

Perceptual scalability is the limit» What do you do when 'top' reports top 20 of 10,000?

Prototype solution: applicationspecified process presentation

•Casco Bay

A new standardlike specification being pushed by Intel Developed by Intel, and is very Intel(tm) specific Weak yet overly specific

Flawed, but addresses a need

Casco Bay details

Linux 2.6.17 or laterExcept if your name is “RedHat” or “SuSE”, then 2.6.9

PXE booting

Linux Standards Base execution environment

Specific compilers and development environmentNo distinction between execution and development

environments

Maintaining Efficiency, Scalability, Security

Multiple Services, Agents, Daemons consume resources & compromise Performance and Scalability

● Memory Bandwidth (400Mb Full Install)● Scheduling latencies of daemons,

services● Multiple agents, extra traffic won’t scale

Full User and Access requires configuration and opens security holes

Configuring large numbers of servers is painful

● Users, Services, Access, Security etc. Right Sized Compute Node OS consumes

less resources ● (8 MB typical RAM)● Can always auto provision extra libraries● 515% Performance Improvement typical● Optimized management agent design scales

well

FROM (Traditional) TO (Lightweight Compute Nodes) Compute nodes are slaves to run jobs

directed by Master● Never have to setup compute nodes

Virtual Fortress● Lightweight Slave with no attack points is

inherently more secure and reliable

Dynamically Provisioned Compute Nodes DramaticallyImproves Efficiency, Scalability, Security

Simplifying the complexity of managing large pools of servers

Multiple, disparate methods and tools for monitoring

● Complex to setup and manage● Does not scale

Job control requires interaction with multiple servers

Setup users and security on all machines

Manual management of multiple, individual servers

● Tedious, timeconsuming

Single point of monitoring● Logs, statistics, status automatically forward

to the Master node● Exceptional ease of use and scalability

Single point of job submission and job control

● Single unified process space● Signals and Std I/O Forwarding

FROM (traditional) TO (Scyld) Present the cluster as a single, powerful

Linux system● As easy as managing a single workstation

Single point of user/security administration

Manage Large Pools of Servers as a Single, Consistent, Virtual System

Workload Virtualization

“Workload Virtualization”,

just another name for sitewide schedulers

Provide a virtual environment,

but only when viewed through specific tools

Mapping utilization to organization policies and priorities

Command Line only● Write many complex, difficult to

maintain scripts● No visualization as to what is

happening

Weak on mapping priorities to organizational policies

No Graphical, Historical Reporting or Accounting

Powerful graphical visualization● Point and click control● View all relevant resources/job/user status● Timeline and job location views● Powerful “whatif” simulations

Powerful, easytoimplement policies● Optimize utilization according to multiple

organizational criteria● Fair Share, Deadline, Backfill ● Reservations, Triggers

FROM (Traditional) TO (Commercial Solutions)

Instant, graphical real time and historical reporting and chargeback accounting

Maximize Cluster Utilization Against Business PrioritiesWith Powerful, Instant Visualization of the Cluster

Advanced Workload Virtualization

Maximizing Cluster Utilization Maximizing Cluster Utilization Mapped to Business PrioritiesMapped to Business Priorities

File System Architecture

"Diskless Administrative" modelReliability of no file system dependencies

Yet with no constraints on application root or performance penalty

File system agnostic Underlying system needs no local or remote file

system File systems are mounted to support the application

Clean consistency and administrative model

Scyld Beowulf file system support

Mounting application file systems is a perinstallation configuration» Default system mounts “no” file systems

Administration is done on the master» File system configuration tables (fstab) are on the master

» Clusterwide default, with pernode specialization

» Standard format, but in /etc/beowulf/fstab

Mounting is a final step of initial provisioningFlowofcontrol on master: failures are diagnosable, nonfatal

Handles kernel extension (module) loading

Cluster Name Services: BeoNSS

"Name Service" AKA "Directory Service" is the general concept of a

mapping a name to a value, or providing a list of names.

Concrete examples are Host names Network groups User names and passwords IP addresses and Ethernet MAC addresses

Cluster Name Services

Why are cluster nameservices important? Simplicity Eliminates pernode configuration files Automates scaling and updates Performance

Avoid the serialization of network name lookups.

Avoid communicating with a busy server

Avoid failures and retry cost of server overload

Avoid the latency of consulting large databases

Opportunities: Why we can do better

Cluster nodes have predictable names and addresses

We always issue jobs from MastersOne point in space and time where we know the answer

Clusters have a single set of users per master"Single Sign On" means similar credentials over all machines

Many table lookups are to neverchanging results

Uncommon queries are rarely performance problemsWe don't need to get it right, we just can't get it wrong

BeoNSS: A clusterspecific name service

Solution: BeoNSS, Beowulf Name Services Hostnames

•BeoNSS Hostnames

Opportunity: We control IP address assignmentAssign node IP addresses in node order

Changes name lookup to addition

Name formatCluster hostnames have the base form .<N>

Options for admindefined names and networks

Special names for "self" and "master"

Current machine is ".2" or "self".

Master is known as ".1", “master”, “master0”

BeoNSS User credentials

Names are reported as password table entry 'pwent'

BeoNSS reports only the current userAnd a few other e.g. root

Cluster jobs do not need to know other users

Much faster than scanning large lists

Use BProc credentials (full passwd entry) if available

Netgroups

Netgroups are used primarily for file server permissionsNetgroups are used in /etc/exports for NFS

Other file systems have similar security

Other uses, such as rexec and rsh, are supported

The archtype netgroup is "cluster"Alias for "cluster0"

Group members are all compute nodes

Hostnames are reported as ".0", ".1" ... ".31"

Maximum cluster node count is important

Other netgroups e.g. “nodesavailable”

Beostat

Unified State, Status and Statistics

Used forScheduling

Monitoring

Diagnostics

Other Virtualization Aspects

What about those other virtualization aspects?

Such as “abstracting changes”?

Dynamic node addition

» Handled automatically

» Must not be easy, or everyone would do it correctly

Node Failures

Dynamic node deletion

Just cross fingers and hope?

... Or have a model to handle that

Compute nodes cache system elements

Kernel, basic filesystem, libraries, executables

But application handles I/O and process failures

Other masters take over control and monitoring

Where can we use virtualization?

Better server utilization

Running validated

OS+application pairs on new

hardware

Apps w/ mutually incompatible

OS requirements

•What are our challenges?

The VM support problem is exactly the same as the

cluster problem

Booting and provisioning

Monitoring and resource limitations

Process initiation, monitoring and control

Questions?

The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...

Documents

Transcript of The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...