The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...

49
1 The State of Cluster Systems Overview and Future Directions Donald Becker February 2007 BWBUG meeting

Transcript of The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...

Page 1: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

1

The State of Cluster Systems Overview and Future Directions

Donald Becker

February 2007 BWBUG meeting

Page 2: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

State of Commodity Clustering

The state of commodity clustering is strong

 Commodity Clusters dominate the HPC world

» A strong showing in the Top­500 list

» Displacing custom HPC architectures

 Linux is the OS of choice for clusters

» Over 75% of high­end HPC machines use Linux (November 2006)

» An even higher percentage in mid­range and utility computing

Page 3: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Virtualization / Virtualization / 

SMP­like ModelSMP­like Model Economics Standards Scalabilityx Heavy system integration

x Management complexity 

x Non­optimized performance

Simplicity Manageability Reliability Powerx Proprietary

x Prescriptivex Expensive

Commodity Commodity 

ClustersClusters

Trends in High Performance Computing

Open Source LinuxOpen Source Linux Standards Based Standards Based 

ComputingComputing

Maximizing the ROI of Clustering is all about getting the Maximizing the ROI of Clustering is all about getting the 

Best of all WorldsBest of all Worlds

 R&D Quality Scale

 Innovation EconomicsAbility to customizePotential stabilityx  Opaque commercial support

The ideal HPC platformThe ideal HPC platform

Lowers capital costsLowers capital costs

Supports standards Supports standards 

Dramatically improves Dramatically improves 

productivity productivity 

Lowers your operational Lowers your operational 

expensesexpenses

Page 4: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Beowulf Clustering: The Early Years Conceived by Becker and Sterling in ‘93 and 

initiated at NASA in ’94 Objective: show that commodity clusters could 

solve HPC problems Build a system that scaled in all dimensions

» Networking, bandwidth, disks, main memory, processing power

Initial prototype

» 16 processors, Channel­bonded Ethernet, under $50K

» Matched performance of contemporary $1M machine

Page 5: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Master Node

Interconnection Network

Internet or Internal Network

Traditional “Ad­Hoc” Linux Cluster

 Full set of disparate daemons, services, user/password, & host access setup Basic parallel shell with complex glue scripts run jobs Monitoring & management added as isolated tools

 Full Linux install to disk; Load into memory

● Manual & Slow; 5­30 minutes

Manual management of multiple, individual servers

Master node

Page 6: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Lessons Learned

What did we do wrong?

Complexity

Clusters required extensive training to install, configure and use

Long­term administration and updates were difficult

Only “static scalability”

Page 7: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Definitions

Clustering:

Independent computers

Combined into a unified 

system

Through software and 

networking

Virtualization: 

Isolation from physical 

resources to

Divide ample resources

Emulate scarce resources

Combine resources

Handle changes

Page 8: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

What does cluster architecture have to do with 

virtualization?

Good cluster systems virtualize

» Create a unified view of independent machines

» Single System Image (Illusion)

Machine Virtualization creates clusters» Many problems solvable with cluster experience

Virtualization isn't always a single thing» Creating unified view with a toolkit

Page 9: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

HPC Clustering ­ redefined

Manual management of multiple, individual servers – slow to scale, reliability issues

● Full operating system install on each compute node in cluster

Complex Cluster Management and Administration

● Time, skill and cost intensive

Security threat for each HPC compute node deployed 

‘Do­it­yourself’, ad hoc clusters – cumbersome and slow

Deploy, monitor and scale virtual pool of resources from a single point

● Only lightweight distribution on compute nodes ­ provision on demand, up in seconds 

● Software version skews eliminated

Central Command/Control makes the complex simple & cost­effective

● Manage entire cluster as single, consistent virtual system

FROM (traditional) TO (Cluster Virtualization) Complete cluster management solution 

that runs out­of­the­box● Robust, sophisticated architecture● Commercially tested and supported

Central security management● Lightweight compute nodes = nothing to 

hack

Driving productivity up and complexity outSimplicity is the value of Cluster Virtualization

Page 10: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Cluster Virtualization Architecture Realized

• Virtual, unified process space enables intuitive single sign­on, job submission

● Effortless job migration to nodes

• Minimal in­memory OS with single daemon rapidly deployed in seconds ­ no disk

● Less than 20 seconds

Master Node

Interconnection Network

Internet or Internal Network

• Monitor & manage efficiently from the Master ●Single System Install●Single Process Space●Shared cache of the cluster state●Single point of provisioning●Better performance due to lightweight nodes●No version skew is inherently more reliable 

Optional Disks

Manage & use a cluster like a single SMP machineManage & use a cluster like a single SMP machine

Page 11: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Elements of Cluster Systems

Some important elements of a cluster system Booting and Provisioning Process creation, monitoring and control Update and consistency model Name services File Systems Physical management Workload virtualization

Page 12: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Booting

Booting requirements Reliable booting Automatic scaling Centralized control and updates Reporting and diagnostics designed into the system No dependency on machine­local storage or 

configuration

Page 13: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•The Scyld solution

Integrated, automatic network boot

Basic hardware reporting and diagnostics in the Pre­OS stageOnly CPU, memory and NIC needed

Kernel and minimal environment from masterJust enough to say “what do I do now?”

Remaining configuration driven by master

Page 14: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Booting ImplementationIntegrated boot server, supporting PXE and other 

protocols Avoids network booting reliability issues Understands PXE versions and bugs Avoids TFTP "capture effect" Built­in dynamic node addition DHCP service for non­cluster hosts

Hardware reporting and diagnostics in pre­boot environment

Used to provide specific boot images

Page 15: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Provisioning

Provisioning Design: Traditional full standard installation on Masters Dynamically built "distribution" on other nodes

Core system is "stateless" non­persistent Provided by the boot server as ramdisks No local storage or network file system used Additional elements provided under centralized master 

control

Page 16: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•Initial provisioning architecture

Initial operation using only CPU, memory and NIC

Minimize actions before under central control

Master configures systemsStep­by­step flow of controlConfiguration and policy exists and used only on masterOpposite of traditional "download a system and cross 

fingers"

Page 17: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Phase 1 Booting Details

Client operations from network boot image Scan bus and load all NIC drivers DHCP on all interfaces (locate log host and run­time 

master) Start logging, then execute commands from master Master order PCI bus scans, uses IDs to load device 

drivers

Final steps of boot: create application environment Mount local and remote file system as needed by 

application

Page 18: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Installation, initial provisioning, and updates

Replacing a failed node is painful● Replace hardware, power on● Manual Install and Configuration● 15­30 minutes

Complex, Scripted Node Updates● Login/apply foreach node● Manually correct failed updates

Disk installs get stale, inconsistent● Tedious manual verification ● Or 30 minute reprovision

Disk­Based Install of entire cluster● Script or tool maintenance (kickstart) ●15­30 minutes/node, days/cluster

Replacing a failed node is effortless● Replace hardware, power on● Node is up in 20 seconds

Single install of the update on Master● Master auto­updates nodes● Guarantees consistency

FROM (traditional) TO (Stateless Auto-Provisioning) Single Install on the Master 

● Auto­provisions in­memory OS to nodes● < 20 seconds/node; min. to hours/cluster

Stateless provisioning always consistent● Refresh in 20 seconds if need be

Turning Provisioning from Painful to EffortlessFocus on application challenges ­ Not node administration

Page 19: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Process Initiation, Monitoring and Control

Problems:Starting jobs on a dynamic clusterMonitoring and controlling running processesAllowing interactive and scheduler­based usage

Opportunity:Clusters jobs are issued from designated mastersThat master has exactly the required environmentWe already have a POSIX job control modelTight communication 

Page 20: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Operation

Semantics: Execution consistency Remote execution produces same results as local 

execution

Implications: same executable (including version!) same libraries, including library linkage order same parameters and environment

Page 21: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Traditional Approaches to Remote Execution Consistency

One approach: treat cluster nodes as workstations Full local OS installation

» Automated installation» Ad hoc reinstall or update system» But synchronization (copying) is not consistency!

NFS mounting OS» Scalability problem, typical limit 30­40 nodes» Reliability problems under heavy load» Mediocre consistency when updating

Page 22: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Run­time provisioning architecture

We "copy" running processes Correct: We know the current versions, library linkage 

order, environment, etc. Fast: libraries and executables are cached Transfer and cache as versioned whole files With dynamic caching we can start with "nothing"! Dynamically builds a minimal custom installation

Feasible and efficient because "slave nodes" run only applications,

 not full OS install

Page 23: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•Hybrid environment

Caching libraries and executables limits Only handles automatic consistency Scripts have inherently unspecified consistency

» Which version is correct after updates?

Approach Extend and expose caching FS Network mount master and other servers

Violates rule of “mount only to support application”

Page 24: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Cluster­wide process space

Single virtual process space over cluster Standard process monitoring tools work unchanged Negligible performance impact Careful design avoids cascading failures

Major operational and performance benefitsConsider cluster­wide “killall”Over 7 minutes on U­Az cluster with 'ssh'Real­time, interactive response with Scyld approach

Page 25: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Unified process space implementation

Correct semantics and efficient monitoring/control Implemented by extending master's process table 2.2/2.4 Linux kernel implementation with custom hooks Redesigned 2.6 implementation minimizes/eliminates 

hooks» Uses security and auditing interface (e.g. "SE Linux")

Page 26: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•Challenges to Unified Process Spaces

Technical scalability is good» Scales to 1000s of nodes and processes

Perceptual scalability is the limit» What do you do when 'top' reports top 20 of 10,000?

Prototype solution: application­specified process presentation

Page 27: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•Casco Bay

  A new standard­like specification being pushed by Intel Developed by Intel, and is very Intel(tm) specific Weak yet overly specific

  Flawed, but addresses a need

Page 28: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Casco Bay details

Linux 2.6.17 or laterExcept if your name is “RedHat” or “SuSE”, then 2.6.9

PXE booting

Linux Standards Base execution environment

Specific compilers and development environmentNo distinction between execution and development 

environments

Page 29: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Maintaining Efficiency, Scalability, Security

Multiple Services, Agents, Daemons consume resources & compromise Performance and Scalability

● Memory Bandwidth (400Mb Full Install)● Scheduling latencies of daemons, 

services● Multiple agents, extra traffic won’t scale

Full User and Access requires configuration and opens security holes

Configuring large numbers of servers is painful

● Users, Services, Access, Security etc. Right Sized Compute Node OS consumes 

less resources ● (8 MB typical RAM)● Can always auto provision extra libraries● 5­15% Performance Improvement typical● Optimized management agent design scales 

well

FROM (Traditional) TO (Lightweight Compute Nodes) Compute nodes are slaves to run jobs 

directed by Master● Never have to setup compute nodes

Virtual Fortress● Lightweight Slave with no attack points is 

inherently more secure and reliable

Dynamically Provisioned Compute Nodes DramaticallyImproves Efficiency, Scalability, Security

Page 30: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Simplifying the complexity of managing large pools of servers

Multiple, disparate methods and tools for monitoring

● Complex to setup and manage● Does not scale

Job control requires interaction with multiple servers

Setup users and security on all machines

Manual management of multiple, individual servers

● Tedious, time­consuming

Single point of monitoring● Logs, statistics, status automatically forward 

to the Master node● Exceptional ease of use and scalability

Single point of job submission and job control

● Single unified process space● Signals and Std I/O  Forwarding

FROM (traditional) TO (Scyld) Present the cluster as a single, powerful 

Linux system● As easy as managing a single workstation

Single point of user/security administration

Manage Large Pools of Servers as a Single, Consistent, Virtual System

Page 31: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Workload Virtualization

“Workload Virtualization”,                                                  

 just another name for site­wide schedulers

Provide a virtual environment,                                           

      but only when viewed through specific tools

Page 32: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Mapping utilization to organization policies and priorities

Command Line only● Write many complex, difficult to 

maintain scripts● No visualization as to what is 

happening

Weak on mapping priorities to organizational policies

No Graphical, Historical Reporting or Accounting

Powerful graphical visualization● Point and click control● View all relevant resources/job/user status● Timeline and job location views● Powerful “what­if” simulations

Powerful, easy­to­implement policies● Optimize utilization according to multiple 

organizational criteria● Fair Share, Deadline, Backfill ● Reservations, Triggers

FROM (Traditional) TO (Commercial Solutions)

Instant, graphical real time and historical reporting and charge­back accounting

Maximize Cluster Utilization Against Business PrioritiesWith Powerful, Instant Visualization of the Cluster

Page 33: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Advanced Workload Virtualization

Maximizing Cluster Utilization­ Maximizing Cluster Utilization­ Mapped to Business PrioritiesMapped to Business Priorities

Page 34: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

File System Architecture

"Diskless Administrative" modelReliability of no file system dependencies

Yet with no constraints on application root or performance penalty

File system agnostic Underlying system needs no local or remote file 

system File systems are mounted to support the application

Clean consistency and administrative model 

Page 35: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Scyld Beowulf file system support

Mounting application file systems is a per­installation configuration» Default system mounts “no” file systems

Administration is done on the master» File system configuration tables (fstab) are on the master

» Cluster­wide default, with per­node specialization

» Standard format, but in /etc/beowulf/fstab

Mounting is a final step of initial provisioningFlow­of­control on master: failures are diagnosable, non­fatal

Handles kernel extension (module) loading

Page 36: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Cluster Name Services: BeoNSS

"Name Service" AKA "Directory Service" is the general concept of a

mapping a name to a value, or providing a list of names.

Concrete examples are Host names Network groups User names and passwords IP addresses and Ethernet MAC addresses

Page 37: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Cluster Name Services

Why are cluster nameservices important? Simplicity Eliminates per­node configuration files Automates scaling and updates Performance

Avoid the serialization of network name lookups.

Avoid communicating with a busy server

Avoid failures and retry cost of server overload

Avoid the latency of consulting large databases

Page 38: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Opportunities: Why we can do better

Cluster nodes have predictable names and addresses

We always issue jobs from MastersOne point in space and time where we know the  answer

Clusters have a single set of users per master"Single Sign On" means similar credentials over all machines

Many table lookups are to never­changing results

Uncommon queries are rarely performance problemsWe don't need to get it right, we just can't get it wrong

Page 39: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

BeoNSS: A cluster­specific name service

Solution: BeoNSS, Beowulf Name Services Hostnames

Page 40: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•BeoNSS Hostnames

Opportunity: We control IP address assignmentAssign node IP addresses in node order

Changes name lookup to addition

Name formatCluster hostnames have the base form  .<N>

Options for admin­defined names and networks

Special names for "self" and "master"

Current machine is ".­2"  or "self".

Master is known as  ".­1", “master”, “master0”

Page 41: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

BeoNSS User credentials

Names are reported as password table entry 'pwent'

BeoNSS reports only the current userAnd a few other e.g. root

Cluster jobs do not need to know other users

Much faster than scanning large lists

Use BProc credentials (full passwd entry) if available

Page 42: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Netgroups

Netgroups are used primarily for file server permissionsNetgroups are used in /etc/exports for NFS

Other file systems have similar security

Other uses, such as rexec and rsh, are supported

The archtype netgroup is "cluster"Alias for "cluster0"

Group members are all compute nodes

Hostnames are reported as ".0", ".1" ... ".31"

Maximum cluster node count is important

Other netgroups e.g. “nodes­available”

Page 43: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Beostat

Unified State, Status and Statistics

Used forScheduling

Monitoring

Diagnostics

Page 44: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Other Virtualization Aspects

What about those other virtualization aspects?

Such as “abstracting changes”?

Dynamic node addition

» Handled automatically

» Must not be easy, or everyone would do it correctly

Page 45: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Node Failures

Dynamic node deletion

Just cross fingers and hope?

... Or have a model to handle that

Compute nodes cache system elements

Kernel, basic filesystem, libraries, executables

But application handles I/O and process failures

Other masters take over control and monitoring

Page 46: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Where can we use virtualization?

Better server utilization

Running validated 

OS+application pairs on new  

hardware

Apps w/ mutually incompatible 

OS requirements

Page 47: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

•What are our challenges?

The VM support problem is exactly the same as the 

cluster problem

Booting and provisioning

Monitoring and resource limitations

Process initiation, monitoring and control

Page 48: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...
Page 49: The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering is all about getting the Best of ... Optimized management agent design scales ...

Questions?