The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...
Transcript of The State of Cluster Systems - WordPress.com · 2007. 4. 20. · Maximizing the ROI of Clustering...
1
The State of Cluster Systems Overview and Future Directions
Donald Becker
February 2007 BWBUG meeting
State of Commodity Clustering
The state of commodity clustering is strong
Commodity Clusters dominate the HPC world
» A strong showing in the Top500 list
» Displacing custom HPC architectures
Linux is the OS of choice for clusters
» Over 75% of highend HPC machines use Linux (November 2006)
» An even higher percentage in midrange and utility computing
Virtualization / Virtualization /
SMPlike ModelSMPlike Model Economics Standards Scalabilityx Heavy system integration
x Management complexity
x Nonoptimized performance
Simplicity Manageability Reliability Powerx Proprietary
x Prescriptivex Expensive
Commodity Commodity
ClustersClusters
Trends in High Performance Computing
Open Source LinuxOpen Source Linux Standards Based Standards Based
ComputingComputing
Maximizing the ROI of Clustering is all about getting the Maximizing the ROI of Clustering is all about getting the
Best of all WorldsBest of all Worlds
R&D Quality Scale
Innovation EconomicsAbility to customizePotential stabilityx Opaque commercial support
The ideal HPC platformThe ideal HPC platform
Lowers capital costsLowers capital costs
Supports standards Supports standards
Dramatically improves Dramatically improves
productivity productivity
Lowers your operational Lowers your operational
expensesexpenses
Beowulf Clustering: The Early Years Conceived by Becker and Sterling in ‘93 and
initiated at NASA in ’94 Objective: show that commodity clusters could
solve HPC problems Build a system that scaled in all dimensions
» Networking, bandwidth, disks, main memory, processing power
Initial prototype
» 16 processors, Channelbonded Ethernet, under $50K
» Matched performance of contemporary $1M machine
Master Node
Interconnection Network
Internet or Internal Network
Traditional “AdHoc” Linux Cluster
Full set of disparate daemons, services, user/password, & host access setup Basic parallel shell with complex glue scripts run jobs Monitoring & management added as isolated tools
Full Linux install to disk; Load into memory
● Manual & Slow; 530 minutes
Manual management of multiple, individual servers
Master node
Lessons Learned
What did we do wrong?
Complexity
Clusters required extensive training to install, configure and use
Longterm administration and updates were difficult
Only “static scalability”
Definitions
Clustering:
Independent computers
Combined into a unified
system
Through software and
networking
Virtualization:
Isolation from physical
resources to
Divide ample resources
Emulate scarce resources
Combine resources
Handle changes
What does cluster architecture have to do with
virtualization?
Good cluster systems virtualize
» Create a unified view of independent machines
» Single System Image (Illusion)
Machine Virtualization creates clusters» Many problems solvable with cluster experience
Virtualization isn't always a single thing» Creating unified view with a toolkit
HPC Clustering redefined
Manual management of multiple, individual servers – slow to scale, reliability issues
● Full operating system install on each compute node in cluster
Complex Cluster Management and Administration
● Time, skill and cost intensive
Security threat for each HPC compute node deployed
‘Doityourself’, ad hoc clusters – cumbersome and slow
Deploy, monitor and scale virtual pool of resources from a single point
● Only lightweight distribution on compute nodes provision on demand, up in seconds
● Software version skews eliminated
Central Command/Control makes the complex simple & costeffective
● Manage entire cluster as single, consistent virtual system
FROM (traditional) TO (Cluster Virtualization) Complete cluster management solution
that runs outofthebox● Robust, sophisticated architecture● Commercially tested and supported
Central security management● Lightweight compute nodes = nothing to
hack
Driving productivity up and complexity outSimplicity is the value of Cluster Virtualization
Cluster Virtualization Architecture Realized
• Virtual, unified process space enables intuitive single signon, job submission
● Effortless job migration to nodes
• Minimal inmemory OS with single daemon rapidly deployed in seconds no disk
● Less than 20 seconds
Master Node
Interconnection Network
Internet or Internal Network
• Monitor & manage efficiently from the Master ●Single System Install●Single Process Space●Shared cache of the cluster state●Single point of provisioning●Better performance due to lightweight nodes●No version skew is inherently more reliable
Optional Disks
Manage & use a cluster like a single SMP machineManage & use a cluster like a single SMP machine
Elements of Cluster Systems
Some important elements of a cluster system Booting and Provisioning Process creation, monitoring and control Update and consistency model Name services File Systems Physical management Workload virtualization
Booting
Booting requirements Reliable booting Automatic scaling Centralized control and updates Reporting and diagnostics designed into the system No dependency on machinelocal storage or
configuration
•The Scyld solution
Integrated, automatic network boot
Basic hardware reporting and diagnostics in the PreOS stageOnly CPU, memory and NIC needed
Kernel and minimal environment from masterJust enough to say “what do I do now?”
Remaining configuration driven by master
Booting ImplementationIntegrated boot server, supporting PXE and other
protocols Avoids network booting reliability issues Understands PXE versions and bugs Avoids TFTP "capture effect" Builtin dynamic node addition DHCP service for noncluster hosts
Hardware reporting and diagnostics in preboot environment
Used to provide specific boot images
Provisioning
Provisioning Design: Traditional full standard installation on Masters Dynamically built "distribution" on other nodes
Core system is "stateless" nonpersistent Provided by the boot server as ramdisks No local storage or network file system used Additional elements provided under centralized master
control
•Initial provisioning architecture
Initial operation using only CPU, memory and NIC
Minimize actions before under central control
Master configures systemsStepbystep flow of controlConfiguration and policy exists and used only on masterOpposite of traditional "download a system and cross
fingers"
Phase 1 Booting Details
Client operations from network boot image Scan bus and load all NIC drivers DHCP on all interfaces (locate log host and runtime
master) Start logging, then execute commands from master Master order PCI bus scans, uses IDs to load device
drivers
Final steps of boot: create application environment Mount local and remote file system as needed by
application
Installation, initial provisioning, and updates
Replacing a failed node is painful● Replace hardware, power on● Manual Install and Configuration● 1530 minutes
Complex, Scripted Node Updates● Login/apply foreach node● Manually correct failed updates
Disk installs get stale, inconsistent● Tedious manual verification ● Or 30 minute reprovision
DiskBased Install of entire cluster● Script or tool maintenance (kickstart) ●1530 minutes/node, days/cluster
Replacing a failed node is effortless● Replace hardware, power on● Node is up in 20 seconds
Single install of the update on Master● Master autoupdates nodes● Guarantees consistency
FROM (traditional) TO (Stateless Auto-Provisioning) Single Install on the Master
● Autoprovisions inmemory OS to nodes● < 20 seconds/node; min. to hours/cluster
Stateless provisioning always consistent● Refresh in 20 seconds if need be
Turning Provisioning from Painful to EffortlessFocus on application challenges Not node administration
Process Initiation, Monitoring and Control
Problems:Starting jobs on a dynamic clusterMonitoring and controlling running processesAllowing interactive and schedulerbased usage
Opportunity:Clusters jobs are issued from designated mastersThat master has exactly the required environmentWe already have a POSIX job control modelTight communication
Operation
Semantics: Execution consistency Remote execution produces same results as local
execution
Implications: same executable (including version!) same libraries, including library linkage order same parameters and environment
Traditional Approaches to Remote Execution Consistency
One approach: treat cluster nodes as workstations Full local OS installation
» Automated installation» Ad hoc reinstall or update system» But synchronization (copying) is not consistency!
NFS mounting OS» Scalability problem, typical limit 3040 nodes» Reliability problems under heavy load» Mediocre consistency when updating
Runtime provisioning architecture
We "copy" running processes Correct: We know the current versions, library linkage
order, environment, etc. Fast: libraries and executables are cached Transfer and cache as versioned whole files With dynamic caching we can start with "nothing"! Dynamically builds a minimal custom installation
Feasible and efficient because "slave nodes" run only applications,
not full OS install
•Hybrid environment
Caching libraries and executables limits Only handles automatic consistency Scripts have inherently unspecified consistency
» Which version is correct after updates?
Approach Extend and expose caching FS Network mount master and other servers
Violates rule of “mount only to support application”
Clusterwide process space
Single virtual process space over cluster Standard process monitoring tools work unchanged Negligible performance impact Careful design avoids cascading failures
Major operational and performance benefitsConsider clusterwide “killall”Over 7 minutes on UAz cluster with 'ssh'Realtime, interactive response with Scyld approach
Unified process space implementation
Correct semantics and efficient monitoring/control Implemented by extending master's process table 2.2/2.4 Linux kernel implementation with custom hooks Redesigned 2.6 implementation minimizes/eliminates
hooks» Uses security and auditing interface (e.g. "SE Linux")
•Challenges to Unified Process Spaces
Technical scalability is good» Scales to 1000s of nodes and processes
Perceptual scalability is the limit» What do you do when 'top' reports top 20 of 10,000?
Prototype solution: applicationspecified process presentation
•Casco Bay
A new standardlike specification being pushed by Intel Developed by Intel, and is very Intel(tm) specific Weak yet overly specific
Flawed, but addresses a need
Casco Bay details
Linux 2.6.17 or laterExcept if your name is “RedHat” or “SuSE”, then 2.6.9
PXE booting
Linux Standards Base execution environment
Specific compilers and development environmentNo distinction between execution and development
environments
Maintaining Efficiency, Scalability, Security
Multiple Services, Agents, Daemons consume resources & compromise Performance and Scalability
● Memory Bandwidth (400Mb Full Install)● Scheduling latencies of daemons,
services● Multiple agents, extra traffic won’t scale
Full User and Access requires configuration and opens security holes
Configuring large numbers of servers is painful
● Users, Services, Access, Security etc. Right Sized Compute Node OS consumes
less resources ● (8 MB typical RAM)● Can always auto provision extra libraries● 515% Performance Improvement typical● Optimized management agent design scales
well
FROM (Traditional) TO (Lightweight Compute Nodes) Compute nodes are slaves to run jobs
directed by Master● Never have to setup compute nodes
Virtual Fortress● Lightweight Slave with no attack points is
inherently more secure and reliable
Dynamically Provisioned Compute Nodes DramaticallyImproves Efficiency, Scalability, Security
Simplifying the complexity of managing large pools of servers
Multiple, disparate methods and tools for monitoring
● Complex to setup and manage● Does not scale
Job control requires interaction with multiple servers
Setup users and security on all machines
Manual management of multiple, individual servers
● Tedious, timeconsuming
Single point of monitoring● Logs, statistics, status automatically forward
to the Master node● Exceptional ease of use and scalability
Single point of job submission and job control
● Single unified process space● Signals and Std I/O Forwarding
FROM (traditional) TO (Scyld) Present the cluster as a single, powerful
Linux system● As easy as managing a single workstation
Single point of user/security administration
Manage Large Pools of Servers as a Single, Consistent, Virtual System
Workload Virtualization
“Workload Virtualization”,
just another name for sitewide schedulers
Provide a virtual environment,
but only when viewed through specific tools
Mapping utilization to organization policies and priorities
Command Line only● Write many complex, difficult to
maintain scripts● No visualization as to what is
happening
Weak on mapping priorities to organizational policies
No Graphical, Historical Reporting or Accounting
Powerful graphical visualization● Point and click control● View all relevant resources/job/user status● Timeline and job location views● Powerful “whatif” simulations
Powerful, easytoimplement policies● Optimize utilization according to multiple
organizational criteria● Fair Share, Deadline, Backfill ● Reservations, Triggers
FROM (Traditional) TO (Commercial Solutions)
Instant, graphical real time and historical reporting and chargeback accounting
Maximize Cluster Utilization Against Business PrioritiesWith Powerful, Instant Visualization of the Cluster
Advanced Workload Virtualization
Maximizing Cluster Utilization Maximizing Cluster Utilization Mapped to Business PrioritiesMapped to Business Priorities
File System Architecture
"Diskless Administrative" modelReliability of no file system dependencies
Yet with no constraints on application root or performance penalty
File system agnostic Underlying system needs no local or remote file
system File systems are mounted to support the application
Clean consistency and administrative model
Scyld Beowulf file system support
Mounting application file systems is a perinstallation configuration» Default system mounts “no” file systems
Administration is done on the master» File system configuration tables (fstab) are on the master
» Clusterwide default, with pernode specialization
» Standard format, but in /etc/beowulf/fstab
Mounting is a final step of initial provisioningFlowofcontrol on master: failures are diagnosable, nonfatal
Handles kernel extension (module) loading
Cluster Name Services: BeoNSS
"Name Service" AKA "Directory Service" is the general concept of a
mapping a name to a value, or providing a list of names.
Concrete examples are Host names Network groups User names and passwords IP addresses and Ethernet MAC addresses
Cluster Name Services
Why are cluster nameservices important? Simplicity Eliminates pernode configuration files Automates scaling and updates Performance
Avoid the serialization of network name lookups.
Avoid communicating with a busy server
Avoid failures and retry cost of server overload
Avoid the latency of consulting large databases
Opportunities: Why we can do better
Cluster nodes have predictable names and addresses
We always issue jobs from MastersOne point in space and time where we know the answer
Clusters have a single set of users per master"Single Sign On" means similar credentials over all machines
Many table lookups are to neverchanging results
Uncommon queries are rarely performance problemsWe don't need to get it right, we just can't get it wrong
BeoNSS: A clusterspecific name service
Solution: BeoNSS, Beowulf Name Services Hostnames
•BeoNSS Hostnames
Opportunity: We control IP address assignmentAssign node IP addresses in node order
Changes name lookup to addition
Name formatCluster hostnames have the base form .<N>
Options for admindefined names and networks
Special names for "self" and "master"
Current machine is ".2" or "self".
Master is known as ".1", “master”, “master0”
BeoNSS User credentials
Names are reported as password table entry 'pwent'
BeoNSS reports only the current userAnd a few other e.g. root
Cluster jobs do not need to know other users
Much faster than scanning large lists
Use BProc credentials (full passwd entry) if available
Netgroups
Netgroups are used primarily for file server permissionsNetgroups are used in /etc/exports for NFS
Other file systems have similar security
Other uses, such as rexec and rsh, are supported
The archtype netgroup is "cluster"Alias for "cluster0"
Group members are all compute nodes
Hostnames are reported as ".0", ".1" ... ".31"
Maximum cluster node count is important
Other netgroups e.g. “nodesavailable”
Beostat
Unified State, Status and Statistics
Used forScheduling
Monitoring
Diagnostics
Other Virtualization Aspects
What about those other virtualization aspects?
Such as “abstracting changes”?
Dynamic node addition
» Handled automatically
» Must not be easy, or everyone would do it correctly
Node Failures
Dynamic node deletion
Just cross fingers and hope?
... Or have a model to handle that
Compute nodes cache system elements
Kernel, basic filesystem, libraries, executables
But application handles I/O and process failures
Other masters take over control and monitoring
Where can we use virtualization?
Better server utilization
Running validated
OS+application pairs on new
hardware
Apps w/ mutually incompatible
OS requirements
•What are our challenges?
The VM support problem is exactly the same as the
cluster problem
Booting and provisioning
Monitoring and resource limitations
Process initiation, monitoring and control
Questions?