Approaches to Clustering CS444I Internet Services Winter 00 © 1999-2000 Armando Fox...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Approaches to Clustering CS444I Internet Services Winter 00 © 1999-2000 Armando Fox...
Approaches to ClusteringApproaches to Clustering
CS444I Internet ServicesCS444I Internet ServicesWinter 00Winter 00
© 1999-2000 Armando Fox© 1999-2000 Armando [email protected]@cs.stanford.edu
© 1999, Armando Fox
OutlineOutline
Non-cluster approaches to bignessNon-cluster approaches to bigness
Approaches to clusteringApproaches to clustering
Cluster case studiesCluster case studies
Berkeley NOW/GLUnixBerkeley NOW/GLUnix
SNS/TACCSNS/TACC
Microsoft Wolfpack WolfpairMicrosoft Wolfpack Wolfpair
© 1999, Armando Fox
Approaches to BignessApproaches to Bigness
One Big Mongo ServerOne Big Mongo Server
DNS Round RobinDNS Round Robin
Magic Routers (a/k/a L4/L5 load balancing)Magic Routers (a/k/a L4/L5 load balancing)
Application-Level ReplicationApplication-Level Replication
True Clustering (case studies)True Clustering (case studies) NOW/GLUnix: single system Unix imageNOW/GLUnix: single system Unix image
Microsoft Wolfpack: virtualize every serviceMicrosoft Wolfpack: virtualize every service
SNS/TACC: fixed Internet-service programming modelSNS/TACC: fixed Internet-service programming model
© 1999, Armando Fox
One Big Mongo ServerOne Big Mongo Server
Example: AltaVistaExample: AltaVista Scaling: What if you can’t get a server with enough main Scaling: What if you can’t get a server with enough main
memory?memory?
AvailabilityAvailability
Growth path and costGrowth path and cost
Advantages of one big mongo server?Advantages of one big mongo server? Many agencies now using their (old?) mainframes! (IBM Many agencies now using their (old?) mainframes! (IBM
390, e.g.)390, e.g.)
Putting Web front end on legacy DB’s/appsPutting Web front end on legacy DB’s/apps
What if application is (say) I/O bound?What if application is (say) I/O bound?
© 1999, Armando Fox
DNS Round RobinDNS Round Robin
BenefitsBenefits Software transparent all the way to network levelSoftware transparent all the way to network level
Expand farm by updating DNS serversExpand farm by updating DNS servers
CostsCosts Coarse grainCoarse grain
Ad hocAd hoc
Effect of node failureEffect of node failure
Some apps can’t be easily replicatedSome apps can’t be easily replicated DatabaseDatabase
© 1999, Armando Fox
Approaches to True ClusteringApproaches to True Clustering
NOW/GLUnix: single Unix system imageNOW/GLUnix: single Unix system image
Microsoft Wolfpack: off-the-shelf support for Microsoft Wolfpack: off-the-shelf support for commodity appscommodity apps
SNS/TACC: fixed Internet-service programming SNS/TACC: fixed Internet-service programming modelmodel
© 1999, Armando Fox
NOW: GLUnixNOW: GLUnix
Original goals:Original goals: High availability through redundancyHigh availability through redundancy
Load balancing, self-managementLoad balancing, self-management
Binary compatibilityBinary compatibility
Both batch and parallel-job supportBoth batch and parallel-job support
I.e., single system image for NOW usersI.e., single system image for NOW users Cluster abstractions == Unix abstractionsCluster abstractions == Unix abstractions
This is both good and bad…what’s missing?This is both good and bad…what’s missing?
For portability and rapid development, build on top of For portability and rapid development, build on top of off-the-shelf OS (Solaris)off-the-shelf OS (Solaris)
© 1999, Armando Fox
GLUnix ArchitectureGLUnix Architecture
Master collects load, status, etc. info from daemonsMaster collects load, status, etc. info from daemons Repository of cluster state,c entralized resource allocationRepository of cluster state,c entralized resource allocation
Pros/cons of this approach?Pros/cons of this approach?
Glib app library talks to GLUnix master as app proxyGlib app library talks to GLUnix master as app proxy Signal catching, process mgmt, I/O redirection, etc.Signal catching, process mgmt, I/O redirection, etc.
Death of daemon is treated as a SIGKILL by masterDeath of daemon is treated as a SIGKILL by master
GLUnixMaster
NOW node
glud daemon
NOW node
glud daemon
NOW node
glud daemon
1 per cluster
© 1999, Armando Fox
GLUnix RetrospectiveGLUnix Retrospective
Trends that changed the assumptionsTrends that changed the assumptions SMP’s have replaced MPP’s, and are tougher to compete withSMP’s have replaced MPP’s, and are tougher to compete with
Kernels have become extensibleKernels have become extensible
Final features vs. initial goalsFinal features vs. initial goals Tools: Tools: glurun, glumake glurun, glumake (2nd most popular use of NOW!)(2nd most popular use of NOW!), ,
glups/glukill, glustat, glureserveglups/glukill, glustat, glureserve
Remote execution--but not total transparencyRemote execution--but not total transparency
Load balancing/distribution--but not transparent Load balancing/distribution--but not transparent migration/failovermigration/failover
Redundancy for high availability--but not for the “GLUnix Redundancy for high availability--but not for the “GLUnix master” nodemaster” node
© 1999, Armando Fox
GLUnix Interesting ProblemsGLUnix Interesting Problems
GlumakeGlumake and NFS “consistency” and NFS “consistency”
Support for benchmark-style batch jobsSupport for benchmark-style batch jobs Many instantiations, different parametersMany instantiations, different parameters
Embarrassingly parallelEmbarrassingly parallel
Social considerationsSocial considerations User-initiated unnecessary (malicious?) restartsUser-initiated unnecessary (malicious?) restarts
Lack of migration: an obstacle to harnessing desktop idle Lack of migration: an obstacle to harnessing desktop idle cycles (why?)cycles (why?)
Philosophy:Philosophy: Did GLUnix ask the right question? Did GLUnix ask the right question?
© 1999, Armando Fox
Scalability LimitsScalability Limits
Centralized resource managementCentralized resource management
TCP connections! (file descriptors)TCP connections! (file descriptors)
Interconnect latency and bandwidth (HW level)Interconnect latency and bandwidth (HW level) Myrinet: ~10 usec latency, 640 Mbits/s throughputMyrinet: ~10 usec latency, 640 Mbits/s throughput
Ethernet: ~400 usec latency, 100 Mbits/s throughputEthernet: ~400 usec latency, 100 Mbits/s throughput
ATM: ~600 usec latency, 78 Mbits/s throughput (ATM was the ATM: ~600 usec latency, 78 Mbits/s throughput (ATM was the initial target of the NOW!)initial target of the NOW!)
Thoughts about the interconnectThoughts about the interconnect What’s more important, latency or bandwidth?What’s more important, latency or bandwidth?
Why else might we want a secondary interconnect?Why else might we want a secondary interconnect?
© 1999, Armando Fox
Microsoft WolfpackMicrosoft Wolfpack
Goal: clustering support for “commodity” OS & apps Goal: clustering support for “commodity” OS & apps (NT)(NT) Clustering DLL’sClustering DLL’s
Limited support for existing applicationsLimited support for existing applications
Elements of a Wolfpack clusterElements of a Wolfpack cluster Cluster leader& quorum resourceCluster leader& quorum resource
Other cluster membersOther cluster members
Failover managersFailover managers
Virtualized servicesVirtualized services
© 1999, Armando Fox
Wolfpack OperationWolfpack Operation
Cluster leader and quorum resourceCluster leader and quorum resource The quorum (cluster configuration DB) The quorum (cluster configuration DB) definesdefines the cluster the cluster
Quorum had better be robust/highly-available!Quorum had better be robust/highly-available!
Prevents “split brain” problem resulting from partitioningPrevents “split brain” problem resulting from partitioning
Heartbeats used to obtain membership infoHeartbeats used to obtain membership info
Services can be virtualized to run on one or more Services can be virtualized to run on one or more nodes, but sharing a single network namenodes, but sharing a single network name
© 1999, Armando Fox
Wolfpack: FailoverWolfpack: Failover
Failover managers negotiate among themselves to Failover managers negotiate among themselves to determine when/where/whether to restart a failed determine when/where/whether to restart a failed serviceservice
Degenerate case: can restart legacy appsDegenerate case: can restart legacy apps Cluster-aware DLL’s provided for writing your own appsCluster-aware DLL’s provided for writing your own apps
No guarantees on integrity/consistencyNo guarantees on integrity/consistency
Pfister:Pfister: “…a means of simply providing transactional semantics “…a means of simply providing transactional semantics for data, without necessarily having to buy an entire for data, without necessarily having to buy an entire relational database in the bargain, would make it significantly relational database in the bargain, would make it significantly easier for applications to be highly available in a cluster.”easier for applications to be highly available in a cluster.”
© 1999, Armando Fox
TACC/SNSTACC/SNS
Specialized cluster runtime to host Web-like Specialized cluster runtime to host Web-like workloadsworkloads TACC: transformation, aggregation, caching and TACC: transformation, aggregation, caching and
customization--elements of an Internet servicecustomization--elements of an Internet service
Build apps from composable modules, Unix-pipeline-styleBuild apps from composable modules, Unix-pipeline-style
Goal: complete separation of Goal: complete separation of *ility*ility concerns from concerns from application logicapplication logic Legacy code encapsulation, multiple language supportLegacy code encapsulation, multiple language support
Insulate programmers from nasty engineeringInsulate programmers from nasty engineering
© 1999, Armando Fox
TACC ExamplesTACC Examples
HotBotHotBot search engine search engine Query crawler’s DBQuery crawler’s DB
Cache recent searchesCache recent searches
Customize UI/presentationCustomize UI/presentation
TranSendTranSend transformation proxy transformation proxy On-the-fly lossy compression of inline On-the-fly lossy compression of inline
images (GIF, JPG, etc.)images (GIF, JPG, etc.)
Cache original & transformedCache original & transformed
User specifies aggressiveness, User specifies aggressiveness, “refinement” UI, etc.“refinement” UI, etc.
C TT
$$AA
TT
$$
C
DBDB
htmlhtml
© 1999, Armando Fox
Cluster-Based TACC ServerCluster-Based TACC Server Component replication for scaling and availabilityComponent replication for scaling and availability High-bandwidth, low-latency interconnectHigh-bandwidth, low-latency interconnect Incremental scaling: commodity PC’sIncremental scaling: commodity PC’s
C$
LB/FT
Interconnect
FE
$ $
WWWT
FE
FE
WWWA
GUI
Front EndsFront EndsFront EndsFront Ends CachesCachesCachesCaches User ProfileUser ProfileDatabaseDatabase
User ProfileUser ProfileDatabaseDatabase
WorkersWorkersWorkersWorkersLoad Balancing &Load Balancing &Fault ToleranceFault Tolerance
Load Balancing &Load Balancing &Fault ToleranceFault Tolerance
AdministrationAdministrationInterfaceInterface
AdministrationAdministrationInterfaceInterface
© 1999, Armando Fox
““Starfish” Availability: LB DeathStarfish” Availability: LB Death
FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB
C$
Interconnect
FE
$ $
WWWT
FE
FE
LB/FT
WWWA
© 1999, Armando Fox
““Starfish” Availability: LB DeathStarfish” Availability: LB Death
FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB
C$
Interconnect
FE
$ $
WWWT
FE
FE
LB/FT
WWWA
LB/FT
New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables
If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure
© 1999, Armando Fox
““Starfish” Availability: LB DeathStarfish” Availability: LB Death
FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB
C$
Interconnect
FE
$ $
WWWT
FE
FE
LB/FT
WWWA
New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables
If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure
© 1999, Armando Fox
SNS Availability MechanismsSNS Availability Mechanisms
Soft state everywhereSoft state everywhere Multicast based announce/listen to refresh the stateMulticast based announce/listen to refresh the state
Idea stolen from multicast routing in the Internet!Idea stolen from multicast routing in the Internet!
Process peers watch each otherProcess peers watch each other Because of no hard state, “recovery” == “restart”Because of no hard state, “recovery” == “restart”
Because of multicast level of indirection, don’t need a location Because of multicast level of indirection, don’t need a location directory for resourcesdirectory for resources
Load balancing, hot updates, migration are “easy”Load balancing, hot updates, migration are “easy” Shoot down a worker, and it will recoverShoot down a worker, and it will recover
Upgrade == install new software, shoot down oldUpgrade == install new software, shoot down old
Mostly graceful degradationMostly graceful degradation
© 1999, Armando Fox
SNS Availability Mechanisms, cont’d.SNS Availability Mechanisms, cont’d.
Orthogonal mechanismsOrthogonal mechanisms Composition without interfacesComposition without interfaces
Example: Scalable Reliable Multicast (SRM) group state Example: Scalable Reliable Multicast (SRM) group state management with SNSmanagement with SNS
Eliminates O(nEliminates O(n22) complexity of composing modules) complexity of composing modules
State space of failure mechanisms is easy to reason aboutState space of failure mechanisms is easy to reason about
What’s the cost?What’s the cost?
More on orthogonal mechanisms laterMore on orthogonal mechanisms later
© 1999, Armando Fox
Administering SNSAdministering SNS
Multicast meansMulticast meansmonitor can runmonitor can runanywhere on clusteranywhere on cluster
Extensible via self-describing data structures and mobile code in Tcl
© 1999, Armando Fox
Comparing SNS & WolfpackComparing SNS & Wolfpack
Somewhat different targetsSomewhat different targets
Quorum Resource <--> Load Balancer/FT managerQuorum Resource <--> Load Balancer/FT manager But soft state, and cluster can (temporarily) function without itBut soft state, and cluster can (temporarily) function without it
Better partition resilienceBetter partition resilience
FailoverFailover Wolfpack Failover Manager slightly more flexibleWolfpack Failover Manager slightly more flexible
Neither system provides any integrity/consistency guarantees Neither system provides any integrity/consistency guarantees itselfitself
Multicast heartbeats detect membership, failures, Multicast heartbeats detect membership, failures, locations of thingslocations of things
© 1999, Armando Fox
What We Really Learned From TACCWhat We Really Learned From TACC
Design for failureDesign for failure It will fail anywayIt will fail anyway
End-to-end argument applied to high availabilityEnd-to-end argument applied to high availability
Orthogonality is even better than layeringOrthogonality is even better than layering Narrow interface vs. no interfaceNarrow interface vs. no interface
A great way to manage system complexityA great way to manage system complexity
The price of orthogonalityThe price of orthogonality
Techniques: Refreshable soft state; watchdogs/timeouts; Techniques: Refreshable soft state; watchdogs/timeouts; sandboxingsandboxing
Software compatibility is hard, but valuableSoftware compatibility is hard, but valuable
© 1999, Armando Fox
Clusters SummaryClusters Summary
Many approaches to clustering, software transparency, Many approaches to clustering, software transparency, failure semanticsfailure semantics An end-to-end problem that is often application-specificAn end-to-end problem that is often application-specific
We’ll see this again at the application level in harvest vs. yield We’ll see this again at the application level in harvest vs. yield discussiondiscussion
Internet workloads are a particularly good match for Internet workloads are a particularly good match for clustersclusters What software support is needed to mate these two things?What software support is needed to mate these two things?
What new abstractions do we want for writing failure-tolerant What new abstractions do we want for writing failure-tolerant applications in light of these techniques?applications in light of these techniques?
What about Pfister’s comment about transactional semantics?What about Pfister’s comment about transactional semantics?