Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat...
Transcript of Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat...
![Page 1: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/1.jpg)
1
Red Hat Clustering:
Best Practices & Pitfalls
Lon HohbergerPrincipal Software EngineerRed HatMay 2013
![Page 2: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/2.jpg)
2
Red Hat Clustering: Best Practices & Pitfalls
● Why Cluster?
● I/O Fencing and Your Cluster
● 2-Node Clusters and Why they are Special
● Quorum Disks
● Service Structure
● Multipath Considerations in a clustered environment
● GFS2 – Cluster File System
![Page 3: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/3.jpg)
3
Why Cluster?
● Application/Service Failover● Reduce MTTR● Meet business needs and SLAs● Protect against software and hardware faults● Virtual machine management● Allow for planned maintenance with minimal downtime
● Load Balancing● Scale out workloads● Improve application response times
![Page 4: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/4.jpg)
4
Why not Cluster?
● Often requires additional hardware
● Increases total system complexity● More possible parts that can fail
● More failure scenarios to evaluate
● Harder to configure● Harder to debug problems
![Page 5: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/5.jpg)
5
Component Overview
● corosync – Totem SRP/RRP-based membership, VS messaging, closed process groups
● cman – quorum, voting, quorum disk
● fenced – handles I/O fencing for joined members● Fencing agents – carry out fencing operations
● DLM – distributed lock manager (kernel)
● clvmd – cluster logical volume manager
● gfs2 – cluster file system
● rgmanager – cold failover for applications
● Pacemaker (TP) – Next-generation CRM
![Page 6: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/6.jpg)
6
Failure Recovery Overview
● corosync - Totem token is lost; Totem forms a new ring
● fenced enters recovery state – quorate partition initiates fencing of dead node(s)
● DLM enters recovery state – locks on dead node(s) are dropped
● clvmd, gfs2 enter recovery state – recover / replay journals
● rgmanager initiates cold failover of user applications
![Page 7: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/7.jpg)
7
I/O Fencing
● An active countermeasure taken by a functioning host to isolate a misbehaving or presumed dead host from shared data
● Most critical part of a cluster utilizing SAN or other shared storage technology
● Despite this, not everyone uses it● How much is your data worth?
● Required by gfs2, clvmd, and cold failover software shipped by Red Hat
● Utilized by RHEV, too – Fencing is not a cluster-specific technology
![Page 8: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/8.jpg)
8
I/O Fencing
● Protects data in the event of planned or unplanned system downtime
● Kernel panic● System freeze● Live hang / recovery
● Enables nodes to safely assume control of shared resources when booted in a network partition situation
![Page 9: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/9.jpg)
9
I/O Fencing
● SAN fabric and SCSI fencing are not fully recoverable● Node must typically be rebooted manually● Enables an autopsy of the node● Sometimes does not require additional hardware
● Power fencing is usually fully recoverable● Your system can reboot and rejoin the cluster - thereby
restoring capacity - without administrator intervention● This is a reduction in MTTR
![Page 10: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/10.jpg)
10
I/O Fencing – Drawbacks
● Difficult to configure● No automated way to “discover” fencing devices● Fencing devices are all very different and have different
permission schemes and requirements● Typically requires additional hardware
● Additional cost often not considered when purchasing systems
● A given “approved” IHV may not sell the hardware you want to use
![Page 11: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/11.jpg)
11
I/O Fencing – Best Practices
● Integrated power management● Use servers with dual power supplies● Use a backup fencing device● IPMI over LAN fencing usually requires disabling acpid
● Single-rail switched PDUs● Use 2 switched PDUs● Use a PDU with two power rails● Use a backup fencing device
![Page 12: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/12.jpg)
12
Host Host
Integrated Power Management Pitfall
Fencing Device
Net
Fencing Device
Net
● Host (and fencing device) lose power
● Safe to recover; host is off
● Host and Fencing Device lose network connectivity
● NEVER safe to recover!
● The two cases are indistinguishable
● A timeout does not ensure data integrity in this case
● Not all integrated power management devices suffer this problem
![Page 13: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/13.jpg)
13
Single Rail Pitfall
Host
Host Fen
cing
Dev
ice
● One power cord = Single Point of Failure
Host
Host Fen
cing
Dev
ice
![Page 14: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/14.jpg)
14
Best Practice: Dual Rail Fencing Device
Host
Host
● Dual power sources, two rails in the fencing device, two power supplies in the cluster nodes
● Fencing device electronics run off of either rail
Rail B
Rail A
FencingDevice
ClusterInterconnect
![Page 15: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/15.jpg)
15
Best Practice: Dual Single Rail Fencing Devices
Host
Host
● Dual power sources, two fencing devices
DeviceB
DeviceA
ClusterInterconnect
![Page 16: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/16.jpg)
16
I/O Fencing – Pitfalls
● SAN fabric fencing● Full recovery typically not automatic● Unfencing in RHEL6 allows a host to turn on its ports
after reboot● SCSI-3 PR fencing
● Not all devices support it● Quorum disk may not reside on a LUN managed by
SCSI fencing due to quorum “chicken and egg” problem
![Page 17: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/17.jpg)
17
I/O Fencing - Pitfalls
● SCSI-3 PR Fencing (cont.)● Preempt-and-abort command is not required by SCSI-3
specification● Not all SCSI-3 compliant devices support it
● LUN detection can be done by querying CLVM, looking for volume groups with the cluster tag set
● On RHEL6, watchdog script allows reboot after fencing
![Page 18: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/18.jpg)
18
2-Node Clusters
● Most common use case in high availability / cold failover clusters
● Inexpensive to set up; several can fit in a single rack
● Red Hat has had two node failover clustering since 2002
![Page 19: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/19.jpg)
19
Why 2-Node Clusters are Special
● Cluster operates using a simple majority quorum algorithm
● Best predictability with respect to node failure counts compared to other quorum algorithms (ex: Grid)
● There is never a majority with one node out of two
● Simple Solution: two_node=”1” mode● When a node boots, it assumes quorum● Services, gfs2, etc. are prevented from operating until
fencing completes
![Page 20: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/20.jpg)
20
2-Node Pitfalls: Fence Loops
● If two nodes become partitioned, a fence loop can occur
● Node A kills node B, who reboots and kills node A... etc.
● Solutions● Correct network configuration
● Fencing devices on same network used for cluster communication
● Use fencing delays● Use a quorum disk
![Page 21: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/21.jpg)
21
Fence Loop
Node 1 Node 2
Fencing Device
Network
Cluster Interconnect
![Page 22: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/22.jpg)
22
Fence Loop
Node 1 Node 2
Fencing Device
Cluster Interconnect Cable pull or switchloses power
Network
![Page 23: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/23.jpg)
23
Fence Loop
Node 1 Node 2
Fencing Device
Fencing RequestFencing Requestblocked; deviceallows only one user at a time
Cluster Interconnect
Network
![Page 24: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/24.jpg)
24
Fence Loop
Node 1 Node 2
Fencing Device
Node 1 powercycled
Network
Cluster Interconnect
![Page 25: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/25.jpg)
25
Fence Loop
Node 1 Node 2
Fencing Device
Node 1 boots
Cluster Interconnect
Network
![Page 26: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/26.jpg)
26
Fence Loop
Node 1 Node 2
Fencing Device
Fencing Request
Network
Cluster Interconnect
![Page 27: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/27.jpg)
27
Fence Loop
Node 1 Node 2
Fencing Device
Node 2 powercycled
Network
Cluster Interconnect
![Page 28: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/28.jpg)
28
Fence Loop
Node 1 Node 2
Fencing Device
Node 2 boots
Network
Cluster Interconnect
![Page 29: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/29.jpg)
29
Fence Loop
Node 1 Node 2
Fencing Device
Fencing Request
Network
Cluster Interconnect
![Page 30: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/30.jpg)
30
Immune to Fence Loops
● On cable pull, node without connectivity can not fence
● If interconnect dies and comes back later, fencing device serializes access so that only one node is fencedNode 1 Node 2
Fencing Device
Cluster Interconnect
![Page 31: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/31.jpg)
31
2-Node Pitfalls: Fence Death
● A combined pitfall when using integrated power in two node clusters
● If a two node cluster becomes partitioned, a fence death can occur if fencing devices are still accessible
● Two nodes tell each other's fencing device to turn off the other node at the same time
● No one is alive to turn either host back on!
● Solutions● Same as fence loops● Use a switched PDU which serializes access
![Page 32: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/32.jpg)
32
Fence Death
Node 1 Node 2
FencingDevice
FencingDevice
Network
Cluster Interconnect
![Page 33: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/33.jpg)
33
Fence Death
Node 1 Node 2
FencingDevice
FencingDevice
Network
Cluster InterconnectCluster interconnect is lost (cable pull, switch turned off, etc.)
![Page 34: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/34.jpg)
34
Fence Death
Node 1 Node 2
FencingDevice
FencingDevice
FencingRequest
FencingRequest
Network
Cluster Interconnect Both nodes fence each other
![Page 35: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/35.jpg)
35
Fence Death
Node 1 Node 2
FencingDevice
FencingDevice
Network
Cluster Interconnect No one is alive to turn the other back on.
![Page 36: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/36.jpg)
36
Immune to Fence Death
Node 1 Node 2
Fencing Device
● Single power fencing device serializes access
● Cable pull ensures one node “loses”
Cluster Interconnect
![Page 37: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/37.jpg)
37
2-Node Pitfalls: Crossover Cables
● Causes both nodes to lose link on cluster interconnect when only one link has failed
● Indeterminate state for quorum disk without very clever heuristics (use master_wins)
● Fencing can't be placed on the same network
● We don't test this
![Page 38: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/38.jpg)
38
2-Node Clusters: Pitfall avoidance
● Network / fencing configuration evaluation
● Use a quorum disk
● Create a 3 node cluster :)● Simple to configure, increased working capacity, etc.
![Page 39: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/39.jpg)
39
Quorum Disk - Benefits
● Prevents fence-loop and fence death situations● Existing cluster member retains quorum until it fails or
cluster connectivity is restored● Heuristics ensure that administrator-defined “best-fit”
node continues operation in a network partition● Provides all-but-one or last-man-standing failure mode
● Examples:● 4 node cluster, and 3 nodes fail● 4 node cluster and 3 nodes lose access to a critical network
path as decided by the administrator
● Note: Ensure capacity of remaining node is adequate for all cluster operations before trying this
![Page 40: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/40.jpg)
40
Quorum Disk - Drawbacks
● Used to be complex to configure, but RHEL 6.3 fixes most of this
● Heuristics need to be written by administrators for their particular environments
● Incorrect configuration can reduce availability● Algorithm used is non-traditional
● Backup membership algorithm vs. ownership algorithm or simple “tie-breaker”
![Page 41: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/41.jpg)
41
Quorum Disk Timing Pitfall (RHEL5)
![Page 42: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/42.jpg)
42
Quorum Disk Made “Simple” (RHEL5)
● Quorum disk failure recovery should be a bit less than half of CMAN's failure time
● This allows for the quorum disk arbitration node to fail over before CMAN times out
● Quorum disk failure recovery should be approximately 30% longer than a multipath failover. Example [1]:
● x = multipath failover● x * 1.3 = quorum disk failover● x * 2.7 = CMAN failover
[1] http://kbase.redhat.com/faq/docs/DOC-2882
![Page 43: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/43.jpg)
43
Quorum Disk Best Practices
● Don't use it if you don't need it● Fencing delays can usually provide adequate decision-
making● If required, use heuristics for your environment
● Prefer master_wins over heuristics
● I/O Scheduling● deadline scheduler● cfq scheduler with realtime prio
● ionice -c 1 -n 0 -p `pidof qdiskd`
![Page 44: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/44.jpg)
44
Clustered Services – Best Practices
● Service structure should be as flat as possible● Improves readability / maintainability● Reduces configuration file footprint● Rgmanager fixes most common ordering mistakes
● The resources block is not required
● Virtual machines should not exceed memory limits of a host after a failover for best predictability
![Page 45: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/45.jpg)
45
● With SCSI-3 PR Fencing, multipath works, but only when using device-mapper
● When using multiple paths and SAN fencing, you must ensure all paths to all storage is fenced for a given host
● When using multipath with a quorum disk, you must not use no_path_retry = queue.
● When using multipath with GFS2, you should not use no_path_retry = queue.
On Multipath
![Page 46: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/46.jpg)
46
On Multipath
● Do not place /var on a multipath device without relocating the bindings file to the root partition
● Not all SAN fabrics behave the same way in the same failure scenarios
● Test all failure scenarios you expect to have the cluster handle
● Use device-mapper multipath rather than vendor supplied versions for the best support from Red Hat
![Page 47: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/47.jpg)
47
GFS2 – Shared Disk Cluster File System
● Provide uniform views of a file system in a cluster
● POSIX compliant (as much as Linux is, anyway)
● Allow easy management of things like virtual machine images
● Good for getting lots of data to several nodes quickly
![Page 48: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/48.jpg)
48
GFS2 Considerations
● Journal count (cluster size)● One journal per node
● File system size● Online extend supported● Shrinking is not supported
● Workload requirements & planned usage
![Page 49: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/49.jpg)
49
GFS2 Pitfalls
● Making a file system with lock_nolock as the locking protocol
● Failure to allocate enough journals at file system creation time and adding nodes to the cluster (GFS only)
● NFS lock failover does not work!
● Never use a cluster file system on top of an md-raid device
● Use of local file systems on md-raid for failover is also not supported
![Page 50: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/50.jpg)
50
Other Topics
● Stretch clustering – multiple buildings on the same campus in the same cluster
● Minimal support for this● Geographic clustering / disaster tolerance – longer-
distance● Evaluated typically on a case-by-case basis; requires
site to site storage replication and a backup cluster● Active/active clustering across sites is not supported
![Page 51: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/51.jpg)
51
Troubleshooting corosync & CMAN
● corosync does not have an easy tool to assist troubleshooting; check system logs (it is very verbose if problems occur)
● Most common problem w/ corosync is incorrect multicast configuration on the switch
● UDPU (6.2+) more reliable● cman_tool status
● Shows cluster states (incl. votes)● cman_tool nodes
● Show cluster node states
![Page 52: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/52.jpg)
52
Troubleshooting Fencing
● group_tool ls – The fence group should be in NONE (or “run” depending on version)
● If it is in another state (FAIL_STOP_WAIT, FAIL_START_WAIT), check logs on the low node ID
● cman_tool nodes -f – Show nodes and the last time each were fenced (if ever)
● fence_ack_manual -e -n <node> - emergency fencing override. Use if you are sure the host is dead and the fencing device is inaccessible (or if fencing is incorrectly configured) to allow the cluster to recover.
![Page 53: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/53.jpg)
53
Summary
● Choose a fencing configuration which works in the failure cases you expect
● Test all failure cases you expect the cluster to recover from
● The more complex the system, the more likely a single component will fail
● Use the simplest configuration whenever possible● When using clustered file systems, tune according to
your workload
![Page 54: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/54.jpg)
54
References
● https://access.redhat.com/knowledge/solutions/17784
● https://access.redhat.com/knowledge/node/28603
● https://access.redhat.com/knowledge/node/29440
● https://access.redhat.com/knowledge/articles/40051
● http://people.redhat.com/lhh/ClusterPitfalls.pdf
![Page 55: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/55.jpg)
55
Complex NSPF Cluster
Host
Host
● Any single failure in the system either allows recovery or continued operation
● Bordering on insane
Rail B
Rail A
FencingDevice
NetQuorumCluster
Net
![Page 56: Red Hat Clustering: Best Practices & Pitfallspeople.redhat.com/lhh/ClusterPitfalls.pdf2 Red Hat Clustering: Best Practices & Pitfalls Why Cluster? I/O Fencing and Your Cluster 2-Node](https://reader034.fdocuments.net/reader034/viewer/2022051508/5ab200007f8b9ac3348d1397/html5/thumbnails/56.jpg)
56
Simpler NSPF configuration
Host
Host
Rail B
Rail A
FencingDevice
Host
Switch1
Switch2
ISL