vcs-refguide

VERITAS Cluster Server Reference Guide, Recommended

Configurations and Frequently Asked Questions

Business Without Interruption

VERITAS Cluster Server Reference Guide

1/22/01Page ii

Table of Contents 1 OVERVIEW.......................................................................................................................................... 4

1.1 PURPOSE OF THIS DOCUMENT .......................................................................................................... 4 1.2 SOURCES OF INFORMATION.............................................................................................................. 4 1.3 CREDITS........................................................................................................................................... 4

2 HIGH AVAILABILITY IN THE IT ENVIRONMENT ................................................................... 4 2.1 THE HISTORY OF HIGH AVAILABILITY............................................................................................ 4

2.1.1 Mainframes to Open Systems .................................................................................................. 4 2.1.2 Methods used to increase availability. .................................................................................... 5 2.1.3 Development of Failover Management Software.................................................................... 7 2.1.4 Second Generation High Availability Software .................................................................... 11

2.2 APPLICATION CONSIDERATIONS .................................................................................................... 14 3 VERITAS CLUSTER SERVER OVERVIEW ................................................................................ 16

4 VCS TECHNOLOGY OVERVIEW................................................................................................. 16 4.1 CLUSTERS...................................................................................................................................... 17 4.2 RESOURCES AND RESOURCE TYPES............................................................................................... 19 4.3 AGENTS ......................................................................................................................................... 19 4.4 CLASSIFICATIONS OF VCS AGENTS............................................................................................... 20 4.5 SERVICE GROUPS........................................................................................................................... 21 4.6 RESOURCE DEPENDENCIES............................................................................................................. 22 4.7 TYPES OF SERVICE GROUPS........................................................................................................... 23

4.7.1 Failover Groups.................................................................................................................... 23 4.7.2 Parallel Groups .................................................................................................................... 23

4.8 CLUSTER COMMUNICATIONS (HEARTBEAT) .................................................................................. 23 4.9 PUTTING THE PIECES TOGETHER. ................................................................................................... 24

5 COMMON CLUSTER CONFIGURATION TASKS ..................................................................... 26 5.1 HEARTBEAT NETWORK CONFIGURATION ....................................................................................... 26 5.2 STORAGE CONFIGURATION............................................................................................................ 27

5.2.1 Dual hosted SCSI .................................................................................................................. 27 5.2.2 Storage Area Networks ......................................................................................................... 30 5.2.3 Storage Configuration Sequence........................................................................................... 30

5.3 APPLICATION SETUP ...................................................................................................................... 31 5.4 PUBLIC NETWORK DETAILS ........................................................................................................... 32 5.5 INITIAL VCS INSTALL AND SETUP.................................................................................................. 32

5.5.1 Unix systems.......................................................................................................................... 32 5.5.2 NT systems ............................................................................................................................ 33

5.6 COMMUNICATION VERIFICATION ................................................................................................... 33 5.6.1 LLT........................................................................................................................................ 33 5.6.2 GAB....................................................................................................................................... 33 5.6.3 Cluster operation .................................................................................................................. 34

6 VCS CONFIGURATION................................................................................................................... 35 6.1 MAIN.CF FILE FORMAT................................................................................................................... 35 6.2 RESOURCE TYPE DEFINITIONS ........................................................................................................ 36 6.3 ATTRIBUTES .................................................................................................................................. 38

6.3.1 Type dependant attributes ..................................................................................................... 39 6.3.2 Type independent attributes .................................................................................................. 40 6.3.3 Resource specific attributes .................................................................................................. 40


1/22/01Page 1

6.3.4 Type specific attributes ......................................................................................................... 40 6.3.5 Local and Global attributes .................................................................................................. 41

7 NFS SAMPLE CONFIGURATIONS ............................................................................................... 41 7.1 TWO NODE ASYMMETRIC NFS CLUSTER........................................................................................ 41

7.1.1 Example main.cf file.............................................................................................................. 42 7.2 TWO NODE SYMMETRICAL NFS CONFIGURATION .......................................................................... 44

7.2.1 Example main.cf file.............................................................................................................. 44 7.3 SPECIAL STORAGE CONSIDERATIONS FOR NFS SERVICE................................................................ 46

8 ORACLE SAMPLE CONFIGURATIONS...................................................................................... 48 8.1 ORACLE SETUP............................................................................................................................... 48 8.2 ORACLE ENTERPRISE AGENT INSTALLATION................................................................................. 48 8.3 SINGLE INSTANCE CONFIGURATION ............................................................................................... 48

8.3.1 Example main.cf.................................................................................................................... 49 8.3.2 Oracle listener.ora configuration ......................................................................................... 51

8.4 ADDING DEEP LEVEL TESTING........................................................................................................ 51 8.4.1 Oracle changes ..................................................................................................................... 51 8.4.2 VCS Configuration changes.................................................................................................. 53

8.5 MULTIPLE INSTANCE CONFIGURATION .......................................................................................... 53 9 ADMINISTERING VCS .................................................................................................................... 53

9.1 STARTING AND STOPPING............................................................................................................... 53 9.2 MODIFYING THE CONFIGURATION FROM THE COMMAND LINE ....................................................... 53 9.3 MODIFYING THE CONFIGURATION USING THE GUI ........................................................................ 53 9.4 MODIFYING THE MAIN.CF FILE....................................................................................................... 53 9.5 SNMP............................................................................................................................................ 53

10 TROUBLESHOOTING ................................................................................................................. 53

11 VCS DAEMONS AND COMMUNICATIONS............................................................................ 53 11.1 HAD.............................................................................................................................................. 54 11.2 HASHADOW ............................................................................................................................... 55 11.3 GROUP MEMBERSHIP SERVICES/ATOMIC BROADCAST (GAB) ...................................................... 55 11.4 CLUSTER MEMBERSHIP .................................................................................................................. 55 11.5 CLUSTER STATE............................................................................................................................. 55 11.6 LLT ............................................................................................................................................... 56 11.7 LOW PRIORITY LINK...................................................................................................................... 57 11.8 LLT CONFIGURATION ................................................................................................................... 57

11.8.1 LLT configuration directives................................................................................................. 57 11.8.2 Example LLT configuration .................................................................................................. 63 11.8.3 Example llthosts file .............................................................................................................. 63

11.9 GAB CONFIGURATION................................................................................................................... 64 11.10 DISK HEARTBEATS (GABDISK)................................................................................................ 64

11.10.1 Configuring GABDISK ..................................................................................................... 64 11.11 THE DIFFERENCE BETWEEN NETWORK AND DISK CHANNELS ..................................................... 64 11.12 JEOPARDY, NETWORK PARTITIONS AND SPLIT-BRAIN .............................................................. 65 11.13 VCS 1.3 GAB CHANGES ........................................................................................................... 67 11.14 EXAMPLE SCENARIOS ................................................................................................................ 67 11.15 PRE-EXISTING NETWORK PARTITIONS ....................................................................................... 76 11.16 VCS SEEDING............................................................................................................................ 77 11.17 VCS 1.3 SEEDING AND PROBING CHANGES .............................................................................. 78 11.18 NETWORK PARTITIONS AND THE UNIX BOOT MONITOR. (OR “HOW TO CREATE YOUR VERY OWN SPLIT-BRAIN CONDITION”)................................................................................................................. 78 11.19 VCS MESSAGING ...................................................................................................................... 78


1/22/01Page 2

12 VCS TRIGGERS ............................................................................................................................ 80 12.1 HOW VCS PERFORMS EVENT NOTIFICATION/TRIGGERS ............................................................... 80 12.2 TRIGGER DESCRIPTION .................................................................................................................. 81

12.2.1 PostOnline trigger................................................................................................................. 81 12.2.2 PostOffline trigger: ............................................................................................................... 81 12.2.3 PreOnline trigger:................................................................................................................. 82 12.2.4 ResFault trigger:................................................................................................................... 82 12.2.5 ResNotOff trigger:................................................................................................................. 82 12.2.6 SysOffline trigger:................................................................................................................. 83 12.2.7 InJeopardy trigger: ............................................................................................................... 83 12.2.8 Violation trigger: .................................................................................................................. 83

12.3 TRIGGER CONFIGURATION............................................................................................................. 84 12.4 RECOMMENDED TRIGGER USAGE................................................................................................... 84

13 SERVICE GROUP DEPENDENCIES ......................................................................................... 84

14 VCS STARTUP AND SHUTDOWN............................................................................................. 84 14.1 VCS STARTUP................................................................................................................................ 84 14.2 VCS SHUTDOWN ........................................................................................................................... 86 14.3 STALE CONFIGURATIONS ............................................................................................................... 88

15 AGENT DETAILS.......................................................................................................................... 88 15.1 PARAMETER PASSING .................................................................................................................... 89 15.2 AGENT CONFIGURATION ................................................................................................................ 91

15.2.1 ConfInterval .......................................................................................................................... 91 15.2.2 FaultOnMonitorTimeouts ..................................................................................................... 92 15.2.3 MonitorInterval..................................................................................................................... 92 15.2.4 MonitorTimeout .................................................................................................................... 92 15.2.5 OfflineMonitorInterval.......................................................................................................... 92 15.2.6 OfflineTimeout ...................................................................................................................... 92 15.2.7 OnlineRetryLimit................................................................................................................... 93 15.2.8 OnlineTimeout....................................................................................................................... 93 15.2.9 RestartLimit........................................................................................................................... 93 15.2.10 ToleranceLimit .................................................................................................................. 93

16 FREQUENTLY ASKED QUESTIONS........................................................................................ 94 16.1 GENERAL....................................................................................................................................... 94

16.1.1 Does VCS support NFS lock failover? .................................................................................. 94 16.1.2 Can I mix different operating systems in a cluster?.............................................................. 94 16.1.3 Can I configure a “Shared Nothing” cluster? ...................................................................... 94 16.1.4 What is the purpose of hashadow?........................................................................................ 94 16.1.5 What is “haload”?................................................................................................................ 94 16.1.6 What failover policies are available? ................................................................................... 95 16.1.7 What are System Zones? ....................................................................................................... 95 16.1.8 What is “halink”? ................................................................................................................. 95 16.1.9 What does “stale admin wait” mean?................................................................................... 96 16.1.10 How many nodes can VCS support? ................................................................................. 96

16.2 RESOURCES ................................................................................................................................... 96 16.2.1 What is the MultiNICA resource? ......................................................................................... 96 16.2.2 What is the IPMultiNIC resource?........................................................................................ 97 16.2.3 What is a Proxy resource? .................................................................................................... 97 16.2.4 How do I configure an IPMultiNIC and MultiNICA resource pair? .................................... 97 16.2.5 How can I use MultiNIC and Proxy together?...................................................................... 99

16.3 COMMUNICATIONS ...................................................................................................................... 100 16.3.1 What is the recommended heartbeat configuration?........................................................... 100


1/22/01Page 3

16.3.2 Can LLT be run over a VLAN? ........................................................................................... 101 16.3.3 Can I place LLT links on a switch?..................................................................................... 101 16.3.4 Can LLT/GAB be routed? ................................................................................................... 101 16.3.5 How far apart can nodes in a cluster be? ........................................................................... 101 16.3.6 Do heartbeat channels require additional IP addresses?................................................... 101 16.3.7 How many nodes should be set in my GAB configuration? ................................................ 101 16.3.8 What is a split brain? .......................................................................................................... 102

16.4 AGENTS ....................................................................................................................................... 103 16.4.1 What are Entry Points?....................................................................................................... 103 16.4.2 What should be the return value of the online Entry Point? ............................................... 103 16.4.3 What should be the return value of the offline Entry Point?............................................... 104 16.4.4 When will the monitor Entry Point be called? .................................................................... 104 16.4.5 When will the clean Entry Point be called? ........................................................................ 104 16.4.6 Should I implement the clean Entry Point?......................................................................... 105 16.4.7 What should be the return the value of the monitor Entry Point?....................................... 105 16.4.8 What should be the return value of the clean Entry Point? ................................................ 106 16.4.9 What should I do if I figure within the online Entry Point that it is not possible to online the resource?106 16.4.10 Is the Agent Framework Multi-threaded?....................................................................... 106 16.4.11 How do I configure the agent to automatically retry the online procedure when the initial attempt to online a resource fails?...................................................................................................... 106 16.4.12 What is the significance of the Enabled attribute?.......................................................... 107 16.4.13 How do I request a VCS agent not to online/offline/monitor a resource? ...................... 107 16.4.14 What is MonitorOnly?..................................................................................................... 107 16.4.15 How do I request a VCS agent not to online/offline a resource? .................................... 107 16.4.16 How do I configure the agent to ignore "transient" faults? ............................................ 107 16.4.17 How do I configure the agent to automatically restart a resource on the local node when the resource faults?............................................................................................................................. 107 16.4.18 What is ConfInterval? ..................................................................................................... 108


1/22/01Page 4

1 Overview 1.1 Purpose of this document

This document is intended to assist customers and VERITAS personnel with understanding the VERITAS Cluster Server product. It is not intended to replace the existing documentation shipped with the product, nor is it “VCS For Dummies”. It is rather intended more as a “VCS for System Administrators”. It will cover, as much as possible, VCS for NT, Solaris and HP/UX. Differences between versions will be noted.

1.2 Sources of information Material for this document was gathered from existing VCS documentation, VERITAS engineering documents and personnel, VERITAS Enterprise Consulting Services personnel and VERITAS University course materials.

1.3 Credits Special thanks to the following VERITAS folks:

Paul Massiglia for his work on the “VERITAS in E-Business” white paper, which served as a base idea for this document.

Tom Stephens for providing the initial FAQ list, guidance, humor and constant review

VCS Engineering team for answering my thousand or so questions

Evan Marcus for providing customer needs, multiple review cycles and in my opinion the best book on High Availability published, “Blueprints for High Availability. Designing Resilient Distributed Systems”

Diane Garey for providing VCS on NT information and guidance.

2 High Availability in the IT environment 2.1 The History Of High Availability

2.1.1 Mainframes to Open Systems Looking back over the evolution of business processing systems in the last 10-15 years, a number of key changes can be noted. One of the first would be the move from mainframe processing systems to more distributed Unix (open) systems. Large monolithic mainframes were slowly replaced with smaller dedicated (and significantly cheaper) open systems to provide specific functionality. In this large decentralization move, significant numbers of open systems were deployed to solve a large number of business problems. One of the key factors to note was the simple decentralization decreased overall impact of any single system outage. Rather than all personnel being idled by a mainframe failure, a single small open system would only support a very limited number of users and therefore not


1/22/01Page 5

impact others. There are always two sides to every issue however. Deploying tens or hundreds of open systems to replace a single mainframe decreased overall impact of failure, but drastically increased administrative complexity. As businesses grew, there could be literally hundreds of various open systems providing application support.

As time passed, newer open systems gained significant computing power and expandability. Rather than a single or dual processor system with memory measured in megabytes and storage in hundreds of megabytes, systems evolved to tens or even hundreds of processors, gigabytes of memory and terabytes of disk capacity. This drastic increase in processing power allowed IT managers to begin to consolidate applications onto larger systems to reduce administrative complexity and hardware footprint. So now we have huge open systems providing unheard of processing power. These “enterprise class” systems have replaced departmental and workgroup level servers throughout organizations. At this point, we have come full circle. Critical applications are now run on a very limited number of large systems. During the shift from mainframe centralization to distributed, open systems and back to centralized, enterprise class, open systems, one other significant change overtook the IT industry.

This could be best summed up with the statement “IT is the business”. Over the last several years, information processing has gone from a function that augmented day-to-day business operations to one of actually being the day-to-day operations. Enterprise Resource Planning (ERP) systems began this revolution and the dawn of e-commerce made it a complete reality. In today’s business world, loss of IT functions means the entire business can be idled.

2.1.2 Methods used to increase availability. As open systems proliferated, IT managers became concerned with the impact of system outages on various business units. These concerns lead to the development of tools to increase overall system availability. One of the easiest points to address is those components with the highest failure rates. Hardware vendors began providing systems with built in redundant components such as power supplies and fans. These high failure items were now protected by in system spare components that would assume duty on the failure of another. Disk drives were also another constant fail item. Hardware and software vendors responded with disk mirroring and RAID devices. Overall, these developments can be summarized with the industry term RAS, standing for reparability, availability and serviceability. As individual components became more reliable, managers looked to decrease exposure to losing an entire system. Just like having a spare power supply or disk drive, IT managers wanted the capability to have a spare system to take over on a system failure. Early configurations were just that, a spare system. On a system failure, external storage containing application data would be disconnected from the failed system, connected to the spare system, then the spare brought into service. This action can be called “failover”. In a properly designed scenario, the


1/22/01Page 6

client systems would require no change to recognize the spare system. This is accomplished by having the now promoted spare system takeover the network identity of its original peer. The following figure will detail the sequence necessary to properly “fail over” an NFS server using VERITAS Volume manager:

As storage systems evolved, the ability to connect more than one host to a storage array was developed. By “dual-hosting” a given storage array, the spare system could be brought online quicker in the event of failure. This is one of the key concepts that will remain throughout the evolution of failover configurations. Reducing time to recover is key to increasing availability. Dual hosting storage


1/22/01Page 7

meant that the spare system would no longer have to be physically cabled on a failure. Having a system ready to utilize application data lead to development of scripts to assist the spare server in functioning as a “takeover” server. In the event of a failure, the proper scripts could be run to effectively change the personality of the spare to mirror the original failed server. These scripts were the very beginning of Failover Management Software (FMS).

Now that it was possible to automate takeover of a failed server, the other part of the problem became detecting failures. The two key components to providing application availability are failure detection and time to recover. Many corporations developed elaborate application and server monitoring code to provide failover management.

2.1.3 Development of Failover Management Software Software vendors responded by developing commercial FMS packages on common platforms to manage common applications. These original packages such as VERITAS First Watch and Sun Solstice-HA can be considered “first generation HA” packages. The packages all have several common capabilities and limitations.

The first common capability is failure detection. The FMS package runs specific applications or scripts to monitor the overall health of a given application. This may be as simple as checking for the existence of a process in the system process table or as complex as actually communicating with the application and expecting certain responses. In the case of a web server, simple monitoring would be testing if the correct “httpd” process is in the process table. Complex monitoring would involve connecting to the web server on the proper address and port and testing for the existence of the home page. Application monitoring is always a trade-off between lightweight, low processor footprint and thorough testing for not only application existence, but also functionality.

The second common capability is failover. FMS packages automate the process of bringing a standby machine online as a spare. From a high level, this requires stopping necessary applications, removing the IP address known to the clients and un-mounting file systems. The takeover server then reverses the process. File systems are mounted, the IP address known to the clients is configured and

Failover of application servicesFailover of application services

Takeover server tasks

•Import Disk groups

•Start Volumes

•Mount file systems

•Start Applications

•Configure IP address

Original server tasks (if server still online)

•Unconfigure IP address

•Stop Applications

•Unmount File Systems

•Stop Volumes

•Deport Disk Groups

Failover of application servicesFailover of application services

Takeover server tasks

•Import Disk groups

•Start Volumes

•Mount file systems

•Start Applications

•Configure IP address

Original server tasks (if server still online)

•Unconfigure IP address

•Stop Applications

•Unmount File Systems

•Stop Volumes

•Deport Disk Groups


1/22/01Page 8

applications are started.

FMS packages differ typically in one area, that of detecting the failure of a complete system rather than a specific application. One of the most difficult tasks in an FMS package is correctly discriminating between a loss of a system and loss of communications between systems. There are a large number of technologies used, including heartbeat networks between servers, quorum disks, SCSI reservation and others. The difficulty arises in providing a mechanism that is reliable and scales well to multiple nodes. This document will only discuss node failure determination as it pertains to VCS. Please see the communications section for the complete description.

System configuration choices with first generation HA products are fairly limited. Common configurations are termed asymmetrical and symmetrical.

In an asymmetrical configuration, an application runs on a primary or master server. A dedicated backup server is present to takeover on any failure. The backup server is not configured to perform any other functions. In the following two illustrations, a file server application will be moved, or failed over, from master to backup. Notice the IP address used by the client’s moves as well. This is extremely important; otherwise all clients would have to be updated on each server failover.

Dual dedicated heartbeat networks

Mirrored copies of critical data on dual data paths

Physically connected, but not logically in

use

Public Network192.1.1.1

MasterFile Server

Dedicated Dedicated BackupBackupServerServer


Mirrored copies of critical data on dual data paths


use


MasterFile Server



1/22/01Page 9

In a symmetrical configuration, each server is configured to run a specific application or service and essentially provide backup for its peer. In the example shown, the file server or application server can fail and its peer will take over both roles. Notice the surviving server has two addresses assigned.


Mirrored copies of critical dataon dual data paths


use


MasterFile Server



Mirrored copies of critical dataon dual data paths


use


MasterFile Server



1/22/01Page 10



DedicatedFile Server

192.1.1.2

Dedicated Dedicated AppApp

ServerServerDual dedicated heartbeat networks


DedicatedFile Server

192.1.1.2


ServerServer



DedicatedApp &

File Server

192.1.1.2


ServerServerDual dedicated heartbeat networks


DedicatedApp &

File Server

192.1.1.2


ServerServer


1/22/01Page 11

On the surface, it would appear the symmetrical configuration is a far more beneficial configuration in terms of hardware utilization. Many customers seriously dislike the concept of a valuable system sitting idle. There is a serious flaw in this line of reasoning, however. In the first asymmetrical example, the takeover server would need only as much processor power as its peer. On failover, performance would remain the same. In the symmetrical example, the takeover server would need sufficient processor power to not only run the existing application, but also enough for the new application it takes over. To put it another way, if a single application needs one processor to run properly, an asymmetric config would need two single processor systems. To run identical applications on each server, a symmetrical config would require two dual processor systems.

Another important limitation of first generation FMS systems is failover granularity. This refers to what must failover on the event of any problems. First generation systems had failover granularity equal to a server. This means on failure of any HA application on a system, all applications would fail to a second system. This fact severely limited scalability of any given server. For example, running multiple production Oracle instances on a single system is problematic; as the failure of any instance will cause and outage of all instances on the system while all applications are migrated to another server.

2.1.4 Second Generation High Availability Software Second generation HA software can generally be characterized by two features. The first is scalability. Most second generation HA packages can scale to 8 or more nodes. One of the key enabling technologies behind this scalability is the advent of Storage Area Networks. Earlier packages were not only constrained by the software design, but more importantly by the storage platforms available. Attaching more than two hosts to a single SCSI storage device becomes problematic, as specialized cabling must be used. Scaling beyond 4 hosts is not practical, as it severely limits the actual number of SCSI disks that can be placed on the bus. SANs provide the ability to connect a large number of hosts to a nearly unlimited amount of storage. This allows much larger clusters to be constructed easily.


1/22/01Page 12

In the configuration above, rather than having 6 systems essentially standing by for 6 processing systems, we have 1 system acting as the spare for 6 processing systems.

.

The second distinguishing feature of a second generation HA package is the concept of resource groups or service groups. As nodes get larger, it is less likely that they will be used to host a single application service. Particularly on the larger Sun servers such as the E6500 or the E10000, it is rare that the entire server will be dedicated to a single application service. Configuring multiple domains on an Enterprise Sun server partially alleviates the problem, however multiple applications may still run within each domain. Failures that affect a single application service, such as a software failure or hang, should not necessarily affect other application services that may reside on the same physical host or domain. If they do, then downtime may be unnecessarily incurred for the other application services.

An application service is the service the end user perceives when accessing a particular network address. An application service is typically composed of multiple resources, some hardware and some software based, all cooperating together to produce a single service. For example, a database service may be composed of one or more logical network addresses (such as IP), RDBMS software, an underlying file system, a logical volume manager and a set of physical disks being managed by the volume manager. If this service, typically called a service group, needed to be migrated to another node for recovery purposes, all of its resources must migrate together to re-create the service on another node. A single large node may host any number of service groups, each providing a discrete service to networked clients who may or may not know that they physically reside on a single node.

vs.

2n n+1

Fibre Channel SANvs.

2n n+1

Fibre Channel SAN


1/22/01Page 13

Service groups can be proactively managed to maintain service availability through an intelligent availability management tool. Given the ability to test a service group to ensure that it is providing the expected service to networked clients and an ability to automatically start and stop it, such a service group can be made highly available. If multiple service groups are running on a single node, then they must be monitored and managed independently. Independent management allows a service group to be automatically recovered or manually idled (e.g. for administrative or maintenance reasons) without necessarily impacting any of the other service groups running on a node. This is particularly important on the larger Sun Enterprise servers, which may easily be running eight or more applications concurrently. Of course, if the entire server crashes (as opposed to just a software failure or hang), then all the service groups on that node must be recovered elsewhere.

At the most basic level, the fault management process includes monitoring a service group and, when a failure is detected, restarting that service group automatically. This could mean restarting it locally or moving it to another node and then restarting it, as determined by the type of failure incurred. In the case of local restart in response to a fault, the entire service group does not necessarily need to be restarted; perhaps just a single resource within that group may need to be restarted to restore the application service. Given that service groups can be independently manipulated, a failed node’s workload can be load balanced across remaining cluster nodes, and potentially failed over successive times (due to consecutive failures over time) without manual intervention, as shown below.


1/22/01Page 14

2.2 Application Considerations Nearly all applications can be placed under cluster control, as long as basic guidelines are met:

• The application must have a defined procedure for startup. This means the FMS developer can determine the exact command used to start the application, as well as all other outside requirements the application may have, such as mounted file systems, IP addresses, etc. An Oracle database agent for example needs the Oracle user, Instance ID, Oracle home directory and the pfile. The developer must also know implicitly what disk groups, volumes and file systems must be present.

• The application must have a defined procedure for stopping. This means an individual instance of an application must be capable of being stopped without affecting other instances. Using a web server for example, killing all HTTPD processes is unacceptable since it would stop other web servers as well. In the case of Apache 1.3, the documented process for shutdown involves locating the PID file written by the specific instance on startup, and sending the PID contained in the pid file a kill –TERM signal. This causes the master HTTPD process for that particular instance to halt all child processes.

• The application must have a defined procedure for monitoring overall health of an individual instance. Using the web server as an example, simply checking the process table for the existence of “httpd” is unacceptable, as any web server would cause the monitor to return an online value. Checking if the pid contained in the pid file is actually in the process table would be a better solution. Taking this check one step further, we should ensure the process in the proc table is actually the correct httpd process; therefore ensuring the operating system has not reused the pid.


1/22/01Page 15

• To add more robust monitoring, an application can be monitored from closer to the user perspective. For example, an HTTPD server can be monitored by connecting to the correct IP address and Port and testing if the web server responds to http commands. In a database environment, the monitoring application can connect to the database server and perform SQL commands and verify read and write to the database. It is important that data written for subsequent read-back is changed each time to prevent caching from hiding underlying problems. (if the same data is written each time to the same block, the database knows it doesn’t really have to update the disk) In both cases, end-to-end monitoring is a far more robust check of application health. The closer a test comes to exactly what a user does, the better the test is in discovering problems. This does come at a price. End to end monitoring increases system load and may increase system response time. From a design perspective, the level of monitoring implemented should be a careful balance between assuring the application is up and minimizing monitor overhead.

• The application must be capable of storing all required data on shared disks. This may require specific set up options or even soft links. For example, the VERITAS NetBackup product is designed to install in /usr/openv only. This requires either linking /usr/openv to a file system mounted from the shared storage device or actually mounting file system from the shared device on /usr/openv. On the same note, the application must store data to disk, rather than maintaining in memory. The take over system must be capable of accessing all required information. More on this in the next paragraph.

• The application must be capable of being restarted to a known state. This is probably the most important application requirement. On a switchover, the application is brought down under controlled conditions and started on another node. The requirements here are fairly straightforward. The application must close out all tasks, store data properly on shared disk and exit. At this time, the peer system can startup from a clean state. The problem scenario arises when one server crashes and another must take over. The application must be written in such a way that data is not stored in memory, but regularly written to disk. A commercial database such as Oracle, is the perfect example of a well written, crash tolerant application. On any given client SQL request, the client is responsible for holding the request until it receives an acknowledgement from the server. When the server receives a request, it is placed in a special log file, or “redo” file. This data is confirmed as being written to stable disk storage before acknowledging the client. At a later time, Oracle then de-stages the data from redo log to actual table space. (This is known as check pointing). After a server crash, Oracle can recover to the last know committed state by mounting the data tables and “applying” the redo logs. This in effect brings the database to the exact point of time of the crash. The client


1/22/01Page 16

resubmits any outstanding client requests not acknowledged by the server; all others are contained in the redo logs. One key factor to note is the cooperation between client application and server. This must be factored in when assessing the overall “cluster compatibility” of an application.

• The application must be capable of running on all servers designated as potential hosts. This means there are no license issues, host name dependencies or other such problems. Prior to attempting to bring an application under cluster control, it is highly advised the application be test run on all systems in the proposed cluster that may be configured to host the app.

3 VERITAS Cluster Server Overview VERITAS Cluster Server provides high availability through automated or manual failover of applications and services. Key features of VERITAS Cluster Server include:

• Extremely scalable (up to 32 nodes in a cluster)

• Supports mixed environments. Windows NT, Solaris and HP/UX are supported today. Support for additional operating systems is planned. Because Cluster Server is a cross-platform solution, administrators only need to learn one clustering technology to support multiple environments. (Individual clusters must all be comprised of the same operating system family. Clusters of multiple OS types can all be managed from the Cluster Manager console)

• Provides a new approach to managing large server clusters. Through its Java-based graphical management interface, administrators can manage large clusters automatically or manually, and migrate applications and services among them.

• Supports all major third-party storage providers and works SCSI and SAN environments. VERITAS provides on-going testing of storage devices through its own Interoperability Lab (iLab) and Storage Certification Suite – a self-certifying test for third-party vendors to qualify their arrays.

• Provides flexible failover possibilities. 1 to 1, any to 1, any to any, and 1 to any failovers are possible.

• Integrates seamlessly with other VERITAS products to increase availability, reliability, and performance.

4 VCS Technology Overview At first glance, VCS seems to be a very complex package. By breaking the technology into understandable blocks, it can be explained in a much simpler


1/22/01Page 17

fashion. The following section will describe each major building block in a VCS configuration. Understanding each of these items as well as interaction with others is key to understanding VCS. The primary items to discuss include the following:

• Clusters

• Resources and resource types

• Resource Categories

• Agents

• Agent Classifications

• Service Groups

• Resource Dependencies

• Heartbeat

4.1 Clusters A single VCS cluster consists of multiple systems connected in various combinations to shared storage devices. VCS monitors and controls applications running in the cluster, and can restart applications in response to a variety of hardware or software faults. A cluster is defined as all systems with the same cluster-ID and connected via a set of redundant heartbeat networks. (See the VCS Communications section for a detailed discussion on cluster ID and heartbeat networks). Clusters can have from 1 to 32 member systems, or “nodes”. All nodes in the cluster are constantly aware of the status of all resources (see below) on all other nodes. Applications can be configured to run on specific nodes in the cluster. Storage is configured to provide access to shared application data for those systems hosting the application. In that respect the actual storage connectivity will determine where applications can be run. In the examples below, the full storage model would allow any application to run on any node. In the partial storage connectivity model, an application requiring access to Volume X would be capable of running on node A’ or B’ and an application requiring access to volume Y can be configured to run on node B’ and C’.


1/22/01Page 18

Within a single VCS cluster, all member nodes must run the same operating system family. For example, a Solaris cluster would consist of entirely Solaris nodes, likewise with HPUX and NT clusters. Multiple clusters can all be managed from one central console with the Cluster Server Cluster Manager.

The Cluster manager allows an administrator to log in and manage a virtually unlimited number of VCS clusters, using one common GUI and command line interface.

Client Access Network

Cluster Server node

Cluster Server node

Cluster Server node

Storage Access Network

Cluster Server node

Private Cluster Interconnects (Redundant Heartbeat)

Full Storage Connectivity Model (e.g., Fibre Channel)

Cluster Server

C Cluster

Server B

Cluster Server

D

Cluster Server

A

Fibre Channel Hub or Switch

Partial Storage Connectivity Model(e.g., Multi-hosted SCSI)

Cluster Server

C’ Cluster

ServerB’

Cluster Server

D’

Volume Y

Cluster Server A’

Volume Z Volume X


1/22/01Page 19

4.2 Resources and Resource Types Resources are hardware or software entities, such as disks, network interface cards (NICs), IP addresses, applications, and databases, which are controlled by VCS. Controlling a resource means bringing it online (starting), taking it offline (stopping) as well as monitoring the health or status of the resource.

Resources are classified according to types, and multiple resources can be of a single type; for example, two disk resources are both classified as type Disk. How VCS starts and stops a resource is specific to the resource type. For example, mounting starts a file system resource, and an IP resource is started by configuring the IP address on a network interface card. Monitoring a resource means testing it to determine if it is online or offline. How VCS monitors a resource is also specific to the resource type. For example, a file system resource tests as online if mounted, and an IP address tests as online if configured. Each resource is identified by a name that is unique among all resources in the cluster.

Different types of resources require different levels of control. Most resource types are considered “On-Off” resources. In this case, VCS will start and stop these resources as necessary. Other resources may be needed by VCS as well as external applications. An example is NFS daemons. VCS requires the NFS daemons to be running to export a file system. There may also be other file systems exported locally, outside VCS control. The NFS resource is considered “OnOnly”. VCS will start the daemons if necessary, but does not stop them if the service group is offlined. The last level of control is a resource that cannot be physically onlined or offlined, yet VCS needs the resource to be present. For example, a NIC cannot be started or stopped, but is necessary to configure an IP address. Resources of this typo are considered “Persistent” resources. VCS monitors to make sure they are present and healthy.

VCS includes a set of predefined resources types. For each resource type, VCS has a corresponding agent. The agent provides the resource type specific logic to control resources.

4.3 Agents The actions required to bring a resource online or take it offline differ significantly for different types of resources. Bringing a disk group online, for example, requires importing the Disk Group, whereas bringing an Oracle database online would require starting the database manager process and issuing the appropriate startup command(s) to it. From the cluster engine’s point of view the same result is achieved—making the resource available. The actions performed are quite different, however. VCS handles this functional disparity between different types of resources in a particularly elegant way, which also makes it simple for application and hardware developers to integrate additional types of resources into the cluster framework.


1/22/01Page 20

Each type of resource supported in a cluster is associated with an agent. An agent is an installed program designed to control a particular resource type. For example, for VCS to bring an Oracle resource online it does not need to understand Oracle; it simply passes the online command to the OracleAgent. Since the structure of cluster resource agents is straightforward, it is relatively easy to develop agents as additional cluster resource types are identified.

VCS agents are “multi threaded”. This means single VCS agent monitors multiple resources of the same resource type on one host; for example, the DiskAgent manages all Disk resources. VCS monitors resources when they are online as well as when they are offline (to ensure resources are not started on systems where there are not supposed to be currently running). For this reason, VCS starts the agent for any resource configured to run on a system when the cluster is started.

If there are no resources of a particular type configured to run on a particular system, that agent will not be started on any system. For example, if there are no Oracle resources configured to run on a system (as the primary for the database, as well as acting as a “failover target”), the OracleAgent will not be started on that system.

VCS agents are located in the /opt/VRTS/bin/$TypeName directory, where $TypeName is the name of the resource type. For example, the Mount agent and corresponding online/offline/monitor scripts are located in the /opt/VRTSvcs/bin/Mount directory. The agent itself is named MountAgent.

4.4 Classifications of VCS Agents • Bundled Agents

Agents packaged with VCS are referred to as bundled agents. They include agents for Disk, Mount, IP, and several other resource types. For a complete description of Bundled Agents shipped with the VCS product, see the VCS Bundled Agents Guide.

• Enterprise Agents

Agents that can be purchased from VERITAS but are packaged separately from VCS are referred to as Enterprise agents. They include agents for Informix, Oracle, NetBackup, and Sybase. Each Enterprise Agent ships with documentation on the proper installation and configuration of the agent.

• Custom Agents

If a customer has a specific need to control an application that is not covered by the agent types listed above, a custom agent must be developed. VERITAS Enterprise Consulting Services provides agent development for customers, or the customer can choose to write their own. Refer to the VERITAS Cluster


1/22/01Page 21

Server Agent’s Developer’s Guide, which is part of the standard documentation distribution for more information on creating VCS agents.

4.5 Service Groups Service Groups are the primary difference between first generation HA packages and second generation. As mentioned in The History of High Availability, early systems such as VERITAS First Watch used the entire server as a level of granularity for failover. If an application failed, all applications under FirstWatch control were migrated to a second machine. Second generation HA packages such as VCS reduce the level of granularity for application control to a smaller level. This smaller container around applications and associated resources is called a Service Group. A service group is a set of resources working together to provide application services to clients.

For example, a web application Service Group might consist of:

• Disk Groups on which the web pages to be served are stored,

• A volume built in the disk group,

• A file system using the volume,

• A database whose table spaces are files and whose rows contain page pointers,

• The network interface card (NIC) or cards used to export the web service,

• One or more IP addresses associated with the network card(s), and,

• The application program and associated code libraries.

VCS performs administrative operations on resources, including starting, stopping, restarting, and monitoring at the Service Group level. Service group operations initiate administrative operations for all resources within the group. For example, when a service group is brought online, all the resources within the group are brought online. When a failover occurs in VCS, resources never failover individually – the entire service group that the resource is a member of is the unit of failover. If there is more than one group defined on a server, one group may failover without affecting the other group(s) on the server.

From a cluster standpoint, there are two significant aspects to this view of an application Service Group as a collection of resources:

• If a Service Group is to run on a particular server, all of the resources it requires must be available to the server.

• The resources comprising a Service Group have interdependencies; that is, some resources (e.g., volumes) must be operational before other resources (e.g., the file system) can be made operational.


1/22/01Page 22

4.6 Resource dependencies

One of the most important parts of a service group definition is the concept of resource dependencies. As mentioned above, resource dependencies determine the order specific resources within a Service Group are brought online or offline when the Service Group is brought offline or online. For example, a VxVM Disk Group must be imported before volumes in the disk group can be started and volumes must start before files systems can be mounted. In the same manner, file systems must be unmounted before volumes are stopped and volumes stopped before disk groups deported. Diagramming resources and their dependencies forms a graph. The resources at the top of the graph are root resources. Resources at the bottom of the graph are leaf resources. Parent resources appear at the top of the arcs that connect them to their child resources. Typically, child resources are brought online before parent resources, and parent resources are taken offline before child resources. Resources must adhere to the established order of dependency. The dependency graph is common shorthand used to document resource dependencies within a Service Group (and as explained later, dependencies between Service Groups). The illustration shows a resource dependency graph for a cluster service.

Volume

File System

Database

Network Card

IP Address

Application

Volume requires

Disk Group

Application requires

database and IP address

Disk Group

Cluster Service Resource Dependency Graph

In the figure above, the lower (child) resources represent resources required by the upper (parent) resources. Thus, the volume requires that the disk group be online, the file system requires that the volume be active, and so forth. The application program itself requires two independent resource sub trees to function—a database and an IP address for client communications.

The VERITAS Cluster Server includes a language for specifying resource types and dependency relationships. The main VCS high availability daemon, or HAD,


1/22/01Page 23

uses resource definitions and dependency definitions when activating or deactivating applications. In general, child resources must be functioning before their parents can be started. The cluster engine starts a service by bringing online the resources represented by leaf resources of the service’s resource dependency graph. Referring to the figure above for example, the disks and the network card could be brought online concurrently, because they have no interdependencies. When all child resources required by a parent are online, the parent itself is brought online, and so on up the tree, until finally the application program itself is started.

Similarly, when deactivating a service, the cluster engine begins at the top of the graph. In the example above, the application program would be stopped first, followed by the database and the IP address in parallel, and so forth.

4.7 Types of Service Groups

4.7.1 Failover Groups A failover group runs on one system in the cluster at a time. Failover groups are used for most application services, such as NFS servers using the VERITAS VxVM Volume Manager.

4.7.2 Parallel Groups A parallel group can runs concurrently on more than one system in the cluster at a time.

Parallel service groups require applications that are designed to be run in more than one place at a time. For example, the standard VERITAS Volume Manager is not designed to allow a volume group to be online on two hosts at once without risk of data corruption. However the VERITAS Cluster Volume Manager, shipped as part of the SANPoint Foundation Suite, is designed to function properly in a cluster environment. For the most part, applications available today will require modification to work in a parallel environment.

4.8 Cluster Communications (Heartbeat) VCS uses private network communications between cluster nodes for cluster maintenance. This communication takes the form of nodes informing other nodes they are alive, known as heartbeat, and nodes informing all other nodes of actions taking place and the status of all resources on a particular node, known as cluster status. This cluster communication takes place over a private, dedicated network between cluster nodes. VERITAS requires two completely independent, private network connects between all cluster nodes to provide necessary communication path redundancy and allow VCS to discriminate between a network failure and a system failure.

VCS communications are discussed in detail in section X.


1/22/01Page 24

4.9 Putting the pieces together. How do all these pieces tie together to form a cluster? Understanding this makes the rest of VCS fairly simple. Lets take a very simple example, a two-node cluster serving a single NFS file system to clients. The cluster itself consists of two nodes; connected to shared storage to allow both servers to access the data needed for the file system export. The following drawing shows the basic cluster layout:

In this example, we are going to configure a single Service Group called NFS_Group that will be failed over between ServerA and ServerB as necessary. The service group, configured as a Failover Group, consists of resources, each one with a different resource type. The resources must be started in a specific order for everything to work. This is described with resource dependencies. Finally, in order to control each specific resource type, VCS will require an agent. The VCS engine, HAD, will read the configuration file and determine what agents are necessary to control the resources in this group (as well as any resources in any other service group configured to run on this system) and start the corresponding VCS Agents. HAD will then determine the order to bring up the resources based on resource dependency statements in the configuration. When it is time to online the service group, VCS will issue online commands to the proper agents in the proper order. The following drawing is a representation of a VCS service group, with the appropriate resources and dependencies for the NFS Group. The method used to display the resource dependencies is identical to the VCS GUI.

Client Access Network

ServerB

Mirrored Disks on shared SCSI

ServerA

RedundantHeartbeat


1/22/01Page 25

In this configuration, the VCS engine would start agents for DiskGroup, Mount, Share, NFS, NIC and IP on all systems configured to run this group. The resource dependencies are configured as follows:

• The /home file system, shown as home_mount requires the Disk Group shared_dg1 to be online before mounting

• The NFS export of the home file system requires the home file system to be mounted as well as the NFS daemons to be running.

• The high availability IP address, nfs_IP requires the file system to be shared as well as the network interface to be up, represented as nfs_group_hme0.

• The NFS daemons and the Disk Group have no lower (child) dependencies, so they can start in parallel.

• The NIC resource is a special resource called a persistent resource and does not require starting. Please see the VCS Configuration Details section for more on persistent resources.

The NFS Group can be configured to automatically start on either node in the example. It can then move or failover to the second node based on operator command, or automatically if the first node fails. VCS will offline the resources starting at the top of the graph and start them on the second node starting at the bottom of the graph.

nfs_IP

nfs_group_hme0

home_mount

shared_dg1

home_share

NFS_nfs_group_16

nfs_IPnfs_IP

nfs_group_hme0nfs_group_hme0

home_mounthome_mount

shared_dg1shared_dg1

home_sharehome_share

NFS_nfs_group_16NFS_nfs_group_16


1/22/01Page 26

5 Common cluster configuration tasks Regardless of overall cluster intent, several steps must be taken in all new VCS cluster configurations. These include VCS heartbeat setup, storage configuration and system layout. The following section will cover these basics.

5.1 Heartbeat network configuration VCS private communications/heartbeat is one of the most critical configuration decisions, as VCS uses this path to control the entire cluster and maintain a coherent state. Loss of heartbeat communications due to poor network design can cause system outages, and at worst case even data corruption.

It is absolutely essential that two completely independent networks be provided for private VCS communications. Completely independent means there can be n o single failure that can disable both paths. Careful attention must be paid to wiring runs, network hub power sources, network interface cards, etc. To state it another way, “the only way it should be possible to lose all communications between two systems is for one system to fail. If any failure can remove all communications between systems, AND still leave systems running and capable of accessing shared storage, a chance for data corruption exists.

To set up private communications, first choose two independent network interface cards within each system. Use of two ports on a multi-port card should be avoided. To interconnect, VERITAS recommends the use of network hubs from a quality vendor. Crossover cabling between two node clusters is acceptable, however the use of hubs allows future cluster growth without system heartbeat interruption of existing nodes. Next ensure the hubs are powered from separate power sources. In many cases, tying one hub to the power source for one server and the second hub to power for the second server provides good redundancy. Connect systems to the hubs with professionally built network cables, running on separate paths. Ensure a single wiring bundle or network patch panel problem cannot affect both cable runs.

Depending on operating system, ensure network interface speed and duplex settings are hard set and auto negotiation is disabled.

Test the network connections by temporarily assigning network addresses and use telnet or ping to verify communications. You must use different IP network addresses to ensure traffic actually uses the correct port.

The following diagram shows basic VCS private network connectivity


1/22/01Page 27

The InstallVCS script will configure actual VCS heartbeat at a later time. For manual VCS communication configuration, see the VCS communications section.

5.2 Storage Configuration. VCS is a “shared data” high availability product. In order to failover an application from one node to another, both nodes must have direct access to the data storage. This can be accomplished with dual hosted SCSI or a Storage Area Network. The use of disk storage on drives internal to a server for shared application data is not possible. VERITAS also does not support the use of replication products to provide mirroring of data between arrays for cluster node usage. This means any system that will run a specific service group must have direct access to shared storage. Put another way, you may not have one node connecting to one array, replicate the data to another array and configure VCS to failover applications to this node.

5.2.1 Dual hosted SCSI Dual hosted SCSI has been around for a number of years and works well in smaller configurations. Its primary limitation is scalability. Typically two and at most four systems can be connected to a single drive array. Large storage vendors such as EMC provide high-end arrays with multiple SCSI connections into an array to overcome this problem. In most cases however, the nodes will be connected to a simple array in a configuration like the following diagram.


1/22/01Page 28

Notice the SCSI Host ID settings on each system. A typical SCSI bus has one SCSI Initiator (Controller or HBA) and one or more SCSI Targets (Drives). To configure a dual hosted SCSI configuration, one SCSI Initiator or SCSI Host ID must be set to a value different than its peer. The SCSI ID must be chosen so it does not conflict with any drive installed or the peer initiator.

The method of setting SCSI Initiator ID is dependant on the system manufacturer.

Sun Microsystems provides two methods to set SCSI ID. One is at the EEPROM level an effects all SCSI controllers in the system. It is set by changing the scsi-initiator-id value in the Open Boot Prom, such as setenv scsi-initiator-id = 5. This change affects all SCSI controllers, including the internal controller for the system disk and CD-ROM. Be careful when choosing a new controller ID to not conflict with the boot disk, floppy drive or CD-ROM. On most recent Sun systems, ID 5 is a possible choice. Sun systems can also set SCSI ID on a per controller basis if necessary. This is done be editing the SCSI driver control file in the /kernel/drv area. For details on setting SCSI ID on a per controller bases, please see the VCS Installation Guide Setting up shared storage.

NT/Intel systems are typically set on a per controller basis with a utility package provided by the SCSI controller manufacturer. This is available during system

PrivateNetworks

SD

Sun E N T E R P R I S E

Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

A B

Public network

Dual Hostedexternal

SCSI Array

SCSIHost

ID = 5

SCSIHost

ID = 7

SD

IN

OUT

HIGH

LO


1/22/01Page 29

boot time with a command sequence such as <cntrl S> or <cntrl U> or as a utility run from within NT. Refer to your system documentation for details.

HP/UX systems vary between platforms. Controllers are typically set with jumper or switch settings on a per controller basis.

The most common problem seen in configuring shared SCSI storage is duplicate SCSI Ids. A duplicate SCSI ID will, in many cases, exhibit different symptoms depending on whether there are duplicate controller Ids or a controller ID conflicting with a disk drive. A controller conflicting with a drive will often manifest itself as “phantom drives”. For example, on a Sun system with a drive ID conflict, the output of the format command will show 16 drives, ID 0-15 attached to the bus with the conflict. Duplicate controller Ids are a very serious problem, yet are harder to spot. SCSI controllers are also known as SCSI Initiators. An initiator, as the name implies, initiates commands. SCSI drives are targets. In a normal communication sequence, a target can only respond to a command from am initiator. If an initiator sees a command from an initiator, it will be ignored. The problem may only manifest itself during simultaneous commands from both initiators. A controller could issue a command, and see a response from a drive and assume all was well. This command may actually have been from the peer system. The original command may have not happened. Carefully examine systems attached to shared SCSI and make certain controller ID is different.

The following is an example of a typical shared SCSI configuration.

• Start with the storage attached to one system. Terminate the SCSI bus at the array.

• Power up the host system and array.

• Verify all drives can be seen with the operating system using available commands such as format.

• Identify what SCSI drive ID’s are used in the array and internal SCSI drives if present.

• Identify the SCSI controller ID. On Sun systems, this is displayed at system boot. NT systems may require launching the SCSI configuration utility during system boot.

• Identify a suitable ID for the controller on the second system.

o This ID must not conflict with any drive in the array or the peer controller.

o If you plan to set all controllers to a new ID, as done from EEPROM on a Sun system, ensure the controller ID chosen on the second system does not conflict with internal SCSI devices.


1/22/01Page 30

• Set the new SCSI controller ID on the second system. It may be a good idea to test boot at this point.

• Power down both systems and the external array. SCSI controllers or the array may be damaged if you attempt to “hot-plug” a SCSI cable. Disconnect the SCSI terminator and cable the array to the second system.

• Power up the array and both systems. Depending on hardware platform, you me be able to check for array connectivity before the operating system is brought up.

o On Sun systems, halt the boot process at the boot prom. Use the command probe-scsi-all to verify the disks can be seen from the hardware level on both systems. If this works, proceed with a boot –r to reconfigure the Solaris /dev entries.

o On NT systems, most SCSI adapters provide a utility available from the boot sequence. Entering the SCSI utility will allow you to view attached devices. Verify both systems can see the shared storage, verify SCSI controller ID one last time and then boot the systems.

• Boot console messages such as “unexpected SCSI reset” are a normal occurrence during the boot sequence of a system connected to a shared array. Most SCSI adapters will perform a bus reset during initialisation. The error message is generated when it sees a reset it did not initiate (initiated by the peer).

5.2.2 Storage Area Networks Storage Area networks or SANs have dramatically increased configuration capability and scalability in cluster environments. The use of Fibre Channel fabric switches and loop hubs allows storage to be added with no electrical interruption to the host system as well as eliminates termination issues.

Configuration steps to build a VCS cluster on a SAN differ depending on SAN architecture.

Depending on system design, it is likely you will not be able to verify disk connectivity before system boot.

5.2.3 Storage Configuration Sequence VCS requires the underlying operating system to be able to see and access shared storage. After installing the shared array, verify the drives can be seen from the operating system. In Solaris, the format command can be used.

Once disk access is verified from the operating system, it is time to address cluster storage requirements. This will be determined by the application(s) that will be


1/22/01Page 31

run in the cluster. The rest of this section assumes the installer will be using the VERITAS Volume Manager VxVM to control and allocate disk storage.

Recall the discussion on Service Groups. In this section it was stated that a service group must be completely self-contained, including storage resources. From a VxVM perspective, this means a Disk Group can only belong to one service group. Multiple service groups will require multiple Disk Groups. Volumes may not be created in the VxVM rootdg for use in VCS, as rootdg cannot be deported and imported by the second server.

Determine the number of Disk Groups needed as well as the number and size of volumes in each disk group. Do not compromise disk protection afforded by disk mirroring or RAID to achieve the storage sizes needed. Buy more disks if necessary!

Perform all VxVM configuration tasks from one server. It is not necessary to perform any volume configuration on the second server, as all volume configuration data is stored within the volume itself. Working from one server will significantly decrease chances of errors during configuration.

Create required file systems on the volumes. On Unix systems, the use of journeled file systems is highly recommended (VxFS or Online JFS) to minimize recovery time after a system crash. This feature is not currently available on NT systems. Do not configure file systems to automatically mount at boot time. This is the responsibility of VCS. Test access to the new file systems.

On the second server, create all necessary file system mount points to mirror the first server. At this point, it is recommended the VxVM disk groups be deported from the first server and imported on the second server and files systems test mounted.

5.3 Application setup One of the primary difficulties new VCS users encounters is “trying to get applications to work in VCS”. Very rarely is the trouble with VCS, but rather the application itself. VCS has the capability to start, stop and monitor individual resources. It does not have any magic hidden powers to start applications. Stated simply, “if the application can not be started from the command line, VCS will not be able to start it”. Understanding this is the key to simple VCS deployments. Manually testing that an application can be started and stopped on both systems before VCS is involved will save lots of time and frustration.

Another common question concerns application install locations. For example, in a simple two-node Oracle configuration, should the Oracle binaries be installed on the shared storage or locally on each system? Both methods have benefits. Installing application binaries on shared storage can provide simpler administration. Only one copy must be maintained, updated, etc. Installing separate copies also has its strong points. For example, installing local copies of


1/22/01Page 32

the Oracle binaries may allow the offline system to upgraded with the latest Oracle patch and minimize application downtime. The offline system is upgraded, the service group is failed over to the new patched version, and the now offline system is upgraded. Refer to the “VCS Best Practices “ section for more discussion on this topic.

Chose whichever method best suits your environment. Then install and test the application on one server. When this is successful, deport the disk group, import on the second server and test the application runs properly. Details like system file modifications, file system mount points, licensing issues, etc. are much easier to sort out at this time, before bringing the cluster package into the picture.

While installing, configuring and testing your application, document the exact resources needed for this application and what order they must be configured. This will provide you with the necessary resource dependency details for the VCS configuration. For example, if your application requires 3 file systems, the beginning resource dependency is disk group, volumes, file systems.

5.4 Public Network details VCS service groups require an IP address for client access. This address will be the High Availability address or “floating” address. During a failover, this address is moved from one server to another. Each server configured to host this service group must have a physical NIC on the proper subnet for the HA IP address. The physical interfaces must be configured with a fixed IP address at all times. Clients do not need to know the physical addresses, just the HA IP address. For example, two servers have hostnames SystemA and SystemB, with IP addresses of IP 192.168.1.1 and 192.168.1.2 respectively. The clients could be configured to access SystemAB at 192.168.1.3. During the cluster implementation, name resolution systems such as DNS, NIS or WINS will need to be updated to properly point clients to the HA address.

VCS cannot be configured to fail an IP address between subnets. While it is possible to do with specific configuration directives, moving an IP address to a different subnet will make it inaccessible and therefore useless.

5.5 Initial VCS install and setup

5.5.1 Unix systems VCS 1.3 for Solaris and 1.3.1 for HP/UX provide a setup script called InstallVCS that automates the installation of VCS packages and communication setup. In order to run this utility, rsh access must be temporarily provide between cluster nodes. This can be done by editing the /.rhosts file and providing root rsh access for the duration of the install. Following software install, rsh access can be disabled. Please see the VCS 1.3 Installation Guide for detailed instructions on the InstallVCS utility.


1/22/01Page 33

5.5.2 NT systems The installation routine for VCS NT is very straightforward and runs as a standard Install Shield type process.

Please see the VCS NT Installation Guide for detailed instructions.

5.6 Communication verification The InstallVCS utility on Unix and the NT setup utility create a very basic configuration with LLT and GAB running and a basic configuration file to allow VCS to start. At this time, it is a good practice to verify VCS communications.

5.6.1 LLT Use the lltstat command to verify that links are active for LLT. This command returns information about the links for LLT for the system on which it is typed. Refer to the lltstat(1M) manual page on Unix and Online help for NT for more information. In the following example, lltstat –n is typed on each system in the cluster. On Unix systems, use /sbin/lltstat. On NT use %VCS_ROOT%\comms\llt\lltstat -n

ServerA# lltstat –nOutput resembles: LLT node information:Node State Links*0 OPEN 21 OPEN 2ServerA#

ServerB# lltstat -nOutput resembles: LLT node information:Node State Links0 OPEN 2*1 OPEN 2ServerB#

Note that each system has two links and that each system is in the OPEN state. The asterisk (*) denotes the system on which the command is typed.

5.6.2 GAB To verify GAB is operating, use the gabconfig –a command. On Unix systems, use /sbin/gabconfig –a. On NT systems, use %VCS_ROOT%\comms\gab\gabconfig -a

ServerA# /sbin/gabconfig -aIf GAB is operating, the following GAB port membership information is returned:

GAB Port Memberships =================================== Port a gen a36e0003 membership 01


1/22/01Page 34

Port h gen fd570002 membership 01 Port a indicates that GAB is communicating, gen a36e0003 is a random generation number, and membership 01 indicates that systems 0 and 1 are connected.

Port h indicates that VCS is started, gen fd570002 is a random generation number, and membership 01 indicates that systems 0 and 1 are both running VCS.

If GAB is not operating, no GAB port membership information is returned:

GAB Port Memberships===================================

If only one network is connected, the following GAB port membership information is returned:

GAB Port Memberships===================================Port a gen a36e0003 membership 01Port a gen a36e0003 jeopardy 1Port h gen fd570002 membership 01Port h gen fd570002 jeopardy 1

5.6.3 Cluster operation To verify that the cluster is operating, use the hastatus –summary command. On Unix systems, use /opt/VRTSvcs/bin/hastatus –summary. On NT use $VCS_HOME\bin\hastatus -summary

ServerA# hastatus -summary

The output resembles:

-- SYSTEM STATE-- System State FrozenA SystemA RUNNING 0A SystemB RUNNING 0

Note the system state. If the value is RUNNING, VCS is successfully installed and running. Refer to hastatus(1M) manual page on Unix and Online Help on NT for more information.

If any problems exist, refer to the VCS Installation Guide, Verifying LLT, GAB and Cluster operation for more information.


1/22/01Page 35

6 VCS Configuration VCS uses two main configuration files in a default configuration. The main.cf file describes the entire cluster, and the types.cf file describes installed resource types. By default, both of these files reside in the /etc/VRTSvcs/conf/config directory ($VCS_HOME\conf\config in Windows NT) Additional files similar to types.cf may be present if additional agents have been added, such as Oracletypes.cf or Sybasetypes.cf

6.1 Main.cf file format The main.cf file is the single file used to define an individual cluster. On startup, the VCS engine uses the hacf utility to parse the main.cf and build a command file, main.cmd, to run. The overall format of the main.cf file is as follows:

• Include clauses Include clauses are used to bring in resource definitions. At a minimum, the types.cf file is included. Other type definitions must be configured as necessary. Typically, the addition of VERITAS VCS Enterprise Agents will add additional type definitions in their own files, as well as custom agents developed for this cluster. Most customers and VERITAS consultants will not modify the provided types.cf file, but instead create additional type files.

• Cluster definition The cluster section describes the overall attributes of the cluster. This includes:

• Cluster name • Cluster GUI users

• System definitions Each system designated as part of the cluster is listed in this section. The names listed as system names must match the name returned by the uname –a command in Unix. If fully qualified domain names are used, an additional file, /etc/VRTSvcs/conf/sysname must be created. See the FAQ for more information on sysname. System names are preceded with the keyword “system”. For any system to be used in a later service group definition, it must be defined here! Think of this as the overall set of systems available, with each service group being a subset.

• snmp definition More on this in Advanced Configuration Topics.

• Service group definitions The service group definition is the overall attributes of this particular service group. Possible attributes for a service group are: (See the VCS users Guide for a complete list of Service Group Attributes)

• SystemList


1/22/01Page 36

o List all systems that can run this service group. VCS will not allow a service group to be onlined on any system not in the group’s system list. The order of systems in the list defines, by default, the priority of systems used in a failover. For example, SystemList = ServerA, ServerB, ServerC would configure sysa to be the first choice on failover, followed by sysb and so on. System priority may also be assigned explicitly in the SystemList by assigning numeric values to each system name. For example: SystemList = ServerA=0,

ServerB=1, ServerC=2 is identical to the preceding example. But in this case, the administrator could change priority by changing the numeric priority values. Also note the different formatting of the “” characters. This is detailed in section X.X, Attributes.

• AutoStartList o The AutoStartList defines the system that should bring up the

group on a full cluster start. If this system is not up when all others are brought online, the service group will remain off line. For example: AutoStartList = ServerA .

• Resource definitions

This section will define each resource used in this service group. (And only this service group). Resources can be added in any order and hacf will reorder in alphabetical order the first time the config file is run.

• Service group dependency clauses To configure a service group dependency, place the keyword requires clause in the service group declaration within the VCS configuration file, before the resource dependency specifications, and after the resource declarations.

• Resource dependency clauses A dependency between resources is indicated by the keyword requires between two resource names. This indicates that the second resource (the child) must be online before the first resource (the parent) can be brought online. Conversely, the parent must be offline before the child can be taken offline. Also, faults of the children are propagated to the parent. This is the most common resource dependency

6.2 Resource type definitions The types.cf file describes standard resource types to the VCS engine. The file describes the data necessary to control a given resource. The following is an example of the DiskGroup resource type definition.

type DiskGroup (static int NumThreads = 1static int OnlineRetryLimit = 1static str ArgList[] = DiskGroup, StartVolumes,

StopVolumes, MonitorOnly NameRule = resource.DiskGroup


1/22/01Page 37

str DiskGroupstr StartVolumes = 1str StopVolumes = 1

In this example, the definition is started with the keyword “type”. This is followed by an optional unique name. All resource names must be unique in a VCS cluster. If a name is not specified, the hacf utility will generate a unique name based on the “NameRule” Please see the following section explaining NameRule.

The types definition performs two very important functions. First it defines the sort of values that may be set for each attribute. In the DiskGroup example, the NumThreads and OnlineRetryLimit are both classified as int, or integer. Signed integer constants are a sequence of digits from 0 to 9. They may be preceded by a dash, and are interpreted in base 10.

The DiskGroup, StartVolumes and StopVolumes are strings. As described in the Users Guide: A string is a sequence of characters enclosed by double quotes. A string may also contain double quotes, but the quotes must be immediately preceded by a backslash. A backslash is represented in a string as \\. Quotes are not required if a string begins with a letter, and contains only letters, numbers, dashes (-), and underscores (_).

The second critical piece of information provided by the type definition is the “ArgList”. The line “static str ArgList [] = xxx, yyy, zzz defines the order that parameters are passed to the agents for starting, stopping and monitoring resources. For example, when VCS wishes to online the disk group “shared_dg1”, it passes the online command to the DiskGroupAgent with the following arguments (shared_dg1 shared_dg1 1 1 <null>). This is the online command, the name of the resource, then the contents of the ArgList. Since MonitorOnly is not set, it is passed as a null. This is always the case: command, resource name, ArgList.

For another example, look at the following main.cf and types.cf pair representing an IP resource:

IP nfs_ip1 (Device = hme0Address = "192.168.1.201")

type IP (static str ArgList[] = Device, Address, NetMask, Options,

ArpDelay, IfconfigTwice NameRule = IP_ + resource.Addressstr Devicestr Addressstr NetMaskstr Optionsint ArpDelay = 1int IfconfigTwice

)


1/22/01Page 38

In this example, we configure the high availability address on interface hme0. Notice the double quotes around the IP address. The string contains periods and therefore must be quoted. The arguments passed to the IPAgent with the online command (nfs_ip1 hme0 192.168.1.201 <null> <null> 1 <null>).

The VCS engine passes the identical arguments to the IPAgent for online, offline, clean and monitor. It is up to the agent to use the arguments that it needs. This is a very key concept to understand later in the custom agent section.

The NameRule for the above example would provide a name of “IP_192.168.1.201”

6.3 Attributes

VCS components are configured using “attributes”. Attributes contain data regarding the cluster, systems, service groups, resources, resource types, and agents. For example, the value of a service group’s SystemList attribute specifies on which systems the group is configured, and the priority of each system within the group. Each attribute has a definition and a value. You define an attribute by specifying its data type and dimension. Attributes also have default values that are assigned when a value is not specified.

Data Type Description

String A string is a sequence of characters enclosed by double quotes. A string may also contain double quotes, but the quotes must be immediately preceded by a backslash. A backslash is represented in a string as \\. Quotes are not required if a string begins with a letter, and contains only letters, numbers, dashes (-), and underscores (_). For example, a string defining a network interface such as hme0 does not require quotes as it contains only letters and numbers. However a string defining an IP address requires quotes, such as: “192.168.100.1” since the IP contains periods.

Integer Signed integer constants are a sequence of digits from 0 to 9. They may be preceded by a dash, and are interpreted in base 10. In the example above, the number of times to retry the online operation of a DiskGroup is defined with an integer:

static int OnlineRetryLimit = 1


1/22/01Page 39

Boolean A boolean is an integer, the possible values of which are 0 (false) and 1 (true). From the main.cf example above, SNMP is enabled by setting the Enabled attribute to 1 as follows:

Enabled = 1

Dimension Description

Scalar A scalar has only one value. This is the default dimension.

Vector A vector is an ordered list of values. Each value is indexed using a positive integer beginning with zero. A set of brackets ([]) denotes that the dimension is a vector. Brackets are specified after the attribute name on the attribute definition. For example, to designate a dependency between resource types specified in the service group list, and all instances of the respective resource type: Dependencies[] = Mount, Disk, DiskGroup

Keylist A keylist is an unordered list of strings, and each string is unique within the list. For example, to designate the list of systems on which a service group will be started with VCS (usually at system boot): AutoStartList = sysa, sysb, sysc

Association An association is an unordered list of name-value pairs. Each pair is separated by an equal sign. A set of braces () denotes that an attribute is an association. Braces are specified after the attribute name on the attribute definition. For example, to designate the list of systems on which the service group is configured to run and the system’s priorities: SystemList() = sysa=1, sysb=2, sysc=3

6.3.1 Type dependant attributes Type dependant attributes are those attributes, which pertain to a particular resource type. For example the “DiskGroup” attribute is only relevant to the DiskGroup resource type. Similarly, the IPAddress attribute pertains to the IP resource type.


1/22/01Page 40

6.3.2 Type independent attributes Type independent attributes are attributes that apply to all resource types. This means there is a set of attributes that all agents can understand, regardless of resource type. These attributes are coded into the agent framework when the agent is developed. Attributes such as RestartLimit and MonitorInterval can be set for any resource type. These type independent attributes must still be set on a per resource type basis, but the agent will understand the values and know how to use them.

6.3.3 Resource specific attributes Resource specific attributes are those attributes, which pertain to a given resource only. These are discrete values that define the “personality” of a given resource. For example, the IPAgent knows how to use an IPAddress attribute. Actually setting an IP address is only done within a specific resource definition. Resource specific attributes are set in the main.cf file

6.3.4 Type specific attributes Type specific attributes refer to attributes, which are set for all resources of a specific type. For example, setting MonitorInterval for the IP resource affects all IP resources. This value would be placed in the types.cf file. In some cases, attributes can be placed in either location. For example, setting “StartVolumes = 1” in the DiskGroup types.cf entry would default StartVolumes to true for all DiskGroup resources. Placing the value in main.cf would set StartVolumes on a per resource value

In the following examples of types.cf entries, we will document several methods to set type specific attributes.

In the example below, StartVolumes and StopVolumes is set in types.cf. This sets the default for all DiskGroup resources to automatically start all volumes contained in a disk group when the disk group is onlined. This is simply a default. If no value for StartVolumes or StopVolumes is set in main.cf, it will they will default to true.


StopVolumes, MonitorOnly NameRule = resource.DiskGroupstr DiskGroupstr StartVolumes = 1str StopVolumes = 1

Adding the required lines in main.cf will allow this value to be overridden. In the next excerpt, the main.cf is used to override the default type specific attribute with a resource specific attribute

DiskGroup shared_dg1 (


1/22/01Page 41

DiskGroup = shared_dg1StartVolumes = 0StopVolumes = 0)

In the next example, changing the StartVolumes and StopVolumes attributes to static str disables main.cf from overriding.


StopVolumes, MonitorOnly NameRule = resource.DiskGroupstr DiskGroupstatic str StartVolumes = 1static str StopVolumes = 1

6.3.5 Local and Global attributes An attribute whose value applies to all systems is global in scope. An attribute whose value applies on a per-system basis is local in scope. The “at” operator (@) indicates the system to which a local value applies. An example of local attributes can be found in the MultiNICA resource type where IP addresses and routing options are assigned on a per machine basis.

MultiNICA mnic (Device@sysa = le0 = "166.98.16.103", qfe3 = "166.98.16.103" Device@sysb = le0 = "166.98.16.104", qfe3 = "166.98.16.104" NetMask = "255.255.255.0"ArpDelay = 5Options = "trailers"RouteOptions@sysa = "default 166.98.16.103 0"RouteOptions@sysb = "default 166.98.16.104 0")

7 NFS Sample Configurations 7.1 Two node asymmetric NFS cluster

The following section will walk through a basic two-node cluster exporting an NFS file system. The systems are configured as follows:

• Servers: ServerA and ServerB

• Storage: One disk group, shared_dg1

• File System: /home

• IP address: 192.168.1.3 nfs_IP

• Public interface: hme0

• ServerA is primary location to start the NFS_Group


1/22/01Page 42

The resource dependency tree looks like the following example. Notice the IP address is brought up last. In an NFS configuration this is important, as it prevents the client from accessing the server until everything is ready. This will prevent unnecessary “Stale Filehandle” errors on the clients and reduce support calls.

7.1.1 Example main.cf file Comments in the example are preceded with “#”. Placing actual comments in the main.cf file is not possible, since the hacf utility will remove them when it parses the file.

include "types.cf"#Brings in all the standard type definitions cluster HA-NFS (# Names the cluster for management purposes

UserNames = veritas = cD9MAPjJQm6go # Required for GUI access. Added # manually or with "hauser -add" command # This is the encryption of password veritas

)

system ServerA

system ServerB# What systems are part of the entire”HA-NFS” cluster. You can add up to 32 nodes here.

nfs_IP

nfs_group_hme0

home_mount

shared_dg1

home_share

NFS_nfs_group_16

nfs_IPnfs_IP

nfs_group_hme0nfs_group_hme0

home_mounthome_mount

shared_dg1shared_dg1

home_sharehome_share

NFS_nfs_group_16NFS_nfs_group_16


1/22/01Page 43

snmp vcs

# The following section will describe the NFS group. This group# definition runs till end of file or till next instance of the# keyword group

group NFS_Group (#Begins NFS_Group definition

SystemList = ServerA, ServerB #What systems within the cluster will this service group (SG) will run on

AutoStartList = ServerA #What system will the group normally start on #

# Additional Service Group attributes can be found in the VCS 1.3.0 Users Guide # by default, this service group will be a failover group and be enabled.

) # The close parenthisis above completes the definition of the main attributes of the service

# group itself. # Immediately following this is the resource definitions for resources within the group as well # resource dependencies. The service group definition runs till end of file or next instance of the # keyword "group"

DiskGroup shared_dg1 (DiskGroup = shared_dg1)

#Defines the disk group for the nfs_group SG

IP nfs_ip (Device = hme0Address = "192.168.1.201")

#Defines the IP resource used to create the IP-alias clients will use to access this SG

Mount home_mount (MountPoint = "/export/home"BlockDevice = "/dev/vx/dsk/shared_dg1/home_vol"FSType = vxfsMountOpt = rw)

# Defines the mount resource used to mount the filesystem

NFS nfs_16 ()

# This resource is an example of a "On Only" type resource. We need the nfs daemon, "nfsd" # to run in order to export the file system later with the share resource. In this case, VCS # will start if necessary, with the default number of threads (16) or monitor if already running # VCS will not stop this resource

NIC NIC_hme0 (Device = hme0NetworkType = ether)

# This resource is an example of a "Persistant" resource. VCS requires it to be there to # use it, but is not capable of starting or stopping.


1/22/01Page 44

Share home_share (PathName = "/export/home")

# This resource provide the nfs share to export the filesystem

nfs_ip1 requires home_share# the High Avail IP is brought up last, following the share to prevent nfs stale filehandles

nfs_ip1 requires NIC_hme0 # The High Avail IP also requires the NIC to place the IP Alias on

home_mount requires shared_dg1# The mount of the filesystem requires the disk group to be imported and started.

# We are not using a volume resource because the disk group resource starts all # volumes and the likelihood of a volume within a disk group failing is minimal.

home_share requires nfs_16

# Exporting the filesystem via nfs requires the nfs daemons to be runninghome_share requires home_mount

# Exporting the filesystem also requires the filesystem to be mounted.

7.2 Two node symmetrical NFS configuration The following example will add a second NFS service Group, NFS_Group2. This group will be configured to normally run on the second system in the cluster. The systems are configured as follows:


• Storage: One disk group, shared_dg2

• File System: /source-code

• IP address: 192.168.1.4 code_IP


• ServerB is primary location to start the NFS_Group2

7.2.1 Example main.cf file Comments in the example are preceded with “#”. The second service group definition begins after the first and is preceded with the keyword “group”

include "types.cf"cluster HA-NFS (

UserNames = veritas = cD9MAPjJQm6go )

system ServerA


1/22/01Page 45

system ServerB

snmp vcs

group NFS_Group (SystemList = ServerA, ServerB

AutoStartList = ServerA ) DiskGroup shared_dg1 (

DiskGroup = shared_dg1)

IP nfs_ip (

Device = hme0Address = "192.168.1.201")

Mount home_mount (

MountPoint = "/export/home"BlockDevice = "/dev/vx/dsk/shared_dg1/home_vol"FSType = vxfsMountOpt = rw)

NFS nfs_16 ()

NIC NIC_hme0 (

Device = hme0NetworkType = ether)


nfs_ip1 requires home_sharenfs_ip1 requires NIC_hme0

home_mount requires shared_dg1home_share requires nfs_16home_share requires home_mount

# Now we can begin the second service group definition group NFS_Group2 (

SystemList = ServerA, ServerB AutoStartList = ServerB ) DiskGroup shared_dg2 (

DiskGroup = shared_dg2)

# Note the second VxVM DiskGroup. A disk group may only exist in a single failover #service group, so a second disk group is required.

IP code_IP (


1/22/01Page 46

Device = hme0Address = "192.168.1.201")

Mount code_mount (

MountPoint = "/export/sourcecode"BlockDevice = "/dev/vx/dsk/shared_dg2/code_vol"FSType = vxfsMountOpt = rw)

NFS nfs_16 ()

NIC NIC_hme0 (

Device = hme0NetworkType = ether)


nfs_ip1 requires home_sharenfs_ip1 requires NIC_hme0

home_mount requires shared_dg1home_share requires nfs_16home_share requires home_mount

7.3 Special storage considerations for NFS Service NFS servers and clients use the concept of a “filehandle”. This concept is based on an NFS design principal that the client is unaware of the underlying layout or architecture of an NFS server’s file system. When a client wishes to access a file, the server responds with a filehandle. This filehandle is used for all subsequent access to the file. For example, a client has the /export/home file system NFS mounted on /home and is currently in /home/user. The client system wishes to open the file /home/user/letter. The client NFS process issues an NFS lookup procedure call to the server for the file letter. The server responds with a filehandle that the client can use to access letter. This filehandle is considered an “opaque data type” to the client. The client has no visibility into what the filehandle contains, it simply knows to use this handle when it wishes to access letter. To the server, the filehandle has a very specific meaning. The filehandle encodes all information necessary to access a specific piece of data on the server. Typical NFS file handles contain the major and minor number of the file system, the inode number and the inode generation number (a sequential number assigned when an inode is allocated to a file. This is to prevent a client from mistakenly accessing a file by inode number that has been deleted and the inode reused to point to a new file). The NFS filehandle describes to the server one unique file on the entire server. If a client accesses the server using a filehandle that does not appear to work, such as major or minor number that are different than what is available on the server, or an inode number where the inode generation number is


1/22/01Page 47

incorrect, the server will reply with a “Stale NFS filehandle” error. Many sites have seen this error after a full restore of a NFS exported file system. In this scenario, the files from a full file level restore are written in a new order with new inode and inode generation numbers for all files. In this scenario, all clients must unmount the file system and re-mount to receive new filehandle assignments from the server.

Rebooting an NFS server has no effect on an NFS client other than an outage while the server boots. Once the server is back, the client mounted file systems are accessible with the same file handles.

From a cluster perspective, a file system failover must look exactly like a very rapid server reboot. In order for this to occur, a filehandle valid on one server must point to the identical file on the peer server. Within a given file system located on shared storage this is guaranteed as inode and inode generation must match since they are read out of the same storage following a failover. The problem exists with major and minor numbers used by Unix to access the disks or volumes used for the storage. From a straight disk perspective, different controllers would use different minor numbers. If two servers in a cluster do not have exactly matching controller and slot layout, this can be a problem.

This problem is greatly mitigated through the use of VERITAS Volume Manager. VxVM abstracts the data from the physical storage. In this case, the Unix major number is a pointer to VxVM and the minor number to a volume within a disk group. Problems arise in two situations. The first is differing major numbers. This typically occurs when the VxVM, VxFS and VCS are installed in different orders. Both VxVM and LLT/GAB use major numbers assigned by Solaris during software installation to create device entries. Installing in different orders will cause a mismatch in major number. Another cause of differing major numbers is different packages installed on each system prior to installing VxVM. Differing minor numbers within VxVM setup is rare and usually only happens when a server has a large number of local disk groups and volumes prior to beginning setup as a cluster peer.

Before beginning VCS NFS server configuration, verify file system major and minor numbers match between servers. On VxVM this will require importing the disk group on one server, checking major and minor, deporting the disk group then repeating the process on the second server.

If any problems arise, refer to the VCS Installation Guide, Preparing NFS Services.


1/22/01Page 48

8 Oracle sample configurations The following examples will show a two-node cluster running a single Oracle instance in an asymmetrical configuration and a 2 Oracle instance symmetrical configuration. It will also show the required changes to the Oracle configuration file such as listener.ora and tnsnames.ora.

8.1 Oracle setup As described in section 5.3, the best method to configure a complex application like Oracle is to first configure one system to run Oracle properly and test. After successful test of the database on one server, import the shared storage and configure the second system identically. The most common configuration mistakes in VCS Oracle are system configuration files. On Unix these are typically /etc/system, /etc/passwd, /etc/shadow and /etc/group. On NT systems, (insert NT/Oracle setup here)

Oracle must also be configured to operate in the cluster environment. The main Oracle setup task is to ensure all data required by the database resides on shared storage. During failover the second server must be able to access all table spaces, data files, logs, etc. The Oracle listener must also be modified to work in the cluster. The changes typically required are $ORACLE_HOME/network/admin/tnsnames.ora and $ORACLE_HOME/network/admin/listener.ora. These files must be modified to use the hostname and IP address if the virtual server rather than a particular physical server. Remember to take this in to account during Oracle setup and testing. If you are using the physical address of a server, it the listener control files must be changed during testing on the second server. If you use the high availability IP address selected for the Oracle service group, you will need to manually configure this address up on each machine during testing.

8.2 Oracle Enterprise Agent installation To control Oracle in a VCS environment, the customer must purchase and install the VCS Enterprise Agent for Oracle. This package actually contains two agents, the OracleAgent to control the Oracle database and the SqlnetAgent to control the Oracle listener. Follow the instructions in the Enterprise Agent for Oracle Installation and Configuration Guide for details.

8.3 Single instance configuration The following example will show a single instance asymmetric failover configuration for Oracle 8i. The configuration assumes the following system configurations

• Cluster HA-Oracle


• Service group ORA_PROD_Group


1/22/01Page 49

• Storage: One disk group, ora_prod_dg

• File Systems: /u01 and /u02

• IP address: 192.168.1.6 PROD_IP


• ServerA is primary location to start the ORA_PROD_Group

• The Listener starts before the Oracle database to allow Multi Threaded Server usage.

• DNS mapping for 192.168.1.6 maps to host “prod-server”

The resource dependency tree looks like the following example.

8.3.1 Example main.cf include "types.cf"include "OracleTypes.cf"

cluster HA-Oracle (UserNames = root = cD9MAPjJQm6go )

system SystemA

system SystemB

ORA_PROD

PROD_IP

NIC_prod_hme0

PROD_U02

PROD_Vol1

PROD_DG

PROD_Vol2

PROD_U01

PROD_Listener

ORA_PRODORA_PROD

PROD_IPPROD_IP

NIC_prod_hme0NIC_prod_hme0

PROD_U02PROD_U02

PROD_Vol1PROD_Vol1

PROD_DGPROD_DG

PROD_Vol2PROD_Vol2

PROD_U01PROD_U01

PROD_ListenerPROD_Listener


1/22/01Page 50

snmp vcs

group ORA_PROD_Group (SystemList = ServerA, ServerB AutoStartList = ServerA )

DiskGroup PROD_DG (DiskGroup = ora_prod_dg)

IP PROD_IP (Device = qfe0Address = "192.168.1.6")

Mount PROD_U01 (MountPoint = "/u01"BlockDevice = "/dev/vx/dsk/ora_prod_dg/u01-vol"FSType = vxfsMountOpt = rw)

Mount PROD_U02 (MountPoint = "/u02"BlockDevice = "/dev/vx/dsk/ora_prod_dg/u02-vol"FSType = vxfsMountOpt = rw)

NIC NIC_prod_hme0 (Device = qfe0NetworkType = ether)

Oracle ORA_PROD (Critical = 1Sid = PRODOwner = oracleHome = "/u01/oracle/product/8.1.5"Pfile = "/u01/oracle/admin/pfile/initPROD.ora")

Sqlnet PROD_Listener (Owner = oracleHome = "/u01/oracle/product/8.1.5"TnsAdmin = "/u01/oracle/network/admin"Listener = LISTENER_PROD

)

Volume PROD_Vol1 (Volume = "u01-vol"DiskGroup = "ora_prod_dg")


1/22/01Page 51

Volume PROD_Vol2 (Volume = "u01-vo2"DiskGroup = "ora_prod_dg")

PROD_Vol1 requires PROD_DGPROD_Vol2 requires PROD_DGPROD_U01 requires PROD_Vol1PROD_U02 requires PROD_Vol2PROD_IP requires NIC_prod_hme0PROD_Listener requires PROD_Vol1PROD_Listener requires PROD_Vol2PROD_Listener requires PROD_IPORA_PROD requires PROD_Listener

8.3.2 Oracle listener.ora configuration LISTENER_PROD =

(ADDRESS_LIST =(ADDRESS=(PROTOCOL= TCP)(Host=prod-server)(Port=1521))

)SID_LIST_LISTENER_PROD=

(SID_LIST =(SID_DESC=

(GLOBAL_DBNAME=db01.)(ORACLE_HOME= /u01/oracle/product/8.1.5)(SID_NAME = PROD)

)(SID_DESC=

(SID_NAME = extproc)(ORACLE_HOME= /u01/oracle/product/8.1.5)(PROGRAM= extproc)

))

STARTUP_WAIT_TIME_LISTENER_PROD =0CONNECT_TIMEOUT_LISTENER_PROD = 10TRACE_LEVEL_LISTENER_PROD = OFF

8.4 Adding deep level testing Deep level testing gives VCS the ability to test Oracle and the Listener from closer to a real user perspective. The OracleAgent will log into the database and write data to a table, logout, log back in and test that it can read from the same table. The SqlnetAgent will test that it can actually connect to the listener and access the database.

8.4.1 Oracle changes To configure deep level testing of the database, a low privilege user must be defined that can create and modify a table. The following is documented in the Sqltest.pl file in the $VCS_HOME/bin/Oracle directory.


1/22/01Page 52

The following test updates a row "tstamp" with the latest value of the Oracle internal function SYSDATE

A prerequisite for this test is that a user/password/table has been created before enabling the script by defining the VCS attributes User/Pword/Table/MonScript for the Oracle resource.

This task can be accomplished by the following SQL statements as DB-admin:

SVRMGR> connect internalSVRMGR> create user <User>

2> identified by <Pword>3> default tablespace USERS4> temporary tablespace USERS5> quota 100K on USERS;

USERS is the tablespace name present at all standard Oracle Installations.

It might be replaced by any other tablespace for the specific installation.

(To get a list of valid tablespaces use: select * from sys.dba_tablespaces;)

SVRMGR> grant create session to <User>;SVRMGR> create table <User>.<Table> ( tstamp date );SVRMGR> create table <User>.<Table> ( tstamp date );SVRMGR> insert into <User>.<Table> ( tstamp ) values ( SYSDATE );

The name of the row "tstamp" should match the one of the update statement below!

To test DB-setup use:

SVRMGR> disconnectSVRMGR> connect <User>/<Pword>SVRMGR> update <User>.<Table> set ( tstamp ) = SYSDATE;SVRMGR> select TO_CHAR(tstamp, 'MON DD, YYYY HH:MI:SS AM')tstamp

2> from <User>.<Table>;SVRMGR> exit

If you received the correct timestamp the in depth testing can be enabled


1/22/01Page 53

8.4.2 VCS Configuration changes To enable VCS to perform deep level Oracle testing, you must define the Oracle user and password and the tablespace used for testing. The following is an example of the modifications to main.cf for the Oracle and Sqlnet resources:

Oracle ORA_PROD (Critical = 1Sid = PRODOwner = oracleHome = "/u01/oracle/product/8.1.5"Pfile = "/u01/oracle/admin/pfile/initPROD.ora")User = "testuser"Pword = "vcstest"Table = "USERS"Monscript = "/opt/VRTSvcs/bin/Oracle/SqlTest.pl"

Sqlnet PROD_Listener (Owner = oracleHome = "/u01/oracle/product/8.1.5"TnsAdmin = "/u01/oracle/network/admin"Listener = LISTENER_PRODMonscript = "/opt/VRTSvcs/bin/Sqlnet/LsnrTest.pl"

)

8.5 Multiple Instance configuration

9 Administering VCS 9.1 Starting and stopping 9.2 Modifying the configuration from the command line 9.3 Modifying the configuration using the GUI 9.4 Modifying the main.cf file 9.5 SNMP

10 Troubleshooting

11 VCS Daemons and Communications The following section will describe VCS node-to-node communications. This refers to communications used by VCS to maintain cluster operation. This section does not discuss public communications for client access.

VCS is a replicated state machine. This requires two basic forms of information. All nodes are constantly aware of who their peers are (Cluster Membership) as


1/22/01Page 54

well as the exact state of resources on the peers (Cluster State). This requires constant communications between nodes in a cluster.

The drawing below shows a general overview of VCS communications. On each cluster node, agents monitor the status of resources. The agents communicate status of resources to the High Availability Daemon (HAD). HAD communicates status of all resources on the local system to all other systems via the Group Membership Services/Atomic Broadcast (GAB) protocol. GAB uses the underlying Low Latency Transport (LLT) to communicate reliably between servers. HAD, GAB and LLT will be discussed at length in the sections below.

11.1 HAD The High Availability Daemon, or “HAD” is the main VCS daemon running on each system. HAD collects all information about resources running on the local system and forwards the info to all other systems in the cluster. It also receives info from all other cluster members to update it’s own view of the cluster.

had

GAB

LLT

GAB

LLT

had

agentagent agentagentagentagent


NODE ANODE A NODE BNODE B

Agent Framework Agent Framework

had

GAB

LLT

GAB

LLT

had


agentagentagentagent agentagentagentagentagentagentagentagent


agentagentagentagent agentagentagentagentagentagentagentagent

NODE ANODE A NODE BNODE B

Agent Framework Agent Framework


1/22/01Page 55

11.2 HASHADOW hashadow runs on each system in the cluster and is responsible for monitoring and restarting, if necessary, the had daemon. HAD monitors hashadow as well and restarts it if necessary

11.3 Group Membership Services/Atomic Broadcast (GAB) The Group Membership Services /Atomic Broadcast protocol, abbreviated “GAB” is responsible for Cluster Membership and Cluster State communications described below.

11.4 Cluster membership At the highest level, one would assume cluster membership to mean all systems configured by the installer to operate as a cluster. In VCS, a cluster membership generally refers to all systems configured with the same “Cluster-ID” and interconnected via a pair of redundant heartbeat networks. The section on LLT discusses configuring LLT Cluster-ID. This means that under normal operation all systems configured as part of the physical cluster during system install will be actively participating in cluster communications.

In order to maintain a complete picture of the exact status of all resources and groups on all nodes, VCS must be constantly aware of which nodes are currently participating in the cluster. While this may sound like an over-simplification, realize that at any time nodes can be rebooted, powered off, fault, added to the cluster, etc. VCS uses it’s cluster membership capability to dynamically track what the overall cluster topology looks like.

Systems join a cluster by issuing a “Cluster Join” message during GAB startup. Cluster membership is maintained via the use of “heartbeats”. Heartbeats are signals that are sent periodically from one system to another to verify that the systems are active. Heartbeats over network are handled by the LLT protocol and disk heartbeats by the GABDISK utility (See section 4.8 for an explanation of GABDISK). When systems no longer receive heartbeat messages from a peer for an interval set by “Heartbeat Timeout” (see Communications FAQ), it is marked DOWN and excluded from the cluster. Its applications are then migrated to the other systems.

11.5 Cluster State Cluster State refers to tracking the status of all resources and groups in the cluster. This is the function of the “Atomic Broadcast” capability of GAB. Atomic Broadcast ensures all systems within the cluster are immediately notified of changes in resource status, cluster membership, and configuration. Atomic means all systems receive updates, or all are “rolled back” to the previous state, much like a database atomic commit. If a failure occurs while transmitting status changes, GAB’s atomicity ensures that, upon recovery, all systems will have the


1/22/01Page 56

same information regarding the status of any monitored resource in the cluster. The broadcast messaging service employs a two phase commit protocol to deliver messages atomically to all surviving members of a group in the presence of node failures.

11.6 LLT LLT (Low Latency Transport) provides fast, kernel-to-kernel communications, and monitors network connections. LLT functions as a replacement for the IP stack on systems. LLT runs directly on top of the Data Link Protocol Interface (DLPI) layer on UNIX, and the Network Driver Interface Specification (NDIS) on Windows NT. This ensures that events such as state changes are reflected more quickly, which in turn enables faster responses.

LLT has several major functions:

• Traffic Distribution. LLT distributes (load balances) inter-node communication across all available private network links. This means all cluster state information is evenly distributed across all private (up to 8) network links for performance and fault resilience. On failure of a link, traffic is redirected to remaining links.

• Heartbeat. LLT is responsible for sending and receiving heartbeat traffic over network links. The frequency of heartbeats can be set in the /etc/llttab file. Heartbeats are used to determine the health of nodes in the cluster.

• Reliable vs. Unreliable communication notification. LLT informs GAB if communications to a peer are “reliable” or “unreliable” A peer connection is said to be reliable if more than one network link exists between them. LLT monitors multiple links and routes network traffic over the surviving

Globally ordered broadcast input stream

Broadcaster

Node A Node B Node C Node D

Globally ordered broadcast input stream

Broadcaster

Node A Node B Node C Node D


1/22/01Page 57

links. For example, if two fully connected independent networks exists between nodes A, B and C; and one network interface card on node C fails; LLT on nodes A and B will route traffic to C over the remaining interface card while multiplexing traffic to each other over both networks. Nodes A and B have a reliable connection with each other, and an unreliable connection to Node C. Node C would have an unreliable connection to Nodes A and B. For the reliable designation to have meaning, it is critical that the networks used fail independently. LLT supports multiple independent links between systems. Using different interfaces and connecting infrastructure decreases the chance that two links will fail at the same time, increasing overall reliability.

11.7 Low Priority Link LLT can be configured to use a low priority network link as a backup to normal heartbeat channels. Low priority links are typically configured on the customer’s public network or administrative network. The low priority link is not used for cluster membership traffic until it is the only remaining link. In normal operation, the low priority link carries only heartbeat traffic for cluster membership and link state maintenance. The frequency of heartbeats is reduced to 20% of normal to reduce network overhead. When the low priority link is the only remaining network link, LLT will switch all cluster status traffic over as well. Upon repair of any configured private link, LLT switches cluster status traffic back to the “high priority link”.

11.8 LLT Configuration LLT is configured with a simple text file called /etc/llttab on Unix systems and %VCS_ROOT%\comms\llt\llttab.txt on Windows NT. During system install, an example llttab file is placed in /opt/VRTSllt/llttab on Unix systems and %VCS_ROOT%\comms\llt\default.llttab.txt on Windows NT. This file documents all possible settings that can be used in an llttab file. VCS versions after 1.1 require the use of an llthosts file in addition to llttab.

11.8.1 LLT configuration directives The following are the possible directives that can be used in an llttab file.

Standard llttab directives

Directive Use and Explanation set-cluster Assigns a unique cluster number. Use this directive when more

than one cluster is configured on the same physical network connection. The “same physical network” refers to all private links as well as any configured low-priority public links. If there is any chance, no matter how slight, that multiple independent clusters can ever be seen on the same physical network, a unique cluster-id MUST be set. VERITAS recommends always setting unique cluster-ids for each cluster. Possible values range


1/22/01Page 58

from 0-255. LLT uses a default cluster number of zero. The format is: set-cluster XX where XX is an integer between 0 and 255 Example set-cluster 10

set-node Assigns the system ID. This number must be unique for each

system in the cluster, and must be in the range 0-31. Note that LLT fails to operate if any systems share the same ID and will cause system panics if duplicates are encountered. The node id can be set in three ways. You may enter a number, name, or filename.

• A number is taken literally as a node ID. • A name is translated to a node ID via /etc/llthosts file. • A filename will take the first word in the file and

translate it via /etc/llthosts to a node ID The following examples show the possible methods of setting System ID. set-node 1

This uses a direct number to set the system ID to 1. This value must be between 0 and 31. set-node system1

This method will extract the value associated with “system1” from /etc/llthosts set-node /etc/nodename

This method will extract the first word from the file /etc/nodename and extract that value from /etc/llthosts All VCS versions from 1.2 and higher require the use of the llthosts file, regardless of method used to set system ID. The systems hostname, or the value set in /etc/sysname must match a valid name in llthosts.

link Assigns a network interface for LLT use. The format is: link tag-name device-name:device-unit node-range link-type SAP MTU

• Tag-name A symbolic name used to reference this link in set-addr commands and lltstat output • Device-name:device-unit


1/22/01Page 59

The DLPI STREAMS device for the LAN interface, and the unit number on that device • Node-range The range of nodes that should process this command. A dash '-' is the default for "all nodes". This is useful to use the same file on multiple nodes that have differing hardware. • link-type The type of network. Currently supported values: ether • SAP The Service Access Point (SAP) used to bind to the network link. A dash '-' is the default. If multiple clusters share the same network infrastructure, each cluster MUST have a unique cluster ID or each cluster must use a different SAP for LLT communications. For ease of administration, VERITAS recommends using the default SAP and setting unique cluster ID numbers. • MTU The maximum transmission size for packets on the network link. A dash '-' is the default.

Examples Solaris example link qfe0 /dev/qfe:0 - ether - -link hme1 /dev/hme:1 - ether - -

HP/UX example link lan0 /dev/dlpi:0 - ether - -link lan1 /dev/dlpi:1 - ether - -

link-lowpri Creates a low priority link for LLT use. The low priority link is

used for heartbeat only until it is the last remaining link. At this time, cluster status is placed on the low-priority link until a regular heartbeat is restored. See the VCS Communications section for more detail on low priority links. All fields after the “link-lowpri” directive are identical to a standard link. Examples Solaris link-lowpri qfe3 /dev/qfe:3 - ether - -

HP/UX link lan3 /dev/dlpi:3 - ether - -

start Starts LLT. This line should appear as the last line is /etc/llttab


1/22/01Page 60

Additional Options Directive Use and explanation set-verbose To enable verbose messages from lltconfig to the console and

syslog, add this line first in llttab. This allows better troubleshooting of LLT configuration issues, but increases logging significantly. Example set-verbose 1

includeexclude

The include and exclude options are used to specify a range of valid nodes in the VCS cluster. The default is all nodes included 0-31. See /kernel/drv/llt.conf "nodes=nnn" for the maximum. These directives are useful to limit the output of the lltstat command to only those nodes configured in the cluster. The exclude statement is used to block a range of node numbers from cluster participation. For example, the following will cause only nodes 0-7 to be valid for cluster participation exclude 8-31

The include statement is somewhat redundant, as nodes 0-31 are already included. However, it can be used to “re-include” a small range of nodes if all are first excluded. For example, the following two lines would enable nodes 12-16 only: exclude 0-31include 12-16

Heartbeat broadcast configuration directives By default, LLT uses a broadcast packet for heartbeat. The

actual heartbeat is a broadcast Address Resolution Protocol response packet that contains the cluster ID and node ID in the address field. In this manner, each node automatically learns the MAC address of each of its peers for necessary point-to-point communications used by GAB and does not have to do any additional network traffic to “learn” the MAC addresses required. This configuration is the default and has broadcast heartbeat enabled (set-bcasthb 1), Address Resolution is disabled since every packet is an Address Resolution Response. No additional Address Resolution Requests are necessary (set-arp 0), and no manual MAC addresses assigned.


1/22/01Page 61

For network architectures that do not support a broadcast mechanism, Broadcast Heartbeat as well as Address Resolution is not possible. In this instance, MAC addresses for each heartbeat interface on each system must be assigned in the llttab file with the set-addr directive. (Note: VERITAS Customer Support will only support heartbeat networks on 802.3 Ethernet type networks. Use of other types of networks for heartbeat is discouraged.) For situations where broadcast is possible, but the customer wishes to limit its use, it is possible to disable Broadcast Heartbeat (set-bcasthb 0) but still use limited broadcast for Address Resolution (set-arp 1). The use of non-broadcast heartbeats is a non standard configuration. VERITAS recommends two dedicated, private networks for heartbeat use. Broadcast traffic is therefore constrained to just those systems that require the information. Using unicast heartbeats require acknowledge packets to be sent between systems, effectively doubling network traffic.

set-bcasthb This directive can disable the use of broadcast heartbeats Set-bcasthb 0

Using this configuration requires manual MAC address configuration or the ability to use ARP. Setting “set-arp 0” and not setting MAC addresses will disable LLT.

set-arp The set-arp directive is used to enable or disable the use of the Address Resolution Protocol for determining the MAC address of peer nodes. It is disabled by default. In order to disable broadcast heartbeats, this option must be enabled or MAC addresses must be manually set. The use of broadcast heartbeats as well as the ARP feature is only supported on network architectures that support MAC level broadcast, such as Ethernet. To enable the ARP, set the following directive set-arp 1

set-addr Used to set MAC addresses manually for networks that do not

support broadcast for address resolution or where broadcast is not desired due to customer requirements. It should be noted that manually setting MAC addresses is prone to human error prone and also causes difficulty when network interface cards


1/22/01Page 62

are changed. Each link for each system in the cluster must be set. Format: set-addr node-id tag-name address # set address for node 2 on link le0 set-addr 2 le0 01:02:03:01:02:03

# set address for node 2 on link lan0 set-addr 2 lan0 01:02:03:01:02:03

Advanced Options

Do not modify unless directed by VERITAS Customer Support set-timer Setting the frequency of LLT heartbeats on private or low-pri

links. This value is expressed in 1/100 second Examples Send a heartbeat 2 times per second set-timer heartbeat:50

Send a heartbeat 1 time per second (for link-lowpri links) set-timer heartbeatlo:100

Setting peer timeout. Example Mark a link to a peer down after 16 sec of missed heartbeats (peerinact must be larger than either heartbeat timer) set-timer peerinact:1600

Other timers and flow control The following are for development use only and should not be modified set-timer oos:10set-timer retrans:10set-timer service:100set-timer arp:30000set-flow lowater:40set-flow hiwater:80set-flow window:60


1/22/01Page 63

11.8.2 Example LLT configuration

The example llttab file is configured with the following:

• 5 nodes

• Private heartbeat on hme0 and qfe1

• Low priority heartbeat on public network on qfe0

• Cluster id set to 10. (Any cluster in a customer environment must have a unique cluster id. Default is 0. It is recommended to always hard set a unique cluster id to prevent problems in the future as more clusters are added.)

• Llthosts file used

## /etc/llttab set-cluster 10 ## Sets this cluster ID to 10. Can be anything from 0-255. ## required if any cluster can ever "see" another on any interface set-node /etc/nodename ## Needs /etc/llthosts file, required in 1.3 and above ## Format for llt hosts is "node number <white space> hostname" link hme0 /dev/hme:0 - ether - - link qfe1 /dev/qfe:1 - ether - - ## High pri links are: link unique_name device - ether - - link-lowpri /dev/qfe:0 - ether - - ## Sets up your low pri start

11.8.3 Example llthosts file ## /etc/llthosts: 1 pinky 2 brain 3 yakko 4 whacko 5 dot


1/22/01Page 64

11.9 GAB configuration GAB is configured with a simple text file, /etc/gabtab. The file contains only one line used to start GAB. The line contains the command to start GAB, /sbin/gabconfig –c –n XX. The value for “XX” is the total number of systems in the cluster. For example, for a five node cluster, place the following line in /etc/gabtab:

/sbin/gabconfig –c –n5

11.10 Disk heartbeats (GABDISK) Disk heartbeats offer yet another way to improve cluster resiliency by allowing a heartbeat to be placed on a physical disk shared by all systems in the cluster. It uses two small, dedicated regions of a physical disk. It has two important limitations:

• Max cluster size is limited to 8 nodes

• A dedicated disk should be used to prevent performance issues. Gabdisk is fairly chatty and will adversely affect performance of applications accessing the disk. At the same time, heavy application access to a disk used for GAB could cause heartbeat timeouts.

11.10.1 Configuring GABDISK Please see the VCS Installation Guide for a detailed description on configuring membership regions on disk.

11.11 The difference between network and disk channels As mentioned earlier, communications between VCS nodes in a cluster are one of two types:

• Cluster Membership. This is a simple unacknowledged broadcast to all nodes in the cluster that basically says, “I am node X and I am here”. This is the basic function of “heartbeat”. Individual cluster nodes track other cluster nodes via the heartbeat mechanism. Network and Disk heartbeat channels can be used for cluster membership

• Cluster State. Cluster status requires considerably more information to be passed between nodes than cluster membership. Cluster status can only be transmitted on Network heartbeat connections. Disk heartbeat channels cannot carry cluster state.


1/22/01Page 65

11.12 Jeopardy, Network Partitions and Split-Brain Under normal circumstances, when a VCS node ceases heartbeat communication with its peers, the peers assume it has failed. This could be due to a power loss or a system crash. At this time a new regular membership is issued that excludes the departed system. A designated system in the cluster would then take over the service groups running on the departed system, assuring application availability.

The problem is that heartbeats can also fail due to network failures. If all network connections between any two groups of systems fail at the same time, you have a network partition. In this condition, systems on both sides of the partition may restart applications from the other side resulting in duplicate services, also called “split-brain”. The worst problem resulting from a network partition involves the use of data on shared disks.

If both systems were to provide the same service by updating the same data without coordination, data will become corrupted.

The design of VCS requires that a minimum of two heartbeat capable channels be available between cluster nodes to provide adequate protection against network failure. When a node is down to a single heartbeat connection, VCS can no longer reliably discriminate between loss of a system and loss of the last network connection. It must then handle loss of communications on a single net differently than multi network. This handling is called jeopardy.

Recall that LLT provides notification of reliable vs. unreliable network communications up to GAB. GAB uses this information, along with the presence or lack of a functional disk heartbeat to make intelligent choices on cluster membership. If a system’s heartbeats are lost simultaneously across all channels, VCS determines that the system has failed. The services running on that system are then restarted on another. However, if prior to loss of heartbeat from a node the node was only running with one heartbeat (in jeopardy), VCS will NOT restart the applications on a new node. This action of disabling failover is a safety mechanism to prevent data corruption.

Jeopardy membership is a strange concept to grasp. A system can be placed in a jeopardy membership on two conditions:

• The system has only one functional network heartbeat and no disk heartbeat. In this situation, the node is a member of the regular membership and the jeopardy membership. Being in a regular membership and jeopardy membership at the same time changes the failover on system fault behaviour only. All other cluster functions remain. This means failover due to a resource fault or switchover of service groups at operator request is unaffected. The only change is disabling other systems from assuming service groups on system fault. To state it as documented in the VCS users guide: VCS continues to operate as a single cluster when at least one network channel exists between the systems. However, when only one channel remains, failover due


1/22/01Page 66

to system failure is disabled. Even after the last network connection is lost, VCS continues to operate as partitioned clusters on each side of the failure.

• The system has no network heartbeat and only a disk heartbeat. As mentioned above, disk heartbeats are not capable of carrying Cluster Status. In this case, the node is excluded from the regular membership since it is impossible to track status of resources on the node and it is placed in a jeopardy membership only. Failover on resource fault or operator initiated switchover is disabled. VCS prevents any actions taken on any service group that were running on the departed system since it is impossible to ascertain the status of resources on the system with just disk heartbeat. Reconnecting the network without stopping VCS and GAB will result in one or more systems halting.

The two situations above mentioned another concept, that of excluding nodes from the regular membership. This brings up another situation where the cluster splits into “mini clusters”. When a final network connection is lost, the systems on each side of the network partition do not stop, they instead segregate into mini-clusters. Each cluster continues to operate and provide services that were running; however failover of any service group to or from the opposite side of the partition is disabled. This design enables administrative services to operate uninterrupted; for example, you can use VCS to shut down applications during system maintenance. Once the cluster is split, reconnecting the private network must be undertaken with care. As stated in the VCS users guide:

If the private network has been disconnected, you must shutdown VCS before reconnecting the systems. Failure to do so results in one or more systems being halted until only the larger of the previously disconnected mini-clusters remains. Halting the systems protects the integrity of shared storage when network connections become unstable. In such an environment, the data on shared storage may already be corrupted by the time the network connections are stabilized.

Reconnecting a private network after a cluster has been segregated causes systems to be halted via a call to kernel panic. There are several rules that determine which systems will halt.

• On a two node cluster, the system with the lowest LLT host ID will stay running and the higher will halt

• In a multinode cluster, the largest running group will stay running. The smaller group(s) will be halted

• On a multinode cluster splitting into two equal size clusters, the cluster with the lowest node number present will stay running.


1/22/01Page 67

11.13 VCS 1.3 GAB changes VCS 1.3 has made a change to the default behaviour of a cluster when systems are reconnected after heartbeat has been lost. The default for VCS 1.1.2 was termed “Halt on Rejoin” This cause selected systems, (which systems are selected is described above) to panic on reconnect. This behaviour could be disabled by starting gab with “gabconfig –r”. The default behaviour for VCS 1.3 is to not cause a panic on reconnect, but to restart the HA daemons. This is identical to the earlier gabconfig –r option. This will still cause an interruption in HA services, as all Service Groups are shutdown and HAD is restarted on the affected systems.

11.14 Example Scenarios The following example scenarios will detail the possible situations surrounding heartbeat problems.

• A 4-node cluster is operating with two private network heartbeat connections, no low priority link and no disk heartbeat. In normal configuration, both private links are load balancing cluster status and both links carry heartbeat. The figure below shows basic VCS communications configuration.

o Now a link to node C fails. This places Node C in an unreliable communications state, as there is only one possible heartbeat. A new cluster membership is issued with nodes A, B, C and D in the regular membership and node C in a jeopardy membership. All normal cluster operations continue, including normal failover of service groups due to resource fault. The figure below shows communications configuration.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, C, D

Public network


1/22/01Page 68

o Same configuration as the first, now node C fails due to a power fault. All other systems recognize that node has faulted. In this situation, a new membership is issued for node A, B and D as regular members and no jeopardy membership. No further action is taken at this point. Since node C was in a jeopardy membership, any service group that was running on node C is “AutoDisabled” so no other node will attempt to assume ownership of these service groups. If the node is actually failed, the system administrator can clear the AutoDisabled flag on the service groups in question and online the groups on other systems in the cluster. This is an example of VCS taking the safest possible choice in a situation where it cannot be positive about the status of resources on a system. The system administrator, by clearing the AutoDisabled flag, informs VCS that the node is actually down.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, C, DJeopardy Membership C

Public network


1/22/01Page 69

o Now we reset to the same configuration as above with Node C is operating in the cluster with one heartbeat. Now we lose the second heartbeat to node C. In this situation, a new membership is issued for node A, B and D as regular members and no jeopardy membership. Since node C was in a jeopardy membership, any service group that was running on node C is AutoDisabled so no other node will attempt to assume ownership of these service groups. Nodes A, B and D become a mini-cluster Comprised of 3 nodes. Node C becomes it’s own mini-cluster comprised of only itself. All service groups that were present on nodes A, B and D are AutoDisabled to node C due to the earlier jeopardy membership. On node C, it issues it’s own new membership with itself as a regular member and no others.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, D (withknown previous Jeopardy membershipfor C)

Public network


1/22/01Page 70

• 4 nodes, connected with two private networks and one public low priority network. In this situation, cluster status is load balanced across the two private links and heartbeat is sent on all three links. The public net heartbeat is reduced in frequency to twice per second.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, D (Cluster1. Service groups that were running onNode C are disabled in this cluster)

Regular Membership C (Cluster 2.Service groups that were running onA,B,D are disabled in this cluster)

Public network


1/22/01Page 71

o One again we lose a private link to node C. The other nodes now send all cluster status traffic to node C over the remaining private link and use both private links for traffic between themselves. The low priority link continues with heartbeat only. No jeopardy condition exists because there are two links to discriminate system failure.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, C, DCluster status on Green and Blueprivate net

heartbeat only on public


1/22/01Page 72

o Now we lose the second private heartbeat link. At this point, cluster status communication is routed over the public link to node C. Node C is placed in a jeopardy membership as detailed in the first example. Auto failover on node C fault is disabled.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, C, DNo jeopardy due to heartbeat on low pri

Heartbeat on public(No status)


1/22/01Page 73

o Reconnecting a private network has no ill effect. All cluster status will revert to the private link and the low priority link returns to heartbeat only. At this point, node C would be placed back in normal regular membership with no jeopardy membership.

• 4 Node configuration with two private heartbeat and one disk heartbeat.

o Under normal operation, all cluster status is load balanced across the two private networks. Heartbeat is sent on both network channels. Gabdisk (or gabdiskhb) places another heartbeat on the disk.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

A B C D

Regular membership A, B, C, DJeopardy membership CCluster status on public net

Heartbeat + Status onpublic


1/22/01Page 74

o On loss of a private heartbeat link, all cluster status shifts to the remaining private link. There is no jeopardy at this point because two heartbeats are still available to discriminate system failure.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

A B C D

Regular membership A, B, C, DCluster status on Green and Blue private netNo heartbeat on Private networkHeartbeat on GABDISK

Public network

GABDISK


1/22/01Page 75

o On loss of the second heartbeat, things change a bit. The cluster splits into mini clusters since no cluster status channel is available. Since heartbeats continue to write to disk, systems on each side of the break AutoDisable service groups running on the opposite side. This is the second type of jeopardy membership, one where there is not a corresponding regular membership.

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

A B C D

Regular membership A, B, C, DCluster status on Blue private netNo heartbeat on Private networkHeartbeat on GABDISK

Public network

GABDISK


1/22/01Page 76

o Reconnection of a private link will cause a system panic since the cluster has been segregated. (HAD and Service Group restart on VCS 1.3)

11.15 Pre-existing Network Partitions

From the VCS Users Guide: A pre-existing network partition refers to failures in communication channels that occur while the systems are down. Regardless of whether the cause is scheduled maintenance or system failure, VCS cannot

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRI VENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

SD


Ω

Ω

Ω

4 0 0 0

SPARCDRIVENULTRA

A B C D

Regular membership A, B, D (Cluster1Groups running on node C are disabled in thiscluster)Regular Membership C (Cluster 2 Groupsrunning on A,B,D are disabled in this cluster)No heartbeat on Private networkHeartbeat on GABDISK

Public network

GABDISK


1/22/01Page 77

respond to failures when systems are down. This leaves VCS vulnerable to network partitioning when the systems are booted.

One of the key concepts to remember here is one of “probing”. During startup, VCS performs a monitor sequence (probe) on all resources configured in the cluster to ascertain what is potentially online on any system. This is designed to prevent any possible concurrency violation due to a system administrator starting any resources manually, outside VCS control. VCS can only communicate with those nodes that are part of the LLT network. For example, imagine a 4-node cluster. During weekend maintenance the entire cluster is shut down. During this time, heartbeat connections are severed to node 4. A system administrator is directed to bring the Oracle database back up. If he manually brings up Oracle on node 4 we have a potential problem. If VCS were allowed to start on nodes 1-3, they would not be able to “see” node 4 and it’s online resources. This can lead to a potential split-brain situation. VCS seeding is designed to prevent just this situation.

11.16 VCS Seeding To protect your cluster from a pre-existing network partition, VCS employs the concept of a seed. By default, when a system comes up, it is not seeded. Systems can be seeded automatically or manually. Note that only systems that have been seeded can run VCS.

Systems are seeded automatically in one of two ways:

• When an unseeded system communicates with a seeded system.

• When all systems in the cluster are unseeded and able to communicate with each other.

VCS requires that you declare the number of systems that will participate in the cluster.

When the last system is booted, the cluster will seed and start VCS on all systems. Systems can then be brought down and restarted in any combination. Seeding is automatic as long as at least one instance of VCS is running somewhere in the cluster.

Seeding control is established via the /etc/gabtab file. GAB is started with the command line “/sbin/gabconfig –c –n X” where X is equal to the total number of nodes in the cluster. A 4 node cluster should have the line “/sbin/gabconfig –c –n 4” in the /etc/gabtab. If a system administrator wishes to start the cluster with less than all nodes, he or she must first verify the nodes not to be in the cluster are actually down, then start GAB with “/sbin/gabconfig –c –x”. This will manually seed the cluster and allow VCS to start on all connected systems.


1/22/01Page 78

11.17 VCS 1.3 Seeding and Probing Changes VCS 1.3 changes the behaviour of a cluster during initial startup by adding further protection. VCS 1.3 will AutoDisable a service group until all resources are probed for the group on all systems in the SystemList that have GAB running. This protects against a situation where enough systems are running LLT and GAB to seed the cluster, but not all systems have HAD running. The new method requires HAD to be running so the status of resources can be determined.

11.18 Network Partitions and the UNIX Boot Monitor. (Or “how to create

your very own split-brain condition”) Most UNIX systems provide a console-abort sequence that enables you to halt and continue the processor. On Sun systems, this is the “L1-A” or “Stop-A” keyboard sequence. Continuing operations after the processor has stopped may corrupt data and is therefore unsupported by VCS. Specifically, when a system is halted with the abort sequence it stops producing heartbeats. The other systems in the cluster then consider the system failed and take over its services. If the system is later enabled with another console sequence, it continues writing to shared storage as before, even though its applications have been restarted on other systems where available.

The best way to think about this is to realize the console abort sequence essentially “stops time” for a Sun system. If a write were about to occur when the abort is processed, it will happen immediately after the resume or “go” command. So, the operator halts a system with “stop-A”. This appears to all other nodes as a complete system fault, as all heartbeats disappear simultaneously. Some other node will take over services for the missing node. When the resume occurs, it will take several seconds before the return of a formerly missing heartbeat causes a system panic. During this time, the write that was waiting on the stopped node will occur, leading to data corruption.

VERITAS recommends disabling the console-abort sequence or creating an alias to force the “go” command to actually perform a reboot. See the VCS installation guide for instructions.

11.19 VCS Messaging This section describes messaging in VCS.

There are three types of messages used by the various components of VCS corresponding to the three “levels” of message infrastructure:

• Internal messages: messages generated from within the HAD process to cause other functions to be called. Internal messages do not go out over any wire; instead they are used as “deferred procedure calls,” a way for one function within HAD to call another after having finished the current


1/22/01Page 79

function. Every HAD server should generate the same internal messages in the same order because they execute the same logic.

• Broadcast or UCAST messages: some messages are required to be broadcast to every HAD peer in a cluster in the same order; for instance a request to online a group or notification that a resource changed state must be broadcast to every peer so they all update their internal data structures in parallel. Broadcast messages are sent via GAB from HAD through the GabHandle, which physically is a handle to the GAB driver on the local system. UCAST messages are simply GAB messages sent directly to a single peer for snapshotting. (Used by a system to download the running configuration from a peer).

• IPM messages: clients connect to the HAD process to deliver requests and receive responses. IPM messages are sent over an IpmHandle, which is physically a standard TCP/IP socket. Each HAD process contains a socket listener called the IpmServer that listens for and accepts new IpmHandle connections.

The figure below shows an example of the message infrastructure for two systems:


1/22/01Page 80

12 VCS Triggers The following section will discuss a new feature to VCS called triggers. VCS 1.1.2 incorporated the concept of a “PreOnline” attribute for a Service Group. This allowed the administrator to code specific actions to be taken prior to onlining a service group (such as updating remote hardware devices or restarting applications external to VCS) or to send mail announcing the service group was going online (this was used as a less than adequate method to notify administrators that a group had already gone offline)

The release of VCS 1.2.x on Windows NT and 1.3.x on Unix has brought the concept of Triggers. Triggers provide two very important functions in VCS:

• Event Notification. This is the simplest use of Trigger capability. Each event can be configured to send email to specific personnel.

• Allow specific actions to be taken on specific events. For example, running a script before bringing a Service Group online.

12.1 How VCS Performs Event Notification/Triggers VCS determines if the event is enabled.

VCS invokes hatrigger.

VCS calls hatrigger, a high-level Perl script located at:

• On UNIX: $VCS_HOME/bin/hatrigger

• On Windows NT: VCS_HOME\bin\hatrigger.pl

VCS also passes the name of event trigger and the parameters specific to the event. For example, when a service group becomes fully online on a system, VCS invokes hatrigger -postonline system service_group. Note that VCS does not wait for hatrigger or the event trigger to complete execution. After calling the triggers, VCS continues normal operations.

Event triggers are invoked on the system where the event occurred, with the following exceptions:

• The InJeopardy, SysOffline, and NoFailover event triggers are invoked from the lowest-numbered system in RUNNING state.

• The Violation event trigger is invoked from all systems on which the service group was brought partially or fully online.

The script hatrigger invokes an event trigger.


1/22/01Page 81

The script hatrigger performs actions common to all triggers, and calls the intended event trigger as instructed by VCS. This script also passes the parameters specific to the event.

12.2 Trigger Description The following section lists each trigger and when it is invoked.

12.2.1 PostOnline trigger The PostOffline trigger is called when a Service group completely transitions to Online following an Offline state.

• It will be invoked after the group is completely online from a non-online state.

• It will be invoked (for that group) on the node where the group went online.

• It will not be invoked for a group already in online state.

• It will not be invoked when the group transitions to partially online state.

• Manual resource online may lead a group to transition to online and thus trigger PostOnline script to run.

• It is configurable, by setting the group level attribute PostOnline to 1. By default, PostOnline is 1.

The PostOffline trigger is useful for signalling remote systems that an application group has come fully online. For instance, in a 3-tier E-Commerce environment, the application of middleware may need a restart after the database is online.

12.2.2 PostOffline trigger: The PostOffline trigger is called when a Service Group transitions to a completely offline state. This state can be reached from Online or Faulted.

• It will be invoked when all OnOff resources (non-persistent and non-OnOnly) resources transition to offline from a non-offline state.

• It will be invoked (for that group) on the node where the group went offline.

• It will not be invoked for a group already in offline state.

• Manual resource offline may lead a group to transition to offline and thus trigger PostOffline script to run

• It is configurable, by setting the group level attribute PostOffline to 1. By default, PostOffline is 1.


1/22/01Page 82

12.2.3 PreOnline trigger: The PreOnline trigger is the most common used trigger. It is invoked prior to beginning the service group online process.

• It will be invoked (for that group) on the node where the group is to be onlined.

• Additional parameter whyonlining will be set to either FAULT or MANUAL to indicate reason for onlining.

• Group online request can be a result of several things such as: a manual online, or a manual switch; or a group failover; or clearing of a persistent resource on an IntentOnline group.

• PreOnline script is not invoked for group already in online state.

• Manual resource online does not trigger PreOnline script.

• PreOnlining attribute is set while PreOnline script runs. PreOnlining attribute is reset when hagrp calls group online with nopre option. PreOnlining attribute is currently not used to make any decisions. (This attribute is an internal HAD value)

• If PreOnline script can't be run, either because script doesn't exist, or script is not executable, group is onlined with -nopre option and PreOnlining attribute is reset.

• It is configurable, by setting the group level attribute PreOnline to 1. By default, PreOnline is 0.

12.2.4 ResFault trigger: The ResFault trigger is invoked whenever a resource faults on a system. It is commonly used for notification purposes

• It will be invoked for a critical or a non-critical resource when that resource faults.

• It will be invoked (for the faulted resource) on the node where resource faults.

• It is non-configurable, and will be invoked if the script exists.

12.2.5 ResNotOff trigger: • It will be invoked for a critical or a non-critical resource when that resource fails

to offline. Offline may have been requested as a resource offline, or a group offline.


1/22/01Page 83

• It will be invoked (for the unable-to-offline resource) on the node where resource doesn't go offline.


12.2.6 SysOffline trigger: • It will be invoked on the lowest node in RUNNING state for the node that went

offline. Node may have gone offline due either to engine crash, or a system crash, or a graceful engine stop, or a forceful engine stop.

• If there are no nodes in RUNNING state, SysOffline will not be invoked.

• If all nodes in the cluster are offlined at once, SysOffline may not be invoked.


12.2.7 InJeopardy trigger: • It will be invoked on the lowest node in RUNNING state when the system

transitions to regular-jeopardy For example, when a node loses one of the two private heartbeat links with other nodes, InJeopardy will be invoked for the nodes in jeopardy.

• If there are no nodes in RUNNING state, injeopardy will not be invoked.

• If a node loses one heartbeat link followed by another heartbeat link, injeopardy will be invoked only once (for the first heartbeat link).

• If a node loses both heartbeat links at once, it is a split-brain; injeopardy will not be invoked.


12.2.8 Violation trigger: • It will be invoked on all nodes when group is in concurrency violation state (when

failover group is ONLINE/PARTIAL) on more than one node.

• It will be invoked every time group's CurrentCount attribute is modified, and the resulting CurrentCount is greater than 1.

• It will be invoked on all nodes that are in concurrency violation.

• This trigger doesn't apply to parallel groups.

• It is non-configurable, and will be invoked if the script exists


1/22/01Page 84

12.3 Trigger Configuration VCS provides sample Perl script for each event trigger. These scripts can be customized according to your requirements. On UNIX, you may also write your own Perl script, C, or C++ programs instead of using the sample scripts. On Windows NT, you may write your own Perl script only.

Sample Perl scripts for event triggers are located in the following directories:

• On UNIX: $VCS_HOME/bin/sample_triggers

• On Windows NT: VCS_HOME\bin\sample_triggers

Note that event triggers must reside on all systems in the cluster in the following directories:

• On UNIX: $VCS_HOME/bin/triggers

• On Windows NT: VCS_HOME\bin\triggers

If VCS determines that there is no corresponding trigger script or executable in the locations listed for each event trigger, VCS takes no further action.

12.4 Recommended Trigger usage TO BE COMPLETED

13 Service Group Dependencies TO BE COMPLETED

14 VCS startup and shutdown This section will describe the startup and shutdown of a VCS cluster and how the configuration file is used.

14.1 VCS startup The following diagram shows the possible state transitions when VCS starts up.


1/22/01Page 85

When a cluster member initially starts up, it transitions to the INITING state. This is had doing general start-up processing. The system must then determine where to get its configuration. It first checks if the local on-disk copy is valid. Valid means the main.cf file passes verification, and there is not a “.stale” file in the config directory (more on .stale later).

If the config is valid, the system transitions to the CURRENT_DISCOVER_WAIT state. Here it is looking for another system in one of the following states: ADMIN_WAIT, LOCAL_BUILD or RUNNING.

• If another system is in ADMIN_WAIT, this system will also transition to ADMIN_WAIT. The ADMIN_WAIT state is a very rare occurrence and can only happen in one of two situations:

• When a node is in the middle of a remote build and the node it is building from dies and there are no other running nodes.

• When doing a local build and hacf reports an error during command file generation. This is a very corner case, as hacf was already run to determine the local file is valid. This would typically require an I/O error to occur while building the local configuration.

• If another system is building the configuration from its own on-disk config file (LOCAL_BUILD), this system will transition to CURRENT_PEER_WAIT and wait for the peer system to complete. When

UNKNOWN

CURRENT_DISCOVER_WAIT STALE_DISCOVER_WAIT

ADMIN_WAIT ADMIN_WAIT

LOCAL_BUILD

CURRENT_PEER_WAIT

STALE_PEER_WAIT

STALE_ADMIN_WAIT

REMOTE_BUILD

RUNNING

Peer in LOCAL_BUILD Peer in

RUNNING

Peer inRUNNING

Peer inLOCAL_BUILD

Peer inADMIN_WAIT

Peer inADMIN_WAIT

Valid configuration on disk Stale configuration on disk

hastart

Peer startsLOCAL_BUILD

INITING

Peer inRUNNING

UNKNOWN

CURRENT_DISCOVER_WAIT STALE_DISCOVER_WAIT

ADMIN_WAIT ADMIN_WAIT

LOCAL_BUILD

CURRENT_PEER_WAIT

STALE_PEER_WAIT

STALE_ADMIN_WAIT

REMOTE_BUILD

RUNNING

Peer in LOCAL_BUILD Peer in

RUNNING

Peer inRUNNING

Peer inLOCAL_BUILD

Peer inADMIN_WAIT

Peer inADMIN_WAIT

Valid configuration on disk Stale configuration on disk

hastart

Peer startsLOCAL_BUILD

INITING

Peer inRUNNING


1/22/01Page 86

the peer transitions to RUNNING, this system will do a REMOTE_BUILD to get the configuration from the peer.

• If another system is already in RUNNING state, this system will do a REMOTE_BUILD and get the configuration from the peer.

If no other systems are in any of the 3 states listed above, this system will transition to LOCAL_BUILD and generate the cluster config from its own on disk config file. Other systems coming up after this point will do REMOTE_BUILD.

If the system comes up and determines the local configuration is not valid, i.e. does not pass verification or has a “.stale” file, the system will shift to STALE_DISCOVER_WAIT. The system then looks for other systems in the following states: ADMIN_WAIT, LOCAL_BUILD or RUNNING.

• If another system is in ADMIN_WAIT, this system will also transition to ADMIN_WAIT

• If another system is building the configuration from its own on-disk config file (LOCAL_BUILD), this system will transition to STALE_PEER_WAIT and wait for the peer system to complete. When the peer transitions to RUNNING, this system will do a REMOTE_BUILD to get the configuration from the peer.

• If another system is already in RUNNING state, this system will do a REMOTE_BUILD and get the configuration from the peer.

If no other system is in any of the three states above, this system will transition to STALE_ADMIN_WAIT. It will remain in this state until another peer comes up with a valid config file and does a LOCAL_BUILD. This system will then transition to STALE_PEER_WAIT, wait for the peer to finish, then transition to REMOTE_BUILD and finally RUNNING.

14.2 VCS Shutdown The following diagram shows the possible state transitions on a VCS shutdown.


1/22/01Page 87

There are three possible ways a system can leave a running cluster: Using hastop, using hastop –force and the system or had faulting.

In the left-most branch, we see an “unexpected exit” and a state of FAULTED. This is from the peer’s perspective. If a system suddenly stops communicating via heartbeat, all other systems in the cluster mark its state as faulted.

In the center branch, we have a normal exit. The system leaving informs the cluster that it is shutting down. It changes state to LEAVING. It then offlines all service groups running on this node. When all service groups have gone offline, the current copy of the configuration is written out to main.cf. At this point, the system transitions to EXITING. The system then shuts down had and the peers see this system as EXITED. This is important because the peers know they can safely online service groups previously online on the exited system.

In the right-most branch, the administrator forcefully shuts down a node or all nodes with “hastop –force” or “hastop –all –force”. With one node, the system transitions to an EXITING_FORCIBLY state. All other systems see this transition. On the local node, all service groups remain online and had exits. All other systems mark any service group that was online on the exiting system as Autodisabled. This is a safety feature since the other systems in the cluster know certain resources were in use and now can no longer see the status of those

RUNNING

LEAVING

EXITING

EXITED

EXITING_FORCIBLYFAULTEDhastop hastop -force

Resources offlined &Agents stopped

Unexpected exit

RUNNING

LEAVING

EXITING

EXITED

EXITING_FORCIBLYFAULTEDhastop hastop -force

Resources offlined &Agents stopped

Unexpected exit


1/22/01Page 88

resources. In order to bring service groups up on any other system, the Autodisabled flag must be cleared for the groups.

14.3 Stale configurations There are several instances where VCS will come up in a stale state. The first is having a configuration file that is not valid. If running hacf –verify produces any errors, the file is not valid. The second is opening the configuration for writing while VCS is running with the GUI or by the command hacf –makerw. When the config is opened, VCS writes a .stale file to the config directory on each system. The .stale is removed when the file is once again read-only (Closed with the GUI or with the command hacf –makero –dump). If a system is shutdown with the configuration open, the .stale file will remain.

VCS can ignore the .stale problem by starting had with “hastart –force”. You must first verify the local main.cf is actually correct for the cluster configuration.

15 Agent Details VCS consists primarily of two classes of processes: engine & agents.

The VCS engine performs the core cluster management functions. An instance of VCS engine runs on every node in the cluster. The VCS engine is responsible for servicing GUI requests & user commands, managing the cluster & keeping the cluster systems in synch. The actual task of managing the individual resources is delegated to the VCS agents.

The VCS agents perform the actual operations on the resources. Each VCS agent manages resources of a particular type (for example Disk resources) on a system. So, you may see multiple VCS agent processes running on a system, one for each resource type (one for Disk resources, another for IP resources etc).

All the VCS agents need to perform some common tasks, including:

• Upon starting up, download the resource configuration information from the VCS engine. Also, register with the VCS engine, so that the agent will receive notification when the above information is changed.

• Periodically monitor the resources and notify the status to the VCS engine.

• Online/Offline the resources when the VCS engine instructs so.

• Cancel the online/offline/monitor operation, if it takes too long to complete.

• When a resource faults, restart it.

• Send a log message to the VCS engine when any error is detected.


1/22/01Page 89

The VCS Agent Framework takes care of all such common tasks & greatly simplifies agent development. The highlights of the VCS Agent Framework design are:

• Parallelism - The Agent Framework is multi-threaded. So, the same agent process can perform operations on multiple resources simultaneously.

• Serialized Operations on Resources - The Agent Framework guarantees that at most one operation will be performed on a resource at any given time. For example, when a resource is being onlined, no other operations (for example, offline or monitor) will be performed on that resource.

• Scalability - An agent can support several hundred resources.

• Implementation Flexibility - Agents support both C++ & scripts.

• Configurability - Agents need to be developed for varied resource types. So, the Agent Framework is configurable to suit the needs of different resource types.

• Recovery - Agents can detect a hung/failed service and restart it on the local node, without any intervention from the user or the VCS engine.

• Faulty Resource Isolation - A faulty or misbehaving resource will not prevent the agent from effectively managing other resources.

VCS agents are the key enabling technology that allows VCS to control such a wide variety of applications and other resources. As any new application is written, an agent can be created to allow VCS to properly start, stop and monitor the application.

15.1 Parameter Passing VCS Agents pass parameters to entry points or scripts (online, offline, monitor, clean) in a very controlled sequence. The agent calls online, offline and monitor with the name of the resource followed by the contents of the ArgList. Clean is called with the name of the resource, CleanReason, then the contents of the ArgList.

In the following example, VCS will use the MountAgent to mount a file system. The Mount resource type description looks like the following:

type Mount (static str ArgList[] = MountPoint, BlockDevice, FSType,

MountOpt, FsckOpt NameRule = resource.MountPointstr MountPointstr BlockDevicestr FSTypestr MountOpt


1/22/01Page 90

str FsckOpt)

The mount resource is defined in main.cf as follows:

Mount home_mount (MountPoint = "/export/home"BlockDevice = "/dev/vx/dsk/shared_dg1/home_vol"FSType = vxfsMountOpt = rw)

When had wishes to bring the home_mount resource online, it will direct the MountAgent to online home_mount. The MountAgent will pass the proper parameters to the online entry point as follows: “home_mount /export/home /dev/vx/dsk/shared_dg1/home_vol vxfs rw <null>” The identical string is passed to the monitor and offline entry points/scripts when necessary. It is the script’s responsibility to use the passed paramater values correctly. For example, the offline script does not need to know fsck options or mount options, just the mount pint or block device. However the offline script is still passed all these values.

The following is an excerpt from the Mount online script that shows bringing in the variables from the MountAgent

# This script onlines the file system by mounting it after doing a # file system check. # my ($MountPoint, $BlockDevice, $Type, $MountOpt, $FsckOpt); my ($RawDevice, $i, $rc); my ($mount, $fsck, $df); my ($log_message, $vcs_home, $ResName); $ResName=$ARGV[0]; ## Note that the agent passes the resource name as the first parameter shift; $MountPoint=$ARGV[0]; ## Assign the first parameter in the ArgList to MountPoint $BlockDevice=$ARGV[1]; ## Assign the second parameter in the ArgList to BlockDevice $RawDevice = $BlockDevice; $RawDevice =~ s/dsk/rdsk/; ## Determine the raw device from the block device $Type=$ARGV[2]; ## Assign the third parameter in the ArgList to Type $MountOpt=$ARGV[3]; ## Assign the fourth parameter in the ArgList to MountOpt $FsckOpt=$ARGV[4]; ## Assign the fifth parameter in the ArgList to FsckOpt


1/22/01Page 91

Clean is the only exception to the rule, as it passes an additional parameter “CleanReason” between the resource name and the ArgList.

The parameters used by the clean script for Mount are as follows:

($ResID, $CleanReason, $MountPoint, $BlockDevice, $Type, $MountOpt, $FsckOpt,)

The variable CleanReason equals one of the following values:

0 - The offline entry point did not complete within the expected time.

1 - The offline entry point was ineffective.

2 - The online entry point did not complete within the expected time.

3 - The online entry point was ineffective.

4 - The resource was taken offline unexpectedly.

5 - The monitor entry point consistently failed to complete within the expected time.

15.2 Agent configuration A large number of attributes are available that are understood by all agents and allow tuning the behaviour of resource types. This section will list the most common:

15.2.1 ConfInterval ConfInterval determines how long a resource must remain online to be considered “healthy”. When a resource has remained online for the specified time (in seconds), previous faults and restart attempts are ignored by the agent. (See ToleranceLimit and RestartLimit attributes for details.) For example, an ApacheAgent is configured with the default ConfInterval of 300 seconds, or 5 minutes and a RestartLimit of 1. In this example, assume the Apache Web Server process is started and remains online for two hours before failing. With the RestartLimit set to 1, the ApacheAgent will restart the failing web server. If the server fails again before the time set by ConfInterval, the ApacheAgent inform HAD that the web server has failed and HAD will mark the resource as faulted and begin a failover for the Service Group. If instead, the web server stays online longer than the time specified by ConfInterval, the RestartLimit counter will be cleared. In this way, the resource could fail again at a later time and be restarted. The ConfInterval attribute gives the developer a method the discriminate between a resource that occasionally fails and one that is essentially bouncing up and down.


1/22/01Page 92

15.2.2 FaultOnMonitorTimeouts When a monitor fails as many times as the value specified, the corresponding resource is brought down by calling the clean entry point. The resource is then marked FAULTED, or it is restarted, depending on the value set in the RestartLimit attribute. When FaultOnMonitorTimeouts is set to 0, monitor failures are not considered indicative of a resource fault. (This attribute is available in versions of VCS above 1.2 only)

Default = 4

15.2.3 MonitorInterval Duration (in seconds) between two consecutive monitor calls for an ONLINE or transitioning resource. The interval between monitor cycles directly affects the amount of time it takes to detect a failed resource. Reducing MonitorInterval can reduce time required for detection. At the same time, reducing this time also increases system load due to increased monitoring and can also increase the chance of false failure detection.

Default = 60 seconds

15.2.4 MonitorTimeout Maximum time (in seconds) within which the monitor entry point must complete or else be terminated. In VCS 1.3, a Monitor Timeout can be configured as a resource failure. On VCS 1.1.2, this simply caused a warning message in the VCS engine log.


15.2.5 OfflineMonitorInterval Duration (in seconds) between two consecutive monitor calls for an OFFLINE

resource. If set to 0, OFFLINE resources are not monitored. Individual resources are monitored on all systems in the SystemList of the service group the resource belongs to, even when they are OFFLINE. This is done to detect Concurrency Violations when a resource is started outside VCS control on another system. The default OfflineMonitorInterval is set to 5 minutes to reduce system loading imposed by monitoring offline service groups


15.2.6 OfflineTimeout Maximum time (in seconds) within which the offline entry point must complete or else be terminated. There are certain cases where the offline function may take a long time to complete, such as shutting down an active Oracle database. When writing custom agents, the developer must remember that is the function of the monitor entry point to actually check that the offline is successful,


1/22/01Page 93

not the offline. In many cases, the offline timeout is due to attempting to wait for offline and do some sort of testing in the offline script.


15.2.7 OnlineRetryLimit Number of times to retry online, if the attempt to online a resource is unsuccessful. This parameter is meaningful only if clean is implemented. This attribute is different than RestartLimit in that it only applies during the initial attempt to bring a resource online when the service group is brought online. The counter for this value is reset when the monitor process reports the resource has been successfully brought online.

Default = 0

15.2.8 OnlineTimeout Maximum time (in seconds) within which the online entry point must complete or else be terminated. As with the offline timeout, the developer must remember that the function of the online entry point is to start the resource, not check if it is actually online. If extra time is needed to wait for the resource to come online, this should be coded in the online exit code in number of seconds to wait before monitoring.


15.2.9 RestartLimit Affects how the agent responds to a resource fault. If set to a value greater than zero, the agent will attempt to restart the resource when it faults. In order to utilize RestartLimit, a clean function must be implemented. The act of restarting a resource happens completely within the agent and is not reported to HAD. In this manner, a resource will still show as online on the VCS GUI or output of hastatus during this process. The resource will only be declared as offline if the restart is unsuccessful.

Default = 0

15.2.10 ToleranceLimit A non-zero ToleranceLimit allows the monitor entry point to return OFFLINE

several times before the resource is declared FAULTED. This is useful when a resource may be heavily loaded and end-to-end monitoring is in effect. For example, a web server under extreme load may not be able to respond to an in-depth monitor probe that connects and expects an html response. Setting a ToleranceLimit of greater than zero allows multiple monitor cycles to attempt the check before declaring a failure.

Default = 0


1/22/01Page 94

16 Frequently Asked Questions

16.1 General

16.1.1 Does VCS support NFS lock failover? No. Current VCS does not failover NFS locks when a file system share is switched between servers. It is on the roadmap for a future release.

16.1.2 Can I mix different operating systems in a cluster? All nodes within an individual cluster must be the same operating system, for example, Solaris, HP or NT. You cannot have systems with mixed operating systems in the same cluster. The Cluster Server Cluster Manager can manage separate clusters of different operating systems from the same console. For example, a single VCS Cluster Manager GUI can log in and manage clusters of HP, NT and Sun.

16.1.3 Can I configure a “Shared Nothing” cluster? That really depends on the definition of “shared-nothing”. VCS is designed as a shared data disk solution. If the data being used is read-only, then a non-shared disk configuration is possible. For example, having multiple systems capable of running a web server, serving static content from a local disk, then the answer is yes.

Attempting to use replication to keep read-write data consistent between cluster members not using a shared disk is not supported. VCS does not support replication between failover members. VCS does support making the primary or secondary of a replication configuration highly available. For example, you can fail over the primary of a VVR configuration to another system, or you can failover the secondary of a VVR config. VCS does not support failing services between a primary and a secondary.

16.1.4 What is the purpose of hashadow? Hashadow is responsible for monitoring had and restarting if necessary. If hashadow has restarted had, the ps output will resemble: “had –restart”

16.1.5 What is “haload”? Haload is a process originally used by VCS to calculate overall system load to be used during failover node determination. It used the values in main.cf such as Factor, MaxFactor, etc. This method of load determination is no longer supported in VCS and the haload binary will be removed from future distributions. See the section on “Failover Policy” for a current description of controlling node selection during failover.


1/22/01Page 95

16.1.6 What failover policies are available? VCS has 3 possible failover policies

Priority (default): The system defined, as the lowest priority in the SystemList attribute will be chosen. Priority is set either implicitly by ordering of system names in the SystemList field (i.e. SystemList HA1, HA2 This is identical to SystemList HA1=0, HA2=1 . In priority, 0 is lowest, and increases as the numbers increase.

Load: The system defined with the least value in the system’s Load attribute will be chosen. Load is a per system value set with the "hasys -load <systemname> value " such as "hasys -load HA1 20" The actual "haload" command is no longer used and will be removed from future releases. The entries in main.cf for Factor and MaxFactor also relate to haload and will be removed. The use of hasys -load requires the user to determine their own policy for determining what will be considered in computing load. The value entered in the command line is compared against other systems on failover. So setting a system to 20 means it has higher load than a system set to 19.

RoundRobin: The system with the least number of active service groups will be chosen. This is likely the best policy to set if multiple service groups can run on multiple machines.

16.1.7 What are System Zones? System zones are an enhancement to the SystemList attribute that changes failover behaviour. The System Zone indicates the virtual sub lists within the SystemList attribute that grant priority in failing over. Values are string/integer pairs. The string key is the name of a system in the SystemList attribute, and the integer is the number of the zone. Systems with the same zone number are members of the same zone. If a service group faults on one system in a zone, it is granted priority to fail over to another system within the same zone, despite the policy granted by the FailOverPolicy attribute.

16.1.8 What is “halink”? The halink binary is a daemon that can be run to update the VCS GUI with the status of heartbeat links. It is not started by default. Enabling “LinkMonitoring” in the Cluster configuration will start halink. There is currently a bug with LinkMonitoring that causes difficulty shutting down a system. The problem is detailed in the VCS Release Notes as follows:

Problem with LinkMonitoring

If you enable LinkMonitoring and issue any form of the hastop command, the HAD process crashes. A crash of VCS is not critical because HAD crashes after offlining or switching service groups. However, because the HAD process quits ungracefully, the hashadow process restarts HAD and VCS remains running. This is timing-dependent and occurs sporadically.


1/22/01Page 96

Workaround: There are two solutions to this problem:

• Do not use LinkMonitoring. This is default behavior. Use the InJeopardy trigger or SNMP traps to get notification when only one network link is remaining.

• If you must use LinkMonitoring, set it to 0 before issuing the command hastop. Issue the command haclus -disable LinkMonitoringto set the attribute to 0.

16.1.9 What does “stale admin wait” mean?

The system has a stale configuration (the local on-disk configuration file does not pass verification or there is a “.stale” file present) and there is no other system in the state of RUNNING from which to retrieve a configuration. If a system with a valid configuration is started, that system enters the LOCAL_BUILD state. Then the systems in STALE_ADMIN_WAIT transition to STALE_PEER_WAIT. When the system finishes LOCAL_BUILD and transitions to RUNNING, the systems in STALE_PEER_WAIT will transition to REMOTE_BUILD followed by RUNNING.

16.1.10 How many nodes can VCS support? Current shipping VCS can support up to 32 nodes. Typical large clusters seen in the field range from 8-16 nodes. Larger configurations greatly increase complexity as well as system overhead from maintaining the distributed state. Clusters should typically be broken up along major functional areas. For example combining a 4 node cluster running web applications with a 4 node cluster running database simply to make an 8 node cluster merely increases overall complexity. On the other hand, combining these separate clusters due to a requirement that web servers must be restarted after a database switch over is valid. In this case, placing all systems in one 8-node cluster allows VCS to actively react to changes in the database by taking appropriate action on the web front end. The basic rule of thumb is to make your cluster only as large as required to support all necessary functionality.

16.2 Resources

16.2.1 What is the MultiNICA resource? The MultiNICA resource is a special configuration to allow “in box failover” of a faulted network connection. Upon detecting a failure of a configured network interface, VCS will move the IP address to a second standby interface in the same system. This can be far less costly in terms of service outage than a complete service group failover to a peer in many cases. It must be noted that there is still an interruption of service between the time a network card or cable fails, detection of the failure and migration to a new interface.


1/22/01Page 97

The MultiNICA resource only keeps a base address up on an interface, not the High Availability address used by VCS service groups. The HA address is the responsibility of the IPMultiNIC agent

16.2.2 What is the IPMultiNIC resource? The IPMultiNIC resource is a special IP resource designed to sit on top of a MultiNICA resource. Just as IP sits on an NIC resource, IPMultiNIC can only sit on a MultiNICA resource. IPMultiNIC configures and moves the HA IP address between hosts

16.2.3 What is a Proxy resource? A proxy resource allows a resource configured and monitored in a separate service group to be mirrored in a service group. This is provided for two reasons:

• Reduce monitoring overhead. Configuring multiple resources pointing at the same physical device adds unnecessary monitoring overhead. For example, if multiple service groups use the same NIC device, all configured resources would monitor the same NIC. Using a proxy resource allows one Service group to monitor the NIC and this status is mirrored to the proxy resource.

• Determine status of an OnOff Resource in a different Service Group. VCS OnOff resources may only exist on one Service Group in a Failover group configuration.

16.2.4 How do I configure an IPMultiNIC and MultiNICA resource pair? In a normal VCS configuration, the IP resource is dependent on the NIC resource. To use a high availability NIC configuration, VCS is configured to use the IPMultiNIC resource depending on the MultiNICA resource. The MultiNICA resource is responsible for maintaining the base IP address up on one of the assigned interfaces, and moving this IP on the event of a failure to another interface. The IPMultiNIC resource actually configures up the floating VCS IP address on the physical interface maintained by MultiNICA.

In the following example, two machines, sysa and sysb, each have a pair of network interfaces, qfe1 and qfe5. The two interfaces have the same base, or physical, IP address. This base address is moved between interfaces during a failure. Only one interface is ever active at a time. The addresses assigned to the interface pairs differ for each host. Since each host will have a physical address up and assigned to an interface during normal operation (base address, not HA address) the addresses must be different. Note the lines beginning at Device@sysb; the use of different physical addresses shows how to localize an attribute for a particular host.

mailto:Device@sysb


1/22/01Page 98

The MultiNICA resource fails over only the physical IP address to the backup NIC in the event of a failure. The IPMultiNIC agent configures the logical IP addresses. The resource ip1, shown in the following example, have an attribute called Address, which contains the logical IP address. In the event of a NIC failure on sysa, the physical IP address and the logical IP addresses will fail over from qfe1 to qfe5. In the event that qfe5 fails, the address will fail back to qfe1 if qfe1 has been reconnected. However, if both the NICs on sysaare disconnected, the MultiNICA and IPMultiNIC resources work in tandem to fault the group on sysa. The entire group now fails over to sysb.

If you have more than one group using the MultiNICA resource, the second group can use a Proxy resource to point to the MultiNICA resource in the first group. This prevents redundant monitoring of the NICs on the same system. The IPMultiNIC resource is always made dependent on the MultiNICA resource.

group grp1 (SystemList = sysa, sysb AutoStartList = sysa )MultiNICA mnic (Device@sysa = qfe1 = "166.98.16.103",qfe5 = "166.98.16.103" Device@sysb = qfe1 = "166.98.16.104",qfe5= "166.98.16.104" NetMask = 255.255.255.0)

IPMultiNIC ip1 (Address = "166.98.16.78"NetMask = "255.255.255.0"MultiNICResName = mnic)ip1 requires mnic

Notes about Using MultiNICA Agent

If all the NICs configured in the Device attribute are down, the MultiNICA agent will fault the resource after a 2-3 minute interval. This delay occurs because the MultiNICA agent tests the failed NIC several times before marking the resource offline. Messages recorded in the engine log during failover provide a detailed description of the events that take place during failover. (The engine log is located at /var/VRTSvcs/log/engine_A.log).

The MultiNICA agent supports only one active NIC on one IP subnet; the agent will not work with multiple active NICs.

The primary NIC must be configured before VCS is started. You can use the ifconfig(1M) command to configure it manually, or edit the file /etc/hostname.<nic so that configuration of the NIC occurs automatically when the system boots. VCS plumbs and configures the backup NIC, so it does not require the file /etc/hostname.<nic.


1/22/01Page 99

16.2.5 How can I use MultiNIC and Proxy together? The following example will show the use of MultiNICA, IPMultiNIC and Proxy together. In this example, the customer wants to use IPMultiNIC in each service group. They also want each service group to look identical from a configuration standpoint. This example will configure a parallel group on each server consisting of the MultiNICA resource and a Phantom resource and multiple Failover groups with a Proxy to the MultiNICA. The example does not define disk, mount or listener resources. Note the IPMultiNIC resource attribute MultiNICResName= mnic always points to the physical MultiNICA resource and not the proxy.

The parallel service group containing MultiNICA resources ensures there is a local instance of the MultiNICA resource running on the box.

group multi-nic_group (SystemList = sys1, sys2 AutoStartList = sys1, sys2 Parallel = 1)

Phantom Multi-NICs ()

MultiNICA mnic (Device@sys1 = qfe0 = "192.168.1.1", qfe5 = "192.168.1.1"

Device@sys2 = qfe0 = "192.168.1.2", qfe5 = "192.168.1.2"

NetMask = "255.255.255.0"Options = "trailers")

group Oracle-Instance1 (SystemList = sys1, sys2 AutoStartList = sys1 )

DiskGroup xxxVolumes xxxMounts xxxOracle xxxListener xxx

Proxy Oracle1-NIC-Proxy (TargetResName = "mnic")

IPMultiNIC Oracle1-IP (Address = "192.168.1.3"


1/22/01Page 100

NetMask = "255.255.255.0"MultiNICResName = mnicOptions = "trailers")

Oracle1-IP requires Oracle1-NIC-Proxy

group Oracle-Instance2 (SystemList = sys1, sys2 AutoStartList = sys2 )

DiskGroup xxxVolumes xxxMounts xxxOracle xxxListener xxx

Proxy Oracle2-NIC-Proxy (TargetResName = "mnic")

IPMultiNIC Oracle2-IP (Address = "192.168.1.4"NetMask = "255.255.255.0"MultiNICResName = mnicOptions = "trailers")

Oracle2-IP requires Oracle2-NIC-Proxy

16.3 Communications

16.3.1 What is the recommended heartbeat configuration? VERITAS recommends a minimum of two dedicated 100 Megabit private links between cluster nodes. These must be completely isolated from each other so the failure of one heartbeat link cannot possibly affect the other.

Use of a low priority link is also recommended to provide further redundancy.

For example, on a Sun E4500 with a built in HME and a QFE expansion card, the best configuration would place one heartbeat on the HME port and one on a QFE port. Public network would be placed on a second QFE port as well as link-lowpri. The low priority link prevents a jeopardy condition on loss of any single private link and provides additional redundancy.

Configuring private heartbeats to share any infrastructure in not recommended. Configurations such as running two shared heartbeats to the same hub or switch, or using a single VLAN to trunk between two switches induce a single point of failure in the heartbeat architecture. The simplest guideline is “No single failure, such as power, network equipment or cabling. Can disable both heartbeat connections”.


1/22/01Page 101

16.3.2 Can LLT be run over a VLAN? Yes, as long as the following rules are met.

• Heartbeat infrastructure is completely separate.

• The VLAN connects the machines at Layer2 (MAC)

16.3.3 Can I place LLT links on a switch? Yes. LLT/GAB operates at layer 2 and will function perfectly on a switch. It should be noted that LLT operates as a broadcast protocol and switching will not provide any performance gain. LLT links should be placed in their own VLAN to prevent broadcast from impacting performance of other network connections. VERITAS recommends the use of two network heartbeats for all cluster configurations. These heartbeat connections must run on complete separate infrastructure, so if switches are used, the heartbeats must run on completely independent switch infrastructure.

16.3.4 Can LLT/GAB be routed? No. LLT is a layer 2 DLPI protocol and has no layer 3 (IP) information therefore it cannot be routed.

16.3.5 How far apart can nodes in a cluster be? Cluster distance is governed by a number of factors. The primary factors for LLT are network connectivity and latency. Direct, layer 2, low latency connections must be provided for LLT. This, combined with difficulties in extending the underlying storage fabric typically limit “Campus Clusters” to approximately 10Km in radius. Large campus clusters or metropolitan area clusters must be very carefully designed to provide completely separate paths for heartbeat and storage fabric to prevent a single fiber optic or fiber bundle failure from taking out heartbeat or storage.

Greater distances are best suited by implementing local clusters at each site and coordinating inter-site failover with the VERITAS Global Cluster Manager

16.3.6 Do heartbeat channels require additional IP addresses? No. VCS LLT/GAB is a layer 2 or MAC layer protocol. All communications are addressed using the MAC address of each machine.

16.3.7 How many nodes should be set in my GAB configuration? VERITAS recommends setting gabconfig parameters to the total number of systems in the cluster. If the customer has a 5-node cluster, GAB should not automatically seed until 5 nodes are present. Based on this configuration, the following is the proper entry in /etc/gabtab

/sbin/gabconfig –c –n 5


1/22/01Page 102

16.3.8 What is a split brain? A split brain occurs when two independent systems configured in a cluster both believe they have exclusive access to a given resource (usually file system/volume). Different vendors approach split brain prevention in different ways.

In all cases, the failover management software (FMS) uses pre-defined methods to determine if its peer is alive. If so, it knows it cannot safely take over resources. The "split-brain" situation comes up when the method of determining failure of a peer has been compromised. In virtually all FMS systems, true split-brain situations are very rare. A real split brain means multiple systems are online AND have simultaneously accessed an exclusive resource.

To make another point, simply splitting communications between cluster nodes does not constitute a split brain! A split-brain means cluster membership was effected in such a way that multiple systems utilize the same exclusive resources, usually resulting in data corruption.

The problem with all methods is finding a way to minimize chance of ever taking over an exclusive resource while another has it active, yet still deal with a system powering off.

In a perfect world, just after a system died, it would send a message from beyond the grave saying it was dead. Since we cannot convene a séance every time a system fails, we need a way to discriminate dead from non-communicating.

VCS uses a heartbeat method to determine health of its peer(s). These can be private network heartbeats, public (low priority) heartbeats and disk heartbeats. Regardless of heartbeat configuration, VCS determines that a system has gone away or more correctly termed "faulted" (i.e. power loss, kernel panic, Godzilla, etc) when ALL heartbeats simultaneously fail. For this to work the system must have two or more functioning heartbeats and all must fail at the same time.

VCS design assumes that for all heartbeats to actually fail at the same time, a system must be dead.

Further, VCS has a concept of "jeopardy". VCS must see multiple heartbeats disappear simultaneously to declare a system fault.

If systems in a cluster are down to only one functioning heartbeat, VCS says it cannot safely discriminate between a heartbeat failure and a real system fault.

In normal operation, complete loss of heartbeat is considered a system fault. In this case, other surviving nodes in a cluster, configured in the service group's "SystemList" will take over services from the faulted system. If the system had previously been in a jeopardy membership, where other systems had only one functional heartbeat to this system, upon loss of the final heartbeat, a peer system will not attempt a take over.


1/22/01Page 103

In order for VCS to actually attain a "split-brain" situation, the following events must occur:

• A service group is online on a system in a cluster

• The service group must have a system(s) in it's system list for potential failover target

• All heartbeat communication between the system with the SG online and potential takeover target must fail simultaneously while the original system stays online.

• The potential takeover target must actually online resources that are normally an exclusive ownership type item (disk groups, volume, file systems).

16.4 Agents

16.4.1 What are Entry Points? An Entry Point is a user-defined plug-in that will be called when an event occurs within the VCS Agent. Examples are the VCSAgStartup Entry Point, which will be called immediately after the agent starts up and the online Entry Point, which will be called when a resource needs to be onlined.

VCS Agent development involves implementing the Entry Points.

VCS Agent Framework (code common to all the agents) + VCS Agent Entry Point implementation (code specific to a resource type) = VCS Agent.

VCS Agent Framework supports the following Entry Points: VCSAgStartup, open, online, offline, monitor, attr_changed, clean, close and shutdown.

VCSAgStartup and monitor Entry Points are mandatory. The other Entry Points are optional.

VCSAgStartup and shutdown Entry Points relate to the agent process as a whole, whereas the other Entry Points signify actions about a specific resource.

16.4.2 What should be the return value of the online Entry Point? The return value of the online Entry Point should indicate the time (in seconds) the resource should take to become ONLINE, after the online Entry Point returns. The agent will not monitor the resource during that time.

For example, if the online Entry Point for a resource returns 10, the agent will resume periodic monitoring of the resource after 10 seconds. Please note that the


1/22/01Page 104

monitor Entry Point will not be invoked when the online Entry Point is being executed.

For most of the resource types, it will be appropriate to resume periodic monitoring immediately after the online Entry Point returns. Correspondingly, the typical return value of the online Entry Point will be 0.

16.4.3 What should be the return value of the offline Entry Point? The return value of the offline Entry Point should indicate the time (in seconds) the resource should take to become OFFLINE, after the offline Entry Point returns. The agent will not monitor the resource during that time.

For example, if the offline Entry Point for a resource returns 10, the agent will resume periodic monitoring of the resource after 10 seconds. Please note that the monitor Entry Point will not be invoked when the offline Entry Point is being executed.

For most of the resource types, it will be appropriate to resume periodic monitoring immediately after the offline Entry Point returns. Correspondingly, the typical return value of the offline Entry Point will be 0.

16.4.4 When will the monitor Entry Point be called? The monitor Entry Point will be called

• Periodically, with a configurable period (thru the MonitorInterval attribute of the resource type). This interval is configurable per resource type only. It is not possible to configure monitor interval on a per resource basis. For example, setting MonitorInterval = 120 for the Apache resource type will monitor all Apache resources every two minutes. You cannot monitor one Apache web server every 60 seconds and another every 120 seconds. If this functionality is required, you must create a second agent and configure monitor interval on the second agent. In this example, all resources of type Apache would have one monitor interval and configuring a second resource type, for instance, Apache1, would allow setting a separate interval for these resources.

• After completing online/offline

16.4.5 When will the clean Entry Point be called? The clean Entry Point will be called when all the ongoing actions associated with a resource need to be terminated and the resource needs to be offlined, maybe forcibly.

The clean Entry Point will be called under any of the following conditions:

• The online Entry Point is not effective


1/22/01Page 105

• The online Entry Point does not complete within the expected time

• The offline Entry Point is not effective.

• The offline Entry Point does not complete within the expected time.

• The agent is configured to automatically restart faulted resources & a resource faults

• The resource becomes offline unexpectedly (VCS 1.3.x change). Clean will be called whenever a resource goes offline without VCS initiating the offline.

16.4.6 Should I implement the clean Entry Point? The clean Entry Point will be called when all the ongoing actions associated with a resource need to be terminated and the resource needs to be offlined, may be forcibly.

The agent will support the following features only if the clean Entry Point is implemented:

• Automatically restart a resource on the local node when the resource faults (see the RestartLimit attribute of the resource type.)

• Automatically retry the online Entry Point when the initial attempt to online a resource fails. (OnlineRetryLimit is 1 or greater)

• Allow the VCS engine to online the resource on another node, when the online Entry Point for that resource fails on the local node.

If you want to take advantage of any of the above features, you need to implement the clean Entry Point.

Determine what is the safe & guaranteed way to clean up (i.e to offline the resource & to terminate any outstanding actions), if any. Then choose one of the steps below:

• If no clean up action is required for a resource type, the clean Entry Point can simply return 0, indicating success.

16.4.7 What should be the return the value of the monitor Entry Point? The return value semantics for the monitor Entry Point depends on whether it is implemented using scripts or C++.

When using scripts, the return value must be


1/22/01Page 106

• 101 - 110 (if the resource is ONLINE.). The return value can also encode the confidence level (starting at 10 corresponding to the return value of 101 and increasing by 10 for each higher return value - 20 for 102; 30 for 103 and so on until 100 for 110)

• 100 (if the resource is OFFLINE)

• Any other value (if the resource is neither ONLINE nor OFFLINE)

When using C++, the return value must be one of the following:

• VCSAgResOnline (if the resource is ONLINE)

• VCSAgResOffline (if the resource is OFFLINE)

• VCSAgUnknown (if the resource is neither ONLINE nor OFFLINE)

Please note that when implementing the monitor Entry Point using C++, the confidence level will need to be returned through a separate output parameter.

16.4.8 What should be the return value of the clean Entry Point? The return value of the clean Entry Point must be 0 (if the resource was cleaned successfully) or 1 (if clean failed).

16.4.9 What should I do if I figure within the online Entry Point that it is not possible to online the resource? When implementing the online Entry Point, you may realize that under some conditions it is not possible to online the resource (or you know for sure that online will fail). Under such conditions do necessary programming cleanup and return exit code 0. The agent will immediately call the monitor Entry Point and then if configured, may either notify the engine that the resource cannot be onlined or retry the online Entry Point.

16.4.10 Is the Agent Framework Multi-threaded? Yes. This implies that all the C++ Entry Point implementations must be thread-safe. Particularly, you should not use global variables (unless protected by mutex locks) and you should not make any C/C++ library calls that are not thread-safe. On the Solaris operating system, thread-safe equivalents exist (indicated by the _r suffix) for most of the library

16.4.11 How do I configure the agent to automatically retry the online procedure when the initial attempt to online a resource fails? Set the OnlineRetryLimit attribute of the resource type to a non-zero value. The default value of this attribute is 0. Also, you must implement the clean Entry Point.


1/22/01Page 107

OnlineRetryLimit specifies the number of times to retry online(), if the attempt to online a resource is not successful.

16.4.12 What is the significance of the Enabled attribute? The agent will monitor/online/offline a resource only if its Enabled attribute is 1.

For all the resources defined in the configuration file (main.cf), the Enabled attribute is 1 by default.

16.4.13 How do I request a VCS agent not to online/offline/monitor a resource? By setting the Enabled attribute of the resource to 0. This may be useful when you want to take a resource out of VCS control temporarily.

16.4.14 What is MonitorOnly? MonitorOnly is a predefined VCS resource attribute. The default value of this attribute is 0. When it is set to 1, the agent will not honor the online/offline requests for the resource. The agent will however continue to monitor the resource periodically.

16.4.15 How do I request a VCS agent not to online/offline a resource? By setting the MonitorOnly attribute of the resource to 1. This may be useful in running VCS in shadow mode.

16.4.16 How do I configure the agent to ignore "transient" faults? A resource is said to have FAULTED if it becomes OFFLINE unexpectedly.

You can configure the agent to ignore “transient” faults by setting the ToleranceLimit attribute of the resource type to a non-zero value. The default value of this attribute is 0. A non-zero ToleranceLimit allows the monitor Entry Point to return OFFLINE more than once, before the resource is declared FAULTED. If the monitor Entry Point reports OFFLINE for a greater number of times than ToleranceLimit within ConfInterval, the resource will be declared FAULTED.

This is similar to the “-tolerance” option in FirstWatch.

16.4.17 How do I configure the agent to automatically restart a resource on the local node when the resource faults? Set the RestartLimit attribute of the resource type to a non-zero value. The default value of this attribute is 0. Also, you must implement the clean Entry Point.

RestartLimit affects how the agent responds to a resource fault. A non-zero RestartLimit will cause VCS to invoke the online Entry Point instead of failing over the group to another node. VCS will attempt to restart the resource RestartLimit times within ConfInterval, before giving up and failing over.


1/22/01Page 108

This is similar to the “-restarts” option in FirstWatch.

16.4.18 What is ConfInterval? ConfInterval determines how long a resource must stay online without failing to be considered healthy. This value is used with RestartLimit to determine application health. For example, if RestartLimit is set to 2 and ConfInterval to 600, the resource is allowed to be restarted twice in ten minutes. Once the resource stays online for 10 minutes, the RestartLimit counter is reset. To continue the example, a resource initially starts at 10:00AM. It fails at 10:02AM and is restarted. It fails again at 10:08. It can then be restarted again (RestartLimit = 2). If it fails again before 10:18AM, the service group will be failed over to a peer system. The RestartLimit counter is only cleared when the resource stays online for ConfInterval. If the resource in this example stays online until 10:18, the RestartLimit counter is reset and the cycle can begin again.

vcs-refguide

Documents

Transcript of vcs-refguide