DB2 HADR Automation – where is the magic?
Transcript of DB2 HADR Automation – where is the magic?
#IDUG
DB2 HADR Automation – where is the magic?
Dale McInnis + Gareth Holl + Danny ArnoldIBM Canada Ltd.Session Code: D15Thursday November 13, 08:30 – 09:30| Platform: LUW
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
2
#IDUG
DB2 9.5 Integrated High Availability
• HA Cluster Manager Integration• Coupling of DB2 and TSA on Linux and AIX, other platforms coming later• DB2 interface to configure cluster• DB2 to maintain cluster configuration, add node, add tablespace, …• Exploitation of new vendor independent layering (VIL), providing support
for any cluster manager• Eventually replace VIL with industry standard, e.g. SAF?
• NO SCRIPTING REQUIRED!• one set of embedded scripts that are used by all cluster managers
• Automate HADR failover• Exploit HA cluster manager integration previously describe
#IDUG
Clustering Setup Pre-9.5Overworked admin doing initial setup
Install cluster SW
Install DB2
Create DB2 instance & database
Add nodes / partitions to
DB2 instance
Create cluster domain
Design and create
‘resource models’
Overworked admin adding a new file system for DB2 (eg. tablespace container or storage path)
Add container or
storage path to DB2
(ALTER)
Remember to modify
cluster resource model to
account for new file system
Add each DB2 node’s host to cluster domain
Refer to cluster management
course material to figure out
how to do this
Take multiple courses on
cluster management
BRING OUT THE CONSULTANTS!
#IDUG
Clustering Setup with 9.5
Relaxed admin doing initial setup
Install DB2 9.5
Run DB2 HA config.
tool –db2haicu
Add nodes / partitions to
DB2 instance
Add container or
storage path to DB2
(ALTER)
Relaxed admin adding
a new file system for
DB2 (tablespace container or
storage path)
Create DB2 instance & database
#IDUG
Integrated DB2 architecture
db2startADD NODE
ALTER …ADD …
db2stop
#IDUG
DB2 HA Feature: Installation• DB2 ships IBM Cluster Manager product (TSA)
• DB2 9.5 GA included TSA 2.2.0.3 (AIX + Linux)• DB2 9.7 GA included TSA 3.1.0.0 (AIX , Linux + Solaris 10/SPARC)• DB2 10 GA included TSA 3.2.2.1• DB2 10.5 GA included TSA 3.2.2.4• packaging, installation and update
• TSA is the default Cluster Manager• HA feature chosen
• Fixpacks – includes TSA updates• Uninstalls /opt/IBM/db2/Vxx.x/ha/tsa/
db2Vxx_monitor.ksh, db2Vxx_start.ksh, db2Vxx_stop.ksh
hadrVxx_monitor.ksh, hadrVxx_start.ksh, hadrVxx_stop.ksh
These scripts can get upgraded when a DB2 fixpack is installed
#IDUG
DB2 HA Feature: Configuration
• Distinct Failover setups possible:
• DB2 shared disk HA configuration • DB2 HADR configuration (Automated
failover)• DB2 DPF HA Configuration with shared disk
#IDUG
HADR Architecture
9
LogsOld
Logs
Log Writer Log Reader
Tables
IndexesLogs
Old
Logs
Log Reader
Tables
Indexes
TCPIP
DB2 Engine DB2 EnginePRIMARY SERVER STANDBY SERVER
Log Writer
HADR Shredder Replay Master
Redo SlavesRedo SlavesRedo SlavesReplay Slaves
Log Pages
Log Pages
Log Pages
Log RecordsLog Pages
Log
Records
HADR
#IDUG
Why the need for TSAMP ?
• HADR does not perform active monitoring of the topology
• HADR will not detect a node outage or NIC failure
• HADR cannot take automated actions in the event of a failed primary instance, node outage, or NIC failure
• Instead, a DB administrator must monitor the HADR pair manually and issue appropriate takeover commands in the event of a primary database interruption
• This is where TSAMP’s automation capabilities comes into play :TSAMP can perform restart actions if an instance unexpectedly
exitsTSAMP can perform a HADR takeover automatically when certain
problems are detected on the primary server
10
#IDUG
Introduction
• DB2 provides a High Availability Disaster Recovery (HADR) feature that keeps a primary and standby database synchronized, and allows an administrator to switch control to a standby DB2 server
• DB2 provides a set of scripts that allow TSAMP to control the DB2 resources. Scripts that are used by TSAMP to start, stop, and monitor each of the DB2 resources – this is the
primary link between the two products
• DB2 provides a utility called ‘db2haicu’ that is used to define the domain and automation policy within TSAMP, that is, the initial setup : The automation policy is the set of definitions of all resources, resource groups, and the
relationships between them all. The resource definitions contain attributes that define which DB2 start, stop, and monitor script
(the automation scripts) to use for a particular resource.
• TSAMP can be used to monitor an application’s resources, and automate the starting, stopping, and failover of resources – it will attempt to maintain a desired operational state.
11
#IDUG
Software Summary
• Each of the following software products/components need to be installed on both systems (primary and standby servers) :
DB2 v10.1. (10.1.0.3 was latest available at the time this deck was written)
TSAMP v3.2.2 (Fixpack 7 (3.2.2.7) or later recommended)
RSCT v3.1.5.2 (installed as part of a TSAMP installation)
• Installation of DB2 v10.1 includes the DB2 automation policy scripts:
/opt/IBM/db2/V10.1/ha/tsa/
db2V10_monitor.ksh, db2V10_start.ksh, db2V10_stop.ksh
hadrV10_monitor.ksh, hadrV10_start.ksh, hadrV10_stop.ksh
lockreqprocessed
These scripts can get upgraded when a DB2 fixpack is installed
12
#IDUG
Software Summary (continued …)
• TSAMP/RSCT doesn’t need to be installed on a 3rd node to maintain quorum
Can use a Network TieBreaker (db2haicu calls this a quorum device)
• License file for TSAMP is included with the DB2 Activation Zip file, available via your Passport Advantage account.
If a base level of TSAMP is not installed, then license file will need to be manually installed
TSAMP can be silently installed by the DB2 installer, but if a base level of DB2 is not installed, then again the TSAMP license will need to be manually installed.
• See the TSAMP formal documentation for platform compatibility & dependencies :
For TSAMP v3.2.2.7 Release Note:
http://pic.dhe.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.doc_3.2.2/HALRN329.pdf
For TSAMP v4.1 Installation and Configuration Guide:
http://pic.dhe.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.doc_4.1/HALICG41.pdf
13
#IDUG14
Example 1 of a DB2 HADR environment
eth0
DB2
HADR
TSAMPRSCT
Primary Server
eth0
DB2
HADR
TSAMPRSCT
Standby Server
HADR Replication
DB2 Instance DB2 Instance
DB2Database
DB2Database
Public network
Client Apps
Virt IPeth0:0
Cluster (RSCT)
HeartbeatDB2 Transactions
HADR replication via the public network
#IDUG15
Example 2 of a DB2 HADR environment
eth0
DB2
HADR
TSAMPRSCT
eth1
Primary Server
eth0
DB2
HADR
TSAMPRSCT
eth1
Standby Server
Private network
HADR Replication
DB2 Instance DB2 Instance
DB2Database
DB2Database
Public network
Client Apps
Virt IPeth0:0
Cluster (RSCT)
HeartbeatDB2 Transactions
Switch
HADR replication via a private network
Cluster (RSCT)
Heartbeat
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
16
#IDUG
TSA MP Architectural Overview
#IDUG
TerminologyPeer Domain: A cluster of servers, or nodes, for which TSA is responsibleResource: Hardware or software that can be monitored or controlled. These can be fixed or floating.
Floating resources can move between nodes. Resource group: A virtual group or collection of resourcesRelationships: Describe how resources work together. A start-stop relationship creates a dependency (see
below) on another resource. A location relationship applies when resources should be started on the same or different nodes.
Dependency: A limitation on a resource that restricts operation. For example, if resource A depends on resource B, then resource B must be onlie for resource A to be started.
Equivalency: A set of fixed resources of the same resource class that provide the same functionality
Quorum: A cluster is said to have quorum when there it has the capability to form a majority within its nodes. The cluster can lose quorum when there is a communication failure, and sub-clusters form with an even number of nodes.
Nominal State: This can be online or offline. It is the desired state of a resource, and can be changed so that TSA will bring a resource online or shut it down.
Tie Breaker: Used to maintain quorum, even in a split-brain situation. A tie-breaker allows sub-clusters to determine which set of nodes will take control of the domain.
Failover: When a failure occurs (typically hardware), which causes resources to be moved from one machine to another machine, the resources are said to have “failed over”
#IDUG
Quorum and TieBreaker One of the questions from the ‘db2haicu’ utility deals with a cluster/automation
concept called Quorum. Quorum The number of nodes in a cluster that are required to control the resources, modify the
cluster definition, or perform certain cluster operations. The main goals of quorum operations:
identify who has the majority when a cluster is broken up into sub-clusters keep data consistent, especially when shared file systems are being used protect critical resources….maintain HA control Two types: configuration vs. operational quorum
Note: “configuration” quorum requires 'majority of nodes' (more than half the number of nodes) to be Online for configuration changes to be carried out.
TieBreaker a TieBreaker situation occurs when a cluster with equal number of nodes is split into sub-
clusters with equal numbers of nodes need to determine which sub-cluster will have an operational quorum in a tie situation
19
#IDUG
db2haicu: Quorum and TieBreaker (continued....)
Disk TieBreaker Not supported by db2haicu
NFS TieBreaker Not supported by db2haiku
Network TieBreaker the goal is for each system to figure out (via the RSCT infrastructure) which one is operational and
should therefore take control (if not already the active node). use a pingable system independent of node1 and node2 without an active TieBreaker, automated failover/takeover will NEVER occur The network tie breaker should be used only for domains where all nodes are in the same IP
sub net. Choose an IP address, which can be reached only through a single path from each node in the domain.
The following 7 slides demonstrate how a TieBreaker works …
20
#IDUG21
21
Base Tie-Breaker Functionality
node1 node2
GatewayRouter
eth0 10.20.30.40
eth010.20.30.1
10.20.30.0
eth0 10.20.30.41
Node Failure Scenario
#IDUG22
22
Base Tie-Breaker Functionality
node1 node2
Gateway Router
eth0 10.20.30.40
eth010.20.30.1
10.20.30.0
eth0 10.20.30.41
Node Failure Scenario:1. System node1 fails
#IDUG23
23
Base Tie-Breaker Functionality
node1 node2
Gateway Router
eth0 10.20.30.40
eth0 10.20.30.1
10.20.30.0
eth0 10.20.30.41
Node Failure Scenario:1. System node1 fails2. System node2 gets quorum using network
tiebreaker
#IDUG24
24
Base Tie-Breaker Functionality
node1 node2
GatewayRouter
eth0 10.20.30.40
eth010.20.30.1
10.20.30.0
eth0 10.20.30.41
Network Adapter Failure Scenario
#IDUG25
25
Base Tie-Breaker Functionality
node1 node2
Gateway Router
eth0 10.20.30.40
eth0 10.20.30.1
10.20.30.0
eth0 10.20.30.41
Network Adapter Failure Scenario :1. Network problem affecting node1
#IDUG26
26
Base Tie-Breaker Functionality
node1 node2
Gateway Router
eth0 10.20.30.40
eth0 10.20.30.110.20.30.0
eth0 10.20.30.41
Network Adapter Failure Scenario:1. Network problem affecting node1 2. Again node2 gets quorum using network tiebreaker
#IDUG27
27
Base Tie-Breaker Functionality
node1 node2
Gateway Router
eth0 10.20.30.40
eth0 10.20.30.110.20.30.0
eth0 10.20.30.41
Network Tiebreaker Scenarios:1. System node1 fails1a. System node2 gets quorum using network
tiebreaker
2. Network problem affecting node1 2a. Again node2 gets quorum using network
tiebreaker2b. System node1 forced to reboot
Network Tiebreak Assumption:If node1 can communicate (ping) with the gateway and node2 can communicate (ping) with the gateway, THEN node1 must be able to communicate (heartbeat) with node2.
#IDUG
System Automation – Components
Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 2 Resource Managers as follows :
28
Resource Manager Function and classes owned
IBM.ConfigRM Configuration tasks across the nodes in the domain, including quorum and TieBreaker functionality
IBM.StorageRM Maps IBM.AgFileSystem resources to IBM.LogicalVolume to IBM.VolumeGroup to IBM.Disk and manages the mount/umount of the filesystems and the varyon/varyoff of the Volume Groups.
Reliable Scalable Cluster Technology RSCT), the "Cluster" software, is made up of some core daemons and some Resource Managers, the most important being the following two :
Resource Manager Function and classes owned
IBM.RecoveryRM Automation engine. Owns IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource, IBM.ManagedRelationship
IBM.GblResRM Starts, stops, and monitors resources of the class IBM.Applicationand IBM.ServiceIP
#IDUG
Resource Manager – Resource Class Composition
• IBM.ConfigRM – Cluster configuration• IBM.CommunicationGroup• IBM.HeartbeatInterface – Manage heartbeats• IBM.NetworkInterface• IBM.PeerDomain• IBM.PeerNode• IBM.RSCTParameter• IBM.TieBreaker
• IBM.StorageRM - Storage• IBM.AgFileSystem• IBM.Disk• IBMLogicalVolume• IBM.Partition• IBM.VolumeGroup
#IDUG
Resource Manager – Resource Class Composition
• IBM.RecoveryRM• Is the decision engine • Runs on every node – with one being designated as the master• Responsible for evaluating the monitoring information that is supplied by
the various resource managers such as the Storage RM and the Global Resource RM.
• Drives the decisions that result in start or stop operations on the resources as needed.
• IBM.GblResMgr• IBM.Application – manage applications• IBM.ServiceIP – manage IP Addresses
#IDUG
System Automation – Components
Everything is a "Resource" in the TSAMP and RSCT world• There are different kinds of resources and that is where we introduce the concept of a resource "class".• There are different Resource Managers, each responsible for managing or controlling resources that belong to a
particular set of resource classes.• The following diagram shows the mapping of three key Resource Managers to some Resource Classes they
manage and then to some example Resources :
31
#IDUG
System Automation – Components Consider the servers that make up the cluster ... they are also resources. They are resources of the class
IBM.PeerNode.
• The domain itself is a resource, of class IBM.PeerDomain. The network interfaces are resources, of class IBM.NetworkInterface.
• The following diagram introduces another Resource Manager (part of RSCT) called IBM.ConfigRM and it shows two Resource Classes it manages and the Resources modelled by those classes:
32
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
33
#IDUG
Mapping DB2 HADR components to TSAMP resources34
Node1 Node2
Resource Group: db2_db2inst1_db2inst1_HADRDB-rg
HADRHADRFloating Resource:
db2_db2inst1_db2inst1_HADRDB-rs
Virtual IPVirtual IPFloating Resource:
db2ip_10_20_30_42-rs
eth1 eth1
eth0 eth0Equivalency:
db2_public_network_0
Private network
db2_db2inst1_node1-rs
Resource Group:db2_db2inst1_node1-rg
db2_db2inst1_node2-rs
Resource Group:db2_db2inst1_node2-rg
Equivalency:db2_private_network_0
Public network
dependsOnrelationship
dependsOnrelationship
Optional
depe
ndsO
nre
latio
nshi
p
#IDUG
Mapping DB2 Components to TSAMP Resources
• DB2 instance called “db2inst1” on server called “node1” maps to a TSAMP managed resource called “db2_db2inst1_node1_0-rs”
• DB2 instance called “db2inst1” on server called “node2” maps to a TSAMP managed resource called “db2_db2inst1_node2_0-rs”
• DB2 HADR database called “HADRDB” who’s primary and standby instances are both named “db2inst1” maps to a TSAMP managed resource called “db2_db2inst1_db2inst1_HADRDB-rs”
• The virtual IP address (optional) maps to a TSAMP managed resource called “db2ip_10_20_30_42-rs” where 10_20_30_42 is the virtual IP address.
• A public network can be defined and this maps to a TSAMP resource (Equivalency) called “db2_public_network_0”
• Note: No need to defined a private network ...
• TSAMP does not manage anything related to the private network … there are no dependencies on it, so no need for it ! Just say “no” to db2haicu
• You can still have an actual private network for HADR replication … its totally independent of TSAMP.
35
#IDUG
IBM.Application class – display attribs of resource/resource class# lsrsrc –s “Name = ‘db2_db2inst1_node1_0-rs’” -Ab IBM.ApplicationName = "db2_db2inst1_node1_0-rs“
StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh db2inst1 0"
StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_stop.ksh db2inst1 0"
MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_monitor.ksh db2inst1 0"
MonitorCommandPeriod = 10
MonitorCommandTimeout = 120
StartCommandTimeout = 330
StopCommandTimeout = 140
UserName = "root"
RunCommandsSync = 1
ProtectionMode = 1
ActivePeerDomain = hadr_domain
NodeNameList = {“node1"}
OpState = 1
36
#IDUG
IBM.Application class – Example is of a DB2 HADR Resource
# lsrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ ” -Ab IBM.Application
Name = "db2hadr_hadrdb-rs“
StartCommand = "/usr/sbin/rsct/sapolicies/db2/hadrV10_start.ksh db2inst1 db2inst1 HADRDB"
StopCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh db2inst1 db2inst1 HADRDB"
MonitorCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_monitor.ksh db2inst1 db2inst1 HADRDB"MonitorCommandPeriod = 21
MonitorCommandTimeout = 29
StartCommandTimeout = 330
StopCommandTimeout = 140
UserName = "root"
RunCommandsSync = 1
ProtectionMode = 1
ActivePeerDomain = hadr_domain
NodeNameList = {“node1",“node2"}
OpState = 1
37
#IDUG
IBM.ServiceIP class – Virtual IP addresses
# lsrsrc -Ab IBM.ServiceIP
Name = "db2ip_10_20_30_42-rs"
IPAddress = “10.20.30.42"
NetMask = "255.255.255.0"
ProtectionMode = 1
ActivePeerDomain = "hadr_dom"
NodeNameList = {“node1“,”node2”}
OpState = 1
38
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
39
#IDUG
Use db2haicu to configure TSA for HADR automationdb2haicu [ -f <XML-input-file-name> ]
[ -disable ] [ -delete [ dbpartitionnum <db-partition-list> | hadrdb <database-name> ] ]
• Default is to run in interactive mode• Follow the instructions in
https://www.ibm.com/developerworks/mydeveloperworks/blogs/nsharma/resource/HADR-db2haicu_v5.pdf
Table of contentsPart 1 : DB2 configuration.................................................................................................. 4Part 2 : TSA Cluster setup .................................................................................................. 7Part 3 : Miscellaneous tasks / Diagnostics........................................................................ 11Part 4 : Remove TSA/HADR configuration ..................................................................... 15Part 5 : Automatic client reroute (ACR) ........................................................................... 16
#IDUG
Using ‘db2haicu’ to Automate HADR FailoverStep 1. Run the following command as root on each node to configure the RSCT ACLs (security)
and allow cluster communication between the servers:
root@node1# preprpnode node1 node2
root@node2# preprpnode node1 node2
Step 2. Log on to the standby server as the instance owner and issue:db2inst1@node2> db2haicu
• The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information.
• Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. If it doesn’t find one, you will see the following :
41
#IDUG
db2haicu: Create a new domain
Create a new domain with two nodes:
42
#IDUG
db2haicu: List the new domain• At this point there would be domain called “hadr_domain” in an online state:
root@node1# lsrpdomain
Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
hadr_domain Online 3.1.2.1 No 12347 12348
• You can also list the states of the individual nodes and see output similar to the following, from either server:
root@node1# lsrpnode
Name OpState RSCTVersion
node1 Online 3.1.2.1node2 Online 3.1.2.1
43
#IDUG
db2haicu: Quorum and TieBreaker
At this point you could list the TieBreaker resources and see the new network TieBreaker:root@node1# lsrsrc –Ab IBM.TieBreaker
The following command should show that your new network TieBreaker is currently active:root@node1# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker
44
The next db2haicu question deals with the creation of a Network TieBreaker:
#IDUG
db2haicu: Network EquivalenciesSpecial TSAMP groups called Equivalencies are created containing the network interfaces found on each of the servers in the cluster. This allows TSAMP to be notified of NIC failures by the RSCT subsystem (who harvested the NICs) and react accordingly.
• We use db2haicu to create an equivalency called db2_public_network_0 and populate it with the en0 NICs from the server called “node1” :
45
#IDUG
db2haicu: Network Equivalencies (continued …)
• Next we add the en0 NIC from the other server, “node2”, to the same equivalency (db2_public_network_0) :
46
#IDUG
db2haicu: Private Network
47
In the previous slide, notice the option to say “Yes” or “No” when adding NICs to a network.
When asked if a non-public NIC should be added to a private network, this is where I recommend you choose “No”.
So don’t create a private network equivalency via db2haicu even if your DB2 HADR environment does use a private network for HADR replication data.
#IDUG
db2haicu: Private Network (continued …)
48
If your DB2 environment uses LDAP for authentication and if you have multiple NICs per server (eg. a private network), then disable the RSCT cluster heartbeat for all NICs not in the public network:Identify the Communication Group that contains the non-public NICs:lsrsrc -Ab IBM.NetworkInterface Name IPAddress CommGroup HeartbeatActive NodeNameList
Change HeartbeatActive to 0 to disable heartbeating for a CommGroup:chrsrc -s "CommGroup=='CG2'" IBM.NetworkInterface HeartbeatActive=0
See the following technote for more details on cluster heartbeat settings and Communication Groups:
http://www.ibm.com/support/docview.wss?uid=swg21292274
#IDUG
db2haicu: Additional NICs per server
49
For non-LDAP setups, its recommended to have at least a 2nd pair of NICs for cluster heartbeating so as to reduce the likelihood of a forced reboot if there is a problem in the public network (or with the public NICs).
For DB2 v9.7 environments, additional “dependsOn” relationships need to be manually added to the automation policy, from each HADR database resource to the public network equivalency
If db2haicu from v10.1.0.0, v10.1.0.1, or v10.1.0.2 is used to create the automation policy, all the necessary dependsOn relationships will be missing due to a bug with the db2haicu utility (fixed as of 10.1.0.3) … they will need to be manually created.
Refer to the following technote to obtain a script that can be used to create any missing relationship in either a DB2 v9.7 or v10 environment :
http://www.ibm.com/support/docview.wss?uid=swg21634431
The dependsOn relationships between the HADR database resources and the Public Network Equivalency are recommended even if there is only one NIC per server, for both DB2 v9.7 and v10 environments.
#IDUG
db2haicu: Listing the Equivalency• The network equivalency(ies) would be created at this point and can be listed as follows:
root@node1# lsequ -AbDisplaying Equivalency information:
All Attributes
Equivalency 1:
Name = db2_public_network_0
MemberClass = IBM.NetworkInterface
Resource:Node[Membership] = {en0:node1,en0:node2}
SelectString = “”
ActivePeerDomain = hadr_domain
Resource:Node[ValidSelectResources] = {en0:node1,en0:node2}
50
#IDUG
db2haicu: Adding the database node to the Automation Policy
• The final part to running db2haicu on the standby server is setting the CLUSTER_MGR variable to “TSA” and then adding resources that represent the DB2 instance on the server where you’re running db2haicu:
51
#IDUG
db2haicu: Adding the database node to the Automation Policy
• Note in the previous screenshot that you won’t be able to validate and automate the HADR database via db2haicu from the standby server. This is why the next part involves running the db2haicu for a 2nd time but from the current HADR primary server.
• At this point we can view the a few more changes to the automation policy and the database manager’s configuration :db2inst1@node2> db2 get dbm cfg |grep -i cluster
Cluster manager = TSA
root@node2# lsrg
Resource Group names:
db2_db2inst1_node2_0-rg
52
#IDUG
db2haicu: DB2 Standby Instance Resources
root@node2# lsrg -g db2_db2inst1_node2_0-rgDisplaying Member Resource information:
For Resource Group "db2_db2inst1_node2_0-rg".
Resource Group 1:
Name = db2_db2inst1_node2_0-rg
MemberLocation = Collocated
Priority = 0
AllowedNode = db2_db2inst1_node2_0-rg_group-equ
NominalState = Online
ActivePeerDomain = hadr_domain
OpState = Online
TopGroup = db2_db2inst1_node2_0-rg
TopGroupNominalState = Online
• Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server this resource is allowed to run on … see the next slide for output that shows this PeerNode Equivalency details.
53
#IDUG
db2haicu: DB2 Standby Instance Dependencies
root@node2# lsequ -AbEquivalency 1:
Name = db2_db2inst1_samp2_0-rg_group-equ
MemberClass = IBM.PeerNode
Resource:Node[Membership] = {node2:node2.tivlab.raleigh.ibm.com}
SelectString = ""
SelectFromPolicy = ANY
MinimumNecessary = 1
ActivePeerDomain = hadr_domain
Resource:Node[ValidSelectResources] = {node2:node2.tivlab.raleigh.ibm.com}
• This restricts the DB2 instance resource on the previous slide from only being brought Online by TSAMP on node2.
• This is fairly obvious given it’s the resource that represents the standby database partition.
54
#IDUG
db2haicu: DB2 Standby Instance Dependencies (continued …)
• At this point there would be one or two relationship defined in the automation policy depending on how many network equivalencies you created:
root@node2# lsrel -AbDisplaying Managed Relationship Information:All Attributes
Managed Relationship 1:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node2_0-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOnConditional = NoCondition
Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel
ActivePeerDomain = hadr_domain
• This shows us that the DB2 instance is dependent on the operational state of the NICs in the public network. If the NIC is Online, then TSAMP will be able to start the associated DB2 instance.
55
#IDUG
db2haicu: The Automation Policy so far …
• Let’s look at what resources and groups are listed in the ‘lssam’ output after completing the execution of ‘db2haicu’ on the standby server :
root@node2# lssam
56
Here we see that the DB2 instance for server “node2” is defined and within its own resource group.
There is a PeerNode equivalency which dictates which server the above instance is allowed to run on.
Finally, there is a Network Equivalency which contains the NICs for the public network … the DB2 instance would have a dependency relationship on this equivalency.
#IDUG
Using ‘db2haicu’ to Automate HADR Failover (continued …)
Step 3. Log on to the primary server as the instance owner and issue:
db2inst1@node1> db2haicu
• The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information.
• Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. Since we’ve already run db2haicu on the standby server, an Online domain should already exist.
• You will then be asked to set the cluster manager.
57
#IDUG
db2haicu: Adding the database node to the Automation Policy
• db2haicu sets the CLUSTER_MGR variable to “TSA” within the local database manager’s configuration: db2inst1@node1> db2 get dbm cfg |grep -i clusterCluster manager = TSA
• Please note that once the dbm is configured with “Cluster manager” set to TSA, the DB2 engine expects to have a domain Online. You will have issues stopping and starting the DB2 instance if no domain is Online.
• Run 'db2haicu -disable' on each DB2 server if you want to break the connection between DB2 and TSAMP. This is the only way to unset “Cluster manager” for DB2 v10.x
• Then db2haicu adds resources that represent the DB2 instance (the primary DB2 instance) on the server where you’re currently running db2haicu:
58
#IDUG
db2haicu: DB2 Primary Instance Resources• At this point we can view the a few more changes to the automation policy
root@node1# lsrgResource Group names:db2_db2inst1_node1_0-rgdb2_db2inst1_node2_0-rg
root@node1# lsrg -g db2_db2inst1_node1_0-rgDisplaying Member Resource information:For Resource Group "db2_db2inst1_node1_0-rg".
Resource Group 1:Name = db2_db2inst1_node1_0-rgMemberLocation = CollocatedPriority = 0AllowedNode = db2_db2inst1_node1_0-rg_group-equNominalState = OnlineActivePeerDomain = hadr_domainOpState = OnlineTopGroup = db2_db2inst1_node1_0-rgTopGroupNominalState = Online
• Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server this resource is allowed to run on … similar to the other DB2 instance resource group but with a different server name.
59
#IDUG
db2haicu: Dependencies for the DB2 Instances
• Now there would be additional relationships defined in the automation policy:root@node1# lsrel -AbDisplaying Managed Relationship Information:
All Attributes
Managed Relationship 1:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node2_0-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOnConditional = NoCondition
Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel
ActivePeerDomain = hadr_domain
Managed Relationship 2:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node1_0-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOnConditional = NoCondition
Name = db2_db2inst1_node1_0-rs_DependsOn_db2_public_network_0-rel
ActivePeerDomain = hadr_domain
• This shows us that the DB2 instances are both dependent on the operational state of the NICs in the public network. If the NICs are Online, then TSAMP will be able to start the associated DB2 instances … it also means if either NIC goes offline for any reason, the local DB2 instance will be stopped by TSAMP.
60
#IDUG
db2haicu: The Automation Policy so far …
• Let’s take another look at what resources and groups are listed in the ‘lssam’ output after ‘db2haicu’ has added both standby and primary database partitions :
root@node1# lssam
61
There’s now a resource and group for the DB2 instance on server “node1”. There’s now another PeerNode equivalency … it forces the “db2_db2inst1_node1_0-rg
partition to run on “node1” only.
#IDUG
db2haicu: Adding the HADR database to the Automation Policy
• Validating and automating HADR failover can only be done from the current primary server and only after successfully running db2haicu on the standby server.
• You may also want to add a virtual IP address for this HADR database
62
#IDUG
db2haicu: HADR Database Resources# lsrg -g db2_db2inst1_db2inst1_HADRDB-rgDisplaying Resource Group information:For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg".
Resource Group 1:Name = db2_db2inst1_db2inst1_HADRDB-rgMemberLocation = CollocatedPriority = 0AllowedNode = db2_db2inst1_db2inst1_HADRDB-rg_group-equNominalState = OnlineActivePeerDomain = hadr_domainOpState = OnlineTopGroup = db2_db2inst1_db2inst1_HADRDB-rgTopGroupNominalState = Online
• Note the AllowedNode attribute. It points to a PeerNode Equivalency that contains the servers “node1” and “node2” that dictates which servers the HADR database can reside on. This is just like the setup for the two DB2 instance resource groups that also use the AllowedNode attribute with other PeerNode Equivalencies, though in this case the HADR resource is a floating resource with two servers as its choices.
63
#IDUG
db2haicu: HADR Database Resources# lsrg -m -g db2_db2inst1_db2inst1_HADRDB-rgDisplaying Member Resource information:For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg".
Member Resource 1:Class:Resource:Node[ManagedResource] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
Mandatory = True
MemberOf = db2_db2inst1_db2inst1_HADRDB-rg
SelectFromPolicy = ORDERED
ActivePeerDomain = hadr_domain
OpState = Online
Member Resource 2:
Class:Resource:Node[ManagedResource] = IBM.ServiceIP:db2ip_10_20_30_42-rs
Mandatory = True
MemberOf = db2_db2inst1_db2inst1_HADRDB-rg
SelectFromPolicy = ORDERED
ActivePeerDomain = hadr_domain
OpState = Online
64
#IDUG
db2haicu: Dependency for the HADR DB Resource
• Now there would be an additional relationship defined in the automation policy:root@node1# lsrel -AbDisplaying Managed Relationship Information:
All Attributes
[...]
Managed Relationship 3:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOnConditional = NoCondition
Name = db2_db2inst1_db2inst1_HADRDB-rs_DependsOn_db2_ public_network_0-rel
ActivePeerDomain = hadr_domain
• This shows us that the HADRDB resource is dependent on the operational state of the NICs in the public network.
If the NICs are Online, then TSAMP will be able to online the associated HADR db resource it also means if either NIC goes offline for any reason, the constituent of the HADR db resource local to the
offline NIC will be offlined by TSAMP (if it is currently online) … this could trigger a failover.
65
#IDUG
db2haicu: The Complete Automation Policy …• Let’s look at what resources and groups are listed in the ‘lssam’ output after completing the
execution of ‘db2haicu’ on both servers :root@node1# lssam
66
The resources and group for the HADR database and virtual IP address have been added, as has a new PeerNode Equivalency containing servers “node1” and “node2”.
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
67
#IDUG
Starting and Stopping DB2 resources• Now that the DB2 instances are managed by TSAMP, they cannot be started with
the db2start command unless the “Nominal” (desired) state of the resource group that contains the DB2 instance resource is set to “Online”. The following is an example:# chrg –o online <Resource_Group>
• Changing the desired states of the resource groups will instruct TSAMP to start/stop the resources using the scripts defined in the “StartCommand”, “StopCommand” attributes for a resource of class “IBM.Application”, such is the case for a DB2 instance resource.
• To change the desired state of multiple resource groups that have similarities in their name, for example, start all DB2 resource groups in parallel where the instance name on each server starts with “db2inst”, use the following syntax:# chrg –o online –s “Name like ‘db2_db2inst_%’ “
• Another example, to take the HADR resource group offline, including removal of the currently assigned virtual IP address:# chrg –o offline db2_db2inst1_db2inst1_HADRDB-rg
• Note: After offlining just the HADR resource group, the HADR pair will remain in a peer connected state even though shown as Offline on both servers when viewed as a TSAMP resource !
68
#IDUG
Starting and Stopping DB2 resources (continued …)
• A couple of things to note:
• It is good practice (and potentially less prone to startup issues) to have the Nominal (desired) State of all the Resource Groups set to Offline prior to stopping the domain. This will allow the domain to be started at some point in the future without any start actions being taken against all DB2 resources simultaneously.
• Offline individual nodes in the cluster using ‘stoprpnode <node_name>’ before shutting down or rebooting that server for maintenance.
• I’d suggest starting both the DB2 instances simultaneously (changing their Resource Group’s Nominal State to Online), but waiting until both are completely online before changing the Nominal State of the HADR resource group to Online.
• Starting both instances should result in the HADR pair reaching a peer connected state, even though the HADR resource group’s Nominal state may still be set to Offline. However, while the HADR resource group is set to Offline, the Virtual IP address is truly offline (not assigned to any NIC so not available to communicate through), AND no automated failover actions will occur.
• If the HADR pair does not reach a peer connected state after both instances have successfully started, troubleshoot this as a DB2 problem. Once corrected and back in Peer, proceed to start the HADR resource group.
• If you go ahead and attempt to start the HADR resource group when not in peer, the HADR resource will likely end up in a Failed Offline state on one or both servers requiring manual reset actions.
69
#IDUG
Starting DB2 Instances & HADR Database• Start the underlying domain if not already online:
# startrpdomain hadr_domain
• Start the primary and standby instances simultaneously:
# chrg –o online –s “Name like ‘db2_db2inst1_node%’ “• above assumes both by instances are names “db2inst1” and my servers have hostnames starting with
“node”
• Check that both instances reach Online states. Do not proceed until both DB2 instances have come online. Confirm using “lssam –top” and “db2_ps” (Run “db2_ps” as the DB2 instance owner on each node)
• The DB2 start scripts used to start the instances will also activate the databases, resulting in the HADR pair establishing a peer connected state. So confirm that the HADR pair have reached peer state by running the following on each DB2 node :
# db2pd –hadr –db hadrdb• If HADR state is not active, then manually bring the HADR pair into peer state as follows:
a. On designated standby node:# db2 start hadr on db hadrdb as standby
b. On designated primary node:# db2 start hadr on db hadrdb as primary
Repeat for all HADR databases. Again check the state of the HADR pair before proceeding
70
#IDUG
Starting DB2 Instances & HADR DB (continued..)
• As instance owner, ensure that the HADR pair is in “Peer” state (on both nodes) as follows:
# db2pd –hadr -db hadrdb
You should see output similar (abbreviated) to the following on the primary server:Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:13:36Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)Primary Peer Sync 0 0
ConnectStatus ConnectTime TimeoutConnected Tue Jul 8 17:47:12 2008 (1215553632) 120
You should see output similar (abbreviated) to the following on the standby server:Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:12:51
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)Standby Peer Sync 0 0
ConnectStatus ConnectTime TimeoutConnected Tue Jul 8 17:48:23 2008 (1215552946) 120
• Finally, change the HADR resource group to online :
# chrg –o online db2_db2inst1_db2inst1_HADRDB-rg• This last step will cause the virtual IP address (if policy includes one) to be assigned.
71
#IDUG
Taking Standby Instance Offline• Because the database is active, the force option is required for the db2stop command:
db2inst1@node2> db2stop force• DB2 will also request TSAMP lock the instance group to prevent TSAMP was trying to restart the
instance. The HADR group will also get locked … this ALWAYS happens when the HADR pair are no longer in a Peer state.
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
Pending online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online'- Offline IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated
'- Offline IBM.Application:db2_db2inst1_node2_0-rs:node2
• To restart the instance: db2inst1@node2> db2start• This will result in TSAMP executing the ‘db2Vxxx_start.ksh’ script which is also responsible for
activating the HADR database and HADR re-integration takes place … peer state results.
72
#IDUG
Taking Primary Instance Offline• Because the database is active, the force option is required for the db2stop command:
db2inst1@node1> db2stop force
• DB2 will also request TSAMP lock the instance group to prevent TSAMP was trying to restart the instance. The HADR group will also get locked … this ALWAYS happens when the HADR pair are no longer in a Peer state.
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated
|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
Pending online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online'- Offline IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated
'- Offline IBM.Application:db2_db2inst1_node1_0-rs:node1
• To restart the primary instance: db2inst1@node1> db2start• This will result in TSAMP executing the ‘db2Vxxx_start.ksh’ script which should activate the db• The ‘hadrVxxx_start.ksh’ script will then be executed and peer state should be re-established.
73
#IDUG
Performing a Manual Takeover (Controlled Failover)• Because the DB2 instances are cluster aware in v9.5+, you can use the native DB2 takeover
command. In fact you should only use the DB2 takeover command, as follows (issued as instance owner on the current standby server) :db2inst1@node2> db2 takeover hadr on database HADRDB
• The HADR resource group will be locked and unlocked several times. There will also be a move request at some point.
• ‘lssam’ will show the online/offline states swapped for the HADR resource and ServiceIP, assuming the takeover is successful :
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs|- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
• Use the following command to check the HADR role has swapped between the nodes and ensure the HADR pair have reached a peer state again :
db2inst1@node2> db2pd –hadr –db <hadr_db_name>
74
#IDUG
The wrong way to attempt a Controlled Failover
• If you attempt to move the HADR resource group using the ‘rgreq –o move db2_db2inst1_db2inst1_HADRDB-rg’ command, the failover/takeover may not succeed and you could end up with the HADR resource showing “Failed Offline” on the standby server.
• During the attempted move, “lssam” (lssam –top) would show:
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Move Nominal=Online
|- Pending online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1
'- Pending online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs
|- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
75
#IDUG
The wrong way to attempt a Controlled Failover (continued)
• If the move attempt fails, ‘lssam’ will show the virtual IP address moved back to the original primary and the HADR resource on the standby will be set to Failed Offline requiring a manual reset before any further failover/takeovers will ever be possible again:
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Failed offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
• To reset the Failed Offline state, use the following TSAMP command:resetrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ & NodeNameList={‘node2’}” IBM.Application
This will cause the hadrV10_stop.ksh script to be executed on node2 and if successful (return code 0), the Operational State will change to “Offline”.
76
#IDUG
Failure Scenarios
77
• The various failover scenarios supported by this solution are detailed in section 6 of a whitepaper called “Automated Cluster Controlled HADR (High Availability Disaster Recovery) Configuration Setup using the IBM DB2 High Availability Instance Configuration Utility (db2haicu) ”
• This whitepaper can be downloaded via the following URL:http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf
• The following scenarios result in automated actions, including failovers/takeovers:1. Standby Instance Failure2. Primary Instance Failure3. Standby NIC Failures (public network)4. Primary NIC Failures (public network)5. Standby Node Failure6. Primary Node Failure
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
78
#IDUG
Disable/Re-enable HA/Automated Failover (using db2haicu)
• To prevent TSAMP from taking any action on DB2 resources, disable HA:db2inst1@node2> db2haicu -disable
79
The local database manager’s configuration will be updated so that “Cluster manager” is unset. The ‘db2haicu –disable’ also needs to be executed on the other server so that it’s instance configuration is also updated.
With “Cluster manager” unset, you would be able to Offline the entire domain without affecting the manual operation of the DB2 instances.
#IDUG
Disable/Re-enable HA/Automated Failover (Continued …)
80
As part of the –disable process, DB2 will request TSAMP lock all Resource Groups to prevent TSAMP was taking any action against DB2 resources:
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
Online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online'- Online IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated
'- Online IBM.Application:db2_db2inst1_node1_0-rs:node1Online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online
'- Online IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated'- Online IBM.Application:db2_db2inst1_node2_0-rs:node2
To re-enable, run ‘db2haicu’, as instance owner, on each server, and select “1” (Yes) when asked if you want to enable high availability, and then choose “TSA”.
#IDUG
Alternative for preventing TSAMP starting and stopping DB2 Resources
• The quickest way of preventing TSAMP from stopping/starting the resources is to change TSAMP to manual mode (Automation = Manual):
# samctrl –M T
• The only action TSAMP will continue to do is monitor the resources by continuing to execute the monitoring scripts associated with each resource.
• Check the current automation mode with the following command:
# lssamctrl
• To re-enable automation mode (Automation = Auto):
# samctrl –M F
• Although changing the Nominal (desired) state of a resource group to “offline” will trigger TSAMP to stop its resources, this does not mean automation is stopped. TSAMP will attempt to maintain the offline state, so if any resource is manually started, TSAMP will stop it again.
• Note you will not be able to perform a takeover while TSAMP is in Manual mode.
81
“Manual = True”
“Manual = False”
#IDUG
Agenda
• Introduction and Overview
• System Automation Components Overview
• Mapping DB2 Components to TSAMP Resources
• Integrating TSAMP with DB2 HADR using db2haicu
• Controlling the Operational State of the DB2 Resources
• Disabling Automation (re-gain manual control of DB2)
• Serviceability
82
#IDUG83
Serviceability - CLI commands
Use the TSAMP command “lssam” as previously demonstrated:# lssam –top
# lssam –g <resource_group>
An alternative is the following TSAMP command:
# lsrg –m
#IDUG84
Serviceability – logs
Three main areas of logging1. Logging from the DB2 automation scripts (i.e. start/stop/monitor scripts) “logger” statements in policy scripts written to syslog (eg. /var/log/messages on Linux systems)
2. Logging of TSAMP / RSCT core processes (i.e. quorum, monitor command timeouts) written to syslog (Linux/AIX/Solaris) and errpt (AIX) Daemon log file directory: /var/ct/<DOMAIN>/log/mc/IBM.<DAEMON>RM
– where <DAEMON> = Recovery, GblRes, …– Circular logs, cannot open with editor directly!rpttr –o dtic <log file dir>/trace_summary > my_trace.out
3. DB2’s log file, “db2diag.log” with DIAGLEVEL 3 or higher
Use TSAMP Level 2 Support’s ‘getsadata’ script to collect data:
http://www.ibm.com/support/docview.wss?&uid=swg21285496
#IDUG85
Serviceability – syslog messages from DB2 automation scripts
The following syslog message indicates the DB2 instance is Online (return code =1) :<timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 1 (db2inst1, 0)
The following syslog message indicates the DB2 instance is Offline (return code =2) :<timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 2 (db2inst1, 0) The DB2 instance monitors repeat approximately every 10 seconds on each server if you’re
using a default automation policy.
The following syslog message indicates that the HADR resource if considered Online (return code = 1) and with a Primary role :<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB Seen only on the node that’s currently the primary node, repeats approx every 21 seconds
The following syslog message indicates that the HADR resource if considered Offline (return code = 2) and mostly likely in a Standby state (normal/OK state) :<timestamp> node2 user:info hadrV10_monitor.ksh[69632]: Returning 2 : db2inst1 db2inst1 HADRDB
#IDUG86
Serviceability – syslog messages from DB2 automation scripts
The following syslog messages occur when TSAMP starts a DB2 instance :<timestamp> node1 user:notice db2V10_start.ksh[856142]: Entered db2V10_start.ksh, db2inst1, 0<timestamp> node1 user:debug db2V10_start.ksh[856146]: Able to cd to /home/db2inst1/sqllib : db2V10_start.ksh, db2inst1, 0
<timestamp> node1 user:debug db2V10_start.ksh[262214]: 1 partitions total: db2V10_start.ksh, db2inst1, 0
<timestamp> node1 user:notice db2V10_start.ksh[393252]: Returning 0 from db2V10_start.ksh ( db2inst1, 0)
If db2start was used to start the instance, the message below would be seen instead of the “1 partitions total” message show above:
<timestamp> node1 user:info db2V10_start.ksh[856150]: db2V10_start.ksh is already up...
The following syslog messages are typical of the HADR resource group being brought online:
<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB<timestamp> node1 user:debug root[524540]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock<timestamp> node1 user:notice hadrV10_start.ksh[422078]: Entering : db2inst1 db2inst1 HADRDB<timestamp> node1 user:debug hadrV10_start.ksh[422086]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h HADRDB -L<timestamp> node1 user:debug root[524290]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0<timestamp> node1 user:notice hadrV10_start.ksh[422090]: Returning 0 : db2inst1 db2inst1 HADRDB
Note: the ‘hadrV10_start.ksh’ script doesn’t actually bring the HADR database pair into a Peer state. Its likely to already be in a Peer state beforehand because the databases are activated as part of the starting of the DB2 instances.
#IDUG87
Serviceability – syslog messages from DB2 automation scripts
The following syslog messages occur when TSAMP stops a DB2 instance. This includes resetting a Failed Offline state for a DB2 instance resource:
<timestamp> node1 user:notice db2V10_stop.ksh[856142]: Entered db2V10_stop.ksh, db2inst1, 0 <timestamp> node1 user:notice db2V10_stop.ksh[393252]: Returning 0 from db2V10_stop.ksh ( db2inst1, 0)
The following syslog messages are typical of the HADR resource being stopped on one node so a manual takeover can occur to the other node. Its also what you would see if resetting a Failed offline state for a HADR resource:
<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602322]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602330]: su - db2inst1 -c db2gcf -t 3600 -d -i db2inst1 -i db2inst1 -h HADRDB -L
<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602334]: Returning 0 : db2inst1 db2inst1 HADRDB
Note: the ‘hadrV10_stop.ksh’ script doesn’t actually stop the HADR functionality within DB2. It doesn’t affect Peer state.
#IDUG88
Serviceability – syslog messages from DB2 automation scripts
The following syslog messages show the HADR resource group lock/unlock state :<timestamp> node1 user:debug root[327754]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-
rg lock<timestamp> node1 user:debug root[327780]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg
lock: 1
and<timestamp> node1 user:debug root[856206]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-
rg unlock<timestamp> node1 user:debug root[856212]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg
unlock: 0
The HADR resource group is locked whenever Peer state is lost. The DB2 software uses a TSAMP API to request the lock. The “lockreqprocessed” script is used to check the lock and unlock states.
When the HADR pair are back in Peer state, the HADR resource group is unlocked, again requested by DB2.
The DB2 Instance resource groups also get locked if the db2stop command is used to stop an instance, and unlocked when db2start is used to start it again.
#IDUG89
Serviceability – syslog messages from DB2 automation scripts
A manual (no force option) “takeover" (db2 takeover hadr on db HADRDB) would result in the following messages on the original primary server:
<timestamp> node1 user:notice hadrV10_stop.ksh[405566]: Entering : db2inst1 db2inst1 HADRDB<timestamp> node1 user:debug hadrV10_stop.ksh[405574]: su - db2inst1 -c db2gcf -t 3600 -d -i db2inst1 -i db2inst1 -h
HADRDB -L<timestamp> node1 user:notice hadrV10_stop.ksh[405578]: Returning 0 : db2inst1 db2inst1 HADRDB
Assuming the above hadrV10_stop.ksh script completes with a 0 return code, then a similar sequence of messages to the following would be seen on the original standby server:
<timestamp> node2 user:debug root[487538]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock<timestamp> node2 user:debug root[487564]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock: 1<timestamp> node2 user:debug root[487566]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock<timestamp> node2 user:debug root[487572]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0<timestamp> node2 user:notice hadrV10_start.ksh[548876]: Entering : db2inst1 db2inst1 HADRDB<timestamp> node2 user:debug hadrV10_start.ksh[548884]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h
HADRDB –L<timestamp> node2 user:notice hadrV10_start.ksh[548888]: Returning 0 : db2inst1 db2inst1 HADRDB<timestamp> node2 user:debug hadrV10_monitor.ksh[696436]: Returning 1 : db2inst1 db2inst1 HADRDB
Note the return code of 0 from “hadrV10_start.ksh” meaning a successful takeover. Any other return code would be considered unsuccessful and would need to be diagnosed from a DB2 perspective.
#IDUG90
Serviceability – syslog messages from TSAMP/RSCT
The following set of messages would indicate a cluster communication problem (domain split) :Firstly, state of the domain changes to PENDING_QUORUM on each node:
CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to PENDING_QUORUM.
The Automation Engine (RecoveryRM) on each node reports that the other node has left the domain:RECOVERYRM_INFO_4_ST A member has left. Node number = 1
Network TieBreaker is tested and an rc=0 indicates a successful poll of the network TieBreaker:samtb_net[1294584]: op=reserve ip=10.201.1.1 rc=0 log=1 count=2
If the TieBreaker poll is successful, the node regains QUORUM:CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to
HAS_QUORUM.
#IDUG91
Serviceability – syslog messages from TSAMP/RSCT
The following messages are expected when TSAMP is assigning and removing ServiceIP (Virtual IP address) resources :
<timestamp> <node_name> daemon:notice GblResRM[1319044]: … :::GBLRESRM_IPONLINE IBM.ServiceIP assigned address on device. IBM.ServiceIP 10.20.30.42 en0
<timestamp> <node_name> daemon:notice GblResRM[618532]: … :::GBLRESRM_IPOFFLINE IBM.ServiceIP removed address. IBM.ServiceIP 10.20.30.42
Release TieBreaker and remove TieBreaker block attempts when node has rejoined a domain again:
<timestamp> <node_name> daemon:info samtb_net[790758]: op=release ip=10.20.30.1 rc=0 log=1 count=2<timestamp> <node_name> daemon:info samtb_net[925932]: remove reserve block
/var/ct/samtb_net_blockreserve_10.20.30.1
A MonitorCommand for a resource of class IBM.Application reached a defined timeout:<timestamp> <node_name> GblResRM[24275]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:
:::Template ID: 0:::Details File: :::Location: RSCT,Application.C,1.2.1,2434 :::GBLRESRM_MONITOR_TIMEOUT IBM.Application monitor command timed out. Resource name <resource_name> Similar TIMEOUT messages exist for StartCommand and StopCommand scripts.
#IDUG92
Serviceability – Example of the TSAMP trace summary files
First format the trace_summary file(s)rpttr –o dtic <log file dir>/trace_summary > my_trace_summary.txt
IBM.RecoveryRM (on “master” node only) traces show:• all ‘online/offline order’ statements• Binder messages and exceptions16:10:10.660208 T(229390) _RCD Offline Request against db2_db2inst1_db2inst1_HADRDB-rs on node node216:10:22.365645 T(229390) _RCD Offline request injected: db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup16:10:26.433653 T(229390) _RCD Online request injected: db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup16:10:26.442814 T(229390) _RCD RIBME-Hist for <NULL>: BINDER: Bind db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup
IBM.GblResRM (from each individual node) traces show:• All start / stop command executions and service IP on / offline13:56:55.919065 T(16386) _GBD Monitor reports: Network device "en0:0" (IP address 10.20.30.42) flagged UP. Bringing resource “db2ip_10_20_30_42-rs" (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10) online.13:57:30.693158 T(163851) _GBD Resource " db2ip_10_20_30_42-rs " (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10): IP address 10.20.30.42 has been successfully taken offline on network interface "en0:0"
#IDUG93
Serviceability
UNKNOWN (0)– Generally a problematic state … really shouldn’t be deliberately used in the automation scripts
ONLINE (1) OFFLINE (2)
– Offline and should be able to be started here if needed
FAILED_OFFLINE (3)– Offline and not a possible node to be started– If MonitorCommand returns FAILED_OFFLINE then availability can change as soon as
MonitorCommand returns something different, like Offline (return code 2)– If status is set to FAILED_OFFLINE by StartCommand not succeeding within RetryCount, then
manual intervention will be needed to fix underyling resource and reset (resetrsrc) resource.
STUCK_ONLINE (4)– Manual intervention will be needed to stop the underlying resource
PENDING_ONLINE (5)– No action is taken in this state, resource should eventually become online, or start attempt will
timeout
PENDING_OFFLINE (6)– No action is taken in this state, resource should eventually become offline or stop attempt will
timeout
Online FAILEDOFFLINE OfflineOnline
#IDUG94
Serviceability
Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands– If yes, then problem is most likely in DB2 automation scripts or core DB2 components– If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement
If Operational State = UNKNOWN (OpState=0)– Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT– Fix: Increase MonitorCommandTimeout value
chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value>lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout
#IDUG95
Db2pd -ha
Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands– If yes, then problem is most likely in DB2 automation scripts or core DB2 components– If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement
If Operational State = UNKNOWN (OpState=0)– Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT– Fix: Increase MonitorCommandTimeout value
chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value>lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout
#IDUG96
Questions/Comments ?
#IDUG
Dale McInnisIBM Canada [email protected]: D15Title: DB2 HADR Automation – Where is the magic?
Please fill out your session evaluation before leaving!