DB2 HADR Automation – where is the magic?

#IDUG

DB2 HADR Automation – where is the magic?

Dale McInnis + Gareth Holl + Danny ArnoldIBM Canada Ltd.Session Code: D15Thursday November 13, 08:30 – 09:30| Platform: LUW

#IDUG

Agenda

• Introduction and Overview

• System Automation Components Overview

• Mapping DB2 Components to TSAMP Resources

• Integrating TSAMP with DB2 HADR using db2haicu

• Controlling the Operational State of the DB2 Resources

• Disabling Automation (re-gain manual control of DB2)

• Serviceability

2

#IDUG

DB2 9.5 Integrated High Availability

• HA Cluster Manager Integration• Coupling of DB2 and TSA on Linux and AIX, other platforms coming later• DB2 interface to configure cluster• DB2 to maintain cluster configuration, add node, add tablespace, …• Exploitation of new vendor independent layering (VIL), providing support

for any cluster manager• Eventually replace VIL with industry standard, e.g. SAF?

• NO SCRIPTING REQUIRED!• one set of embedded scripts that are used by all cluster managers

• Automate HADR failover• Exploit HA cluster manager integration previously describe

#IDUG

Clustering Setup Pre-9.5Overworked admin doing initial setup

Install cluster SW

Install DB2

Create DB2 instance & database

Add nodes / partitions to

DB2 instance

Create cluster domain

Design and create

‘resource models’

Overworked admin adding a new file system for DB2 (eg. tablespace container or storage path)

Add container or

storage path to DB2

(ALTER)

Remember to modify

cluster resource model to

account for new file system

Add each DB2 node’s host to cluster domain

Refer to cluster management

course material to figure out

how to do this

Take multiple courses on

cluster management

BRING OUT THE CONSULTANTS!

#IDUG

Clustering Setup with 9.5

Relaxed admin doing initial setup

Install DB2 9.5

Run DB2 HA config.

tool –db2haicu

Add nodes / partitions to

DB2 instance

Add container or

storage path to DB2

(ALTER)

Relaxed admin adding

a new file system for

DB2 (tablespace container or

storage path)

Create DB2 instance & database

#IDUG

Integrated DB2 architecture

db2startADD NODE

ALTER …ADD …

db2stop

#IDUG

DB2 HA Feature: Installation• DB2 ships IBM Cluster Manager product (TSA)

• DB2 9.5 GA included TSA 2.2.0.3 (AIX + Linux)• DB2 9.7 GA included TSA 3.1.0.0 (AIX , Linux + Solaris 10/SPARC)• DB2 10 GA included TSA 3.2.2.1• DB2 10.5 GA included TSA 3.2.2.4• packaging, installation and update

• TSA is the default Cluster Manager• HA feature chosen

• Fixpacks – includes TSA updates• Uninstalls /opt/IBM/db2/Vxx.x/ha/tsa/

db2Vxx_monitor.ksh, db2Vxx_start.ksh, db2Vxx_stop.ksh

hadrVxx_monitor.ksh, hadrVxx_start.ksh, hadrVxx_stop.ksh

These scripts can get upgraded when a DB2 fixpack is installed

#IDUG

DB2 HA Feature: Configuration

• Distinct Failover setups possible:

• DB2 shared disk HA configuration • DB2 HADR configuration (Automated

failover)• DB2 DPF HA Configuration with shared disk

#IDUG

HADR Architecture

9

LogsOld

Logs

Log Writer Log Reader

Tables

IndexesLogs

Old

Logs

Log Reader

Tables

Indexes

TCPIP

DB2 Engine DB2 EnginePRIMARY SERVER STANDBY SERVER

Log Writer

HADR Shredder Replay Master

Redo SlavesRedo SlavesRedo SlavesReplay Slaves

Log Pages

Log Pages

Log Pages

Log RecordsLog Pages

Log

Records

HADR

#IDUG

Why the need for TSAMP ?

• HADR does not perform active monitoring of the topology

• HADR will not detect a node outage or NIC failure

• HADR cannot take automated actions in the event of a failed primary instance, node outage, or NIC failure

• Instead, a DB administrator must monitor the HADR pair manually and issue appropriate takeover commands in the event of a primary database interruption

• This is where TSAMP’s automation capabilities comes into play :TSAMP can perform restart actions if an instance unexpectedly

exitsTSAMP can perform a HADR takeover automatically when certain

problems are detected on the primary server

10

#IDUG

Introduction

• DB2 provides a High Availability Disaster Recovery (HADR) feature that keeps a primary and standby database synchronized, and allows an administrator to switch control to a standby DB2 server

• DB2 provides a set of scripts that allow TSAMP to control the DB2 resources. Scripts that are used by TSAMP to start, stop, and monitor each of the DB2 resources – this is the

primary link between the two products

• DB2 provides a utility called ‘db2haicu’ that is used to define the domain and automation policy within TSAMP, that is, the initial setup : The automation policy is the set of definitions of all resources, resource groups, and the

relationships between them all. The resource definitions contain attributes that define which DB2 start, stop, and monitor script

(the automation scripts) to use for a particular resource.

• TSAMP can be used to monitor an application’s resources, and automate the starting, stopping, and failover of resources – it will attempt to maintain a desired operational state.

11

#IDUG

Software Summary

• Each of the following software products/components need to be installed on both systems (primary and standby servers) :

DB2 v10.1. (10.1.0.3 was latest available at the time this deck was written)

TSAMP v3.2.2 (Fixpack 7 (3.2.2.7) or later recommended)

RSCT v3.1.5.2 (installed as part of a TSAMP installation)

• Installation of DB2 v10.1 includes the DB2 automation policy scripts:

/opt/IBM/db2/V10.1/ha/tsa/

db2V10_monitor.ksh, db2V10_start.ksh, db2V10_stop.ksh

hadrV10_monitor.ksh, hadrV10_start.ksh, hadrV10_stop.ksh

lockreqprocessed

These scripts can get upgraded when a DB2 fixpack is installed

12

#IDUG

Software Summary (continued …)

• TSAMP/RSCT doesn’t need to be installed on a 3rd node to maintain quorum

Can use a Network TieBreaker (db2haicu calls this a quorum device)

• License file for TSAMP is included with the DB2 Activation Zip file, available via your Passport Advantage account.

If a base level of TSAMP is not installed, then license file will need to be manually installed

TSAMP can be silently installed by the DB2 installer, but if a base level of DB2 is not installed, then again the TSAMP license will need to be manually installed.

• See the TSAMP formal documentation for platform compatibility & dependencies :

For TSAMP v3.2.2.7 Release Note:

http://pic.dhe.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.doc_3.2.2/HALRN329.pdf

For TSAMP v4.1 Installation and Configuration Guide:

http://pic.dhe.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.doc_4.1/HALICG41.pdf

13

#IDUG14

Example 1 of a DB2 HADR environment

eth0

DB2

HADR

TSAMPRSCT

Primary Server

eth0

DB2

HADR

TSAMPRSCT

Standby Server

HADR Replication

DB2 Instance DB2 Instance

DB2Database

DB2Database

Public network

Client Apps

Virt IPeth0:0

Cluster (RSCT)

HeartbeatDB2 Transactions

HADR replication via the public network

#IDUG15

Example 2 of a DB2 HADR environment

eth0

DB2

HADR

TSAMPRSCT

eth1

Primary Server

eth0

DB2

HADR

TSAMPRSCT

eth1

Standby Server

Private network

HADR Replication

DB2 Instance DB2 Instance

DB2Database

DB2Database

Public network

Client Apps

Virt IPeth0:0

Cluster (RSCT)

HeartbeatDB2 Transactions

Switch

HADR replication via a private network

Cluster (RSCT)

Heartbeat

#IDUG

Agenda







• Serviceability

16

#IDUG

TSA MP Architectural Overview

#IDUG

TerminologyPeer Domain: A cluster of servers, or nodes, for which TSA is responsibleResource: Hardware or software that can be monitored or controlled. These can be fixed or floating.

Floating resources can move between nodes. Resource group: A virtual group or collection of resourcesRelationships: Describe how resources work together. A start-stop relationship creates a dependency (see

below) on another resource. A location relationship applies when resources should be started on the same or different nodes.

Dependency: A limitation on a resource that restricts operation. For example, if resource A depends on resource B, then resource B must be onlie for resource A to be started.

Equivalency: A set of fixed resources of the same resource class that provide the same functionality

Quorum: A cluster is said to have quorum when there it has the capability to form a majority within its nodes. The cluster can lose quorum when there is a communication failure, and sub-clusters form with an even number of nodes.

Nominal State: This can be online or offline. It is the desired state of a resource, and can be changed so that TSA will bring a resource online or shut it down.

Tie Breaker: Used to maintain quorum, even in a split-brain situation. A tie-breaker allows sub-clusters to determine which set of nodes will take control of the domain.

Failover: When a failure occurs (typically hardware), which causes resources to be moved from one machine to another machine, the resources are said to have “failed over”

#IDUG

Quorum and TieBreaker One of the questions from the ‘db2haicu’ utility deals with a cluster/automation

concept called Quorum. Quorum The number of nodes in a cluster that are required to control the resources, modify the

cluster definition, or perform certain cluster operations. The main goals of quorum operations:

identify who has the majority when a cluster is broken up into sub-clusters keep data consistent, especially when shared file systems are being used protect critical resources….maintain HA control Two types: configuration vs. operational quorum

Note: “configuration” quorum requires 'majority of nodes' (more than half the number of nodes) to be Online for configuration changes to be carried out.

TieBreaker a TieBreaker situation occurs when a cluster with equal number of nodes is split into sub-

clusters with equal numbers of nodes need to determine which sub-cluster will have an operational quorum in a tie situation

19

#IDUG

db2haicu: Quorum and TieBreaker (continued....)

Disk TieBreaker Not supported by db2haicu

NFS TieBreaker Not supported by db2haiku

Network TieBreaker the goal is for each system to figure out (via the RSCT infrastructure) which one is operational and

should therefore take control (if not already the active node). use a pingable system independent of node1 and node2 without an active TieBreaker, automated failover/takeover will NEVER occur The network tie breaker should be used only for domains where all nodes are in the same IP

sub net. Choose an IP address, which can be reached only through a single path from each node in the domain.

The following 7 slides demonstrate how a TieBreaker works …

20

#IDUG21

21

Base Tie-Breaker Functionality

node1 node2

GatewayRouter

eth0 10.20.30.40

eth010.20.30.1

10.20.30.0

eth0 10.20.30.41

Node Failure Scenario

#IDUG22

22


node1 node2

Gateway Router

eth0 10.20.30.40

eth010.20.30.1

10.20.30.0

eth0 10.20.30.41

Node Failure Scenario:1. System node1 fails

#IDUG23

23


node1 node2

Gateway Router

eth0 10.20.30.40

eth0 10.20.30.1

10.20.30.0

eth0 10.20.30.41

Node Failure Scenario:1. System node1 fails2. System node2 gets quorum using network

tiebreaker

#IDUG24

24


node1 node2

GatewayRouter

eth0 10.20.30.40

eth010.20.30.1

10.20.30.0

eth0 10.20.30.41

Network Adapter Failure Scenario

#IDUG25

25


node1 node2

Gateway Router

eth0 10.20.30.40

eth0 10.20.30.1

10.20.30.0

eth0 10.20.30.41

Network Adapter Failure Scenario :1. Network problem affecting node1

#IDUG26

26


node1 node2

Gateway Router

eth0 10.20.30.40

eth0 10.20.30.110.20.30.0

eth0 10.20.30.41

Network Adapter Failure Scenario:1. Network problem affecting node1 2. Again node2 gets quorum using network tiebreaker

#IDUG27

27


node1 node2

Gateway Router

eth0 10.20.30.40

eth0 10.20.30.110.20.30.0

eth0 10.20.30.41

Network Tiebreaker Scenarios:1. System node1 fails1a. System node2 gets quorum using network

tiebreaker

2. Network problem affecting node1 2a. Again node2 gets quorum using network

tiebreaker2b. System node1 forced to reboot

Network Tiebreak Assumption:If node1 can communicate (ping) with the gateway and node2 can communicate (ping) with the gateway, THEN node1 must be able to communicate (heartbeat) with node2.

#IDUG

System Automation – Components

Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 2 Resource Managers as follows :

28

Resource Manager Function and classes owned

IBM.ConfigRM Configuration tasks across the nodes in the domain, including quorum and TieBreaker functionality

IBM.StorageRM Maps IBM.AgFileSystem resources to IBM.LogicalVolume to IBM.VolumeGroup to IBM.Disk and manages the mount/umount of the filesystems and the varyon/varyoff of the Volume Groups.

Reliable Scalable Cluster Technology RSCT), the "Cluster" software, is made up of some core daemons and some Resource Managers, the most important being the following two :

Resource Manager Function and classes owned

IBM.RecoveryRM Automation engine. Owns IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource, IBM.ManagedRelationship

IBM.GblResRM Starts, stops, and monitors resources of the class IBM.Applicationand IBM.ServiceIP

#IDUG

Resource Manager – Resource Class Composition

• IBM.ConfigRM – Cluster configuration• IBM.CommunicationGroup• IBM.HeartbeatInterface – Manage heartbeats• IBM.NetworkInterface• IBM.PeerDomain• IBM.PeerNode• IBM.RSCTParameter• IBM.TieBreaker

• IBM.StorageRM - Storage• IBM.AgFileSystem• IBM.Disk• IBMLogicalVolume• IBM.Partition• IBM.VolumeGroup

#IDUG

Resource Manager – Resource Class Composition

• IBM.RecoveryRM• Is the decision engine • Runs on every node – with one being designated as the master• Responsible for evaluating the monitoring information that is supplied by

the various resource managers such as the Storage RM and the Global Resource RM.

• Drives the decisions that result in start or stop operations on the resources as needed.

• IBM.GblResMgr• IBM.Application – manage applications• IBM.ServiceIP – manage IP Addresses

#IDUG

System Automation – Components

Everything is a "Resource" in the TSAMP and RSCT world• There are different kinds of resources and that is where we introduce the concept of a resource "class".• There are different Resource Managers, each responsible for managing or controlling resources that belong to a

particular set of resource classes.• The following diagram shows the mapping of three key Resource Managers to some Resource Classes they

manage and then to some example Resources :

31

#IDUG

System Automation – Components Consider the servers that make up the cluster ... they are also resources. They are resources of the class

IBM.PeerNode.

• The domain itself is a resource, of class IBM.PeerDomain. The network interfaces are resources, of class IBM.NetworkInterface.

• The following diagram introduces another Resource Manager (part of RSCT) called IBM.ConfigRM and it shows two Resource Classes it manages and the Resources modelled by those classes:

32

#IDUG

Agenda







• Serviceability

33

#IDUG

Mapping DB2 HADR components to TSAMP resources34

Node1 Node2

Resource Group: db2_db2inst1_db2inst1_HADRDB-rg

HADRHADRFloating Resource:

db2_db2inst1_db2inst1_HADRDB-rs

Virtual IPVirtual IPFloating Resource:

db2ip_10_20_30_42-rs

eth1 eth1

eth0 eth0Equivalency:

db2_public_network_0

Private network

db2_db2inst1_node1-rs

Resource Group:db2_db2inst1_node1-rg

db2_db2inst1_node2-rs

Resource Group:db2_db2inst1_node2-rg

Equivalency:db2_private_network_0

Public network

dependsOnrelationship

dependsOnrelationship

Optional

depe

ndsO

nre

latio

nshi

p

#IDUG

Mapping DB2 Components to TSAMP Resources

• DB2 instance called “db2inst1” on server called “node1” maps to a TSAMP managed resource called “db2_db2inst1_node1_0-rs”

• DB2 instance called “db2inst1” on server called “node2” maps to a TSAMP managed resource called “db2_db2inst1_node2_0-rs”

• DB2 HADR database called “HADRDB” who’s primary and standby instances are both named “db2inst1” maps to a TSAMP managed resource called “db2_db2inst1_db2inst1_HADRDB-rs”

• The virtual IP address (optional) maps to a TSAMP managed resource called “db2ip_10_20_30_42-rs” where 10_20_30_42 is the virtual IP address.

• A public network can be defined and this maps to a TSAMP resource (Equivalency) called “db2_public_network_0”

• Note: No need to defined a private network ...

• TSAMP does not manage anything related to the private network … there are no dependencies on it, so no need for it ! Just say “no” to db2haicu

• You can still have an actual private network for HADR replication … its totally independent of TSAMP.

35

#IDUG

IBM.Application class – display attribs of resource/resource class# lsrsrc –s “Name = ‘db2_db2inst1_node1_0-rs’” -Ab IBM.ApplicationName = "db2_db2inst1_node1_0-rs“

StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh db2inst1 0"

StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_stop.ksh db2inst1 0"

MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_monitor.ksh db2inst1 0"

MonitorCommandPeriod = 10

MonitorCommandTimeout = 120

StartCommandTimeout = 330

StopCommandTimeout = 140

UserName = "root"

RunCommandsSync = 1

ProtectionMode = 1

ActivePeerDomain = hadr_domain

NodeNameList = {“node1"}

OpState = 1

36

#IDUG

IBM.Application class – Example is of a DB2 HADR Resource

# lsrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ ” -Ab IBM.Application

Name = "db2hadr_hadrdb-rs“

StartCommand = "/usr/sbin/rsct/sapolicies/db2/hadrV10_start.ksh db2inst1 db2inst1 HADRDB"

StopCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh db2inst1 db2inst1 HADRDB"

MonitorCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_monitor.ksh db2inst1 db2inst1 HADRDB"MonitorCommandPeriod = 21

MonitorCommandTimeout = 29

StartCommandTimeout = 330

StopCommandTimeout = 140

UserName = "root"

RunCommandsSync = 1

ProtectionMode = 1


NodeNameList = {“node1",“node2"}

OpState = 1

37

#IDUG

IBM.ServiceIP class – Virtual IP addresses

# lsrsrc -Ab IBM.ServiceIP

Name = "db2ip_10_20_30_42-rs"

IPAddress = “10.20.30.42"

NetMask = "255.255.255.0"

ProtectionMode = 1

ActivePeerDomain = "hadr_dom"

NodeNameList = {“node1“,”node2”}

OpState = 1

38

#IDUG

Agenda







• Serviceability

39

#IDUG

Use db2haicu to configure TSA for HADR automationdb2haicu [ -f <XML-input-file-name> ]

[ -disable ] [ -delete [ dbpartitionnum <db-partition-list> | hadrdb <database-name> ] ]

• Default is to run in interactive mode• Follow the instructions in

https://www.ibm.com/developerworks/mydeveloperworks/blogs/nsharma/resource/HADR-db2haicu_v5.pdf

Table of contentsPart 1 : DB2 configuration.................................................................................................. 4Part 2 : TSA Cluster setup .................................................................................................. 7Part 3 : Miscellaneous tasks / Diagnostics........................................................................ 11Part 4 : Remove TSA/HADR configuration ..................................................................... 15Part 5 : Automatic client reroute (ACR) ........................................................................... 16

#IDUG

Using ‘db2haicu’ to Automate HADR FailoverStep 1. Run the following command as root on each node to configure the RSCT ACLs (security)

and allow cluster communication between the servers:

root@node1# preprpnode node1 node2

root@node2# preprpnode node1 node2

Step 2. Log on to the standby server as the instance owner and issue:db2inst1@node2> db2haicu

• The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information.

• Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. If it doesn’t find one, you will see the following :

41

#IDUG

db2haicu: Create a new domain

Create a new domain with two nodes:

42

#IDUG

db2haicu: List the new domain• At this point there would be domain called “hadr_domain” in an online state:

root@node1# lsrpdomain

Name OpState RSCTActiveVersion MixedVersions TSPort GSPort

hadr_domain Online 3.1.2.1 No 12347 12348

• You can also list the states of the individual nodes and see output similar to the following, from either server:

root@node1# lsrpnode

Name OpState RSCTVersion

node1 Online 3.1.2.1node2 Online 3.1.2.1

43

#IDUG

db2haicu: Quorum and TieBreaker

At this point you could list the TieBreaker resources and see the new network TieBreaker:root@node1# lsrsrc –Ab IBM.TieBreaker

The following command should show that your new network TieBreaker is currently active:root@node1# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker

44

The next db2haicu question deals with the creation of a Network TieBreaker:

#IDUG

db2haicu: Network EquivalenciesSpecial TSAMP groups called Equivalencies are created containing the network interfaces found on each of the servers in the cluster. This allows TSAMP to be notified of NIC failures by the RSCT subsystem (who harvested the NICs) and react accordingly.

• We use db2haicu to create an equivalency called db2_public_network_0 and populate it with the en0 NICs from the server called “node1” :

45

#IDUG

db2haicu: Network Equivalencies (continued …)

• Next we add the en0 NIC from the other server, “node2”, to the same equivalency (db2_public_network_0) :

46

#IDUG

db2haicu: Private Network

47

In the previous slide, notice the option to say “Yes” or “No” when adding NICs to a network.

When asked if a non-public NIC should be added to a private network, this is where I recommend you choose “No”.

So don’t create a private network equivalency via db2haicu even if your DB2 HADR environment does use a private network for HADR replication data.

#IDUG

db2haicu: Private Network (continued …)

48

If your DB2 environment uses LDAP for authentication and if you have multiple NICs per server (eg. a private network), then disable the RSCT cluster heartbeat for all NICs not in the public network:Identify the Communication Group that contains the non-public NICs:lsrsrc -Ab IBM.NetworkInterface Name IPAddress CommGroup HeartbeatActive NodeNameList

Change HeartbeatActive to 0 to disable heartbeating for a CommGroup:chrsrc -s "CommGroup=='CG2'" IBM.NetworkInterface HeartbeatActive=0

See the following technote for more details on cluster heartbeat settings and Communication Groups:

http://www.ibm.com/support/docview.wss?uid=swg21292274

#IDUG

db2haicu: Additional NICs per server

49

For non-LDAP setups, its recommended to have at least a 2nd pair of NICs for cluster heartbeating so as to reduce the likelihood of a forced reboot if there is a problem in the public network (or with the public NICs).

For DB2 v9.7 environments, additional “dependsOn” relationships need to be manually added to the automation policy, from each HADR database resource to the public network equivalency

If db2haicu from v10.1.0.0, v10.1.0.1, or v10.1.0.2 is used to create the automation policy, all the necessary dependsOn relationships will be missing due to a bug with the db2haicu utility (fixed as of 10.1.0.3) … they will need to be manually created.

Refer to the following technote to obtain a script that can be used to create any missing relationship in either a DB2 v9.7 or v10 environment :

http://www.ibm.com/support/docview.wss?uid=swg21634431

The dependsOn relationships between the HADR database resources and the Public Network Equivalency are recommended even if there is only one NIC per server, for both DB2 v9.7 and v10 environments.

#IDUG

db2haicu: Listing the Equivalency• The network equivalency(ies) would be created at this point and can be listed as follows:

root@node1# lsequ -AbDisplaying Equivalency information:

All Attributes

Equivalency 1:

Name = db2_public_network_0

MemberClass = IBM.NetworkInterface

Resource:Node[Membership] = {en0:node1,en0:node2}

SelectString = “”


Resource:Node[ValidSelectResources] = {en0:node1,en0:node2}

50

#IDUG

db2haicu: Adding the database node to the Automation Policy

• The final part to running db2haicu on the standby server is setting the CLUSTER_MGR variable to “TSA” and then adding resources that represent the DB2 instance on the server where you’re running db2haicu:

51

#IDUG


• Note in the previous screenshot that you won’t be able to validate and automate the HADR database via db2haicu from the standby server. This is why the next part involves running the db2haicu for a 2nd time but from the current HADR primary server.

• At this point we can view the a few more changes to the automation policy and the database manager’s configuration :db2inst1@node2> db2 get dbm cfg |grep -i cluster

Cluster manager = TSA

root@node2# lsrg

Resource Group names:

db2_db2inst1_node2_0-rg

52

#IDUG

db2haicu: DB2 Standby Instance Resources

root@node2# lsrg -g db2_db2inst1_node2_0-rgDisplaying Member Resource information:

For Resource Group "db2_db2inst1_node2_0-rg".

Resource Group 1:

Name = db2_db2inst1_node2_0-rg

MemberLocation = Collocated

Priority = 0

AllowedNode = db2_db2inst1_node2_0-rg_group-equ

NominalState = Online


OpState = Online

TopGroup = db2_db2inst1_node2_0-rg

TopGroupNominalState = Online

• Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server this resource is allowed to run on … see the next slide for output that shows this PeerNode Equivalency details.

53

#IDUG

db2haicu: DB2 Standby Instance Dependencies

root@node2# lsequ -AbEquivalency 1:

Name = db2_db2inst1_samp2_0-rg_group-equ

MemberClass = IBM.PeerNode

Resource:Node[Membership] = {node2:node2.tivlab.raleigh.ibm.com}

SelectString = ""

SelectFromPolicy = ANY

MinimumNecessary = 1


Resource:Node[ValidSelectResources] = {node2:node2.tivlab.raleigh.ibm.com}

• This restricts the DB2 instance resource on the previous slide from only being brought Online by TSAMP on node2.

• This is fairly obvious given it’s the resource that represents the standby database partition.

54

#IDUG

db2haicu: DB2 Standby Instance Dependencies (continued …)

• At this point there would be one or two relationship defined in the automation policy depending on how many network equivalencies you created:

root@node2# lsrel -AbDisplaying Managed Relationship Information:All Attributes

Managed Relationship 1:

Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node2_0-rs

Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}

Relationship = DependsOnConditional = NoCondition

Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel


• This shows us that the DB2 instance is dependent on the operational state of the NICs in the public network. If the NIC is Online, then TSAMP will be able to start the associated DB2 instance.

55

#IDUG

db2haicu: The Automation Policy so far …

• Let’s look at what resources and groups are listed in the ‘lssam’ output after completing the execution of ‘db2haicu’ on the standby server :

root@node2# lssam

56

Here we see that the DB2 instance for server “node2” is defined and within its own resource group.

There is a PeerNode equivalency which dictates which server the above instance is allowed to run on.

Finally, there is a Network Equivalency which contains the NICs for the public network … the DB2 instance would have a dependency relationship on this equivalency.

#IDUG

Using ‘db2haicu’ to Automate HADR Failover (continued …)

Step 3. Log on to the primary server as the instance owner and issue:

db2inst1@node1> db2haicu

• The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information.

• Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. Since we’ve already run db2haicu on the standby server, an Online domain should already exist.

• You will then be asked to set the cluster manager.

57

#IDUG


• db2haicu sets the CLUSTER_MGR variable to “TSA” within the local database manager’s configuration: db2inst1@node1> db2 get dbm cfg |grep -i clusterCluster manager = TSA

• Please note that once the dbm is configured with “Cluster manager” set to TSA, the DB2 engine expects to have a domain Online. You will have issues stopping and starting the DB2 instance if no domain is Online.

• Run 'db2haicu -disable' on each DB2 server if you want to break the connection between DB2 and TSAMP. This is the only way to unset “Cluster manager” for DB2 v10.x

• Then db2haicu adds resources that represent the DB2 instance (the primary DB2 instance) on the server where you’re currently running db2haicu:

58

#IDUG

db2haicu: DB2 Primary Instance Resources• At this point we can view the a few more changes to the automation policy

root@node1# lsrgResource Group names:db2_db2inst1_node1_0-rgdb2_db2inst1_node2_0-rg

root@node1# lsrg -g db2_db2inst1_node1_0-rgDisplaying Member Resource information:For Resource Group "db2_db2inst1_node1_0-rg".

Resource Group 1:Name = db2_db2inst1_node1_0-rgMemberLocation = CollocatedPriority = 0AllowedNode = db2_db2inst1_node1_0-rg_group-equNominalState = OnlineActivePeerDomain = hadr_domainOpState = OnlineTopGroup = db2_db2inst1_node1_0-rgTopGroupNominalState = Online

• Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server this resource is allowed to run on … similar to the other DB2 instance resource group but with a different server name.

59

#IDUG

db2haicu: Dependencies for the DB2 Instances

• Now there would be additional relationships defined in the automation policy:root@node1# lsrel -AbDisplaying Managed Relationship Information:

All Attributes





Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel






Name = db2_db2inst1_node1_0-rs_DependsOn_db2_public_network_0-rel


• This shows us that the DB2 instances are both dependent on the operational state of the NICs in the public network. If the NICs are Online, then TSAMP will be able to start the associated DB2 instances … it also means if either NIC goes offline for any reason, the local DB2 instance will be stopped by TSAMP.

60

#IDUG

db2haicu: The Automation Policy so far …

• Let’s take another look at what resources and groups are listed in the ‘lssam’ output after ‘db2haicu’ has added both standby and primary database partitions :

root@node1# lssam

61

There’s now a resource and group for the DB2 instance on server “node1”. There’s now another PeerNode equivalency … it forces the “db2_db2inst1_node1_0-rg

partition to run on “node1” only.

#IDUG

db2haicu: Adding the HADR database to the Automation Policy

• Validating and automating HADR failover can only be done from the current primary server and only after successfully running db2haicu on the standby server.

• You may also want to add a virtual IP address for this HADR database

62

#IDUG

db2haicu: HADR Database Resources# lsrg -g db2_db2inst1_db2inst1_HADRDB-rgDisplaying Resource Group information:For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg".

Resource Group 1:Name = db2_db2inst1_db2inst1_HADRDB-rgMemberLocation = CollocatedPriority = 0AllowedNode = db2_db2inst1_db2inst1_HADRDB-rg_group-equNominalState = OnlineActivePeerDomain = hadr_domainOpState = OnlineTopGroup = db2_db2inst1_db2inst1_HADRDB-rgTopGroupNominalState = Online

• Note the AllowedNode attribute. It points to a PeerNode Equivalency that contains the servers “node1” and “node2” that dictates which servers the HADR database can reside on. This is just like the setup for the two DB2 instance resource groups that also use the AllowedNode attribute with other PeerNode Equivalencies, though in this case the HADR resource is a floating resource with two servers as its choices.

63

#IDUG

db2haicu: HADR Database Resources# lsrg -m -g db2_db2inst1_db2inst1_HADRDB-rgDisplaying Member Resource information:For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg".

Member Resource 1:Class:Resource:Node[ManagedResource] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs

Mandatory = True

MemberOf = db2_db2inst1_db2inst1_HADRDB-rg

SelectFromPolicy = ORDERED


OpState = Online

Member Resource 2:

Class:Resource:Node[ManagedResource] = IBM.ServiceIP:db2ip_10_20_30_42-rs

Mandatory = True

MemberOf = db2_db2inst1_db2inst1_HADRDB-rg

SelectFromPolicy = ORDERED


OpState = Online

64

#IDUG

db2haicu: Dependency for the HADR DB Resource

• Now there would be an additional relationship defined in the automation policy:root@node1# lsrel -AbDisplaying Managed Relationship Information:

All Attributes

[...]


Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs



Name = db2_db2inst1_db2inst1_HADRDB-rs_DependsOn_db2_ public_network_0-rel


• This shows us that the HADRDB resource is dependent on the operational state of the NICs in the public network.

If the NICs are Online, then TSAMP will be able to online the associated HADR db resource it also means if either NIC goes offline for any reason, the constituent of the HADR db resource local to the

offline NIC will be offlined by TSAMP (if it is currently online) … this could trigger a failover.

65

#IDUG

db2haicu: The Complete Automation Policy …• Let’s look at what resources and groups are listed in the ‘lssam’ output after completing the

execution of ‘db2haicu’ on both servers :root@node1# lssam

66

The resources and group for the HADR database and virtual IP address have been added, as has a new PeerNode Equivalency containing servers “node1” and “node2”.

#IDUG

Agenda







• Serviceability

67

#IDUG

Starting and Stopping DB2 resources• Now that the DB2 instances are managed by TSAMP, they cannot be started with

the db2start command unless the “Nominal” (desired) state of the resource group that contains the DB2 instance resource is set to “Online”. The following is an example:# chrg –o online <Resource_Group>

• Changing the desired states of the resource groups will instruct TSAMP to start/stop the resources using the scripts defined in the “StartCommand”, “StopCommand” attributes for a resource of class “IBM.Application”, such is the case for a DB2 instance resource.

• To change the desired state of multiple resource groups that have similarities in their name, for example, start all DB2 resource groups in parallel where the instance name on each server starts with “db2inst”, use the following syntax:# chrg –o online –s “Name like ‘db2_db2inst_%’ “

• Another example, to take the HADR resource group offline, including removal of the currently assigned virtual IP address:# chrg –o offline db2_db2inst1_db2inst1_HADRDB-rg

• Note: After offlining just the HADR resource group, the HADR pair will remain in a peer connected state even though shown as Offline on both servers when viewed as a TSAMP resource !

68

#IDUG

Starting and Stopping DB2 resources (continued …)

• A couple of things to note:

• It is good practice (and potentially less prone to startup issues) to have the Nominal (desired) State of all the Resource Groups set to Offline prior to stopping the domain. This will allow the domain to be started at some point in the future without any start actions being taken against all DB2 resources simultaneously.

• Offline individual nodes in the cluster using ‘stoprpnode <node_name>’ before shutting down or rebooting that server for maintenance.

• I’d suggest starting both the DB2 instances simultaneously (changing their Resource Group’s Nominal State to Online), but waiting until both are completely online before changing the Nominal State of the HADR resource group to Online.

• Starting both instances should result in the HADR pair reaching a peer connected state, even though the HADR resource group’s Nominal state may still be set to Offline. However, while the HADR resource group is set to Offline, the Virtual IP address is truly offline (not assigned to any NIC so not available to communicate through), AND no automated failover actions will occur.

• If the HADR pair does not reach a peer connected state after both instances have successfully started, troubleshoot this as a DB2 problem. Once corrected and back in Peer, proceed to start the HADR resource group.

• If you go ahead and attempt to start the HADR resource group when not in peer, the HADR resource will likely end up in a Failed Offline state on one or both servers requiring manual reset actions.

69

#IDUG

Starting DB2 Instances & HADR Database• Start the underlying domain if not already online:

# startrpdomain hadr_domain

• Start the primary and standby instances simultaneously:

# chrg –o online –s “Name like ‘db2_db2inst1_node%’ “• above assumes both by instances are names “db2inst1” and my servers have hostnames starting with

“node”

• Check that both instances reach Online states. Do not proceed until both DB2 instances have come online. Confirm using “lssam –top” and “db2_ps” (Run “db2_ps” as the DB2 instance owner on each node)

• The DB2 start scripts used to start the instances will also activate the databases, resulting in the HADR pair establishing a peer connected state. So confirm that the HADR pair have reached peer state by running the following on each DB2 node :

# db2pd –hadr –db hadrdb• If HADR state is not active, then manually bring the HADR pair into peer state as follows:

a. On designated standby node:# db2 start hadr on db hadrdb as standby

b. On designated primary node:# db2 start hadr on db hadrdb as primary

Repeat for all HADR databases. Again check the state of the HADR pair before proceeding

70

#IDUG

Starting DB2 Instances & HADR DB (continued..)

• As instance owner, ensure that the HADR pair is in “Peer” state (on both nodes) as follows:

# db2pd –hadr -db hadrdb

You should see output similar (abbreviated) to the following on the primary server:Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:13:36Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)Primary Peer Sync 0 0

ConnectStatus ConnectTime TimeoutConnected Tue Jul 8 17:47:12 2008 (1215553632) 120

You should see output similar (abbreviated) to the following on the standby server:Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:12:51

Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)Standby Peer Sync 0 0

ConnectStatus ConnectTime TimeoutConnected Tue Jul 8 17:48:23 2008 (1215552946) 120

• Finally, change the HADR resource group to online :

# chrg –o online db2_db2inst1_db2inst1_HADRDB-rg• This last step will cause the virtual IP address (if policy includes one) to be assigned.

71

#IDUG

Taking Standby Instance Offline• Because the database is active, the force option is required for the db2stop command:

db2inst1@node2> db2stop force• DB2 will also request TSAMP lock the instance group to prevent TSAMP was trying to restart the

instance. The HADR group will also get locked … this ALWAYS happens when the HADR pair are no longer in a Peer state.

Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated

|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2

'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2

Pending online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online'- Offline IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated

'- Offline IBM.Application:db2_db2inst1_node2_0-rs:node2

• To restart the instance: db2inst1@node2> db2start• This will result in TSAMP executing the ‘db2Vxxx_start.ksh’ script which is also responsible for

activating the HADR database and HADR re-integration takes place … peer state results.

72

#IDUG

Taking Primary Instance Offline• Because the database is active, the force option is required for the db2stop command:

db2inst1@node1> db2stop force

• DB2 will also request TSAMP lock the instance group to prevent TSAMP was trying to restart the instance. The HADR group will also get locked … this ALWAYS happens when the HADR pair are no longer in a Peer state.

Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated

|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2


Pending online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online'- Offline IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated

'- Offline IBM.Application:db2_db2inst1_node1_0-rs:node1

• To restart the primary instance: db2inst1@node1> db2start• This will result in TSAMP executing the ‘db2Vxxx_start.ksh’ script which should activate the db• The ‘hadrVxxx_start.ksh’ script will then be executed and peer state should be re-established.

73

#IDUG

Performing a Manual Takeover (Controlled Failover)• Because the DB2 instances are cluster aware in v9.5+, you can use the native DB2 takeover

command. In fact you should only use the DB2 takeover command, as follows (issued as instance owner on the current standby server) :db2inst1@node2> db2 takeover hadr on database HADRDB

• The HADR resource group will be locked and unlocked several times. There will also be a move request at some point.

• ‘lssam’ will show the online/offline states swapped for the HADR resource and ServiceIP, assuming the takeover is successful :

Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs

|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2

'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs|- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2

• Use the following command to check the HADR role has swapped between the nodes and ensure the HADR pair have reached a peer state again :

db2inst1@node2> db2pd –hadr –db <hadr_db_name>

74

#IDUG

The wrong way to attempt a Controlled Failover

• If you attempt to move the HADR resource group using the ‘rgreq –o move db2_db2inst1_db2inst1_HADRDB-rg’ command, the failover/takeover may not succeed and you could end up with the HADR resource showing “Failed Offline” on the standby server.

• During the attempted move, “lssam” (lssam –top) would show:

Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Move Nominal=Online

|- Pending online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs

|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1

'- Pending online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2

'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs

|- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1

'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2

75

#IDUG

The wrong way to attempt a Controlled Failover (continued)

• If the move attempt fails, ‘lssam’ will show the virtual IP address moved back to the original primary and the HADR resource on the standby will be set to Failed Offline requiring a manual reset before any further failover/takeovers will ever be possible again:

Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs

|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Failed offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2

'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2

• To reset the Failed Offline state, use the following TSAMP command:resetrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ & NodeNameList={‘node2’}” IBM.Application

This will cause the hadrV10_stop.ksh script to be executed on node2 and if successful (return code 0), the Operational State will change to “Offline”.

76

#IDUG

Failure Scenarios

77

• The various failover scenarios supported by this solution are detailed in section 6 of a whitepaper called “Automated Cluster Controlled HADR (High Availability Disaster Recovery) Configuration Setup using the IBM DB2 High Availability Instance Configuration Utility (db2haicu) ”

• This whitepaper can be downloaded via the following URL:http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf

• The following scenarios result in automated actions, including failovers/takeovers:1. Standby Instance Failure2. Primary Instance Failure3. Standby NIC Failures (public network)4. Primary NIC Failures (public network)5. Standby Node Failure6. Primary Node Failure

#IDUG

Agenda







• Serviceability

78

#IDUG

Disable/Re-enable HA/Automated Failover (using db2haicu)

• To prevent TSAMP from taking any action on DB2 resources, disable HA:db2inst1@node2> db2haicu -disable

79

The local database manager’s configuration will be updated so that “Cluster manager” is unset. The ‘db2haicu –disable’ also needs to be executed on the other server so that it’s instance configuration is also updated.

With “Cluster manager” unset, you would be able to Offline the entire domain without affecting the manual operation of the DB2 instances.

#IDUG

Disable/Re-enable HA/Automated Failover (Continued …)

80

As part of the –disable process, DB2 will request TSAMP lock all Resource Groups to prevent TSAMP was taking any action against DB2 resources:

Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated

|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2


Online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online'- Online IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated

'- Online IBM.Application:db2_db2inst1_node1_0-rs:node1Online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online

'- Online IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated'- Online IBM.Application:db2_db2inst1_node2_0-rs:node2

To re-enable, run ‘db2haicu’, as instance owner, on each server, and select “1” (Yes) when asked if you want to enable high availability, and then choose “TSA”.

#IDUG

Alternative for preventing TSAMP starting and stopping DB2 Resources

• The quickest way of preventing TSAMP from stopping/starting the resources is to change TSAMP to manual mode (Automation = Manual):

# samctrl –M T

• The only action TSAMP will continue to do is monitor the resources by continuing to execute the monitoring scripts associated with each resource.

• Check the current automation mode with the following command:

# lssamctrl

• To re-enable automation mode (Automation = Auto):

# samctrl –M F

• Although changing the Nominal (desired) state of a resource group to “offline” will trigger TSAMP to stop its resources, this does not mean automation is stopped. TSAMP will attempt to maintain the offline state, so if any resource is manually started, TSAMP will stop it again.

• Note you will not be able to perform a takeover while TSAMP is in Manual mode.

81

“Manual = True”

“Manual = False”

#IDUG

Agenda







• Serviceability

82

#IDUG83

Serviceability - CLI commands

Use the TSAMP command “lssam” as previously demonstrated:# lssam –top

# lssam –g <resource_group>

An alternative is the following TSAMP command:

# lsrg –m

#IDUG84

Serviceability – logs

Three main areas of logging1. Logging from the DB2 automation scripts (i.e. start/stop/monitor scripts) “logger” statements in policy scripts written to syslog (eg. /var/log/messages on Linux systems)

2. Logging of TSAMP / RSCT core processes (i.e. quorum, monitor command timeouts) written to syslog (Linux/AIX/Solaris) and errpt (AIX) Daemon log file directory: /var/ct/<DOMAIN>/log/mc/IBM.<DAEMON>RM

– where <DAEMON> = Recovery, GblRes, …– Circular logs, cannot open with editor directly!rpttr –o dtic <log file dir>/trace_summary > my_trace.out

3. DB2’s log file, “db2diag.log” with DIAGLEVEL 3 or higher

Use TSAMP Level 2 Support’s ‘getsadata’ script to collect data:

http://www.ibm.com/support/docview.wss?&uid=swg21285496

#IDUG85

Serviceability – syslog messages from DB2 automation scripts

The following syslog message indicates the DB2 instance is Online (return code =1) :<timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 1 (db2inst1, 0)

The following syslog message indicates the DB2 instance is Offline (return code =2) :<timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 2 (db2inst1, 0) The DB2 instance monitors repeat approximately every 10 seconds on each server if you’re

using a default automation policy.

The following syslog message indicates that the HADR resource if considered Online (return code = 1) and with a Primary role :<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB Seen only on the node that’s currently the primary node, repeats approx every 21 seconds

The following syslog message indicates that the HADR resource if considered Offline (return code = 2) and mostly likely in a Standby state (normal/OK state) :<timestamp> node2 user:info hadrV10_monitor.ksh[69632]: Returning 2 : db2inst1 db2inst1 HADRDB

#IDUG86


The following syslog messages occur when TSAMP starts a DB2 instance :<timestamp> node1 user:notice db2V10_start.ksh[856142]: Entered db2V10_start.ksh, db2inst1, 0<timestamp> node1 user:debug db2V10_start.ksh[856146]: Able to cd to /home/db2inst1/sqllib : db2V10_start.ksh, db2inst1, 0

<timestamp> node1 user:debug db2V10_start.ksh[262214]: 1 partitions total: db2V10_start.ksh, db2inst1, 0

<timestamp> node1 user:notice db2V10_start.ksh[393252]: Returning 0 from db2V10_start.ksh ( db2inst1, 0)

If db2start was used to start the instance, the message below would be seen instead of the “1 partitions total” message show above:

<timestamp> node1 user:info db2V10_start.ksh[856150]: db2V10_start.ksh is already up...

The following syslog messages are typical of the HADR resource group being brought online:

<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB<timestamp> node1 user:debug root[524540]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock<timestamp> node1 user:notice hadrV10_start.ksh[422078]: Entering : db2inst1 db2inst1 HADRDB<timestamp> node1 user:debug hadrV10_start.ksh[422086]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h HADRDB -L<timestamp> node1 user:debug root[524290]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0<timestamp> node1 user:notice hadrV10_start.ksh[422090]: Returning 0 : db2inst1 db2inst1 HADRDB

Note: the ‘hadrV10_start.ksh’ script doesn’t actually bring the HADR database pair into a Peer state. Its likely to already be in a Peer state beforehand because the databases are activated as part of the starting of the DB2 instances.

#IDUG87


The following syslog messages occur when TSAMP stops a DB2 instance. This includes resetting a Failed Offline state for a DB2 instance resource:

<timestamp> node1 user:notice db2V10_stop.ksh[856142]: Entered db2V10_stop.ksh, db2inst1, 0 <timestamp> node1 user:notice db2V10_stop.ksh[393252]: Returning 0 from db2V10_stop.ksh ( db2inst1, 0)

The following syslog messages are typical of the HADR resource being stopped on one node so a manual takeover can occur to the other node. Its also what you would see if resetting a Failed offline state for a HADR resource:

<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602322]: Entering : db2inst1 db2inst1 HADRDB

<timestamp> node1 user:debug /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602330]: su - db2inst1 -c db2gcf -t 3600 -d -i db2inst1 -i db2inst1 -h HADRDB -L

<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602334]: Returning 0 : db2inst1 db2inst1 HADRDB

Note: the ‘hadrV10_stop.ksh’ script doesn’t actually stop the HADR functionality within DB2. It doesn’t affect Peer state.

#IDUG88


The following syslog messages show the HADR resource group lock/unlock state :<timestamp> node1 user:debug root[327754]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-

rg lock<timestamp> node1 user:debug root[327780]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg

lock: 1

and<timestamp> node1 user:debug root[856206]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-

rg unlock<timestamp> node1 user:debug root[856212]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg

unlock: 0

The HADR resource group is locked whenever Peer state is lost. The DB2 software uses a TSAMP API to request the lock. The “lockreqprocessed” script is used to check the lock and unlock states.

When the HADR pair are back in Peer state, the HADR resource group is unlocked, again requested by DB2.

The DB2 Instance resource groups also get locked if the db2stop command is used to stop an instance, and unlocked when db2start is used to start it again.

#IDUG89


A manual (no force option) “takeover" (db2 takeover hadr on db HADRDB) would result in the following messages on the original primary server:

<timestamp> node1 user:notice hadrV10_stop.ksh[405566]: Entering : db2inst1 db2inst1 HADRDB<timestamp> node1 user:debug hadrV10_stop.ksh[405574]: su - db2inst1 -c db2gcf -t 3600 -d -i db2inst1 -i db2inst1 -h

HADRDB -L<timestamp> node1 user:notice hadrV10_stop.ksh[405578]: Returning 0 : db2inst1 db2inst1 HADRDB

Assuming the above hadrV10_stop.ksh script completes with a 0 return code, then a similar sequence of messages to the following would be seen on the original standby server:

<timestamp> node2 user:debug root[487538]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock<timestamp> node2 user:debug root[487564]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock: 1<timestamp> node2 user:debug root[487566]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock<timestamp> node2 user:debug root[487572]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0<timestamp> node2 user:notice hadrV10_start.ksh[548876]: Entering : db2inst1 db2inst1 HADRDB<timestamp> node2 user:debug hadrV10_start.ksh[548884]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h

HADRDB –L<timestamp> node2 user:notice hadrV10_start.ksh[548888]: Returning 0 : db2inst1 db2inst1 HADRDB<timestamp> node2 user:debug hadrV10_monitor.ksh[696436]: Returning 1 : db2inst1 db2inst1 HADRDB

Note the return code of 0 from “hadrV10_start.ksh” meaning a successful takeover. Any other return code would be considered unsuccessful and would need to be diagnosed from a DB2 perspective.

#IDUG90

Serviceability – syslog messages from TSAMP/RSCT

The following set of messages would indicate a cluster communication problem (domain split) :Firstly, state of the domain changes to PENDING_QUORUM on each node:

CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to PENDING_QUORUM.

The Automation Engine (RecoveryRM) on each node reports that the other node has left the domain:RECOVERYRM_INFO_4_ST A member has left. Node number = 1

Network TieBreaker is tested and an rc=0 indicates a successful poll of the network TieBreaker:samtb_net[1294584]: op=reserve ip=10.201.1.1 rc=0 log=1 count=2

If the TieBreaker poll is successful, the node regains QUORUM:CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to

HAS_QUORUM.

#IDUG91

Serviceability – syslog messages from TSAMP/RSCT

The following messages are expected when TSAMP is assigning and removing ServiceIP (Virtual IP address) resources :

<timestamp> <node_name> daemon:notice GblResRM[1319044]: … :::GBLRESRM_IPONLINE IBM.ServiceIP assigned address on device. IBM.ServiceIP 10.20.30.42 en0

<timestamp> <node_name> daemon:notice GblResRM[618532]: … :::GBLRESRM_IPOFFLINE IBM.ServiceIP removed address. IBM.ServiceIP 10.20.30.42

Release TieBreaker and remove TieBreaker block attempts when node has rejoined a domain again:

<timestamp> <node_name> daemon:info samtb_net[790758]: op=release ip=10.20.30.1 rc=0 log=1 count=2<timestamp> <node_name> daemon:info samtb_net[925932]: remove reserve block

/var/ct/samtb_net_blockreserve_10.20.30.1

A MonitorCommand for a resource of class IBM.Application reached a defined timeout:<timestamp> <node_name> GblResRM[24275]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:

:::Template ID: 0:::Details File: :::Location: RSCT,Application.C,1.2.1,2434 :::GBLRESRM_MONITOR_TIMEOUT IBM.Application monitor command timed out. Resource name <resource_name> Similar TIMEOUT messages exist for StartCommand and StopCommand scripts.

#IDUG92

Serviceability – Example of the TSAMP trace summary files

First format the trace_summary file(s)rpttr –o dtic <log file dir>/trace_summary > my_trace_summary.txt

IBM.RecoveryRM (on “master” node only) traces show:• all ‘online/offline order’ statements• Binder messages and exceptions16:10:10.660208 T(229390) _RCD Offline Request against db2_db2inst1_db2inst1_HADRDB-rs on node node216:10:22.365645 T(229390) _RCD Offline request injected: db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup16:10:26.433653 T(229390) _RCD Online request injected: db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup16:10:26.442814 T(229390) _RCD RIBME-Hist for <NULL>: BINDER: Bind db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup

IBM.GblResRM (from each individual node) traces show:• All start / stop command executions and service IP on / offline13:56:55.919065 T(16386) _GBD Monitor reports: Network device "en0:0" (IP address 10.20.30.42) flagged UP. Bringing resource “db2ip_10_20_30_42-rs" (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10) online.13:57:30.693158 T(163851) _GBD Resource " db2ip_10_20_30_42-rs " (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10): IP address 10.20.30.42 has been successfully taken offline on network interface "en0:0"

#IDUG93

Serviceability

UNKNOWN (0)– Generally a problematic state … really shouldn’t be deliberately used in the automation scripts

ONLINE (1) OFFLINE (2)

– Offline and should be able to be started here if needed

FAILED_OFFLINE (3)– Offline and not a possible node to be started– If MonitorCommand returns FAILED_OFFLINE then availability can change as soon as

MonitorCommand returns something different, like Offline (return code 2)– If status is set to FAILED_OFFLINE by StartCommand not succeeding within RetryCount, then

manual intervention will be needed to fix underyling resource and reset (resetrsrc) resource.

STUCK_ONLINE (4)– Manual intervention will be needed to stop the underlying resource

PENDING_ONLINE (5)– No action is taken in this state, resource should eventually become online, or start attempt will

timeout

PENDING_OFFLINE (6)– No action is taken in this state, resource should eventually become offline or stop attempt will

timeout

Online FAILEDOFFLINE OfflineOnline

#IDUG94

Serviceability

Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands– If yes, then problem is most likely in DB2 automation scripts or core DB2 components– If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement

If Operational State = UNKNOWN (OpState=0)– Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT– Fix: Increase MonitorCommandTimeout value

chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value>lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout

#IDUG95

Db2pd -ha

Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands– If yes, then problem is most likely in DB2 automation scripts or core DB2 components– If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement

If Operational State = UNKNOWN (OpState=0)– Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT– Fix: Increase MonitorCommandTimeout value

chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value>lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout

#IDUG96

Questions/Comments ?

#IDUG

Dale McInnisIBM Canada [email protected]: D15Title: DB2 HADR Automation – Where is the magic?

Please fill out your session evaluation before leaving!

DB2 HADR Automation – where is the magic?

Documents

Transcript of DB2 HADR Automation – where is the magic?