Post on 10-Jun-2015
description
DB2 pureScaleAvailability & Recovery
© 2010 IBM CorporationOctober 13, 2010
Aamer Sachedina (aamers@ca.ibm.com)Kelly Schlamb (kschlamb@ca.ibm.com)
Information Management
Continuous Availability
� Protect from infrastructure outages
Automatic workload balancing
Duplexed secondary global lock
© 2009 IBM Corporation2
Automatically recovers from component failures
Tolerates multiple node failures
Duplexed secondary global lock and memory manager
Information Management
DB2 Cluster Services Overview
DB2DB2 DB2 DB2 DB2
� Integrated DB2 component� Single install as part of DB2 installation� Upgrades and maintenance through DB2 fixpack
© 2009 IBM Corporation3
CF
DB2 Cluster Services: Cluster File System
(GPFS)
DB2 Cluster Services:
Cluster Manager (RSCT) Cluster Automation (Tivoli SA MP)
DB2DB2 DB2 DB2 DB2
CF
Information Management
DB2 Cluster Services
DB2 Cluster ServicesReliable Scalable Cluster Technology
Tivoli Systems Automation for Multi-PlatformsIBM General Parallel File System
© 2009 IBM Corporation4
� DB2 CS tightly integrates these IBM products into DB2 pureScale
� DB2 instance creation creates RSCT and GPFS domains across hosts
� Single command used to add hosts to the instance:
db2iupdt –add -m newhost.acme.com db2inst1
Information Management
DB2 pureScale HA Architecture
Member
DB2 CS
Member
DB2 CS
Member
DB2 CS
Member
DB2 CS
© 2009 IBM Corporation5
Cluster Interconnect
GPFS
2nd-ary
CSCS
PrimarySecondary
Information Management
Application Servers andDB2 Clients
Virtually Instantaneous Recovery From Node Failure
� Protect from infrastructure related outages– Automatically
redistribute workload to
© 2009 IBM Corporation6
redistribute workload to surviving nodes
– Automatically recover
in-flight transactions in
as little as 15-20 seconds including
detection
of the problem
Information Management
Minimize the Impact of Planned Outages
Bring node
� Keep your system up– During OS fixes– HW updates– Administration
© 2009 IBM Corporation7
Identify MemberDo MaintenanceBring node back online
Information Management
Member Hardware Failure
Clients
� Power cord tripped over accidentally
� DB2 Cluster Services looses heartbeat and declares member down
– Informs other members & CF servers– Fences member from logs and data– Initiates automated member restart on another
(“guest”) host> Using reduced, and pre-allocated memory model
– Member restart is like a database crash recovery in a single system database, but is much faster
• Redo limited to inflight transactions (due to FAC)• Benefits from page cache in CF
� In the mean-time, client connections
Single Database View
Automatic;
© 2009 IBM Corporation8
Log
CS
CS
DB2
Shared Data
� In the mean-time, client connections are automatically re-routed to healthy members
– Based on least load (by default), or,– Pre-designated failover member
� Other members remain fully available throughout – “Online Failover”
– Primary retains update locks held by member at the time of failure
– Other members can continue to read and update data not locked for write access by failed member
� Member restart completes– Retained locks released and all data fully
available
CS
DB2
CS
DB2
CS
Updated Pages Global Locks
LogLogLog
PrimarySecondary
Updated Pages Global Locks
CS
DB2
DB2
Ultra Fast;
Online
Almost all data remains available. Affected connections transparently re-routed to other members.
Information Management
Member Failback
Clients� Power restored and system re-booted
� DB2 Cluster Services automatically detects system availability– Informs other members and
Single Database View
© 2009 IBM Corporation9
Log
CS
CS
DB2
Shared Data
– Informs other members and PowerHA pureScale servers
– Removes fence– Brings up member on home host
� Client connections automatically re-routed back to member
CS
DB2
CS
CS
Updated Pages Global Locks
LogLogLog
PrimarySecondary
Updated Pages Global Locks
CS
DB2
DB2
DB2
Information Management
Member Hardware Failure and Failback
> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT
0 MEMBER STARTED host0 host0 NO
1 MEMBER STARTED host1 host1 NO
2 MEMBER STARTED host2 host2 NO
3 MEMBER STARTED host3 host3 NO
4 CF PRIMARY host4 host4 NO
5 CF PEER host5 host5 NODB2 DB2 DB2 DB2
host1host0 host3host2
> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT
0 MEMBER STARTED host0 host0 NO
1 MEMBER STARTED host1 host1 NO
2 MEMBER STARTED host2 host2 NO
3 MEMBER RESTARTING host3 host2 NO
4 CF PRIMARY host4 host4 NO
5 CF PEER host5 host5 NOCS
> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT
0 MEMBER STARTED host0 host0 NO
1 MEMBER STARTED host1 host1 NO
2 MEMBER STARTED host2 host2 NO
3 MEMBER WAITING_FOR_FAILBACK host3 host2 NO
4 CF PRIMARY host4 host4 NO
5 CF PEER host5 host5 NO
© 2009 IBM Corporation10
5 CF PEER host5 host5 NO
HOST_NAME STATE INSTANCE_STOPPED ALERT
host0 ACTIVE NO NO
host1 ACTIVE NO NO
host2 ACTIVE NO NO
host3 ACTIVE NO NO
host4 ACTIVE NO NO
host5 ACTIVE NO NO
0 host0 0 - MEMBER
1 host1 0 - MEMBER
2 host2 0 - MEMBER
3 host3 0 - MEMBER
4 host4 0 - CF
5 host5 0 - CF
db2nodes.cfg
Shared Data
host4
PrimarySecondary
5 CF PEER host5 host5 NO
HOST_NAME STATE INSTANCE_STOPPED ALERT
host0 ACTIVE NO NO
host1 ACTIVE NO NO
host2 ACTIVE NO NO
host3 INACTIVE NO YES
host4 ACTIVE NO NO
host5 ACTIVE NO NO
Log
DB2
CS
LogLogLog
DB2
5 CF PEER host5 host5 NO
HOST_NAME STATE INSTANCE_STOPPED ALERT
host0 ACTIVE NO NO
host1 ACTIVE NO NO
host2 ACTIVE NO NO
host3 INACTIVE NO YES
host4 ACTIVE NO NO
host5 ACTIVE NO NO
host5
Shared Data
Failure Mode
DB2 DB2DB2 DB2
CF CF
Member
OtherMembersRemainOnline ?
Automatic &Transparent ? Comments
Only data that was in-flight on failed memberremains locked temporarily.
Connections to failed member transparently
Summary : Single Failure
DB2 DB2DB2 DB2
CF CF
DB2 DB2DB2 DB2
CF CF
PrimaryCF
SecondaryCF
member transparently move to another member
Momentary “blip” in CF service.
Transparent to members(In-flight CCFrequests just take a few moreseconds before completingnormally.).
Momentary “blip” in CF service.
Transparent to members(In-flight CFrequests just take a few moreseconds before completingnormally.).
DB2 DB2 DB2 DB2
CF CF
Failure Mode
OtherMembersRemainOnline ?
Automatic &Transparent ? Comments
Only data that was in-flight on failed membersremains locked temporarily.
Recoveries done in parallel.Connections to failed member transparently
Summary : Multiple Failures
DB2 DB2 DB2 DB2
CF CF
DB2 DB2 DB2 DB2
CF CF
.
Same as member failure.
Momentary, transparent, “blip”in CF service.
.
.
Same as member failure.
Momentary, transparent, “blip”in CF service.
.
member transparently move to another member
Connections to failed member transparently move to another member
Connections to failed member transparently move to another member