DB2 z/OS Recovery Overview
description
Transcript of DB2 z/OS Recovery Overview
04/10/23
DB2 Recovery 101 Overview
Bill ArledgeDB2 Data Management Strategist Mainframe DB Recovery
04/10/23 ©2007 BMC Software2
Availability is critical …
Lost revenue… Up to $6 million/hour for e-businesses
2 - 3% of annual revenue for every 10 hours of database outage
Lost customers… Nearly 14 times your initial investment
to win customers back
Lost market share…Approximately 1/2 point of market share
for every 8 hours of outage (It will take an estimated 3 years to win customers back)
04/10/23 ©2007 BMC Software3
Recovery Elements
FAILURECREATE
RECOVERYJCL
EXECUTERECOVERYANALYSIS
RECOVERY MANAGEMENT
FAST UTILITIES
APPLICATION OUTAGE
04/10/23 ©2007 BMC Software4
When Availability is Critical, Recovery is Crucial!
Unplanned downtime is an unfortunate fact of life…. Up to 80% of all unplanned downtime is caused by software or human error*Up to 80% of all unplanned downtime is caused by software or human error* Up to 70% of recovery is Up to 70% of recovery is “think time”“think time”!!
*Source: Gartner, “Aftermath: Disaster Recovery”, Vic Wheatman, September 21, 2001*Source: Gartner, “Aftermath: Disaster Recovery”, Vic Wheatman, September 21, 2001
Recover30%
Analyze30%
Investigate20%
Diagnose20%
04/10/23 ©2007 BMC Software5
DB2 Recovery Resource Review
– Large lines = User data flow• Updates to Tablespace are Logged, Copied
– Small lines = DB2 info flow• Copies and Logs are registered• Log range for Tablespace recovery tracked
– MVS ICF Catalog• Watches over all
ICF Catalog
SYSLGRNX TABLESPACE
BSDS
SYSCOPY
ActiveLog
Archive Logs
or
Full Copy
orIncremental
Copies
04/10/23 ©2007 BMC Software6
ICF Catalog
› All DB2 pagesets must be cataloged– Cataloging updates three MVS system files
• VTOC - Volume Table of Contents• VVDS - VSAM Volume Data Sets• ICF - Integrated Catalog Facility
CREATE TABLESPACETS000001 IN DP000001USING VCAT DB2PVCAT ...
ICF Catalog
DB2PVCAT.DSNDBC.DP000001.TS000001.I0001.A001DB2PVCAT.DSNDBD.DP000001.TS000001.I0001.A001
VTOC, VVDS
04/10/23 ©2007 BMC Software7
Boot Strap Data Set (BSDS)
– Boot Strap Data Set (BSDS)• Active and Archive Log Inventories• Active Log is reused, Archive Log is serial up to 10000 (V8 and later)
– Current Active Log is DS01– Next Archive Log will be A004
Active LogsDS01DS02DS03
ACTIVE RBA LOG START END STATUS
DS01 40000 4FFFF Not ReusableDS02 20000 2FFFF Reusable DS03 30000 3FFFF Reusable
ARCHIVE RBA LOG START END VOLSER
A001 10000 1FFFF VOL=T00001A002 20000 2FFFF VOL=T00002A003 30000 3FFFF VOL=T00003
BSDS
Archive LogsA001A002A003
04/10/23 ©2007 BMC Software8
BSDS and Relationship to the LOG
IncompleteUR Summary
Record
Open PagesetSummary
Record
Database and PagesetExceptions
Summary RecordDB2 BSDS
LOG
Checkpoint
Active & Archive Log Dataset Information
DDF Communication Record
Checkpoint Queue
04/10/23 ©2007 BMC Software9
Print Log Map (DSNJU004)
› DSNJU004 (Print Log Map) provides– Active log data set information
– Archive log data set information
– System checkpoints• Driven by
– LOGLOAD ZPARM– Active log switch
Local Time
GMT
04/10/23 ©2007 BMC Software10
DB2 Logs – A Shared Resource
› Primary usage to provide for restart and recovery of DB2 subsystems and objects
– Other uses including audit and data migration› Log records
– Record updates for all spaces with before and/or after images• Data and index pages
– Record DDL operations (updates to catalog pages)– Capture checkpoint and restart information including
• transaction info • Object exception information
– Are identified by a log point• RBAs (Relative Byte Addresses) for non-data sharing• LRSNs (Log Record Sequence Numbers) for data sharing
– Include the page (and row if applicable) being impacted
04/10/23 ©2007 BMC Software11
DB2 Log Data Flow
– Log Buffers • RBA (Relative Byte Address)
– Written to Active Log– Full Active written to Archive
RBA 1000
RBA 2000
DB2 RECOVERY LOG MANAGER
LOG BUFFERS
INSERTONE ROW
ArchiveLogLOGSWITCH
ActiveLog
BEGIN_URRECORD
UNDO / REDODATA
UNDO / REDOINDEX
COMMITRECORDS
END_URRECORD
UNDO / REDOINDEX
OPENPAGESETS
CheckpointRECORD
04/10/23 ©2007 BMC Software12
DB2 Log Data Breakdown
Data20%
Index50%
Checkpoint10%
Commit5%
Other15%
UNDO
REDO
REDO
UNDO
04/10/23 ©2007 BMC Software13
DB2 Log Archival Process
› Active log offload is triggered by several events, including:– active log data set is full – Starting DB2 and an active log data set is full – ARCHIVE LOG command – Two uncommon events also trigger the offload:
• An error occurring while writing to an active log data set • Filling of the last un-archived active log data set
Write to Active Log Triggering Event
Update the BSDS
Write the Archive
Offload Process
ARCHIVING THE ACTIVE LOG
04/10/23 ©2007 BMC Software14
DSN1LOGP - Looking at Log Records
› DSN1LOGP is a standalone utility available with DB2› IBM does not set out to document everything you see in a detail
report› You can still get lots of information but not easily› Most recovery experts would find a more sophisticated log tool
handy to– Re-create SQL from the log– report and filter on transaction and column data more effectively and– handle compression and other issues
04/10/23 ©2007 BMC Software15
DSN1LOGP – Usage Examples
› Who updated that table that's not supposed to be updated?
› Who DROPPED that DATABASE?
› Did BADUSER update anything?
› Are GRANTs and REVOKEs being done outside of our control?
› Who FREEd that PLAN or PACKAGE?
› Are SAVEPOINTs being executed on this subsystem?
› Can I find a common quiet point for two tablespaces for Recovery?
› Sample summary log record for a Unit of Recovery (UOR)DSN1151I DSN1LPRT UR CONNID=TSO CORRID=BADUSER AUTHID=BADUSER PLAN=DSNESPCS START DATE=05.059 TIME=14:29:20 DISP=COMMITTED INFO=COMPLETE STARTRBA=024AC666C4C7 ENDRBA=024AC666C76B STARTLRSN=BCA426C98AED ENDLRSN=BCA426C98C3E NID=* LUWID=USBMCN01.DEBALU.BCA4267B984C.0001 COORDINATOR=* PARTICIPANTS=* DATA MODIFIED:
DATABASE=0600=KMMSEGDB PAGE SET=0002=KMMSEGS
04/10/23 ©2007 BMC Software16
Identifying Relevant Log
› The SYSLGRNX directory table records log ranges containing updates to a space (or partition)
– There are entries for each data sharing member updating and• these entries give the location range on the logs (relative byte address--
RBA) and• the relative time range (log record sequence number--LRSN) to coordinate
with copies and other logs– SYSLGRNGX output provided by REPORT RECOVERY utility
• Identifies assets required for recovery
04/10/23 ©2007 BMC Software17
DB2 Directory Table
– SYSLGRNX • Open update log ranges on the DB2 Log • Provides for faster recovery
SYSLGRNX
COPY Current
LOG
A B C D E FOpen log ranges
QUIESCE
DBID PSID Start Range End Range0105 000F A B0105 000F C D0105 000F E F
Quiet Point
04/10/23 ©2007 BMC Software18
Spaces Absent in SYSLGRNX
› Some catalog and directory spaces don’t have SYSLGRNX entries
– DSNDB01.SYSUTILX – DSNDB01.DBD01 – DSNDB01.SYSLGRNX– DSNDB06.SYSCOPY – DSNDB06.SYSGROUP– DSNDB01.SCT02 – DSNDB01.SPT01
04/10/23 ©2007 BMC Software19
Log Compatibility
› Since log records reference pages and rows in spaces and spaces are identified by internal IDs:
– certain activities make one series of log records incompatible with others
– require a new copy or starting point– LOAD REPLACE completely resets the data and REORG and
REBUILD change row and key entry locations– Certain DROPs can be disastrous
04/10/23 ©2007 BMC Software20
Log Roadblocks
insert row,page 1F2,
row 7
update row,page E2,
row 1
log apply
COPY COPYREORG is executed
Row is now on page E2, row 1
log apply
04/10/23 ©2007 BMC Software21
Problem Categories
› An expert categorizes failures and plans and performs accordingly. There are three common possibilities.
– A media failure destroys data or compromises it (Disk failure or controller or cache failure occurs)
– Data becomes logically compromised by an incorrect job or transaction– The data center is unusable (aka disaster)
04/10/23 ©2007 BMC Software22
Media Failure
› A media failure destroys data or compromises it– Identify volume contents and RECOVER or REBUILD for traditional
DASD– Identify objects affected by storage component and RECOVER or
REBUILD bearing mind that some may not be affected because they weren’t recently updated
04/10/23 ©2007 BMC Software23
Pop Quiz - Index Recovery
Maybe Not
Recovery to current (for a media failure) would not require it . Some objects are being recovered to overcomemedia failure. Related objects should still be consistent if they were unaffected by the media failure.
04/10/23 ©2007 BMC Software24
Logical Data Corruption
› Data has become logically compromised by an incorrect job or transaction
– An expert finds the cause of the problem.– An expert knows the possible tools to use
• whether it is a set of RECOVER and REBUILD statements or• a special program or• a special log tool • or some combination of the above
04/10/23 ©2007 BMC Software25
Finding Corruption Point
Look for a place where everyone agrees data wasn’t corrupted. Get as close to now as possible!
Application Data Is fine
POINT A
Application Data Is Corrupted
POINT B
What happened in between
04/10/23 ©2007 BMC Software26
Consistency Point
If RECOVER must be used, a point of consistency across affected table spaces must be located and any good updates after that point will be lost.
Batch Job
Online Trans Online Trans
Application Data Is fine
Batch Job Rerun
Application Data Now Corrupted
04/10/23 ©2007 BMC Software27
Identifies all QuiescePoints after the
specified ICDATE
Looking for QUIESCE Points
SELECT DBNAME, TSNAME, ICDATE, ICTIME, HEX(START_RBA) FROM SYSIBM.SYSCOPY WHERE DBNAME IN ('LSBX', 'LSBQ') AND ICTYPE = 'Q' AND ICDATE > ’020124' ORDER BY ICDATE, ICTIME;
SELECT HEX(MAX(START_RBA)) FROM SYSIBM.SYSCOPY WHERE DBNAME IN ('LSBX', 'LSBQ') AND ICTYPE = 'Q'; SELECT DBNAME, NAME FROM SYSIBM.SYSTABLESPACE WHERE DBNAME IN ('LSBX', 'LSBQ') AND (DBNAME, NAME) NOT IN ( SELECT DBNAME, TSNAME FROM SYSIBM.SYSCOPY WHERE START_RBA = ( SELECT MAX(START_RBA) FROM SYSIBM.SYSCOPY WHERE DBNAME IN ('LSBX', 'LSBQ') ) );
• Official points are recorded in SYSCOPY generally as a result of execution of the QUIESCE utility
• Useful queries for evaluating available quiesce (quiet) points on the DB2 log
Identifies the latest Quiesce point for a set
Of objects
Identifies related objectsWith no entry at the latest
quiesce point
04/10/23 ©2007 BMC Software28
Recovering from Logical Errors Using SQL Processes
If the result of the batch job was undone with SQL then the online transactions might be preserved and it would not be necessary to find a point of consistency for RECOVER.
SQL INSERT SQL DELETE
Online TransOnline Trans
Online TransOnline Trans
Online TransOnline Trans
Online TransOnline Trans
Online Trans
Batch Job Rerun Batch Job Reversed
Application Data Now Corrupted
04/10/23 ©2007 BMC Software29
Caveats for a SQL approach
› Using SQL to correct logical errors has some possible pitfalls• The transactions being preserved may have depended on the incorrect
data
transaction changes addresses
transaction changes salesmen based on
addresses
Good!BAD!
04/10/23 ©2007 BMC Software30
Caveats for a SQL approach
› Using SQL to correct logical errors has some possible pitfalls• The transactions that ran during or after the corruption may have also
updated the same column in some of the rows corrupted
many employees 401K deductions set
to zero
Employee requests to set new percentages for
401K processed
Good!
BAD!
04/10/23 ©2007 BMC Software31
Caveats for a SQL approach
› Allowing access to application spaces while they are corrupted may cause problems as in these examples
• If a group of customers were accidentally deleted from your data base, then allowing salesmen to continue placing orders might cause them to recreate customer rows because they don’t see the rows. When a insert is attempted for the customer to correct the delete, it will likely receive -803 or cause an improper duplicate customer record
• Customer shipping addresses corrupted might cause a label to be printed (read only) and packages to be misdirected
04/10/23 ©2007 BMC Software32
Image Copy
› An image copy is a sequential dataset– Contains page images from the tablespace or indexspace – Represents at least one data set of a space and at most a complete space
(all data sets or partitions)› Image Copies
– Can be made while changes are taking place (SHRLEVEL CHANGE) or – Can be made allowing only reads so they are consistent (SHRLEVEL
REFERENCE)– Registered in SYSCOPY and accessible via SQL SELECT – REPORT RECOVERY identifies copy required for recovery
• No guarantee that a copy in SYSCOPY is not deleted or not cataloged. – Can be used to UNLOAD data– Deleted by
• DROP DDL against the space• Potentially by the MODIFY utility
04/10/23 ©2007 BMC Software33
Image Copy Types
› Multiple, identical image copies (four) may be made. They are identified as:
• Primary or Backup; and• Local or Recovery site.
› Image copies may be made with only changed pages. These are incremental image copies.
x nK pagesspacecopy
04/10/23 ©2007 BMC Software34
SHRLEVEL CHANGE Copies
copybegins
page 2update
page FFF2 update
pg0
pg1
page 2copied
pg2 ...
pgFFF0
copyends
page FFF2
copied
pgFFF1
pgFFF2
contentsof copy
1 23 4
5
04/10/23 ©2007 BMC Software35
External Copies Unknown to DB2
› Data set dumps made by DFDSS or DSN1COPY or other mechanisms
– Aren’t registered but may be used by• Restoring known copies that are consistent because the space was
stopped or DB2 was cleanly stopped• Restoring a complete set of system data
– ‘flash copied’ or ‘snapped’ between the SET LOG SUSPEND and SET LOG RESUME commands or
– made while DB2 is down after it was taken done cleanly– and then restarting DB2.
04/10/23 ©2007 BMC Software36
SYSCOPY MINING
SYSCOPY contains a wealth of data!
WHO?
GROUP_MEMBER
DBNAME
TSNAME
DSNUM
DSNAME
JOBNAME
AUTHID
WHEN?
TIMESTAMP
START_RBA
ICDATE
ICTIME
WHAT?
ICTYPE
STYPE
SHRLEVEL
ICBACKUP
PIT_RBA
OTYPE
04/10/23 ©2007 BMC Software37
SYSCOPY Example
– DB2 Catalog Table SYSIBM.SYSCOPY• Backup and recovery point information
IC Type Description F Full I Incremental Q Quiesce X REORG LOG(YES)
SHRLEVEL Description R Reference C Change
04/10/23 ©2007 BMC Software38
Spaces not recorded in SYSCOPY
› Three catalog and directory spaces do not have entries in the SYSCOPY table
– DSNDB01.DBD01 – DSNDB01.SYSUTILX– DSNDB06.SYSCOPY
› Information on copies for these spaces resides in the DB2 log
04/10/23 ©2007 BMC Software39
Pop Quiz - Avoid Copies?
PROBOBABLY NOT
Not if REORG or LOAD are used with LOG NO.
04/10/23 ©2007 BMC Software40
DB2 Recovery Processing
LOGAPPLY
RESTORE
MESSAGES
ImageCopy
SYSCOPY
SYSLGRNG
BSDS
ACTIVE LOG
ARCHIVE LOG
TABLE SPACE
04/10/23 ©2007 BMC Software41
RECOVER Flavors
› RECOVER can use all the log records to the end of the subsystem log(s) or
› RECOVER can be instructed to stop at a particular log point
› RECOVER usually starts by restoring image copies except in the rare cases where everything is on the log and
› RECOVER has a LOGONLY feature that assumes the space is restored outside its control
04/10/23 ©2007 BMC Software46
Disaster Recovery
› Options from weekly dumps to offsite logging– Dumps - Simple, cheap, maximum data loss
• Weekly dumps means several days data loss– Offsite Logs - Complex, expensive, no data loss
• Applying log data to shadow increases expense– Compromise - Periodic vaulting of Copies & Logs
• Daily or hourly log shipment will minimize data loss› Good topic for a future presentation
CostComplexity
Data LossOutage Time
04/10/23 ©2007 BMC Software47
Expert Summary
› Know the basics and don’t be caught by the myths› Know the assets you are trying to protect› Know what you have to protect them with› Plan for each type of failure and practice if you can