Adventures in Dataguard
-
Upload
jason-arneil -
Category
Technology
-
view
5.045 -
download
2
Transcript of Adventures in Dataguard
Adventures in DataguardDr. Jason Arneil
Motivation
Why Dataguard
• Introduction
• The Motivation
• Dataguard Architecture & Features
• Creating a Physical Standby
• Maintaining your standby
• Using your Standby
• Performing a Switchover
AGENDA
Introduction
Health Warning
Introduction
• Jason Arneil
• System Administrator/DBA
• Using Oracle since 1998
• At Nominet since 2001
About Me
Introduction
• Nominet is the internet registry for .uk domain names
• Nominet has been in existence for over 11 years
• Nominet is run as a not-for-profit company
• Nominet is owned by its members
• There are over 6 Million .uk domain names
About Nominet
Motivation
Why Dataguard
• Big push on a Nominet Business Continuity Plan
• Dataguard is the Oracle solution for disaster recovery
• Physical Standby was the obvious option
• Maximum Availability Architecture (MAA)
Motivation
Business Continuity Site
Architecture & Features
Dataguard Processes
PrimaryDatabase
Transactions Physical/Logical StandbyDatabase
Backup /Reports
Transform Redo to SQL for SQL Apply
MRP/ LSP
ARCHArchived Redo Logs
Archived Redo Logs
ARCH
Oracle Net
StandbyRedo Logs
RFS
FAL
Online Redo Logs
LGWRLNS
Architecture & Features
Dataguard Features
• Several Protection Modes
– Maximum Protection
– Maximum Availability
– Maximum Performance
• Several Transport Modes
– LGWR SYNC
– LGWR ASYNC
– ARCH
Creating a Standby
Prepare Primary & Standby
• Prepare Primary Database
– Enable Force Logging
SQL> alter database force logging;
– Modify initialization parameters
• Prepare Standby Database
– Setup directory structure
– Create spfile with correct parameters
– Start database in nomount
Creating a Standby
Log Transport Parameters
• LOG_ARCHIVE_CONFIG='DG_CONFIG=(PRIMARY, STANDBY)'
• LOG_ARCHIVE_DEST_1='LOCATION=/var/oracle/PRIMARY/arch'
• LOG_ARCHIVE_DEST_2='SERVICE=PRIMARC DB_UNIQUE_NAME=PRIMARY'
• LOG_ARCHIVE_DEST_3='SERVICE=STANDBY LGWR ASYNCREOPEN=15 MAX_FAILURE=10 OPTIONAL
VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=STANDBY'
Creating a Standby
ssh tunnels
• You may not wish your redo data being sent unencrypted across the internet to your standby. You can use ssh tunnels to avoid this
– ssh -N -L 3333:standby:1521 oracle@standby
• Now the tnsnames entry points to the localhost
STANDBYARC =
(DESCRIPTION =
(SDU = 32767)
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT=3333)))
(CONNECT_DATA =
(SERVICE_NAME = STANDBY)))
Creating a Standby
Some Other Parameters
• FAL_SERVER
• FAL_CLIENT
• ARCHIVE_LAG_TARGET
• STANDBY_FILE_MANAGEMENT
• DB_FILE_NAME_CONVERT
• LOG_FILE_NAME_CONVERT
Creating a Standby
backup your primary
• Backup primary - rman is good
– rman> backup format '/backup/%U' database plus archivelog;
– rman> backup format '/backup/%U' current controlfile for standby;
• Recover backup on standby node
– I like using rman duplicate to create standby:
• (oracle$) rman target sys/password@PRIMARY auxiliary /
• rman> duplicate target database for standby;
Creating a Standby
Start applying redo
• Create standby redo log files on both primary and standby:
– sql> alter database add standby logfile thread 2 group 42 (’PATH_TO_DATA/standbyredo01.log') size 512M;
• Now you can start the physical standby recovering logs:
– sql>alter database recover managed standby database disconnect from session;
• Or if you prefer real time apply:
– sql>alter database recover managed standby database using current logfile disconnect from session;
Maintaining your standby
Monitoring the Standby
• You have to ensure your standby is keeping up with your primary
• You can check which was the last log to have been applied to your standby is
– sql> SELECT MAX(SEQUENCE#), THREAD# FROM V$ARCHIVED_LOG where APPLIED='YES' GROUP BY THREAD#;
MAX(SEQUENCE#) THREAD#
-------------- ----------
2976 1
1888 2
Maintaining your standby
Monitoring Standby Progress
• A good way of checking what the background processes of your standby are up to is using v$managed_standby
– SQL> select process, sequence#, status
from V$managed_standby;
PROCESS SEQUENCE# STATUS
-------- ---------- ------------
ARCH 2967 CLOSING
ARCH 2974 CLOSING
RFS 2977 IDLE
MRP0 1889 APPLYING_LOG
RFS 1889 IDLE
RFS 2977 IDLE
Maintaining your standby
Monitoring Your Standby
• You have to ensure your standby is keeping up with your primary
• V$DATAGUARD_STATS provides useful information
– SQL> select name, value from v$dataguard_stats;
NAME VALUE
-------------------------------- ------------------------------------
apply finish time +00 00:00:00
apply lag +00 00:00:11
estimated startup time 41
standby has been open N
transport lag +00 00:00:03
Maintaining your standby
Monitoring Your Standby
• A way of finding out what has been happening to your standby over a period time is to look at the v$dataguard_status view
– Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 1 sequence 2977 (in transit)
– Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 1 sequence 2977 (in transit)
– Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 2 sequence 1889 (in transit)
– Remote File Server 01-AUG-07 Primary database is in MAXIMUM PERFORMANCE mode
– Remote File Server 01-AUG-07 RFS[53]: Successfully opened standby log 14: '+DATA2/standby/standbyredo02.log'
Maintaining your standby
Oracle can’t divide by 0
• Standby was happily working away
– ORA-07445: exception encountered: core dump [kcrarmb()+152] [SIGFPE] [Integer divide by zero] [0x00085C300
• MRP process crashes
– No redo gets applied from this point
• Logs after the one that caused the ORA-07445 still being shipped
• A simple restart of the managed recovery process does a FAL and the standby is back up-to-date
Maintaining your standby
kcrfr_resize2
• Lots of problems after upgrade to 10.2.0.3
– Recovery of Online Redo Log: Thread 2 Group 23 Seq 999 Reading mem 0
Mem# 0: +DATA3/standby/standbyredo11.log
ORA-00600: internal error code, arguments: [kcrfr_resize2], [652614828032], [268423168], [], [], [], [], []
• Perhaps caused by the following:
– Bug 3306010 OERI[kcrfr_resize2] possible in MEDIA recovery
Media recovery may fail with ORA-600 [kcrfr_resize2] when
the number of redo strands is set to a high value using
log_parallelism.
Maintaining your standby
kcrfr_resize2
• This issue has recently been published as Note:453259.1
– Triggered by having a large log_buffer
• This bug affects 10.2.0.3 and potentially 9.2.0.8
• It is related to the size of the log_buffer parameter
• Fix is included in 10.2.0.4
Maintaining your standby
kcrrupirfs
• ARC processes died on primary:
ORA-00600: [kcrrupirfs.20] [4] [368]
• Trace file showed the following:
Corrupt redo block 479421 detected: bad block number
Flag: 0x0 Format: 0x0 Block: 0x00000000 Seq: 0x00000000 Beg: 0x0 Cks:0x0 <<<<<<<--
----- Dump of Corrupt Redo Buffer -----000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Maintaining your standby
kcrrupirfs
• Oracle think initially think this ORA-600 error was hardware related
– There are NO indications of any hardware fault - the primary keeps running
• After a couple of weeks it was decided this was a “bug situation”
– This was bug 4767278 which talked about FAL not being able to read from multiple mirror sides when encountering invalid/stale redo in a file. Apparently required for ASM configurations because ASM does not guarantee all mirror sides contain same data after writing.
– We were using ASM, but external redundancy
– Oracle then said “The ASM group is not 100% sure if the patch 4767278 will fix the problem”
Maintaining your standby
log corruption
• The Managed Recovery process crashed complaining about log corruption
MRP0: Background Media Recovery terminated with error 355
ORA-00355: change numbers out of order
ORA-00353: log corruption near block 2 change 1273622545 time 03/06/2007 08:32:46
ORA-00312: online log 13 thread 1: '+DATA2/standby/standbyredo01.log'
• Oracle blame the upgrade process at first. They suggest rebuilding the standby
• Then I notice that trying managed recovery rather than real time apply seems to allow the standby to progress
Maintaining your standby
log corruption
• At this point Oracle say “it looks like a bug”
• Lots of time spent diagnosing the issue
– ALTER SYSTEM DUMP LOGFILE '+DATA2/nom/standby33.log' scn min 865465290 scn max 865465300;
• Eventually Oracle produced a patch 5746174
– MRP HANGS WITH ASYNC LNS AND PARALLEL ARCHIVAL
Using Your Standby
Utilize those cpu cycles
• A Standby can be considered an insurance policy
• Several ways to utilize your standby
– Run your backups from your standby
– Open your standby read only for reporting
– Flashback standby to look at old data
– Open your standby read write for testing purposes
Using Your Standby
Open for Reports
• You need to cancel managed recovery
– sql> alter database recover managed standby database cancel;
• Then simply open the standby
– sql> alter database open;
• Redo is still transported to your standby
• To transition back to applying redo shutdown the open standby, startup mount and restart the recovery process
Using Your Standby
Open for read write
• You must have flashback database enabled for this
• Stop redo apply on standby
• Create a restore point
• Activate the Standby & perform read/write testing
• Flashback to restore point
• Start the redo on the Standby again
Using Your Standby
Open for read write
Physical Standby Physical Standby
read write
RestorePoint
Flashback Database
Activate standby
Using Your Standby
Flashback Database in a Nutshell
• Set up Flashback Database
– alter system set db_recovery_file_dest_size = 8G;
– alter system set db_recovery_file_dest = 'your flashback destination';
– alter system set db_flashback_retention_target = 1440 ;
– alter database flashback on;
• Once you have cancelled the standby recovery create a guaranteed restore point
– create guaranteed restore point before_activate;
•
Using Your Standby
Open for read write
• Activate your Standby
– SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE;
• You can open the Standby for business
– SQL> ALTER DATABASE OPEN;
• To become a Standby again shutdown and startup in mount
– SQL> FLASHBACK DATABASE TO RESTORE POINT BEFORE_ACTIVATE;
– SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;
Using Your Standby
Open for read write
• However things never go according to plan
– ORA-00600: internal error code, arguments: [3705], [1], [8], [3], [8], [], []
• This was bug 4479323 which is a bug with recovery (not standby specific) and only occurs in a RAC environment
• This is fixed in 10.2.0.3
Doing a Switchover
It’s good to test
• A business continuity plan is no good unless it’s been tested
• It’s not all about the database
• Good to think in terms of services
Doing a Switchover
Database Switchover
• Make sure your standby is up-to-date
• Check your primary database switchover status:
– primary> SELECT SWITCHOVER_STATUS FROM V$DATABASE;
• Switchover primary database
– primary> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY with session shutdown;
• Switchover the standby
– standby> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY with session shutdown;
Doing a switchover
DNS Primer
• DNS allows translation from hostname to IP address
– example.co.uk IN A 162.0.0.1
• Our principle is all services are accessed through a CNAME
– anexample.co.uk 5M IN CNAME example.co.uk
• relocation of the service is just a case of changing where the CNAME points
Conclusion
Conclusion
• Dataguard is an efficient DR solution for your primary database
• Dataguard is mostly reliable but is not without it’s blips
• There are opportunities for gaining added value from your standby
• You can’t test your Business continuity plan enough