MIGRATING EXCHANGE SERVER 2003 TO HIGHLY AVAILABLE EXCHANGE SERVER
How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...
-
Upload
john-payne -
Category
Documents
-
view
229 -
download
5
Transcript of How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...
How CCR and SCR provide High Availability in Exchange Server 2007 SP1Scott [email protected] Technical WriterExchange ServerMicrosoft Corporation
How CCR and SCR provide High Availability in Exchange Server 2007 SP1Ilse Van [email protected] MVP, Trainer & ConsultantMicrosoft Unified CommunicationsGlobal Knowledge
3
Agenda
• Mailbox Server High Availability Options
• CCR and SCR: Better Together• Why CCR? Why not SCC?• Continuous Replication Demystified• Troubleshooting Exchange Clusters
and Continuous Replication• Known Issues
4
Mailbox Server
High Availability
Options
5
Mailbox Server High Availability Options
Local Continuous Replication (LCR)
6
Mailbox Server High Availability Options
Single Copy Cluster (SCC)
7
Mailbox Server High Availability Options
Cluster Continuous Replication (CCR)
8
Standby Continuous Replication
CCR
Standalone
SCC
Standalone MailboxServer (w/o LCR)
Standby Cluster with Passive Mailbox Role
SCR Sources SCR Targets
9
CCR and SCR: Better
Together
CCR and SCR: Better Together
• CCR provides high-availability for Mailbox data and services within the datacenter
• SCR replicates data remotely to provide site resilience for the Mailbox data
Datacenter A
Datacenter B
11
CCR across 2 SitesDatacenter A Datacenter B
12
CCR local / SCR to remote Site
Datacenter A Datacenter B
13
CCR/SCR vs SCC/Sync – 2 sites
Datacenter A Datacenter B
DB
Lo
gs
DB
Lo
gs
Q
Lo
gs
DB
Lo
gs
DB
Exchange Disaster Recovery or 3rd Party Failover
PhysicalCorruption Physical
Corruption
VS
S
Clo
ne
VS
S
Clo
ne
Undetected Physical Corruption
1 month later, Undetected Physical Corruption
On full Storage or Site Failure in Primary Site,corruption is detected, must Recover from Backup
Log corruption detected immediately on replication at both targets
Physical Corruption
Lo
gs D
B
Setup /recovercms, play logs forward
On Site Failure in Primary Site,if corruption not detected and corrected from a test failover, must Recover from Backup
CCR
SCC
14
Why CCR?Why Not
SCC?
15
CCR SCC
Single Point of Failure
None when stretched across sites or combined with SCR for site resiliency
Data, Storage and Site single points of failurePotential for massive data loss on single failure:• Storage device failures can lose collocated
backups• Hardware replication can propagate
physical errors• Storage failure requires activation of remote
copy if one exists• SCC requires two VSS clones plus a remote
copy of data to achieve RPO equal to CCR
Simplicity
Simple setup• No special
storage configuration required
Built-in Site Resilience
Same technology and redundancy model for intra- and inter-site protection
Shared storage Storage configuration before and after
forming cluster Complex storage stack • Driver mgmt• Cluster WCL• Switches• Multipathing• Queue depths
Complex deployment to approach RTO/RPO of 1 CCR cluster
Why CCR? Why not SCC?
16
CCR SCC
Backups Backups off passive copy eliminates/reduces backup window
Backups must be off active
TCO Reduced TCO• Cheaper hardware• No special storage
expertise required• In-the-box solution• Integrated
management• Single operations
team• Reduced backup cost
Higher TCO• Additional products needed to
achieve equivalent combined RTO/RPO
• Separate management tools for HA operations may be required
• Higher-end servers and storage required
• Storage expertise needed
Large Mailboxes
• Great RTO/RPO, Simplicity, No Maintenance Window, Reduced TCO → improved support for larger mailboxes
Higher TCO, long recovery times constrain mailbox size
Why CCR? Why not SCC?
17
Failure CCRStretched CCR or CCR + SCR
SCCSCC + SCR/3rd party replication + 2 VSS
clones to approach combined RTO/RPO of 1 CCR
cluster
RTO
Server ~ 2 minutes ~ 2 minutes
Data or LUN
~ 2 minutes 15 min – 1 hour
Full Storage
~ 2 minutes ~ 15 min with synchronous replication
Days with VSS clones only
Site ~ 2 minutes for Stretched CCR
30-60 minutes for CCR + SCR
~ 15 min with synchronous replication
Days with VSS clones only
RPO
Server 0 for mail*appointment, contact, task,
draft
0 – uses same copy of data
Physical Corrupt
DB 0 Hours to days if sync repl; point in time if VSS
Logs
0 (must reseed passive)
N/A if log not needed; same as DB if needed
DB LUN dies
0 0 with synchronous replication
Point-in-time with VSS clones
LOG LUN dies
0 for mail*appointment, contact, task,
draft
0 with synchronous replication Point-in-time with VSS clones
Full Storage
0 for mail*appointment, contact, task,
draft
0 with synchronous replication
Hours to days with VSS clones only
Site Same as Server for Stretched CCR
1 Log**
0 with synchronous replication
Hours to days with VSS clone
* Assumes following best practice guidance for Transport Dumpster
**Assumes replication’s keeping up
Why CCR? Why not SCC?
18
Why CCR? Why not SCC?
SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutions (e.g., Backups)
SCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutions (e.g., log inspection)
With hardware-based replication, deeper stack can lead to corruption caused by:
HBA driver/firmwareMulti-path driver server hardware FC Switch firmwareStorage controller firmware/OStarget Storage controller firmware/OS
Corruptions caused by the application
Logical corruption replicated by all synchronous and asynchronous replication solutions
SCR with lag replay can mitigate if detected early
Logical Corruption
Physical Corruption
19
Continuous Replication
Demystified
20
Log Copier
LogReplayer
Basic Replication PipelineSource
DB
StoreLog
Inspector
Source LogDirectory
InspectorDirectory
ReplicaLogDirectory
ReplicaDB
21
Continuous Replication Basics
• When current log file is closed, it is copied to the replication target by the Replication service
• Replication service• at source: creates read-only shares for log
directory• at target: reads from the shares and pulls a
copy of the log file• contains a ReplicaInstance for each storage
group• Configuration discovered from Active Directory (every
30 sec for LCR/CCR, every 3 min for SCR)
22
Continuous Replication Basics
• Communication is done via logs, registry, cluster database and RPC• Logs: replicate database changes and backup
status• Registry: used in LCR and SCR. Also in CCR for
checkpointing the current log generation value for loss calculation
• Cluster database: cluster res "Exchange Information Store Instance (CMSName)" /priv | findstr /i replay
• RPCs: Target Replication service RPCs into Store for log truncation coordination
23
Lost Log Resilience (LLR)• Designed to minimize need to reseed after
lossy failover• Database changes written to log file prior to
database, and the database can be updated as soon as change is logged
• LLR modifies this behavior by delaying updates to the database until 1 or more log generations are created
• Utilizes a new log stream marker called the waypoint• Minimum Log Required to prevent database
divergence• No modifications after the waypoint
have been written to the database
Transaction Markers
Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-
0x14)
...
• Committed: Log generation 20
• Checkpoint: Log generation 2
• Waypoint: Log generation 10• What this means:
• We only need logs 2-10• Logs 11-20 can be
discarded
17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
checkpoint
waypoint
NodeB
18
19
20
2121
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
NodeA
Healthy CCR
NodeA fails and a failover to NodeB
occurs
Validate database can mount logs lost <
AutoDatabaseMountDial
Logs are generated on NodeB (beyond gen21)
NodeA recovers and performs a
divergence check
NodeA performs incremental reseed and
copies logs
Healthy CCR
18
19
20
21
1717
26
Log Roll Activity
• In the absence of user or database activity, ESE now also forces the active log file to close
• [15 (minutes) ÷ LLR Depth value] = Frequency of log roll activity (in minutes)
Maximum number of logs generated each day due to log roll activity
Mailbox server configuration
Maximum number of logs generated per day by an idle
storage group
Stand-alone (with or without LCR)SCC
96
CCR 960
27
When Do I Need A Full Reseed?
• Rarely• Lost log past current Waypoint
• Admin accepted large amount of loss by running Restore-StorageGroupCopy
• Automatic mount while LLR was “not honored”• Automatic lossy mount with “stale” loss window
calculation
• Log corruption prior to log replay• ESE cannot skip over logs
• Database files modified outside of Store or Replication service• E.g., Offline defrag, eseutil /r
28
Transport Dumpster
• Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reached
• Transport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage group
• Transport Dumpster statistics:Get-StorageGroupCopyStatus -DumpsterStatistics
Output:DumpsterServersNotAvailable:{HUB1}DumpsterStatistics:
{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}
29
CCR CMS
MBX2
MBX1
HUB1SG Dumpster Contents
SG1
SG2
HUB2SG Dumpster Contents
SG1
SG2
SG1 SG2
SG1 SG2
Passive
SG Dumpster Contents
SG1 Msg1
SG2 Msg1
SG Dumpster Contents
SG1 Msg2
SG2
SG Dumpster Contents
SG1 Msg1
SG2 Msg1,Msg3
SG Dumpster Contents
SG1 Msg2,Msg4
SG2 Msg4
SG Resubmit Required
SG1
SG2
SG Resubmit Required
SG1
HUB1,HUB2
SG2
HUB1,HUB2
Redeliver SG1,SG2(returns Retry)
Redeliver SG1,SG2(returns timeout)
SG Resubmit Required
SG1
HUB1
SG2
HUB1
Active
Redeliver SG1,SG2(returns Success)
Redeliver SG1,SG2(returns retry)Redeliver SG1,SG2(returns success)
Transport Dumpster
30
Transport Dumpster
• How much data loss can transport dumpster mitigate?• 18 MB dumpster per storage group on 8 Hub Transport
servers = 144 MB / storage group• [20 MB / 10 hour] x [100 users / SG] = 200 MB message
traffic in one hour• Putting the above two together gives
60 min X 144 / 200 43.2 minutes worth of datain 43.2 minutes 144+ logs created per SG
• Customize transport dumpster size/time limitSet-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB –
MaxDumpsterTime 07.00:00:00
• No time window guarantees• If there are no message size limits, a single large
message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server
31
Transport Dumpster
• When CCR detects a lossy failover:• Expands loss window by 12 hours back and 1 hour
forward • Finds all Hub Transport servers in the local Active
Directory site• Requests transport dumpster redelivery from all detected
servers• New servers not added to redelivery list
• Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime
• If multiple lossy failovers take place, new loss is window added to previous one
• Restore-StorageGroupCopy on LCR is one time request, no retries
• Redelivery not triggered as part of Setup /recoverCMS
• No other ways to redeliver messages from transport dumpster
Redundant Networks• Use for log shipping and seeding in CCR
Enable-ContinuousReplicationHostName
SeedingUpdate-StorageGroupCopy -DataHostNames:Host1,Host2
Get-ClusteredMailboxServerStatus OperationalReplicationHostNames:FailedReplicationHostNames:InUseReplicationHostNames:
Watch out for misconfigured host file
33
Circular Logging• One configuration setting with two consumers
• Store service: requires database to be dismounted and re-mounted to take effect
• Replication service: picks up new setting dynamically
• In CCR, it’s no big deal to switch between on/off/on
• In some settings, logs are deleted prematurely• Example: turn off circular logging, then enable LCR
without dismount/mount of database• ESE is still doing log truncation with circular logging logic
• Logs will get truncated before making it to the LCR copy
• To be safe follow this recipe: • Suspend, dismount, change setting, mount, resume
34
Troubleshooting Exchange
Clusters and Continuous Replication
35
Troubleshooting Replication & Failover• Get-StorageGroupCopyStatus• Test-ReplicationHealth• Cluster Log• Get-ClusteredMailboxServerStatus• Getscrsources.ps1• Test-Mailflow• Application Event Log – Replication events
• Get-EventLogLevel -id:"MSExchange Repl" | Set-EventLogLevel -Level expert
• Get-EventLogLevel -id:"MSExchange Cluster" | Set-EventLogLevel -Level expert
• System Event Log – Cluster events• Active Directory management tools• Network Monitor
36
= LastLogCopyNotified – LastLogCopied
TroubleshootingGet-StorageGroupCopyStatus
Time stamp on source SG of most recent log
Time of sources most recent log known to copy
Time stamp on source SG of last successful log copy
Must use –DumpsterStatistics option to get these values
37
TroubleshootingTest-ReplicationHealth
ClusterNetwork:• Checks connectivity of
all network interfaces • Checks cluster group is
up• Warns in multi subnet
topologies since not all cluster networks can be up at the same time
SGCopyQueueLength Warns at 3 and Errors at 6
SGReplayQueueLength Warns at 30 and Errors at 60
38
TroubleshootingCluster Log
• Windows Server 2003• %windir%\cluster\cluster.log• Logs are always appended to this file
• Windows Server 2008• Must generate the cluster log file
cluster.exe [[/CLUSTER:]cluster-name] LOG <options><options> =
/G[EN[ERATE]] [/COPY[:"directory"]] [/NODE:"node-name"][/SPAN[MIN[UTE[S]]]:min] ]/SIZE:logsize-MB/LEVEL:logLevel
• If /COPY is not specified, %windir%\Cluster\Reports\Cluster.log• If /NODE is not specified, a log file is generated on every node • /SIZE must be between 8 and 1024 MB• /LEVEL must be between 0 and 10
TroubleshootingServer Failover but Databases Didn’t Mount• Steps to troubleshoot:
1. Run Get-StorageGroupCopyStatus2. Check the log directories on Active
and Passive3. Run Restore-StorageGroupCopy
and then Mount-Database
TroubleshootingLog File Corrupted
• Steps to troubleshoot:• Run Get-StorageGroupCopyStatus and/or
Test-ReplicationHealth• Reseed passive copy / SCR target by
running Suspend-StorageGroupCopy • Run Update-StorageGroupCopy on the
passive node or SCR target
TroubleshootingSMB File Share for Replication Missing
• Steps to troubleshoot (SCR/CCR)1. Run Test-ReplicationHealth on Passive2. Run Get-StorageGroupCopyStatus on
Passive3. Run Get-ClusteredMailboxServerStatus4. Verify share on Active Node5. Stop Sharing the File Share – Replication
Service recreates in 30 seconds6. Run Test-ReplicationHealth on Active7. Check Application Event Log8. Check Active Directory Permissions
42
Known Issues
43
Known Issues
• Update Rollup 5 for Exchange 2007 SP1 can cause Enable-StorageGroupCopy to fail in an SCR topology that consists of a parent and child domain structure:“Standby continuous replication is not supported between computers in different Active Directory domains. The target node is in domain <child domain> which is different from the source domain of <parent domain>”
• Workarounds• Uninstall UR5, enable SCR, re-install RU5• Use a Management Console running pre-RU5
code
• Exchange12 bug 152967• Expected fix in RU7 for Exchange 2007 SP1
44
Known Issues• Network shares get deleted and created every 5 minutes by
the replication service on a Windows 2008 SCC when SCR is enabled
• Replication service share names intermittently disappear from the cluster causing replication status to repeatedly switch back and forth between failed and healthy states• Test-ReplicationHealth on SCR target may succeed
showing all tests passed• Get-StorageGroupCopyStatus on SCR target status shows
Healthy, Initializing or Failed• No events on source, but SCR target will log ESE event
522 in the application event log• Exchange12 bug 146483• Expected fix in RU7 for Exchange 2007 SP1
45
Known Issues• When running VSS backup, ESE event 522 is logged
on the passive node; Event is logged on resuming a suspended storage group Event log fills
• Event message details:Microsoft.Exchange.Cluster.ReplayService (7012) Log Verifier e0a 31573001: An attempt to open the device name "\\source\share$" containing "\\source\share$\" failed with system error 5 (0x00000005): "Access is denied. ". The operation will fail with error -1032 (0xfffffbf8).
• Workaround• If Get-StorageGroupCopyStatus is healthy for storage
groups, ignore the event• If Test-ReplicationHealth passes all tests, ignore the event
• Exchange12 bug 147432• Expected fix in RU7 for Exchange 2007 SP1
46
Known Issues
• Reseed fails when you restore 1 full backup and then more than 2 differential backups.
• Restoring to active node can succeed, but CCR no longer works after recovery.
• Workaround• Take full backup when restore is finished
(note this may not be practical with large databases)
• Exchange12 bug 152258• Expected fix in RU8 for Exchange 2007 SP1
47
Key Takeaways
• Exchange 2007 includes several Mailbox Server availability configurations
• CCR+SCR provide higher availability at a lower cost than any other solution
• There are a number of cmdlets and tools that can be used for troubleshooting and managing continuous replication
• LLR minimizes need for full reseeds• Transport Dumpster redelivers all routed mail after
failover• CCR addresses all ranges of failures from disk to full site• CCR on DAS provides great RTO and RPO
Thank You!
• Here this week!• Visit our usergroup booth• Core members
• Ilse Van Criekinge• Johan Delimon• Tonino Bruno
• http://www.proexchange.be
Pro-Exchange
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,
it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.