How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...

How CCR and SCR provide High Availability in Exchange Server 2007 SP1Scott [email protected] Technical WriterExchange ServerMicrosoft Corporation

mailto:[email protected]

How CCR and SCR provide High Availability in Exchange Server 2007 SP1Ilse Van [email protected] MVP, Trainer & ConsultantMicrosoft Unified CommunicationsGlobal Knowledge

3

Agenda

• Mailbox Server High Availability Options

• CCR and SCR: Better Together• Why CCR? Why not SCC?• Continuous Replication Demystified• Troubleshooting Exchange Clusters

and Continuous Replication• Known Issues

4

Mailbox Server

High Availability

Options

5

Mailbox Server High Availability Options

Local Continuous Replication (LCR)

6


Single Copy Cluster (SCC)

7


Cluster Continuous Replication (CCR)

8

Standby Continuous Replication

CCR

Standalone

SCC

Standalone MailboxServer (w/o LCR)

Standby Cluster with Passive Mailbox Role

SCR Sources SCR Targets

9

CCR and SCR: Better

Together

CCR and SCR: Better Together

• CCR provides high-availability for Mailbox data and services within the datacenter

• SCR replicates data remotely to provide site resilience for the Mailbox data

Datacenter A

Datacenter B

11

CCR across 2 SitesDatacenter A Datacenter B

12

CCR local / SCR to remote Site

Datacenter A Datacenter B

13

CCR/SCR vs SCC/Sync – 2 sites

Datacenter A Datacenter B

DB

Lo

gs

DB

Lo

gs

Q

Lo

gs

DB

Lo

gs

DB

Exchange Disaster Recovery or 3rd Party Failover

PhysicalCorruption Physical

Corruption

VS

S

Clo

ne

VS

S

Clo

ne

Undetected Physical Corruption

1 month later, Undetected Physical Corruption

On full Storage or Site Failure in Primary Site,corruption is detected, must Recover from Backup

Log corruption detected immediately on replication at both targets

Physical Corruption

Lo

gs D

B

Setup /recovercms, play logs forward

On Site Failure in Primary Site,if corruption not detected and corrected from a test failover, must Recover from Backup

CCR

SCC

14

Why CCR?Why Not

SCC?

15

CCR SCC

Single Point of Failure

None when stretched across sites or combined with SCR for site resiliency

Data, Storage and Site single points of failurePotential for massive data loss on single failure:• Storage device failures can lose collocated

backups• Hardware replication can propagate

physical errors• Storage failure requires activation of remote

copy if one exists• SCC requires two VSS clones plus a remote

copy of data to achieve RPO equal to CCR

Simplicity

Simple setup• No special

storage configuration required

Built-in Site Resilience

Same technology and redundancy model for intra- and inter-site protection

Shared storage Storage configuration before and after

forming cluster Complex storage stack • Driver mgmt• Cluster WCL• Switches• Multipathing• Queue depths

Complex deployment to approach RTO/RPO of 1 CCR cluster

Why CCR? Why not SCC?

16

CCR SCC

Backups Backups off passive copy eliminates/reduces backup window

Backups must be off active

TCO Reduced TCO• Cheaper hardware• No special storage

expertise required• In-the-box solution• Integrated

management• Single operations

team• Reduced backup cost

Higher TCO• Additional products needed to

achieve equivalent combined RTO/RPO

• Separate management tools for HA operations may be required

• Higher-end servers and storage required

• Storage expertise needed

Large Mailboxes

• Great RTO/RPO, Simplicity, No Maintenance Window, Reduced TCO → improved support for larger mailboxes

Higher TCO, long recovery times constrain mailbox size


17

Failure CCRStretched CCR or CCR + SCR

SCCSCC + SCR/3rd party replication + 2 VSS

clones to approach combined RTO/RPO of 1 CCR

cluster

RTO

Server ~ 2 minutes ~ 2 minutes

Data or LUN

~ 2 minutes 15 min – 1 hour

Full Storage

~ 2 minutes ~ 15 min with synchronous replication

Days with VSS clones only

Site ~ 2 minutes for Stretched CCR

30-60 minutes for CCR + SCR

~ 15 min with synchronous replication

Days with VSS clones only

RPO

Server 0 for mail*appointment, contact, task,

draft

0 – uses same copy of data

Physical Corrupt

DB 0 Hours to days if sync repl; point in time if VSS

Logs

0 (must reseed passive)

N/A if log not needed; same as DB if needed

DB LUN dies

0 0 with synchronous replication

Point-in-time with VSS clones

LOG LUN dies

0 for mail*appointment, contact, task,

draft

0 with synchronous replication Point-in-time with VSS clones

Full Storage

0 for mail*appointment, contact, task,

draft

0 with synchronous replication

Hours to days with VSS clones only

Site Same as Server for Stretched CCR

1 Log**

0 with synchronous replication

Hours to days with VSS clone

* Assumes following best practice guidance for Transport Dumpster

**Assumes replication’s keeping up


18


SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutions (e.g., Backups)

SCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutions (e.g., log inspection)

With hardware-based replication, deeper stack can lead to corruption caused by:

HBA driver/firmwareMulti-path driver server hardware FC Switch firmwareStorage controller firmware/OStarget Storage controller firmware/OS

Corruptions caused by the application

Logical corruption replicated by all synchronous and asynchronous replication solutions

SCR with lag replay can mitigate if detected early

Logical Corruption

Physical Corruption

19

Continuous Replication

Demystified

20

Log Copier

LogReplayer

Basic Replication PipelineSource

DB

StoreLog

Inspector

Source LogDirectory

InspectorDirectory

ReplicaLogDirectory

ReplicaDB

21

Continuous Replication Basics

• When current log file is closed, it is copied to the replication target by the Replication service

• Replication service• at source: creates read-only shares for log

directory• at target: reads from the shares and pulls a

copy of the log file• contains a ReplicaInstance for each storage

group• Configuration discovered from Active Directory (every

30 sec for LCR/CCR, every 3 min for SCR)

22

Continuous Replication Basics

• Communication is done via logs, registry, cluster database and RPC• Logs: replicate database changes and backup

status• Registry: used in LCR and SCR. Also in CCR for

checkpointing the current log generation value for loss calculation

• Cluster database: cluster res "Exchange Information Store Instance (CMSName)" /priv | findstr /i replay

• RPCs: Target Replication service RPCs into Store for log truncation coordination

23

Lost Log Resilience (LLR)• Designed to minimize need to reseed after

lossy failover• Database changes written to log file prior to

database, and the database can be updated as soon as change is logged

• LLR modifies this behavior by delaying updates to the database until 1 or more log generations are created

• Utilizes a new log stream marker called the waypoint• Minimum Log Required to prevent database

divergence• No modifications after the waypoint

have been written to the database

Transaction Markers

Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-

0x14)

...

• Committed: Log generation 20

• Checkpoint: Log generation 2

• Waypoint: Log generation 10• What this means:

• We only need logs 2-10• Logs 11-20 can be

discarded

17

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

checkpoint

waypoint

NodeB

18

19

20

2121

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

NodeA

Healthy CCR

NodeA fails and a failover to NodeB

occurs

Validate database can mount logs lost <

AutoDatabaseMountDial

Logs are generated on NodeB (beyond gen21)

NodeA recovers and performs a

divergence check

NodeA performs incremental reseed and

copies logs

Healthy CCR

18

19

20

21

1717

26

Log Roll Activity

• In the absence of user or database activity, ESE now also forces the active log file to close

• [15 (minutes) ÷ LLR Depth value] = Frequency of log roll activity (in minutes)

Maximum number of logs generated each day due to log roll activity

Mailbox server configuration

Maximum number of logs generated per day by an idle

storage group

Stand-alone (with or without LCR)SCC

96

CCR 960

27

When Do I Need A Full Reseed?

• Rarely• Lost log past current Waypoint

• Admin accepted large amount of loss by running Restore-StorageGroupCopy

• Automatic mount while LLR was “not honored”• Automatic lossy mount with “stale” loss window

calculation

• Log corruption prior to log replay• ESE cannot skip over logs

• Database files modified outside of Store or Replication service• E.g., Offline defrag, eseutil /r

28

Transport Dumpster

• Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reached

• Transport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage group

• Transport Dumpster statistics:Get-StorageGroupCopyStatus -DumpsterStatistics

Output:DumpsterServersNotAvailable:{HUB1}DumpsterStatistics:

{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}

29

CCR CMS

MBX2

MBX1

HUB1SG Dumpster Contents

SG1

SG2

HUB2SG Dumpster Contents

SG1

SG2

SG1 SG2

SG1 SG2

Passive

SG Dumpster Contents

SG1 Msg1

SG2 Msg1


SG1 Msg2

SG2


SG1 Msg1

SG2 Msg1,Msg3


SG1 Msg2,Msg4

SG2 Msg4

SG Resubmit Required

SG1

SG2


SG1

HUB1,HUB2

SG2

HUB1,HUB2

Redeliver SG1,SG2(returns Retry)

Redeliver SG1,SG2(returns timeout)


SG1

HUB1

SG2

HUB1

Active

Redeliver SG1,SG2(returns Success)

Redeliver SG1,SG2(returns retry)Redeliver SG1,SG2(returns success)

Transport Dumpster

30

Transport Dumpster

• How much data loss can transport dumpster mitigate?• 18 MB dumpster per storage group on 8 Hub Transport

servers = 144 MB / storage group• [20 MB / 10 hour] x [100 users / SG] = 200 MB message

traffic in one hour• Putting the above two together gives

60 min X 144 / 200 43.2 minutes worth of datain 43.2 minutes 144+ logs created per SG

• Customize transport dumpster size/time limitSet-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB –

MaxDumpsterTime 07.00:00:00

• No time window guarantees• If there are no message size limits, a single large

message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server

31

Transport Dumpster

• When CCR detects a lossy failover:• Expands loss window by 12 hours back and 1 hour

forward • Finds all Hub Transport servers in the local Active

Directory site• Requests transport dumpster redelivery from all detected

servers• New servers not added to redelivery list

• Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime

• If multiple lossy failovers take place, new loss is window added to previous one

• Restore-StorageGroupCopy on LCR is one time request, no retries

• Redelivery not triggered as part of Setup /recoverCMS

• No other ways to redeliver messages from transport dumpster

Redundant Networks• Use for log shipping and seeding in CCR

Enable-ContinuousReplicationHostName

SeedingUpdate-StorageGroupCopy -DataHostNames:Host1,Host2

Get-ClusteredMailboxServerStatus OperationalReplicationHostNames:FailedReplicationHostNames:InUseReplicationHostNames:

Watch out for misconfigured host file

33

Circular Logging• One configuration setting with two consumers

• Store service: requires database to be dismounted and re-mounted to take effect

• Replication service: picks up new setting dynamically

• In CCR, it’s no big deal to switch between on/off/on

• In some settings, logs are deleted prematurely• Example: turn off circular logging, then enable LCR

without dismount/mount of database• ESE is still doing log truncation with circular logging logic

• Logs will get truncated before making it to the LCR copy

• To be safe follow this recipe: • Suspend, dismount, change setting, mount, resume

34

Troubleshooting Exchange

Clusters and Continuous Replication

35

Troubleshooting Replication & Failover• Get-StorageGroupCopyStatus• Test-ReplicationHealth• Cluster Log• Get-ClusteredMailboxServerStatus• Getscrsources.ps1• Test-Mailflow• Application Event Log – Replication events

• Get-EventLogLevel -id:"MSExchange Repl" | Set-EventLogLevel -Level expert

• Get-EventLogLevel -id:"MSExchange Cluster" | Set-EventLogLevel -Level expert

• System Event Log – Cluster events• Active Directory management tools• Network Monitor

36

= LastLogCopyNotified – LastLogCopied

TroubleshootingGet-StorageGroupCopyStatus

Time stamp on source SG of most recent log

Time of sources most recent log known to copy

Time stamp on source SG of last successful log copy

Must use –DumpsterStatistics option to get these values

37

TroubleshootingTest-ReplicationHealth

ClusterNetwork:• Checks connectivity of

all network interfaces • Checks cluster group is

up• Warns in multi subnet

topologies since not all cluster networks can be up at the same time

SGCopyQueueLength Warns at 3 and Errors at 6

SGReplayQueueLength Warns at 30 and Errors at 60

38

TroubleshootingCluster Log

• Windows Server 2003• %windir%\cluster\cluster.log• Logs are always appended to this file

• Windows Server 2008• Must generate the cluster log file

cluster.exe [[/CLUSTER:]cluster-name] LOG <options><options> =

/G[EN[ERATE]] [/COPY[:"directory"]] [/NODE:"node-name"][/SPAN[MIN[UTE[S]]]:min] ]/SIZE:logsize-MB/LEVEL:logLevel

• If /COPY is not specified, %windir%\Cluster\Reports\Cluster.log• If /NODE is not specified, a log file is generated on every node • /SIZE must be between 8 and 1024 MB• /LEVEL must be between 0 and 10

TroubleshootingServer Failover but Databases Didn’t Mount• Steps to troubleshoot:

1. Run Get-StorageGroupCopyStatus2. Check the log directories on Active

and Passive3. Run Restore-StorageGroupCopy

and then Mount-Database

TroubleshootingLog File Corrupted

• Steps to troubleshoot:• Run Get-StorageGroupCopyStatus and/or

Test-ReplicationHealth• Reseed passive copy / SCR target by

running Suspend-StorageGroupCopy • Run Update-StorageGroupCopy on the

passive node or SCR target

TroubleshootingSMB File Share for Replication Missing

• Steps to troubleshoot (SCR/CCR)1. Run Test-ReplicationHealth on Passive2. Run Get-StorageGroupCopyStatus on

Passive3. Run Get-ClusteredMailboxServerStatus4. Verify share on Active Node5. Stop Sharing the File Share – Replication

Service recreates in 30 seconds6. Run Test-ReplicationHealth on Active7. Check Application Event Log8. Check Active Directory Permissions

42

Known Issues

43

Known Issues

• Update Rollup 5 for Exchange 2007 SP1 can cause Enable-StorageGroupCopy to fail in an SCR topology that consists of a parent and child domain structure:“Standby continuous replication is not supported between computers in different Active Directory domains. The target node is in domain <child domain> which is different from the source domain of <parent domain>”

• Workarounds• Uninstall UR5, enable SCR, re-install RU5• Use a Management Console running pre-RU5

code

• Exchange12 bug 152967• Expected fix in RU7 for Exchange 2007 SP1

44

Known Issues• Network shares get deleted and created every 5 minutes by

the replication service on a Windows 2008 SCC when SCR is enabled

• Replication service share names intermittently disappear from the cluster causing replication status to repeatedly switch back and forth between failed and healthy states• Test-ReplicationHealth on SCR target may succeed

showing all tests passed• Get-StorageGroupCopyStatus on SCR target status shows

Healthy, Initializing or Failed• No events on source, but SCR target will log ESE event

522 in the application event log• Exchange12 bug 146483• Expected fix in RU7 for Exchange 2007 SP1

45

Known Issues• When running VSS backup, ESE event 522 is logged

on the passive node; Event is logged on resuming a suspended storage group Event log fills

• Event message details:Microsoft.Exchange.Cluster.ReplayService (7012) Log Verifier e0a 31573001: An attempt to open the device name "\\source\share$" containing "\\source\share$\" failed with system error 5 (0x00000005): "Access is denied. ". The operation will fail with error -1032 (0xfffffbf8).

• Workaround• If Get-StorageGroupCopyStatus is healthy for storage

groups, ignore the event• If Test-ReplicationHealth passes all tests, ignore the event


46

Known Issues

• Reseed fails when you restore 1 full backup and then more than 2 differential backups.

• Restoring to active node can succeed, but CCR no longer works after recovery.

• Workaround• Take full backup when restore is finished

(note this may not be practical with large databases)


47

Key Takeaways

• Exchange 2007 includes several Mailbox Server availability configurations

• CCR+SCR provide higher availability at a lower cost than any other solution

• There are a number of cmdlets and tools that can be used for troubleshooting and managing continuous replication

• LLR minimizes need for full reseeds• Transport Dumpster redelivers all routed mail after

failover• CCR addresses all ranges of failures from disk to full site• CCR on DAS provides great RTO and RPO

Thank You!

[email protected]

• Here this week!• Visit our usergroup booth• Core members

• Ilse Van Criekinge• Johan Delimon• Tonino Bruno

• http://www.proexchange.be

Pro-Exchange

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,

it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...

Documents

Transcript of How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...