MSI SYSTEMS INTEGRATORS Building Better Resiliency Through Data Replication A presentation by: Alan...

35
MSI SYSTEMS INTEGRATORS Building Better Resiliency Through Data Replication A presentation b Alan Salesma IT Consultan August 24, 201

Transcript of MSI SYSTEMS INTEGRATORS Building Better Resiliency Through Data Replication A presentation by: Alan...

MSI SYSTEMS INTEGRATORSBuilding Better ResiliencyThrough Data Replication

A presentation by:

Alan SalesmanIT Consultant

August 24, 2010

Abstract

• Abstract:  Are you feeling the pressure for high availability, need quick recovery time without losing data?  MSI will be reviewing approaches to improving your Recovery Time Objectives (RTO) and Recovery Point Objectives(RPO).  A discussion around technologies, Service Definition, and Process will help you understand how your critical systems can achieve performance and resiliency.

 

Industry Terms and Definitions

The Disaster• The catastrophic failure of an IT Service

environment caused by an unplanned event.

That causes… • A significant and extended failure of an IT

service resulting in a severe impact to the business through:• A failure to satisfy a service level

agreement (SLA)• Loss of market share• Revenue loss• Exposure to litigation / legal action

Definitions

Recovery Time Objective (RTO)– is the duration of time and a service level within which a IT service must 

be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in IT Service continuity

Recovery Point Objective (RPO)– is the point in time to which you must recover an IT Service or data as 

defined by your organization. This is generally a definition of what an organization determines is an “acceptable loss" in a disaster situation

Definitions

Replication– the process of sharing information so as to ensure consistency between 

redundant resources, such as software or hardware components, to improve reliability, fault-tolerance or accessibility. 

Resilient– An IT service deployment philosophy that describes an IT service that is 

able to survive the loss of major components of the IT infrastructure without presenting a total loss of functionality to the user.  (The application many continue to operate at an altered level of service.)

Costs per Occurrence

Fre

que

ncy

pe

r ye

ar

1

100

1/100,000

$ $$$ $$$$$

Virus Data Corruption

Worms

Application OutageDisk Failure

Component FailureNetwork Problem

Power Failure

Building FireNatural Disaster

Terrorism/Civil Unrest

Availability Related

Disaster Resiliency

ITSCM – Recovery Related

IT Service Continuity Management Spectrum

Evolution of Disaster Resiliency

The transformation of your IT Service

Evolution of Disaster Resiliency

2010• IT has grown to be a strategic

center of companies, not just a cost center

• Globalization and collaboration open IT environment

• Proactive Disaster Resiliency/Data Protection

1990•IT is a competitive Advantage

•Proprietary Systems

•Reactive Disaster Recovery

Pressure from Both Sides

Business Pressure

•Globalization

•True 24/7 availability

•‘”Mobilization” of Data – user access at anytime or anyplace

•All data critical

•Compliance/Regulatory

IT Pressure

•Increased Data Capacity

•Data growth at 30% per year

•Increased Data Value

•Maintenance windows few and far between

•Outages are visible, some make the news

Tim

e and Expectations

Replication Model Key Characteristics

• Data replication is performed for critical data• Consistency of replicated data is supported by synchronous or asynchronous 

techniques• Recovery point is based on timing of consistency group creation and tape backup 

timing• Recovery environment provides servers and supporting infrastructure for key 

applications• Servers are inactive, excepts to support non-disk based replication• Tests can be performed and should be conducted on a regularly scheduled basis• Facilities are provided by a vendor hot site/co-location agreement, company 

owned internal private cloud or external private cloud • Configurations for infrastructure represent a subset of what is needed to support 

production levels, plus what is needed to support replication method• Network bandwidth is established to support the volume of data replication

Infrastructure Replication

Performed at the infrastructure layer (synchronous or asynchronous) • Covers both geo-proximate or Geo-Remote dual data center designs. • Application neutral and lightweight on the operating system.• Asynchronous delivery can require up to three times the amount of physical storage capacity. • Requires an high bandwidthExample Services are found in:• Peer-to-Peer Remote Copy (PPRC) (IBM)• Fast Remote Mirroring• FlashCopyRecommended Infrastructure/Policy:• Geo-Remote data centers with high bandwidth connectivity• Redundant/Duplicate Infrastructure• A good job scheduler for asynchronous replication• Defined Policy/process for recovery or restarting of Services• Defined Process for Declaration of DisasterRTO/RPO Target• Target can be 24 hours or less for RTO• Target can be less than 4 hour RPO

Client Example for a Recovery Strategy

Client Focus

• Support sophisticated recovery initiative by implementing storage replication technology to:– Reduce Recovery Time (RTO)– Recovery Point (RPO) 

MSI Role

• Performed a review of IT services included within scope  to determine the impact of the new storage recovery environment on recovery exercises, outages and daily processing. 

• Developed operational recovery procedure recommendations to leverage the capabilities of the new recovery environment and improve upon RTO/RPO objectives.

• Developed recovery test timeline recommendations to improve the quality and reduce the duration of recovery test exercises. 

Beginning State

• All mainframe IT services have the same recovery time and recovery point objectives.

• Recovery time objective was 72 hours, which could not be achieved.

• Recovery exercises were performed during 64 hour sessions at a vendor provided recovery site.

• The test process could be completed within 64 hours, but the process was a subset of what would be necessary for a full recovery.

Beginning State

• The recovery point objective was stated as being no greater than 24 hours from time of failure. 

• Disaster recovery tapes were not taken off site until 3 pm each day. 

• The time between tape creation and off-site rotation introduced an additional eight hours to the recovery point, or a total of 32 hours.

• On weekends and holidays, cycles were generally not scheduled, therefore backups and off-site rotation did not occur, leaving longer time periods of exposure.

Technology

• (2) DS8300 Storage Subsystems– Flashcopy– Global Mirror

• PPRC• Asynchronous replication

– Storage instances• A Copy - Production        Local DS8300• B Copy – Performance    Remote DS8300• C Copy – Consistency     Remote DS8300• D Copy – DR Test            Remote DS8300

Strategic Shift to Internal Recovery

• Driven by inflexible recovery vendor:– Pricing of services– Definition of roles– Access to facilities– Contention for tests and regional disasters

• The existence of a secondary data center• Mainframe cost mitigated by Capacity BackUp (CBU) 

offering• Open systems also need to be addressed, magnifying 

inflexibility• Cost concerns for extended stay at recovery site

Support Recommendations

• Establish clear ownership of remote storage, to include:– FlashCopy schedules.– Bandwidth utilization.– Global Mirror and Consistency Group validation and 

monitoring process.– Change control.

Support Recommendations

• Volume specific FlashCopy methods:– Page datasets and coupling facility datasets should be 

replicated once to create an instance on the recovery platform and never replicated again.

– JES2 Spool and secondary checkpoint should follow the same FlashCopy scheme as other production volumes.

– SYSRES volumes should be replicated once per month, following a successful IPL.

– Development and Test volumes should be replicated, and FlashCopied once per day.

Support Recommendations

• Production volume consistency group creation schedule:– During prime shift create consistency groups at system 

default interval of 1-2 minutes, to support CICS and DB2.– Pause Global Mirror operation before cycle and resume 

after cycle, to support batch processing.– For DR test exercises, perform FlashCopies to DR test 

instance at a point in time to satisfy the test requirements.

– Plan to implement TPC for Replication for advanced capabilities.

Recovery Point Improvements

Recovery Point in Time Comparison

0

5

10

15

20

25

30

35

Tim

e of

Day

13:0

014

:00

15:0

016

:00

17:0

018

:00

19:0

020

:00

21:0

022

:00

23:0

00:

001:

002:

003:

004:

005:

006:

007:

008:

009:

0010

:00

11:0

0

Rec

ove

ry P

oin

t in

Tim

e

Before

After

Total Recovery Improvement

Total Recovery Time to Point of Outage

0

10

20

30

40

50

60

70

80

Tim

e of

Day

13:0

014

:00

15:0

016

:00

17:0

018

:00

19:0

020

:00

21:0

022

:00

23:0

00:

001:

002:

003:

004:

005:

006:

007:

008:

009:

0010

:00

11:0

0RT

O +

RP

O t

o P

oin

t o

f O

uta

ge

Before

After

Summary

• Shift to internal recovery model using:– Storage replication– Capacity backup mainframe

• Shift from “restore” to “restart”• Significant improvement in recovery objectives

Core Efficiency Technologies

DeduplicationSaves up to 90% for full backups

Saveup to

90%

SnapshotSpace efficient imaging for data protection

Saveover

70%

Storage VirtualizationPooling and sharing heterogeneous storage resources

Deduplication

• Deduplication removes redundant data blocks from volumes, regardless of application or protocol

• With deduplication, users can recoup 50% or more of their capacity for many data sets and environments

• Deduplication can be used for primary, secondary, and archival storage tiers

Snapshots

• Locally retained point-in time copies of file systems which you can use to protect data– Single files or complete backup and recovery

• Block-incremental behavior limits associated storage capacity consumption

• Reliable off-media backups without the need for long backup windows 

• Simplifies the process of recovering, duplicating, or archiving data

Storage Virtualization

• Designed to improve the flexibility and utilization of your storage resources

• Pools your storage volumes, files and file systems into a single reservoir for centralized management

• Works with heterogeneous storage systems

• Reduce the effects of hardware configurations and helps support business continuity

Where do you start?

• Understand your current Resiliency/Recovery capabilities, limitations, and risks (BIA, Risk Assessment)

• Develop Service Inventory– Prioritize and define the value/importance to the business of that IT Service, i.e. 

Tier 1, 2, or 3• Define Data Protection Methods• Review/Define Service Level Objectives/Agreements• Document Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for 

each IT Service• Design and Build the resiliency model to support the business supported IT Services• Develop Data Migration Plan

Develop Service Inventory

• Define tier specific patterns for – Virtualization and Cloud Opportunities– Daily operational quality of service– High Availability requirements – Recovery requirements– Data protection requirements

• Establish service placement in each tier• Develop budgetary roll-up of all services by tier• Perform service delivery chain mapping for critical 

systems 

Sample Service Mapping

Sample IT Service Tiering

Tier Definitons:

Target RPO/RTO for 2010: RTO: <24 Hours RPO: <12 Hours RTO: <72 Hours RPO: <24 Hours RTO: 7-10 Days RPO: <24 Hours RTO: 30 DaysRPO: <24 Hours

Services by Tier:

Sharepoint (non-critical)

Tier 1 Tier 2 Tier 3 Tier 4

Communications/Infrastructure Critical Applications Non-Critical Applications Development / TestNetwork Phone System client websites (no SLA) TFS - Firewall - Call Manager QC - Telecom - Unity Oracle ERP - Router - IPCC XXX Apps - Switch SharePoint (intranet) Move-it DMZ - VPN - intranet.XXX.org Records Management - ACS client websites (w/ SLA) License Servers (Cached)Email Infra SmartFilter - Exchange License Servers (No Cache) DMS (2010) - Blackberry ProjectWisePhones SIP (2010) - SRST

Define Data Protection Patterns

• Develop an inventory of data types supporting complete service delivery

• Based on data protection requirements, determine best replication methods

• Develop estimated bandwidth and budgetary role-up costs to support the preferred replication methods

• Define roadmap or incremental implementation approach for desired replication methods which are outside current approved funding

Data Migration Planning

• Discover characteristics of the source data• Based on data protection requirements, determine 

data migration requirements• Develop data migration plan• Deliverables:

– Initial and incremental roadmap for replication implementation

Conclusion

• There is no one-size-fits-all approach • Most businesses do not fully understand how vulnerable they are using existing recovery 

processes.• Companies who are looking to improve their RTO/RPO approach need to:

– Look carefully at their business requirements  and risk– Understand IT demands by understanding the applications, the operating system, 

storage, network bandwidth requirements and the total business impact. • Relating the business support needs to IT Service Management can assist them in 

determining which approach is the most suitable. – Sometimes a combination of drawing elements from several approaches works best. – You can mix and match in order to tailor a solution that meets unique needs of any 

business unit