how to start your disaster recovery in this “cloudy” landscape€¦ · recovery in this...

how to start your disaster recovery in this “cloudy” landscape

EMC Proven Professional Knowledge Sharing 2011

Roy MikesStorage and Virtualization ArchitectMondriaan [email protected]


Table of Contents

About This Document 3

Who Should Read This Document? 3

Introduction 4

1. What is a Disaster 6

2. What is a Disaster Recovery Plan (DR plan) 7

2.1. Other benefits of a Disaster Recovery Plan 7

3. Business Impact Analysis (BIA) 8

3.1. Maximum Tolerable Downtime (MTD) 9

3.2. Recovery Time Objective (RTO) 9

3.3. Recovery Point Objective (RPO) 9

4. Data Classification 10

5. Risk Assessment 13

5.1. Component Failure Impact Analysis (CFIA) 16

5.2. Identifying Critical Components 18

5.2.1. Personnel 18

5.2.2. Systems 18

5.3. Dependencies 19

5.4. Redundancy 21

6. Emergency Response Team (ERT) 23

7. Developing a Recovery Strategy 24

7.1. Types of backup 26

7.2. Virtualized Servers and Disaster Recovery 27

7.3. Other thoughts 28

8. Testing Recovery Plans 29

9. Role of virtualization 30

9.1. Role of VMware 31

9.2. Role of EMC 33

9.3. Role of VMware Site Recovery Manager (SRM) 35

10. VMware Site Recovery Manager 36

11. Standardization 41

12. Conclusion 42

References 44


About This Article Despite our best efforts and precautions, disasters of all kind eventually strike an

organization, usually unanticipated and unannounced. Natural disasters such as hurricanes,

floods, or fires can threaten the very existence of an organization. Well-prepared

organizations establish plans, procedures, and protocols to survive the effects that a disaster

may have on continuing operations and help facilitate a speedy return to working order.

Continuity and recovery planning are two separate procedures of reparation to restore and

recover critical business operations in the event of such disasters. My focus in this article

concerns recovery planning.

This article should help you understand the need for Business Continuity Management and

Disaster Recovery Planning in relation to a working failover plan. Because it is not all

technical, this article covers most of the non-technical discussions in relation to Disaster

Recovery Planning. After reading this document I think you can make a good start.

As such, this material is probably most useful to those with little or no familiarity with this

topic. Readers who fall into this category would be well served to read this document.

Who Should Read This Document? This article is written for IT professionals who are responsible for defining the strategic

direction of protecting data in their data center(s).

These include:

Storage Administrators

Operational, middle level managers

Business Managers

IT managers (CIO, Chief information officer)

Organizations and individuals who have the same interests should read this article as well.

Where to start with Disaster Recovery Planning? It often remains a difficult story.

My goal is to give a general guideline to provide insight into Disaster Recovery Planning,

which should not be too difficult to read.


Introduction Let‘s start this with a simple quote; ―Information is the organization’s most important asset‖

Data is created by the applications and is processed to become information. Information is

undoubtedly the most important asset for an organization. Does this make sense?

Absolutely! The digital footprint for each person on this planet is growing. In a sense it does

not matter whether we as a person or a corporation store data; it has to be protected. For

some people, photos are just as important as a company's ERP system. It is not for nothing

that storage vendors put in a lot of energy to manage this information.

From a Disaster Recovery perspective, the world is divided into two types of businesses;

those that have DR plans and those that don‘t. If a disaster strikes your organization in each

category, which do you think will survive?

When disaster strikes, organizations without DR plans have an extremely difficult road

ahead. If the business has any highly time-sensitive critical business processes, that

business is almost certain to fail. If a disaster hits an organization without a DR plan, that

organization has very little chance of recovery. And it‘s certainly too late to begin planning.

Organizations that do have DR plans may still have a difficult time when a disaster

strikes. You may have to put in considerable effort to recover time-sensitive critical business

functions. But if you have DR plan, you have a fighting chance at survival.

Does your organization have a disaster recovery plan today? If not, how many critical, time-

sensitive business processes does your organization have? Many organizations think they

have a DR plan. They think they have some procedures and that is all it takes. True, you

need procedures, but you need also to be sure that you actually can failover. How do you

manage that? Personally, I think testing live will do more damage than knowing you can. I

can take a guess, but actually do know for sure, the number of every organizational change.

Many organizational infrastructures change per hour. Try to fit in your DR plan when

changing that fast. Where does that leave you? Good question. Probably when you test your

failover you do it once per year, maybe twice or even each quarter. How much do you think

has changed since the last time you performed your failover. Thus, this is a considerable

challenge.

Lucky for you there are many techniques and solutions, such as "clouds", where DR plans

are probably already well organized, or VMware Site Recovery Manager (SRM) who can

help you with your failover. VMware SRM is a business continuity and disaster recovery

solution that helps you plan, test, and execute a scheduled migration or emergency failover

of data center services from one site to another. But the most beautiful part of SRM is, you

can test a plan without doing it live. Wow!!! I can actually failover anytime without doing some

damage to the infrastructure environment? True! Virtualization these days can make Disaster

Recovery implementations easy. Think not only public but also private. Private clouds have a

huge positive impact and synergy. How many of you are looking for partnerships or serve as

each other‘s failover? That makes 1+1=3. But take it easy people. Don't press <Enter> too

soon. There is a lot to consider before taking this road.

Depending on the nature of your business, good disaster recovery is achieved by designing

a process which enables your operations to continue to work, perhaps from a different


location, with different equipment, or from home, making full use of technology to achieve a

near seamless transition that is all but invisible to your customers and suppliers. Insurance

can mitigate the cost of recovery, but without a disaster recovery plan that gets you back up

and running you could still go under. Indeed, more than 70% of businesses that don‘t have a

DR plan fail within 2 years of suffering a disaster.

So what's next? Certainly a lot! But don't make life too difficult. There will always be one or

more single points of failures. You should ask yourself if the costs are worth the five nines

(99,999%) availability. The primary task and next step is to determine how you will achieve

your Disaster Recovery goals for each of the systems and system components to ensure that

the critical, time-sensitive business processes continue working. First, this is the point at

which it becomes important to consider exactly what types of disasters you need to prepare

for and to classify them by the extent and type of impact they have.


1. What is a Disaster? You may argue with me about the definition of a disaster, because there is more than one

definition. To some, anything that doesn't go according to their schedule or plans is a

disaster. On a personal level, a fire in our house could be considered a disaster. In most

cases, one broken server isn‘t a disaster but many servers are. However, it is important to

understand the difference between these kinds of disasters, and a ‗true‘ disaster. This will

allow you to keep things in perspective when making your own disaster plans.

Should your company experience a disaster, the first 48 hours following the disaster will be

the most critical in your recovery efforts. How you respond during that period will determine if

your business will survive. Furthermore, the most important hour is the one immediately

following the event.

A disaster is defined as an event causing great loss, hardship, or suffering to many

organizations. When we think of this kind of event we usually think of catastrophic events

such as hurricanes, earthquakes, floods, fires, and even man-made disasters. In situations

like this, help may be unavailable because rescuers may be in the same predicament as you,

and it could take a considerable length of time for help to arrive.

Disaster preparedness is the sensible thing to do. It doesn't need to be expensive and it can

save your business! In these situations we are not talking about losing server cooling or

power for a few hours; we are talking about losing essential services, data, or information,

under extreme circumstances, for a prolonged period of time.

Disaster recovery is becoming an increasingly important aspect of enterprise computing. As

devices, systems, and networks become ever more complex, there are simply more things

that can go wrong. As a consequence, recovery plans have also become more complex.

It is a common misconception that most of the threats to continuity are a result of natural

disaster. To the contrary, statistically, these threats account for fewer than 1% of IT service

unavailability.


2. What is a Disaster Recovery Plan (DR plan)? A good Disaster Recovery Plan (DR plan) is like an information insurance policy for a

business. A DR plan documents the ability to continue work after any number of catastrophic

problems, ranging from a natural disaster such as flood, fire, and earthquake or

planned/unplanned scenarios such as database corruption, server failures, or simply human

errors.

Often a DR plan is confused with a Business Continuity Plan (BCP). Just as a DR plan, BCP

is an event that makes the continuation of normal functions impossible. A DR plan is part of

his big brother, Business Continuity Plan and IT related. I am not going to talk about a

Business Continuity Plan. Instead, we are sticking with the DR plan.

A DR plan consists of the precautions taken so that the effects of a disaster will be minimized

and the organization will be able to either maintain or quickly resume mission-critical

functions. Typically, DR planning involves an analysis of business processes and continuity

needs; it may also include a significant focus on disaster prevention.

2.1. Other benefits of a Disaster Recovery Plan

Besides the obvious readiness to survive a disaster, organizations can realize profits or

several other benefits from DR planning [1]:

Improved business processes: Business processes undergo continuous analysis and

reviews; there are always areas for improvement.

Improved technology: Often, you need to improve IT systems to support recovery

objectives that you develop in the disaster recovery plan. The attention you pay to

recoverability also often leads to making your IT systems more consistent with each

other and, hence, more easily and predictably managed.

Fewer disruptions: As a result of improved technology, IT systems tend to be more

stable than in the past. Also, when you make changes to system architecture to meet

recovery objectives, events that used to cause outages no longer do so.

Higher quality services: Improved processes and technologies improve services, both

internally and to customers and supply-chain partners.

Competitive advantages: Having a good DR plan gives an organization bragging

rights that may outshine competitors. Price isn‘t necessarily the only point on which

companies compete for business. A DR plan allows a company to also claim higher

availability and reliability of services.


3. Business Impact Analysis (BIA) Although a full DR plan takes many months or even longer to complete, a good first step of

an individual DR plan is mapping out the most critical aspects of day-to-day business in your

company. Data safety is perhaps one of the most crucial and overlooked aspects of disaster

recovery. [2]

A Business Impact Analysis (BIA) is a detailed inventory of the critical processes, systems,

and people that are associated with an organization‘s primary business activities. If you have

never done a Business Impact Analysis, it seems to be one of the most difficult tasks. There

always seems to be a lot of questions about what should and should not be included in the BIA.

The purpose of a BIA is to identify which business units, operations, and processes are

essential to the survival of the business. Of course, there is no standard BIA; it depends per

organization. Basically there are two areas to discover.

1. Determine the most critical business areas, often referred to as mission-critical

applications. We will cover this later.

2. For each business area, determine the sub-business processes and identify the

processes which are essential to the operation of the business, often referred to as

business-critical. We will cover this one later also.

After having a clear view which processes are critical for your business (and don‘t take this

lightly), management should estimate the maximum downtime that is tolerated. Management

should determine the longest period of time that a critical process can be disrupted. This

figure is known as the Maximum Tolerable Downtime (MTD). You may measure an MTD in

hours or days. And often these are the most difficult answers to get.

After you complete the MTD and risk analysis for each critical business process, you need to

condense the detailed information to a simple spreadsheet so you can see all the business

processes on one page, along with their respective MTD and risk figures. Try to see the big

picture here.

Because there is a potential risk for downtime regarding these critical processes, we cannot

ignore the major consequences. These consequences are related to objectives of the

business.

The objectives of the business impact analysis (BIA) includes:

Financial/Cash Flow/revenue loss

Legal/Regulatory

Life-threatening issues in hospitals, for example

Reputation

And so on…

(There are many more, depending on your type of organization)


3.1. Maximum Tolerable Downtime

For each process in the BIA, you need to determine its Maximum Tolerable Downtime

(MTD), which is the time after which the process being unavailable creates irreversible (and

often fatal) consequences. Generally, exceeding the MTD leads to severe damage to the

viability of the business, including the actual failure of the business. Depending on the

process, you can express the MTD in hours or days.

3.2. Recovery Time Objective (RTO)

After you determine the MTD for processes, you can begin setting targets for recovery. One

important target is the Recovery

Time Objective (RTO).

RTO is the period of time required

to return an application or process

to a working state after a downtime

situation. For any given process,

the RTO is less than the MTD. By definition, it has to be. If you set a 5-day RTO for a

process with a 2-day MTD, your business has failed before you can get the critical process

running again. And what‘s the point of that? A process‘s RTO forms the basis for any DR

planning that you‘ll do for that process.

For example, if a process has a 30-day RTO, you can get it running again—purchase a new

server, install software, and restore backup data—at a leisurely pace. However, a process

with a one-hour RTO requires a hot site with a standby server and data replication in near-

real time. The costs for these two scenarios vary greatly.

3.3. Recovery Point Objective (RPO)

The Recovery Point Objective (RPO), represents the first fallback point transactions for the

last 5 minutes, hour, day. It represents the risk of a permanent loss of some part of your

data.

Assume that an organization wants to establish a 5-hour RPO for an order entry system. To

meet this figure, the organization has to implement a mechanism to back up or replicate

transaction data so that it loses no more than 5 hours of transactions in a disaster scenario.

Similar to the RTO, setting the RPO determines what sort of measures you need to take to

ensure that you don‘t lose

information related to any particular

business process. Speed costs.


4. Data Classification Every business requires certain applications that they use every day to run their business.

These applications become assets and are incorporated into the business. Critical

information is deployed on the asset and these assets are provided to each employee.

In this chapter I will go through the five classes of applications. It is important that the data in

each organization is analyzed and classified in order to develop a recovery strategy. You

must classify your data; otherwise, everything gets protected the same way. Without

classification everything is important and you don‘t want that. Unless you own a gold mine!

Let‘s face it; most employees are aware of which applications or processes are important for

their organization. But, not everything is or should be important and that‘s why we are going

to classify these types of data.

EMC‘s application matrix provides a requirement driven, five-category methodology for

mapping technology solutions to critical applications. Critically ranking each application within

the matrix dictates the need of disaster recovery or method used to protect the data.

Class 1 - Mission Critical

Class 2 - Business Critical

Class 3 - Business Important

Class 4 - Productivity Important

Class 5 - Non-critical

Let's go through them;

Mission Critical applications are applications necessary for the company's mission to

perform. These applications have a significant impact on revenue loss from downtime.

Business Critical applications are applications that increase productivity. These are the

applications that usually support mission critical applications. At the time of a major disaster,

this should be the second application restored. These applications also have an impact on

revenue loss from downtime.

Business Important applications are also the type of applications that increase productivity

and supporting applications that are not critical. These are ‗third-rate‘ applications.

Productivity Important are departmental applications, rather than the entire company.

These will only affect the productivity of their departments.

Non-critical applications have minor impact on productivity. They are too personal to be

recognized in times of crisis.

Please note that these are guidelines. It may be that for departments within an organization

application importance may differ. But beware that you are not lost yourself discussing each

application‘s importance. Eventually it happened to me! All of these departments declaring

the importance of their application comes at a price. But hey... as long as somebody is willing

to pay for it, no harm is done and you have created a new challenge.


It's not rocket science to know that mission-critical applications need more performance and

protection. Eventually we earn our money using these applications. So when we are clear

about which applications we can put in boxes, hypothetically speaking, we know what is

needed for performance, capacity, protection, and so on.

Requirements, such as high availability, high scalability, and redundant connections and so

on, are examples of what you possibly need. As I mentioned earlier, mission-critical

applications have bigger needs than ‗nice to have‘ applications. By now you probably know

which applications are mission-critical or very important to the business. But is there a need

for high performance/no downtime hardware? Every requirement has a price; is your

business willing to pay the price for a Ferrari or is a nice Mercedes enough? The questions

that have to be answered are probably about RPO and RTO. Ask questions such as;

- How much downtime can be afforded?

- How much data may be lost in an outage?

- How fast do we need to recover the data?

- How fast do we need to be up and running again?

If the answer is no downtime or less than 2 days, this certainly impacts the type of

technology you need. Meeting such standards requires excellent skills combined with

excellent hardware and software.

Once complete, it can be integrated into a matrix on the next page:


The Criticality Matrix as shown in the table below depicts the typical requirements and the information infrastructure associated with each class. [2] Here again, it means that you decide which requirements are necessary in each class.

Requirements Backup & disaster

recovery

Classes

Application Application

function

High availability

High Scalability

Redundant connections Offsite tape

PSYGIS/ Basis Clients administration

Scalable performance Instant onsite recovery

PSYGIS File Lining

Non disruptive backups

Offsite disaster recovery location

Mission critical PSYGIS Medication prescription

Rapid restore Offsite data replica

PSYGIS Authentication and billing

Instant test environments -

PSYGIS Calendar Management

Buseness continuance

Advanced recovery

High availability

High Scalability

Centrasys Pharmacy

Redundant connections Offsite tape

Mirador lab results

Scalable performance Instant onsite recovery Business critical MUS Methadone

Non disruptive backups Offsite disaster recovery location

REAKT Registration day activities

Rapid restore

- -

Instant test environments

Vila Purchase & Inventory

SDB HRM / salaries Personnel administration & salaries

High availability

Square Planning

High Scalability

FIS Financials

Redundant connections Offsite tape Business important IRIS Financials

Scalable performance

Priva Building Management

Non disruptive backups

Microsoft Office E-mail / Documents

BI Business intelligence

High availability

Prodacapo Accounting

High Scalability Onsite tape Productivity important SharePoint Intranet

High performance

Marvin Process management system

Topdesk Enterprise Service Management

Scalable

Low cost none Non-critical Other applications -

-

Example of Criticality Matrix mental health institution


5. Risk Assessment A risk assessment is an important step in protecting your business. It determines various

natural or man-made threats that can disrupt processes in the organization and its facilities.

A common misconception is that most of the threats to continuity are a result of natural disaster. Statistically, these threats account for less than 1% of IT service unavailability. That leaves us with 99% attributable to other threats. It‘s important you know the risks about disasters such as tornadoes, hurricanes, floods, or other natural disasters. It‘s more important that you protect yourself against man-made threats.

This one is difficult because there are innumerable scenarios that can go wrong caused by

humans. Do we look in to all these scenarios? I think not; there are simply too many.

This leaves us with two questions:

Which possible risks are there for the organization?

What are the results of these possible threats?

NOTE: Because there are so many scenarios, it is more important to consider the misery

caused by these scenarios/risks. There are more than one hundred ways to destroy

something. The fact is, one way or the other, it’s broken. Try to shift your focus from cause to

effect!

Besides the effects of disasters, you need to create a relatively complete list of the disasters

that are reasonably likely to occur. The following list isn‘t meant to be complete. Disasters not

listed here might belong in your threat model. But this list should give you a good starting

point.

Global Threats

Part of the risk process is to review the types of disruptive events that can affect the normal

running of the organization. There are many potential disruptive events and the impact and

probability level must be assessed to give a sound basis for progress.

Environmental Disasters

o Tornado

o Hurricane

o Flood

o Snowstorm

o Earthquake

o Electrical storms

o Fire

o Subsidence and Landslides

o Freezing conditions

o Contamination and Environmental Hazards

o Epidemic


Organized and / or Deliberate Disruption

o Act of terrorism

o Act of sabotage

o Act of war

o Theft

o Labor Disputes / Industrial Action

Loss of Utilities and Services

o Electrical power failure

o Loss of gas supply

o Loss of water supply

o Communications services breakdown

o Loss of drainage / waste removal

Equipment or System Failure

o Internal power failure

o Air conditioning failure

o Production line failure

o Cooling plant failure

o Equipment failure (excluding IT hardware)

Serious Information Security Incidents

o Cyber crime

o Loss of records or data

o Disclosure of sensitive information

o IT system failure

Other Emergency Situations

o Workplace violence

o Public transportation disruption

o Neighborhood hazard

o Health and Safety Regulations

o Employee morale

o Mergers and acquisitions

o Negative publicity

o Legal problems

Although not a complete list, it does give a good idea of the wide variety of potential threats.


Consequences of disasters according the International Disaster Database


5.1. Component Failure Impact Analysis

Originally a process defined by IBM in the 1980s to improve availability, Component Failure

Impact Analysis (CFIA) [2] [3] is now a part of the ITIL "Best Practices". CFIA is a process of

analyzing a particular hardware/software configuration to determine the true impact of any

individual failed component.

Many know that CFIA is somehow related to ITIL Problem and Availability Management, yet

it remains at best a fuzzy concept for most. While CFIA is impressive sounding, it is really

just a way of evaluating (and predicting) the impact of failures, and locating Single Points of

Failure (SPoF). CFIA can:

1. Identify Configuration Items (CI‘s) that can cause an outage

2. Locate CI‘s that have no backup

3. Evaluate the risk of failure for each CI

4. Justify future investments

5. Assist in Configuration Management Database (CMDB) creation and maintenance

All it takes to gain these benefits is an Excel spreadsheet or some graph paper. Following

are the 3 steps to success with Component Failure Impact Assessment.

1. Select an IT Service, and get the list of CI‘s, hopefully from Configuration Management,

upon which the IT Service depends. If there is no formal CMDB, then ask around IT for

documentation, paper diagrams, and general knowledge.

2. Using a spreadsheet or graph paper, list CI‘s in one column and the IT Service(s) across

the top row. Then, for each CI, under each service:

a.) Mark ―X‖ in the column if a CI failure causes an outage

b.) Mark ―A‖ when the CI has an immediate backup (―hot-start‖)

c.) Mark ―B‖ when the CI has an intermediate backup (―warm-start‖)

You now have a basic CFIA matrix. Every ―X‖ and ―B‖ is a potential liability.

3. Examine first the ―X‘s‖, then the ―B‘s‖, by asking the following questions:

Is this CI a SPoF?

What is the business/customer impact of this CI failing? How many users would be

impacted? What would be the cost to the business?

What is the probability of failure? Is there anything we can do differently to avoid this

impact?

Are there design changes that could prevent this impact? Should we propose

redundancy or some form of resiliency? What would redundancy cost?

As you get good at CFIA, consider expanding your CFIA matrix to include the procedure

used to recover from a CI failure as a row across the bottom of your CFIA matrix. (Of course,

this requires that you have written procedures!) Adding documented response procedures to

your CFIA matrix lets you examine the organization as well as infrastructure. Ask yourself:


How do we respond when this CI fails?

What procedures do we follow? Are these procedures documented? Could they be

improved? Could they be automated?

Can we improve the procedure through staff training? New tools or techniques?

Could preventative maintenance have helped avoid this problem?

NOTE: We will cover these questions later in this article.

Sound CFIA at any level (infrastructure, organization, or both) delivers RFCs that can deliver

real improvements to the business without requiring high process maturity or expensive

supporting software. There are some IT-centric benefits to CFIA as well, including a head-

start on IT Service Continuity Management; aiding Configuration Management which benefits

from the addition of recovery procedures to the CMDB; and Problem and Incident

Management who may follow these procedures.

How far are we going with this? You can take every little thing in your CMDB, but it‘s wise to

focus on the CI‘s regarding your mission-critical applications. Identify your critical

components. Only they are important. When one of these critical components fails, probably

your mission-critical application does, too.


5.2. Identifying Critical Components

This chapter is very closely related to Component Failure Impact Analysis (CFIA), which

talked about data classification, and gives you a good start to the applications we are going

to protect.

5.2.1. Personnel

It‘s important to recognize that your personnel are critical components too. You may begin to

notice a few names who appear frequently in the most critical processes or are involved with

some critical applications. You may want to take a closer look at those people and consider

whether they‘re truly critical for so many business processes or applications. Items in your

DR plans that relate to critical personnel may include cross-training or staff expansion of

some sort in order to reduce any possible exposures related to too many processes

depending on too few individuals.

5.2.2. Systems

By this time you‘ve collected all the necessary information from all the important business

processes for your Business Impact Analysis. You‘ve identified information systems,

personnel, assets, and suppliers that these processes depend on.

In chapter 4 (Data Classification), we discussed the critical applications and what was

needed to run them and keep them available at all times. These critical applications are

depending and running on systems such as power, cooling, switches, firewalls, and servers.

Now, all systems that are relevant for these critical applications should be named. But only

those systems that are absolutely necessary for those critical applications that need to

failover when disaster strikes.

So, again, make things easy and start with which

application is running on which server. Go to work

systematically.

For example; this application is running on this server

and depending on the following services…this server is

running on precisely this blade in this blade chassis…

Once you are ready, you have a list with all servers.

You can expand this list to your own needs. Do this

also with power, switches, and other systems that are

related to the critical applications. The more extensive,

the better. The important thing is to start somewhere!


5.3. Dependencies Why identify dependencies? Your mission in this phase is to identify systems that are critical.

The systems that support business processes aren‘t just the systems with the applications,

but also everything else that you need to keep those systems running properly.

The following sections discuss dependencies in greater detail. As previously discussed, try to

focus on the things that are related to the mission-critical application. After you reach that

point you can always expand. A wise lesson; start easy!

When you have your list complete with all the critical components regarding mission-critical

applications, you need to have not only an inventory-level view of your systems and

applications, but also a high-level view of it. If you don‘t have these views of your

environment, it‘s worth the time required to develop them. Often, these diagrams are the only

way you can get a complete end-to-end view of a single application or an entire environment.

Put all your information in a diagram. After adding them to your diagram, try to connect them

to each other.

In my example on the next page there are some layers to play with. I roughly used the

following layers:

Power

Network infrastructure

Hardware

Storage

Hypervisor

Operation systems

Services

I found a good way to get the above dependencies in a schema [4]. Below, we see how data

flows as email is delivered. It‘s routed from the external mail server, through the spam filter,

and into the internal mail server. IT staff envision their world like this. It‘s a conceptual model

that facilitates troubleshooting of email problems. But it doesn‘t clearly show how email

services might be impacted by various system failures. However, by adding dependency

relations, it will.

You need at least these high-level diagrams of your systems environment. On the next page,

I have provided an example where I identify the critical systems. Just add your dependencies

and you are ready. (No lines added in the drawing because it will not be readable)


;

Power

Network Infrastructure

HP Blade Chassis

VMware ESX Server

HP Blade chassis

HP Thin Client

VM: (name server)

VM: (name server)

VM: (name server)

Domain Name System Service (DNS)

Dynamic Host Configuration Protocol

Service (DHCP)

Active Directory Services (ADS)

VM: (name server)

VM: (name server)

Inlogportaal service

O:\Phoenixclient\

Flex profile

HP Unix (name server)

VM: (name server)

Sharepoint

Shared Services Provider (SSP)

Dependencies

Mission Critical

Processes

VM: (name server)

Sharepoint

Web Front-end (WFE)

VM: (name server)

VM: (name server)

Dynamic Host Configuration Protocol

Service (DHCP)

Active Directory Services (ADS)Forest Roles

Schema Master

Domain Naming Master

Domain Roles

PDC Master

RID Master

Infrastructure Master

SQL Server 2005

SPH2007 instance

portaaldb

sharedservices1_db

sharepoint_admincontent

sharepoint_config

. instance

SMZ64SP

Internet Information Services (IIS)


VMware ESX Server

VM: (name server)(Citrix Basis)

Shortcut to:

O:\Phoenixclient\

Internet Explorer

VM: (name server) (Citrix Datacollector)

Hardware

Hypervisor

Operating System

Service

Database

RAID

LUN


Active Directory Services (ADS)

EMC CLARiiON CX4-240

RAID 5

RAID Group 7

LUN ID

601

LUN ID

600

LUN ID

602

RAID 5

RAID Group 8

LUN ID

402

LUN ID

401

RAID 5

RAID Group 9

LUN ID

404

LUN ID

403

RAID 5

RAID Group 10

LUN ID

406

LUN ID

405

RAID 5

RAID Group 0

LUN ID

525

RAID 1/0

RAID Group 1

LUN ID

524

LUN ID

523

RAID 5

RAID Group 5

LUN ID

521

LUN ID

520

RAID 5

RAID Group 11

LUN ID

402

LUN ID

401

RAID 5

RAID Group 12

LUN ID

404

LUN ID

403

RAID 5

RAID Group 13

LUN ID

406

LUN ID

405

VMFS

VMFS

VMFS

Network infrastructure

Power

Example of high-level diagram


5.4. Redundancy

The term redundancy is often confused with availability. While these two are related, they are

not the same. Redundancy refers to, for example, the use of multiple servers, more than one

Host Bus Adapter (HBA), or RAID protected disks. Redundancy is the duplication of critical

components of a system with the intention of increasing system reliability. When it comes to

redundancies, more is better you would say. But redundancy has its price. Why? Because

you have to buy at least double the hardware. On the other hand, what is the price of a

server compared to the outage of your mission critical application?

There are many scenarios to deploy a redundant solution for servers and storage. Let‘s look at some examples: No redundancy

Look at the components in the picture. You only see a single point of failure. A single point of

failure (SPoF) is a hardware or software element whose loss results in the loss of service.

HBA redundancy

Configuring multiple HBAs and using multi-pathing software provides path redundancy. Upon

detection of one failed HBA, the software can re-drive the I/O through another available path.

HBA and Switch redundancy

This picture provides HBA and switch redundancy as well. It also protects against storage

array port failures.


HBA, Switch, and Disk redundancy

Now we are using some level of RAID, such as RAID-5. RAID protection will ensure

continuous operation in the event of disk failures.

HBA, Switch, Disk, and Storage array redundancy

The diagram above depicts a high level redundancy infrastructure. Everything is redundant

and there is little chance that if one component breaks down your applications are no longer

available.

Remote replication is an essential part of any data protection plan. It provides protection in

case of primary device, storage, or site failure. Remote replication involves moving data to a

secondary storage array to protect against data loss in case of primary site failure.

There are two types of remote replication; synchronous, which allows RPO of close to zero

and asynchronous, which allows updates to be made to a Secondary Image at intervals

selected by the user.

Bottom line, there should be some sort of redundancy in your infrastructure to make sure all

data and information is protected. Without redundancy, you will realize when an outage

happens, you are lost. You will have to decide how far you want to pull redundancy. As

mentioned earlier, this also has a price. The more redundancy, the more money it costs.


6. Emergency Response Team

An emergency response team (ERT) [5] is a group of people who

prepare for and respond to any emergency incident, such as a

natural disaster or an interruption of business operations. Incident

response teams are common in corporations as well as in public

service organizations. This team is generally composed of specific

members designated before an incident occurs, although under

certain circumstances the team may be an ad-hoc group of willing

volunteers.

Incident response team members typically are trained and prepared to fulfill the roles

required by the specific situation. Ideally, the team has already defined a protocol or set of

actions to perform to mitigate the negative effects of the incident.

There is a need to figure out when to declare a disaster. Define a procedure for declaring a

disaster. At least two ERT members will declare a disaster. Most of the time this is relatively

easy. Why? Because it doesn‘t matter whether it was caused by man-made or a natural

event. Base your decision on consequences rather than cause.

For example, assume a case of no availability of your mission-critical environment or data

loss in your mission-critical application. Either way, this is usually severe enough to declare a

disaster. When a disaster is declared, the ERT launch the DR plan. Before declaring a

disaster, determine the Maximum Acceptable Outage Time (MAOT). The MAOT may be a

period ranging from a few hours to several days or more. The MAOT is the longest time that

can be tolerated between the onset of a disaster and the resumption of a critical business

process. The ERT should assess the disaster and determine whether your business‘s critical

processes will likely exceed the MAOT. If the ERT thinks you‘ll exceed the MAOT, the ERT

should declare a disaster.

Example of MAOT

Decide Moment: The final moment when decided that the IT infrastructure of the primary location will not recover within one business day

12 Hour

Maximum

Start Fallback Scenario: Start failover to secondary location

48 Hour

Maximum

Availability Mission Critical App: Working IT environment for at least 200 employees


The scenario on the previous page is an example. It should be clear that in the decision point

you do not need 12 hours to decide whether or not you will failover. If irreparable damage is

caused by fire in your data center, it should be clear you need to failover.

After the ERT decides that the MAOT has been exceeded for critical processes, it invokes

the DR plan.

Here are some guidelines to get your DR plan up and running:

Arrange an emergency meeting

Appoint an ERT Leader

Assign other roles, such as communications

Designate someone on the ERT to keep a logbook

Discuss the MAOT

Initiate recovery plans

Many organizations put emergency contact lists on laminated wallet cards. Wallet cards are

very portable because they can fit into a wallet. And it‘s more likely to have your wallets with

you when a disaster strikes. Consider putting items on your card such as; name, phone

number(s), URL that contains the disaster recovery procedure, and even spouse info. You

might need more or less information than what I have listed here.

7. Developing a Recovery Strategy

The primary task of this step is to determine how you will achieve your disaster recovery

goals for each of the systems and system components that were identified. For most

organizations, the design of a recovery strategy solution is a fairly custom process. While the

design principles and considerations are mainly common, designers typically have to make a

number of compromises.

Backup and recovery [7] are components of business continuity. Business continuity is the

term that covers all efforts to keep critical data and applications running despite any type of

interruption (including planned and unplanned). Planned interruptions include regular

maintenance or upgrades. Unplanned interruptions could include hardware or software

failures, data corruption, natural or man-made disasters, viruses, or human error. Backup

and recovery is essential for operational recovery; that is, recovery from errors that can occur

on a regular basis but are not catastrophic, i.e. data corruption or accidentally deleted files.

Disaster recovery is concerned with catastrophic failures. Believe me, nothing is as

interesting as a big failure because it‘s the moment you actually learn something. When

planning for backup and recovery, you should decide how much data loss you‘re willing to

incur. You can use this decision to calculate how often you need to perform backups.

Backups should be performed at fixed intervals.

The length of time between backups is called the Recovery Point Objective (RPO); that is,

the maximum amount of data that you are willing to lose. You should also decide how long

you‘re willing to wait until the data is completely restored and business applications become

available. The time it takes to completely restore data and for business applications to

become available is called the Recovery Time Objective (RTO). Your RTO can be different

from your RPO.


After determining your recovery time and recovery point objectives, then you can determine

how much time you actually have to perform your backups; typically called your backup

window. The backup window determines the type and level of your backups. For example, if

you have a system that requires 24-hour, 7-days a week, 365-days a year availability, then

there is no backup window. So, you would have to perform an online backup (also known as

a hot backup) in which the system is not taken offline.

Lastly, as the number of backups increase, the

space required to store them will also increase.

Therefore, you should consider how long you

are required to retain your backups (also

referred to a data retention period) and plan for

the appropriate amount of storage space.

When your deployment fails, you recover it by

restoring it to a previously consistent state (that

is, a particular point in time) from your

backups. Restoring a deployment to a

particular point in time is also known as a

point-in-time recovery.


7.1. Types of backup

You can choose from three different backup methods. Most backup strategies use a

combination of two or three of these methods:

Full is the starting point for all other backups and contains all the data in the folders

and files that are selected to be backed up. Because the full backup stores all files

and folders, frequent full backups result in faster and simpler restore operations.

Remember that when you choose other backup types, restore jobs may take longer.

It would be ideal to make full backups all the time, because they are the most

comprehensive and are self-contained. However, the amount of time it takes to run

full backups often prevents us from using this backup type. Full backups are often

restricted to a weekly or monthly schedule, although the increasing speed and

capacity of backup media is making overnight full backups a more realistic

proposition.

Incremental provides a faster method of backing up data than repeatedly running full

backups. During an incremental backup only the files changed since the most recent

backup are included. The time it takes to execute the backup may be a fraction of the

time it takes to perform a full backup.

Differential contains all files that have changed since the last full backup. The

advantage of a differential backup is that it shortens restore time compared to a full

backup or an incremental backup. However, if you perform the differential backup too

many times, the size of the differential backup might grow to be larger than the

baseline full backup.

I talked about three backups. Maybe this one doesn‘t belong here but it‘s a definitive copy of

the original data, which makes it a backup.

Mirrored ensures your information is protected from both system and site failures. In

an array, it‘s a block level protection, so you can‘t open and navigate these file in

Windows Explorer.

In EMC terms, we speak about MirrorView™. It leverages the power of EMC

CLARiiON® networked storage systems to offer both synchronous and asynchronous

remote mirroring. Whether you mirror data around the corner or across the globe,

MirrorView provides disaster recovery that protects your most critical data in the event

of an outage.

Another replication method is Symmetrix Remote Data Facility (SDRF®), which is

used in EMC Symmetrix® systems. SRDF provides remote replication for disaster

recovery and business continuity.


7.2. Virtualized Servers and Disaster Recovery

Traditional disaster recovery plans are often very complex and difficult. The reason for this is

bare metal recovery. Virtualization makes life easier for us and simplifies this environment. A

virtual machine typically is stored on the host computer in a set of files, usually in a directory

created by the host for that specific virtual machine. When you protect these files using your

backup or replication software, you've protected the entire system. These files can then be

recovered to any hardware without requiring any changes because virtual machines are

hardware-independent.

Reliable disaster recovery solutions traditionally require duplicating your entire production

infrastructure and with it, your costs. With virtualization software such as VMware vSphere,

you can provide rapid and reliable recovery without requiring identical hardware. Virtual

machines can share the physical resources of a single computer while remaining completely

isolated from each other as if they were separate physical machines. If, for example, there

are three virtual machines on one physical

server and one of the virtual machines

crashes, the other virtual machines

remain available. Isolation is an important

reason why the availability and security in

a virtual environment is superior to

applications running in a traditional, non-

virtualized system. Server consolidation

also lets you slash the cost of server

infrastructure needed both for production

and disaster recovery.

Virtualization is a must-have these days in

combination with disaster recovery. You

can easily test your disaster recovery plan

to ensure the highest levels of reliability

and availability of your entire IT

infrastructure.


7.3. Other thoughts

An amazing amount of work and planning is required before you push the button and begin

drafting actual recovery plans. Disaster recovery has many aspects because you may need

to recover different portions of your environment, depending on the scope and magnitude of

the disaster that strikes. Your worst-case scenario (an earthquake, tornado, flood, or

whatever sort of disaster happens in your part of the world) can render your work facility

completely damaged or destroyed, requiring the business to continue elsewhere.

But besides that, there are more business justifications for developing a recovery strategy.

– Level of attention and expertise required

– Performance impacts

– Effect of link outages

– Change Control Integration

Do you know who your expert is? The one who can provide innovative, valuable solutions to

your organization (whether internal or external)? The one who knows the jargon, products,

and tools of your organization? That expert exists at every level of your organization. It‘s your

most valuable competitive asset and also your most scarce. Its scarcity is probably the

greatest single factor limiting your growth. Your expert also goes home every night, and it‘s

what you lose when it retires or goes over to the competition. There are not many people

who have that specific experience and knowledge. Everything is learnable but it takes a

while. It could be years before you are on the same level as you were before. So, take good

care of your experts.

Most organization doesn‘t have an exact copy of their data center like a fully automatic

failover site. In most cases, it‘s more important to recover your data and ‗mission-critical‘

application on the failover site than have the same amount of people working at the same

time as before the disaster. Doing a recovery with less hardware has an impact on

performance. Be aware that there are fewer people who can crawl behind the keyboard.

Things will not always go according to plan. That‘s a fact. That‘s the whole reason for this

article. Be prepared that things will not always go as planned in your recovery plan. Try to

anticipate things that can go wrong as much as possible.

If your organization is working according to ITIL, you are probably working with Change

Management. When a disaster strikes, a lot is going on. Try to fit in Change Management in

an appropriate manner given the circumstances.


8. Testing Recovery Plans

Traditional recovery plans are often difficult to test, difficult to keep up to date, and depend

on exact execution of complex, manual processes. In a virtualized environment, testing is

simpler because you can execute non-disruptive tests using existing resources. Hardware

independence eliminates the complexity of maintaining the recovery site by eliminating

failures due to hardware differences.

But still, your organization is changing by the day and servers are added and deleted. Maybe

there are needs that require adding mission-critical applications or simply merging with

another organization. The fact remains, changes occur every day and these changes have

an enormous impact on your DR plan. After you develop the DR plan, you need to put it

through progressively intense cycles of testing. If an organization needs to trust its very

survival to the quality and accuracy of a DR plan, you need to test that plan to be sure that it

actually works. In disasters, you rarely get second chances.

DR plans contain lists of procedures to follow when a natural or man-made disaster occurs.

The purpose of the plan is to recover the IT applications and infrastructure that support

business-critical processes. When disaster hits you, it hits hard. You seldom can clearly tell

whether those disaster plans will actually work. And given the nature of disasters, if your

disaster plan fails, the organization may not survive the disaster.

When you test your disaster plan, note anything that‘s not going according plan, and then

pass the plan back to the people who designed the plan so they can update it. This process

improves the quality and accuracy of the disaster plan. Therefore, realistic testing of the

recovery plan periodically is necessary and is also required to succeed in your mission.

Another thought is whether you can test and maintain protection simultaneously. Because

what will happen when you start on Friday and not be back up and running on Monday? It is

important this is included in your plan. Ask yourself every time, what if, and be prepared for

the worst thing that can happen. Probably a good start would be to fragment your recovery

plan into small pieces. Start with destroying one server and see if it can restored.


9. Role of virtualization

The business world has undergone an enormous transformation over the past 20 years.

Business process after business process has been captured in software and automated,

moving from paper to electrons.

In today‘s world, virtually every strategic business decision has an IT implication. Market

forces continue to accelerate in every region of the world, and across every industry, putting

increasing pressure on IT departments to be more responsive and help organizations stay

competitive and pursue new opportunities at lower cost.

Virtualization is rapidly transforming the IT landscape and fundamentally changing the way

companies compute. Virtualization is the catalyst that makes IT-as-a-Service a reality. It is

the enabling technology on which cloud computing architectures are and will be built.

Whether you have virtualized all of your IT assets and applications or you are just starting

out, you are on your way to transforming to a new model for IT.

Before virtualization, IT organizations would run one application per physical server, so cost-

per-server was a quick way to compare costs; it was a one-to-one relationship. Therefore,

many data centers have machines running at only 10 or 15 percent of total processing

capacity. In other words, 85 or 90 percent of the machine‘s power is unused. It isn‘t rocket

science to recognize that this situation is a waste of resources. But once you virtualize, many

applications (each on its own virtual machine) run on each physical server; it is now a many-

to-one relationship.

When a server is used to host a number of virtual machines, it is faced with much higher

levels of demand for system resources than would be presented by a single operating

system running a single application. Obviously, with more virtual machines running on the

server, there will be more demand for processing. Even with two or more processors,

virtualization can outstrip the processing capability of a traditional commodity server.

Also, with more virtual machines on the server, there will be far higher storage and network

traffic as each virtual machine transmits and receives as much data as would be demanded

by a single operating system performing in the old ―one application, one server‖ model.

Furthermore, because virtualization makes the robustness of hardware more important,

most IT organizations seek to avoid so-called Single Point of Failure (SPoF) situations by

implementing redundant resources in their servers: multiple network cards, multiple

storage cards, extra memory, and multiple processors and all doubled or even tripled in an

effort to avoid a situation where a number of virtual machines can be stalled due to the failure

of a single hardware resource.


9.1. Role of VMware

As virtualization is now a critical

component of an overall IT strategy, it

is important to choose the right

vendor. VMware is the leading

business virtualization infrastructure

provider, offering the most trusted

and reliable platform for building a

good IT infrastructure, private- and

public clouds.

VMware [6] stands alone as a leader.

While challengers like Microsoft and

Citrix are emerging, VMware has a

tremendous head start in this market.

It is clearly ahead in understanding

the market, and is ahead in product

strategy, business model, and

technology innovations.

Why VMware?

Is built on a robust, reliable foundation for many years

Delivers a complete virtualization platform, from desktop through the data center out

to public clouds

Provides the most comprehensive virtualization and cloud management

Integrates with your overall IT infrastructure

Is proven by more than 190,000 customers

VMware has invested in technologies to achieve very high virtual machine density on

VMware vSphere. VMware supports more guest operating systems than any other bare-

metal virtualization platform in 2010. The superior performance of VMware vSphere with

unmodified (fully virtualized) guests, made possible by VMware‘s exclusive binary translation

technology, means that VMware vSphere can run off-the-shelf operating systems with near-

native performance. No other virtualization platform achieves the high virtual machine density

of VMware vSphere and still maintains consistent, high application performance across all

running virtual machines.

With VMware you can lower your operational costs. You can directly reduce your operational

costs by using the dynamic IT services built into VMware vSphere that most other

competitors do not offer.


Most common for example are [8]:

High availability – HA, Explained here

VMware HA provides uniform, cost-effective failover protection against hardware and

operating system failures within your virtualized IT environment.

Dynamic Resource Scheduler – DRS, Explained here

VMware DRS continuously balances computing capacity in resource pools to deliver the

performance, scalability, and availability not possible with physical infrastructure.

vMotion, Explained here

VMware vMotion uses VMware’s cluster file system to control access to a virtual machine’s

storage. During a vMotion, the active memory and precise execution state of a virtual machine

is rapidly transmitted over a high-speed network from one physical server to another and

access to the virtual machine’s disk storage is instantly switched to the new physical host.

Since the network is also virtualized by the VMware host, the virtual machine retains its

network identity and connections, ensuring a seamless migration process.

Storage vMotion, Explained here

VMware Storage vMotion is a state-of-the-art solution that enables you to perform live

migration of virtual machine disk files across heterogeneous storage arrays with complete

transaction integrity and no interruption in service for critical applications.

Site Recovery Manager – SRM, Explained here

VMware vCenter Site Recovery Manager eliminates complex manual recovery steps and

removes the risk and worry from disaster recovery.

Fault Tolerance – FT, Explained here

VMware Fault Tolerance provides continuous availability for applications in the event of server

failures, by creating a live shadow instance of a virtual machine that is in virtual lockstep with

the primary instance.

Find more on http://www.vmware.com/products/

VMware is the proven choice for virtualization from the desktop to the data center. Small and

midsize businesses run on VMware. More than 190,000 customers of all sizes, including all

of the Fortune 100, trust VMware as their virtualization infrastructure platform. That must

mean something!

http://www.vmware.com/products/high-availability/index.html

http://www.vmware.com/products/drs/

http://www.vmware.com/products/vmotion/

http://www.vmware.com/products/storage-vmotion/

http://www.vmware.com/products/site-recovery-manager/

http://www.vmware.com/products/fault-tolerance/


9.2. Role of EMC

The digital universe is still growing, even during a global economic downturn. The creation

and replication of digital information set a record in 2009 by growing to 800 billion gigabytes,

more than 60% over the previous year. People continue to take pictures, send e-mail, blog,

and post videos. Organizations are still adding information. Governments are still requiring

more information to be kept. And that‘s only the beginning of what‘s to come.

That‘s nice for business. Undoubtedly so for storage vendors. But it‘s not just about storing

data. It‘s more about innovation, protection, optimization, and leveraging information.

In 2003, EMC, the world leader in information storage and management acquired VMware.

Joe Tucci, EMC President and CEO, said, "Customers want help simplifying the

management of their IT infrastructures. This is more than a storage challenge. Until now,

server and storage virtualization have existed as disparate entities. Today, EMC is

accelerating the convergence of these two worlds." Was he wrong?

I have the privilege to work with nice things related to EMC and VMware every day. And it is

just amazing how easy things are to integrate. Let me give you a great example of the

products EMC builds in relation to VMware.

EMC Unified Storage vCenter plug-in

This plug-in is a must-have in combination with vSphere. With

EMC‘s second-generation vCenter plug-in family (Virtual Storage

Integrator, CLARiiON plug-in, and Celerra® NFS Plug-in), EMC

gives VMware administrators the ability to simplify visibility,

provisioning, and management of EMC storage through the VMware

lens. From VMware vCenter, administrators can leverage array

functions to increase the efficiency in their VMware environment and

hardware accelerates VM deployment.

Click on the document for downloading

Or use url: http://www.mikes.eu/download/EMC Plug-in for VMware vCenter.pdf

Integration is good. EMC offers direct integration and management capability of their

systems from VMware‘s Management suite by making use of API‘s. EMC and VMware

integration makes things simpler and more efficient.

Without discussing products, I don‘t want to keep information away from you, shown in the

table below.

http://www.mikes.eu/download/EMC Plugins for VMware vCenter.pdf


Product Families [9]

Hardware

Celerra, Explained here

Bring powerful, high-availability unified storage to your organization in convenient integrated

models and flexible gateways. All are easy to deploy and manage. Plus, simplify management

with powerful software.

CLARiiON, Explained here

Get the high availability, scalability, and flexibility you need to manage and consolidate more

data. Combine easy-to-use midrange networked storage with innovative technology and

robust software capabilities.

Connectrix®, Explained here

Move your organization's vital information where it needs to go—quickly, easily, and reliably.

Advanced directors and switches make it happen. Get best-in-class availability and easy

management.

Centera®, Explained here

Store and manage your "fixed content"—unchanging digital assets—and keep them available

online and accessible. All with EMC Centera content-addressed storage (CAS) systems. Be

ready for growth with petabyte scalability.

Iomega, Explained here

Store, protect, and share your valuable data with reliable and easy-to-use storage solutions for

home and small business.

Symmetrix, Explained here

Make high-end networked storage part of your information infrastructure with systems that

take performance, availability, and security to new heights. Manage and protect your

information today and expand in the future.

VPLEX™, Explained here

Deploy next-generation architecture to enable simultaneous information access within,

between, and across data centers.

Software

Atmos™, Explained here

Build your own cloud services or leverage a public cloud to deliver content and information

services anywhere in the world with EMC Atmos.

Ionix™, Explained here

Simplify and automate key tasks—such as discovery, monitoring, reporting, planning, and

provisioning—for even the largest, most complex storage environments.

PowerPath®, Explained here

Host-based solutions including multipathing, data migration, and host-based encryption.

http://netherlands.emc.com/products/family/celerra-family.htm

http://netherlands.emc.com/products/family/clariion-family.htm

http://netherlands.emc.com/products/family/connectrix-family.htm

http://netherlands.emc.com/products/family/emc-centera-family.htm

http://go.iomega.com/en/?partner=4740

http://netherlands.emc.com/products/family/symmetrix-family.htm

http://netherlands.emc.com/products/family/vplex.htm

http://netherlands.emc.com/products/family/atmos.htm

http://netherlands.emc.com/products/family/controlcenter-family.htm

http://netherlands.emc.com/products/family/powerpath-family.htm


9.3. Role of VMware Site Recovery Manager

The beautiful part of VMware Site Recovery Manager (SRM) is, you can test a plan without

doing it live. With SRM I can failover anytime without damaging the infrastructure

environment.

SRM [8] provides business continuity and disaster recovery protection for virtual

environments. Protection can extend from individual replicated data stores to an entire virtual

site. VMware‘s virtualization of the data center offers advantages that can be applied to

business continuity and disaster recovery:

The entire state of a virtual machine (memory, disk images, I/O, and device state) is

encapsulated. Encapsulation enables the state of a virtual machine to be saved to a

file. Saving the state of a virtual machine to a file allows the transfer of an entire

virtual machine to another host.

Hardware independence eliminates the need for a complete replication of hardware

at the recovery site. Hardware running VMware ESX at one site can provide business

continuity and disaster recovery protection for hardware running VMware ESX at

another site. This eliminates the cost of purchasing and maintaining a system that sits

idle until disaster strikes.

Hardware independence allows an image of the system at the protected site to boot

from disk at the recovery site in minutes or hours instead of days.

SRM leverages array-based replication between a protected site and a recovery site. The

workflow that is built into SRM automatically discovers which datastores are set up for

replication between the protected and recovery sites. SRM can be configured to support bi-

directional protection between two sites.

SRM provides protection for the operating systems and applications encapsulated by the

virtual machines running on VMware ESX. A SRM server must be installed at the protected

site and at the recovery site. The protected and recovery sites must each be managed by

their own vCenter Server.

Implementing a SRM solution is almost ―too easy‖. But as you've read so far it‘s not only

about the software you are using. The software, which will make your life a lot easier, is not

the most important piece of the puzzle. Keep thinking about the first 8 chapters of this article,

which are more important than the software.

VMWARE IS A TRUE ENABLER FOR DISASTER RECOVERY


10. VMware Site Recovery Manager Downtime is expensive! Disaster preparedness and recovery planning is an iterative process,

not a one-time event. You need to continually revisit disaster recovery plans to ensure they

remain aligned with current business goals and test those plans regularly to ensure that they

perform as planned.

VMware Site Recovery Manager [8] provides business continuity and disaster recovery

protection for virtual environments. In a Site Recovery Manager environment, there are two

sites involved, a protected (primary) site and a recovery (secondary) site. Protection groups

that contain protected virtual machines are configured on the protected site and these virtual

machines can be recovered by executing the recovery plans on the recovery site. The

illustration below depicts how it operates at a very high level.

Site Recovery Manager uses a database on both protected and recovery sites to store

information. The protected site Recovery Manager database stores data regarding the

protection group settings and protected virtual machines, while the recovery site Recovery

Manager database stores information on recovery plan settings

VMware Site Recovery Manager changes the way disaster recovery plans are designed and

executed by involving two simple steps; protection and recovery.

Protection involves the following operations:

Array manager configuration

Inventory mapping

Creating a protection group

Recovery involves the following operations:

Creating a recovery plan

Test recovery

Real recovery


The vCenter Server must be installed at both the protected site and recovery site, as well as

an SQL Server or Oracle Database server.

See Site Recovery Manager Compatibility Matrixes documentation for a list of

supported servers and databases.

Each site has an inventory of virtual machines that reside on array based replicated LUNs

(logical unit numbers), which are disk volumes in a storage array that are identified

numerically. Before installing SRM, install the Storage Replication Adapter (SRA) for your

storage and storage replication environment. SRA is software that ensures integration of

your storage device with SRM. Because SRM interacts with arrays from a variety of storage

vendors, consult the documentation that your storage vendor provides for array specific

information used during SRM installation and configuration. The SRAs that have been

created by storage vendors for Site Recovery Manager can be downloaded from the

vmware.com website.

See Site Recovery Manager Storage Partner Compatibility Matrixes for a list of

supported SRAs.

Optimally SRM is installed bi-directionally, so that each site serves as a recovery site for the

other. The two sites should be a significant geographic distance from each other. The

protected and recovery sites must be in a networked configuration that allows TCP

connectivity. Each site consists of a vCenter Server, which is a Windows machine that runs

the vCenter service. Installed with each vCenter Server is the SRM Server. The SRM Server

hosts Site Recovery Manager and array management technology. It also serves the SRM

plug-in to the VI Client. Management is done from the vCenter client on the protected site.

SRM uses block based replication with SRA‘s installed on the SRM Server. This integration

of hardware and software supports the most demanding application business continuance

needs, in this case, a failover following a disaster.

Replication, Replication, Replication - Technology

SRM only works properly with a replication technology. Data replication, however, is a

growing challenge. Working to achieve higher levels of data availability, storage

administrators increasingly create multiple copies of business-critical data to quickly recover

from disasters. As data centers attempt to maintain data availability in the event of local

catastrophes while globally servicing customers, multiple copies of data must also be

efficiently distributed and synchronized to other data centers.

http://www.vmware.com/pdf/srm_compat_matrix_4_0.pdf

http://www.vmware.com/pdf/srm_storage_partners.pdf


There are several replication techniques that can be used with VMware SRM. There is a

compatibility matrix of supported vendors. The strength that SRM delivers is to:

- Remove manual recovery complexity through automation

- Provide central management of recovery plans and protection groups

- Simplify and automate disaster recovery workflows

Replication in combination with VMware and EMC comes in a few flavors, such as:

EMC SRDF [9]

EMC Symmetrix Remote Data Facility (SRDF) provides remote replication for disaster

recovery and business continuity.


Or use url: http://www.emc.com/products/detail/software/srdf.htm

EMC MirrorView [9]

EMC MirrorView ensures your information is protected from both system and site failures. It

leverages the power of EMC CLARiiON networked storage systems—to offer both

synchronous and asynchronous remote mirroring.


Or use url: http://www.emc.com/products/detail/software/mirrorview.htm

http://www.emc.com/products/detail/software/srdf.htm

http://www.emc.com/products/detail/software/mirrorview.htm


EMC Celerra Replicator [9]

EMC Celerra Replicator provides efficient, asynchronous data replication over Internet

Protocol (IP) networks.


Or use url: http://www.emc.com/products/detail/software/celerra-replicator.htm

EMC RecoverPoint [9]

EMC RecoverPoint brings you continuous data protection and continuous remote replication

for on-demand protection and recovery to any point in time. RecoverPoint's advanced

capabilities include policy-based management, application integration, and bandwidth

reduction.


Or use url: http://www.emc.com/products/detail/software/recoverpoint.htm

http://www.emc.com/products/detail/software/celerra-replicator.htm

http://www.emc.com/products/detail/software/recoverpoint.htm


Plans

The next steps involve making plans to configure Site Recovery Manager. Creating and

managing recovery plans directly from vCenter are very powerful and easy to create. Site

Recovery Manager provides an intuitive interface to help users create recovery plans for

different failover scenarios and different parts of their infrastructure. Users can specify virtual

machines to be suspended or shut down. They can also specify the order in which virtual

machines are powered on or shut down, set user-defined scripts to execute automatically,

and determine where to pause the recovery process if necessary. These steps are not

detailed as they are beyond the scope of this article. Refer to VMware and storage vendor

documentation for additional details. There is also a lot to find in the communities.

Basically it comes down to this:

Deploy Site Recovery Manager (SRM) at both the protected and recovery sites.

Install the Storage Replication Adapters (SRA) on the same server as SRM on

both the protected and recovery sites. Install the SRM plug-in on the protected

and recovery vCenter servers.

Set up connections between the protected and recovery sites.

Configure the Array Manager so that SRM knows about the storage arrays.

Create one or more Protection Groups that contain the replicated LUN and

associated virtual machines, which holds the mission-critical application.

Create a Recovery Plan which is associated with a Protection Group, so that in

the event of a failover, the recovery site knows the relationship between virtual

machines and the failed over storage.

Run a test failover to verify functionality.


11. Standardization

A good start is to ask, what are standards? According to ‗search and Google‘ a standard is a

definition or format that has been approved by a recognized organization or is accepted as a

recognized standards organization or is accepted as a de facto standard by the industry.

Standards exist for programming languages, operating systems, data formats,

communications protocols, and so forth.

Standards are extremely important in the computer industry because they allow the

combination of products from different manufacturers to create a customized system. Without

standards, only hardware and software from the same company could be used together. In

addition, standard user interfaces can make it much easier to learn how to use new

applications.

A lot of organizations are committed to an open, standards-based approach to

interoperability so that customers can implement solutions that meet their individual needs.

It‘s important to create a policy with the basic concepts of standardization. Stability, future-

proof, controlled innovation and security are essential.

VMware is committed to an open, standards-based approach to licensing and interoperability

so that customers can implement virtualization-based solutions that meet their individual

needs. Whether you have virtualized all of your IT assets and applications or you are just

starting out, you are on your way to transforming to a new ‗standard‘ model for IT.


12. Conclusion

We started with the sentence, and now we end with it, ―Information is the organization’s most

important asset.‖

Given that, the information must be protected. We must look carefully at which information

we protect because there is no point in protecting your total infrastructure environment. You

must classify your data; otherwise, everything gets protected the same way. Without

classification everything is important and you don‘t want that.

When disaster strikes, it hurts, one way or the other. If a disaster hits an organization without

a disaster recovery plan, that organization has very little chance of recovery. Organizations

that do have DR plans may still have a difficult time when a disaster strikes. You may have to

put in considerable effort to recover time-sensitive critical business functions. But if you have

a disaster recovery plan, you have a chance at survival.

It is a common misconception that most of the threats to continuity are a result of natural

disaster. Statistically, these threats account for less than 1% of IT service unavailability. This

finding indicates that you should mainly focus on other things than just natural disasters.

Doing nothing isn‘t an option because it can damage your company in many ways. For

example:

Financial/Cash Flow/revenue loss

Legal/Regulatory

Life-threatening issues in hospitals, for example

Reputation

A good disaster recovery plan is like an information insurance policy for a business. A

disaster recovery plan is the ability to continue work after any number of catastrophic

problems, ranging from a natural disaster such as flood, fire, and earthquake or

planned/unplanned scenarios like database corruption, server failures, or simply human

errors. Disaster recovery is becoming an increasingly important aspect for an organization.

Beside the fact that a disaster recovery plan is a must-have for the survival of your

organization it has more benefits, such as; improved business processes, improved

technology, fewer disruptions, higher quality services, and competitive advantages.

The maximum length of time a business function can be discontinued without causing

irreparable damage to the business is called Maximum Tolerable Downtime (MTD). This

value must be within the (MAOT) which is given by Management. After set targets for MTD

you must set targets for your Recovery Point Objectives (RPO) and Recovery Time

Objectives (RTO) for each process. You need this when disaster strikes. You can give any

sort of guarantee; how much data is lost and how long it takes before you're back online.

Make sure you have an Emergency Response Team ready. This ERT is a group of people

prepared for any emergency or big incident, such as a natural disaster or an interruption of

business operations. Emergency Response Team members typically are trained and

prepared to fulfill the roles required by the specific situation. Ideally the team has already

defined a protocol or set of actions to perform to mitigate the negative effects of the incident.


Traditional disaster recovery plans are often very complex and difficult. As virtualization is

now a critical component to an overall IT strategy, it is important to choose the right vendor.

Avoid unnecessary risk and overhead when choosing a robust and production-proven

hypervisor for your virtualized datacenter.

Not all hypervisors are equal. VMware has a true enabler for disaster recovery named

VMware Site Recovery Manager (SRM). VMware Site Recovery Manager is a business

continuity and disaster recovery solution that helps you plan, test, and execute a scheduled

migration or emergency failover of datacenter services from one site to another. But as

mentioned in the Introduction, you can test a recovery plan without ruining anything. Yes, you

can failover anytime without damaging the infrastructure environment. Virtualization these

days can make Disaster Recovery implementations easy.

As the leader goes, so goes the organization. A disaster recovery plan needs executive

sponsorship. Without executive sponsorship, these disaster recovery plans are not feasible.

The executives are responsible for making decisions relating to an organization‘s direction,

strategy, and financial commitment. They would approve the finance of purchasing hardware

or software. Finally, the executive sponsorship role is needed to make decisions about a

company‘s policies, procedures, and strategic directions. Ensure that it has the attention from

executive management. When it does, it's more broad-based and probably more successful.

Disaster recovery and business continuity are extremely complex. This is often the reason

why companies are holding back on a recovery strategy. What I try to reach with my paper is

that we don‘t make disaster recovery too complicated. We can, but it isn‘t necessary. The

most important issue is that data is protected and that we can provide this data quickly to the

organization. Surely we must consider risks and do everything to prevent them. But this

should not be your main concern. Your concern is to return as quickly as possible to daily

business.

Virtualization is a true enabler to recover after a disaster. Costs are relatively low and it is

very easy to integrate this into your infrastructure.


References [1] IT Disaster Recovery Planning for Dummies By: Peter Gregory

[2] EMC Information Availability Design and Management course

[3] Source: Hank Marquis (2006), http://www.hankmarquis.com/articles.html

[4] http://dependencymapping.com/

[5] http://en.wikipedia.org

[6] By: Gartner RAS Core research note G00200526, Thomas J. Bittman, Philip Dawson,

George J. Weis, 26 may 2010, Magic Quadrant for x86 server Virtualization Infrastructure

[7] EMC® Documentum® Content Server Backup and Recovery White Paper version 6.5,

Published January 2010

[8] VMware, http://www.vmware.com

[9] EMC, http://www.emc.com

Disclaimer: The views, processes or methodologies published in this article are those of the

author. They do not necessarily reflect EMC Corporation‘s views, processes or

methodologies.

EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED ―AS IS.‖ EMC CORPORATION

MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO

THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.

http://www.hankmarquis.com/articles.html

http://en.wikipedia.org/

how to start your disaster recovery in this “cloudy” landscape€¦ · recovery in this...

Documents

Transcript of how to start your disaster recovery in this “cloudy” landscape€¦ · recovery in this...