how to start your disaster recovery in this “cloudy” landscape€¦ · recovery in this...
Transcript of how to start your disaster recovery in this “cloudy” landscape€¦ · recovery in this...
how to start your disaster recovery in this “cloudy” landscape
EMC Proven Professional Knowledge Sharing 2011
Roy MikesStorage and Virtualization ArchitectMondriaan [email protected]
EMC Proven Professional Knowledge Sharing 2
Table of Contents
About This Document 3
Who Should Read This Document? 3
Introduction 4
1. What is a Disaster 6
2. What is a Disaster Recovery Plan (DR plan) 7
2.1. Other benefits of a Disaster Recovery Plan 7
3. Business Impact Analysis (BIA) 8
3.1. Maximum Tolerable Downtime (MTD) 9
3.2. Recovery Time Objective (RTO) 9
3.3. Recovery Point Objective (RPO) 9
4. Data Classification 10
5. Risk Assessment 13
5.1. Component Failure Impact Analysis (CFIA) 16
5.2. Identifying Critical Components 18
5.2.1. Personnel 18
5.2.2. Systems 18
5.3. Dependencies 19
5.4. Redundancy 21
6. Emergency Response Team (ERT) 23
7. Developing a Recovery Strategy 24
7.1. Types of backup 26
7.2. Virtualized Servers and Disaster Recovery 27
7.3. Other thoughts 28
8. Testing Recovery Plans 29
9. Role of virtualization 30
9.1. Role of VMware 31
9.2. Role of EMC 33
9.3. Role of VMware Site Recovery Manager (SRM) 35
10. VMware Site Recovery Manager 36
11. Standardization 41
12. Conclusion 42
References 44
EMC Proven Professional Knowledge Sharing 3
About This Article Despite our best efforts and precautions, disasters of all kind eventually strike an
organization, usually unanticipated and unannounced. Natural disasters such as hurricanes,
floods, or fires can threaten the very existence of an organization. Well-prepared
organizations establish plans, procedures, and protocols to survive the effects that a disaster
may have on continuing operations and help facilitate a speedy return to working order.
Continuity and recovery planning are two separate procedures of reparation to restore and
recover critical business operations in the event of such disasters. My focus in this article
concerns recovery planning.
This article should help you understand the need for Business Continuity Management and
Disaster Recovery Planning in relation to a working failover plan. Because it is not all
technical, this article covers most of the non-technical discussions in relation to Disaster
Recovery Planning. After reading this document I think you can make a good start.
As such, this material is probably most useful to those with little or no familiarity with this
topic. Readers who fall into this category would be well served to read this document.
Who Should Read This Document? This article is written for IT professionals who are responsible for defining the strategic
direction of protecting data in their data center(s).
These include:
Storage Administrators
Operational, middle level managers
Business Managers
IT managers (CIO, Chief information officer)
Organizations and individuals who have the same interests should read this article as well.
Where to start with Disaster Recovery Planning? It often remains a difficult story.
My goal is to give a general guideline to provide insight into Disaster Recovery Planning,
which should not be too difficult to read.
EMC Proven Professional Knowledge Sharing 4
Introduction Let‘s start this with a simple quote; ―Information is the organization’s most important asset‖
Data is created by the applications and is processed to become information. Information is
undoubtedly the most important asset for an organization. Does this make sense?
Absolutely! The digital footprint for each person on this planet is growing. In a sense it does
not matter whether we as a person or a corporation store data; it has to be protected. For
some people, photos are just as important as a company's ERP system. It is not for nothing
that storage vendors put in a lot of energy to manage this information.
From a Disaster Recovery perspective, the world is divided into two types of businesses;
those that have DR plans and those that don‘t. If a disaster strikes your organization in each
category, which do you think will survive?
When disaster strikes, organizations without DR plans have an extremely difficult road
ahead. If the business has any highly time-sensitive critical business processes, that
business is almost certain to fail. If a disaster hits an organization without a DR plan, that
organization has very little chance of recovery. And it‘s certainly too late to begin planning.
Organizations that do have DR plans may still have a difficult time when a disaster
strikes. You may have to put in considerable effort to recover time-sensitive critical business
functions. But if you have DR plan, you have a fighting chance at survival.
Does your organization have a disaster recovery plan today? If not, how many critical, time-
sensitive business processes does your organization have? Many organizations think they
have a DR plan. They think they have some procedures and that is all it takes. True, you
need procedures, but you need also to be sure that you actually can failover. How do you
manage that? Personally, I think testing live will do more damage than knowing you can. I
can take a guess, but actually do know for sure, the number of every organizational change.
Many organizational infrastructures change per hour. Try to fit in your DR plan when
changing that fast. Where does that leave you? Good question. Probably when you test your
failover you do it once per year, maybe twice or even each quarter. How much do you think
has changed since the last time you performed your failover. Thus, this is a considerable
challenge.
Lucky for you there are many techniques and solutions, such as "clouds", where DR plans
are probably already well organized, or VMware Site Recovery Manager (SRM) who can
help you with your failover. VMware SRM is a business continuity and disaster recovery
solution that helps you plan, test, and execute a scheduled migration or emergency failover
of data center services from one site to another. But the most beautiful part of SRM is, you
can test a plan without doing it live. Wow!!! I can actually failover anytime without doing some
damage to the infrastructure environment? True! Virtualization these days can make Disaster
Recovery implementations easy. Think not only public but also private. Private clouds have a
huge positive impact and synergy. How many of you are looking for partnerships or serve as
each other‘s failover? That makes 1+1=3. But take it easy people. Don't press <Enter> too
soon. There is a lot to consider before taking this road.
Depending on the nature of your business, good disaster recovery is achieved by designing
a process which enables your operations to continue to work, perhaps from a different
EMC Proven Professional Knowledge Sharing 5
location, with different equipment, or from home, making full use of technology to achieve a
near seamless transition that is all but invisible to your customers and suppliers. Insurance
can mitigate the cost of recovery, but without a disaster recovery plan that gets you back up
and running you could still go under. Indeed, more than 70% of businesses that don‘t have a
DR plan fail within 2 years of suffering a disaster.
So what's next? Certainly a lot! But don't make life too difficult. There will always be one or
more single points of failures. You should ask yourself if the costs are worth the five nines
(99,999%) availability. The primary task and next step is to determine how you will achieve
your Disaster Recovery goals for each of the systems and system components to ensure that
the critical, time-sensitive business processes continue working. First, this is the point at
which it becomes important to consider exactly what types of disasters you need to prepare
for and to classify them by the extent and type of impact they have.
EMC Proven Professional Knowledge Sharing 6
1. What is a Disaster? You may argue with me about the definition of a disaster, because there is more than one
definition. To some, anything that doesn't go according to their schedule or plans is a
disaster. On a personal level, a fire in our house could be considered a disaster. In most
cases, one broken server isn‘t a disaster but many servers are. However, it is important to
understand the difference between these kinds of disasters, and a ‗true‘ disaster. This will
allow you to keep things in perspective when making your own disaster plans.
Should your company experience a disaster, the first 48 hours following the disaster will be
the most critical in your recovery efforts. How you respond during that period will determine if
your business will survive. Furthermore, the most important hour is the one immediately
following the event.
A disaster is defined as an event causing great loss, hardship, or suffering to many
organizations. When we think of this kind of event we usually think of catastrophic events
such as hurricanes, earthquakes, floods, fires, and even man-made disasters. In situations
like this, help may be unavailable because rescuers may be in the same predicament as you,
and it could take a considerable length of time for help to arrive.
Disaster preparedness is the sensible thing to do. It doesn't need to be expensive and it can
save your business! In these situations we are not talking about losing server cooling or
power for a few hours; we are talking about losing essential services, data, or information,
under extreme circumstances, for a prolonged period of time.
Disaster recovery is becoming an increasingly important aspect of enterprise computing. As
devices, systems, and networks become ever more complex, there are simply more things
that can go wrong. As a consequence, recovery plans have also become more complex.
It is a common misconception that most of the threats to continuity are a result of natural
disaster. To the contrary, statistically, these threats account for fewer than 1% of IT service
unavailability.
EMC Proven Professional Knowledge Sharing 7
2. What is a Disaster Recovery Plan (DR plan)? A good Disaster Recovery Plan (DR plan) is like an information insurance policy for a
business. A DR plan documents the ability to continue work after any number of catastrophic
problems, ranging from a natural disaster such as flood, fire, and earthquake or
planned/unplanned scenarios such as database corruption, server failures, or simply human
errors.
Often a DR plan is confused with a Business Continuity Plan (BCP). Just as a DR plan, BCP
is an event that makes the continuation of normal functions impossible. A DR plan is part of
his big brother, Business Continuity Plan and IT related. I am not going to talk about a
Business Continuity Plan. Instead, we are sticking with the DR plan.
A DR plan consists of the precautions taken so that the effects of a disaster will be minimized
and the organization will be able to either maintain or quickly resume mission-critical
functions. Typically, DR planning involves an analysis of business processes and continuity
needs; it may also include a significant focus on disaster prevention.
2.1. Other benefits of a Disaster Recovery Plan
Besides the obvious readiness to survive a disaster, organizations can realize profits or
several other benefits from DR planning [1]:
Improved business processes: Business processes undergo continuous analysis and
reviews; there are always areas for improvement.
Improved technology: Often, you need to improve IT systems to support recovery
objectives that you develop in the disaster recovery plan. The attention you pay to
recoverability also often leads to making your IT systems more consistent with each
other and, hence, more easily and predictably managed.
Fewer disruptions: As a result of improved technology, IT systems tend to be more
stable than in the past. Also, when you make changes to system architecture to meet
recovery objectives, events that used to cause outages no longer do so.
Higher quality services: Improved processes and technologies improve services, both
internally and to customers and supply-chain partners.
Competitive advantages: Having a good DR plan gives an organization bragging
rights that may outshine competitors. Price isn‘t necessarily the only point on which
companies compete for business. A DR plan allows a company to also claim higher
availability and reliability of services.
EMC Proven Professional Knowledge Sharing 8
3. Business Impact Analysis (BIA) Although a full DR plan takes many months or even longer to complete, a good first step of
an individual DR plan is mapping out the most critical aspects of day-to-day business in your
company. Data safety is perhaps one of the most crucial and overlooked aspects of disaster
recovery. [2]
A Business Impact Analysis (BIA) is a detailed inventory of the critical processes, systems,
and people that are associated with an organization‘s primary business activities. If you have
never done a Business Impact Analysis, it seems to be one of the most difficult tasks. There
always seems to be a lot of questions about what should and should not be included in the BIA.
The purpose of a BIA is to identify which business units, operations, and processes are
essential to the survival of the business. Of course, there is no standard BIA; it depends per
organization. Basically there are two areas to discover.
1. Determine the most critical business areas, often referred to as mission-critical
applications. We will cover this later.
2. For each business area, determine the sub-business processes and identify the
processes which are essential to the operation of the business, often referred to as
business-critical. We will cover this one later also.
After having a clear view which processes are critical for your business (and don‘t take this
lightly), management should estimate the maximum downtime that is tolerated. Management
should determine the longest period of time that a critical process can be disrupted. This
figure is known as the Maximum Tolerable Downtime (MTD). You may measure an MTD in
hours or days. And often these are the most difficult answers to get.
After you complete the MTD and risk analysis for each critical business process, you need to
condense the detailed information to a simple spreadsheet so you can see all the business
processes on one page, along with their respective MTD and risk figures. Try to see the big
picture here.
Because there is a potential risk for downtime regarding these critical processes, we cannot
ignore the major consequences. These consequences are related to objectives of the
business.
The objectives of the business impact analysis (BIA) includes:
Financial/Cash Flow/revenue loss
Legal/Regulatory
Life-threatening issues in hospitals, for example
Reputation
And so on…
(There are many more, depending on your type of organization)
EMC Proven Professional Knowledge Sharing 9
3.1. Maximum Tolerable Downtime
For each process in the BIA, you need to determine its Maximum Tolerable Downtime
(MTD), which is the time after which the process being unavailable creates irreversible (and
often fatal) consequences. Generally, exceeding the MTD leads to severe damage to the
viability of the business, including the actual failure of the business. Depending on the
process, you can express the MTD in hours or days.
3.2. Recovery Time Objective (RTO)
After you determine the MTD for processes, you can begin setting targets for recovery. One
important target is the Recovery
Time Objective (RTO).
RTO is the period of time required
to return an application or process
to a working state after a downtime
situation. For any given process,
the RTO is less than the MTD. By definition, it has to be. If you set a 5-day RTO for a
process with a 2-day MTD, your business has failed before you can get the critical process
running again. And what‘s the point of that? A process‘s RTO forms the basis for any DR
planning that you‘ll do for that process.
For example, if a process has a 30-day RTO, you can get it running again—purchase a new
server, install software, and restore backup data—at a leisurely pace. However, a process
with a one-hour RTO requires a hot site with a standby server and data replication in near-
real time. The costs for these two scenarios vary greatly.
3.3. Recovery Point Objective (RPO)
The Recovery Point Objective (RPO), represents the first fallback point transactions for the
last 5 minutes, hour, day. It represents the risk of a permanent loss of some part of your
data.
Assume that an organization wants to establish a 5-hour RPO for an order entry system. To
meet this figure, the organization has to implement a mechanism to back up or replicate
transaction data so that it loses no more than 5 hours of transactions in a disaster scenario.
Similar to the RTO, setting the RPO determines what sort of measures you need to take to
ensure that you don‘t lose
information related to any particular
business process. Speed costs.
EMC Proven Professional Knowledge Sharing 10
4. Data Classification Every business requires certain applications that they use every day to run their business.
These applications become assets and are incorporated into the business. Critical
information is deployed on the asset and these assets are provided to each employee.
In this chapter I will go through the five classes of applications. It is important that the data in
each organization is analyzed and classified in order to develop a recovery strategy. You
must classify your data; otherwise, everything gets protected the same way. Without
classification everything is important and you don‘t want that. Unless you own a gold mine!
Let‘s face it; most employees are aware of which applications or processes are important for
their organization. But, not everything is or should be important and that‘s why we are going
to classify these types of data.
EMC‘s application matrix provides a requirement driven, five-category methodology for
mapping technology solutions to critical applications. Critically ranking each application within
the matrix dictates the need of disaster recovery or method used to protect the data.
Class 1 - Mission Critical
Class 2 - Business Critical
Class 3 - Business Important
Class 4 - Productivity Important
Class 5 - Non-critical
Let's go through them;
Mission Critical applications are applications necessary for the company's mission to
perform. These applications have a significant impact on revenue loss from downtime.
Business Critical applications are applications that increase productivity. These are the
applications that usually support mission critical applications. At the time of a major disaster,
this should be the second application restored. These applications also have an impact on
revenue loss from downtime.
Business Important applications are also the type of applications that increase productivity
and supporting applications that are not critical. These are ‗third-rate‘ applications.
Productivity Important are departmental applications, rather than the entire company.
These will only affect the productivity of their departments.
Non-critical applications have minor impact on productivity. They are too personal to be
recognized in times of crisis.
Please note that these are guidelines. It may be that for departments within an organization
application importance may differ. But beware that you are not lost yourself discussing each
application‘s importance. Eventually it happened to me! All of these departments declaring
the importance of their application comes at a price. But hey... as long as somebody is willing
to pay for it, no harm is done and you have created a new challenge.
EMC Proven Professional Knowledge Sharing 11
It's not rocket science to know that mission-critical applications need more performance and
protection. Eventually we earn our money using these applications. So when we are clear
about which applications we can put in boxes, hypothetically speaking, we know what is
needed for performance, capacity, protection, and so on.
Requirements, such as high availability, high scalability, and redundant connections and so
on, are examples of what you possibly need. As I mentioned earlier, mission-critical
applications have bigger needs than ‗nice to have‘ applications. By now you probably know
which applications are mission-critical or very important to the business. But is there a need
for high performance/no downtime hardware? Every requirement has a price; is your
business willing to pay the price for a Ferrari or is a nice Mercedes enough? The questions
that have to be answered are probably about RPO and RTO. Ask questions such as;
- How much downtime can be afforded?
- How much data may be lost in an outage?
- How fast do we need to recover the data?
- How fast do we need to be up and running again?
If the answer is no downtime or less than 2 days, this certainly impacts the type of
technology you need. Meeting such standards requires excellent skills combined with
excellent hardware and software.
Once complete, it can be integrated into a matrix on the next page:
EMC Proven Professional Knowledge Sharing 12
The Criticality Matrix as shown in the table below depicts the typical requirements and the information infrastructure associated with each class. [2] Here again, it means that you decide which requirements are necessary in each class.
Requirements Backup & disaster
recovery
Classes
Application Application
function
High availability
High Scalability
Redundant connections Offsite tape
PSYGIS/ Basis Clients administration
Scalable performance Instant onsite recovery
PSYGIS File Lining
Non disruptive backups
Offsite disaster recovery location
Mission critical PSYGIS Medication prescription
Rapid restore Offsite data replica
PSYGIS Authentication and billing
Instant test environments -
PSYGIS Calendar Management
Buseness continuance
Advanced recovery
High availability
High Scalability
Centrasys Pharmacy
Redundant connections Offsite tape
Mirador lab results
Scalable performance Instant onsite recovery Business critical MUS Methadone
Non disruptive backups Offsite disaster recovery location
REAKT Registration day activities
Rapid restore
- -
Instant test environments
Vila Purchase & Inventory
SDB HRM / salaries Personnel administration & salaries
High availability
Square Planning
High Scalability
FIS Financials
Redundant connections Offsite tape Business important IRIS Financials
Scalable performance
Priva Building Management
Non disruptive backups
Microsoft Office E-mail / Documents
BI Business intelligence
High availability
Prodacapo Accounting
High Scalability Onsite tape Productivity important SharePoint Intranet
High performance
Marvin Process management system
Topdesk Enterprise Service Management
Scalable
Low cost none Non-critical Other applications -
-
Example of Criticality Matrix mental health institution
EMC Proven Professional Knowledge Sharing 13
5. Risk Assessment A risk assessment is an important step in protecting your business. It determines various
natural or man-made threats that can disrupt processes in the organization and its facilities.
A common misconception is that most of the threats to continuity are a result of natural disaster. Statistically, these threats account for less than 1% of IT service unavailability. That leaves us with 99% attributable to other threats. It‘s important you know the risks about disasters such as tornadoes, hurricanes, floods, or other natural disasters. It‘s more important that you protect yourself against man-made threats.
This one is difficult because there are innumerable scenarios that can go wrong caused by
humans. Do we look in to all these scenarios? I think not; there are simply too many.
This leaves us with two questions:
Which possible risks are there for the organization?
What are the results of these possible threats?
NOTE: Because there are so many scenarios, it is more important to consider the misery
caused by these scenarios/risks. There are more than one hundred ways to destroy
something. The fact is, one way or the other, it’s broken. Try to shift your focus from cause to
effect!
Besides the effects of disasters, you need to create a relatively complete list of the disasters
that are reasonably likely to occur. The following list isn‘t meant to be complete. Disasters not
listed here might belong in your threat model. But this list should give you a good starting
point.
Global Threats
Part of the risk process is to review the types of disruptive events that can affect the normal
running of the organization. There are many potential disruptive events and the impact and
probability level must be assessed to give a sound basis for progress.
Environmental Disasters
o Tornado
o Hurricane
o Flood
o Snowstorm
o Earthquake
o Electrical storms
o Fire
o Subsidence and Landslides
o Freezing conditions
o Contamination and Environmental Hazards
o Epidemic
EMC Proven Professional Knowledge Sharing 14
Organized and / or Deliberate Disruption
o Act of terrorism
o Act of sabotage
o Act of war
o Theft
o Labor Disputes / Industrial Action
Loss of Utilities and Services
o Electrical power failure
o Loss of gas supply
o Loss of water supply
o Communications services breakdown
o Loss of drainage / waste removal
Equipment or System Failure
o Internal power failure
o Air conditioning failure
o Production line failure
o Cooling plant failure
o Equipment failure (excluding IT hardware)
Serious Information Security Incidents
o Cyber crime
o Loss of records or data
o Disclosure of sensitive information
o IT system failure
Other Emergency Situations
o Workplace violence
o Public transportation disruption
o Neighborhood hazard
o Health and Safety Regulations
o Employee morale
o Mergers and acquisitions
o Negative publicity
o Legal problems
Although not a complete list, it does give a good idea of the wide variety of potential threats.
EMC Proven Professional Knowledge Sharing 15
Consequences of disasters according the International Disaster Database
EMC Proven Professional Knowledge Sharing 16
5.1. Component Failure Impact Analysis
Originally a process defined by IBM in the 1980s to improve availability, Component Failure
Impact Analysis (CFIA) [2] [3] is now a part of the ITIL "Best Practices". CFIA is a process of
analyzing a particular hardware/software configuration to determine the true impact of any
individual failed component.
Many know that CFIA is somehow related to ITIL Problem and Availability Management, yet
it remains at best a fuzzy concept for most. While CFIA is impressive sounding, it is really
just a way of evaluating (and predicting) the impact of failures, and locating Single Points of
Failure (SPoF). CFIA can:
1. Identify Configuration Items (CI‘s) that can cause an outage
2. Locate CI‘s that have no backup
3. Evaluate the risk of failure for each CI
4. Justify future investments
5. Assist in Configuration Management Database (CMDB) creation and maintenance
All it takes to gain these benefits is an Excel spreadsheet or some graph paper. Following
are the 3 steps to success with Component Failure Impact Assessment.
1. Select an IT Service, and get the list of CI‘s, hopefully from Configuration Management,
upon which the IT Service depends. If there is no formal CMDB, then ask around IT for
documentation, paper diagrams, and general knowledge.
2. Using a spreadsheet or graph paper, list CI‘s in one column and the IT Service(s) across
the top row. Then, for each CI, under each service:
a.) Mark ―X‖ in the column if a CI failure causes an outage
b.) Mark ―A‖ when the CI has an immediate backup (―hot-start‖)
c.) Mark ―B‖ when the CI has an intermediate backup (―warm-start‖)
You now have a basic CFIA matrix. Every ―X‖ and ―B‖ is a potential liability.
3. Examine first the ―X‘s‖, then the ―B‘s‖, by asking the following questions:
Is this CI a SPoF?
What is the business/customer impact of this CI failing? How many users would be
impacted? What would be the cost to the business?
What is the probability of failure? Is there anything we can do differently to avoid this
impact?
Are there design changes that could prevent this impact? Should we propose
redundancy or some form of resiliency? What would redundancy cost?
As you get good at CFIA, consider expanding your CFIA matrix to include the procedure
used to recover from a CI failure as a row across the bottom of your CFIA matrix. (Of course,
this requires that you have written procedures!) Adding documented response procedures to
your CFIA matrix lets you examine the organization as well as infrastructure. Ask yourself:
EMC Proven Professional Knowledge Sharing 17
How do we respond when this CI fails?
What procedures do we follow? Are these procedures documented? Could they be
improved? Could they be automated?
Can we improve the procedure through staff training? New tools or techniques?
Could preventative maintenance have helped avoid this problem?
NOTE: We will cover these questions later in this article.
Sound CFIA at any level (infrastructure, organization, or both) delivers RFCs that can deliver
real improvements to the business without requiring high process maturity or expensive
supporting software. There are some IT-centric benefits to CFIA as well, including a head-
start on IT Service Continuity Management; aiding Configuration Management which benefits
from the addition of recovery procedures to the CMDB; and Problem and Incident
Management who may follow these procedures.
How far are we going with this? You can take every little thing in your CMDB, but it‘s wise to
focus on the CI‘s regarding your mission-critical applications. Identify your critical
components. Only they are important. When one of these critical components fails, probably
your mission-critical application does, too.
EMC Proven Professional Knowledge Sharing 18
5.2. Identifying Critical Components
This chapter is very closely related to Component Failure Impact Analysis (CFIA), which
talked about data classification, and gives you a good start to the applications we are going
to protect.
5.2.1. Personnel
It‘s important to recognize that your personnel are critical components too. You may begin to
notice a few names who appear frequently in the most critical processes or are involved with
some critical applications. You may want to take a closer look at those people and consider
whether they‘re truly critical for so many business processes or applications. Items in your
DR plans that relate to critical personnel may include cross-training or staff expansion of
some sort in order to reduce any possible exposures related to too many processes
depending on too few individuals.
5.2.2. Systems
By this time you‘ve collected all the necessary information from all the important business
processes for your Business Impact Analysis. You‘ve identified information systems,
personnel, assets, and suppliers that these processes depend on.
In chapter 4 (Data Classification), we discussed the critical applications and what was
needed to run them and keep them available at all times. These critical applications are
depending and running on systems such as power, cooling, switches, firewalls, and servers.
Now, all systems that are relevant for these critical applications should be named. But only
those systems that are absolutely necessary for those critical applications that need to
failover when disaster strikes.
So, again, make things easy and start with which
application is running on which server. Go to work
systematically.
For example; this application is running on this server
and depending on the following services…this server is
running on precisely this blade in this blade chassis…
Once you are ready, you have a list with all servers.
You can expand this list to your own needs. Do this
also with power, switches, and other systems that are
related to the critical applications. The more extensive,
the better. The important thing is to start somewhere!
EMC Proven Professional Knowledge Sharing 19
5.3. Dependencies Why identify dependencies? Your mission in this phase is to identify systems that are critical.
The systems that support business processes aren‘t just the systems with the applications,
but also everything else that you need to keep those systems running properly.
The following sections discuss dependencies in greater detail. As previously discussed, try to
focus on the things that are related to the mission-critical application. After you reach that
point you can always expand. A wise lesson; start easy!
When you have your list complete with all the critical components regarding mission-critical
applications, you need to have not only an inventory-level view of your systems and
applications, but also a high-level view of it. If you don‘t have these views of your
environment, it‘s worth the time required to develop them. Often, these diagrams are the only
way you can get a complete end-to-end view of a single application or an entire environment.
Put all your information in a diagram. After adding them to your diagram, try to connect them
to each other.
In my example on the next page there are some layers to play with. I roughly used the
following layers:
Power
Network infrastructure
Hardware
Storage
Hypervisor
Operation systems
Services
I found a good way to get the above dependencies in a schema [4]. Below, we see how data
flows as email is delivered. It‘s routed from the external mail server, through the spam filter,
and into the internal mail server. IT staff envision their world like this. It‘s a conceptual model
that facilitates troubleshooting of email problems. But it doesn‘t clearly show how email
services might be impacted by various system failures. However, by adding dependency
relations, it will.
You need at least these high-level diagrams of your systems environment. On the next page,
I have provided an example where I identify the critical systems. Just add your dependencies
and you are ready. (No lines added in the drawing because it will not be readable)
EMC Proven Professional Knowledge Sharing 20
;
Power
Network Infrastructure
HP Blade Chassis
VMware ESX Server
HP Blade chassis
HP Thin Client
VM: (name server)
VM: (name server)
VM: (name server)
Domain Name System Service (DNS)
Dynamic Host Configuration Protocol
Service (DHCP)
Active Directory Services (ADS)
VM: (name server)
VM: (name server)
Inlogportaal service
O:\Phoenixclient\
Flex profile
HP Unix (name server)
VM: (name server)
Sharepoint
Shared Services Provider (SSP)
Dependencies
Mission Critical
Processes
VM: (name server)
Sharepoint
Web Front-end (WFE)
VM: (name server)
VM: (name server)
Dynamic Host Configuration Protocol
Service (DHCP)
Active Directory Services (ADS)Forest Roles
Schema Master
Domain Naming Master
Domain Roles
PDC Master
RID Master
Infrastructure Master
SQL Server 2005
SPH2007 instance
portaaldb
sharedservices1_db
sharepoint_admincontent
sharepoint_config
. instance
SMZ64SP
Internet Information Services (IIS)
Internet Information Services (IIS)
VMware ESX Server
VM: (name server)(Citrix Basis)
Shortcut to:
O:\Phoenixclient\
Internet Explorer
VM: (name server) (Citrix Datacollector)
Hardware
Hypervisor
Operating System
Service
Database
RAID
LUN
Internet Information Services (IIS)
Active Directory Services (ADS)
EMC CLARiiON CX4-240
RAID 5
RAID Group 7
LUN ID
601
LUN ID
600
LUN ID
602
RAID 5
RAID Group 8
LUN ID
402
LUN ID
401
RAID 5
RAID Group 9
LUN ID
404
LUN ID
403
RAID 5
RAID Group 10
LUN ID
406
LUN ID
405
RAID 5
RAID Group 0
LUN ID
525
RAID 1/0
RAID Group 1
LUN ID
524
LUN ID
523
RAID 5
RAID Group 5
LUN ID
521
LUN ID
520
RAID 5
RAID Group 11
LUN ID
402
LUN ID
401
RAID 5
RAID Group 12
LUN ID
404
LUN ID
403
RAID 5
RAID Group 13
LUN ID
406
LUN ID
405
VMFS
VMFS
VMFS
Network infrastructure
Power
Example of high-level diagram
EMC Proven Professional Knowledge Sharing 21
5.4. Redundancy
The term redundancy is often confused with availability. While these two are related, they are
not the same. Redundancy refers to, for example, the use of multiple servers, more than one
Host Bus Adapter (HBA), or RAID protected disks. Redundancy is the duplication of critical
components of a system with the intention of increasing system reliability. When it comes to
redundancies, more is better you would say. But redundancy has its price. Why? Because
you have to buy at least double the hardware. On the other hand, what is the price of a
server compared to the outage of your mission critical application?
There are many scenarios to deploy a redundant solution for servers and storage. Let‘s look at some examples: No redundancy
Look at the components in the picture. You only see a single point of failure. A single point of
failure (SPoF) is a hardware or software element whose loss results in the loss of service.
HBA redundancy
Configuring multiple HBAs and using multi-pathing software provides path redundancy. Upon
detection of one failed HBA, the software can re-drive the I/O through another available path.
HBA and Switch redundancy
This picture provides HBA and switch redundancy as well. It also protects against storage
array port failures.
EMC Proven Professional Knowledge Sharing 22
HBA, Switch, and Disk redundancy
Now we are using some level of RAID, such as RAID-5. RAID protection will ensure
continuous operation in the event of disk failures.
HBA, Switch, Disk, and Storage array redundancy
The diagram above depicts a high level redundancy infrastructure. Everything is redundant
and there is little chance that if one component breaks down your applications are no longer
available.
Remote replication is an essential part of any data protection plan. It provides protection in
case of primary device, storage, or site failure. Remote replication involves moving data to a
secondary storage array to protect against data loss in case of primary site failure.
There are two types of remote replication; synchronous, which allows RPO of close to zero
and asynchronous, which allows updates to be made to a Secondary Image at intervals
selected by the user.
Bottom line, there should be some sort of redundancy in your infrastructure to make sure all
data and information is protected. Without redundancy, you will realize when an outage
happens, you are lost. You will have to decide how far you want to pull redundancy. As
mentioned earlier, this also has a price. The more redundancy, the more money it costs.
EMC Proven Professional Knowledge Sharing 23
6. Emergency Response Team
An emergency response team (ERT) [5] is a group of people who
prepare for and respond to any emergency incident, such as a
natural disaster or an interruption of business operations. Incident
response teams are common in corporations as well as in public
service organizations. This team is generally composed of specific
members designated before an incident occurs, although under
certain circumstances the team may be an ad-hoc group of willing
volunteers.
Incident response team members typically are trained and prepared to fulfill the roles
required by the specific situation. Ideally, the team has already defined a protocol or set of
actions to perform to mitigate the negative effects of the incident.
There is a need to figure out when to declare a disaster. Define a procedure for declaring a
disaster. At least two ERT members will declare a disaster. Most of the time this is relatively
easy. Why? Because it doesn‘t matter whether it was caused by man-made or a natural
event. Base your decision on consequences rather than cause.
For example, assume a case of no availability of your mission-critical environment or data
loss in your mission-critical application. Either way, this is usually severe enough to declare a
disaster. When a disaster is declared, the ERT launch the DR plan. Before declaring a
disaster, determine the Maximum Acceptable Outage Time (MAOT). The MAOT may be a
period ranging from a few hours to several days or more. The MAOT is the longest time that
can be tolerated between the onset of a disaster and the resumption of a critical business
process. The ERT should assess the disaster and determine whether your business‘s critical
processes will likely exceed the MAOT. If the ERT thinks you‘ll exceed the MAOT, the ERT
should declare a disaster.
Example of MAOT
Decide Moment: The final moment when decided that the IT infrastructure of the primary location will not recover within one business day
12 Hour
Maximum
Start Fallback Scenario: Start failover to secondary location
48 Hour
Maximum
Availability Mission Critical App: Working IT environment for at least 200 employees
EMC Proven Professional Knowledge Sharing 24
The scenario on the previous page is an example. It should be clear that in the decision point
you do not need 12 hours to decide whether or not you will failover. If irreparable damage is
caused by fire in your data center, it should be clear you need to failover.
After the ERT decides that the MAOT has been exceeded for critical processes, it invokes
the DR plan.
Here are some guidelines to get your DR plan up and running:
Arrange an emergency meeting
Appoint an ERT Leader
Assign other roles, such as communications
Designate someone on the ERT to keep a logbook
Discuss the MAOT
Initiate recovery plans
Many organizations put emergency contact lists on laminated wallet cards. Wallet cards are
very portable because they can fit into a wallet. And it‘s more likely to have your wallets with
you when a disaster strikes. Consider putting items on your card such as; name, phone
number(s), URL that contains the disaster recovery procedure, and even spouse info. You
might need more or less information than what I have listed here.
7. Developing a Recovery Strategy
The primary task of this step is to determine how you will achieve your disaster recovery
goals for each of the systems and system components that were identified. For most
organizations, the design of a recovery strategy solution is a fairly custom process. While the
design principles and considerations are mainly common, designers typically have to make a
number of compromises.
Backup and recovery [7] are components of business continuity. Business continuity is the
term that covers all efforts to keep critical data and applications running despite any type of
interruption (including planned and unplanned). Planned interruptions include regular
maintenance or upgrades. Unplanned interruptions could include hardware or software
failures, data corruption, natural or man-made disasters, viruses, or human error. Backup
and recovery is essential for operational recovery; that is, recovery from errors that can occur
on a regular basis but are not catastrophic, i.e. data corruption or accidentally deleted files.
Disaster recovery is concerned with catastrophic failures. Believe me, nothing is as
interesting as a big failure because it‘s the moment you actually learn something. When
planning for backup and recovery, you should decide how much data loss you‘re willing to
incur. You can use this decision to calculate how often you need to perform backups.
Backups should be performed at fixed intervals.
The length of time between backups is called the Recovery Point Objective (RPO); that is,
the maximum amount of data that you are willing to lose. You should also decide how long
you‘re willing to wait until the data is completely restored and business applications become
available. The time it takes to completely restore data and for business applications to
become available is called the Recovery Time Objective (RTO). Your RTO can be different
from your RPO.
EMC Proven Professional Knowledge Sharing 25
After determining your recovery time and recovery point objectives, then you can determine
how much time you actually have to perform your backups; typically called your backup
window. The backup window determines the type and level of your backups. For example, if
you have a system that requires 24-hour, 7-days a week, 365-days a year availability, then
there is no backup window. So, you would have to perform an online backup (also known as
a hot backup) in which the system is not taken offline.
Lastly, as the number of backups increase, the
space required to store them will also increase.
Therefore, you should consider how long you
are required to retain your backups (also
referred to a data retention period) and plan for
the appropriate amount of storage space.
When your deployment fails, you recover it by
restoring it to a previously consistent state (that
is, a particular point in time) from your
backups. Restoring a deployment to a
particular point in time is also known as a
point-in-time recovery.
EMC Proven Professional Knowledge Sharing 26
7.1. Types of backup
You can choose from three different backup methods. Most backup strategies use a
combination of two or three of these methods:
Full is the starting point for all other backups and contains all the data in the folders
and files that are selected to be backed up. Because the full backup stores all files
and folders, frequent full backups result in faster and simpler restore operations.
Remember that when you choose other backup types, restore jobs may take longer.
It would be ideal to make full backups all the time, because they are the most
comprehensive and are self-contained. However, the amount of time it takes to run
full backups often prevents us from using this backup type. Full backups are often
restricted to a weekly or monthly schedule, although the increasing speed and
capacity of backup media is making overnight full backups a more realistic
proposition.
Incremental provides a faster method of backing up data than repeatedly running full
backups. During an incremental backup only the files changed since the most recent
backup are included. The time it takes to execute the backup may be a fraction of the
time it takes to perform a full backup.
Differential contains all files that have changed since the last full backup. The
advantage of a differential backup is that it shortens restore time compared to a full
backup or an incremental backup. However, if you perform the differential backup too
many times, the size of the differential backup might grow to be larger than the
baseline full backup.
I talked about three backups. Maybe this one doesn‘t belong here but it‘s a definitive copy of
the original data, which makes it a backup.
Mirrored ensures your information is protected from both system and site failures. In
an array, it‘s a block level protection, so you can‘t open and navigate these file in
Windows Explorer.
In EMC terms, we speak about MirrorView™. It leverages the power of EMC
CLARiiON® networked storage systems to offer both synchronous and asynchronous
remote mirroring. Whether you mirror data around the corner or across the globe,
MirrorView provides disaster recovery that protects your most critical data in the event
of an outage.
Another replication method is Symmetrix Remote Data Facility (SDRF®), which is
used in EMC Symmetrix® systems. SRDF provides remote replication for disaster
recovery and business continuity.
EMC Proven Professional Knowledge Sharing 27
7.2. Virtualized Servers and Disaster Recovery
Traditional disaster recovery plans are often very complex and difficult. The reason for this is
bare metal recovery. Virtualization makes life easier for us and simplifies this environment. A
virtual machine typically is stored on the host computer in a set of files, usually in a directory
created by the host for that specific virtual machine. When you protect these files using your
backup or replication software, you've protected the entire system. These files can then be
recovered to any hardware without requiring any changes because virtual machines are
hardware-independent.
Reliable disaster recovery solutions traditionally require duplicating your entire production
infrastructure and with it, your costs. With virtualization software such as VMware vSphere,
you can provide rapid and reliable recovery without requiring identical hardware. Virtual
machines can share the physical resources of a single computer while remaining completely
isolated from each other as if they were separate physical machines. If, for example, there
are three virtual machines on one physical
server and one of the virtual machines
crashes, the other virtual machines
remain available. Isolation is an important
reason why the availability and security in
a virtual environment is superior to
applications running in a traditional, non-
virtualized system. Server consolidation
also lets you slash the cost of server
infrastructure needed both for production
and disaster recovery.
Virtualization is a must-have these days in
combination with disaster recovery. You
can easily test your disaster recovery plan
to ensure the highest levels of reliability
and availability of your entire IT
infrastructure.
EMC Proven Professional Knowledge Sharing 28
7.3. Other thoughts
An amazing amount of work and planning is required before you push the button and begin
drafting actual recovery plans. Disaster recovery has many aspects because you may need
to recover different portions of your environment, depending on the scope and magnitude of
the disaster that strikes. Your worst-case scenario (an earthquake, tornado, flood, or
whatever sort of disaster happens in your part of the world) can render your work facility
completely damaged or destroyed, requiring the business to continue elsewhere.
But besides that, there are more business justifications for developing a recovery strategy.
– Level of attention and expertise required
– Performance impacts
– Effect of link outages
– Change Control Integration
Do you know who your expert is? The one who can provide innovative, valuable solutions to
your organization (whether internal or external)? The one who knows the jargon, products,
and tools of your organization? That expert exists at every level of your organization. It‘s your
most valuable competitive asset and also your most scarce. Its scarcity is probably the
greatest single factor limiting your growth. Your expert also goes home every night, and it‘s
what you lose when it retires or goes over to the competition. There are not many people
who have that specific experience and knowledge. Everything is learnable but it takes a
while. It could be years before you are on the same level as you were before. So, take good
care of your experts.
Most organization doesn‘t have an exact copy of their data center like a fully automatic
failover site. In most cases, it‘s more important to recover your data and ‗mission-critical‘
application on the failover site than have the same amount of people working at the same
time as before the disaster. Doing a recovery with less hardware has an impact on
performance. Be aware that there are fewer people who can crawl behind the keyboard.
Things will not always go according to plan. That‘s a fact. That‘s the whole reason for this
article. Be prepared that things will not always go as planned in your recovery plan. Try to
anticipate things that can go wrong as much as possible.
If your organization is working according to ITIL, you are probably working with Change
Management. When a disaster strikes, a lot is going on. Try to fit in Change Management in
an appropriate manner given the circumstances.
EMC Proven Professional Knowledge Sharing 29
8. Testing Recovery Plans
Traditional recovery plans are often difficult to test, difficult to keep up to date, and depend
on exact execution of complex, manual processes. In a virtualized environment, testing is
simpler because you can execute non-disruptive tests using existing resources. Hardware
independence eliminates the complexity of maintaining the recovery site by eliminating
failures due to hardware differences.
But still, your organization is changing by the day and servers are added and deleted. Maybe
there are needs that require adding mission-critical applications or simply merging with
another organization. The fact remains, changes occur every day and these changes have
an enormous impact on your DR plan. After you develop the DR plan, you need to put it
through progressively intense cycles of testing. If an organization needs to trust its very
survival to the quality and accuracy of a DR plan, you need to test that plan to be sure that it
actually works. In disasters, you rarely get second chances.
DR plans contain lists of procedures to follow when a natural or man-made disaster occurs.
The purpose of the plan is to recover the IT applications and infrastructure that support
business-critical processes. When disaster hits you, it hits hard. You seldom can clearly tell
whether those disaster plans will actually work. And given the nature of disasters, if your
disaster plan fails, the organization may not survive the disaster.
When you test your disaster plan, note anything that‘s not going according plan, and then
pass the plan back to the people who designed the plan so they can update it. This process
improves the quality and accuracy of the disaster plan. Therefore, realistic testing of the
recovery plan periodically is necessary and is also required to succeed in your mission.
Another thought is whether you can test and maintain protection simultaneously. Because
what will happen when you start on Friday and not be back up and running on Monday? It is
important this is included in your plan. Ask yourself every time, what if, and be prepared for
the worst thing that can happen. Probably a good start would be to fragment your recovery
plan into small pieces. Start with destroying one server and see if it can restored.
EMC Proven Professional Knowledge Sharing 30
9. Role of virtualization
The business world has undergone an enormous transformation over the past 20 years.
Business process after business process has been captured in software and automated,
moving from paper to electrons.
In today‘s world, virtually every strategic business decision has an IT implication. Market
forces continue to accelerate in every region of the world, and across every industry, putting
increasing pressure on IT departments to be more responsive and help organizations stay
competitive and pursue new opportunities at lower cost.
Virtualization is rapidly transforming the IT landscape and fundamentally changing the way
companies compute. Virtualization is the catalyst that makes IT-as-a-Service a reality. It is
the enabling technology on which cloud computing architectures are and will be built.
Whether you have virtualized all of your IT assets and applications or you are just starting
out, you are on your way to transforming to a new model for IT.
Before virtualization, IT organizations would run one application per physical server, so cost-
per-server was a quick way to compare costs; it was a one-to-one relationship. Therefore,
many data centers have machines running at only 10 or 15 percent of total processing
capacity. In other words, 85 or 90 percent of the machine‘s power is unused. It isn‘t rocket
science to recognize that this situation is a waste of resources. But once you virtualize, many
applications (each on its own virtual machine) run on each physical server; it is now a many-
to-one relationship.
When a server is used to host a number of virtual machines, it is faced with much higher
levels of demand for system resources than would be presented by a single operating
system running a single application. Obviously, with more virtual machines running on the
server, there will be more demand for processing. Even with two or more processors,
virtualization can outstrip the processing capability of a traditional commodity server.
Also, with more virtual machines on the server, there will be far higher storage and network
traffic as each virtual machine transmits and receives as much data as would be demanded
by a single operating system performing in the old ―one application, one server‖ model.
Furthermore, because virtualization makes the robustness of hardware more important,
most IT organizations seek to avoid so-called Single Point of Failure (SPoF) situations by
implementing redundant resources in their servers: multiple network cards, multiple
storage cards, extra memory, and multiple processors and all doubled or even tripled in an
effort to avoid a situation where a number of virtual machines can be stalled due to the failure
of a single hardware resource.
EMC Proven Professional Knowledge Sharing 31
9.1. Role of VMware
As virtualization is now a critical
component of an overall IT strategy, it
is important to choose the right
vendor. VMware is the leading
business virtualization infrastructure
provider, offering the most trusted
and reliable platform for building a
good IT infrastructure, private- and
public clouds.
VMware [6] stands alone as a leader.
While challengers like Microsoft and
Citrix are emerging, VMware has a
tremendous head start in this market.
It is clearly ahead in understanding
the market, and is ahead in product
strategy, business model, and
technology innovations.
Why VMware?
Is built on a robust, reliable foundation for many years
Delivers a complete virtualization platform, from desktop through the data center out
to public clouds
Provides the most comprehensive virtualization and cloud management
Integrates with your overall IT infrastructure
Is proven by more than 190,000 customers
VMware has invested in technologies to achieve very high virtual machine density on
VMware vSphere. VMware supports more guest operating systems than any other bare-
metal virtualization platform in 2010. The superior performance of VMware vSphere with
unmodified (fully virtualized) guests, made possible by VMware‘s exclusive binary translation
technology, means that VMware vSphere can run off-the-shelf operating systems with near-
native performance. No other virtualization platform achieves the high virtual machine density
of VMware vSphere and still maintains consistent, high application performance across all
running virtual machines.
With VMware you can lower your operational costs. You can directly reduce your operational
costs by using the dynamic IT services built into VMware vSphere that most other
competitors do not offer.
EMC Proven Professional Knowledge Sharing 32
Most common for example are [8]:
High availability – HA, Explained here
VMware HA provides uniform, cost-effective failover protection against hardware and
operating system failures within your virtualized IT environment.
Dynamic Resource Scheduler – DRS, Explained here
VMware DRS continuously balances computing capacity in resource pools to deliver the
performance, scalability, and availability not possible with physical infrastructure.
vMotion, Explained here
VMware vMotion uses VMware’s cluster file system to control access to a virtual machine’s
storage. During a vMotion, the active memory and precise execution state of a virtual machine
is rapidly transmitted over a high-speed network from one physical server to another and
access to the virtual machine’s disk storage is instantly switched to the new physical host.
Since the network is also virtualized by the VMware host, the virtual machine retains its
network identity and connections, ensuring a seamless migration process.
Storage vMotion, Explained here
VMware Storage vMotion is a state-of-the-art solution that enables you to perform live
migration of virtual machine disk files across heterogeneous storage arrays with complete
transaction integrity and no interruption in service for critical applications.
Site Recovery Manager – SRM, Explained here
VMware vCenter Site Recovery Manager eliminates complex manual recovery steps and
removes the risk and worry from disaster recovery.
Fault Tolerance – FT, Explained here
VMware Fault Tolerance provides continuous availability for applications in the event of server
failures, by creating a live shadow instance of a virtual machine that is in virtual lockstep with
the primary instance.
Find more on http://www.vmware.com/products/
VMware is the proven choice for virtualization from the desktop to the data center. Small and
midsize businesses run on VMware. More than 190,000 customers of all sizes, including all
of the Fortune 100, trust VMware as their virtualization infrastructure platform. That must
mean something!
EMC Proven Professional Knowledge Sharing 33
9.2. Role of EMC
The digital universe is still growing, even during a global economic downturn. The creation
and replication of digital information set a record in 2009 by growing to 800 billion gigabytes,
more than 60% over the previous year. People continue to take pictures, send e-mail, blog,
and post videos. Organizations are still adding information. Governments are still requiring
more information to be kept. And that‘s only the beginning of what‘s to come.
That‘s nice for business. Undoubtedly so for storage vendors. But it‘s not just about storing
data. It‘s more about innovation, protection, optimization, and leveraging information.
In 2003, EMC, the world leader in information storage and management acquired VMware.
Joe Tucci, EMC President and CEO, said, "Customers want help simplifying the
management of their IT infrastructures. This is more than a storage challenge. Until now,
server and storage virtualization have existed as disparate entities. Today, EMC is
accelerating the convergence of these two worlds." Was he wrong?
I have the privilege to work with nice things related to EMC and VMware every day. And it is
just amazing how easy things are to integrate. Let me give you a great example of the
products EMC builds in relation to VMware.
EMC Unified Storage vCenter plug-in
This plug-in is a must-have in combination with vSphere. With
EMC‘s second-generation vCenter plug-in family (Virtual Storage
Integrator, CLARiiON plug-in, and Celerra® NFS Plug-in), EMC
gives VMware administrators the ability to simplify visibility,
provisioning, and management of EMC storage through the VMware
lens. From VMware vCenter, administrators can leverage array
functions to increase the efficiency in their VMware environment and
hardware accelerates VM deployment.
Click on the document for downloading
Or use url: http://www.mikes.eu/download/EMC Plug-in for VMware vCenter.pdf
Integration is good. EMC offers direct integration and management capability of their
systems from VMware‘s Management suite by making use of API‘s. EMC and VMware
integration makes things simpler and more efficient.
Without discussing products, I don‘t want to keep information away from you, shown in the
table below.
EMC Proven Professional Knowledge Sharing 34
Product Families [9]
Hardware
Celerra, Explained here
Bring powerful, high-availability unified storage to your organization in convenient integrated
models and flexible gateways. All are easy to deploy and manage. Plus, simplify management
with powerful software.
CLARiiON, Explained here
Get the high availability, scalability, and flexibility you need to manage and consolidate more
data. Combine easy-to-use midrange networked storage with innovative technology and
robust software capabilities.
Connectrix®, Explained here
Move your organization's vital information where it needs to go—quickly, easily, and reliably.
Advanced directors and switches make it happen. Get best-in-class availability and easy
management.
Centera®, Explained here
Store and manage your "fixed content"—unchanging digital assets—and keep them available
online and accessible. All with EMC Centera content-addressed storage (CAS) systems. Be
ready for growth with petabyte scalability.
Iomega, Explained here
Store, protect, and share your valuable data with reliable and easy-to-use storage solutions for
home and small business.
Symmetrix, Explained here
Make high-end networked storage part of your information infrastructure with systems that
take performance, availability, and security to new heights. Manage and protect your
information today and expand in the future.
VPLEX™, Explained here
Deploy next-generation architecture to enable simultaneous information access within,
between, and across data centers.
Software
Atmos™, Explained here
Build your own cloud services or leverage a public cloud to deliver content and information
services anywhere in the world with EMC Atmos.
Ionix™, Explained here
Simplify and automate key tasks—such as discovery, monitoring, reporting, planning, and
provisioning—for even the largest, most complex storage environments.
PowerPath®, Explained here
Host-based solutions including multipathing, data migration, and host-based encryption.
EMC Proven Professional Knowledge Sharing 35
9.3. Role of VMware Site Recovery Manager
The beautiful part of VMware Site Recovery Manager (SRM) is, you can test a plan without
doing it live. With SRM I can failover anytime without damaging the infrastructure
environment.
SRM [8] provides business continuity and disaster recovery protection for virtual
environments. Protection can extend from individual replicated data stores to an entire virtual
site. VMware‘s virtualization of the data center offers advantages that can be applied to
business continuity and disaster recovery:
The entire state of a virtual machine (memory, disk images, I/O, and device state) is
encapsulated. Encapsulation enables the state of a virtual machine to be saved to a
file. Saving the state of a virtual machine to a file allows the transfer of an entire
virtual machine to another host.
Hardware independence eliminates the need for a complete replication of hardware
at the recovery site. Hardware running VMware ESX at one site can provide business
continuity and disaster recovery protection for hardware running VMware ESX at
another site. This eliminates the cost of purchasing and maintaining a system that sits
idle until disaster strikes.
Hardware independence allows an image of the system at the protected site to boot
from disk at the recovery site in minutes or hours instead of days.
SRM leverages array-based replication between a protected site and a recovery site. The
workflow that is built into SRM automatically discovers which datastores are set up for
replication between the protected and recovery sites. SRM can be configured to support bi-
directional protection between two sites.
SRM provides protection for the operating systems and applications encapsulated by the
virtual machines running on VMware ESX. A SRM server must be installed at the protected
site and at the recovery site. The protected and recovery sites must each be managed by
their own vCenter Server.
Implementing a SRM solution is almost ―too easy‖. But as you've read so far it‘s not only
about the software you are using. The software, which will make your life a lot easier, is not
the most important piece of the puzzle. Keep thinking about the first 8 chapters of this article,
which are more important than the software.
VMWARE IS A TRUE ENABLER FOR DISASTER RECOVERY
EMC Proven Professional Knowledge Sharing 36
10. VMware Site Recovery Manager Downtime is expensive! Disaster preparedness and recovery planning is an iterative process,
not a one-time event. You need to continually revisit disaster recovery plans to ensure they
remain aligned with current business goals and test those plans regularly to ensure that they
perform as planned.
VMware Site Recovery Manager [8] provides business continuity and disaster recovery
protection for virtual environments. In a Site Recovery Manager environment, there are two
sites involved, a protected (primary) site and a recovery (secondary) site. Protection groups
that contain protected virtual machines are configured on the protected site and these virtual
machines can be recovered by executing the recovery plans on the recovery site. The
illustration below depicts how it operates at a very high level.
Site Recovery Manager uses a database on both protected and recovery sites to store
information. The protected site Recovery Manager database stores data regarding the
protection group settings and protected virtual machines, while the recovery site Recovery
Manager database stores information on recovery plan settings
VMware Site Recovery Manager changes the way disaster recovery plans are designed and
executed by involving two simple steps; protection and recovery.
Protection involves the following operations:
Array manager configuration
Inventory mapping
Creating a protection group
Recovery involves the following operations:
Creating a recovery plan
Test recovery
Real recovery
EMC Proven Professional Knowledge Sharing 37
The vCenter Server must be installed at both the protected site and recovery site, as well as
an SQL Server or Oracle Database server.
See Site Recovery Manager Compatibility Matrixes documentation for a list of
supported servers and databases.
Each site has an inventory of virtual machines that reside on array based replicated LUNs
(logical unit numbers), which are disk volumes in a storage array that are identified
numerically. Before installing SRM, install the Storage Replication Adapter (SRA) for your
storage and storage replication environment. SRA is software that ensures integration of
your storage device with SRM. Because SRM interacts with arrays from a variety of storage
vendors, consult the documentation that your storage vendor provides for array specific
information used during SRM installation and configuration. The SRAs that have been
created by storage vendors for Site Recovery Manager can be downloaded from the
vmware.com website.
See Site Recovery Manager Storage Partner Compatibility Matrixes for a list of
supported SRAs.
Optimally SRM is installed bi-directionally, so that each site serves as a recovery site for the
other. The two sites should be a significant geographic distance from each other. The
protected and recovery sites must be in a networked configuration that allows TCP
connectivity. Each site consists of a vCenter Server, which is a Windows machine that runs
the vCenter service. Installed with each vCenter Server is the SRM Server. The SRM Server
hosts Site Recovery Manager and array management technology. It also serves the SRM
plug-in to the VI Client. Management is done from the vCenter client on the protected site.
SRM uses block based replication with SRA‘s installed on the SRM Server. This integration
of hardware and software supports the most demanding application business continuance
needs, in this case, a failover following a disaster.
Replication, Replication, Replication - Technology
SRM only works properly with a replication technology. Data replication, however, is a
growing challenge. Working to achieve higher levels of data availability, storage
administrators increasingly create multiple copies of business-critical data to quickly recover
from disasters. As data centers attempt to maintain data availability in the event of local
catastrophes while globally servicing customers, multiple copies of data must also be
efficiently distributed and synchronized to other data centers.
EMC Proven Professional Knowledge Sharing 38
There are several replication techniques that can be used with VMware SRM. There is a
compatibility matrix of supported vendors. The strength that SRM delivers is to:
- Remove manual recovery complexity through automation
- Provide central management of recovery plans and protection groups
- Simplify and automate disaster recovery workflows
Replication in combination with VMware and EMC comes in a few flavors, such as:
EMC SRDF [9]
EMC Symmetrix Remote Data Facility (SRDF) provides remote replication for disaster
recovery and business continuity.
Click on the document for downloading
Or use url: http://www.emc.com/products/detail/software/srdf.htm
EMC MirrorView [9]
EMC MirrorView ensures your information is protected from both system and site failures. It
leverages the power of EMC CLARiiON networked storage systems—to offer both
synchronous and asynchronous remote mirroring.
Click on the document for downloading
Or use url: http://www.emc.com/products/detail/software/mirrorview.htm
EMC Proven Professional Knowledge Sharing 39
EMC Celerra Replicator [9]
EMC Celerra Replicator provides efficient, asynchronous data replication over Internet
Protocol (IP) networks.
Click on the document for downloading
Or use url: http://www.emc.com/products/detail/software/celerra-replicator.htm
EMC RecoverPoint [9]
EMC RecoverPoint brings you continuous data protection and continuous remote replication
for on-demand protection and recovery to any point in time. RecoverPoint's advanced
capabilities include policy-based management, application integration, and bandwidth
reduction.
Click on the document for downloading
Or use url: http://www.emc.com/products/detail/software/recoverpoint.htm
EMC Proven Professional Knowledge Sharing 40
Plans
The next steps involve making plans to configure Site Recovery Manager. Creating and
managing recovery plans directly from vCenter are very powerful and easy to create. Site
Recovery Manager provides an intuitive interface to help users create recovery plans for
different failover scenarios and different parts of their infrastructure. Users can specify virtual
machines to be suspended or shut down. They can also specify the order in which virtual
machines are powered on or shut down, set user-defined scripts to execute automatically,
and determine where to pause the recovery process if necessary. These steps are not
detailed as they are beyond the scope of this article. Refer to VMware and storage vendor
documentation for additional details. There is also a lot to find in the communities.
Basically it comes down to this:
Deploy Site Recovery Manager (SRM) at both the protected and recovery sites.
Install the Storage Replication Adapters (SRA) on the same server as SRM on
both the protected and recovery sites. Install the SRM plug-in on the protected
and recovery vCenter servers.
Set up connections between the protected and recovery sites.
Configure the Array Manager so that SRM knows about the storage arrays.
Create one or more Protection Groups that contain the replicated LUN and
associated virtual machines, which holds the mission-critical application.
Create a Recovery Plan which is associated with a Protection Group, so that in
the event of a failover, the recovery site knows the relationship between virtual
machines and the failed over storage.
Run a test failover to verify functionality.
EMC Proven Professional Knowledge Sharing 41
11. Standardization
A good start is to ask, what are standards? According to ‗search and Google‘ a standard is a
definition or format that has been approved by a recognized organization or is accepted as a
recognized standards organization or is accepted as a de facto standard by the industry.
Standards exist for programming languages, operating systems, data formats,
communications protocols, and so forth.
Standards are extremely important in the computer industry because they allow the
combination of products from different manufacturers to create a customized system. Without
standards, only hardware and software from the same company could be used together. In
addition, standard user interfaces can make it much easier to learn how to use new
applications.
A lot of organizations are committed to an open, standards-based approach to
interoperability so that customers can implement solutions that meet their individual needs.
It‘s important to create a policy with the basic concepts of standardization. Stability, future-
proof, controlled innovation and security are essential.
VMware is committed to an open, standards-based approach to licensing and interoperability
so that customers can implement virtualization-based solutions that meet their individual
needs. Whether you have virtualized all of your IT assets and applications or you are just
starting out, you are on your way to transforming to a new ‗standard‘ model for IT.
EMC Proven Professional Knowledge Sharing 42
12. Conclusion
We started with the sentence, and now we end with it, ―Information is the organization’s most
important asset.‖
Given that, the information must be protected. We must look carefully at which information
we protect because there is no point in protecting your total infrastructure environment. You
must classify your data; otherwise, everything gets protected the same way. Without
classification everything is important and you don‘t want that.
When disaster strikes, it hurts, one way or the other. If a disaster hits an organization without
a disaster recovery plan, that organization has very little chance of recovery. Organizations
that do have DR plans may still have a difficult time when a disaster strikes. You may have to
put in considerable effort to recover time-sensitive critical business functions. But if you have
a disaster recovery plan, you have a chance at survival.
It is a common misconception that most of the threats to continuity are a result of natural
disaster. Statistically, these threats account for less than 1% of IT service unavailability. This
finding indicates that you should mainly focus on other things than just natural disasters.
Doing nothing isn‘t an option because it can damage your company in many ways. For
example:
Financial/Cash Flow/revenue loss
Legal/Regulatory
Life-threatening issues in hospitals, for example
Reputation
A good disaster recovery plan is like an information insurance policy for a business. A
disaster recovery plan is the ability to continue work after any number of catastrophic
problems, ranging from a natural disaster such as flood, fire, and earthquake or
planned/unplanned scenarios like database corruption, server failures, or simply human
errors. Disaster recovery is becoming an increasingly important aspect for an organization.
Beside the fact that a disaster recovery plan is a must-have for the survival of your
organization it has more benefits, such as; improved business processes, improved
technology, fewer disruptions, higher quality services, and competitive advantages.
The maximum length of time a business function can be discontinued without causing
irreparable damage to the business is called Maximum Tolerable Downtime (MTD). This
value must be within the (MAOT) which is given by Management. After set targets for MTD
you must set targets for your Recovery Point Objectives (RPO) and Recovery Time
Objectives (RTO) for each process. You need this when disaster strikes. You can give any
sort of guarantee; how much data is lost and how long it takes before you're back online.
Make sure you have an Emergency Response Team ready. This ERT is a group of people
prepared for any emergency or big incident, such as a natural disaster or an interruption of
business operations. Emergency Response Team members typically are trained and
prepared to fulfill the roles required by the specific situation. Ideally the team has already
defined a protocol or set of actions to perform to mitigate the negative effects of the incident.
EMC Proven Professional Knowledge Sharing 43
Traditional disaster recovery plans are often very complex and difficult. As virtualization is
now a critical component to an overall IT strategy, it is important to choose the right vendor.
Avoid unnecessary risk and overhead when choosing a robust and production-proven
hypervisor for your virtualized datacenter.
Not all hypervisors are equal. VMware has a true enabler for disaster recovery named
VMware Site Recovery Manager (SRM). VMware Site Recovery Manager is a business
continuity and disaster recovery solution that helps you plan, test, and execute a scheduled
migration or emergency failover of datacenter services from one site to another. But as
mentioned in the Introduction, you can test a recovery plan without ruining anything. Yes, you
can failover anytime without damaging the infrastructure environment. Virtualization these
days can make Disaster Recovery implementations easy.
As the leader goes, so goes the organization. A disaster recovery plan needs executive
sponsorship. Without executive sponsorship, these disaster recovery plans are not feasible.
The executives are responsible for making decisions relating to an organization‘s direction,
strategy, and financial commitment. They would approve the finance of purchasing hardware
or software. Finally, the executive sponsorship role is needed to make decisions about a
company‘s policies, procedures, and strategic directions. Ensure that it has the attention from
executive management. When it does, it's more broad-based and probably more successful.
Disaster recovery and business continuity are extremely complex. This is often the reason
why companies are holding back on a recovery strategy. What I try to reach with my paper is
that we don‘t make disaster recovery too complicated. We can, but it isn‘t necessary. The
most important issue is that data is protected and that we can provide this data quickly to the
organization. Surely we must consider risks and do everything to prevent them. But this
should not be your main concern. Your concern is to return as quickly as possible to daily
business.
Virtualization is a true enabler to recover after a disaster. Costs are relatively low and it is
very easy to integrate this into your infrastructure.
EMC Proven Professional Knowledge Sharing 44
References [1] IT Disaster Recovery Planning for Dummies By: Peter Gregory
[2] EMC Information Availability Design and Management course
[3] Source: Hank Marquis (2006), http://www.hankmarquis.com/articles.html
[4] http://dependencymapping.com/
[5] http://en.wikipedia.org
[6] By: Gartner RAS Core research note G00200526, Thomas J. Bittman, Philip Dawson,
George J. Weis, 26 may 2010, Magic Quadrant for x86 server Virtualization Infrastructure
[7] EMC® Documentum® Content Server Backup and Recovery White Paper version 6.5,
Published January 2010
[8] VMware, http://www.vmware.com
[9] EMC, http://www.emc.com
Disclaimer: The views, processes or methodologies published in this article are those of the
author. They do not necessarily reflect EMC Corporation‘s views, processes or
methodologies.
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED ―AS IS.‖ EMC CORPORATION
MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO
THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.