Presentation best practices for fault tolerance for tier 1

40
© 2009 VMware Inc. All rights reserved Confidential Best Practices for Fault Tolerance for Tier 1 Apps Randy Carson Sr. System Engineer [email protected]

Transcript of Presentation best practices for fault tolerance for tier 1

Page 1: Presentation   best practices for fault tolerance for tier 1

© 2009 VMware Inc. All rights reserved

Confidential

Best Practices for Fault Tolerance for Tier 1 Apps

Randy Carson

Sr. System Engineer

[email protected]

Page 2: Presentation   best practices for fault tolerance for tier 1

2 Confidential

Agenda

• Virtualizing Tier 1 Apps

• Building a Virtualized DR Solution

• Site Recovery Manager

• Summary / Questions

Page 3: Presentation   best practices for fault tolerance for tier 1

3 Confidential

Applications Are Key Milestone on Private Cloud Journey

Manage

hypervisors, VMs and

dev/test environments

Phase 1 - Explore

Manage large, dynamic,

shared infrastructure including

enterprise applications

Phase 2 - Expand

Self-service IT with policy-driven

automation to ensure service

levels

Phase 3 – Standardize on

the private cloud

Can I deploy my Tier 1 apps on VMware?

PerformanceISV supportConsolidationGuarantee Uptime

Why should I deploy my Tier 1 apps on VMware?

Accelerate app lifecycleGuarantee app QoSCost reduction

Page 4: Presentation   best practices for fault tolerance for tier 1

4 Confidential

Consolidation and infrastructure efficiency

Simpler management

Built-in availability

Greater Agility

“Our CIO told the team that we need to

virtualize as much as we can as soon we can”

VI Admin, VMware Customer

Organizations are on the path to

100% virtualization for:

CIOs and IT Operations Want to Virtualize More…

Page 5: Presentation   best practices for fault tolerance for tier 1

5 Confidential

Source: VMware customer survey, September 2008, sample size 1038

Data: Within subset of VMware customers running a specific app, % that have at least one instance of that app in production in a VM

In a recent Gartner poll, 73% of customers claimed to use x86

virtualization for mission critical applications in production

Source: Gartner IOM Conference (June 2008)

“Linux and Windows Server Virtualization Is Picking Up Steam” (ID Number: G00161702)

36%

53%56%

41% 34%

50%

MS

Exchange

MS

SharePoint

MS

SQL

Oracle

Middleware

Oracle

DB

IBM

WebSphere

% of customers running apps in production on VMware

IBM

DB2

24%

SAP

27%

The Trend Is Clear…

Page 6: Presentation   best practices for fault tolerance for tier 1

6 Confidential

“The Exchange admin vetoed the virtualization project, he felt it was too risky”

-VI Admin, VMware Customer

“No way,” Say Many App Owners

Can You Afford NOT To Run Your Business Critical Apps on VMware?

IT / VI Admins

CI

O

App Owners

• Standardize on VMware

• Virtualization first

• Consolidation and

infrastructure efficiency

• Reduce operational complexity

IT / VI Admins

CI

O

• Why change?

• Can Virtual Machines handle my

performance requirements?

• What’s in it for me?

• My app is too important to run on

shared hardware

App Owners

Page 7: Presentation   best practices for fault tolerance for tier 1

7 Confidential

Virtual Machines run the most demanding apps

8 vCPUs and 255 GB of memory

Small overhead (typically 2% to 10%)

Majority of large ISVs support VMware

Microsoft, SAP and IBM provide full support

Oracle in grey zone – has support statement

“Yes You Can!” - Debunking Common Objections for Tier 1 Apps

Performance

ISV Support / Licensing

Scale apps better on large multi-core servers

Double server capacity for Exchange 2007

Licensing costs often reduced with virtualization

“Per vCPU” licensing: pay only for what you use

“Physical processor” licensing: consolidate

multiple licenses on shared cluster

Page 8: Presentation   best practices for fault tolerance for tier 1

8 Confidential

Reduce Hardware and Software Costs

• Achieve 5X - 10X consolidation for large apps

Match or Exceed Physical Performance

• Maximize scale out performance on large multi-core servers

Accelerate Application Lifecycle

• Provision On-Demand in production and in the labs

Leverage expanding ISV Ecosystem

• Includes Microsoft, SAP, Oracle, and IBM

Deliver Apps as Dynamic, Cost-Efficient IT Services

ISV Support

Guarantee Application Quality of Service

• Scale dynamically to ensure service levels

• Provide built-in High Availability and reliable Disaster Recovery

Quality of Service

App Lifecycle

Consolidation

Performance

Page 9: Presentation   best practices for fault tolerance for tier 1

9 Confidential

Cut Infrastructure and Software License Costs

Achieve 5X - 10X server consolidation for large apps

Increase utilization of software licenses

Accelerate App Lifecycle from Dev to Production

Reduce provisioning times from weeks to minutes

Self-service provisioning

Apps Run Better On The Private Cloud

Guarantee Application Quality of Service

Policy-driven Service Level assurance

Provide cost-effective HA and simple Disaster Recovery

Quality of Service

App Lifecycle

Consolidation

Page 10: Presentation   best practices for fault tolerance for tier 1

10 Confidential

• Challenges for Virtualizing Teri 1 Apps

• Building a Virtualized BC / FT Solution

• Site Recovery Manager

• Summary / Questions

Agenda

Page 11: Presentation   best practices for fault tolerance for tier 1

12 Confidential

VMware Solutions Maximize Uptime

Storage

Site

Interconnect

Server

Prevent Planned Downtime Minimize Unplanned Downtime

Network

Redundancy

Storage vMotion

VMotion + DRS

Maintenance Mode

NIC & HBA

Teaming

Consolidated Backup

+ backup software,

Data Recovery

HA,

Fault Tolerance

Site Recovery Manager

Page 12: Presentation   best practices for fault tolerance for tier 1

14 Confidential

VMware VMotion

62% of VMware customers have implemented VMotion

Live migration of virtual

machines

Zero downtime

Page 13: Presentation   best practices for fault tolerance for tier 1

15 Confidential

EVC Cluster Requirements

Hosts

• CPUs from a single vendor, either Intel or AMD

• Running ESX Server 3.5 Update 2 or later

• Connected to vCenter Server

• Hardware virtualization support (AMD‐V or Intel VT) enabled

• AMD No eXecute (NX) or Intel eXecute Disable (XD) technology enabled

• Support hardware live migration (AMD-V Extended Migration or Intel FlexMigration) or have baseline processor of intended feature set

Virtual Machines

• Powered off or migrated out of cluster when EVC is enabled

• Applications on virtual machines must use CPUID instructions

Page 14: Presentation   best practices for fault tolerance for tier 1

16 Confidential

• Shut down idle host and

perform maintenance

• DRS migrates running virtual

machines to other hosts

• Activate Maintenance Mode

for physical host

Use VMotion to evacuate

hosts

Move running applications

to other servers without

disruption

Perform maintenance at

any time of day

Automate with DRS

maintenance mode

Automates moving virtual

machines to other hosts

Automates re-balancing

after maintenance complete • Restart host; DRS automatically

rebalances workloads

VMotionVMotion

Zero-downtime maintenance using VMware

Page 15: Presentation   best practices for fault tolerance for tier 1

17 Confidential

New DRS Management Pages

History tab

Recommendations page

Refresh

recommendations

Apply a subset of

recommendations

Edit cluster

properties

Apply all selected

recommendationsFaults view displays issues that

prevented DRS from providing

or applying recommendations.

Actions taken based on

recommendations

Customize the

display

Faults page

Page 16: Presentation   best practices for fault tolerance for tier 1

18 Confidential

Storage VMotion in vSphere 4

Enhancements

• Can administer via vSphere Client

• Supports NFS, Fibre Channel, and iSCSI

• No longer requires 2 x memory

• Supports moving VMDKs from thick to thin

formats

• Can migrate RDMs to RDMs and RDMs to

VMDKs (non-passthrough)

• Leverages new vSphere 4 features to speed

migration

Limitations

• Virtual machine cannot include snapshots

• VM must be powered off to simultaneously

migrate both host and datastore

Page 17: Presentation   best practices for fault tolerance for tier 1

19 Confidential19

Storage VMotion in vSphere 4

Source Destination

123

4

5

1. Copy virtual machine files

except disks to new datastore

2. Enable changed block tracking

on the virtual machine’s disk

3. “Pre-copy” virtual machine’s

disk and swap file from source

to destination

4. Invoke fast suspend/resume

on virtual machine

5. Remove source home and

disks of virtual machine

Page 18: Presentation   best practices for fault tolerance for tier 1

20 Confidential

New HA Cluster Settings

Ability to suspend

host monitoring

Choice of three

admission control

strategies

Page 19: Presentation   best practices for fault tolerance for tier 1

21 Confidential

VM Monitoring

Enable automatic

restart due to failure of

guest operating system

Determine how quickly

failures are detected

Set monitoring sensitivity

for individual virtual

machines

Page 20: Presentation   best practices for fault tolerance for tier 1

22 Confidential

App

OS

App

OS

App

OSXXApp

OS

App

OS

App

OS

App

OS

X

Single identical VMs running in lockstep

on separate hosts

Zero downtime, zero data loss

failover for all virtual machines in case

of hardware failures

Integrated with VMware HA/DRS

2-node VM pairs, multiple FT VMs per

host

Dynamic enablement / disablement

No complex clustering or specialized

hardware required

Single common mechanism for all

applications and operating systems

Single vCPU VMs supported

VMware ESX VMware ESX

VMware Fault Tolerance

FTHAHA

Page 21: Presentation   best practices for fault tolerance for tier 1

23 Confidential

SecondaryPrimary

VMware Fault Tolerance (FT)

vLockstep Technology

New

Secondary

vLockstep Technology

VMware FT provides zero-downtime, zero-data-loss

protection to virtual machines in an HA cluster.

New

Primary

Page 22: Presentation   best practices for fault tolerance for tier 1

24 Confidential

Enable Fault Tolerance with a Single Click

Primary Virtual Machine >

Summary Tab

After you turn on Fault Tolerance,

the Status tab on the primary

virtual machine shows Fault

Tolerance information.

Page 23: Presentation   best practices for fault tolerance for tier 1

25 Confidential

Target VMware FT Applications

Workload Type Application Rationale

Database Small to medium instances that

are strategic to IT infrastructure

Costs to deploy traditional

cluster solutions not

justified but availability is a

must

Exchange and

messaging

< 1000 users Reduced licensing and

management costs

Remote Branch

Office

Many workloads SLA requirements require

a traditional cluster ($$$).

Deliver high availability at

lower cost and easier to

administer.

Custom applications Business-specific solutions Cluster solutions not

available today

Page 24: Presentation   best practices for fault tolerance for tier 1

26 Confidential

• Challenges for Virtualizing Teri 1 Apps

• Building a Virtualized BC / FT Solution

• Site Recovery Manager

• Summary / Questions

Agenda

Page 25: Presentation   best practices for fault tolerance for tier 1

27 Confidential

Recovery Risk

Drivers of risk

New applications or changing app/infrastructure configuration

Gap between current configuration and last revision of the DR plan

Human error and manual steps during DR testing & failover

Availability of key DR staff

Lengthy recovery time

Increasing complexity of managing the DR solution

Associated costs

Lost business & productivity for each hour of downtime

(Unpredictable) staff overtime

Application end-users disrupted by testing & outages;

inability to meet SLAs

Page 26: Presentation   best practices for fault tolerance for tier 1

28 Confidential

Reducing and Managing Recovery Risk

During the testing gap, organizations can’t be sure that they can recover the current IT environment

A failover scenario may take days or weeks to complete, leaving the business at extreme risk

Virtualization & DR Automation Greatly Reduce Recovery Risk

Unproven

Recoverability

TimeDR Test DR Test

TESTING GAP

Recovery

Risk

IT Environment without

Virtualization & DR Automation

Recovery

Risk

DR Test DR Test

Frequent

DR Testing

Time

Virtualization + DR Automation

Page 27: Presentation   best practices for fault tolerance for tier 1

29 Confidential

Best Practices for Recovery Risk Mitigation

Frequent testing to ensure DR plan correct & successful

Automation to minimize mistakes and speed up recovery time

Tight integration between infrastructure management and DR solution

Multiple layers of downtime protection at all levels of the datacenter

Page 28: Presentation   best practices for fault tolerance for tier 1

30 Confidential

• Simplifies and automates disaster recovery workflows:

Setup, testing, failover

• Turns manual recovery runbooks into automated recovery plans

• Provides central management of recovery plans from the VMware Infrastructure Client

VMware vCenter Site Recovery Manager

Works with VMware Infrastructure to make disaster recovery rapid, reliable, manageable, affordable

Site Recovery Manager leverages VMware Infrastructure to deliver

advanced disaster recovery management and automation

Page 29: Presentation   best practices for fault tolerance for tier 1

31 Confidential

Site Recovery Manager Key Components

Storage

Servers

VMware Infrastructure

vCenter ServerSite

Recovery Manager

Storage

Servers

VMware Infrastructure

Virtual Machines

vCenter ServerSite

Recovery Manager

Virtual Machines

Site Recovery Manager

> Manages and monitors recovery plans

> Tightly integrated with vCenter Server

Storage

> iSCSI or FibreChannel storage

Storage Partner Replication

> Integrated via replication adapters created,

certified and supported by replication vendor

Partner Replication

VMware Infrastructure

> Requires supported version of ESX

> Requires supported version of vCenter Server

Page 30: Presentation   best practices for fault tolerance for tier 1

32 Confidential

Site Recovery Manager: User Interface

Managed through

VirtualCenter plug-in

Key configuration

steps

Page 31: Presentation   best practices for fault tolerance for tier 1

33 Confidential

Disaster Recovery Setup

Integrate with replication Identify which virtual machines are

protected by replication configuration

Map recovery resources Server resources, network resources,

management objects

Create recovery plans For virtual machines, applications,

business units

Convert manual runbook topre-programmed response

Customizable with scripting and callouts

• Simplify configuration of recovery infrastructure and process

• Simplify coordination of replication with virtual environment

Page 32: Presentation   best practices for fault tolerance for tier 1

34 Confidential

Site Recovery Manager: Creating and Editing Recovery Plans

Recovery plans

for failure

scenarios

Recovery plan editor

Page 33: Presentation   best practices for fault tolerance for tier 1

35 Confidential

Testing

Replication Management

Snapshot replicated LUNs before test

Delete snapshots of replicated LUNs after test

Network Management

Change all virtual machines to a test port group

before powering them on

Customization/extensibility

Same breakpoints and callouts as failover

sequence

Extra breakpoints and callouts around the test

bubble

• Non-disruptive testing of recovery plans

• Testing can incorporate existing/non-virtual DR tools and processes

Page 34: Presentation   best practices for fault tolerance for tier 1

36 Confidential

Testing and Executing Recovery Plans

Steps in

recovery plan Status and time

stamps

When to execute

User

confirmation

message

Page 35: Presentation   best practices for fault tolerance for tier 1

37 Confidential

Failover Automation

Detect site failures

Raise alert when heartbeat lost

Initiate failover

User confirmation of outage

Granular failover initiation

Manage replication failover

Break replication

Make replica visible to recovery hosts

Execute recovery process

Use pre-programmed plan

Provide visibility into progress

• Automation for failover (and failback) process

• Real-time, step-by-step visibility into execution progress

Page 36: Presentation   best practices for fault tolerance for tier 1

38 Confidential

Failover Initiation

Page 37: Presentation   best practices for fault tolerance for tier 1

39 Confidential

Simplified Compliance

Self-documenting recovery

plans

• Centrally managed

• Always current

Easier testing

• Ensure recoverability with

realistic testing

Auditable testing and

failover

• View and export recovery

plans, tests, execution

Page 38: Presentation   best practices for fault tolerance for tier 1

40 Confidential

• Challenges for Virtualizing Teri 1 Apps

• Building a Virtualized BC / FT Solution

• Site Recovery Manager

• Summary / Questions

Agenda

Page 39: Presentation   best practices for fault tolerance for tier 1

41 Confidential

Reduce Hardware and Software Costs

• Achieve 5X - 10X consolidation for large apps

Match or Exceed Physical Performance

• Maximize scale out performance on large multi-core servers

Accelerate Application Lifecycle

• Provision On-Demand in production and in the labs

Leverage expanding ISV Ecosystem

• Includes Microsoft, SAP, Oracle, and IBM

Deliver Apps as Dynamic, Cost-Efficient IT Services

ISV Support

Guarantee Application Quality of Service

• Scale dynamically to ensure service levels

• Provide built-in High Availability and reliable Disaster Recovery

Quality of Service

App Lifecycle

Consolidation

Performance

Page 40: Presentation   best practices for fault tolerance for tier 1

© 2009 VMware Inc. All rights reserved

Confidential

Best Practices for Fault Tolerance for Tier 1 Apps

Randy Carson

Sr. System Engineer

[email protected]

Questions?