Failsafe 1 hour 2013

Post on 08-Apr-2017

147 views 0 download

Transcript of Failsafe 1 hour 2013

Ulrich HomannMarc Mercuri

FailSafePatterns for Implementing Resilient Cloud Applications

ARC302

Presented in 2013

^ ResiliencyResiliency

End Slide

Netflix is currently unavailable.Try again later.

What is a modern application?

Microsoft Dynamics: Line of Business Applications

Retail HQ

Manufacturing

Project

Point of Sale

Industry Operational Workloads

HR

Finance

SRM

Expense HCM

Sales and Distribution

Customer Care

Citizen Portal

Sales Force Automation

Marketing Automation

Administrative Core

Horizontal Operational Workloads

CRM Workloads

FailSafe Services

Cloud Services, Roles and InstancesCloud Service is a management, configuration, security, networking and service model boundary

VM1 VM2 VM3

VM4 VM5 VM…

INST

ANCE

S

ROLE

S

Fabrikam-CloudSvc

Cloud Service 1

WA Web Roles

Windows Azure

SQL Database

Data Access

What are the “9”s

90% ("one nine")99% ("two nines")99.9% ("three nines")99.99% ("four nines")99.999% ("five nines")99.9999% ("six nines")

The Truth About 9s

Throttling

Decompose by Workload

Define Lifecycle Model

Workload 1

Workload 2

Workload 1

Workload 2

Availability Model and Plan

Failure Points

Failure Modes

Failure Mode Example

catch (Exception e)

Scale

Resources

Demands

Unit of ScaleWorkloads

Workload 1

Workload 2

Bottom Ramp Peak

Fault Domains

Fault and upgrade domains

• Failed component can’t take down service

• Isolated infrastructure• Physical hosts, racks• Network equipment

• Two by default• Role instances across 2+ fault

domains

Upgrade Domains

• VM rolling upgrades, no availability impact

• Logical grouping of role instances

• Five by default

• Role instances spread over upgrade domains

• Deployment upgraded for all or one at a time

Deployment Redundancy

Application considerations

Circuit Breaker at Netflix

Circuit Breaker at Netflix - Fallbacks

Incorporate Open Standards••

•••

Data Partitioning

Data Decomposition Apply functional composition to database layer too Don’t force partitioning for the sake of partitioning; you will lose

manage-ability Partition where and when required to reduce dependency,

independent management and scale

Reduce logic in SQL Databases; CRUD is acceptable; say NO to others

Understanding the 3Vs

Understanding Queryability

Data – to cache or not to cache….

••

Data on the inside – Data on the outsidehttp://msdn.microsoft.com/en-us/library/ms954587.aspx

“Query Ready” Cache Query patterns

Push the data close to where it is queried Example: BING Maps

Process, structure, produce, format etc. data and cache “query ready” data

Light/cheap data production is OK Pure and Idempotent operations are usually

good candidates

Duplication is OK Same data in a different format Same data in multiple places

This requires processing data before it is queried - NOT at the query time All data can be cached Some data can be cached: Frequently used Process Heavy, Expensive data Build as you Go

Backup and Restore

CDN

Latency shifts

• Direct users to the service in the closest region

Traffic ManagerMonitoringPolicies

foo.cloudapp.net foo-us.cloudapp.net

foo-europe.cloudapp.net

foo-asia.cloudapp.net

1.2.3.4DNS response

Traffic Management

Cloud Enterprise

Application-Layer Connectivity &

Messaging Service Bus

Data SynchronizationSQL Database Data Sync

Secure Machine-to-Machine Connectivity

Windows Azure Connect

Secure Site-to-Site Network Connectivity

Windows Azure Virtual Network

App Monitoring & Management

System Center

Cross-Premises Connectivity

Design for operations

What is a health model?

Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist

Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.

AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only

Operational Condition

Troubleshooting Workflow

Tools

Demo

FailSafe Modeling Tool

Test Plans Creation and Execution - Create, review, execute, and save tests plans and executions - Test Execution Reports

Multi-Subscription Test Execution - Send disruptions to multiple Cloud applications - Ability to define the disruptions execution order

Multiple Disruption Delivery Mechanisms - Use WA Management API and/or Overlord Agent - Mix the Disruption delivery

Extensible Disruptions’ Database - Template Engine for PowerShell Scripts - Ability to execute programs (NotMyFault.exe)

Cloud Overlord Testing Framework - Fault Injection Testing Framework - Generate consistent and repeatable platform level disruptions

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.