Post on 08-Apr-2017
Ulrich HomannMarc Mercuri
FailSafePatterns for Implementing Resilient Cloud Applications
ARC302
Presented in 2013
^ ResiliencyResiliency
End Slide
Netflix is currently unavailable.Try again later.
What is a modern application?
Microsoft Dynamics: Line of Business Applications
Retail HQ
Manufacturing
Project
Point of Sale
Industry Operational Workloads
HR
Finance
SRM
Expense HCM
Sales and Distribution
Customer Care
Citizen Portal
Sales Force Automation
Marketing Automation
Administrative Core
Horizontal Operational Workloads
CRM Workloads
FailSafe Services
Cloud Services, Roles and InstancesCloud Service is a management, configuration, security, networking and service model boundary
VM1 VM2 VM3
VM4 VM5 VM…
INST
ANCE
S
ROLE
S
Fabrikam-CloudSvc
Cloud Service 1
WA Web Roles
Windows Azure
SQL Database
Data Access
What are the “9”s
90% ("one nine")99% ("two nines")99.9% ("three nines")99.99% ("four nines")99.999% ("five nines")99.9999% ("six nines")
The Truth About 9s
Throttling
Decompose by Workload
Define Lifecycle Model
Workload 1
Workload 2
Workload 1
Workload 2
Availability Model and Plan
Failure Points
Failure Modes
Failure Mode Example
catch (Exception e)
Scale
Resources
Demands
Unit of ScaleWorkloads
Workload 1
Workload 2
Bottom Ramp Peak
Fault Domains
Fault and upgrade domains
• Failed component can’t take down service
• Isolated infrastructure• Physical hosts, racks• Network equipment
• Two by default• Role instances across 2+ fault
domains
Upgrade Domains
• VM rolling upgrades, no availability impact
• Logical grouping of role instances
• Five by default
• Role instances spread over upgrade domains
• Deployment upgraded for all or one at a time
Deployment Redundancy
Application considerations
Circuit Breaker at Netflix
Circuit Breaker at Netflix - Fallbacks
Incorporate Open Standards••
•
•••
•
•
•
Data Partitioning
Data Decomposition Apply functional composition to database layer too Don’t force partitioning for the sake of partitioning; you will lose
manage-ability Partition where and when required to reduce dependency,
independent management and scale
Reduce logic in SQL Databases; CRUD is acceptable; say NO to others
Understanding the 3Vs
Understanding Queryability
Horizontal Partitioning
David AlexanderJarred CarlsonSue CharlesSimon MitchelRichard Zeng
A C M Z
Vertical Partitioning
David AlexanderJarred CarlsonSue Charles
Simon Mitchel
Richard Zeng
Hybrid Partitioning
David AlexanderJarred CarlsonSue Charles
Simon Mitchel
Richard Zeng
A-L M-Z
Data – to cache or not to cache….
••
Data on the inside – Data on the outsidehttp://msdn.microsoft.com/en-us/library/ms954587.aspx
•
•
•
“Query Ready” Cache Query patterns
Push the data close to where it is queried Example: BING Maps
Process, structure, produce, format etc. data and cache “query ready” data
Light/cheap data production is OK Pure and Idempotent operations are usually
good candidates
Duplication is OK Same data in a different format Same data in multiple places
This requires processing data before it is queried - NOT at the query time All data can be cached Some data can be cached: Frequently used Process Heavy, Expensive data Build as you Go
Backup and Restore
CDN
Latency shifts
• Direct users to the service in the closest region
Traffic ManagerMonitoringPolicies
foo.cloudapp.net foo-us.cloudapp.net
foo-europe.cloudapp.net
foo-asia.cloudapp.net
1.2.3.4DNS response
Traffic Management
Cloud Enterprise
Application-Layer Connectivity &
Messaging Service Bus
Data SynchronizationSQL Database Data Sync
Secure Machine-to-Machine Connectivity
Windows Azure Connect
Secure Site-to-Site Network Connectivity
Windows Azure Virtual Network
App Monitoring & Management
System Center
Cross-Premises Connectivity
Design for operations
What is a health model?
Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist
Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.
AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only
Operational Condition
Troubleshooting Workflow
Tools
Demo
FailSafe Modeling Tool
Test Plans Creation and Execution - Create, review, execute, and save tests plans and executions - Test Execution Reports
Multi-Subscription Test Execution - Send disruptions to multiple Cloud applications - Ability to define the disruptions execution order
Multiple Disruption Delivery Mechanisms - Use WA Management API and/or Overlord Agent - Mix the Disruption delivery
Extensible Disruptions’ Database - Template Engine for PowerShell Scripts - Ability to execute programs (NotMyFault.exe)
Cloud Overlord Testing Framework - Fault Injection Testing Framework - Generate consistent and repeatable platform level disruptions
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.