Windows Azure Automation and Dev/Test for the Enterprise - RightScale Compute 2013
Windows Azure Compute
description
Transcript of Windows Azure Compute
Windows Azure ComputeBrad Calder
General ManagerWindows Azure
Usage
Com
pu te
Time
Average
InactivityPeriod
“On and Off “ RiskMetrics
On and off workloads (e.g. batch job)Over provisioned capacity is wasted
Com
pu te
Time
“Unpredictable Bursting“
Average Usage
Unexpected/unplanned peak in demand Sudden spike impacts performance Hard/costly to over provision for extreme cases
Average Usage
Com
pu te
Time
“Growing Fast“Docs.com on Facebook
Successful services needs to grow/scale Keeping up w/growth is big IT challenge Can be hard to predict growth
Com
pu te
Time
Average Usage
“Predictable Bursting“Walmart
Services with micro seasonality trends Peaks due to periodic increased demandWasted capacity off season
Large-Scale Workload Patterns
12:00 AM 1:54 AM 3:48 AM 5:42 AM 7:36 AM 9:30 AM11:24 AM 1:18 PM 3:12 PM 5:06 PM 7:00 PM 8:54 PM 10:48 PM
Japan Great Britain
BING SEARCHES – JAPAN VS. UK
Source: Microsoft
Computing Demand Daily FluctuationQu
ery
Volu
me
• turbotax.com • taxcut.com• hrblock.com • taxact.com
Source: Alexa
~4x normal load(Holiday shopping)
~10x normal load(Tax season)
• target.com • walmart.com• toysrus.com • barnesandnoble.com
Jan 2009 Jan 2010 Jan 2009 Jan 2010Source: Alexa
Computing Demand Yearly Variability
Time
Dem
and
What is a “Cloud”?• Cloud: on-demand, scalable, compute and storage
resources
TimeDe
man
dSelf Server Provisioning Cloud Provisioning
OverprovisionedUnderprovisioned
What is Under the Covers of a Service
Business logic
Datacenter (Power and Cooling)
Respond to hardware failures
Monitoring and alerting infrastructureReliable/Secure storage and computation
Metering and billing infrastructureLive upgrades and OS patches
Add compute/storage capacity on the flyOverprovision for peak traffic
Service “glue”
…
Buy and provision hardware
What is Windows Azure?An operating system for the cloud:
….Service 1 Service 2 Service NService 3
……
Cloud Terminology• Infrastructure as a Service (IaaS):
basic compute and storage resources• On-demand servers• Amazon EC2, VMWare vCloud, etc
• Platform as a Service (PaaS): cloud application infrastructure• On-demand application-hosting environment• Google AppEngine, Salesforce.com, Windows Azure, etc
• Software as a Service (SaaS): cloud applications• On-demand applications• Office 365, GMail, etc
Operating System
Operating System
VM
WebServer
Operating System
VM
DBMS
2) Choose image, then create and configure VM(s) for
application
1) Choose image, then
create VM for DBMS and
configure DBMS
IaaS
Library
VM Images
Developer/Ops
ApplicationDataLoad
Balancer
5) Config
ure load
balancer
6) Manage VMs and DBMS (e.g.,
deploying new OS images in VMs)
3) Provision database,
then create tables and add data
4) Install
application
Operating System
Operating System
VM
Operating System
VM
DBMS
PaaS Developer/Ops
ApplicationDataLoad
Balancer
2) Deploy applicati
on w/ service model
WebServer
1) Provision database,
then create tables and add data
3) Automated Service Managem
ent
Windows Azure• Windows Azure is an OS for the data center• Handles resource management, provisioning, and
monitoring• Manages application lifecycle• Allows developers to concentrate on business logic
• Provides common building blocks for distributed applications• Reliable queuing• Simple unstructured and structured storage• SQL storage• Application services like access control, caching, and
connectivity
Windows Azure Platform
Fabric Controller Windows Azure Networking
AppFabric Caching
AppFabric Access Control Server
SQL Azure
AppFabric Service Bus
WindowsAzure
Compute
WindowsAzure
Middleware Services
Windows Azure Applications
Windows Azure Storage
Windows Azure CDN
WindowsAzure
Data Services
• Owns all the hardware in the data center• Uses the inventory to host services• Similar to what a per machine operating system
does with applications• Provisions the hardware as necessary• Maintains the health of the hardware• Deploys applications to free resources• Maintains the health of those applications
Fabric Controller
Windows Azure Fabric Controller
Highly-availableFabric Controller
Hardware control Software control
WS08 Hypervisor
VMVM
VM
Fabric
Agent
Switches
Load-balancers
Scaling with the Fabric Controller Service Model
Scaling• There are two basic scaling models:
Compute
Compute
Compute
Compute
Scale Up Scale Out
Compute
Scaling Lessons• Use few, well-defined scaling units• Define scaling boundaries• Scale out those units as needed
Scale-Out ApplicationsNetwork Load Balancer
Stateless ‘Worker’
Stateless Front End
Shared Filesystem
(Azure Blobs)
Partitioned RDBMS
(SQL Azure)
Key/ValueDatastore
(Azure Tables)
AzureQueues
Scale Out
Scale Out
AlreadyProvided ScalableStorage
The Windows Azure Service Model• A Windows Azure application is called a “service”• Definition information• Configuration information• At least one “role”
• A role is the scaling boundary withina service• Roles are like DLLs in the service “process”• Collection of code with an entry point
that runs in its own virtual machine• Virtual machine is scale unit • Role code runs in a virtual machine • Role scales by instances of a virtual machine size
LB
Durable
Store
Front End
Middle Tier
Multi-Tier Cloud Application• A cloud application is typically made up of different
components• Front end: e.g. load-balanced stateless web servers• Middle worker tier: e.g. order processing, encoding• Backend storage: e.g. Azure Blobs, Azure Tables, SQL
Azure• Multiple instances of each for scalability and availability• Requires at least 2 instances of each to achieve the SLA
Front-End
Cloud Application
Front-End
HTTP/HTTPS
Windows
AzureStorag
e,SQL
Azure
Load Balancer Middle-
Tier
Service Model and Role Contents• Definition:
• Role name• Role type • VM size (e.g. small, medium, etc.)• Network endpoints
• Configuration:• Number of instances• Number of update and fault domains
• Code: • Web/Worker Role: Hosted DLL
and other executables• VM Role: VHD
Service ModelRole: Front-End
DefinitionType: WebVM Size: SmallEndpoints: External-1ConfigurationInstances: 2Update Domains: 3Fault Domains: 2
Role: Middle-Tier
DefinitionType: WorkerVM Size: LargeEndpoints: Internal-1ConfigurationInstances: 3Update Domains: 3Fault Domains: 2
Service Model Files• Service definition is in ServiceDefinition.csdef
• Service configuration is in ServiceConfiguration.cscfg
• CSPack program Zips service binaries and definition into Service Package File (service.cscfg)
ServiceDefinition.csdef<?xml version="1.0" encoding="utf-8"?><ServiceDefinition name="Sample" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" upgradeDomainCount=“3"> <WorkerRole name="Middle-Tier" vmsize="Large"> <Endpoints> <InternalEndpoint name="Internal-1" protocol="tcp" /> </Endpoints> </WorkerRole> <WebRole name="Front-End" vmsize="Small"> <Sites> <Site name="Web"> <Bindings> <Binding name="Endpoint1" endpointName="External-1" /> </Bindings> </Site> </Sites> <Endpoints> <InputEndpoint name="External-1" protocol="http" port="80" /> </Endpoints> <Imports> <Import moduleName="Diagnostics" /> </Imports> </WebRole></ServiceDefinition>
ServiceConfiguration.cscfg<?xml version="1.0" encoding="utf-8"?><ServiceConfiguration serviceName="Sample" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceConfiguration" osFamily=“2"
osVersion="*"> <Role name="Middle-Tier"> <Instances count="3" /> </Role> <Role name="Front-End"> <Instances count="2" /> <ConfigurationSettings> <Setting name="Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString" value="DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=key" /> </ConfigurationSettings> </Role></ServiceConfiguration>
Windows Azure Push-button Deployment• Step 1: Allocate VMs/nodes,
VDIPs/VIPs• Across fault domains• Across update domains
• Step 2: Place role images on nodes
• Step 3: Start roles in VM instances
• Step 4: Configure load-balancers• Step 5: Maintain desired number
of role instances• Failed roles automatically
restarted• Node failure results in new VMs
automatically allocated
Allocation across fault and update domains
Load-balancers
• Windows Azure FC monitors the health of roles• FC detects if a role dies• Restart the role to bring it back to a healthy state
• If a failed node can’t be recovered, FC migrates role instances to a new node• A suitable replacement location is found• Existing role instances are notified of the
configuration change
FC Automated Management
Availability andFault/Upgrade Domains
Availability Service Level Agreements (SLA)
• Windows Azure Platform SLAs:• Compute External Connectivity: 99.95% (2 or more
instances)• Storage Availability: 99.9%• SQL Azure Availability: 99.9%
Availability % Downtime per year Downtime per month* Downtime per week
99% ("two nines") 3.65 days 7.20 hours 1.68 hours99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
Maintaining Availability: Assume Failure• Hardware fails • 3-5% of servers experience failures annually
• Software fails• Inevitable in any evolving, complex system
• Tolerating failure means:• Redundancy where possible• Need to build in retries and backoff• Fast recovery• Big red buttons
Hardware Redundancy
TOR
LB LBAgg
PDU
LB LBAgg LB LB
Agg LB LB
Agg LB LB
Agg LB LB
Agg
Racks
Datacenter
RoutersAggregation Routers and
Load Balancers
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
TOR
PDU
……… … …
Top of RackSwitches
Power Distribution
Units
…Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Node
s
Top of Rack Switch is a Single Point of Failure
Maintaining Availability: Fault Domains• Avoid single points of failure• Unit of failure based on
data center topology is a rack• E.g. top-of-rack switch on a rack of
machines• Windows Azure considers
fault domains when allocating service roles• At least 2 fault domains per service• Will try and spread roles out across
more
Front-End-1
Fault Domain 1
Fault Domain
2
Front-End-2
Middle Tier-2
Middle Tier-1
Fault Domain 3
Middle Tier-3
Front-End-1
Middle Tier-1
Front-End-
2Middl
e Tier-
2
Middle
Tier-3
• Update domains specifies what percentage of your service you will take offline for an upgrade• Specify the # of update domains for
your service• Default is 5 and max is 20
• Roles are evenly assigned an update domain
• Used to update only one domain at a time• Rolling update
Update Domains
Upgrade domains
allocated across fault domains
Fault domains
Service Deployment and Maintenance
Containing Failure: Datacenter Clusters• Datacenters are divided into “clusters”
• Approximately 1000 rack-mounted servers (we call them “nodes”)• Provides a unit of fault isolation
• Each cluster is managed by a Fabric Controller (FC)
Cluster1
Cluster2
Clustern
…Datacenter network
FC FC FC
The Fabric Controller (FC)• The “kernel” of the cloud operating system
• Manages datacenter hardware• Manages Windows Azure services
• Four main responsibilities:• Datacenter resource allocation• Datacenter resource provisioning• Service lifecycle management• Service health management
• Inputs:• Description of the hardware and network resources it will
control• Service model and binaries for cloud applications
Service Deployment Steps• Process service model files
• Determine resource requirements• Create role images
• Allocate compute and network resources• Prepare nodes
• Place role images on nodes• Create virtual machines• Start virtual machines and roles
• Configure networking• Dynamic IP addresses (DIPs) assigned to nodes• Virtual IP addresses (VIPs) + ports allocated and mapped to sets of
DIPs• Configure packet filter for VM to VM traffic within service• Program load balancers to allow traffic to external endpoints
Service Resource Allocation• Goal: allocate service components to available resources
while satisfying all hard constraints • Size of VM
• HW requirements: CPU, Memory, Storage, Network• Upgrade domains• Fault domains
• Secondary goal: Satisfy soft constraints • Optimize network proximity: pack different roles into same node
Deploying a ServiceRole B
Middle-Tier RoleCount: 3
Update Domains: 3Size: Large
Role AFront-End Role
(Front End)Count: 2
Update Domains: 3Size: Medium
LoadBalance
r10.100.0.36
10.100.0.122
www.mycloudapp.net
www.mycloudapp.net
Fault domain
Upgrade domain
Inside a Deployed Node
Fabric Controller (Primary)
FC Host Agent
Host Partition
Guest Partitio
nGuest Agent
Guest Partitio
nGuest Agent
Guest Partitio
nGuest Agent
Guest Partitio
nGuest Agent
Physical Node
Fabric Controller (Replica)
Fabric Controller (Replica)…
Role Instance
Role Instance
Role Instance
Role Instance
Trust boundary Image Repository
(OS VHDs, role ZIP files)
Detection: Load Balancer Operation• FC programs load balancers (LB) to “probe” guest
agent (GA) every 15 seconds• If the guest misses two probes, the LB stops forwarding
traffic• The role can report “busy” status to the GA • GA stops responding to probes
• LB keeps an idle connection open for 60s• Use keep-alive commands if the connection needs to be
open longer
Recovery: Server and Role Health• FC maintains service availability by monitoring the
software and hardware health• Based primarily on heartbeats • Automatically “heals” affected rolesProblem Fabric Detection Fabric Response
Role instance crashes FC guest agent monitors role termination FC restarts role
Guest VM or agent crashes FC host agent notices missing guest agent heartbeats
FC restarts VM and hosted role
Host OS or agent crashes FC notices missing host agent heartbeat Tries to recover nodeFC reallocates roles to other nodes
Detected node hardware issue Host agent informs FC FC migrates roles to other nodesMarks node “out for repair”
Updating Your Service• There are two update types:• In-place: used for large scale services and used to
updated services with local state• VIP swap: for ease of testing and fail-back for smaller
services• In-place (rolling) update:• Role instances updated one update domain at a time• Two modes: automatic and manual
In-Place Update• Purpose: Ensure service stays up
while updating• Used by Windows Azure OS updates
• System considers update domains when upgrading a service• 1/Update domains = percent of
service that will be offline• Default is 5 and max is 20
• The Windows Azure SLA is based on at least two update domains and two role instances of each role
Front-End-
1
Front-End-
2
Update Domain 1
Update Domain
2
Middle
Tier-1
Middle
Tier-2
Middle
Tier-3
Update Domain
3
Middle Tier-
3Front-End-2Front-End-
1
Middle Tier-
2
Middle
Tier-1
Windows Azure Compute Summary• Platform as a Service is all about reducing
management and operations overhead• The Windows Azure Fabric Controller is the
foundation for Windows Azure’s PaaS• Provisions machines• Deploys services• Configures hardware for services• Monitors service and hardware health