From Server to Service: How Microsoft moved Team Foundation Server to Windows Azure
Grant HollidaySenior Premier Field Engineer
AZR323b
Introduction
TFSPreview.com3 Years in Redmond
What does a Premier Field Engineer do?
Microsoft Services Premier Support
Help customers manage and support their ITTechnical account management and proactive advisory services
Fast issue resolutionThrough packaged and customized offerings
Demo
A quick introduction to http://www.tfspreview.com/
Team Foundation Service
Standing on the Shoulders of Giants
20+ Years of Experience in Internet-Scale Services
James Hamilton
The goal should be that a highly-reliable,24x7 service should be maintainedby a small 8x5 operations staff.
Engineer the problems. Don’t scale the operations team.
Low-cost administration correlates highlywith how closely the development, test,and operations teams work together
The product team is held accountable for the success of the service. This drives the right behaviours.
Team Structure
Brian Harry
Version Control
Work Item Tracking .. Agile Tools
Service Delivery
Team
* Missing a couple of management/organisational layers, but the point is that everybody is on the same team
Evolution of a Service
Source ControlWork ItemsBuildsReportingSharePoint
2005Performance fixesWeb Access
2008Architectural changesScale outFarmsFlexibility
2010Web Access v2
Internet identitiesFile ServiceRetriesEnhanced TracingPartitioning
2012
Service Architecture
Clients
Windows AzureCloud Service
Windows AzureSQL Databases
Windows AzureActive Directory
Load Balancer
Worker Role
Windows AzureStorage Blobs
Web Role
Necessary Additions
Internet IdentitiesWindows Identity Foundation, Windows Azure Active Directory (ACS)
File ServiceMove blobs from expensive SQL storage to cheap Windows Azure Storage Blobs
Fault ToleranceServer-side retry logic to handle transient failures
Enhanced TracingManipulate tracing at runtime with very fine-grained control
PartitioningCo-locate logical customer databases in single physical Windows Azure SQL Databases
Things to think about
Building a Service
Expect Failures
On-Premises Assumptions:Network is solidSQL is availableDedicated servers
Cloud:Shared infrastructureTransient failuresFlexible to cope with variations in usage and loadThere’s no place like Production
Windows Azure SQL Database Errors
Error Number
Error Message Cause
40197 The service has encountered an error processing your request. Please try again.
In case of a hardware failure, SQL Database provides automatic failover to optimize availability for your application. Some failover actions may result in an abrupt termination of a session.
40501 The service is currently busy. Retry the request after 10 seconds.
When soft throttling limit for worker threads on a machine is exceeded, the database with the highest requests per second is throttled.
40552 The session has been terminated because of excessive transaction log space usage. Try modifying fewer rows in a single transaction.
Uncommitted transactions can block the truncation of log files.
Transient Fault Handling Application Blockusing (SqlConnection conn = new SqlConnection(connString)){ // Attempt to open a connection using the // specified retry policy. conn.OpenWithRetry(retryPolicy); // ... execute SQL queries}
Transient Fault Handling Application Blockusing (IDataReader dataReader =selectCommand.ExecuteReaderWithRetry(retryPolicy)){ if (dataReader.Read()) { // ... etc
Availability & SLAs
Assume your service depends on these four services:
Storage – 99.9%Network – 99.95%Compute – 99.9%Access Control – 99.9%
What is the maximum uptime your service can guarantee?
…without building extra redundancy in
99.9% * 99.95% * 99.9% * 99.9% = 99.65% (~30min/week)
How is Availability Defined?Service Qualifications of Downtime
Cloud Services (compute)
“Role Instance Downtime” is the total accumulated minutes for all role instances during a billing month that had been deployed and started by action initiated by Customer which had not been running for longer than two minutes without detection and corrective action being initiated.
Storage We guarantee that at least 99.9% of the time we will successfully process correctly formatted requests that we receive to add, update, read and delete data.
“Error Rate” is the total number of Failed Storage Transactions divided by the Total Storage Transactions during a set time interval (currently set at one hour).
SQL Database SQL Database will maintain a “Monthly Availability” of 99.9% during a billing month.
A 5-minute interval is marked as unavailable if all the customer’s attempts to establish a connection to SQL Azure fail or take longer than 30 seconds to succeed, or if all basic valid read and write operations (as described in our technical documentation) fail after connection is established.
Exchange Online
Any period of time when end users are unable to send or receive email with Outlook Web Access.
Monitoring the Database
On-Premises Assumption:You have access to the OSYou can collect performance countersYou can install SCOM Agents
Cloud:No access to the underlying infrastructureNo access to performance counters, because it’s a shared server
How to Monitor the Database
Periodically poll the DMVsManagement Pack for SQL Azure
Build counters in to your applicationAverage SQL Connect TimeCurrent SQL Connection Failures/SecCurrent SQL Connection Retries/SecCurrent SQL Execution Retries/SecCurrent SQL Executions/SecCurrent SQL Notification Queries/Sec
Monitoring the Application Tier
On-Premises Assumption:Call a TFS ‘Server Status’ web service on serverPerformance counters
Cloud:Lots of Application TiersNot directly accessible to the InternetCan’t sync status across servers (doesn’t scale)
How to Monitor the Application Tier
Build events in to your application:“A request for service host XX has been executing for 34 seconds, exceeding the warning threshold of 30.”
Windows Azure DiagnosticsBuilt-in to AzurePeriodically collects perf counters, event logs, crash dumpsUploads them to Table/Blob storage
System Center Monitoring Pack for Windows Azure Apps
Monitoring the End-User Experience
On-Premises Assumption:Wait for them to tell you SCOM monitors
Cloud:Many more usersLess reliable and slower networksWould probably give up, rather than say something is slow/broken
Outside-In Monitoring
Synthetic transactions..Executed continuously..From key points around the world..Using typical ISP connections
System Center Global Service MonitorAgents run by MicrosoftIntegrates with System Center Operations Manager
Others: Gomez, Keynote
Testing in Production (TiP monitors)
Synthetic transactions..Executed continuously..From another role in the same datacentreExercise dependent services
Continuous smoke testing of the serviceKeeps downstream providers accountableInformation to quickly diagnose an outage
Diagnosing Issues
Easy problemsTFS Activity Log – keeps track of every command & parameter that a user runs
Complex problemsFine-grained tracing – controllable at runtime via database
Really hard problemsDebugging role – parallel Azure role deployment where a customer can be redirected to and a debugger can be attached
Fine-Grained Tracing
Separate Debugging Role
DNShttps://*.tfspreview.com/
VIP65.52.8.37
Web RoleWorker Role
Role Instance #1…n
Role Instance #1…n
New DNS Recordhttps://
sadcustomer.tfspreview.com/
VIP65.52.X.Y
Web Role
Role Instance #1
Config DBCustomer
DBCustomer
DB Attach Debugger
Upgrades / Patches / Hotfixes
Users are geographically distributedNo ideal time for an offline upgrade
Can’t upgrade every customer at onceToo much loadToo much risk
How to do big “Keynote” releases?Feature flaggingTurn features on/off at runtime based upon Account, IP, etc
How to Think About Upgrades
Upgrade must be an online operationDeploy one piece at a time (Schema, Services, Web)“Trickle” upgrades that migrate the data to new schema
Multiple versions must coexist peacefullyNew web binaries, old DB schemaOld clients, new server
Regular, Fixed deployment windowsEvery three weeks is a deployment opportunityIf you miss one, not too long to wait for the next oneAvoids building debt and risk
Communication
Maintain (and build) trust during an outage:Immediately: “Yes, there’s a problem. We’re working on it”Regularly: “Still working on it, going to do <x>”After: “Root cause was <y>. It’s not going to happen again, because we’ve done <z>”
Team Structure Matters!Dev, Test & Ops together
Expect FailuresHandle all failures gracefully
Most Problems Have Been SolvedYour job is to find and bring those solutions together
Summary
Related Content
Planning for Failure in Cloud Applications (AZR333 - Fri 11:30)
Exploring Windows Azure Storage (AZRILL102 - Fri 11:30)
Research Paper (http://aka.ms/InternetScaleServices)
Exam 70-583: Designing and Developing Windows Azure Applications
Find Me Later at the Speaker Lounge (12:45 – 1:45)
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the
part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Top Related