Download - From Server to Service: How Microsoft moved Team Foundation Server to Windows Azure Grant Holliday Senior Premier Field Engineer AZR323b.

From Server to Service: How Microsoft moved Team Foundation Server to Windows Azure

Grant HollidaySenior Premier Field Engineer

AZR323b

Introduction

TFSPreview.com3 Years in Redmond

What does a Premier Field Engineer do?

Microsoft Services Premier Support

Help customers manage and support their ITTechnical account management and proactive advisory services

Fast issue resolutionThrough packaged and customized offerings

Demo

A quick introduction to http://www.tfspreview.com/

Team Foundation Service

Standing on the Shoulders of Giants

20+ Years of Experience in Internet-Scale Services

James Hamilton

The goal should be that a highly-reliable,24x7 service should be maintainedby a small 8x5 operations staff.

Engineer the problems. Don’t scale the operations team.

Low-cost administration correlates highlywith how closely the development, test,and operations teams work together

The product team is held accountable for the success of the service. This drives the right behaviours.

Team Structure

Brian Harry

Version Control

Work Item Tracking .. Agile Tools

Service Delivery

Team

* Missing a couple of management/organisational layers, but the point is that everybody is on the same team

Evolution of a Service

Source ControlWork ItemsBuildsReportingSharePoint

2005Performance fixesWeb Access

2008Architectural changesScale outFarmsFlexibility

2010Web Access v2

Internet identitiesFile ServiceRetriesEnhanced TracingPartitioning

2012

Service Architecture

Clients

Windows AzureCloud Service

Windows AzureSQL Databases

Windows AzureActive Directory

Load Balancer

Worker Role

Windows AzureStorage Blobs

Web Role

Necessary Additions

Internet IdentitiesWindows Identity Foundation, Windows Azure Active Directory (ACS)

File ServiceMove blobs from expensive SQL storage to cheap Windows Azure Storage Blobs

Fault ToleranceServer-side retry logic to handle transient failures

Enhanced TracingManipulate tracing at runtime with very fine-grained control

PartitioningCo-locate logical customer databases in single physical Windows Azure SQL Databases

Things to think about

Building a Service

Expect Failures

On-Premises Assumptions:Network is solidSQL is availableDedicated servers

Cloud:Shared infrastructureTransient failuresFlexible to cope with variations in usage and loadThere’s no place like Production

Windows Azure SQL Database Errors

Error Number

Error Message Cause

40197 The service has encountered an error processing your request. Please try again.

In case of a hardware failure, SQL Database provides automatic failover to optimize availability for your application. Some failover actions may result in an abrupt termination of a session.

40501 The service is currently busy. Retry the request after 10 seconds.

When soft throttling limit for worker threads on a machine is exceeded, the database with the highest requests per second is throttled.

40552 The session has been terminated because of excessive transaction log space usage. Try modifying fewer rows in a single transaction.

Uncommitted transactions can block the truncation of log files.

Transient Fault Handling Application Blockusing (SqlConnection conn = new SqlConnection(connString)){ // Attempt to open a connection using the // specified retry policy. conn.OpenWithRetry(retryPolicy); // ... execute SQL queries}

Transient Fault Handling Application Blockusing (IDataReader dataReader =selectCommand.ExecuteReaderWithRetry(retryPolicy)){ if (dataReader.Read()) { // ... etc

Availability & SLAs

Assume your service depends on these four services:

Storage – 99.9%Network – 99.95%Compute – 99.9%Access Control – 99.9%

What is the maximum uptime your service can guarantee?

…without building extra redundancy in

99.9% * 99.95% * 99.9% * 99.9% = 99.65% (~30min/week)

How is Availability Defined?Service Qualifications of Downtime

Cloud Services (compute)

“Role Instance Downtime” is the total accumulated minutes for all role instances during a billing month that had been deployed and started by action initiated by Customer which had not been running for longer than two minutes without detection and corrective action being initiated.

Storage We guarantee that at least 99.9% of the time we will successfully process correctly formatted requests that we receive to add, update, read and delete data.

“Error Rate” is the total number of Failed Storage Transactions divided by the Total Storage Transactions during a set time interval (currently set at one hour).

SQL Database SQL Database will maintain a “Monthly Availability” of 99.9% during a billing month.

A 5-minute interval is marked as unavailable if all the customer’s attempts to establish a connection to SQL Azure fail or take longer than 30 seconds to succeed, or if all basic valid read and write operations (as described in our technical documentation) fail after connection is established.

Exchange Online

Any period of time when end users are unable to send or receive email with Outlook Web Access.

Monitoring the Database

On-Premises Assumption:You have access to the OSYou can collect performance countersYou can install SCOM Agents

Cloud:No access to the underlying infrastructureNo access to performance counters, because it’s a shared server

How to Monitor the Database

Periodically poll the DMVsManagement Pack for SQL Azure

Build counters in to your applicationAverage SQL Connect TimeCurrent SQL Connection Failures/SecCurrent SQL Connection Retries/SecCurrent SQL Execution Retries/SecCurrent SQL Executions/SecCurrent SQL Notification Queries/Sec

Monitoring the Application Tier

On-Premises Assumption:Call a TFS ‘Server Status’ web service on serverPerformance counters

Cloud:Lots of Application TiersNot directly accessible to the InternetCan’t sync status across servers (doesn’t scale)

How to Monitor the Application Tier

Build events in to your application:“A request for service host XX has been executing for 34 seconds, exceeding the warning threshold of 30.”

Windows Azure DiagnosticsBuilt-in to AzurePeriodically collects perf counters, event logs, crash dumpsUploads them to Table/Blob storage

System Center Monitoring Pack for Windows Azure Apps

Monitoring the End-User Experience

On-Premises Assumption:Wait for them to tell you SCOM monitors

Cloud:Many more usersLess reliable and slower networksWould probably give up, rather than say something is slow/broken

Outside-In Monitoring

Synthetic transactions..Executed continuously..From key points around the world..Using typical ISP connections

System Center Global Service MonitorAgents run by MicrosoftIntegrates with System Center Operations Manager

Others: Gomez, Keynote

Testing in Production (TiP monitors)

Synthetic transactions..Executed continuously..From another role in the same datacentreExercise dependent services

Continuous smoke testing of the serviceKeeps downstream providers accountableInformation to quickly diagnose an outage

Diagnosing Issues

Easy problemsTFS Activity Log – keeps track of every command & parameter that a user runs

Complex problemsFine-grained tracing – controllable at runtime via database

Really hard problemsDebugging role – parallel Azure role deployment where a customer can be redirected to and a debugger can be attached

Fine-Grained Tracing

Separate Debugging Role

DNShttps://*.tfspreview.com/

VIP65.52.8.37

Web RoleWorker Role

Role Instance #1…n

Role Instance #1…n

New DNS Recordhttps://

sadcustomer.tfspreview.com/

VIP65.52.X.Y

Web Role

Role Instance #1

Config DBCustomer

DBCustomer

DB Attach Debugger

Upgrades / Patches / Hotfixes

Users are geographically distributedNo ideal time for an offline upgrade

Can’t upgrade every customer at onceToo much loadToo much risk

How to do big “Keynote” releases?Feature flaggingTurn features on/off at runtime based upon Account, IP, etc

How to Think About Upgrades

Upgrade must be an online operationDeploy one piece at a time (Schema, Services, Web)“Trickle” upgrades that migrate the data to new schema

Multiple versions must coexist peacefullyNew web binaries, old DB schemaOld clients, new server

Regular, Fixed deployment windowsEvery three weeks is a deployment opportunityIf you miss one, not too long to wait for the next oneAvoids building debt and risk

Communication

Maintain (and build) trust during an outage:Immediately: “Yes, there’s a problem. We’re working on it”Regularly: “Still working on it, going to do <x>”After: “Root cause was <y>. It’s not going to happen again, because we’ve done <z>”

Team Structure Matters!Dev, Test & Ops together

Expect FailuresHandle all failures gracefully

Most Problems Have Been SolvedYour job is to find and bring those solutions together

Summary

Related Content

Planning for Failure in Cloud Applications (AZR333 - Fri 11:30)

Exploring Windows Azure Storage (AZRILL102 - Fri 11:30)

Research Paper (http://aka.ms/InternetScaleServices)

Exam 70-583: Designing and Developing Windows Azure Applications

Find Me Later at the Speaker Lounge (12:45 – 1:45)

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the

part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.