The Shortcut Guide to SQL Server Infrastructure Optimization

The Shortcut Guide ToThe Shortcut Guide Totmtm

SQL Server Infrastructure Optimization

Don Jones

Introduction

i

Introduction to Realtimepublishers by Don Jones, Series Editor For several years, now, Realtime has produced dozens and dozens of high-quality books that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve made this unique publishing model work through the generous support and cooperation of our sponsors, who agree to bear each book’s production expenses for the benefit of our readers.

Although we’ve always offered our publications to you for free, don’t think for a moment that quality is anything less than our top priority. My job is to make sure that our books are as good as—and in most cases better than—any printed book that would cost you $40 or more. Our electronic publishing model offers several advantages over printed books: You receive chapters literally as fast as our authors produce them (hence the “realtime” aspect of our model), and we can update chapters to reflect the latest changes in technology.

I want to point out that our books are by no means paid advertisements or white papers. We’re an independent publishing company, and an important aspect of my job is to make sure that our authors are free to voice their expertise and opinions without reservation or restriction. We maintain complete editorial control of our publications, and I’m proud that we’ve produced so many quality books over the past years.

I want to extend an invitation to visit us at http://nexus.realtimepublishers.com, especially if you’ve received this publication from a friend or colleague. We have a wide variety of additional books on a range of topics, and you’re sure to find something that’s of interest to you—and it won’t cost you a thing. We hope you’ll continue to come to Realtime for your educational needs far into the future.

Until then, enjoy.

Don Jones

http://nexus.realtimepublishers.com/

Table of Contents

ii

Introduction to Realtimepublishers.................................................................................................. i

Chapter 1: Traditional Challenges and Their Impact.......................................................................1

The Symptoms of an Un-Optimized Infrastructure .........................................................................2

Isolated Systems and Data Silos ..........................................................................................2

Reactive Management..........................................................................................................3

Availability Concerns ..........................................................................................................3

Top-Level Causes of the Un-Optimized Infrastructure ...................................................................4

Technological Causes ..........................................................................................................4

Performance Constraints..........................................................................................4

Storage Constraints ..................................................................................................5

Chassis Constraints ..................................................................................................5

Business Causes ...................................................................................................................5

Branch Offices .........................................................................................................6

Departmental Servers...............................................................................................6

Project-Based Servers ..............................................................................................6

Top-Level Causes: A Summary...........................................................................................6

The Result of an Un-Optimized Infrastructure ................................................................................7

Maintenance Overhead ........................................................................................................7

Application and Data Availability .......................................................................................8

Management Overhead ......................................................................................................10

Disaster Recovery Overhead..............................................................................................10

Life Cycle Management.....................................................................................................11

Storage Management .........................................................................................................11

Performance Management .................................................................................................11

Security and Auditing ........................................................................................................11

Introducing the Infrastructure Optimization Model.......................................................................12

Basic...................................................................................................................................13

Standardized.......................................................................................................................13

Advanced ...........................................................................................................................14

Dynamic.............................................................................................................................14

Application Platform Optimization................................................................................................15

Keys for SQL Server Infrastructure Optimization.........................................................................16

Abstracted ..........................................................................................................................16

Table of Contents

iii

Segmentation......................................................................................................................17

Manageability ....................................................................................................................18

Performance .......................................................................................................................18

Availability ........................................................................................................................18

Life Cycle...........................................................................................................................19

Disaster Recovery ..............................................................................................................19

Dynamic.............................................................................................................................19

Coming Up….................................................................................................................................19

Chapter 2: Optimizing Your SQL Server Infrastructure: Good Ideas, Bad Ideas .........................20

The Nature of RDBMS and SQL Server .......................................................................................20

SQL Server Instances.........................................................................................................20

Processor and Memory Requirements ...............................................................................22

Storage Requirements ........................................................................................................25

Client Connectivity Requirements.....................................................................................26

Disaster Recovery Requirements.......................................................................................27

Basic Server Consolidation............................................................................................................29

Technology Overview........................................................................................................29

The SQL Server Connection..............................................................................................29

The Impact on Sprawl........................................................................................................30

The Impact on Flexibility...................................................................................................30

Windows Server Clusters...............................................................................................................31





Hardware Virtualization.................................................................................................................34





Pooled-Resource Clustering...........................................................................................................37



Table of Contents

iv



Infrastructure Optimization: The Positive Impact on SQL Server Sprawl ....................................40

Coming Up Next…........................................................................................................................40

Chapter 3: The New Cluster: Technologies for SQL Server Infrastructure Optimization ............41

Rethinking the SQL Server Cluster ...............................................................................................42

Processor and Memory Requirements ...............................................................................42

Storage Requirements ........................................................................................................42

Network Requirements ......................................................................................................44

Foundation Technologies for the New Cluster ..............................................................................46

Abstracting the Processor and Memory.............................................................................46

Abstracting the Storage......................................................................................................46

Abstracting the Network ....................................................................................................48

Coordination: The Key to the New Cluster .......................................................................49

Rethinking SQL Server Management............................................................................................51

Managing SQL Server Software........................................................................................51

Adding Nodes to the Cluster..............................................................................................51

Adding SQL Server Instances............................................................................................52

Rolling Hardware Generations ..........................................................................................53

The New Cluster and Infrastructure Optimization.........................................................................54

Techniques for Creating Clusters.......................................................................................54

Using Clusters to Improve Infrastructure Optimization ....................................................55

Benefits of the New Cluster...........................................................................................................56

Consistent Software Configurations ..................................................................................56

Reduction in Operating Costs ............................................................................................56

High Availability ...............................................................................................................56

Reduced Complexity..........................................................................................................57

Room for Growth...............................................................................................................57

Conclusion: Welcome to the Optimized SQL Server Infrastructure .............................................59

Copyright Statement

v

Copyright Statement © 2007 Realtimepublishers.com, Inc. All rights reserved. This site contains materials that have been created, developed, or commissioned by, and published with the permission of, Realtimepublishers.com, Inc. (the “Materials”) and this site and any such Materials are protected by international copyright and trademark laws.

THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice and do not represent a commitment on the part of Realtimepublishers.com, Inc or its web site sponsors. In no event shall Realtimepublishers.com, Inc. or its web site sponsors be held liable for technical or editorial errors or omissions contained in the Materials, including without limitation, for any direct, indirect, incidental, special, exemplary or consequential damages whatsoever resulting from the use of any information contained in the Materials.

The Materials (including but not limited to the text, images, audio, and/or video) may not be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any way, in whole or in part, except that one copy may be downloaded for your personal, non-commercial use on a single computer. In connection with such use, you may not modify or obscure any copyright or other proprietary notice.

The Materials may contain trademarks, services marks and logos that are the property of third parties. You are not permitted to use these trademarks, services marks or logos without prior written consent of such third parties.

Realtimepublishers.com and the Realtimepublishers logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners.

If you have any questions about these terms, or if you would like information about licensing materials from Realtimepublishers.com, please contact us via e-mail at [email protected].

mailto:[email protected]

Chapter 1

1

[Editor's Note: This eBook was downloaded from Realtime Nexus—The Digital Library. All leading technology guides from Realtimepublishers can be found at http://nexus.realtimepublishers.com.]

Chapter 1: Traditional Challenges and Their Impact

SQL Server can be an organization’s greatest asset and—from a corporate management perspective—a complex challenge. The reason is that SQL Server (or any enterprise-class database system, for that matter) places a heavy demand on infrastructure resources such as servers, the network, and storage. Many organizations continually struggle to scale their SQL Server installations to meet business demand—including requirements for availability in the face of hardware failure—while at the same time bemoaning the “data center bloat” that seems to be the unavoidable companion of large SQL Server installations. A lot of that is simply the cost of SQL Server’s success in the enterprise: As companies find more uses for SQL Server, there are inevitably more SQL Server installations to deal with. Some of the challenge comes from what I’ll call traditional SQL Server architecture, which too often consists primarily of double-clicking Setup, or architecting SQL Server for performance without taking manageability into account.

In fact, SQL Server and “data center bloat” don’t necessarily go together any more than SQL Server and “difficult to manage” do. With the right tools and techniques, you can have a top-performing SQL Server infrastructure without having to cram your data centers with so much hardware that they’re all but overflowing. Some of these tools and techniques may not seem very obvious, which is perhaps why many SQL Server architects don’t discover them right away. They are, however, extremely effective.

It’s all about infrastructure optimization, or letting your application get the very most from the infrastructure that it’s running on. Before I can begin to share some of these new techniques, however, I need to back up a bit and explain exactly why so many organizations aren’t really optimizing their SQL Server platform.


Chapter 1

2

The Symptoms of an Un-Optimized Infrastructure I want to stress that the “management problems” most often associated with SQL Server are really associated with the underlying infrastructure: too many servers, too many disparate storage devices, and so forth. When the infrastructure that SQL Server relies upon isn’t optimized, what sort of “big picture” problems can you expect to encounter?

Isolated Systems and Data Silos One sure sign of an un-optimized system is data silos, as illustrated in Figure 1.1. This occurs when each major application within a company is not only given its own database—as you would expect—or its own SQL Server instance—which might make sense—but its own SQL Server computer—sometimes its own server cluster.

Figure 1.1: Data silos result in hardware bloat and are a symptom of an un-optimized infrastructure.

Chapter 1

3

Some IT administrators would argue that these single-purpose servers are the cause of an un-optimized infrastructure—and, to a point, I suppose I would agree. Other administrators would tell you that this is an optimal architecture because they like this kind of isolation between applications—even if none of these servers are even close to fully utilized. However, I think that these types of isolated systems are for the most part a symptom of an un-optimized infrastructure. I’ll explain why.

Imagine you work for a department in your company that needs to deploy a SQL Server computer for a project, or perhaps a department-specific application. You’re perfectly willing to have the database hosted on your company’s “big” SQL Server—the one that perhaps runs SAP, or Peoplesoft, or some other large enterprise application. The database administrators (DBAs) who “own” that server won’t hear of it, though, because you’ll be occupying precious “growth capacity” on their machine—unused resources, including computing power and storage, that they’re saving against future need. So you go out and pay for your own SQL Server computer. It’s probably housed in the corporate data center and may even be tended to by those same DBAs, but it’s a machine dedicated to you. Most likely, a lot of your machine’s resources are being wasted, too, simply because the machine is too big for you one little database.

So the symptom is data center bloat but the cause is that the infrastructure isn’t optimized to the point at which your DBAs feel comfortable using every last iota of resources available to them before adding more—they insist on maintaining a degree of wasted, unused resources on each SQL Server machine, “just in case.” It’s not their fault, of course—it’s the infrastructure’s fault.

Reactive Management Reactive management is another symptom—not cause—of an un-optimized infrastructure. Reactive management simply means managing the latest crisis: When Server “X” runs out of capacity, everyone runs around trying to fix the problem, implement a bigger server, or some other solution. Although corporate management culture can certainly take some of the blame for this style of management, for the most part, the blame can be laid squarely on the infrastructure. If it were capable of adapting more quickly and easily to changing business needs, then new business needs could be handled smoothly, without all the running around and firefighting.

Availability Concerns One last major symptom—not cause—of an un-optimized infrastructure is availability concerns. In most cases, organizations would prefer that their databases never go down, even in the face of scheduled maintenance. Without an optimized infrastructure—one capable of moving database workload around to any available computer, at any time, without interruption—the only way to achieve that type of availability is through relatively simplistic, resource-dependent clustering mechanisms, such as Windows Cluster Service. Although those kinds of clustering solve the unplanned downtime problem, they also create more data center bloat by doubling (or more) the number of servers, storage resources, and other infrastructure elements. Think about it this way: If you have a two-node SQL Server cluster serving a single application, and the cluster’s “active” node is at 30% utilization, the overall cluster is at just 15% utilization—keep in mind that there is a “passive” Server B essentially doing nothing! That’s the type of utilization that provides great fault-tolerance but at a pretty high cost in terms of hardware and ongoing maintenance. Again, this is a symptom; if the infrastructure were properly optimized, less of that extra hardware would be necessary.

Chapter 1

4

Top-Level Causes of the Un-Optimized Infrastructure So where does an un-optimized infrastructure come from? It’s difficult to find all the root causes because they seem so embedded in the way we’re used to working. Often these bad habits were derived from working with the product and institutionalizing workarounds for the product’s deficiencies. Even when the deficiency is addressed, the workaround persists. However, there are clear technological and business causes that result in an un-optimized SQL Server infrastructure. Many of these causes are part and parcel of traditional architectural techniques that have simply met their limit in terms of business agility; by examining some of these causes, we’ll understand what we need to change about our infrastructure architecture in order to achieve some degree of optimization.

Technological Causes Technological causes are typically not bad decisions or even bad products (although there are obviously exceptions). Instead, they’re often just limitations of the way we look at infrastructure architecture.

Performance Constraints Traditional thinking tells us that the only way to improve SQL Server performance is to buy a bigger server. SQL Server is almost always the first application to benefit from new processors—some of the first multi-processor, multi-core x64 servers, for example, were sold as SQL Server computers. And in many cases, “bigger is better” is still true for SQL Server. It’s pretty rare, though, to find even massive enterprise applications that can’t be satisfied with a 4-way dual-core 64-bit server with lots of memory installed.

However, in reality very few databases require everything that a single server can deliver; in many cases, several of an organization’s databases could be “stacked” onto a single powerful server. But the server has always been held to be the performance constraint for SQL Server—scaling up with bigger hardware, in other words, can be better than scaling out to multiple servers through the use of a distributed database. Any given server can only be expected to support a finite amount of SQL Server workload, so you tend to buy a server capable of handling all the workload you think you’ll ever get, and then you just put a single task on that server—“reserving” the rest of its capacity for growth. This is the architectural technique that often first leads to data silos and data center bloat.

Chapter 1

5

Storage Constraints Traditional architecture assigns storage space to a single computer. The reasons are fairly obvious; the concept of multiple servers sharing the same storage space is technologically challenging. In many cases, the storage assigned to a given server is pulled from a larger pool of storage available from a Storage Area Network (SAN); although the SAN is one big “cloud” of storage, individual chunks are carved out and assigned to specific computers and aren’t typically shared between them.

This technique of “assigning” storage plays a crucial role in SQL Server infrastructure. SQL Server, of course, needs lots and lots of storage. Once a “chunk” is assigned to a given computer, it may or may not be straightforward to resize that “chunk” in the future; organizations tend to assign the maximum space they expect SQL Server to need for a given database application. Assigning additional “chunks” in the future, to accommodate long-term growth, is always an option but can become expensive both in terms of hardware and in terms of resource management.

Chassis Constraints Servers can, of course, be upgraded—but only to a point. Most mainstream servers support a maximum of four to eight processors and are typically locked into a single family of processors. Servers can only accommodate so much memory; even large x64 servers have an upper limit on how much memory can be physically installed in the chassis. These are hard limits; you can’t “squeeze” an extra processor under the hood, so once you’ve upgraded a server’s hardware to its maximum capability, you’re done. If the SQL Server application running on that server needs more hardware resources, you’re faced with the sometimes-ugly task of migrating the application to an all-new server.

That is why SQL Server computers typically have a lot of extra capacity: Migrating databases from one server to another is painful, but once you reach the limits of your hardware, you have no other options. Thus, it would seem to make sense to build servers with plenty of “free space” or “room to grow” and not load them up to maximum capacity right at the outset.

Business Causes Not all causes of un-optimized SQL Server infrastructure can be laid at the feet of technology. In some instances, business requirements force architects into less-than-optimal designs. Again, I have to stress that these business requirements aren’t bad in and of themselves; it’s just that—using traditional architecture—they often lead to an un-optimized infrastructure.

Chapter 1

6

Branch Offices Branch offices are one of the most difficult aspects of any IT infrastructure, and it seems as if software vendors never take branch offices into consideration when designing their products. Branch offices typically need high-speed—for example, local—connectivity to certain resources, practically demanding their own on-site servers. Those servers are rarely fully utilized because many branch offices are (almost by definition) relatively small. Branch offices may or may not have their own dedicated IT administration resources, meaning these widely distributed resources (servers, storage, operating systems—OSs, and so forth) may need to be centrally managed—imposing a higher level of overhead for the remote administrators.

Branch offices may also be a cause for data center bloat. Branch offices may require dedicated databases, meaning they often wind up with those databases on a dedicated server. Again, this server is rarely fully utilized, creating wasted resources in addition to the distributed management overhead.

Departmental Servers As I’ve already discussed, individual departments often have specific applications that require their own database, which in many organizations leads to the department “owning” (if not actually administering, although sometimes that’s the case, too) their own dedicated SQL Server computers. After all, companies and departments have found SQL Server to be easy to deploy, relatively inexpensive, easy to administer, and so forth—it’s easy and inexpensive to just spin up a new database server when you need one for a project, or a department, or any other need. However, this creates “bloat,” meaning the organization winds up having more SQL Server computers than it really needs based on the total SQL Server workload being served.

Project-Based Servers Project-specific database servers—whether “owned” by a department or used in a more cross-organization fashion—are also one result of traditional SQL Server architectural decisions, and a cause of bloat. Project-based servers are amongst the worst kind of bloat because they’re typically designed to be used for a relatively finite period of time; a design decision that often leads to the use of older, “spare” hardware. Too often, the project extends to near-infinite duration, meaning that older hardware continues to be a management and maintenance burden. And, as with most bloat, the server is rarely well-utilized—creating additional capacity in an infrastructure that rarely requires it.

Top-Level Causes: A Summary It’s important to remember that none of the technological or business causes I’ve discussed are inherently bad. Desiring department-specific databases, or project-specific databases, is perfectly normal and acceptable. Wanting to maintain “room for growth” in the database infrastructure is absolutely necessary. The problems come from technological limitations—such as chassis constraints—and from perceptions of how SQL Server must be designed in order to meet business requirements.

Simply put, the design choice of, “let’s use a dedicated server for that” happens too frequently, when it isn’t actually necessary in order to meet all the organization’s needs and goals. Dedicated servers, however, are the cause of bloat—one aspect of an un-optimized infrastructure. What are some other results of an un-optimized infrastructure?

Chapter 1

7

The Result of an Un-Optimized Infrastructure Let’s quickly define un-optimized infrastructure to be sure we’re on the same page with the impact of having one: infrastructure, of course, refers to the underlying, supporting technologies that make something—in this case, SQL Server—work. SQL Server’s infrastructure consists of obvious elements such as servers, storage resources, backup and restore resources, networking components, and so forth. Less-obvious elements include data center resources, power requirements, administrators’ time, and so on. Un-optimized simply means that the infrastructure elements involved aren’t being used to the best effect or maximum capability or capacity. In other words, the infrastructure as a whole has a lot of wasted or unused resources in it.

“Extra” isn’t always a good thing: Imagine having a lot of extra food in your kitchen, for example. It takes up room, spoils if it goes unused for too long, higher purchase cost than necessary, and has other negatives. “Extra” money in your bank account might seem nice, but if it’s just sitting there earning no interest, then it’s not really working to its maximum effect for you. “Extra” insurance might give you a feeling of comfort and security, but it costs a lot and if it’s never used, not only is that “extra” capacity wasted but so is all the money you spent acquiring and maintaining it.

As I’ll discuss later, one goal of an optimized infrastructure is to achieve agility—the ability to quickly rearrange IT resources to smoothly meet changing business needs. An optimized infrastructure isn’t the only way to achieve agility. You could simply spend a fortune on extra infrastructure resources, for example. However, the cost of achieving agility that way is prohibitively high, which is why few organizations have done so. An optimized infrastructure both saves you money and provides agility. But back to the topic at hand: If you’ve got an unoptimized infrastructure, what are some other negative effects you can expect it to deliver?

Maintenance Overhead If un-optimized translates loosely to “too many extra resources,” an increase in maintenance overhead is probably an obvious negative impact. Server management—not to mention storage management and management of other types of resources—isn’t a linear thing. Managing 10 servers might require a couple of hours per week apiece; managing 20 might require three. That’s because, as the number of resources increase, certain overhead elements—inventory, SQL Server patch management, Windows OS patch management, hardware failure management, warranty management, and so forth—simply become more cumbersome. Figure 1.2 illustrates the curve that I’ve experienced in the past, with the number of hours expended each week increasing almost logarithmically with the number of resource elements that are being managed.

Chapter 1

8

Figure 1.2: Maintenance overhead increases significantly as resources are added to the infrastructure.

Although it might initially seem like a good idea to have extra resources in case they’re needed, there is a very real cost in terms of maintenance that is associated with such resources.

The problem gets worse when SQL Server is being independently deployed by individual departments. Too often, they create their own management processes, resulting in inconsistency, more difficult auditing procedures, different patch levels, and more—exponentially increasing the effort spent within the organization to maintain and manage SQL Server.

Application and Data Availability An un-optimized infrastructure has a significant impact on application availability, as well. Some companies feel that hardware clustering is an ideal way to increase application availability. After all, the odds of two entire servers (or more, in larger clusters) utterly failing are pretty remote. But it can happen, and so some companies add even more resources to create a “never fails” infrastructure. That is an infrastructure positively laden with wasted, difficult-to-manage resources.

There is a vicious cycle here, too. More resources provide better availability but increase maintenance and management. An increase in maintenance and management can lead to improper maintenance and management, which can negatively impact availability. Figure 1.3 illustrates this seemingly endless circle.

Chapter 1

9

Figure 1.3: The need for better availability leads to adding resources, which leads to increased maintenance (with a higher risk of mistakes), which negatively impacts availability.

Ideally, availability is something an optimized infrastructure gives you for free without the need for additional “availability-dedicated” resources and without the companion rise in maintenance costs and complexity that comes with adding resources to the infrastructure.

Chapter 1

10

Management Overhead An un-optimized infrastructure has a particularly chilling effect on management. One aspect of this stems from the number of resources being managed. There are simply too many, in most cases, to manage effectively. Management loses track of the number of resources present on the network, often forgetting about “extra” capacity and instead implementing more resources when additional projects kick off.

Another management impact comes when the business needs a change in the infrastructure. Perhaps the organization made a major acquisition, launched a new product division, or changed the way it does business; whatever the cause, the business needs have changed. Un-optimized infrastructures, however, don’t readily accept change, and so every change in the business results in a management headache for the infrastructure: Resources are shuffled, redeployed, migrated, upgraded, and changed, all with incredible manual labor. Manual labor of course means risk of error, and so more management resources are spent analyzing risks and coming up with mitigation plans. More management time is spent tracking the project, and when it’s all over, often as not, the business needs have changed again.

This is such an issue in most organizations that many of them simply avoid changing business needs. In other words, they let the inflexibility of their own infrastructure dictate what the business can do, and in today’s increasingly competitive marketplace, that’s simply not acceptable.

Imagine a world where you discover that a new business acquisition will put a specific SQL Server over-capacity. No problem: You have another server with plenty of capacity that is only handling a few file-serving tasks. You push a button, and SQL Server is installed on that server. The over-capacity database is instantly and automatically moved to the new server, and users never realize the move took place—it happened in seconds. Other databases move onto the now-empty database server, filling it almost to capacity. Those databases leave older server hardware vacant, allowing you to dispose of legacy hardware that may be becoming a maintenance nightmare. All with a few clicks in a graphical user interface (GUI), you’ve turned a business challenge into an infrastructure benefit! That’s what an optimized infrastructure should look like from a management perspective—no projects, no hassles, just the ability to reconfigure IT assets smoothly, and with the click of a button. Does your infrastructure let you do that today?

Disaster Recovery Overhead Having lots of database servers means having lots of databases, and having lots of databases means you have to find a way to back up and potentially restore all of that data. “Silo” data—that is, databases running on essentially standalone servers—is especially difficult to back up and restore simply because you’ve got so many management points. Each silo has its own storage, which means each server either needs dedicated backup equipment—even more resources you’ll need to manage—or you’ll burden the network with backup-related data. Some organizations create a separate network specifically to carry backup data to central backup servers—a network that involves even more resources that will need maintenance and management.

This is a classic problem of a fragmented, inflexible, un-optimized infrastructure. An optimized infrastructure, however, gives you centralized backup and restore capabilities for free, as part of being optimized.

Chapter 1

11

Life Cycle Management An un-optimized network plays havoc with hardware life cycle management. Older hardware is difficult to get rid of simply because it’s so risky and time-consuming to move software such as SQL Server from one machine to another (in many cases, for example, client applications may have to be altered to use a new server name, or you might have to deal with significant downtime during a move or migration). Newer hardware is easy to add, but adds new overhead for management and maintenance.

To cope with these issues, organizations often try to have a standardized server platform, standardized storage platform, and so forth. The problem with this technique is that the hardware industry changes rapidly, and a “standardized” platform that is state-of-the-art today might not even be available for purchase in as little as a year.

Storage Management Storage management has, for some time, been the bane of many organizations. Although SANs make storage capacity somewhat more dynamic, allocating that storage to SQL Server still requires relatively static designs. It’s not always straightforward to expand or shrink the space made available to a given database, meaning you wind up designing each database with a “little extra” capacity—creating, in total, a lot of extra capacity—usually a lot more than you really need. Admittedly, storage management issues have improved vastly in the past few years, but when you start mixing in requirements for availability and the ways in which SQL Server needs everything to “mesh” together, storage management is still very much a challenge.

Performance Management I’ve talked at length about the desire to engineer extra capacity into any SQL Server infrastructure; that’s a valid and worthy goal, but simply adding more servers every time a new database requirement comes up isn’t the answer. Doing so creates unacceptable bloat and “dedicated” extra capacity.

In other words, if Server A is running two databases at 40% capacity, the extra 60% is only available to those two databases. If Server B is running one database at 90% capacity, there is no way—using traditional architecture—to “transfer” the extra capacity from Server A to Server B. The result, then, is that every server must have extra capacity, creating a huge capacity surplus, albeit a surplus that is inflexible.

Here’s an analogy: Imagine that every utility company you buy utilities from required payment in a different form of currency. You would have to maintain cash reserves in Euro, Dollars, Yen, and so forth—and imagine that there was no such thing as currency exchange. Many homeowners maintain a bit of a “pad” in their bank accounts for times of emergency—you’d be maintaining that “pad” in multiple different currencies, meaning your total “pad” would be several times what you might prefer. Wasted capacity, in other words.

Security and Auditing And now the whopper: Securing and auditing your un-optimized infrastructure. Every server, every storage device, every network component, and every database—they all have to be secured, and organizations are increasingly subject to legal and industry requirements (to say nothing of internal policies) that require not only security but verifiable, auditable security. This is where you pay the price for all those “extra” resources in the infrastructure. Every single one is a major piece of overhead when it comes to securing and auditing.

Chapter 1

12

In an optimized infrastructure, of course, it’s easy. You simply tell the infrastructure what it should look like, from a security standpoint, and the various infrastructure elements dynamically comply. That is a large part of what Microsoft’s Dynamic Systems Initiative (DSI) is all about—defining top-level policies and having managed elements automatically configure themselves to comply.

At a simpler level, however, you can ease security and auditing pains by simply reducing the number of resources on your network. Consolidate. Make every resource earns its place in the infrastructure by working to nearly maximum capacity so that every resource is absolutely necessary. That is another large part of what infrastructure optimization is all about.

Introducing the Infrastructure Optimization Model Microsoft’s Infrastructure Optimization Model (IOM) is a simple way for organizations to find out how optimized their infrastructure really is. The model uses common characteristics of IT organizations to classify the organization into one of four areas, shown in Figure 1.4: Basic, Standardized, Advanced, and Dynamic.

Figure 1.4: The Microsoft IOM.

Chapter 1

13

Basic An organization at the Basic level is essentially reactive, meaning they “fight fires.” The infrastructure is what it is, and most of the organization’s time is occupied dealing with problems and issues: little or no time may be available for new projects; the infrastructure contains wildly varying (non-standard) elements; and there may be no formal processes for change control, ensuring service levels, measuring service levels, and so forth. Processes tend to be manual and localized, with minimal central control over resources. Standards may not exist for backup, security, deployment, compliance, and other common practices.

Basic organizations tend to feature few tools, and the tools they do use are often nonstandard, assembled haphazardly over a long period of time. Desktop and server computers will run varying OSs, often at varying patch levels. Hardware is rarely standardized and legacy hardware and applications are commonly found. IT staff members may have little formal IT training or may simply be overwhelmed by the quantity of work. Very little central planning occurs, and new projects—if time is found for any—are often deployed haphazardly or over a long period of time. Many new deployments fail or spend forever in “planning” stages. IT is usually perceived as a cost center that significantly trails business requirements.

Standardized An organization at the Standardized level is gaining control over its infrastructure. Written standards and policies have been introduced to govern common IT practices, to manage desktops, and so forth. Some centralization, such as for authentication, has taken place. An effort is being made to standardize resource types and configuration to make maintenance and management easier.

Fewer IT staffers are “jacks of all trades,” instead focusing on specific areas for administration. More specialization leads to more skilled staffers who make fewer mistakes and can focus more on automating common tasks. More tools are found in the Standardized organization, freeing up staffers to work on new projects and to implement new technologies and tools.

IT is still perceived as a cost center but it’s one that is well-managed. The Standardized organization may be able to allocate IT costs back to production business units, helping to justify their overhead; in most cases, business units will also have written service level agreements (SLAs) in place defining the quality of IT service they receive.

Chapter 1

14

Advanced An IT organization at the Advanced level is perceived as a business asset rather than as overhead. IT efforts are fully standardized, documented, and in most cases centrally controlled and managed. The cost of managing resources is at the lowest possible level due to standardization and well-written and executed IT policies. Security tends to be proactive, responding to threats and challenges rapidly and before they’re realized in the form of production impact.

Management frameworks like the Information Technology Infrastructure Library (ITIL) are commonly a part of a Advanced organization. You can also expect to see management technologies, such as Microsoft’s System Center family, providing centralized configuration and control. Deployment tools such as Windows Deployment Services (WDS) may also exist. In general, nearly every common IT task will be handled by some form of automated, centralized toolset—only exceptional tasks are handled manually in an Advanced organization.

Advanced IT organizations are seen as a business partner and an enabler for new business efforts. IT assets are still directly managed, although management is done through tools rather than locally and manually. Production business units turn to IT to help implement new business efforts and to reduce the cost of existing business efforts.

Dynamic An organization at the Dynamic level exhibits the agility I’ve been mentioning throughout this chapter. The IT organization has implemented the tools, technologies, and techniques to quickly and smoothly reconfigure and redeploy IT resources on-demand, creating a flexible, malleable environment that can quickly respond to business needs—not only reacting to changing needs, but in fact anticipating those needs and enabling them to occur. Virtualization plays a large role, as does advanced forms of clustering, because they abstract software resources (such as applications) from the underlying hardware infrastructure, enabling software resources to be dynamically deployed and repositioned with little effort.

The tools used in a dynamic organization are more advanced. They tend to create a layer of abstraction between business policies and actual IT assets; deploying a new Active Directory (AD) domain controller, for example, may be as simple as selecting a machine and telling it to become a domain controller. The tools take care of reconfiguring the computer with not only the necessary software but also the necessary configuration standards for operations, security, auditing, and so forth. Changing a security setting for the organization may simply require a change to a policy; the tools take care of actually implementing that change on any computers that require it.

IT is perceived as a valuable partner in the business. IT creates the capability for new business efforts. It reduces overhead on existing business efforts, freeing up production business unit assets and management to focus on new directions; IT then implements additional capability and capacity to handle those new directions—all dynamically, all smoothly, and all nearly transparently. A major portion of IT staff time is spent on new projects because most routine maintenance is handled automatically and proactively. Little or no manual administration occurs.

Chapter 1

15

Application Platform Optimization Microsoft’s Application Platform Optimization (APO) model is an application-specific look at the IOM, with a goal of providing the infrastructure, technologies, and tools needed to build more connected and adaptable systems. Within the APO model are several key areas, including:

• Data management—obviously the one most applicable to SQL Server

• Business intelligence

• Business process management

• Software development

• User experience

These five core capabilities ride “on top of” an optimized infrastructure, providing proactive capabilities that are corporate assets rather than cost centers.

Within the specific realm of data management, APO focuses on scalable, integrated database platforms designed to securely store and manage increasingly large quantities of data drawn from disparate sources—in other words, SQL Server. But, as I’ve discussed already, it’s not enough to simply have SQL Server—it has to be running on an optimized infrastructure that is specifically designed to provide optimized infrastructure capabilities and characteristics to SQL Server.

Let’s take roadways as an example. What does an “optimized roadway” look like? Well, that depends on what type of traffic it’s going to carry. A major local thoroughfare, for example, might include a main highway for traffic that is passing through, with separate access roads leading to local homes and businesses. The access roads would periodically connect into the main highway, allowing traffic to interchange between “passing through” and “local access” lanes. That same level of optimization, however, wouldn’t work for a major interstate, where the goal is to move larger amounts of traffic—including larger trucks—longer distances in a shorter amount of time. For a highway, specialized optimization technologies would be brought into play: Onramps and exits to firmly separate interstate traffic from local traffic; electronic metering systems to manage the merging of the two types of traffic at interchange points; bypasses for areas with closely spaced interchanges; and so forth. In other words, there is no single type of infrastructure that is suitable for all situations; the basic infrastructure model—the road—needs to be augmented with specific technologies designed to support the ways in which the infrastructure will be used. The same holds true for SQL Server, and in the next section, I’ll lay out some of the specific keys that you’ll need to look for in order to achieve a SQL Server-optimized infrastructure.

Chapter 1

16

Keys for SQL Server Infrastructure Optimization The next chapter will begin examining various ways of optimizing your infrastructure for SQL Server’s benefit. I’ll look at SQL Server’s own technology and design parameters, technologies such as virtualization and clustering, techniques such as pooled-resource clusters, and so forth. Not all of these technologies and techniques are optimal for SQL Server, though, so I need to first define a set of criteria that we’ll use to determine the appropriateness of any given technology or technique.

Abstracted A key criterion is that SQL Server be abstracted from its underlying hardware. We need to manage SQL Server without regard for the underlying storage or other hardware resources. Ideally, we should be able to consider a large pool of servers, and a large pool of available storage, as a single unit. If we have the total server capacity to support 20,000 SQL Server connections (or physical input/output, or whatever other measurement you want to use), we should be able to allocate databases across that capacity at will. Similarly, if we have 10 terabytes of storage, we should be able to distribute that across databases as needed, with no regard for how that storage is physically implemented in the data center. Figure 1.5 illustrates this abstraction concept.

Figure 1.5: Abstracting SQL Server from the hardware resources.

Chapter 1

17

Segmentation We must be able to create multiple asset pools and assign databases to an entire pool. For example, we may have one pool for mission-critical applications, and that pool may have a larger percentage of extra capacity engineered into it (covered in following sections). Less-critical applications may be assigned to other asset pools with less extra capacity engineered in. Figure 1.6 illustrates the segmentation concept.

Figure 1.6: Segmenting assets into pools and assigning databases to pools.

Chapter 1

18

Manageability Management of SQL Server itself must be no more difficult than it is today. In other words, we can assume that any database will have a given amount of management overhead and that any single SQL Server computer (or, more accurately, SQL Server instance) will have a given amount of overhead. Whatever means are used to achieve our other goals should add no more management overhead than the existing database—and instance-related overhead we’re already accustomed to. If possible, our solution should actually reduce the management overhead.

Ideally, management of SQL Server should become easier. An ideal solution would provide the capability to manage a “farm” of SQL Server computers as easily as one. Designating a single patch for installation, for example, would “push” the patch to all SQL Server computers as part of an overall process. Health monitoring would likewise be aggregated, and so forth, so that the entire collection of SQL Servers is “seen” as a single unit by administrators.

Performance We must be able to achieve optimal SQL Server performance with little additional effort. We can rely on tools such as Microsoft System Center Operations Manager to tell us whether a given database or application is “healthy” or not; we must be able to correct “unhealthy” situations quickly and transparently—without taking the database or application offline.

This s a key requirement because it’s one that most frequently leads to an un-optimized infrastructure using traditional SQL Server architectural techniques. It’s easier to ensure good performance with an infinite number of servers, though, so a companion requirement is that we must have a minimum of wasted capacity in the new infrastructure. Any “margin for growth” capacity offered by the infrastructure must be offered in such a way that it is “growth capacity” for all of our databases, not just a particular one. In other words, we may design the infrastructure to have 10% extra capacity on purpose; that capacity must be immediately available to any database that needs it—the capacity cannot be “hard coded” to a specific database or SQL Server instance.

Availability We must be able to maintain 100% availability of all databases at all times, through hardware failure, maintenance requirements, and so forth. We may, as a design decision, accept that a failure or maintenance action may reduce performance temporarily. Or we may engineer extra capacity into the infrastructure so that performance can be maintained to a specific level—understanding the requirement that any extra capacity must be universally available to all databases and SQL Server instances.

I fully acknowledge that most organizations do not offer 100% available for all of their databases in their SLAs. In some cases, then, this level of availability might be lower. I will suggest, however, that 100% available is what everyone wants—but they simply can’t afford it using traditional techniques. So, for now, let’s “aim high” for 100% availability all the time, and see what compromises we may need to make, or are willing to make, for lower levels of availability.

Chapter 1

19

Life Cycle We must be able to bring new server and storage assets into our infrastructure and remove old assets with no downtime and with minimal effort. For example, we should have the ability to deploy new servers with little manual setup beyond the Windows OS itself (which can also be automated, of course). Our pool of server assets should be able to be heterogeneous, permitting us to mix existing hardware assets and to bring in new assets based on market conditions, not on an arbitrary hardware standard.

Disaster Recovery Our solution must provide robust support for disaster recovery, preferably using native SQL Server capability and/or industry-standard disaster recovery tools and techniques. Disaster recovery requirements should add an absolute minimum of additional assets to the infrastructure, and disaster recovery should work on conjunction with our pooled, abstracted hardware assets. We should have the capability of restoring to a purpose-built “recovery pool,” perhaps located off-site or hosted by a dedicated disaster recovery facility, and we should be able to independently engineer the performance, availability, and other aspects of this recovery pool. We should not have to give up our management advantages and dynamic capabilities (covered next) when restoring the infrastructure, even in the event of a total site failure.

Dynamic Our infrastructure must be dynamic. We must be able to reallocate performance and storage assets transparently, with little or no impact to users or applications, on-demand. We must be able to dynamically shrink and expand server and storage pools, and dynamically reallocate databases and instances as desired. Reallocation should occur automatically in the event of hardware failure, but must be done so in a controllable and planned way. We must also be able to instantly reallocate any resource which was reallocated due to failure.

Coming Up… With those criteria out of the way, let’s move on to the next chapter, where I’ll look at various ways of optimizing the infrastructure to support an application like SQL Server.

Chapter 2

20

Chapter 2: Optimizing Your SQL Server Infrastructure: Good Ideas, Bad Ideas

In the previous chapter, I discussed the impact of sprawl in a SQL Server infrastructure. Basically, my conclusion was that traditional SQL Server architectural techniques, along with a number of business factors (such as political factors) seem to make SQL Server sprawl inevitable. The result is a highly unoptimized infrastructure, where copious amounts of unused resources exist but which cannot be easily deployed to help fill any need. That is, if you take a look at your infrastructure as a whole, you probably have lots of unused storage, processor capacity, and so forth, but you’re not necessarily free to “move” unused resources from one area to cover a shortage in another. Your resources are inflexible, tied to individual servers, which are often handling a very specific workload.

This chapter looks a bit harder at why things turn out this way, from a SQL Server technical viewpoint, and explores some of the ways the industry has tried to solve the problem. Remember that the ultimate goal is to create an infrastructure that’s dynamic according to the Microsoft Infrastructure Optimization Model (IOM)—capable of changing smoothly and quickly to meet business needs, and in fact designed to accommodate rapid changes.

The Nature of RDBMS and SQL Server SQL Server itself had traditionally defied attempts to be incorporated into a dynamic infrastructure. The very nature of SQL Server—the way it works—doesn’t seem, at first glance, to accommodate rapid change. To be fair, the challenges you get with SQL Server are really issues with any relational database management system (RDBMS), whether IBM DB2, Oracle, or any other brand. An RDBMS is a resource-hungry, intensive application, and it can be very difficult to architect that type of application so that it can accommodate rapid change. Let’s look at a few specific aspects of SQL Server (again, most of which apply to any RDBMS) that would seem, on the face of it, to make a dynamic infrastructure more difficult.

SQL Server Instances Since SQL Server 2000, SQL Server has been built as a multi-instance product. In other words, you can basically have multiple copies of SQL Server running on the same computer, at the same time. Each copy is called an instance, and each instance has an independent server configuration and one or more user databases, as illustrated in Figure 2.1.

Chapter 2

21

Figure 2.1: Multiple instances of SQL Server running on a single computer.

Clients connect to each instance using a combination of the physical server name and the instance’s name: \\ServerA\Instance1, for example. Apart from this unique naming convention, client computers aren’t really aware that they’re connecting to an instance; as far as they’re concerned, each instance could be running on a completely different physical machine.

Instances are actually a SQL Server element in favor of a dynamic infrastructure, although few companies really use them that way. One of the primary reasons that instances were introduced to the product was to help improve stability. If one instance needs to be taken down for maintenance, for example, another instance could continue to run, provided the physical server has sufficient free hardware resources. Another reason for instances was configuration: SQL Server has certain configuration parameters that are server-wide (well, instance-wide); instances permit multiple databases to run on a single physical machine, using different “server-wide” configuration settings. For example, one instance might be configured to use Windows Authentication, while another might allow mixed Windows and SQL Server authentication.

Chapter 2

22

Instances also have a real security benefit because they represent a security boundary. Being able to log on to Instance A, for example, doesn’t imply anything about your ability to log on to Instance B.

The ability to have instances in SQL Server actually enables almost every technique I’ll examine for creating a dynamic infrastructure. That’s because instances, for the first time, created an abstraction between the physical server and the “SQL Server” that client computers see and connect to. As far as clients are concerned, “SQL Server” is an instance, not a physical computer. If an instance was moved from one computer to another, all the client might have to do is change the name used to connect to that instance (and playing tricks with DNS could even eliminate that need). In other words, SQL Server clients don’t care what physical hardware they’re talking to, they care what instance of SQL Server they’re talking to.

So, instances: A mark in SQL Server’s favor for dynamic infrastructure.

Processor and Memory Requirements SQL Server is definitely an application that is able to fully utilize a lot of processor power. Record-setting SQL Server benchmarks are often accomplished on amazingly well-equipped servers with eight or more processors running in parallel.

SQL Server is also known as an application that likes its memory. The more memory, the better, I always say, and SQL Server makes excellent use of however much memory you can throw at it. One of the major advents of the past few years has been the introduction of 64-bit processors that provide a much larger address space for memory than the traditional 4GB limit. In the past, I never built a SQL Server that didn’t have the full 4GB of memory, and now with 64-bit processors, I like to throw in a few extra gigabytes of memory whenever possible.

Take some processors or memory away from SQL Server, and its performance can suffer. More importantly, either processor or memory capacity is often a “ceiling” that limits a SQL Server instance’s ability to handle additional workload (the other main “ceiling” is often disk throughput, and that’s the one you often hit first). Once a computer’s processor and memory capacity is maxed out, and SQL Server is fully utilizing those resources, SQL Server simply can’t grow any further. This single fact is the primary cause of SQL Server sprawl in many organizations (see Figure 2.2).

Chapter 2

23

Figure 2.2: Processor and memory capacity on two SQL Server computers.

Suppose you were responsible for these two servers, A and B. The colored bars represent current resource utilization for both memory and processor. Your organization needs to deploy a new database related to a specific project, and you know that, at least initially, this new database won’t use more than 5% of a SQL Server’s memory or processor capacity. To which server would you deploy the database?

Most administrators would choose either “B,” or a third answer: Get a new server. Because A is already nearing its limit, most would fear that either the new database, or whatever databases A is currently hosting, would eventually grow and exceed the server’s capacity. Because moving a database to a new server can be a time-consuming task, depending upon how clients are connecting to that server, most administrators prefer to stick a database somewhere where they can leave it more or less permanently. Moving a database also entails downtime, which often isn’t desirable from a business perspective, providing another reason to avoid hosting a database on a server that might soon run out of capacity.

This attitude on the part of administrators—largely driven by the way SQL Server itself works, and the relative difficulty of relocating databases to other servers—results in lots of servers with spare capacity sitting around. It’s also what drives the installation of new physical servers, even when the infrastructure technically has enough free capacity to handle whatever new databases are needed: The anticipation of growth.

Let’s say, however, that you decided to put the new database on Server A. Eventually, that short-term project database became a long-term mission-critical application (of course), and its usage grew—eventually exceeding the processor capacity of the server, as shown in Figure 2.3.

Chapter 2

24

Figure 2.3: Server’s processor capacity is exceeded.

Here’s the ultimate problem with a non-optimized infrastructure: Even though Server B has plenty of excess capacity, it doesn’t do Server A any good. Certainly, if Server A’s workload is the result of more than one database, as we’ve suggested, then perhaps one of those could be migrated to Server B. If both databases were held by a single SQL Server instance, that migration can be fairly difficult. Client applications will have to be reconfigured or modified to connect to the different server, at the very least, and the database will experience downtime during the migration period. If the databases were hosted by separate instances, the move might be a bit easier, but there would still be a production impact of some kind.

Instead, many organizations will take an easier route. Move all of the workload from Server A to Server C, a brand-new, bigger, and better server. That attitude makes server manufacturers very happy, but it’s not using the infrastructure very efficiently. Perhaps Server A could be redeployed to handle other databases, meaning the infrastructure finds itself with more inflexible excess capacity.

Chapter 2

25

Storage Requirements SQL Server also, of course, requires disk space for its databases. Disk space on large SQL Server computers these days is often provided by a Storage Area Network (SAN), which is often pretty easy to expand, giving SQL Server computers extra disk capacity when they need it. That’s not always the case, of course. Many departmental or project servers might have local SCSI storage, which can be more difficult to expand on demand. In those cases, storage becomes another constraining resource much like memory or processor, which I’ve already discussed, and imposes many of the same limitations on the infrastructure.

But even more easily expanded SANs can lead toward a less-dynamic infrastructure, simply because of the way SQL Server must use its storage. Keep in mind that a given instance of SQL Server, no matter how many databases it is hosting, needs a single set of storage resources. This is definitely easier to explain with an illustration, so take a look at Figure 2.4.

Figure 2.4: Storage resources used by a SQL Server instance.

Chapter 2

26

In this scenario, a server hosts a single instance of SQL Server, which in turn hosts four databases. The server has six storage resources connected to it—logical disks, essentially, either connected from a SAN or directly from SCSI storage. Each database has its own dedicated logical disk (probably for data files), and each pair of databases shares a logical disk (perhaps for log files). In reality, of course, SQL Server storage configurations can be much more complex, but this serves to illustrate the point, which is this: All six storage resources are tied to the SQL Server instance. If this server starts to reach its capacity, you can’t simply move the instance to a new server without also moving (or re-provisioning) all six storage elements—or reconfiguring the databases to use a different storage configuration. If one of the storage elements has a lot of excess capacity, it can’t be used by SQL Server instances running on other servers, because the storage is tied to this instance.

This is a very important distinction: Storage is most commonly tied to the instance not to the database. Yes, the storage is being used by the database, but the database is necessarily not your most flexible unit of management for storage. For example, consider the storage element shared by databases 1 and 2. If database 1 were migrated to a different SQL Server instance running on another server, perhaps its dedicated storage element could be migrated with it. The shared element, however, could not be so easily migrated because it’s being used by another database.

This isn’t to say that reconfiguring storage is insanely complicated, because of course it isn’t. However, doing so can result in some downtime as well as some administrative time, and it’s not entirely without risk in terms of data loss or additional downtime. All of these factors conspire to make storage configurations just a bit less flexible than they could be, meaning administrators try to avoid storage reconfiguration whenever possible, meaning you wind up with a lot of unused storage capacity floating around the infrastructure—capacity which is essentially locked to a specific instance, and which can’t be easily used to relieve overcrowding on another instance.

Client Connectivity Requirements The need for a SQL Server client application to know the name of a SQL Server computer and instance can also be a limiting factor in how dynamic the SQL Server infrastructure can be. Unfortunately, many application developers persist in hardcoding database connection strings, meaning that moving a database to another server—or even another instance on the same server—requires the application to be modified and redistributed (and the re-distribution is often the difficult part of that). Even with an application that is more reconfigurable, actually performing the reconfiguration can be difficult depending upon how the application was written.

Chapter 2

27

Tricks exist that can help alleviate these issues. For example, one trick is to never allow clients to use actual server host names. Instead, a new DNS alias is created for each SQL Server instance that exists in the organization. The alias points to the physical server name that is hosting the instance. This allows each instance to have a unique “server name,” and if the instance is moved to another physical server, the only update needed is to the DNS alias record (typically a CNAME record). So a certain amount of flexibility exists for client connectivity if the organization has been smart. The task of moving an instance, from a client connectivity standpoint, can be accomplished fairly quickly by simply updating DNS.

Note that this DNS technique doesn’t abstract the instance name; DNS can’t provide that service. Instead of accessing InstanceA on Server2, you set up DNS so that you’re accessing InstanceA on ServerABC—with ServerABC simply “pointing” to Server2. If InstanceA needs to be moved to Server3, it can be, so long as it keeps the same instance name. Just re-point ServerABC to Server3, and the change is accomplished.

I’m not sure I’d qualify this as flexible within the Microsoft IOM definition of the word, though. DNS changes can take time to propagate through the organization, and Windows computers cache DNS information for about ten minutes, so you can’t really rely on instant changes just using DNS. Other techniques exist, which I’ll discuss later, but for now my point is this: SQL Server itself requires server names and provides very little in the way of flexibility for quickly changing server names. Other non-SQL Server techniques can be used to add flexibility to the infrastructure, but not all organizations are able to use all these techniques. Without using some kind of non-SQL Server technique, moving instances between physical servers can become anything from a hassle to an outright maintenance nightmare. That’s one reason why administrators often loathe relocating SQL Server instances: Getting clients reconnected.

Disaster Recovery Requirements The need for reliable disaster recovery is another factor that contributes to SQL Server sprawl and to a less-than-flexible infrastructure. Making backups of large amounts of data is one issue. For example, some organizations may rely on direct-attached tape drives to back up each SQL Server computer’s databases. If the databases’ size begins to exceed the capacity of the tape system—either in terms of storage capacity or the time in which it takes to back up the database—then backup becomes a constraint similar to memory or processor. Administrators respond by ensuring every server has excess capacity, creating a lot of excess, inflexible capacity within the infrastructure.

Many organizations today use much smarter and more flexible backup schemes. For example, they may have a centralized backup solution that backs up databases across a private network that is used only for backup traffic. This is a much more flexible scheme, as it centralizes excess capacity. Any server can benefit from any central excess backup capacity. Most organizations have a hybrid approach, using several centralized backup systems for various “clusters” of SQL Server computers. This helps to create a better balance between backup speed and excess capacity.

Chapter 2

28

Business factors often step in to make the infrastructure less flexible, though. For example, consider Figure 2.5, which illustrates an organizations four geographically dispersed offices. Backing up SQL Server over a wide-area network (WAN) connection is rarely feasible, so each office has at least one backup solution in place for its SQL Server computers. In reality, political issues—departmental ownership of resources, for example—mean that each office has a couple of backup solutions, some shared by multiple servers and others dedicated to a single server. I’ve shown the backup solutions as colored bars underneath the servers that share each solution, with colors indicating the backup solutions’ current utilization—red for almost maxed out, yellow for moderate utilization, and green for low utilization.

Note that the utilization I’m showing refers to the backup solution, which as you can see is shared by different sets of servers at each location.

Figure 2.5: Backup solutions and capacities in use across a geographically distributed organization.

Chapter 2

29

Once again, we see that ample excess capacity is present to accommodate the organization’s total needs. However, this capacity isn’t flexible. Office A, for example, is coming close to running out of backup capacity; the excess capacity in Office C, however, is of no help in solving the problem. Office B has more than double the capacity it requires, but that extra doesn’t do Office D any good for their almost-overburdened solution. Even the excess capacity already available in Office D isn’t available to the SQL Server computers that need it.

Often, the business factors that result in these situations are legitimate for other reasons—administrative control over resources, budgetary requirements, and so forth. However, it creates a hardware-centric infrastructure that is inherently less flexible than it could be. If Office A needed to implement a couple of new databases, they might well be looking at significant expenditure and effort to make that happen—even though the organization has already spent more than enough money on backup capacity. Although Office C could probably implement new databases without any issues, Office A has less flexibility than it really should.

Basic Server Consolidation I want to start with a baseline for reducing SQL Server sprawl, so I’ll start with the basic consolidation capabilities that are built into SQL Server itself.

Technology Overview The instance is SQL Server’s basic method for consolidation. Essentially, any server running SQL Server 2000 or later is ready to use this consolidation technique. Simply relocate each instance to a fewer overall number of physical servers. Of course, I say “simply” even though that relocation isn’t always simple.

The SQL Server Connection There aren’t any built-in tools for “moving” a SQL Server instance from one physical server to another. Instead, you have to manually perform the task, which involves several distinct steps. You first decide whether you’re going to create a new instance on the destination server or use an existing instance. Once you’ve decided what instance to use, you copy over the database. Data Transformation Services (called Integration Services in SQL Server 2005) can copy the database or you can simply copy the database files directly. This will involve downtime; you really need to take the original server offline so that changes aren’t being made until the move is complete. Then you’ll need to reconnect your client applications using whatever means those applications support and require.

This type of consolidation is almost entirely a manual process, and it can be time-consuming depending upon the amount of data involved. It does involve a certain amount of downtime, and it’s not something you can do in the middle of the business day for production databases.

Chapter 2

30

The Impact on Sprawl This technique obviously has a positive impact on sprawl by reducing the number of servers that are being managed. As databases are migrated off of servers, those servers can be decommissioned and put to other, non-SQL Server uses or removed from the environment entirely.

The Impact on Flexibility This technique doesn’t have much of an impact on flexibility. You’re certainly not going to be moving databases from server to server all the time, and organizations rarely consolidate in this manner for any reason other than reaction to a problem. For example, consider Figure 2.6.

Figure 2.6: Lack of flexibility.

As shown, plenty of excess capacity exists across these four servers. In fact, you could probably combine Servers B and D. And let’s say that a new database came along that would require about 60% of a server’s capacity. Perhaps moving some workload from A to C would free up enough capacity on A to host the new database, without implementing a new server. Would the average SQL Server administrator do it? Probably not, because none of these decisions leaves much free capacity on any one server, and you can’t really move databases and instances around easily enough to be doing it all the time. Consolidation using this technique is something you might do only in really obvious circumstances, such as when you have a couple of very under-utilized servers, and combining them would still leave plenty of free capacity.

Chapter 2

31

Windows Server Clusters Windows Server Clusters—that is, clusters built using the Windows Cluster Service—are primarily intended as high-availability solutions rather than a server-consolidation technique. However, they do seem to offer some interesting capabilities for server consolidation, as on the face of it they work around some of the issues involved with basic server consolidation. Let’s see how well-suited they are to reducing SQL Server sprawl and improving the flexibility of the infrastructure.

Technology Overview Figure 2.7 illustrates a typical two-node cluster running SQL Server. Part of the clustering technology is the creation of virtual server names. In this case, SQLA, SQLB, and SQLC are all “computer names” that the cluster uses; only one cluster node—the “owner”—will respond to a given name at any given time. In this case, each virtual name maps to one of three SQL Server instances; as shown, two instances are running on the first node, and one is on the second node.

Figure 2.7: SQL Server in a Windows Cluster.

Chapter 2

32

For each node that will be “active”—that is, for each node running a SQL Server instance that is fulfilling client requests—an external storage system is required. Therefore, this cluster requires two such systems. Each storage system is physically connected to each node (often via a SAN, although sometimes with SCSI cabling), although only one node “owns” a given storage system at any given time, and only the “owner” node can access a given storage system.

In a failure situation, the resources “owned” by the failed node are “seized” by the surviving node. That’s straightforward enough: If you can picture the second node in our cluster dying, then the first would “own” all three virtual server names, would run all three SQL Server instances, and would “see” both storage systems.

From an infrastructure flexibility standpoint, clustering is interesting because it allows SQL Server instances to “move” across nodes with relatively little effort. Each instance is actually installed on every node in the cluster; it’s just running on the node which “owns” that instance at the time. So if we needed to take the second node offline for maintenance, we’d just manually move its resources to node 1 for a while. This would seem to enhance server consolidation, as well. Since resources can be moved between cluster nodes, you might feel less need to have a lot of excess capacity. Any excess capacity on one node could be used for any workload, since workload could be transferred to that node to utilize what capacity it had available.

The SQL Server Connection SQL Server actually needs only a minimal awareness that it’s running on a cluster in order to work properly. To oversimplify a bit, each instance of SQL Server is installed on each node in the cluster but is stopped. When a cluster “owns” an instance, it starts the instance and begins serving client requests.

Each instance usually depends upon other resources being owned by the same node. That is, if a node can’t “own” the external storage that holds an instance’s database files, and it can’t “own” the virtual server name that the instance uses to communicate, then the node can’t “own” the instance either. Those resources—storage, instance, and name—are transferred as a set. Most of the work is done by the Windows Cluster Service.

The Impact on Sprawl On the face of it, Windows clusters might not seem to impact sprawl at all. After all, two servers are two servers, whether they’re in a cluster or not. The theory on positive sprawl impact depends upon the ability for the cluster itself to have the excess capacity you need to handle growth or spikes in workload; because resources can be transferred between nodes, you would perhaps only need one node with a lot of excess capacity—the others could run at closer to full capacity, meaning you could possibly eliminate some servers. Three- and four-way nodes add flexibility to this idea. The actual impact on sprawl might be minimal, though, because the impact on flexibility is smaller than you might think.

Chapter 2

33

The Impact on Flexibility Windows clustering isn’t a free-for-all when it comes to moving workload between nodes. Consider our example cluster: Because the two instances running on the first node store their databases on the same external storage, both instances depend upon that one resource. Therefore, both instances must be “owned” by the same node at all times. If you move the SQLA instance to the second node, its storage (the lower one) would also have to move, which means the SQLB instance would also have to move. This is poor flexibility, since you can’t be really granular when moving workload around. Of course, you could build your cluster to look more like Figure 2.8, with each instance having completely dedicated resources.

Figure 2.8: Dedicated resources for each instance in a cluster.

However, you now start to enter a realm of increasingly complicated resource management. Imagine a dozen instances running on this cluster, with a dozen virtual names and a dozen storage resources. Managing all of that becomes complicated, error-prone, and difficult. Simply configuring the storage resources to operate that way can require excessive hardware. For example, if those external storage systems were SCSI arrays, you might need a dedicated controller card for each (depending on the hardware you’re using)—in each node. You’d quickly run out of expansion slots in your server! Even SAN-attached storage can present complications of a similar nature. Thus, although Windows clusters can add some flexibility to the infrastructure, they can’t add much, simply because that’s not the problem that Windows clusters were really designed to solve; Windows clusters are intended purely as a high-availability solution, and they simply weren’t designed to address consolidation needs.

Chapter 2

34

Hardware Virtualization Technologies such as Microsoft Virtual Server and VMWare Server are becoming increasingly popular for server consolidation. Essentially, these technologies make it possible to run multiple “virtual servers” (commonly called virtual machines or VMs) on a single physical machine.

Technology Overview As shown in Figure 2.9, virtualization essentially works by laying a “hypervisor” layer on top of the machine’s physical hardware. This layer is controlled by an operating system (OS) such as Windows.

Figure 2.9: Hardware virtualization.

This model reflects a relatively modern way of building virtualization and is essentially how VMWare ESX Server is built; with software products like Microsoft Virtual Server 2005, or VMWare Server, the model is slightly different. With those products, the virtualization software runs on top of the host OS rather than beside it. This provides additional overhead for accessing the machine’s physical hardware, which can cause resource constraints.

On top of the virtualization software run VMs. These essentially emulate real physical hardware, and you install a real OS such as Windows onto them, and then install software such as SQL Server. Each VM has its own virtual hardware resources: memory, disk, processor, and so forth. “Within” the VM, the OS believes it is running on a dedicated physical machine. In reality, the VM’s virtual resources are mapped to the machine’s physical hardware resources, which are partitioned and shared across all running VMs.

Chapter 2

35

The SQL Server Connection SQL Server doesn’t “care” about virtualization of this kind and doesn’t need to do anything special to work with it. As far as SQL Server is concerned, the VM is a real, physical machine. It has a server name, memory, disk space, and everything else a physical machine would have.

The Impact on Sprawl Virtualization can definitely have a positive partial impact on sprawl. By provisioning virtualization hosts as large, powerful servers, you can consolidate smaller machines—often older ones that use hardware that is inferior to the current state-of-the-art—onto fewer physical machines. Figure 2.10 illustrates the consolidation of five physical machines onto two virtual hosts.

Figure 2.10: Consolidating machines with virtualization.

Chapter 2

36

There’s not exactly a formula for how many physical machines can go onto a single virtual host, though. For one, the virtual host may have more powerful processors, meaning each one is individually more powerful than a single processor in an older physical machine. And the overhead of the virtualization process itself is not insignificant, although it can be difficult to measure accurately.

Also consider that sprawl isn’t just the physical count of computers. Although reducing computer count saves data center space, converting physical machines to virtual machines doesn’t lessen the amount of OS maintenance you have to perform, the amount of SQL Server maintenance, and so forth. In fact, maintenance increases: Not only must you keep the OS and software on each VM up-to-date; you also have to worry about the OS and software on the physical machine that all the VMs run on. In this regard, virtualization has a negative impact on sprawl, in that it can slightly increase the amount of computers, both physical and virtual combined, that you have to maintain.

The Impact on Flexibility Virtualization can also seem to have a positive impact on flexibility. Most high-end virtualization platforms, such as VMWare’s ESX Server, provide tools to help move virtual machines between physical hosts to quickly rebalance workload as needed. With this technique, the virtual hosts—the physical machines—become one giant pool of resources, and workload—running inside VMs—can be spread across them fairly dynamically. Tools vary in their ability to seamlessly move VMs; none can do so without at least minor downtime, and few can do so without fairly major downtime.

In addition, because SQL Server itself is such a big consumer of computer resources, the additional overhead of virtualization can be something of a waste. Virtualization is absolutely ideal for combining disparate applications running on disparate OSs; it’s perhaps a good, but not great, solution for SQL Server consolidation, at least for low-use databases. However, virtualization solutions simply aren’t appropriate for high-use databases—they just can’t provide sufficient performance. Keep in mind that SQL Server, through its instances, already has a rough form of “virtualization;” what’s needed is a way to quickly move instances across physical hosts, without having to encapsulate those instances in a whole VM—along with a complete OS to run on.

Chapter 2

37

Pooled-Resource Clustering Pooled-resource clustering is a combination, of sorts, of Windows clustering and virtualization. Essentially, it’s a means of allowing SQL Server instances to become “virtual SQL Servers” in the sense of true VMs. They’re encapsulated and can be easily and quickly moved between physical computers. It borrows concepts from Windows clustering but does so with a crucial difference that eliminates the resource management overhead that comes along with Windows clustering. Another term for this technique is shared storage clustering.

Technology Overview With pooled-resource clustering, a cluster of servers—which can all have different hardware specifications—are connected to a single, shared storage system (typically a SAN). By leveraging a Windows-based cluster file system, each of the nodes in the cluster can “see” the same storage area. That is, they all have (for example) an S drive, and all of the files on the entire storage system are exposed to each node through that S drive.

Each server runs one or more instances of SQL Server. In reality, each instance may be installed on each server, although it is only started on one node at a time—much like Windows clustering. Like Windows clustering, a virtual server name is usually associated with each instance, and is “owned” by the node that “owns” the corresponding instance. Figure 2.11 illustrates the basic configuration.

Figure 2.11: Pooled-resource clustering.

Chapter 2

38

The trick with this technology is that no node must try and access the same database files at the same time. The reason is that Windows always assumes that all local storage is private and contains no mechanism for resolving access conflicts. Fortunately, each instance of SQL Server would be opening different database files, and would be locking them when doing so. For example, the “SQLA” instance might be installed on all seven nodes but only running on node 1. If node 2 tried to start its “SQLA” instance, the instance would find that the database files it was configured to use would be visible, but not accessible, because the instance running on node 1 had them locked. The instance would simply shut down, being unable to open its configuration database or any other databases. Figure 2.12 illustrates this activity.

Figure 2.12: File locking prevents duplicate instances from accessing the same files.

To “move” the SQLA instance to the second node simply requires shutting down the instance on node 1 and starting it on node 2—which, since node 1 is no longer running, will be able to access the database files on the shared storage area.

Automating this shutdown and startup process provides a very quick and robust high-availability solution. With this shared file system, data ownership doesn’t have to be transferred; it’s already there. Thus, clusters can have many servers providing active-to-active failover support for each other.

Each cluster of physical nodes only requires a single storage area; instances depend upon their files rather than the storage area itself, so each instance of SQL Server becomes independent. Further, your configuration can be as flexible as you like: The “SQLA” instance, for example, might be capable of running on four out of the seven nodes, if those were the only nodes with sufficient hardware resources to support that instance.

Chapter 2

39

The SQL Server Connection SQL Server itself doesn’t need to do anything special to run in this kind of environment; in fact, it won’t “know” that it is running in this kind of environment. External cluster management software is responsible for starting and stopping SQL Server instances as necessary, either on demand or in response to a hardware failure in a given node.

The Impact on Sprawl The impact on sprawl is positive with this technique. Every server can be utilized to its fullest. As shown in Figure 2.13, you can balance the workload however you want. In this example, the second server has plenty of excess capacity. If another server becomes overworked, that excess capacity can be used by transferring workload to the less-busy server. Every server doesn’t require excess capacity, and so you can have fewer computers to handle the same workload that you have today.

Figure 2.13: Spreading workload across the pool.

In this figure, I’m showing server capacity and workload in colored bars below each server. The bars next to each database represent that database’s activity level.

This is a true reduction in sprawl, with a commensurate reduction in maintenance overhead for maintaining OSs and software. Further, when your pool itself eventually reaches its maximum capacity, you can add another, brand-new, leading-edge server to the pool. That server will likely be able to replace perhaps two older servers from the pool, while still having excess capacity for overall workload growth, which helps to maintain the reduction in sprawl.

Chapter 2

40

The Impact on Flexibility This technique has the best impact on flexibility of any I’ve discussed so far. Because all servers can see each instance’s database, moving an instance to another server is simply a process of shutting off one instance and starting another—a process that takes about 30 seconds. Need to implement a new database? Put it on whatever server has the capacity for it. If it’ll initially be a low-use database, put it on a server like the first node in Figure 2.13; if the workload goes up over time, simply relocate it in seconds to the third or fourth node. You’re still maintaining extra capacity in the infrastructure, but that capacity is now available to any SQL Server instance that needs it, not just the instances installed on a particular computer. Want to completely upgrade all of your server hardware? Fine: Add new servers to the pool, move workload to them, and turn the old servers off. Done. Do you desperately need one of your SQL Server computers for another task for a short period of time (perhaps to replace a failed server that was doing some other task)? Relocate your SQL Server workload to other servers in the pool, and take the server you need out of the pool. You get infinite flexibility, on demand. Rearranging your infrastructure is no longer a hassle or potential risk: The infrastructure is specifically designed for change. In Microsoft IOM terms, it’s dynamic.

Infrastructure Optimization: The Positive Impact on SQL Server Sprawl With an optimized infrastructure, you’ll definitely reduce SQL Server sprawl but you’ll also give “new life” to your business capabilities by being more flexible and more dynamic in the use of your infrastructure’s resources. I want to point out that the techniques I’ve shared in this chapter—particularly pooled-resource clustering—probably still meet the political needs of your organization. Remember that each instance of SQL Server is an “island,” with its own server-wide configuration. You can set up each instance to serve whatever business or political needs you have; if a department needs “total and exclusive control” of their database, fine: Give them an instance. You decide what hardware it runs on. SQL Server’s internal security is robust and granular enough to allow for any combination of scenarios.

Coming Up Next… In this chapter, I looked at various ways of consolidating SQL Server computers to reduce sprawl and explored each technique’s impact on the infrastructure’s flexibility. My conclusion is that pooled-resource clustering, of the techniques I’ve discussed, provides one of the most positive impacts on sprawl and on the infrastructure itself. By creating the ability to pool processor, memory, storage, and other resources, and giving you the ability to almost instantly relocate workloads across that pool, you create a highly dynamic infrastructure for SQL Server. In the next chapter, I’ll focus entirely on this pooled-resource clustering idea, looking at some of the base technologies that make it possible, and looking closely at how to actually build this type of infrastructure into your environment.

Chapter 3

41

Chapter 3: The New Cluster: Technologies for SQL Server Infrastructure Optimization

The term clustering, when applied to SQL Server, is almost always used to denote a high-availability solution. In other words, SQL Server clustering is traditionally seen as a means of providing redundancy to SQL Server. Although that’s certainly admirable, and in many environments desirable, it’s not the end of what clustering can do. Rethinking what the word clustering can mean opens up the possibility of clusters that can actually help achieve an optimized SQL Server infrastructure. That’s what I call the new cluster: Clusters that use different techniques and technologies to go beyond high availability and to instead provide the kind of agility required to create what Microsoft’s Infrastructure Optimization Model would recognize as a Dynamic organization.

A number of third-party companies produce high-availability products for SQL Server, many of which use the term clustering to refer to groups of servers that provide backup capacity for each other. These companies include Neverfail, XLink, Sonasoft, and others. Some companies provide solutions that go beyond high-availability and seek to meet some of the other goals of infrastructure optimization, such as SQL Server consolidation; these companies include EMC, Scalent, and HP’s PolyServe brand. What’s interesting is that most of these companies take very different approaches to SQL Server, with the aim of achieving different business goals. For portions of this chapter, I’ll use the HP PolyServe Software for Microsoft SQL Server consolidation product as an example to illustrate some of the various concepts I’ll introduce.

I should note that mere consolidation is not actually a goal of what I’ll discuss in this chapter; it’s almost a side benefit. With a truly optimized SQL Server infrastructure, consolidation—that is, a reduced number of SQL Server computers handling the same workload—is something you get “for free.” The real goal is to create a set of techniques and technologies that help achieve a Dynamic organization, with the ability to quickly rearrange the SQL Server infrastructure, completely dynamically, to meet changing business needs.

Chapter 3

42

Rethinking the SQL Server Cluster I want to step back for a minute and forget everything we’ve discussed about the often-overused word, cluster. Instead, I want to focus on SQL Server’s actual product requirements, and discuss how those requirements can lend themselves to a new type of cluster—one that is designed to meet the goals of an optimized infrastructure rather than just high availability. I’ll look first at SQL Server’s three main requirements: processor and memory, storage, and network.

Processor and Memory Requirements For the purposes of this discussion, processor and memory are something I consider to be a single unit—perhaps this section would be better named “stuff that is attached to the server’s motherboard requirements,” instead. Both are hardware resources that “live” inside the server and can’t be taken outside of the server in the way that, for example, disk storage can be.

SQL Server does, of course, require processor and memory resources in order to run. Because SQL Server is a multi-instance product, the processor and memory resources of a single server are considered a pool, which is shared amongst the instances of SQL Server that are running (as well as being shared with Windows itself and any other running applications). Windows dynamically allocates these resources on the fly based on the current workload.

From a planning perspective—and ignoring Windows itself and any other applications (SQL Server computers typically don’t run anything but SQL Server in terms of major applications), the SQL Server instance is the smallest unit of management when it comes to processor and memory resources. That is, you can’t allocate processor time or memory capacity to a single database: Windows allocates those resources to an entire instance of SQL Server. Thus, if a single instance hosting a single database is experiencing a lot of growth in its workload, that work can’t be subdivided without significant database re-design (such as federated databases, which are beyond the scope of this book).

Storage Requirements Storage resources are much more interesting than processor and memory resources, in part because of the advent of Storage Area Networks (SANs), which permit disk storage to be moved “outside the server” (and physically away from the server), and more importantly to allow a single large pool of disk space to be broken down, or partitioned, for use by different servers. Perhaps the most typical type of SAN configuration is a single pool of disk resources, partitioned down into “chunks” that are made available for the exclusive use of a single server.

The reason that this type of configuration is “typical” is because neither Windows nor most basic SAN products have the ability to deal with resource contention. That is, if a single “chunk” of SAN disk space was made available to two servers simultaneously, things would work more or less fine provided both servers did nothing but read the data contained in that disk space (there are cautions you would need to take with regard to file permissions and so forth, but those are more administrative issues than technical ones). Even if one server only reads data and the other server wrote the data, things would probably work more or less fine. It’s if the two servers tried to modify the shared data at the same time that things would go wrong: At best, the changes made by one server would simply be lost; at worst, the two servers would experience continual file-access errors as they attempted to modify the same data at the same time.

Chapter 3

43

Interestingly, this is not an inherent problem when it comes to SQL Server. Consider the way SQL Server works: Each database consists of two or more files: At least a database file (with an .MDF filename extension) and at least one log file (with an .LDF filename extension). An instance is configured to open that database and make it available for client queries. However, when SQL Server actually does open a database, it locks that database’s files. And, before SQL Server attempts to open a database, it checks to see whether the database’s files are already locked. Thus, you could actually assign two instances, running on the same server, to both open the same database. Only the instance that actually started up first and locked the database would be able to do so; the second instance would “see” the lock from the first, and would gracefully fail to open the database.

This behavior raises interesting possibilities. Figure 3.1, for example, shows four SQL Server computers connected to a single “chunk” of SAN disk space. This disk space is used only for storing databases; in my initial configuration, each SQL Server computer is running a single instance, which is opening only a single database. As you can see, there is no contention for disk files or other disk resources, even though all the servers can “see” the other servers’ database files.

Figure 3.1: Shared storage.

Chapter 3

44

These four servers now share a particular resource—the disk storage they can all “see”—so they can reasonably be referred to as a cluster. But keep in mind how SQL Server works: It’s okay to configure multiple instances of SQL Server to open the same database files because only the first one in will “win.” That makes a configuration like that shown in Figure 3.2 possible. Here, each server is running four instances of SQL Server. Each instance, A through D, is configured to open a single database, A through D. However, even though all four servers are “trying” to open the same four sets of files, there will be no resource contention because only the first instance to open each set of files will “win.” Obviously, relying on the “first one in, wins” method might not achieve the consistent, reliable results we’d want in terms of management, but this is a first step toward creating a new, more optimized infrastructure for SQL Server.

Figure 3.2: Shared storage, multiple instances—no contention.

Network Requirements When I discuss network requirements in this context, I’m not focused on the need for each SQL Server to be physically connected to a network, although that’s obviously a requirement. Instead, I’m looking at the need for client computers to connect to SQL Server via some type of address, usually an easy-to-remember name that is translated (or resolved) to an IP address by a DNS server.

As I discussed in the last chapter, a major hurdle to SQL Server consolidation and agility is these DNS names. If clients are connecting to ServerA and you move the database to ServerB, you’re going to have to somehow reconfigure these clients to connect to ServerB instead.

Chapter 3

45

Certain tricks can be done with DNS to help abstract the connection between a name and a physical server. For example, the server name “SQL1” might not refer to a physical server name at all but might rather be a “nickname” to a server named ServerA. If you wanted clients to start connecting to ServerB instead, you’d just change the “nickname” in DNS so that “SQL1” pointed to “ServerB” instead. That trick doesn’t necessarily work with SQL Server instances, though—at least, not without some additional tricks in SQL Server itself. However, the overall technique is a good one, and it can be leveraged in SQL Server through the use of third-party software. Figure 3.3 illustrated this: A client computer attempts a connection to \\SQL3\C, a named instance of SQL Server. A combination of DNS and third-party redirector software gets the client to the correct computer—ServerB—where an instance named C is running.

Figure 3.3: Redirector software helps abstract server names from actual servers.

If you needed to take ServerB offline for maintenance, you’d simply notify the redirector to start sending SQL3 requests to ServerA instead. ServerA already has an instance named C installed, and provided it was configured the same as the instance C on ServerB—and could “see” the same database files—clients would be none the wiser. Provided ServerA and ServerB used a SAN disk configuration like the one I discussed in the previous section, having them each “see” the same database files would be easy, and you’d be well on your way to a more optimized SQL Server infrastructure.

Chapter 3

46

Foundation Technologies for the New Cluster Revisiting the previous chapter, and translating some of the requirements I outlined into “new cluster” terms, we find a fairly simple set of requirements. We essentially just need to be able to move instances from server to server in response to changing workload, hardware failures, or maintenance requirements. We need to be able to do this on a “cluster” of heterogeneous (in terms of hardware) servers so that we can incorporate all the hardware currently being used to run SQL Server in the environment. For the moment, I’m not focusing on security issues—specifically, the issue of what administrators might have access to within SQL Server. That’s something I’ll address later. For now, let’s focus on the basic infrastructure technologies necessary to meet the goals I’ve stated.

Abstracting the Processor and Memory In one sense, it’s impossible to “abstract” processor and memory resources. That is, there’s no way to take the processor and memory resources of several servers and have them appear as one giant “pool” of resources. However, SQL Server instances don’t “care” what processor, or how many processors, they run on, so from that perspective, abstraction isn’t necessary. That is, a given SQL Server instance can be installed on multiple servers and can be started or stopped on those servers as necessary. So long as the instance can “see” the database files it’s configured to use, and so long as clients can connect to the instance, the instance doesn’t “care” what server it’s running on.

This is similar to the way the Windows Cluster Service operates in an “active-active” configuration. Each cluster node runs one or more SQL Server instances, with the capability to move instances to a different node on demand. The main difference between the “new cluster” and Windows Cluster Service is in how the cluster deals with disk storage.

Abstracting the Storage SANs make it fairly easy to abstract the network. As I explained earlier, the right SAN solution will allow the same “chunk” of disk space to be assigned to multiple servers. All the servers will be able to “see” all the files in that disk space, and provided there’s no need for resource access contention management—which there isn’t, with SQL Server—there won’t be a problem. Figure 3.4 illustrates this: You’re looking at four Remote Desktop sessions, each connected to a different server. All four servers have an “MFS Global” volume, which exists on a SAN. The disk space for that volume is the same physical space for all four servers, meaning that when I create a “New Rich Text Document.rtf” file on one server, it immediately is “seen” by the other three.

Chapter 3

47

Figure 3.4: Abstracting storage through a SAN.

With a design like this, it doesn’t matter how many servers are in my “cluster.” Provided all of my SQL Server database files exist on that SAN disk space, any given server in the “cluster” can “see” any given database, meaning any SQL Server instance could be started on any given server, and it would run fine.

Note that additional disk space is needed for the Windows OS and the SQL Server application files—this space is often in fault-tolerant storage inside the server, although it could also be on other forms of storage. That “system” disk space needs to be private for each server because those files can’t be easily shared due to resource access contention.

This design is very different from the approach used by the Windows Cluster Service. Windows clustering, when used with SQL Server, requires that a given cluster node exclusively “own” disk storage that can be “seen” by every node in the cluster. In other words, if Node A is running SQL Server Instance A, then Disk Storage A must be exclusively available for Node A’s use, and it’s where the instance will look for its database files. If the instance is moved to Node B, then Node A must “relinquish” the disk space and allow Node B to have exclusive use of it. Any other instances that were running on Node A and storing their files in the same disk space must also move to Node B because “their” disk storage is moving. Windows Cluster Service’s design thus limits and complicates how flexible the cluster can be: Either each instance must have dedicated disk storage—creating a complex system of resources and dependencies that must be managed—or you must be content moving instances in groups according to their disk storage. The “new cluster” is more flexible because storage is not a logistical issue because it’s been abstracted away from a particular server or SQL Server instance.

Chapter 3

48

Abstracting the Network As I mentioned earlier, special software will be needed so that clients can address specific SQL Server instances using names that aren’t tied to a specific physical server. As I explained, DNS tricks alone aren’t enough due to the way SQL Server works; instead, specific third-party software is needed to create nicknames, or aliases, for SQL Server instances. As Figure 3.5 shows, HP PolyServe Software provides an Instance Aliasing service that handles this requirement. It allows an administrator to define a series of abstract “server names.” In the organization’s DNS, these names are “pointed” to IP addresses that have been assigned to an HP PolyServe Software “cluster” (that is, a collection of computers all running this aliasing service, amongst other things). The aliasing service takes care of translating those incoming “server names” into actual physical server names and SQL Server instance names so that the client is connected to the proper SQL Server instance, no matter where in the cluster that instance happens to be running.

Figure 3.5: Aliasing abstract names to server/instance names requires a third-party service.

This is actually similar to the way that Windows Cluster Service works, with the Service itself providing the aliasing functionality.

Chapter 3

49

Coordination: The Key to the New Cluster Let me wrap up what the “New Cluster” looks like so far: It’s a collection of physical servers, each with several instances of SQL Server installed. In fact, at this point, let’s say that every server has a copy of every SQL Server instance that the cluster is capable of running, meaning that any given instance can run on any given server. That’s about the best we can do in terms of abstracting processor and memory resources: Ensuring that any given instance can run on any given processor/memory set within the cluster.

The servers are all connected to a SAN, and they have at least one volume that they share. This allows every server to “see” every database’s files, effectively abstracting the disk storage so that it’s not server-dependent.

The servers all run an aliasing service that allows them to translate abstract host names into physical server/instance names. This allows clients to address a specific SQL Server instance, without being aware of the physical server that is running that instance at a given time.

Our cluster still has a few things that need to be done in order for it to work effectively: First and foremost, some form of coordination needs to exist so that only one server attempts to start any given SQL Server instance at a time. That way, we can predict which SQL Server computer will be running any given instance, and we can update our aliasing service accordingly, so that it “knows” where to send incoming requests for that particular instance. The HP PolyServe Software for Microsoft SQL Server product (which incorporates the aliasing service I pointed out earlier) provides this coordination. Figure 3.6 shows the main HP PolyServe Software administrative interface.

Figure 3.6: HP PolyServe Software for Microsoft SQL Server.

Chapter 3

50

As you can see, this particular product doesn’t require that every server be able to run every SQL Server instance within the cluster. Servers are shown across the top of the grid, and SQL Server instances down the left side. A green dot within the grid means that server is capable of running that particular instance of SQL Server; the number indicates the server’s “preference” for that instance. A preference of “1” means that the server is the top choice for running that instance, for example. A green checkmark indicates the server that is actually running a given instance of SQL Server. This product provides the coordination necessary for the cluster to function, ensuring that each SQL Server instance is only started on a single physical server at any given time. It also provides other administrative and management functionality, which I’ll get to in the following section. At this point, though, we have a cluster fully capable of meeting the requirements for a highly optimized SQL Server infrastructure:

• Any instance of SQL Server can be started on any given computer. Thus, we can take a server offline for maintenance without significant downtime because its instances can be transferred to another computer.

• Free resources on any server can potentially be used for any instance. Although all of our free resources aren’t exactly “pooled” (meaning that two servers with 50% capacity don’t equal one server with 100% capacity because the two servers couldn’t “split” a workload requiring 75%), our free resources can be much more flexibly used by whatever workload will fit.

• A hardware failure is no more a problem than hardware maintenance: We simply transfer SQL Server instances to other servers, taking the failed server out of the cluster until it’s repaired.

• We don’t need homogenous hardware within the cluster. Because SQL Server instances don’t care what they’re running on (provided the server meets the minimum SQL Server requirements, of course), we’re free to use any mix of servers that meets our requirements. That means we can create our “new cluster” using existing SQL Server resources.

We still haven’t solved all the management problems—patch management, agile deployment of new servers, and so forth—that we need to be a fully Dynamic organization. I’ll tackle those topics next.

Chapter 3

51

Rethinking SQL Server Management What I’ve introduced so far of the “new cluster” points toward a more Dynamic organization, but not a completely Dynamic one. There are still problems that I haven’t addressed, primarily related to cluster and server management. A Dynamic organization would manage the entire cluster as a unit, looking at overall cluster-free resources, adding nodes to the cluster almost on demand, and so forth. So the next major element of the “new cluster” is to rethink SQL Server management.

We can no longer be satisfied with managing individual servers, reconfiguring individual servers, and maintaining individual servers. Doing so simply leaves us stuck with the same problems as managing non-clustered SQL Servers: Every new server we add to the mix creates additional administrative overhead. In order to become a Dynamic organization, we need to be able to maintain more or less the same amount of administrative overhead for the entire cluster, no matter how many servers it encompasses.

Managing SQL Server Software The “new cluster” allows an entire cluster to be managed as easily as a single server. Obviously, database-specific configurations are stored within the SQL Server database itself, and so any SQL Server instance that opens that database will “see” any configuration changes. Server-level configuration for individual instances can be managed through the HP PolyServe Software console. These configuration changes can then be “pushed” out to the individual servers automatically by the HP PolyServe Software. Configuration changes can include updates and hotfixes along with instance-level configuration such as port number and other settings.

Adding Nodes to the Cluster A major requirement of the “new cluster” is the ability to quickly add new nodes in order to support additional workload. New nodes can either be intended to support growth in the cluster’s workload or can be intended to replace older, outdated hardware. HP PolyServe Software helps provide a Dynamic level of functionality by providing a single console that can “push” SQL Server’s software installation out to new cluster nodes—including multiple nodes at once. Figure 3.7 illustrates this concept, with nodes being selected to receive the SQL Server software. Service packs can be deployed in the same fashion, helping to simplify long-term maintenance of the cluster. Overall, this functionality allows new nodes to be added to the cluster much more quickly and helps to remove the “individual server” management technique in favor of a “cluster-wide” management technique.

Chapter 3

52

Figure 3.7: Deploying SQL Server to new cluster nodes.

Adding SQL Server Instances Adding new SQL Server instances to the cluster should also be a one-step operation, with the new instance configured on whatever cluster nodes are desired automatically. Figure 3.8 shows this task in HP PolyServe Software: Each new instance—called a virtual SQL Server by this product—gets a dedicated IP address, a name, and an application name. As many cluster nodes as desired can be designated to potentially host this instance, and a failback policy (which determines whether the instance will try to “fail back” to its preferred cluster node after “failing over” to another for some reason) can be configured to support high availability.

Chapter 3

53

Figure 3.8: Configuring a new SQL Server instance in the cluster.

Rolling Hardware Generations Another major benefit of the “new cluster” is the ability to use dissimilar hardware for cluster nodes. This allows all existing SQL Server hardware to potentially become a resource within the cluster. And, because the cluster allows nodes to be almost transparently moved from node to node, it’s relatively easy to remove old, unsupported hardware from the cluster. When an older server is beyond its operational life, you can “roll” the cluster’s hardware generation by adding a new node to the cluster, moving SQL Server instances to it (thus “emptying” the old server), and simply removing the old server from the cluster. Minimal downtime is required—usually on the order of a few seconds for each SQL Server instance you move—making it easy for nearly any organization to find an appropriate maintenance window.

Chapter 3

54

The New Cluster and Infrastructure Optimization The “new cluster” provides all the tools you need to create a fully optimized SQL Server infrastructure, one that Microsoft’s Infrastructure Optimization Model would recognize as Dynamic. Of course, simply having the technologies in place doesn’t give you that optimization: You have to use them correctly.

Techniques for Creating Clusters It’s unlikely that large organizations will lump every SQL Server they own into a single cluster. After all, different SQL Server computers serve distinctly different functions, and you still have to accommodate factors such as large, geographically distant offices that require their own physical servers. Exactly how you choose to build your clusters will depend on your exact business requirements, and it’s a question that bears careful consideration and planning. Figure 3.9 shows one simple example of how an organization’s SQL Server computers might be “divvied up” into three different clusters.

Figure 3.9: One way to divide servers into clusters.

Here, I’ve created the largest cluster for my organization’s line-of-business applications. This might include my SAP Financials database, my PeopleSoft databases for personnel management, as well as several in-house databases used for various business-specific applications. This cluster will use my newest and best server hardware, and it might have a fairly large overall amount of free capacity so that a significant number of servers could fail entirely without taking a single database offline for more than a few seconds (of course, if the maximum number of servers did fail, I might be satisfied with less-than-optimal application performance, so I don’t need the cluster to have 200% of my working capacity).

Chapter 3

55

A second, smaller cluster is used to run smaller department- and project-specific databases. This cluster might use older hardware and might have much less overall free capacity depending on how critical these smaller applications are. A third, very small cluster might be used for development and testing purposes.

Workloads couldn’t be easily moved between these clusters; doing so would be far from impossible, but it would require some manual effort, file copying, and so forth. Workload could be repositioned within the cluster, though, based on hardware failures, maintenance, and other needs. If a department-specific application one day became mission-critical, and I decided to move it to my large cluster, doing so wouldn’t be a massive undertaking: I’d use the HP PolyServe Software migration tools to automatically create a new SQL Server instance on the large cluster while maintaining configuration and security settings and safely copy the database files between the two clusters’ shared storage pools, providing a working mirror copy of the first database without ever taking the database out of service. I’d decide which nodes of the new cluster would be potential hosts for the new SQL Server instance, and once testing is complete, I flip the switch to migrate over any incremental updates to the primary database and transition users to the new server. This is the type of dynamic flexibility that the Infrastructure Optimization Model calls for in the Dynamic level.

Using Clusters to Improve Infrastructure Optimization Let’s look at the “new cluster” in terms of Microsoft’s Infrastructure Optimization Model—specifically, the Dynamic level (characteristics are quoted from http://www.microsoft.com/technet/infrastructure/datasheet.mspx):

• “Processes are fully automated.” With regard to SQL Server, this is true. Failover occurs automatically within the “new cluster,” and other processes—deploying software updates, deploying new SQL Server instances, manually moving SQL Server instances, and so forth—are all automated.

• “The use of self-provisioning software…” is definitely in place. When the time comes to create a new SQL Server instance or to add a new node to the cluster, you just point the clustering software at the task and it handles it.

• “…with service level agreements…established.” This is also possible with the “new cluster.” Although you could certainly establish service level agreements (SLAs) without this clustering technology, actually meeting those SLAs requires a much larger hardware and administrative investment. With the “new cluster,” you can meet your SLAs with minimal redundancy and with minimal additional administrative overhead.

• “Costs are fully controlled.” This is definitely possible because the cluster provides you with the desired level of high availability without excessive redundancy, and without imposing significantly higher administrative overhead. In other words, you can achieve business goals for availability and maintenance without vastly increased costs from hardware or administrative overhead.

The “new cluster” gives you the capabilities to meet all these criteria, helping to make SQL Server a more strategic asset within your organization.

http://www.microsoft.com/technet/infrastructure/datasheet.mspx

Chapter 3

56

Benefits of the New Cluster The “new cluster’s” use of shared disk resources, alias-based connectivity, single point of administration, and flexible reallocation of workloads provides a number of business benefits.

Consistent Software Configurations SQL Server software stays consistently configured. From a security and operational standpoint, that is a huge benefit; you can be assured that no matter which server is presently hosting a given SQL Server instance, it’ll behave exactly the same as any other server that might host that same instance. For organizations subject to industry or regulatory compliance requirements, this consistency helps make auditing easier and more problem-free. Consistency also helps reduce debugging time for in-house developers because they’re assured of the same server behavior at all times and don’t need to be concerned about different nodes behaving differently. Consistency also applies to SQL Server updates and hotfixes, ensuring that every instance of SQL Server is at the same patch level at all times, no matter which cluster node the instance happens to be running on at the time.

Reduction in Operating Costs Because maintenance—including instance reconfiguration and patch deployment—doesn’t need to be performed on a per-server basis, administrative costs are significantly lowered. In fact, using a product such as HP PolyServe Software can help reduce administrative costs immediately, even if you don’t take advantage of the product’s other features and capabilities.

Capital costs can also be reduced because individual servers don’t necessarily have to be configured with vast amounts of inflexible extra capacity. Instead, free capacity can be provided to the cluster as a whole, giving you a place to move SQL Server instances that are outgrowing their current server, and giving you the ability to bring in new server resources—and remove older hardware—on a gradual, as-needed basis.

High Availability Availability is built-in to the cluster, providing full communication between cluster nodes and immediately relocating SQL Server instances from failed servers to running ones. The instances from a failed server can be relocated to many different “backup” servers, if necessary, allowing you to customize the impact on application performance by re-balancing the workload automatically through instance preference assignments. Instances can be configured to “fail back” to their preferred server automatically if the server becomes available again after a failure.

SQL Server’s own fault tolerance ensures that, when an instance fails over to another cluster node, data loss is minimal. The new node has access to the same SQL Server data and log files, allowing it to immediately begin transaction recovery as it begins to accept new client connections.

Although the disk storage shared by the cluster’s nodes can be seen as a single point of failure, in reality, most SANs provide many options for fault tolerance, including RAID arrays, hot-swappable drives, and more. You can also schedule frequent SQL Server backups to a second “shared” disk storage area, allowing SQL Server instances to (for example) create very frequent log backups to guard against physical damage to the database files.

Chapter 3

57

Reduced Complexity Although managing a cluster full of SQL Server computers may sound more complex, it actually offers less complexity than managing a similar number of individual, non-clustered servers. Because much of the cluster’s management is done from a single location, managing many aspects of the cluster is no different from managing a single SQL Server computer—no matter how many nodes exist in the cluster.

And the “new cluster” offers less complex management than Windows Cluster Service. SQL Server instances don’t have complex disk storage dependencies that have to be managed. Instead, each SQL Server instance is an entity unto itself, with only a network alias name that has to be “moved” with the instance—and the clustering software takes care of that automatically. Disk storage resources are shared across the cluster, ensuring that every instance can “see” its disk storage, no matter which cluster node the instance happens to be running on.

Room for Growth One of the main benefits of the “new cluster” is the retention of “room for growth.” In previous chapters, I described how organizations see to maintain this room on individual servers by always over-provisioning servers so that no individual server is running at 100% capacity. That creates a lot of “extra capacity” within the environment, as Figure 3.10 shows. However, that extra capacity isn’t very flexible because it can only be used to support whatever applications are on that individual server.

Figure 3.10: Provisioning inflexible extra capacity on a per-server basis.

Chapter 3

58

This figure illustrates four servers, each with varying levels of memory and processor utilization (there are, of course, other resources that play into overall server utilization, but I’m choosing these two to keep the illustration simple). As you can see, Server B has plenty of extra capacity, but it’s useless to Server A, which is beginning to reach its maximum. Overall, this organization has something in excess of 50% free capacity—or, in other words, wasted resources—although this capacity can’t be flexibly reallocated to help support a single server.

In Figure 3.11, I’ve made those four servers nodes in a cluster, using the new techniques I’ve discussed in this chapter. This enables me to relocate the workload from Server D to Server B. Now, Server D is completely unused—essentially, I’ve centralized most of the four servers’ free capacity onto a single server by more fully utilizing Server B.

Figure 3.11: Aggregating free capacity in a new cluster.

Now, Server D’s free capacity can support any of the other three servers. For example, if Server A’s workload begins to exceed that server’s hardware capacity, a beefier Server D could be used by simply relocating Server A’s SQL Server instances to Server D. If Server A were running more than one SQL Server instance, then Server A’s workload could be subdivided, and a portion relocated to Server D to provide room for growth on Server A. Or, if the workloads managed by Servers A, B, and C were relatively static, Server D could be removed entirely and put to some other use. Regardless of what you choose to do and how you choose to architect your cluster, the point is that you now have a flexibility you never had before, with the ability to quickly and easily move workloads across servers to create the exact balance you want.

Chapter 3

59

Conclusion: Welcome to the Optimized SQL Server Infrastructure You’re no longer worried about a server hardware failure, because you’ve got ample free capacity in your cluster and you know you can quickly redistribute SQL Server instances as necessary if a failure occurs. Maintenance is a no-downtime affair, now, because SQL Server instances can also be relocated on demand to servers with free capacity. New nodes can be added to the cluster with a few button clicks, and new SQL Server instances can be created and provisioned with a few more. You’re no longer worried about processor, memory, or disk storage constraints because those resources can be easily expanded within the cluster at any time—without significant increases to administrative overhead. Your SQL Server infrastructure can respond almost instantly to changing business needs without all-night maintenance marathons and without having double the amount of SQL Server hardware you really need. Welcome to the Optimized SQL Server Infrastructure.

Download Additional eBooks from Realtime Nexus! Realtime Nexus—The Digital Library provides world-class expert resources that IT professionals depend on to learn about the newest technologies. If you found this eBook to be informative, we encourage you to download more of our industry-leading technology eBooks and video guides at Realtime Nexus. Please visit http://nexus.realtimepublishers.com.


The Shortcut Guide to SQL Server Infrastructure Optimization

Documents

Transcript of The Shortcut Guide to SQL Server Infrastructure Optimization