Download - Aspirus Epic Hyperspace VCE Proof of Concept

Aspirus Information Technology: Epic Hyperspace Virtualization using VCE White Paper

1

Aspirus Information Technology Epic Hyperspace Virtualization using Virtual

Computing Environment Coalition Infrastructure Proof of Concept

Written by: Thomas Whalen

Core Infrastructure Team Leader – Aspirus IT Co-‐Written by: Phil Hammen

VMware / Citrix Systems Administrator – Aspirus IT

Abstract Aspirus as struggled over the years in providing a consistent computing model for it’s Epic Hyperspace deployment. This white paper will illustrate the use of the VCE architecture to solve the problems of workload management along with testing scenario outcomes.


2

Table of Contents Executive Summary………………………………………………………………………………………………….3 Introduction………………………………………………………………………………….……………4 Current Operating State………………………………………….…………………………………6 Quick Overview of VCE……………..……………………………………………………………….7 VCE Testing Scenarios………………………………………………………………………………..8 Testing Architecture….…………………………………………………………………….8 Scenario 1 – Virtualization Taxation……………………..…………………………9 Scenario 2 – Virtualization Aggregation Performance…………………..10

Scenario 3 -‐ UCS B440 Redline-‐Go-‐for-‐Broke Test……………….…………11 Conclusion………………………………………………………………………………….…………...14 Acknowledgments……………………………………………………………………………………15


3

Executive Summary In 2006, roughly two years post Epic go-‐live for our hospital and clinic business units, the deployment of our Epic Hyperspace environment was the most dynamic in the areas of CPU and Memory performance that impacts the user’s experience using the Epic HIS. The struggle in IT was to find a consistent architecture to deal with the continual growth of the Epic HIS user base but also assure a high level of performance for the client to operate on in order the satisfy the user experience for a growing and demanding user base. Because of these requirements, Aspirus IT embarked on beginning to use of VMware to virtualize the Citrix XenApp servers hosting the Epic Hyperspace client. As the physical footprint of servers was continuing to grow, it became obvious that the growth of physical servers in our environment was not sustainable. The initial results using VMware were very impressive and we saw the transition from physical to virtual to be positive process for both IT and the user base. The mountain of 1U pizza box form factor servers slowly began to erode away as we transitioned to blade form factor servers. But as new application functionality was introduced along with growth in the user base due to acquisitions made by Aspirus, we started to fall into the trap of “architecting-‐for-‐the-‐moment” versus architecting for the long term. The problem was the wave of performance differences we were seeing at varying times in the average production day was rarely consistent which challenged us to understand the real root of the performance curve issues we were measuring. In 2009, the decision was then made to back away from using VMware to provision the Citrix XenApp servers but to use Citrix Provisioning Server to stream the Citrix XenApp servers to bare metal blades. While this was an acceptable short term solution, you lose the value that virtualizing with a hypervisor brings you in better TCO for the CPU and memory resources as well as DR and high availability. In late 2010, Aspirus was able to invest in technologies using the VCE (Virtual Computing Environment Coalition) combining technologies from Cisco, VMware, and EMC. As we began to test various server build-‐out scenarios, it became increasing clear that we could return to using a truly virtualized infrastructure without seeing the user experience degrade. But more importantly we concluded that the VCE architecture is highly scalable for this workload to handle the dramatic changes in application functionality and user base growth.


4

Introduction In 2004 Aspirus embarked on it’s largest single project that it ever tackled – the implementation of the Epic HIS for Wausau Hospital in Wausau, WI. As IT began to think through the architecture for the presentation layer, the aspect of server virtualization was very new to Aspirus and we felt at the time that focusing a strategic application workload on a relatively new technology to Aspirus was risky. Using Citrix to act as an application delivery technology was something that Aspirus had been doing since the late 90’s and we felt using Citrix to deliver the Hyperspace client was a good idea. With these early decisions, we deployed over 60 individual 1U servers running Windows 2000 with Citrix Presentation Server managing an individual user load of approximately 30-‐35 users per server. At the time, this was a pretty common configuration for Epic shops. Keep the performance localized to each server. But then we began to grow in user base and in application functionality, which typically drove up the server performance needs. The user experience was beginning to suffer. We soon recognized that the physical nature of this implementation was not sustainable when you factor in the power, cooling, and data center space needed to continue this. But more telling was an assessment done of our data center as a whole in the area of CPU and Memory utilization. We found that the average CPU and memory consumption was only roughly 5-‐7% of the server’s capability. This meant that our servers were, figuratively speaking, idling in our Data Center. When we drilled this down, the servers that were the highest in utilization were the Hyperspace Citrix servers. Consequently, they were also our biggest pain points as well. From an application growth perspective, we added an additional hospital and about 35 clinics to the environment. After this eye opening assessment, the decision was made to embrace virtual server provisioning using VMware to scale down the physical foot print of the data center but also improve the TCO of the server investment needed to resolve the pain points we were experiencing. At this time, we made the investment to move to blade form factor servers that also had more CPU and memory per blade than the then current physical servers. We still felt the use of Citrix as the application delivery vehicle for the Hyperspace client was the best solution so our architecture evolved to running VMware to provision Citrix Presentation Server and deliver the client. At first, we were seeing pretty good performance and the user experience was satisfactory but from time to time we would see unusual CPU spikes which also caused average processor queue length values of greater than two per CPU to occur. To help mitigate this we added more VM’s and reduced the user load to bring more CPU overhead to each VM, which seemed to smooth out the performance curve. In our most dense configuration we were running 64 VM’s with roughly 30-‐35 users per guest across 16 blade servers.


5

Then came a client version upgrade. As part of the upgrade process, as a customer you must pass Epic’s “Pre-‐Upgrade Assessment”. This process is an evaluation of your current architecture against what the perceived workload will be under the new version. While at the time, the VM based Hyperspace architecture was not officially approved by Epic, they evaluated the performance using other internal Epic tools and Performance Monitor at the guest level and determined that performance should be acceptable. Said differently, we got the yellow cautionary indication. Our initial testing with the new client on the production VM architecture didn’t yield anything significant in performance changes. (Writer’s note: Lesson Learned – always test with a workload that’s consistent with your current AVERAGE user base, not a handful of super users.) We went live with the existing architecture with the new client and the VM infrastructure crashed and burned. Why? The difference was some improved functionality introduced that in its previous development state wasn’t as CPU heavy. But in this version, the redesign while functionally better for the users, was significantly heavier from a CPU perspective to the guest running the client. We struggled to determine exactly why this happened. We drilled down deep into the performance of the servers, storage, anything we could find. No stone was unturned. Epic did a great job of helping us see what was happening at the client level so that we could translate it into a potential architecture change to improve the situation. But in the end, the client simply needed more resources than expected. We increased our CPU counts per guest to handle the change in the performance curve. This put a significant strain on our existing computing resources and forced us to dedicate more blades just to this workload, which scaled in the wrong direction for Aspirus. We got to a point where we felt that we were being managed by the technology for this particular workload and something needed to be done for the long term. We made the decision to abandon VMware and the notion of hypervisor virtualization for this particular workload. We used VMware heavily for other workloads, which performed great. But again, this is not the average workload and we had to recognize that for what it was for the sake of the user experience. We took the blades used for the VM guests and converted them to bare metal servers using Citrix Provisioning Server to deliver the server to the blades in a read-‐only configuration. It worked great. However, we knew that there had to be a way to virtualize this mission critical client to regain all the cool things you gain when you use VMware to host guests in the area of vMotion, DRS, Storage vMotion, etc.


6

Current Operating State Today, the production environment (figure 1) consists of using Citrix Provisioning Server to stream bare metal servers to 20 blades, which allows us to manage a per-‐server user base of approximately 125-‐130 users on blades with two 6-‐core AMD processors and 16GB of RAM. While the performance of this infrastructure is great, there are a number of issues in the area of DR and hardware protection that are missing because of the lack of a virtualization hypervisor managing the servers. For instance, in this operating state, if we were to lose a blade for whatever reason, there is no automated process to provide High Availability to bring a server back online quickly. We also have no means of transitioning the workload to different hardware on the fly in the event of any hardware degradation, i.e. vMotion. While the current operating state is optimal from a performance perspective, there are some issues in the area of HA/DR that we need to address.

Figure 1. Current Operating State diagram

What we need to find is an architecture that blends the performance of bare metal with the advanced HA/DR capabilities of a hypervisor utilizing a sound fabric to tie it all together. By using the benefits that Citrix Provisioning Server gives us (read-‐only disks, fast provisioning, easier update management, identical servers) with the benefits of a hypervisor (vMotion, HA/DRS clusters, Storage vMotion) we could achieve the best of both worlds.


7

Quick Overview of VCE The VCE Coalition was formed by three super powers in the IT industry, Cisco Systems, VMware, and EMC with the mission of creating a single architecture that could handle the demand of varying workloads across a common hardware and hypervisor base. This came out of the recognition that in today’s whole “Cloud Computing” marketing for virtually everyone in the IT space, there needed to be someone who could actually deliver a solution to truly handle what Cloud Computing’s potential had to offer. The VCE Coalition is primarily made up of a mixture of network, server, storage, virtualization software, and management tools to create a foundational layer to your cloud based application or any application workload for that matter. It’s comprised of various blocks or VBlocks. VBlock’s are configured and tested to achieve a certain degree of performance and scalability and as such there are various VBlock configurations you can choose from. What makes the VCE Coalition interesting is the partner’s mission to integrate all of the components to create awareness both high and low between each layer of the VCE. This assures that integration between layers is designed to give the user the ability to effectively manage the all layers from end-‐to-‐end. Leveraging separate tools within each layer but also using the EMC Ionix Unified Infrastructure Manager, you get eyes into all layers through a common management interface. For those interested in more information around VCE, you can go to www.vce.com.


8

VCE Testing Scenarios Testing Architecture

In late 2010, Aspirus made some strategic purchases to begin its work around creating a second data center to compliment its current data center. As part of that purchase, we acquired two Cisco Nexus 6120 Fabric Interconnects and two Cisco UCS 5108 chassis to support up to 16 blades. The configuration came with eight B200 blades spread across the two chassis. Each blade contains two 6-‐core Intel Xeon processors with 48GB of RAM. The ethernet connectivity was created using 10Gb uplinks to Cisco 6500 Catalyst switches. The SAN fabric used for the testing consisted of two Brocade DCX SAN Fabric directors using 4G connectivity. The storage consisted of an EMC Clariion CX4-‐960 storage array utilizing two, 4+4 Raid 1/0 raid groups to present the VMware data store pools to the guests. Figure 2 gives a high level look at the testing architecture. VMware ESXi 4.1 was the hypervisor configured to boot from SAN. We used Citrix Provisioning Server to stream the guest OS configured with four vCPU’s and 16GB of RAM each.

Figure 2. EMC / Cisco / Brocade Testing Architecture

It’s important to note this was not an isolated infrastructure specific for the purpose of testing but a live production infrastructure. Live clinical users were part of our testing crew. We did this to assure that the workload was truly based on user workflow and not induced through artificial means. Safeguards were in place to assure that in the event of a problem, we could pull back quickly. The following tests were determined to be the most relevant and would exhibit the results we were looking for. 1. Virtualization Taxation 2. Virtualization Aggregation Performance 3. UCS B440 Redline-‐Go-‐for-‐Broke Test


9

Scenario 1 – Virtualization Taxation This first test was a single VM created to support 125-‐130 users, the same user load we were using with our current production architecture in bare metal servers using the HP BL465 G6 blades as illustrated in figure 1. Once we had everything ready, we slowing began to push users towards this particular virtual machine while carefully monitoring the utilization as the user base grew on the guest. Figure 3 illustrated this particular testing scenario.

Figure 3. Scenario 1 – Virtualization Taxation

At around 70 concurrent users we noticed processor queue length started to go out of bounds but it didn’t jive with the observed performance of the guest itself. We checked esxtop and processor queue length never hit above three. The desired value is less than two per CPU. Since our guest was configured with four vCPU’s, three is well within acceptable values. We attributed this to perfmon inaccurately indicating the true processor queue length at the guest level. At 125 users, the server was humming along with no apparent tax being shown in any of the performance data. But something was unusual. When we launched the Hyperspace application, we noticed a six second improvement in the Windows logon time. At first we thought this was a fluke so we continued to launch sessions. No fluke. From our 12-‐core bare metal HP blade with logon times of 10-‐12 seconds, the 4-‐core VM guest on UCS scored a 50% improvement in the Windows logon process while matching user density.


10

Scenario 2 -‐ Virtualization Aggregation Performance For the next test we ramped up the process, this time building multiple VM’s on a single UCS B200 blade. Our goal for this test was three VM’s, each loaded with 125 users per VM, for a total of 375 users on a single blade. We needed to see if the host held up when we layered more VM’s on the single blade. Figure 4 illustrates this testing scenario.

Figure 4. Virtualization Aggregation Performance

We streamed the new VM guests and slowly the users began to fill the VM’s. Once we reached the goal of 375 users across three VM’s we checked the performance data and saw that the CPU metrics were showing performance that matched the data we saw when we were running a single VM on the blade. Although transparent page sharing was enabled in ESXi we didn’t want to overcommit memory at this point (16GB x 3 VM’s = 48GB of RAM). We decided against it because again, live production users were using the system and we wanted to keep the user experience acceptable. In this case, the user experience remained unchanged from testing Scenario 1. We observed that the Windows logon process was still 50% faster than our current 12-‐core servers. There was no apparent performance tax for VM guest density, all measured metrics matched regardless of the number of VM’s running on the host.


11

Scenario 3 -‐ UCS B440 Redline-‐Go-‐for-‐Broke Test Being at this point meant that virtualization of the Epic Hyperspace workload could be a reality given the right set of hardware, software, fabric, and storage. This last test was about how we could push the envelope and determine what the scalability of the VBlock architecture looked like.

Figure 5. UCS B440 Redline-‐Go-‐for-‐Broke Test

Our friends at Cisco Systems gave us the opportunity to try a B440 full-‐width blade with four 8-‐core Intel Xeon processors and 128GB of RAM in an effort to try one thing. To run our entire production user load of over 2000+ users on a single server. So with that in mind, we jumped into it and created VM guests just as we did with the other tests. In this case, we got nine VM guests with 100-‐125 users per guest on the B440. Again we tried not to push the memory too far past the limits since these were production users. However, CPU metrics with this density showed very good results. We started with eight VM guests on the B440 and used vMotion to move the ninth VM guest over. This worked flawlessly and the graph below illustrates the performance we observed once we hit our user load of 1034 users. The B440 blade didn’t have quite enough power to get to our goal of 2000+ users but we felt that even 1034 users on a single full-‐width blade was a huge win in the area of scalability and user density without negatively impacting the user experience.


12

Figure 6. UCS B440 M1 Blade CPU Performance

As figure 6 indicates with 1034 users there was still quite a bit of CPU headroom available to the blade. We believe that 1400+ users (another three VM guests) would be possible with the proper amount of RAM available. The screenshot below is the VMware Virtual Center display of the host running a 1034 user load.

Figure 7. VMware Virtual Center B440M1 Blade


13

The screenshot below is of the Citrix Hyperspace users as reported in the Citrix Delivery Services Console.

Figure 8. XenApp indicating 1034 live production users on nine VM guests

As we watched in awe of the performance of this workload with this degree of user density across the nine VM’s, it was pretty clear to us that this was an important moment in our efforts to understand the virtualization design that could be achieved with use of this infrastructure.


14

Conclusion We went into this process not really sure exactly what we would find. We had an idea that the performance would be good and that at some level, we would have accomplished a VM based Hyperspace server that would perform very well. As the findings indicate, the ability to virtualize this particular workload is not only possible, but more importantly, scalable. In today’s IT world, with mission critical applications like an Epic environment, being able to find the gap between scalability and performance can be as hard to achieve as I think our history indicates. But based on the results of all four tests, Aspirus believes that the VCE infrastructure can not only support this particular workload without breaking a sweat, but also allow an organization to layer other equally important workloads onto a single scalable architecture. When you consider the degree of blade density possible with the Cisco UCS architecture (40 blade chassis and 320 blade servers per pair of fabric interconnects) and the underlying fabric bandwidth optimizations both at the hardware level and within VMware itself, we’ve only scratched the surface of what this architecture is capable of achieving. Wrapping this up with a storage infrastructure like our EMC Clariion CX4-‐960 and Celerra NAS gateway, this infrastructure offers nearly unlimited performance curves for any application demand. But let’s talk about results. Testing Scenario 1 resulted in a duplication of our existing bare metal installation of Epic Hyperspace in a virtualized server design and with a decreased application launch time of 50% from 10-‐12 seconds to 6 seconds without a negative impact to the user experience. Based on the results from Scenario 2 we would only require 7 B200 blades to accomplish density, user load, and improved performance versus the 20 HP blades we are using today. Test 3 really is the game changer in our opinion. With the ability to load a full-‐width B440 blade with 1034 users without a degradation in performance and user experience, coupled with the scalability of the fabric and chassis density, three blades can effectively run the presentation layer of a typical Epic environment for 2500 concurrent users. But extend the capabilities of virtualization to the services layer (interconnect servers, web-‐servers, EPS servers), and leveraging Linux at the database layer, you could potentially have the VCE architecture service the entire Epic workload across all layers. Much more testing would be required to make this a reality but there’s little doubt the VCE infrastructure could support it.


15

Acknowledgements Phil and I would like to acknowledge the following for their support and assistance in this testing process and the creation of this document. Aspirus IT

Glynn Hollis – For his continued support, mentoring, and guidance as the Director of Technical Service for Aspirus IT. Without his support of this process, this document and results simply could not have been possible. Jerry Mourey – As the CIO for Aspirus IT, Jerry’s unyielding support for the technical group in our efforts to continually redefine ourselves is rare and we appreciate all that you are, and all that you do. Aspirus Tech Team – Jesse Kozikowski, Mark Chickering, Joe Thompson, Bart Kniess, and Jeremy Woller. Not only the best group of guys to work with but the talent and passion for technology of these gentlemen is unparalleled for a single team. You guys all rock!

Ahead, Inc. Eric Ledyard – as CTO of Ahead and key player in our testing process, we were honored that you spent so much time with us given your crazy schedule. But more importantly, we got a new friend out of the deal. Thank you so much. Steve Pantol – as our UCS engineer, your knowledge of this technology exceeded our expectations and your guidance paved to way for our success.

EMC Corporation Nick Butrym and the Wisconsin Sales Team – for being so visible in our environment as a manufacturer and helping get the most out of our EMC products. Your support continues to be critical to our success.

Cisco Systems Dean Novitzke, Brian Joanis, and the Wisconsin Sales Team – for all your help with getting us loaner gear to make this testing possible.

Epic Systems Sameer Chowdhary, Alex Wang, and the Aspirus support group – for always being there to help us understand the Epic environment.

VMware Kevin Moran – for being supportive of this effort and offering up the support to make this successful.