Exchange 2003 Storage Design Brad Carter Rapid Response Engineer EMEA Exchange Centre of Excellence.

Exchange 2003 Storage Design Brad Carter

Rapid Response Engineer

EMEA Exchange Centre of Excellence

Welcome to this TechNet Event

FREE bi-weekly technical newsletter

FREE regular technical events hosted across the UK

FREE weekly UK & US led technical webcasts

FREE comprehensive technical web site

Monthly CD / DVD subscription with the latest technical tools & resources

FREE quarterly technical magazine

We would like to bring your attention to the key elements of the TechNet programme; the central information and community resource for IT professionals in the UK:

To subscribe to the newsletter or just to find out more, please visit www.microsoft.com/uk/technet or speak to a Microsoft representative during the break

http://www.microsoft.com/uk/technet


Purpose of Session

– Introduce Exchange Storage Concepts

– Improve understanding of Storage Design Best Practices

–Validating design and monitoring performance

–Objective: To provide attendees with enough knowledge to ensure they deploy Exchange 2003 optimally on their chosen Storage Platform.

OverviewOverview

Introduction

Disk I/O and Exchange Server

Best Practices for Optimizing your Storage Architecture

Storage Design

- Establishing your Disk I/O Requirements

- Establishing your Disk Capacity Requirements

Using Jetstress to verify Sub-Storage System Performance & Reliability

Basic Primary Partitions

Key Recommendations Summary

PERF Counters

Examples

Session Structure and ContentSession Structure and Content

Exchange Server 2003 is a disk-intensive application that requires a fast, reliable disk subsystem to function correctly.

Storage subsystem bottlenecks cause more performance problems than any other server-side component; e.g.... CPU or RAM. + CRITSITS!

A poorly designed disk subsystem will provide extremely negative performance for your users . Specifically, your disk subsystem is performing poorly if it is experiencing:

- Average read and write latencies over 20 ms for database drives

- MSExchangeIS/RPC Average Latency above 50ms

High disk latency = High RPC latency

High RPC latency = Slow performance

IntroductionIntroduction

Disk I/O and Exchange ServerDisk I/O and Exchange Server

Every time data is read from or written to Exchange, disk I/O is Every time data is read from or written to Exchange, disk I/O is generated. generated.

Exchange Data Components (.EDB, .STM, Logs)Exchange Data Components (.EDB, .STM, Logs)

Component I/O Pattern

Jet database (.edb file) Read from and write to at random 4 KB page size

Streaming database (.stm file) Normally read from and write to sequentially Variable page size that averages 8 KB in productionNote There are significant numbers of seek operations, so the I/O

pattern is neither entirely random nor entirely sequential.

Transaction log files (.log files) 100 percent sequential writes during normal operations 100 percent sequential reads during recovery operations Writes vary in size from 512 bytes to the log buffer size

Best Practices for Optimizing your Storage ArchitectureBest Practices for Optimizing your Storage Architecture

Database Files

- .edb and .stm file placement

- fast random access speeds

Content Indexing Files

- Never place content indexing files on the same disk as the page file (although that is the default location).

- Random-access file, should be placed on the same volume as the databases

Transaction Log Files

- Most important write performance drive in terms of latency

- Sequential write pattern

- Instance placement (in terms of ESE)

Cont… Best Practices for Optimizing your Storage ArchitectureCont… Best Practices for Optimizing your Storage Architecture

SMTP Queue

- Should never be on any spindle that performs another function (due to very different I/O patterns).

Page File

- place your page file on separate spindles

- If you lose the disk with the page file, the server will experience a stop error.

MTA Queue

- MTA queues should never reside on a log or database volumes.

- If your server handles a significant amount of SMTP and/or MTA traffic, you should provide a separate set of spindles for the SMTP and MTA queues. (MBX Servers, Bridgeheads, etc).


All devices on the storage system must be listed on the Windows Hardware Compatibility List (WHCL). HCL Web site: http://go.microsoft.com/fwlink/?LinkId=23194.

Exchange should be run only against storage that is certified for Windows.

Cluster Certified Geo-Cluster/Multi-Cluster Certified.

Drivers and firmware must be up-to-date;

- Server BIOS/firmware

- SCSI/Array Controller firmware and driver

- Fiber Host Bus Adapter (HBA) firmware and driver

- Fiber switch/hub firmware

- SAN (Storage Area Network) enclosure Operating System/Microcode/firmware

- Hard disk firmware

Verify that the HBA/SAN specific configuration is set correctly. HBAS use registry keys to customize the configuration to a specific SAN platform (for example, Queue Depth and Queue Target).

Impact on sharing sequential and random I/O can be significant within the same disk group

– Results in excessive latency if sequential and random are shared on the same disks – “Exchange mail stores and backup content” !! + TRANY LOGS

Use Dispar to align disks

Setting up and Configuring the Storage System:

http://go.microsoft.com/fwlink/?LinkId=23194




Segment Size

- Stripe size specifies the segment's size when written to each disk in a RAID array.

Controller Cache.

If your controller allows you to configure the cache page size. Configure this for 4K pages to accommodate Exchange. Set this to 100% write cache. Make sure this is battery-backed cache.

Setting up and Configuring the Storage System:

Storage DesignStorage Design

Required information

– Total IOPS required (IOPS/mailbox x # Mailboxes)

– Read/Write Ratio

– Disk capabilities (10K, 15K, 72GB, 146GB, 300GB?)

– Disk transfer capability of the storage enclosure (consider throughput with failed components)

– Backup window (understand the workloads)

– Restore window

– Near immediate restore is viable with VSS

– Don’t forget OLM and recovery utilities as part of your design considerations)

– Always design for performance first, then capacity

Storage Design …Peak IO RequirementsStorage Design …Peak IO Requirements

Profiling

–Utilize existing infrastructure to determine peak IO requirements

–Use Windows System Performance Monitor to trend following counters:

– Disk Transfers/sec

– Disk Reads/sec

– Disk Writes/sec

– Trend during peak period

– Monday is typically the busiest day in Microsoft.

Storage Design: I/O ProfileStorage Design: I/O Profile

This image is a six hour peak This image is a six hour peak period representation of disk period representation of disk activity on a production server in activity on a production server in Microsoft supporting 4700 Microsoft supporting 4700 mailboxesmailboxes

The image is based on a 10 second The image is based on a 10 second sample rate for a period of over 6 sample rate for a period of over 6 hourshours

The average rate defined is ~4300 The average rate defined is ~4300 transfers/sec for the six hourstransfers/sec for the six hours

The rate we use for profiling is The rate we use for profiling is between 10am and 12pm which is between 10am and 12pm which is below ~5000 transfers/secbelow ~5000 transfers/sec

Mailbox IOP is defined by dividing Mailbox IOP is defined by dividing the average peak IO by total the average peak IO by total mailboxes.mailboxes.

5000 / 4700 = 1.075000 / 4700 = 1.07

We profile at 1.2 IOPS per mailbox We profile at 1.2 IOPS per mailbox for this server based on historical for this server based on historical datadata

Peak IO Read Write Mix (R:W)Peak IO Read Write Mix (R:W)

The read write mix profile The read write mix profile in Microsoft is typically in Microsoft is typically based on a 2:1 R:W ratiobased on a 2:1 R:W ratio

Within the MSIT Within the MSIT deployment this is not a deployment this is not a critical value for our design critical value for our design methodology as we utilize methodology as we utilize RAID 1_0 for our RAID 1_0 for our production devices.production devices.

Customers considering Customers considering storage requirements for storage requirements for Exchange with the intent of Exchange with the intent of using RAID 5 should trend using RAID 5 should trend their peak period R:W mixtheir peak period R:W mixRAID 5 has a significant RAID 5 has a significant write penalty and disk write penalty and disk allocations will vary allocations will vary substantially based on R:W substantially based on R:W mixmix

Storage Design …Select RAID and Disk TypeStorage Design …Select RAID and Disk Type

RAID 10 or RAID 5

– Different Write Penalties (WP)

– Very different performance profiles under heavy load

36GB, 72GB or 146GB

– Disks are getting larger, performance is not changing

– 300GB will be available soon !!

10K or 15K RPM

Disk

Speed

IO measured at the host IO measured behind the controller

10K ~100 ~130

15K ~150 ~180

Table represents throughput based on 80% capacity utilization Table represents throughput based on 80% capacity utilization under a 4K random load delivering below 20ms latencies.under a 4K random load delivering below 20ms latencies.

These values can be controller specific so testing is requiredThese values can be controller specific so testing is required

12 10K disks delivering 1200 transfers 12 10K disks delivering 1200 transfers ~100 transfers/disk~100 transfers/disk Transfers at the disk using backend tools Transfers at the disk using backend tools

~130 transfers/disk~130 transfers/disk

Disk IO measured at the Disk IO measured at the host using Jetstress host using Jetstress

Disk IO measured behind Disk IO measured behind controller during the same controller during the same

Jetstress testJetstress test

Storage Design …Select Correct RAID TypeStorage Design …Select Correct RAID Type

Consider Microsoft’s scenario:Consider Microsoft’s scenario:4000 mail boxes per server4000 mail boxes per server200MB limits for most users, some exceptions200MB limits for most users, some exceptions1.2 IOPS (From Trending Profile)1.2 IOPS (From Trending Profile)2:1 Read / Write mix2:1 Read / Write mixDeleted item retention of 3 days Deleted item retention of 3 days Fluff Factor of 1.4 (overhead for provisioning mailbox storage)Fluff Factor of 1.4 (overhead for provisioning mailbox storage)

Simple Math :Simple Math :

(IOPS X READ RATIO) + [RAID PENALTY](IOPS X WRITE RATIO)(IOPS X READ RATIO) + [RAID PENALTY](IOPS X WRITE RATIO)------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SPINDLE SPEED AT THE CONTROLLERSPINDLE SPEED AT THE CONTROLLER

4800 X 0.66) + [2](4800 X 0.34)4800 X 0.66) + [2](4800 X 0.34)----------------------------------------------------------------------------------

130 OR 180130 OR 180

10K disk = 6432 / 130 (disk capability) = ~ 10K disk = 6432 / 130 (disk capability) = ~ 49 disks49 disks15K disk = 6432 / 180 = ~ 15K disk = 6432 / 180 = ~ 35 disks35 disks

Select Correct RAID TypeSelect Correct RAID Type

Consider requirement behind the controller– Use “130” for 10K and “180” for 15K disks

– Assume 80% capacity utilization on disk

RAID 1_04000 X 1.2 = 4800 transfers/sec

3200 Reads + (1600 Writes * 2 (write penalty)) = ~ 6400

10K disk = 6400 / 130 (disk capability) = ~ 49 disks – round down to 48

15K disk = 6400 / 180 (disk capability) = ~ 35 disks – round down to 34

RAID 54000 X 1.2 = 4800 transfers/sec

3200 Reads + (1600 Writes * 4 (write penalty - 2 R and 2 W per Host W)) = ~ 9600

10K disk = 9600 / 130 (disk capability) = ~ 73 disks

15K disk = 9600 / 180 (disk capability) = ~ 53 disks

RAID 1_0 is the obvious choice for internal deployment. Some controllers do a better job handling RAID 5 than others by reducing the impact of the write penalty with effective caching.

Ensure Cache is EnabledEnsure Cache is Enabled

The white line deviation represents the change in IO capability as measured at the host when write cache is disabled on the storage controller.

In this case a reduced throughput of ~400 transfers/sec with a corresponding impact on read and write latency characteristics.

Write latency was below 2ms prior to disabling the cache and shot to below ~16ms when disabled.

When designing for Exchange you should try and achieve the best level of latency for both read and write

Be Careful with RAID 5Be Careful with RAID 5

This is a representation of total This is a representation of total disk transfers on an optimized disk transfers on an optimized RAID 5 configuration. RAID 5 configuration.

System sustaining averages of System sustaining averages of 1600 transfers/sec until a drive 1600 transfers/sec until a drive failed and forced rebuild to a failed and forced rebuild to a hot-sparehot-spare

System lost 400 Transfers or System lost 400 Transfers or 33% capability !! 33% capability !!

This representation is extracted using an array This representation is extracted using an array performance analyzer to show the impact during rebuild.performance analyzer to show the impact during rebuild.

Rebuild took over 3 hours on a 146GB disk. Rebuild took over 3 hours on a 146GB disk.

Allocate more disks than required to cater for failed disk Allocate more disks than required to cater for failed disk Do Not rebuild during peak time to minimize impact on Do Not rebuild during peak time to minimize impact on users.users.

Note:Note: Some storage enclosures will vary on rebuild time Some storage enclosures will vary on rebuild time

Understand QUEUEDEPTHUnderstand QUEUEDEPTH

Representation of a system running Jetstress displaying Disk Transfers, Read and Representation of a system running Jetstress displaying Disk Transfers, Read and Write latencyWrite latency

Host is connected via a single 2GB FCA using StorPort driverHost is connected via a single 2GB FCA using StorPort driver

Gradual reduction in sustainable throughput with corresponding impact in latency as Gradual reduction in sustainable throughput with corresponding impact in latency as a result of reducing Queuedeptha result of reducing Queuedepth

Queuedepth can throttle IO back at the host making the result look like storage Queuedepth can throttle IO back at the host making the result look like storage contentioncontention

Use JETSTRESSUse JETSTRESS

Use Jetstress to validate storage capabilityhttp://www.microsoft.com/downloads/thankyou.aspx?FamilyID=94b9810b-670e-433a-b5ef-b47054595e9c&displaylang=en

Determine total system throughput under normal operational conditions

Determine capability and resilience of system during simulated component failure.

– Host Bus Adapter (HBA) failures

– Array controller failures

– Etc.

Ensure peak activity is viable under all conditions, many customers fail to plan for these occurrences

Testing Categories (Performance & Long Haul)

Importance of Database sizes

http://www.microsoft.com/downloads/thankyou.aspx?FamilyID=94b9810b-670e-433a-b5ef-b47054595e9c&displaylang=en



JETSTRESS UI - InterfaceJETSTRESS UI - Interface

Jetstress set to simulate Jetstress set to simulate 4000 mailboxes at 1.2 4000 mailboxes at 1.2 IOPS per mailboxIOPS per mailbox

Databases can take an Databases can take an extended time to create!extended time to create!

Create Databases based Create Databases based on expected profile in the on expected profile in the example using:example using:

4 -4 - Storage Groups (SG) Storage Groups (SG)

40GB40GB - Databases (edb) - Databases (edb)

66 – edb’s per SG – edb’s per SG

System will generate 4K System will generate 4K random Read Write activity random Read Write activity to simulate an Exchange to simulate an Exchange load for predicting storage load for predicting storage capabilitycapability

Short & Long Stroking

Outer Track will produceHigher level of performance

Inner Track will produceLower level of performance

Data Plateau

Spindle Arm & Head

JETSTRESS with Full StrokeJETSTRESS with Full Stroke

• Create test databases that are sized to represent what you will have in Create test databases that are sized to represent what you will have in productionproduction

• Jetstress UI does a good job on this today BEWARE OF CMDJetstress UI does a good job on this today BEWARE OF CMD

• Using ~70%+ of actual disk capacity results in a more realistic performance Using ~70%+ of actual disk capacity results in a more realistic performance characteristiccharacteristic

• In this example 48 10K disks provisioned over 4800 transfers/sec with 40GB In this example 48 10K disks provisioned over 4800 transfers/sec with 40GB EDBsEDBs

• Expected IO is perceived to be ~100 IOPS per disk at the hostExpected IO is perceived to be ~100 IOPS per disk at the host

(Full stroke seek time) -(Full stroke seek time) - The time it takes to seek over all tracks The time it takes to seek over all tracks i.e., from the innermost to the outermost or vice versa of a diski.e., from the innermost to the outermost or vice versa of a disk

JETSTRESS with Short StrokeJETSTRESS with Short Stroke

• Original JetStress recommendation suggested using 1/20Original JetStress recommendation suggested using 1/20thth of expected of expected database size for testing storage designdatabase size for testing storage design

• Using small amount of physical disk capacity can result in better performance Using small amount of physical disk capacity can result in better performance than expectedthan expected

• In this example 48 10K disks provisioned over 10,000 transfers/sec using In this example 48 10K disks provisioned over 10,000 transfers/sec using 4GB EDB’s4GB EDB’s

• Expected IO is perceived to be ~200 IOPS per disk at the host. This is Expected IO is perceived to be ~200 IOPS per disk at the host. This is ~100% more throughout that can be expected when the disk is ~80% ~100% more throughout that can be expected when the disk is ~80% utilized from a capacity perspective.utilized from a capacity perspective.

Basic Primary PartitionsBasic Primary Partitions

Mater Boot Record (MBR) creates an alignment offset

– Utilize Diskpar from the Windows 2000 Resource kit

– Provides ~10% improvement in sustainable throughput when corrected

– This will not resolve excessive latencies

– Data destructive

– Creates RAW partition

– Assign drive letter or mount point and then format in “Disk Manager”

Disk –i Disk –i

Disk –s Disk –s

Key Recommendations SummaryKey Recommendations Summary

- Understand user IO profiles- Understand user IO profiles- Select the correct RAID type, Disk capacity and speed - Select the correct RAID type, Disk capacity and speed

-Design for performance, and then capacityDesign for performance, and then capacity

- Isolate Exchange disks – Keep other applications awayIsolate Exchange disks – Keep other applications away(SHARED STROAGE SHOULD BE AVOIDED) (SHARED STROAGE SHOULD BE AVOIDED)

- Dedicated disk spindles for data !!- Dedicated disk spindles for data !!- Dedicated disk spindles for logs !!Dedicated disk spindles for logs !!

- Align you disks using diskparAlign you disks using diskparThis wont correct a bad designThis wont correct a bad design

- Validate storage design using Jetstress- Validate storage design using Jetstress- Scale mailboxes for performance efficiency with strong consideration - Scale mailboxes for performance efficiency with strong consideration towards maintaining a backup and restore SLAtowards maintaining a backup and restore SLA

The PERF Counters to watchThe PERF Counters to watch

Physical Disk->Average Disk sec/Read->Instances

Physical Disk->Average Disk sec/Write->Instances

Physical Disk->Current Disk Queue Length->Instances

Physical Disk->Disk Bytes/Sec->Instances

Physical Disk->Disk Writes/Sec->Instances

Physical Disk->Disk Reads/Sec->Instances

Physical Disk->Disk Transfers/Sec->Instances

Physical -> Average Disk Bytes/Transfer->Instances

Database\Log Record Stalls/sec

– Log Record Stalls/sec is the number of log records that cannot be added to the log buffers per second because they are full. The average for this value should be below 25. There shouldn’t be spikes (maximum values) higher then 250.

Database\Log Threads Waiting

– Log Threads Waiting is the number of threads waiting for their data to be written to the log in order to complete an update of the database

Under MSExchangeIS there are RPC counters that are extremely useful for understanding users latencies

RPC Requests – number of operations currently being processed by the store

RPC operations/sec – number of incoming operations

RPC Average Latency – average of latency of the last 100 RPC operations (sliding window)

RPC Num. of Slow Packets – number of packets in the past 1024 that have latencies longer than 2 seconds

Data Device BaselineData Device Baseline

Flat response times for read and write activity independent of I/O loadFlat response times for read and write activity independent of I/O load

A ROCK’IN System!

Questions and Answers

© 2005 Microsoft Corporation. All rights reserved.© 2005 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in

this summary.this summary.Data in this presentation is current as of its publish date (see title slide).Data in this presentation is current as of its publish date (see title slide).

Optimizing Storage for Exchange Server 2003

http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/optimizestorage.mspx

Exchange Development Team Blog (storage):

http://blogs.msdn.com/exchange/archive/2004/10/11/240868.aspx

Troubleshooting Microsoft Exchange Server 2003 Performance

http://www.microsoft.com/downloads/details.aspx?FamilyID=8679F6BD-7FF0-41F5-BDD0-C09019409FC0&displaylang=en

Exchange Best Practices Analyzer

http://www.microsoft.com/exchange/downloads/2003/exbpa/default.asp.

TechNet


Links



http://blogs.msdn.com/exchange/archive/2004/10/11/240868.aspx




http://www.microsoft.com/exchange/downloads/2003/exbpa/default.asp


Exchange 2003 Storage Design Brad Carter Rapid Response Engineer EMEA Exchange Centre of Excellence.

Documents

Transcript of Exchange 2003 Storage Design Brad Carter Rapid Response Engineer EMEA Exchange Centre of Excellence.