Performance Testing at Scale - York...

An overview of performance testing at NetApp.

Shaun Dunning

[email protected]

Performance Testing at Scale

1

Outline

Performance Engineering responsibilities

– How we protect performance

Overview 3 key areas:

1. The system under test (SUT)

2. Performance test and analysis methods

3. Performance tools and infrastructure

Future Works

2

3

Protecting Performance of a Product at NetApp

Protecting Performance of a New Release

4

Design

Implementation

Testing / Validation

Deployment

Maintenance

Requirements

SDLC

Customer Escalations

Release Guidance

Regression Testing

1. Weekly Systemic

2. Daily Continuous

Analyst Engagements with

Product Dev.

Performance Modeling

Design for Performance

Desired Discovery

Cost

Defect Resolution Actors Involved

Support, Perf.Eng, Acct. Teams 50 <= >= 100

Sizing, Analysts, Measurement, =~50

Measurement, Analysts =~40

Analysts =~20

TDs + PEs == 4

5

Overview of the SUT

What is it that we are protecting?

SUT – Networked Storage Hardware

Applications Generate I/O Requests

Networked Storage services I/O Requests

Clients • General Purpose compute servers • Network ports

Storage Controllers • CPU • Memory • Network ports • I/O Expansion

Disk Shelves • Fibre Channel • SATA • Flash (SSDs)

SUT – Networked Storage Software (OS)

Linux, Windows, Solaris, …

Vol

A1

Vol

A2

LIF

A1

Server A

Vol

B1

LIF

B1

LIF

B2

Server B

Data ONTAP® • Specialized OS • Feature-rich • Runs a a variety of FAS nodes

• #1 most deployed storage OS in the industry!

Logical views into storage: • LIFs (logical interfaces) abstract ports • volumes abstract disks

NAS File I/O requests SAN Block I/O requests Content Repository Object ID requests

Communication Over Protocols: • NFS/SMB (NAS) • iSCSI/FCP (SAN)

SUT – Storage Virtualization

Clustered

Data ONTAP® (cDOT)

Logical views span physical systems: • virtual servers use LIFs and volumes located anywhere within a cluster

Network for intra-cluster traffic

ESX, HyperV, Xen, … • Server Virtualization

VM VM VM VM VM VM

VM VM VM VM VM VM

VserverA

LIF

A2

LIF

A1

Vol

A1

Vol

B1

Vol

B2

LIF

B1

VserverB

Multi-tenancy: tenant 1 tenant 2

Common Variants # of Values

Controller Models (Nodes) ~20s

Protocols 5

Data ONTAP® Storage Features / Configurations 1000s

Disk Type <5

Cluster Size 24

Client Configurations ~10s

Workload ~100s

12 billion different test scenarios!

– More variants within each Common Variant…

Requires robust test methods

Requires a flexible infrastructure

Intractability of Performance Test Cases

10

Performance Test and Analysis Methods

Context Guide:

Network Storage Performance Workloads

Some key storage I/O workload variants:

– Request type: read, write, deletes, or file ops

– Request size: typically 4KiB – 1MiB

– Data access pattern: sequential VS random

– Data access distribution: uniform VS non-uniform

– Concurrency of workers: high to low

11

Micro-benchmark Workloads

Narrow focus down well-known workload subsets

Useful for:

– Limits testing and modeling

CPU, Disk, Network bottlenecks

– Regression detections

Simplifies SUT behavior

12

Request Type Request Size Access Pattern Access

Distribution Concurrency

Sequential Read Reads >= 64KiB 100% sequential uniform low

Sequential Write Writes >= 32KiB 100% sequential uniform low

Random Read Reads <= 8KiB 100% random uniform high

Random Write Writes <= 8KiB 100% random uniform high

Application-specific Workloads

13

Request Type Request Size Access Pattern Access

Distribution Concurrency

OLTP 60% writes

40% reads 4KiB <= x <= 64KiB

2 streams: 100% random

1 stream: 100% sequential 5% uniform high

VDI 50% writes

50% reads

512B <= x <= 132KiB

86% random

14% sequential 10% uniform high

Video Data

Acquisition

Stream 1: 100% writes

Stream 2: 89% reads, 10% file-ops, 1% deletes

64KiB <= x <= 1MiB Stream 1: 100% sequential

Stream 2: 84% random uniform low

Software Build Env.

7% writes

6% reads

2% deletes

85% file ops

0B < x <= 132KiB

100% sequential on small files (appears random)

90% uniform high

Application workloads are more complex

“Checks and balances” on micros/synthetic

– Check for escapes

Other Important Workload Properties

Working set size

– Relation to size of data caches

– Used to move bottleneck from CPU to Disk

Open VS Closed workloads [4]

– Open – new jobs arrive independently

Response time curves

– Closed – new jobs arrive when old jobs complete

Max throughput curve

14

Performance Testing Methods

Competitive Performance

– Apply closed work to drive max util. of resource(s)

Scale by increasing concurrency

– Results find max tput curve, e.g.,

15


Consistent Performance

– Apply fixed open work to measure resp. time

Scale by increasing requested I/Os

– Results in resp. time VS tput curve:

16


Predictable Performance

– Apply open work then inject disruptions

– Observe impact on response time:

17

Failure Event

Running in

Degraded State

Return to

“steady state”

Sub-system breakdown into: • Delay centers • Service centers

(resources) Introduced as part of Storage QoS

Layered Analytics and Metrics

18

CPU DiskData

NetworkClusterNetwork

resource

delay center

Xarrivals visits completions

visits

High Level Reports

Sub-system queuing model

Low-level trace data

Wait time

Wait time Service time

Data ONTAP® Sub-system

Throughput, Latency, and Utilization System-wide reports

Most granular view of system: • Kernel MSG traces • Network traces • Function profiling

Metrics for PASS/FAIL Correlation

System-wide metrics:

– Throughput and Latency

Cost-based metrics:

– CPU path length per I/O

What is the overall CPU cost for servicing the I/O?

– Resource Residence Time

What is the specific-resource cost for work?

More granular metric of CPU path length

– Delay Center Residence Time

Time spent waiting on a resource to free up?

This + resource residence time == system latency

19

Summary of Analysis and Test Methods

Overview of key workloads

– Micro and Application-specific workloads

Performance testing methods

– Competitive perf. testing

– Predictable perf. testing

– Consistent perf. testing

Analytics and metrics

– System-wide reports and metrics

– Queuing model for cost-based metrics

– Low-level trace data (when the going gets tough)

20

21

Performance Tools and Testing Infrastructure

Context Guide:

Dedicated HW, to aid in repeatability

Pool of Clients Pool of Disks Pool of Storage

Controllers Data Network

Swtches

CISCO

Brocade

Layer 2

Switching

FCP

NFS

SMB

iSCSI

Disk Ops

Switched infra. Supports

Dynamic testbed configurations

Performance Testing Lab

8 Perf. Labs deployed • 735 clients • 377 controllers • Over 2PB of raw storage

Performance Testing SW


Measurement Invariant Interval

– The of perf. test execution automation

– Interval where a set of pre and post conditions and are guaranteed to be true to ensure accurate measurements

24

Assert pre-conditions • E.g., warmup achieved?

Assert post-conditions • E.g., utilization achieved?

Measurement Interval

(collect measurement data)

time

(apply work) (next measurement)


Workflow automation (Regression Tests)

25

Regression

Test Results

Bisect on dev

committsRegression?

Yes

No

Analytics

Update ReportsBasic Workflow for RCA Performance Regressions

A lot of “special sauce” here

Rapid Prototyping

SLSL[1] for rapidly prototyping new tests

– Proprietary, domain-specific language

IPython Notebooks[2] for rapidly prototyping new reports

Leveraged for:

– Customer escalations (repro in-house)

– Exploring new regression test ideas

26

Summary of Tools and Infrastructure

Perf. Labs aid in repeatability

– Dedicated HW implementation

– Test execution SW automation

Reduce human error

Implement strict measurement invariant interval

– Facilities for rapid prototyping

Workflow automation for regression tests

– capacity and test coverage

– Automatic RCA

27

Future Works

Tests Generate a lot of data

– Statistical modeling techniques

– Data science specialists

Creating more synthetic “performance” tests

– Run in virtualized SUTs

– Metrics based on sub-system queues as proxies for changes that impact system performance

More analysis to be done here…

28

~Fin~

Provided an overview of how Performance Engineering protects the performance of Data ONTAP® through it’s SDLC.

Covered 3 key areas: 1. The SUT: Networked Storage Systems running

Data ONTAP®

2. Performance test and analysis methods

3. Performance tools and infrastructure

Discussed future improvement areas

29

Citations and References

1. Dunning, SLSL: http://dl.acm.org/citation.cfm?id=1958798

2. IPython Notebooks: http://ipython.org/notebook

3. RabbitMQ: https://www.rabbitmq.com

4. Schroder, Open Versus Closed A Cautionary Talehttp://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf

30

http://dl.acm.org/citation.cfm?id=1958798

http://dl.acm.org/citation.cfm?id=1958798

http://ipython.org/notebook

http://ipython.org/notebook

https://www.rabbitmq.com

https://www.rabbitmq.com

http://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf



Performance Testing at Scale - York...

Documents

Transcript of Performance Testing at Scale - York...