Performance Testing at Scale - York...

31
An overview of performance testing at NetApp. Shaun Dunning [email protected] Performance Testing at Scale 1

Transcript of Performance Testing at Scale - York...

Page 1: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

An overview of performance testing at NetApp.

Shaun Dunning

[email protected]

Performance Testing at Scale

1

Page 2: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Outline

Performance Engineering responsibilities

– How we protect performance

Overview 3 key areas:

1. The system under test (SUT)

2. Performance test and analysis methods

3. Performance tools and infrastructure

Future Works

2

Page 3: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

3

Protecting Performance of a Product at NetApp

Page 4: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Protecting Performance of a New Release

4

Design

Implementation

Testing / Validation

Deployment

Maintenance

Requirements

SDLC

Customer Escalations

Release Guidance

Regression Testing

1. Weekly Systemic

2. Daily Continuous

Analyst Engagements with

Product Dev.

Performance Modeling

Design for Performance

Desired Discovery

Cost

Defect Resolution Actors Involved

Support, Perf.Eng, Acct. Teams 50 <= >= 100

Sizing, Analysts, Measurement, =~50

Measurement, Analysts =~40

Analysts =~20

TDs + PEs == 4

Page 5: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

5

Overview of the SUT

What is it that we are protecting?

Page 6: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

SUT – Networked Storage Hardware

Applications Generate I/O Requests

Networked Storage services I/O Requests

Clients • General Purpose compute servers • Network ports

Storage Controllers • CPU • Memory • Network ports • I/O Expansion

Disk Shelves • Fibre Channel • SATA • Flash (SSDs)

Page 7: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

SUT – Networked Storage Software (OS)

Linux, Windows, Solaris, …

Vol

A1

Vol

A2

LIF

A1

Server A

Vol

B1

LIF

B1

LIF

B2

Server B

Data ONTAP® • Specialized OS • Feature-rich • Runs a a variety of FAS nodes

• #1 most deployed storage OS in the industry!

Logical views into storage: • LIFs (logical interfaces) abstract ports • volumes abstract disks

NAS File I/O requests SAN Block I/O requests Content Repository Object ID requests

Communication Over Protocols: • NFS/SMB (NAS) • iSCSI/FCP (SAN)

Page 8: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

SUT – Storage Virtualization

Clustered

Data ONTAP® (cDOT)

Logical views span physical systems: • virtual servers use LIFs and volumes located anywhere within a cluster

Network for intra-cluster traffic

ESX, HyperV, Xen, … • Server Virtualization

VM VM VM VM VM VM

VM VM VM VM VM VM

VserverA

LIF

A2

LIF

A1

Vol

A1

Vol

B1

Vol

B2

LIF

B1

VserverB

Multi-tenancy: tenant 1 tenant 2

Page 9: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Common Variants # of Values

Controller Models (Nodes) ~20s

Protocols 5

Data ONTAP® Storage Features / Configurations 1000s

Disk Type <5

Cluster Size 24

Client Configurations ~10s

Workload ~100s

12 billion different test scenarios!

– More variants within each Common Variant…

Requires robust test methods

Requires a flexible infrastructure

Intractability of Performance Test Cases

Page 10: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

10

Performance Test and Analysis Methods

Context Guide:

Page 11: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Network Storage Performance Workloads

Some key storage I/O workload variants:

– Request type: read, write, deletes, or file ops

– Request size: typically 4KiB – 1MiB

– Data access pattern: sequential VS random

– Data access distribution: uniform VS non-uniform

– Concurrency of workers: high to low

11

Page 12: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Micro-benchmark Workloads

Narrow focus down well-known workload subsets

Useful for:

– Limits testing and modeling

CPU, Disk, Network bottlenecks

– Regression detections

Simplifies SUT behavior

12

Request Type Request Size Access Pattern Access

Distribution Concurrency

Sequential Read Reads >= 64KiB 100% sequential uniform low

Sequential Write Writes >= 32KiB 100% sequential uniform low

Random Read Reads <= 8KiB 100% random uniform high

Random Write Writes <= 8KiB 100% random uniform high

Page 13: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Application-specific Workloads

13

Request Type Request Size Access Pattern Access

Distribution Concurrency

OLTP 60% writes

40% reads 4KiB <= x <= 64KiB

2 streams: 100% random

1 stream: 100% sequential 5% uniform high

VDI 50% writes

50% reads

512B <= x <= 132KiB

86% random

14% sequential 10% uniform high

Video Data

Acquisition

Stream 1: 100% writes

Stream 2: 89% reads, 10% file-ops, 1% deletes

64KiB <= x <= 1MiB Stream 1: 100% sequential

Stream 2: 84% random uniform low

Software Build Env.

7% writes

6% reads

2% deletes

85% file ops

0B < x <= 132KiB

100% sequential on small files (appears random)

90% uniform high

Application workloads are more complex

“Checks and balances” on micros/synthetic

– Check for escapes

Page 14: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Other Important Workload Properties

Working set size

– Relation to size of data caches

– Used to move bottleneck from CPU to Disk

Open VS Closed workloads [4]

– Open – new jobs arrive independently

Response time curves

– Closed – new jobs arrive when old jobs complete

Max throughput curve

14

Page 15: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Performance Testing Methods

Competitive Performance

– Apply closed work to drive max util. of resource(s)

Scale by increasing concurrency

– Results find max tput curve, e.g.,

15

Page 16: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Performance Testing Methods

Consistent Performance

– Apply fixed open work to measure resp. time

Scale by increasing requested I/Os

– Results in resp. time VS tput curve:

16

Page 17: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Performance Testing Methods

Predictable Performance

– Apply open work then inject disruptions

– Observe impact on response time:

17

Failure Event

Running in

Degraded State

Return to

“steady state”

Page 18: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Sub-system breakdown into: • Delay centers • Service centers

(resources) Introduced as part of Storage QoS

Layered Analytics and Metrics

18

CPU DiskData

NetworkClusterNetwork

resource

delay center

Xarrivals visits completions

visits

High Level Reports

Sub-system queuing model

Low-level trace data

Wait time

Wait time Service time

Data ONTAP® Sub-system

Throughput, Latency, and Utilization System-wide reports

Most granular view of system: • Kernel MSG traces • Network traces • Function profiling

Page 19: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Metrics for PASS/FAIL Correlation

System-wide metrics:

– Throughput and Latency

Cost-based metrics:

– CPU path length per I/O

What is the overall CPU cost for servicing the I/O?

– Resource Residence Time

What is the specific-resource cost for work?

More granular metric of CPU path length

– Delay Center Residence Time

Time spent waiting on a resource to free up?

This + resource residence time == system latency

19

Page 20: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Summary of Analysis and Test Methods

Overview of key workloads

– Micro and Application-specific workloads

Performance testing methods

– Competitive perf. testing

– Predictable perf. testing

– Consistent perf. testing

Analytics and metrics

– System-wide reports and metrics

– Queuing model for cost-based metrics

– Low-level trace data (when the going gets tough)

20

Page 21: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

21

Performance Tools and Testing Infrastructure

Context Guide:

Page 22: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Dedicated HW, to aid in repeatability

Pool of Clients Pool of Disks Pool of Storage

Controllers Data Network

Swtches

CISCO

Brocade

Layer 2

Switching

FCP

NFS

SMB

iSCSI

Disk Ops

Switched infra. Supports

Dynamic testbed configurations

Performance Testing Lab

8 Perf. Labs deployed • 735 clients • 377 controllers • Over 2PB of raw storage

Page 23: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Performance Testing SW

Page 24: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Performance Testing SW

Measurement Invariant Interval

– The of perf. test execution automation

– Interval where a set of pre and post conditions and are guaranteed to be true to ensure accurate measurements

24

Assert pre-conditions • E.g., warmup achieved?

Assert post-conditions • E.g., utilization achieved?

Measurement Interval

(collect measurement data)

time

(apply work) (next measurement)

Page 25: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Performance Testing SW

Workflow automation (Regression Tests)

25

Regression

Test Results

Bisect on dev

committsRegression?

Yes

No

Analytics

Update ReportsBasic Workflow for RCA Performance Regressions

A lot of “special sauce” here

Page 26: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Rapid Prototyping

SLSL[1] for rapidly prototyping new tests

– Proprietary, domain-specific language

IPython Notebooks[2] for rapidly prototyping new reports

Leveraged for:

– Customer escalations (repro in-house)

– Exploring new regression test ideas

26

Page 27: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Summary of Tools and Infrastructure

Perf. Labs aid in repeatability

– Dedicated HW implementation

– Test execution SW automation

Reduce human error

Implement strict measurement invariant interval

– Facilities for rapid prototyping

Workflow automation for regression tests

– capacity and test coverage

– Automatic RCA

27

Page 28: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Future Works

Tests Generate a lot of data

– Statistical modeling techniques

– Data science specialists

Creating more synthetic “performance” tests

– Run in virtualized SUTs

– Metrics based on sub-system queues as proxies for changes that impact system performance

More analysis to be done here…

28

Page 29: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

~Fin~

Provided an overview of how Performance Engineering protects the performance of Data ONTAP® through it’s SDLC.

Covered 3 key areas: 1. The SUT: Networked Storage Systems running

Data ONTAP®

2. Performance test and analysis methods

3. Performance tools and infrastructure

Discussed future improvement areas

29

Page 30: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

Citations and References

1. Dunning, SLSL: http://dl.acm.org/citation.cfm?id=1958798

2. IPython Notebooks: http://ipython.org/notebook

3. RabbitMQ: https://www.rabbitmq.com

4. Schroder, Open Versus Closed A Cautionary Talehttp://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf

30

Page 31: Performance Testing at Scale - York Universitylt2014.eecs.yorku.ca/talks/lt2014_perf_testing_at_netapp.pdfAn overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com

31