Internet/Web Basics Margie Metzler [email protected] Java 3.3.129 (408) 822-3185
Performance Testing at Scale - York...
Transcript of Performance Testing at Scale - York...
An overview of performance testing at NetApp.
Shaun Dunning
Performance Testing at Scale
1
Outline
Performance Engineering responsibilities
– How we protect performance
Overview 3 key areas:
1. The system under test (SUT)
2. Performance test and analysis methods
3. Performance tools and infrastructure
Future Works
2
3
Protecting Performance of a Product at NetApp
Protecting Performance of a New Release
4
Design
Implementation
Testing / Validation
Deployment
Maintenance
Requirements
SDLC
Customer Escalations
Release Guidance
Regression Testing
1. Weekly Systemic
2. Daily Continuous
Analyst Engagements with
Product Dev.
Performance Modeling
Design for Performance
Desired Discovery
Cost
Defect Resolution Actors Involved
Support, Perf.Eng, Acct. Teams 50 <= >= 100
Sizing, Analysts, Measurement, =~50
Measurement, Analysts =~40
Analysts =~20
TDs + PEs == 4
5
Overview of the SUT
What is it that we are protecting?
SUT – Networked Storage Hardware
Applications Generate I/O Requests
Networked Storage services I/O Requests
Clients • General Purpose compute servers • Network ports
Storage Controllers • CPU • Memory • Network ports • I/O Expansion
Disk Shelves • Fibre Channel • SATA • Flash (SSDs)
SUT – Networked Storage Software (OS)
Linux, Windows, Solaris, …
Vol
A1
Vol
A2
LIF
A1
Server A
Vol
B1
LIF
B1
LIF
B2
Server B
Data ONTAP® • Specialized OS • Feature-rich • Runs a a variety of FAS nodes
• #1 most deployed storage OS in the industry!
Logical views into storage: • LIFs (logical interfaces) abstract ports • volumes abstract disks
NAS File I/O requests SAN Block I/O requests Content Repository Object ID requests
Communication Over Protocols: • NFS/SMB (NAS) • iSCSI/FCP (SAN)
SUT – Storage Virtualization
Clustered
Data ONTAP® (cDOT)
Logical views span physical systems: • virtual servers use LIFs and volumes located anywhere within a cluster
Network for intra-cluster traffic
ESX, HyperV, Xen, … • Server Virtualization
VM VM VM VM VM VM
VM VM VM VM VM VM
VserverA
LIF
A2
LIF
A1
Vol
A1
Vol
B1
Vol
B2
LIF
B1
VserverB
Multi-tenancy: tenant 1 tenant 2
Common Variants # of Values
Controller Models (Nodes) ~20s
Protocols 5
Data ONTAP® Storage Features / Configurations 1000s
Disk Type <5
Cluster Size 24
Client Configurations ~10s
Workload ~100s
12 billion different test scenarios!
– More variants within each Common Variant…
Requires robust test methods
Requires a flexible infrastructure
Intractability of Performance Test Cases
10
Performance Test and Analysis Methods
Context Guide:
Network Storage Performance Workloads
Some key storage I/O workload variants:
– Request type: read, write, deletes, or file ops
– Request size: typically 4KiB – 1MiB
– Data access pattern: sequential VS random
– Data access distribution: uniform VS non-uniform
– Concurrency of workers: high to low
11
Micro-benchmark Workloads
Narrow focus down well-known workload subsets
Useful for:
– Limits testing and modeling
CPU, Disk, Network bottlenecks
– Regression detections
Simplifies SUT behavior
12
Request Type Request Size Access Pattern Access
Distribution Concurrency
Sequential Read Reads >= 64KiB 100% sequential uniform low
Sequential Write Writes >= 32KiB 100% sequential uniform low
Random Read Reads <= 8KiB 100% random uniform high
Random Write Writes <= 8KiB 100% random uniform high
Application-specific Workloads
13
Request Type Request Size Access Pattern Access
Distribution Concurrency
OLTP 60% writes
40% reads 4KiB <= x <= 64KiB
2 streams: 100% random
1 stream: 100% sequential 5% uniform high
VDI 50% writes
50% reads
512B <= x <= 132KiB
86% random
14% sequential 10% uniform high
Video Data
Acquisition
Stream 1: 100% writes
Stream 2: 89% reads, 10% file-ops, 1% deletes
64KiB <= x <= 1MiB Stream 1: 100% sequential
Stream 2: 84% random uniform low
Software Build Env.
7% writes
6% reads
2% deletes
85% file ops
0B < x <= 132KiB
100% sequential on small files (appears random)
90% uniform high
Application workloads are more complex
“Checks and balances” on micros/synthetic
– Check for escapes
Other Important Workload Properties
Working set size
– Relation to size of data caches
– Used to move bottleneck from CPU to Disk
Open VS Closed workloads [4]
– Open – new jobs arrive independently
Response time curves
– Closed – new jobs arrive when old jobs complete
Max throughput curve
14
Performance Testing Methods
Competitive Performance
– Apply closed work to drive max util. of resource(s)
Scale by increasing concurrency
– Results find max tput curve, e.g.,
15
Performance Testing Methods
Consistent Performance
– Apply fixed open work to measure resp. time
Scale by increasing requested I/Os
– Results in resp. time VS tput curve:
16
Performance Testing Methods
Predictable Performance
– Apply open work then inject disruptions
– Observe impact on response time:
17
Failure Event
Running in
Degraded State
Return to
“steady state”
Sub-system breakdown into: • Delay centers • Service centers
(resources) Introduced as part of Storage QoS
Layered Analytics and Metrics
18
CPU DiskData
NetworkClusterNetwork
resource
delay center
Xarrivals visits completions
visits
High Level Reports
Sub-system queuing model
Low-level trace data
Wait time
Wait time Service time
Data ONTAP® Sub-system
Throughput, Latency, and Utilization System-wide reports
Most granular view of system: • Kernel MSG traces • Network traces • Function profiling
Metrics for PASS/FAIL Correlation
System-wide metrics:
– Throughput and Latency
Cost-based metrics:
– CPU path length per I/O
What is the overall CPU cost for servicing the I/O?
– Resource Residence Time
What is the specific-resource cost for work?
More granular metric of CPU path length
– Delay Center Residence Time
Time spent waiting on a resource to free up?
This + resource residence time == system latency
19
Summary of Analysis and Test Methods
Overview of key workloads
– Micro and Application-specific workloads
Performance testing methods
– Competitive perf. testing
– Predictable perf. testing
– Consistent perf. testing
Analytics and metrics
– System-wide reports and metrics
– Queuing model for cost-based metrics
– Low-level trace data (when the going gets tough)
20
21
Performance Tools and Testing Infrastructure
Context Guide:
Dedicated HW, to aid in repeatability
Pool of Clients Pool of Disks Pool of Storage
Controllers Data Network
Swtches
CISCO
Brocade
Layer 2
Switching
FCP
NFS
SMB
iSCSI
Disk Ops
Switched infra. Supports
Dynamic testbed configurations
Performance Testing Lab
8 Perf. Labs deployed • 735 clients • 377 controllers • Over 2PB of raw storage
Performance Testing SW
Performance Testing SW
Measurement Invariant Interval
– The of perf. test execution automation
– Interval where a set of pre and post conditions and are guaranteed to be true to ensure accurate measurements
24
Assert pre-conditions • E.g., warmup achieved?
Assert post-conditions • E.g., utilization achieved?
Measurement Interval
(collect measurement data)
time
(apply work) (next measurement)
Performance Testing SW
Workflow automation (Regression Tests)
25
Regression
Test Results
Bisect on dev
committsRegression?
Yes
No
Analytics
Update ReportsBasic Workflow for RCA Performance Regressions
A lot of “special sauce” here
Rapid Prototyping
SLSL[1] for rapidly prototyping new tests
– Proprietary, domain-specific language
IPython Notebooks[2] for rapidly prototyping new reports
Leveraged for:
– Customer escalations (repro in-house)
– Exploring new regression test ideas
26
Summary of Tools and Infrastructure
Perf. Labs aid in repeatability
– Dedicated HW implementation
– Test execution SW automation
Reduce human error
Implement strict measurement invariant interval
– Facilities for rapid prototyping
Workflow automation for regression tests
– capacity and test coverage
– Automatic RCA
27
Future Works
Tests Generate a lot of data
– Statistical modeling techniques
– Data science specialists
Creating more synthetic “performance” tests
– Run in virtualized SUTs
– Metrics based on sub-system queues as proxies for changes that impact system performance
More analysis to be done here…
28
~Fin~
Provided an overview of how Performance Engineering protects the performance of Data ONTAP® through it’s SDLC.
Covered 3 key areas: 1. The SUT: Networked Storage Systems running
Data ONTAP®
2. Performance test and analysis methods
3. Performance tools and infrastructure
Discussed future improvement areas
29
Citations and References
1. Dunning, SLSL: http://dl.acm.org/citation.cfm?id=1958798
2. IPython Notebooks: http://ipython.org/notebook
3. RabbitMQ: https://www.rabbitmq.com
4. Schroder, Open Versus Closed A Cautionary Talehttp://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf
30
31