Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
1.827 -
download
5
description
Transcript of Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Best Practices for Benchmarking and
Performance Analysis in the Cloud
Robert Barnes, Amazon Web Services
November 15, 2013
Benchmarks: Measurement Demo
3
3
4
6 4
How many
ways to
measure? At least 20…
Cloud Benchmarks: Prequel
• The best benchmark • Absolute vs. relative measures • Fixed time or fixed work • What’s different? • Use a good AMI
0.00 5.00 10.0015.0020.0025.0030.00
Ubuntu 12.4 ami-…AWS CentOS 5.4 ami-…
CentOS 5.4 ami-…CentOS 5.4 ami-…CentOS 5.4 ami-…
Average CPU result
0%
10%
20%
30%
40%
50%
60%
Coefficient of Variance
Scenario: CPU-based Instance Selection
• Application runs on premises
• Primary requirement is integer CPU performance
• Application is complex to set up, no benchmark tests exist, limited time
• What instance would work best?
1. Choose a synthetic benchmark
2. Baseline: Build, configure, tune, and run it on premises
3. Run the same test (or tests) on a set of instance types
4. Use results from the instance tests to choose the best match
Testing CPU
• Choose a benchmark – geekbench, UnixBench, sysbench(cpu), and SPEC CPU2006
Integer
• How do you know when you have a good result?
• Tests run on 9 instance types – 10 instances of each of the 9 types launched
– Tests run a minimum of 4 times on each instance
– Ubuntu 13.04 base AMI
geekbench Overview • Workloads in 3 categories
– 13 Integer tests
– 10 Floating Point tests
– 4 Memory tests
• Commercial product (64bit)
• No source code
• Runs single and multi-cpu
• Fast setup, fast runtime
Integer
AES
Twofish
SHA1
SHA2
BZip2 compress
BZip2 decompress
JPEG compress
JPEG decompress
PNG compress
PNG decompress
Sobel
LUA
Dijkstra
Floating Point
Black-Scholes
Mandelbrot
Sharpen image
Blur image
SGEMM
DGEMM
SFFT
DFFT
N-Body
Ray trace
Memory
STREAM copy
STREAM scale
STREAM add
STREAM triad
geekbench Script SEQNO=$1
GBTXT=gbtest.txt
DL=+
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
OUTID=$ID$DL$TYPE$DL
START=$(date +%s.%N)
./geekbench_x86_64 --no-upload >$GBTXT
END=$(date +%s.%N)
DIFF=$(echo "$END - $START" | bc)
OUTNAME=$OUTID$SEQNO$DL$DIFF$DL$GBTXT
mv $GBTXT $OUTNAME
…
grep “Geekbench Score” i-*$GBTXT >gbresults.txt
cat gbresults.txt | sed s/:// | awk ‘/i-/ {print $1”;”$4”;”$5}’>gbresults.csv
geekbench
Geekbench
1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
m3.xlarge 0.93 1.04% 2.04 2.31% 2.06
m3.2xlarge 0.93 1.40% 3.80 1.46% 2.08
m2.xlarge 0.80 2.84% 1.54 4.06% 1.99
m2.2xlarge 0.80 1.34% 2.82 1.21% 2.04
m2.4xlarge 0.76 2.28% 5.11 1.71% 2.01
c3.large 1.13 0.93% 1.32 0.71% 1.76
c3.xlarge 1.13 0.39% 2.51 1.81% 1.74
c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70
cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21
geekbench – Run Variance geekbench 1CPU ratio C.O.V.
m3.xlarge
instance-1 0.93 0.31%
instance-2 0.97 0.23%
instance-3 0.94 0.17%
instance-4 0.94 0.10%
instance-5 0.94 0.32%
instance-6 0.94 0.10%
instance-7 0.93 0.25%
instance-8 0.93 0.38%
instance-9 0.94 0.11%
instance-10 0.94 0.09%
geekbench – Integer Portion
gb-integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
c3.large 1.12 0.50% 1.37 0.43% NA
c3.xlarge 1.13 0.38% 2.72 0.41% NA
c3.2xlarge 1.12 0.38% 5.35 0.51% NA
cc2.8xlarge 1.00 0.20% 17.88 3.31% NA
geekbench
c3.large 1.13 0.93% 1.32 0.71% 1.76
c3.xlarge 1.13 0.39% 2.51 1.81% 1.74
c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70
cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21
UnixBench Overview • Default: the BYTE Index
– 12 workloads, run 2 times (roughly 29 minutes each time)
• Integer computation
• Floating point computation
• System calls
• File system calls
– Geomean Of results to a baseline produces a system benchmarks index score
• Open source – must be built – Must be patched for > 16 CPUs
11
UnixBench Script SEQNO=$1
UBTXT=ubtest.txt
DL=+
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`"
FN=$ID$DL$TYPE$DL$SEQNO$DL$UBTXT
COPIES=`cat /proc/cpuinfo | grep processor | wc –l`
./Run –c 1 –c $COPIES >$FN
…
grep “System Benchmarks Index Score” i-*$UBTXT >ubresults.txt
cat ubresults.txt | sed s/”.txt:System Benchmarks Index Score”// | \
awk ‘/i-/ {print $1”;”$2}’>ubresults.csv
UnixBench
UnixBench 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
m3.xlarge 1.38 1.90% 2.49 1.36% 28.25
m3.2xlarge 1.42 1.85% 4.21 1.99% 28.29
m2.xlarge 0.40 5.82% 0.76 1.28% 28.30
m2.2xlarge 0.42 1.71% 1.23 1.75% 28.32
m2.4xlarge 0.48 3.31% 2.02 1.71% 28.34
c3.large 1.10 1.33% 1.91 1.54% 28.17
c3.xlarge 1.06 1.48% 2.85 1.26% 28.21
c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96
cc2.8xlarge 1.00 2.97% 6.44 2.65% 30.20
UnixBench – Dhrystone 2 UB-Integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
c3.large 1.05 0.24% 1.10 0.30% 0.17
c3.xlarge 1.05 0.27% 2.20 0.28% 0.17
c3.2xlarge 1.05 0.07% 4.34 0.23% 0.17
cc2.8xlarg
e 1.00 0.10% 15.54 0.95% 0.17
UnixBench
c3.large 1.10 1.33% 1.91 1.54% 28.17
c3.xlarge 1.06 1.48% 2.85 1.26% 28.21
c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96
cc2.8xlarg
e 1.00 2.97% 6.44 2.65% 30.20
SPEC CPU2006 Overview
• Competitive (reviewed)
• Commercial (site) license required
• Source code provided, must be built
• Highly customizable
• Full “reportable” run 5+ hours
• Published results on www.spec.org
SPEC CPU2006 Overview Benchmark Category
400.perlbench C Programming language
401.bzip2 C Compression
403.gcc C C compiler
429.mcf C Combinatorial optimization
445.gobmk C Artificial intelligence
456.hmmer C Search gene sequence
458.sjeng C Artificial intelligence
462.libquantum C Physics / quantum computing
464.h264ref C Video compression
471.omnetpp C++ Discrete event simulation
473.astar C++ Path-finding algorithms
483.xalancbmk C++ Xml processing
SPEC CPU2006 Integer Script CPATH=“/cpu2006/result”
COPIES=`cat /proc/cpuinfo | grep processor | wc –l`
SITXT=estspecint.txt
DL=+
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
FN=$ID$DL$TYPE$DL$SEQNO$DL$SITXT
runspec –noreportable –tune=base –size=ref –rate=$COPIES –iterations=1 /
400 403 445 456 458 462 464 471 473 483
grep “_base” $CPATH/CINT*.ref.csv | cut -d, -f1-2 > $FN
grep “total seconds elapsed” $CPATH/CPU*.log | awk '/finished/ {print $9}’ >>$FN
Estimated SPEC CPU2006 Integer
Est.
SPECint 1CPU ratio C.O.V. RT (min)
NCPU
ratio C.O.V. RT (min)
m3.xlarge 1.01 1.06% 54.39 2.24 1.15% 104.18
m3.2xlarge 1.01 1.67% 54.49 4.25 1.63% 109.22
m2.xlarge 0.76 1.97% 70.83 1.39 2.45% 85.37
m2.2xlarge 0.79 0.94% 68.85 2.76 1.24% 85.42
m2.4xlarge 0.78 0.16% 68.73 5.21 1.26% 89.91
c3.large 1.11 1.95% 50.00 1.25 1.47% 94.22
c3.xlarge 1.10 1.96% 50.29 2.39 1.28% 97.66
c3.2xlarge 1.08 0.87% 50.87 4.67 0.25% 100.22
cc2.8xlarge 1.00 0.29% 54.92 14.92 0.52% 125.74
Sysbench Overview • Designed as quick system test of MySQL servers
• Test categories – Fileio
– Cpu
– Memory
– Threads
– Mutex
– oltp
• Source code provided, must be built
• Very simplistic defaults – tuning recommended
Sysbench Script COPIES=`cat /proc/cpuinfo | grep processor | wc –l`
TDS=$(($COPIES * 2))
STXT=sysbenchcpu.txt
DL=+
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
FN=$ID$DL$TYPE$DL$TDS$DL$STXT
sysbench –num-threads=$TDS --max-requests=30000 --test=cpu /
--cpu-max-prime=100000 run > $FN
grep “total time:” i-*$STXT| cut -d, -f1-2 > $FN
Sysbench – CPU
sysbench Default C.O.V. RT (min)
m3.xlarge 3.21 1.44% 0.06
m3.2xlarge 6.41 1.38% 0.03
m2.xlarge 1.59 0.75% 0.11
m2.2xlarge 3.19 0.64% 0.06
m2.4xlarge 8.83 0.62% 0.02
c3.large 1.78 0.26% 0.10
c3.xlarge 3.55 0.53% 0.05
c3.2xlarge 6.55 8.45% 0.03
cc2.8xlarge 25.34 2.30% 0.01
tuned ratio C.O.V. RT (min)
1.69 1.29% 3.86
3.38 1.41% 1.93
0.80 0.23% 8.16
1.60 0.76% 4.07
4.71 0.20% 1.38
0.91 0.09% 7.13
1.83 0.02% 3.57
3.54 3.31% 1.85
13.69 1.10% 0.48
Summary: CPU Comparison GB GB
Int
UB UB
Int
Est.
SPECInt
sysbench
default
sysbench
tuned
m3.xlarge 2.04 2.01 2.49 1.88 2.24 3.21 1.69
m3.2xlarge 3.80 3.96 4.21 3.77 4.25 6.41 3.38
m2.xlarge 1.54 1.52 0.76 1.59 1.38 1.59 0.80
m2.2xlarge 2.82 3.02 1.23 3.19 2.76 3.19 1.60
m2.4xlarge 5.11 5.54 2.02 6.48 5.21 8.83 4.71
c3.large 1.32 1.37 1.91 1.10 1.25 1.78 0.91
c3.xlarge 2.51 2.72 2.85 2.20 2.39 3.55 1.83
c3.2xlarge 4.88 5.35 4.50 4.34 4.67 6.55 3.54
cc2.8xlarge 15.46 17.88 6.44 15.5
4
14.92 25.34 13.69
Scenario: Memory Instance Selection
• Application runs on premises
• Primary requirement: memory throughput of 20K MB/sec
• What instance would work best?
1. Choose a synthetic benchmark
2. Baseline: Build, configure, tune, and run it on premises
3. Run the same test (or tests) on a set of instance types
4. Use results from the instance tests to choose the best match
Testing Memory
• Choose a benchmark: – stream, geekbench, sysbench(memory)
• How do you know when you have a good result?
• Tests run on 9 instance types – Minimum of 10 instances launched
– Tests run a minimum of 3 times on each instance
– Ubuntu 13.04 base AMI
Stream* Overview
• Synthetic measure sustainable memory bandwidth – Published results at www.cs.virginia.edu/stream/top20/Bandwidth.html
– Must be built
– By default, runs 1 thread per cpu
– Use stream-scaling to automate array size and thread scaling • https://github.com/gregs1104/stream-scaling
name kernel
bytes
iter
FLOPS
iter
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
* McCalpin, John D.: "STREAM: Sustainable Memory Bandwidth in High Performance Computers",
Memory Scripts TDS=`cat /proc/cpuinfo | grep processor | wc –l`
export OMP_NUM_THREADS= $TDS
MTXT=stream.txt
DL=+
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
FN=$ID$DL$TYPE$DL$TDS$DL$MTXT
./stream | egrep \
"Number of Threads requested|Function|Triad|Failed|Expected|Observed" > $FN
MTXT=sysbench-mem.txt
FN=$ID$DL$TYPE$DL$TDS$DL$MTXT
./sysbench --num-threads=$TDS --test=memory run >$FN
Memory Comparison
Stream-
Triad
Geekbench
Memory-Triad
sysbench
(default)
m3.xlarge 23640.56 15375.64 302.95
m3.2xlarge 26046.17 14999.27 603.40
m2.xlarge 18766.58 17365.76 528.16
m2.2xlarge 22421.91 17600.00 1019.08
m2.4xlarge 19634.50 14405.82 1576.30
c3.large 11434.83 9967.96 2116.84
c3.xlarge 21141.30 13972.65 2643.33
c3.2xlarge 30235.78 20657.49 2944.91
cc2.8xlarge 55200.86 37067.32 1195.90
sysbench memory defaults
--memory-block-size [1K]
--memory-total-size [100G]
--memory-scope {global,local} [global]
--memory-hugetlb [off]
--memory-oper {read, write, none} [write]
--memory-access-mode {seq,rnd} [seq]
Testing Disk I/O • Storage options:
– Amazon EBS
– Amazon EBS PIOPs
– Ephemeral
– hi1.4xlarge local storage
• I/O metrics – IOPs
– Throughput
– Latency
• Test parameters: – Read %
– Write %
– Sequential
– Random
– Queue depth
• Storage configuration – Volume(s)
– RAID
– LVM
Benchmarking PIOPs
• Launch an Amazon EBS-optimized
instance
• Create provisioned IOPS volumes
• Attach the volumes to Amazon
EBS-optimized instance
• Pre-warm volumes
• Tune queue depth and latency
against IOPs
0
200
400
600
800
1000
1200
Seq.Read
Seq.Write
MixedSeq
Read
MixedSeqWrite
RandRead
RandWrite
MixedRandRead
MixedRandWrite
Late
ncy (
usec)
PIOPs 2K Queue Depth
1D PIOPS 2K
1D PIOPS 2KQD22D PIOPS 2K
2D PIOPS 2KQD2
Testing Disk I/O Examples • [global]
• clocksource=cpu
• randrepeat=0
• ioengine=libaio
• direct=1
• group_reporting
• size=1G
• [xvdd-fill]
• filename=/data1/testfile1
• refill_buffers
• scramble_buffers=1
• iodepth=4
• rw=write
• bs=2m
• stonewall
• [xvdd-1disk-write-1k-1]
• time_based
• ioscheduler=deadline
• iodepth=1
• rate_iops=4080
• ramp_time=10
• filename=/data1/testfile1
• runtime=30
• bs=1k
• rw=write
• disk copy
• cp file1 /disk1/file1
• dd
• dd if=/dev/zero of=/data1/testile1 \
bs=1048 count=1024000
• fio – flexible io tester
• fio simple.cfg
Summary Disk I/O
Seconds MB/sec
cp f1 f2 17.248 59.37
rm –rf f2; cp f1 f2 .853 1200.47
cp f1 f3 .880 1164.96
dd if=/dev/zero bs=1048 count=1024000 of=d1 .722 1419.01
dd if=/dev/urandom bs=1048 count=1024000 of=d2 79.710 12.84
fio simple.cfg NA 61.55
Beyond Simple Disk I/O
Random
1M I/O
PIOPs 16disk
MBps
read 1006.73
write 904.03
r70w30 1005.91
Summary
If benchmarking your application is not practical, synthetic
benchmarks can be used if you are careful.
• Choose the best benchmark that represents your application
• Analysis – what does “best” mean?
• Run enough tests to quantify variability
• Baseline – what is a “good result” ?
• Samples – keep all of your results – more is better!
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
ENT305