Ferramentas de Desenvolvimento Intel® (Intel® Advisor) - Intel Software Conference 2013
Steve Shaw, Intel Database Technology Manager Makers TPC -H Overview DSS ... Manufacturer: Intel...
-
Upload
duongthien -
Category
Documents
-
view
233 -
download
8
Transcript of Steve Shaw, Intel Database Technology Manager Makers TPC -H Overview DSS ... Manufacturer: Intel...
Steve Shaw, IntelDatabase Technology Manager
2
Agenda
• Introduction • HammerDB Introduction• OS Configuration Essentials and Tools: CPU, Memory, I/O• OLTP workload customer example• Analysing Results and Price/Performance • Scaling and Clustering• Development and Directions• Summary
3
Introduction
• Steve Shaw – Database Tech Manager Intel
• Co-authored 2 books on Oracle
• Work with multiple databases commercial and open source all on Intel since Dynix/ptx on Pentium Pro 200Mhz 1MB Cache
• Specialized in scaling up and scaling out
• HammerDB is a GPL employer approved open source project developed under Intel Linux User Group program
4
5
What is HammerDB?“industry standard database benchmarking tool”*
http://www.hammerdb.com/benchmarks.html
Database License Test Results Interface Library OLTP
(from TPC-C)
OLAP
(from TPC-H)
Oracle/TimesTen Commercial Restricted Oracle OCI Oratcl
MS SQL Server Commercial Restricted ODBC TclODBC
IBM DB2 Commercial Restricted DB2 CLI Db2tcl
MariaDB/MySQL /
(Amazon Aurora)
Open
Source
Free to publish MySQL C API MySQLtcl
PostgreSQL /
(EnterpriseDB /
Greenplum
/Amazon Redshift)
Open
Source
Free to publish libpq Pgtclng
Redis Open
Source
Free to publish TCL client In-built (Retcl
planned)
Trafodion SQL on
Hadoop
Open
Source
Free to publish ODBC TDBC
6
Databases, Licenses and Workloads• HammerDB is GPL as are all extensions compiled using gcc on Linux and MS Visual C/C++ on Windows
• Third party database libraries are required in the library path
7
How benchmarking has changed
• Old method:• Submit official audited benchmarks• Example cost for 1 official benchmark $4,483,729• ‘Bare Metal’ environment• Last official TPC-C benchmark in 2014, 1 current
• New method: • Enable people to run own benchmarks• Inbuilt OLTP (TPC-C) and OLAP (TPC-H) workloads• Zero Cost• Bare Metal + Cloud, virtualization, containers, control groups• Share your results online
OLTP and OLAP workloads
DSS Database 100GB, 300GB,
1TB, 3TB, 10TB
OLTP
Database
Business Analysis
Business Operations
OLTP
Transaction
s
DSS Queries
Decision Makers
TPC-H Overview
DSS – Decision
Support System
TPC-H
TPC-C
Source: www.tpc.org
• OLAP based on TPC-H specification Complex Analytic Queries favours parallel query engines and column store databases
• OLTP based on TPC-C Specification Transactional throughput complex workload with deliberate contention to test scalability of the database engine
NOTE: HammerDB does not implement a full TPC-C or TPC-H workload or use any of the terminology (eg tpmC, QphH) to imply that the workloads do.
Scalable Schema Configurations
OLTP/TPC-C OLAP/TPC-H
10
OLTP HammerDB Scalability• Used in Intel database testing in multiple groups
eg Generate database performance data for forthcoming processors (more than 50 data points)
– CPUs required to pass multiple performance tests HammerDB is one of these these
– Scalability proven alignment over generations with TPC-C at fraction of cost
• Example ‘Skylake’
– Generation of performance data for multiple SKUs
– Review and Approval
– Known product performance
– Not public due to commercial database licensing
– Known as ‘De Witt’ ClauseTPM
CPU Generations
TPM
CPU Generations and SKUs
2Up to 5x claim based on OLTP Warehouse workload: 1-Node, 4 x Intel® Xeon® Processor E7-4870 Source: Request Number: 56, Benchmark: HammerDB, Score: 2.46322e+006 Higher is better vs. 1-Node, 4 x Intel® Xeon® Platinum 8180 Processor
11
Example: Skylake Launch
“With up to 28 of the highest-performance
cores, the all-new Intel Xeon Scalable
platform can support up to 5x more
transactions per second2 than 4-year-old systems”
https://newsroom.intel.com/editorials/intel-xeon-scalable-processor-family-data-center/
‘Relative’ rather than ‘Absolute’ performance data at product launches
12
OLTP Comparing Results: TPM and NOPM
• HammerDB produces 2 results TPM and NOPM eg
• TPM is database specific ie TPM cannot be compared between different databases (apart from relative scaling)
• NOPM is Schema/workload specific ie NOPM can be compared between different databases
• HammerDB uses both as TPM is a lightweight database statistic to gather and monitor, NOPM may impact schema so used minimally
13
HammerDB/Commerical and Sysbench/MariaDB
E5-2697 v3 E5-2697 v4
HammerDB v3 to v4
(Commercial)
• Similar scaling observed across databases• HammerDB has more complexity and contention and supports OLTP and OLAP
1.26X
E5-2699 v4 Intel® Xeon®
Platinum 8180
HammerDB v4 to
Skylake
(Commercial) 1.70X
14
Why HammerDB is written in TCL (and not Python) • TCL Threading Model - User sees 1 process per application
• TCL One Interpreter per thread eq High Performance, High Scalability, Stability
• Python restricted by GIL – ‘global interpreter lock’ = one thread executing at a time
One thread, 1 interpreter with low level API
Database drivenTo 100%
Users check a TSV toSee if stop button pressedor can kill threads
Tens of millionstransactions per minute
15
Commercial Tool Comparison
• TCL based open source application delivers higher performance at lower CPU utilisation than leading commercial tool
16
BIOS Settings
17
• Optimal BIOS settings are essential to performance
• And testing essential for optimization
• Beware of ‘Maximum Performance’ set and forget
Visit ark.intel.com
CPU cat /proc/cpuinfo | grep -i intel
vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
18
intel@purley1:~$ sudo turbostat --debug
turbostat version 4.16 24 Dec 2016 Len Brown <[email protected]>
10 * 100 = 1000 MHz max efficiency frequency
25 * 100 = 2500 MHz base frequency
cpu106: MSR_IA32_POWER_CTL: 0x29240059
cpu106: MSR_TURBO_RATIO_LIMIT: 0x2021232323232426
32 * 100 = 3200 MHz max turbo 8 active cores
33 * 100 = 3300 MHz max turbo 7 active cores
35 * 100 = 3500 MHz max turbo 6 active cores
35 * 100 = 3500 MHz max turbo 5 active cores
35 * 100 = 3500 MHz max turbo 4 active cores
35 * 100 = 3500 MHz max turbo 3 active cores
36 * 100 = 3600 MHz max turbo 2 active cores
38 * 100 = 3800 MHz max turbo 1 active cores
Intel® Turbo Boost Technology
Increases performance by increasing processor frequency and enabling faster speeds when conditions
allow
Co
re0
Co
re1
Co
re8
Co
re0
Co
re1
Co
re8
Co
re0
Co
re1
Fre
qu
en
cy
All cores operate at
rated frequency
All cores operate at higher
frequency
Fewer cores may operate at
even higherfrequencies
8C TurboNormal<8C Turbo
… … …
Higher Performance on Demand
Intel® Turbo Boost Technology (turbostat)
cpupower frequency (P-States)
20
intel@purley1:~$ cpupower frequency-set --governor=performanceintel@purley1:~$ sudo cpupower frequency-infoanalyzing CPU 0:driver: intel_pstateCPUs which run at the same hardware frequency: 0CPUs which need to have their frequency coordinated by software: 0maximum transition latency: Cannot determine or is not supported.hardware limits: 1000 MHz - 3.80 GHzavailable cpufreq governors: performance powersavecurrent policy: frequency should be within 1000 MHz and 3.80 GHz.
The governor "performance" may decide which speed to usewithin this range.
current CPU frequency: 2.83 GHz (asserted by call to hardware)boost state support:Supported: yesActive: yes
cpupower idle (C-States)
intel@purley1:~$ cpupower idle-set --enable-all
intel@purley1:~$ sudo cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
analyzing CPU 0:
Number of idle states: 4
Available idle states: POLL C1-SKX C1E-SKX C6-SKX
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 2280
Duration: 1822287
21
Energy Performance Bias
MSR_IA32_ENERGY_PERF_BIAS
0 = high performance 6 = balanced 15 = low power
Red Hat 5 defaulted to ‘performance’
Red Hat 7 set this MSR to ‘balanced’
To return to performance setting :
intel@purley1:~$ sudo x86_energy_perf_policy -v performance
CPUID.06H.ECX: 0x9
cpu0 msr0x1b0 0x0000000000000006 -> 0x0000000000000000
cpu1 msr0x1b0 0x0000000000000006 -> 0x0000000000000000
22
Hyper-Threading
./cpu_topology64.out
Software visible enumeration in the system:
Number of logical processors visible to the OS: 112
Number of logical processors visible to this process: 112
Number of processor cores visible to this process: 56
Number of physical packages visible to this process: 2
Memorydmidecode | more
Base Board Information
Manufacturer: Intel Corporation
Product Name: S2600WFD
…
Memory Device
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Bank Locator: NODE 1
Type: DDR4
Speed: 2666 MHz
Manufacturer: Micron
Configured Clock Speed: 2666 MHz
ark.intel.com
NUMA and Memory Latency
intel@purley1:~/cputools/Linux$ sudo ./mlc
Intel(R) Memory Latency Checker - v3.4
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 72.5 135.0
1 132.4 70.5
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is
enabled
Using traffic with the following read-write ratios
ALL Reads : 226259.4
3:1 Reads-Writes : 207531.0
2:1 Reads-Writes : 204783.2
1:1 Reads-Writes : 188107.7
Stream-triad like: 182377.5
Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
26
Database Files
1 x Intel® SSD DC P3700 Series
Database Files
1 x Intel® Optane™ SSD DC P4800X
With all NAND SSDs
Server
2 x Intel® Xeon® E5
With Intel® Optane™ SSD
10xMore transactions per second (TPS at same latency level)1
91% Lower cost per transaction1
TPS 1395
Latency ~11ms @ 99%
$/transaction ~$10.09
16480
~10ms @ 99%
~$0.90
1. System configuration: Server Intel® Server System R2208WT2YS, 2x Intel® Xeon® E5 2699v4, 384 GB DDR4 DRAM, boot drive- 1x Intel® SSD DC S3710 Series (400 GB), database drives- 1x Intel® SSD DC P3700 Series (400 GB) and 1x Intel® SSD DC P4800X Series (140 GB prototype), CentOS 7.2, MySQL Server 5.7.14, Sysbench 0.5 configured for 70/30 Read/Write OLTP transaction split using a 100GB database. Cost per transaction determined by total MSRP for each configuration divided by the transactions per second.
Server
2 x Intel® Xeon® E5
*Other names and brands names may be claimed as the property of others
up to
up to
I/O :Intel® Optane™ SSD DC P4800X
27
28
Choose Database
29
Schema Creation
• Choose Schema Options
• Select Build
• Multi-threaded schema build
• Option for flat-file data generation for cloud
30
Build Complete
• Schema of chosen size has built successfully
31
Timed Workload
• Test and Timed Workloads
32
Virtual Users
• Configure Virtual Users• For Timed workloads one
Vuser is a monitor
33
Transaction Counter
• Transaction Counter should be as ‘flat’ as possible
• Peaks and Troughs indicate configuration errors
34
Test Complete
• Monitor Virtual User shows average TPM and NOPM over the test
• TPM will be slightly lower than Transaction Counter due to longer sample interval
35
Autopilot
• Completely Automated and Unattended performance test
• Provide sequence of Virtual Users and leave to run
36
37
Analysing Results
0
100000
200000
300000
400000
500000
600000
0 20 40 60 80 100
NO
PM
Virtual Users
Performance Profile
MariaDB v3 MariaDB v4
0
100000
200000
300000
400000
500000
600000
NO
PM
MariaDB Peak Performance
MariaDB v3 MariaDB v4
1.26X
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
38
Commercial Comparison
0
100000
200000
300000
400000
500000
600000
NO
PM
MariaDB Peak Performance
MariaDB v3 MariaDB v4
• NOPM results enable comparison of MariaDB results to commercial database results archive
39
Example Price / Performance
0
200000
400000
600000
800000
1000000
1200000
Commercial MariaDB
US
D $
Database Software 3 year TCO
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Commercial MariaDB
Pri
ce /
NO
PM
Cost per Transaction
• Calculate License Cost + Support for a 3 year period
• Divide Cost by Performance for Cost per Transaction
40
41
Skylake InnoDB Optimisation
0
100000
200000
300000
400000
500000
600000
700000
0 10 20 30 40 50 60 70 80 90
NO
PM
Virtual Users
MariaDB OLTP Database Performance
MariaDB Skylake InnoDB Update MariaDB Skylake DEFAULT
• Non-contention workloads do not highlight performance impact
• Optimize InnoDB spinlock on Skylake to reduce contention and increase throughput 1.43X
SELECT FOR UPDATE Locking Overhead
SELECT d_next_o_id, d_tax INTO no_d_next_o_id,
no_d_tax
FROM district
WHERE d_id = no_d_id AND d_w_id = no_w_id FOR
UPDATE;
UPDATE district SET d_next_o_id = d_next_o_id +
1 WHERE d_id = no_d_id AND d_w_id = no_w_id;
SET o_id = no_d_next_o_id;
UPDATE district SET d_next_o_id = d_next_o_id +
1 WHERE d_id = no_d_id AND d_w_id = no_w_id
RETURNING d_next_o_id, d_tax INTO
no_d_next_o_id, no_d_tax;
o_id := no_d_next_o_id;
SELECT d_next_o_id, d_tax INTO no_d_next_o_id,
no_d_tax
FROM OLD TABLE ( UPDATE district
SET d_next_o_id = d_next_o_id + 1
WHERE d_id = no_d_id
AND d_w_id = no_w_id );
SET o_id = no_d_next_o_id;
UPDATE dbo.district
SET @no_d_tax = d_tax
, @o_id = d_next_o_id
, d_next_o_id = district.d_next_o_id + 1
WHERE district.d_id = @no_d_id
AND district.d_w_id = @no_w_id
SET @no_d_next_o_id = @o_id+1
Oracle
MS SQL Server
MariaDB / MySQL
DB2
MariaDB Galera Cluster on Intel® SSD DC P3700 Series
Clustering Performance
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
NEWORD PAYMENT DELIVERY SLEV OSTAT
Re
spo
nse
Tim
es
Mic
rose
con
ds
Stored Procedures
Stored Procedure Peak Throughput Response Times
MariaDB at 363285 NOPM Galera at 72489 NOPM
• Replication impact onperformance
• High Penalty of stored procedure with ‘Delete’ Transactions
45
MyRocks Potential
• Optane and MyRocks great potential to better use NVM
• > 460,000 HammerDB NOPM in initial testing
Future: Intel® DIMMsBased on 3D XPoint™ memory media
OS Paging
‘memory pool’
DRAM
PCIe
Intel® 3D NAND SSDs
Intel® Optane™ SSDs
Intel® Xeon®
E5
DDR
PCIe*
Intel® Memory Drive Technology
FPGA Intel Arria 10 for Database
Up to 80% reduction in power
consumption (vs. Intel® Xeon®
Real-time, inline processing of streaming data without buffering
power efficiency high-throughput, low latency
• Acceleration of Targeted Algorithms for example:
• Compression
48
HammerDB v3.0
eg requests for SAP HANA, SQLite, Cassandra, MongoDB, Tibero, Cubrid, Linter and TPC-E, TPC-DS
• Multiple Requests to support additional databases
• Refactoring underway to make adding databases easier
• Now XML <-> Dict driven to make adding databases by XML file and build and driver scripts
Moore’s Law
49
Hi-K Metal Gate
Strained Silicon
3D Transistors
65 nm 45 nm 32 nm 22 nm 14 nm 10 nm 7 nm90 nm
Enabling new devices with higher functionality and complexity while
controlling power, cost, and size
50
Executing to Moore’s Law