NVMe Performance Testing and Optimization Application...

Advanced Micro Devices

NVMe Performance Testing and Optimization Application

Note

Publication # 56163 Revision: 0.72 Issue Date: December 2017

© 2017 Advanced Micro Devices, Inc. All rights reserved.

The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.

Trademarks

AMD, the AMD Arrow logo, AMD EPYC, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Linux is a registered trademark of Linus Torvalds.

PCI and PCIe are registered trademarks of PCI SIG.

56163 Rev. 0.72 December 2017 NVMe Performance Testing and Optimization

Application Note

Contents 3

Contents

Introduction .................................................................................................................................... 6

AMD EPYC™ Processor Architecture .......................................................................................... 6

System Optimizations ...................................................................................................................... 8

FIO CPU Pinning ........................................................................................................................... 10

Test System .................................................................................................................................. 10

Test Setup .................................................................................................................................. 11

Results .................................................................................................................................. 11

100% Read Test ....................................................................................................................... 11

70% Read 30% Write Test ....................................................................................................... 14

30% Read 70% Write Test ....................................................................................................... 17

100% Write Test ...................................................................................................................... 19

Summary .................................................................................................................................. 21

NVMe Performance Testing and Optimization Application Note

56163 Rev. 0.72 December 2017

4 List of Figures

List of Figures Figure 1. AMD EPYC™ Processor Architecture ............................................................................. 7

Figure 2. 100% Read IOps .............................................................................................................. 12

Figure 3. 100% Read Bandwidth .................................................................................................... 12

Figure 4. 100% Read CPU Utilization ............................................................................................ 13

Figure 5. 70% Read 30% Write IOPs ............................................................................................. 14

Figure 6. 70% Read 30% Write Bandwidth .................................................................................... 15

Figure 7. 70% Read 30% Write CPU Utilization ............................................................................ 16

Figure 8. 30% Read 70% Write IOPs ............................................................................................. 17

Figure 9. 30% Read 70% Write Bandwidth .................................................................................... 18

Figure 10. 30% Read 70% Write CPU Utilization .......................................................................... 18

Figure 11. 100% Write IOPs ........................................................................................................... 19

Figure 12. 100% Write Bandwidth ................................................................................................. 20

Figure 13. 100% Write CPU Utilization ......................................................................................... 20


Application Note

Revision History 5

Revision History

Date Revision Description December 2017 0.72 Initial public release; updated line “a.” in System Optimizations section

on page 8. September 2017 0.71 Updated CPIO Pinning section; Updated legend for Figures 4, 5, and 7. August 2017 0.70 Initial NDA release.


56163 Rev. 0.72 December 2017

6 Introduction

Introduction The AMD EPYC™ processor has more PCIe® lanes and NUMA nodes than a traditional processor which can impact synthetic I/O testing adversely. When performing synthetic IO testing, some optimizations need to be done to achieve maximum performance. This application note discusses the EPYC architecture and how to optimize the IO.

AMD EPYC™ Processor Architecture The AMD EPYC™ processor is functionally different than any other CPU on the market. The processor uses 4 dies to create a single CPU. A single die contains 2 Core Compute Complexes or a CCX, each CCX has 4 “Zen” cores which share a single L3 cache. In the case of this test system, which is using the AMD EPYC 7601 processor, it has 8 physical cores per die, meaning it has 32 total cores. The internal communication of the dies is handled by the Infinity Fabric, which is a low latency fabric that manages inter-die and inter-CCX communication. The AMD EPYC processor architecture provides enhanced performance, core count, and PCIe connectivity over traditional CPU architecture. When doing synthetic disk testing it is best to pin the IO to the associated die. Linux® sees these dies as NUMA nodes, which is attached to the IO device. The optimizations in this paper can be applied to SATA, SAS and NVMe drives.


Application Note

AMD EPYC™ Processor Architecture 7

Figure 1. AMD EPYC™ Processor Architecture

Figure 1 shows the separate NUMA nodes with their associated dies and their direct connectivity internally to the multiple SATA, NVMe and PCIe® devices installed on this test system. For example, PCI device 144d:a822 is a Samsung NVMe drive, of which 22 are connected to PCIe root complexes in the platform. The testing below focuses on testing a single one of these drives that is attached to die 0. Optimizing the workload by keeping the IO localized to a die minimizes external die memory usage by keeping all the IO local to the die that is associated with the SATA drive, NVMe device, or PCIe device. The Infinity Fabric has high speed interconnectivity between the dies but it is not as fast local die IO.


56163 Rev. 0.72 December 2017

8 System Optimizations

System Optimizations The system optimizations that should be performed for synthetic disk testing are the standard optimizations for Linux®.

a. Load the latest kernel 4.13+ - which provides patches that will optimize IO. b. Change the IO scheduler to NOOP

• Edit: grub.conf and add to the GRUB_CMDLINE_LINUX_DEFAULT line “elevator=noop” and then run update-grub

• Set the CPU governor to performance c. Run this command from a prompt: cpufreq-set -c 0 -g ondemand d. If you do not have the cpufreq-set command available you need to install the cpufrequtils

package

As described previously, the AMD EPYC™ processor architecture is fundamentally different than prior CPU architectures. AMD has worked with the open source community to provide updates to the Linux kernel so that it is optimized to use the EPYC CPU to its full capabilities. A large amount of work has been done on the IRQBALANCE service, specifically around optimizations for data locality and core count. The latest version can be found at:

https://github.com/Irqbalance/irqbalance/pull/51/commits/a4b5781f28700eb624871ceccb1cfee7cd84bd93

If the IO is expected to be extremely high, then it would be best to pin CPU cores to the respective IO device that is connected to that core. A simple way to see this is to load the HWLOC package on Linux which contains LSTOPO. LSTOPO is a command that can be used to show PCIe connectivity and to visualize the various NUMA nodes installed on the system.

The following is the command and resulting output of the command for the example platform. It is only showing the output of a single NUMA node to simplify the results.

root@nvmetestsys1:~# lstopo-no-graphics Machine (252GB total) NUMANode L#0 (P#0 63GB) Package L#0 L3 L#0 (8192KB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1 PU L#2 (P#2) PU L#3 (P#3) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 PU L#4 (P#4) PU L#5 (P#5)




Application Note

System Optimizations 9

L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 PU L#6 (P#6) PU L#7 (P#7) L3 L#1 (8192KB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 PU L#8 (P#8) PU L#9 (P#9) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 PU L#10 (P#10) PU L#11 (P#11) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 PU L#12 (P#12) PU L#13 (P#13) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 PU L#14 (P#14) PU L#15 (P#15) HostBridge L#0 PCIBridge PCI 144d:a822 PCIBridge PCI 144d:a822 PCIBridge PCIBridge PCI 1a03:2000 GPU L#0 "card0" GPU L#1 "controlD64" PCIBridge PCI 144d:a822 PCIBridge PCI 144d:a822 PCIBridge PCI 144d:a822 PCIBridge PCI 144d:a822 PCIBridge PCI 1022:7901 Block(Disk) L#2 "sda"


56163 Rev. 0.72 December 2017

10 FIO CPU Pinning

FIO CPU Pinning FIO supports CPU pinning within the FIO workload file an example of this is as follows:

[global] name=4k random read 4 ios in the queue in 32 queues ioengine=libaio direct=1 readwrite=randrw rwmixread=70 iodepth=64 buffered=0 size=100% runtime=30 time_based randrepeat=0 norandommap refill_buffers ramp_time=10 [job1] filename=/dev/nvme0n1 bs=4k cpus_allowed=0 [job2] filename=/dev/nvme0n1 bs=4k cpus_allowed=2 [job3] filename=/dev/nvme0n1 bs=4k cpus_allowed=4

In Job 1 it pins CPU 0 to that job that will send IO to nvme0n1 which is directly attached to CPU 0. To verify this impacted synthetic disk benchmark performance tests we used the four corners of disk IO to insure there was improvement.

Test System

System HPE CL3150

Memory 256GB of installed memory


Application Note

Test Setup 11

CPU AMD EPYC™ 7601 processor

NVMe Drive Samsung PM1725a

OS Ubuntu 17.04

Optimizations to the OS IO Scheduler set to NOOP CPU Governor set to performance Latest IRQBALANCE patches

Test Setup The single drive test was setup using FIO and it focused on four scenarios:

1- 100% Read Pinned vs. Unpinned IO 2- 70/30% Read/Write Pinned vs. Unpinned IO 3- 30/70% Read/Write Pinned vs. Unpinned IO 4- 100% Write Pinned vs Unpinned IO

The test was setup to verify that these optimizations improved the performance of synthetic disk bench marking.

Results 100% Read Test

Figure 2 and Figure 3 on page 12, and Figure 4 on page 13 show the results of the 100% Read test.


56163 Rev. 0.72 December 2017

12

Results

Figure 2. 100% Read IOps

Figure 3. 100% Read Bandwidth

0.00

100,000.00

200,000.00

300,000.00

400,000.00

500,000.00

600,000.00

700,000.00

800,000.00

900,000.00

1 4 8 16 24 32 64

IO Depth

100% Read IOPs

Read IOPs Pinned

READ IOPs Unpinned

0.00

500,000.00

1,000,000.00

1,500,000.00

2,000,000.00

2,500,000.00

3,000,000.00

3,500,000.00

1 4 8 16 24 32 64

IO Depth

100% Read Bandwidth

BW Pinned BW Unpinned


Application Note

Results

13

Figure 4. 100% Read CPU Utilization

The 100% Read test shows significant improvement of pinned IO vs unpinned IO. The CPU spent less time in SYS space which means the kernel is performing more efficiently while performing IO. Subsequently the IOPs numbers went up and the drive performed at its full capabilities.

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

1 4 8 16 24 32 64

IO Depth

100% Read CPU Utilization

USR CPU Pinned USR CPU Unpinned SYS CPU Pinned SYS CPU Unpinned


56163 Rev. 0.72 December 2017

14

Results

70% Read 30% Write Test

Figure 5 on page 14, and Figure 6 on page 15 and Figure 7 on page 16 show the results of the 70% Read 30% Write test.

Figure 5. 70% Read 30% Write IOPs

0.00

50,000.00

100,000.00

150,000.00

200,000.00

250,000.00

1 4 8 16 24 32 64IO Depth

70% Read 30% Write IOPs

Read IOPs Pinned Read IOPs UnpinnedWrite IOPs Pinned Write IOPs Unpinned


Application Note

Results

15

Figure 6. 70% Read 30% Write Bandwidth

0.00

200,000.00

400,000.00

600,000.00

800,000.00

1,000,000.00

1,200,000.00

1,400,000.00

1 4 8 16 24 32 64

IO Depth

70% Read 30% Write Bandwidth

Bandwidth Pinned Bandwidth Unpinned


56163 Rev. 0.72 December 2017

16

Results

Figure 7. 70% Read 30% Write CPU Utilization

The read write tests are impacted less by the pinning from an IOP perspective but pinning allows the CPU to perform more efficiently than not pinned which can be seen by the CPU Utilization graph. The IOP numbers were slightly lower but the CPU performed more efficiently in this workload while IO was pinned.

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 4 8 16 24 32 64

IO Depth

70% Read 30% Write CPU Utilization



Application Note

Results

17

30% Read 70% Write Test

Figure 8 on page 17, and Figure 9 and Figure 10 and on page 18 show the results of the 70% Read 30% Write test.

Figure 8. 30% Read 70% Write IOPs

0.00

20,000.00

40,000.00

60,000.00

80,000.00

100,000.00

120,000.00

140,000.00

160,000.00

1 4 8 16 24 32 64

IO Depth

30% Read 70% Write IOPs

Read IOPs Pinned READ IOPs Unpinned Write IOPs Pinned Write IOPs Unpinned


56163 Rev. 0.72 December 2017

18

Results

Figure 9. 30% Read 70% Write Bandwidth

Figure 10. 30% Read 70% Write CPU Utilization

0.00100,000.00200,000.00300,000.00400,000.00500,000.00600,000.00700,000.00800,000.00900,000.00

1,000,000.00

1 4 8 16 24 32 64

IO Depth

30% Read 70% Write Bandwidth


0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

1 4 8 16 24 32 64

IO Depth

30% Read 70% Write CPU Utilization



Application Note

Results

19

The read write tests are impacted less by the pinning from an IOP perspective but pinning allows the CPU to perform more efficiently than not pinned which can be seen by the CPU Utilization graph. The IOP numbers were slightly lower but the CPU performed more efficiently in this workload while IO was pinned.

100% Write Test

Figure 11 on page 19, and Figure 12 and Figure 13 on page 20 show the results of the 70% Read 30% Write test.

Figure 11. 100% Write IOPs

0.00

50,000.00

100,000.00

150,000.00

200,000.00

250,000.00

1 4 8 16 24 32 64

IO Depth

100% Write IOPs

Write IOPs Pinned Write IOPs Unpinned


56163 Rev. 0.72 December 2017

20

Results

Figure 12. 100% Write Bandwidth

Figure 13. 100% Write CPU Utilization

0.00

100,000.00

200,000.00

300,000.00

400,000.00

500,000.00

600,000.00

700,000.00

800,000.00

900,000.00

1 4 8 16 24 32 64

IO Depth

100% Write Bandwidth


0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

1 4 8 16 24 32 64

IO Depth

100% Write CPU Utilization



Application Note

Summary 21

Pinned CPU performance performed better in this synthetic test.

Summary The optimizations show increased performance and allows the CPU to perform at its potential. The AMD EPYC™ processor performs better when the IO is localized to the attached CPU, when synthetic testing is performed the best options are to run the latest IRQBALANCE patches and to pin the CPUs.

NVMe Performance Testing and Optimization Application...

Documents

Transcript of NVMe Performance Testing and Optimization Application...