An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM •...

58
1 An Empirical Guide to the Behavior and Use of Scalable Persistent Memory Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego Department of Electrical, Computer & Energy Engineering University of Colorado, Boulder

Transcript of An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM •...

Page 1: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

1

An Empirical Guide to the Behavior and Use of Scalable Persistent Memory

Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, Steven Swanson

Non-Volatile Systems LaboratoryDepartment of Computer Science & Engineering

University of California, San Diego

Department of Electrical, Computer & Energy EngineeringUniversity of Colorado, Boulder

Page 2: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

2

Optane DIMMs are Here!

Page 3: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

3

Optane DIMMs are Here!

Cool.

Page 4: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

4

Optane DIMMs are Here!

• Not Just Slow Dense DRAMTM

Page 5: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

5

Optane DIMMs are Here!

• Not Just Slow Dense DRAMTM

• Slower media

-> More complex architecture

-> Second-order performance anomalies

-> Fundamentally different

Page 6: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

6

Outline

• Background

• Basics: Optane DIMM Performance

• Lessons: Optane DIMM Best Practices– *not included: Emulation Study, Best Practices in Macrobenchmarks (see the paper)

• Conclusion

Page 7: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

7

Background

Page 8: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

8

Background: Optane in the Machine

Core CoreCore

Page 9: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

9

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

Core

L1/2

Page 10: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

10

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

Page 11: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

11

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Page 12: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

12

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Page 13: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

13

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

AppDirectMode

Core

L1/2

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Page 14: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

14

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

AppDirectMode

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Far MemoryNear Memory

(Direct Mapped Cache)

Memory Mode

Page 15: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

15

Background: Optane in the Machine

Core Core

L1/2 L1/2

L3

iMC iMC

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Far MemoryNear Memory

(Direct Mapped Cache)

Memory Mode Core

L1/2

AppDirectMode

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

Page 16: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

16

Background: Optane Internals

$

MC

Optane

DRAM

Optane DIMM

Page 17: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

17

Background: Optane Internals

$

MC

Optane

DRAM

WPQ

iMC

Optane DIMM

From CPU

Page 18: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

18

Background: Optane Internals

$

MC DRAM

Optane Controller

WPQ

iMC

Optane DIMM

From CPU

Optane

Page 19: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

19

Background: Optane Internals

$

MC DRAM

Optane Media

Optane Controller

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

Page 20: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

20

Background: Optane Internals

$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

Page 21: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

21

Background: Optane Internals

$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

Page 22: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

22

Background: Optane Internals

$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

ADR (Asynchronous DRAM Refresh)

Optane

Page 23: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

23

Background: Optane Interleaving

• Optane is interleaved across NVDIMMs at 4KB granularity.

Optane4

Optane5

Optane6

Optane1

Optane2

Optane3

Page 24: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

24

Background: Optane Interleaving

• Optane is interleaved across NVDIMMs at 4KB granularity.

Optane4

Optane5

Optane6

Optane1

Optane2

Optane3

4 5 61 2 3

Phys. Addr: 4KB 12KB 20KB0 8KB 16KB

1

24KB

Page 25: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

25

Basics: How does Optane DC perform?

Page 26: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

26

Basics: Our Approach

• Microbenchmark sweeps across state space– Access Patterns (random, sequential)

– Operations (read, ntstore, clflush, etc.)

– Access/Stride Size

– Power Budget

– NUMA Configuration

– Address Space Interleaving

• Targeted experiments

• Total: 10,000 experiments– https://github.com/NVSL/OptaneStudy

Page 27: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

27

Test Platform

• CPU– Intel Cascade Lake, 24 cores at 2.2 GHz in 2 sockets

– Hyperthreading off

• DRAM– 32 GB Micron DDR4 2666 MHz

– 384 GB across 2 sockets w/ 6 channels

• Optane– 256 GB Intel Optane 2666 MHz QS

– 3 TB across 2 sockets w/ 6 channels

• OS– Fedora 27, 4.13.0

Page 28: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

28

Basics: Latency

• 2x -3x as slow as DRAM

• Write latency masked by ADR

Page 29: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

30

Basics: Bandwidth

• ~Scalable reads, Non-scalable writes

Page 30: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

31

Basics: Bandwidth

• Access size matters

Page 31: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

32

Basics: Bandwidth

• A mystery!

*answer in ~10 minutes

Page 32: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

35

Lessons: What are Optane Best Practices?

Page 33: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

36

Lessons: What are Optane Best Practices?

• Avoid small random accesses

• Use ntstores for large writes

• Limit threads accessing one NVDIMM

• Avoid mixed and multi-threaded NUMA accesses

Page 34: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

37

Lesson #1: Avoid small random accesses

Page 35: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

38

Lesson #1: Avoid small random accesses

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

Page 36: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

39

Lesson #1: Avoid small random accesses

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

Page 37: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

40

Lesson #1: Avoid small random accesses

Optane Media

Optane ControllerOptaneBuffer

AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

Page 38: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

41

Lessons: Optane Buffer Size

• Write amplification if working set is larger than Optane Buffer

Page 39: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

43

Lesson #1: Avoid small random accesses

• Bad bandwidth with:

– Small random writes (<256B)

– Not tiny working set / NVDIMM (>16KB)

• Good bandwidth with

– Sequential accesses

Page 40: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

44

Lesson #2: Use ntstores for large writes

Page 41: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

45

Lessons: Store instructions

ntstore store + clwb store

$

MC

Optane

DRAM

1

2

3

1

2

3

Page 42: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

46

Lessons: Store instructions

ntstore store + clwb store

$

MC

Optane

DRAM

1

2

3

1

2

3

*lost bandwidth *lost locality

Page 43: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

47

Lesson #2: Use ntstores for large writes

Page 44: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

48

Lesson #2: Use ntstores for large writes

• Non-temporal stores bypass the cache

– Avoid cache-line read

– Maintain locality

Page 45: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

49

Lesson #3: Limit threads accessing one NVDIMM

Page 46: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

50

Lesson #3: Limit threads accessing one NVDIMM

• Contention at Optane Buffer

• Contention at iMC

Page 47: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

51

Lessons: Contention at Optane Buffer

• More threads = access amplification = lower bandwidth

Page 48: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

52

Lessons: Contention at iMC

• iMCs aren’t designed for slow, variable latency accesses

– Short queues end up clogged

Page 49: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

53

Lessons: Contention at iMC

• Vary #threads/NVDIMM (total threads = 6)

Page 50: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

54

Lesson: Contention at iMC

• iMC contention is largest when random access size = interleave size (4KB)

Page 51: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

55

Lesson: Contention at iMC

• iMC contention is largest when random access size = interleave size (4KB)

• iMC load is fairest when access size = #DIMMs x interleave size (24KB)

Page 52: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

56

Lesson #3: Limit threads accessing one NVDIMM

• Contention at Optane Buffer

– Increase access amplification

• Contention at iMC

– Lose bandwidth through uneven NVDIMM load

– Avoid interleave aligned random accesses

Page 53: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

57

Lesson #4: Avoid mixed and multi-threaded NUMA accesses

Page 54: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

58

Lesson #4: Avoid mixed and multi-threaded NUMA accesses

• NUMA effects impact Optane more than DRAM

Page 55: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

59

Lesson #4: Avoid mixed and multi-threaded NUMA accesses

• NUMA effects impact Optane more than DRAM

• R/W ratio is more important

Page 56: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

60

Lesson #4: Avoid mixed and multi-threaded NUMA accesses

Page 57: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

61

Conclusion

Page 58: An Empirical Guide to Scalable Persistent Memory · • Limit threads accessing one NVDIMM • Avoid mixed and multi-threaded NUMA accesses. 37 Lesson #1: Avoid small random accesses.

62

Conclusion

• Not Just Slow Dense DRAM

• Slower media

-> More complex architecture

-> Second-order performance anomalies

-> Fundamentally different

• Max performance is tricky– Avoid small random accesses

– Use ntstores for large writes

– Limit threads accessing one NVDIMM

– Avoid mixed and multi-threaded NUMA accesses