ReFlex: Remote Flash at the Performance of Local Flash Tier Clients TCP/IP Datastore Service App...

40
ReFlex: Remote Flash at the Performance of Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis Flash Memory Summit 2017 Santa Clara, CA 1 Stanford MAST

Transcript of ReFlex: Remote Flash at the Performance of Local Flash Tier Clients TCP/IP Datastore Service App...

ReFlex: Remote Flash at the Performance of Local Flash

Ana Klimovic Heiner Litz Christos Kozyrakis

Flash Memory Summit 2017 Santa Clara, CA

1 Stanford MAST!

Flash in Datacenters

§  NVMe Flash •  1,000x higher throughput than disk (1MIOPS) •  100x lower latency than disk (50-70usec)

§  But Flash is often underutilized due to imbalanced resource requirements

Example Datacenter Flash Use-Case

3

AppTier

RAM

Flash

NIC

AppTier

Clients TCP/IP

DatastoreService

AppServers

Key-ValueStoreget(k)put(k,val)

Applica(onTier DatastoreTier

CPU

So9ware

Hardware

Imbalanced Resource Utilization

Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months

4 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.

Imbalanced Resource Utilization

Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months

5 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.

Imbalanced Resource Utilization

Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months

6

u"liza"on

[EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.

Imbalanced Resource Utilization

Flash capacity and bandwidth are underutilized for long periods of time

7 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar

u"liza"on

Solution: Remote Access to Storage

§  Improve resource utilization by sharing NVMe Flash between remote tenants

§  There are 3 main concerns: 1.  Performance overhead for remote access 2.  Interference on shared Flash 3.  Flexibility of Flash usage

8

0

200

400

600

800

1000

0 50 100 150 200 250 300

p95

read

late

ncy

(us)

IOPS (Thousands)

Local Flash iSCSI (1 core) libaio+libevent (1core)

Issue 1: Performance Overhead

9

4x throughput drop

2× latency increase

•  Traditional network/storage protocols and Linux I/O libraries (e.g. libaio, libevent) have high overhead

4kB random read

Issue 2: Performance Interference

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 250 500 750 1000 1250

p95read

latency(us)

TotalIOPS(Thousands)

100%read99%read95%read90%read75%read50%read

Writes impact read tail latency

Latency depends on IOPS load

To share Flash, we need to enforce performance isolation 10

Issue 3: Flexibility

§  Flexibility is key §  Want to allocate Flash capacity and bandwidth

for applications: •  On any machine in the datacenter •  At any scale •  Accessed using any network-storage protocol

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

12

Linux vs. ReFlex

[ASPLOS’17] ReFlex: Remote Flash ≈ Local Flash. Ana Klimovic, Heiner Litz, Christos Kozyrakis.

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

13

Linux vs. ReFlex

Remove SW bloat by separating

control & data plane

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

14

Linux vs. ReFlex

DPDK SPDK

Direct access to hardware

1 data plane per CPU core

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

15

Linux vs. ReFlex

IRQ

Polling vs. interrupts

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

16

Linux vs. ReFlex

IRQ

Polling vs. interrupts

Run to completion

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

17

Linux vs. ReFlex

IRQ

Polling vs. interrupts

Adaptive batching

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Remote Storage Application

HardwareNetworkInterface

Flash Storage

How does ReFlex achieve high performance?

18

Linux vs. ReFlex

1.

2. 4.

Zero-copy device to device

3.

How does ReFlex enable performance isolation?

§  Request cost based scheduling

§  Determine the impact of tenant A on the tail latency and IOPS of tenant B

§  Control plane assigns tenants with a quota §  Data plane enforces quotas through throttling

19

Request Cost Modeling

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 250 500 750 1000 1250

p95read

latency(us)

TotalIOPS(Thousands)

100%read99%read95%read90%read75%read50%read

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 200 400 600 800 1000

p95Re

adLa

tency(us)

WeightedIOPS(x103tokens/s)

100%read99%read95%read90%read75%read50%read

For this device: Write == 10x Read

Compensate for read-write asymmetry

Request Cost Based Scheduling

21

1ms tail latency SLO

Device max IOPS: 510K

200K IOPS SLO

310K Slack

Tenant A SLO: •  1ms tail latency •  200K IOPS

Request Cost Based Scheduling

22

1ms tail latency SLO

Device max IOPS: 510K

200K IOPS SLO

210K Slack

100K IOPS SLO

1.5 ms tail latency SLO Tenant A SLO: •  1ms tail latency •  200K IOPS

Tenant B SLO: •  1.5ms tail latency •  100K IOPS

Request Cost Based Scheduling

23

1ms tail latency SLO

Device max IOPS: 375K

200K IOPS

100K IOPS

Tenant A SLO: •  1ms tail latency •  200K IOPS

Tenant B SLO: •  1.5ms tail latency •  100K IOPS

Tenant C SLO: •  0.5ms tail latency •  200K IOPS

75K slack

0.5ms tail latency SLO Tenant C cannot register since system cannot satisfy its SLO

Network Flash Application

libReFlex

Network RX queue

TCP/IP

Event Conditions

Batched Network/Storage Calls

NVMe

NVMe SQ

Scheduler

24

Summary: Life of a Packet in ReFlex

1

2

3

4

TCP/IP

Event Conditions

Batched Network/Storage Calls

NVMe TCP/IP

NVMe

Scheduler

25

6

7

Summary: Life of a Packet in ReFlex

Network Flash Application

libReFlex

8 Network RX queue

NVMe CQ

NVMe SQ

Network TX queue

5

0

200

400

600

800

1000

0 250 500 750 1000 1250

p95

Rea

d La

tenc

y (u

s)

IOPS (Thousands)

Local-1T

ReFlex-1T

Libaio-1T Linux-1T

Results: Local ≈ Remote Latency

ReFlex: 850K IOPS/core

over TCP

Linux: 75K IOPS/core

over TCP

26

Use 1 core

on Flash server

0

200

400

600

800

1000

0 250 500 750 1000 1250

p95

Rea

d La

tenc

y (u

s)

IOPS (Thousands)

Local-1T

ReFlex-1T

Libaio-1T

Results: Local ≈ Remote Latency

Latency Local Flash 78 µs ReFlex 99 µs Linux 200 µs

27

Linux-1T

0

200

400

600

800

1000

0 250 500 750 1000 1250

p95

Rea

d La

tenc

y (u

s)

IOPS (Thousands)

Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T

Results: Local ≈ Remote Latency

28

Linux-1T Linux-2T

ReFlex: saturates Flash

Use 2 vs. 1 cores

on Flash server

Results: Performance Isolation

•  Tenants A & B: latency-critical; Tenant C + D: best effort •  Without scheduler: latency and bandwidth QoS for A/B are violated •  Scheduler rate limits best-effort tenants to enforce SLOs

0

20

40

60

80

100

120

140

Tenant A Tenant B Tenant C Tenant D

IOPS

(T

hous

ands

)

I/O sched disabled I/O sched enabled

0 500

1000 1500 2000 2500 3000 3500 4000

Tenant A Tenant B Tenant C Tenant D

Rea

d p9

5 la

tenc

y (u

s)

I/O sched disabled I/O sched enabled

Latency SLO

Tenant A IOPS SLO

Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

29

Results: Performance Isolation

•  Tenants A & B: latency-critical; Tenant C + D: best effort •  Without scheduler: latency and bandwidth QoS for A/B are violated •  Scheduler rate limits best-effort tenants to enforce SLOs

0

20

40

60

80

100

120

140

Tenant A Tenant B Tenant C Tenant D

IOPS

(T

hous

ands

)

I/O sched disabled I/O sched enabled

0 500

1000 1500 2000 2500 3000 3500 4000

Tenant A Tenant B Tenant C Tenant D

Rea

d p9

5 la

tenc

y (u

s)

I/O sched disabled I/O sched enabled

Latency SLO

Tenant A IOPS SLO

Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

30

Results: Performance Isolation

•  Tenants A & B: latency-critical; Tenant C + D: best effort •  Without scheduler: latency and bandwidth QoS for A/B are violated •  Scheduler rate limits best-effort tenants to enforce SLOs

0

20

40

60

80

100

120

140

Tenant A Tenant B Tenant C Tenant D

IOPS

(T

hous

ands

)

I/O sched disabled I/O sched enabled

0 500

1000 1500 2000 2500 3000 3500 4000

Tenant A Tenant B Tenant C Tenant D

Rea

d p9

5 la

tenc

y (u

s)

I/O sched disabled I/O sched enabled

Latency SLO

Tenant A IOPS SLO

Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

31

Results: Scalability

§  ReFlex scales to multiple cores (we tested with up to 12 cores) §  ReFlex scales to thousands of tenants and connections

32

Each tenant issues 100 1kB reads/sec

ReFlex can manage 5,000 tenants with a

single core.

How can I use ReFlex? §  ReFlex is open source:

www.github.com/stanford-mast/reflex

§  Two versions of ReFlex available: 1.  Kernel implementation:

–  Provides a protected network-storage data plane (user application not trusted) –  Builds on hardware support for virtualization –  Requires Dune kernel module for direct access to hardware and memory

management in Linux 2.  User space implementation:

–  Network-storage data plane runs as user space application; kernel-bypass –  Portable implementation, tested on Amazon Web Services i3 instance

ReFlex Clients

§  ReFlex exposes a logical block interface

§  ReFlex client implementations available: 1.  Block I/O client

à For issuing raw network block I/O requests to ReFlex 2.  Remote block device driver

à For legacy Linux applications

34

Control Plane for ReFlex

§  ReFlex provides an efficient data plane for remote access

§  Bring your own control plane or distributed storage system: •  Global control plane allocates Flash capacity & bandwidth in cluster •  Control plane can be a cluster manager or a distributed storage

system (e.g. distributed file system or database)

§  Example: ReFlex storage tier for Crail distributed file system •  Spark applications can use ReFlex as a remote Flash storage tier,

managed by the Crail distributed data store

Example Use-Case

§  Big data analytics frameworks such as Spark store a variety of data types in memory, Flash and disk

§  Crail is a distributed storage system that manages and provides high-performance access to data across multiple storage tiers

36

SQL Streaming MLlib GraphX

SparkShuffledata Input/OutputfilesBroadcastdataRDDs

www.crail.io

Example Use-Case §  Spark executors often use (local) Flash to store temporary data (e.g. shuffle)

§  Spark applications often saturate CPU resources before they can saturate NVMe Flash IOPS

à Local Flash bandwidth is underutilized 37

SparkExecutor

Compute

DRAM

NVMeFlash

SparkExecutor

Compute

DRAM

NVMeFlash

SparkExecutor

Compute

DRAM

NVMeFlash

SparkExecutor

Compute

DRAM

NVMeFlash

Example Use-Case §  ReFlex provides a shared remote Flash tier for big data analytics

•  Improves resource utilization while offering local Flash performance

38

SparkExecutor

High-BandwidthEthernetNetwork

Compute

DRAM

SparkExecutor

Compute

DRAM

SparkExecutor

Compute

DRAM

SparkExecutor

Compute

DRAM

ReFlex

NVMeFlash

ReFlex

NVMeFlash

Summary

§  ReFlex enables Flash disaggregation

§  Performance: remote ≈ local

§  Commodity networking, low CPU overhead

§  QoS on shared Flash with I/O scheduler

Thank You!

ReFlex is available at:

https://github.com/stanford-mast/reflex