Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of...

Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems

USENIX HotStorage 2016

Sungyong Ahn*, Kwanghyun La*, Jihong Kim**

*Memory Business, Samsung Electronics Co., Ltd.

**Seoul National University

Introduction

Motivation

Contributions

Weight-based dynamic throttling (WDT) scheme

Experimental Results

Conclusion

Outline

Multiple isolated instances (containers) running on a single host.

OS-level Virtualization


• Hardware resources should be isolated and allocated to containers



• Hardware resources should be isolated and allocated to containers

• Different resource requirements should be satisfied


I have CPU-bound jobs

I have I/O-bound jobs

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwin76H-86bNAhWBnhQKHVp1DSEQjRwIBw&url=http://www.theverge.com/2014/12/2/7318801/clip-art-dead-photo-essay&bvm=bv.124542969,bs.2,d.dGo&psig=AFQjCNHX_HB_wjiK4n4lh3ExEo69wGa6yg&ust=1465972610420889


Kernel-level resource manager of Linux

Linux Control Groups (Cgroups)

I have I/O-bound jobs

I have CPU-bound jobs



I/O bandwidth is shared according to I/O weights

Proportional I/O scheme in Linux Cgroups

10

5

2.5

1

0

2

4

6

8

10

12

No

rma

lize

d I

/O b

an

dw

idth C(10) C(5)

C(2.5) C(1)

Ideal proportional I/O sharing

Introduction

Motivation

Contributions



Conclusion

Outline

Questions

When Linux Cgroups work with NVMe SSD,

1. Can they share the I/O resource in proportion to I/O weights? 2. Is it scalable?


0

2

4

6

8

10

12

1.5 1.10.5 1.0

0

2

4

6

8

10

12

BASELINEN

orm

aliz

ed

I/O

ban

dw

idth C(10) C(5)

C(2.5) C(1)

Existing Cgroups cannot support the proportional I/O to NVMe SSDs

Proportional I/O with NVMe SSDs

10

5

2.5 1.0

NVMe SSDs have different I/O stack from SATA storage

Because...

SATA I/O stack

NVMe I/O stack

Existing proportional I/O scheme is implemented in single queue block layer

First Attempt: Using the Existing Static Throttling

Upper limit of I/O bandwidth

• Limit the maximum number of bytes or I/O requests for particular time interval (throttling window)

1 2 3

Throttling Window

4 5 6 7 8 9 10

throttled

Container A upper limit=10

Time

0

2

4

6

8

10

12

Static throttling is not enough to support the proportional I/O

First Attempt: Using the Existing Static Throttling

9.9

2.21.2 1.0

0

2

4

6

8

10

12

Static Throttling

No

rmal

ize

d I/

O b

and

wid

th C(10) C(5)

C(2.5) C(1)

Because...

I/O workloads fluctuate with time

I/O workloads fluctuate with time

Because...

Introduction

Motivation

Contributions

Weight-based Dynamic throttling (WDT) scheme


Conclusion

Outline

Contributions

We achieved the proportional I/O for NVMe SSDs.

We achieved the scalable performance of Linux Cgroups.

Introduction

Motivation

Contributions

Weight-based Dynamic throttling (WDT) scheme


Conclusion

Outline

Overview of WDT Scheme

Container

Block Layer

Container Container

weight w1 weight w2 weight w3 Credit allocation

CPM Monitoring

Data flow

Future I/O Demand

Predictor

Budget Distributor

TotalCredit Updater

Residual Credits

Carryover

TotalCredit ℬ1𝑗+ ℛ1

𝑗

> 𝒰1𝑗

Monitoring

𝐶𝑃𝑀1

ℬ2𝑗+ ℛ2

𝑗

> 𝒰2𝑗

ℬ3𝑗+ ℛ3

𝑗

> 𝒰3𝑗

Monitoring

𝐶𝑃𝑀2

Monitoring

𝐶𝑃𝑀3

𝒞3 𝒞2 𝒞1

WDT

To update TotalCredit, future I/O demand is predicted

Distributing the credits to containers according to I/O weights

All containers are allocated credits in proportion to their I/O weight.

Budget Distributor

Throttling Window

Container A I/O weight=10

Time

Container B I/O weight=5

Throttling Window

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15


Credits are replenished periodically.

Budget Distributor

Throttling Window


Time


Throttling Window

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15 Replenishment of Credit Replenishment of Credit

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15



If a container has no available credit, it is throttled.

Budget Distributor

Throttling Window


Time


Throttling Window

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15 Replenishment of Credit Replenishment of Credit

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

TotalCredit = 15

1 2 3 4 5 6 7 8 9 10

throttled

1 2 3 4 5 6 7 8 9 10

throttled

1 2 3 4 5

throttled

1 2 3 4 5

throttled



If a container has no available credit, it is throttled.

Budget Distributor

In order to remove storage idle time, TotalCredit is adjusted.

TotalCredit Updater

Technical Challenge

How to predict TotalCredit required for the next interval?


Technical Challenge

How to predict future I/O demand for the next interval?


Monitoring I/O demand of each container for every interval

• Prediction of the future I/O demand from cumulative distribution function

– 80th percentile of a cumulative distribution of I/O demand (assuming normal distribution)

Future I/O Demand Predictor

Container

Block Layer

Container Container

weight w1 weight w2 weight w3 Credit allocation

CPM Monitoring

Data flow

Future I/O Demand

Predictor

Budget Distributor

TotalCredit Updater

Residual Credits

Carryover

TotalCredit ℬ1𝑗+ ℛ1

𝑗

> 𝒰1𝑗

Monitoring

𝐶𝑃𝑀1

ℬ2𝑗+ ℛ2

𝑗

> 𝒰2𝑗

ℬ3𝑗+ ℛ3

𝑗

> 𝒰3𝑗

Monitoring

𝐶𝑃𝑀2

Monitoring

𝐶𝑃𝑀3

𝒞3 𝒞2 𝒞1

WDT

Questions

When Linux Cgroups work with NVMe SSD,

1. Can they share the I/O resource in proportion to I/O weights? 2. Is it scalable?


Scalability problem of the existing Cgroups throttling layer

Scalability of the Existing Cgroups on NUMA

0

100

200

300

400

500

600

1 node 2 nodes 3 nodes 4 nodes

KIO

PS

The number of NUMA nodes

C1 C2

C3 C4

All containers share a single request_queue lock across NUMA nodes

Because...

0.6

21.8

31.738.2

0

10

20

30

40

50

1 node 2 nodes 3 nodes 4 nodesCP

U c

ach

e m

iss

rati

o (

%)

The number of NUMA nodes

Hardware

Linux

Container A Container B

Single-queue Block Layer

Container C Container D

Multi-queue Block Layer

Proportional I/O (CFQ)

Cgroup I/O throttling

LOCK

Lock contention Remote memory accesses to the lock

state Cacheline invalidations caused by

cache coherence protocol

http://true-system.blogspot.com/2014/03/nvme.html

We adopt fine-grained per-container locks

The cache miss ratio decreases to 12.8% from 38.2%

Per-container Locks

Hardware

Linux







LOCK

Hardware

Linux







LOCK LOCK LOCK LOCK



Introduction

Motivation

Contributions



Conclusion

Outline

Linux kernel 4.0.4 (modified)

Dell 4-nodes NUMA machine

Samsung NVMe SSDs

Block I/O traces replayer

• UMass trace repository

• SNIA IOTTA repository

Experiment Setup

Dell R920 Server (4-nodes NUMA machine)

Samsung XS1715 NVMe SSDs

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiMtOaC86bNAhXCaxQKHVO2AmAQjRwIBw&url=http://www.storagereview.com/dell_poweredge_r920_review&bvm=bv.124542969,bs.2,d.dGo&psig=AFQjCNF6lz4aMeI4KA2Gr5AM7DX5tV1z7Q&ust=1465972363617986

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj5yquh86bNAhWKOBQKHYV8DUkQjRwIBw&url=http://www.businesswire.com/news/home/20130717006185/en/Samsung-Develops-Industry%E2%80%99s-2.5-Inch-NVMe-SSD-Next-Generation&bvm=bv.124542969,bs.2,d.dGo&psig=AFQjCNG7YaYJOV9L2FprjJlI5cqXJwX1wQ&ust=1465972485899345

0

2

4

6

8

10

12

WDT scheme satisfies the proportional sharing requirements

Result 1: Proportional I/O Support

1.5

9.9 10.0

1.1

2.2

5.0

0.51.2

2.5

1.0 1.0 1.0

0

2

4

6

8

10

12

BASELINE Static Throttling WDT

No

rma

lize

d I

/O b

an

dw

idth

WDT- : Using single spin lock

WDT : Using per-container locks

Result 2: Performance Scalability

1333

1762

670

881

334 440

133 176

0

200

400

600

800

1000

1200

1400

1600

1800

2000

WDT- WDT

I/O

ba

nd

wid

th (M

B/s

) C1 C2

C3 C4

Introduction

Motivation

Contributions



Conclusion

Outline

Proposed the weight-based dynamic throttling scheme to support proportional I/O sharing for NVMe SSDs.

Proposed the per-container locks for scalable performance.

Conclusion

Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of...

Documents

Transcript of Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of...