Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of...
Transcript of Improving I/O Resource Sharing of Linux Cgroup for NVMe ......Improving I/O Resource Sharing of...
Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems
USENIX HotStorage 2016
Sungyong Ahn*, Kwanghyun La*, Jihong Kim**
*Memory Business, Samsung Electronics Co., Ltd.
**Seoul National University
Introduction
Motivation
Contributions
Weight-based dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Introduction
Motivation
Contributions
Weight-based dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Multiple isolated instances (containers) running on a single host.
OS-level Virtualization
Multiple isolated instances (containers) running on a single host.
• Hardware resources should be isolated and allocated to containers
OS-level Virtualization
Multiple isolated instances (containers) running on a single host.
• Hardware resources should be isolated and allocated to containers
• Different resource requirements should be satisfied
OS-level Virtualization
I have CPU-bound jobs
I have I/O-bound jobs
Kernel-level resource manager of Linux
Linux Control Groups (Cgroups)
I have I/O-bound jobs
I have CPU-bound jobs
I/O bandwidth is shared according to I/O weights
Proportional I/O scheme in Linux Cgroups
10
5
2.5
1
0
2
4
6
8
10
12
No
rma
lize
d I
/O b
an
dw
idth C(10) C(5)
C(2.5) C(1)
Ideal proportional I/O sharing
Introduction
Motivation
Contributions
Weight-based dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Questions
When Linux Cgroups work with NVMe SSD,
1. Can they share the I/O resource in proportion to I/O weights? 2. Is it scalable?
Questions
When Linux Cgroups work with NVMe SSD,
1. Can they share the I/O resource in proportion to I/O weights? 2. Is it scalable?
0
2
4
6
8
10
12
1.5 1.10.5 1.0
0
2
4
6
8
10
12
BASELINEN
orm
aliz
ed
I/O
ban
dw
idth C(10) C(5)
C(2.5) C(1)
Existing Cgroups cannot support the proportional I/O to NVMe SSDs
Proportional I/O with NVMe SSDs
10
5
2.5 1.0
NVMe SSDs have different I/O stack from SATA storage
Because...
SATA I/O stack
NVMe I/O stack
Existing proportional I/O scheme is implemented in single queue block layer
First Attempt: Using the Existing Static Throttling
Upper limit of I/O bandwidth
• Limit the maximum number of bytes or I/O requests for particular time interval (throttling window)
1 2 3
Throttling Window
4 5 6 7 8 9 10
throttled
Container A upper limit=10
Time
0
2
4
6
8
10
12
Static throttling is not enough to support the proportional I/O
First Attempt: Using the Existing Static Throttling
9.9
2.21.2 1.0
0
2
4
6
8
10
12
Static Throttling
No
rmal
ize
d I/
O b
and
wid
th C(10) C(5)
C(2.5) C(1)
Because...
I/O workloads fluctuate with time
I/O workloads fluctuate with time
Because...
Introduction
Motivation
Contributions
Weight-based Dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Contributions
We achieved the proportional I/O for NVMe SSDs.
We achieved the scalable performance of Linux Cgroups.
Introduction
Motivation
Contributions
Weight-based Dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Overview of WDT Scheme
Container
Block Layer
Container Container
weight w1 weight w2 weight w3 Credit allocation
CPM Monitoring
Data flow
Future I/O Demand
Predictor
Budget Distributor
TotalCredit Updater
Residual Credits
Carryover
TotalCredit ℬ1𝑗+ ℛ1
𝑗
> 𝒰1𝑗
Monitoring
𝐶𝑃𝑀1
ℬ2𝑗+ ℛ2
𝑗
> 𝒰2𝑗
ℬ3𝑗+ ℛ3
𝑗
> 𝒰3𝑗
Monitoring
𝐶𝑃𝑀2
Monitoring
𝐶𝑃𝑀3
𝒞3 𝒞2 𝒞1
WDT
To update TotalCredit, future I/O demand is predicted
Distributing the credits to containers according to I/O weights
All containers are allocated credits in proportion to their I/O weight.
Budget Distributor
Throttling Window
Container A I/O weight=10
Time
Container B I/O weight=5
Throttling Window
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
TotalCredit = 15
All containers are allocated credits in proportion to their I/O weight.
Credits are replenished periodically.
Budget Distributor
Throttling Window
Container A I/O weight=10
Time
Container B I/O weight=5
Throttling Window
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
TotalCredit = 15 Replenishment of Credit Replenishment of Credit
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
TotalCredit = 15
All containers are allocated credits in proportion to their I/O weight.
Credits are replenished periodically.
If a container has no available credit, it is throttled.
Budget Distributor
Throttling Window
Container A I/O weight=10
Time
Container B I/O weight=5
Throttling Window
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
TotalCredit = 15 Replenishment of Credit Replenishment of Credit
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
TotalCredit = 15
1 2 3 4 5 6 7 8 9 10
throttled
1 2 3 4 5 6 7 8 9 10
throttled
1 2 3 4 5
throttled
1 2 3 4 5
throttled
All containers are allocated credits in proportion to their I/O weight.
Credits are replenished periodically.
If a container has no available credit, it is throttled.
Budget Distributor
In order to remove storage idle time, TotalCredit is adjusted.
TotalCredit Updater
Technical Challenge
How to predict TotalCredit required for the next interval?
Technical Challenge
How to predict future I/O demand for the next interval?
Monitoring I/O demand of each container for every interval
• Prediction of the future I/O demand from cumulative distribution function
– 80th percentile of a cumulative distribution of I/O demand (assuming normal distribution)
Future I/O Demand Predictor
Container
Block Layer
Container Container
weight w1 weight w2 weight w3 Credit allocation
CPM Monitoring
Data flow
Future I/O Demand
Predictor
Budget Distributor
TotalCredit Updater
Residual Credits
Carryover
TotalCredit ℬ1𝑗+ ℛ1
𝑗
> 𝒰1𝑗
Monitoring
𝐶𝑃𝑀1
ℬ2𝑗+ ℛ2
𝑗
> 𝒰2𝑗
ℬ3𝑗+ ℛ3
𝑗
> 𝒰3𝑗
Monitoring
𝐶𝑃𝑀2
Monitoring
𝐶𝑃𝑀3
𝒞3 𝒞2 𝒞1
WDT
Questions
When Linux Cgroups work with NVMe SSD,
1. Can they share the I/O resource in proportion to I/O weights? 2. Is it scalable?
Scalability problem of the existing Cgroups throttling layer
Scalability of the Existing Cgroups on NUMA
0
100
200
300
400
500
600
1 node 2 nodes 3 nodes 4 nodes
KIO
PS
The number of NUMA nodes
C1 C2
C3 C4
All containers share a single request_queue lock across NUMA nodes
Because...
0.6
21.8
31.738.2
0
10
20
30
40
50
1 node 2 nodes 3 nodes 4 nodesCP
U c
ach
e m
iss
rati
o (
%)
The number of NUMA nodes
Hardware
Linux
Container A Container B
Single-queue Block Layer
Container C Container D
Multi-queue Block Layer
Proportional I/O (CFQ)
Cgroup I/O throttling
LOCK
Lock contention Remote memory accesses to the lock
state Cacheline invalidations caused by
cache coherence protocol
We adopt fine-grained per-container locks
The cache miss ratio decreases to 12.8% from 38.2%
Per-container Locks
Hardware
Linux
Container A Container B
Single-queue Block Layer
Container C Container D
Multi-queue Block Layer
Proportional I/O (CFQ)
Cgroup I/O throttling
LOCK
Hardware
Linux
Container A Container B
Single-queue Block Layer
Container C Container D
Multi-queue Block Layer
Proportional I/O (CFQ)
Cgroup I/O throttling
LOCK LOCK LOCK LOCK
Introduction
Motivation
Contributions
Weight-based dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Linux kernel 4.0.4 (modified)
Dell 4-nodes NUMA machine
Samsung NVMe SSDs
Block I/O traces replayer
• UMass trace repository
• SNIA IOTTA repository
Experiment Setup
Dell R920 Server (4-nodes NUMA machine)
Samsung XS1715 NVMe SSDs
0
2
4
6
8
10
12
WDT scheme satisfies the proportional sharing requirements
Result 1: Proportional I/O Support
1.5
9.9 10.0
1.1
2.2
5.0
0.51.2
2.5
1.0 1.0 1.0
0
2
4
6
8
10
12
BASELINE Static Throttling WDT
No
rma
lize
d I
/O b
an
dw
idth
WDT- : Using single spin lock
WDT : Using per-container locks
Result 2: Performance Scalability
1333
1762
670
881
334 440
133 176
0
200
400
600
800
1000
1200
1400
1600
1800
2000
WDT- WDT
I/O
ba
nd
wid
th (M
B/s
) C1 C2
C3 C4
Introduction
Motivation
Contributions
Weight-based dynamic throttling (WDT) scheme
Experimental Results
Conclusion
Outline
Proposed the weight-based dynamic throttling scheme to support proportional I/O sharing for NVMe SSDs.
Proposed the per-container locks for scalable performance.
Conclusion