Post on 27-Sep-2018
UnderstandingManycore Scalability
of File Systems
Changwoo Min, Sanidhya Kashyap, Steffen MaassWoonhak Kang, and Taesoo Kim
Application must parallelizeI/O operations
● Death of single core CPU scaling– CPU clock frequency: 3 ~ 3.8 GHz– # of physical cores: up to 24 (Xeon E7 v4)
● From mechanical HDD to flash SSD– IOPS of a commodity SSD: 900K– Non-volatile memory (e.g., 3D XPoint): 1,000x ↑
But file systems become a scalability bottleneck
3
Problem: Lack of understandingin internal scalability behavior
0k
2k
4k
6k
8k
10k
12k
14k
0 10 20 30 40 50 60 70 80
mes
sage
s/se
c
#core
Exim mail server on RAMDISK
btrfsext4
F2FSXFS
3. Never scale
Embarrassingly parallelapplication!
1. Saturated
2. Collapsed
● Intel 80-core machine: 8-socket, 10-core Xeon E7-8870● RAM: 512GB, 1TB SSD, 7200 RPM HDD
4
Even in slower storage mediumfile system becomes a bottleneck
0k
2k
4k
6k
8k
10k
12k
btrfs ext4 F2FS XFS
mes
sage
s/se
c
Exim email server at 80 cores
RAMDISKSSD
HDD
5
Outline
● Background● FxMark design
– A file system benchmark suite for manycore scalability
● Analysis of five Linux file systems● Pilot solution● Related work● Summary
6
Research questions
● What file system operations are not scalable?
● Why they are not scalable?
● Is it the problem of implementation or design?
7
Technical challenges
● Applications are usually stuck with a few bottlenecks
→ cannot see the next level of bottlenecks before resolving them
→ difficult to understand overall scalability behavior
● How to systematically stress file systems to understand scalability behavior
8
FxMark: evaluate & analyze manycore scalability of file systems
FxMark:
File systems:
Storage medium:
tmpfs
Memory FS
ext4J/NJ
XFS
Journaling FS
btrfs
CoW FS
F2FS
Log FS
19 micro-benchmarks 3 applications
# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80
SSD
9
FxMark: evaluate & analyze manycore scalability of file systems
FxMark:
File systems:
Storage medium:
tmpfs
Memory FS
ext4J/NJ
XFS
Journaling FS
btrfs
CoW FS
F2FS
Log FS
19 micro-benchmarks 3 applications
# core: 1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80
SSD
>4,700
10
Low Medium High
Microbenchmark:unveil hidden scalability bottlenecks● Data block read
Shar
ing
Leve
l
File Block Process
ROperation
Legend:
R R R R R R
11
Stress different componentswith various sharing levels
12
Legend
btrfsext4
ext4NJF2FS
tmpfsXFS
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
Evaluation
● Data block read
Low:
R R
File systems:
Storage medium:
Linear scalability
13
Outline
● Background● FxMark design● Analysis of five Linux file systems
– What are scalability bottlenecks?
● Pilot solution● Related work● Summary
14
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL
Summary of results:file systems are not scalable
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
DRBH
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWAL
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWTL
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWSL
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPL
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPM
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MRPH
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDL
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWRL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MWRM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM:O_DIRECT
0k
10k
20k
30k
40k
50k
60k
70k
80k
90k
100k
0 10 20 30 40 50 60 70 80
mes
sage
s/se
c
#core
Exim
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80
ops/
sec
#core
RocksDB
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80
GB/
sec
#core
DBENCH Legend
btrfsext4
ext4NJF2FS
tmpfsXFS
15
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL
Summary of results:file systems are not scalable
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
DRBH
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWAL
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWTL
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWSL
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPL
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPM
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MRPH
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDL
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWRL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MWRM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM:O_DIRECT
0k
10k
20k
30k
40k
50k
60k
70k
80k
90k
100k
0 10 20 30 40 50 60 70 80
mes
sage
s/se
c
#core
Exim
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80
ops/
sec
#core
RocksDB
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80
GB/
sec
#core
DBENCH Legend
btrfsext4
ext4NJF2FS
tmpfsXFS
16
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL
Summary of results:file systems are not scalable
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
DRBH
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWAL
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWTL
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWSL
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPL
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPM
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MRPH
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDL
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWRL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MWRM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM:O_DIRECT
0k
10k
20k
30k
40k
50k
60k
70k
80k
90k
100k
0 10 20 30 40 50 60 70 80
mes
sage
s/se
c
#core
Exim
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80
ops/
sec
#core
RocksDB
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80
GB/
sec
#core
DBENCH Legend
btrfsext4
ext4NJF2FS
tmpfsXFS
17
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL
Summary of results:file systems are not scalable
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
DRBH
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWAL
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWTL
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWSL
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPL
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRPM
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MRPH
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDL
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MRDM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWCM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWUM
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
MWRL
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
MWRM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL:O_DIRECT
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM:O_DIRECT
0k
10k
20k
30k
40k
50k
60k
70k
80k
90k
100k
0 10 20 30 40 50 60 70 80
mes
sage
s/se
c
#core
Exim
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80
ops/
sec
#core
RocksDB
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80
GB/
sec
#core
DBENCH Legend
btrfsext4
ext4NJF2FS
tmpfsXFS
18
Data block read
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80M
ops
/sec
#core
DRBL
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBH
Low:
Medium:
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBM
All file systems linearly scale
All file systems show performance collapse
XFS shows performance collapse
XFS
R R
R R
High: R R
19
Disk
Page cache is maintained for efficient access of file data
R
1. read a file block
Page cache
2. look up a page cache
3. cache miss
4. read a page from disk
5. copy page
OS Kernel
20
Disk
Page cache hit
R1. read a file block
Page cache
2. look up a page cache
3. cache hit4. copy page
OS Kernel
21
Disk
Page cache can be evictedto secure free memory
Page cache
OS Kernel
22
Disk
… only when not being accessed
R1. read a file block
Page cache
OS Kernel
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}
Reference counting is usedto track # of accessing tasks
4. copy page
23
Reference counting becomes a scalability bottleneck
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);} 0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBH
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}
R R R R R R
...100 CPI
(cycles-per-instruction)
20 CPI(cycles-per-instruction)
24
Reference counting becomes a scalability bottleneck
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);} 0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DRBH
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count);}
R R R R R R
...
High contention on a page reference counter → Huge memory stall
100 CPI(cycles-per-instruction)
20 CPI(cycles-per-instruction)
Many more: directory entry cache, XFS inode, etc
25
Lessons learned
High locality can cause performance collapse
Cache hit should be scalable → When the cache hit is dominant,
the scalability of cache hit does matter.
26
Data block overwrite
Low:
Medium:
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM
All file systems degrade gradually
Ext4, F2FS, and btrfs show performance collapse
ext4 F2FSbtrfs
W W
W W
Btrfs is a copy-on-write (CoW)file system
● Directs a write to a block to a new copy of the block
→ Never overwrites the block in place
→ Maintain multiple versions of a file system imageW
Time T Time T Time T+1
28
CoW triggers disk block allocation for every write
W
Time T Time T+1
BlockAllocation
BlockAllocation
Disk block allocation becomes a bottleneck
Ext4 journaling, F2FS checkpointing→ →
29
Lessons learned
Overwriting could be as expensive as appending → Critical at log-structured FS (F2FS) and CoW FS (btrfs)
Consistency guarantee mechanisms should be scalable
→ Scalable journaling → Scalable CoW index structure → Parallel log-structured writing
Data block overwrite
Low:
Medium:
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
M o
ps/s
ec
#core
DWOM
All file systems degrade gradually
Ext4, F2FS, and btrfs show performance collapse
ext4 F2FSbtrfs
W W
W W
Entire file is lockedregardless of update range
● All tested file systems hold an inode mutex for write operations– Range-based locking is not implemented
***_file_write_iter(...) { mutex_lock(&inode->i_mutex); ... mutex_unlock(&inode->i_mutex);}
Lessons learned
A file cannot be concurrently updated– Critical for VM and DBMS, which manage large files
Need to consider techniques used in parallel file systems
→ E.g., range-based locking
33
Summary of findings
● High locality can cause performance collapse● Overwriting could be as expensive as appending● A file cannot be concurrently updated● All directory operations are sequential● Renaming is system-wide sequential● Metadata changes are not scalable● Non-scalability often means wasting CPU cycles● Scalability is not portable
See our paper
34
Summary of findings
● High locality can cause performance collapse● Overwriting could be as expensive as appending● A file cannot be concurrently updated● All directory operations are sequential● Renaming is system-wide sequential● Metadata changes are not scalable● Non-scalability often means wasting CPU cycles● Scalability is not portable
See our paper
Many of them are unexpected and counter-intuitive → Contention at file system level to maintain data dependencies
35
Outline
● Background● FxMark design● Analysis of five Linux file systems● Pilot solution
– If we remove contentions in a file system,
is such file system scalable?
● Related work● Summary
36
RocksDB on a 60-partitioned RAMDISK scales better
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80
ops/
sec
#core
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80op
s/se
c#core
btrfsext4F2FS
tmpfsXFS
tmpfs
A single-partitioned RAMDISK
A 60-partitioned RAMDISK
2.1x
** Tested workload: DB_BENCH overwrite **
37
RocksDB on a 60-partitioned RAMDISK scales better
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80
ops/
sec
#core
0
100
200
300
400
500
600
700
0 10 20 30 40 50 60 70 80op
s/se
c#core
btrfsext4F2FS
tmpfsXFS
tmpfs
A single-partitioned RAMDISK
A 60-partitioned RAMDISK
2.1x
** Tested workload: DB_BENCH overwrite **
Reduced contention on file systemshelps improving performance and scalability
38
0
100
200
300
400
500
0 5 10 15 20
ops/
sec
#core
But partitioning makesperformance worse on HDD
A single-partitioned HDD
A 60-partitionedHDD
2.7x
0
100
200
300
400
500
0 5 10 15 20
btrfsext4F2FSXFS
F2FS
** Tested workload: DB_BENCH overwrite **
39
0
100
200
300
400
500
0 5 10 15 20
ops/
sec
#core
But partitioning makesperformance worse on HDD
A single-partitioned HDD
A 60-partitionedHDD
2.7x
0
100
200
300
400
500
0 5 10 15 20
btrfsext4F2FSXFS
F2FS
** Tested workload: DB_BENCH overwrite **
But reduced spatial locality degrades performance → Medium-specific characteristics (e.g., spatial locality)
should be considered
40
Related work
● Scaling operating systems– Mostly use memory file system to opt out the effect of I/O
operations
● Scaling file systems– Scalable file system journaling
● ScaleFS [MIT:MSThesis'14]● SpanFS [ATC'15]
– Parallel log-structured writing on NVRAM● NOVA [FAST'16]
41
Summary
● Comprehensive analysis of manycore scalability of five widely-used file systems using FxMark
● Manycore scalability should be of utmost importance in file system design
● New challenges in scalable file system design– Minimizing contention, scalable consistency guarantee, spatial
locality, etc.
● FxMark is open source– https://github.com/sslab-gatech/fxmark