A Case for Hardware-Based Demand Pagingcsl.skku.edu/uploads/Publications/isca20_slide.pdf · 4 10!...
Transcript of A Case for Hardware-Based Demand Pagingcsl.skku.edu/uploads/Publications/isca20_slide.pdf · 4 10!...
Gyusun Lee¹*, Wenjing Jin²*, Wonsuk Song¹, Jeonghun Gong², Jonghyun Bae²Tae Jun Ham², Jae W. Lee² and Jinkyu Jeong¹
Sungkyunkwan University (SKKU)¹
Seoul National University (SNU)²
A Case for Hardware-Based Demand Paging
ISCA 2020
1* Equally contributed
Demand Paging§ Core mechanism of virtual memory
§ Memory as a cache of backing storage§ Prevalent from mobile to cloud computing
§ Benefits§ Fewer I/Os needed§ Less memory needed§ Faster response§ Better support for multiple processes
2
Source: A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts, 9th Edition, John Wiley & Sons, Inc. 2014.
10! Disk seekSSD accessDRAM accessSRAM accessCPU cycle
1985 1990 1995 2000 2005 2010 2015 2020
Tim
e (n
s)
Year
10"
10#
10$
10%
10&%
Storage Performance
Trends
Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015
Bottleneck Shift in Demand Paging
3
Ultra-Low Latency SSD(Samsung Z-SSD, etc.)
CPU
Device Device time 11.35μs
Page table lookup
Exception I/O stack Exception
Interrupt Delivery
Context Switching
I/O completion
PTE UpdateBreakdown of
Page Fault Handling
Bottleneck Shift in Demand Paging
4
10! Disk seekSSD accessDRAM accessSRAM accessCPU cycle
1985 1990 1995 2000 2005 2010 2015 2020
Tim
e (n
s)
Year
10"
10#
10$
10%
10&%
Ultra-Low Latency SSD(Samsung Z-SSD, etc.)
Storage Performance
Trends
CPU
Device Device time 11.35μs
Breakdown ofPage Fault Handling
Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015
76% of device time!!
w/o exception w/ exception
Indirect Cost of Demand Paging § Architecture resource pollution (branch mispredictions, cache misses) by page faults
5
User-levelIPC
Branchmisses
L2 cachemisses
LLC loadmisses
1.5
1
0.5
0
2
Nor
mal
ized
valu
e
CPU
Pipeline
Branch Predictor
Data Cache
KernelUser
Userinstuction
Page faultexception
Userinstuction
Page faultexception
Why Hardware Doesn’t Handle Page Faults?
6
Source: Arpaci-Dusseau, Remzi H., Arpaci-Dusseau, Andrea C. Operating systems: Three easy pieces. Arpaci-Dusseau Books LLC, 2018.
Operating System-Based Demand Paging (OSDP)
7
CPU
Application
Page fault exception
Page table
MMUCore
Disk
NULL 0
NULL 0
PFN 1
…
Virtual address
Operating System (OS)
Pagemiss
Present bit
I/O
NULL 0
NULL 0
PFN 1
…
Hardware-Based Demand Paging (HWDP)
8
CPU
Application
Page fault exception
LBA-augmentedpage table
SMU(Storage Management Unit)
MMUCore
Virtual address
Pagemiss
I/O
Present bit
Page table
LBA
LBA
§ Challenges§ How to make a CPU understand storage layout?
§ LBA*-augmented page table§ How to make a CPU control a storage I/O device?
§ Storage Management Unit§ How to handle remaining page fault operations
in the OS kernel?§ OS extensions (syscall extension, kernel thread..)
§ Benefits§ Eliminates OS intervention (no exception occurs)§ Reduces demand paging latency§ Mitigates architecture resource pollution
OS Syscall extensionKernel thread
Disk*LBA: Logical Block Adress
(= sector #)
File
Outline§ Overview§ Architectural extensions
§ LBA-augmented page table§ Storage Management Unit (SMU)
§ OS supports§ Fast file mmap()§ Updating OS metadata for HWDP§ Management of free page queue
§ Evaluation§ Conclusion
9
§ CPU understand storage layout through LBA-augmented page table
LBA-augmented Page Table
10
PGD PUD PMD PTE
... ... ...
...4-levelPage table
Present bitPage Frame Number (PFN) Protection Bits 1Conventional PTE
(resident page)
IgnoredConventional PTE(non-resident page)
LBA-augmented PTE Protection BitsDevice IDSocket ID LBALBA bit
1 0
0
HWDP with LBA-augmented Page Table
11
LBA bitProt. Bits 1 0Sock. ID Dev. ID LBA
LBA-augmented PTE
Present bit
After page miss handling
After OS metadata update
Prot. Bits 1 1PFN
Prot. Bits 0 1PFN
CPU
Application
SMU
MMUCore
Disk
I/O
Pagemiss
Virtual address
LBA-augmented PTE (OS metadata updated yet)
Conventional PTE (resident)
§ Additional architectural component to handle page miss in hardware
Storage Management Unit (SMU)
12
PCIe Bus
Mem. BusPCIe RootComplex
NVMe Host Controller
Page Miss Handler
Storage Management Unit
CPU
Mem
ory
NVMe SSD
MMUCore #0
MMUCore #1
MMUCore #2
MMUCore #3
SMU – Page Miss Handler
13
Core #0 Core #1 Core #2 Core #3
PTEP*, Device ID, LBA
Page Table UpdaterFree Page Fetcher
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory
NVMe Host Controller *PTEP: PTE address
SMU – Page Miss Handler
14
Core #0 Core #1 Core #2 Core #3
PMSHR lookup1
Page Table UpdaterFree Page Fetcher
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory
Page Miss Status Holding Registers (PMSHR)PTEP (key) DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
PTEP*, Device ID, LBA
NVMe Host Controller *PTEP: PTE address
SMU – Page Miss Handler
15
Page Table Updater
Core #0 Core #1 Core #2 Core #3
Free Page Fetcher
PMSHR miss2
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@Allocated #### #### ####
PTEP*, Device ID, LBA
NVMe Host Controller *PTEP: PTE address
SMU – Page Miss Handler
16
Page Table Updater
Core #0 Core #1 Core #2 Core #3
Free Page Fetcher
Free pageretrieval3
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory
(PFN, DMAP*) pair
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
#### #### ####
PTEP*, Device ID, LBA
NVMe Host Controller*PTEP: PTE address*DMAP: DMA address
SMU – Page Miss Handler
17
Page Table Updater
Core #0 Core #1 Core #2 Core #3
Free Page Fetcher
PMSHR update4
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory PTEP*, Device ID, LBA
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
#### #### #### ####
NVMe Host Controller *PTEP: PTE address
SMU – Page Miss Handler
18
Page Table Updater
Core #0 Core #1 Core #2 Core #3
Free Page Fetcher
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory
I/O request5
PTEP*, Device ID, LBA
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
#### #### #### ####
NVMe Host Controller *PTEP: PTE address
SMU – Page Miss Handler
19
Page Table Updater
Core #0 Core #1 Core #2 Core #3
Free Page Fetcher
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
0LBA
Memory
I/O completion6
PTEP*, Device ID, LBA
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
#### #### #### ####
NVMe Host Controller *PTEP: PTE address
SMU – Page Miss Handler
20
Core #0 Core #1 Core #2 Core #3
Free Page Fetcher
PTE update7
Page Miss Handler
Free page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
Memory PTEP*, Device ID, LBA
0LBA 1PFN Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
#### #### #### ####
NVMe Host Controller
Page Table Updater
*PTEP: PTE address
SMU – Page Miss Handler
21
Core #0 Core #1 Core #2 Core #3
Free Page FetcherFree page queue
LBA-augmented page table
LBA 1 0
LBA 1 0
NULL 0 0
LBA Present
1PFN
8
Page Miss Handler
Memory
PMSHR entry deletion & Broadcast completion
PTEP*, Device ID, LBA
Page Miss Status Holding Registers (PMSHR)PTEP DevID LBA PFN
@@@@ @@@@ @@@@ @@@@
#### #### #### ####
NVMe Host Controller
Page Table Updater
*PTEP: PTE address
SMU – NVMe Host Controller
22
Mem. Bus
PCIe Bus
PCIe RootComplex
NVMe SSD
Page Miss HandlerMemory
Submission queue
Completion queue
I/O request
NVMe Host Controller
Snoop on memory
NVMe queuedescriptor registers
I/O request
I/Ocompletion
Outline§ Overview§ Architectural extensions
§ LBA-augmented page table§ Storage Management Unit (SMU)
§ OS supports§ Fast file mmap()§ Updating OS metadata for HWDP§ Management of free page queue
§ Evaluation§ Conclusion
23
§ A new interface to utilize HWDP§ Allocate a region of memory-mapped file to use HWDP§ Update associated metadata lazily§ Currently, only for files that are not shared across multiple processes
PGD
PUD
PMD
LBA-augmented page table
Fast File mmap()
24
Virtual memory
mmap(MAP_HWDP)
Fast file mmap area
Page cache hit
OS page cacheNULL 0 0
NULL 0 0
NULL 0 0
PFNLBA Present
1
PTE
File system
Page cache miss
LBA 1
LBA 1
§ Introduces a kernel thread (kpted) to update OS-managed metadata in background
PGD
Updating OS Metadata for HWDP
25
kpted
PUD
Periodicpage table scanning
if (PTEàLBA bit && PTEàpresent bit) {Updates OS metadata; (inserts page to LRU list ..)Clears PTEàLBA bit;
}
if (PUD) {Goes to next-level (PMD);
}
if (PMD) {Goes to next-level (PTE);
}
PUD
PMD
PTE
PTE
PMD
…… …
……
1 1111
10
0
11
11
LBALBA
LBA Present
if (PUDàLBA bit) {Clears PUDàLBA bit;Goes to next-level (PMD);
}
PUD
if (PMDàLBA bit) {Clears PMDàLBA bit;Goes to next-level (PTE);
}
PMD
10
Management of Free Page Queue§ Synchronous free page refill
§ SMU raises a page fault exception when it fails to allocate a free page§ OS page fault handler handles the page fault and refills the free page queue
§ Asynchronous free page refill§ A background kernel thread (kpoold) periodically refills free pages§ 44-78% reduction of synchronous page refills (in YCSB workloads)
26
Summary - Demand Paging Comparison
27
OS-based Demand Paging (OSDP)
Hardware-based Demand Paging (HWDP)
CPU
Page miss
Time
Fast mmap()
CPU
kpted(Background)
kpoold(Background)
Time
Time
Time
User FS* Block I/O OS meta. update
Page refill
User FS Block I/O OS meta. update User FS Block I/O OS meta.
update
User
Page miss Page miss
FS FS I/O User I/O User I/O User I/O User I/O
Page miss Page miss Page miss Page miss
Page refill
Page refill
OS meta. update
OS meta. update
Periodic wakeupPeriodic wakeup Periodic wakeup
Periodic wakeupPeriodic wakeup
Page miss
*FS: File System
Outline§ Overview§ Architectural extensions
§ LBA-augmented page table§ Storage Management Unit (SMU)
§ OS supports§ Fast file mmap()§ Updating OS metadata for HWDP§ Management of free page queue
§ Evaluation§ Conclusion
28
§ Experimental Setup§ Micro-architecture-level evaluation
§ Gem5-based full system simulator integrated with the SSD model§ End-to-end system level evaluation
§ Implementating HWDP on real x86 machine (software-emulated SMU)
§ Delta measurement of single page miss handling
§ HWDP performance estimation (Ops/sec)
Evaluation
29
DeltaGem5-based full system simulator
Software-emulated SMU (x86 machine)
Execution timeof workload - # of
page misses( )* Delta/# of operations
Server Dell R730
OS Ubuntu 16.04.6
Base kernel Linux 4.9.30
CPU Intel Xeon E5-2640v3 2.6GHz8 physical cores (Hyperthreading on)
Memory DDR4 32GB
Storage devices Samsung SZ985 800GB Z-SSD
§ x86 machine configuration
Page Miss Analysis
30
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Page miss
Time (ns)Cycles
Page Miss Analysis
31
Bus accessMMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Page miss
0.36Time (ns)1Cycles
Page Miss Analysis
32
PMSHR lookupMMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Page miss
1.790.36Time (ns)51Cycles
Page Miss Analysis
33
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
0.36
Retrieve free page
I/O request
51Cycles 1
Page Miss Analysis
34
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
0.360.36
PMSHR update
Buffer write
Prefetch free page
51Cycles 1 1
Page Miss Analysis
35
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
0.360.36
Write NVMe cmd. to mem.
77.16
51Cycles 1 1
Page Miss Analysis
36
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
77.16 1.600.360.36
Ring SQ doorbell
51Cycles 1 1
Page Miss Analysis
37
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
Device time
77.16 1.60 0.710.360.36
Snoop memory bus
I/O completion
51Cycles 21 1
I/O completion
Page Miss Analysis
38
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
Device time
77.16 1.60 0.71 18.200.360.36
PTE,PMD,PUD read
Ring CQ doorbell
51Cycles 2 511 1
Page Miss Analysis
39
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
Device time
77.16 1.60 0.71 18.20 16.430.360.36
PTE,PMD,PUD update
51Cycles 2 51 461 1
Page Miss Analysis
40
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)
Page miss
Device time
77.16 1.60 0.71 18.20 16.43 0.710.360.36
Notify MMU
Delete entry
51Cycles 2 51 46 21 1
Page Miss Analysis
41
1.790.36
MMU
Page Miss Handler
NVMe Host Controller
NVMe SSD
Time (ns)Cycles
Page miss
Device time
77.16 1.60 0.71 18.20 16.43 0.710.360.36
OSDP
HWDP
2.46μs
0.08μs
11.35μs
11.35μs
6.20μs
0.04μs
Before device I/O
After device I/O
Device time
Page Miss Breakdown
Page Miss Timeline51 2 51 46 21 1
Device time43% reduction
Application Performance§ Result of throughput improvement by HWDP over OSDP
42
FIO DBbench YCSB-A YCSB-B YCSB-C YCSB-D YCSB-E
1 thread 2 threads 4 threads 8 threads
Thro
ughp
ut im
prov
emen
t 60%
40%
20%
0%
Dataset / Workingset (unit) 64/16 (GB) 64/128 (GB)
Access pattern Uniform Zipfian Zipfian Zipfian Latest Zipfian
Read:Write 100:0 100:0 50:50 95:5 100:0 95:5 95:5
§ Comparison of micro-architectural events in YCSB-C workload
Architecture Resource Pollution
43
0
0.5
1
1.5
Throughput
Nor
mal
ized
valu
e
OSDP HWDP
User-levelIPC
Branchmisses
L2 cachemisses
LLC loadmisses
1.5
1
0.5
0
Nor
mal
ized
valu
e
7.0% ▲
Conclusion§ Scaling SSD performance demands new directions§ A case for hardware-based demand paging
§ New architectural extensions + OS supports§ Page misses (mostly) handled in hardware§ Fast demand paging performance + improved user-level IPC§ Up to 27% throughput improvement in realistic workload (YCSB-C on RocksDB)
44
Q&A§ Thank you
45