DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance
description
Transcript of DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance
INS
TIT
UTE O
F C
OM
PU
TIN
G
TEC
HN
OLO
GY
DMA Cache Architecturally Separate I/O Data from
CPU Data for Improving I/O Performance
Dang Tang, Yungang Bao,
Weiwu Hu, Mingyu Chen
2010.1
Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS)
INSTITUTE OF COMPUTING
TECHNOLOGY
The role of I/O
I/O is ubiquitous Load binary files : Disk Memory Brower web, media stream : NetworkMemory…
I/O is significant Many commercial applications are I/O intensive:
Database etc.
INSTITUTE OF COMPUTING
TECHNOLOGY
State-of-the-Art I/O Technologies I/O Bus: 20GB/s
PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect
I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)
INSTITUTE OF COMPUTING
TECHNOLOGY
Direct Memory Access (DMA)
DMA is used for I/O operations in all modern computers
DMA allows I/O subsystems to access system memory independently of CPU.
Many I/O devices have DMA engines Including disk drive controllers, graphics
cards, network cards, sound cards and GPUs
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Engine
CPU
Memory
Driver Buffer
Descriptor①
②③
Kernel Buffer
④
An Example of Disk Read:DMA Receiving Operation
• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Engine
CPU
Memory
Driver Buffer
Descriptor①
②③
Kernel Buffer
④
Direct Cache Access [Ram-ISCA05]
• This is a typical Shared-Cache Scheme
Prefetch-Hint Approach [Kumar-Micro07]
INSTITUTE OF COMPUTING
TECHNOLOGY
Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing
Not suitable for other I/O Degrade performance
when DMA requests are large (>100KB) for “Oracle + TPC-H” application
To address this problem deeply, we need to investigate the I/O data characteristics.
INSTITUTE OF COMPUTING
TECHNOLOGY
I/O Data V.S. CPU Data
MemCtrlI/O Data
CPU Data
HMTT
I/O Data + CPU Data
INSTITUTE OF COMPUTING
TECHNOLOGY
A short AD of HMTT [Bao-Sigmetrics08]
A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g.,
virtual address Process id I/O operation
Can collect the trace of commercial applications, e.g., Oracle Web server
The HMTT System
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(1)
% of Memory References to I/O data
% of References of various I/O types
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(2) I/O request size distribution?
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(3) Sequential access in I/O data
Compared with CPU data, I/O data is very regular
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(4) Reuse Distance (RD)
LRU Stack Distance 1
3
2
4
1
2
2
3
3
4
4
3
1
1
2
1
2
4
3
1
2
3
4
1
2
3
1
2
1
2
3
1
1
2
4
RD
CDF
x%
<=n
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(5)
DMA-W CPU-R
CPU-RW CPU-RW
CPU-W DMA-R
INSTITUTE OF COMPUTING
TECHNOLOGY
Rethink I/O & DMA Operation
20~40% of memory references are for I/O data in I/O-intensive applications.
Characteristics of I/O data are different from CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential
Separating I/O data and CPU data
INSTITUTE OF COMPUTING
TECHNOLOGY
Separating I/O data and CPU data
Before Separating
After Separating
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
Dedicated DMA Cache (DDC)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through
policies are available Write Policy Cache Coherence Replacement Policy Prefetching
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
IO-E
SI P
roto
col
for W
T p
olicy
IO-M
OE
SI P
roto
col
for W
B P
olicy
The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions
INSTITUTE OF COMPUTING
TECHNOLOGY
A Big Issue
How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?
INSTITUTE OF COMPUTING
TECHNOLOGY
A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98]
DMA $ CPU $ CPU $
……O S IM I S
OS+I+√ MS+I+ X
EI+
R|E
MI+W|*
S+I+R|I
INSTITUTE OF COMPUTING
TECHNOLOGY
Global State Cache Coherence Theorem
Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine.
S+I+
EI+
I+
MI+
OS+I+
R|*
W|*
W|* R|I
R|M W|*
R|*
R|*
W|*
W|*
R|E
R|I
INSTITUTE OF COMPUTING
TECHNOLOGY
MOESI + ESI
S+I+
ECI+
I+
MCI+
EDI+
OCS+I+
R*|*
RC|E R*|I
WC|* WD|*
RC|I RD |I
WD|I
RD|* WD|*
RC|I
WC|*
Wc|I
WD|I
WC|I
WD|SI R*|I
WC|*
RC|* RD|SI
WD|* RD|E RC|M
WC|*
6 Global States:
S+I+
ECI*
I*
MCI*
EDI*
OCS*I*
√√√√√√
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
An LRU-like Replace Policy
1. Invalid
2. Shared
3. Owned
4. Exlusive
5. Modified
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time
INSTITUTE OF COMPUTING
TECHNOLOGY
Design Complexity vs.Design Cost Dedicated DMA Cache (DDC)
Partition-Based DMA Cache
(PBDC)
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
INSTITUTE OF COMPUTING
TECHNOLOGY
Speedup of Dedicated DMA Cache
INSTITUTE OF COMPUTING
TECHNOLOGY
% of Valid Prefetched Blocks
DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.
INSTITUTE OF COMPUTING
TECHNOLOGY
Performance Comparisons
Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
INSTITUTE OF COMPUTING
TECHNOLOGY
Conclusions
INSTITUTE OF COMPUTING
TECHNOLOGYThanks !
&Question?
INSTITUTE OF COMPUTING
TECHNOLOGY
Design Complexity of PBDC
INSTITUTE OF COMPUTING
TECHNOLOGY
More References on Cache Coherence Protocol Verification
Fong Pong , Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998
Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997