Parallel Application Memory Scheduling
description
Transcript of Parallel Application Memory Scheduling
![Page 1: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/1.jpg)
Parallel Application Memory Scheduling
Eiman Ebrahimi*
Rustam Miftakhutdinov*, Chris Fallin‡
Chang Joo Lee*+, Jose Joao*
Onur Mutlu‡, Yale N. Patt** HPS Research Group
The University of Texas at Austin
‡ Computer Architecture LaboratoryCarnegie Mellon University
+ Intel Corporation Austin
![Page 2: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/2.jpg)
2
Background Core 0 Core 1 Core 2 Core N
Shared Cache
Memory Controller
DRAMBank 0
DRAMBank 1
DRAM Bank 2
... DRAMBank K
...
Shared MemoryResources
Chip BoundaryOn-chipOff-chip
2
![Page 3: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/3.jpg)
Background
Memory requests from different cores interfere in shared memory resources
Multi-programmed workloads System Performance and Fairness
A single multi-threaded application?
3
Core 0
Core 1
Core 2
Core N
Shared Cache
Memory Controller
...DRAMBank
K
...
Shared MemoryResources
Chip Boundary
3
DRAMBank
0
DRAMBank
1
DRAMBank
2
![Page 4: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/4.jpg)
Memory System Interference in A Single Multi-Threaded Application Inter-dependent threads from the same
application slow each other down
Most importantly the critical path of execution can be significantly slowed down
Problem and goal are very different from interference between independent applications Interdependence between threads Goal: Reduce execution time of a single application No notion of fairness among the threads
of the same application4
![Page 5: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/5.jpg)
Potential inA Single Multi-Threaded Application
hist mg cg is bt ft gmean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nor
mal
ized
Exe
cuti
on T
ime
If all main-memory related interference is ideally eliminated, execution time is reduced by 45% on average
5
Normalized to systemusing FR-FCFS memoryscheduling
![Page 6: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/6.jpg)
Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion
6
![Page 7: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/7.jpg)
Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion
7
![Page 8: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/8.jpg)
Parallel Application Memory Scheduler Identify the set of threads likely to be on the
critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
8
![Page 9: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/9.jpg)
Parallel Application Memory Scheduler Identify the set of threads likely to be on the
critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
9
![Page 10: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/10.jpg)
Runtime System Limiter Identification Contended critical sections are often on the critical path
of execution
Extend runtime system to identify thread executing the most contended critical section as the limiter thread Track total amount of time all threads wait on
each lock in a given interval Identify the lock with largest waiting time as
the most contended Thread holding the most contended lock is a limiter and
this information is exposed to the memory controller
10
![Page 11: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/11.jpg)
Prioritizing Requests from Limiter Threads
11
Critical Section 1 BarrierNon-Critical SectionWaiting for Sync or Lock
Thread DThread CThread BThread A
Time
Barrier
Time
Barrier
Thread DThread CThread BThread A
Critical Section 2 Critical Path
SavedCycles Limiter Thread: DBCA
Most ContendedCritical Section:1
Limiter Thread Identification
![Page 12: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/12.jpg)
Parallel Application Memory Scheduler Identify the set of threads likely to be on the
critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
12
![Page 13: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/13.jpg)
Time-based classification of threads as latency- vs. BW-sensitive
13
Critical Section
BarrierNon-Critical SectionWaiting for Sync
Thread DThread CThread BThread A
Time
BarrierTime Interval 1
Time Interval 2
Thread Cluster Memory Scheduling (TCM) [Kim et. al., MICRO’10]
![Page 14: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/14.jpg)
Terminology A code-segment is defined as:
A program region between two consecutive synchronization operations
Identified with a 2-tuple:<beginning IP, lock address>
Important for classifying threads as latency- vs. bandwidth-sensitive Time-based vs. code-segment based
classification
14
![Page 15: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/15.jpg)
Code-segment based classification of threads as latency- vs. BW-sensitive
15
Thread DThread CThread B
Thread DThread CThread BThread A
Time
Code-Segment Changes
Barrier
Thread A
Time
BarrierTime Interval 1
Time Interval 2
CodeSegment 1
CodeSegment 2
Critical Section
BarrierNon-Critical SectionWaiting for Sync
![Page 16: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/16.jpg)
Parallel Application Memory Scheduler Identify the set of threads likely to be on the
critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
16
![Page 17: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/17.jpg)
Shuffling Priorities of Non-Limiter Threads
Goal: Reduce inter-thread interference among a set of threads
with the same importance in terms of our estimation of the critical path
Prevent any of these threads from becoming new bottlenecks
Basic Idea: Give each thread a chance to be high priority in the memory
system and exploit intra-thread bank parallelism and row-buffer locality
Every interval assign a set of random priorities to the threads and shuffle priorities at the end of the interval
17
![Page 18: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/18.jpg)
Shuffling Priorities of Non-Limiter Threads
BarrierThread AThread BThread CThread D
Barrier
Time
Thread AThread BThread CThread D Time
Thread AThread BThread CThread D Time
Thread AThread BThread CThread D
Barrier Barrier
Time
Thread AThread BThread CThread D
Time
Thread AThread BThread CThread D Time
4321
321
2
1
1Saved Cycles
Saved Cycles Saved
Lost Cycles
18
Baseline(No shuffling)
Policy 1
Threads with similar memory behavior
Policy 2
Shuffling Shuffling
4231
12
3
2
1 1
Cycles
ActiveWaiting
Legend Threads with different memory behavior
PAMS dynamically monitors memory intensities and chooses appropriate
shuffling policy for non-limiter threads at runtime
![Page 19: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/19.jpg)
Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion
19
![Page 20: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/20.jpg)
Evaluation Methodology x86 cycle accurate simulator
Baseline processor configuration Per-core
- 4-wide issue, out-of-order, 64 entry ROB
Shared (16-core system)- 128 MSHRs- 4MB, 16-way L2 cache
Main Memory- DDR3 1333 MHz- Latency of 15ns per command (tRP, tRCD, CL)- 8B wide core to memory bus
20
![Page 21: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/21.jpg)
PAMS Evaluation
hist mg cg is bt ft gmean0
0.2
0.4
0.6
0.8
1
1.2
Thread cluster memory scheduler [Kim+, MICRO'10]Thread criticality prediction (TCP)-based memory schedulerParallel application memory scheduler
Nor
mal
ized
Exe
cuti
on T
ime
(nor
mal
ized
to
FR-F
CFS)
13%
21
7%
Thread criticality predictors (TCP) [Bhattacherjee+, ISCA’09]
![Page 22: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/22.jpg)
22
Sensitivity to system parameters
-10.5%-15.9%-16.7%
L2 Cache Size4 MB 8 MB 16 MB
Δ FR-FCFS Δ FR-FCFS Δ FR-FCFS
-10.4%-11.6%-16.7%
Number of Memory Channels1 Channel 2 Channels 4 ChannelsΔ FR-FCFS Δ FR-FCFS Δ FR-FCFS
![Page 23: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/23.jpg)
Conclusion Inter-thread main memory interference within a
multi-threaded application increases execution time
Parallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance by Identifying a set of threads likely to be on the critical path and
prioritizing requests from them Periodically shuffling priorities of non-likely critical threads to
reduce inter-thread interference among them
PAMS significantly outperforms Best previous memory scheduler designed for
multi-programmed workloads A memory scheduler that uses a state-of-the-art
thread criticality predictor (TCP)
23
![Page 24: Parallel Application Memory Scheduling](https://reader036.fdocuments.net/reader036/viewer/2022081420/568166ea550346895ddb3104/html5/thumbnails/24.jpg)
Parallel Application Memory Scheduling
Eiman Ebrahimi*
Rustam Miftakhutdinov*, Chris Fallin‡
Chang Joo Lee*+, Jose Joao*
Onur Mutlu‡, Yale N. Patt** HPS Research Group
The University of Texas at Austin
‡ Computer Architecture LaboratoryCarnegie Mellon University
+ Intel Corporation Austin