ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
-
Upload
nasim-caldwell -
Category
Documents
-
view
37 -
download
5
description
Transcript of ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
![Page 1: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/1.jpg)
ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
Yoongu KimDongsu HanOnur Mutlu
Mor Harchol-Balter
![Page 2: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/2.jpg)
2
Motivation
Modern multi-core systems employ multiple memory controllers
Applications contend with each other in multiple controllers
How to perform memory scheduling for multiple controllers?
![Page 3: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/3.jpg)
3
Desired Properties of Memory Scheduling Algorithm
Maximize system performance Without starving any cores
Configurable by system software To enforce thread priorities and QoS/fairness policies
Scalable to a large number of controllers Should not require significant coordination between
controllers
No previous scheduling algorithm satisfies all these requirements
Multiple memory controllers
![Page 4: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/4.jpg)
4
Multiple Memory Controllers
Core
MC Memory
Single-MC system
Multiple-MC system
Difference?
The need for coordination
Core MC Memory
MC Memory
![Page 5: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/5.jpg)
5
MC 1 T1 T2
Memory service timeline
Thread 1’s request
Thread 2’s request
T2
Thread Ranking in Single-MC
# of requests: Thread 1 < Thread 2
Thread 1 Shorter job
Optimal averagestall time: 2TThread
2STALL
Thread 1
STALL
Execution timeline
Thread ranking: Thread 1 > Thread 2
Thread 1 Assigned higher rank
Assume all requests are to the same bank
![Page 6: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/6.jpg)
6
Thread Ranking in Multiple-MC
MC 1 T2 T1T2
MC 2 T1 T1T1
MC 1 T1 T2T2
MC 2 T1 T1T1
Uncoordinated
Coordinated
MC 1’s shorter job: Thread 1Global shorter job: Thread 2
MC 1 incorrectly assigns higher rank to Thread 1
Global shorter job: Thread 2
MC 1 correctly assigns higher rank to Thread 2
Coordination
Coordination Better scheduling decisions
Thread 1
Thread 2
STALL
STALL
Avg. stall time: 3T Thread
1Thread
2
STALL
STALLSAVED CYCLE
S!
Avg. stall time: 2.5T
![Page 7: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/7.jpg)
7
Coordination Limits Scalability
Coordination?
MC 1 MC 2
MC 3 MC 4
Meta-MC
MC-to-MC
Meta-MC
To be scalable, coordination should: exchange little information occur infrequently
Consumes bandwidth
![Page 8: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/8.jpg)
8
The Problem and Our Goal
Problem: Previous memory scheduling algorithms are not scalable to many controllers
Not designed for multiple MCs Require significant coordination
Our Goal: Fundamentally redesign the memory scheduling algorithm such that it
Provides high system throughput Requires little or no coordination among MCs
![Page 9: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/9.jpg)
9
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 10: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/10.jpg)
10
Rethinking Memory SchedulingA thread alternates between two states (episodes)
Compute episode: Zero outstanding memory requests High IPC
Memory episode: Non-zero outstanding memory requests Low IPC
Goal: Minimize time spent in memory episodes
Ou
tsta
nd
ing
m
em
ory
re
qu
est
s
TimeMemory episode
Compute episode
![Page 11: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/11.jpg)
11
How to Minimize Memory Episode Time
Minimizes time spent in memory episodes across all threads Supported by queueing theory:
Shortest-Remaining-Processing-Time scheduling is optimal in single-server queue
Remaining length of a memory episode?
Prioritize thread whose memory episode will end the soonest
Time
Outs
tandin
g
mem
ory
re
quest
s
How much longer?
![Page 12: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/12.jpg)
Predicting Memory Episode Lengths
Large attained service Large expected remaining service
Q: Why?A: Memory episode lengths are Pareto distributed…
12
We discovered: past is excellent predictor for future
Time
Outs
tandin
g
mem
ory
re
quest
s
Remaining serviceFUTURE
Attained service
PAST
![Page 13: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/13.jpg)
13
Pareto Distribution of Memory Episode Lengths
401.bzip2
Favoring least-attained-service memory episode
= Favoring memory episode which will end the soonest
Pr{
Mem
. epi
sode
> x
}
x (cycles)
Memory episode lengths of SPEC benchmarks
Pareto distribution
Attained service correlates with remaining service
The longer an episode has lasted The longer it will last
further
![Page 14: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/14.jpg)
14
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 15: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/15.jpg)
15
Prioritize the job with shortest-remaining-processing-
time
Provably optimal
Remaining service: Correlates with attained service
Attained service: Tracked by per-thread counter
Least Attained Service (LAS) Memory Scheduling
Prioritize the memory episode with least-remaining-service
Our Approach Queueing Theory
Least-attained-service (LAS)
scheduling: Minimize memory
episode time
However, LAS does not consider
long-term thread behavior
Prioritize the memory episode with least-attained-service
![Page 16: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/16.jpg)
16
Long-Term Thread Behavior
Mem.episode
Thread 1 Thread 2
Short-termthread behavior
Mem.episode
Long-termthread behavior
Compute episode
Compute episode
>priority
<priority
Prioritizing Thread 2 is more beneficial: results in very long stretches of compute episodes
Short memory episode Long memory episode
![Page 17: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/17.jpg)
17
Quantum-Based Attained Service of a Thread
TimeOuts
tandin
g
mem
ory
re
quest
s
Attained service
Short-termthread
behavior
We divide time into large, fixed-length intervals: quanta (millions of cycles)
Attained service
Long-termthread
behavior Outs
tandin
g
mem
ory
re
quest
s
Time
…Quantum (millions of
cycles)
![Page 18: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/18.jpg)
18
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 19: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/19.jpg)
19
LAS Thread Ranking
Each thread’s attained service (AS) is tracked by MCs
ASi = A thread’s AS during only the i-th quantum
Each thread’s TotalAS computed as:
TotalASi = α · TotalASi-1 + (1- α) · ASi
High α More bias towards history
Threads are ranked, favoring threads with lower TotalAS
Threads are serviced according to their ranking
During a quantum
End of a quantum
Next quantum
![Page 20: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/20.jpg)
20
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 21: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/21.jpg)
21
ATLAS Scheduling Algorithm
ATLAS Adaptive per-Thread Least Attained Service
Request prioritization order 1. Prevent starvation: Over threshold request 2. Maximize performance: Higher LAS rank 3. Exploit locality: Row-hit request 4. Tie-breaker: Oldest request
How to coordinate MCs to agree upon a consistent ranking?
![Page 22: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/22.jpg)
22
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 23: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/23.jpg)
23
ATLAS Coordination Mechanism
During a quantum: Each MC increments the local AS of each thread
End of a quantum: Each MC sends local AS of each thread to centralized meta-MC Meta-MC accumulates local AS and calculates ranking Meta-MC broadcasts ranking to all MCs
Consistent thread ranking
![Page 24: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/24.jpg)
Coordination Cost in ATLAS
24
ATLASPAR-BS
(previous best work [ISCA08])
How often?
Very infrequently
Every quantum boundary (10 M cycles)
Frequently
Every batch boundary (thousands of cycles)
Sensitive to coordination latency?
Insensitive
Coordination latency << Quantum length
Sensitive
Coordination latency ~ Batch length
How costly is coordination in ATLAS?
![Page 25: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/25.jpg)
25
Properties of ATLAS
LAS-ranking Bank-level parallelism Row-buffer locality
Very infrequent coordination
Scale attained service with thread weight (in paper)
Low complexity: Attained service requires a single counter per thread in each MC
Maximize system performance
Scalable to large number of controllers
Configurable by system software
Goals Properties of ATLAS
![Page 26: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/26.jpg)
26
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 27: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/27.jpg)
27
Evaluation Methodology
4, 8, 16, 24, 32-core systems 5 GHz processor, 128-entry instruction window 512 Kbyte per-core private L2 caches
1, 2, 4, 8, 16-MC systems 128-entry memory request buffer 4 banks, 2Kbyte row buffer 40ns (200 cycles) row-hit round-trip latency 80ns (400 cycles) row-conflict round-trip latency
Workloads Multiprogrammed SPEC CPU2006 applications 32 program combinations for 4, 8, 16, 24, 32-core
experiments
![Page 28: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/28.jpg)
28
Comparison to Previous Scheduling Algorithms FCFS, FR-FCFS [Rixner+, ISCA00]
Oldest-first, row-hit first Low multi-core performance Do not distinguish between threads
Network Fair Queueing [Nesbit+, MICRO06] Partitions memory bandwidth equally among threads Low system performance Bank-level parallelism, locality not
exploited
Stall-time Fair Memory Scheduler [Mutlu+, MICRO07] Balances thread slowdowns relative to when run alone High coordination costs Requires heavy cycle-by-cycle
coordination
Parallelism-Aware Batch Scheduler [Mutlu+, ISCA08] Batches requests and performs thread ranking to preserve bank-level
parallelism High coordination costs Batch duration is very short
![Page 29: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/29.jpg)
29
1 2 4 8 16Memory controllers
4
6
8
10
12
14
16
FCFS FR_FCFS STFM PAR-BS ATLAS
Syst
em th
roug
hput
System Throughput: 24-Core System
System throughput = ∑ Speedup
ATLAS consistently provides higher system throughput than all previous scheduling algorithms
17.0%
9.8%
8.4%
5.9%
3.5%
Syst
em
th
rou
gh
pu
t
# of memory controllers
![Page 30: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/30.jpg)
30
4 8 16 24 32Cores
0
2
4
6
8
10
12
14
PAR-BS ATLAS
Syst
em th
roug
hput
System Throughput: 4-MC System
# of cores increases ATLAS performance benefit increases
1.1%3.5%
4.0%
8.4%10.8%
Syst
em
th
rou
gh
pu
t
# of cores
![Page 31: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/31.jpg)
31
Other Evaluations In Paper System software support
ATLAS effectively enforces thread weights
Workload analysis ATLAS performs best for mixed-intensity workloads
Effect of ATLAS on fairness
Sensitivity to algorithmic parameters
Sensitivity to system parameters Memory address mapping, cache size, memory latency
![Page 32: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/32.jpg)
32
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
![Page 33: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/33.jpg)
33
Conclusions
Multiple memory controllers require coordination Need to agree upon a consistent ranking of threads
ATLAS is a fundamentally new approach to memory scheduling Scalable: Thread ranking decisions at coarse-grained intervals High-performance: Minimizes system time spent in memory
episodes (Least Attained Service scheduling principle) Configurable: Enforces thread priorities
ATLAS provides the highest system throughput compared to five previous scheduling algorithms Performance benefit increases as the number of cores increases
![Page 34: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/34.jpg)
34
THANK YOU.
QUESTIONS?
![Page 35: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/35.jpg)
35
ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
Yoongu KimDongsu HanOnur Mutlu
Mor Harchol-Balter
![Page 36: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/36.jpg)
36
Hardware Cost
Additional hardware storage: For a 24-core, 4-MC system: 9kb
Not on critical path of execution
![Page 37: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/37.jpg)
37
System Configuration
DDR2-800 memory timing
![Page 38: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/38.jpg)
38
Benchmark Characteristics
Benchmarks have a wildly varying level of memory intensity Orders of magnitude difference Prioritizing the “light” threads over “heavy” threads is
critical
![Page 39: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/39.jpg)
39
Workload-by-workload Result
Workloads
Results
![Page 40: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/40.jpg)
40
Result on 24-Core System
24-core system with varying number of controllers.
PAR-BSc Ideally coordinated version of PAR-BS Instantaneous knowledge of every other controller ATLAS outperforms this unrealistic version of PAR-BSc
![Page 41: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/41.jpg)
41
Result on X-Core, Y-MC System
System where both the numbers of cores and controllers are varied.
PAR-BSc Ideally coordinated version of PAR-BS Instantaneous knowledge of every other controller ATLAS outperforms this unrealistic version of PAR-BSc
![Page 42: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/42.jpg)
42
Varying Memory Intensity
Workloads that are composed of varying number of memory intensive threads
ATLAS performs the best for a balanced-mix workload Distinguishes a memory-non-intensive threads and
prioritizes them over memory-intensive threads
![Page 43: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/43.jpg)
43
ATLAS with and w/o Coordination ATLAS provides ~1% improvement over ATLASu
ATLAS is specifically designed to operate with little coordination
PAR-BSc provides ~3% improvement over PAR-BS.
![Page 44: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/44.jpg)
44
ATLAS Algorithmic Parameters
Empirically determined quantum length: 10M cycles When quantum length is large:
Infrequent coordination Even long memory episodes can fit within the quantum Thread ranking is stable
![Page 45: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/45.jpg)
45
ATLAS Algorithmic Parameters
Empirically determined history weight: 0.875 When history weight is large:
Attained service is retained for longer TotalAS changes slowly over time Better distinguish memory intensive threads
![Page 46: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/46.jpg)
46
ATLAS Algorithmic Paramters
Empirically determined starvation threshold: 100K cycles
When T is too low: System performance is degraded (similar to FCFS)
When T is too high: Starvation-freedom guarantee is lost
![Page 47: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/47.jpg)
47
Multi-threaded Applications In our evaluations, all threads are single-threaded
applications
For multi-threaded applications, we can generalize the notion of attained service Attained service of multiple threads
= Sum or average of each of their attained service
Within-application ranking
Future work
![Page 48: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/48.jpg)
48
Unfairness in ATLAS
Assuming threads are equal priority, ATLAS maximizes system throughput by prioritizing shorter jobs at expense to the longer jobs
ATLAS increases unfairness Maximum slowdown: 20.3% Harmonic speedup: -10.3%
If threads are of different priorities, system software can ensure fairness/QoS by appropriate configuring thread priorities supported by ATLAS
![Page 49: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/49.jpg)
49
System Software Support
ATLAS enforces system priorities, or thread weights. Linear relationship between thread weight and
speedup.
![Page 50: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/50.jpg)
50
System Parameters
ATLAS performance on systems with varying cache sizes and memory timing
ATLAS performance benefit increases as contention for memory increases
![Page 51: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/51.jpg)
51
Handling Prefetches Attained service encourages efficient prefetch
behavior Prefetches cause disruption to memory accesses of other
threads, while being less useful Attained service can be thought as a fee that is charged
Orthogonal to Prefetch Aware DRAM Controllers [Lee+, MICRO08] PADC: Track accuracy of prefetches.
If accurate, prefetches and demands are same priority. If inaccurate, deprioritize prefetches.
ATLAS + PADC If accurate, treat prefetches as demands (i.e., increment
attained service). If inaccurate, deprioritize prefetches.
![Page 52: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/52.jpg)
52
Memory Addressing Scheme Evaluated memory addressing schemes
Block-interleaving: 3.5% improvement Row-interleaving: 8.4% improvement
Block-interleaving has more bank-level parallelism that ATLAS can exploit.
![Page 53: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/53.jpg)
53
Row-buffer Locality? A balance exists between:
Row-buffer locality Bank-level parallelism
By emphasizing row-buffer locality, bank-level parallelism suffers We evaluated this algorithm and saw that performance
was worse
We try to achieve a balance, but place more emphasis on bank-level parallelism. But within a thread, we do preserve row-buffer locality
![Page 54: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/54.jpg)
54
GPU
This will be applicable as long as threads are competing with each other.
![Page 55: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/55.jpg)
55
PCM
Applicable to any kind of memory technologies. Banks Contention between threads
![Page 56: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/56.jpg)
56
Difference between ATLAS and PAR-BS Attained service
Not utilized by PAR-BS Scheduling theory
ATLAS leverages Least-Attained-Scheduling principle Long-term thread behavior
PAR-BS forms batches on instantaneous state of the request buffer Memory episodes are broken across batch boundaries
But ATLAS monitors long-term thread behavior
Scalability PAR-BS need to coordinate frequently PAR-BS is more complex
Max # of requests to any bank Total # of requests
![Page 57: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/57.jpg)
57
Non-Pareto-distributed Benchmarks There were three that weren’t Pareto distributed.
We can’t take advantage of the short-term thread behavior optimizations, i.e. returning quickly to the compute episode
However, we can still take advantage of long-term thread behavior optimizations We take into account the memory intensity of threads Based on this intensity, we favors non-intensive
threads, which leads to higher system performance
![Page 58: ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers](https://reader036.fdocuments.net/reader036/viewer/2022062304/56813322550346895d99f796/html5/thumbnails/58.jpg)
58
Writes
Handle them traditionally using write buffers.
Don’t incur the read-write switch penalty everytime.
So whenever a write buffer becomes close to full, issue the writes.