Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures
description
Transcript of Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures
![Page 1: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/1.jpg)
1
Accelerating Critical Section Executionwith Asymmetric Multi-Core Architectures
M. Aater Suleman*Onur Mutlu†
Moinuddin K. Qureshi‡Yale N. Patt*
*The University of Texas at Austin
†Carnegie Mellon University
‡IBM Research
![Page 2: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/2.jpg)
2
Background
• To leverage CMPs:– Programs must be split into threads
• Mutual Exclusion:– Threads are not allowed to update shared data concurrently
• Accesses to shared data are encapsulated inside critical sections
• Only one thread can execute a critical section at a given time
![Page 3: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/3.jpg)
Example of Critical Section from MySQL
3
×
×
List of Open Tables
××
×
Thread 0
Thread 1
Thread 2
Thread 3
A
×
B C D
×
E
Thread 3:OpenTables(D, E)
Thread 2:CloseAllTables()
![Page 4: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/4.jpg)
Example Critical Section from MySQL
4
A B C D
0
2 2
1
0
3
E
3
![Page 5: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/5.jpg)
5
Example Critical Section from MySQL
End of Transaction: foreach (table opened by thread)
if (table.temporary)table.close()
LOCK_openAcquire()
LOCK_openRelease()
![Page 6: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/6.jpg)
6
Contention for Critical Sections
t1 t2 t3 t4 t5 t6 t7
t1 t2 t3 t4 t5 t6 t7
Critical Sections execute 2x faster
Thread 1Thread 2Thread 3Thread 4
Thread 1Thread 2Thread 3Thread 4
Critical Section
Parallel
Idle
Accelerating critical sections not only helps the thread executing the
critical sections, but also the waiting threads
![Page 7: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/7.jpg)
7
Impact of Critical Sections on Scalability
• Contention for critical sections increases with the number of threads and limits scalability
MySQL (oltp-1)
Chip Area (cores)
Sp
eed
up
![Page 8: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/8.jpg)
8
Outline
• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary
![Page 9: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/9.jpg)
9
The Asymmetric Chip Multiprocessor (ACMP)
• Provide one large core and many small cores• Execute parallel part on small cores for
high throughput• Accelerate serial part using the large core
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Largecore
ACMP Approach
![Page 10: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/10.jpg)
10
Conventional ACMP
EnterCS()
PriorityQ.insert(…)
LeaveCS()
On-chip Interconnect
1. P2 encounters a Critical Section2. Sends a request for the lock3. Acquires the lock4. Executes Critical Section5. Releases the lock
Core executing critical section
P1P2 P3 P4
![Page 11: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/11.jpg)
11
Accelerating Critical Sections (ACS)
• Accelerate Amdahl’s serial part and critical sections using the large core
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Largecore
ACMP Approach
Critical SectionRequest Buffer (CSRB)
![Page 12: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/12.jpg)
12
Accelerated Critical Sections (ACS)
EnterCS()
PriorityQ.insert(…)
LeaveCS()
Onchip-Interconnect
Critical SectionRequest Buffer (CSRB)
1. P2 encounters a Critical Section2. P2 sends CSCALL Request to CSRB3. P1 executes Critical Section4. P1 sends CSDONE signal
Core executing critical section
P4P3P2P1
![Page 13: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/13.jpg)
13
Architecture Overview
• ISA extensions– CSCALL LOCK_ADDR, TARGET_PC– CSRET LOCK_ADDR
• Compiler/Library inserts CSCALL/CSRET
• On a CSCALL, the small core:– Sends a CSCALL request to the large core
• Arguments: Lock address, Target PC, Stack Pointer, Core ID– Stalls and waits for CSDONE
• Large Core– Critical Section Request Buffer (CSRB)– Executes the critical section and sends CSDONE to the
requesting core
![Page 14: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/14.jpg)
14
False Serialization
• ACS can serialize independent critical sections
• Selective Acceleration of Critical Sections (SEL)– Saturating counters to track false serialization
CSCALL (A)
CSCALL (A)
CSCALL (B)
Critical Section Request Buffer(CSRB)
4
4
A
B
32
5
To large core
From small cores
![Page 15: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/15.jpg)
15
Outline
• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary
![Page 16: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/16.jpg)
16
Performance Tradeoffs
• Fewer threads vs. accelerated critical sections– Accelerating critical sections offsets loss in throughput– As the number of cores (threads) on chip increase:
• Fractional loss in parallel performance decreases• Increased contention for critical sections
makes acceleration more beneficial
• Overhead of CSCALL/CSDONE vs. better lock locality– ACS avoids “ping-ponging” of locks among caches by keeping
them at the large core
• More cache misses for private data vs. fewer misses for shared data
![Page 17: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/17.jpg)
17
Cache misses for private data
Private Data: NewSubProblems
Shared Data: The priority heap
PriorityHeap.insert(NewSubProblems)
Puzzle Benchmark
![Page 18: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/18.jpg)
18
Performance Tradeoffs
• Fewer threads vs. accelerated critical sections– Accelerating critical sections offsets loss in throughput– As the number of cores (threads) on chip increase:
• Fractional loss in parallel performance decreases• Increased contention for critical sections
makes acceleration more beneficial
• Overhead of CSCALL/CSDONE vs. better lock locality– ACS avoids “ping-ponging” of locks among caches by keeping
them at the large core
• More cache misses for private data vs. fewer misses for shared data– Cache misses reduce if shared data > private data
![Page 19: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/19.jpg)
19
Outline
• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary
![Page 20: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/20.jpg)
20
Experimental Methodology
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
SCMP
• All small cores
• Conventional locking
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Largecore
ACMP
• One large core (area-equal 4 small cores)
• Conventional locking
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Largecore
ACS
• ACMP with a CSRB
• Accelerates Critical Sections
![Page 21: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/21.jpg)
21
Experimental Methodology
• Workloads– 12 critical section intensive applications from various domains– 7 use coarse-grain locks and 5 use fine-grain locks
• Simulation parameters:– x86 cycle accurate processor simulator– Large core: Similar to Pentium-M with 2-way SMT.
2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage
– Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5-stage
– Private 32 KB L1, private 256KB L2, 8MB shared L3– On-chip interconnect: Bi-directional ring
![Page 22: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/22.jpg)
22
Workloads with Coarse-Grain Locks
Chip Area = 16 coresSCMP = 16 small coresACMP/ACS = 1 large and 12 small cores
Equal-area comparisonNumber of threads = Best threads
Chip Area = 32 small coresSCMP = 32 small coresACMP/ACS = 1 large and 28 small cores
210 150 210 150
![Page 23: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/23.jpg)
23
Workloads with Fine-Grain Locks
Equal-area comparisonNumber of threads = Best threads
Chip Area = 16 coresSCMP = 16 small coresACMP/ACS = 1 large and 12 small cores
Chip Area = 32 small coresSCMP = 32 small coresACMP/ACS = 1 large and 28 small cores
![Page 24: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/24.jpg)
Equal-Area Comparisons
24
Sp
eed
up
ove
r a
smal
l co
re
Chip Area (small cores)
(a) ep (b) is (c) pagemine (d) puzzle (e) qsort (f) tsp
(i) oltp-1 (i) oltp-2(h) iplookup (k) specjbb (l) webcache(g) sqlite
Number of threads = No. of cores
------ SCMP------ ACMP------ ACS
![Page 25: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/25.jpg)
25
ACS on Symmetric CMP
Majority of benefit is from large core
![Page 26: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/26.jpg)
26
Outline
• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary
![Page 27: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/27.jpg)
27
Related Work
• Improving locality of shared data by thread migration and software prefetching (Sridharan+, Trancoso+, Ranganathan+)ACS not only improves locality but also uses a large core to
accelerate critical section execution
• Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+)ACS not only accelerates the Amdahl’s bottleneck but also critical
sections
• Remote procedure calls (Birrell+)ACS is for critical sections among shared memory cores
![Page 28: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/28.jpg)
28
Hiding Latency of Critical Sections
• Transactional memory (Herlihy+)ACS does not require code modification
• Transactional Lock Removal (Rajwar+) and Speculative Synchronization (Martinez+)– Hide critical section latency by increasing concurrency
ACS reduces latency of each critical section– Overlaps execution of critical sections with no data conflicts
ACS accelerates ALL critical sections– Does not improve locality of shared data
ACS improves locality of shared data
ACS outperforms TLR (Rajwar+) by 18% (details in paper)
![Page 29: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/29.jpg)
29
Conclusion
• Critical sections reduce performance and limit scalability
• Accelerate critical sections by executing them on a powerful core
• ACS reduces average execution time by:– 34% compared to an equal-area SCMP– 23% compared to an equal-area ACMP
• ACS improves scalability of 7 of the 12 workloads
![Page 30: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062519/5681505d550346895dbe5d11/html5/thumbnails/30.jpg)
30
Accelerating Critical Section Executionwith Asymmetric Multi-Core Architectures
M. Aater Suleman*Onur Mutlu†
Moinuddin K. Qureshi‡Yale N. Patt*
*The University of Texas at Austin
†Carnegie Mellon University
‡IBM Research