18-447 Computer Architecture Guest Lecture: Multi-Core...
Transcript of 18-447 Computer Architecture Guest Lecture: Multi-Core...
![Page 1: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/1.jpg)
18-447
Computer Architecture
Guest Lecture: Multi-Core Systems
Prof. Onur Mutlu
Carnegie Mellon University
![Page 2: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/2.jpg)
Unexpected Slowdowns in Multi-Core
2
Low priority
High priority
(Core 0) (Core 1)
![Page 3: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/3.jpg)
Agenda
Intro to Multi-Core Systems
Multi-Core Design Issues
Shared Main Memory Systems
Shared Caches
Core Organization
Interconnects
Announcements
CALCM reading group
ECE 18-742: Parallel Computers (offered Spring 2010)
Interested in summer and future research in computer architecture and multi-core systems?
3
![Page 4: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/4.jpg)
Announcements
Weekly CALCM Reading Group
Will start the first week of May (May 5)
Readings and brainstorming on cutting-edge research in comp arch and related areas
+ snacks
Email me or join CALCM mailing list if you are interested in attending and receiving announcements
https://sos.ece.cmu.edu/mailman/listinfo/calcm-list
4
![Page 5: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/5.jpg)
Announcements (II)
Interested in more Computer Architecture classes?
18-740: Advanced Comp Arch (Fall 2009, Prof. Mowry)
18-742: Parallel Comp Arch (Spring 2010, Prof. Mutlu)
Interested in Summer of Future Research in Comp Arch?
Talk to me. Some sample projects:
MS-Manic: Memory systems for 1000-core processors
On-chip security: attacks, defenses, many-core resource management
BLESS: Bufferless on-chip networks
Asymmetric Multi-Core Design
Architectural support for safe/managed programming languages
Hardware/software/system support for tolerating hardware defects and bugs
5
![Page 6: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/6.jpg)
6
Computer architecture is the science/art of designing high-performance processing systems under many different constraints (power, cost, size, battery life, reliability, etc)
Processor performance improvements enabled innovation in software development for decades
Single-thread performance has become very difficult to improve
Complexity wall
Memory wall
Power wall
Reliability wall (soon)
Chip-multiprocessor architectures are mainstream
Reduce mainly the “complexity wall” by tiling cores
Create new problems shared resources, parallel programming, off-chip bandwidth, serial bottleneck
The State of Computer Architecture
Flynn & Hung, IEEE Micro, 2005
Processor core speed: ~60% per year
DRAM speed: ~7-10% per year
1 memory access > 300 clock cycles
Hard to bridge the gap with a single thread
Moore, FCRC, 2007
![Page 7: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/7.jpg)
Virtuous Cycle, 1950-2005 (per Jim Larus)
7
Increased
processor
performance
Larger, more
feature-full
software
Larger
development
teams
Higher-level
languages &
abstractions
Slower
programs
World-Wide Software Market (per IDC):
$212b (2005) $310b (2010)
*Slide credit: Mark Hill, HPCA 2007
![Page 8: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/8.jpg)
Virtuous Cycle, 2005+
8
Increased
processor
performance
Larger, more
feature-full
software
Larger
development
teams
Higher-level
languages &
abstractions
Slower
programs
World-Wide Software Market: $212b (2005) ???
X
Thread Level Parallelism & Multicore Chips
*Slide credit: Mark Hill, HPCA 2007
![Page 9: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/9.jpg)
An Example Multi-Core System
9
CORE 1
L2 C
AC
HE
0
SH
AR
ED
L3 C
AC
HE
DR
AM
INT
ER
FA
CE
CORE 0
CORE 2 CORE 3L
2 C
AC
HE
1
L2 C
AC
HE
2
L2 C
AC
HE
3
DR
AM
BA
NK
S
Multi-Core
Chip
*Die photo credit: AMD Barcelona
DRAM MEMORY
CONTROLLER
![Page 10: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/10.jpg)
A Future Multi-Core Chip
10
![Page 11: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/11.jpg)
Designing Multi-Core Chips is Difficult
Designers must confront single-core design options
Instruction fetch, decode, wakeup, select, out-of-order execution
Execution unit configuration, operand bypass, SIMD extensions
Load/store queues, data cache, L2 caches
Checkpoint, runahead, commit
Speculative execution: Prefetching, branch prediction
As well as additional design degrees of freedom
How many cores? How big each? Heterogeneous/homogeneous?
Shared caches: levels? How many banks? How to share?
Shared memory interface: How many controllers? How to share?
On-chip interconnect: bus, switched, ordered? How to share?
Prefetching: how to manage prefetchers across cores?
11
![Page 12: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/12.jpg)
Problems in Multi Core Chips
Simplify the design complexity problem
Somewhat…
Stamping multiple of the same cores side by side and connect them with some interconnection network easier
However, create many other (new) problems
Shared resources among multiple cores: how to design and manage?
More cores, NOT faster cores: single-thread performance suffers, serial code performance suffers
Memory bandwidth: How to supply all the cores with enough data
Parallel programming: How to write programs that can benefit from multiple cores? How to ease parallel programming?
How to design the cores: what kind? homogeneous or heterogeneous?
How to design the interconnect between cores/caches/memory?
12
![Page 13: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/13.jpg)
Let’s Take a Look at Some of
These Problems
13
![Page 14: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/14.jpg)
Unexpected Slowdowns in Multi-Core
14
Memory Performance HogLow priority
High priority
(Core 0) (Core 1)
![Page 15: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/15.jpg)
15
Why the Disparity in Slowdowns?
CORE 1 CORE 2
L2
CACHE
L2
CACHE
DRAM MEMORY CONTROLLER
DRAM
Bank 0
DRAM
Bank 1
DRAM
Bank 2
Shared DRAM
Memory System
Multi-Core
Chip
unfairness
INTERCONNECT
matlab gcc
DRAM
Bank 3
![Page 16: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/16.jpg)
16
DRAM Bank Operation
Row Buffer
(Row 0, Column 0)
Row
decoder
Column mux
Row address 0
Column address 0
Data
Row 0Empty
(Row 0, Column 1)
Column address 1
(Row 0, Column 85)
Column address 85
(Row 1, Column 0)
HITHIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Row
s
Access Address:
![Page 17: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/17.jpg)
17
DRAM Controllers
A row-conflict memory access takes 2-3 times longer than a row-hit access
Current controllers take advantage of the row buffer
Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*
(1) Row-hit first: Service row-hit memory accesses first
(2) Oldest-first: Then service older accesses first
This scheduling policy aims to maximize DRAM throughput
*Rixner et al., “Memory Access Scheduling,” ISCA 2000.
*Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.
![Page 18: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/18.jpg)
18
The Problem
Multiple threads share the DRAM controller
DRAM controllers designed to maximize DRAM throughput
DRAM scheduling policies are thread-unfair
Row-hit first: unfairly prioritizes threads with high row buffer locality
Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive threads
DRAM controllers vulnerable to denial of service
Can write programs that deny memory service to others
Memory performance hogs
![Page 19: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/19.jpg)
// initialize large arrays A, B
for (j=0; j<N; j++) {
index = rand();
A[index] = B[index];
…
}
19
An Example Memory Performance Hog
STREAM
- Sequential memory access
- Very high row buffer locality (96% hit rate)
- Memory intensive
RANDOM
- Random memory access
- Very low row buffer locality (3% hit rate)
- Similarly memory intensive
// initialize large arrays A, B
for (j=0; j<N; j++) {
index = j*linesize;
A[index] = B[index];
…
}
streaming random
![Page 20: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/20.jpg)
20
What does the MPH do?
Row BufferR
ow
decoder
Column mux
Data
Row 0
T0: Row 0
Row 0
T1: Row 16
T0: Row 0T1: Row 111
T0: Row 0T0: Row 0T1: Row 5
T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0
Request Buffer
T0: STREAMT1: RANDOM
Row size: 8KB, cache block size: 64B
128 (8KB/64B) requests of T0 serviced before T1
![Page 21: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/21.jpg)
Effect of the MPH
0
0.5
1
1.5
2
2.5
3
STREAM RANDOM
21
1.18X slowdown
2.82X slowdown
Results on Intel Pentium D running Windows XP
(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)
Slo
wd
ow
n
0
0.5
1
1.5
2
2.5
3
STREAM gcc
0
0.5
1
1.5
2
2.5
3
STREAM Virtual PC
![Page 22: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/22.jpg)
Can Be a Bigger Problem with More Cores
1.05
1.85
4.72
7.74
0
1
2
3
4
5
6
7
8
libquantum hmmer h264ref omnetpp
Slo
wdow
n
22
DRAM memory is the only shared resource
Memory Performance Hog
(Core 0) (Core 1) (Core 2) (Core 3)
Low priority
High priority
![Page 23: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/23.jpg)
Problems Caused by MPHs
1.051.85
4.72
7.74
0
1
2
3
4
5
6
7
8
libquantum hmmer h264ref omnetpp
Slo
wdow
n
23
Vulnerability to denial of service [Usenix Security 2007]
Inability to enforce thread priorities [MICRO 2007, ISCA 2008]
System performance loss [MICRO 2007, ISCA 2008]
Cores make
very slow
progress
Memory performance hogLow priority
High priority
![Page 24: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/24.jpg)
24
Preventing Memory Performance Hogs
Fundamentally hard to distinguish between malicious and unintentional MPHs
MATLAB’s memory access behavior is very similar to STREAM’s
Unfair DRAM scheduling is the fundamental cause of MPHs
MPHs exploit the unfairness in the DRAM controller
Solution: Prevent DRAM unfairness
Contain and limit MPHs by providing fair memory scheduling
![Page 25: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/25.jpg)
25
Solution: Hardware-Software Cooperation
Hardware provides a fair scheduler that is
Configurable by software
High-performance
Simple to implement (cost- and power-efficient)
System software decides policy
Configures the fair scheduler to enforce thread priorities and quality of service policies
But, what is fairness in shared DRAM systems?
![Page 26: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/26.jpg)
26
Stall-Time Fairness in Shared DRAM Systems
A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system
DRAM-related stall-time: The time a thread spends waiting for DRAM memory
STshared: DRAM-related stall-time when the thread runs with other threads
STalone: DRAM-related stall-time when the thread runs alone
Memory-slowdown = STshared/STalone
Relative increase in stall-time
Stall-Time Fair Memory scheduler (STFM) aims to equalizeMemory-slowdown for interfering threads, without sacrificing performance
Considers inherent DRAM performance of each thread
Aims to allow proportional progress of threads
![Page 27: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/27.jpg)
27
STFM Scheduling Algorithm [MICRO’07]
For each thread, the DRAM controller
Tracks STshared
Estimates STalone
Each cycle, the DRAM controller
Computes Slowdown = STshared/STalone for threads with legal requests
Computes unfairness = MAX Slowdown / MIN Slowdown
If unfairness <
Use DRAM throughput oriented scheduling policy
If unfairness ≥
Use fairness-oriented scheduling policy
(1) requests from thread with MAX Slowdown first
(2) row-hit first , (3) oldest-first
![Page 28: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/28.jpg)
28
How Does STFM Prevent Unfairness?
Row Buffer
Data
Row 0
T0: Row 0
Row 0
T1: Row 16
T0: Row 0
T1: Row 111
T0: Row 0T0: Row 0
T1: Row 5
T0: Row 0T0: Row 0
T0: Row 0
T0 Slowdown
T1 Slowdown 1.00
1.00
1.00Unfairness
1.03
1.03
1.06
1.06
1.05
1.03
1.06
1.031.04
1.08
1.04
1.04
1.11
1.06
1.07
1.04
1.10
1.14
1.03
Row 16Row 111
![Page 29: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/29.jpg)
Containing the Memory Performance Hog
1.07
3.04
0
0.5
1
1.5
2
2.5
3
3.5
4
matlab gcc
Slo
wdow
n
29
1.76 1.85
0
0.5
1
1.5
2
2.5
3
3.5
4
matlab gcc
Slo
wdow
n
14% improvement in system performance
(Core 0) (Core 1)
![Page 30: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/30.jpg)
30
STFM Implementation
Tracking STshared
Increase STshared if the thread cannot commit instructions due to an outstanding DRAM access
Estimating STalone
Difficult to estimate directly because thread not running alone
Observation: STalone = STshared - STinterference
Estimate STinterference: Extra stall-time due to interference
Update STinterference when a thread incurs delay due to other threads When a row buffer hit turns into a row-buffer conflict
(keep track of the row that would have been in the row buffer)
When a request is delayed due to bank or bus conflict
![Page 31: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/31.jpg)
31
Support for System Software
System-level thread weights (priorities)
OS can choose thread weights to satisfy QoS requirements
Larger-weight threads should be slowed down less
OS communicates thread weights to the memory controller
Controller scales each thread’s slowdown by its weight
Controller uses weighted slowdown used for scheduling
Favors threads with larger weights
: Maximum tolerable unfairness set by system software
Don’t need fairness? Set large.
Need strict fairness? Set close to 1.
Other values of : trade off fairness and throughput
![Page 32: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/32.jpg)
Enforcing Thread Priorities
1.07
3.04
0
0.5
1
1.5
2
2.5
3
3.5
4
matlab gcc
Slo
wdow
n
32
1.76 1.85
0
0.5
1
1.5
2
2.5
3
3.5
4
matlab gcc
Slo
wdow
n
2.78
1.06
0
0.5
1
1.5
2
2.5
3
3.5
4
matlab gcc
Slo
wdow
n
Low priority
High priority
(Core 0) (Core 1)
![Page 33: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/33.jpg)
Some Issues in Multi-Core Design
Shared Main Memory System
Shared vs. Private Caches
Interconnect Design
Amdahl’s Law: Asymmetric Multi-Core Chips
33
![Page 34: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/34.jpg)
Multi-core Issues in Caching
How does the cache hierarchy change in a multi-core system?
Private cache: Cache belongs to one core
Shared cache: Cache is shared by multiple cores
34
CORE 0 CORE 1 CORE 2 CORE 3
L2
CACHE
L2
CACHE
L2
CACHE
DRAM MEMORY CONTROLLER
L2
CACHE
CORE 0 CORE 1 CORE 2 CORE 3
DRAM MEMORY CONTROLLER
L2
CACHE
![Page 35: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/35.jpg)
Shared Caches Between Cores
Advantages:
Dynamic partitioning of available cache space
No fragmentation due to static partitioning
Easier to maintain coherence
Shared data and locks do not ping pong between caches
Disadvantages
Cores incur conflict misses due to other cores’ accesses
Misses due to inter-core interference
Some cores can destroy the hit rate of other cores
What kind of access patterns could cause this?
Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)
High bandwidth harder to obtain (N cores N ports?)35
![Page 36: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/36.jpg)
Handling Shared Data in Private Caches
Shared data and locks ping-pong between processors if caches are private
-- Increases latency to fetch shared data/locks
-- Reduces cache efficiency (many invalid blocks)
-- Scalability problem: maintaining coherence across a large number of private caches is costly
How to do better?
Idea: Store shared data and locks only in one special core’s cache. Divert all critical section execution to that core/cache.
Essentially, a specialized core for processing critical sections
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009.
36
![Page 37: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/37.jpg)
Multi-Core Cache Efficiency: Bandwidth Filters
Caches act as a filter that reduce memory bandwidth requirement
Cache hit: No need to access memory
This is in addition to the latency reduction benefit of caching
GPUs use caches to reduce memory BW requirements
Efficient utilization of cache space becomes more important with multi-core
Memory bandwidth is more valuable
Pin count not increasing as fast as # of transistors
10% vs. 2x every 2 years
More cores put more pressure on the memory bandwidth
37
![Page 38: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/38.jpg)
Some Issues in Multi-Core Design
Shared Main Memory System
Shared vs. Private Caches
Interconnect Design
Amdahl’s Law: Asymmetric Multi-Core Chips
38
![Page 39: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/39.jpg)
On-Chip Interconnects
Or Networks-On-Chip (NoC)
Each node on chip consists of
A core and caches associated with the core
How should we connect the nodes?
A shared bus is not scalable
A crossbar is too expensive
A ring?
A 2D mesh?
A torus?
39
![Page 40: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/40.jpg)
On-Chip Interconnects
What we want
Fast communication
No congestion
Many paths or good routing
Small area overhead
Small energy consumption
40
![Page 41: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/41.jpg)
2D Mesh
+ Easy to layout in a 2D chip
+ Many paths, yet relatively simple
-- Large diameter, maximum distance
-- Large energy and area overhead
-- Compared to rings, buses
-- Many buffers in router
41
![Page 42: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/42.jpg)
How to Make a 2D Mesh More Efficient
NoC consumes 20-40% of system power in prototype chips
Problem: Buffers consume energy, occupy area, increase router/NoC complexity/latency
Question: When are buffers most helpful? Congestion.
Observation: On-chip networks lightly loaded
Idea: Eliminate Buffers
Misroute a packet upon congestion instead of buffering it
Called Hot Potato routing
Deflected/misrouted packets eventually reach destination
42
![Page 43: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/43.jpg)
Bufferless On-Chip Networks
Benefits
Network Energy Savings: ~40%
Performance Increase: ~2%
Reduced router latency
Network Area Savings: ~40%
Simpler network/router design
Adaptivity, deadlock freedom
Many remaining research issues
How to provide fairness to cores?
How to provide quality of service guarantees?
Better routing and flow-control algorithms to handle congestion
Prototyping in FPGAs
How to apply it to other topologies?43
![Page 44: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/44.jpg)
Some Issues in Multi-Core Design
Shared Main Memory System
Shared vs. Private Caches
Interconnect Design
Amdahl’s Law: Asymmetric Multi-Core Chips
44
![Page 45: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/45.jpg)
Remember Amdahl’s Law?
Begins with Simple Software Assumption (Limit Arg.)
Fraction F of execution time perfectly parallelizable
No Overhead for
Scheduling
Communication
Synchronization, etc.
Fraction 1 – F Completely Serial
Time on 1 core = (1 – F) / 1 + F / 1 = 1
Time on N cores = (1 – F) / 1 + F / N
Speedup limited by the serial fraction of the program
45*Slide credit: Mark Hill, HPCA 2007
![Page 46: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/46.jpg)
Accelerating Serial Program Portions
Tile-large: Good at serial program portions
Niagara: Good at exploiting thread-level parallelism
ACMP (Asymmetric Multi-Core)
Good at both
Serial: on large core, Parallel: on many small cores
46
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Large
core
ACMP Approach
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
Niagara
-like
core
“Niagara” Approach
Large
core
Large
core
Large
core
Large
core
“Tile-Large” Approach
![Page 47: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/47.jpg)
Asymmetric Multi-Core Approach
47
![Page 48: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/48.jpg)
Performance vs. Parallel Fraction
48
![Page 49: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/49.jpg)
Performance vs. Parallel Fraction (II)
49
![Page 50: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/50.jpg)
Performance vs. Parallel Fraction (III)
50
![Page 51: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/51.jpg)
Performance vs. Parallel Fraction (IV)
51
![Page 52: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/52.jpg)
Asymmetric Multi-Core Chips
Powerful execution engines are needed to execute
Single-threaded applications
Serial sections of multithreaded applications (remember Amdahl’s law)
Where single thread performance matters (e.g., transactions, game logic)
Accelerate multithreaded applications (e.g., critical sections)
Corollary: Core design and enhancements still very important in multi-core chips
Many research questions
How many types of cores? How many “powerful” cores?
Specialized accelerator cores? For what kernels/applications?
How to allocate cores to threads and applications?
What should be shipped to and executed on powerful cores?52
![Page 53: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/53.jpg)
Summary
Multi-core chips bring about many new challenges
In Computer Architecture
Design of uncore components
Design of cores
Allocation of chip real-estate to types of cores and uncore
In System Software
Hardware resource allocation and management
Virtualization and QoS support
In Programming Languages and Compilers
Parallelization, thread extraction, easy parallel programming
53
![Page 54: 18-447 Computer Architecture Guest Lecture: Multi-Core Systemsjhoe/course/ece447/S09handouts/L24.pdf · Chip-multiprocessor architectures are mainstream Reduce mainly the “complexity](https://reader036.fdocuments.net/reader036/viewer/2022062604/5fc648c28f591066951e258e/html5/thumbnails/54.jpg)
References and Readings
Moscibroda and Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems,“ USENIX SECURITY 2007.
Mutlu and Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,“ MICRO 2007.
Moscibroda and Mutlu, "A Case for Bufferless Routing in On-Chip Networks,“ ISCA 2009.
Suleman et al., "Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures," ASPLOS 2009.
Hill and Marty, “Amdahl's Law in the Multicore Era,” IEEE Computer 2008.
54