Optimizing shared caches in chip multiprocessors
-
Upload
young-alista -
Category
Technology
-
view
34 -
download
0
Transcript of Optimizing shared caches in chip multiprocessors
![Page 1: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/1.jpg)
Core 2 Duo die
“Just a few years ago, the idea of putting multiple processors on a chip was farfetched. Now it is accepted and commonplace, and virtually every new high performance processor is a chip multiprocessor of some sort…”
Center for Electronic System DesignUniv. of California Berkeley
Chip Multiprocessors??
“Mowry is working on the development of single-chip multiprocessors: one large chip capable of performing multiple operations at once, using similar techniques to maximize performance”
-- Technology Review, 1999
Sony's Playstation 3, 2006
![Page 2: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/2.jpg)
CMP Caches: Design Space
• Architecture– Placement of Cache/Processors– Interconnects/Routing
• Cache Organization & Management– Private/Shared/Hybrid– Fully Hardware/OS Interface
“L2 is the last line of defense before hitting the memory wall, and is the focus of our talk”
![Page 3: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/3.jpg)
Private L2 Cache
I$ D$ I$ D$
L2 $ L2 $ L2 $ L2 $ L2 $ L2 $
I N T E R C O N N E C T
Coherence Protocol
Offchip Memory
+ Less interconnect traffic+ Insulates L2 units + Hit latency
– Duplication– Load imbalance– Complexity of coherence– Higher miss rate
L1 L1
Proc
![Page 4: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/4.jpg)
Shared-Interleaved L2 Cache
– Interconnect traffic– Interference between cores– Hit latency is higher
+ No duplication+ Balance the load+ Lower miss rate+ Simplicity of coherence
I$ D$ I$ D$
I N T E R C O N N E C T
Coherence ProtocolL1
L2
![Page 5: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/5.jpg)
Take Home Message
• Leverage on-chip access time
![Page 6: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/6.jpg)
Take Home Messages
• Leverage on-chip access time• Better sharing of cache resources• Isolating performance of processors• Place data on the chip close to where it is used • Minimize inter-processor misses (in shared cache)• Fairness towards processors
![Page 7: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/7.jpg)
On to some solutions…
Jichuan Chang and Gurindar S. SohiCooperative Caching for Chip MultiprocessorsInternational Symposium on Computer Architecture, 2006.
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia AilamakiReactive NUCA: Near-Optimal Block Placement and Replication in Distributed CachesInternational Symposium on Computer Architecture, 2009.
Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane IrwinAdaptive Set-Pinning: Managing Shared Caches in Chip MultiprocessorsArchitectural Support for Programming Languages and Operating, Systems 2008.
each handles this problem in a different way
![Page 8: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/8.jpg)
Co-operative Caching(Chang & Sohi)
• Private L2 caches• Attract data locally to reduce remote on chip access.
Lowers average on-chip misses.• Co-operation among the private caches for efficient
use of resources on the chip.• Controlling the extent of co-operation to suit the
dynamic workload behavior
![Page 9: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/9.jpg)
CC Techniques
• Cache to cache transfer of clean data– In case of miss transfer “clean” blocks from another L2 cache.– This is useful in the case of “read only” data (instructions) .
• Replication aware data replacement– Singlet/Replicate.– Evict singlet only when no replicates exist.– Singlets can be “spilled” to other cache banks.
• Global replacement of inactive data– Global management needed for managing “spilling”.– N-Chance Forwarding.– Set recirculation count to N when spilled.– Decrease N by 1 when spilled again, unless N becomes 0.
![Page 10: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/10.jpg)
Set “Pinning” -- Setup
P1
P2
P3
P4
Set 0
Set 1
::
Set (S-1)
L1cache
Processors SharedL2 cache
Interconnect
MainMemory
![Page 11: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/11.jpg)
Set “Pinning” -- Problem
P1
P2
P3
P4
Set 0
Set 1
::
Set (S-1)
MainMemory
![Page 12: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/12.jpg)
Set “Pinning” -- Types of Cache Misses
• Compulsory (aka Cold)
• Capacity• Conflict• Coherence
• Compulsory• Inter-processor• Intra-processor
versus
![Page 13: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/13.jpg)
P1
P2
P3
P4
MainMemory
POP 1
POP 2
POP 3
POP 4
Set
::
Set
Owner Other bits Data
![Page 14: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/14.jpg)
R-NUCA: Use Class-Based Strategies
Solve for the common case!Most current (and future) programs have the following types of accesses1. Instruction Access – Shared, but Read-Only2. Private Data Access – Read-Write, but not Shared3. Shared Data Access – Read-Write (or) Read-Only, but Shared.
![Page 15: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/15.jpg)
R-NUCA: Can do this online!• We have information from the OS and TLB• For each memory block, classify it as
– Instruction– Private Data– Shared Data
• Handle them differently– Replicate instructions – Keep private data locally – Keep shared data globally
![Page 16: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/16.jpg)
R-NUCA: Reactive Clustering
• Assign clusters based on level of sharing– Private Data given level-1 clusters (local cache)– Shared Data given level-16 clusters (16 neighboring machines), etc.
Clusters ≈ Overlapping Sets in Set-Associative Mapping• Within a cluster, “Rotational Interleaving”
– Load-Balancing to minimize contention on bus and controller
![Page 17: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/17.jpg)
Future Directions
Area has been closed.
![Page 18: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/18.jpg)
Just Kidding…
• Optimize for Power Consumption• Assess trade-offs between more caches and more cores
• Minimize usage of OS, but still retain flexibility• Application adaptation to allocated cache quotas• Adding hardware directed thread level speculation
![Page 19: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/19.jpg)
Questions?
THANK YOU!
![Page 20: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/20.jpg)
Backup
• Commercial and research prototypes– Sun MAJC– Piranha– IBM Power 4/5– Stanford Hydra
![Page 21: Optimizing shared caches in chip multiprocessors](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885e7341a28ab906d8b75e1/html5/thumbnails/21.jpg)
Backup