CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David...
-
date post
22-Dec-2015 -
Category
Documents
-
view
223 -
download
4
Transcript of CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David...
![Page 1: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/1.jpg)
CS 258 Parallel Computer Architecture
LimitLESS Directories: A Scalable Cache Coherence
SchemeDavid Chaiken, John Kubiatowicz,
and Anant Agarwal
Presented:March 19, 2008
Ankit Jain
![Page 2: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/2.jpg)
LimitLESS.23/19/08
The Background & Problems• Bus-Based Protocols
– Do not scale because broadcasts are slow and limit parallelism
• Traditional Directory-Based Protocols– Monolithic Directories
» Implicitly serialize all memory requests– Directory Accesses consume a disproportionately large
fraction of available network bandwidth– Full Directories are Large
» Full Map Size: Total Memory Size * Number of Processors
– Limited Directory Protocols» Allowing a limited number of simultaneous cached
copies of any block of data» Pro: Size of directory is smaller» Con: Potential Thrashing since eviction and
reassignment when more simultaneous copies needed» Previous studies show small set of pointers is
sufficient to capture worker-set of processors
![Page 3: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/3.jpg)
LimitLESS.33/19/08
Alewife Architecture
• Cost Effective Mesh Network– Pro: Scales in terms of hardware– Pro: Exploits Locality
• Directory Distributed along with main memory– Bandwidth scales with number of
processors
• Con: Non-Uniform Latencies of Communication– Have to manage the mapping of
processes/threads onto processors due– Alewife employs techniques for latency
minimization and latency tolerance so programmer does not have to manage
• Context Switch in 11 cycles between processes on remote memory request which has to incur communication network latency
• Cache Controller holds tags and implements the coherence protocol
![Page 4: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/4.jpg)
LimitLESS.43/19/08
LimitLESS Protocol + Requirements• Limited Directory that is Locally Extended
through Software Support• Handle the common case (small worker set)
in hardware and the exceptional case (overflow) in software
• Processor with rapid trap handling (executes trap code within 5-10 cycles of initiation)
• State Shared– Processor needs complete access to coherence related
controller state in the hardware directories– Directory Controller can invoke processor trap handlers
• Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets
![Page 5: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/5.jpg)
LimitLESS.53/19/08
The Protocol
Note: In the Read-Only State, the notation S: n>p indicates that the outputs from the state are handled through a software interrupt
handler if the size of the pointer set (n) is greater than the size of the limited directory (p).
![Page 6: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/6.jpg)
LimitLESS.63/19/08
An Example
• Proc i has data block D from Proc d in Read-Write State
• Proc j wants to write a value to data block DProcessor i
Data BlockState
d Read-Write
Processor j
Data BlockState
d Invalid
Processor d Directory Entry
Data BlockState AckCtr Owning Processors
d Read-Write 0 i
![Page 7: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/7.jpg)
LimitLESS.73/19/08
An Example
• Proc i has data block D from Proc d in Read-Write State
• Proc j wants to write a value to data block DProcessor i
Data BlockState
d Read-Write
Processor j
Data BlockState
d Invalid
j WREQ
Precondition: P = { I }
INV i
Data BlockState AckCtr Owning Processors
d Read-Write 0 i
Processor d Directory Entry
![Page 8: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/8.jpg)
LimitLESS.83/19/08
An Example
• Proc i has data block D from Proc d in Read-Write State
• Proc j wants to write a value to data block DProcessor i
Data BlockState
d Invalid
Processor j
Data BlockState
d Invalid
Data BlockState AckCtr Owning Processors
d Read-Write 1 j
Processor d Directory Entry
![Page 9: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/9.jpg)
LimitLESS.93/19/08
An Example
• Proc i has data block D from Proc d in Read-Write State
• Proc j wants to write a value to data block DProcessor i
Data BlockState
d Invalid
Processor j
Data BlockState
d Invalid
Data BlockState AckCtr Owning Processors
d Read-Write 1 j
AckCtr = 1, P = { j }
i ACKC
Processor d Directory Entry
![Page 10: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/10.jpg)
LimitLESS.103/19/08
An Example
• Proc i has data block D from Proc d in Read-Write State
• Proc j wants to write a value to data block DProcessor i
Data BlockState
d Invalid
Processor j
Data BlockState
d Read-Write
Data BlockState AckCtr Owning Processors
d Read-Write 0 j
Processor d Directory Entry
![Page 11: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/11.jpg)
LimitLESS.113/19/08
Interprocessor-Interrupt (1/2)
• Trap routine can either discard packet or store it to memory
• Store-back capability permits message-passing and block transfers
• Potential Deadlock Scenario with Processor Stalled and waiting for a remote cache-fill
•Solution: Synchronous Trap (stored in local memory) to empty input queue
![Page 12: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/12.jpg)
LimitLESS.123/19/08
Interprocessor-Interrupt (2/2)
• Overflow Trap Scenario– First Instance: Full-Map bit-vector allocated in local memory
and hardware pointers emptied into this and vector entered into hash table
– Otherwise: Empty hardware pointers into bit vector– Meta-State Set to “Trap-On-Write”– While emptying hardware pointers, Meta-State: “Trans-In-
Progress”
• Incoming Write Request Scenario– Empty hardware pointers to memory– Set AckCtr to number of bits that are set in bit-vector– Send invalidations to all caches except possibly requesting
one– Free vector in memory– Upon invalidate acknowledgement (AckCtr == 0), send Write-
Permission and set Memory State to “Read-Write”
![Page 13: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/13.jpg)
LimitLESS.133/19/08
Performance Technique
Notes:
• Multigrid: Small worker sets limited directories perform as well as full map
• SIMPLE implemented barrier synchronization with single lock
• Matexpr has worker sets up to 16 processors
• Weather has one variable initialized by one processor and then read by all the other processors
![Page 14: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/14.jpg)
LimitLESS.143/19/08
Results (1/3)
![Page 15: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/15.jpg)
LimitLESS.153/19/08
Results (2/3)
![Page 16: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/16.jpg)
LimitLESS.163/19/08
Results (3/3)
![Page 17: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/17.jpg)
LimitLESS.173/19/08
Summary
• LimitLESS directories can closely emulate Full-Map Directories while saving hardware resources
• LimitLESS is not as sensitive to tuning parameters as the Limited Directory approach
• The protocol is general enough to apply to other coherence techniques
• In the future, it can be extended to give feedback to programmers/compilers about hot-spots, etc
![Page 18: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:](https://reader035.fdocuments.net/reader035/viewer/2022062516/56649d805503460f94a6459f/html5/thumbnails/18.jpg)
LimitLESS.183/19/08
Full Memory State Transition Diagram