Lecture 3. Directory-based Cache Coherence
description
Transcript of Lecture 3. Directory-based Cache Coherence
Lecture 3. Directory-based Cache Coherence
Prof. Taeweon SuhComputer Science Education
Korea University
COM503 Parallel Computer Architecture & Programming
Korea Univ
Scalability Problem with Bus• Bus-based machines are essentially based on broadcasting via bus• Bus bandwidth limitation as the number of processors increases
Bandwidth = #wires X clock freq.• Wire length grows with #processors• Load (capacitance) grows with #processors
Snoopy bus could delay memory accesses
2
Korea Univ
Distributed Shared Memory
• Goal: Scale up machines Distribute memory Use scalable point-to-point interconnection providing
• Bandwidth scales linearly with #nodes• Memory access latency grows sub-linearly with #nodes
3
P1 Pn
CA
$
Mem
Scalable Interconnection network
CA
$
Mem
…
CA (Communication Assist)
Korea Univ
Distributed Shared Memory
• Important issue How to provide caching and coherence in hardware on a
machine with physically distributed memory, without the benefits of a globally snoopable interconnect such as a bus
• Not only must the memory access latency and bandwidth scale well, but so must the protocols used for coherence, at least up to the scales of practical interest
• The most common approach for full hardware support for cache coherence: Directory-based cache coherence
4
Korea Univ
A Scalable Multiprocessor with Directories
5CA (Communication Assist)
P1 Pn
CA
$
Scalable Interconnection network
CA
$…MemoryDirectory Memory Directory
• Scalable cache coherence is typically based on the concept of a directory Maintain the state of memory block explicitly in a place called a directory The directory entry for the block is co-located with the main memory
• Its location can be determined from the block address
Korea Univ
2-level Cache Coherent Systems• An approach that is (was?) popular takes a form of 2-level protocol hierarchy
Each node of the machine is itself a multiprocessor The caches within a node are kept coherent by inner protocol Coherence across the nodes is maintained by outer protocol
• A common organization Inner protocol: Snooping protocol Outer protocol: Directory-based protocol
6
Korea Univ
ccNUMA
• Main memory is physically distributed and has non-uniform access costs to a processor Architectures of this type are often called cache-coherent,
non-uniform memory access (ccNUMA)
• More generally, systems that provide a shared address space programming model with physically distributed memory and coherent replication are called distributed shared memory (DSM) systems
7
Korea Univ
DSM (NUMA) Machine Examples
• Nehalem-based systems with QPI
8http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html
Nehalem-based
Xeon 5500
QPI: QuickPath Interconnect
Korea Univ
Directory-based Approach• Three important functions upon cache miss
Finding out enough information about the state of the memory block Locating other copies if needed (e.g., to invalidate them) Communicating with the other copies (e.g., obtaining data from them or invalidating
or updating them)
• In snoopy protocols, all three functions are done by the broadcast on bus
• Broadcasting in DSM machines generates a large amount of traffic On a p-node machine, at least p network transactions on every miss So, it does not scale
• A simple directory approach: Look up directory to find out the information about the state of blocks in other
caches The location of the copies is also found from the directory The copies are communicated with using point-to-point network transactions in an
arbitrary interconnection network, without resorting to broadcast9
Korea Univ
Terminology• Home node is the node in whose main memory the block is allocated
• Dirty node is the node that has a copy of the block in its cache in modified (dirty) state The home node and the dirty node for a block may be the same
• Owner node is the node that currently holds the valid copy of a block and must supply the data when needed This is either the home node (when the block is not in dirty state in a cache) or the dirty node
• Exclusive node is the node that has a copy of the block in its cache in an exclusive (dirty or clean) state
• Local node (or requesting node) is the node containing the processor that issues a request for the block
• Blocks whose home is the requesting processor are called locally allocated (or local blocks), whereas all others are called remotely allocated (or remote blocks)
10
P1 Pn
CA
$
Scalable Interconnection network
CA
$…MemoryDirectory MemoryDirectory
Korea Univ
A Directory Structure• A natural way to organize a directory
Maintain the directory together with a block in main memory at home
• A simple organization of the directory for a block is a bitvector of n presence bits (which indicates each of the n nodes) together with state bits For simplicity, let’s assume that there is only one state bit (dirty)
11
P1 Pn
CA
$
Scalable Interconnection network
CA
$…MemoryDirectory MemoryDirectory • ••
Directory
n presence bitsdirty bit
Korea Univ
A Directory Structure (Cont.)
• If the dirty bit is ON, then only one node (the dirty node) should be caching that block and only that node’s presence bit should be ON
• A read miss can be serviced by looking up the directory to see which node has a dirty copy of the block or if the block is valid in main memory at home
• A write miss can be serviced by looking up the directory as well to see which nodes are the sharers that must be invalidated
12
Korea Univ
Basic Operation
13
HomeHome
Korea Univ
A Read Miss (at node i) Handling
• If the dirty bit is OFF, CA obtains the block from main memory, supplies it to the
requestor in a reply transaction, and turns the ith presence bit (presence[i]) ON
• If the dirty bit is ON, CA responds to the requestor with the identity of the node whose
presence bit is ON (i.e., the owner node) The requestor then sends a request network transaction to that
owner node At the owner, the cache changes its state to shared and supplies
the block to both the requesting node as well as to main memory at the home node
• The requestor node stores the block in its cache in shared state At memory, the dirty bit is turned off and presence[i] is turned ON
14
CA (Communication Assist)
Korea Univ
A Write Miss (at node i) Handling
• If the dirty bit is OFF, Main memory has a clean copy of the data Invalidation request transactions must be sent to all nodes j for which
presence[j] is ON• The home node supplies the block to the requesting node (node i) together with
presence bit vector• CA at the requestor sends invalidation requests to the required nodes and waits for
acknowledgment transactions from the nodes The directory entry is cleared, leaving only presence[i] and the dirty bit ON The requestor places the block in its cache in dirty state
• If the dirty bit is ON, The block is first delivered from the dirty node (whose presence bit is ON)
using network transactions• The cache in the dirty node changes its state to invalid, and then the block is
supplied to the requesting processor The requesting processor places the block in its cache in dirty state The directory entry in the home node sets presence[i] and the dirty bit ON
15
CA (Communication Assist)
Korea Univ
Block Replacement Issues in Cache
• Replacement of a dirty block in node i The dirty data being replaced is written back to main memory
in home node The directory is updated to turn off the dirty bit and presence[i]
• Replacement of a shared block in node i A message may or may not be sent to the directory in home
node to turn off the corresponding presence bit • An invalidation is not sent to this node the next time the block is written
if the message was sent• This message is called a replacement hint (whether it is sent or not does
not affect the correctness of the protocol or the execution)
16
Korea Univ
Scalability of Directory-based Protocol
• The main goal of using directory protocols is to allow cache coherence to scale beyond the number of processors that may be sustained by a bus It is important to understand the scalability of directory protocols in terms
of both performance and storage overhead for directory information
• The major performance scaling issues for a protocol are how the latency and bandwidth demands scale with the number of processors used The bandwidth demands are governed by the number of network
transactions generated per miss (multiplied by the frequency of misses) and latency by the number of these transactions that are in the critical path of the miss
These quantities are affected both by the directory organization and by how well the flow of network transactions is optimized in the protocol
17
Korea Univ
Reducing Latency
18
Baseline
Optimized (adopted in Stanford DASH)
RemRd: Remote Read request
Korea Univ
Race Conditions
• What if multiple requests for a block are in transit
• Solutions Directory controller can serialize transactions for a
block through an additional status bit (busy or lock bit)
If the busy bit is set and the home node receives another request for the same block, 2 options:• Reject incoming requests for as long as the busy bit is set.
That is, nack (negative acknowledgement) to the requests• Queue incoming requests – Bandwidth saved, Request
latency kept to a minimum, Design complexity increased (Queue overflow)
19
Korea Univ
Issue with Turning Off Busy Bit
• When to turn off busy bit? Turn off the busy bit before
home responds to the requester (local node)?
• Race condition example In (c), before UpgrAck is
reached to the local node, what if RemRd arrives at the node?
• This race condition occurs if subtranctions follow physically different paths
• One solution Send Nack to the RemRd request
20
RemRd: Remote Read request
Korea Univ
Scalability of Directory-based Protocol
• Storage is affected only by how the directory information is organized For the simple bit vector organization, the number of presence bits needed grows
linearly with• #nodes: p bits for a memory block for p nodes• Main memory size: one bit vector for a memory block
It could lead to a potentially large storage overhead for the directory
• Examples With a 64-byte block size and 64 processors, what is the directory overhead as a
fraction of non-directory (i.e., data)?• 64 bits for each 64 bytes block: 64b/64B = 1/8 = 12.5%
What if there are 256 processors with the same block size?• 256 bits for each 64 bytes block: 256b/64B = ½ = 50%!
What if there are 1024 processors?• 1024 bits for each 64 bytes block: 1024b/64B = 2 = 200%!!!
21
Korea Univ
Alternative Directory Organizations
• Fortunately, there are many other ways to organize directory information that improve the scalability of directory storage
• The different organizations naturally lead to different high-level protocols with different ways of addressing the protocol functions
22
Korea Univ
Alternative Directory Organizations
• The 2 major classes of alternatives for finding the source of the directory information for a block Hierarchical directory schemes Flat directory schemes
23
Korea Univ
Hierarchical Directory Schemes• Hierarchical scheme organizes the processors as the
leaves of a logical tree (need not be binary)
• An internal node stores the directory entries for the memory blocks local to its children• A directory entry essentially informs which of its children
subtrees are caching the block and if there are some subtrees (which are not its children) that are caching the block
Finding the directory entry of a block involves a traversal of the tree until the entry is found • Inclusion is maintained between level k and k+1 directory
node where the root is at the highest level In the worst case you may have to go to the root to find
directory
These organizations are not popular since the latency and bandwidth characteristics tend to be much worse than flat schemes
24
Korea Univ
Flat, Memory-Based Directory Schemes
• The bit vector organization is the most straightforward way to store directory information in a flat, memory-based scheme
• Performance characteristics on writes The number of network transactions per invalidating write grows
only with the number of actual sharers• But, because the identity of all sharers is available at home, invalidations
can be sent in parallel• The number of fully serialized network transactions in the critical path is
thus not proportional to the number of sharers, reducing latency
• The main disadvantage of the bit vector organization is storage overhead
25
Korea Univ
Flat, Memory-Based Directory Schemes
• 2 ways to reduce the overhead Increase the cache block size Put multiple processors (rather than just one) in a node that is
visible to the directory protocol (that is, to use a 2-level protocol)
• Example: 4-processor node, 128 byte cache blocks, the directory memory overhead for a 256-processor machine is 6.25%
• However, these methods reduce the overhead by only a small constant factor Total directory storage is still proportional to P x (P x m)
• P is #nodes• P x m is #total memory blocks in the machine
m is #memory blocks in local memory
26
Korea Univ
Reducing Directory Overhead• Reducing the directory width
Directory width is reduced by using limited pointer directories• Most of time, only a few caches have a copy of a block when a block is written• Limited pointer schemes maintain a fixed number of pointers, each pointing
to a node that currently caches a copy of the block Each pointer takes log2P bits of storage for P nodes For example, for a machine with 1024 nodes, each pointer needs 10 bits, so even
having 100 pointers uses less storage than a full bit vector scheme• These schemes need some kind of backup or overflow strategy for the
situation when more than i readable copies are cached when they can keep track of only i copies precisely
• Reducing the directory height Directory height can be reduced by organizing the directory itself as
a cache• Only small fraction of the memory blocks will actually be present in caches at
a given time, so most of the directory entries will be unused anyway
27
Korea Univ
Limited Pointer Directories
• Assume that directory can track i copies In other words, directory can keep i number of pointers
• Pointer overflow handling options If there are more than i sharers, broadcast from i+1th request Victimize (Invalidate) one pointer and use it for the new node Let software handle overflow (used by MIT Alewife)
• When hardware pointers are exhausted, a trap to a software handler is taken
• Handler allocates a new pointer in regular (non-directory) memory• Software handler is needed
When the number of nodes exceeds the number of hardware pointers available
When a block is modified and some pointers have been allocated in memory
28
Korea Univ
Flat, Cache-Based Directory Schemes
• There is still a home main memory for the block However, the directory entry at the home node does not contain the identities of
all sharers, but only a pointer to the first sharer in the list plus a few state bits• This pointer is called the head pointer for the block
The remaining nodes caching that block are joined together in a distributed, doubly-linked list
• A cache that contains a copy of the block also contains pointers to the next and previous caches that have a copy
They are called the forward and backward pointers
29
Korea Univ
A Read Miss Handling
• The requesting node sends a network transaction to the home memory to find out the identity of the head node
• If the head pointer is null (no sharers), the home replies with data
• If the head pointer is not null, the requester must be added to the list of sharers The home responds to the requestor with the head pointer The requestor sends a message to the head node, asking to be inserted at
the head of the list and hence to become the new head node• The head pointer at home now points to the requestor• The forward pointer of the requestor’s cache entry points to the old head node• The backward pointer of the old head node points to the requestor
The data for the block is provided by the home if it has the latest copy or by the head node, which always has the latest copy
30
Korea Univ
A Write Miss Handling
• The writer obtains the identity of the head node from the home
• It then inserts itself into the list as the head node If the writer was already in the list as a sharer and is now performing an
upgrade, it is deleted from its current position in the list and inserted as the new head
• The rest of the distributed linked list is traversed node by node via network transactions to find and invalidate the cached blocks If a block that is written is shared by three nodes A, B, and C, the home only
knows about A, so the writer sends an invalidation message to it; the identity of the next sharer B can only be known once A is reached, and so on.
Acknowledgements for these invalidations are sent to the writer
• If the data for the block is needed by the writer, it is provided by either the home or the head node as appropriate
31
Korea Univ
Pros and Cons of Flat, Cache-based Schemes
• Cons Latency: the number of messages per invaliding write is in the
critical path
• Pros The directory overhead is small The linked list records the order in which accesses were made
to memory for the block, thus making it easier to provide fairness in a protocol
Sending invalidations is not centralized at the home but rather distributed among sharers
• Thus reducing the bandwidth demands placed on a particularly busy home CA
32
CA (Communication Assist)
Korea Univ
IEEE 1596-1992 SCI
• Scalable Coherent Interface (SCI) Manipulating insertion in and deletion from distributed
linked lists can lead to complex protocol implementations
These complexity issues have been greatly alleviated by formalization and publication of a standard for a cache-based directory organization and protocol: IEEE 1596-1992 Scalable Coherent Interface (SCI)
Several commercial machines use this protocol• Sequent NUMA-Q, 1996• Convex Exemplar, Convex Computer Corp. 1993• Data General, 1996
33
Korea Univ
Correctness Issues • Read Section 8.4 in the Culler’s book
Serialization to a location for coherence Serialization across locations for sequential consistency Deadlock, Livelock, and Starvation
34
Korea Univ
Example: Write Atomicity
35
• A common solution for write atomicity in an invalidation-based scheme is for the current owner of a block (the main memory or the processor holding the dirty copy in its cache) to provide the appearance of atomicity by not allowing access to the new value by any process until all invalidation acknowledgements for the write have returned