MN Cache Coherence

CACHE COHERENCE By: Mahesh Neupane

Cache: Cache is a key to the performance of a modern processor. In a modern computing system nearly 25% of instructions reference memory, so that memory access time a critical factor in performance. By effectively reducing the cost of a memory access, caches enable the greater than one instruction/cycle goal for instruction throughput for modern processors. Cache exploits locality of reference property to improve the access time to data and reducing the cost of accessing main memory. There are 2 types of locality:

a) Temporal Locality b) Spatial Locality

Temporal Locality:

Once a location is referenced, there is high probability that it will be referenced again in the near future. Temporal Locality persists in both data and instructions. The simplest example of temporal locality is instructions in loops: once the loop is entered, all the instructions in the loop will be referenced many times before the loop exits. Commonly called subroutines and functions and interrupt handlers also have properties of temporal locality. Many types hot data that the program uses or updates many times before going on to another block of data.

Spatial Locality: When an instruction or datum is accessed it is very likely that nearby instructions or data will be accessed soon. An instruction stream also exhibit spatial locality such as in the absence of jumps, the next instruction to be executed is the one immediately following the current one. Data also shows considerable spatial locality when arrays or strings are accessed. Programs commonly step through an array from beginning to end, accessing each element of the array sequentially. Cache Coherence: In a shared memory multiprocessor with a separate cache memory for each processor, it is possible to have many copies of any one-instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.

There are three distinct levels of cache coherence: Every write operation appears to occur instantaneously. All processes see exactly the same sequence of changes of

values for each separate operand. Different processes may see on operand assume different

sequences of values.(this is non-coherent behavior)

A cache coherence problem arises when cache reflects a view of memory, which is different from reality. Cache coherence is a common issue when handling the I/O subsystem. For the centralized shared memory architecture two different processors can view two different values for the same memory location. Lets consider following example of centralized shared memory architecture which has 2 processors A and B:

Time Event Cache Contents for CPU A

Cache Contents for CPU B

Memory Contents for location X

0 1 1 CPU A reads X 1 1 2 CPU B reads X 1 1 1 3 CPU A stores 0

into X 0 1 0

Condition of Coherency: 1) A read by a processor P, to a location X follows a write by P to X, with no writes of X by another processor occurring in between, always returns the value written by P. 2) A read by P to location X that follows a write by another processor to X returns the newly written value if the read and write are sufficiently separated. 3) Writes to the same location are serialized: that is two writes to the same location by any two processors are seen in the same order by all processors. In a coherent multiprocessor the cache provides both migration and replication of shared data. Data migration allows low latency access to shared data from a local cache. Replicating data enable simultaneous access to shared data and limits the potential of contention. Solution to Cache Coherency: There basic protocol that is adapted in order to eliminate the problem of cache coherency in the memory system is Snooping Protocol.

SNOOPING PROTOCOL:

Snooping Cache Coherence protocols can be used in systems with a shared bus between the processors and memory modules.

Snoopy Cache coherence protocols rely on a common channel (or bus) connecting the processors to main memory.

This enables all cache controllers to observe (or snoop) the activities of all other processors and take appropriate actions to prevent the processor from obtaining stale data.

Snooping protocol further classified in to 2 scheme. They are:

a) Write Invalidate Scheme b) Write Update Scheme

a) Write Invalidate Scheme:

• In this protocol all other caches must invalidate their copy of a block before a single processor can modify it.

• In other words, a processor must obtain exclusive ownership of a block before it can modify the block.

• A processor first broadcasts a write over the bus to obtain the exclusive ownership of a block. Once this is done, the writing processor is sure that no other processor can receive old (stale) data.

In write invalidate protocol, a cache block is always in one of these four possible states: INVALID, VALID, RESERVED, and DIRTY.

• INVALID state indicates that the cache does not have the current data for the block and that an access to the block must result in a cache miss.

• VALID state indicates the correctness of the block data and the block may be present in other caches as well.

• RESERVED state indicates the data is correct and the block is present only in this cache and the main memory.

• DIRTY state is present in block only in this cache that has the correct data.

Even it looks like RESERVED and DIRTY state leads to same case but there is significant difference between two states. In the RESERVED state, the cache as well as the main memory have the correct data. In the DIRTY state, only the cache has the correct copy of the data.

The state diagram for this protocol is give below: Read Miss Bus Read Bus Write Write Hit Write Hit Write Miss

STATE 0 INVALID

STATE 1 VALID(clean, potentially shared)

STATE 2 RESERVED(clean, only copy)

STATE 3 DIRTY(modified, only copy)

0

2 3

1

Processor Based transition Bus –Induced transition

The protocol works as follows: Read Hit: The data is returned to the processor without any delay or bus transactions Read Miss: On a read miss, the block is loaded into VALID state from memory if no cache has a copy of block in DIRTY state. If another cache has copy then it supplies the data to requesting cache and also writes the data to the main main memory. All copies of the data are set to state VALID. When read miss occur the state condition depends upon two scheme: Write-through and Write-back. In write through, memory is always kept up-to-date. In write-back, snoop (look) in cache to find the most recent copy of data. Write Hit:

• If the block is in RESERVED/ DIRTY State, new data is written in cache without any delay or bus transactions.

• If the block is in VALID state, the new data is written through to memory and the block state is changed to RESERVED. All other cache will set to be INVALID state.

Since repeated write to a block by a processor generates only one write to main memory, which is used to invalidate all other copies of the block and no bus transactions is required for repeated transaction. That’s why this protocol also known as write once protocol. Write Miss:

• If there is not DIRTY bit in the caches, on write miss, the block is loaded from memory into DIRTY state.

• If another cache has the block in DIRTY state, then it supplies the block data to requesting cache instead of memory and sets the local block state to INVALID.

• After observing write miss on the bus all other cache will set INVALID state.

• Only the requesting processor has exclusive ownership of the block. • Once block is loaded, the processor modifies it and set the state to

DIRTY. Block replacement: The block data must be written to memory if the block is in DIRTY state. Otherwise, the block can replaced without any bus transaction. The working of this protocol can be shown by following example: Processor activity Bus Activity CPU A’s

cache content CPU B’s cache content

Contents of memory(X)

0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0

CPU A writes a 1 to X

Invalidation for X 1 0 0

CPU B reads X Cache miss for X 1 1 1

b) Write Update Scheme:

• In this protocol, a write to shared block causes the write data to be broadcast over the bus so that all caches other can update their data.

• If the writing processor determines that there are other copies of the block, then it broadcasts the new data over the bus when it writes to the block.

• When other caches observes the new data being broadcasted over the bus, they update their copy of the block data.

This protocols also known as Firefly protocol. In this protocol, a cache block may be in one of the three possible states: VALID-EXCLUSIVE, SHARED, and DIRTY. • VALID-EXCLUSIVE state signifies that this cache has the only copy of the block

and that the data present in main memory is correct. • SHARED state indicates that the block has the correct copy of the data and may be

present in other caches. • A block in DIRTY state if the cache has the only copy of the data and the data

stored in main memory is incorrect.

SHARED and VALID-EXCLUSIVE states are identical to Write invalid’s VALID and RESERVED states respectively. The DIRTY bit in both protocols has the same meaning.

This protocol requires that the bus provide an additional signal called the Shared-Line.

The Shared-Line is used to indicate whether or not a cache has a copy of the block being accessed on the bus and is sampled by the cache controllers to determine the appropriate actions to be taken.

A cache raises the Shared-Line if it has a copy of the block whenever a read or write to the block is performed.

The state transition diagram for this protocol is given below: Read miss (Shared Line false) Write Miss (Shared Line False) Write Hit Bus Read/Write write hit Write Hit Bus Read/Write (Shared line false) Write Hit (Shared line true) Bus Read/Write Write Miss(Shared line true) Read Miss(Shared Line True)

2

1 0

Processor based transition State 0: VALID-EX (clean,only copy) Bus- induced transition State 1: SHARED (clean) State 2: DIRTY (dirty, only copy)

The protocol works as follows: Read Hit: The data is returned to the processor without any delay or bus transactions. Read Miss: On a read miss following operation performed:

The block is loaded into the cache in either VALID-EXCLUSIVE or SHARED state depending on whether the Shared Line is raised.

If another cache has the block, then that cache supplies the block to the requesting cache and raises the Shared Line on the bus.

All other cache aborts their attempt to supply the requested data. If the block is supplied to the requesting cache by another cache

instead of memory (i.e. the SharedLine was raised), the entire cache will update their data with new data.

Write Hit: * If the block is in VALID-EXCLUSIVE state then on write miss cache goes to SHARED state and the data is written without any delay or bus transactions. * If the block is in SHARED state and the Shared line is raised then that cache sets its state to VALID-EXCLUSIVE state by acquiring the data. * If the block in DIRTY state then cache will load those data in to the bus and broadcast the value . Write Miss: On write miss following operations are performed:

• If the block present in cache has SHARED state then it is written in to main memory.

• If the block has SHARED state and Shared Line is true then it remains in the same state by writing into its memory.

• If the block is in DIRTY state and if Shared Line is false then it changes its state by being in VALID-EXCLUSIVE state.

The following example shows the operation of Write update protocol between 2 processors: A and B. Processor activity Bus Activity CPU A’s

cache content CPU B’s cache content

Contents of memory(X)

0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0

CPU A writes a 1 to X

Write broadcast of X

1 1 1

CPU B reads X 1 1 1

Comparison between Write Invalidate and Write Update:

• Write invalidate is used in vast majority of designs. • Qualitative Performance Difference:

• Write-invalidate requires one transaction per write run (sequence of writes) while write-update involves a broadcast for each write.

• Write- invalidate uses spatial locality: one transaction per cache block while write update requires a broadcast per word.

• Write- broadcast has lower latency between write and read while the write-invalidate requires a broadcast per word.

• Write-invalidate protocol is popular because the demand for bus and memory bandwidth is high.

• Write-update can causes problem for some memory consistency models reducing the potential performance gain.

• The high demand for bandwidth in write update limits its scalability for large number of processors.

MESI Algorithm: The Modified Exclusive Shared Invalid (MESI) algorithm is also used to eliminate the Cache coherency. The states in this algorithm are given below: MESI State Definition Modified (M) The line is valid in the cache and in only this cache. The line is

modified with respect to system memory-that is, the modified data in the line has not been written back to memory.

Exclusive(E) The addressed line is in the cache only. The data in this line is consistent with system memory.

Shared(S) The addressed line is valid in the cache and in at least one other cache. A shared line always consistent with system memory, That is, the shared state is shared-unmodified; there is no shared- modified state.

Invalid(I) This state indicated that the addressed line is not resident in the cache and/or any data contained is considered not useful.

Exclusive may also be called CleanExclusive. Modified may also be called DirtyExclusive.

Cache States in MESI: Cache A Cache B Cache A Cache B M I S S System memory Cache A Cache B Cache A Cache B E I System Memory

Modified in Cache A Shared in Cache A

Valid Data Invalid data

Valid Data

Invalid data

Invalid data Invalid data

Valid Data

Valid Data

Valid Data

Valid Data Don’t Care

Don’t Care

Exclusive in Cache A Invalid in Cache A

Some processor adds a fifth state for Shared Modified and calls it the MOESI protocol. The caches with the shared modified state update each other’s lines with current data, but do not write it to main memory.

MESI State Diagram:

MESI State Table: State Event Action Next State

Read miss, shared (cache copies exist)

Read cache line Shared

Read miss, exclusive(no cache copies exist)

Read cache line Exclusive

Invalid

Write miss Broadcast invalidate Read cache line Modify cache line

Modified

Read hit Shared Write hit Broadcast invalidate Modified Snoop hit on read Shared

Shared

Snoop hit on invalidate Invalidate Cache line Invalid Read hit Exclusive

Write hit Modified

Snoop hit on read Shared

Exclusive

Snoop hit on invalidate Invalidate cache line Invalid

Read hit Modified Write hit Modified Snoop hit on read Write back to memory Shared Snoop hit on invalidate Write back to memory Invalid

Modified

LRU Replacement Write back to memory Invalid

MN Cache Coherence

Documents

Transcript of MN Cache Coherence