Constructive Computer Architecture Cache Coherence Arvind

30
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 18, 2013 http:// www.csg.csail.mit.edu/ 6.s195 L21-1

description

Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology. Contributors to the course material. Arvind, Rishiyur S. Nikhil, Joel Emer, Muralidaran Vijayaraghavan - PowerPoint PPT Presentation

Transcript of Constructive Computer Architecture Cache Coherence Arvind

Page 1: Constructive Computer Architecture Cache Coherence Arvind

Constructive Computer Architecture

Cache Coherence

ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

November 18, 2013

http://www.csg.csail.mit.edu/6.s195 L21-1

Page 2: Constructive Computer Architecture Cache Coherence Arvind

Contributors to the course material

Arvind, Rishiyur S. Nikhil, Joel Emer, Muralidaran VijayaraghavanStaff and students in 6.375 (Spring 2013), 6.S195 (Fall 2012), 6.S078 (Spring 2012)

Asif Khan, Richard Ruhler, Sang Woo Jun, Abhinav Agarwal, Myron King, Kermin Fleming, Ming Liu, Li-Shiuan Peh

External Prof Amey Karkare & students at IIT Kanpur Prof Jihong Kim & students at Seoul Nation University Prof Derek Chiou, University of Texas at Austin Prof Yoav Etsion & students at Technion

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-2

Page 3: Constructive Computer Architecture Cache Coherence Arvind

Memory Consistency in SMPs

Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value

cache-1A 100

CPU-Memory bus

CPU-1 CPU-2

cache-2A 100

memoryA 100

200

200

Do these stale values matter?What is the view of shared memory for programming?

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-3

Page 4: Constructive Computer Architecture Cache Coherence Arvind

Maintaining Store AtomicityStore atomicity requires all processors to see writes occur in the same order

multiple copies of a location in various caches can cause this to be violated

To meet the ordering requirement it is sufficient for hardware to ensure:

Only one processor at a time has write permission for a location

No processor can load a stale copy of the data after a write to the location

November 18, 2013 L21-4http://www.csg.csail.mit.edu/

6.s195

cache coherence protocols

Page 5: Constructive Computer Architecture Cache Coherence Arvind

A System with Multiple Caches

M

L1 P

L1 P

L1 P

L1 P

L2 L2 L1P

L1 P

Interconnect

Modern systems often have hierarchical cachesEach cache has exactly one parent but can have zero or more childrenLogically only a parent and its children can communicate directlyInclusion property is maintained between a parent and its children, i.e.,

a Li a Li+1Because usuallyLi+1 >> Li

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-5

Page 6: Constructive Computer Architecture Cache Coherence Arvind

Cache Coherence ProtocolsWrite request:

the address is invalidated in all other caches before the write is performed, or

the address is updated in all other caches after the write is performed

Read request: if a dirty copy is found in some cache, that is the value that must

be used, e.g., by doing a write-back and reading the memory or forwarding that dirty value directly to the reader.

November 18, 2013 L21-6http://www.csg.csail.mit.edu/

6.s195

We will focus on Invalidation protocols as opposed to Update protocols

Page 7: Constructive Computer Architecture Cache Coherence Arvind

State needed to maintain Cache Coherence

Use MSI encoding in caches whereI means this cache does not contain the locationS means this cache has the location but so may other

caches; hence it can only be readM means only this cache has the location; hence it can

be read and writtenThe states M, S, I can be thought of as an order M > S > I

A transition from a lower state to a higher state is called an Upgrade

A transition from a higher state to a lower state is called a Downgrade

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-7

Page 8: Constructive Computer Architecture Cache Coherence Arvind

Sibling invariant and compatibility

Sibling invariant: Cache is in state M its siblings are in state I That is, the sibling states are “compatible”

The states x, y of two siblings are compatible iff IsCompatible(x, y) is True where

IsCompatible(M, M) = FalseIsCompatible(M, S) = FalseIsCompatible(S, M) = FalseAll other cases = True

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-8

Page 9: Constructive Computer Architecture Cache Coherence Arvind

Cache State Transitions

S M

I

storeload

write-back

invalidate flush

store

optimizations

This state diagram is helpful as long as one remembers that each transition involves cooperation of other caches and the main memory

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-9

Page 10: Constructive Computer Architecture Cache Coherence Arvind

Cache ActionsOn a read miss (i.e., Cache state is I):

In case some other cache has the location in state M then write back the dirty data to Memory

Read the value from Memory and set the state to S On a write miss (i.e., Cache state is I or S):

Invalidate the location in all other caches and in case some cache has the location in state M then write back the dirty data

Read the value from Memory if necessary and set the state to M

Misses cause Cache upgrade actions which in turn may cause further downgrades or upgrades on other caches

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-10

Page 11: Constructive Computer Architecture Cache Coherence Arvind

MSI protocol: some issuesIt is possible to have multiple requests for the same location from different processors. Hence there is a need to arbitrate requests

In bus-based systems bus controller performs this function

In directory-based systems upgrade requests are passed to the parent who acts as an arbitrator

On a cache miss there is a need to find out the state of other caches

In a bus-based system a system-wide broadcast of the request determines the state of other caches by “snooping”

In directory-based systems a directory keeps track of the state of each child cache

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-11

Page 12: Constructive Computer Architecture Cache Coherence Arvind

Directory State EncodingTwo-level (L1, M) system

For each location in a cache, the directory keeps two types of info

c.state[a] (sibling info): do c’s siblings have a copy of location a; M (means no), S (means maybe)

c.child[ck][a] (children info): the state of c’s child ck for location a; At most one child can be in state M

Since L1 has no children, only sibling information is kept and since main (home) memory has no siblings only children cache information is kept

November 18, 2013 L21-12http://www.csg.csail.mit.edu/

6.s195

All addresses in the home memory are in state M

a

a P

L1 P

L1 L1

Interconnect<S,I,I,I>

S P P

Page 13: Constructive Computer Architecture Cache Coherence Arvind

Directory state encoding cont

New states needed to deal with waiting for responses:

c.waitp[a] : Denotes if cache c is waiting for a response from its parent

Nothing means not waiting Valid (M|S|I) means waiting for a response to transition to M

or S or I state, respectively c.waitc[ck][a] : Denotes if cache c is waiting for a

response from its child ck Nothing | Valid (M|S|I)

Cache state in L1: <(M|S|I), (Nothing | Valid(M|S|I))>

Directory state in home memory: <[(M|S|I), (Nothing | Valid(M|S|I))]>

November 18, 2013 L21-13http://www.csg.csail.mit.edu/

6.s195

Children’s state

Page 14: Constructive Computer Architecture Cache Coherence Arvind

A Directory-based Protocol an abstract view

interconnectPP

P

c2mm2c

L1p2m m2p

mPPin out

PP

P

c2mm2c

L1p2mm2p

Each cache has 2 pairs of queues (c2m, m2c) to communicate with the memory (p2m, m2p) to communicate with the processor

Message format: <cmd, srcdst, a, s, data>

FIFO message passing between each (srcdst) pair except a Resp cannot block a Req

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-14

Req/Resp address state

Page 15: Constructive Computer Architecture Cache Coherence Arvind

Processor Hit Rules Load-hit rule

p2m.msg=(Load a) & (c.state[a]>I) p2m.deq;

m2p.enq(c.data[a]);

Store-hit rulep2m.msg=(Store a v) &

c.state[a]=M p2m.deq;

m2p.enq(Ack);c.data[a]:=v;

PP

P

c2mm2c

L1p2m m2p

The miss rules are taken care of by the general cache rules to be presented

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-15

Page 16: Constructive Computer Architecture Cache Coherence Arvind

Processing a Load or a Store miss

Child to Parent: Upgrade-to-y request

Parent to Child: process Upgrade-to-y request

Parent to other child caches: Downgrade-to-x request

Child to Parent: Downgrade-to-x response

Parent waits for all Downgrade-to-x responses

Parent to Child: Upgrade-to-y response

Child receives upgrade-to-y response

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-16

Page 17: Constructive Computer Architecture Cache Coherence Arvind

Processing a Load missL1 to Parent: Upgrade-to-S request(c.state[a]=I) & (c.waitp[a]=Nothing) c.waitp[a]:=Valid S; c2m.enq(<Req, cm, a, S, - >);

Parent to L1: Upgrade-to-S response(j, m.waitc[j][a]=Nothing) & c2m.msg=<Req,cm,a,S,-> & (i≠c, IsCompatible(m.child[i][a],S)) m2c.enq(<Resp, mc, a, S, m.data[a]>); m.child[c][a]:=S; c2m.deq

L1 receiving upgrade-to-S responsem2c.msg=<Resp, mc, a, S, data> m2c.deq; c.data[a]:=data; c.state[a]:=S; c.waitp[a]:=Nothing;

November 18, 2013 L21-17http://www.csg.csail.mit.edu/

6.s195

Page 18: Constructive Computer Architecture Cache Coherence Arvind

Processing Load miss cont.

What if (i≠c, IsCompatible(m.child[i][a],y)) is false?Downgrade other child caches

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-18

Parent to L1: Upgrade-to-S response(j, m.waitc[j][a]=Nothing) & c2m.msg=<Req,cm,a,S,-> & (i≠c, IsCompatible(m.child[i][a],S)) m2c.enq(<Resp, mc, a, S, m.data[a]>); m.child[c][a]:=S; c2m.deq

Parent to Child: Downgrade to S request c2m.msg=<Req,cm,a,S,-> & (m.child[i][a]>S) & (m.waitc[i][a]=Nothing)

m.waitc[i][a]:=Valid S; m2c.enq(<Req, mi, a, S, - >);

Page 19: Constructive Computer Architecture Cache Coherence Arvind

Complete set of cache actions

req = {1,4,7}resp = {2,3,5,6,8}

A protocol specifies cache actions corresponding to each of these 8 different messages

Cache

1,5,8 3,7

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-19

Memory

4,2 6

Page 20: Constructive Computer Architecture Cache Coherence Arvind

Child Requests1. Child to Parent: Upgrade-to-y Request

(c.state[a]<y) & (c.waitp[a]=Nothing) c.waitp[a]:=Valid y; c2m.enq(<Req, cm, a, y, - >);

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-20

Page 21: Constructive Computer Architecture Cache Coherence Arvind

Parent Responds2. Parent to Child: Upgrade-to-y response

(j, m.waitc[j][a]=Nothing) & c2m.msg=<Req,cm,a,y,-> & (m.child[c][a]<y) & (i≠c, IsCompatible(m.child[i][a],y)) m2c.enq(<Resp, mc, a, y, (if (m.child[c][a]=I) then m.data[a] else -)>); m.child[c][a]:=y; c2m.deq;

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-21

Page 22: Constructive Computer Architecture Cache Coherence Arvind

Child receives Response3. Child receiving upgrade-to-y response

m2c.msg=<Resp, mc, a, y, data> m2c.deq; if(c.state[a]=I) c.data[a]:=data; c.state[a]:=y; if(c.waitp[a]=(Valid x) & x≤y) c.waitp[a]:=Nothing;

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-22

Page 23: Constructive Computer Architecture Cache Coherence Arvind

Parent Requests4. Parent to Child: Downgrade-to-y Request

(m.child[i][a]>y) & (m.waitc[i][a]=Nothing) m.waitc[i][a]:=Valid y; m2c.enq(<Req, mc, a, y, - >);

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-23

Page 24: Constructive Computer Architecture Cache Coherence Arvind

Child Responds5. Child to Parent: Downgrade-to-y response

(m2c.msg=<Req,mc,a,y,->) & (c.state[a]>y) c2m.enq(<Resp, c->m, a, y, (if (c.state[a]=M) then c.data[a] else - )>); c.state[a]:=y; m2c.deq

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-24

Page 25: Constructive Computer Architecture Cache Coherence Arvind

Parent receives Response6. Parent receiving downgrade-to-y response

c2m.msg=<Resp, cm, a, y, data> c2m.deq; if(m.child[c][a]=M) m.data[a]:=data; c.state[a]:=y; if(m.waitc[c][a]=(Valid x) & x≥y) m.waitc[c][a]:=Nothing;

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-25

Page 26: Constructive Computer Architecture Cache Coherence Arvind

Child receives served Request7. Child receiving downgrade-to-y request

(m2c.msg=<Req, mc, a, y, - >) & (c.state[a]≤y) m2c.deq;

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-26

Page 27: Constructive Computer Architecture Cache Coherence Arvind

Child Voluntarily downgrades 8. Child to Parent: Downgrade-to-y response (vol)

(c.waitp[a]=Nothing) & (c.state[a]>y) c2m.enq(<Resp, c->m, a, y, (if (c.state[a]=M) then c.data[a] else - )>); c.state[a]:=y;

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-27

Page 28: Constructive Computer Architecture Cache Coherence Arvind

Some propertiesRules 1 to 8 are complete - cover all possibilities and cannot deadlock or violate cache invariantsOur protocol maintains two important invariants:

Directory state is always a conservative estimate of a child’s state

Every request eventually gets a corresponding response (assuming responses cannot be blocked by requests and a request cannot overtake a response for the same address)

Starvation, that is a Load or store request is ignored indefinitely has to be prevented; Fair arbitration at the memory between requests from various caches will ensure starvation freedom.

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-28

Page 29: Constructive Computer Architecture Cache Coherence Arvind

FIFO property of queuesIf FIFO property is not enforced, then the protocol can either deadlock or update with wrong dataA deadlock scenario:

1. Child 1 requests upgrade (from I) to M (msg1)2. Parent responds to Child 1 with upgrade from I to M (msg2)3. Child 2 requests upgrade (from I) to M (msg2)4. Parent requests Child 1 for downgrade (from M) to I (msg3)5. msg3 overtakes msg26. Child 1 sees request to downgrade to I and drops it7. Parent never gets a response from Child 1 for downgrade

to I

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-29

Page 30: Constructive Computer Architecture Cache Coherence Arvind

H and L Priority MessagesAt the memory, unprocessed request messages cannot block reply messages. Hence all messages are classified as H or L priority.

all messages carrying replies are classified as high priority

Accomplished by having separate paths for H and L priority

In Theory: separate networks In Practice:

Separate Queues Shared physical wires for both networks

HL

November 18, 2013http://www.csg.csail.mit.edu/

6.s195 L21-30