Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines...

30
Cache Coherent Distributed Shared Memory

Transcript of Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines...

Page 1: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Cache Coherent Distributed Shared Memory

Page 2: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Motivations

• Small processor count– SMP machines– Single shared memory with multiple

processors interconnected with a bus

• Large processor count– Distributed Shared Memory Machines– Largely message passing architectures

Page 3: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Programming Concerns

• Message passing– Access to memory involve send/request

packets– Communication costs

• Shared memory model– Ease of programming– But not very scalable

• Scalable and easy to program?

Page 4: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Distributed Shared Memory

• Physically distributed memory

• Implemented with a single shared address space

• Also known as NUMA machines since memory access times are non-uniform– Local access times < Remote access times

Page 5: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

DSM and Memory access

• Big difference in accessing local versus remote data

• Large differences make it difficult to hide latency

• How about caching?– In short, it’s difficult– Cache coherence

Page 6: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Cache coherence

• Cache Coherence– Different processors may access values at

same memory location– How to ensure data integrity at all times?

• An update by a processor at time t is available for other processors at time t+1

– Snoopy protocol– Directory based protocol

Page 7: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Snoopy Coherence Protocols

• Transparent to user• Easy to implement• For a read

– Data fetched from other cache or from memory

• For a write– All data at other caches are invalidated– Delayed or immediate write-back.

• The Bus plays an important role

Page 8: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Example

Page 9: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

But it does not scale!

• Not feasible for machines with memory distributed across a large number of systems

• Broadcast on bus approach is bad

• Leads to bus saturation

• Waste of processor cycles to snoop all caches in system

Page 10: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Directory-Based Cache Coherence

• A directory tracks which processor have cached a block of memory

• Directory contains information for all cache blocks in system

• Each cache block can have 1 of 3 states– Invalid– Shared– Exclusive

• To enter exclusive state, all other cache blocks for same memory location is invalidated

Page 11: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Original form not popular

• Compared to snoopy protocols– Directory systems avoid broadcasting on bus

• But requests served by 1 directory server– May saturate a directory server

• Still not scalable

• How about distributing the directory– Load balancing– Hierarchical model?

Page 12: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Distributed Directory Protocol

• Involved sending messages among 3 node types– Local node

• Requesting processor node

– Home node• Node containing memory location

– Remote node• Node containing cache block in exclusive state

Page 13: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

3 Scenarios

• Scenario 1– Local node sends request to home node– Home node sends data back to local node

• Scenario 2– Local node sends request to home node– Home node redirects request to remote node– Remote node sends data back to local node

• Scenario 3– Local node sends request for exclusive state– Home node redirects request to other remote nodes

for invalidation

Page 14: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Example

Page 15: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Stanford DASH Multiprocessor

• 1st operational multiprocessor to support scalable coherence protocol

• Demonstrates scalability and cache coherence are not incompatible

• 2 hypotheses– Shared memory machines easier to program– Cache coherence vital

Page 16: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Past experience

• From experience– Memory access times differ widely between

physical location– Latency and bandwidth is important for

shared memory systems– Caching helps amortize cost of memory

access in a memory hierarchy

Page 17: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

DASH Multiprocessor

• Relaxed memory consistency model

• Observation– Most programs use explicit synchronization – Sequential consistency is not necessary– Allows system to perform writes without

waiting till all invalidations are performed

• Offers advantages in hiding memory latency

Page 18: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

DASH Multiprocessor

• Non-Binding software prefetch– Prefetches data into cache– Maintains coherence– Transparent to user

• Compiler can issue such instructions to help runtime performance

– If data is invalidated, it will refresh the data when it is accessed

• Helps to hide latency as well

Page 19: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

DASH Multiprocessor

• Remote Access Cache– Remote access combined and buffered within

individual nodes– Can be likened to having a 2-level cache

hierarchy

Page 20: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Lessons

• High performance require careful planning of remote data access

• Scaling applications depend on other factors– Load balancing– Limited parallelism– Difficult to scale application into using more

processor

Page 21: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Challenges

• Programming model?– Model that helps programmers reason about

code rather than fine-tuning for a specific machine

• Fault tolerance and recovery?– More computers = Higher chance of failure

• Increasing latency?– Increasing hierarchies = Larger variety of

latencies

Page 22: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Callisto

• Previously networking gateways– Handle diverse set of services– Handles 1000s of channels– Complex designs involving many chips– High power requirement

• Callisto is a gateway on a chip – Used to implement communication gateways

for different networks

Page 23: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

In a nutshell

• Integrates DSPs, CPUs, RAM, IO channels on chip

• Programmable multi-service platform

• Handles 60 to 240 channels per chip

• An array of Callisto chips can fit in a small space– Power efficient– Handles a large number of channels

Page 24: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Page 25: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Page 26: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Page 27: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Page 28: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Page 29: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Page 30: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.