Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines...

Cache Coherent Distributed Shared Memory

Motivations

• Small processor count– SMP machines– Single shared memory with multiple

processors interconnected with a bus

• Large processor count– Distributed Shared Memory Machines– Largely message passing architectures

Programming Concerns

• Message passing– Access to memory involve send/request

packets– Communication costs

• Shared memory model– Ease of programming– But not very scalable

• Scalable and easy to program?

Distributed Shared Memory

• Physically distributed memory

• Implemented with a single shared address space

• Also known as NUMA machines since memory access times are non-uniform– Local access times < Remote access times

DSM and Memory access

• Big difference in accessing local versus remote data

• Large differences make it difficult to hide latency

• How about caching?– In short, it’s difficult– Cache coherence

Cache coherence

• Cache Coherence– Different processors may access values at

same memory location– How to ensure data integrity at all times?

• An update by a processor at time t is available for other processors at time t+1

– Snoopy protocol– Directory based protocol

Snoopy Coherence Protocols

• Transparent to user• Easy to implement• For a read

– Data fetched from other cache or from memory

• For a write– All data at other caches are invalidated– Delayed or immediate write-back.

• The Bus plays an important role

Example

But it does not scale!

• Not feasible for machines with memory distributed across a large number of systems

• Broadcast on bus approach is bad

• Leads to bus saturation

• Waste of processor cycles to snoop all caches in system

Directory-Based Cache Coherence

• A directory tracks which processor have cached a block of memory

• Directory contains information for all cache blocks in system

• Each cache block can have 1 of 3 states– Invalid– Shared– Exclusive

• To enter exclusive state, all other cache blocks for same memory location is invalidated

Original form not popular

• Compared to snoopy protocols– Directory systems avoid broadcasting on bus

• But requests served by 1 directory server– May saturate a directory server

• Still not scalable

• How about distributing the directory– Load balancing– Hierarchical model?

Distributed Directory Protocol

• Involved sending messages among 3 node types– Local node

• Requesting processor node

– Home node• Node containing memory location

– Remote node• Node containing cache block in exclusive state

3 Scenarios

• Scenario 1– Local node sends request to home node– Home node sends data back to local node

• Scenario 2– Local node sends request to home node– Home node redirects request to remote node– Remote node sends data back to local node

• Scenario 3– Local node sends request for exclusive state– Home node redirects request to other remote nodes

for invalidation

Example

Stanford DASH Multiprocessor

• 1st operational multiprocessor to support scalable coherence protocol

• Demonstrates scalability and cache coherence are not incompatible

• 2 hypotheses– Shared memory machines easier to program– Cache coherence vital

Past experience

• From experience– Memory access times differ widely between

physical location– Latency and bandwidth is important for

shared memory systems– Caching helps amortize cost of memory

access in a memory hierarchy

DASH Multiprocessor

• Relaxed memory consistency model

• Observation– Most programs use explicit synchronization – Sequential consistency is not necessary– Allows system to perform writes without

waiting till all invalidations are performed

• Offers advantages in hiding memory latency

DASH Multiprocessor

• Non-Binding software prefetch– Prefetches data into cache– Maintains coherence– Transparent to user

• Compiler can issue such instructions to help runtime performance

– If data is invalidated, it will refresh the data when it is accessed

• Helps to hide latency as well

DASH Multiprocessor

• Remote Access Cache– Remote access combined and buffered within

individual nodes– Can be likened to having a 2-level cache

hierarchy

Lessons

• High performance require careful planning of remote data access

• Scaling applications depend on other factors– Load balancing– Limited parallelism– Difficult to scale application into using more

processor

Challenges

• Programming model?– Model that helps programmers reason about

code rather than fine-tuning for a specific machine

• Fault tolerance and recovery?– More computers = Higher chance of failure

• Increasing latency?– Increasing hierarchies = Larger variety of

latencies

Callisto

• Previously networking gateways– Handle diverse set of services– Handles 1000s of channels– Complex designs involving many chips– High power requirement

• Callisto is a gateway on a chip – Used to implement communication gateways

for different networks

In a nutshell

• Integrates DSPs, CPUs, RAM, IO channels on chip

• Programmable multi-service platform

• Handles 60 to 240 channels per chip

• An array of Callisto chips can fit in a small space– Power efficient– Handles a large number of channels

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines...

Documents

Transcript of Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines...