DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

52
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Nima Honarmand, Rakesh Komuravelli, Robert Smolinski, Hyojin Sung, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter , Ching-Tsun Chou University of Illinois, Urbana-Champaign, Intel

description

DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism. Byn Choi , Nima Honarmand , Rakesh Komuravelli , Robert Smolinski , Hyojin Sung , Sarita V. Adve , Vikram S. Adve , Nicholas P. Carter  , Ching-Tsun Chou  University of Illinois, Urbana-Champaign, - PowerPoint PPT Presentation

Transcript of DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Page 1: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

DeNovo: Rethinking the Multicore Memory Hierarchy for

Disciplined ParallelismByn Choi, Nima Honarmand, Rakesh Komuravelli,

Robert Smolinski, Hyojin Sung, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou

University of Illinois, Urbana-Champaign,Intel

Page 2: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Motivation• Goal: Power-, complexity-, performance-scalable hardware• Today: shared-memory

– Directory-based coherence• Complex and unscalable

– Difficult programming model• Data races, non-determinism, no safety/composability/modularity

– Mismatched interface between HW and SW, a.k.a memory model• Can’t specify “what value can read return”• Data races defy acceptable semantics

• Fundamentally broken for hardware and software

Page 3: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Solution?• Banish shared memory?

As you scale the number of cores on a cache coherent system (CC), “cost” in “time and memory” grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall….

A case for message-passing for many-core computing,Timothy Mattson et al

Page 4: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Solution• The problems are not inherent to shared memory paradigm

Global address spaceImplicit, unstructured communication and

synchronization

Shared Memory

Page 5: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Solution• Banish wild shared memory!

Global address spaceImplicit, unstructured communication and

synchronization

Wild Shared Memory

Page 6: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Implicit, unstructured communication and

synchronization

Solution• Build disciplined shared memory!

Global address space Explicit, structured side-effects

Disciplined Shared Memory

Page 7: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Disciplined Shared Memory

Disciplined Shared Memory

• No data races, determinism-by-default, safe non-determinism• Simple semantics, safety and composability

• Simple coherence and consistency • Software-aware address/comm/coherence granularity• Power-, complexity-, performance-scalable HW

explicit effects +structured

parallel control

Page 8: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Research Strategy: Software Scope• Many systems provide disciplined shared-memory features• Current driver is DPJ (Deterministic Parallel Java)

– Determinism-by-default, safe non-determinism• End goal is language-oblivious interface• Current focus on deterministic codes

– Common and best case • Extend later to safe non-determinism, legacy codes

Page 9: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Research Strategy: Hardware Scope• Today’s focus: cache coherence• Limitations of current directory-based protocols

1. Complexity• Subtle races and numerous transient sates in the protocol• Hard to extend for optimizations

2. Storage overhead• Directory overhead for sharer lists

3. Performance and power inefficiencies• Invalidation and ack messages• False sharing• Indirection through the directory• Suboptimal comm. granularity of cache line …

• Ongoing: Rethink communication architecture and data layout

Page 10: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Contributions• Simplicity

– Compared protocol complexity with MESI– 25x less reachable states for model checking

• Extensibility– Direct cache-to-cache transfer– Flexible communication granularity

• Storage overhead– No storage overhead for directory information– Storage overheads beat MESI after tens of cores and scale beyond

• Performance/Power– Up to 73% reduction in memory stall time– Up to 70% reduction in network traffic

Page 11: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Contributions• Simplicity

– Compared protocol complexity with MESI– 25x less reachable states with model checking

• Extensibility– Direct cache-to-cache transfer– Flexible communication granularity

• Storage overhead– No storage overhead for directory information– Storage overheads beat MESI after tens of cores and scale beyond

• Performance/Power– Up to 73% reduction in memory stall time– Up to 70% reduction in network traffic

Page 12: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Contributions• Simplicity

– Compared protocol complexity with MESI– 25x less reachable states with model checking

• Extensibility– Direct cache-to-cache transfer– Flexible communication granularity

• Storage overhead– No storage overhead for directory information– Storage overheads beat MESI after tens of cores and scale beyond

• Performance/Power– Up to 73% reduction in memory stall time– Up to 70% reduction in network traffic

Page 13: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Contributions• Simplicity

– Compared protocol complexity with MESI– 25x less reachable states with model checking

• Extensibility– Direct cache-to-cache transfer– Flexible communication granularity

• Storage overhead– No storage overhead for directory information– Storage overheads beat MESI after tens of cores and scale beyond

• Performance/Power– Up to 73% reduction in memory stall time– Up to 70% reduction in network traffic

Page 14: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Contributions• Simplicity

– Compared protocol complexity with MESI– 25x less reachable states with model checking

• Extensibility– Direct cache-to-cache transfer– Flexible communication granularity

• Storage overhead– No storage overhead for directory information– Storage overheads beat MESI after tens of cores and scale beyond

• Performance/Power– Up to 73% reduction in memory stall time– Up to 70% reduction in network traffic

Page 15: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Outline• Motivation• Background: DPJ• DeNovo Protocol• DeNovo Extensions• Evaluation

– Protocol Verification– Performance

• Conclusion and Future Work

Page 16: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Background: DPJ• Extension for modern OO languages• Structured parallel control: nested fork-join style

– foreach, cobegin• Type and effect system ensures non-interference

– Region names: partition heap– Effects for methods: which regions read/written in method– Type checking ensures race-freedom for parallel tasks

Deterministic execution• Recent support for safe non-determinism

Page 17: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Memory Consistency Model• Guaranteed determinism

Read returns value of last write in sequential order1. Same task in this parallel phase2. Or before this parallel phase

LD 0xa

ST 0xaParallelPhase

ST 0xa

Page 18: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Memory Consistency Model• Guaranteed determinism

Read returns value of last write in sequential order1. Same task in this parallel phase2. Or before this parallel phase

LD 0xa

CoherenceMechanism

ST 0xaParallelPhase

Page 19: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Cache Coherence• Coherence Enforcement

1. Invalidate stale copies in caches2. Track up-to-date copy

Page 20: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Cache Coherence• Coherence Enforcement

1. Invalidate stale copies in caches2. Track up-to-date copy

• Explicit effects– Compiler knows all regions written in this parallel phase– Cache can self-invalidate before next parallel phase

• Invalidates data in writeable regions not accessed by itself

Page 21: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Cache Coherence• Coherence Enforcement

1. Invalidate stale copies in caches2. Track up-to-date copy

• Explicit effects– Compiler knows all regions written in this parallel phase– Cache can self-invalidate before next parallel phase

• Invalidates data in writeable regions not accessed by itself

• Registration– Directory keeps track of one up-to-date copy– Writer updates before next parallel phase

Page 22: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Basic DeNovo Coherence• Assume (for now): Private L1, shared L2; single word line

– Data-race freedom at word granularity

• L2 data arrays double as directory– Keep valid data or registered core id, no space overhead

• L1/L2 states

• “Touched” bit – set only if read in the phase

registry

Invalid Valid

Registered

Read

Write Write

Page 23: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Example Run

R X0 V Y0

R X1 V Y1

R X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

class S_type {X in DeNovo-region ;Y in DeNovo-region ;

}S _type S[size];...Phase1 writes { // DeNovo effect

foreach i in 0, size {S[i].X = …;

}self_invalidate( );

}

L1 of Core 1

R X0 V Y0

R X1 V Y1

R X2 V Y2

I X3 V Y3

I X4 V Y4

I X5 V Y5

L1 of Core 2

I X0 V Y0

I X1 V Y1

I X2 V Y2

R X3 V Y3

R X4 V Y4

R X5 V Y5

Shared L2

R C1 V Y0

R C1 V Y1

R C1 V Y2

R C2 V Y3

R C2 V Y4

R C2 V Y5

RegisteredValidInvalid

V X0 V Y0

V X1 V Y1

V X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

V X0 V Y0

V X1 V Y1

V X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

V X0 V Y0

V X1 V Y1

V X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

V X0 V Y0

V X1 V Y1

V X2 V Y2

R X3 V Y3

R X4 V Y4

R X5 V Y5

Registration Registration

Ack Ack

Page 24: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Addressing Limitations• Limitations of current directory-based protocols

1. Complexity• Subtle races and numerous transient sates in the protocol• Hard to extend for optimizations

2. Storage overhead• Directory overhead for sharer lists

3. Performance and power overhead• Invalidation and ack messages• False-sharing• Indirection through the directory

Page 25: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Practical DeNovo Coherence• Basic protocol impractical

– High tag storage overhead (a tag per word)• Address/Transfer granularity > Coherence granularity• DeNovo Line-based protocol

– Traditional software-oblivious spatial locality– Coherence granularity still at word

• no word-level false-sharing

“Line Merging” Cache

V V RTag

Page 26: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Storage Overhead• DeNovo line (64 bytes/line)

– L1: 2 state bits, 1 touched bit, 5 region bits (128 bits/line)– L2: 1 valid/line, 1 dirty/line, 1 state bit/word (18 bits/line)

• MESI (full-map, in-cache directory)– L1: 5 state bits (5 bits/line)– L2: 5 state bits, [# of cores for directory] bits (5+P bits/line)

• Size of L2 == 8 x [aggregate size of L1s]

Page 27: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Storage Overhead

29 Cores

Page 28: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Alternative Directory Designs• Duplicate Tag Directory

– L1 tags duplicated in directory– Requires highly-associative lookups

• 256-way associative lookup for 64-core setup with 4-way L1s

• Sparse Directory– Significant performance overhead with higher core-counts

• Tagless Directory (MICRO ’09)– Imprecise sharer-list encoding using bloom-filters– Even more complexity in the protocol (false-positives)– Trade performance for less storage overhead/power

Page 29: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Addressing Limitations• Limitations of current directory-based protocols

1. Complexity• Subtle races and numerous transient sates in the protocol• Hard to extend for optimizations

2. Storage overhead• Directory overhead for sharer lists

3. Performance and power overhead• Invalidation and ack messages• False-sharing• Indirection through the directory

✔✔

Page 30: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Extensions• Traditional directory-based protocols

Sharer-lists always contain all the true sharers• DeNovo protocol

Registry points to latest copy at end of phase

Valid data can be copied around freely

Page 31: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Extensions (1 of 2)• Basic with Direct cache-to-cache transfer

– Get data directly from producer– Through prediction and/or software-assistance– Convert 3-hop misses to 2-hop misses

L1 of Core 1 …

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

I X3 V Y3 V Z3

I X4 V Y4 V Z4

I X5 V Y5 V Z5

X3

L1 of Core 2 …

I X0 V Y0 V Z0

I X1 V Y1 V Z1

I X2 V Y2 V Z2

R X3 V Y3 V Z3

R X4 V Y4 V Z4

R X5 V Y5 V Z5

Shared L2…

R C1 V Y0 V Z0

R C1 V Y1 V Z1

R C1 V Y2 V Z2

R C2 V Y3 V Z3

R C2 V Y4 V Z4

R C2 V Y5 V Z5

RegisteredValidInvalid

LD X3

LD X3

Page 32: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Extensions (2 of 2)• Basic with Flexible communication

– Software-directed data transfer– Transfer “relevant” data together– Achieve effects of AoS-to-SoA transformation

without programmer/compiler interventionL1 of Core 1

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

I X3 V Y3 V Z3

I X4 V Y4 V Z4

I X5 V Y5 V Z5

L1 of Core 2 …

I X0 V Y0 V Z0

I X1 V Y1 V Z1

I X2 V Y2 V Z2

R X3 V Y3 V Z3

R X4 V Y4 V Z4

R X5 V Y5 V Z5

Shared L2

R C1 V Y0 V Z0

R C1 V Y1 V Z1

R C1 V Y2 V Z2

R C2 V Y3 V Z3

R C2 V Y4 V Z4

R C2 V Y5 V Z5

RegisteredValidInvalid

Page 33: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Extensions (2 of 2)• Basic with Flexible communication

– Software-directed data transfer– Transfer “relevant” data together– Achieve effects of AoS-to-SoA transformation

without programmer/compiler interventionL1 of Core 1

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

I X3 V Y3 V Z3

I X4 V Y4 V Z4

I X5 V Y5 V Z5

L1 of Core 2 …

I X0 V Y0 V Z0

I X1 V Y1 V Z1

I X2 V Y2 V Z2

R X3 V Y3 V Z3

R X4 V Y4 V Z4

R X5 V Y5 V Z5

Shared L2…

R C1 V Y0 V Z0

R C1 V Y1 V Z1

R C1 V Y2 V Z2

R C2 V Y3 V Z3

R C2 V Y4 V Z4

R C2 V Y5 V Z5

RegisteredValidInvalid

X3 X4 X5

LD X3LD X4LD X5

Y3 Y4 Y5

Z3 Z4 Z5

Page 34: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Extensions (2 of 2)• Basic with Flexible communication

– Software-directed data transfer– Transfer “relevant” data together– Achieve effects of AoS-to-SoA transformation

without programmer/compiler interventionL1 of Core 1

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

I X3 V Y3 V Z3

I X4 V Y4 V Z4

I X5 V Y5 V Z5

L1 of Core 2 …

I X0 V Y0 V Z0

I X1 V Y1 V Z1

I X2 V Y2 V Z2

R X3 V Y3 V Z3

R X4 V Y4 V Z4

R X5 V Y5 V Z5

Shared L2…

R C1 V Y0 V Z0

R C1 V Y1 V Z1

R C1 V Y2 V Z2

R C2 V Y3 V Z3

R C2 V Y4 V Z4

R C2 V Y5 V Z5

RegisteredValidInvalid

X3 X4 X5

LD X3

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

V X3 V Y3 V Z3

V X4 V Y4 V Z4

V X5 V Y5 V Z5

Page 35: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Outline• Motivation• Background: DPJ• DeNovo Protocol• DeNovo Extensions• Evaluation

– Protocol Verification– Performance

• Conclusion and Future Work

Page 36: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Evaluation• Simplicity

– Formal verification of the cache coherence protocol– Comparing reachable states

• Performance/Power– Simulation experiments

• Extensibility– DeNovo extensions

Page 37: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Protocol Verification Methodology• Murphi model checking tool• Verified DeNovo word and MESI word protocols

– State-of-the art GEMS implementation• Abstract model

– Single address / region, two data values– Two cores with private L1 and unified L2, unordered n/w– Data race free guarantee for DeNovo– Cross phase interactions

Page 38: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Protocol Verification: Results• Correctness

– Three bugs in DeNovo protocol• Mistakes in translation from the high level specification• Simple to fix

– Six bugs in MESI protocol• Two deadlock scenarios• Unhandled races due to L1 writebacks• Several days to fix

• Complexity– 25x fewer reachable states for DeNovo– 30x difference in the runtime

Page 39: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Performance Evaluation Methodology• Simulation Environment

– Wisconsin GEMS + Simics + Princeton Garnet n/w• System Parameters

– 64 cores– Private L1(128KB) and Unified L2(32MB)

• Simple core model– 5-stage, one-issue, in-order core– Results for only memory stall time

Page 40: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Benchmarks• FFT and LU from SPLASH-2

• kdTree – two versions– Construction of k-D trees for ray tracing– Developed within UPCRC, presented at HPG 2010– Two versions

• kdFalse: false sharing in an auxiliary structure• kdPad: padding to eliminated false sharing

Page 41: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Simulated Protocols• Word-based: 4B cache line

– MESI and DeNovo• Line-based: 64B cache line

– MESI and DeNovo• DeNovo extensions (proof of concept, word based)

– DeNovo direct– DeNovo flex

Page 42: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Word ProtocolsM

wor

dD

wor

d

Mw

ord

Dw

ord

Mw

ord

Dw

ord

Mw

ord

Dw

ord

0%

50%

100%

150%

200%

250%

300%

350% Mem HitRemote L1 HitL2 Hit

Mem

ory

Stal

l Tim

e

FFT LU kdFalse kdPad

Dword comparable to Mword

563547

Simplicity doesn’t compromise performance

Page 43: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Mw

ord

Dw

ord

Ddi

rect

Mw

ord

Dw

ord

Ddi

rect

Mw

ord

Dw

ord

Ddi

rect

Mw

ord

Dw

ord

Ddi

rect

0%

50%

100%

150%

200%

250%

300%

350% Mem HitRemote L1 HitL2 Hit

Mem

ory

Stal

l Tim

eDeNovo Direct

Ddirect reduces remote L1 hit time

563547 505

FFT LU kdFalse kdPad

Page 44: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Line ProtocolsM

wor

dD

wor

dD

dire

ctM

line

Dlin

e

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line0%

50%

100%

150%

200%

250%

300%

350%Mem HitRemote L1 HitL2 Hit

Mem

ory

Stal

l Tim

e563

547 505

Dline not susceptible to false-sharingUp to 66% reduction in total time

FFT LU kdFalse kdPad

Page 45: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Word vs. LineM

wor

dD

wor

dD

dire

ctM

line

Dlin

e

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line0%

50%

100%

150%

200%

250%

300%

350%Mem HitRemote L1 HitL2 Hit

Mem

ory

Stal

l Tim

e563

547 505

FFT LU kdFalse kdPad

Application dependent benefitkdFalse – Mline worse by 12%

Page 46: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

DeNovo FlexM

wor

dD

wor

dD

dire

ctM

line

Dlin

eD

flex

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Dfle

x

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Dfle

x

Mw

ord

Dw

ord

Ddi

rect

Mlin

eD

line

Dfle

x0%

50%

100%

150%

200%

250%

300%

350% L2 HitRemote L1 HitMem Hit

Mem

ory

Stal

l Tim

e563

547 505

Dflex outperforms all systemsUp to 73% reduction over Mline

FFT LU kdFalse kdPad

Page 47: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Network Traffic

Mlin

e

Dlin

e

Dfle

x

Mlin

e

Dlin

e

Dfle

x

Mlin

e

Dlin

e

Dfle

x

Mlin

e

Dlin

e

Dfle

x0%

50%

100%

Net

wor

k Fl

its

Dline and Dflex less n/w traffic than MlineUp to 70% reduction

FFT LU kdFalse kdPad

Page 48: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Conclusion• Disciplined programming models key for software

– DeNovo rethinks hardware for disciplined models• Simplicity

– 25x fewer reachable states with model checking– 30x runtime difference

• Extensibility– Direct cache-to-cache transfer– Flexible communication granularity

• Storage overhead– No storage overhead for directory information– Storage overheads beat MESI after tens of cores

• Performance/Power– 73% less time spent in memory requests– 70% reduction in n/w traffic

Page 49: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Future Work• Rethinking cache data layout• Extend to

– Disciplined non-deterministic codes– Synchronization– Legacy codes

• Extend to off-chip memory• Automate generation of hardware regions• More extensive evaluations

Page 50: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Thank You!

Page 51: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Backup Slides

Page 52: DeNovo : Rethinking the  Multicore  Memory Hierarchy for Disciplined Parallelism

Flexible Coherence Granularity• Byte-level sharing is uncommon

– None of apps we studied so far• Handle it correctly but not necessarily efficiently

– Compiler aligns byte-granularity regions at word boundaries– If fails, H/W “clone” the line into 4 cache frames

• With at least 4-way associativity, all reside in the same set

Tag State Byte 0 Byte 1 Byte 2 Byte 3

Tag State Byte 3

Normal

ClonedTag State Byte 2

Tag State Byte 1Tag State Byte 0

Tag State Byte 0 Byte 1 Byte 2 Byte 3Tag State Byte 0 Byte 1 Byte 2 Byte 3

Tag State Byte 0 Byte 1 Byte 2 Byte 3

Set

Set