Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos...

36
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Transcript of Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos...

Page 1: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Prefetching Challenges in

Distributed Memories for CMPs

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

Page 2: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

2

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 3: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

3

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 4: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

4

Prefetching

• Reduce memory latency

• Bring to a nearest cache next data required by CPU

• Increase the hit ratio

• It is implemented in most of the commercial

processors

• Erroneous prefetching may produce

– Cache pollution

– Resources consumption (queues, bandwidth, etc.)

– Power consumption

Page 5: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Motivation

• Number of cores in a same chip grows every year

Nehalem4~6 Cores

Tilera64~100 Cores

Intel Polaris80 Cores

Nvidia GeForceUp to 256 Cores

5

Page 6: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

6

Prefetch in CMPs

• Useful prefetchers implies more performance

– Avoid network latency

– Reduce memory access latency

• Useless prefetchers implies less performance

– More power consumption

– More NoC congestion

– Interference with other cores requests

Page 7: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

7

Prefetch adverse behaviors

M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Page 8: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

8

Distributed memories

• Distribution of the memory access pattern:

@ @+2 @+4 @+6 @+8 @+10

@

@ + 2

@ + 4

@ + 6

@ + 8

@ + 10

Page 9: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

9

@ @ + 2 @ + 4 @ + 6

@ + 8 @ + 10 @ + 12 @ + 14

TILE 00 TILE 01 TILE 02 TILE 03

TILE 04 TILE 05 TILE 06 TILE 07

Distributed memories

• Distribution of the memory access pattern:

@ @+2 @+4 @+6 @+8 @+10 @+12 @+14

Page 10: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

10

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 11: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

11

Prefetch Distributed Memory Systems

• Analysis phase

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

L1 MISS for @

Distributed patterns

Page 12: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

12

Pattern Detection Challenge

• Distribution of the memory stream

• Prefetcher aware of a certain part of the stream

• Harder to detect access patterns or correlation

• Not all the prefetchers affected– Correlation prefetchers affected: GHB– One Block Lookahead not affected: Tagged

Page 13: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

13

Prefetch Distributed Memory Systems

• Request generation phase

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

@+4

@+2

@ + 2 @ + 4

Queue filtering

Page 14: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

14

Prefetch Queue Filtering Challenge

• Prefetch requests queued in distributed queues

• Independent engines generating requests

• Repeated requests can be queued

• In a centralized queue those would be merged

• Adverse effects:– Power consumption– Network contention

Page 15: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

15

Prefetch Distributed Memory Systems

• Evaluation phase

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Dynamic profiling

Page 16: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

16

Dynamic Profiling Challenge

• Prefetch requests generated in one tile

• Dynamic profiling information in another tile

• Erroneous profiling in the self tile

• Techniques using this info may work erroneously– Filtering– Throttling– Concrete prefetching engines

Page 17: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

17

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 18: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

18

Challenge evaluation methodology

• Three environments to test the challenges

• Pattern Detection Challenge: Ideal Prefetcher– Prefetcher that it is aware of all the memory stream– No extra network contention added in the system– No extra power consumed– Requests classified depending on its core identifier– To preserve the original stream of each core

• Prefetcher used to test: Global History Buffer

Page 19: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

19

Pattern Detection Challenge

Page 20: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

20

Challenge evaluation methodology

• Three environments to test the challenges

• Prefetch Queue Filtering: Centralized queue– All the requests sent to a centralized queue– Repeated requests are merged– No extra network contention added in the system– No extra power consumed– Repeated requests are not issued

• Prefetcher used to test: Tagged prefercher

Page 21: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

21

Prefetch Queue Filtering Challenge

Page 22: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

22

Challenge evaluation methodology

• Three environments to test the challenges

• Dynamic Profiling Challenge: Hardware counters– For each statistic and core, add a hardware counter– Useful prefetchers and unuseful prefetchers– Use the id of the origin core to classify the statistic– Quantify the error for each core by:

*Where statistic is useful or unuseful prefetch

• Prefetcher used to test: Tagged Prefetcher

Page 23: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

23

Dynamic Profiling Challenge

Page 24: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

24

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 25: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

25

Experimental framework

• Gem5– 64 x86 CPUs– Ruby memory system– L2 prefetchers– MOESI coherency protocol– Garnet network simulator

• Parsecs 2.1

Page 26: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

26

Simulation environment

Page 27: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

27

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 28: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

28

Pattern Detection Challenge

Page 29: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

29

Prefetch Queue Filtering Challenge

Page 30: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

30

Dynamic Profiling Challenge

Page 31: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

31

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 32: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

32

Facing the challenges

• There are two main options – Redesign the entire prefetch philosophy– Adapt the current techniques to work with DSMs

• Moreover, there are two main directions– Centralize the information

– Handicap of communication increment

– Distribute the prefetcher – Handicap of smartly distribute the prefetcher

Page 33: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

33

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 34: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

34

Conclusions

• Three challenges when prefetching in DSMs– Prefetch Queue Filtering Challenge– Dynamic Profiling Challenge– Challenge evaluation methodology

• Directions for future investigators

• There are no evident solutions for them

• Not solving them -> limited prefetch performance

Page 35: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

35

Q & A

Page 36: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Prefetching Challenges in

Distributed Memories for CMPs

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech