Memory System Performance in a NUMA Multicore Multiprocessor
description
Transcript of Memory System Performance in a NUMA Multicore Multiprocessor
![Page 1: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/1.jpg)
1
Memory System Performance in a NUMA Multicore Multiprocessor
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
![Page 2: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/2.jpg)
2
Summary
• NUMA multicore systems are unfair to local memory accesses
• Local execution sometimes suboptimal
![Page 3: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/3.jpg)
3
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
![Page 4: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/4.jpg)
4
NUMA multicores: how it happened
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
Active cores
Total bandwidth [GB/s]
3210
BusC
Northbridge
MC
DRAM memory
0 1 2 3 7654
BusC
4 5 6 7
BusC BusC BusC BusC BusC BusC
MC
First generation: SMP
![Page 5: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/5.jpg)
5
NUMA multicores: how it happened
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
Active cores
Total bandwidth [GB/s]
3210
BusC
Northbridge
DRAM memory
7654
BusC
MC MCMC
DRAM memory
BusC BusC
Next generation: NUMA
IC IC
![Page 6: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/6.jpg)
6
NUMA multicores: how it happened
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
NUMA (local)
Active cores
Total bandwidth [GB/s]
3210
DRAM memory
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
![Page 7: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/7.jpg)
7
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
NUMA (local)
NUMA (re-mote)
Active cores
Total bandwidth [GB/s]
NUMA multicores: how it happened
3210
DRAM memory
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
![Page 8: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/8.jpg)
8
3210
DRAM memory
7654
MC MC
DRAM memory
IC IC
Bandwidth sharing
• Frequent scenario:
bandwidth shared between cores
• Sharing model for the Intel Nehalem
0 1 2 3 4 5 6 7
![Page 9: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/9.jpg)
9
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
![Page 10: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/10.jpg)
10
Evaluation system
Intel Nehalem E5520
2 x 4 cores
8 MB level 3 cache
12 GB DDR3 RAM
5.86 GT/s QPI
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
QPI QPI
Global Queue Global Queue
Processor 0 Processor 1
![Page 11: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/11.jpg)
11
Bandwidth sharing: local accesses
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
0
DRAM memory
3
Global Queue
Processor 0 Processor 1
![Page 12: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/12.jpg)
12
Bandwidth sharing: remote accesses
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
![Page 13: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/13.jpg)
13
Bandwidth sharing: combined accesses
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
Global Queue
![Page 14: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/14.jpg)
14
Global Queue
• Mechanism to arbitrate between different types of memory accesses
• We look at fairness of the Global Queue:
– local memory accesses
– remote memory accesses
– combined memory accesses
![Page 15: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/15.jpg)
15
Benchmark program
• STREAM triad
for (i=0; i<SIZE; i++){
a[i]=b[i]+SCALAR*c[i];}
• Multiple co-executing triad clones
![Page 16: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/16.jpg)
16
Multi-clone experiments
• All memory allocated on Processor 0
• Local clones: Remote clones:
• Example benchmark configurations:
C C
C C
(2L, 0R)
C C C C C C C C
(0L, 3R) (2L, 3R)
Processor 0 Processor 1 Processor 0 Processor 1
![Page 17: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/17.jpg)
17
GQ fairness: local accesses
1 L 2 L 3 L 4 L0
2000
4000
6000
8000
10000
12000
14000
Core 0 Core 1 Core 2 Core 3
Benchmark configurations
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
![Page 18: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/18.jpg)
18
1 R 2 R 3 R 4 R0
2000
4000
6000
8000
10000
12000
14000
Core 0 Core 1 Core 2 Core 3
Benchmark configurations
1 L 1 R 2 L 2 R 3 L 3 R 4 L 4 R0
2000
4000
6000
8000
10000
12000
14000
Core 0 Core 1 Core 2 Core 3
Benchmark configurations
GQ fairness: remote accesses
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
![Page 19: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/19.jpg)
19
Global Queue fairness
• Global Queue fair when there areonly local/remote accesses in the system
• What about combined accesses?
![Page 20: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/20.jpg)
20
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4(2L, 3R)
![Page 21: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/21.jpg)
21
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
![Page 22: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/22.jpg)
22
GQ fairness: combined accessesTotal bandwidth [GB/s]
(4L, 0R) (4L, 1R) (4L, 2R) (4L, 3R) (4L, 4R)0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
local clones remote clones
Benchmark configurations
![Page 23: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/23.jpg)
23
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
![Page 24: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/24.jpg)
24
Combined accessesTotal bandwidth [GB/s]
(1L,1R) (2L,1R) (3L,1R) (4L,1R)0
2000
4000
6000
8000
10000
12000
14000
16000
remote clonelocal clone 1local clone 2local clone 3local clone 4
![Page 25: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/25.jpg)
25
Combined accesses
• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone
• Remote execution can be better than local
![Page 26: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/26.jpg)
26
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
![Page 27: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/27.jpg)
27
Bandwidth sharing model
remotelocaltotal bandwidthbandwidthbandwidth )1(
3210
DRAM memory
7654
IMC IMC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
DRAM memory
C C
![Page 28: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/28.jpg)
28
Sharing factor ()
• Characterizes the fairness of the Global Queue
• Dependence of sharing factor on contention?
![Page 29: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/29.jpg)
29
Contention affects sharing factor
DRAM
Processor 0 Processor 0
C
CQPI
contenders
C
C
C
![Page 30: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/30.jpg)
30
Contention affects sharing factor
+0L +1L +2L +3L0%
10%
20%
30%
40%
50%
Additional contention
Sharing factor ()
![Page 31: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/31.jpg)
31
Combined accessesTotal bandwidth [GB/s]
(1L,1R) (2L,1R) (3L,1R) (4L,1R)0
2000
4000
6000
8000
10000
12000
14000
16000
remote clonelocal clone 1local clone 2local clone 3local clone 4
![Page 32: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/32.jpg)
32
Contention affects sharing factor
• Sharing factor decreases with contention
• With local contention remote execution becomes more favorable
![Page 33: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/33.jpg)
33
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
![Page 34: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/34.jpg)
34
The next generation
Intel Westmere X5680
2 x 6 cores
12 MB level 3 cache
144 GB DDR3 RAM
6.4 GT/s QPI
3210
DRAM memory
IMC
DRAM memory
QPI
Level 3 cache
Global Queue
BA98
IMCQPI
Level 3 cache
Global Queue
764 5
![Page 35: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/35.jpg)
35
The next generationTotal bandwidth [GB/s]
(1L,
1R)
(2L,
1R)
(3L,
1R)
(4L,
1R)
(5L,
1R)
(6L,
1R)
0
2000
4000
6000
8000
10000
12000
remote clonelocal clone 1local clone 2local clone 3local clone 4local clone 5local clone 6
Benchmark configurations
![Page 36: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/36.jpg)
36
Conclusions
• Optimizing for data locality can be suboptimal
• Applications:
– OS scheduling (see ISMM’11 paper)
– data placement and computation scheduling
![Page 37: Memory System Performance in a NUMA Multicore Multiprocessor](https://reader034.fdocuments.net/reader034/viewer/2022051316/56815b37550346895dc90c84/html5/thumbnails/37.jpg)
37
Thank you! Questions?