Memory System Performance of High End SMPs, PCs and Clusters of PCs
description
Transcript of Memory System Performance of High End SMPs, PCs and Clusters of PCs
![Page 1: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/1.jpg)
EidgenössischeTechnische Hochschule
Zürich
Ecole polytechnique fédérale de ZurichPolitecnico federale di Zurigo
Swiss Federal Institute of Technology Zurich
25th Annual International Symposium on Computer Architecture
7th Workshop on Scalable Shared Memory Multiprocessor
Memory System Performance of High End SMPs, PCs and
Clusters of PCs
Ch. Kurmann, T. Stricker
Laboratory for Computer SystemsETHZ - Swiss Institute of Technology
CH-8092 Zurich
Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/
![Page 2: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/2.jpg)
2
Memory Systems
Low End designs in PCs: extremely low cost standard I/O interface
High End designs in “Killer” Workstations: well engineered memory systems support for additional datastreams better I/O busses
Are Low End SMPs the universal compute nodes for parallel and distributed systems?
![Page 3: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/3.jpg)
3
Contribution
The answer is probably the memory system performance.
How significant are the differences in memory system performance?
Limitations of Low End memory systems for local computation (e.g. in scientific applications) for inter-node communication (e.g. in databases)
![Page 4: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/4.jpg)
4
Extended Copy Transfer Characterization
ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): Categories
Access pattern, stride (spatial locality) Working set (temporal locality)
Value Transfer bandwidth (large amount of data)
Same chart resulting from one microbenchmark Local and Remote transfers compute and communicate accesses
![Page 5: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/5.jpg)
5
Measurement Problems
Some parameter combinations are hard tomeasure, even with carefully tuned C code: Reduced performance for large strides and small
working-sets in L1 caches is a measurement artifact and not architecture related.
Compilers occasionally generate suboptimal instruction schedules for loads / stores.
![Page 6: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/6.jpg)
6
Local Load Access: Pentium Pro PC
Working set
Access pattern
(stride between 64bit words)
12
81
279664634832312416151287654321
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
600
500
400
300
200
100
0
600
500
400
300
200
100
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
Pentium Pro FXone processor
200 MHz
DRAM
L1
L2
![Page 7: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/7.jpg)
7
Local Load Access: SGI Origin
12
81
279664634832312416151287654321
64
M3
2 M
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
1600
1400
1200
1000
800
600
400
200
0
1600
1400
1200
1000
800
600
400
200
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
SGI Origin 10000one processor
195 MHz
L1
L2
Working set
Access pattern
(stride between 64bit words)
![Page 8: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/8.jpg)
8
Local Load Access: DEC 8400
12
81
279664634832312416151287654321
64
M3
2M
16
M8
M4
M2
M1
M5
12
k2
56
k1
28
k6
4k
32
k1
6k
8k
4k
2k
1k
.5k
1200
1000
800
600
400
200
0
1200
1000
800
600
400
200
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
DEC Alpha 8400one processor
300 MHz
L2
L3
L1
Working set
Access pattern
(stride between 64bit words)
![Page 9: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/9.jpg)
9
Local Load Access: Sun Enterprise
Working set
Access pattern
(stride between 64bit words)
12
81
279664634832312416151287654321
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
700
600
500
400
300
200
100
0
700
600
500
400
300
200
100
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
Sun Ultra Enterpriseone Ultra SPARC II
248 MHz
DRAM
L1
L2
![Page 10: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/10.jpg)
10
Local Load Access: SGI Cray T3E
12
81
279664634832312416151287654321
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
1200
1000
800
600
400
200
0
1200
1000
800
600
400
200
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
Cray T3Eone processor
300 MHz
DRAM
L1L2
Working set
Access pattern
(stride between 64bit words)
![Page 11: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/11.jpg)
11
Comparison - Local Access
1 2 3 4 5 6 7 81
21
51
62
43
1 32
48
63
64
96
12
71
28
19
2
0
50
100
150
200
250
300
Me
mo
ry L
oa
d b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
Pentium Pro
SGI Origin
DEC 8400
Sun Enterp.
Cray T3E
450
![Page 12: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/12.jpg)
12
Performance in an SMP setting
Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors
Topics of interest: small working sets in caches: performance remains
same large working sets in memory: interesting
differences behavior for even/uneven strides
“Gather copy stream” (strided load / contiguous store)
![Page 13: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/13.jpg)
13
Local Copy: Pentium Pro SMP
1 2 3 4 5 6 7 81
21
51
62
43
1 32
48
63
64
96
12
71
28
19
2
0
5
10
15
20
25
30
35
40
45
50
Me
mo
ry C
op
y b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
one processor two processors
![Page 14: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/14.jpg)
14
Local Copy: SGI Origin CC-NUMA
1 2 3 4 5 6 7 81
21
51
62
43
1 32
48
63
64
96
12
71
28
19
2
0
20
40
60
80
100
120
140
Me
mo
ry C
op
y b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
1 processor
2 processors
4 processors
![Page 15: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/15.jpg)
15
Local Copy: DEC 8400 SMP
1 2 3 4 5 6 7 8 12 16 24 32 48 640
10
20
30
40
50
60
Me
mo
ry C
op
y b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
1 processor 4 processors
![Page 16: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/16.jpg)
16
Local Copy: Sun Enterprise SMP
1 2 3 4 5 6 7 81
21
51
62
43
1 32
48
63
64
96
12
71
28
19
2
0
10
20
30
40
50
60
70
Me
mo
ry C
op
y b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
8 processors
4 processors
2 processors
1 processor
![Page 17: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/17.jpg)
17
Remote in Parallel Computers
Parallel & Network Symmetric Computers Multiprocessors
SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs
Processor Caches Memory
P
C
M
P
C
M
P
C
M
Network
P
C
P
C
P
C
M M
Bus/Network
P C M
![Page 18: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/18.jpg)
18
1 2 3 4 5 6 7 8 12 16 24 32 48 640
10
20
30
40
50
60
70
80
Rem
ote
Cop
y ba
ndw
idth
(M
byte
/s)
Access pattern (stride between 64bit words)
local copy
remote copy by Myrinet
remote copy by SCI
128
Remote Transfers: CoPsPentium Pro with SCI / Myrinet
![Page 19: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/19.jpg)
19
Remote Transfers: SGI Origin
1 2 3 4 5 6 7 8 12 16 24 32 48 640
20
40
60
80
100
120
Re
mo
te c
op
y b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
local copy remote copy
![Page 20: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/20.jpg)
20
Remote Transfers: DEC 8400
1 2 3 4 5 6 7 8 12 16 24 32 48 640
20
40
60
80
100
120
140
160
Me
mo
ry L
oa
d b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
local loads remote loads
![Page 21: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/21.jpg)
21
Remote Transfers: SGI Cray T3E
1 2 3 4 5 6 7 8 12 16 24 32 48 640
20
40
60
80
100
120
140
160
180
200
Me
mo
ry L
oa
d b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
local loads remote loads
![Page 22: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/22.jpg)
22
1 2 3 4 5 6 7 8 12 16 24 32 48 640
20
40
60
80
100
120
140
160
180
200
Me
mo
ry L
oa
d b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
PPro-Myrinet
PPro-SCI
SGI Origin
DEC 8400
Cray T3E
350
Comparison - Remote Transfers
![Page 23: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/23.jpg)
23
Improvement of PC Chipsets
Intel 440 BX AGP Chip Set400 MHz / 100 MHz
Intel 440 LX AGP Chip Set233 MHz / 66 MHz
Intel 440 FX Natoma Chip Set200 MHz / 66 MHz
1 2 3 4 5 6 7 81
21
51
62
43
1 32
48
63
64
96
12
71
28
19
2
0
10
20
30
40
50
60
70
80
90
100
Me
mo
ry C
op
y b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
440FX 440 LX 440 BX
![Page 24: Memory System Performance of High End SMPs, PCs and Clusters of PCs](https://reader035.fdocuments.net/reader035/viewer/2022062314/56813480550346895d9b5f23/html5/thumbnails/24.jpg)
24
Conclusion
ECT-Characterizations for different memory systems: T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) CoPs Intel P6 SMPs and Clusters
High End SMP vs. Low End SMP: Less than half performance on two processor PCs.
Fast communication puts high demands on the memory system: Unlike in traditional SMPs and CC-NUMAs fine grained
remote access do not perform at all in PC-SMPs and CoPs Adding more commodity microprocessors processors
without reinforcing the memory system is therefore questionable.