Post on 04-Jul-2020
�
Design and Evaluation of the Hamal Parallel Computer
J.P. Grossman
Project AriesSupervisor: Tom Knight
Dec. 3, 2002
December 3, 2002 The Hamal Parallel Computer 2
�Motivation
� Million node general-purpose machine
� Scalable memory system
� Support for massive multithreading
� Discarding Network
� Billion transistor era
� Embedded DRAM
December 3, 2002 The Hamal Parallel Computer 3
�Talk Outline
� Overview of Hamal
� The Hamal memory system
� Thread management and synchronization
� Fault-tolerant messaging protocol
December 3, 2002 The Hamal Parallel Computer 4
�Hamal - Overview
� Distributed shared memory machine
� Multiple processor-memory nodes per die
� Fat tree interconnect
� Split-phase memory operations
� Memory consistency implemented in software
� No data caches
� Complete system cycle-accurate simulator
December 3, 2002 The Hamal Parallel Computer 5
�Hamal - Overview
Data Data Data Data
Controller/Arbiter
CodeNet Processor
December 3, 2002 The Hamal Parallel Computer 6
�The Hamal Processor
� 128-bit VLIW multithreaded (8 contexts)
� No register renaming, branch prediction, speculative execution, etc.
� Event-driven microkernel runs in context 0
Handle event
Pollqueue
Event Queue
Memory Events
Processor Events
Network Events
Context 0
December 3, 2002 The Hamal Parallel Computer 7
�Talk Outline
�Overview of Hamal
� The Hamal memory system
� Threads and synchronization
� Fault-tolerant messaging protocol
December 3, 2002 The Hamal Parallel Computer 8
�Capabilities
� 128 bit pointers with embedded hardware-enforced permissions and bounds� 64 address bits, 64 capability bits
� Single virtual address space� Reduces state associated with a process
� Easy sharing of data
� Intra-process protection
� Object-based protection
� Simple lazy page allocation
December 3, 2002 The Hamal Parallel Computer 9
�A Haiku
Capabilities!
It is no longer a sin
to program in C
December 3, 2002 The Hamal Parallel Computer 10
�Virtual Memory
virtual address physical address
node page offset
virtual address
physical page virtual page physical page
node page offset
virtual address
TLB Hardware Page Tables
December 3, 2002 The Hamal Parallel Computer 11
�Distributed Objects
� Hamal implements Sparsely Faceted Arrays[Brown02]
� Conceptually allocate same segment on all nodes, but actual facets are lazily allocated
� Network interface translates between global segment IDs and local facets
December 3, 2002 The Hamal Parallel Computer 12
�Talk Outline
�Overview of Hamal
� The Hamal memory system
� Threads and synchronization
� Fault-tolerant messaging protocol
December 3, 2002 The Hamal Parallel Computer 13
�Motivation
� Run time for a problem of size m on N nodes:
� Optimal run time for fixed m:
)log(21
NCN
mCt +=
+=⇒=
2
1
212log1
C
mCCt
N
mCC
December 3, 2002 The Hamal Parallel Computer 14
�Thread Creation
� fork instruction specifies:
� Starting address
� Destination node
� Set of registers to copy to child
� Each node contains a hardware fork queue
� Queue is serviced by microkernel
fork
December 3, 2002 The Hamal Parallel Computer 15
�Register Dribbling
� Each thread has a swap page in memory
� Threads are loaded/unloaded in the background on unused memory cycles (Register dribbling - [Soundararajan92])
� Reduces overhead of thread swapping
� Least recently issued (LRI) context constantly dribbles
� Reduces latency of thread swapping
December 3, 2002 The Hamal Parallel Computer 16
�Thread Suspension
� When should a blocked thread be suspended?
� Two part strategy:
1. Wait until
a) No context can issue
b) The LRI context is clean
2. Generate a stall event and allow the microkernel to decide if the thread should be suspended
December 3, 2002 The Hamal Parallel Computer 17
�Register-Based Synchronization
� join capabilities allow one thread to write directly to another thread’s registers
� Three instructions: jcap, busy, and join
Parent Thread
r0 = jcap r1r1 = busyfork _child, {r0}r1 = and r1, r1
Child Thread
_child:
<computation>join r0, 0
December 3, 2002 The Hamal Parallel Computer 18
�Example - Barrier
doacross
December 3, 2002 The Hamal Parallel Computer 19
�Barrier Times
barrier time
12
58
105
161
216
273
333
393
456
523
0
100
200
300
400
500
600
1 2 4 8 16 32 64 128 256 512
# processors
tim
e (
cycle
s)
December 3, 2002 The Hamal Parallel Computer 20
�Benchmark - ppadd
ppadd
11
12
13
14
15
16
17
18
19
20
1 2 4 8 16 32 64 128 256 512
# processors
log(tim
e)
1024
2048
4096
8192
16384
32768
65536
December 3, 2002 The Hamal Parallel Computer 21
�UV Trap Bits
� Each memory word is tagged with two user trap bits (U, V)
� Each memory operation may optionally:
� Trap on U high, U low, V high, V low
� Modify U, V if successful
� Traps generate events which are handled by the microkernel on the node containing the memory word
December 3, 2002 The Hamal Parallel Computer 22
�Example – Word Locking
� Aqcuire:
� load, trap on U high or V high, set U
� Release:
� store, trap on V high, clear U
busy11
trap10
unavailable01
available00
MeaningVU
December 3, 2002 The Hamal Parallel Computer 23
�Example – Word Locking
B
B
C
B
CD
A
A
A
A
B
CD
B
CD
B
CD
= word = lock = join capability= thread
December 3, 2002 The Hamal Parallel Computer 24
�Benchmark – wordcount
� Frequency count of words in [Brown02]
� Distributed hash table used to store counts
� remote version: access hash table remotely
� local version: create a thread on target node 0
0.5
1
1.5
2
2.5
3
3.5
ex
ecu
tio
n t
ime
remote local
spin
UV
December 3, 2002 The Hamal Parallel Computer 25
�Talk Outline
�Overview of Hamal
� The Hamal memory system
� Threads and synchronization
� Fault-tolerant messaging protocol
December 3, 2002 The Hamal Parallel Computer 26
�Motivation
Discarding vs. Non-Discarding
Internet Examples Most || Computers
Performance �
� Simplicity �
� Reliability �
December 3, 2002 The Hamal Parallel Computer 27
�Fault Tolerant Messaging
� Idempotence?
� Sequence numbers (e.g. TCP)
� 220 nodes, 32 bits ⇒ 8MB/node
MSG MSG
MSG ACK
December 3, 2002 The Hamal Parallel Computer 28
�Idempotent Messaging Protocol
MSG MSG
� CONF: No more messages will be sent
� [Brown01]
MSG ACK MSG
CONF MSG
December 3, 2002 The Hamal Parallel Computer 29
�Out of Order Packets
MSG 7
sender receiver
MSG 7MSG 7
ACK 7
MSG 7
MSG 7
CONF 7
MSG 7MSG 7 CONF 7
December 3, 2002 The Hamal Parallel Computer 30
�Message Identification
� Sender generates message ID
� All packets contain source node and msg. ID
ACKSender ReceiverMSG
CONF
� ID identifies MSG
� source node gives destination for CONF
� (source node, ID) identifies MSG
December 3, 2002 The Hamal Parallel Computer 31
�Can We Reduce Overhead?
� ACK/CONF packets:
� Two ideas to reduce size:
1. Use short (4-8 bit) MSG ID
2. Receiver assigns short secondary ID
type dest message IDsource
ID2type dest message IDsource
type dest ID2
ACK:
CONF:
December 3, 2002 The Hamal Parallel Computer 32
�Failure of Short IDs
MSG 7
sender receiver
MSG 7
CONF 7
ACK 7
ACK 7 MSG 7
MSG 7
MSG 7
ACK 7
MSG 7
CONF 7 MSG 7
December 3, 2002 The Hamal Parallel Computer 33
�Secondary IDs
MSG 5
sender receiver
MSG 5
MSG 9
MSG 9
CONF 2
MSG 5,2MSG 5 ACK 5,2
CONF 2
MSG 5,2
ACK 5,2
MSG 9 MSG 9,2
ACK 9,2
MSG 9 MSG 9
December 3, 2002 The Hamal Parallel Computer 34
�How Many Bits Is Enough?
� Can’t reuse an ID if it’s still in the system
� But receivers can remember an ID for arbitrarily long periods of time
� Solution:
� use 48 bit IDs
� flush the network every 4-12 months when a node runs out of IDs
December 3, 2002 The Hamal Parallel Computer 35
�Experimental Results
� Trace driven simulation of 4 microbenchmarkson 4 topologies
� Linear backoff for packet retransmission
� Small send tables (8 entries)
� Larger receive tables (64 entries)
� Buffer ~4 flits at each network node
December 3, 2002 The Hamal Parallel Computer 36
�Summary
� Scalable memory system
� Capabilities → single shared virtual address space
� Hardware page tables, sparsely faceted arrays
� Low overhead for parallel programs
� Efficient thread management
� Register-based synchronization
� UV Trap bits
� Discarding network
� Lightweight fault-tolerant messaging protocol
December 3, 2002 The Hamal Parallel Computer 37
�Conclusion
Yesterday – ENIAC Today – P4
Soon – 1M nodes Tomorrow – ???
December 3, 2002 The Hamal Parallel Computer 38
�Comparison with Non-Discarding
0
200
400
600
800
1000
2D Grid 3D Grid Fat Tree Butterfly0
2000
4000
6000
8000
10000
12000
14000
2D Grid 3D Grid Fat Tree Butterfly
0
50000
100000
150000
200000
250000
2D Grid 3D Grid Fat Tree Butterfly
0
20000
40000
60000
80000
100000
2D Grid 3D Grid Fat Tree Butterfly
add reverse
nbodyquicksort
Non-Discarding Discarding + fault-tolerant messaging
December 3, 2002 The Hamal Parallel Computer 39
�Hamal Benchmarks
� ppadd – Parallel-prefix addition
� quicksort – Parallel quicksort
� nbody – exact N-body simulation, 256 bodies
� Processors conceptually organized in square array
� Communication is in rows and columns only
� wordcount – frequency count of words in [Brown02]
� Distributed hash table used to maintain counts
December 3, 2002 The Hamal Parallel Computer 40
�ppadd – Model vs. Simulations
ppadd
10
11
12
13
14
15
16
17
18
19
20
1 2 4 8 16 32 64 128 256 512
# processors
log(tim
e)
December 3, 2002 The Hamal Parallel Computer 41
�quicksort – Execution Time
quicksort
17
18
19
20
21
22
23
24
25
26
27
1 2 4 8 16 32 64 128 256 512
# processors
log(tim
e)
4096
8192
16384
32768
65536
131072
262144
December 3, 2002 The Hamal Parallel Computer 42
�quicksort - Speedup
quicksort
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64 128 256 512
# processors
log(speedup)
4096
8192
16384
32768
65536
131072
262144
December 3, 2002 The Hamal Parallel Computer 43
�nbody – Speedup
nbody
0
2
4
6
8
1 4 16 64 256
# processors
log(speedup)
December 3, 2002 The Hamal Parallel Computer 44
�Register Dribbling
ppadd8
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
quicksort
2800000
2850000
2900000
2950000
3000000
3050000
3100000
3150000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
nbody
2200000
2300000
2400000
2500000
2600000
2700000
2800000
2900000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
wordcount
0
200000
400000
600000
800000
1000000
1200000
1400000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
December 3, 2002 The Hamal Parallel Computer 45
�Network Benchmarks
� add – parallel prefix addition on 4096 nodes
� reverse – reverse the data of a 16K entry vector distributed across 1024 nodes
� quicksort – parallel quicksort of a 32K entry vector on 1024 nodes
� nbody – full N-body simulation on 256 nodes with one body per node
December 3, 2002 The Hamal Parallel Computer 46
�Network Topologies
� 2D Grid
� Dimension-ordered routing prefered
� 3D Grid
� Dimension-ordered routing prefered
� Fat tree
� radix-4 (down) dilation-2 (up), randomized
� Multibutterfly
� radix-2 dilation-2, randomized
December 3, 2002 The Hamal Parallel Computer 47
�Network Retransmission
L301.028L301.093overall
L321.015L321.041multibutterfly
L301.036L311.085fat tree
L301.018L321.0393D grid
L281.026L321.0332D grid
L311.045L311.085nbody
Q121.015Q151.020quicksort
L321.016L301.028reverse
L51.003L51.009add
average case worst case
slowdown over optimal
December 3, 2002 The Hamal Parallel Computer 48
�Network Send Table Size
add
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 4 8 16 32 64 128 256
reverse
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1 2 4 8 16 32 64 128 256
quicksort
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 32 64 128 256
nbody
0
20000
40000
60000
80000
100000
120000
140000
1 2 4 8 16 32 64 128 256
December 3, 2002 The Hamal Parallel Computer 49
�Network Buffering
add
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8
reverse
0
5000
10000
15000
20000
25000
30000
35000
1 2 3 4 5 6 7 8
quicksort
0
50000
100000
150000
200000
250000
300000
350000
1 2 3 4 5 6 7 8
nbody
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 3 4 5 6 7 8
December 3, 2002 The Hamal Parallel Computer 50
�Network Receive Table Size
add
0
500
1000
1500
2000
2500
3000
3500
16 32 64 128
reverse
0
2000
4000
6000
8000
10000
12000
14000
16000
16 32 64 128
quicksort
0
50000
100000
150000
200000
250000
300000
16 32 64 128
nbody
0
20000
40000
60000
80000
100000
120000
140000
16 32 64 128