Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45...

Post on 04-Jul-2020

2 views 0 download

Transcript of Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45...

Design and Evaluation of the Hamal Parallel Computer

J.P. Grossman

Project AriesSupervisor: Tom Knight

Dec. 3, 2002

December 3, 2002 The Hamal Parallel Computer 2

�Motivation

� Million node general-purpose machine

� Scalable memory system

� Support for massive multithreading

� Discarding Network

� Billion transistor era

� Embedded DRAM

December 3, 2002 The Hamal Parallel Computer 3

�Talk Outline

� Overview of Hamal

� The Hamal memory system

� Thread management and synchronization

� Fault-tolerant messaging protocol

December 3, 2002 The Hamal Parallel Computer 4

�Hamal - Overview

� Distributed shared memory machine

� Multiple processor-memory nodes per die

� Fat tree interconnect

� Split-phase memory operations

� Memory consistency implemented in software

� No data caches

� Complete system cycle-accurate simulator

December 3, 2002 The Hamal Parallel Computer 5

�Hamal - Overview

Data Data Data Data

Controller/Arbiter

CodeNet Processor

December 3, 2002 The Hamal Parallel Computer 6

�The Hamal Processor

� 128-bit VLIW multithreaded (8 contexts)

� No register renaming, branch prediction, speculative execution, etc.

� Event-driven microkernel runs in context 0

Handle event

Pollqueue

Event Queue

Memory Events

Processor Events

Network Events

Context 0

December 3, 2002 The Hamal Parallel Computer 7

�Talk Outline

�Overview of Hamal

� The Hamal memory system

� Threads and synchronization

� Fault-tolerant messaging protocol

December 3, 2002 The Hamal Parallel Computer 8

�Capabilities

� 128 bit pointers with embedded hardware-enforced permissions and bounds� 64 address bits, 64 capability bits

� Single virtual address space� Reduces state associated with a process

� Easy sharing of data

� Intra-process protection

� Object-based protection

� Simple lazy page allocation

December 3, 2002 The Hamal Parallel Computer 9

�A Haiku

Capabilities!

It is no longer a sin

to program in C

December 3, 2002 The Hamal Parallel Computer 10

�Virtual Memory

virtual address physical address

node page offset

virtual address

physical page virtual page physical page

node page offset

virtual address

TLB Hardware Page Tables

December 3, 2002 The Hamal Parallel Computer 11

�Distributed Objects

� Hamal implements Sparsely Faceted Arrays[Brown02]

� Conceptually allocate same segment on all nodes, but actual facets are lazily allocated

� Network interface translates between global segment IDs and local facets

December 3, 2002 The Hamal Parallel Computer 12

�Talk Outline

�Overview of Hamal

� The Hamal memory system

� Threads and synchronization

� Fault-tolerant messaging protocol

December 3, 2002 The Hamal Parallel Computer 13

�Motivation

� Run time for a problem of size m on N nodes:

� Optimal run time for fixed m:

)log(21

NCN

mCt +=

+=⇒=

2

1

212log1

C

mCCt

N

mCC

December 3, 2002 The Hamal Parallel Computer 14

�Thread Creation

� fork instruction specifies:

� Starting address

� Destination node

� Set of registers to copy to child

� Each node contains a hardware fork queue

� Queue is serviced by microkernel

fork

December 3, 2002 The Hamal Parallel Computer 15

�Register Dribbling

� Each thread has a swap page in memory

� Threads are loaded/unloaded in the background on unused memory cycles (Register dribbling - [Soundararajan92])

� Reduces overhead of thread swapping

� Least recently issued (LRI) context constantly dribbles

� Reduces latency of thread swapping

December 3, 2002 The Hamal Parallel Computer 16

�Thread Suspension

� When should a blocked thread be suspended?

� Two part strategy:

1. Wait until

a) No context can issue

b) The LRI context is clean

2. Generate a stall event and allow the microkernel to decide if the thread should be suspended

December 3, 2002 The Hamal Parallel Computer 17

�Register-Based Synchronization

� join capabilities allow one thread to write directly to another thread’s registers

� Three instructions: jcap, busy, and join

Parent Thread

r0 = jcap r1r1 = busyfork _child, {r0}r1 = and r1, r1

Child Thread

_child:

<computation>join r0, 0

December 3, 2002 The Hamal Parallel Computer 18

�Example - Barrier

doacross

December 3, 2002 The Hamal Parallel Computer 19

�Barrier Times

barrier time

12

58

105

161

216

273

333

393

456

523

0

100

200

300

400

500

600

1 2 4 8 16 32 64 128 256 512

# processors

tim

e (

cycle

s)

December 3, 2002 The Hamal Parallel Computer 20

�Benchmark - ppadd

ppadd

11

12

13

14

15

16

17

18

19

20

1 2 4 8 16 32 64 128 256 512

# processors

log(tim

e)

1024

2048

4096

8192

16384

32768

65536

December 3, 2002 The Hamal Parallel Computer 21

�UV Trap Bits

� Each memory word is tagged with two user trap bits (U, V)

� Each memory operation may optionally:

� Trap on U high, U low, V high, V low

� Modify U, V if successful

� Traps generate events which are handled by the microkernel on the node containing the memory word

December 3, 2002 The Hamal Parallel Computer 22

�Example – Word Locking

� Aqcuire:

� load, trap on U high or V high, set U

� Release:

� store, trap on V high, clear U

busy11

trap10

unavailable01

available00

MeaningVU

December 3, 2002 The Hamal Parallel Computer 23

�Example – Word Locking

B

B

C

B

CD

A

A

A

A

B

CD

B

CD

B

CD

= word = lock = join capability= thread

December 3, 2002 The Hamal Parallel Computer 24

�Benchmark – wordcount

� Frequency count of words in [Brown02]

� Distributed hash table used to store counts

� remote version: access hash table remotely

� local version: create a thread on target node 0

0.5

1

1.5

2

2.5

3

3.5

ex

ecu

tio

n t

ime

remote local

spin

UV

December 3, 2002 The Hamal Parallel Computer 25

�Talk Outline

�Overview of Hamal

� The Hamal memory system

� Threads and synchronization

� Fault-tolerant messaging protocol

December 3, 2002 The Hamal Parallel Computer 26

�Motivation

Discarding vs. Non-Discarding

Internet Examples Most || Computers

Performance �

� Simplicity �

� Reliability �

December 3, 2002 The Hamal Parallel Computer 27

�Fault Tolerant Messaging

� Idempotence?

� Sequence numbers (e.g. TCP)

� 220 nodes, 32 bits ⇒ 8MB/node

MSG MSG

MSG ACK

December 3, 2002 The Hamal Parallel Computer 28

�Idempotent Messaging Protocol

MSG MSG

� CONF: No more messages will be sent

� [Brown01]

MSG ACK MSG

CONF MSG

December 3, 2002 The Hamal Parallel Computer 29

�Out of Order Packets

MSG 7

sender receiver

MSG 7MSG 7

ACK 7

MSG 7

MSG 7

CONF 7

MSG 7MSG 7 CONF 7

December 3, 2002 The Hamal Parallel Computer 30

�Message Identification

� Sender generates message ID

� All packets contain source node and msg. ID

ACKSender ReceiverMSG

CONF

� ID identifies MSG

� source node gives destination for CONF

� (source node, ID) identifies MSG

December 3, 2002 The Hamal Parallel Computer 31

�Can We Reduce Overhead?

� ACK/CONF packets:

� Two ideas to reduce size:

1. Use short (4-8 bit) MSG ID

2. Receiver assigns short secondary ID

type dest message IDsource

ID2type dest message IDsource

type dest ID2

ACK:

CONF:

December 3, 2002 The Hamal Parallel Computer 32

�Failure of Short IDs

MSG 7

sender receiver

MSG 7

CONF 7

ACK 7

ACK 7 MSG 7

MSG 7

MSG 7

ACK 7

MSG 7

CONF 7 MSG 7

December 3, 2002 The Hamal Parallel Computer 33

�Secondary IDs

MSG 5

sender receiver

MSG 5

MSG 9

MSG 9

CONF 2

MSG 5,2MSG 5 ACK 5,2

CONF 2

MSG 5,2

ACK 5,2

MSG 9 MSG 9,2

ACK 9,2

MSG 9 MSG 9

December 3, 2002 The Hamal Parallel Computer 34

�How Many Bits Is Enough?

� Can’t reuse an ID if it’s still in the system

� But receivers can remember an ID for arbitrarily long periods of time

� Solution:

� use 48 bit IDs

� flush the network every 4-12 months when a node runs out of IDs

December 3, 2002 The Hamal Parallel Computer 35

�Experimental Results

� Trace driven simulation of 4 microbenchmarkson 4 topologies

� Linear backoff for packet retransmission

� Small send tables (8 entries)

� Larger receive tables (64 entries)

� Buffer ~4 flits at each network node

December 3, 2002 The Hamal Parallel Computer 36

�Summary

� Scalable memory system

� Capabilities → single shared virtual address space

� Hardware page tables, sparsely faceted arrays

� Low overhead for parallel programs

� Efficient thread management

� Register-based synchronization

� UV Trap bits

� Discarding network

� Lightweight fault-tolerant messaging protocol

December 3, 2002 The Hamal Parallel Computer 37

�Conclusion

Yesterday – ENIAC Today – P4

Soon – 1M nodes Tomorrow – ???

December 3, 2002 The Hamal Parallel Computer 38

�Comparison with Non-Discarding

0

200

400

600

800

1000

2D Grid 3D Grid Fat Tree Butterfly0

2000

4000

6000

8000

10000

12000

14000

2D Grid 3D Grid Fat Tree Butterfly

0

50000

100000

150000

200000

250000

2D Grid 3D Grid Fat Tree Butterfly

0

20000

40000

60000

80000

100000

2D Grid 3D Grid Fat Tree Butterfly

add reverse

nbodyquicksort

Non-Discarding Discarding + fault-tolerant messaging

December 3, 2002 The Hamal Parallel Computer 39

�Hamal Benchmarks

� ppadd – Parallel-prefix addition

� quicksort – Parallel quicksort

� nbody – exact N-body simulation, 256 bodies

� Processors conceptually organized in square array

� Communication is in rows and columns only

� wordcount – frequency count of words in [Brown02]

� Distributed hash table used to maintain counts

December 3, 2002 The Hamal Parallel Computer 40

�ppadd – Model vs. Simulations

ppadd

10

11

12

13

14

15

16

17

18

19

20

1 2 4 8 16 32 64 128 256 512

# processors

log(tim

e)

December 3, 2002 The Hamal Parallel Computer 41

�quicksort – Execution Time

quicksort

17

18

19

20

21

22

23

24

25

26

27

1 2 4 8 16 32 64 128 256 512

# processors

log(tim

e)

4096

8192

16384

32768

65536

131072

262144

December 3, 2002 The Hamal Parallel Computer 42

�quicksort - Speedup

quicksort

0

1

2

3

4

5

6

7

1 2 4 8 16 32 64 128 256 512

# processors

log(speedup)

4096

8192

16384

32768

65536

131072

262144

December 3, 2002 The Hamal Parallel Computer 43

�nbody – Speedup

nbody

0

2

4

6

8

1 4 16 64 256

# processors

log(speedup)

December 3, 2002 The Hamal Parallel Computer 44

�Register Dribbling

ppadd8

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

4 5 6 7 8 9 10 11 12 13 14 15 16

# contexts

tim

e (

cycle

s)

dribble on suspend

extended dribbling

quicksort

2800000

2850000

2900000

2950000

3000000

3050000

3100000

3150000

4 5 6 7 8 9 10 11 12 13 14 15 16

# contexts

tim

e (

cycle

s)

dribble on suspend

extended dribbling

nbody

2200000

2300000

2400000

2500000

2600000

2700000

2800000

2900000

4 5 6 7 8 9 10 11 12 13 14 15 16

# contexts

tim

e (

cycle

s)

dribble on suspend

extended dribbling

wordcount

0

200000

400000

600000

800000

1000000

1200000

1400000

4 5 6 7 8 9 10 11 12 13 14 15 16

# contexts

tim

e (

cycle

s)

dribble on suspend

extended dribbling

December 3, 2002 The Hamal Parallel Computer 45

�Network Benchmarks

� add – parallel prefix addition on 4096 nodes

� reverse – reverse the data of a 16K entry vector distributed across 1024 nodes

� quicksort – parallel quicksort of a 32K entry vector on 1024 nodes

� nbody – full N-body simulation on 256 nodes with one body per node

December 3, 2002 The Hamal Parallel Computer 46

�Network Topologies

� 2D Grid

� Dimension-ordered routing prefered

� 3D Grid

� Dimension-ordered routing prefered

� Fat tree

� radix-4 (down) dilation-2 (up), randomized

� Multibutterfly

� radix-2 dilation-2, randomized

December 3, 2002 The Hamal Parallel Computer 47

�Network Retransmission

L301.028L301.093overall

L321.015L321.041multibutterfly

L301.036L311.085fat tree

L301.018L321.0393D grid

L281.026L321.0332D grid

L311.045L311.085nbody

Q121.015Q151.020quicksort

L321.016L301.028reverse

L51.003L51.009add

average case worst case

slowdown over optimal

December 3, 2002 The Hamal Parallel Computer 48

�Network Send Table Size

add

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 4 8 16 32 64 128 256

reverse

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

1 2 4 8 16 32 64 128 256

quicksort

0

50000

100000

150000

200000

250000

300000

1 2 4 8 16 32 64 128 256

nbody

0

20000

40000

60000

80000

100000

120000

140000

1 2 4 8 16 32 64 128 256

December 3, 2002 The Hamal Parallel Computer 49

�Network Buffering

add

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8

reverse

0

5000

10000

15000

20000

25000

30000

35000

1 2 3 4 5 6 7 8

quicksort

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8

nbody

0

20000

40000

60000

80000

100000

120000

140000

160000

1 2 3 4 5 6 7 8

December 3, 2002 The Hamal Parallel Computer 50

�Network Receive Table Size

add

0

500

1000

1500

2000

2500

3000

3500

16 32 64 128

reverse

0

2000

4000

6000

8000

10000

12000

14000

16000

16 32 64 128

quicksort

0

50000

100000

150000

200000

250000

300000

16 32 64 128

nbody

0

20000

40000

60000

80000

100000

120000

140000

16 32 64 128