Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control...
-
Upload
william-howard -
Category
Documents
-
view
222 -
download
2
Transcript of Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control...
Computer Architecture
Dataflow Machines
Data Flow
• Conventional programming models are control driven• Instruction sequence is precisely specified• Sequence specifies control
• which instruction the CPU will execute next
• Execution rule:• Execute an instruction when its predecessor
has completed s1: r = a*b;s2: s = c*d;s3: y = r + s;
s2 executes when s1 is completes3 executes when s2 is complete
Data Flow• Consider the calculation
• y = a*b + c*d
• Represent it bya graph• Nodes represent
computations• Data flows along
arcs
• Execution rule:• Execute an instruction
when its data is available• Data driven rule
a b
x
+
d c
x
y
Data Flow• Dataflow firing rule
• An instruction fires (executes)when its data is available
• Exposes all possible parallelism• Either multiplication can
fire as soon as data arrives• Addition must wait
• Data dependence analysis!• Instruction issue units:
• Fire (issue) each instructionwhen its operands (registers) have been written
a b
x
+
d c
x
y
Data Flow - Realisations• Several Experimental Machines built
• Manchester Gurd & Watson
• Tagged Token Arvind, MIT
• SigmaETL, Tsukuba
• EMC-4 ETL, Tsukuba
• Monsoon Arvind, MIT
• EMX ETL, Tsukuba
• RAPID Osaka/Sharp/Mitsubishi(Asynchronous!)
• Naiad Tasmania
and some others
Data Flow - Realisations
• Manchester
Data Flow - Program• Program word
• Matching Store Entry
• When both Presence Flags are Y,this packet is despatched to a PE (any PE!)
Operation+, -, *, /
etc
Left, RightOperands Presence
Flags
DestinationAddress
DestinationLeft or Right
Data Flow - Matching Store
• Special purpose memory• Limited processing capability• Detects full slots• Despatches operation packets to any idle PE
Operation+, -, *, /
etc
Left, RightOperands Presence
Flags
DestinationAddress
DestinationLeft or Right
Data Flow - Processing Elements• Receive operation packets
• Generate result• Form result packet• Despatch to matching store
Data Flow - EM4• Architects
• Yamaguchi,Sakai, Kodama,Sato et al
• ElectroTechnicalLaboratory,Tsukuba,Japan
• PE (EM-Y)• CMOS Gate Array• 80k gates / 1.0• f = 20MHz• ~1992
Data Flow - Monsoon• Architects
• Papadopoulos, Culleret al
• MIT, Cambridge
• PE • f = 10MHz• ~1990
• I-StructureProcessor
Data Flow - I-Structures• Memory with a presence bit
• Tag each memory location with a bitindicating its validity
• Valid bit set -> normal read (no wait)
• Data not yet written (valid bit not set)WaitRead requests queued
Data driven execution
• Operations proceed when data is available
valid validdata data valid data
Data Flow - Monsoon Pipeline
• 8 stage pipeline• “Presence bits”
checks operandavailability
• Frame (coarse grain)basis
Data Flow - Summary• Fine-Grain Dataflow
• Suffered from comms network overload!
• Coarse-Grain Dataflow• Monsoon ...
• Overtaken by commercial technology!!
• A sad “fact-of-life”• It’s almost impossible to generate the funds
for non-”mainstream” computer architecture research
• $n x 108 required • Non-mainstream = interesting!
Data Flow - Summary• As a software model …
• Functional languages • Dataflow in a different guise! • Theoretically
• important
• Practically?• Inefficient ( = slow!!) • ….. Ask your CS colleagues!
• Cilk - based on C• Used on CIIPS Myrmidons• Uses a dataflow model
• Threads become ready for execution when their data is generated
• Message passing efficiency• Without explicit data transfer & synchronisation!
Networks
• Network Topology (or shape)• Vital to efficient parallel algorithms• Communication is the limiting factor!
• Ideal• Cross-bar
• Any-to-any• Non-blocking
• Except two sources to same receiver
• Realisable• But only for limited order (number of ports)
Networks
• Cross-bars• Achilles
• 8 x 8• Full duplex
• Simultaneous Input and Outputat each port
• 32 bit data-path• Target :
1Gbyte / second total throughput but we needed the 3-D arrangement to achieve
• bandwidth• high order
Networks
• Cross-bars• Achilles
• Hardwarealmost trivial!
• Single FPGAon each level
• Programmable• VHDL Models
• Several topologies
• Just by changing thesoftware!
Networks - More than 8 PEs
• Simple• Use 2 8x8 routers!
but ….This linkgets a lot of traffic!
Networks - Fat tree
• Problem:• High-traffic links between PEs can become a bottleneck
• Solution: Fat-tree• Links higher up the tree are “fatter”• Sustainable bandwidth between all PEs is the same
Networks - Performance Metrics
• Metrics for comparing network topologies• Diameter
• Maximum distance between any pair of nodes• Determines latency
• Bisection Bandwidth• Aggregate bandwidth over any “cut”
which divides the network in half• Determines throughput
• Crossbar• Diameter: 1
• Every PE is directly connected to routerso a single “hop” suffices
• Bisection Bandwidth: b bytes/sec• b is the bandwidth of a single link
Networks - Performance Metrics
• Metrics for comparing network topologies• To connect n PEs with mxm crossbars• Single link bandwidth b bytes/s
• Simple: n = 14 (2 switches)• Diameter 3
• Bisection Bandwidth b
1
2
3
Networks - Performance Metrics
• Fat-tree• Diameter: 2 logmn
• Height is logmn
• Worst case distance - up and down
• Bisection Bandwidth: b n/2 bytes/sec• Links are fatter higher up the tree
logmn
Networks - Performance Metrics
• Mesh• Diameter: 2n-2• Bisection Bandwidth: b n bytes/sec• Order: 4
Networks - Performance Metrics
• Hypercube• Hypercube of order m• Link 2 order m-1 hypercubes with 2m-1 links• Number of PEs: n = 2m
• Order: log2n = m
Order 2 Hypercube Order 2
Hypercube
Order 3 Hypercube
Networks - Hypercubes
• Embedding property• In an n PE hypercube,
we have hypercubes of size n/2, n/4, …• Number PEs with binary numbers
• 000, 001, 010, 011, 100, …• Joining two hypercubes
• add one binary digitto the numbering
• Each PE is connectedto every PE whoseindex differs in only one bit
Networks - Hypercubes
• Embedding property• Partitioning tasks
• Allocate to sub-cubes• Sub-tasks allocated to
sub-cubes of that cube,etc