CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms...
Click here to load reader
Embed Size (px)
Transcript of CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms...
Data Flow - SummaryFine-Grain DataflowSuffered from comms network overload!Coarse-Grain DataflowMonsoon ...Overtaken by commercial technology!!A sad fact-of-lifeIts almost impossible to generate the funds for non-mainstream computer architecture research$n x 108 required LNon-mainstream = interesting!
Data Flow - SummaryAs a software model Functional languages Dataflow in a different guise! Theoretically importantPractically?Inefficient ( = slow!!) .. Ask your CS colleagues!Cilk - based on CUsed on CIIPS MyrmidonsUses a dataflow modelThreads become ready for execution when their data is generatedMessage passing efficiencyWithout explicit data transfer & synchronisation!
NetworksNetwork Topology (or shape)Vital to efficient parallel algorithmsCommunication is the limiting factor!IdealCross-barAny-to-anyNon-blockingExcept two sources to same receiverRealisableBut only for limited order (number of ports)
NetworksCross-barsAchilles8 x 8Full duplexSimultaneous Input and Output at each port32 bit data-pathTarget : 1Gbyte / second total throughputbut we needed the 3-D arrangement to achievebandwidthhigh order
NetworksCross-barsAchillesHardware almost trivial!Single FPGA on each levelProgrammableVHDL ModelsSeveral topologiesJust by changing the software!
Networks - More than 8 PEsSimpleUse 2 8x8 routers!
but .This linkgets a lot of traffic!
Networks - Fat treeProblem:High-traffic links between PEs can become a bottleneckSolution: Fat-treeLinks higher up the tree are fatterSustainable bandwidth between all PEs is the same
Networks - Performance MetricsMetrics for comparing network topologiesDiameterMaximum distance between any pair of nodesDetermines latencyBisection BandwidthAggregate bandwidth over any cut which divides the network in halfDetermines throughputCrossbarDiameter: 1Every PE is directly connected to router so a single hop sufficesBisection Bandwidth: b bytes/secb is the bandwidth of a single link
Networks - Performance MetricsMetrics for comparing network topologiesTo connect n Pes with mxm crossbarsSingle link bandwidth b bytes/sSimple: n = 14 (2 switches)Diameter3
Bisection Bandwidth b123
Networks - Performance MetricsFat-treeDiameter: 2 logmnHeight is logmnWorst case distance - up and downBisection Bandwidth: b n/2 bytes/secLinks are fatter higher up the tree
Networks - Performance MetricsMeshDiameter: 2n-2Bisection Bandwidth: b n bytes/secOrder: 4
Networks - Performance MetricsHypercubeHypercube of order mLink 2 order m-1 hypercubes with 2m-1 linksNumber of PEs: n = 2mOrder: log2n = m
Order 2 HypercubeOrder 2 HypercubeOrder 3 Hypercube
Networks - HypercubesEmbedding propertyIn an n PE hypercube, we have hypercubes of size n/2, n/4, Number PEs with binary numbers000, 001, 010, 011, 100, Joining two hypercubesadd one binary digit to the numberingEach PE is connected to every PE whose index differs in only one bit
Networks - HypercubesEmbedding propertyPartitioning tasksAllocate to sub-cubesSub-tasks allocated to sub-cubes of that cube, etc
VLIW - Very Long Instruction WordInstruction word: multiple operationsn RISC-style instructionsArchitecture: fixed set of functional units
Each FU matched to a slot in the instruction
VLIW - Very Long Instruction WordCompiler responsible for allocating instructions to wordsBurden squarely on compilerNeeds to produce near optimal scheduleInevitable: large number of empty slots!Lower code densitySimilar to superscalarbut instruction issue flexibility missingVLIW simpler faster?Re-compilation neededEach new generation will have different functional unit mix
Synchronous Logic SystemsClock distributionMajor problem for chip architectClock skews < 100-200ps over whole die10% of cycle timeSmall changesRe-engineer whole chipChecking for data hazards & logic races
Synchronous Logic SystemsClock distributionPower consumptionMajor problem @ 30W+ per chipCMOS logic consumes power only on switchbut synch systems clock a lot of logic on every cycleClock is distributed to every subsystemEven if the logic of the subsystem is disabled!
Synchronous Logic SystemsClock distributionPower consumptionWorst case propagation delayDetermines maximum clock speedClock edge must wait until all logic has settledTemperature and process fabricationEven slower clocksDesign is simplerLogic designers have experienceGood tools
Asynchronous Logic SystemsClock distributionNo longer a problemSynchronisation bundled with dataCircuits are composableNo global clock No need to re-engineer a whole chip to change one section!Known correct circuits can be combinedPower consumptionCircuits switch only when theyre computingPotentially very low power consumptionMay be the biggest attraction of asynch systems!
Asynchronous Logic SystemsClock distribution problem removedCircuits are composablePower consumptionAverage case propagation delayCompletion signal generated when result is availableIndependent of Temperature and process fabricationDesign is harderExperience will remove this?
Laboratory 1.51Practical Examinationswill be held in this laboratoryevery afternoon from 1:50pm to 5:30pmnext week, June 1 to June 5The laboratory will be closedto everyone except those inCT105/CLP110actually taking the examsduring these times.Please consider the students taking the exam by not disturbing them in any way.