Parallel and Multiprocessor Architectures
description
Transcript of Parallel and Multiprocessor Architectures
Chapter 9.4
By Eric Neto
Parallel & Multiprocessor ArchitectureIn making processors faster, we run into
certain limitations.PhysicalEconomic
Solution: When necessary, use more processors, working in sync.
Parallel & Multiprocessor LimitationsThough parallel processing can speed up
performance, the amount is limited.Intuitively, you’d expect N processors to do
the work in 1/N time, but processes sometimes work in sequences, so there will be some downtime while dormant processors wait for the active processor to finish.
Therefore, the more sequential a process is, the less cost-effective it is to implement parallelism.
Parallel and Multiprocessing ArchitecturesSuperscalarVLIWVectorInterconnection NetworksShared MemoryDistributed Computing
Superscalar ArchitectureAllow multiple instructions to be executed
simultaneously in each cycle.Contain
Execution units – Each allows for one process to execute.
Specialized instruction fetch unit – Fetch multiple instructions at once, send them to decoding unit.
Decoding unit – Determines whether the given instructions are independent of one another.
VLIW ArchitectureSimilar to superscalar, but relies on compiler
rather than specific hardware.Puts independent instructions into one “Very
Long Instruction Words”Advantages:
More simple hardwareDisadvantages:
Instructions fixed at compile time, so some modifications could affect execution of instructions
Vector ProcessorsUse vector pipelines to store and perform
operations on many values at once, as opposed to Scalar processing, which only performs operations on individual values.
Since it uses fewer instructions, there is less decoding, control unit overhead, and memory bandwidth usage.
Can be SIMD or MIMD.
Xn = X1 + X2 ;
Yn = Y1 + Y2 ;
Zn = Z1 + Z2 ;
Wn = W1 + W2 ;
…
VS.LDV V1, R1LDV V2, R2ADDV R3, V1, V2STV R3, V3
Interconnection NetworksEach processor has it’s own memory, that can
be accessed and shared by other processors through an interconnected network.
Efficiency of messages shared through the network is limited based on:BandwidthMessage latencyTransport latencyOverhead
In general, the amount of messages sent and distances they must travel are minimized.
TopologiesConnections between networks can be either
static or dynamic.Different configurations of static processors
are more useful for different tasks.
Completely Connected
Ring
Star
More Topologies
Tree Mesh
Hypercube
Dynamic NetworksBusses, Crossbars, Switches, Multistage
connections.As you implement more processors, these get
exponentially more expensive.
Dynamic Networking:Crossbar Network• Efficient• Direct• Expensive
Dynamic Networking:Switch Network• Complex• Moderately Efficient• Cheaper
SimpleSlowInefficientCheap
Dynamic Networking:Bus
Shared Memory MultiprocessorsMemory is shared
either globally or locally, or a combination of the two.
Shared Memory AccessUniform Memory Access systems use a
shared memory pool, where all memory takes the same amount of time to access.Quickly becomes expensive when more
processors are added.
Shared Memory AccessNon-Uniform Memory Access systems
have memory distributed across all the processors, and it takes less time for a processor to read from its own local memory than from non-local memory.Prone to cache coherence problems, which
occur when a local cache isn’t in sync with non-local caches representing the same data.
Dealing with these problems require extra mechanisms to ensure coherence.
Distributed ComputingMulti-Computer
processingWorks on the same
principal as multi-processors on a larger scale.
Uses a large network of computers to solve small parts of a very large problem.