Parallel and Multiprocessor Architectures

Chapter 9.4

By Eric Neto

Parallel & Multiprocessor ArchitectureIn making processors faster, we run into

certain limitations.PhysicalEconomic

Solution: When necessary, use more processors, working in sync.

Parallel & Multiprocessor LimitationsThough parallel processing can speed up

performance, the amount is limited.Intuitively, you’d expect N processors to do

the work in 1/N time, but processes sometimes work in sequences, so there will be some downtime while dormant processors wait for the active processor to finish.

Therefore, the more sequential a process is, the less cost-effective it is to implement parallelism.

Parallel and Multiprocessing ArchitecturesSuperscalarVLIWVectorInterconnection NetworksShared MemoryDistributed Computing

Superscalar ArchitectureAllow multiple instructions to be executed

simultaneously in each cycle.Contain

Execution units – Each allows for one process to execute.

Specialized instruction fetch unit – Fetch multiple instructions at once, send them to decoding unit.

Decoding unit – Determines whether the given instructions are independent of one another.

VLIW ArchitectureSimilar to superscalar, but relies on compiler

rather than specific hardware.Puts independent instructions into one “Very

Long Instruction Words”Advantages:

More simple hardwareDisadvantages:

Instructions fixed at compile time, so some modifications could affect execution of instructions

Vector ProcessorsUse vector pipelines to store and perform

operations on many values at once, as opposed to Scalar processing, which only performs operations on individual values.

Since it uses fewer instructions, there is less decoding, control unit overhead, and memory bandwidth usage.

Can be SIMD or MIMD.

Xn = X1 + X2 ;

Yn = Y1 + Y2 ;

Zn = Z1 + Z2 ;

Wn = W1 + W2 ;

…

VS.LDV V1, R1LDV V2, R2ADDV R3, V1, V2STV R3, V3

Interconnection NetworksEach processor has it’s own memory, that can

be accessed and shared by other processors through an interconnected network.

Efficiency of messages shared through the network is limited based on:BandwidthMessage latencyTransport latencyOverhead

In general, the amount of messages sent and distances they must travel are minimized.

TopologiesConnections between networks can be either

static or dynamic.Different configurations of static processors

are more useful for different tasks.

Completely Connected

Ring

Star

More Topologies

Tree Mesh

Hypercube

Dynamic NetworksBusses, Crossbars, Switches, Multistage

connections.As you implement more processors, these get

exponentially more expensive.

Dynamic Networking:Crossbar Network• Efficient• Direct• Expensive

Dynamic Networking:Switch Network• Complex• Moderately Efficient• Cheaper

SimpleSlowInefficientCheap

Dynamic Networking:Bus

Shared Memory MultiprocessorsMemory is shared

either globally or locally, or a combination of the two.

Shared Memory AccessUniform Memory Access systems use a

shared memory pool, where all memory takes the same amount of time to access.Quickly becomes expensive when more

processors are added.

Shared Memory AccessNon-Uniform Memory Access systems

have memory distributed across all the processors, and it takes less time for a processor to read from its own local memory than from non-local memory.Prone to cache coherence problems, which

occur when a local cache isn’t in sync with non-local caches representing the same data.

Dealing with these problems require extra mechanisms to ensure coherence.

Distributed ComputingMulti-Computer

processingWorks on the same

principal as multi-processors on a larger scale.

Uses a large network of computers to solve small parts of a very large problem.

Parallel and Multiprocessor Architectures

Documents

Transcript of Parallel and Multiprocessor Architectures