PARALLEL PROCESSOR ORGANIZATIONS

Jehan-François Pârisjfparis@uh.edu

Chapter Organization

• Overview• Writing parallel programs• Multiprocessor Organizations• Hardware multithreading• Alphabet soup (SISD, SIMD, MIMD, …)• Roofline performance model

OVERVIEW

The hardware side

• Many parallel processing solutions– Multiprocessor architectures

• Two or more microprocessor chips• Multiple architectures

– Multicore architectures• Several processors on a single chip

The software side

• Two ways for software to exploit parallel processing capabilities of hardware– Job-level parallelism

• Several sequential processes run in parallel• Easy to implement (OS does the job!)

– Process-level parallelism• A single program runs on several processors

at the same time

WRITING PARALLEL PROGRAMS

Overview

• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or

password guessing• Much more difficult for other applications

– Communication overhead among sub-tasks– Amdahl's law– Balancing the load

Amdahl's Law

• Assume a sequential process takes

– tp seconds to perform operations that could be performed in parallel

– ts seconds to perform purely sequential operations

• The maximum speedup will be

(tp + ts )/ts

Balancing the load

• Must ensure that workload is equally divided among all the processors

• Worst case is when one of the processors does much more work than all others

Example (I)

• Computation partitioned among n processors• One of them does 1/m of the work with m < n

– That processor becomes a bottleneck

• Maximum expected speedup: n

• Actual maximum speedup: m

Example (II)

• Computation partitioned among 64 processors• One of them does 1/8 of the work

• Maximum expected speedup: 64

• Actual maximum speedup: 8

A last issue

• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs

Rene Descartes

• Seventeenth-century French philosopher• Invented

– Cartesian coordinates – Methodical doubt

• [To] never to accept anything for true which I did not clearly know to be such

• Proposed a scientific method based on four precepts

Method's third rule

• The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

MULTI PROCESSOR ORGANIZATIONS

Shared memory multiprocessors

Interconnection network

RAM I/O

Shared memory multiprocessor

• Can offer– Uniform memory access to all processors

(UMA)• Easiest to program

– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory

Computer clusters

Interconnection network

Computer clusters

• Very easy to assemble• Can take advantage of high-speed LANs

– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through

message passing

Message passing (I)

• If processor P wants to access data in the main memory of processor Q it must– Send a request to Q– Wait for a reply

• For this to work, processor Q must have a thread– Waiting for message from other processors– Sending them replies

Message passing (II)

• In a shared memory architecture, each processor can directly access all data

• A proposed solution– Distributed shared memory offers to the

users of a cluster the illusion of a single address space for their shared data

– Still has performance issues

When things do not add up

• Memory capacity is very important for big computing applications– If the data can fit into main memory, the

computation will run much faster• A company replaced

– Single shared memory computer with 32GB of RAM

A problem

• A company replaced – Single shared memory computer with 32GB of

RAM– Four “clustered” computers with 8GB each

• More I/O than ever• What did happen?

The explanation

• Assume OS occupies one GB of RAM– The old shared-memory computer still had 31

GB of free RAM– Each of the clustered computer has 7 GB of

free RAM• The total RAM available to the program went

down from 31 GB to 47 = 28 GB!

Grid computing

• The computers are distributed over a very large network– Sometimes computer time is donated

• Volunteer computing• Seti@Home

– Works well with embarrassingly parallel workloads• Searches in a n-dimensional space

HARDWARE MULTITHREADING

General idea

• Let the processor switch to another thread of computation while them current one is stalled

• Motivation:– Increased cost of cache misses

Implementation

• Entirely controlled by the hardware– Unlike multiprogramming

• Requires a processor capable of– Keeping track of the state of each thread

• One set of registers—including PC– for each concurrent thread

– Quickly switching among concurrent threads

Approaches

• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads

Approaches

• Coarse-grained multithreading– Switches between threads whenever a long

stall is detected– Easier to implement – Cannot eliminate all stalls

Approaches

• Simultaneous multi-threading:– Takes advantage of the possibility of modern

hardware to perform different tasks in parallel for instructions of different threads

– Best solution

ALPHABET SOUP

Overview

• Used to describe processor organizations where– Same instructions can be applied to– Multiple data instances

• Encountered in– Vector processors in the past– Graphic processing units (GPU)– x86 multimedia extension

Classification

• SISD:– Single instruction, single data– Conventional uniprocessor architecture

• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture

Classification

• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data

• Think of adding two vectors

for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];

Vector computing

• Kind of SIMD architecture– Used by Cray computers

• Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU

• Requires– Vector registers able to store

multiple values– Special vector instructions: say lv, addv, …

Benchmarking

• Two factors to consider– Memory bandwidth

• Depends on interconnection network– Floating-point performance

• Best known benchmark is LINPACK

Roofline model

• Takes into account– Memory bandwidth– Floating-point performance

• Introduces arithmetic intensity– Total number of floating point operations in a

program divided by total number of bytes transferred to main memory

– Measured in FLOPS/byte

Roofline model

• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic

Intensity, Peak Floating-Point Performance

Roofline model

Peak floating-point performance

Floating-point performance islimited by memory bandwidth

PARALLEL PROCESSOR ORGANIZATIONS

Documents

Transcript of PARALLEL PROCESSOR ORGANIZATIONS

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Computer Architecture: Pipelined and Parallel Processor Design

Dynamic Processor Allocation for Adaptively Parallel Jobs

Simulation of a parallel processor based small tactical ... · Calhoun: The NPS Institutional Archive Theses and Dissertations Thesis Collection 1991 Simulation of a parallel processor

Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.

Optical content-addressable parallel processor ... · processor (OCAPP) for the efficient support of parallel symbolic computing are presented. The architecture is designed to exploit

What I learned building a parallel processor from scratch

Parallel Processor for Graphics Acceleration

Computer Architecture and the Fetch-Execute Cycle Parallel Processor Systems.

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Design and Implementation of Parallel Pipeline Processor

Inter-Processor Parallel Architecture

PARALLEL PROCESSOR ORGANIZATION · 31‐10‐2015 2 3 OVERVIEW Important Processor Organizations SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were d iscussed

A Parallel and Scalable Processor for JSON DataA Parallel and Scalable Processor for JSON Data Christina Pavlopoulou University of California, Riverside cpavl001@ucr.edu E. Preston

Chapter VIII Parallel Processor Organizations

ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.

Multi-Core Processor and Parallel Programming

Parallel Processing & Distributed Systemscse.hcmut.edu.vn/~ptvu/ppds/lec1.pdf · Parallel Processing Terminology Parallel processing Parallel computer – Multi-processor computer

Parallelization of Numerical Methods on Parallel Processor ...real-phd.mtak.hu/400/1/doi_Disszertáció_Laszlo Endre.pdf · Common parallel processor microarchitectures o er a wide

Parallel Programming - ReadingSample · of a multicore processor must obtain a separate ﬂow of control, i.e., parallel programming techniques must be used. The cores of a processor