Parallel Computer Organization and Design EDA282 Slide 1.
-
Upload
rhoda-thomas -
Category
Documents
-
view
221 -
download
0
Transcript of Parallel Computer Organization and Design EDA282 Slide 1.
![Page 1: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/1.jpg)
Parallel Computer Organization and DesignEDA282
Slide 1
![Page 2: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/2.jpg)
Slide 2
Why Study Parallel Computers?
Almost ALL computers are now parallel
Understanding hardware is important for producing good software (converse also true!)
It’s fun!
![Page 3: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/3.jpg)
Logistics EL43 1:15-3:00 T/Th (often1:15 F, too)
Expected participation¨ Attend lectures, participate in discussion¨ Complete labs (including a satisfactory writeup)
— dates/times TBD¨ Read papers¨ Complete quizzes¨ Write (short) survey article (in teams)¨ Finish (short) take-home exam
Canvas course-management system¨ https://canvas.instructure.com/courses/777378¨ Link: http://www.cse.chalmers.se/~mckee/eda282
Slide 3
![Page 4: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/4.jpg)
Personnel
Prof. Sally McKee¨ Office hours: arrange meetings via email¨ Available for discussions after class¨ [email protected]
Jacob Lidman¨ [email protected]
Slide 4
![Page 5: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/5.jpg)
Course Materials
“Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona)
Research and survey papers (linked to web page)
Slide 5
![Page 6: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/6.jpg)
Course Structure/Contents
Intro today
Programming models¨ Data parallelism¨ Shared address spaces¨ Message passing¨ Hybrid
Design principles/tradeoffs(this is the bulk of the material)¨ Small-scale systems¨ Scalable systems¨ Interconnects
Slide 6
![Page 7: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/7.jpg)
For Each Big Topic, We’ll Discuss . . .
History¨ How concepts originated in old machines¨ How they show up in current machines
Basics required in any parallel machine¨ Memory coherence¨ Communication¨ Synchronization
![Page 8: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/8.jpg)
How Did We Get Here?
Transistor count doubling every ~2 years
Transistor feature sizes shrinking
Costs changing
Clock speeds hitting limits
Parallelism per processor increasing
Looking at trends is important when designing new systems!
Slide 8
![Page 9: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/9.jpg)
Costs of Parallel Machines
Things to keep in mind when designing a machine . . .
What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?
(how long is a "lifetime"?)
Slide 9
![Page 10: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/10.jpg)
Slide 10
Interesting Questions (i.e., course content) What do we mean by parallel?
¨ Task parallelism (SPMD, MPMD)¨ Data parallelism (SIMD)¨ Thread parallelism (Hyperthreading, SMT)
How do the processors coordinate their work?¨ Shared memory/message passing¨ Interconnection network (at least one!)¨ Synchronization primitives¨ Many combinations/variations
What’s the best way to put these pieces together?¨ What do you want to run?¨ How fast do you have to run it?¨ How much can you spend?¨ How much energy can you use?
![Page 11: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/11.jpg)
Moore’s Law: Transistor Counts
Slide 11
![Page 12: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/12.jpg)
Feature Sizes
Slide 12
![Page 13: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/13.jpg)
Costs: Apple
Slide 13
![Page 14: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/14.jpg)
Costs of Consumer Electronics Today
Slide 14
![Page 15: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/15.jpg)
History Pascal adding machine, 1642
Leibniz adder/multiplier, ~1670
Babbage analytical engine, 1837 (punch cards, memory, printer!)
Hollerith punchcards, 1890 (used for US census data)
Aiken digital computer, 1940s (Harvard)
Von Neumann stored-program computer, 1945
Eckert/Mauchly ENIAC GP computer, 1946
Slide 15
![Page 16: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/16.jpg)
Evolution of Electronic Computers Vacuum tubes replaced by transistors, late 1950s
¨ Smaller, faster, more versital logic elements¨ Lower power¨ Longer lifetime
Integrated Circuits, late 1960s¨ Many transistors fabricated on silicon substrate¨ Wires plated in place¨ Lower price¨ Smaller size¨ Lower failure rate
LSI/VLSI/microprocessors, 1970s¨ 1000s of interconnected transistors etched into silicon¨ Could check 8 switches at once → 8-bit “byte”
Slide 16
![Page 17: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/17.jpg)
History of Supercomputers IBM 7030 Stretch, 1961
¨ 2K sq. ft.¨ Fastest computer in world at time¨ Slower than expected!¨ Cost initially $13M, dropped to $8.5M¨ Instruction pipelining, prefetching/decoding, memory
interleaving
CDC 6600, 1964¨ Size ~= 4 filing cabinets. ¨ Cost $8M ($60M today)¨ 40MHz, 3M FLOPS at peak¨ Freon cooled¨ CPU == 10 FUs, multiple PCBs¨ 60-bit words/regs
Slide 17
![Page 18: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/18.jpg)
History of Supercomputers (2)
Cray 1, 1976¨ 64-bit words¨ 80 MHz¨ 136 MFLOPS!¨ Speed-critical parts
placed inside¨ 1662 PCBs w/ 144 Ics¨ 80 sold in 10 years¨ $5-8M ($25M now)
Slide 18
![Page 19: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/19.jpg)
History of Supercomputers (3) Cray XMP, 1982
¨ Up to 4 CPUs in 1 chassis¨ Up to 16M 64-bit words (128 MB, all SRAM!)¨ Up to 32 1.2GB disks¨ 105 MHz¨ Up to 800 MFLOPS (200/CPU)¨ Double memory bandwidth wrt Cray 1
Cray 2, 1985¨ Again ICs packed on logic boards¨ Again, horseshoe shape¨ Boards packed tightly — submersed in Fluorinert to cool
(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf)
¨ Up to 8 CPUs, 1.9 GFLOPS¨ Mainstream software/Unix System V OS
Slide 19
![Page 20: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/20.jpg)
History of Supercomputers (4)
Intel Paragon, 1989¨ I860-based¨ 32- or 64-bit¨ Up to 4K CPUs¨ 2D MIMD topology¨ Poor memory bandwidth utilization
ASCI Red, 1996¨ First to use off-the-shelf CPUs (Pentium Pros, Xeons)¨ 6K CPUs¨ Broke 1 TFLOP barrier¨ Cost $46M ($67M now)¨ Upgrade had 9298 Xeons for 3.1 TFlops¨ Over 1MW power!
Slide 20
![Page 21: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/21.jpg)
History of Supercomputers (5) Hitachi SR2201, 1996
¨ H-shaped chassis¨ 2048 CPUs ¨ 600 GFLOPS peak
Other similar machines (many Japanese)¨ 100s of CPUs¨ 2D or 3D networks (e.g., Cray torus)¨ MIMD
Seymour Cray leaves Cray Research¨ Cray Computer Corp (CCC)
¨ Cray 3 first gallium arsenide chips¨ Cray 4 failed → bankruptcy
¨ SRC Computers (see http://www.srccomp.com/about/aboutus.asp)
Slide 21
![Page 22: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/22.jpg)
Biggest Machine Today
SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab
Slide 22
![Page 23: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/23.jpg)
Types of Parallelism Instruction-Level Parallelism (ILP)
¨ Superscalar issue¨ Out-of-order execution¨ Very Long Instruction Word (VLIW)
Thread-Level Parallelism (TLP)¨ Loop-level¨ Multithreading
¨ Explicit¨ Speculative¨ Simultaneous/Hyperthreading
Task-Level Parallelism Program-Level Parallelism Data-Level Parallelism
Slide 23
![Page 24: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/24.jpg)
Parallelism in Sequential Programs
Programming model: C (sequential)
Architecture: superscalar¨ ILP¨ Communication through registers¨ Synchronization through pipeline interlocks
Slide 24
for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];
Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]
Loop 2 a[0] a[1] … a[N-1]
data dependencies
![Page 25: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/25.jpg)
Parallel Programming Models
Extend semantics to express¨ Units of parallelism
¨ Instructions¨ Threads¨ Programs
¨ Communication and coordination between units via¨ Registers¨ Memory¨ I/O
Slide 25
![Page 26: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/26.jpg)
Model vs. Architecture
¨ Communication abstraction supports model¨ Communication architecture (ISA + comm/sync)
implements part of model¨ Hw/sw boundary defines which parts of comm arch
implemented in which
Slide 26
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Databases Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compileror library
Operating system support
Communication harrdware
Physical communication medium
Hardware/software boundary
![Page 27: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/27.jpg)
Shared Address Space Model
TLP
Communication/coordination among threads via shared global address space
Slide 27
for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j];
Communication abstractionsupported by HW/SW interface
P P P
Memory
![Page 28: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/28.jpg)
Message Passing Model Process-level parallelism (separate addr spaces)
Communication/coordination via messages
Slide 28
for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for
![Page 29: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/29.jpg)
Data Parallelism (SIMD)
Programming model¨ Operations done in parallel on multiple data elements¨ Single thread of control
Architectural model¨ Array of simple, cheap processors w/ little memory¨ Attached to control proc that issues instructions¨ Specialized + general comm, cheap sync
Slide 29
PE PE PE
PE PE PE
PE PE PE
Controlprocessor
parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];
![Page 30: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/30.jpg)
Coarser-Grain Data Parallelism
Single-Program Multiple-Data
More broadly applicable than SIMD
Slide 30
![Page 31: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/31.jpg)
Creating a Parallel Program
ID work that can be done in parallel¨ Computation¨ Data access¨ I/O
Partition work/data among entities¨ Processes¨ Threads
Manage data access, comm, sync
Speedup(P) = Performance(P)/Performance(1)
= Time(1)/Time(P)
Slide 31
![Page 32: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/32.jpg)
Steps
Decomposition
Assignment
Orchestration
Mapping
Can be done by ¨ Programmer¨ Compiler¨ Runtime¨ Hardware (speculatively)
Slide 32
![Page 33: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/33.jpg)
Architecture independent
Parallelization
Architecture dependent
Slide 33
P0 P1
P2 P3
Sequential Compuitation Tasks Processes
Parallel Program Processors
![Page 34: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/34.jpg)
Concepts
Task¨ Arbitrary piece of work from computation¨ Sequentually executed¨ Could be fine- or coarse-grained
Process (or thread)¨ What gets executed by a core¨ Abstract entity that performs tasks assigned to it¨ Processes comm & sync to perform tasks
Processor (core)¨ Physical engine on which processes run¨ Virtualized machine view for programmer
Slide 34
![Page 35: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/35.jpg)
Decomposition
Purpose: Break up computation into tasks to be divided among processes¨ Tasks may become available dynamically¨ Number of available tasks may vary with time¨ i.e., identify concurrency and decide level at which to
exploit it
Goal: keep processes busy, but keep management reasonable¨ Number of tasks creates upper bound on speedup¨ Too many tasks requires too much coordination
Slide 35
![Page 36: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/36.jpg)
Assignment
Specify mechanism to divide work among processes¨ Strive for balance¨ Reduce communication, management
Structured approach recommended¨ Inspect code¨ Apply well known heuristics
Programmer focuses on decomp/assign 1st ¨ Largely independent of architecture/programming model¨ Choice of primitives (cost/complexity) affects decisions
Architects assume program(mer) does decent job
Slide 36
![Page 37: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/37.jpg)
Orchestration
Purpose¨ Name data, structure comm/sync¨ Organize data structures, schedule tasks (temporally)
Goals¨ Reduce costs of comm/sync from processor POV¨ Improve data locality¨ Reduce overhead of managing parallelism
Choices depend heavily on comm abstraction, efficiency of primitives Architects must provide appropriate, efficient
primitives
Slide 37
![Page 38: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/38.jpg)
Mapping
Two aspects ¨ Which processes to run on same processor¨ Which process runs on which processor
One extreme-sharing¨ Partition machine s.t. only 1 app at a time in a subset¨ Pin processes to cores (or let OS balance workloads)
Another extreme¨ Control complete resource management in OS¨ Use performance techniques for dynamic balancing
Real world is between the two¨ User specifies desires in some aspects¨ System may ignore
Slide 38
![Page 39: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/39.jpg)
High-Level Goals
High performance Low resource usage Low development effort Low power consumption Implications for algorithm designers and architects
¨ Algorithm designers: high-performance, low resource needs
¨ Architects: high-performance, low cost, reduced programming effort
Slide 39
![Page 40: Parallel Computer Organization and Design EDA282 Slide 1.](https://reader036.fdocuments.net/reader036/viewer/2022062321/56649ddf5503460f94ad9990/html5/thumbnails/40.jpg)
Costs of Parallel Machines
Things to keep in mind when designing a machine . . .
What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?
(how long is a "lifetime"?)
Slide 40