CS 5234 –Spring 2013 Advanced Parallel Computing...

23
Architecture CS 5234 –Spring 2013 Advanced Parallel Computing Architecture Yong Cao

Transcript of CS 5234 –Spring 2013 Advanced Parallel Computing...

Page 1: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

CS 5234 –Spring 2013 Advanced Parallel Computing

Architecture

Yong Cao

Page 2: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Goals

Ø  Sequential Machine and Von-Neumann Model Ø Parallel Hardware

Ø Distributed vs Shared Memory Ø Architecture Classes

Ø Multiple-core Ø Many-core (massive parallel)

Ø NVIDIA GPU Architecture

Page 3: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Von-Neumann Machine (VN)

Ø  PC: Program counter Ø  MAR: Memory address

register Ø  MDR: Memory data

register Ø  IR: Instruction register Ø  ALU: Arithmetic Logic

Unit Ø  Acc: Accumulator

PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 4: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø The six phases of the instruction cycle: Ø Fetch Ø Decode Ø Evaluate Address Ø Fetch Operands Ø Execute Ø  Store Result

PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 5: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø Fetch Ø MARçPC Ø MDRçMEM[MAR] Ø  IRçMDR

PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 6: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø Decode Ø DECODERçIR.OP PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 7: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø Evaluate Address Ø MARçIR.ADDR Ø MDRçMEM[MAR]

PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 8: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø Execute Ø Acc çAcc + MDR PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 9: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø  Store Result Ø MDRçAcc PC

MAR

Acc MDR OP ADDRESS

MEMORY

A L U

Decoder

Page 10: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

Ø Register File PC

MAR

Register File OP ADDRESS

MEMORY

A L U

Decoder

Page 11: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Sequential Execution and Instruction Cycle

PC

MEMORY

A L U

IR

Register File

Page 12: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Parallel Hardware Ø  Shared vs Distributed Memory

MEMORY

PC

A L U

IR

Register File

PC

A L U

IR

Register File

PC

A L U

IR

Register File

……

Ø  Multi-Core and Many-Core Architecture

Page 13: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Parallel Hardware Ø  Shared vs Distributed Memory

MEMORY

PC

A L U

IR

Register File

PC

A L U

IR

Register File

PC

A L U

IR

Register File

……

MEMORY MEMORY

Ø  Cluster Computing, Grid Computing, Cloud Computing

Page 14: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Multi-Core vs Many-Core

Ø Definition of Core – Independent ALU Ø How about a vector processor?

Ø  SIMD: E.g. Intel’s SSE. Ø How many is “many”?

Ø What if there are too “many” cores in the Multi-core design? Shared control logic (PC, IR, Schedule)

Page 15: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Multi-Core

Ø Each core has its own control (PC and IR)

Page 16: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Many-Core

Ø A group of cores shares the control (PC, IR and Thread Scheduling)

Page 17: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

NVIDIA Fermi Architecture

16 Stream Multiprocessor (SM) 32 Core for Each SM

Page 18: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Fermi SM

Page 19: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Execution in a SM

A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.

Page 20: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Data Parallel

Ø Data Parallel vs Task Parallel Ø What to partition? Data or Task?

Ø Massive Data Parallel Ø Millions (or more) of threads Ø  Same instruction, different data elements

Page 21: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Computing on GPUs

Ø  Steam processing and Vectorization (SIMD)

Input Stream

Output Stream

Instructions

SIMD

Page 22: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

GPU Programming Model: Stream

Ø  Stream Programming Model Ø  Streams:

Ø An array of data units Ø Kernels:

Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together

Stream

Stream

Kernel

Page 23: CS 5234 –Spring 2013 Advanced Parallel Computing Architecturepeople.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture2.pdf · CS 5234 –Spring 2013 Advanced Parallel

Architecture

Why Streams? Ø Ample computation by exposing parallelism

Ø Stream expose data parallelism Ø Multiple stream elements can be processed in parallel

Ø Pipeline (task) parallelism Ø Multiple tasks can be processed in parallel

Ø Efficient communication Ø Producer-consumer locality Ø Predictable memory access pattern

Ø Optimize for throughput of all elements, not latency of one

Ø Processing many elements at once allows latency hiding