Computer Architecture Guidance

59
Computer Architecture Guidance Keio University AMANO, Hideharu hunga@am ics keio ac jp

description

Computer Architecture Guidance. Keio University AMANO, Hideharu hunga@am . ics . keio . ac . jp. Contents. Techniques on Parallel Processing Parallel Architectures Parallel Programming → On real machines Advanced uni-processor architecture → Special Course of Microprocessors - PowerPoint PPT Presentation

Transcript of Computer Architecture Guidance

Page 1: Computer Architecture Guidance

Computer   Architecture  Guidance

Keio University

AMANO, Hideharu

hunga@am . ics . keio . ac . jp

Page 2: Computer Architecture Guidance

Contents

Techniques on Parallel Processing Parallel Architectures Parallel Programming →   On real machines

Advanced uni-processor architecture→   Special Course of Microprocessors

(by Prof. Yamasaki, Fall term)

Page 3: Computer Architecture Guidance

Class

Lecture using Powerpoint: (70mins-90mins. ) The ppt file is uploaded on the web site

http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture.

When the file is uploaded, the message is sent to you by E-mail.

Textbook: “Parallel Computers” by H.Amano (Sho-ko-do)  but too old….

Exercise (20mins-home work.) Simple design or calculation on design issues Sorry, it often becomes a home work.

Page 4: Computer Architecture Guidance

Evaluation

Exercise on Parallel Programming using GPU (50%) Caution! If the program does not run, the unit

cannot be given even if you finish all other exercises.

Exercise after every lecture (50%)

Page 5: Computer Architecture Guidance

TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) 天河一号 (Xeon+FireStream,2009/11 5th )

GPGPU(General-Purpose computing on Graphic ProcessingUnit)

※() 内は開発環境

Page 6: Computer Architecture Guidance

glossary 1 英語の単語がさっぱりわからんとのことなので用

語集を付けることにする。 この glossary は、コンピュータ分野に限り有効で

ある。英語一般の使い方とかなり異なる場合がある。 Parallel: 並列の 本当に同時に動かすことを意味

する。並列に動いているように見えることを含める場合を concurrent( 並行)と呼び区別する。概念的には concurrent > parallel である。

Exercise: ここでは授業の最後にやる簡単な演習を指す

GPU: Graphic ProcessingUnit   Cell Broadband Engine を使って来たが、昨年から GPU を導入した。今年は新型でより高速のを使う予定

Page 7: Computer Architecture Guidance

Computer   Architecture   1Introduction to Parallel Architectures

Keio University

AMANO, Hideharu

hunga@am . ics . keio . ac . jp

Page 8: Computer Architecture Guidance

Parallel Architecture

Purposes Classifications Terms Trends

A parallel architecture consists of multiple processing units which work simultaneously.→   Thread level parallelism

Page 9: Computer Architecture Guidance

Boundary between Parallel machines and Uniprocessors

ILP(Instruction   Level   Parallelism) A single Program Counter Parallelism Inside/Between instructions

TLP(Thread   Level   Parallelism) Multiple Program Counters Parallelism between processes and jobs

Uniprocessors

Parallel MachinesDefinition

Hennessy & Petterson’s Computer Architecture: A quantitative approach

Page 10: Computer Architecture Guidance

Multicore Revolution

Niagara 2

1. The end of increasing clock frequency1. Consuming power becomes too much.2. A large wiring delay in recent processes.3. The gap between CPU performance and memory latency

2. The limitation of ILP3. Since 2003, multicore and manycore have become popular.

Page 11: Computer Architecture Guidance

Increasing power consumption

Page 12: Computer Architecture Guidance

End of Moore’s Law in computer performance

1.25/year1.5/year=Moore’s Law

1.2/year

Page 13: Computer Architecture Guidance

Purposes of providing multiple processors

Performance A job can be executed quickly with multiple processors

Dependability If a processing unit is damaged, total system can be

available : Redundant systems Resource sharing

Multiple jobs share memory and/or I/O modules for cost effective processing : Distributed systems

Low energy High performance with Low frequency operation

Parallel Architecture: Performance Centric!

Page 14: Computer Architecture Guidance

glossary 2

Simultaneously: 同時に、という意味で in parallel とほとんど同じだが、ちょっとニュアンスが違う。 in parallel だと同じようなことを同時にやる感じがするが、 simultaneously だととにかく同時にやればよい感じがする。

Thread: プログラムの一連の流れのこと。 Thread level parallelism ( TLP) は、 Thread 間の並列性のことで、ここではHennessy and Patterson のテキストに従って PC が独立している場合に使うが違った意味に使う人も居る。これに対して PC が単一で命令間にある並列性を ILP と呼ぶ

Dependability: 耐故障性、 Reliability( 信頼性) , Availability( 可用性)双方を含み、要するに故障に強いこと。 Redundant system は冗長システムのことで、多めに資源を持つことで耐故障性を上げることができる。

Distributed system: 分散システム、分散して処理することにより効率的に処理をしたり耐故障性を上げたりする

Page 15: Computer Architecture Guidance

Flynn’s Classification

The number of Instruction   Stream :  M(Multiple)/S(Single)

The number of Data   Stream : M/S SISD

Uniprocessors ( including Super scalar 、 VLIW ) MISD : Not existing ( Analog   Computer ) SIMD MIMD

He gave a lecture at Keio on the last May.

Page 16: Computer Architecture Guidance

SIMD (Single Instruction StreamMultiple Data Streams

Instruction

InstructionMemory

Processing Unit

Data memory

•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX instructions/ClearSpeed/IMAP/GP-GPU( coarse grain )•CM-2, ( fine grain )

Page 17: Computer Architecture Guidance

Two types of SIMD Coarse grain : Each node performs floating point n

umerical operations Old SuperComputers: ILLIAC-IV , BSP,GF-11 Multimedia instructions in recent high-end CPUs Accelerator: GPU, ClearSpeed Dedicated on-chip approach: NEC’s IMAP

Fine grain : Each node only performs a few bits operations

ICL   DAP,   CM-2 , MP-2 Image/Signal Processing Connection Machines  ( CM-2) extends the application

to Artificial Intelligence (CmLisp)

Page 18: Computer Architecture Guidance

TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) 天河一号 (Xeon+FireStream,2009/11 5th )

GPGPU(General-Purpose computing on Graphic ProcessingUnit)

※() 内は開発環境

Page 19: Computer Architecture Guidance

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

Thread Execution Manager

Input Assembler

Host

Load/Store

Global Memory

GeForceGTX280240 cores

Page 20: Computer Architecture Guidance

GPU(NVIDIA’s GTX580)

512 GPU cores ( 128 X 4 )768 KB L2 cache40nm CMOS 550 mm^2

128 Cores128 Cores 128 Cores128 Cores

128 Cores128 Cores 128 Cores128 Cores

L2 CacheL2 Cache

Page 21: Computer Architecture Guidance

LPA is consisting of 16 PE Groups

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

Interface UnitS

DRAM IF

CPU IF

Background transfer control

Bus IF

Inter-PE data selector

Linear Processor Array (LPA)

Data cache 2KB

Inst. cache 32KB

GR

16bx32

Instruction

Fetch(4w/clk)

ALUMUL

Control Processor (CP) P

E 

status

PE data

Wired OR logic

64b

64b

16b

16b

16b

64b

64b

IMAP-CE

IMAP-CE

Control Processor (CP)

16b

64b

8bx16

8b1b

8b

PE

PE

PE

PE

PE

PE

PE

PE

SR

SR

SR

SR

SR

SR

SR

SR

Video data in

Video data out

Page 22: Computer Architecture Guidance

ClearSpeedCSX600

FP

MU

LF

PA

DD

DIV

/SQ

RT

MA

CA

LU

Reg File

SRAM

PIO

FP

MU

LF

PA

DD

DIV

/SQ

RT

MA

CA

LU

Reg File

SRAM

PIO

FP

MU

LF

PA

DD

DIV

/SQ

RT

MA

CA

LU

Reg File

SRAM

PIO

PIO Collection/Distribution

Poly Execution Unit

Poly MCodedControl

Poly LSControl

Poly PIOControl

Poly Scoreboard

Mono Exec Unit

Thread Cont.

Semaphore Unit

D Cache D CacheControlDebug

96 Execution Unitswhich work at 250MHz

Page 23: Computer Architecture Guidance

GRAPE-DR

Kei Hiraki “GRAPE-DR” http://www.fpl.org (FPL2007)

Page 24: Computer Architecture Guidance

Renesas   MTXI/

O I

nte

rfac

e

..

..

PEPEPEPEPEPEPEPE

PEPE

..

Inst. memory Controller

Data Register 1Data Register 0

Inst. Pointer1Pointer0

256bit 256bit4096bit

2048PEs

SE

L

2b-ALU

Valid OH-ch

V-ch H-chPE structure

Page 25: Computer Architecture Guidance

The future of SIMD

Coarse grain SIMD GPGPU became a main stream of accelerators Other SIMD accelerators: CS600, GRAPE - DR Multi-media instructions will be used in the future.

Fine grain SIMD Advantageous to specific applications like image

processing On-chip accelerator General purpose machines are difficult to be built

ex.CM2  →  CM5

Page 26: Computer Architecture Guidance

MIMD

Processors

Memory modules (Instructions ・ Data )

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

Page 27: Computer Architecture Guidance

Classification of MIMD machines Structure of shared memory UMA(Uniform   Memory   Access   Model )

provides shared memory which can be accessed from all processors with the same manner.

NUMA(Non-Uniform   Memory   Access   Model )provides shared memory but not uniformly accessed.

NORA/NORMA ( No   Remote   Memory   Access   Model )provides no shared memory. Communication is done with

message passing.

Page 28: Computer Architecture Guidance

UMA The simplest structure of shared memory machin

e The extension of uniprocessors OS which is an extension for single processor ca

n be used. Programming is easy. System size is limited.

Bus connected Switch connected

A total system can be implemented on a single chip

On-chip multiprocessorChip multiprocessorSingle chip multiprocessorIBM Power 5 NEC/ARM chip multiprocessor for embedded systems

Page 29: Computer Architecture Guidance

An example of UMA : Bus connected

PU PU

SnoopCache

PU

SnoopCache

PU

SnoopCache

Main   Memory

  shared   bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

Note that   it is a logical image

Page 30: Computer Architecture Guidance

MPCore (ARM+NEC)

CPUinterface

Timer

WdogCPU

interfaceTimer

WdogCPU

interfaceTimer

WdogCPU

interfaceTimer

Wdog

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

Interrupt Distributor

Snoop Control Unit (SCU) CoherenceControl Bus

DuplicatedL1 Tag

IRQ IRQ IRQ IRQ

PrivateFIQ Lines

PrivatePeripheralBus

L2 Cache

PrivateAXI R/W64bit Bus

SMP for Embeddedapplication

Page 31: Computer Architecture Guidance

SUN T1Core

Core

Core

Core

Core

Core

Core

Core

CrossbarSwitch

FPU

L2Cachebank

Directory

L2Cachebank

Directory

L2Cachebank

Directory

L2Cachebank

DirectorySingle issue six-stage pipelineRISC with 16KB Instruction cache/8KB Data cache for L1

Total 3MB, 64byte Interleaved

Memory

Page 32: Computer Architecture Guidance

Multi-Core (Intel’s Nehalem-EX)

8 CPU cores24MB L3 cache45nm CMOS 600 mm^2

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPUL3 CacheL3 Cache

L3 CacheL3 Cache

Page 33: Computer Architecture Guidance

Heterogeneous vs. Homogeneous Homogeneous: consisting of the same processing

elements A single task can be easily executed in parallel. Unique programming environment

Heterogeneous: consisting of various types of processing elements Mainly for task-level parallel processing High performance per cost Most recent high-end processors for cellular phone use this

structure However, programming is difficult.

Page 34: Computer Architecture Guidance

NEC MP211

ARM926PE0

ARM926PE1

ARM926PE2

SPX-K602DSP

DMAC USB OTG

3D Acc.

Rot-ater.

ImageAcc.

CamDTVI/F.

LCDI/F

AsyncBridge0

AsyncBridge1

APBBridge0

IIC UART

TIM1

TIM2

TIM3

WDT

Mem. card

PCM

APBBridge1

Bus Interface

Scheduler

SDRAMController

SRAMInterface

On-chip SRAM

(640KB)

PLL OSC

Inst.RAM

PMU

INTC TIM0GPIO SIO

Sec.Acc.

SMU uWIRE

CameraLCD

FLASH DDR SDRAM

Multi-Layer AHB

Heterogeneous type UMA

Page 35: Computer Architecture Guidance

NUMA Each processor provides a local memory,

and accesses other processors’ memory through the network.

Address translation and cache control often make the hardware structure complicated.

Scalable : Programs for UMA can run without modification. The performance is improved as the system size.

Competitive to WS/PC clusters with Software DSM

Page 36: Computer Architecture Guidance

Typical structure of NUMA

Node  1

Node   2

Node  3

Node  0 0

InterconnectonNetwork

Logical address space

Page 37: Computer Architecture Guidance

Classification of NUMA Simple NUMA :

Remote memory is not cached. Simple structure but access cost of remote memory is large.

CC-NUMA : Cache   Coherent Cache consistency is maintained with hardware. The structure tends to be complicated.

COMA:Cache   Only   Memory   Architecture No home memory Complicated control mechanism

Page 38: Computer Architecture Guidance

Cray’s T3D: A simple NUMA supercomputer (1993)

Using

Alpha 21064

Page 39: Computer Architecture Guidance

The Earth simulator(2002)  Simple NUMA

Page 40: Computer Architecture Guidance

From IBM   web site

The fastest computerAlso simple NUMA

Page 41: Computer Architecture Guidance

Cell ( IBM/SONY/Toshiba )

SXU

LS

DMA

PXU

L1 C

L2 C

MIC

BIC

ExternalDRAM

Flex I/O

EIB: 2+2 Ring Bus

CPU Core IBM Power2-way superscalar, 2-thread

SPE:Synergistic Processing Element(SIMD core)128bit(32bit X 4)2 way superscalar

32KB+32KB

512KB

PPE

512KB Local Store

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

The LS of SPEsare mapped onthe same addressspace of the PPE

Page 42: Computer Architecture Guidance

SGI   Origin

Hub  Chip

Main   Memory

Network

Bristled   Hypercube

Main   Memory is connected directly with Hub   Chip1 cluster consists of 2 PE.

Page 43: Computer Architecture Guidance

SGI’s CC-NUMA Origin3000(2000)

Using R12000

Page 44: Computer Architecture Guidance

TRIPS

N NNNN

NNNNN N

NM MM MM M

MM MM MM M

NNNN

NNNN

IIII

IIIII

N M

M N

M N

DDDD

DDDDG

GE E E EEEE

E E EE E EE E E

E E E EEEE

E E EE E EE E E

R R R R

R R R RDMA C2C

SD EBCDMA

SDSD

DMA

C2C

TRIPS L2 Cache(OCN)

TRIPS processor 0(OPN)

TRIPS processor 1(OPN)

Register Tile

Execution Tile

Instruction cache Tile

Data cache Tile

Global Control Tile

Network Tile

Memory Tile

DDRAM ControllerDMA ControllerChip to chip InterfaceOCN interconnect

OPN Interconnect

Page 45: Computer Architecture Guidance

DDM(Data   Diffusion  Machine )

... ... ... ...

Page 46: Computer Architecture Guidance

NORA/NORMA

No shared memory Communication is done with message

passing Simple structure but high peak performance

Cost effective solution.

Hard for programming

Inter-PU communications Cluster computing

Tile Processors: On-chip NORMA for embedded applications

Page 47: Computer Architecture Guidance

Early Hypercube machine nCUBE2

Page 48: Computer Architecture Guidance

Fujitsu’s NORA AP1000(1990)

Mesh connection SPARC

Page 49: Computer Architecture Guidance

Intel’s Paragon XP/S(1991)

Mesh connection i860

Page 50: Computer Architecture Guidance

PC Cluster

Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) Commodity components TCP/IP Free software

Others Commodity components High performance networks like Myrinet / Infiniban

d Dedicated software

Page 51: Computer Architecture Guidance

RHiNET-2 cluster

Page 52: Computer Architecture Guidance

Tilera’s Tile64

Tile Pro, Tile Gx

Linux runs ineach core.

Page 53: Computer Architecture Guidance

Intel 80-Core Chip

Intel 80-core chip [Vangal,ISSCC’07]

Page 54: Computer Architecture Guidance

Multi-core + Accelerator

Intel’s Sandy Bridge AMD’s Fusion APU

I / O

System Agent

Core4

Core3

Core2

Core1

LLC

LLC

LLC

LLC

GPU

GPU 1

Video Decoder

Core 1

GPU 2 Core 2

memorycontroller

PlatformInterface

Page 55: Computer Architecture Guidance

glossary 3 Flynn’s Classification: Flynn(Stanford 大の教授)が論文中に用いた分類、内容

は本文を参照のこと Coarse grain :粗粒度、この場合はプロセッシングエレメントが浮動小数演算

が可能な程度大きいこと。反対が Fine grain (細粒度)で、数ビットの演算しかできないもの

Illiac-IV, BSP, GF-11, Connection Machine   CM-2 , MP-2 などはマシン名。SIMD の往年の名機

Synchronization: 同期、 Shared Memory: 共有メモリ、この辺は後の授業で詳細を解説する

Message passing: メッセージ交換。共有メモリを使わずにデータを直接交換する方法

Embedded System: 組み込みシステム Homogeneous: 等質な  Heterogeneous :性質の異なったものから成る Coherent Cache: 内容の一貫性が保障されたキャッシュ、 Cache Consistency

は内容の一貫性、これも後の授業で解説する Commodity   Component:  標準部品、価格が安く入手が容易 Power 5, Origin2000, Cray XD-1,AP1000,NCUBE などもマシン名。 The earth

simulator は地球シミュレータ ,IBM BlueGene/L は現在のところ最速

Page 56: Computer Architecture Guidance

Terms(1)

Multiprocessors : MIMD machines with shared memory ( Strict definition :by Enslow J

r.) Shared memory Shared I/O Distributed OS Homogeneous

Extended definition: All parallel machines ( Wrong usage )

Page 57: Computer Architecture Guidance

Terms ( 2) Multicomputer

MIMD machines without shared memory, that is NORA/NORMA

Arraycomputer A machine consisting of array of processing elements :

SIMD A supercomputer for array calculation (Wrong usage)

Loosely coupled ・ Tightly coupled Loosely coupled: NORA, Tightly coupled: UMA But, are NORAs really loosely coupled??

Don’t use if possible

Page 58: Computer Architecture Guidance

Stored programmingbased

Others

SIMDFine grain Coarse grain 

MIMD

Bus connected UMASwitch connected UMA

NUMASimple NUMACC-NUMACOMA

NORA

Systolic architectureData flow architectureMixed controlDemand driven architecture

Multiprocessors

Multicomputers

Classification

Page 59: Computer Architecture Guidance

Exercise 1

In 2011, Japanese supercomputer K got the award of “world fastest computer”.

Which type is K classified into ? Why do you think so? The reason is important!

If you take this class, send the answer with your name and student number to [email protected]

You can use either Japanese or English.