Introduction to Multi-Core - Rev2 - UMass · PDF fileIntroduction to Multi-Core Baskaran...

Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &

Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 1

Introduction to Multi-Core

Baskaran Ganesan

[email protected]

Sr. Design EngineerDigital Enterprise Group, Intel Corporation



Topics

1.CPU (semiconductor) HISTORY (SESSION-1)

a. Moore’s Law

b. Transistor scaling

c. Scaling limitations & impact

d. What then?

- Dual core

e. The new era

- ARCHITECTURE (SESSION-2)

a. Core Architecture

- Core basics, Platform architecture, Core architecture

b. Multi-core architecture

c. Multi-core challenges

d. Closing notes



Moore’s Law



Moore’s law at work

Compute Power

SW/IT eco-system

Volume Market

CPU Cost

Manufacturing technology

CPU Arch technology

Transistor Size Transistor Count



Historical Driving Forces

0.01

0.1

1

10

1970 1980 1990 2000 2010 2020

0.01

0.1

1

10

1970 1980 1990 2000 2010 2020

FeatureFeature

SizeSize

(um)(um)

Shrinking GeometryShrinking Geometry

20052005MontecitoMontecito

1.7B Transistors1.7B Transistors

197119714004 Processor4004 Processor2300 Transistors2300 Transistors

197819788008 Processor8008 Processor

IBM PCIBM PC

19861986i386 Processori386 Processor

3232--bitbit

19931993Pentium ProcessorPentium Processor3.1M transistors3.1M transistors

1

10

100

1000

10000

100000

1970 1980 1990 2000 2010 2020

1

10

100

1000

10000

100000

1970 1980 1990 2000 2010 2020

Increased FrequencyIncreased Frequency

FrequencyFrequency

(MHz)(MHz)



Scale Factors (loosely defined)

Voltage scale-factor: Rate at which the transistor voltage decreases with respect to a change in transistor dimensions

Frequency scale-factor: Rate at which the transistor frequency increases with respect to a change in transistor dimensions

Cost scale-factor: Rate at which the per-transistor cost decreases with respect to a change in transistor dimensions

Count scale-factor: Rate at which the transistor count increases with respect to a change in transistor dimensions



Scaling: More data



The Act of Balancing

Delivered Performance = Instructions Per Cycle (IPC) * Frequency

Delivered Performance = Delivered Performance =

Instructions Per Cycle (IPC) * FrequencyInstructions Per Cycle (IPC) * Frequency

Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency

Goal is higher performance and lower power



Pentium® 4 Processor

August 27, [email protected] GHz core55 Million 0.13µ transistors1249 SPECint2000

386 Processor

May 1986@16 MHz core

275,000 1.5µ transistors~1.2 SPECint2000

17 Years200x

200x/11x1000x

Scaling at its best



Architectural Innovations

• Serial, sequential execution

• Overlapped execution (pipelining)

• Multi-stage, deep pipelining

• Control-speculative execution

• Data-speculative execution

• Super-scalar execution

• Out-of-order execution

• Vector computing

• Addressing extensions

• Application specific instructions

• Multi-level on-chip caching

• Memory disambiguation

• Register renaming

• Score-boarding

• Hardware data prefetching

• …

Many decades of computer architecture Many decades of computer architecture focused onfocused on

InstructionInstruction--Level Parallelism (ILP) enhancementLevel Parallelism (ILP) enhancement



The Challenges

30nm

45nm65nm

90nm

0.13um

0.18um

0.25um

0.35um

0.5um

0.7um

0.1

1

10

1990 1993 1997 2001 2005 2009

~30%

30nm

45nm65nm

90nm

0.13um

0.18um

0.25um

0.35um

0.5um

0.7um

0.1

1

10

1990 1993 1997 2001 2005 2009

~30%SupplySupply

Voltage Voltage

(V)(V)

Diminishing Voltage ScalingDiminishing Voltage Scaling

slowing

Power LimitationsPower Limitations

Power = Capacitance x VoltagePower = Capacitance x Voltage22 x Frequencyx Frequencyalsoalso

Power ~ VoltagePower ~ Voltage33



Heat Dissipation

Po

we

r D

en

sit

yP

ow

er

Den

sit

y

(W/c

m2

)(W

/cm

2)

40044004

80088008

80808080

80858085

80868086

286286386386

486486

PentiumPentium®®

processorsprocessors

11

1010

100100

1,0001,000

10,00010,000

’’7070 ’’8080 ’’9090 ’’0000 ’’1010

Hot PlateHot Plate

Nuclear ReactorNuclear Reactor

Rocket NozzleRocket Nozzle

SunSun’’s Surfaces Surface

ProjectedProjected



Max FrequencyMax Frequency

PowerPower

PerformancePerformance

1.00x1.00x

What then?



OverOver--clockedclocked

(+20%)(+20%)

1.73x1.73x

1.13x1.13x1.00x1.00x


PowerPower


Over-clocking




(+20%)(+20%)UnderUnder--clockedclocked

((--20%)20%)

0.51x0.51x

0.87x0.87x1.00x1.00x

1.73x1.73x

1.13x1.13x


PowerPower


Under-clocking




(+20%)(+20%)

1.00x1.00x

Relative singleRelative single--core frequency and core frequency and VccVcc

1.73x1.73x

1.13x1.13x


PowerPower


DualDual--corecore

((--20%)20%)

1.02x1.02x

1.73x1.73x

DualDual--CoreCore

Multi-CoreEnergy-Efficient Performance



Dual core with voltage scaling

Area = 1Area = 1

Voltage = 1Voltage = 1

Freq = 1Freq = 1

Power = 1Power = 1

PerfPerf = 1= 1

Area = 2Area = 2

Voltage = 0.85Voltage = 0.85

Freq = 0.85Freq = 0.85

Power = 1Power = 1

PerfPerf = ~1.8= ~1.8

10%45%15%

Performance

Reduction

Power

Reduction

Frequency

Reduction

A 15% A 15%

ReductionReduction

In VoltageIn Voltage

YieldsYields

SINGLE CORESINGLE CORE DUAL COREDUAL CORE

RULE OF THUMBRULE OF THUMB



Intel: Dual & Quad Cores



A New Era…

PerformanceEquals Frequency

Unconstrained Power

Voltage Scaling

PerformanceEquals IPC

THE OLDTHE OLD

THE NEWTHE NEW

Multi-Core

MicroarchitectureAdvancements

Power Efficiency



Trade-off equations

- Power is costly; Transistors, relatively cheap

- Frequency alone is not important; Efficiency IS

- Performance-per-watt is critical; per-core performance is not quite

- Computation is relatively easy; Memory accesses are NOT



Q & AQ & A



Topics

1. CPU (semiconductor) HISTORY (SESSION-1)

a. Moore’s Law

b. Transistor scaling

c. Scaling limitations & impact

d. What then?

- Dual core

e. The new era

- ARCHITECTURE (SESSION-2)

a. Core Architecture

- Core basics, Platform architecture, Core architecture

b. Multi-core architecture

c. Multi-core challenges

d. Closing notes



Typical PC Architecture



Processor Resources

- Caches: L0, L1, L2 etc (Different levels of caches)

- General Purpose Registers (For SW programming)

- Segment Registers & TLB (for memory management)

- FP registers, XMM registers

- System Flags

- Control and Data registers, Debug registers, MSRs

- Many more



CMP/SMP/HT

CMP: Chip Multi Processing, refers to multiple physical core engines that have unique resources

Unique: L0/L1 Cache, TLBs, Instruction Pointer, GP Regs

Shared: L2 Cache

SMP: Refers to multiple threads that share all resources (time muxed)

Shared: L0/L1/L2 Caches, TLBs

Unique: Instruction Pointer, GP Regs

Hyper Threading: Refers to multiple threads that share more resources (L0/L1 Cache for example); May/May not be part of a CMP core

SW Threading: Application (SW) level threading of processes on one/more physical core engines



Core Architecture (Prescott)



Core Architecture (Xeon – Dual Core)



Multi-core platform (Freescale: embedded)



Multi-Core platform (RMI-XLR: embedded)



Tilera – 64 core CPU



Tilera – Platform



Tera-scale Computing

Terabytes

TIPS

Gigabytes

MIPS

Megabytes

GIPS

Perform

ance

Dataset SizeKilobytes

KIPS

Mult-Media

3D &Video

Text

RMS Personal Media Personal Media

Creation and Creation and

ManagementManagement

EntertainmentEntertainment

Learning & Learning &

TravelTravel

HealthHealthSingle CoreSingle Core

Multi-coreMulti-core

Tera-scaleTera-scale

IPS = Instruction per second

“RMS” ApplicationsRecognition

MiningSynthesis



Intel Polaris (80-core)



Multi-Core: what next?



Connecting multiple cores



Platform Architecture (multi-core)

External

I/F



Multi-core: Architectural Challenges

- Instruction-level parallelism v/s Thread-level parallelism tradeoffs and balance

- Shared resource management (functional units, caches, tlb, btb)

- Multi-threading v/s Multi-core tradeoffs

- On and Off-chip bandwidth requirements

- Latencies (execution, cache, and memory) reduction

- Memory Coherence/Consistency (for high speed on-die cache hierarchies)

- Multiple domains (and crossing) in clocking, voltage, reset,...

- Partitioning resources (between threads/cores)

- Fault tolerance (at device, storage, execution, core level) (aka reliability)

- On-die interconnect (optimized along latency, bw, modularity, power, ...)

- Integration (of system components, and/or fixed function devices)



Multi-core: Design Challenges

Design Complexity, Productivity Tools / Methods Advance

• …But at slower rate than Moore’s Law

• Replicating cores improves productivity

Visibility for Test & Debug

• Pin Bandwidth/Transistor continues to decline

• Shrinking dimensions, increasing speeds, …

• Increased test time adding to cost

Power

• Power Delivery – di/dt of Amps/nano-second

• Thermals: Overall power and thermal density



Multi-core: Eco-system challenges

Underlying Software assumptions on resource sharing

• Lack of standard mechanisms to share “resource sharing info”between hw and OS

Lack of “Resource sharing” aware SW

• Compilers, Schedulers, Configuration/Management (Power!) etc

Legacy SW architectural requirements left on Multi-Core CPUs

• Compatibility requirements

Many more…unknowns (to CPU Design world)



Algorithms, Programming Languages, Compilers, Operating Systems,Architectures, Libraries, … not ready for 100s of CPUs / chip

Multi-core: Software Challanges

- Scalability of O/S Data Structures and Policies- Synchronization and locking, Scheduling, Process management,

Data structure sizing and management limitations, Threading granularity and primitives

- Memory Hierarchy Awareness

- Impact of coherency policy, Efficiency of Data-sharing and Process migration effects, SW visibility to High speed on-die interconnect, SW control of Cache hierarchy, NUCA Awareness

- High Bandwidth I/O Support- Light weight Interrupts, Data movement and transformation

engines, I/O Affinity



More than the cores



Closing notes

• Single and Multi-core architectures presented

• Multi-Core CPU is the next generation CPU Architecture

– 2Core and Intel Quad-Core designs plenty on market already

– Many More are on their way

• Several old paradigms ineffective; Several new problems to be addressed

• Chip Level Multiprocessing and large caches can exploit Moore’s Law

• Thread/Core count in future microprocessor systems to increase

• Eco-system immature/non-existent

• Numerous domains in arch/design awaiting research & innovation and here is where you come in!!!

MultiMulti--Core Architecture and DesignCore Architecture and Designready forready for

research, development and innovation!research, development and innovation!



Acknowledgements

Gautam Doshi [Principal Engineer, Digital Enterprise Group]

Ajay Bhatt [Intel Fellow, Digital Enterprise Group]

Dileep Bhandarkar [Architect, Digital Enterprise Group]

Sunit Tyagi [Sr. Principal Engineer, Digital Enterprise Group]

… and countless foil-wares



Resources

Intel Tech/Research: http://www.intel.com/technology/index.htmEnergy Efficient Performance: http://www.intel.com/technology/eep/index.htmIntel Core Microarchitecture: http://www.intel.com/technology/architecture/coremicro/Dual-core processor: http://www.intel.com/technology/computing/dual-core/index.htmMulti/Many Core: http://www.intel.com/multi-core/index.htmIntel Platforms: http://www.intel.com/platforms/index.htmThreading: http://www3.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/index.htm



Q & AQ & A



Backup: Core Backup: Core

uArchuArch



Intel® Core™ Microarchitecture

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

ScalableScalableLow PowerLow Power High PerformanceHigh Performance

(Core2 Duo) (Core2 Duo) MeromMerom

(Core2 Duo) Conroe(Core2 Duo) Conroe

(Xeon) (Xeon)

WoodcrestWoodcrest

65nm65nm

Server Server

OptimizedOptimized

Desktop Desktop

OptimizedOptimized

Mobile Mobile

OptimizedOptimized

IntelIntel®® Wide Wide

Dynamic Dynamic

ExecutionExecution

IntelIntel®®

Intelligent Intelligent

Power Power

CapabilityCapability

IntelIntel®®

Advanced Advanced

Smart CacheSmart Cache

IntelIntel®® Smart Smart

Memory Memory

AccessAccess

IntelIntel®®

Advanced Advanced

Digital Media Digital Media

BoostBoost



Intel® Intelligent Power Capability

UltraUltraFineFine

GrainedGrained

CoarseCoarseGrainedGrained

•• AggressiveAggressive

Clock GatingClock Gating

•• EnhancedEnhanced

SpeedSpeed--StepStep

•• Low VCC ArraysLow VCC Arrays

•• Blocks ControlledBlocks Controlled

Via SleepVia Sleep

TransistorsTransistors

•• Low LeakageLow Leakage


•• SleepSleep


TransistorTransistor

•• 65nm65nm

•• Strained SiliconStrained Silicon

•• LowLow--K DielectricK Dielectric

•• More Metal LayersMore Metal Layers

ProcessProcess

ADVANTAGEADVANTAGE•• MobileMobile--Level Power ManagementLevel Power Management•• Energy Efficient PerformanceEnergy Efficient Performance


Energy Energy



Intel® Wide Dynamic Execution

INSTRUCTION FETCHINSTRUCTION FETCH

AND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

RETIREMENT UNITRETIREMENT UNIT

(REORDER BUFFER)(REORDER BUFFER)

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

SCHEDULERSSCHEDULERS

EXECUTEEXECUTE

INSTRUCTION FETCHINSTRUCTION FETCH

AND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

RETIREMENT UNITRETIREMENT UNIT

(REORDER BUFFER)(REORDER BUFFER)

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

SCHEDULERSSCHEDULERS

EXECUTEEXECUTE

CORE 1CORE 1 CORE 2CORE 2

4 WIDE 4 WIDE --

DECODE TODECODE TO

EXECUTEEXECUTE

4 WIDE 4 WIDE --

MICROMICRO--OPOP

EXECUTEEXECUTE

MICROMICRO

andand

MACROMACRO

FUSIONFUSION

DEEPERDEEPER

BUFFERSBUFFERS

EFFICIENTEFFICIENT

14 STAGE14 STAGE

PIPELINEPIPELINE

ENHANCEDENHANCED

ALUsALUs

EACH COREEACH CORE

ADVANTAGEADVANTAGE•• 33% Wider Execution over Previous Gen33% Wider Execution over Previous Gen•• Comprehensive AdvancementsComprehensive Advancements•• Enabled In Each CoreEnabled In Each CoreEnergy Energy

Perf Perf



Intel® Wide Dynamic ExecutionMicro and Macro Fusion

DECODEDECODE

EXECUTEEXECUTE

uCODEuCODE

ROMROM

ADVANTAGEADVANTAGE•• Instruction Load Reduced ~ 15%Instruction Load Reduced ~ 15%****

•• MicroMicro--Ops Reduced ~ 10%Ops Reduced ~ 10%****

WITHOUT MACRO FUSIONWITHOUT MACRO FUSIONWITH MACRO FUSIONWITH MACRO FUSION

INSTRUCTION 3INSTRUCTION 3



INTERNAL INST 1INTERNAL INST 1

COMPLETED INST 3COMPLETED INST 3



DECODEDECODE

EXECUTEEXECUTE










DECODEDECODE

EXECUTEEXECUTE

COMBINED INST 2 & 3COMBINED INST 2 & 3

MACRO FUSION EXAMPLEMACRO FUSION EXAMPLE

CMP+JMP IN 1 CLOCKCMP+JMP IN 1 CLOCKMicroMicroFusionFusion

MacroMacroFusionFusion

*Graphics not representative of actual die photo or relative size

** Workload dependant

Energy Energy

Perf Perf



Intel® Advanced Smart CacheDynamic L2 Cache Usage

ADVANTAGEADVANTAGE•• Higher Cache Hit RateHigher Cache Hit Rate•• Reduced BUS TrafficReduced BUS Traffic•• Lower Latency to DataLower Latency to Data

CoreCore™™ MicroarchitectureMicroarchitecture

Shared L2Shared L2Independent L2Independent L2

Dynamically,Dynamically,

BiBi--DirectionallyDirectionally

AvailableAvailable


L1L1

CACHECACHEL1L1

CACHECACHE


L1L1

CACHECACHE

NotNot

ShareableShareable


L1L1

CACHECACHE

x

Energy Energy

Perf Perf

DecreasedDecreased

TrafficTrafficIncreasedIncreased

TrafficTraffic



HARDWAREHARDWARE

Mem. Dis.Mem. Dis.

PredictorPredictor

Intel® Smart Memory AccessHardware-based Memory Disambiguation

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

INST 1 INST 1 ““STORE [X]STORE [X]””




DECODE/SCHEDULEDECODE/SCHEDULE


OUTOUT

OFOF

ORDERORDER

ININ

ORDERORDER

EXECUTEEXECUTE






DECODE/SCHEDULEDECODE/SCHEDULE


EXECUTEEXECUTE STALLSTALL

ADVANTAGEADVANTAGE•• Higher Utilization of PipelineHigher Utilization of Pipeline•• Masks latency to data accessMasks latency to data access•• Higher PerformanceHigher Performance

CoreCore™™ MicroarchitectureMicroarchitecture OtherOther

Inst. 2 Inst. 2 ““LoadLoad””

Can OccurCan Occur

BeforeBefore

Inst. 1 Inst. 1 ““StoreStore””

Energy Energy

Perf Perf

Inst. 2 MustInst. 2 Must

Wait ForWait For

Inst. 1 Inst. 1 ““StoreStore””

To CompleteTo Complete



Intel® Advanced Digital Media BoostSingle Cycle SSE

CoreCore™™ µµµµµµµµarch arch

PreviousPrevious

DECODEDECODE

X4X4

Y4Y4

X4opY4X4opY4

SOURCESOURCE

X1opY1X1opY1

DECODEDECODE

In Each Core In Each Core

X3X3

Y3Y3

X3opY3X3opY3

X2X2

Y2Y2

X2opY2X2opY2

X1X1

Y1Y1

X1opY1X1opY1

DESTDEST

SSE/2/3 OPSSE/2/3 OP

X2opY2X2opY2

X3opY3X3opY3X4opY4X4opY4

CLOCKCLOCK

CYCLE 1CYCLE 1

CLOCKCLOCK

CYCLE 2CYCLE 2

00127127

CLOCKCLOCK

CYCLE 1CYCLE 1

SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)

ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle in each core128 bit Single Cycle in each core•• Improved Energy EfficiencyImproved Energy Efficiency

EXECUTEEXECUTEEXECUTEEXECUTE

FusionFusion

SupportSupport

SingleSingle

CycleCycle

SSESSE


Energy Energy

Perf Perf



Backup: Next Gen Backup: Next Gen

TechnologiesTechnologies



Traditional Operating Systems (Time-mux)



Physical Host Hardware

What is Virtualization?

GFX

MemoryProcessors

Keyboard / Mouse

Graphics

StorageNetwork

Operating System

...App App App

Without VMs: Single OS owns all hardware resources

VM1VM0

Guest OS0

App AppApp ...

...Guest OS1

App ...

VM Monitor (VMM)

Physical Host Hardware

With VMs: Multiple OSes share hardware resources

A newA new

layer oflayer of

software...software...

AppApp

Virtualization enables multiple operating systems to run on the same platform



Types of Virtualization

Hosted VMM

• launched from within an OS, e.g., VMplayer, WSX, GSX, Virtual PC, Virtual Server

– Cheap but lower performance

Hypervisor: A bootable layer on Bios

• Thick: embeds all the drivers, e.g., ESX

• Thin: has a service VM, e.g., Xen derivates

Virtual Appliances: dedicated Virtual machines, e.g., MojoPC



Intel® Virtualization Technology (VT)

Intel® VT

First to market with native virtualization support

Broadest HW and SW ecosystem support

CoreTM Microarchitecture based systems

� Significant increase in performance and improved VT performance overall segments

� Mobile - Intel® Core™2 Duo Mobile Processor for Intel® Centrino®Duo Mobile Technology

� Desktop - Intel® Core™2 Duo Desktop Processor E6000 sequence -

� Server Dual and Quad Core Intel® Xeon® Processor 5000 series

Get More Done On Every ServerGet More Capabilities On Client

Processors with Intel® Virtualization Technology

Virtual Machine Monitor

..…OSOS

AppApp

OSOS

AppApp

OSOS

AppApp

OSOS

AppApp

and others …

11stst VT base SW VT base SW

SolutionsSolutions



Trusted Execution Technology



LT Hardware Ingredients

RAM

ICHICH

USBUSB

Intel Intel

CPUCPUIntel CPU

Intel(G)MCH

LPCLPC

TPM

CPU Extensions� Enables domain separation

� Sets policy for protected memory

LT = CPU + Chipset + TPM + Protected I/O

Protected Graphics� Trusted channel

between graphics and trusted SW

� Integrated or third party discrete graphics

Protected Keyboard & Mouse� Trusted channel between

keyboard/mouse and trusted SW

Protected Memory Mgmt� Enforces access policy to

protected memory

Trusted Platform Module v1.2� Protects keys, digital certificates

& attestation credentials

� Provides platform authentication

= LT-specific enhancement



Backup: MiscBackup: Misc



Moore’s Law Moving Forward

|---------------------ACTUAL---------------------------|--FORECAST-|

Production 1995 1997 1999 20012001 20032003 20052005 20072007 20092009 20112011

Generation 0.35 0.25 0.18 130130nmnm 9090nmnm 6565nmnm 4545nmnm 3535nmnm 2222nmnm

Gate Length 0.35 0.20 0.13 <70nm <<5050nmnm <<3535nmnm <<3535nmnm <<3535nmnm <<2222nmnm

Wafer Size (Wafer Size (mmmm)) 200200 200200 200200 300300 300300 300300 300300 300300? ? 300?300?

Integration CapacityIntegration Capacity <100M<100M 100M100M 200M200M 500M500M 1B1B >>1B1B >2B>2B >4B>4B >8B>8B

““Another decade is probably straightAnother decade is probably straight--forward forward ……There is certainly no end to creativity.There is certainly no end to creativity.””

-- Gordon Moore, speaking of extending MooreGordon Moore, speaking of extending Moore’’s Law at ISSCC, Feb 2003s Law at ISSCC, Feb 2003



Multi-Core Power Efficiency

C1C1

C4C4

C2C2

C3C3

SmallSmall

corecore

Big coreBig core

CacheCache

CacheCache

11

22

33

44

11

22

11 11

11

22

33

44

11

22

33

44

PowerPower


Power = Power = ¼¼

Performance = 1/2Performance = 1/2

Many core is more Many core is more

power efficientpower efficient

Power ~ areaPower ~ area

Single thread Single thread

performance ~ area**.5performance ~ area**.5



Multi-Core and Memory Gap

Growing Performance GapGrowing Performance Gap

0

100

200

300

400

500

600

700

Pentium

66MHz

Pentium-Pro

200MHz

PentiumIII

1100MHz

Pentium4 2

GHz

19921992 19941994 19961996 19981998 20002000 20022002

LOGICLOGIC

MEMORYMEMORY

GA

PG

AP

Peak InstructionsPeak Instructions

Per DRAM AccessPer DRAM Access

Reduce DRAM access with large cachesExtra benefit: power savings. Cache is lower power than logic

Tolerate memory latency with multiple threadsMultiple coresHyper-threading



Multi-threading tolerates memory latency

Ai

Ai Idle Ai+1

Ai+1Idle

Bi Idle Bi+1

Bi Bi+1

Serial Execution

Multi-threaded Execution

Execute thread B while thread A waits for memoryExecute thread B while thread A waits for memory

Multi-core has a similar effect



Multi-core tolerates memory latency

Ai

Ai

Idle

Ai+1

Ai+1

Bi Idle

Bi+1

Serial Execution

Multi-core Execution

Execute thread A and B simultaneouslyExecute thread A and B simultaneously

Idle

Bi Idle

Bi+1



How does Multicore Change Parallel Programming?

No change in fundamental programming model

Synchronization and communication costs greatly reduced

• Makes it practical to parallelize more programs

Resources now shared

• Caches

• Memory interface

• Optimization choices may be different

P1

cache

P2 P3 P4

cache cache cache

Memory

SMP

C1

cache

Memory

C2 C3 C4

cache cache cache

CMP



Art of the Possible

Billion transistors realized in 65nm Si process

Multi-Billion transistors possible in future Si process

Large die sizes can be built

– 400 to 600 square millimeters

What can fit on a single die?

– For 65nm (rough est)

• 30 mm2 per proc.

• 15 mm2 per MB

72060054032 MB cache

48036030016 MB cache

8 cores

4 cores

2 cores

Die size (core + cache only) in mm2



Quad Cores – here a quarter ago already!



Multi-Core

Introduction to Multi-Core - Rev2 - UMass · PDF fileIntroduction to Multi-Core Baskaran...

Documents

Transcript of Introduction to Multi-Core - Rev2 - UMass · PDF fileIntroduction to Multi-Core Baskaran...