Introduction to Multi-Core - Rev2 - UMass · PDF fileIntroduction to Multi-Core Baskaran...
-
Upload
truongcong -
Category
Documents
-
view
234 -
download
1
Transcript of Introduction to Multi-Core - Rev2 - UMass · PDF fileIntroduction to Multi-Core Baskaran...
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 1
Introduction to Multi-Core
Baskaran Ganesan
Sr. Design EngineerDigital Enterprise Group, Intel Corporation
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 2
Topics
1.CPU (semiconductor) HISTORY (SESSION-1)
a. Moore’s Law
b. Transistor scaling
c. Scaling limitations & impact
d. What then?
- Dual core
e. The new era
- ARCHITECTURE (SESSION-2)
a. Core Architecture
- Core basics, Platform architecture, Core architecture
b. Multi-core architecture
c. Multi-core challenges
d. Closing notes
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 3
Moore’s Law
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 4
Moore’s law at work
Compute Power
SW/IT eco-system
Volume Market
CPU Cost
Manufacturing technology
CPU Arch technology
Transistor Size Transistor Count
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 5
Historical Driving Forces
0.01
0.1
1
10
1970 1980 1990 2000 2010 2020
0.01
0.1
1
10
1970 1980 1990 2000 2010 2020
FeatureFeature
SizeSize
(um)(um)
Shrinking GeometryShrinking Geometry
20052005MontecitoMontecito
1.7B Transistors1.7B Transistors
197119714004 Processor4004 Processor2300 Transistors2300 Transistors
197819788008 Processor8008 Processor
IBM PCIBM PC
19861986i386 Processori386 Processor
3232--bitbit
19931993Pentium ProcessorPentium Processor3.1M transistors3.1M transistors
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
Increased FrequencyIncreased Frequency
FrequencyFrequency
(MHz)(MHz)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 6
Scale Factors (loosely defined)
Voltage scale-factor: Rate at which the transistor voltage decreases with respect to a change in transistor dimensions
Frequency scale-factor: Rate at which the transistor frequency increases with respect to a change in transistor dimensions
Cost scale-factor: Rate at which the per-transistor cost decreases with respect to a change in transistor dimensions
Count scale-factor: Rate at which the transistor count increases with respect to a change in transistor dimensions
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 7
Scaling: More data
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 8
The Act of Balancing
Delivered Performance = Instructions Per Cycle (IPC) * Frequency
Delivered Performance = Delivered Performance =
Instructions Per Cycle (IPC) * FrequencyInstructions Per Cycle (IPC) * Frequency
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
Goal is higher performance and lower power
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 9
Pentium® 4 Processor
August 27, [email protected] GHz core55 Million 0.13µ transistors1249 SPECint2000
386 Processor
May 1986@16 MHz core
275,000 1.5µ transistors~1.2 SPECint2000
17 Years200x
200x/11x1000x
Scaling at its best
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 10
Architectural Innovations
• Serial, sequential execution
• Overlapped execution (pipelining)
• Multi-stage, deep pipelining
• Control-speculative execution
• Data-speculative execution
• Super-scalar execution
• Out-of-order execution
• Vector computing
• Addressing extensions
• Application specific instructions
• Multi-level on-chip caching
• Memory disambiguation
• Register renaming
• Score-boarding
• Hardware data prefetching
• …
Many decades of computer architecture Many decades of computer architecture focused onfocused on
InstructionInstruction--Level Parallelism (ILP) enhancementLevel Parallelism (ILP) enhancement
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 11
The Challenges
30nm
45nm65nm
90nm
0.13um
0.18um
0.25um
0.35um
0.5um
0.7um
0.1
1
10
1990 1993 1997 2001 2005 2009
~30%
30nm
45nm65nm
90nm
0.13um
0.18um
0.25um
0.35um
0.5um
0.7um
0.1
1
10
1990 1993 1997 2001 2005 2009
~30%SupplySupply
Voltage Voltage
(V)(V)
Diminishing Voltage ScalingDiminishing Voltage Scaling
slowing
Power LimitationsPower Limitations
Power = Capacitance x VoltagePower = Capacitance x Voltage22 x Frequencyx Frequencyalsoalso
Power ~ VoltagePower ~ Voltage33
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 12
Heat Dissipation
Po
we
r D
en
sit
yP
ow
er
Den
sit
y
(W/c
m2
)(W
/cm
2)
40044004
80088008
80808080
80858085
80868086
286286386386
486486
PentiumPentium®®
processorsprocessors
11
1010
100100
1,0001,000
10,00010,000
’’7070 ’’8080 ’’9090 ’’0000 ’’1010
Hot PlateHot Plate
Nuclear ReactorNuclear Reactor
Rocket NozzleRocket Nozzle
SunSun’’s Surfaces Surface
ProjectedProjected
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 13
Max FrequencyMax Frequency
PowerPower
PerformancePerformance
1.00x1.00x
What then?
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 14
OverOver--clockedclocked
(+20%)(+20%)
1.73x1.73x
1.13x1.13x1.00x1.00x
Max FrequencyMax Frequency
PowerPower
PerformancePerformance
Over-clocking
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 15
OverOver--clockedclocked
(+20%)(+20%)UnderUnder--clockedclocked
((--20%)20%)
0.51x0.51x
0.87x0.87x1.00x1.00x
1.73x1.73x
1.13x1.13x
Max FrequencyMax Frequency
PowerPower
PerformancePerformance
Under-clocking
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 16
OverOver--clockedclocked
(+20%)(+20%)
1.00x1.00x
Relative singleRelative single--core frequency and core frequency and VccVcc
1.73x1.73x
1.13x1.13x
Max FrequencyMax Frequency
PowerPower
PerformancePerformance
DualDual--corecore
((--20%)20%)
1.02x1.02x
1.73x1.73x
DualDual--CoreCore
Multi-CoreEnergy-Efficient Performance
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 17
Dual core with voltage scaling
Area = 1Area = 1
Voltage = 1Voltage = 1
Freq = 1Freq = 1
Power = 1Power = 1
PerfPerf = 1= 1
Area = 2Area = 2
Voltage = 0.85Voltage = 0.85
Freq = 0.85Freq = 0.85
Power = 1Power = 1
PerfPerf = ~1.8= ~1.8
10%45%15%
Performance
Reduction
Power
Reduction
Frequency
Reduction
A 15% A 15%
ReductionReduction
In VoltageIn Voltage
YieldsYields
SINGLE CORESINGLE CORE DUAL COREDUAL CORE
RULE OF THUMBRULE OF THUMB
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 18
Intel: Dual & Quad Cores
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 19
A New Era…
PerformanceEquals Frequency
Unconstrained Power
Voltage Scaling
PerformanceEquals IPC
THE OLDTHE OLD
THE NEWTHE NEW
Multi-Core
MicroarchitectureAdvancements
Power Efficiency
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 20
Trade-off equations
- Power is costly; Transistors, relatively cheap
- Frequency alone is not important; Efficiency IS
- Performance-per-watt is critical; per-core performance is not quite
- Computation is relatively easy; Memory accesses are NOT
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 21
Q & AQ & A
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 22
Topics
1. CPU (semiconductor) HISTORY (SESSION-1)
a. Moore’s Law
b. Transistor scaling
c. Scaling limitations & impact
d. What then?
- Dual core
e. The new era
- ARCHITECTURE (SESSION-2)
a. Core Architecture
- Core basics, Platform architecture, Core architecture
b. Multi-core architecture
c. Multi-core challenges
d. Closing notes
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 23
Typical PC Architecture
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 24
Processor Resources
- Caches: L0, L1, L2 etc (Different levels of caches)
- General Purpose Registers (For SW programming)
- Segment Registers & TLB (for memory management)
- FP registers, XMM registers
- System Flags
- Control and Data registers, Debug registers, MSRs
- Many more
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 25
CMP/SMP/HT
CMP: Chip Multi Processing, refers to multiple physical core engines that have unique resources
Unique: L0/L1 Cache, TLBs, Instruction Pointer, GP Regs
Shared: L2 Cache
SMP: Refers to multiple threads that share all resources (time muxed)
Shared: L0/L1/L2 Caches, TLBs
Unique: Instruction Pointer, GP Regs
Hyper Threading: Refers to multiple threads that share more resources (L0/L1 Cache for example); May/May not be part of a CMP core
SW Threading: Application (SW) level threading of processes on one/more physical core engines
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 26
Core Architecture (Prescott)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 27
Core Architecture (Xeon – Dual Core)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 28
Multi-core platform (Freescale: embedded)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 29
Multi-Core platform (RMI-XLR: embedded)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 30
Tilera – 64 core CPU
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 31
Tilera – Platform
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 32
Tera-scale Computing
Terabytes
TIPS
Gigabytes
MIPS
Megabytes
GIPS
Perform
ance
Dataset SizeKilobytes
KIPS
Mult-Media
3D &Video
Text
RMS Personal Media Personal Media
Creation and Creation and
ManagementManagement
EntertainmentEntertainment
Learning & Learning &
TravelTravel
HealthHealthSingle CoreSingle Core
Multi-coreMulti-core
Tera-scaleTera-scale
IPS = Instruction per second
“RMS” ApplicationsRecognition
MiningSynthesis
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 33
Intel Polaris (80-core)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 34
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 35
Multi-Core: what next?
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 36
Connecting multiple cores
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 37
Platform Architecture (multi-core)
External
I/F
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 38
Multi-core: Architectural Challenges
- Instruction-level parallelism v/s Thread-level parallelism tradeoffs and balance
- Shared resource management (functional units, caches, tlb, btb)
- Multi-threading v/s Multi-core tradeoffs
- On and Off-chip bandwidth requirements
- Latencies (execution, cache, and memory) reduction
- Memory Coherence/Consistency (for high speed on-die cache hierarchies)
- Multiple domains (and crossing) in clocking, voltage, reset,...
- Partitioning resources (between threads/cores)
- Fault tolerance (at device, storage, execution, core level) (aka reliability)
- On-die interconnect (optimized along latency, bw, modularity, power, ...)
- Integration (of system components, and/or fixed function devices)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 39
Multi-core: Design Challenges
Design Complexity, Productivity Tools / Methods Advance
• …But at slower rate than Moore’s Law
• Replicating cores improves productivity
Visibility for Test & Debug
• Pin Bandwidth/Transistor continues to decline
• Shrinking dimensions, increasing speeds, …
• Increased test time adding to cost
Power
• Power Delivery – di/dt of Amps/nano-second
• Thermals: Overall power and thermal density
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 40
Multi-core: Eco-system challenges
Underlying Software assumptions on resource sharing
• Lack of standard mechanisms to share “resource sharing info”between hw and OS
Lack of “Resource sharing” aware SW
• Compilers, Schedulers, Configuration/Management (Power!) etc
Legacy SW architectural requirements left on Multi-Core CPUs
• Compatibility requirements
Many more…unknowns (to CPU Design world)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 41
Algorithms, Programming Languages, Compilers, Operating Systems,Architectures, Libraries, … not ready for 100s of CPUs / chip
Multi-core: Software Challanges
- Scalability of O/S Data Structures and Policies- Synchronization and locking, Scheduling, Process management,
Data structure sizing and management limitations, Threading granularity and primitives
- Memory Hierarchy Awareness
- Impact of coherency policy, Efficiency of Data-sharing and Process migration effects, SW visibility to High speed on-die interconnect, SW control of Cache hierarchy, NUCA Awareness
- High Bandwidth I/O Support- Light weight Interrupts, Data movement and transformation
engines, I/O Affinity
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 42
More than the cores
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 43
Closing notes
• Single and Multi-core architectures presented
• Multi-Core CPU is the next generation CPU Architecture
– 2Core and Intel Quad-Core designs plenty on market already
– Many More are on their way
• Several old paradigms ineffective; Several new problems to be addressed
• Chip Level Multiprocessing and large caches can exploit Moore’s Law
• Thread/Core count in future microprocessor systems to increase
• Eco-system immature/non-existent
• Numerous domains in arch/design awaiting research & innovation and here is where you come in!!!
MultiMulti--Core Architecture and DesignCore Architecture and Designready forready for
research, development and innovation!research, development and innovation!
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 44
Acknowledgements
Gautam Doshi [Principal Engineer, Digital Enterprise Group]
Ajay Bhatt [Intel Fellow, Digital Enterprise Group]
Dileep Bhandarkar [Architect, Digital Enterprise Group]
Sunit Tyagi [Sr. Principal Engineer, Digital Enterprise Group]
… and countless foil-wares
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 45
Resources
Intel Tech/Research: http://www.intel.com/technology/index.htmEnergy Efficient Performance: http://www.intel.com/technology/eep/index.htmIntel Core Microarchitecture: http://www.intel.com/technology/architecture/coremicro/Dual-core processor: http://www.intel.com/technology/computing/dual-core/index.htmMulti/Many Core: http://www.intel.com/multi-core/index.htmIntel Platforms: http://www.intel.com/platforms/index.htmThreading: http://www3.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/index.htm
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 46
Q & AQ & A
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 47
Backup: Core Backup: Core
uArchuArch
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 48
Intel® Core™ Microarchitecture
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
ScalableScalableLow PowerLow Power High PerformanceHigh Performance
(Core2 Duo) (Core2 Duo) MeromMerom
(Core2 Duo) Conroe(Core2 Duo) Conroe
(Xeon) (Xeon)
WoodcrestWoodcrest
65nm65nm
Server Server
OptimizedOptimized
Desktop Desktop
OptimizedOptimized
Mobile Mobile
OptimizedOptimized
IntelIntel®® Wide Wide
Dynamic Dynamic
ExecutionExecution
IntelIntel®®
Intelligent Intelligent
Power Power
CapabilityCapability
IntelIntel®®
Advanced Advanced
Smart CacheSmart Cache
IntelIntel®® Smart Smart
Memory Memory
AccessAccess
IntelIntel®®
Advanced Advanced
Digital Media Digital Media
BoostBoost
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 49
Intel® Intelligent Power Capability
UltraUltraFineFine
GrainedGrained
CoarseCoarseGrainedGrained
•• AggressiveAggressive
Clock GatingClock Gating
•• EnhancedEnhanced
SpeedSpeed--StepStep
•• Low VCC ArraysLow VCC Arrays
•• Blocks ControlledBlocks Controlled
Via SleepVia Sleep
TransistorsTransistors
•• Low LeakageLow Leakage
TransistorsTransistors
•• SleepSleep
TransistorsTransistors
TransistorTransistor
•• 65nm65nm
•• Strained SiliconStrained Silicon
•• LowLow--K DielectricK Dielectric
•• More Metal LayersMore Metal Layers
ProcessProcess
ADVANTAGEADVANTAGE•• MobileMobile--Level Power ManagementLevel Power Management•• Energy Efficient PerformanceEnergy Efficient Performance
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Energy Energy
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 50
Intel® Wide Dynamic Execution
INSTRUCTION FETCHINSTRUCTION FETCH
AND PREAND PRE--DECODEDECODE
INSTRUCTION QUEUEINSTRUCTION QUEUE
RETIREMENT UNITRETIREMENT UNIT
(REORDER BUFFER)(REORDER BUFFER)
DECODEDECODE
RENAME / ALLOCRENAME / ALLOC
SCHEDULERSSCHEDULERS
EXECUTEEXECUTE
INSTRUCTION FETCHINSTRUCTION FETCH
AND PREAND PRE--DECODEDECODE
INSTRUCTION QUEUEINSTRUCTION QUEUE
RETIREMENT UNITRETIREMENT UNIT
(REORDER BUFFER)(REORDER BUFFER)
DECODEDECODE
RENAME / ALLOCRENAME / ALLOC
SCHEDULERSSCHEDULERS
EXECUTEEXECUTE
CORE 1CORE 1 CORE 2CORE 2
4 WIDE 4 WIDE --
DECODE TODECODE TO
EXECUTEEXECUTE
4 WIDE 4 WIDE --
MICROMICRO--OPOP
EXECUTEEXECUTE
MICROMICRO
andand
MACROMACRO
FUSIONFUSION
DEEPERDEEPER
BUFFERSBUFFERS
EFFICIENTEFFICIENT
14 STAGE14 STAGE
PIPELINEPIPELINE
ENHANCEDENHANCED
ALUsALUs
EACH COREEACH CORE
ADVANTAGEADVANTAGE•• 33% Wider Execution over Previous Gen33% Wider Execution over Previous Gen•• Comprehensive AdvancementsComprehensive Advancements•• Enabled In Each CoreEnabled In Each CoreEnergy Energy
Perf Perf
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 51
Intel® Wide Dynamic ExecutionMicro and Macro Fusion
DECODEDECODE
EXECUTEEXECUTE
uCODEuCODE
ROMROM
ADVANTAGEADVANTAGE•• Instruction Load Reduced ~ 15%Instruction Load Reduced ~ 15%****
•• MicroMicro--Ops Reduced ~ 10%Ops Reduced ~ 10%****
WITHOUT MACRO FUSIONWITHOUT MACRO FUSIONWITH MACRO FUSIONWITH MACRO FUSION
INSTRUCTION 3INSTRUCTION 3
INSTRUCTION 2INSTRUCTION 2
INSTRUCTION 1INSTRUCTION 1
INTERNAL INST 1INTERNAL INST 1
COMPLETED INST 3COMPLETED INST 3
COMPLETED INST 2COMPLETED INST 2
COMPLETED INST 1COMPLETED INST 1
DECODEDECODE
EXECUTEEXECUTE
INSTRUCTION 3INSTRUCTION 3
INSTRUCTION 2INSTRUCTION 2
INSTRUCTION 1INSTRUCTION 1
INTERNAL INST 3INTERNAL INST 3
INTERNAL INST 2INTERNAL INST 2
INTERNAL INST 1INTERNAL INST 1
COMPLETED INST 3COMPLETED INST 3
COMPLETED INST 2COMPLETED INST 2
COMPLETED INST 1COMPLETED INST 1
DECODEDECODE
EXECUTEEXECUTE
COMBINED INST 2 & 3COMBINED INST 2 & 3
MACRO FUSION EXAMPLEMACRO FUSION EXAMPLE
CMP+JMP IN 1 CLOCKCMP+JMP IN 1 CLOCKMicroMicroFusionFusion
MacroMacroFusionFusion
*Graphics not representative of actual die photo or relative size
** Workload dependant
Energy Energy
Perf Perf
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 52
Intel® Advanced Smart CacheDynamic L2 Cache Usage
ADVANTAGEADVANTAGE•• Higher Cache Hit RateHigher Cache Hit Rate•• Reduced BUS TrafficReduced BUS Traffic•• Lower Latency to DataLower Latency to Data
CoreCore™™ MicroarchitectureMicroarchitecture
Shared L2Shared L2Independent L2Independent L2
Dynamically,Dynamically,
BiBi--DirectionallyDirectionally
AvailableAvailable
CORE 1CORE 1 CORE 2CORE 2
L1L1
CACHECACHEL1L1
CACHECACHE
CORE 1CORE 1 CORE 2CORE 2
L1L1
CACHECACHE
NotNot
ShareableShareable
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
L1L1
CACHECACHE
x
Energy Energy
Perf Perf
DecreasedDecreased
TrafficTrafficIncreasedIncreased
TrafficTraffic
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 53
HARDWAREHARDWARE
Mem. Dis.Mem. Dis.
PredictorPredictor
Intel® Smart Memory AccessHardware-based Memory Disambiguation
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 1 INST 1 ““STORE [X]STORE [X]””
DECODE/SCHEDULEDECODE/SCHEDULE
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
OUTOUT
OFOF
ORDERORDER
ININ
ORDERORDER
EXECUTEEXECUTE
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 1 INST 1 ““STORE [X]STORE [X]””
DECODE/SCHEDULEDECODE/SCHEDULE
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
EXECUTEEXECUTE STALLSTALL
ADVANTAGEADVANTAGE•• Higher Utilization of PipelineHigher Utilization of Pipeline•• Masks latency to data accessMasks latency to data access•• Higher PerformanceHigher Performance
CoreCore™™ MicroarchitectureMicroarchitecture OtherOther
Inst. 2 Inst. 2 ““LoadLoad””
Can OccurCan Occur
BeforeBefore
Inst. 1 Inst. 1 ““StoreStore””
Energy Energy
Perf Perf
Inst. 2 MustInst. 2 Must
Wait ForWait For
Inst. 1 Inst. 1 ““StoreStore””
To CompleteTo Complete
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 54
Intel® Advanced Digital Media BoostSingle Cycle SSE
CoreCore™™ µµµµµµµµarch arch
PreviousPrevious
DECODEDECODE
X4X4
Y4Y4
X4opY4X4opY4
SOURCESOURCE
X1opY1X1opY1
DECODEDECODE
In Each Core In Each Core
X3X3
Y3Y3
X3opY3X3opY3
X2X2
Y2Y2
X2opY2X2opY2
X1X1
Y1Y1
X1opY1X1opY1
DESTDEST
SSE/2/3 OPSSE/2/3 OP
X2opY2X2opY2
X3opY3X3opY3X4opY4X4opY4
CLOCKCLOCK
CYCLE 1CYCLE 1
CLOCKCLOCK
CYCLE 2CYCLE 2
00127127
CLOCKCLOCK
CYCLE 1CYCLE 1
SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)
ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle in each core128 bit Single Cycle in each core•• Improved Energy EfficiencyImproved Energy Efficiency
EXECUTEEXECUTEEXECUTEEXECUTE
FusionFusion
SupportSupport
SingleSingle
CycleCycle
SSESSE
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Energy Energy
Perf Perf
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 55
Backup: Next Gen Backup: Next Gen
TechnologiesTechnologies
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 56
Traditional Operating Systems (Time-mux)
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 57
Physical Host Hardware
What is Virtualization?
GFX
MemoryProcessors
Keyboard / Mouse
Graphics
StorageNetwork
Operating System
...App App App
Without VMs: Single OS owns all hardware resources
VM1VM0
Guest OS0
App AppApp ...
...Guest OS1
App ...
VM Monitor (VMM)
Physical Host Hardware
With VMs: Multiple OSes share hardware resources
A newA new
layer oflayer of
software...software...
AppApp
Virtualization enables multiple operating systems to run on the same platform
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 58
Types of Virtualization
Hosted VMM
• launched from within an OS, e.g., VMplayer, WSX, GSX, Virtual PC, Virtual Server
– Cheap but lower performance
Hypervisor: A bootable layer on Bios
• Thick: embeds all the drivers, e.g., ESX
• Thin: has a service VM, e.g., Xen derivates
Virtual Appliances: dedicated Virtual machines, e.g., MojoPC
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 59
Intel® Virtualization Technology (VT)
Intel® VT
First to market with native virtualization support
Broadest HW and SW ecosystem support
CoreTM Microarchitecture based systems
� Significant increase in performance and improved VT performance overall segments
� Mobile - Intel® Core™2 Duo Mobile Processor for Intel® Centrino®Duo Mobile Technology
� Desktop - Intel® Core™2 Duo Desktop Processor E6000 sequence -
� Server Dual and Quad Core Intel® Xeon® Processor 5000 series
Get More Done On Every ServerGet More Capabilities On Client
Processors with Intel® Virtualization Technology
Virtual Machine Monitor
..…OSOS
AppApp
OSOS
AppApp
OSOS
AppApp
OSOS
AppApp
and others …
11stst VT base SW VT base SW
SolutionsSolutions
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 60
Trusted Execution Technology
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 61
LT Hardware Ingredients
RAM
ICHICH
USBUSB
Intel Intel
CPUCPUIntel CPU
Intel(G)MCH
LPCLPC
TPM
CPU Extensions� Enables domain separation
� Sets policy for protected memory
LT = CPU + Chipset + TPM + Protected I/O
Protected Graphics� Trusted channel
between graphics and trusted SW
� Integrated or third party discrete graphics
Protected Keyboard & Mouse� Trusted channel between
keyboard/mouse and trusted SW
Protected Memory Mgmt� Enforces access policy to
protected memory
Trusted Platform Module v1.2� Protects keys, digital certificates
& attestation credentials
� Provides platform authentication
= LT-specific enhancement
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 62
Backup: MiscBackup: Misc
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 63
Moore’s Law Moving Forward
|---------------------ACTUAL---------------------------|--FORECAST-|
Production 1995 1997 1999 20012001 20032003 20052005 20072007 20092009 20112011
Generation 0.35 0.25 0.18 130130nmnm 9090nmnm 6565nmnm 4545nmnm 3535nmnm 2222nmnm
Gate Length 0.35 0.20 0.13 <70nm <<5050nmnm <<3535nmnm <<3535nmnm <<3535nmnm <<2222nmnm
Wafer Size (Wafer Size (mmmm)) 200200 200200 200200 300300 300300 300300 300300 300300? ? 300?300?
Integration CapacityIntegration Capacity <100M<100M 100M100M 200M200M 500M500M 1B1B >>1B1B >2B>2B >4B>4B >8B>8B
““Another decade is probably straightAnother decade is probably straight--forward forward ……There is certainly no end to creativity.There is certainly no end to creativity.””
-- Gordon Moore, speaking of extending MooreGordon Moore, speaking of extending Moore’’s Law at ISSCC, Feb 2003s Law at ISSCC, Feb 2003
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 64
Multi-Core Power Efficiency
C1C1
C4C4
C2C2
C3C3
SmallSmall
corecore
Big coreBig core
CacheCache
CacheCache
11
22
33
44
11
22
11 11
11
22
33
44
11
22
33
44
PowerPower
PerformancePerformance
Power = Power = ¼¼
Performance = 1/2Performance = 1/2
Many core is more Many core is more
power efficientpower efficient
Power ~ areaPower ~ area
Single thread Single thread
performance ~ area**.5performance ~ area**.5
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 65
Multi-Core and Memory Gap
Growing Performance GapGrowing Performance Gap
0
100
200
300
400
500
600
700
Pentium
66MHz
Pentium-Pro
200MHz
PentiumIII
1100MHz
Pentium4 2
GHz
19921992 19941994 19961996 19981998 20002000 20022002
LOGICLOGIC
MEMORYMEMORY
GA
PG
AP
Peak InstructionsPeak Instructions
Per DRAM AccessPer DRAM Access
Reduce DRAM access with large cachesExtra benefit: power savings. Cache is lower power than logic
Tolerate memory latency with multiple threadsMultiple coresHyper-threading
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 66
Multi-threading tolerates memory latency
Ai
Ai Idle Ai+1
Ai+1Idle
Bi Idle Bi+1
Bi Bi+1
Serial Execution
Multi-threaded Execution
Execute thread B while thread A waits for memoryExecute thread B while thread A waits for memory
Multi-core has a similar effect
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 67
Multi-core tolerates memory latency
Ai
Ai
Idle
Ai+1
Ai+1
Bi Idle
Bi+1
Serial Execution
Multi-core Execution
Execute thread A and B simultaneouslyExecute thread A and B simultaneously
Idle
Bi Idle
Bi+1
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 68
How does Multicore Change Parallel Programming?
No change in fundamental programming model
Synchronization and communication costs greatly reduced
• Makes it practical to parallelize more programs
Resources now shared
• Caches
• Memory interface
• Optimization choices may be different
P1
cache
P2 P3 P4
cache cache cache
Memory
SMP
C1
cache
Memory
C2 C3 C4
cache cache cache
CMP
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 69
Art of the Possible
Billion transistors realized in 65nm Si process
Multi-Billion transistors possible in future Si process
Large die sizes can be built
– 400 to 600 square millimeters
What can fit on a single die?
– For 65nm (rough est)
• 30 mm2 per proc.
• 15 mm2 per MB
72060054032 MB cache
48036030016 MB cache
8 cores
4 cores
2 cores
Die size (core + cache only) in mm2
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 70
Quad Cores – here a quarter ago already!
Reach To TeachReach To TeachIntel Higher Education Program &Intel Higher Education Program &
Foundation for Advancement of Education and Research (FAER)Foundation for Advancement of Education and Research (FAER) 71
Multi-Core