Chapter 7.1: Monosaccharides and Disaccharides CHEM 7784 Biochemistry Professor Bensley.
N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06)...
Transcript of N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06)...
![Page 1: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/1.jpg)
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
March 8, 2006March 8, 2006
NEW
Stephen L. SmithVice President
Digital Enterprise Group
Bob ValentineArchitect
Intel Architecture Group
![Page 2: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/2.jpg)
2
Agenda
• Multi-core Update and New Microarchitecture Level Set
• New Intel® Core™ Microarchitecture
• Wrap Up
![Page 3: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/3.jpg)
3
Intel Multi-core Roadmap – Updates since Fall IDF
![Page 4: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/4.jpg)
4
Ramping Multi-core Everywhere
All products and dates are preliminary and subject to change without notice.
![Page 5: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/5.jpg)
5
Refresher: What is Multi-Core?Two or more independent execution cores in the same processor
Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies• Best mix of product architecture and volume mfg capabilities
– Architecture: Shared Caches vs. Independent Caches– Mfg capabilities: volume packaging technology
• Designed to deliver performance, OEM and end user experience
MultiMulti--Chip ProcessorChip ProcessorSingle die (Monolithic) based processorSingle die (Monolithic) based processorExample: 90nm PentiumExample: 90nm Pentium®® D D
Processor (Smithfield)Example: Intel CoreExample: Intel Core™™ Duo Duo
Processor (Yonah)Example: 65nm Pentium D Example: 65nm Pentium D
Processor (Presler)Processor (Smithfield) Processor (Yonah) Processor (Presler)
Front Side BusFront Side Bus
Core0Core0 Core1Core1 Core0Core0 Core1Core1
Front Side BusFront Side Bus
Core0Core0 Core1Core1
Front Side BusFront Side Bus
*Not representative of actual die photos or relative size
![Page 6: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/6.jpg)
6
Intel® Core™ Micro-architecture
*Not representative of actual die photo or relative size
![Page 7: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/7.jpg)
7
Intel Multi-core Roadmap
![Page 8: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/8.jpg)
8
Intel Multi-core Roadmap
![Page 9: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/9.jpg)
9
Intel® Core™ Microarchitecture Based PlatformsPlatformPlatform 2006 20072006 2007
MP ServersMP Servers
DP Servers/DP Servers/DP WorkstationDP Workstation
Mobile Client Mobile Client
Desktop Desktop --Home Home
Desktop Desktop --Office Office
Caneland Platform (2007) Tigerton (QC) (2007)
Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) )Woodcrest (Q3’06)
Clovertown (QC) (Q1’07)
Bridge Creek Platform (Mid’06)Conroe (Q3’06)
Kentsfield (QC) (Q1’07)
Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06)Conroe (Q3’06)
Kentsfield (QC) (Q1’07)UP Servers/UP Servers/UP WorkstationUP Workstation
Napa Platform (Q1’06)Merom (2H’06)
Averill Platform (Mid’06) Conroe (Q3’06)
All products and dates are preliminary and subject to change without notice.
Note: only Intel® Core™ microarchitecture based processors listed
QC refers to Quad-Core
![Page 10: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/10.jpg)
10
Intel® Core™ MicroarchitecturePerformance
Delivering both industry leading performance and performance/watt• Conroe: >40% improvement in performance1 & >40%
reduction in power2
– As compared to today’s high-end Pentium® D processor 950 (formerly Presler)
• Woodcrest: >80% improvement in performance1 and > 35% reduction in power2
– As compared to today’s high-end Dual-Core Intel® Xeon® processor 2.8GHz (formerly Paxville DP)
• Merom: Extends the already significant performance and performance/watt leadership delivered with today's Intel® Core™ Duo processor with greater than 20% additional performance1 improvement– As compared to today’s high-end Intel® Core™ Duo processor (formerly Yonah)
1 - Estimated SPECint*_rate_base2000
2 – Expected reduction in TDP
![Page 11: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/11.jpg)
11
Agenda
• Multi-core Update and New Micro-architecture level set
• New Intel® Core™ Microarchitecture
• Summary
![Page 12: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/12.jpg)
12
Inside the IntelInside the Intel®® CoreCore™™MicroarchitectureMicroarchitecture
![Page 13: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/13.jpg)
13
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
![Page 14: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/14.jpg)
14
Microarchitecture HistoryMicroarchitecture History
![Page 15: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/15.jpg)
15
New Microarchitecture Coming in 2006New Microarchitecture Coming in 2006
![Page 16: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/16.jpg)
16
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design GoalsMicroarchitecture Design Goals
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
![Page 17: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/17.jpg)
17
IntelIntel®® CoreCore™™ Microarchitecture: Microarchitecture: Design GoalsDesign Goals
Deliver world class performance combined Deliver world class performance combined with superior energy/power efficiency with superior energy/power efficiency –– Existing and emerging applications and usesExisting and emerging applications and uses–– Greater performance and performance/wattGreater performance and performance/watt–– Optimized for Intel MultiOptimized for Intel Multi--core platformscore platforms
Deliver single foundation for optimized Deliver single foundation for optimized processors across each segment and power processors across each segment and power envelopeenvelope–– Optimized for mobile, desktop and server segmentsOptimized for mobile, desktop and server segments
Driving Performance and Driving Performance and Performance/Watt LeadershipPerformance/Watt Leadership
![Page 18: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/18.jpg)
18
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
![Page 19: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/19.jpg)
19
Processor Architecture 101Processor Architecture 101
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Goal is higher performance and lower power
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
Cdynamic is roughly a product of area and activity“how many bits” * “how much do they toggle”
![Page 20: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/20.jpg)
20
Processor Architecture 101Processor Architecture 101
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Frequency is proportional to voltage,Frequency is proportional to voltage,so frequency reduction coupled with so frequency reduction coupled with voltage reduction results in cubic voltage reduction results in cubic reduction in power. reduction in power.
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
![Page 21: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/21.jpg)
21
Processor Architecture 101Processor Architecture 101
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Higher IPC usuallyHigher IPC usuallyresults in wider data pathsresults in wider data pathsand/or more speculation :and/or more speculation :directly increasing C dynamicdirectly increasing C dynamic
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
![Page 22: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/22.jpg)
22
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
![Page 23: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/23.jpg)
23
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
Block Diagram Walkthrough
![Page 24: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/24.jpg)
24
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
in orderin order
instruction fetchinstruction fetchinstruction decodeinstruction decodemicromicro--op renameop renamemicromicro--op allocateop allocate
![Page 25: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/25.jpg)
25
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
out of orderout of order
micromicro--op scheduleop schedulemicromicro--op executeop execute
![Page 26: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/26.jpg)
26
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
out of orderout of order
memory pipelinesmemory pipelines
memory order unitmemory order unitmaintains architecturalmaintains architecturalordering requirementsordering requirements
![Page 27: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/27.jpg)
27
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
in orderin order
micromicro--op retirementop retirementfault handlingfault handling
Retirement UnitRetirement Unitmaintains illusionmaintains illusionof in orderof in orderinstruction retirementinstruction retirement
![Page 28: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/28.jpg)
28
IntelIntel®® CoreCore™™Microarchitecture
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Microarchitecture
Wide Dynamic Execution
Advanced Digital Media Boost
Smart Memory Access
Advanced Smart Cache
Intelligent Power Capability
New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture
![Page 29: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/29.jpg)
29
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Wide Dynamic Wide Dynamic ExecutionExecution
Start with Instruction Fetch
four(+) instructions / cycle
>33% increase over other x86 processors
Instructions converted to micro-ops (uops)
~1 uop per x86 instruction
![Page 30: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/30.jpg)
30
MicroMicro--op Reduction op Reduction (recall Processor (recall Processor 101)101)
Fewer Fewer uopsuops per instructionper instructionallows IPC to be increasedallows IPC to be increasedwhile lowering C dynamicwhile lowering C dynamic(less bits and less toggling)(less bits and less toggling)
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Power = Cdynamic * V * V * FrequencyPower = Power = CCdynamicdynamic * V * V * Frequency* V * V * Frequency
![Page 31: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/31.jpg)
31
Techniques for MicroTechniques for Micro--op Reductionop Reduction
ESP Tracker (Extended Stack Pointer)ESP Tracker (Extended Stack Pointer)–– Execute Stack Pointer updates in dedicated hardwareExecute Stack Pointer updates in dedicated hardware–– IntelIntel®® CoreCore™™ microarchitecture increases BW by 33%*microarchitecture increases BW by 33%*
MicroMicro--Op MicroOp Micro--FusionFusion–– Single Single UopUop representation of representation of ““multimulti--uopuop”” instruction instruction –– IntelIntel®® CoreCore™™ microarchitecture increase # instructions*microarchitecture increase # instructions*
MacroMacro--FusionFusion–– New technique in IntelNew technique in Intel®® CoreCore™™ microarchitecture (more on microarchitecture (more on
next pages)next pages)
* Techniques pioneered on Intel* Techniques pioneered on Intel®® PentiumPentium®® M processorsM processors
![Page 32: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/32.jpg)
32
New: MacroNew: Macro--FusionFusion
Represent common x86 instruction pairs in Represent common x86 instruction pairs in single microsingle micro--opop––CMP or TEST + Conditional Branch (CMP or TEST + Conditional Branch (JccJcc))
Enhanced Arithmetic Logic Unit (ALU) for Enhanced Arithmetic Logic Unit (ALU) for macromacro--fusionfusion––Single dispatch Single dispatch -- efficiencyefficiency
––Single cycle execution Single cycle execution -- performanceperformance
![Page 33: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/33.jpg)
33
WithoutWithoutMacroMacro--FusionFusion
Instruction Queue
Read four instructions from Instruction Queue
Each instruction gets decoded into separate uops
store [mem3], ebx
load eax, [mem1]cmp eax, [mem2]
jne targ
inc ecx
inc ecx
store [mem3], ebx
dec1 dec2 dec3
jne targ
load eax, [mem1]
cmp eax, [mem2]
dec0
Cycle 1
Cycle 2
![Page 34: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/34.jpg)
34
With IntelWith Intel’’s New s New MacroMacro--FusionFusion
Read five Instructions from Instruction Queue
Send fusable pair to single decoder
Single uop represents two instructions
store [mem3], ebx
load eax, [mem1]cmpjne eax, [mem2], targ
inc ecx
dec1
Instruction Queue
inc ecx
dec2 dec3
load eax, [mem1]
cmp eax, [mem2]
jne targ
store [mem3], ebx
dec0
Cycle 1
![Page 35: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/35.jpg)
35
cmpjne eax, mem2, targScheduler
Execution
flags and target to Write back
BranchEval
MacroMacro--FusionFusion(cont)(cont)
Lower latencyIncreased bandwidth“virtually” increase storage
Macro-fusion makes the machine behave as if it is wider and deeper, withoutthe additional cost
Enabling Greater Performance & Enabling Greater Performance & EfficiencyEfficiency
![Page 36: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/36.jpg)
36
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Wide Dynamic Wide Dynamic ExecutionExecution
4 wide rename4 wide micro-op execution 4 wide retire
Deeper out of order storage
32 discontiguous micro-opsconsidered for dispatch per cycle
33% Wider Than Previous 33% Wider Than Previous GenerationGeneration
![Page 37: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/37.jpg)
37
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Advanced Digital Advanced Digital Media BoostMedia Boost
128-bit packed Multiplyplus
128-bit packed Addplus
128-bit packed Loadplus
128-bit packed Storeplus
(how about a CMPJCC)
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 38: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/38.jpg)
38
Advanced Digital Media BoostAdvanced Digital Media Boost
Lets scale a vector: Lets scale a vector: B[iB[i] := ] := A[iA[i] * C] * C
A
B
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 39: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/39.jpg)
39
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
Assume both Microarchitectures have 128Assume both Microarchitectures have 128--bit bit path from L1 to Processorpath from L1 to Processor
Advanced Digital Media BoostAdvanced Digital Media Boost
A
B
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 40: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/40.jpg)
40
Advanced Digital Media BoostAdvanced Digital Media Boost
...handles all the memory data...handles all the memory data
A
B
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
Multiply can’tkeep up with load bandwidth
multiplieroperateson all data
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 41: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/41.jpg)
41
Advanced Digital Media BoostAdvanced Digital Media BoostExisting implementations eventually stall the load Existing implementations eventually stall the load
pipe waiting for multiplierpipe waiting for multiplier
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 42: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/42.jpg)
42
Advanced Digital Media BoostAdvanced Digital Media Boost
...keeps pipeline free for computations...keeps pipeline free for computations
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 43: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/43.jpg)
43
Advanced Digital Media BoostAdvanced Digital Media Boost...maintains 2X throughput compared to prior ...maintains 2X throughput compared to prior
implementationsimplementations
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 44: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/44.jpg)
44
Advanced Digital Media BoostAdvanced Digital Media Boost
8 Single Precision Flops/cycle8 Single Precision Flops/cycle
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 45: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/45.jpg)
45
4 Double Precision Flops/cycle4 Double Precision Flops/cycleAdvanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 46: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/46.jpg)
46
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 47: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/47.jpg)
47
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 48: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/48.jpg)
48
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 49: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/49.jpg)
49
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 50: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/50.jpg)
50
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
![Page 51: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/51.jpg)
51
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
Leading Compute DensityLeading Compute Density2x Compute Throughput / 2x Compute Throughput /
ClockClock
![Page 52: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/52.jpg)
52
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Memory Memory DisambiguationDisambiguation
Improved Improved PrefetchersPrefetchers
Smart MemorySmart MemoryAccessAccess
Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem
![Page 53: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/53.jpg)
53
Smart Memory Access Smart Memory Access –– GoalGoal
L1Data
Cache
CORE1
L1Data
Cache
CORE2
Smart-SharedL2 Cache
System Bus
WHENWHEN Ensure data can be used as Ensure data can be used as earlyearly as possibleas possible
WHEREWHERE Ensure user of data has it as Ensure user of data has it as closeclose as possibleas possible
Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem
![Page 54: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/54.jpg)
54
Subsequent Loads Must WaitSubsequent Loads Must Wait
Memory
Store1 Load2Store3Load4
Data Y
Data Z
Data W
Data X
Without Memory DisambiguationWithout Memory Disambiguation
Load4 must WAIT until previous stores complete
Waits for Data X before can executeY
W
Y
X
12
4
3
![Page 55: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/55.jpg)
55
Solving the Problem of Solving the Problem of WhenWhen
Memory
Store1 Load2Store3Load4
Data Y
Data Z
Data W
Data X
With IntelWith Intel’’s New Memory Disambiguations New Memory Disambiguation
Loads can decouple from Stores
Load4 can get its data FIRSTY
W
Y
X
23
4
1
Smart Memory AccessSmart Memory Access
![Page 56: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/56.jpg)
56
Memory Disambiguation predictorMemory Disambiguation predictor––Loads that are predicted NOT to forward from Loads that are predicted NOT to forward from
preceding store are allowed to schedule as early preceding store are allowed to schedule as early as possibleas possible–– increasing the performance of OOO memory pipelinesincreasing the performance of OOO memory pipelines
Disambiguated loads checked at retirementDisambiguated loads checked at retirement––Extension to existing coherency mechanismExtension to existing coherency mechanism
––Invisible to software and systemInvisible to software and system
Memory DisambiguationMemory DisambiguationSmart Memory AccessSmart Memory Access
Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem
![Page 57: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/57.jpg)
57
Smart Memory AccessSmart Memory AccessPrefetchersPrefetchers
SharedL2
DataCache
oldest
youngest L1Data
Cache
Load1 Load2Load3Load4
![Page 58: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/58.jpg)
58
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
Memory is too far awayMemory is too far away
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
![Page 59: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/59.jpg)
59
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
Caches are closerCaches are closerwhen they have the datawhen they have the data
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
![Page 60: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/60.jpg)
60
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
Prefetchers detectPrefetchers detectapplications dataapplications datareference patternsreference patterns
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
![Page 61: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/61.jpg)
61
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
And bring the data And bring the data closer to data consumercloser to data consumer
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
![Page 62: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/62.jpg)
62
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
SharedL2
DataCache
oldest
youngest L1Data
Cache
Load1 Load2Load3Load4
Solving the Problem of Solving the Problem of Where Where Minimizing Memory LatencyMinimizing Memory Latency
![Page 63: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/63.jpg)
63
Prefetchers and MultiPrefetchers and Multi--CoreCore
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
![Page 64: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/64.jpg)
64
Prefetchers and MultiPrefetchers and Multi--CoreCore
![Page 65: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/65.jpg)
65
Prefetchers and MultiPrefetchers and Multi--CoreCore
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Three Individual Prefetchers per CoreThree Individual Prefetchers per CoreTwo L2 Prefetchers dynamically Two L2 Prefetchers dynamically
sharedshared
![Page 66: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/66.jpg)
66
Smart Memory AccessSmart Memory Access8 Prefetchers per two8 Prefetchers per two--core processorcore processor–– 2 data and 1 instruction 2 data and 1 instruction prefetcherprefetcher per coreper core
–– able to handle multiple simultaneous patternsable to handle multiple simultaneous patterns–– 2 prefetchers in the L2 cache2 prefetchers in the L2 cache
–– tracking multiple patterns per coretracking multiple patterns per core
Prefetchers monitor demand traffic and regulate Prefetchers monitor demand traffic and regulate ““aggressionaggression””
Implementation Implementation ““knobsknobs”” allow platform and allow platform and segment specific settings tailored to applications segment specific settings tailored to applications and usage modelsand usage models
Data Is Data Is WhereWhere You Need It, You Need It, WhenWhen You Need ItYou Need It
![Page 67: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/67.jpg)
67
Advanced Smart CacheAdvanced Smart CacheMultiMulti--core Optimized core Optimized
All the Smart Cache benefits:• L2 can adapt to each core’s load• Fast data sharing• No replicated data
Plus:• 2X BW to L1 caches
CoreCore22
CoreCore11
L2 CacheL2 Cache
Shared & MultiShared & Multi--Core Optimized, Core Optimized, with 2x Bandwidthwith 2x Bandwidth
![Page 68: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/68.jpg)
68
Advanced Smart Cache Advanced Smart Cache Dynamic Cache AllocationDynamic Cache Allocation
Advanced Advanced Smart CacheSmart Cache
Independent Independent Cache Cache (today)(today)
Core1Core1 Core2Core2Core1Core1 Core2Core2
L2 CacheL2 Cache
Shared Cache adapts to mismatchedloads. Independent Cache can thrashheavy app even when other cache isunder-utilized
L2 L2 CacheCache
L2 L2 CacheCache
![Page 69: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/69.jpg)
69
Advanced Smart Cache Advanced Smart Cache Efficient Data SharingEfficient Data Sharing
Independent CacheIndependent CacheAdvanced Smart Advanced Smart CacheCache
Core2Core2Core1Core1
L2 CacheL2 Cache
Core2Core2Core1Core1
L2 CacheL2 Cache
Main memory
L2 CacheL2 Cache
Main memoryFSBFSB FSBFSB
2X L2 to L1 Bandwidth2X L2 to L1 Bandwidth
![Page 70: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/70.jpg)
70
Intelligent Power CapabilityIntelligent Power Capability
Extending the power management architecture• Intel® Pentium® M processor innovated a new power
management architecture• Intel® Core™ Duo extended the Pentium® M processor
capability to multi-core
New Power Features within each processor core• Ultra fine-grained power control• Split Busses• Platformization of Power Management Architecture
Enhancing Energy EfficiencyEnhancing Energy Efficiency
![Page 71: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/71.jpg)
71
Ultra Fine Grained Power ControlUltra Fine Grained Power Control
Even during periods ofhigh performanceexecution, many partsof the chip core can beshut off.
Example could be aSW memory initializationexecuting from frontend with IQ operatingas loop cache.
ALUBranch
MMX/SSEFPmove
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
FP FP FP
Intelligent Power CapabilityIntelligent Power Capability
![Page 72: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/72.jpg)
72
Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)
Many buses are sized for worst-case data
(x86 instruction of 15 bytes)(ALU can write-back 128 bits)
Improved Energy EfficiencyImproved Energy Efficiency
![Page 73: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/73.jpg)
73
Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)
By splitting buses to dealwith varying data widths,
we can gain the performancebenefit of bus width while
maintaining C dynamiccloser to thinner buses
Improved Energy EfficiencyImproved Energy Efficiency
![Page 74: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/74.jpg)
74
Platformization of Power Platformization of Power Management ArchitectureManagement Architecture
Integrating best features from Server Integrating best features from Server and Mobile productsand Mobile products
Exposing more to the systemExposing more to the system
PSIPSI--22 Power Status Indicator (Mobile)Power Status Indicator (Mobile)
DTSDTS Digital Thermal SensorsDigital Thermal Sensors
PECIPECI Platform Environment Control Platform Environment Control InterfaceInterface
![Page 75: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/75.jpg)
75
Power Status Indicator Power Status Indicator (Mobile)(Mobile)
Processor communicates power consumption Processor communicates power consumption to external platform componentsto external platform components––Optimization of voltage regulator efficiencyOptimization of voltage regulator efficiency
––Load line and power delivery efficiency Load line and power delivery efficiency
PSI-2 / VID
VR
![Page 76: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/76.jpg)
76
Enabling Efficient Processor and Enabling Efficient Processor and Platform Thermal ControlPlatform Thermal Control……
DTS DTS –– Digital Thermal SensorDigital Thermal Sensor
Several thermal sensors are located within the Several thermal sensors are located within the Processor to cover all possible hot spotsProcessor to cover all possible hot spots
Dedicated logic scans the thermal sensors and Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at measures the maximum temperature on the die at any given timeany given time
Accurately reporting Processor temperature enables Accurately reporting Processor temperature enables advanced thermal control schemesadvanced thermal control schemes
LPF
LPF Core 1 DTS Logic
Core 2DTS Logic
DTS control
and status
![Page 77: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/77.jpg)
77
Platform Environment Platform Environment Control Interface (PECI)Control Interface (PECI)
Processor provides its temperature reading over Processor provides its temperature reading over a a multi drop single wire busmulti drop single wire bus allowing efficient allowing efficient platform thermal controlplatform thermal control
ProcessorFan
AuxiliaryFan
Manager
PECI
ChassisFan 1
ChassisFan 2
PROC #2
PROC #3
PROC #1
![Page 78: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/78.jpg)
78
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Microarchitecture Microarchitecture Feature SummaryFeature Summary
New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture
Wide Dynamic Execution33% wider pipes (4 vs. 3) and greater efficiency
Advanced Digital Media Boost2x compute throughput / clock
Smart Memory AccessMinimizing latency – Data Where & When needed
Advanced Smart CacheMulti-Core optimized, shared with 2x bandwidth
Intelligent Power CapabilityImproved energy efficientperformance
![Page 79: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/79.jpg)
79
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– SummarySummary
![Page 80: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/80.jpg)
80
IntelIntel®® CoreCore™™ Microarchitecture Microarchitecture and Softwareand Software
Software consistency across application spaceSoftware consistency across application space–– Wide Dynamic Execution will provide generic performance gains Wide Dynamic Execution will provide generic performance gains –– Smart Memory Access targets memory intensive appsSmart Memory Access targets memory intensive apps–– Advanced Digital Media Boost provides a leap in capability for Advanced Digital Media Boost provides a leap in capability for
media and floating point appsmedia and floating point apps–– MultiMulti--Core and Advanced Smart Cache further improve the Core and Advanced Smart Cache further improve the
growing number of multigrowing number of multi--threaded applicationsthreaded applications
Software consistency across markets segmentsSoftware consistency across markets segments––New apps and optimizations can target single New apps and optimizations can target single
microarchitecturemicroarchitecture
Immediate Performance Immediate Performance Increase Across Applications Increase Across Applications
and Segmentsand Segments
![Page 81: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/81.jpg)
81
Agenda
• Multi-core Update and New Micro-architecture level set
• New Intel® Core™ Microarchitecture
• Summary
![Page 82: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth](https://reader033.fdocuments.net/reader033/viewer/2022043006/5f90e32c4a47b23257498c24/html5/thumbnails/82.jpg)
82
Summary
Continuing to drive aggressive multi-core ramp– Dual-core ramp in 2006, quad-core starts in early 2007
Intel® Core™ microarchitecture delivers leading performance and performance/watt
– Conroe – >40% performance increase1 / >40% less power– Woodcrest - >80% performance increase1 / >35% less power– Mobile - Extending leadership delivered with Intel® Core™ Duo
with >20% performance increase1
On track for product introductions starting in Q3’06– Based upon new Intel® Core™ microarchitecture
1 - Estimated SPECint* rate