15-447 Computer Architecture Fall 2008 ©
November 19, 2007Nael Abu-Ghazaleh
[email protected]://www.qatar.cmu.edu/~msakr/15447-f08
Lecture 26Emerging Architectures
CS 15-447: Computer Architecture
2
15-447 Computer Architecture Fall 2008 ©
Last Time: Buses and I/O
• Buses: Bunch of wires• Shared Interconnect: multiple “devices” connect
to the same bus• Versatile: new devices can connect (even ones
we didn’t know existed when bus was designed)• Can become a bottleneck
– Shorter->faster; less devices->faster
• Have to:– Define the protocol to make devices communicate– Come up with an arbitration mechanism
Data Lines
Control Lines
3
15-447 Computer Architecture Fall 2008 ©
Types of Buses
• System bus– Connects processor and memory– Short, fast, synchronous, design specific
• I/O Bus – Usually is lengthy and slower; industry standard– Need to match a wide range of I/O devices– Connects to the processor-memory bus or backplane bus
Processor Memory
Processor Memory Bus
BusAdaptor
BusAdaptor
BusAdaptor
I/O BusBackplane Bus
I/O Bus
4
15-447 Computer Architecture Fall 2008 ©
Bus “Mechanics”
• Master Slave• Have to define how we hand-shake
– Depends on whether its synchronous or not
• Bus arbitration protocol– Contention vs. reservation; centralized vs. distributed
• I/O Model– Programmed I/O; Interrupt driven I/O; DMA
• Increasing performance (mainly bandwidth)– Shorter; closer; wider– Block transfers (instead of byte transfers)– Split transaction buses– …
5
15-447 Computer Architecture Fall 2008 ©
Today—Emerging Architectures
• We are at an interesting point in computer architecture evolution
• What is emerging and why is it emerging?
6
15-447 Computer Architecture Fall 2008 ©
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
Uniprocessor Performance (SPECint)
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
Sea change in chip design—what is emerging?
3X
??%/year
7
15-447 Computer Architecture Fall 2008 ©
How did we get there?
• First, what allowed the ridiculous 52% improvement per year to continue for around 20 years?– If cars improved as much we would have 1 million Km/hr cars!
• Is it just the number of transistors/clock rate?• No! Its also all the stuff that we’ve been learning about!
8
15-447 Computer Architecture Fall 2008 ©
Walk down memory lane
• What was the first processor organization we looked at?– Single cycle processors
• How did multi-cycle processors improve those?• What did we do after that to improve
performance?– Pipelining; why does that help? What are the
limitations?
• From there we discussed superscalar architectures– Out of order execution; multiple ALUs– This is basically state of the art in uniprocessors– What gave us problems there?
9
15-447 Computer Architecture Fall 2008 ©
Detour: couple of other design points
• Very Large Instruction Word Architectures; let the compiler do the work• Great for energy efficiency—less Instruction Level Parallelism• Not binary compatible? Trasnmeta Crusoe Processor
10
15-447 Computer Architecture Fall 2008 ©
SIMD ISA Extensions—Parallelism from the Data?
• Same Instruction applied to multiple Data at the same time– How can this help?
• MMX (Intel) and 3DNow! (AMD) ISA extensions• Great for graphics; originally invented for scientific codes
(vector processors)– Not a general solution
• End of detour!
11
15-447 Computer Architecture Fall 2008 ©
Back to Moore’s law
• Why are the “good times” over?– Three walls
1. “Instruction Level Parallelism” (ILP) Wall– Less parallelism available in programs (2->4->8->16)– Tremendous increase in complexity to get more– Does VLIW help?– What can help?– Conclusion: standard architectures cannot continue
to do their part of sustaining Moore’s law
12
15-447 Computer Architecture Fall 2008 ©
Wall 2: Memory Wall
• What did we do to help this?– Still very very expensive to access memory
• How do we see the impact in practice?• Very different from when I learned architecture!
µProc 52%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
1000198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU
198
2Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
“Moore’s Law”
13
15-447 Computer Architecture Fall 2008 ©
Ways out? Multithreaded Processors
• Can we switch to other threads if we need to access memory?– When do we need to access memory?
• What support is needed?• Can I use it to help with the ILP wall as well?
14
15-447 Computer Architecture Fall 2008 ©
Symmetric Multithreaded Processors
• How do I switch between threads?
• Hardware support for that
• How does this help?
• But, increased contention for everything (BW, TLB, caches…)
15
15-447 Computer Architecture Fall 2008 ©
Third Wall: Physics/Power wall
• We’re down to the level of playing with a few atoms
• More error prone; lower yield• But also soft-errors and wear out
– Logic that sometimes works!– Can we do something in architecture to recover?
16
15-447 Computer Architecture Fall 2008 ©
Power! Our topic next class
17
15-447 Computer Architecture Fall 2008 ©
So, what is our way out? Any ideas?
• Maybe architecture becomes commodity; this is the best we can do– This happens to a lot of technologies: why don’t we
have the million km/hr car?
• Do we actually need more processing power?– 8 bit embedded processors good enough for
calculators; 4 bit ones probably good enough for elevators
– Is there any sense to continue investing so much time and energy into this stuff?
Power Wall + Memory Wall + ILP Wall = Brick Wall
18
15-447 Computer Architecture Fall 2008 ©
A lifeline? Multi-core architectures
• How does this help?
• Think of the three walls
• The new Moore’s law: – the number of cores will double every 3 years!
– Many-core architectures
19
15-447 Computer Architecture Fall 2008 ©
Overcoming the three walls
• ILP Wall? – Don’t need to restrict myself to a single thread– Natural parallelism available across threads/programs
• Memory wall?– Hmm, that is a tough one; on the surface, seems like
we made it worse– Maybe help coming from industry
• Physics/power wall?– Use less aggressive core technology
• Simpler processors, shallower pipelines• But more processors
– Throw-away cores to improve yield
• Do you buy it?
20
15-447 Computer Architecture Fall 2008 ©
7 Questions for Parallelism
• Applications:1. What are the apps?2. What are kernels of apps?• Hardware:3. What are the HW building blocks?4. How to connect them?• Programming Models:5. How to describe apps and
kernels?6. How to program the HW?• Evaluation: 7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
21
15-447 Computer Architecture Fall 2008 ©
Sea Change in Chip Design
• Intel 4004 (1971): 4-bit processor,2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip
Processor is the new transistor!
• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to 0.02 mm2 at 65 nm
22
15-447 Computer Architecture Fall 2008 ©
Architecture Design space
• What should each core look like?• Should all cores look the same?• How should the chip interconnect between them
look?• What level of the cache should they share?
– And what are the implications of that?
• Are there new security issues?– Side channel attacks; denial of service attacks
• Many other questions…
Brand new playground; exciting time to do architecture research
23
15-447 Computer Architecture Fall 2008 ©
Hardware Building Blocks: Small is Beautiful
• Given difficulty of design/validation of large designs• Given power limits what can build, parallel is energy efficient
way to achieve performance– Lower threshold voltage means much lower power
• Given redundant processors can improve chip yield– Cisco Metro 188 processors + 4 spares– Sun Niagara sells 6 or 8 processor version
• Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs
• One size fits all?
– Amdahl’s Law a few fast cores + many small cores
24
15-447 Computer Architecture Fall 2008 ©
Elephant in the room
• We tried this parallel processing thing before– Very difficult
• It failed, pretty much– A lot of academic progress and neat algorithms, but little
impact commercially
• We actually have to do new programming– A lot of effort to develop; error prone; etc..– La-Z-boy programming era is over– Need new programming models
• Amdahl’s law• Applications: What will you use 1024 cores for?• These concerns are being voiced by a substantial
segment of academia/industry– What do you think?– Its coming, no matter what
Top Related