Crash course in CPUs

PLUS!

HOTWIRED HARDCORE PC ADVICE ¤ HACKING ¤ OVERCLOCKING ¤ MODDING

NINJA RIG SPECIAL!

SILENCE YOUR PC Blazing fast and dead quiet yours from as little as £29

THERE’S MORE… Meet the incredible £108 PCSolid state drive breakthroughSolid state drive breakthroughFirst Person Shooter Hall of Fame

AMD BACK FROM THE DEAD IN OUR GRAPHICS SUPERTEST SEE P60

Smoking performance from just £58!

ISSUE 220

DEC 2008

CPU TECH EXPLAINED! AMAZING FREE DVD INSIDE!NEHALEM, MICRO-OP FUSION, IMC, FCLGA AND MUCH MORE!

59 MUST-PLAY GAMES • 50 POWER TOOLS • 69 ESSENTIAL FREE APPS Turn to

page p118

PERFORMANCE GEAR & GAMING

ISSUE 220 AU REVOIR, GOPHER

AU R

EVOIR

, GO

PHER

WW

W.PCFO

RM

AT.CO.U

K

Issue 220 Dec 2008 £5.99 Outside UK & ROI £6.49

Steam Back-up your games

WARHAMMER: RVR MASTERCLASS

FAR CRY 2DEFINITIVE REVIEW

PCF220.cover 1PCF220.cover 1 6/10/08 11:19:46 am6/10/08 11:19:46 am

The future of CPUs

72 December 2008

Crash course in CPUsHow do tiny grains of sand turn numbers into stunning 3D games anyway? Adam Oxford goes under the ceramic of the CPU to fi nd out

PCF220.feature2 072PCF220.feature2 072 3/10/08 1:28:2 pm3/10/08 1:28:2 pm

tried. The 8086 ran at 4MHz, had a total transistor count of less than 30,000 and was packaged in a 40-pin dual in-line chip: physically, it was one of those long black things with the legs sticking out from the sides like an evil metal spider.The Core i7, by contrast, is a two-, four- or eight-core beast, with up to 1.4 billion transistors in its largest variety. At launch, it will be clocked at well over the 3GHz mark. It has 1567 pin outs, and comes in the fl at FCLGA (fl ip chip land grid array) packaging that will be familiar from the Core 2 line. That means that balls of solder meet the circuit board head on, and end in simple pads which are then laid on to of pins in the motherboard socket.

We’ve come a long way, clearly. The CPUs of the seventies look like single-celled organisms in primordial processor sludge by comparison to the staggering complexity of today’s chips. It takes teams of hundreds of people several years to design a new CPU, and it’s unlikely that any individual could completely navigate the fi nished silicon topography by hand.

INSIDE THE SHELLWe can, however, do our bit to improve general understanding by looking at certain core principles of CPU design. Technically speaking, a CPU is any processor that can execute programmable code, but for the purposes of our sanity, we’ll stick to a discussion of modern day x86 chips here.

Though the layout of today’s chips bear as much resemblance to the original 8086s as a dog does to its jellyfi sh ancestors, nevertheless the core operational procedure follows the same cycle. The CPU’s task is broken into four stages: fetch, decode, execute and writeback.

Instructions are called from a memory store to the registers. These are then interpreted, processed and a result is written back. This result can be output to, say, a graphics card or hard drive, or called back into the CPU for processing again. However intricate a processor is, these basic four steps are a good way to understand how they work and why they are designed the way they are.

That cycle can be sped up, of course, by increasing the clockspeed of the CPU and the number of cycles it performs per second. Intel learnt the hard way that the

Light a candle and bake a cake, then pop down to Clintons to pick up an hilarious card – for your CPU has just turned 30! While its best years aren’t behind it quite yet, it could do with

cheering up. In 1978, Intel released its fi rst 16-bit microprocessor, the 8086. Although it was the cheaper, cut down 8-bit version – the 8088 – that made it into the IBM PC and quite literally changed the world as we know it, today’s Core 2 and Phenom chips are designed to run code based on what’s still called the x86 instruction set. In fact, they still share some important common core characteristics with the venerable 8086.

Quite why it should have been the x86 family is a different story for another time. Intel’s chips were far from the most advanced, cleverest or cheapest available at the end of the 1970s, and had some fairly serious design bugs, which had to be replaced by IBM free of charge some years later. In the annals of our times, though, that will be deemed irrelevant: this was the general purpose processor that drove the desktop revolution.

Curiously, one of its competitors – the Zilog Z80 which powered Sinclair’s home computer of (almost) the same name – is actually still manufactured and used today. The 8086, however, has been consigned to history.

Why do we bring these curious factoids up? Because later this month also sees the launch of Intel’s seventh generation of x86 CPUs, the Core i7 (Nehalem). Intel is touting it as the biggest architectural change in the company’s history; and for once we’re actually prepared to believe it.

The success of x86 is, of course, backwards compatibility. Somewhere in the Core i7’s infi nitely more complex design are the same 116 instructions that the 8086 could execute, albeit substantially enhanced with later additions, and the same is true of the AMD Phenom. These are the basic arithmetic and logic commands – like ADD, MUL, OR and XOR – along with a few more specifi c instructions for which bit of data belongs in which block of memory or system register.

In reality, of course, the things couldn’t be more different today if they

December 2008 73

“Layout of today’s chips bears as much resemblance to the original 8086s

as a dog does to its jellyfi sh ancestors”

The future of CPUs


CPUs explained

74 December 2008

Predecode & Instruction Length Decoder

Instruction Queue18 x86 Instructions

AlignmentMacroOp Fusion

ComplexDecoder

SimpleDecoder

SimpleDecoder

SimpleDecoder

Decoded Instruction Queue (28 µOP entries)

MicroOp Fusion

LoopStream

Decoder

2 x Register Allocation Table (RAT)

Reorder Buffer (128-entry) fused

2 xRetirement

RegisterFile

Reservation Station (128-entry) fused

StoreAddr.Unit

AGU

LoadAddr.Unit

AGUStoreData

MicroInstructionSequencer

512-entryL2-TLB-4K

Integer/MMX ALU,

Branch

SSEADDMove

Integer/MMX ALU

SSEADDMove

FPADD

Integer/MMX ALU,

2x AGU

SSEMUL/DIV

Move

FPMUL

Memory Order Buffer (MOB)

BranchPrediction

global/bimodal,loop, indirect

jmp

128

Port 4 Port 0Port 3 Port 2 Port 5 Port 1

128 128

128 128 128

Result Bus256

Quick PathEnter-

connect

DDR3Memory

Controller

Uncore

GT/s: gigatransfers per second

INTEL NEHALEM MICROARCHITECTURE

Dia

gram

: App

aloo

sa, h

ttp:

//en

.wik

iped

ia.o

rg/w

iki/

Inte

l_Neh

alem

_(mic

roar

chit

ectu

re)


key to building a really fast processor isn’t just about raw gigahertz. If a single cycle requires a certain amount of electricity to be performed, more cycles per second means increasing the power consumed and – importantly – the heat produced. The theoretically scalable netburst architecture of the Pentium 4 came a cropper when it hit an unexpected top speed barrier beyond which trying to cool the chip was impracticable for most. Which tells us that processor designers may be very clever, but they can’t foresee everything.

In the same way that graphics technology has moved to unifi ed shading in order to make more effi cient use of the processing power available, today’s design goals are to keep all the various parts of the CPU working on useful information. Note the inclusion of the word ‘useful’ there.

FETCH, DECODE, ETCThe simplest form of CPU takes one piece of data, works out what to do with it, does it and then outputs the result. The inherent problem is that it can only work on one piece of data at a time, and while that’s being passed through to the part of the execution engine that’s designed to perform the requested operation, the rest of the CPU is sitting idle.

The solution to this is to introduce some form of parallelism to the ‘pipeline’. To start with, this might have been simply to have the ‘fetch’ part of the CPU grabbing a new piece of data while the ‘decode’ bit is working on another. That’s been developed somewhat, mind you, and the last iteration of Pentium 4 had a whopping 31 stages to its pipeline.

The problem with long pipelines, however, is that they aren’t always terribly effi cient because they’re not always full of useful information. On its journey through the pipeline, a piece of data may return an error or will become reliant on other information being drawn from the registers – if it isn’t there, the

result will have to be written out while the new piece of data is fetched and the rest of the pipeline will stand idle.

The key workaround for this in today’s CPUs is to build logical areas that are dedicated to ‘branch prediction’ – in other words, guessing what bits of data are going to be needed next and getting them ready for insertion into the pipe. Of course, branch predictors aren’t infallible, and if the wrong information is called then you’re back to having large amounts of wasted die area. A large part of processor design is fi nding a happy balance between length of pipeline and CPU cycles lost to such stalling.

Part of Core i7’s secret to success, for example, is using relatively medium-length pipelines, and including a ‘Second-Level Branch Target Buffer’: an extra bit of memory to cache information and

Below Two dual-core dies mounted on a PCB. The lid you see is the Penryn’s heatsink

allow the branch to double back on itself if a problem arises.

There are other ways to speed up data throughput too. Your CPU’s inbox is always overfl owing with work to be done, but it will rifl e through the pages to take the best job next, not necessarily the fi rst one it was given. The order in which instructions are executed is decided by a scheduler, which independently assesses the most effi cient way to do them.

That might mean looking ahead in the currently running thread and pulling out commands that aren’t dependent on the current operation – known as ‘out of order’ processing – or, in the case of a processor core capable of working on more than one thread at once, starting to work through an entirely different instruction loop that just happens not to

December 2008 75

Chip design seems very hard work these days. Is there anyone left who actually understands a CPU in its entirety?At a high level, a lot of people can explain what we’re doing here, here and here. It’s becoming much harder for any one person to grasp all the internal details from an architecture standpoint, as there are so many components integrated on the die. It’s hard to find folks who understand execution units and memory controllers at a very detailed level, and it’s going to become more complicated going forward. The team that developed Nehalem developed the first Pentium Pro processor in the mid-1990s. Back then each of the architects was able to hold all the key details at a fairly low level. We’re way beyond that now: complexity has grown substantially.

How many people does it take?It depends how you defi ne it – there are some activities where people are there for four or fi ve years, others where people will complete their work in 12 months then move on to another project. If you look at the peak number of people it’s several

hundred, but how many, I don’t know. Where do you draw the line?

What’s the best new feature in Core i7?With each processor generation there’s something that’s evolutionary and something that’s revolutionary. What’s really different with Nehalem is the power management stuff. The concepts of power gating and turbo mode are what will be remembered as revolutionary.

Is your experience with Pentium 4 why you put so much work into this area?You may read into that the fact that the needle has gone so far in the other direction, that we’ve been burnt by that before. If you’ve had an issue before you make sure that it’s never going to be an issue again. And we all understand the benefi ts of conserving power.

Is Hyperthreading your baby too?This is the only team at Intel that could have resurrected Hyperthreading and implemented it. We’ve done it before, we understand the good and the bad, and we’re willing to take on that challenge. We have people on the team who’ve been working on hyperthreading since day one of the Pentium 4.

Is Nehalem enough to see off the threat of GP-GPUs in the HPC market?The big benefi t of these versus GP-GPUs is the programming environments. You can use standard tools, rather than specialised kits. The x86 tools are extremely mature. The second advantage is the legacy of backward compatibility we have. You know that apps you write today will run on the next-gen processors; with the graphics cards, you haven’t had that yet.

INTEL INTERVIEW WHO’S AFRAID OF THOSE GP-GPUs?

Ronak Singhal

is the Chief

Architect for

Nehalem. His US-

based team was

also responsible for

the original

Pentium 4 design.

The future of CPUs


need the same parts of the pipeline as the currently running one. To speed things up further, Core i7 can execute up to four instructions per cycle.

Incidentally, it’s also interesting to note that a CPU’s instruction set – the programming language into which all commands are eventually decoded and compiled - isn’t completely hard-wired into the design. There’s a software layer that handles most of the interpretation known as the ‘microcode’ – a form of non-upgradable fi rmware stored in an on-board ROM, which works as a mini-operating system. It’s a useful tool for chip builders: because the microcode isn’t fi nalised until the chip goes into production – and can be rewritten for a new manufacturing run – any problems or improvements that are discovered after the silicon has been laid out can be changed in the software stack. This is, of course, easier than going back to the drawing board and laying out another million or two transistors.

DEDICATED BITSIf you have a look at the Core i7 block diagram on page 74, you can see that the execution engine is also broken down further into dedicated areas for tasks like integer operations, fl oat point calculation and SSE instructions. The latter is an acronym of an acronym – the Streaming SIMD Engine where SIMD stands for Single Instruction Multiple Data. It’s an on-board vector processor capable of performing the same transformation on several pieces of information at once. It’s included on Intel and AMD chips for speeding up things like video processing, where the same

command must be performed on, say, all the pixels on a screen simultaneously.There’s also, of course, the one important part of a CPU that we haven’t talked about yet: the memory.

Closest to the actual instruction pipeline are the registers: there are 32 of these on a 64-bit chip, and each can

either store a general piece of information or has a specifi c task or overlapping tasks. In order to help out those prefetch engines we mentioned earlier, though, there are two levels of fast cache memory to store the data which might be needed for the current process, or that has been written out but may be called again. The cache memory is much faster than system memory and prevents the whole system bottlenecking while the RAM is slowly scanned for instructions and data. For multi-core chips, where two or more processors are packaged onto the same die, there’s often a third cache area that is structured to allow the different cores to swap information quickly.

AMD has long led the way in memory access speed: since the introduction of the Athlon 64, its CPUs have been able to talk directly to the memory via a fast proprietary bus. Meanwhile, Intel chips have had to share access to the memory on the same general system bus – the FSB – as all other information travels. With Core i7, however, Intel has fi nally introduced a technology it calls ‘QuickPath Interconnect’, or QPI. Broadly analogous to Hypertransport, it allows the CPU to talk directly to components like memory without going via the northbridge on the motherboard, which otherwise acts rather like the router in your home network as a central hub for data transportation and can get quite congested. This should prevent Core i7 bottlenecking despite its enormous demand for incoming information.

POWER MADThere have been many other improvements to CPUs since the humble 8086. One, for example, is that power management systems are now built onto the die. These serve several purposes – shutting the chip down to protect it from damage when it gets too hot, say, or turning off areas that aren’t being used to conserve electricity. The latter is particularly useful for extending

The future of CPUs

76 December 2008

Above PhysX acceleration is also available in some ATI cards thanks to the CUDA SDK

Above Silicon wafers are carved up into individual processors by tiny saws – not all will survive

Much has been made of the latest graphics cards being able to do more than graphics. Both NVIDIA and AMD have been touting the GP-GPU (General Purpose GPU) properties of their DirectX 10 chips and proprietary programming languages which coders can use to unlock said features. In NVIDIA’s case, this is the CUDA development environment; more phonetically pleasing, AMD calls its GP-GPU technology FireStream.

Both are very promising technologies, but one thing should be clear: no matter what happens, they’re no threat to the current CPU architecture of your PC. A processor like the Core 2 or Phenom is designed to be able to do lots of different things at different speeds at the same time. It might be running Windows, Outlook and a heavily branched AI routine in a game program at the same time. Or they may be running the user interface of a hosted telephone exchange. That takes a particular type of chip, and it isn’t a graphics one. It’s long been suspected that NVIDIA wants to get into the processor market: if it does, it won’t be with a GeForce-derived product.

But with the introduction of unifi ed shaders, graphics companies have found themselves with some interesting designs in their inventories. Essentially, the DX10 GPUs are incredible at parallel processing. They can be confi gured on the fl y to take in a lot of similar pieces of data and perform the same operation on all of them. It may have been designed to take all the pixels beneath a light source and add a green glow to them, but it’s also useful for prosaically named High Performance Computing (HPC) applications.

These are areas like weather simulators, medical imaging systems, molecular modelling, geophysical analysis and even the fi nancial systems of insurance fi rms and the Stock Exchange. Previously, they’ve all relied on enormous server farms fi lled with x86 processors to perform vector calculations quickly: but a single graphics card is, potentially, a hundred times faster than a blade server at these operations.

There will be no GPU-CPU, but NVIDIA has already released the fi rst of its CUDA

applications for its desktop cards: making them able to simulate

PhysX hardware on a GeForce G80.

THE GPU CHALLENGE


the battery life on notebooks, but in these times of rising energy costs is also handy for datacentres, where a 50W saving per chip over a thousand servers can add up to a serious amount of money per year. Especially when it means you can turn the air-conditioning down a notch or two as well.

Perhaps the biggest ongoing technological advances are in how these complex designs actually get turned into transistors on a silicon die.

Most of us will be familiar with the mind-roastingly tiny fi gures that are quoted by CPU manufacturers for their manufacturing processes – 45nm, 60nm and so on. These refer to the basic size of components on a chip, and are unimaginably small. Reducing them further has several advantages: performance-wise, the same chip on a smaller process can run cooler and faster; but more importantly – because they take up less physical space – more can be squeezed onto a single silicon wafer. So they’re cheaper too.

The basic manufacturing process hasn’t changed much in 30 years: take a large disc of purifi ed silicon, and use a photolithographic process to build layers of extra materials onto it which connect together to create data pathways and logic gates. The materials and tools, however, are constantly being refi ned to increase the accuracy needed to achieve these tiny dimensions, and reduce the effects of ‘leakage’. Put

simply, this is when electrons begin hopping out over the boundaries between interconnects that they’re not supposed to be able to mount, and becomes more of a problem the smaller the manufacturing process becomes. At the moment, Intel leads the way with its 45nm process, which is made possible thanks to a hafnium-derived material used for the transistor gates. Intel states that the next generation of Core i7s will be produced on an even smaller 32nm process.

MOORE TO COMECPU design and manufacture isn’t showing any signs of slowing. The infamous ‘Moore’s Law’ – a prediction by Intel founder Gordon Moore that the number of transistors that can be placed on a circuit will double every two years – may not be based on any scientifi c assessment of the manufacturing capabilities of the future, but it has remained peculiarly true for the last forty-three years. Indeed, it could be that we’re on the cusp of a far bigger architectural change than even Core i7 augers. AMD and Intel are keen to move more functions onto the CPU, starting with a basic graphics processor, with the end goal of creating simple, power effi cient system-on-a-chip that will, essentially, put a desktop PC on your fi ngernail.

NVIDIA and the ex-ATI part of AMD, meanwhile, seem to recognise that the

The future of CPUs

78 December 2008

next big jump in real-time graphics engines is a little further off than previously supposed, and their hugely parallel GPUs are capable of performing important tasks like medical imaging and fi nancial reporting better than an entire farm of CPU servers (see The GPU Challenge on the previous page).

Perhaps more likely to yield results faster, though, are the hardware hooks for virtualisation which are being built into CPU cores, allowing several operating systems to run at once without a performance penalty. Many speculate that ‘cloud computing’ – starting an instanced desktop from a web-based grid server is the way forward, turning all our computing into one big Gmail-type application.

Quite where these developments – and the hundreds of others that are going on simultaneously – will lead, though, is anyone’s guess. But before gazing too far into the future, bear this in mind: there’s another, even bigger and more signifi cant birthday than the 8086 this year. In September 1958, Texas Instruments welcomed the very fi rst microprocessor – just a single transistor on a germanium strip – off its production line and set the ball rolling for the information age. Did anyone, fi fty years ago, predict World of Warcraft or even Microsoft Word? Happy birthday, computers. PCF raises a glass substrate to your future. ¤

TAKE SOLACE IN QUANTUM

Want to read about the CPUs of the future in a human, digestible format? Read Charles Stross’ book Halting State. It’s a whodunnit starring a quantum processor as one of the main protagonists.

Quantum has been much vaunted as the future of computing, and not just because it sounds very clever. While manufacturers like Intel are doing well at fi nding ways around the physical limits of current CPU design to make their transistors ever smaller, there will one day come an impassable boundary for current engineering techniques. You can only make things so small when you’re playing around with silicon molecules.

Manipulating the quantum characteristics of atoms to

represent data stores is, theoretically, one way to go beyond conventional computers – albeit a highly complicated way that no-one has perfected yet. The idea is that data would be stored in ‘Qubits’, rather than bits. These could use particle spin to represent a one, a zero or – crucially – a quantum superposition, allowing for calculations of such fi endish complexity to be carried out that it hurts our collective brains to think about. Quite how it would be implemented is another matter – the University of Michigan has demonstrated a proof of concept, but there’s little sign that its on its way desktopwards any time soon.More likely, in the short term at least, are optical replacements for current microprocessors, using photons instead of electrons to carry data and refracting materials instead of silicon for logic gates.

Above Phenom: four central rectangles are cores; transistors on the edge are the shared resources

“Electrons hop over boundaries they’re not supposed to mount…”


Crash course in CPUs

Documents

Transcript of Crash course in CPUs