Download - Hardware Fundamentals(1)

8/12/2019 Hardware Fundamentals(1)

1/44

Some Hardware Fundamentals and an Introduction to

Software

In order to comprehend fully the function of system software, it is vital to understand theoperation of computer hardware and peripherals. The reason for this is that software and

hardware are inextricably connected in a symbiotic relationship. First, however, we need

to identify types of software and their relationship to each other and, ultimately to the

hardware.

Figure 1 Software Hierarchy

Figure 1 represents the relationship between the various types of software and hardware.

The figure appears as an inverted pyramid to reflect the relative size and number of the

various types of software, on one hand, and their proximity to computer hardware, on the

other.

First, application software is remote from, and rarely interacts with, the computers

hardware. This is particularly true of applications that run on modern operating systems

such as indows !T ".#, $###, %& and '!I% variants. (y $###, with the advent of

the .!)T and *ava paradigms, applications became even further removed from hardware,

Tom (utler1

Application Software

System Software

Firmware

Hardware

Computer Logic Circuits


2/44

as .!)Ts +ommon anguage -untime +-/ and the *ava 0irtual achine *0/

provide operating system and hardware access.

Indeed, from the operating systems perspective, the *0 and the +- are merely

applications. 2lder operating systems such as 34523, permitted some direct interactionbetween applications, chiefly computer games6 however, this meant that vendors of such

applications had to write code that would interact with the computers (I23 (asic

Input72utput 3ystem or firmware based on the computers read only memory or -2

integrated circuits/. Indeed, when computers first appeared on the mar8et this practice

was the norm, rather than the exception. 9pplication programmers soon tired of

reinventing the wheel every time they wrote an application, as they would have to include

software routines that helped the application software communicate and control hardware

devices, including the +&' central processing unit/. In order to overcome this, computer

scientists focused on developing a new type of software:operating or system software:

whose sole purpose was to provide an environment or interface for applications such that

the burden of managing and communicating the computer hardware was removed from

application programs. This proved important as technological advances resulted in

computer hardware becoming more sophisticated and difficult to manage. Thus,

operating systems were developed to manage a computers hardware resources and

provide an application programming interface, as well as a user or administrator

interface, to permit access to the hardware for use and configuration by application

software programmers and systems administrators.

In early computer systems, a boot strap code, that was either loaded into the system

manually via switches, and7or pre4coded punched cards or teletype tape, was re;uired to

load the operating system program and boot the system so that an application program

could be loaded and run. The advent of read only memory -2/ in the 1owever, firmware came into its own in microprocessors

Tom (utler$


3/44

systems and, later, personal computers. (y the turn of the new millennium, entire

operating systems, such as indows !T and inux, appeared in the firmware of

embedded systems. The most recent advances in this area have been in the &59 or

&oc8et &+ mar8et, where &alm 23 and icrosoft +) are competing for dominance. That

said, while almost every type of electronic device possesses firmware of one form or

other, the most prevalent appears in personal computers &+s/. i8ewise, &+s dominate

the computer mar8et due to their presence in all areas of human activity. >ence,

understanding &+ hardware has become a sine ;ua non for all who call themselves IT

professionals. The remainder of this chapter therefore focuses on delineating the basic

architecture of todays &+.

A Brief Look Under the Hood of Todays PCThis section provides a brief examination of the ma?or components of the &+.

The Power Supply

The most oft ignored of the &+s component is the system power supply. ost household

electrical appliances operate on alternating current 9+/ 11# 0olt @# >z 9+, e.g. '39/

or $$# 0olt A# >z 9+, )urope/. >owever, electronic subassemblies or entire devices

with embedded logic circuitry, whether microprocessor4based or not, operate exclusively

on direct current 5+/. The ?ob of a &+s power supply is to transform and rectify the

external 9+ commercial supplies to a range of 5+ voltages re;uired by the computer

logic, associated electronic components, the 5+ motors in the hard dis8 drives, floppy,

+5B-2, and 505 drives and the system fans. Typical 5+ power supplies in a &+ are

rated at 1.A, C.C, A, 4A, 1$, 41$ volts. 9lso note that as !oteboo8 and aptop computers

have a rechargeable 5+ battery, it re;uires special 5+45+ converters to generate the

re;uired range of 5+ voltages. 3everal colour4designated cables emanate from a

computers power supply unit, the largest of which is connected to the computers maincircuit board, called the motherboard. The various 5+ voltages are distributed via the

power supply rails printed onto the circuit board.

Tom (utlerC


4/44

The Basic Input/Output Operating System

The (asic Input72utput 3ystem (I23/ is system software and is a collection of

hardware4related software routines embedded as firmware on a read4only memory

-2/ integrated circuit I+/ Dchip which is typically housed on a computers

motherboard. 'sually referred to as -2 (I23, this software component provides the

most fundamental means for the operating system to communicate with the hardware.

>owever, most (I23s are 1@ bit programs and must operate in real mode 1on machines

with Intel processors. hile this does not cause performance problems during the boot4

up phase, it means a degradation in &+ performance as the +&' switches from protected

to real mode when (I23 routines are referenced by an operating system. C$ bit (I23s

are presently in use, but are not widespread. odern C$ bit operating systems such as

I!'% do not use the (I23 after bootup, as the designers of I!'% integrated C$ bit

versions of the (I23 routines into the I!'% 8ernel. >ence, the limitations of real mode

switches in the +&' are avoided. !evertheless, the (I23 plays a critical role during the

boot4up phase, as it performs the power4on self test &23T/ for the computer and then

loads the boot code from the hard dis8s master boot record (-/, which in turn copies

the system software into -9 and loads it into the +&'.

hen a computer is first turned on, 5+ voltages are applied to the +&' and associated

electrical and logic circuits. This would lead to electronic mayhem if the +&' did not

assert control. >owever, a +&' is merely a collection of hundreds of thousands and now

millions/ of logic circuits. +&' designers therefore built in a predetermined se;uence of

programmed electronic events, which are triggered when a signal appears on the +&'s

reset pin. This has the +&'s control unit use the memory address in the instruction

counter I+/ register to fetch the first instruction to be executed. The C$ bit value placed

in the I+ is the address of the first byte of the final @" E( segment in the first 1 ( of

the computers address space this is a hangover from the early days of the &+ when the

last C" E( of the first 1 ( of -9 was reserved for the system and peripheral (I23

routines, each of which were @" E( in length/. This is the address of first of the many 1@

11@ bit applications operate in real mode on all Intel +&'s. This effectively limits the address space to 1

(, by using 1@ x @" E( program segments. )ach 1@ bit application can only address @" E( $16G @A,AC@

locations/, however, the +&' manages and uses an extra " bit7address lines to provide the (I23, 23 and

applications with 1@ $4G 1@/ segment addresses.

Tom (utler"


5/44


6/44


7/44

Figure The Intel !"# $hipset

The %other&oard and the $hipset

The motherboard or system board houses all system components, from the +&', -9,

expansion slots ).J. I39 and &+I/, to the I72 controllers. >owever, the 8ey component

on a motherboard is the chipset. hile motherboards are identified physically by their

form factor, the chipset designation indicates the capability of the motherboard to house

system components. The most popular form factor is I(s 9T%. This motherboard was

designed by I( to increase air movement for cooling on4board components, and allow

easier access to the +&' and -9. hile the motherboard contains many chips or I+s,

such as the +&', -9, (I23, and a variety of smaller chips, two chips now handle most

of the I72 functionality of a &+. The first is the !orthbridge chip, which handles all

communication address, data and control/ to the +&', -9, 9ccelerated Jraphics &ort

Tom (utler=


8/44

and &+I devices. The frontside system bus F3(/ terminates on the !orthbridge chip and

permits the +&' to access the -9, 9J& and &+I devices and those serviced by the

3outhbridge chip and vice versa/. The 3outhbridge chip permits communication with

slow peripherals such as the floppy dis8 drive, the hard dis8 drive7+54-23, I39

devices, and the parallel, serial, mouse, 8eyboard ports Flash -2 (I23.

Figure ' The Intel !#( $hipset

Intel and 0I9 are the leaders in chipset manufacture as of $##$, although there areseveral other manufacturers:9li and 3i3. hile Intel services its own +&'s, 0I9

manufactures for both Intel and its ma?or competitor 95. In $##$, the basic Intel iA#

chipset consisted of the $A# !orthbridge +> emory +ontroller >ub/ and a I+>$

I72 +ontroller >ub/ 3outhbridge. The chipset also contains a Firmware >ub F>/ that

provides access to the Flash -2 (I23. This permits up to "J( of -9 with )++

Tom (utler


9/44

error correction/, "%9J& ode, " 'ltra 9T9 1## I5) dis8 drives, and four '3( ports.

I39 is not supported. 5ifferent chipset designs support different -9 types and speeds

e.g. 55- 35-9 or -9(us 5-9/, +&' types and pac8aging, system bus speeds,

and so on.

In $###, Intel announced that the future of -9 in the &+ industry was -9(us 5ram

-5-9/. This heralded the release of the Intel $# D+amino chipset, which supported

three -9(us memory slots. >owever, errors in the design meant that only two memory

slots could be used. 9 loss of confidence in the mar8etplace meant that withdrawal of the

ill4fated +amino and its replacement with the Intel "# D+armel chipset. This includes a

@" bit &+I controller, a redesigned and improved -5-9 memory repeater, and an

35-9 memory repeater that converts the -5-9 protocol to 35-9. This was a

smart move by Intel, which bac8fired terribly as the 35-9 hub had design errors that

limited the limited the number of 35-9s that could be used. In addition, the -5-9

to 35-9 conversion protocol impaired overall memory throughput when using

35-9. +onse;uently, faster memory performance on Intels &entium III +oppermine

+&'s with an 1CC hz Frontside (us could only be achieved using 0I9s 9pollo &ro

1CC 9. To ma8e matters worse, the Intel 1A 3olano chipset, which was introduced to

support &+ 1CC 5Is 35-9 memory modules/ and to help regain mar8et share

from 0I9, would not allow 35-9 modules wor8 at 1CC hz, if +&'s such as certain

variants of Intels &entium III/ rated for a 1## hz external cloc8 rate were fitted on the

motherboard. This particularly applies to the +eleron family which ran at a @@ hz

external cloc8 rate. It is significant that many of Intels competitors promoted &+1CC and

&+ $@@ 5I standards over the more expensive -9(us 5-9. This further

impeded the acceptance of -5-96 however, by late $##$, -5-9 had its own

mar8et niche as the price of 35-9 increased once more.

Intel learned from its experience with +amino and +armel chipsets. (owing to mar8et

pressure it designed two new chipset families for use with its new &entium I0 +&'. The

first of these, the i"A see Figure $/ was targeted at systems based on the &entium I0

and synchronous 5-9 memory such as the &+1CC, $CC, and CCC, with up to C J( of

memory. The iA# see Figure C/ was targeted on -5-94based systems of up to " J(,

which supported the &+ ##, 1#CC and 1#@@ -9(us memory. In late $##$, The Intel

Tom (utler


10/44

"AJ) chipset was released to support &+CCC 55- 3-9 and &entium " processor.

The chipset also included Intels )xtreme Jraphics technology which ran at $@@ >z

core speed. The basic member of the Intel A# chipset family had support for &+##

-5-9 memory and provided a balanced performance platform for the &entium "

processor with "##>z system bus and !et(urstK 9rchitecture. It also supports dual

channel access to -5-9 -Is, which increases overall throughput to C.$ Jbps.

3ubse;uent developments in this chipset family provided support for -5-9 running at

1#CC hz, 1#@@ hz and a ACC >z F3(. Further advances in 55- 35-9 technologies

saw 55- 35-94based Intel and 0I9 chipsets which accommodated &+$"## 9!5

&+$=## 55- 3--9 running at 1A# hz and 1@@>z.respectively and which is

double cloc8ed to C## and CCC hz so called 55- C## and CCC/. >owever, the

evolution of 55-C@@ and chipset design led to the &+C### 55- 35-9 being released

with even higher bandwidth speeds.

Basic CPU Architectures

CISC vs. RISC

There are two types of fundamental +&' architectureH complex instruction set computers

+I3+/ and reduced instruction set computers -I3+/. +I3+ is the most prevalent and

established microprocessor architecture, while -I3+ is a relative newcomer. Intels

#x@ and &entium microprocessor families are +I3+4based, although -I3+4type

functionality has been incorporated into &entium +&'s. otorolas @### family of

microprocessors is another example of this type of architecture. 3un icrosystems

3&9-+ microprocessors and I&3 -$###, -C### and -"### families dominate the

-I3+ end of the mar8et6 however, otorolas &ower&+, J", Intels i@#, and 9nalog

5evices Inc.s digital signal processors 53&/ are in wide use. In the &+7or8station

mar8et, 9pple +omputers and 3un employ -I3+ microprocessors as their choice of +&'.

Tom (utler1#


11/44

Ta&le 1 $IS$ and )IS$

$IS$ )IS$

arge instruction set +ompact instruction set

+omplex, powerful instructions 3imple hard4wired machine code and control unit

Instruction sub4commands microcoded in on board

-2

&ipelining of instructions

+ompact and versatile register set !umerous registers

!umerous memory addressing options for operands +ompiler and I+ developed simultanwously

The difference between the two architectures is the relative complexity of the instruction

sets and underlying electronic and logic circuits in +I3+ microprocessors. For example,

the original -I3+ I prototype had ?ust C1 instructions, while the -I3+ II had C


12/44

Figure " Typical %icroprocessor Architectures

Tom (utler1$

(us Interface 'nit

&rogram +ounter

3tac8 &ointer

9% (&

(% 3I

+% 5I

5% Flag

Instruction -e ister

Jeneral purpose

registersH 9% is the

9ccumulator

Internal (us

5ecode 'nit

9rithmetic and ogic 'nit

+ontrol 'nit

9ddress (us 5ata (us +ontrol (us

Includes

read7write,

interrupt, cloc8 andreset


13/44

the 1


14/44

several integrated transistors which are configured as a flip4flop circuits each of which

can be switched into a 1 or # state. They remain in that state until changed under control

of the +&' or until the power is removed from the processor. )ach register has a specific

name and is addressable, some, however, are dedicated to specific tas8s while the

ma?ority are Dgeneral purpose. The width of a register depends on the type of +&', e.g.,

an 1@, C$ or @" bit microprocessor. In order to provide bac8ward compatibility, registers

may be sub4divided. For example, the &entium processor is a C$ bit +&', and its registers

are C$ bits wide. 3ome of these are sub4divided and named as and 1@ bit registers in

order to run and 1@ bit applications designed for earlier x@ microprocessors.

Instruction )egister

hen the (us Interface 'nit receives an instruction it transfers it to the Instruction

-egister for temporary storage. In &entium processors the (us Interface 'nit transfers

instructions to the 1 I4+ache, there is no instruction register as such.

Stac* Pointer

9 Dstac8 is a small area of reserved memory used to store the data in the +&'s registers

whenH 1/ system calls are made by a process to operating system routines6 $/ when

hardware interrupts generated by input7output I72/ transactions on peripheral devices6

C/ when a process initiates an I72 transfer6 C/ when a process rescheduling event occurson foot of a hardware timer interrupt. This transfer of register contents is called a Dcontext

switch. The stac8 pointer is the register which holds the address of the most recent

Dstac8 entry. >ence, when a system call is made by a process to say print a document/

and its context is stored on the stac8, the called system routine uses the stac8 pointer to

reload the register contents when it is finished printing. Thus the process can continue

where it left off.

Instruction +ecoder

The Instruction 5ecoder is an arrangement of logic elements which act on the bits that

constitute the instruction. 3imple instructions with corresponding logic hard4wired into

the execution unit are simply passed to the )xecution 'nit and7or the % in the

&entium II, III and I0/, complex instructions are decoded so that related microcode

Tom (utler1"


15/44

modules can be transferred from the +&'s microcode -2 to the execution unit. The

Instruction 5ecoder will also store referenced operands in appropriate registers so data at

the memory locations referenced can be fetched.

Program or Instruction $ounter

The &rogram +ounter &+/ is the register that stores the address in primary memory

-9 or -2/ of the next instruction to be executed. In C$ bit systems, this is a C$ bit

linear or virtual memory address that references a byte the first of " re;uired to store the

C$ bit instruction/ in the processs virtual memory address space. This value is translated

to determine the real memory address in which the instruction is stored. hen the

referenced instruction is fetched, the address in the &+ is incremented to the address of

the next instruction to be executed. If the current address is ##(# hex, then the next

address will be ##(" hex. -emember each byte in -9 is individually addressable,

however each complete instruction is C$ bits or " bytes, and the address of the next

instruction in the process will be " bytes on.

Accumulator

The accumulator may contain data to be used in a mathematical or logical operation, or it

may contain the result of an operation. Jeneral purpose registers are used to support the

accumulator by holding data to be loaded to7from the accumulator.

$omputer Status ,ord or Flag )egister

The result of a 9' operation may have conse;uences of subse;uent operations6 for

example, changing the path of execution. Individual bits in this register are set or reset in

accordance with the result of mathematical or logical operations. 9lso called a flag, each

bit in the register has a preassigned meaning and the contents are monitored by the

control unit to help control +&' related actions.

Arithmetic and -ogic .nit

The 9rithmetic and ogic 'nit 9'/ performs all arithmetic and logic operations in a

microprocessor viz. addition, subtraction, logical 9!5, 2-, )%42-, etc.. 9 typical 9'

is connected to accumulator and general purpose registers and other +&' components

Tom (utler1A


16/44

that help transfer the result of its operations to -9 via the (us Interface 'nit and the

system bus. The results may also be written into internal or external caches.

$ontrol .nit

The control unit coordinates and manages +&' activities, in particular the execution of

instructions by the arithmetic and logic unit 9'/. In &entium processors its role is

complex, as microcode from decoded instructions are pipelined for execution by two

9's.

The System $loc*

The Intel # had a cloc8 speed of ".== hz6 that is, its internal logic gates were opened

and closed under the control of a s;uare wave pulsed signal that had a fre;uency of ".==

million cycles per second. 9lternatively put, the logic gates opened and closed ".==

million times per second. Thus, instructions and data were pumped through the integrated

transistor logic circuits at a rate of ".== million bits per second. ater designs ran at

higher speeds viz. the i$@ 4$# hz, the iC@ 1@4CC hz, i"@ $A4A# hz. here does

this cloc8 signal come fromM )ach motherboard is fitted with a ;uartz oscillator in a

metal pac8age that generates a s;uare wave cloc8 pulse of a certain fre;uency. In i#

systems the crystal oscillator ran at 1".C1 hz and this was fed to the i$" to generate

the system cloc8 fre;uency of ".== hz in earlier system, to 1#hz is later designs.ater, the i$@ &+s had a 1$ hz crystal which provided i$$" I+ multiplier7divider

with the primary cloc8 signal. This then divided7multiplied the basic 1$ hz to generate

the system cloc8 signal of 4$# hz. ith the advent of the i"@5%, the system cloc8

signal, which ran at $A or CC hz, was effectively multiplied by factors of $ and C to

deliver an internal +&' cloc8 speed of A#, @@, =A, 1## hz. This approach is used in

&entium I0 architectures, where the primary crystal source delivers a relatively slow A#

hz cloc8 signal that is then multiplied to the system cloc8 speed of 1##41CC hz. The

internal multiplier in the &entium then multiplies this by a fact or $#N to obtain speeds of

$Jhz and above.

Tom (utler1@


17/44

Instruction $ycle

9n instruction cycle consists of the activities re;uired to fetch and execute an instruction.

The length of time ta8e to fetch and execute is measured in cloc8 cycles. In +I3+

processors this will ta8e many cloc8 cycles, depending on the complexity of theinstruction and number of memory references made to load operands. In -I3+ computers

the number of cloc8 cycles are reduced significantly. hen the +&' finishes the

execution of an instruction it transfers the content of the program or instruction register

into the (us Interface 'nit 1 cloc8 cycle/ . This is then gated onto the system address

bus and the read signal is asserted on the control bus 1 cloc8 cycle/. This is a signal to

the -9 controller that the value of this address is to be read from memory and loaded

onto the data bus "N cloc8 cycles/. The instruction is read in from the data bus and

decoded $ N cloc8 cycles. The fetch and decode activities constitute the first machine

cycle of the instruction cycle. The second machine cycle begins when the instructions

operand is read from -9 and ends when the instruction is executed and the result

written bac8 to memory. This will ta8e at least another N cloc8 cycles, depending on the

complexity of the instruction. Thus an instruction cycle will ta8e at least 1@ cloc8 cycles,

a considerable length of time. Together, -I3+ processors and fast -9 can 8eep this to

a minimum. >owever, Intel made advances by super pipelining instructions, that is by

interleaving fetch, decode, operand read, execute, and retire i.e. write the result of the

instruction to -9/ activities into two separate pipelines serving two 9's. >ence,

instructions are not executed se;uentially, but concurrently and in parallel:more about

pipelining later.

#thand th 0eneration Intel $P. Architecture

The &entium microprocessor was the last of Intels A th generation microprocessors and

had several basic unitsH the (us Interface 'nit (I'/6 the I4+ache E( of write4through

3tatic -9:3-9/6 the Instruction Translation oo8aside (uffer T(/6 The 54

+ache E( of write4bac8 3-9/6 the 5ata T(6 the +loc8 5river7ultiplier6

Instruction Fetch 'nit6 the (ranch &rediction 'nit6 the Instruction 5ecode 'nit6 +omplex

Instruction 3upport 'nit6 3uperscalar Integer )xecution 'nit6 &ipelined Floating &oint

'nit. Figure A presents a bloc8 diagram of the original &entium.

Tom (utler1=


18/44

The &entium was the first Intel chip to have a @" bit external data bus which was split

internally into two separate pipelines, each C$ bits wide. This allowed the &entium to

execute two instructions simultaneously6 however, more than one instruction could be in

the pipeline, thus increasing instruction throughput.

>eat dissipation is enemy of chip designers, as the greater the number of integrated

transistors, the higher the speed of operation and the operating voltage, the more poser is

consumed, and the more heat generated. The first two &entium versions ran at @# and @@

hz respectively with an operating voltage of A 0 5+. >ence they ran ;uite hot.

>owever, a change in pac8age design from 3oc8et A to =, &in Jrid 9rray:&J9/ and a

reduction in operating voltage to C.C 0olts lowered power consumption and heat

dissipation. Intel also introduced a cloc8 multiplier which multiplied the external cloc8

signals and enabled the &entium to run at 1.A, $, $.A and finally C times this speed. Thus

while the system bus ran at A#, @#, and @@ hz, the +&' ran at =A4$##hz.

In 1owever, ma?or design changes came with the

&entium I0. odifications and design changes centered on a/ the physical pac8age6 b/

the process by which instructions were decoded and executed6 c/ support for memory

beyond the " J( limit6 c/ the integration and enhancement of 1 and $ cache

performance and size6 d/ the addition of a new cache6 e/ the speed of internal and

external operation. )ach of these issues receives attention in the following subsections.

Tom (utler1


19/44

Figure # Pentium $P. Bloc* +iagram

Tom (utler

us nter ace n t

t A ress us

- ac e

rac arget u er

Control Unit

re etc u er

Fetch and Decode Unit

ontro ust ata us

- ac e

Microcode

ROM

oc

Multiplier

Dual Pipeline

Execution Unit

- pe ne - pe ne

Registers

Floating Point

Unit

Advanced

Programmable

Interrupt

Controller

1


20/44

Physical Packaging

Two terms are employed to describe the pac8aging employed for the &entium family of

processorsH the first refers to the motherboard connection, and the second to the actual

pac8age itself. For example, the original &entium &A was fitted to the 3oc8et A type

connection on the motherboard using a 3taggered &in Jrid 9rray 3&J9/ for the dies

I72 die is the technical term for the physical structure that incorporates the chip/. ater

variants used the 3oc8et = connector. The &in Jrid 9rray &J9/ family of pac8ages are

associated with different 3oc8et types, which are numbered. 9 pin grid array is simply an

array of metal pin connectors used to form an electrical connection between the internal

electronics of the +&' pac8aged on the die/ and other system components li8e the

system chipsets. The pins plug into corresponding receptacle pinholes in the +&'s

soc8et on the motherboard. The different types of &J9 reflect the type of pac8aging, e.g.

ceramic to plastic, the number of pins, and how they are arrayed. The &entium &ro used a

3&J9 with a staggering C= pins for connection to the motherboard soc8et, called 3oc8et

. The &entium &ro was the first Intel processor to have an $ cache connected to the

+&' via bac8side bus, but on a separate die. This was a significant technical achievement

pac8aging. hen Intel designed the &entium II they decided to change the pac8aging

significantly and introduced a 3ingle )dge +ontact +onnector 3)++/ pac8age with

three variants 3)++ for the &entium II, 3)++$ for the &entium II and 3)&& for the+eleron/, each of which plugged into the 3lot 1 connector on the motherboard. >owever,

later variants of the +eleron and &entium III used &J9 pac8aging for certain

applicationsH the +eleron uses the &lastic &J9, the +eleron III and &entium III the Flip4

+hip &in Jrid 9rray F+4&J9/. (oth use the C=#4pin 3oc8et. The &entium I0 saw a full

return to the &J9 for all chips. >ere a Flip4+hip &in Jrid 9rray F+4&J9/ was

employed in a "= &+&J9 pac8age.

Overall Architectural Comparison of the Pentium Family of Microprocessors

The &entium &A"/ first shipped in 1


21/44


22/44

variants, and the &entium III. 9s indicated, the physical pac8age was also significant

advance, as was the incorporation of additional -I3+ features. >owever, aimed as it was

at the server mar8et, the &entium &ro did not incorporate % technology. It was

expensive to produce as it included the $ cache on its substrate but on a separate die/

and had A.A million transistors at its core and over million in its $ cache. Its core logic

operated at C.C0olts. The microprocessor was still, however, chiefly +I3+ in design, and

optimized for C$ bit operation. The chief features of the &entium &ro wereH

9 partly integrated $ cache of up to A1$ E( on a specially manufactured

3-9 separate die/ that was connected via a dedicated Dbac8side bus that ran at

full +&' speed.

Three 1$ staged pipelines

3peculative execution of instructions

2ut4of4order completion of instructions

"# renamed registers

5ynamic branch prediction

ultiprocessing with up to " &entium &ros

9n increased bus size to C@ bits from C$/ to enable up to @" Jb of memory to be

used. &lease note that the " extra bits can address up to 1@ memory locations6 this

gives " Jb x 1@ G @" Jb of memory./

The following description is ta8en from Intels introduction to its microprocessor

architecture is relevant to all members of the &@ family, including the +eleron, &entium II

and III.

The Intel &entium &ro processor has three4way superscalar architecture. The term

Othree4way superscalarP means that using parallel processing techni;ues, the processor is

able on average to decode, dispatch, and complete execution of retire/ three instructions

per cloc8 cycle. To handle this level of instruction throughput, the &entium &ro processor

uses a decoupled, 1$4stage superpipeline that supports out4of4order instruction execution.

It does this by incorporating even more parallelism than the &entium processor. The

Tom (utler$$


23/44

&entium &ro processor provides 5ynamic )xecution micro4data flow analysis, out4of4

order execution, superior branch prediction, and speculative execution/ in a superscalar

implementation.

The centerpiece of the &entium &ro processor architecture is an innovative out4of4order execution mechanism called Odynamic execution.P 5ynamic execution incorporates

three data4processing conceptsH

Q 5eep branch prediction.

Q 5ynamic data flow analysis.

Q 3peculative execution.

(ranch prediction is a concept found in most mainframe and high4speed -I3+

microprocessor architectures. It allows the processor to decode instructions beyond

branches to 8eep the instruction pipeline full. In the &entium &ro processor, the

instruction fetch7decode unit uses a highly optimized branch prediction algorithm to

predict the direction of the instruction stream through multiple levels of branches,

procedure calls, and returns.

Figure Functional Bloc* +iagram of the PentiumPro Processor %icroarchitecture

Tom (utler$C


24/44

5ynamic data flow analysis involves real4time analysis of the flow of data through the

processor to determine data and register dependencies and to detect opportunities for out4

of4order instruction execution. The &entium &ro processor dispatch7execute unit can

simultaneously monitor many instructions and execute these instructions in the order that

optimizes the use of the processors multiple execution units, while maintaining the

integrity of the data being operated on. This out4of4order execution 8eeps the execution

units busy even when cache misses and data dependencies among instructions occur.

3peculative execution refers to the processors ability to execute instructions ahead of the

program counter but ultimately to commit the results in the order of the original

instruction stream. To ma8e speculative execution possible, the &entium &ro processor

microarchitecture decouples the dispatching and executing of instructions from the

commitment of results. The processors dispatch7execute unit uses data4flow analysis to

execute all available instructions in the instruction pool and store the results in temporary

registers. The retirement unit then linearly searches the instruction pool for completed

instructions that no longer have data dependencies with other instructions or unresolved

branch predictions. hen completed instructions are found, the retirement unit commits

the results of these instructions to memory and7or the Intel 9rchitecture registers the

processors eight general4purpose registers and eight floating4point unit data registers/ in

the order they were originally issued and retires the instructions from the instruction pool.

Through deep branch prediction, dynamic data4flow analysis, and speculative execution,

dynamic execution removes the constraint of linear instruction se;uencing between the

traditional fetch and execute phases of instruction execution. It allows instructions to be

decoded deep into multi4level branches to 8eep the instruction pipeline full. It promotes

out4of4order instruction execution to 8eep the processors six instruction execution units

Tom (utler$"


25/44

running at full capacity. 9nd finally it commits the results of executed instructions in

original program order to maintain data integrity and program coherency.

Three instruction decode units wor8 in parallel to decode ob?ect code into smaller

operations called Omicro4opsP microcode/. These go into an instruction pool, and wheninterdependencies dont prevent/ can be executed out of order by the five parallel

execution units two integer, two F&' and one memory interface unit/. The -etirement

'nit retires completed micro4ops in their original program order, ta8ing account of any

branches.

The power of the &entium &ro processor is further enhanced by its cachesH it has the same

two on4chip 4E(yte 1 caches as does the &entium processor, and also has a $A@4A1$

E(yte $ cache thats in the same pac8age as, and closely coupled to, the +&', using a

dedicated @"4bit Obac8sideP/ full cloc8 speed bus. The 1 cache is dual ported, the $

cache supports up to " concurrent accesses, and the @"4bit external data bus is transaction

4oriented, meaning that each access is handled as a separate re;uest and response, with

numerous re;uests allowed while awaiting a response. These parallel features for data

access wor8 with the parallel execution capabilities to provide a Onon4bloc8ingP

architecture in which the processor is more fully utilized and performance is enhanced.

Pentium Pro Modes of Operation

The Intel 9rchitecture supports three operating modesH protected mode, real4address

mode, and system management mode. The operating mode determines which instructions

and architectural features are accessibleH

Protected mode2 The native state of the processor. In this mode all instructions

and architectural features are available, providing the highest performance and

capability. This is the recommended mode for all new applications and operating

systems. 9mong the capabilities of protected mode is the ability to directly

execute Oreal4addressmodeP #@ software in a protected, multi4tas8ing

environment. This feature is called 3irtual!(! mode, although it is not actually

a processor mode. 0irtual4#@ mode is actually a protected mode attribute that

can be enabled for any tas8.

Tom (utler$A


26/44

)ealaddress mode2 &rovides the programming environment of the Intel #@

processor with a few extensions such as the ability to switch to protected or

system management mode/. The processor is placed in real4address mode

following power4up or a reset.

System management mode2 9 standard architectural feature uni;ue to all Intel

processors, beginning with the IntelC@ 3 processor. This mode provides an

operating system or executive with a transparent mechanism for implementing

platform4specific functions such as power management and system security. The

processor enters 3 when the external 3 interrupt pin 3IR/ is activated

or an 3I is received from the advanced programmable interrupt controller

9&I+/. In 3, the processor switches to a separate address space while saving

the entire context of the currently running program or tas8. 34specific code

may then be executed transparently. 'pon returning from 3, the processor is

placed bac8 into its state prior to the system management interrupt.

The basic execution environment is the same for each of these operating modes,

Basic Pentium 45ecution 4n3ironment

9ny program or tas8 running on an Intel 9rchitecture processor is given a set of

resources for executing instructions and for storing code, data, and state information.

These resources shown in Figure / include an address space of up to $C$ bytes, a set of

general data registers, a set of segment registers, and a set of status and control registers.

hen a program calls a procedure, a procedure stac8 is added to the execution

environment. &rocedure calls and the procedure stac8 implementation are described in

+hapter ",Procedure Calls, Interrupts, and Exceptions./

Figure 6 Basic 45ecution 4n3ironment

Tom (utler$@


27/44

Pentium Pro %emory Organi7ation

The memory that the processor addresses on its bus is called physical memory. &hysical

memory is organized as a se;uence of 4bit bytes. )ach byte is assigned a uni;ue address,

called a physical address. The physical address space ranges from zero to a maximum

of $C$S 1 " gigabytes/. 0irtually any operating system or executive designed to wor8 with

an Intel 9rchitecture processor will use the processors memory management facilities to

access memory. These facilities provide features such as segmentation and paging, which

allow memory to be managed efficiently and reliably. emory management is described

in detail later. The following paragraphs describe the basic methods of addressing

memory when memory management is used. hen employing the processors memory

management facilities, programs do not directly address physical memory. Instead, they

access memory using any of three memory modelsH flat, segmented, or real4address

mode.

ith the flat memory model see Figure C4$/, memory appears to a program as a single,

continuous address space, called a linear address space. +ode a programs

instructions/, data, and the procedure stac8 are all contained in this address space. The

linear address space is byte addressable, with addresses running contiguously from # to

$C$ 4 1. 9n address for any byte in the linear address space is called a linear address.

ith the segmented memory model, memory appears to a program as a group of

independent address spaces called segments. hen using this model, code, data, and

stac8s are typically contained in separate segments. To address a byte in a segment, a

program must issue a logical address, which consists of a segment selector and an offset.

Tom (utler$=


28/44

9 logical address is often referred to as a far pointer./ The segment selector identifies

the segment to be accessed and the offset identifies a byte in the address space of the

segment. The programs running on an Intel 9rchitecture processor can address up to

1@,CC segments of different sizes and types, and each segment can be as large as $ C$

"J(/ bytes.

Internally, all the segments that are defined for a system are mapped into the processors

linear address space. 3o, the processor translates each logical address into a linear address

to access a memory location. This translation is transparent to the application program.

The primary reason for using segmented memory is to increase the reliability of programs

and systems. For example, placing a programs stac8 in a separate segment prevents the

stac8 from growing into the code or data space and overwriting instructions or data,

respectively. 9nd placing the operating systems or executives code, data, and stac8 in

separate segments protects Them from the application program and vice versa.

ith either the flat or segmented model, the Intel 9rchitecture provides facilities for

dividing the linear address space into pages and mapping the pages into virtual memory.

If an operating system7executive uses the Intel 9rchitectures paging mechanism, the

existence of the pages is transparent to an application program.

The realaddress mode model uses the memory model for the Intel #@ processor, the

first Intel 9rchitecture processor. It was provided in all the subse;uent Intel 9rchitecture

processors for compatibility with existing programs written to run on the Intel #@

processor. The real address mode uses a specific implementation of segmented memory

in which the linear address space for the program and the operating system7executive

consists of an array of segments of up to @"E bytes in size each. The maximum size of

the linear address space in real4address mode is $$# bytes.

Figure ! Three %emory %anagement %odels

Tom (utler$


29/44

'&it 3s2 1&it Address and Operand Si7es

The processor can be configured for C$4bit or 1@4bit address and operand sizes. ith C$4

bit address and operand sizes, the maximum linear address or segment offset is

FFFFFFFF> $C$/, and operand sizes are typically bits or C$ bits. ith 1@4bit address

and operand sizes, the maximum linear address or segment offset is FFFF> $ 1@/, and

operand sizes are typically bits or 1@ bits. hen using C$4bit addressing, a logical

address or far pointer/ consists of a 1@4bit segment selector and a C$4bit offset6 when

using 1@4bit addressing, it consists of a 1@4bit segment selector and a 1@4bit offset.

Instruction prefixes allow temporary overrides of the default address and7or operand sizes

from within a program. hen operating in protected mode, the segment descriptor for the

currently executing code segment defines the default address and operand size. 9

segment descriptor is a system data structure not normally visible to application code.

9ssembler directives allow the default addressing and operand size to be chosen for a

program. The assembler and other tools then set up the segment descriptor for the code

segment appropriately. hen operating in real4address mode, the default addressing and

operand size is 1@ bits. 9n address4size override can be used in real4address mode to

Tom (utler$


30/44

enable C$ bit addressing6 however, the maximum allowable C$4bit address is still

####FFFF> $1@/.

Figure 8 Application Programming )egisters

)40IST4)S

The processor provides 1@ registers for use in general system and application programming. 9s shown in

Figure, these registers can be grouped as followsH

0eneralpurpose data registers. These eight registers are available for storing

operands and pointers.

Segment registers. These registers hold up to six segment selectors.

Status and control registers. These registers report and allow modification of thestate of the processor and of the program being executed.

General-Purpose Data Reisters

The C$4bit general4purpose data registers )9%, )(%, )+%, )5%, )3I, )5I, )(&, and

)3& are provided for holding the following itemsH

Tom (utlerC#


31/44

2perands for logical and arithmetic operations

2perands for address calculations

9lthough all of these registers are available for general storage of operands, results, and

pointers, caution should be used when referencing the )3& register. The )3& register

holds the stac8 pointer and as a general rule should not be used for any other purpose.

any instructions assign specific registers to hold operands. For example, string

instructions use the contents of the )+%, )3I, and )5I registers as operands. hen using

a segmented memory model, some instructions assume that pointers in certain registers

are relative to specific segments. For instance, some instructions assume that a pointer in

the )(% register points to a memory location in the 53 segment.

The following is a summary of these special usesH

)9%:9ccumulator for operands and results data.

)(%:&ointer to data in the 53 segment.

)+%:+ounter for string and loop operations.

)5%:I72 pointer.

)3I:&ointer to data in the segment pointed to by the 53 register6 source pointer

for string operations.

)5I:&ointer to data or destination/ in the segment pointed to by the )3 register6

destination pointer for string operations.

)3&:3tac8 pointer in the 33 segment/.

)(&:&ointer to data on the stac8 in the 33 segment/.

9s shown in Figure, the lower 1@ bits of the general4purpose registers map directly to the

register set found in the #@ and Intel $@ processors and can be referenced with the

names 9%, (%, +%, 5%, (&, 3&, 3I, and 5I. )ach of the lower two bytes of the )9%,

)(%, )+%, and )5% registers can be referenced by the names 9>, (>, +>, and 5>

high bytes/ and 9, (, +, and 5 low bytes/.

Tom (utlerC1


32/44

Segment )egisters

The segment registers +3, 53, 33, )3, F3, and J3/ hold 1@4bit segment selectors. 9

segment selector is a special pointer that identifies a segment in memory. To access a

particular segment in memory, the segment selector for that segment must be present inthe appropriate segment register. hen writing application code, you generally create

segment selectors with assembler directives and symbols. The assembler and other tools

then create the actual segment selector values associated with these directives and

symbols. If you are writing system code, you may need to create segment selectors

directly.

>ow segment registers are used depends on the type of memory management model that

the operating system or executive is using. hen using the flat unsegmented/ memory

model, the segment registers are loaded with segment selectors that point to overlapping

segments, each of which begins at address # of the linear address space as shown in

Figure/. These overlapping segments then comprise the linear4address space for the

program. Typically, two overlapping segments are definedH one for code and another for

data and stac8s. The +3 segment register points to the code segment and all the other

segment registers point to the data and stac8 segment./

hen using the segmented memory model, each segment register is ordinarily loaded

with a different segment selector so that each segment register points to a different

segment within the linear4address space as shown in Figure


33/44

Figure 11 .se of Segment )egisters in Segmented %emory %odel

)ach of the segment registers is associated with one of three types of storageH code, data,

or stac8/. For example, the +3 register contains the segment selector for the code

segment, where the instructions being executed are stored. The processor fetches

instructions from the code segment, using a logical address that consists of the segment

selector in the +3 register and the contents of the )I& register. The )I& register contains

the linear address within the code segment of the next instruction to be executed. The +3

register cannot be loaded explicitly by an application program. Instead, it is loaded

implicitly by instructions or internal processor operations that change program control

such as, procedure calls, interrupt handling, or tas8 switching/.

The 53, )3, F3, and J3 registers point to four data segments. The availability of four

data segments permits efficient and secure access to different types of data structures. For

example, four separate data segments might be createdH one for the data structures of the

current module, another for the data exported from a higher4level module, a third for a

dynamically created data structure, and a fourth for data shared with another program. To

Tom (utlerCC


34/44

access additional data segments, the application program must load segment selectors for

these segments into the 53, )3, F3, and J3 registers, as needed.

The 33 register contains the segment selector for a stac* segment, where the procedure

stac8 is stored for the program, tas8, or handler currently being executed. 9ll stac8operations use the 33 register to find the stac8 segment. 'nli8e the +3 register, the 33

register can be loaded explicitly, which permits application programs to set up multiple

stac8s and switch among them.

The four segment registers +3, 53, 33, and )3 are the same as the segment registers

found in the Intel #@ and Intel $@ processors and the F3 and J3 registers were

introduced into the Intel 9rchitecture with the IntelC@ family of processors.

4F-A0S )egister

The C$4bit )F9J3 register contains a group of status flags, a control flag, and a group

of system flags. Figure C4= defines the flags within this register. Following initialization

of the processor either by asserting the -)3)T pin or the I!IT pin/, the state of the

)F9J3 register is #######$>. (its 1, C, A, 1A, and $$ through C1 of this register are

reserved. 3oftware should not use or depend on the states of any of these bits.

3ome of the flags in the )F9J3 register can be modified directly, using special4purpose

instructions described in the following sections/. There are no instructions that allow the

whole register to be examined or modified directly. >owever, the following instructions

can be used to move groups of flags to and from the procedure stac8 or the )9% registerH

9>F, 39>F, &'3>F, &'3>F5, &2&F, and &2&F5. 9fter the contents of the

)F9J3 register have been transferred to the procedure stac8 or )9% register, the flags

can be examined and modified using the processors bit manipulation instructions (T,

(T3, (T-, and (T+/.

hen suspending a tas8 using the processors multitas8ing facilities/, the processor

automatically saves the state of the )F9J3 register in the tas8 state segment T33/ for

the tas8 being suspended. hen binding itself to a new tas8, the processor loads the

)F9J3 register with data from the new tas8s T33.

Tom (utlerC"


35/44

hen a call is made to an interrupt or exception handler procedure, the processor

automatically saves the state of the )F9J3 registers on the procedure stac8. hen an

interrupt or exception is handled with a tas8 switch, the state of the )F9J3 register is

saved in the T33 for the tas8 being suspended.

Instruction Pointer

The instruction pointer )I&/ register contains the offset in the current code segment for

the next instruction to be executed. It is advanced from one instruction boundary to the

next in straightline code or it is moved ahead or bac8wards by a number of instructions

when executing *&, *cc, +9, -)T, and I-)T instructions.

The )I& register cannot be accessed directly by software6 it is controlled implicitly by

controltransfer instructions such as *&, *cc, +9, and -)T/, interrupts, and

exceptions. The only way to read the )I& register is to execute a +9 instruction and

then read the value of the return instruction pointer from the procedure stac8. The )I&

register can be loaded indirectly by modifying the value of a return instruction pointer on

the procedure stac8 and executing a return instruction -)T or I-)T/.

9ll Intel 9rchitecture processors prefetch instructions. (ecause of instruction

prefetching, an instruction address read from the bus during an instruction load does not

match the value in the )I& register. )ven though different processor generations usedifferent prefetching mechanisms, the function of )I& register to direct program flow

remains fully compatible with all software written to run on Intel 9rchitecture processors.

Operandsi7e and Addresssi7e Attri&utes

hen processor is executing in protected mode, every code segment has a default

operand4size attribute and address4size attribute. These attributes are selected with the 5

default size/ flag in the segment descriptor for the code segment. hen the 5 flag is set

the C$4bit operand4size and address4size attributes are selected6 when the flag is clear, the

1@4bit size attributes are selected. hen the processor is executing in real4address mode,

virtual4#@ mode, or 3, the default operand4size and address4size attributes are

always 1@ bits.

Tom (utlerCA


36/44

The operand4size attribute selects the sizes of operands that instructions operate on.

hen the 1@4bit operand4size attribute is in force, operands can generally be either bits

or 1@ bits, and when the C$4bit operand4size attribute is in force, operands can generally

be bits or C$ bits. The address4size attribute selects the sizes of addresses used to

address memoryH 1@ bits or C$ bits. hen the 1@4bit address4size attribute is in force,

segment offsets and displacements are 1@4bits. This restriction limits the size of a

segment that can be addressed to @" E(ytes. hen the C$4bit address4size attribute is in

force, segment offsets and displacements are C$4bits, allowing segments of up to "

J(ytes to be addressed. The default operand4size attribute and7or address4size attribute

can be overridden for a particular instruction by adding an operand4size and7or address4

size prefix to an instruction. The effect of this prefix applies only to the instruction it is

attached to.

Pentium II

The &entium II incorporates many of the salient features of the &entium &ro and &entium

%6 however, its physical pac8age was based on the 3)++73lot 1 interface and its A1$

E( $ cache ran at only half the processor internal cloc8 rate. First generation &entium II

Elamath +&'s operated at $CC, $@@, C## and CCChz with a F3( of @@hz and a core

voltage of $. 0olts. In 1z, F3( and at $.# 0olts

at the core. Its ma?or improvements wereH

1@ Eb 1 instruction and data caches

$ cache with non4proprietary commercially available 3-9

Improved 1@ bit capability through segment register caches

% unit.

3tandard &entium II could only be used in dual multiprocessor configurations6

however, &entium %)2! cpus had up to $ ( of $ cache and could be used in

multiprocessor configurations of up to " processors.

Tom (utlerC@


37/44

Celeron

The +eleron began as a scaled down version of the &entium II and was designed to

compete against similar offerings from Intels competitors. The Elamath4based

+ovington core ran at $@@ and C## >z and were constructed without an $ cache.

>owever, adverse mar8et reaction saw the 5eschutes4based endocino core introduced

with an 1$ Eb $ cache and ran at C##, CCC, "##, "CC, "@@, A## and ACC >z. +elerons

have the same 1 cache as their bigger brothers:&entium II and III. The important

distinction is that the $ cache operates at full +&' cloc8 rates, unli8e the &entium II and

the 3)++ pac8aged &entium III. ater variants of the &entium III had an on4die $

cache which ran at full +&' cloc8 rate. The +eleron III +oppermine1$ core/has the

same internal features as the &entium III, but has reduced functionalityH @@ hz cloc8

rate, no error correction codes for the data bus, and parity creation for the address bus,

and a maximum of " J( of address space. +eleron III +oppermine1$s with a 1.@ 0

core and a 1## >z were produced in $##1 and operated at core speeds of up to 1.1

hz. Tualatin4core +elerons were put on the mar8et in late $##1 and ran at 1.$ J>z.

$##$ saw the final versions produced running aty 1.C and 1." >z.

Pentium III

The only significant difference between the &entium III and its predecessor was the

inclusion of =$ % instructions, 8nown as the Internet 3treaming 3ingle Instruction

ultiple 5ata )xtensions I33)/, they include integer and floating point operations.

>owever, li8e the original % instructions, application programmers must include the

corresponding extensions if any use is to be made of these instructions. The most

controversial and short4lived addition was the +&' I5 number which could be used for

software licensing and e4commerce. 9fter protest from various sources, Intel disabled it

as default, but did not remove it. 5epending on the (I23 and motherboard manufacturer,

it may remain as such but it can be enabled via the (I23. In reality, &entium III

performance was based. The three variants of &entium III were the were the Eatami,

+oppermine, and Tualatin. Eatami first introduced the I33) %7$/ as described with

an F3( of 1## >. The +oppermine also introduced 9dvanced Transfer +ache 9T+/

for the $ cache which reduced cache capacity to $A@ E( but saw the cache run at full

processor speed. 9lso the @"4bit Eatami cache bus was ;uadrupled to $A@ bits.

Tom (utlerC=


38/44

+oppermine also uses an 4way set associative cache, rather than the "4way set

associative cache in the Eatami and older &entiums. (ringing the cache on4die also

increased the transistor count to C# million, from the 1# million on the Eatami. 9nother

advance in the +oppermine was 9dvanced 3ystem (uffering 93(/, which simply

increased the number of buffers to account for the increased F3( speed of 1CC >z. The

&entium III Tualatin had a reduced die size that allowed it to run at higher speeds.

Tualatins use a 1CC>z F3( and have 9T+ and 93(.

Pentium I9: The ;e5t 0eneration

The release of the &entium I0 in $### heralded the seventh generation of Intel

microprocessors. The release was premature, however, due to the out performance of the

&entium III +oppermine, with its 1 Jhz performance threshold, by Intels ma?or

competitor the microprocessor mar8et, the 95 9thlon. Intel was not ready to answer

the competition through the early release of the next member of its &entium III family,

the &entium III Tualatin, which were designed to brea8 the 1 Jhz barrier. &revious

attempts to do so with the &entium III +oppermine 1.1C Jhz met with failure due to

design flaws. &aradoxically, however, Intel was in a position to release the first of the

&entium I0 family the illamette, which ran at 1.C, 1." and 1.A hz, using a F+4&J9

pac8age on the short4lived 3oc8et "$C, which was a design dead end for motherboard

manufacturers and consumers. orse still, the only Intel chipset available for the

&entium I0 could only house the highly expensive -ambus 5-9. In addition, the early

versions of &entium I0 +&' were outperformed by slower 95 9thlons. !evertheless,

the core capability of Intels seventh generation processors is that they can run at ever4

higher speeds. For example, Intels sixth generation &entiums began at 1$# hz with the

&entium &ro and ended at over 1.$ Jhz, a tenfold increase. The bottom line here is that

Intels seventh generation chips could end up running at speeds of 1# Jhz or more. >ow

has Intel achieved thisM Through a radical redesign of the &entiums core architecture.The following sections illustrate the ma?or advances.

The most visible feature seen of the new &entium I0 is the Front 3ide (us F3(/ which

initially operated at e;uivalent speed of "## hz as compared to 1## >z on the

&entium III. The &entium III has a @"4bit data bus that delivered a data throughput of

Tom (utlerC


39/44

1.#@@ J( @"U 1CCG 1.#@@/. The &entium I0 F3( bus is also @"4bit wide, however, its

1## hz bus speed is D;uad4pumped giving an effective bus speed of "##hz and a data

transfer rate of C.$ J(. The newer as of late $##$/ &entium I07chipsets operate at 1CC

hz and deliver a bus speed of ACC hz and a bus speed of ".$ Jhz. Thus, the &entium

I0 exchange data with the i"A and iA# chipsets faster than any other processor, thus

removing the &entium IIIs most significant bottlenec8. IntelVs A# chipset for the

&entium I0 uses two -ambus channels to $4" -5-9 -Is. Together, these two

-5-9 channels are able to deliver the same data bandwidth as the &entium I0 F3(.

9s the later discussion on 5-9 indicates, similar transfer rates are delivered using the

i"A chipset and 55- 5-9. stellation enables &entium "4systems to have the highest

data transfer rates between processor, system and main memory, which is a clear benefit.

Advanced Transfer Cache

The first ma?or improvement is the integration of the $ cache and the evolution of the

9dvanced Transfer +ache introduced in the &entium III +oppermine which had ?ust $A@

E( of 1 +ache. The first &entium I0, the illamette, had a similar sized cache, but

could transfer data at " J( per second at a +&' cloc8 speed of 1.A Jhz into the +&'s

core logic, In comparison, the +oppermine could only transfer 1@ J(7s at 1 Jhz to its 1

Instruction +ache. !ote also that the Front 3ide (us speed of the &entium III was 1CC

hz, while the &entium I0 illamette had a F3( speed of "## hz. In addition, the

&entium I0 $ cache has 1$4byte cache lines, which are divided in two @"4byte

segments. For example, when the &entium I0 fetches data from the -9, it does so in

@" byte burst transfers. >owever, if ?ust four bytes C$ bits/ are re;uired this bloc8

transfer becomes inefficient. >owever, the cache has advanced 5ata &refetch ogic that

predicts the data re;uired by the cache and loads it into the $ cache in advance. The

&entium I0Vs hardware prefetch logic significantly accelerates the execution of processes

that operate on large data arrays. The read latency the time it ta8es the cache to transfer

data into the pipeline/ of &entium "Vs $4cache is = cloc8 pulses. >owever, its connection

to the core logic the Translation oo8aside buffer in this case, there is no I4+ache in the

&entium I0/ is $A@4bit wide and cloc8ed the full processor speed. The second member of

the &entium I0 family was the !orthwood, which had a A1$ E( $ +ache running at the

processors cloc8 speed.

Tom (utlerC


40/44

L1 Data CacheThe second ma?or development in cache technology is that the &entium I0 has only one

1 E( data cache. In place of the 1 instruction cache I4+ache/ in the @ thgeneration

&entiums it has a much more efficient )xecution Trace +ache.

Intel reduced the size of its 1 data cache to enable a very low latency of only $ cloc8

cycles. This results in an overall read latency the time it ta8es to read data from cache

memory/ of less than half of the &entium IIIVs 1 data cache.

7thGeneration NetBurst Micro-Architecture

Intels !et(urst icro49rchitecture provides a firm foundation for future advances in

processor performance, particularly where speed of operation is concerned. The !et(urst

micro4architecture has four ma?or componentsH >yper &ipelined Technology, -apid

)xecution )ngine, )xecution Trace +ache and a "##>z system bus. 9lso incorporated

are four significant improvements over sixth generation architectureH 9dvanced 5ynamic

)xecution, 9dvanced Transfer +ache, )nhanced Floating &oint W ultimedia 'nit, and

3treaming 3I5 )xtensions $.

Hyper Pipelined Technology

The traditional approach to increasing a +&'s cloc8 speed was ma8e smaller processors

by shrin8ing the die. 9n alternative strategy evident in -I3+ processors is to ma8e the

+&' more efficient do less per cloc8 cycle and have more of them. To do this in a +I3+4

based processor, Intel simply increased the number of stages in the processors pipeline.

The upshot of this is that less is accomplished per cloc8 cycle. This is a8in to a Dbuc8et4

brigade passing smaller buc8ets rapidly down a chain, rather than larger buc8ets at a

slower rate. For example, the ' and 0 integer pipelines in the original &entium each had

?ust five stagesH instruction fetch, decode 1, decode $, execute and write4bac8. The

&entium &ro introduced a &@ architecture with a pipeline consisting of 1# stages. The &=

!et(urst micro4architecture in the &entium I0 increased the number of stages to $#.

This, Intel terms its >yper &ipelined Technology.

Enhanced Branch Prediction

The 8ey to pipeline efficiency and operation is effective branch prediction, hence the

much improved branch prediction logic in the &entium I0s 9dvanced 5ynamic

Tom (utler"#


41/44


42/44

point operations, which are not prone to the same type of branch prediction inefficiencies

as integer4based instructions.

Streaming SIMD Extensions 2

In the follow4up to Intels 3treaming 3I5 3ingle Instruction ultiple 5ata/ )xtensions33)/. 3I5 is a technology that allows a single instruction to be applied to multiple

datasets at the same time. This is especially useful when processing C 5 graphics. 3I54

F& Floating &oint/ extensions help speed up graphics processing by ta8ing the

multiplication, addition and reciprocal functions and apply them to the multiple datasets

simultaneously. -ecall, 3I5 first appeared with the &entium % which incorporated

A= % instructions. These are essentially 3I54Int integer/ instructions. Intel first

introduced 3I54F& extensions in the &entium III with =$ 3treaming 3I5 )xtensions

33)/. Intel introduced 1"" new instructions in the &entium I0 that enable it to handle

two @"4bit 3I54I!T operations and two double precision @"4bit 3I54F& operations.

This is contrast to the two C$4bit operations the &entium % and III under 33)/

handle. The ma?or benefit of 33)$ is enhanced greater performance, particularly with

3I54F& instructions, as it increases the processors ability to handle greater precision

floating point calculations. 9s with % and 33), these instructions re;uire software

support.

Celeron IV

The +eleron I0 first appeared in $##$, these were based on the &entium I0 and could be

accommodated on the 3oc8et "= motherboards. (ased on the illamette, the $ was

halved to 1$ E( and ran at 1.= J>z. ater models ran at 1., 1.< and $ J>z. The next

member was based on the !orthwood and had $A@ E( $ cache. (ased on the i"A

chipset, the new +elerons are now good value entry level processors.

Additional )esources

The following 5iagrams of the &entium III, I0 and 95 9thlon +&'s are provided to

highlight the architectural features of these microprocessors and enhance the foregoing

text. The following figures have been obtained from Toms >ardware Juide !2T this

Tom/H further insights into the Intel architectures may be found atH

[email protected]$###11$#7index.html/.

Tom (utler"$


43/44

Tom (utler"C


44/44