Solutions Fundaments of Co

Chapter 1

Introduction

11 Instruction execution cycle consists of the following phases (see the figure below):

1. Fetch an instruction from the memory,

2. Decode the instruction (i.e., determine the instruction type),

3. Execute the instruction (i.e., perform the action specified by the instruction).

Instructionfetch

Instructiondecode

Operandfetch

Instructionexecute

Resultwrite back

Instructionfetch

Instructiondecode

Operandfetch

Instructionexecute

Resultwrite back

ID OF IE WBIF

Instruction execution phase

. . .

Execution cycle

The instruction execution phase involves

1. Fetching any required operands;

2. Performing the specified operation;

3. Writing the results back.

This process is often referred to as the fetch-execute cycle, or simply the execution cycle.

12 The system bus consists of three major components: an address bus, a data bus, and a control bus(see the figure below).

1

2 Chapter 1

I/O device

I/O device

I/O device

Address bus

Processor Memory

Data bus

Control bus

I/Osubsystem

The address bus width determines the amount of physical memory addressable by the processor.The data bus width indicates the size of the data transferred between the processor and memory oran I/O device. For example, the Pentium processor has 32 address lines and 64 data lines. Thus,the Pentium can address up to 232, or 4 GB of memory. Furthermore, each data transfer can move64 bits of data. The Intel Itanium processor uses address and data buses that are twice the sizeof the Pentium buses (i.e., 64-bit address bus and 128-bit data bus). The Itanium, therefore, canaddress up to 264 bytes of memory and each data transfer can move 128 bits.The control bus consists of a set of control signals. Typical control signals include memory read,memory write, I/O read, I/O write, interrupt, interrupt acknowledge, bus request, and bus grant.These control signals indicate the type of action taking place on the system bus. For example, whenthe processor is writing data into the memory, the memory write signal is generated. Similarly,when the processor is reading from an I/O device, it generates the I/O read signal.

13 Registers are used as a processors scratchpad to store data and instructions temporarily. Becauseaccessing data stored in the registers is faster than going to the memory, optimized code tendsto put most-often accessed data in processor registers. Obviously, we would like to have as manyregisters as possible, the more the better. In general, all registers are of the same size. For example,registers in a 32-bit processor like the Pentium are all 32 bits wide. Similarly, 64-bit registers areused in 64-bit processors like the Itanium.The number of processor registers varies widely. Some processors may have only about 10 regis-ters, and others may have 100+ registers. For example, the Pentium has about 8 data registers and

Chapter 1 3

8 other registers, whereas the Itanium has 128 registers just for integer data. There are an equalnumber of floating-point and application registers.Some of the registers contain special values. For example, all processors have a register called theprogram counter (PC). The PC register maintains a marker to the instruction that the processoris supposed to execute next. Some processors refer to the PC register as the instruction pointer(IP) register. There is also an instruction register (IR) that keeps the instruction currently beingexecuted. Although some of these registers are not available, most processor registers can be usedby the programmer.

14 Machine language is the native language of the machine. This is the language understood by themachines hardware. Since digital computers use 0 and 1 as their alphabet, machine languagenaturally uses 1s and 0s to encode the instructions.Assembly language, on the other hand, does not use 1s and 0s; instead, it uses mnemonics toexpress the instructions. Assembly language is closely related to the machine language. In thePentium, there is a one-to-one correspondence between the instructions of the assembly languageand its machine language. Compared to the machine language, assembly language is far better inunderstanding programs. Since there is one-to-one correspondence between many assembly andmachine language instructions, it is fairly straightforward to translate instructions from assemblylanguage to machine language. Assembler is the software that achieves this code translation.Although Pentiums assembly language is close to its machine language, other processors like theMIPS use the assembly language to implement a virtual instruction set that is more powerful anduseful than the native machine language. In this case, an assembly language instruction could betranslated into a sequence of machine language instructions.

15 Assembly language is considered a low-level language because each assembly language instruc-tion performs a much lower-level task compared to an instruction in a high-level language. Forexample, the following C statement, which assigns the sum of four count variables to result

result = count1 + count2 + count3 + count4;

is implemented in the Pentium assembly language as

mov AX,count1add AX,count2add AX,count3add AX,count4mov result,AX

16 The advantages of programming in a high-level language rather than in an assembly languageinclude the following:

1. Program development is faster in a high-level language.Many high-level languages provide structures (sequential, selection, iterative) that facilitate

4 Chapter 1

program development. Programs written in a high-level language are relatively small andeasier to code and debug.

2. Programs written in a high-level language are easier to maintain.Programming for a new application can take several weeks to several months, and the lifecy-cle of such an application software can be several years. Therefore, it is critical that softwaredevelopment be done with a view toward software maintainability, which involves activitiesranging from fixing bugs to generating the next version of the software. Programs writtenin a high-level language are easier to understand and, when good programming practices arefollowed, easier to maintain. Assembly language programs tend to be lengthy and take moretime to code and debug. As a result, they are also difficult to maintain.

3. Programs written in a high-level language are portable.High-level language programs contain very few machine-specific details, and they can beused with little or no modification on different computer systems. In contrast, assemblylanguage programs are written for a particular system and cannot be used on a differentsystem.

17 There are two main reasons for programming in assembly language: efficiency and accessibilityto system hardware. Efficiency refers to how good a program is in achieving a given objective.Here we consider two objectives based on space (space-efficiency) and time (time-efficiency).Space-efficiency refers to the memory requirements of a program (i.e., the size of the code). Pro-gram A is said to be more space-efficient if it takes less memory space than program B to performthe same task. Very often, programs written in an assembly language tend to generate more com-pact executable code than the corresponding high-level language version. You should not confusethe size of the source code with that of the executable code.

Time-efficiency refers to the time taken to execute a program. Clearly, a program that runs faster issaid to be better from the time-efficiency point of view. Programs written in an assembly languagetend to run faster than those written in a high-level language. However, sometimes a compiler-generated code executes faster than a handcrafted assembly language code!

The superiority of assembly language in generating compact code is becoming increasingly lessimportant for several reasons.

1. First, the savings in space pertain only to the program code and not to its data space. Thus,depending on the application, the savings in space obtained by converting an applicationprogram from some high-level language to an assembly language may not be substantial.

2. Second, the cost of memory (i.e., cost per bit) has been decreasing and memory capacity hasbeen increasing. Thus, the size of a program is not a major hurdle anymore.

3. Finally, compilers are becoming smarter in generating code that competes well with ahandcrafted assembly code. However, there are systems such as mobile devices and embed-ded controllers in which space-efficiency is still important.

Chapter 1 5

One of the main reasons for writing programs in assembly language is to generate code that istime-efficient. The superiority of assembly language programs in producing code that runs fasteris a direct manifestation of specificity. That is, handcrafted assembly language programs tend tocontain only the necessary code to perform the given task. Even here, a smart compiler canoptimize the code that can compete well with its equivalent written in the assembly language.

Perhaps the main reason for still programming in an assembly language is to have direct controlover the system hardware. High-level languages, on purpose, provide a restricted (abstract) viewof the underlying hardware. Because of this, it is almost impossible to perform certain tasks thatrequire access to the system hardware. Since assembly language does not impose any restrictions,you can have direct control over all of the system hardware.

18 Datapath provides the necessary hardware to facilitate instruction execution. It is controlled by thecontrol unit as shown in the following figure. Datapath consists of a set of registers, one or morearithmetic and logic units (ALUs) and their associated interconnections.

Control unit

. . .

Registers

ALU

Datapath

Processor

19 A microprogram is a small run-time interpreter that takes the complex instruction and generates asequence of simple instructions that can be executed by hardware. Thus the hardware need not becomplex.

An advantage of using microprogrammed control is that we can implement variations on the basicISA architecture by simply modifying the microprogram; there is no need to change the underlyinghardware, as shown in the following figure. Thus, it is possible to come up with cheaper versionsas well as high-performance processors for the same family.

6 Chapter 1

Microprogram 1

ISA 1

Hardware

Microprogram 2

ISA 2

Microprogram 3

ISA 3

110 Pipelining helps us improve the efficiency by overlapping execution. For example, as the followingfigure shows, the instruction execution can be divided into five parts: IF, ID, OF, IE, and WB.

Instructionfetch

Instructiondecode

Operandfetch

Instructionexecute

Resultwrite back

Instructionfetch

Instructiondecode

Operandfetch

Instructionexecute

Resultwrite back

ID OF IE WBIF

Instruction execution phase

. . .

Execution cycle

A pipelined execution of the basic execution cycle in the following figure clearly shows that theexecution of instruction I1 is completed in Cycle 5. However, after Cycle 5, notice that one instruc-tion is completed in each cycle. Thus, executing six instructions takes only 10 cycles. Withoutpipelining, it would have taken 30 cycles.

Time (cycles)

Instruction 1 2 3 4 7 85 6 109

I1

I2

I3

I4

OF IE WBIF ID

IF ID OF IE WB

IF ID OF IE WB

IF ID OF IE WB

IF ID OF IE WBI5

I6 IF ID OF IE WB

Notice from this description that pipelining does not speed up execution of individual instructions;each instruction still takes five cycles to execute. However, pipelining increases the number ofinstructions executed per unit time; that is, instruction throughput increases.

111 As the name suggests, CISC systems use complex instructions. RISC systems use only simpleinstructions. Furthermore, RISC systems assume that the required operands are in the processorsregisters, not in main memory. A CISC processor does not impose such restrictions.

Chapter 1 7

Complex instructions are expensive to implement in hardware. The introduction of the micro-programmed control solved this implementation dilemma. As a result, most CISC designs usemicroprogrammed control, as shown in the following figure. RISC designs, on the other hand,eliminate the microprogram layer and use the hardware to directly execute instructions.

Hardware

Microprogram control

ISA level

Hardware

ISA level

(a) CISC implementation (b) RISC implementation

112 When we want to store a data item that needs more than one byte, we can store it in one of twoways as shown in the following figure. In the little-endian scheme, multibyte data is stored fromleast significant byte to the most significant byte (vice versa for the big-endian scheme).

MSB LSB

(b) Little-endian byte ordering (c) Big-endian byte ordering

102101

102101

(a) 32-bit data

1 0 0 1 1 0 0 0 1 0 1 1 0 1 1 1

1 0 0 1 1 0 0 01 0 1 1 0 1 1 1

1 0 1 1 0 1 1 11 0 0 1 1 0 0 0

1 1 1 1 0 1 0 0 0 0 0 0 1 1 1 1

1 1 1 1 0 1 0 0

0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 0

0 0 0 0 1 1 1 1Address

103

100

Address

100

103

Pentium processors use the little-endian byte ordering. However, most processors leave it up tothe system designer to configure the processor. For example, the MIPS and PowerPC processorsuse the big-endian byte ordering by default, but these processors can be configured to use thelittle-endian byte order.

8 Chapter 1

113 Memory can then be viewed as consisting of an ordered sequence of bytes. In byte addressablememory, each byte is identified by its sequence number starting with 0, as shown in the followingfigure. This is referred to as the memory address of the byte. Most processors support byteaddressable memories. For example, the Pentium can address up to 4 GB (2 32 bytes) of mainmemory (see the figure below).

322 -1

(in decimal)Address Address

(in hex)FFFFFFFF

FFFFFFFE

00000002

00000001

00000000

FFFFFFFD

2

1

0

114 The memory address space of a system is determined by the address bus width of the processorused in the system. If the processor has 16 address lines, its memory address space is 2 16 locations.The addresses of the first and last locations are 0000H and FFFFH, respectively. (Note: Suffix Hindicates that the number is in the hexdecimal number system.)

115 Advances in technology and processor architecture led to extremely fast processors. Technologicaladvances pushed the basic clock rate into giga Hertz range. Simultaneously, architectural advancessuch as multiple pipelines and superscalar designs reduced the number of clock cycles required toexecute an instruction. Thus, there is a lot of pressure on the memory unit to supply instructionsand data at faster rates. If the memory cant supply the instructions and data at a rate required by theprocessor, what is the use of designing faster processors? To improve overall system performance,ideally, we would like to have lots of fast memory. Of course, we dont want to pay for it. Designershave proposed cache memories to satisfy these requirements.Cache memory successfully bridges the speed gap between the processor and memory. The cacheis a small amount of fast memory that sits between the processor and the main memory. Cachememory is implemented by using faster memory technology compared to the technology used forthe main memory.

116 Even though processors have a large memory address space, only a fraction of this address spaceactually contains memory. For example, even though the Pentium has 4 GB of address space,

Chapter 1 9

most PCs now have between 128 MB and 256 MB of memory. Furthermore, this memory isshared between the system and application software. Thus, the amount of memory available torun a program is considerably smaller. In addition, if you run more programs simultaneously, eachapplication gets an even smaller amount of memory. You might have experienced the result of this:terrible performance.Apart from the performance issue, this scenario also causes another more important problem:What if your application does not fit into its allotted memory space? How do you run such anapplication program? This is the motivation for proposing virtual memory.

Virtual memory was developed to eliminate the physical memory size restriction mentioned before.There are some similarities between the cache memory and virtual memory. Just as with the cachememory, we would like to use the relatively small main memory and create the illusion (to theprogrammer) of a much larger memory, as shown in the following figure. The programmer isconcerned only with the virtual address space. Programs use virtual addresses and when theseprograms are run, their virtual addresses are mapped to physical addresses at run time.

Virtual memory Physical memory

Mapping

The illusion of larger address space is realized by using much slower disk storage. Virtual memorycan be implemented by devising an appropriate mapping function between the virtual and physicaladdress spaces.

10 Chapter 1

117 An I/O port is a fancy name for the address of a register in an I/O controller. I/O controllershave three types of internal registersa data register, a command register, and a status registeras shown in the following figure. When the processor wants to interact with an I/O device, itcommunicates only with the associated I/O controller. A processor can access the internal registersof an I/O controller through I/O ports.

Syste

m b

us

Data bus

Address bus

Control bus

Status

Command

Data

I/O Device

I/O Controller

118 Bus systems with more than one potential bus master need a bus arbitration mechanism to allocatethe bus to a bus master. The processor is the bus master most of the time, but the DMA controlleracts as the bus master during DMA transfers. In principle, bus arbitration can be done eitherstatically or dynamically. In the static scheme, bus allocation among the potential masters is donein a predetermined way. For example, we might use a round-robin allocation that rotates thebus among the potential masters. The main advantage of a static mechanism is that it is easy toimplement. However, since bus allocation follows a predetermined pattern rather than the actualneed, a master may be given the bus even if it does not need it. This kind of allocation leadsto inefficient use of the bus. Consequently, most bus arbitration implementations use a dynamicscheme, which uses a demand-driven allocation scheme.

Chapter 2

Digital Logic Basics

21 Implementation using NAND gates: We can write the XOR logical expression A B+A B usingdouble negation as

A B+A B = A B+ A B

= A B A B

From this logical expression, we can derive the following NAND gate implementation:

A B

Figure 2.1: 2-input XOR gate using only NAND gates.

Implementation using NOR gates: We can write the XOR logical expression as

A B+A B = A B +A B

= A + B + A + B

From this logical expression, we can derive the following NOR gate implementation:

1

2 Chapter 2

A B

Figure 2.2: 2-input XOR gate using only NOR gates.

22 Implementation using NAND gates: We can write the exclusive-NOR logical expression A B+A B using double negation as

A B + A B = A B + A B

= A B A B

From this logical expression, we can derive the following NAND gate implementation:

A B

Chapter 2 3

Implementation using NOR gates: We can write the exclusive-NOR logical expression as

A B + A B = A B +A B

= A + B + A + B

From this logical expression, we can derive the following NOR gate implementation:

A B

Alternative Implementations:Alternatively, we can derive the following NAND implementation by modifying the logic circuitin Figure 2.1 by adding an output inverter:

A B

4 Chapter 2

Similarly, we derive the following NOR implementation by modifying the logic circuit in Fig-ure 2.2 by deleting the output inverter:

A B

23 A NOT gate can be implemented by holding one input at 1 as shown below:

1

AA

24 By keeping one input at 0, we can turn an XOR gate into a buffer that passes input to output asshown below:

AA

It is clear from this and the last exercise that by controlling one input (call it control input), we canturn an XOR gate into either an inverter or a buffer. If the control input is 1, the XOR gate actsas an inverter; if the control input is 0, it acts as a buffer.

25 We can write the AND logical expression (A B) using double negation as

A B = A B

= A + B

From this logical expression, we can derive the following implementation:

Chapter 2 5

A

B

26 We can write the OR logical expression (A + B) using double negation as

A + B = A + B

= A B

From this logical expression, we can derive the following implementation:

A

B

27 The two transistors are in series. Vout

is low only when both transistors are turned on. Thishappens only when both V

in1

and Vin2

are high as shown below:

Vin1

Vin2

Vout

low low highlow high highhigh low highhigh high low

As in the text, when we interpret low as 0 and high as 1, it implements the NAND function.

28 In this example, the two transistors are in parallel. Vout

is low when any of the two transistors areturned on. This happens when either V

in1

or Vin2

(or both) is high as shown below:

Vin1

Vin2

Vout

low low highlow high lowhigh low lowhigh high low

As in the text, when we interpret low as 0 and high as 1, it implements the NOR function.

29 We assume that input A has 50% weight. The truth table is shown below:

6 Chapter 2

A B C F0 0 0 00 0 1 00 1 0 00 1 1 11 0 0 11 0 1 11 1 0 11 1 1 1

We use the Karnaugh map to derive the simplified logical expression.

01

00 01 11 1001

1 1 1 100

A

A

BCBC

From this K-map, we get the following logical expression:

A + BC

The following logic circuit implements this function:

AB

C

210 We assume that input A has the veto power. The truth table is shown below:

A B C F0 0 0 00 0 1 00 1 0 00 1 1 01 0 0 01 0 1 11 1 0 11 1 1 1

Chapter 2 7

The sum-of-products expression for F can be simplified by replicating the term (A B C) as shownbelow:

F = A B C + A B C + A B C

= A B C + A B C + A B C + A B C

= A C + A B

= A(B + C)

You can also use the Karnaugh map method to derive the same logical expression.

The following logic circuit implements this function:

AB

C

211 (a) x x = xLet us start with x and show that it is equivalent to x x.

x = x 1 (Identity)

= x (x + x) (Complement)

= (x x) + (x x) (Distribution)

= (x x) + 0 (Complement)

= x x (Identity)

(b) x+ x = xLet us start with x and show that it is equivalent to x+ x (very similar to the last exercise).

x = x + 0 (Identity)

= x + (x x) (Complement)

= (x + x) (x + x) (Distribution)

= (x + x) 1 (Complement)

= x + x (Identity)

(c) x 0 = 0As in the previous examples, we start with the right hand side (0) and show that it is equivalent to

8 Chapter 2

x 0.

0 = x x (Complement)

= x (x + 0) (Identity)

= (x x) + (x 0) (Distribution)

= 0 + (x 0) (Complement)

= x 0 (Identity)

(d) x+ 1 = 1This is the dual of the last exercise.

1 = x + x (Complement)

= x + (x 1) (Identity)

= (x + x) (x + 1) (Distribution)

= 1 (x + 1) (Complement)

= x + 1 (Identity)

212 We have to show (x y) (x + y) = 0 and (x y) + (x + y) = 1.

(x y) (x + y) = x y x + x y y

= 0 + 0

= 0

(x y) + (x + y) = x y + x (y + y) + y (x + x)

= x y + x y + x y + y x + y x

= (x y + x y) + (x y + y x)

= (x y + x y) + (x y + y x)

= y + y

= 1

213 We have to show (x + y) (x y) = 0 and (x + y) + (x y) = 1.

(x + y) (x y) = x x y + y x y

= 0 + 0

= 0

Chapter 2 9

(x + y) + (x y) = x( y + y) + y (x + x) + (x y)

= x y + x y + y x + y x + x y

= x ( y + y + y) + x (y + y)

= x + x

= 1

214 AND version: A B C = A + B + CThe truth table below verifies the AND version.

A B C A B C A + B + C0 0 0 1 10 0 1 1 10 1 0 1 10 1 1 1 11 0 0 1 11 0 1 1 11 1 0 1 11 1 1 0 0

OR version: A + B + C = A B CThe truth table below verifies the OR version.

A B C A + B + C A B C0 0 0 1 10 0 1 0 00 1 0 0 00 1 1 0 01 0 0 0 01 0 1 0 01 1 0 0 01 1 1 0 0

215 From the 3-input NAND gate shown in Figure 2.23b, we can see that each additional input needsan inverter and a 2-input NAND gate. Since we implement the inverter with a 2-input NAND gateas well, we need two 2-input NAND gates for each additional input. Thus, for an n input NANDgate, we need

1 + 2(n 2)

10 Chapter 2

2-input NAND gates. To build an 8-input NAND gate we need

1 + 2(8 2) = 13 gates

Since there are four gates in the 7400 chip, we need four 7400 chips.

216 (a)(x + y) (x + y) = (x y) x y (de Morgons law)

= 0

(b)x + yx = x (1 + y) + yx

= x + x y + yx

= x + y (x + x)

= x + y

(c)A B A B = (A + B) (A + B)

= A A + A B + BA + B B

= A B + A B

217 Truth table:

A B C F0 0 0 10 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 0

Sum-of-products form:A B C + A B C + A B C + A B C

Product-of-sums form:

(A + B + C) (A + B + C) (A + B + C) (A + B + C)

Chapter 2 11

218 We start with the product-of-sums expression and derive the sum-of-products expression.

(A + B + C) (A + B + C) (A + B + C) (A + B + C)

= (A + A B+ A C + A B + B C + A C +B C) (A + B + C) (A + B+C)

= (A B C +A B C+ A B + A B C + B C + A B C+ A C + A B C) (A + B+C)

= A B C + A B C + A B C + A B C

219 Logic expression for Figure 2.10a is (A B).Logic expression for Figure 2.10b is A + B.We show that this expression is equivalent to (A B).

A + B = A B (de Morgons law)

220 We start with the product-of-sum expression and derive the other expression.

(A + B + C) (A + B + C) (A + B + C) (A + B + C)

= (A + A B + A C + B + B C + A C+ B C) (A +B+ C) (A + B +C)

= (A + A B + A C + A B C + A C+ A B C+ A B+ A B C + A B C+ B C) (A + B +C)

= A B C + A B + A B C + A B C + B C + A C + A B C

By observing that A B + A B C + B C + A C is equivalent to (A B C), we derive the sum-of-products expression.

221 We start with the product-of-sums expression and derive the sum-of-products expression.

(A + B + C) (A + B + C) (A + B + C) (A + B + C)

= (A + A B+ A C+ A B + B C+ A C +B C) (A + B +C) (A + B+ C)

= (A B C+A B C + A B + A B C+ B C + A B C + A C + A B C) (A + B+ C)

= A B C + A B C + A B C + A B C

222 Replace the exercise in the book by the following:Using Boolean algebra show that the following two expressions are equivalent:

A B C + A C D + A B C + A B D + A B C + A C + A B C D

A + B D + C D + B C + B C D

12 Chapter 2

Solution:

A B C + A B C + A C D + A B C + A B D + A C + A B C D

= A C (B + B) + A C D + A B C + A B D + A C + A B C D

= (A C + A C) + A C D + A B C + A B D + A B C D

= A + A C D + A (1 + C D) + A B C + A (1 + B C) +

A B D + A (1 + B D) + A B C D + A (1 + B C D)

= A + C D + B C + B D + B C D

223 The logic circuit is shown below:

224 We need a 7-input XOR gate to derive the parity bit. We can construct 1 7-input XOR using 2-inputXOR gates as shown below:

A0A1A2A3A4A5A6

P

We need to add an inverter at the output to generate odd parity bit.

Chapter 2 13

225

B D + A C D + A B D = B D (1 + A C) + A C D (B + B) + A B D

= B D + B D A C + A B C D + A B C D + A B D

= B D + A B C (D + D) + A B D (1 + C)

= B D + A B C + A B D

226 The truth table is shown below:

A B C F0 0 0 10 0 1 00 1 0 10 1 1 01 0 0 11 0 1 01 1 0 11 1 1 0

A B C + A B C + A B C + A B C = A C(B + B) + A C(B + B)

= C(A + A)

= C

Clearly, we just need one inverter to implement this simplified logical expression.


14 Chapter 2

A B C D F0 0 0 0 10 0 0 1 00 0 1 0 00 0 1 1 00 1 0 0 10 1 0 1 00 1 1 0 00 1 1 1 01 0 0 0 11 0 0 1 01 0 1 0 01 0 1 1 01 1 0 0 11 1 0 1 01 1 1 0 01 1 1 1 0

From the following Karnaugh map

C D

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

00

01

11

10

00 01 11 10ABCD

we get the simplified logical expression as (C D).We just need a single NOR gate to implement this.

228 The following table finds the prime implicants:

Column 1 Column 2 Column 3

A B C D

p

A C D

p

C D

A B C Dp

B C D

p

A B C Dp

B C Dp

A B C Dp

A C Dp

Chapter 2 15

There is no need for Step 2. The simplified expression is C D.


A B C D F0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 00 1 0 0 00 1 0 1 10 1 1 0 10 1 1 1 11 0 0 0 11 0 0 1 11 0 1 0 11 0 1 1 01 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 0


0

0

0

1

0

1

0

1

0

1

0

0

0

1

1

0

00

01

11

10

00 01 11 10ABCD

we derive the simplified expression as

A B D + A B C + A B D + A B C = A B (C + D) + A B (C + D)

The following circuit implements this logic expression:

16 Chapter 2

B

A

CD

BA

CD


Column 1 Column 2

A B D

A B C Dp

A B C

A B C Dp

A B D

A B C Dp

A B C

A B C Dp

A B C Dp

A B C Dp

Step 2:

Primeimplicants

Input product terms

ABCD ABCD AB CD ABCD ABCD AB C D

ABD N

A BC N

A B DN

AB CN

Chapter 2 17

The minimal expression is

A B D + A B C + A B D + A B C = A B (C + D) + A B (C + D)


A B C D F0 0 0 0 00 0 0 1 00 0 1 0 10 0 1 1 00 1 0 0 10 1 0 1 10 1 1 0 00 1 1 1 01 0 0 0 01 0 0 1 01 0 1 0 01 0 1 1 11 1 0 0 11 1 0 1 01 1 1 0 11 1 1 1 0


0

1

1

0

0

1

0

0

0

0

0

1

1

0

0

1

00

01

11

10

00 01 11 10ABCD

we derive the following simplified logic expression:

A B C D + A B C D + A B C + A B D = B C (A D + A D) + A B C + A B D

An implementation of this logic expression is shown below:

18 Chapter 2

AB

CA

BD

BC

AD


Column 1 Column 2

A B C D A B C

B C D

A B C Dp

A B D

A B C Dp

A B C Dp

A B C D

A B C Dp

Step 2:

Primeimplicants

Input product terms

ABCD ABCD ABCD A BCD ABC D A B CD

ABC N

BCD

A BD N

A BCDN

ABC DN

Chapter 2 19

We derive the following simplified logic expression:

A B C D + A B C D + A B C + A B D = B C (A D + A D) + A B C + A B D

Chapter 4

Sequential Logic Circuits

41 The defining characteristic of a combinational circuit is that its output depends only on the currentinputs applied to the circuit. The output of a sequential circuit, on the other hand, depends bothon the current input values as well as the past inputs. This dependence on past inputs gives theproperty of memory for sequential circuits.

42 The sequence of past inputs is encoded into a set of state variables. The feedback circuit storesthis state information and feeds it to the input.

43 When the propagation delay is zero, theoretically, signals at the input and output of the inverterchange at the same time. This means the output of the AND gate is always zero.

44 When S and R are 1, both outputs are forced to take 0. To see why this combination is undesirable,consider what happens when S and R inputs are changed from S = R = 1 to S = R = 0. It is onlyin theory that we can assume that both inputs change simultaneously. In practice, there is alwayssome finite time difference between the two signal changes. If the S input goes low earlier than theR signal, the sequence of input changes is SR = 11! 01! 00. Because of the intermediate stateSR = 01, the output will be Q = 0 and Q = 1. If, on the other hand, the R signal goes low beforethe S signal does, the sequence of input changes is SR = 11 ! 10 ! 00. Because the transitiongoes through the SR = 10 intermediate state, the output will be Q = 1 and Q = 0. Thus, when theinput changes from 11 to 00, the output is indeterminate. This is the reason for avoiding this state.


Q n+1

Q n

S R0 00 11 01 1

01

1

1

2 Chapter 4

It can se seen from this truth table that is not exactly the same as that given for the NOR gateversion. However, it is closely related in the sense it is the dual of the other truth table.

46 The D-latch avoids the SR = 11 input combination by using a single inverter to provide onlycomplementary inputs at S and R inputs of the clocked SR latch as shown below:

Q

S

R

QCP

Clock

47 Flip-flops are edge-triggered devices whereas latches are level sensitive.

48 The circuit is shown below:

ResetCCCC

S S S S

Q Q Q Q

Q0 Q1 Q2 Q3

FF0 FF1

Clock

J

K

QCP

J

K

QCP

FF2

J

K

QCP

FF3

J

K

QCP

High

High


Q Q

Q0 Q1

Q

Q2

High J

K

QCP

QQHigh

High

Clock

J

K

QCP

J

K

QCP

High J

K

QCP

High J

K

QCP

Q3 Q4


Chapter 4 3

Q Q QQ

High

Clock

J

K

QCK

D

J

K

QCK

C

J

K

QCK

A

J

K

QCK

B

411 We need four JK flip-flops to implement this four-bit counter. The design table is shown below:

Present state Next state JK flip-flop inputs

A B C D A B C D JA

K

A

J

B

K

B

J

C

K

C

J

D

K

D

0 0 0 0 0 0 1 0 0 d 0 d 1 d 0 d

0 0 0 1 d d d d d d d d

0 0 1 0 0 1 0 0 0 d 1 d d 1 0 d

0 0 1 1 d d d d d d d d

0 1 0 0 0 1 1 0 0 d d 0 1 d 0 d

0 1 0 1 d d d d d d d d

0 1 1 0 1 0 0 0 1 d d 1 d 1 0 d

0 1 1 1 d d d d d d d d

1 0 0 0 1 0 1 0 d 0 0 d 1 d 0 d

1 0 0 1 d d d d d d d d

1 0 1 0 1 1 0 0 d 0 1 d d 1 0 d

1 0 1 1 d d d d d d d d

1 1 0 0 1 1 1 0 d 0 d 0 1 d 0 d

1 1 0 1 d d d d d d d d

1 1 1 0 0 0 0 0 d 1 d 1 d 1 0 d

1 1 1 1 d d d d d d d d

4 Chapter 4

Using the Karnaugh map method, we can get the simplified logical expressions for the J and Kinputs as follows:JA

= B C; KA

= B CJB

= C, KB

= CJC

= 1; KC

= 1

JD

= 0; KD

= dNotice that the D flip-flop is not necessary as its output is always 0. The circuit is shown below:

QQ Q

J

K

QCK

A

J

K

QCK

C

J

K

QCK

B

High

Clock

D

412 We need three JK flip-flops to implement this four-bit counter. The design table is shown below:

Present state Next state JK flip-flop inputs

A B C A B C JA

K

A

J

B

K

B

J

C

K

C

0 0 0 0 0 1 0 d 0 d 1 d

0 0 1 0 1 1 0 d 1 d d 0

0 1 0 1 1 0 1 d d 0 0 d

0 1 1 0 1 0 0 d d 0 d 1

1 0 0 0 0 0 d 1 0 d 0 d

1 0 1 1 0 0 d 0 0 d d 1

1 1 0 1 1 1 d 0 d 0 1 d

1 1 1 1 0 1 d 0 d 1 d 0

Using the Karnaugh map method, we can get the simplified logical expressions for the J and Kinputs as follows:

Chapter 4 5

JA

= BC; KA

= B C

JB

= A C; KB

= A CJC

= A B + A B; KC

= A B + A BThe circuit is shown below:

QQ Q

J

K

QCK

A

J

K

QCK

C

J

K

QCK

B

Clock

413 The state table is shown below:

Next state Output

Present state X = 0 X = 1 X = 0 X = 1

S0 S0 S1 0 1

S1 S1 S0 1 0

Simple state assignment: S0 = 0 and S1 = 1.

Presentstate

Presentstate

Nextstate

Presentstate

JK flip-flop inputs

A X A Y JA

K

A

0 0 0 0 0 d

0 1 1 1 1 d

1 0 1 1 d 0

1 1 0 0 d 1

6 Chapter 4

Using the Karnaugh map method, we can get the simplified logical expressions for the J and Kinputs and the output Y as follows:

JA

= X; KA

= X

Y = A X + A X

The circuit is shown below:

Q

J

K

QCKClock

X

Y

414 The Karnaugh map for the state assignment is shown below:

S0 S3

S2

ABC

00 01 11 10

0

1 S1

S5

S4S6

Final state assignment is shown below:

State A B C

S0 = 0 0 0

S1 = 1 0 0

S2 = 1 1 0

S3 = 0 0 1

S4 = 1 1 1

S5 = 0 1 0

S6 = 1 0 1

The design table is shown below:

Chapter 4 7

Presentstate

Presentstate

Nextstate

Presentstate

JK flip-flop inputs

A B C X A B C Y JA

K

A

J

B

K

B

J

C

K

C

0 0 0 0 1 1 0 0 1 d 1 d 0 d

0 0 0 1 1 0 0 0 1 d 0 d 0 d

0 0 1 0 1 1 1 0 1 d 1 d d 0

0 0 1 1 0 0 1 0 0 d 0 d d 0

0 1 0 0 1 1 1 1 1 d d 0 1 d

0 1 0 1 0 0 1 0 0 d d 1 1 d

1 0 0 0 1 1 1 0 d 0 1 d 1 d

1 0 0 1 0 0 1 0 d 1 0 d 1 d

1 0 1 0 1 0 1 0 d 0 0 d d 0

1 0 1 1 0 1 0 1 d 1 1 d d 1

1 1 0 0 1 0 1 0 d 0 d 1 1 d

1 1 0 1 0 1 0 0 d 1 d 0 0 d

1 1 1 0 1 0 1 1 d 0 d 1 d 0

1 1 1 1 0 1 0 0 d 1 d 0 d 1


JA

= X + B C; KA

= X

JB

= C X + A X + A C X; KB

= A X + A X

JC

= A B + A B + A X; KC

= A X

The Y output logical expression is:Y = A B C X + B C X + A B X

It is straightforward to complete the solution using these expressions (similar to what is shown inFigures 4.27 and 4.28).

415 We can use the same circuit; all we have to do is invert the input.

416 The state diagram is shown below:

8 Chapter 4

You can see from this state diagram that the design remains the same as that for the pattern recog-nition example on page 134 (see Example 2). However, we need to modify the output Y. In the Ycolumn in Table 4.8, the last two 1s should be zero. This gives us the following expression for theonly 1 in that column:Y = A B C X.The implementation is as shown in Figure 4.28 (substitute the following circuit for the Y logiccircuit given in Figure 4.8b):

C

X

Y

A

B


Chapter 4 9

Next state Output

Present state X = 0 X = 1 X = 0 X = 1

S0 S1 S0 0 0

S1 S1 S2 0 0

S2 S0 S0 1 0

Heuristic 1 groupings: (S0, S1) (S0, S2)

Heuristic 2 groupings: (S0, S1) (S0, S2)

These groupings suggest the following state assignment:

S0 S1

AB

0 1

0

1 S2


State A B

S0 = 0 0

S1 = 0 1

S2 = 1 0


10 Chapter 4

Presentstate

Presentstate

Nextstate

Presentstate

JK flip-flop inputs

A B X A B Y JA

K

A

J

B

K

B

0 0 0 0 1 0 0 d 1 d

0 0 1 0 0 0 0 d 0 d

0 1 0 0 1 0 0 d d 0

0 1 1 1 0 0 1 d d 1

1 0 0 0 0 1 d 1 0 d

1 0 1 0 0 0 d 1 0 d


JA

= B X; KA

= 1

JB

= A X; KB

= X

The Y output logical expression is:Y = A X

The implementation is shown below:

X

Clock

B

High

A

Q

J

K

QCK

Q

J

K

QCK

Y


Chapter 4 11

00/0

01/0 01/0 01/0 01/0 01/1

S1 S2 S3 S4 S5S0

10/0 10/0 10/0 10/1

10/100/0 00/0 00/0 00/0 00/0

01/0

10/0

The state table is shown below:

Next state Output Z

Present state XY = 00 XY = 01 XY = 10 XY = 00 XY = 01 XY = 10

S0 S0 S1 S2 0 0 0

S1 S1 S2 S3 0 0 0

S2 S2 S3 S4 0 0 0

S3 S3 S4 S5 0 0 0

S4 S4 S5 S5 0 1 1

S5 S5 S1 S2 0 0 0

Heuristic 1 groupings: (S0, S5)2 (S3, S4)Heuristic 2 groupings: (S1, S2)3 (S2, S3)2 (S4, S5)2 (S3, S4)2

Heuristic 3 groupings: None


12 Chapter 4

S0 S3

ABC

00 01 11 10

0

1 S5 S1S4

S2


State A B C

S0 = 0 0 0

S1 = 1 1 1

S2 = 0 1 1

S3 = 0 0 1

S4 = 1 0 1

S5 = 1 0 0


Chapter 4 13

Presentstate

Presentstate

Nextstate

Presentstate

JK flip-flop inputs

A B C XY A B C Z JA

K

A

J

B

K

B

J

C

K

C

0 0 0 00 0 0 0 0 0 d 0 d 0 d

0 0 0 01 1 1 1 0 1 d 1 d 1 d

0 0 0 10 0 1 1 0 0 d 1 d 1 d

0 0 1 00 0 0 1 0 0 d 0 d d 0

0 0 1 01 1 0 1 0 1 d 0 d d 0

0 0 1 10 1 0 0 0 1 d 0 d d 1

0 1 1 00 0 1 1 0 0 d d 0 d 0

0 1 1 01 0 0 1 0 0 d d 1 d 0

0 1 1 10 1 0 1 0 1 d d 1 d 0

1 0 0 00 1 0 0 0 d 0 0 d 0 d

1 0 0 01 1 1 1 0 d 0 1 d 1 d

1 0 0 10 0 1 1 0 d 1 1 d 1 d

1 0 1 00 1 0 1 0 d 0 0 d d 0

1 0 1 01 1 0 0 1 d 0 0 d d 1

1 0 1 10 1 0 0 1 d 0 0 d d 1

1 1 1 00 1 1 1 0 d 0 d 0 d 0

1 1 1 01 0 1 1 0 d 1 d 0 d 0

1 1 1 10 0 0 1 0 d 1 d 1 d 0

Using the Karnaugh map method, we can get the simplified logical expressions for the J and Kinputs as follows:JA

= B Y + C X; KA

= B X + B Y + C X

JB

= C X + C Y; KB

= X + A Y

JC

= X + Y; KC

= B C X Y + A B C X YThe Z output logical expression is:Z = A B C (X Y + X Y)It is straightforward to complete the solution using these expressions (similar to what is shown in

14 Chapter 4

Figures 4.27 and 4.28).419 The state diagram is shown below:

00/00

01/00 01/00 01/00 01/00 01/01

S1 S2 S3 S4 S5S0

10/00 10/00 10/00 10/10

00/00 00/00 00/00 00/00 00/00

10/00

00/00S6

01/0001/00

10/00

10/11

Note that the output is represented by two bits: CZ. The C bit indicates change due and the Z bitindicates activation of the selection circuit (as in the last exercise).The state table is shown below:

Next state Output CZ

Present state XY = 00 XY = 01 XY = 10 XY = 00 XY = 01 XY = 10

S0 S0 S1 S2 00 00 00

S1 S1 S2 S3 00 00 00

S2 S2 S3 S4 00 00 00

S3 S3 S4 S5 00 00 00

S4 S4 S5 S5 00 01 11

S5 S5 S1 S2 00 00 00

S6 S6 S1 S2 00 00 00

Heuristic 1 groupings: (S0, S5, S6)2 (S3, S4)Heuristic 2 groupings: (S1, S2)4 (S2, S3)2 (S4, S5)2 (S3, S4)2

Chapter 4 15

Heuristic 3 groupings: None


S0 S3

ABC

00 01 11 10

0

1 S5 S1S4

S2

S6


State A B C

S0 = 0 0 0

S1 = 1 1 1

S2 = 0 1 1

S3 = 0 0 1

S4 = 1 0 1

S5 = 1 0 0

S6 = 1 1 0


16 Chapter 4

Presentstate

Presentstate

Nextstate

Presentstate

JK flip-flop inputs

A B C XY A B C CZ JA

K

A

J

B

K

B

J

C

K

C

0 0 0 00 0 0 0 00 0 d 0 d 0 d

0 0 0 01 1 1 1 00 1 d 1 d 1 d

0 0 0 10 0 1 1 00 0 d 1 d 1 d

0 0 1 00 0 0 1 00 0 d 0 d d 0

0 0 1 01 1 0 1 00 1 d 0 d d 0

0 0 1 10 1 0 0 00 1 d 0 d d 1

0 1 1 00 0 1 1 00 0 d d 0 d 0

0 1 1 01 0 0 1 00 0 d d 1 d 0

0 1 1 10 1 0 1 00 1 d d 1 d 0

1 0 0 00 1 0 0 00 d 0 0 d 0 d

1 0 0 01 1 1 1 00 d 0 1 d 1 d

1 0 0 10 0 1 1 00 d 1 1 d 1 d

1 0 1 00 1 0 1 00 d 0 0 d d 0

1 0 1 01 1 0 0 01 d 0 0 d d 1

1 0 1 10 1 0 0 11 d 0 0 d d 1

1 1 0 00 1 1 0 00 d 0 d 0 0 d

1 1 0 01 1 1 1 00 d 0 d 0 1 d

1 1 0 10 0 1 1 00 d 1 d 0 1 d

1 1 1 00 1 1 1 00 d 0 d 0 d 0

1 1 1 01 0 1 1 00 d 1 d 0 d 0

1 1 1 10 0 0 1 00 d 1 d 1 d 0


JA

= B Y + C X; KA

= B X + B C Y + C X

JB

= C X + C Y; KB

= X + A Y

Chapter 4 17

JC

= X + Y; KC

= B C X Y + A B C X YThe C and Z output logical expressions are:C = A B C X YZ = A B C (X Y + X Y)It is straightforward to complete the solution using these expressions (similar to what is shown inFigures 4.27 and 4.28).

Chapter 5

System Buses

51 A bus transaction is a sequence of actions to complete a well-defined activity. Some examples ofsuch activities are memory read, memory write, I/O read, burst read, and so on.

52 A bus transaction may perform one or more bus operations. For example, a Pentium burst readtransfers four words. Thus this bus transaction consists of four memory read operations.

53 Address bus width determines memory-addressing capacity of the system. Typically, 32-bit pro-cessors such as the Pentium use 32-bit addresses, and 64-bit processors use 64-bit addresses.

54 System performance improves with a wider data bus as we can move more bytes in parallel. Thus,higher data bus width increases the data transfer rate. For example, the Pentium uses a 64-bit databus, whereas the Itanium uses a 128-bit data bus. Therefore, if all other parameters are the same,we can double the bandwidth in the Itanium relative to the Pentium processor.

55 As the name implies, dedicated buses have separate buses dedicated to carry data and addressinformation. For example, a 64-bit processor with 64 data and address lines requires 128 pinsjust for these two buses. If we want to move 128 bits of data like the Itanium, we need 192 pins!The obvious advantage of these designs is the performance we can get out of them. To reducethe cost of such systems we might use multiplexed bus designs in which buses are not dedicatedto a function. Instead, both data and address information is time multiplexed on a shared bus.Multiplexed bus designs reduce the cost but they also reduce the system performance.

56 In synchronous buses, a bus clock signal provides the timing information for all actions on the bus.Change in other signals is relative to the falling or rising edge of the clock. In asynchronous buses,there is no clock signal. Instead, they use four-way handshaking to perform a bus transaction.Asynchronous buses allow more flexibility in timing.The main advantage of asynchronous buses is that they eliminate this dependence on the bus clock.However, synchronous buses are easier to implement, as they do not use handshaking. Almostall system buses are synchronous, partly for historical reasons. In the early days, the difference

1

2 Chapter 5

between the speeds of various devices was not so great as it is now. Since synchronous buses aresimpler to implement, designers chose them.

57 In synchronous buses, all timing must be a multiple of the bus clock. For example, if memory re-quires slightly more time than the default amount, we have to add a complete bus cycle. Of course,we can increase the bus frequency to counter this problem. But that introduces problems such asbus skew, increased power consumption, the need for faster circuits, and so on. Thus, choosing anappropriate bus clock frequency is very important for synchronous buses. In determining the clockfrequency, we have to consider all devices that will be attached to the bus. When these devices areheterogeneous in the sense that their operating speeds are different, we have to operate the bus atthe speed of the slowest device in the system.

58 The READY signal is needed in the synchronous buses because the default timing allowed bythe processor is sometimes insufficient for a slow device to respond. The processor samples theREADY line when it expects data on the data bus. It reads the data on the data bus only if theREADY signal is active; otherwise, it waits. Thus, slower devices can use this line to indicate thatthey need more time to respond to the request.For example, in a memory read cycle, if we have a slow memory it may not be able to supply datawhen the processor expects. In this case, the processor should not presume that whatever is presenton the data bus is the actual data supplied by memory. That is why the processor always reads thevalue of the READY line to see if the memory has actually placed the data on the data bus. If thisline is inactive (indicating not ready status), the processor waits one more cycle and samples theREADY line again. The processor inserts wait states as long as the READY line is inactive. Oncethis line is active, it reads the data and terminates the read cycle.The asynchronous buses do not require the READY signal as they use handshaking to perform abus transaction (i.e., there is no default timing as in the synchronous buses).

59 Processors provide block transfer operations that read or write several contiguous locations of amemory block. Such block transfers are more efficient than transferring each individual word. Thecache line fill is an example that requires reading several bytes of contiguous memory locations.Data movement between cache and main memory is in units of cache line size. If the cache line sizeis 32 bytes, each cache line fill requires 32 bytes of data from memory. The Pentium uses 32-bytecache lines. It provides a block transfer operation that transfers four 64-bit data from memory.Thus, by using this block transfer, we can fill a 32-byte cache line. Without block transfer, weneed four memory read cycles to fill a cache line, which takes more time than the block transferoperation.

510 The main disadvantage of the static mechanism is that its bus allocation follows a predeterminedpattern rather than the actual need. In this mechanism, a master may be given the bus even if itdoes not need it. This kind of allocation leads to inefficient use of the bus. This inefficiency isavoided by the dynamic bus arbitration, which uses a demand-driven allocation scheme.

511 A fair allocation policy does not allow starvation. Fairness can be defined in several ways. Forexample, fairness can be defined to handle bus requests within a priority class, or requests from

Chapter 5 3

several priority classes. Some examples of fairness are: (i) all bus requests in a predefined windowmust be satisfied before granting requests from the next window; (ii) a bus request should not bepending for more than M milliseconds. For example, in the PCI bus, we can specify fairness byindicating the maximum delay to grant a request.

512 A potential disadvantage of the nonpreemptive policies is that a bus master may hold the bus fora long time, depending on the transaction type. For example, long block transfers can hold thebus for extended periods of time. This may cause problems for some types of services where thebus is needed immediately. Preemptive policies force the current master to release the bus withoutcompleting its current bus transaction.

513 A drawback of the transaction-based release policy is that if there is only one master that is re-questing the bus most of the time, we have to unnecessarily incur arbitration overhead for eachbus transaction. This is typically the case in single-processor systems. In these systems, the CPUuses the bus most of the time; DMA requests are relatively infrequent. In demand-based release,the current master releases the bus only if there is a request from another bus master; otherwise, itcontinues to use the bus. Typically, this check is done at the completion of each transaction. Thispolicy leads to more efficient use of the bus.

514 Centralized implementations suffer from single-point failures due to the presence of the centralarbiter. This causes two main problems:

1. If the central arbiter fails, there will be no arbitration.2. The central arbiter can become a bottleneck limiting the performance of the whole system.

The distributed implementation avoids these problems. However, the arbitration logic has to bedistributed among the masters. In contrast, in the centralized organization, the bus masters donthave the arbitration logic.

515 The daisy-chaining scheme has three potential problems:

It implements a fixed-priority policy, which can lead to starvation problems. From our dis-cussion, it should be clear that the master closer to the arbiter (in the chain) has a higherpriority.

The bus arbitration time varies and is proportional to the number of masters. The reason isthat the grant signal has to propagate from master to master. If each master takes timeunits to propagate the bus grant signal from its input to output, a master that is in the ithposition in the chain would experience a delay of (i 1) time units before receiving thegrant signal.

This scheme is not fault tolerant. If a master fails, it may fail to pass the bus grant signal tothe master down the chain.

516 The hybrid scheme limits the following three potential problems associated with the daisy-chainingscheme as explained below:

4 Chapter 5

The hybrid scheme limits the fixed-priority policy of the daisy-chaining scheme to a class. In the daisy-chaining scheme, the bus arbitration time varies and is proportional to the num-

ber of masters. The hybrid scheme limits this delay as the class size is small. The daisy-chaining scheme is not fault tolerant. If a master fails, it may fail to pass the bus

grant signal to the master down the chain. However, in the hybrid scheme, this problem islimited to the class in which the node failure occurs.

The independent request/grant lines scheme rectifies the problems associated with the daisy-chainingscheme. However, it is expensive to implement. The hybrid scheme reduces the cost by applyingthis scheme at the class level.

517 The ISA bus was closely associated with the system bus used by the IBM PC. It operates at8.33 MHz clock with a maximum bandwidth of about 8 MB/s. This bandwidth was sufficientat that time as memories were slower and there were no multimedia, windows GUI, and the like toworry about. However, in current systems, the ISA bus can only support slow I/O devices. Eventhis limited use of the ISA bus is disappearing due to the presence of the USB. Current systems usethe PCI bus, which is processor independent. It can provide a peak bandwidth of (up to) 528 MB/s.

518 The main reason is to save the number of connector pins. Even with multiplexing, 32-bit PCIuses 120-pin connectors while the 64-bit version needs an additional 64 pins. A drawback withmultiplexed address/data bus is that it needs additional time to turn around the bus.

519 The four Byte Enable lines identify the byte of data that is to be transferred. Each BE# lineidentifies one byte of the 32-bit data: BE0# identifies byte 0, BE1# identifies byte 1, and so on.Thus, we can specify any combination of the bytes to be transferred. Two extreme cases are:C/BE# = 0000 indicates transfer of all four bytes, and C/BE# = 1111 indicates a null data phase(no byte transfer). In a multiple data phase bus transaction, the byte enable value can be specifiedfor each data phase. Thus, we can transfer the bytes of interest in each data phase. The extremecase of null data phase is useful, for example, if you want to skip one or more 32-bit values in themiddle of a burst data transfer. If null data transfer is not allowed, we have to terminate the currentbus transaction, request the bus again via the arbiter, and restart the transfer with a new address.

520 The current bus master drives the FRAME# signal to indicate the start of a bus transaction (whenthe FRAME# signal goes low). This signal is also used to indicate the length of the bus transactioncycle. This signal is held low until the final data phase of the bus transaction.

521 PCI uses a centralized bus arbitration with independent grant and request lines. Each device hasseparate grant (GNT#) and request (REQ#) lines connected to the central arbiter. The PCI spec-ification does not mandate a particular arbitration policy. However, it mandates that the policyshould be a fair one to avoid starvation.A device that is not the current master can request the bus by asserting the REQ# line. Thearbitration takes place while the current bus master is using the bus. When the arbiter notifies amaster that it can use the bus for the next transaction, the master must wait until the current bus

Chapter 5 5

master has released the bus (i.e., the bus is idle). The bus idle condition is indicated when bothFRAME# and IRDY# are high.PCI uses hidden bus arbitration in the sense that the arbiter works while another bus master isrunning its transaction on the PCI bus. This overlapped bus arbitration increases the PCI busutilization by not keeping the bus idle during arbitration.PCI devices should request a bus for each transaction. However, a transaction may consist of anaddress phase and one or more data phases. For efficiency, data should be transferred in burstmode. PCI specification has safeguards to avoid a single master from monopolizing the bus and toforce a master to release the bus.

522 PCI allows hierarchical PCI bus systems, which are typically built using PCI-to-PCI bridges (e.g.,using the Intel 21152 PCI-to-PCI bridge chip). This chip connects two independent PCI buses: aprimary and a secondary. The bridge improves performance due to the following reasons:

1. It allows concurrent operation of the two PCI buses. For example, a master and target on thesame PCI bus can communicate while the other PCI bus is busy.

2. The bridge also provides traffic filtering which minimizes the traffic crossing over to the otherside.

Obviously, this traffic separation along with concurrent operation improves overall system perfor-mance for bandwidth-hungry applications such as multimedia.

523 Most PCI buses tend to operate at 33 MHz clock speed due to serious challenges in implementingthe 66 MHz design. To understand this problem, look at the timing of the two buses. The 33 MHzbus cycle of 30 ns leaves about 7 ns of setup time for the target. When we double the clockfrequency, all values are cut in half. The reduction in setup time is important as we have only 3 nsfor the target to respond. As a result of this difficulty, most PCI buses tend to operate at 33 MHzclock speed.The PCI-X solves this problem by using a register-to-register protocol, as opposed to the immedi-ate protocol implemented by PCI. In the PCI-X register-to-register protocol, the signal sent by themaster device is stored in a register until the next clock. Thus the receiver has one full clock cycleto respond to the masters request. This makes it possible to increase the frequency to 133 MHz.At this frequency, one clock period corresponds to about 7.5 ns, about the same period allowed forthe decode phase in the 33 MHz PCI implementation. We get this increase in frequency by addingone additional cycle to each bus transaction. This increased overhead is more than compensatedfor by the increase in the frequency.

524 With the increasing demand for high-performance video due to applications such as 3D graphicsand full-motion video, the PCI bus is reaching its performance limit. In response to these demands,Intel introduced the AGP to exclusively support high-performance 3D graphics and full-motionvideo applications. The AGP is not a bus in the sense that it does not connect multiple devices.The AGP is a port that precisely connects only two devices: the CPU and video card.

6 Chapter 5

To see the bandwidth demand of a full-motion video, let us look at a 640 480 resolution screen.For true color, we need three bytes per pixel. Thus, each frame requires 640 * 480 * 3 = 920 KB.Full-motion video should use a frame rate of 30 frames/second. Therefore, the required bandwidthis 920 * 30/1000 = 27.6 MB/s. If we consider a higher resolution of 1024 768, it goes upto 70.7 MB/s. We actually need twice this bandwidth when displaying video from hard disks orDVDs. This is due to the fact that the data have to traverse the bus twice: once from the disk to thesystem memory and again from the memory to the graphics adaptor. The 32-bit, 33 MHz PCI with133 MB/s bandwidth can barely support this data transfer rate. The 64-bit PCI can comfortablyhandle the full-motion video but the video data transfer uses half the bandwidth. Since the videounit is a specialized subsystem, there is no reason for it to be attached to a general-purpose bus likethe PCI. We can solve many of the bandwidth problems by designing a special interconnection tosupply the video data. By taking the video load off the PCI bus, we can also improve performanceof the overall system. Intel proposed the AGP precisely for these reasons.

525 As we have seen in the text, AGP is targeted at 3D graphical display applications, which havehigh memory bandwidth requirements. One of the performance enhancements AGP uses is thepipelining. AGP pipelined transmission can be interrupted by PCI transactions. This ability tointervene in a pipelined AGP transfer allows the bus master to maintain high pipeline depth forimproved performance.

526 The STSCHG# signal gives I/O status change information for multifunction PC cards. In a pure I/OPC card, we do not normally require this signal. However, in multifunction PC cards containingmemory and I/O functions, this signal is needed to report the status-signals removed from thememory interface (READY, WP, BVD1, and BVD2). A configuration register (called the pinreplacement register) in the attribute memory maintains the status of the signals removed fromthe memory interface. For example, since BVD signals are removed from the memory interface,this register keeps the BVD information to report the status of the battery. When a status changeoccurs, this signal is asserted. The host can read the pin replacement register to get the status.

Chapter 6

Processor Organization andPerformance

61 The three-address format gives the addresses required by most operations: two addresses for thetwo input operands and one address for the result. However, some processors like the Pentiumcompromise by using the two-address format because operands in these processors can be locatedin memory (leading to longer addresses). This is not a problem with the modern RISC processorsas they use the store/load architecture. In these processors, most instructions find their operandsin registers; the result is also placed in a register. Since registers can be identified with a shorteraddress, using the three-address format does not unduly impact the instruction length. The follow-ing figure shows the difference instruction sizes when we use register-based versus memory-basedoperands. We assume that there are 32 registers and memory address is 32 bits long.

104 bits Opcode destination address

8 bits 32 bits

source1 address source2 address

32 bits 32 bits

23 bits Opcode Rdest

8 bits 5 bits 5 bits 5 bits

Rsrc1 Rsrc2

Register format

Memory format

62 Yes, Pentiums use of two-address format is justified for the following reason: Operands in Pen-tium can be located in memory, which implies longer addresses for these operands. Comparing thefollowing figures, we see that we reduce the instruction length from 104 bits to 72 bits by movingfrom three to two address format.

1

2 Chapter 6


8 bits 32 bits

source1 address source2 address

32 bits 32 bits


8 bits 5 bits 5 bits 5 bits

Rsrc1 Rsrc2

Register format

Memory format

Three-address format


8 bits 32 bits

source address

32 bits


8 bits 5 bits 5 bits

Rsrc

Register format

Memory format

Two-address format

A further reason is that most instructions end up using an address twice. Here is the example wediscussed in Section 6.2.1:Using the three address format, the C statement

A = B + C * D - E + F + A

is converted to the following code:

mult T,C,D ; T = C*Dadd T,T,B ; T = B + C*Dsub T,T,E ; T = B + C*D - Eadd T,T,F ; T = B + C*D - E + Fadd A,T,A ; A = B + C*D - E + F + A

Notice that all instructions, barring the first one, use an address twice. In the middle three instruc-tions, it is the temporary T and in the last one, it is A. This also supports using two addresses.

63 In the load/store architecture, all instructions excepting the load and store get their operands fromthe registers; the results produced by these instructions also go into the registers. This results inseveral advantages. The main ones discussed in this chapter are the following:

Chapter 6 3

1. Since the operands come from the internal registers and results are stored in the registers, theload/store architecture speeds up instruction execution.

2. The load/store architecture also reduces the instruction length as addressing registers takesfar fewer bits than addressing a memory location.

3. Reduced processor complexity allows these processors to have a large number of registers,which improves performance.

There are some other advantages (such as fixed instruction length) that are discussed in Chapter14.

64 In Section 6.2.5, we assumed that the stack operation (push or pop) does not require a memoryaccess. Thus, we used two memory accesses for each push/pop instruction (one to read the in-struction and the other to get the value to be pushed/popped). If the push/pop operations requirememory access, we need to add one additional memory access for each push/pop instruction. Thisimplies we need 7 more memory accesses, leading to 19 + 7 = 26 memory accesses.

65 RISC processors use the load/store architecture, which assumes that the operands required bymost instructions are in the internal registers. Load and store instructions are the only exceptions.These instructions move data between memory and registers. If we have few registers, we cannotkeep the operands and results that can be used by other instructions (we will be overwriting themfrequently with data from memory). This does not exploit the basic feature of the load/storearchitecture. If we have more registers, we can keep the data longer in the registers (e.g., resultproduced by an arithmetic instruction that is required by another instruction), which reduces thenumber of memory accesses. Otherwise, we will be reading and writing data using the load andstore instructions (lose the main advantage of the load/store architecture).

66 In normal branch execution, shown in the figure below, when the branch instruction is executed,control is transferred to the target immediately. The Pentium, for example, uses this type of branch-ing. In delayed branch execution, control is transferred to the target after executing the instructionthat follows the branch instruction. In the figure below, before the control is transferred, the in-struction instruction y (shown shaded) is executed. This instruction slot is called the delayslot. For example, the SPARC uses delayed branch execution. In fact, it also uses delayed execu-tion for procedure calls. Why does this help? The reason is that by the time the processor decodesthe branch instruction, the next instruction is already fetched. Thus, instead of throwing it away,we improve efficiency by executing it. This strategy requires reordering of some instructions.

4 Chapter 6

jump targetinstruction yinstruction z

instruction x

instruction binstruction c

target:instruction a

(a) Normal branch execution (b) Delayed branch execution

jump targetinstruction yinstruction z

instruction x

instruction binstruction c

target:instruction a

67 In set-then-jump design, condition testing and branching are separated (for example, Pentiumuses this design). A condition code register communicates the test results to the branch instruc-tion. On the other hand, test-and-jump design combines testing and branching into a singleinstruction.The first design is more general-purpose in the sense that all branching can be handled using thisseparation. The disadvantage is that two separate instructions need to be executed. For example,in Pentium, cmp (compare) and a conditional jump instructions are used to implement conditionalbranch. Furthermore, this design needs condition code registers to carry the test result.The test-and-jump is useful only for certain types of branches where testing can be part of theinstruction. However, there are situations where testing cannot done as part of the branch instruc-tion. For example, consider the overflow condition that results from an add operation. The statusresult of the addition must be stored in something like a condition code register or a flag for useby a branch instruction later on. Some processors like the MIPS, which follow the test-and-jumpdesign, must handle such scenarios. The MIPS processor, for example, uses exceptions to flagthese conditions.

68 The main advantage of storing the return address in a register is that simple procedure calls donot have to access memory. Thus, the overhead associated with a procedure invocation is re-duced compared to processors like Pentium that store the return address on the stack. However,the stack-based mechanism used by Pentium is more general-purpose in that it can handle anytype of procedure call. In contrast, the register-based scheme can only handle simple procedureinvocations. For example, recursive procedures cause problems for the register-based scheme.

69 The size of the instruction depends on the number of addresses and whether these addresses iden-tify registers or memory locations. Since RISC processors use instructions that are register-basedand simple addressing modes, there is no variation in the type of information carried from instruc-tion to instruction. This leads to fixed-size instructions.The Pentium, which is a CISC processor, encodes instructions that vary from one byte to severalbytes. Part of the reason for using variable length instructions is that CISC processors tend to pro-vide complex addressing modes. For example, in the Pentium, if we use register-based operands,we need just 3 bits to identify a register. On the other hand, if we use a memory-based operand, we

Chapter 6 5

need up to 32 bits. In addition, if we use an immediate operand, we need a further 32 bits to encodethis value into the instruction. Thus, an instruction that uses a memory address and an immediateoperand needs 8 bytes just for these two components. You can realize from this description thatproviding flexibility in specifying an operand leads to dramatic variations in instruction sizes.

610 There are two main reasons for this: Allowing both operands to be in memory leads to even greatervariations in instruction lengths. Typically, a register in Pentium can be identified using 3 bitswhereas a memory address takes 32 bits. This complicates encoding and decoding of instructionsfurther. In addition, no one would want to work with all memory-based operands. Registersare extensively used by compilers to optimize code. By not allowing both operands in memory,inefficient code will not be executed.

611 If PC and IR are not connected to the system bus, we have to move the contents of PC to MARusing the A bus. Similarly, the instruction read from the memory placed in the MDR register,which must be moved to the IR register. In both cases, one additional cycle is needed. Thisdegrades processor performance. The amount of increase in overhead depends on the instructionbeing executed. For example, in instruction fetch discussed in Section 6.5.2 (page 226), we needtwo additional cycles for the movement of data between PC and MAR and between MDR and IR.This accounts for an increase of 50%.

612 We assume that shl works on the B input of the ALU and shifts left by one bit position. Toimplement shl4, we need to execute shl four times. This is shown in the following table:

Instruction Step Control signals

shl4 %G7,%G5 S1 G5out: ALU=shl: Cin;

S2 Cout: ALU=shl: Cin;



S5 Cout: G7in: end;

613 We use add to perform multiply by 10. Our algorithm to multiply X by 10 is given below: X + X= 2X (store this result - we need it in the last step) 2X + 2X = 4X 4X + 4X = 8X 8X + 2X = 10XThis algorithm is implemented as shown in the following table:

6 Chapter 6


mul10 %G7,%G5 S1 G5out: Ain;

S2 G5out: ALU=add: Cin;

S3 Cout: Ain: G5in;

S4 Cout: ALU=add: Cin;

S5 Cout: Ain;


S7 G5out: Ain;


S9 Cout: G7in: end;

As shown in this table, we need 9 cycles.

614 The implementation is shown below:


mov %G7,%G5 S1 G5out: ALU=BtoC: G7in;

615 MIPS stands for millions of instructions per second. Although it is a simple metric, it is practicallyuseless to express the performance of a system. Since instructions vary widely among the proces-sors, a simple instruction execution rate will not tell us anything about the system. For example,complex instructions take more clocks than simple instructions. Thus, a complex instruction ratewill be lower than that for simple instructions. The MIPS metric does not capture the actual workdone by these instructions. MIPS performance metric is perhaps useful in comparing various ver-sions of processors derived from the same instruction set.

616 Synthetic benchmarks are programs specifically written for performance testing. For example, theWhetstones and Dhrystones benchmarks are examples of synthetic benchmarks. Real benchmarks,on the other hand, use actual programs of the intended application to capture system performance.Therefore, they capture the system performance more accurately.

617 Whetstone is a synthetic benchmark in which the performance is expressed in MWIPS, millions ofWhetstone instructions per second. This benchmark is a small program, which may not measurethe system performance for all applications. Another drawback with this benchmark is that itencouraged excessive optimization by compilers to distort the performance results.

618 Computer systems are no longer limited to number crunching. Modern computer systems are morecomplex and these systems run a variety of different applications (3D rendering, string process-ing, number crunching, and so on). Performance measured for one type of application may be

Chapter 6 7

inappropriate for some other application. Thus, it is important to measure performance of variouscomponents for different types of applications.

Chapter 7

The Pentium Processor

71 The main purpose of registers is to provide a scratch pad so that the processor can keep data on atemporary basis. For example, the processor may keep the procedure return address, stack pointer,instruction pointer, and so on. Registers are also used to keep the data handy so that it can avoidcostly memory accesses. Keeping frequently accessed data in registers is a common compileroptimization technique.

72 Pentium supports the following three address spaces:1. Linear address space2. Physical address space3. I/O address space (from discussion in Section 1.7)

73 In segmented memory organization, memory is partitioned into segments, where each segment isa small part of the memory. In the real mode, each segment of memory is a linear contiguoussequence of up to 64 KB. In the protected mode, it can be up to 4 GB.

Pentium supports segmentation largely to provide backward compatibility to 8086. Note that 8086is a 16-bit processor with 20 address lines. This mismatch between the processors 16-bit registersand 20-bit addresses is solved by using the segmented memory architecture. This segmentedarchitecture has been carried over to Pentium. However, in the protected mode, it is possible toconsider the entire memory as a single segment; thus, segmentation is completely turned off.

74 In the real mode, a segment is limited to 64 KB due to the fact that 16 bits are used to indicatethe offset value into a segment. This magic number 16 is due to the 16-bit registers used 8086processor. Note that the Pentium emulates 8086 in the real mode.

75 In the real mode, Pentium emulates 8086 processor, which is a 16-bit processor (i.e., all its internalregisters are 16 bits wide). Since 8086s address bus is 20 bits wide but its internal registers are all16 bits in size, a segments start address is stored in a 16-bit segment register with the assumptionthat the least significant four bits of the 20-bit address are zeroes. Thus, segments can only start at

1

2 Chapter 7

addresses that have the least significant four bits as zero (that is, address must be a multiple of 16).For example, a segment can start at address 16, 32, 48, and so on.

76 (a) Since the least significant four bits (A = 1010) are not zero, a segment cannot be placed at thisaddress.

(b) Since the least significant four bits (5 = 0101) are not zero, a segment cannot be placed at thisaddress.

(c) Since the least significant four bits are zero, a segment can be placed at this address.

(d) Since the least significant four bits are zero, a segment can be placed at this address.

77 In the protected mode, a segment can be up to 4 GB.

78 This limitation is due to number of segment registers available in Pentium.

79 This is due to the 20-bit value used to specify the segment limit. If the granularity bit is zero, thesegment size is interpreted in bytes. To specify larger segment sizes, the granularity is increasedto 4 KB. Why? This is because the difference 3220 = 12 bits, which corresponds to 4 KB.

710 The Table Indicator (TI) bit indicates whether the local or global descriptor table should be used.

0 = Global descriptor table,1 = Local descriptor table.

The global descriptor table contains descriptors that are available to all tasks within the system.There is only one GDT in the system. Typically, the GDT contains code and data used by theoperating system. The local descriptor table contains descriptors for a given program. Therecan be several LDTs, each of which may contain descriptors for code, data, stack, and so on. Aprogram cannot access a segment unless there is a descriptor for the segment in either the currentLDT or GDT.

711 Both LDT and GDT can contain up to 213 = 8192 8-bit descriptors. The reason for this is thatonly 13 bits are used to select a segment descriptor as shown in the following figure:

Chapter 7 3

ADDER

031

031

SEGMENT SELECTOR

TI RPLINDEX

15 3 1 02OFFSET

DESCRIPTOR TABLE

BASE ADDRESS

LIMIT

ACCESS RIGHTSSegmentdescriptor

LINEAR ADDRESS

32-bit base address

712 The conversion of logical address to physical address is straightforward. This translation processis shown below:

4 Chapter 7

19 0

0

0 0 0 0

19 15

34

16

19 0

0 0 0 0

ADDER

Offset value

Segment register

20-bit physical memory address

The translation process involves adding four least significant zero bits to the segment base valueand then adding the offset value. When using the hexadecimal number system, simply add a zerodigit to the segment base address at the right and add the offset value. As an example, consider thelogical address 1100:450H. The physical address is computed as follows:

11000 (add 0 to the 16-bit segment base value)+ 450 (offset value)11450 (physical address).

713 In protected mode, contents of the segment register are taken as an index into a segment descriptortable to get a descriptor. The segment translation process is shown in the following figure:

Chapter 7 5

ADDER

031

031

SEGMENT SELECTOR

TI RPLINDEX

15 3 1 02OFFSET

DESCRIPTOR TABLE

BASE ADDRESS

LIMIT

ACCESS RIGHTSSegmentdescriptor

LINEAR ADDRESS

32-bit base address

Segment descriptors provide the 32-bit segment base address, its size, and access rights. To trans-late a logical address to the corresponding linear address, the offset is added to the 32-bit baseaddress. The offset value can be either a 16-bit or 32-bit number.

714 In the real mode:

Segment size is limited to 64 KB; No explicit segment size indication; Segments must begin on 16-byte boundaries; No segment descriptors; Segmentation cannot be turned off.

In the protected mode:

Segment size can be up to 4 GB (limited by the memory address space). Segment descriptor contains explicit segment size information. Segments may be placed anywhere in memory. There is no restriction that they should begin

on 16-bit boundaries. Segment descriptors provide information on the segment (including segment size, segment

type, DPL, and so on)

6 Chapter 7

If the operating does not use segmentation, segmentation can be turned off. In essence, theentire memory is treated as single segment.

715 A processor with 16-bit addresses supports 216 = 64 KB. The first address is 0000H and the lastaddress is FFFFH.

716 All numbers are in hex. The additions in the following are done in hexadecimal.(a) 1A2B0 + 019A = 1A44A(b) 39110 + 200 = 39310(c) 25910 + 10B5 = 269C5(d) 11000 + ABCD = 1BBCD

Chapter 8

Pipelining and Vector Processing

81 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entirepipeline. This leads to other stages idling.

82 Pipeline stalls can be caused by three types of hazards: resource, data, and control hazards. Re-source hazards result when two or more instructions in the pipeline want to use the same resource.Such resource conflicts can result in serialized execution, reducing the scope for overlapped exe-cution.

Data hazards are caused by data dependencies among the instructions in the pipeline. As a simpleexample, suppose that the result produced by instruction I1 is needed as an input to instruction I2.We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. Ifthe pipeline is not designed properly, data hazards can produce wrong results by using incorrectoperands. Therefore, we have to worry about the correctness first.

Control hazards are caused by control dependencies. As an example, consider the flow controlaltered by a branch instruction. If the branch is not taken, we can proceed with the instructions inthe pipeline. But, if the branch is taken, we have to throw away all the instructions that are in thepipeline and fill the pipeline with instructions at the branch target.

83 Prefetching is a technique used to handle resource conflicts. Pipelining typically uses just-in-time mechanism so that only a simple buffer is needed between the stages. We can minimizethe performance impact if we relax this constraint by allowing a queue instead of a single buffer.The instruction fetch unit can prefetch instructions and place them in the instruction queue. Thedecoding unit will have ample instructions even if the instruction fetch is occasionally delayedbecause of a cache miss or resource conflict.

84 Data hazards are caused by data dependencies among the instructions in the pipeline. As a simpleexample, suppose that the result produced by instruction I1 is needed as an input to instruction I2.We have to stall the pipeline until I1 has written the result so that I2 reads the correct input.

1

2 Chapter 8

There are two techniques used to handle data dependencies: register interlocking and registerforwarding. Register forwarding works if the two instructions involved in the dependency are inthe pipeline. The basic idea is to provide the output result as soon as it is available in the datapath.This technique is demonstrated in the following figure. For example, if we provide the output ofI1 to I2 as we write into the destination register of I1, we will reduce the number of stall cyclesby one (see Figure a). We can do even better if we feed the output from the IE stage as shown inFigure b. In this case, we completely eliminate the pipeline stalls.

ID OF IE WBIF

ID OF IE WBIF

(b) Forward scheme 2

(a) Forward scheme 1

4 6 8Clock cycle 1 2 3 5 7

I2

I4

I1

I3

Clock cycle 4 6 81 2 3 5 7

I2

I4

I1

I3

9

IF

ID OF IE WBIF

ID OF IE WB

IF OF IE WBID

ID OF IE WBIF

OF IE WBIF ID

ID OF IE WBIF

Register interlocking is a general technique to solve the correctness problem associated with datadependencies. In this method, a bit is associated with each register to specify whether the contentsare correct. If the bit is 0, the contents of the register can be used. Instructions should not readcontents of a register when this interlocking bit is 1, as the register is locked by another instruction.The following figure shows how the register interlocking works for the example given below:

I1: add R2,R3,R4 /* R2 = R3 + R4 */I2: sub R5,R6,R2 /* R5 = R6 R2 */

I1 locks the R2 register for clock cycles 3 to 5 so that I2 cannot proceed reading an incorrect R2value. Clearly, register forwarding is more efficient than the interlocking method.

Chapter 8 3

Clock cycleR2 is locked

4 6 81 2 3 5 7

I2

I4

I1

I3

9

WBIEOFIDIF

WBIEOFIDIF

10

ID OF IE WBIF

ID OF IE WBIF

85 Flow altering instructions such as branch require special handling in pipelined processors. In thefollowing figure, Figure a shows the impact of a branch instruction on our pipeline. Here we areassuming that instruction Ib is a branch instruction; if the branch is taken, it transfers control toinstruction It. If the branch is not taken, the instructions in the pipeline are useful. However, fora taken branch, we have to discard all the instructions that are in the pipeline at various stages. Inour example, we have to discard instructions I2, I3, and I4. We start fetching instructions at thetarge

Solutions Fundaments of Co

Documents

Transcript of Solutions Fundaments of Co