INSA-3IF Architecture des Ordinateurs - Centre national de la … · 2019-12-05 · CM2 ISA:...

Post on 11-Jul-2020

12 views 0 download

Transcript of INSA-3IF Architecture des Ordinateurs - Centre national de la … · 2019-12-05 · CM2 ISA:...

C. Wolf# 1

INSA-3IF

Architecture des Ordinateurs

Christian Wolf,INSA-Lyon, Dép. IF

Séance 3

C. Wolf# 2

Au MenuCM1 ISA : Introduction. Jeux d’instructions,

assembleur.TD La « Micromachine » : réalisation d’un processeur

simple sur papierCM2 ISA : Encodage; Histoire

TP Réalisation de la micromachine sous « Digital »CM3 Chemin de données d’un CPU RISC.

Parallélisme des instructionsCM4 Parallélisme des instructions + multi-cœur, multi-

threadingRetour sur les ISA/architectures

CM5 Hiérarchie mémoire

TP Prise en main du contrôleur MSP 430CM6 Les GPU

TD Micromachine sur papier : les interruptionsTP MSP 430 : pile, timer, interruptions

C. Wolf# 3

Sommaire de la séance

CPU RISC : le chemin de donnéesCycle de von Neumann

Parallélisme des instructionsIF IDIF ID

C. Wolf# 4

Exemple d’un processeur RISC - Processeur RISC :

– Toutes les opérations seront faites sur des registres– Accès mémoire uniquement pour chargement et rangement– Instructions de tailles fixes– Peu d’instructions, simples– Notre exemple : branchements par comparaison registre==0

- Ces propriétés simplifieront – la conception du cycle de von Neumann– Le parallélisme des instructions

C. Wolf# 5

Bibliographie

Une grande partie de cette section s’inspire très fortement de :

Hennessy et Patterson, Architecture des Ordinateurs

C. Wolf# 6

Rappel du cycle de von Neumann- Le cycle est exécuté par l’automate, contrôlant le

chemins de données au sein du processeur

- Durée d’une instruction = période horloge * nombre de cycles (exemple ci-dessus : 5 cycles = 5 CPI « cycles par instruction »)

IF ID MEMEX WB

C. Wolf# 7

Cycle de von Neumann : exemple- IF (Instruction Fetch)

– Lecture de l’instruction à partir du PC- ID (Instruction Decode)

– Accès au registres; calcul de la destination d’un saut conditionnel – Décode + accès aux registres : requiert orthogonalité!

- EX (Execute)– Travail ALU: op. ALU ou calcul adresse pour accès MEM– Requiert une architecture « chargement-rangement » !

- MEM (Memory access)– Chargement ou rangement mémoire

- WB (Write Back)– Ecrit dans le banc de registres (résultat ALU ou retour mémoire)

IF ID MEMEX WB

C. Wolf# 8

Cycle« IF »

MémoireInstruct. IRPC

+NPC4

« Instruction fetch »

IR ç Mem[PC]NPC ç PC+4

On garde la nouvelle valeur du PC dans un registre dédié

C. Wolf# 9

Cycle« ID »

RegistresA

B

MémoireInstruct. IRPC

+NPC4

Extension signée Imm

« Instruction decode »

A ç Regs[rs]B ç Regs[rt]Imm ç extension/mask[IR]

Calcul en parallèle :- Décodage- accès aux registres

Orthogonalité des instructions (encodage fixe des choix de registres) !!

C. Wolf# 10

Cycle« Ex »

ALU

MémoireInstruct. ALU

outRegistres

IRPC A

B

+NPC4

Extension signée Imm

Cond==0?

Cas 1 : add r1, r2, r3 Cas 3 : ldr r3, [r4, r5]Cas 2 : add r1, r2, #12 Cas 4 : beq .L1

C. Wolf# 11

Cycle « MEM »

ALU Mémoiredonnées

MémoireInstruct. ALU

outRegistres

IRPC A

B

LMD

+NPC4

Extension signée Imm

Cond==0?

Cas 1 : ldr r1, #8001 Cas 3 : beq .L1Cas 2 : str r1, #8001

C. Wolf# 12

Cycle « WB »

ALU Mémoiredonnées

MémoireInstruct. ALU

outRegistres

IRPC A

B

LMD

+NPC4

Extension signée Imm

Cond==0?

Cas 1 : ldr r1, #8001Cas 2 : add r1, r2, r3

C. Wolf# 13

Performance- Mesurée en instructions par unité de temps (sec)- Dépend de

– Fréquence horloge (Hz)– CPI (cycles par instruction)

- Compromis :– Diminuer le CPI peut nécessiter de diminuer la fréquence– Augmenter la fréquence peut nécessiter des étapes plus

« courtes », donc augmenter CPI

- Comment peut-on déterminer la fréquence maximale?- L’augmenter sans modification de CPI?- Identification du chemin critique dans le circuit

C. Wolf# 14

Exemple : cycle« Ex »

ALUALUout

A

B

NPC

Imm

Cond==0?

Plusieurs chemins s’exécutent en parallèle dans le circuit.Quel est la durée du chemin le plus long?

C. Wolf# 15

[ 3IF::S2 Graphe de tâches ]Voir aussi le cours 3IF « Algorithmie avancée pour l’IA et les graphes» par Christine Solnon et Pierre-Edouard Portier (Semestre 2).

- Quel est la durée minimale du projet?- Quel est le chemin critique, c.à.d. le chemins de tâches

qui ne peuvent être rallongées sans augmenter la durée du projet?

C. Wolf# 16

Graphe pour le cycle« Ex »

ALUALUout

A

B

NPC

Imm

Cond==0?

ALU

MUXA

==

MUXB

C. Wolf# 17

Graphe pour le cycle complet

ALU

MUXA

==

MUXB

Read

+

BancReg

Ext. signée

IF

ID

EX

MEM

Question :Les cycles s’exécutent séquentiellement (l’un après l’autre).Pourquoi sont-ils connectés en parallèle dans ce graphe?

C. Wolf# 18

Optimisation- Optimisation :

– Recherche du chemin critique– Optimisation des éléments sur ce chemin– Eventuellement re-arrangement du graphe

C. Wolf# 19

Sommaire de la séance

CPU RISC : le chemin de donnéesCycle de von Neumann

Parallélisme des instructionsIF IDIF ID

C. Wolf# 20

Pipeline : motivationRappel :- Nombre d’instructions par seconde =

– Fréquence horloge (Hz), divisé par– CPI (cycles par instruction)

- Compromis :– Diminuer le CPI peut nécessiter de diminuer la fréquence– Augmenter la fréquence peut nécessiter des étapes plus

« courtes », donc augmenter CPI

- Idée : Réaliser les étapes du cycle de von Neumann en parallèle

C. Wolf# 21

Un pipeline- Fonctionnement similaire à une chaine d’assemblage- Le processeur travaille sur plusieurs instructions en

parallèle, en étant dans un état différent pour chaque instruction

- Durée d’une instruction : égale à la version non pipelinée- Débit (max) :

– version non pipelinée * nombre étages– Egale à la fréquence de l’horloge (une instruction par cycle!)

-> constant!

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

C. Wolf# 22

Chemin de données : utilisation

C-8 ■ Appendix C Pipelining: Basic and Intermediate Concepts

clock cycle. To handle reads and a write to the same register (and for anotherreason, which will become obvious shortly), we perform the register write in thefirst half of the clock cycle and the read in the second half.

Third, Figure C.2 does not deal with the PC. To start a new instruction everyclock, we must increment and store the PC every clock, and this must be doneduring the IF stage in preparation for the next instruction. Furthermore, we mustalso have an adder to compute the potential branch target during ID. One furtherproblem is that a branch does not change the PC until the ID stage. This causes aproblem, which we ignore for now, but will handle shortly.

Although it is critical to ensure that instructions in the pipeline do not attemptto use the hardware resources at the same time, we must also ensure that instruc-tions in different stages of the pipeline do not interfere with one another. Thisseparation is done by introducing pipeline registers between successive stages ofthe pipeline, so that at the end of a clock cycle all the results from a given stageare stored into a register that is used as the input to the next stage on the nextclock cycle. Figure C.3 shows the pipeline drawn with these pipeline registers.

Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap amongthe parts of the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file isused as a source in the ID stage and as a destination in the WB stage, it appears twice. We show that it is read in onepart of the stage and written in another by using a solid line, on the right or left, respectively, and a dashed line onthe other side. The abbreviation IM is used for instruction memory, DM for data memory, and CC for clock cycle.

ALU

ALU

RegRegIM DM

RegIM DM

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

Reg

CC 8 CC 9

RegIM DM RegALU

RegIM DM Reg

ALU

RegIM DM Reg

ALU

IF ID MEMEX WB

- Séparation en IM/DM ev. réalisée par cache- Ecriture / lecture registres en demi-cycles différents

Instruction Memory

Data memory

Lecture registres

Ecritureregistres

C. Wolf# 23

C-8 ■ Appendix C Pipelining: Basic and Intermediate Concepts

clock cycle. To handle reads and a write to the same register (and for anotherreason, which will become obvious shortly), we perform the register write in thefirst half of the clock cycle and the read in the second half.

Third, Figure C.2 does not deal with the PC. To start a new instruction everyclock, we must increment and store the PC every clock, and this must be doneduring the IF stage in preparation for the next instruction. Furthermore, we mustalso have an adder to compute the potential branch target during ID. One furtherproblem is that a branch does not change the PC until the ID stage. This causes aproblem, which we ignore for now, but will handle shortly.

Although it is critical to ensure that instructions in the pipeline do not attemptto use the hardware resources at the same time, we must also ensure that instruc-tions in different stages of the pipeline do not interfere with one another. Thisseparation is done by introducing pipeline registers between successive stages ofthe pipeline, so that at the end of a clock cycle all the results from a given stageare stored into a register that is used as the input to the next stage on the nextclock cycle. Figure C.3 shows the pipeline drawn with these pipeline registers.

Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap amongthe parts of the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file isused as a source in the ID stage and as a destination in the WB stage, it appears twice. We show that it is read in onepart of the stage and written in another by using a solid line, on the right or left, respectively, and a dashed line onthe other side. The abbreviation IM is used for instruction memory, DM for data memory, and CC for clock cycle.

ALU

ALU

RegRegIM DM

RegIM DM

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

Reg

CC 8 CC 9

RegIM DM RegALU

RegIM DM Reg

ALU

RegIM DM Reg

ALU

Cas général

Quelques problèmes persistent … à traiter.

C. Wolf# 24

Constat- Nous nous sommes assurés que le matériel n’est pas

sur-utilisé en parallélisant.- Quid des résultats et de leur cohérences?

- Utilisation de registres supplémentaires pour transporter les résultats intermédiaires.

C. Wolf# 25

Rappel : chemin de données

ALU Mémoiredonnées

MémoireInstruct. ALU

outRegistres

IRPC A

B

LMD

+NPC4

Extension signée Imm

Cond==0?

C. Wolf# 26

Registres supplémentaires

ALU Mémoiredonnées

MémoireInstruct. Registres

PC

+4

Extension signée

==0?

IF/ID ID/EX EX/MEM MEM/WB

C. Wolf# 27

C.1 Introduction ■ C-9

Although many figures will omit such registers for simplicity, they arerequired to make the pipeline operate properly and must be present. Of course,similar registers would be needed even in a multicycle data path that had no pipe-lining (since only values in registers are preserved across clock boundaries). Inthe case of a pipelined processor, the pipeline registers also play the key role ofcarrying intermediate results from one stage to another where the source and des-tination may not be directly adjacent. For example, the register value to be stored

Figure C.3 A pipeline showing the pipeline registers between successive pipeline stages. Notice that the regis-ters prevent interference between two different instructions in adjacent stages in the pipeline. The registers also playthe critical role of carrying data for a given instruction from one stage to the other. The edge-triggered property ofregisters—that is, that the values change instantaneously on a clock edge—is critical. Otherwise, the data from oneinstruction could interfere with the execution of another!

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

DMIM

ALU

DMIM

ALU

DMIM

ALU

IM

ALU

IM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

C. Wolf# 28

C.1 Introduction ■ C-9

Although many figures will omit such registers for simplicity, they arerequired to make the pipeline operate properly and must be present. Of course,similar registers would be needed even in a multicycle data path that had no pipe-lining (since only values in registers are preserved across clock boundaries). Inthe case of a pipelined processor, the pipeline registers also play the key role ofcarrying intermediate results from one stage to another where the source and des-tination may not be directly adjacent. For example, the register value to be stored

Figure C.3 A pipeline showing the pipeline registers between successive pipeline stages. Notice that the regis-ters prevent interference between two different instructions in adjacent stages in the pipeline. The registers also playthe critical role of carrying data for a given instruction from one stage to the other. The edge-triggered property ofregisters—that is, that the values change instantaneously on a clock edge—is critical. Otherwise, the data from oneinstruction could interfere with the execution of another!

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

DMIM

ALU

DMIM

ALU

DMIM

ALU

IM

ALU

IM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

IF/ID ID/EX EX/MEM MEM/WB

C. Wolf# 29

Premier bilan : limitations- Le débit augmente, mais …- … aucune instruction sera exécutée plus rapidement!

Limitations :- Latence - Problèmes d’équilibre entre les étapes (l’étape la plus

lente détermine la fréquence)- Surcout du à la gestion du pipeline (registres etc.)- Dépendances entre les instructions

C. Wolf# 30

Conflits3 types de conflits- Structurels : quand il n’y a pas assez de matériel pour

toutes les combinaisons d’instructions.

- Liés aux données : quand une instruction utilise un résultat d’une instruction précédente (encore en exécution).

- Liés aux contrôle : quand le PC est modifié par une instruction.

« Solution » : – raccourcir les chemins, ou – suspendre l’activité d’une partie du pipeline.

C. Wolf# 31

Conflits liés à la structureExemples :- Un seul chemin d’accès à la mémoire

- Opérations A/L plus longues qu’un cycle (float?)

C-14 ■ Appendix C Pipelining: Basic and Intermediate Concepts

the right (which delays its execution start and finish by 1 cycle). The effect of thepipeline bubble is actually to occupy the resources for that instruction slot as ittravels through the pipeline.

Example Let’s see how much the load structural hazard might cost. Suppose that data ref-erences constitute 40% of the mix, and that the ideal CPI of the pipelined proces-sor, ignoring the structural hazard, is 1. Assume that the processor with thestructural hazard has a clock rate that is 1.05 times higher than the clock rate ofthe processor without the hazard. Disregarding any other performance losses, isthe pipeline with or without the structural hazard faster, and by how much?

Answer There are several ways we could solve this problem. Perhaps the simplest is tocompute the average instruction time on the two processors:

Average instruction time =

Figure C.4 A processor with only one memory port will generate a conflict whenever a memory referenceoccurs. In this example the load instruction uses the memory for a data access at the same time instruction 3 wantsto fetch an instruction from memory.

ALU

ALU

RegRegMem Mem

RegMem Mem

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Reg

CC 8

RegMem Mem RegALU

RegMem Mem Reg

ALU

RegMem Mem

ALU

Load

Instruction 1

Instruction 2

Instruction 3

Instruction 4

CPI Clock cycle time×

IF ID MEMEX WBEX EX

C. Wolf# 32

Les bulles

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

Exécution optimale :

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

Bulle / Conflit :

Bulle

C. Wolf# 33

Conflits liés aux données (1)

Toutes les opérations après « add » utilisent son résultat.

add r1, r2, r3sub r4, r1, r5and r6, r1, r7or r8, r1, r9xor r10, r1, r11

C. Wolf# 34

C.2 The Major Hurdle of Pipelining—Pipeline Hazards ■ C-17

after the DADD actually produces it. If the result can be moved from the pipelineregister where the DADD stores it to where the DSUB needs it, then the need for astall can be avoided. Using this observation, forwarding works as follows:

1. The ALU result from both the EX/MEM and MEM/WB pipeline registers isalways fed back to the ALU inputs.

2. If the forwarding hardware detects that the previous ALU operation has writ-ten the register corresponding to a source for the current ALU operation, con-trol logic selects the forwarded result as the ALU input rather than the valueread from the register file.

Notice that with forwarding, if the DSUB is stalled, the DADD will be completedand the bypass will not be activated. This relationship is also true for the case ofan interrupt between the two instructions.

As the example in Figure C.6 shows, we need to forward results not onlyfrom the immediately previous instruction but also possibly from an instruction

Figure C.6 The use of the result of the DADD instruction in the next three instructions causes a hazard, since theregister is not written until after those instructions read it.

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

Reg

DM

DM

DM

DADD R1, R2, R3

DSUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

Reg

Reg Reg

RegIM

IM

IM

Reg ALU

ALU

ALU

ALU

Reg

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

IM

IM

Instruction « OR » : ok si lecture en deuxième demi-cycle

C. Wolf# 35

Bulles (sans contre-mesures)

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

BulleBulle

Ecriture dans le banc des registres (1ier demi-cycle)

Lecture du banc des registres (2ième demi-cycle)

C. Wolf# 36

Bulles : contre-mesures?

Il n’y a pas de raison « théorique » pour un conflit, car la valeur demandée et déjà disponible. Elle n’a juste pas encore été écrite dans le banc des registres.

IF ID MEMEX WBIF ID MEMEX WB

Calcul du résultat

Utilisation du résultat

C. Wolf# 37

Transmission directe de valeurs (1)

C-18 ■ Appendix C Pipelining: Basic and Intermediate Concepts

that started 2 cycles earlier. Figure C.7 shows our example with the bypass pathsin place and highlighting the timing of the register read and writes. This codesequence can be executed without stalls.

Forwarding can be generalized to include passing a result directly to the func-tional unit that requires it: A result is forwarded from the pipeline register corre-sponding to the output of one unit to the input of another, rather than just fromthe result of a unit to the input of the same unit. Take, for example, the followingsequence:

DADD R1,R2,R3LD R4,0(R1)SD R4,12(R1)

Figure C.7 A set of instructions that depends on the DADD result uses forwarding paths to avoid the data hazard.The inputs for the DSUB and AND instructions forward from the pipeline registers to the first ALU input. The ORreceives its result by forwarding through the register file, which is easily accomplished by reading the registers inthe second half of the cycle and writing in the first half, as the dashed lines on the registers indicate. Notice that theforwarded result can go to either ALU input; in fact, both ALU inputs could use forwarded inputs from either thesame pipeline register or from different pipeline registers. This would occur, for example, if the AND instruction wasAND R6,R1,R4.

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

Reg

DM

DM

DM

DADD R1, R2, R3

DSUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

Reg

Reg Reg

RegIM

IM

IM

Reg ALU

ALU

ALU

ALU

Reg

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

IM

IM

L’automate détecte la présence d’un conflit.Choix de la source de calcul.

C. Wolf# 38

Pipeline : implémentation

ALUALUout

A

B

Imm

Rappel : version non pipelinéeDeux entrées pour chaque MUX devant l’ALU

C. Wolf# 39

Pipeline : implémentationLa transmission directe des valeurs demande des entrées supplémentaires.

ALU Mémoiredonnées

==0?

C. Wolf# 40

Conflits liés aux données (2)add r1, r2, r3ldr r4, [r1, #20]str r4, [r1, #64]

C. Wolf# 41

Transmission directe de valeurs (2)C.2 The Major Hurdle of Pipelining—Pipeline Hazards ■ C-19

To prevent a stall in this sequence, we would need to forward the values of theALU output and memory unit output from the pipeline registers to the ALU anddata memory inputs. Figure C.8 shows all the forwarding paths for this example.

Data Hazards Requiring Stalls

Unfortunately, not all potential data hazards can be handled by bypassing.Consider the following sequence of instructions:

LD R1,0(R2)DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9

The pipelined data path with the bypass paths for this example is shown inFigure C.9. This case is different from the situation with back-to-back ALUoperations. The LD instruction does not have the data until the end of clockcycle 4 (its MEM cycle), while the DSUB instruction needs to have the data bythe beginning of that clock cycle. Thus, the data hazard from using the resultof a load instruction cannot be completely eliminated with simple hardware.As Figure C.9 shows, such a forwarding path would have to operate backward

Figure C.8 Forwarding of operand required by stores during MEM. The result of the load is forwarded from thememory output to the memory input to be stored. In addition, the ALU output is forwarded to the ALU input for theaddress calculation of both the load and the store (this is no different than forwarding to another ALU operation). Ifthe store depended on an immediately preceding ALU operation (not shown above), the result would need to be for-warded to prevent a stall.

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

DM

DM

DM

DADD R1, R2, R3

LD R4, 0(R1)

SD R4,12(R1)

Reg

Reg Reg

RegIM

IM

IM

ALU

ALU

ALU

Reg

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

add r1,r2,r3

str r4, [r1, #64]

ldr r4, [r1, #20]

C. Wolf# 42

Conflits liés aux données (3)ldr r1,[r2,#0]sub r4, r1, r5and r6, r1, r7or r8, r1, r9

C. Wolf# 43

Conflit réel lié aux donnéesC-20 ■ Appendix C Pipelining: Basic and Intermediate Concepts

in time—a capability not yet available to computer designers! We can forwardthe result immediately to the ALU from the pipeline registers for use in theAND operation, which begins 2 clock cycles after the load. Likewise, the ORinstruction has no problem, since it receives the value through the register file.For the DSUB instruction, the forwarded result arrives too late—at the end of aclock cycle, when it is needed at the beginning.

The load instruction has a delay or latency that cannot be eliminated by for-warding alone. Instead, we need to add hardware, called a pipeline interlock, topreserve the correct execution pattern. In general, a pipeline interlock detects ahazard and stalls the pipeline until the hazard is cleared. In this case, the interlockstalls the pipeline, beginning with the instruction that wants to use the data untilthe source instruction produces it. This pipeline interlock introduces a stall orbubble, just as it did for the structural hazard. The CPI for the stalled instructionincreases by the length of the stall (1 clock cycle in this case).

Figure C.10 shows the pipeline before and after the stall using the names of thepipeline stages. Because the stall causes the instructions starting with the DSUB tomove 1 cycle later in time, the forwarding to the AND instruction now goesthrough the register file, and no forwarding at all is needed for the OR instruction.The insertion of the bubble causes the number of cycles to complete thissequence to grow by one. No instruction is started during clock cycle 4 (and nonefinishes during cycle 6).

Figure C.9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, sincethat would mean forwarding the result in “negative time.”

DM

ALU

ALU

ALU

DM

CC 1 CC 2 CC 3 CC 4 CC 5

Time (in clock cycles)

LD R1, 0(R2)

DSUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

Reg

Reg

RegIM

IM

IM

IM Reg

Reg

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

ldr r1,[r2,#0]

and r6, r1, r7

sub r4, r1, r5

or r8, r1, r9

!

C. Wolf# 44

Bulle

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

BulleBulleBulle

C. Wolf# 45

Conflit réel lié aux donnéesC-20 ■ Appendix C Pipelining: Basic and Intermediate Concepts

in time—a capability not yet available to computer designers! We can forwardthe result immediately to the ALU from the pipeline registers for use in theAND operation, which begins 2 clock cycles after the load. Likewise, the ORinstruction has no problem, since it receives the value through the register file.For the DSUB instruction, the forwarded result arrives too late—at the end of aclock cycle, when it is needed at the beginning.

The load instruction has a delay or latency that cannot be eliminated by for-warding alone. Instead, we need to add hardware, called a pipeline interlock, topreserve the correct execution pattern. In general, a pipeline interlock detects ahazard and stalls the pipeline until the hazard is cleared. In this case, the interlockstalls the pipeline, beginning with the instruction that wants to use the data untilthe source instruction produces it. This pipeline interlock introduces a stall orbubble, just as it did for the structural hazard. The CPI for the stalled instructionincreases by the length of the stall (1 clock cycle in this case).

Figure C.10 shows the pipeline before and after the stall using the names of thepipeline stages. Because the stall causes the instructions starting with the DSUB tomove 1 cycle later in time, the forwarding to the AND instruction now goesthrough the register file, and no forwarding at all is needed for the OR instruction.The insertion of the bubble causes the number of cycles to complete thissequence to grow by one. No instruction is started during clock cycle 4 (and nonefinishes during cycle 6).

Figure C.9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, sincethat would mean forwarding the result in “negative time.”

DM

ALU

ALU

ALU

DM

CC 1 CC 2 CC 3 CC 4 CC 5

Time (in clock cycles)

LD R1, 0(R2)

DSUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

Reg

Reg

RegIM

IM

IM

IM Reg

Reg

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

ldr r1,[r2,#0]

and r6, r1, r7

sub r4, r1, r5

or r8, r1, r9

C-20 ■ Appendix C Pipelining: Basic and Intermediate Concepts

in time—a capability not yet available to computer designers! We can forwardthe result immediately to the ALU from the pipeline registers for use in theAND operation, which begins 2 clock cycles after the load. Likewise, the ORinstruction has no problem, since it receives the value through the register file.For the DSUB instruction, the forwarded result arrives too late—at the end of aclock cycle, when it is needed at the beginning.

The load instruction has a delay or latency that cannot be eliminated by for-warding alone. Instead, we need to add hardware, called a pipeline interlock, topreserve the correct execution pattern. In general, a pipeline interlock detects ahazard and stalls the pipeline until the hazard is cleared. In this case, the interlockstalls the pipeline, beginning with the instruction that wants to use the data untilthe source instruction produces it. This pipeline interlock introduces a stall orbubble, just as it did for the structural hazard. The CPI for the stalled instructionincreases by the length of the stall (1 clock cycle in this case).

Figure C.10 shows the pipeline before and after the stall using the names of thepipeline stages. Because the stall causes the instructions starting with the DSUB tomove 1 cycle later in time, the forwarding to the AND instruction now goesthrough the register file, and no forwarding at all is needed for the OR instruction.The insertion of the bubble causes the number of cycles to complete thissequence to grow by one. No instruction is started during clock cycle 4 (and nonefinishes during cycle 6).

Figure C.9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, sincethat would mean forwarding the result in “negative time.”

DM

ALU

ALU

ALU

DM

CC 1 CC 2 CC 3 CC 4 CC 5

Time (in clock cycles)

LD R1, 0(R2)

DSUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

Reg

Reg

RegIM

IM

IM

IM Reg

Reg

Pro

gram

exe

cutio

n or

der

(in in

stru

ctio

ns)

Bulle

Bulle

Bulle

C. Wolf# 46

Conflits liés aux branchementsbne .L1add r1, r2, r3.L1:

ALU

MémoireInstruct. ALU

outRegistres

IRPC A

B

+NPC4

Extension signée Imm

Cond==0?

La destination du saut est - disponible en fin de cycle « EX »- écrit dans le PC en fin de cycle « MEM »

C. Wolf# 47

Conflits liés aux branchements

IF ID MEMEX WBIF ID MEMEX WB

Nouveau PC disponible

Lecture de la nouvelle instruction

bne .L1add r1, r2, r3.L1:

Une implémentation « très agressive » permet d’obtenir le saut en fin de cycle « ID » (non détaillé ici).Il reste donc un problème …

C. Wolf# 48

Solution (2) : simple

IF ID MEMEX WBIF ID MEMEX WB

Systématiquement répéter le cycle « IF » après un branchement.

IF

Lecture « inutile » si le branchement a été effectué.

Equivalent à une bulle.

C. Wolf# 49

Solution (2) : prédiction constante

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

Branchement non-exécuté :

Branchement exécuté :

bne .L1add r1, r2, r3

.L1:

IF ID MEMEX WBIF

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

bne .L1add r1, r2, r3 Bulle Bulle Bulle Bulle

C. Wolf# 50

Solution (3) : retardementSystématiquement configurer une instruction entre l’instruction du branchement et son exécution :

bne .L1add r1, r2, r3add r2, r1, r1

.L1: sub r2, r1, r1

Instruction exécutée quelque soit le résultat de bne

Branchement non pris

Branchement pris

C. Wolf# 51

Solution (3) : retardementLe compilateur optimisera le code pour placer des instructions « utiles » dans le créneau de retardement.

add r1, r2, r3Cmp r4, r5bne .L1<créneau>

.L1:

Cmp r4, r5bne .L1add r1, r2, r3

.L1:

Remarques : les processeurs actuels utilisent un pipeline « out-of-order » où les instructions sont réordonnées pour maximiser la performance.

C. Wolf# 52

Prédiction dynamiquePour certaines architectures avec des pipelines profonds, les branchements peuvent couter plusieurs cycles.Solution : tentative de prédire la destination du saut.Un tableau sauvegarde si un branchement a été pris précédemment pour une adresse donnée.Plusieurs adresses partagent une entrée, selon leurs bits de poids faible.

1110 0001 1010 0000 0011 0000 0000 0111

bne .L1Buffer de prédiction

add r4, r5, r6

sub r4, r5, R6« IF » selon prédiction

MAJ buffer siprédiction fausse

C. Wolf# 53

Les opérations multi-cyclesCertaines opérations durent plusieurs cycles, par exemple les opérations sur les nombres en virgules flottants

IF ID MEM WB

Int

FP/Int multiplication

FP addition

FP/Int Division, 24 cycles, non pipelinée

C. Wolf# 54

Les opérations multi-cyclesLes opérations non-pipelinées peuvent causer des bulles de durées très importantes.

IF ID MEMEX WBIF ID MEM WB

IF ID

C. Wolf# 55

Les opérations multi-cyclesLes opérations peuvent terminer dans un ordre différentProblèmes de lecture après écriture

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

EX EX EX EX EX EX

Bulle Bulle Bulle BulleBulle Bulle Bulle Bulle

C. Wolf# 56

Problème des interruptions- Réactions sur des évènements

– internes (exceptions mathématiques, accès mémoire interdite) – externes (matériel, changement de tâche etc.)

- Le programme actuel est interrompu, suivi par l’appel d’une sous-routine

- Des l’arrivée d’une interruption, toutes les écritures (mémoire, registres) sont désactivés.

- Les instructions non-terminées seront redémarrées après le retour de la routine d’interruption.

C. Wolf# 57

Interruptions et multi-cyclesUne instruction a déjà terminée, alors qu’une instruction précédente est encore en cours d’exécution.Si les « interruptions précises » sont demandées, il faut absolument éviter ce cas de figure.Solution : retard des écritures

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WBIF ID MEMEX WB

IF ID MEMEX WB

EX EX EX EX EX EX

Interruption ici

C. Wolf# 58

Bilan : développement historique- 1980 : les pipelines sont utilisées dans les super-

ordinateurs uniquement- ~1985 : arrivée des pipelines dans les micro-

processeurs (desktop)- Maintenant : les micro-controleurs à 1-2€ sont pipelinés