CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced...
-
Upload
truongcong -
Category
Documents
-
view
229 -
download
4
Transcript of CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced...
![Page 2: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/2.jpg)
CA226 — AdvancedComputer Architecture
2
Load RAW Stalls
ld r1,0(r2)dadd r4,r3,r1 ; unavoidable stall (on r1)
The value needed by the second instruction in Ex is available only after the firstinstruction has completed Mem.
![Page 3: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/3.jpg)
CA226 — AdvancedComputer Architecture
3
Branch RAW Stalls
dsub r1,r2,r3beqz r1,target
We can’t forward the necessary value until after Ex:
• hence, a stall of one cycle(whether the branch is taken or not)
![Page 4: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/4.jpg)
CA226 — AdvancedComputer Architecture
4
A "Double Whammy" Stallld r1,0(r2)beqz r1,target
Stall of two cycles.This is a combination of the two previous stalls.
![Page 5: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/5.jpg)
CA226 — AdvancedComputer Architecture
5
The PipelineThe pipeline:
• is essentially a miniature graph of parallel-processing elements
• instructions flow from node to node
![Page 6: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/6.jpg)
CA226 — AdvancedComputer Architecture
6
The Pipeline
![Page 7: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/7.jpg)
CA226 — AdvancedComputer Architecture
7
Consider this …dadd r3,r1,r2dadd r4,r1,r2
They flow:
• smoothly through the pipeline, no stalls
![Page 8: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/8.jpg)
CA226 — AdvancedComputer Architecture
8
Now consider this …dadd r3,r1,r2dmul r4,r1,r2
They flow:
• more slowly (more cycles), but still no stalls(multiplication is expensive)
![Page 9: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/9.jpg)
CA226 — AdvancedComputer Architecture
9
And this …dmul r3,r1,r2dmul r4,r1,r2dmul r5,r1,r2
They flow:
• again, more slowly (more cycles), but still no stalls
• all three instructions flow through the multiplier
![Page 10: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/10.jpg)
CA226 — AdvancedComputer Architecture
10
And this …dmul r4,r1,r2dadd r3,r1,r2
The dadd is not blocked by the dmul:
• the dadd overtakes the dmul in the pipeline
• still no stalls
![Page 11: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/11.jpg)
CA226 — AdvancedComputer Architecture
11
Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2
The dadd is now blocked by the dmul:
• were the dadd to overtake the dmul:r3 would have the incorrect final value
This is known as a:
• write-after-write (WAW) stall
![Page 12: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/12.jpg)
CA226 — AdvancedComputer Architecture
12
Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2daddi r5,r0,100
Note:
• subsequent, independent instructions are also blocked!
![Page 13: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/13.jpg)
CA226 — AdvancedComputer Architecture
13
Another topic…
![Page 14: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/14.jpg)
CA226 — AdvancedComputer Architecture
14
ExampleConsider:
for (i=0; i<1000; i+=1) a[i] += 1; // where a[i] is an integer
![Page 15: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/15.jpg)
CA226 — AdvancedComputer Architecture
15
Example.data ; psched3.s N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit words
.text daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: ld r3,a(r1) ; r3 = a[i/8] daddi r3,r3,1 ; r3 = r3 + 1 sd r3,a(r1) ; a[i/8] = r3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt
![Page 16: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/16.jpg)
CA226 — AdvancedComputer Architecture
16
So …Stalls per iteration:
• one load RAW stall on r3
• one branch RAW stall on r1
• plus 999 wasted nop cycles (in delay slot)
8007 cycles in total
![Page 17: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/17.jpg)
CA226 — AdvancedComputer Architecture
17
After pipeline scheduling, ….data ; psched4.s N: .word 8000 b: .word 0 ; Address of a, minus 8 (hack) a: .space 8000
.text daddi r1,r0,0 ld r2,N(r0)
loop: ld r3,a(r1) daddi r1,r1,8 ; moved up daddi r3,r3,1 bne r1,r2,loop sd r3,b(r1) ; moved down, and adjusted halt
![Page 18: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/18.jpg)
CA226 — AdvancedComputer Architecture
18
So …No stalls!
• just 5007 cycles, a substantial improvement
• CPI of 1.001
I suspect:
• we can’t do much better than that!(every instruction/cycle does something which has to be done)
![Page 19: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/19.jpg)
CA226 — AdvancedComputer Architecture
19
Now, …Let’s try the same thing:
• but with floating point numbers
![Page 20: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/20.jpg)
CA226 — AdvancedComputer Architecture
20
Example — Floating Point.data ; psched5.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit floating point values
.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: l.d f3,a(r1) ; f3 = a[i/8] add.d f3,f3,f10 ; f3 = f3 + one s.d f3,a(r1) ; a[i/8] = f3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt
![Page 21: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/21.jpg)
CA226 — AdvancedComputer Architecture
21
Stalls?Stalls:
• 5000 RAW stalls
• 1000 structural stalls
Overall: 11008 cycles.
![Page 22: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/22.jpg)
CA226 — AdvancedComputer Architecture
22
As Before: Reorder Operations.data ; psched6.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values
.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: l.d f3,a(r1) ; f3 = a[i/8] daddi r1,r1,8 ; i = i + 8 add.d f3,f3,f10 ; f3 = f3 + one bne r1,r2,loop ; repeat, unless i == N s.d f3,b(r1) ; a[i/8] = f3 halt
![Page 23: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/23.jpg)
CA226 — AdvancedComputer Architecture
23
Stalls?Stalls:
• 1000 RAW stalls
• 1000 structural stalls
Overall: 7008 cycles; better than 11008, previously.
![Page 24: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/24.jpg)
CA226 — AdvancedComputer Architecture
24
Stalls?Remaining stalls, both RAW and structural are from:
add.d f3,f3,f10 ; f3 = f3 + one...s.d f3,b(r1) ; a[i] = f3
It takes four cycles for the add.d to move through the floating point adder.
The s.d (a read after write) arrives too soon, and is blocked for two cycles (periteration).
![Page 25: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/25.jpg)
CA226 — AdvancedComputer Architecture
25
So, …How can we eliminate the remaining stalls?
![Page 26: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/26.jpg)
CA226 — AdvancedComputer Architecture
26
Loop UnrollingOriginally:
for (i=0; i<1000; i+=1) a[i] += 1
Unroll the loop:
for (i=0; i<1000; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}
![Page 27: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/27.jpg)
CA226 — AdvancedComputer Architecture
27
Example.data ; psched7.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values
.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: l.d f3,a(r1) ; 1 daddi r1,r1,8 ; add.d f3,f3,f10 ; s.d f3,b(r1) ;
l.d f4,a(r1) ; 2 - now using f4, instead of f3
![Page 28: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/28.jpg)
CA226 — AdvancedComputer Architecture
28
daddi r1,r1,8 ; add.d f4,f4,f10 ; s.d f4,b(r1) ;
l.d f5,a(r1) ; 3 - now using f5, instead of f3 daddi r1,r1,8 ; add.d f5,f5,f10 ; s.d f5,b(r1) ;
l.d f6,a(r1) ; 4 - now using f6, instead of f3 daddi r1,r1,8 ; add.d f6,f6,f10 ; s.d f6,b(r1) ;
bne r1,r2,loop ; repeat, unless i == N nop halt
![Page 29: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/29.jpg)
CA226 — AdvancedComputer Architecture
29
Now, …That doesn’t help:
• but many of these operations are now independent
• they can be reordered
First:
• we don’t need all those dadd instructions
• and we can use the delay slot for something useful
![Page 30: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/30.jpg)
CA226 — AdvancedComputer Architecture
30
Example.data ; psched8.s one: .double 1.0 N: .word 8000 a: .space 8000
.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2
loop: l.d f3,0(r11) add.d f3,f3,f10 s.d f3,0(r11)
l.d f4,8(r11) ; adjust displacement add.d f4,f4,f10
![Page 31: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/31.jpg)
CA226 — AdvancedComputer Architecture
31
s.d f4,8(r11) ; adjust displacement
l.d f5,16(r11) ; adjust displacement add.d f5,f5,f10 s.d f5,16(r11) ; adjust displacement
l.d f6,24(r11) ; adjust displacement add.d f6,f6,f10
daddi r11,r11,32 ; collect all four daddi-s into one
bne r11,r12,loop s.d f6,-8(r11) ; 24 - 32 == -8, adjust displacement halt
![Page 32: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/32.jpg)
CA226 — AdvancedComputer Architecture
32
Hmm, …Still:
• 7009 cycles
• fewer instructions, more stalls, same number of cycles
We need to:
• look more carefully at how the pipeline is operating
• explore more options for reordering operations
![Page 33: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/33.jpg)
CA226 — AdvancedComputer Architecture
33
Best we can do?.data ; psched9.s one: .double 1.0 N: .word 8000 a: .space 8000
.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2
loop: l.d f3,0(r11) ; i%4==0 load l.d f4,8(r11) ; i%4==1 load
add.d f3,f3,f10 ; i%4==0 add l.d f5,16(r11) ; i%4==2 load l.d f6,24(r11) ; i%4==3 load
![Page 34: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/34.jpg)
CA226 — AdvancedComputer Architecture
34
add.d f4,f4,f10 ; i%4==1 add s.d f3,0(r11) ; i%4==0 store daddi r11,r11,32 ; increment loop counter
add.d f5,f5,f10 ; i%4==2 add add.d f6,f6,f10 ; i%4==3 add
s.d f4,-24(r11) ; i%4==1 store (8-32 == -24) s.d f5,-16(r11) ; i%4==2 store (16-32 == -16)
bne r11,r12,loop s.d f6,-8(r11) ; i%4==3 store (24-32 == -8) halt
![Page 35: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/35.jpg)
CA226 — AdvancedComputer Architecture
35
Differences …Differences to previous version:
• reorder independent instructionspreserve the order of dependent instructions
• increment the loop counter sooner and adjust subsequent offsets
• carefully interleave FP and non-FP instructionsallow non-FP instructions to flow around FP instructions in the FP adder
![Page 36: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/36.jpg)
CA226 — AdvancedComputer Architecture
36
PerformanceNow:
• 4009 cycles, CPI of 1.144
• just 500 structural stalls
Previously:
• 11008 cycles, CPI of 1.833
• a speedup of 60%
![Page 37: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/37.jpg)
CA226 — AdvancedComputer Architecture
37
More GenerallyUnroll the loop some number of times (four, here):
for (i=0; i<N-N%4; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}
// Perform any remaining iterations...for (i=N-N%4; i<N; i+=1) a[i] += 1
![Page 38: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/38.jpg)
CA226 — AdvancedComputer Architecture
38
Costs?Costs:
• increase in code size(so this is a space-time trade off)
• requires more registers(a limited resource)
In fact:
• this type of pipeline scheduling is only possible because we have a large number ofgeneral-purpose registers
![Page 39: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward](https://reader033.fdocuments.net/reader033/viewer/2022051801/5adbb3067f8b9afc0f8e3170/html5/thumbnails/39.jpg)
CA226 — AdvancedComputer Architecture
39
Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>