Embedded Software in Real-Time Signal Processing Systems: Application and Architecture Trends
Real-time Signal Processing on Embedded Systems
description
Transcript of Real-time Signal Processing on Embedded Systems
![Page 1: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/1.jpg)
Real-time Signal Processing on Embedded Systems
Advanced Cutting-edge Research Seminar I&III
![Page 2: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/2.jpg)
Advances in Microprocessor Technology
![Page 3: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/3.jpg)
Architectural improvements of microprocessors
Pipelining Paralle processing exploiting ILP
Superscalar VLIW
SIMD
![Page 4: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/4.jpg)
Procedure of instruction execution on a processor
Instruction Fetch (IF) fetches an instruction from main
memory. Instruction Decode (ID)
decodes fetched instruction Execution (EX)
executes decoded instruction Memory Access (MA)
accesses to main memory Write Back (WB)
Write back data to registers
![Page 5: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/5.jpg)
Operation cycles on a processor
Single cycle machine This kinds of machines execute all
procedures from IF to WB in a cycle. Operation speed is determined by the
slowest instruction. (Because all instructions must be executed in a cycle)
Multi-cycle machine This kinds of machines execute an
instruction in several cycles.IF ID EX MA
WB
![Page 6: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/6.jpg)
Piepelined operation can improve throughput of
instructions.
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
IF ID EX MA
WB IF ID EX M
AWB
IF ID EX MA
WB
To realize pipelined operation, several techniques are required.
![Page 7: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/7.jpg)
Causes of pipeline hazards
Structural hazard: The hardware cannot cope with the combination of issued instructions.
Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former.
Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.
![Page 8: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/8.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
![Page 9: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/9.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
![Page 10: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/10.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
![Page 11: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/11.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
![Page 12: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/12.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
MAIFconflict
![Page 13: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/13.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU Resolve 1: to stall the next
instruction
![Page 14: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/14.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU Resolve 1: to stall the next
instruction
![Page 15: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/15.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
MAIFconflict
Resolve 2: to add another data bus to access the instruction memory.
![Page 16: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/16.jpg)
Structural hazard
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC
Inst Mem
Instructiondecoder
Instructionregister
ALU
Registers
CPU
Data Mem
Harvard Architecture
Resolve 2: to add another data bus to access the instruction memory.
![Page 17: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/17.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
![Page 18: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/18.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
![Page 19: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/19.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
$t2=$s0-$t3
![Page 20: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/20.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
$t2=$s0-$t3-2=0-2
![Page 21: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/21.jpg)
Data hazard
IF ID EX MA
WBIF ID EX M
AWB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Waiting by stalls: consuming 3 cycles
![Page 22: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/22.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Resolve: forwarding
![Page 23: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/23.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Resolve: forwarding
The result is forwarded to ALU
![Page 24: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/24.jpg)
Data hazard
IF ID EX MA
WBIF ID EX MA
WB
PC
Memory
Instructiondecoder
Instructionregister
ALU
Registers
CPU
add $s0,$t0,$t1($s0=$t0+$t1)sub $t2,$s0,$t3($t2=$s0-$t3)
$s0=$t0+$t1
Registers5
0
4 3 2 1
0 0 0 0t0 t1 t2 t3 t4
s0 s1 s2 s3 s4
Resolve: forwarding
$t2=9-$t37=9-2
The result is forwarded to ALU
![Page 25: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/25.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
An instruction sequenceincluding branch
PC:10Instructiondecoder
Instructionregister
ALU
Registers
CPU ※ ※ In this explanation,PC adopts word addressfor simplification.
![Page 26: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/26.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:Instructiondecoder
Instructionregister
ALU
Registers
CPU
An instruction sequenceincluding branch
![Page 27: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/27.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:11Instructiondecoder
Instructionregister
ALU
Registers
CPU
An instruction sequenceincluding branch
![Page 28: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/28.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:12Instructiondecoder
Instructionregister
ALU
Registers
CPUPC’s value of next
instruction depends on the branch
conditionBranch is
taken:PC=40Not taken:PC=12
An instruction sequenceincluding branch
![Page 29: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/29.jpg)
Control hazard Resolve 1: stall
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB2 cycle stall
The number of required stall cycleaetermined by architecture.
![Page 30: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/30.jpg)
Control hazard Resolve 1: stall
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWB
1 cycle stall
If the processor can calculate the branch targetaddress at the ID stage.
![Page 31: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/31.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:10Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 2: Branch prediction
In this example, the nextPC is predicted as if the branch is always untaken.
![Page 32: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/32.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:11Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 2: branch prediction
In this example, the nextPC is predicted as if the branch is always untaken.
![Page 33: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/33.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:12Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 2: branch prediction
In this example, the nextPC is predicted as if the branch is always untaken.
![Page 34: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/34.jpg)
Control hazard Resolve 2: branch prediction
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})or $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX M
AWBstall
PC:40Instructiondecoder
Instructionregister
ALU
Registers
CPUIf the prediction is missed,in other words, if branchis taken.
![Page 35: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/35.jpg)
Control hazard More practical scheme: dynamic
branch prediction n-bit counter-based prediction:
Address of a branch instraction Branch History TableLower i-bit
n-bit saturatingup/down counter
![Page 36: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/36.jpg)
1-bit counter-based prediction
Predict branch will be taken
Predict branch will be untaken
1 0
Branch is taken
Branch is untaken
![Page 37: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/37.jpg)
2-bit counter-based prediction
Predict branch will be taken
Predict branch will be taken
Predict branch will be taken
00
Predict branch will be taken
0110
11
Branch is taken
Branch is untaken
This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc
![Page 38: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/38.jpg)
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:11Instructiondecoder
Instructionregister
ALU
Registers
CPU
Resolve 3: delayed prediction
An instruction that has no dependencyis inserted.
IF ID EX MA
WB
![Page 39: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/39.jpg)
Resolve 3: delayed prediction
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:12Instructiondecoder
Instructionregister
ALU
Registers
CPUIF ID EX M
AWB
An instruction that has no dependencyis inserted.
![Page 40: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/40.jpg)
Resolve 3: delayed prediction
Control hazard
add $s0,$t0,$t1($s0=$t0+$t1)beq $s1,$s2, 40(if($s1==$s2){goto 40})Inserted instructionor $s3,$s4,$t2($s3=$s4|$t2)
IF ID EX MA
WBIF ID EX MA
WBIF ID EX MA
WB
PC:13or40Instructiondecoder
Instructionregister
ALU
Registers
CPUIF ID EX M
AWB
An instruction at determined addressis executed.
An instruction that has no dependencyis inserted.
![Page 41: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/41.jpg)
Exploiting ILP (Instruction Level Parallelism)
SuperScalar : issuing multiple instructions per cycle with hardware support. Advantage: binary compatibility.
VLIW: issuing multiple instructions per cycle with compiler support. Advantage: simple hardware
![Page 42: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/42.jpg)
Types of data dependence True data dependence (RAW: Read
After Write)
Anti-dependence (WAR: Write After Read)
Output dependence (WAW: Write After Write)
i1: r2=r1+r3i2: r4=r2+1
i1: r1=r2+r3i2: r2=r4+1i3: r1=r4+2
Anti Output
difficult to remove
can be removed by register renaming
They are called as artificial dependence
![Page 43: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/43.jpg)
Basic Architecture of Superscaler Processor
Instruction cache
Instruction decodeRegister renaming
Branch prediction
Function unit
Function unit
Registers
・・・・・
・・・・・
Data cache
Reorder buffer
・・・・・
・・・・・
・・・・・
Frontend
Ex-coreBackend
dispatchInstruction window
commit
issue
![Page 44: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/44.jpg)
Basic function of Frontend
provides enough instructions. predicts next instruction address if
branch instruction appears. resolves artificial dependences by
register renaming. analyzes true data dependence
after register renaming. transfers instructions after the
above operations. This operation is called “dispatch”.
![Page 45: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/45.jpg)
Basic function of Ex-core finds independent instructions
stored in “instruction window” as many as possible. In this operation, dynamic scheduling
is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc.
executes independent instructions in parallel. An operation that transfers an
instruction to a function unit is called “issue”.
![Page 46: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/46.jpg)
Basic function of Backend
updates processor state. Results obtained as out-of-order are
reordered to in-order. Update of the processor state is
performed precisely. Update of the processor state based on
the execution result is called “commit”. Disappear of instruction is called “retire”.
![Page 47: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/47.jpg)
Dynamic instruction scheduling
Instruction scheduling means to determine issuing order of instructions and when the instructions are issued.
In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer.
In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i-way out-of-order processor.
![Page 48: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/48.jpg)
1 way in-order issue The number of issued instructions
at a cycle is at most 1. The size of instruction window is 1
because all subsequent instructions cannot be issued if an instruction cannot be issued.
Only true and output dependences should be checked because anti dependence is always resolved.
![Page 49: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/49.jpg)
Control by R flag R flag is used to check true and
output dependences.op dst src1 src2 R value
R valueR valueR valueR valueR valueR valueR value
Instruction
Registers
Register number
Only when R(dst) == true && R(src1) ==true && R(src2), the instruction is issued. (This condition is called “ready”.)
R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available.
![Page 50: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/50.jpg)
Update sequence of the R flag
R bit of destination becomes false when an instruction is issued.
R bit of destination becomes true when a result is stored in the destination. by the above update,
• Instructions using unavailable registers as source registers are not issued; true dependence is resolved.•Instructions using unavailable a register as a destination register are not issued; output dependence is resolved.
Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion.
![Page 51: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/51.jpg)
i-way in-order issue We think about how the following 4
instructions are executed on this processor.
i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1
Cycle Funciont Unit0
Function Unit1
0 i1: r1=r51 i2: r2 = r1 +
1i3: r3 = r6
2 i4: r4 = r3 + 1
In-order scheduling
IPC becomes 1.3. (4instcuctions/3
cycle)
![Page 52: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/52.jpg)
How to check dependency of instructions?
True and output dependence must be checked.
Instruction 0
Instruction i-1::i
Instruction window
R valueRegisters
R value
:::::
Register number
3 × i
3 × i i
![Page 53: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/53.jpg)
How to allocate resources(funciton unit)?
Allocation of is performed as follows. Check whether any of preceding
ready instructions refers or not. If there is no instructions refering , the function unit is available.
Repeat the above procedure from to , where means the number of function units.
0R0R
1R 1rRr
![Page 54: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/54.jpg)
Complexity of i-way in-order issue
Ready detection ports are required.
comparators are required for check of operand dependency.
Resource allocation input NOR gate is required.
i3
i
k
iik1
)1(23)1(3
Complexity increases by )()( 2iOiO ~
1i
![Page 55: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/55.jpg)
i-way out-of-order issue Out-of-order scheduling of the
same code used in the previous i-way in-order case.
i1: r1 = r5i2: r2 = r1 + 1i3: r3 = r6i4: r4 = r3 +1
Cycle Funciont Unit0
Function Unit1
0 i1: r1=r5 i3: r3 = r61 i2: r2 = r1 +
1i4: r4 = r3 +
1
Out-of-order scheduling
IPC becomes 2.0. (4instcuctions/2
cycle)
![Page 56: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/56.jpg)
Architectural requirements for out-of-order execution
The depth of instruction window should be increased to .
The number of registers’ ports must be for check of dependence.
Anti-dependence must be checked, in addition to the i-way in-order case.
Resource allocation can be performed in the same way as the i-way in-order case.
n3
n
![Page 57: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/57.jpg)
Complexity of i-way in-order issue
Ready detection ports are required.
comparators are required for check of operand dependency.
Resource allocation input NOR gate is required.
n3
n
k
nnk1
)1(25)1(5
Complexity increases by
)()( 2nOnO ~
1n
Increase of hardware complexity is more significant than the in-order case because n>>i in general.
![Page 58: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/58.jpg)
Tomasulo’s Algorithm was proposed by R.M. Tomasulo in
1967. was originally adopted in floating
point unit in IBM 360/91. Performance was drastically
improved. Similar algorithms are used in the
latest microprocessors.
![Page 59: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/59.jpg)
Superscalar arch using TomasuloInstruction cache
Instruction decodeTag allocation
Branch prediction
Function unit
Function unit
Registers
・・・・・
・・・・・
Data cache
・・・・・
Frontend
Ex-coredispatch
issue・・・・・
Reservation Station
![Page 60: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/60.jpg)
Contents of reservation station and register
Register Tag is used for register renaming.
Reservation station
op: opecode dtag: destination tag stag: source tag R: ready flag value: operand’s value
valuetagR
valuestagRvaluestagRdtagop
Source 1 Source 2
![Page 61: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/61.jpg)
Operation on the arch Dispatch Issue Execution Finalization
![Page 62: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/62.jpg)
Operation on the arch Dispatch
dtag is assigned to a destination operand from tag pool that holds unassigned tags.
Src operands are obtained by reading registers using each register number. If R is true, then value is read, otherwise tag’s value is read from the register.
Then, an instruction is stored in a reservatoin station corresponding to a function unit used in the instruction.
![Page 63: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/63.jpg)
Operation on the arch Issue
A ready instruction in a reservation is executed on a corresponding function unit, if the function unit is available.
The issued instruction is deleted from the reservation station.
Execution Issued instructions are executed on
corresponding function units.
![Page 64: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/64.jpg)
Operation on the arch Finalize
Based on a result of execution, dtag and a result value is broadcasted to the result bus.
If there is an instruction holds the broadcasted dtag as stag, R flag and value of the instruction is replaced by true and the broadcasted result value, respectively.
Only when there is a register holding a tag corresponding to broadcasted dtag, the broadcasted result is stored in the register.
Finally, the broadcasted tag is stored to tag pool.
![Page 65: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/65.jpg)
An example of Tomasulo A superscalar processor used in
this example has the following 5 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and
dispatches. RS: waits operands until an
instruction becomes ready. EX: executes an instruction. WB: writes a result.
i1: r1 = load Ai2: r2 = r1 + 3i3: r3 = r2 + 1i4: r4 = load B#A and B are const
![Page 66: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/66.jpg)
Cycle 0op Destination Source 1 Source 2
R dtag
val R stag
val R stag
valInstruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
State of instructions
# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9
Registers
30
・・・・・・ 54
53
52
51
50
Tag pool
![Page 67: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/67.jpg)
Cycle 1op Destination Source 1 Source 2
R dtag
val R stag
val R stag
valInstruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
StageIFIF
State of instructions
# R tag Val1 1 X 22 1 X 43 1 X 74 1 X 9
Registers
30
・・・・・・ 54
53
52
51
50
Tag pool
![Page 68: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/68.jpg)
Cycle 2op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
load
0 50 X 1 X A 1 X 0
add
0 51 X 0 50 X 1 X 7
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
StageIDIDIFIF
State of instructions
# R tag Val1 0 50 X2 0 51 X3 1 X 74 1 X 9
Registers
30
・・・・・・ 54
53
52
Tag pool
![Page 69: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/69.jpg)
Cycle 3op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
load
1 50 15 1 X A 1 X 0
add
0 51 X 0 50 X 1 X 7
add
0 52 X
load
0 53 X
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
EXRSIDID
State of instructions
# R tag Val1 0 50 X2 0 53 X3 1 X 74 0 52 X
Registers
30
・・・・・・ 54
Tag pool
![Page 70: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/70.jpg)
Cycle 4op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
load
1 50 15 1 X A 1 X 0
add
1 51 22 1 50 15 1 X 7
add
0 52 X 0 51 X 1 X 1
load
1 53 16 1 X B 1 X 0
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
WBEXRSEX
State of instructions
# R tag Val1 1 X 152 0 53 X3 1 X 74 0 52 X
Registers
50
30 ・・・・・・ 54
Tag pool
![Page 71: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/71.jpg)
Cycle 5op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
add
1 51 22 1 50 15 1 X 7
add
1 52 23 1 51 22 1 X 1
load
1 53 16 1 X B 1 X 0
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
WBEXWB
State of instructions
# R tag Val1 1 X 152 1 X 163 1 X 74 0 52 X
Registers
53
51 50 30 ・・・・・・ 54
Tag pool
![Page 72: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/72.jpg)
Cycle 6op Destination Source 1 Source 2
R dtag
val R stag
val R stag
val
add
1 52 23 1 51 22 1 X 1
Instruction
i1: r1 = load Ai2: r2 = r1 +r3i3: r4 = r2 + 1i4: r2 = load B
Stage
WB
State of instructions
# R tag Val1 1 X 152 1 X 163 1 X 74 1 X 23
Registers
52
53 51 50 30 ・・・・・・ 54
Tag pool
![Page 73: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/73.jpg)
Problem of out-of-order execution
It is difficult to update the processor state precisely if exception is occurred.
Fin i0: ・・・・
Fin i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
i5: r3 = r4 << r2i6: ・・・・
In order execution Out of order execution
Fin i0: ・・・・
i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
Fin i5: r3 = r4 << r2i6: ・・・・
![Page 74: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/74.jpg)
Flow of exception handling
Unfinished instructions that include an instruction causes the exception is invalidated.
Control is moved to OS to save the current state to main memory and to handle the exception.
After the process of the exception, CPU begins to execute the instruction causing the exception again.
![Page 75: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/75.jpg)
Problem of out-of-order execution
It is difficult to update the processor state precisely if exception is occurred.
Fin i0: ・・・・
Fin i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
i5: r3 = r4 << r2i6: ・・・・
In order execution
•Save the current state.•OS handles the exception.•CPU restarts from i3.
![Page 76: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/76.jpg)
Problem of out-of-order execution
It is difficult to update the processor state precisely if exception is occurred.
Out of order execution
Fin i0: ・・・・
i1: ・・・・
Fin i2: r1=load r1E i3: r2=load r3
i4: ・・・・
Fin i5: r3 = r4 << r2i6: ・・・・
•Save the current state.• i5 has finished before i3.• i1 has not finished.• the data of r3 has been lost.
•OS handles the exception.CPU cannot restart from i3.
Reorder buffer is used for precise exception handling.
![Page 77: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/77.jpg)
Reorder buffer Updates CPU’s state in the original
program order by reordering results.
Handles exception at the state update.
Reorder Buffer
Registers
Results and information about exception
Store of results in the originalprogram order and detection ofexception.
Commit
![Page 78: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/78.jpg)
Superscalar arch using Tomasulo and reorder buffer
Instruction cache
Instruction decodeTag allocation
Branch prediction
Function unit
Function unit
Registers
・・・・・
・・・・・
Data cache
・・・・・
Frontend
Ex-core
dispatch
issue・・・・・
Reservation Station
Reorder Buffer
Backend
commit
![Page 79: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/79.jpg)
Behaviour of reorder buffer
If there is result without an exception, it is stored to a register and the entry corresponding to it is removed.
There is a result with an exception, pipeline and reorder buffer are cleared.
If a result is not stored, reorder buffer waits until the result is obtained.
![Page 80: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/80.jpg)
Contents of reorder buffer
PC: instruction address R: Ready flag dreg: register number of
destination dtag: operand tag of destination E: Exception flag result: result
resultEdtagdregRPC
![Page 81: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/81.jpg)
Operand bypass and supply of source operand tag
Tomasulo: operand values are obtained from registers that have the latest values.
Reorder buffer: the latest values are stored in reorder buffer. (not in registers)
Procedure of obtaining operands: Check dependency to instructions decoded
concurrently. If there is dependency, stag becomes dtag of the dependent instruction.
Otherwise, reorder buffer is searched by source register number to obtain value (when R=1) or tag. (when R=0) If reorder buffer does not have value and tag corresponding to the register number, values are obtained from registers.
![Page 82: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/82.jpg)
An example of reorder buffer
A superscalar processor used in this example has the following 6 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and
dispatches. RS: waits operands until an
instruction becomes ready. EX: executes an instruction. WB: writes results to reorder buffer. RT: writes result to registers.
![Page 83: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/83.jpg)
A code used in the example
i1: 0x40: r1 = load A (r0)i2: 0x44: r2 = r1 + r3i3: 0x48: r2 = r2 + 16i4: 0x4C: r5 = load 0 (r1)i5: 0x50: r1 = r1 + 1i6: 0x54: r2 = load 0 (r2)
Address of instruction
![Page 84: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/84.jpg)
Cycle 0op Destination Source 1 Source 2
E dtag
val R stag
val R stag
valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
State of instructions
pointer
enrty PC R dreg dtag E result
h/t 202122232425
Reorder buffer
![Page 85: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/85.jpg)
Cycle 1op Destination Source 1 Source 2
E dtag
val R stag
val R stag
valInstructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
StageIFIF
State of instructions
pointer
enrty PC R dreg dtag E result
H/T 202122232425
Reorder buffer
![Page 86: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/86.jpg)
Cycle 2op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
X 20 X 1 X A 1 X 0
add
X 21 X 0 20 X 1 X 7
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
StageIDIDIFIF
State of instructions
pointer
enrty PC R dreg dtag E result
Head 20 40 0 1 20 X X21 44 0 2 21 X X
Tail 22232425
Reorder buffer
![Page 87: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/87.jpg)
Cycle 3op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
0 20 15 1 X A 1 X 0
add
X 21 X 0 20 X 1 X 7
add
X 22 X 0 21 X 1 X 16
load
X 23 X 1 X 0 0 20 X
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
EXRSIDIDIFIF
State of instructions
pointer
enrty PC R dreg dtag E result
Head 20 40 0 1 20 X X21 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X
Tail 2425
Reorder buffer
![Page 88: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/88.jpg)
Cycle 4op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
0 20 15 1 X A 1 X 0
add
0 21 22 1 20 15 1 X 7
add
X 22 X 0 21 X 1 X 16
load
1 23 ? 1 X 0 1 20 15
add
X 24 X 1 X 15 1 X 1
load
X 25 X 1 X 0 0 22 X
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
WBEXRSEXIDID
State of instructions
pointer
enrty PC R dreg dtag E result
Head 20 40 1 1 20 0 1521 44 0 2 21 X X22 48 0 2 22 X X23 4C 0 5 23 X X24 50 0 1 24 X X25 54 0 2 25 X X
Reorder buffer
Tail
![Page 89: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/89.jpg)
Cycle 5op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
add
0 21 22 1 20 15 1 X 7
add
0 22 38 1 21 22 1 X 16
load
1 23 ? 1 X 0 1 20 15
add
0 24 16 1 X 15 1 X 1
load
X 25 X 1 X 0 0 22 X
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
StageRTWBEXWBEXRS
State of instructions
pointer
enrty PC R dreg dtag E result
20 40 1 1 20 0 15Head 21 44 1 2 21 0 22
22 48 0 2 22 X X23 4C 1 5 23 1 ?24 50 0 1 24 X X25 54 0 2 25 X X
Reorder buffer
Tail
![Page 90: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/90.jpg)
Cycle 6op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
add
0 22 38 1 21 22 1 X 16
add
0 24 16 1 X 15 1 X 1
load
0 25 4 1 X 0 1 22 38
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
RTWBRTWBEX
State of instructions
pointer
enrty PC R dreg dtag E result
20 40 1 1 20 0 1521 44 1 2 21 0 22
Head 22 48 1 2 22 0 3823 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X
Reorder buffer
Tail
![Page 91: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/91.jpg)
Cycle 7op Destination Source 1 Source 2
E dtag
val R stag
val R stag
val
load
0 25 4 1 X 0 1 22 38
Instructioni1:r1=load A(r0)i2:r2= r1+r3i3:r2=r2+16i4:r5=load 0(r1)i5:r1=r1+1i6:r2=load 0(r2)
Stage
RTRTRTWB
State of instructions
pointer
enrty PC R dreg dtag E result
20 40 1 1 20 0 1521 44 1 2 21 0 2222 48 1 2 22 0 38
H/T 23 4C 1 5 23 1 ?24 50 1 1 24 0 1625 54 0 2 25 X X
Reorder buffer
Exceptionis detected.
![Page 92: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/92.jpg)
VLIW (Very Long Instruction Word)
In the VLIW processor, compiler extracts parallelism in a code. Therefore, special hardware support used in the superscalar processor becomes unnecesarry. Superscalar: dynamic scheduling by
hardware support VLIW: static scheduling by compiler
![Page 93: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/93.jpg)
Overview of VLIWcompiler
compiler
processor
processor
main(){ ・・・・・・・・・・・・ }
add sub ・・・
code gen
scheduling
execution
execution
main(){ ・・・・・・・・・・・・ }
add sub ・・・
code gen
add sub load add mul load ・・・
scheduling
Superscalar
VLIW
![Page 94: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/94.jpg)
VLIW code
i1: r3=r4+1i2: r1=load(r2)i3: r1=r1<<r3i4: r5=r2+r6i5:beq r5,L
Sequential code
ALU ALU MEM Branchi1: r3=r4+1 i4: r5=r2+r6 i2:
r1=load(r2)nop
i3: r1=r1<<r3
nop nop i5:beq r5,L VLIW code
![Page 95: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/95.jpg)
・・・・・
Hardware organization of VLIW
ALU ALU MEM Branch
Registers
・・・・・
Instruction cache
Data cache
![Page 96: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/96.jpg)
VLIW vs Superscalar
Superscalar VLIWHardware size Large SmallHardware complexity Large SmallScheduling algorithm Poor RichInstruction window Small LargeBinary compatibility Compatible Not
compatible
![Page 97: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/97.jpg)
Dynamic vs Static schedluingi1: r1=load Ai2: r2=load(r1)i3: r3=load Bi4: r4=r3<<r2i5: r5=r4+1i6: r6=r2+r5
i1
i2
i3
i4
i5i6
Data dependencyof the code
Cycle ALU MEM0 i11 i22 i33 i44 i55 i6
Sample code
Dynamic scheduling
Cycle ALU MEM0 i31 i4 i12 i5 i23 i6
Optimal scheduling
![Page 98: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/98.jpg)
Advantage of dynamic scheduling
Scheduling based on information that can only be obtained at run time. For example, cache miss can be
concealed. Scheduling based on accurate
dependency of memories. Data address that can be obtained
only at run time improves scheduling performance.
![Page 99: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/99.jpg)
Taxonomy of scheduling algorithm
Local scheduling Global scheduling
Cyclic scheduling Acyclic scheduling
Trace-based scheduling DAG-based (Directed acyclic graph)
scheduling
![Page 100: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/100.jpg)
VLIW-based commercial processors
Transmeta Crusoe Aiming mobile computing
Texas Instruments TMS320C6x series Embedded applications
Intel Itanium
![Page 101: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/101.jpg)
Parallel operation by SIMD
What is SIMD?: SIMD (Single Instruction Multiple Data) means that the same operation is applied to several operands. Ex: Addition
int i;int a[4]={1,2,3,4};int b[4]={5,6,7,8};int c[4];for (i=0;i<4;i++){ c[i]=a[i]+b[i];}
c[0]=a[0]+b[0]c[1]=a[1]+b[1]c[2]=a[2]+b[2]c[3]=a[3]+b[3]
Sequential operation
SIMD
b[3]b[2]b[1]b[0]
a[3]a[2]a[1]a[0]
c[3]c[2]c[1]c[0]
![Page 102: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/102.jpg)
SIMD data types (Cell/B.E.)
vector unsigned char 16 unsigned 8bit valuesvector signed char 16 signed 8bit valuesvector unsigned short 8 unsigned 16 bit valuesvector signed short 8 signed 16 bit valuesvector unsigned int 4 unsigned 32 bit valuesvector signed int 4 signed 32 bit valuesvector unsigned long long
2 unsigned 64 bit values
vector signed long long
2 signed 64 bit values
vector float 4 32bit floating vlauesvector double 2 64 bit double (floating) values
![Page 103: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/103.jpg)
Allocation of vector values
Vector values are allocated to memory in the big-endian style as shown in the following figure.
*This figure is adapted from cell.fixstars.com
![Page 104: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/104.jpg)
How to access vector type via normal pointer
vector signed int va = (vector signed int) { 1, 2, 3, 4 };int *a = (int *) &va;
*This figure is adapted from cell.fixstars.com
![Page 105: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/105.jpg)
How to access a normal array from vector type
int a[8] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8 };vector signed int *va = (vector signed int *) a;
*This figure is adapted from cell.fixstars.com
__attribute__((aligned(16))) forces scalar data to be 16 byte-aligned
![Page 106: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/106.jpg)
SIMD operation on PPE
int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);
b[3]b[2]b[1]b[0]
a[3]a[2]a[1]a[0]
c[3]c[2]c[1]c[0]
vec_add is a SIMD function provided by VMX (Vector Multimedia Extension) proposed by IBM and Mtorola.
![Page 107: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/107.jpg)
Entire code for vector addition
#include <stdio.h>#include <altivec.h>int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };int c[4] __attribute__((aligned(16)));int main(int argc, char **argv){vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;*vc = vec_add(*va, *vb);printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);return 0;}
![Page 108: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/108.jpg)
A part of VMX functionArithmetic operation
vec_add(a,b) a+bvec_sub(a,b) a-bvec_madd(a,b,c)
a*b+c
Logical operation
vec_and(a,b) Logical andvec_or(a,b) Logical or
Bit operation
vec_perm(a,b,c)
creating new vector from a[i] and b[i] based on c[i]
vec_sel(a,b,c) selecting a[i] or b[i] basedon c[i]
branch vec_cmpeq(a, b)
a[i]==b[i]
vec_cmpgt(a, b)
a[i]>b[i]
Type conversion
vec_ctf(a, b) (float)a[i]/(2^b)vec_ctu(a, b) (unsigned int) a[i]/(2^b)
Generating constant
vec_splat(a, b) a[b]vec_splat s32(a)
signed a[i]
![Page 109: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/109.jpg)
How to create dense vector data
In general, vector data is not densely stored. Threfore, dense vector data must be created before vector operation.vc = vec_perm(va, vb, vpat);
*This figure is adapted from cell.fixstars.com
![Page 110: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/110.jpg)
Ex of vec_perm: Transpose
*These figures are adapted from cell.fixstars.com
![Page 111: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/111.jpg)
Branch on SIMD
*These figures are adapted from cell.fixstars.com
![Page 112: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/112.jpg)
Procedure of SIMD Branch
*These figures are adapted from cell.fixstars.com
![Page 113: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/113.jpg)
Detail of SIMD Branch
vec_cmpgt()
vec_sel() *These figures are adapted from cell.fixstars.com
![Page 114: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/114.jpg)
Ex of SIMD Branchint a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };int b[16] = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 3, 2, 1 };int c[16];int i;for (i = 0; i < 16; i++) { if (a[i] > b[i]) { c[i] = a[i] - b[i]; } else { c[i] = b[i] - a[i]; }}
![Page 115: Real-time Signal Processing on Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062305/56816464550346895dd646ec/html5/thumbnails/115.jpg)
Ex of SIMD Branchint a[16] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16 };int b[16] __attribute__((aligned(16))) = { 16, 15, 14, 13, 12, 11, 10, 9,8, 7, 6, 5, 4, 3, 2, 1 };int c[16] __attribute__((aligned(16)));vector signed int *va = (vector signed int *) a;vector signed int *vb = (vector signed int *) b;vector signed int *vc = (vector signed int *) c;vector signed int vc_true, vc_false;vector unsigned int vpat;int i;for (i = 0; i < 4; i++) { vpat = vec_cmpgt(va[i], vb[i]); vc_true = vec_sub(va[i], vb[i]); vc_false = vec_sub(vb[i], va[i]); vc[i] = vec_sel(vc_false, vc_true, vpat);}