ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל...
-
date post
19-Dec-2015 -
Category
Documents
-
view
244 -
download
8
Transcript of ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל...
ARMORAsynchronous RISC Microprocessor
לישראל - טכנולוגי מכון מהירות הטכניון ספרתיות למערכות המעבדהחשמל להנדסת הפקולטה
Submitted by: Tziki Oz-Sinay, Ori Lempel
Supervised by: Rony Mitleman
Final Presentation
Milestones Reached
• First Semester– Thorough ramp-up made on asynchronous circuit design –
algorithms and methodologies.– Development platform selected (Balsa over Petrify).– ARMOR architecture defined.– Detailed Micro-Architecture Specification written –
• functional block partition, datapath interface defined.
• Asynchronous handshaking protocol defined.
– Detailed asynchronous pseudo-code implementation written.
• Second Semester– Balsa code written and compiled for each functional ARMOR unit.– Light validation started on all units, test harness piloted on IFU.– Attempts made at full-chip integration.
Main Problems Encountered
• Straightforward implementation of asynchronous psaudo-code in Balsa does not always work:– Deadlocks originating in forced deviation from psaudo-code.– False read after write hazards.
• Arbitration in Balsa is limited to two inputs only, which necessitates large arbitration trees.
• Balsa validation environment does not allow the synchronization of events. Customized test harnesses had to be built.
• Balsa simulator does not filter main channels and thus is almost impossible to follow – huge obstacle in full-chip integration.
Deadlock Illustration
• Function1
A<- Reg1;
B<- Reg2
• Function2
A-> Reg1;
B-> Reg2
• Function1
A<- Reg1;
B<- Reg2
• Function2
B-> Reg2;
A-> Reg1
• Function1
A<- Reg1;
B<- Reg2
Deadlock Illustration
UArch Deadlock Example
PSrc Data Readiness Determination
• OutOfOrder execution implies that not all instructions will be data-ready when entering the execution window (i.e. having been renamed and registered in the ROB).
• PSrc data readiness is determined by reading its corresponding RAT entry ready-bit during the renaming stage:– If the ready bit is set – the PSrc data is guaranteed to be
ready for execution and its valid value can be safely copied to the relevant ROB entry field.
– If the ready bit is clear – the PSrc data is NOT yet ready and its value can NOT be copied until a WB CAM-match occurs vs. that same PSrc
PSrc Data Readiness DeterminationUARCH REQUIREMENT
• Consider the following sequence of events:– A certain PSrc, whose data is NOT ready, is read from
the RAT.– After that PSrc’s (clear !) ready-bit is read, its PDst is
written back to the ROB, having completed execution, thus validating its data.
– Only now is the PSrc read from the RAT registered (together with its corresponding instruction) in the ROB, and because its ready bit is clear, its data is deemed not-ready.
– The instruction registered in that ROB entry will forever wait for a WB CAM-match vs. the said PSrc, which had already occurred.
• This results in a machine deadlock !!!
PSrc Data Readiness DeterminationUARCH REQUIREMENT
PSrc Data Readiness DeterminationSYNCHRONOUS LOGIC IMPLEMENTATION
• For simplicity we assume all register files are read during the high clock phase and written during the low clock phase.
• WB data always arrives within the same timing window an override datapath can be designed, that meets the ROB array-write timing requirement.
Fetch Decode Rename Schedule Dispatch Execute Memory WB Retire
WB data forwarding
PSrc Data Readiness DeterminationSYNCHRONOUS LOGIC IMPLEMENTATION
clkLSrc [2:0] rden [7:0]
clk#wren [31:0]
rddec [7:0]
wrdec [31:0]
RATArray
ready
PDst [4:0]
WBPDst [4:0]
1
1
0
ROBArray
wrdata
PSrc Data Readiness DeterminationASYNCHRONOUS LOGIC IMPLEMENTATION
• In an asynchronous pipeline WB can occur at any given time, and cannot be contained within a predetermined timing window the previous override approach will not work !!!
• Instead, after completing an instruction’s registration in the ROB, another check will be performed on the data readiness of each PSrc, whose RAT data-ready bit was clear during the renaming process.
PSrc Data Readiness DeterminationASYNCHRONOUS LOGIC IMPLEMENTATION
RAT ROB
Head Pointer
Issue Req
OpCode, LDst, Imm, PSrc1, PSrc1Ready,PSrc2, PSrc2Ready
Issue Ack
PSrc Data Readiness DeterminationASYNCHRONOUS LOGIC IMPLEMENTATION
ROB
PSrc1 PSrc21 0
0
Result
PSrc1 ready-bit was set and PSrc2 ready-bit was clear during renaming
PSrc1
PSrc2
Ready Ready
Valid
1
Valid
PSrc Data Readiness DeterminationASYNCHRONOUS LOGIC IMPLEMENTATION
ROB
PSrc1 PSrc2PSrc1Value1 0
Result
PSrc1 result is valid and thus copied to the PSrc1Value field
At the same time, PSrc2 writes back
PSrc1
PSrc2
Ready Ready
1
Valid
1
Valid
Result
PSrc Data Readiness DeterminationASYNCHRONOUS LOGIC IMPLEMENTATION
ROB
PSrc1 PSrc2 PSrc2ValuePSrc1Value1 1
1
Result
Result
A final check made on the status of PSrc2 deems it valid, thus its result is copied into the PSrc2Value field and its ready-bit is set.
PSrc1
PSrc2
Ready Ready
Valid
False Read After Write Hazardand
Resulting Arbitration Overhead
ROB Interface Arbitration
ROB Interface ArbitrationUARCH REQUIREMENT
ROB
RRF
RAT
RS0 RS1
ALU0 ALU1DATA
CACHE
BranchDecision to IFU
Inst from ID
branchesnon-mem inst
mem instnon-branch inst
ROB Interface ArbitrationUARCH REQUIREMENT
• The ROB constitutes the heart of the OutOfOrder execution engine. Its functions include:– Holding all instructions currently in the execution window
(issue retirement).– Determining data-readiness of each instruction by CAM-
matching the 3 WB busses vs. each entry’s PSrc pointers.
– Dispatching data-ready instructions out-of-order to the appropriate RS.
– Retire PDsts of executed instructions in-order to the Real Register File (RRF).
ROB Interface ArbitrationASYNCHRONOUS LOGIC DESIGN
• Our initial logic design comprised 8 independent concurrent asynchronous logical processes:– Issue process – requests a new instruction from the RAT
and registers it in the ROB head entry.– Three RS dispatch processes scanning the ROB from tail
to head, scheduling and dispatching instructions for execution –
• Branch Iterator – searches for the oldest branch instruction yet to be dispatched.
• Memory Iterator - searches for the oldest memory instruction yet to be dispatched.
• RegOp Iterator - searches for the oldest data-ready, non-branch/memory instruction yet to be dispatched.
ROB Interface ArbitrationASYNCHRONOUS LOGIC DESIGN
– Three CAM-match processes per entry, comparing the entry’s PDst and PSrc pointers vs. the 3 PDsts writing back from the ALUs and DCache.
– Retirement process – serially copies the result of completed instructions from the ROB’s tail to the RRF.
ROB Interface ArbitrationASYNCHRONOUS LOGIC DESIGN
• These processes were conceived and designed in such a way, as to eliminate any potential read-write or write-write logical hazards:– Issue process – writes to entries within the ROB headtail
region, reads from valid entries only.– Branch RS iterator – reads/writes from/to invalid entries
within the ROB tailhead region, whose op-code is BEZ, which has yet to be dispatched.
– Memory RS iterator – reads/writes from/to invalid entries within the ROB tailhead region, whose op-code is LW/SW, which has yet to be dispatched.
– RegOp RS iterator – reads/writes from/to invalid entries within the ROB tailhead region, whose op-code is NOT BEZ/LW/SW, which has yet to be dispatched.
ROB Interface ArbitrationASYNCHRONOUS LOGIC DESIGN
– CAM-match processes – write to DIFFERENT, invalid, entries within the ROB tailhead region, which have already been dispatched.
– Retirement process – reads from valid entries within the ROB tailhead region.
ROB Interface ArbitrationASYNCHRONOUS LOGIC DESIGN
[4:0] [3:0] [15:0] [15:0] [15:0]
OpCode LDst PSrc2Val PSrc2RdyPSrc2 ValidImm/Res DispPSrc1Val PSrc1RdyPSrc1PDst
[4:0][4:0][2:0]
PDst0
PDst1
PDst23
head
tail
PDst22
PDst21
PDst20
PDst2
PDst3
PDst4
PDst5
BEZ
LW/SW
1
0
1
1
11
00
00
MOV
ADD 00
Issue ProcessBranch RS IteratorMemory RS IteratorCAM-Match ProcessesRetirement ProcessRegOp Iterator
Arbitrate Tree
Problem:If a function has more than one input, which can come at any time. Two inputs might come at the same time.Balsa does not allow it !!!Solution: We arbitrate the input.
Another problem:Balsa can not implement more than two inputs.
Solution:We Build an arbitrate tree.
How Does it works?
Arbitrate Tree
The base element:
10
Arbitrate tree
10
10
Balsa Arch
Check the tail position
Ask to read one of the Entries
Arbitrate1
0
1
0
1
01
0
1
0
Every entry checks if the call ask for itWho is the senderAnd what is the request
We arbitrate the resultsAnd send it to all the iterators
We check where to send theCommand in case it is ready to go
And sends it!!!
Update the entry and move on!!!
Test Harnessesand
Dynamic Simulations
Test Harness
• Problem: – The Balsa-mgr supplies a test environment which
does not enable synchronization of the different inputs (We can not control the sequence of events), i.e. input A can’t come after input B (logically). We know it but the Balsa environment does not, and we cannot control it.
– As a result the test wont work properly.
• Solution:– We had to build a test harness, so that we can control
the sequence of events.
IFU Test HarnessWhat are we going to see?
IFU Test Harness
IFU sends command cache first address: 0
IFU sends REQUEST to command cache
Command cache respond with “ack” and
receive the data
IFU sends request, signaling he is ready to receive dataCommand cache sends data
with “ack” signal
IFU check the data, and (in this case) sends the data to
the ID.
Validation signal, external to the harness
Same scenario goes again, from next address
ALU Test example
RS sends data to be execute, ALU receives itALU “call” the relevant
PDst
ALU calculate the result and sends it to the ROB.
5+7=12!!!
{ADD,3,7,5,0}Command: add
PDst: 3Src1: 7Src2: 5Imm: 0
Conclusionand
Future Plans
Conclusions
• Asynchronous Circuit Design– Theoretical advantages of asynchronous circuits (reduced power,
clock skew elimination, average-case performance), are negated by huge overhead caused by hand-shake logic.
– Large scale systems are almost impossible to implement on standard (synchronous) development platforms (FPGA, etc.).
• Project Concept– Project scope and definition was exaggerated – OOO chip
design, implementation, validation and synthesis.– The Balsa platform was very disappointing in term of design,
simulation and synthesis capabilities, as well as support options (ManU junior faculty only…).
– Main line of project should have been performed on platforms that are well known in the lab (VHDL !!!).
Future Plans
• Balsa code documentation and packaging.• Completion of light validation on all ARMOR logical units
via test harnesses for checking/simulating basic functionality.
• Project documentation completion.