[IEEE 2012 Brazilian Symposium on Computing System Engineering (SBESC) - Natal, Brazil...

RAPTOR-Design: Refactorable Architecture

Processor To Optimize Recurrent Design

P. Garcia, T. Gomes, F. Salgado, J. Cabral, J. Monteiro, A. Tavares

Centro Algoritmi-University of Minho

{paulo.garcia, tiago.a.gomes, filipe.salgado, jorge.cabral, joao.monteiro, adriano.tavares}@algoritmi.uminho.pt

Abstract— The growth in embedded systems complexity has

created the demand for novel tools which allow rapid systems

development and facilitate the designer’s management of

complexity. Especially since systems must incorporate a variety

of often contradictory characteristics, achieving design metrics

in short development time is an increasing challenge.

This paper presents RAPTOR-Design, a framework for

System-on-Chip (SoC) design which incorporates a

customizable processor architecture and allows rapid software-

to-hardware migration, custom hardware integration in a

tightly-coupled fashion and seamless Fault Tolerance (FT)

capabilities for FPGA platforms. Impact on design metrics of

processor customization, FT-capabilities and custom hardware

integration are presented, as well as an overview of the design

process using RAPTOR-Design.

Keywords-FPGA, Microprocessor, Custom Computational

Units

I. INTRODUCTION AND RELATED WORK

The advent of low-cost high-density Field Programmable Gate Arrays (FPGAs) in the last decade led to its widespread use as prototyping and deployment platforms. FPGAs offer designers the ability to produce custom hardware at a productivity level comparable to software. State of the art FPGAs offer thousands of Logic Elements (LEs), embedded arithmetic support (e.g., multipliers) and on-chip memory blocks. Together, these elements allow the implementation of small to medium sized Systems-on-Chip (SoCs) suitable for a wide range of applications [1]. Although FPGAs incur in greater power consumption and reduced clock speeds than Application Specific Integrated Circuits (ASICs), they offer two main advantages which contributed to their adoption as the zeitgeist in target platforms for embedded systems: (1) the Non-Recurring Engineering (NRE) cost when deploying systems in FPGAs is substantially lower than in ASICs: the fabrication cost of ASICs is too high compared to the cost of Commercial Off the Shelf (COTS) FPGAs. (2) Also, current FPGAs allow runtime re-configuration, lending themselves well to adaptive systems where a small area impact is required, as long as re-configuration overhead is acceptable [2]: the wide availability of Intellectual Property (IP) modules, both commercial and open source, contribute to the design flexibility awarded to FPGA-based systems’ designers. FPGAs are typically used in one of two ways: either as a coprocessor to an ASIC, performing tasks which are highly parallel to the ASIC’s workload, or as the main deployment platform [1, 2].

FPGA-based SoCs are constituted by one or more processing elements (e.g., general-purpose processors,

Application-Specific Instruction-set Processors (ASIPs), Custom Computational Units (CCUs), peripherals, etc) and embedded memories. Some FPGAs offer hardcore processors, deep within the configurable fabric, to be used as main system processors aided by coprocessors implemented in LEs (e.g., Virtex-5 FXT FPGA embedded up to two PowerPC440 cores). However, most FPGAs, especially when full advantage of re-configuration is intended, provide no hardcore processors but rely on the implementation of softcore processors. Hardcore general-purpose processors are faster, but can be “overkill” in terms of performance for some applications. Hence, softcore designs offer several advantages: (1) the exact number and type of processors may be chosen for a given application; (2) customization may be performed at several granularity levels, resulting in optimum design; (3) the silicon neutrality blurs the hardware/software line, increasing the hardware design productivity to software levels and (4) softcores suffer a much smaller performance loss from off-chip memory latencies due to their reduced clock frequencies [3], if off-chip memory is used at all. If on-chip memory is used, there is no need to explore the memory hierarchy [4].

The two extremes of the softcore spectrum are general-purpose processors (e.g., OpenSPARC, MIPS) and ASIPs. General-purpose processors are the most comfortable implementation paradigm, since the availability of IP designs makes them realistic COTS solutions. However, most of them lack the requirements of many embedded systems, such as real-time predictability [5]. Thus, ASIPs have emerged as processors completely dedicated to the target application. In [5], Oliveira et al present a FPGA implementation of a processor which includes hardware support to a Real-Time-Operating-System (RTOS). Free from stagnant silicon, ASIPs offer the most comprehensive solution to provide high parallelism [6] and tackle the flexibility/design metrics tradeoffs in SoCs [7]. Two main problems afflict ASIP design: (1) since designing an ASIP requires considerable NRE cost and (2) there is a limit on how much an application can be expressed in terms of custom functionalities; most ASIPs contain instructions present in general-purpose processors [8]. Between the two extremes, lay extensible processors. These are implemented upon a base general-purpose machine whose Instruction Set Architecture (ISA) can be extended with custom instructions (e.g., Altera’s Nios II, Xilinx’s MicroBlaze), providing the required tradeoff between design flexibility and short time-to-market [8]: enabling critical parts of the application to be implemented in hardware contributes to meet the cost and performance demands of embedded systems [9]: Danek et al [10]

2012 Brazilian Symposium on Computing System Engineering

978-0-7695-4929-3/12 $26.00 © 2012 IEEE

DOI 10.1109/SBESC.2012.55

186

2012 Brazilian Symposium on Computing System Engineering

978-0-7695-4929-3/12 $26.00 © 2012 IEEE

DOI 10.1109/SBESC.2012.55

188

demonstrate how integrating custom hardware to manage thread scheduling in a soft processor contributes to performance increase and determinism. Custom instructions are implemented by CCUs which can be connected to a processor core in two ways: loosely coupled or tightly-coupled. In loosely coupled configurations, a CCU is connected to a core in a coprocessor approach, i.e., it is outside the softcore. This approach leads to overhead due to bus arbitration but allows a random number of CCUs to be connected. Approaches to allow efficient control and communication between cores and loosely coupled CCUs have been presented in the literature, such as the work presented by Vassiliadis et al [11]. Tightly coupled configurations integrate CCUs in the processor core and treat them as any other datapath functional unit. This approach incurs in null overhead between the dispatch and execution of a custom instruction, but limits the number and types of operands to the ones specified in the base ISA, as well as the number of possible CCUs, a phenomenon known as opcode explosion [11]. Several studies have demonstrated the tradeoffs in CCU configurations in terms of performance and area, such as the work developed by Bordoloi et al [12]. A multitude of studies have been presented in the literature on how to efficiently identify custom instruction candidates from an application code, as well as methodologies to integrate these techniques in the design flow [13-16].

A technique utilized extensively in soft processors to increase throughput and maximize efficiency is multithreading. By replicating the architectural elements of a processor, it appears to software as several logical processors [17]. In softcores Fine Grained or Interleaved Multithreading (IMT) and Coarse Grained or Blocking Multithreading (BMT) are used as the main multithreading architectures [18]. In IMT policies, in each clock cycle an instruction from a different thread is executed. IMT allows the simplification of hazard detection and resolution logic if the number of threads is equal or greater than the number of pipeline stages, resulting in area and power savings [19]. In BMT, one thread executes continuously until it is stalled (e.g., due to memory access latency), resulting in better utilization of clock cycles. Both configurations achieve better Instructions Per Cycle (IPC), are more area-efficient than their single thread counterparts and provide null-overhead context switching [20]. Multithreading has also proved to be a successful solution to minimize overheads and stalls related to custom instructions in extensible processors [21, 22].

The remainder of this paper is organized as follows: Section II explains the integration of tightly-coupled custom hardware on the RAPTOR framework and presents the metrics impact of software-to-hardware migration. Section III demonstrates the integration of Fault Tolerant capabilities onto the design and finally, Section IV presents the concluding remarks.

II. CUSTOM HARDWARE INTEGRATION

The RAPTOR framework builds upon previous work presented in [23] and [24]. The framework’s core is the M²µP customizable processor. M²µP can be customized in terms of number of hardware threads (1 to 8), thread

scheduling policy (IMT, BMT or hybrid), cache sizes and architecture and allows custom instructions to be integrated onto the datapath in a tightly-coupled fashion.

CCUs can be inserted onto the datapath and activated by custom instructions. The processor’s flexibility comes from the variability offered to CCU insertion and the ease of integration, which enables the RAPTOR-Design framework. CCU inputs may come directly from registers, from the ALU output, or from outside the processor (dedicated port). Conversely, CCU outputs may be connected to the Register File, directly to the memory hierarchy (data cache) or to a dedicated port. A Scoreboard mechanism allows CCUs of any latency to be integrated without any design effort; while the CCU is computing, any other instruction which does not require it or its output may execute in parallel. If an instruction requires the CCU’s output (e.g., the value of a register to where the CCU should write) before the CCU finished the computation, the Scoreboard mechanism detects that such value is not ready, and stalls the requesting thread.

Due to the processor’s source code style, adding a CCU requires no code modifications, but only additions. Adding a CCU to the project requires addition of several lines of code: on the operand dispatch look-up table, on the legal instructions look-up table, on the write-back stage look-up table, on the ports list and on the scoreboard sensitivity look-up table. The RAPTOR-Design framework allows easy integration of CCU through a script which modifies the source code and performs the required code additions, allowing application-specific designs to be customized easily, incorporating the required CCUs while maintaining the base core architecture small.

III. FPGA FAULT TOLERANCE

Several studies have proposed FT techniques for FPGA architectures. Xilinx Corp., one of the most successful FPGA vendors, addresses the issue of Single-Points of Failure on voters in their chips [26]. Their suggested implementation utilizes Triple-Modular-Redundancy (TMR) even for voters, causing no Single Point of Failure. This approach requires triple the inputs and outputs and pushes the binding of signals off-chip, where less sensible (radiation-hardened) technology can be used. This approach poses two problems: complex designs may require an excessive amount of I/O, rendering them impractical and, for on-chip memory, require triplication of memories, again something which may not be feasible: estimates show that 50% of current Systems-on-Chip’s (SoCs) area is occupied by memory; in modern microprocessors, over 30% of area is occupied by caches. The ratio is increasing for both [27]. This approach also leaves one problem unsolved: configuration corruption on voters is undetected. A first corruption on a voter is mitigated by TMR, but if a second voter is corrupted, the design yields incorrect results. Frigerio and Salice [28] exploited the use of Hamming codes on every FPGA module to implement Fault Tolerance. However, their approach is ineffective against Multiple-Bit-Upset (MBUs) and still creates Single-Points of Failure. Bastos et al [29] show the area/performance impact of protecting an 8-bit processor, but their approach is not suited for FPGA implementations. Weaver [30] implements a FT

187189

processor through the use of online-checking, again unsuited to FPGAs. Pascual et al [31] propose fault tolerant techniques for cache coherence in Chip Multi Processor (CMP) architectures. Tanoue et al [32], Hubner et al [33] and Marconi et al [34] use Dynamic Reconfiguration on FPGA softcores, but disregard the Single Point of Failures in voters. Straka et al [35] implement Fault Tolerance on FPGA internal buses. Garvie et al [36] introduce Lazy Scrubbing: one module is used to detect corruption and activate partial reconfiguration and the system relies on genetic algorithms if reconfiguration fails due to corruption of the given detector. This approach still incurs is a Single Point of Failure on the corruption detector. The RAPTOR-Design paradigm to implementing Fault Tolerant architecture dictates the following requirements:

(1) Triplication of all logic. (2) Insertion of a multiplexer before every register. (3) Insertion of a voter after every register. (4) Insertion of a corruption detector before every

register; every corruption detector requires an additional output pin.

The fault masking and correction happens as follows: an Single-Event-Upset (SEU) which alters the value of any designer-visible register is masked by the voters, so the correct value is propagated to the next stage. On every clock cycle, all registers are written: if a designer defined write enable signal is active, then the register is written with the appropriate input value; if not, the output from the voters is fed back. So, write enable signals that controlled the register write operation on the non-FT version, now control the inserted multiplexer. This method provides protection against all SEUs at designer-visible registers and eliminates accumulation of errors. All Single-Event-Transients (SETs) are corrected by the same mechanism; a transient fault on any combinational path causes an incorrect value to be fed to a register, which is corrected by the voters on the next clock cycle. Even if a SET spans several clock cycles, its effects are masked by the following voter layer. This approach is sufficient to mask and correct all faults at designer-visible logic. However, it is incapable of detecting SEUs on FPGA configuration RAM. If any combinational block has its configuration altered, it will produce permanently incorrect results. Although these will be masked by voters, if a second module is also corrupted, voters will be unable to select the correct result. Indeed, voters themselves may be corrupted. Even if the outputs are tripled and external voting applied, a fault cannot be detected if more than one parallel internal module is corrupted, causing the internal voting scheme to propagate the same incorrect value to the triple output. Therefore, the corruption detectors are inserted at the end of all combinational paths.

Each corruption detector compares the values of two combinational paths. If any two paths disagree, the detector immediately signals an external module which causes the dynamic reconfiguration of the FPGA, which is the only way to correct a SEU at configuration level. If a detector itself is corrupted and its operation compromised, two things may happen: the detector produces a false positive, which causes the reconfiguration; no incorrect behavior is introduced, and the detector itself is reconfigured, so the fault is corrected; or

the detector produces a false negative, in which case, another detector will produce the correct positive and cause the reconfiguration. With this scheme, any possibility of failure requires at least the following conditions: (1) two detectors are compromised, both yielding false negatives; (2) two voters, which propagate the results to the combinational paths checked by the third detector, are compromised; and (3) the corruption of both voters happens at the same time; even if two detectors are compromised, corruption of only one voter will cause reconfiguration. Statistically, these conditions are much less probable than the corruption of two voters.

Reconfiguration does not affect registers, only combinational elements; therefore, all states are preserved so execution may resume immediately after reconfiguration. This Fault Tolerance methodology poses one drawback: SETs may be detected by the corruption detectors and cause reconfiguration. No incorrect behavior is introduced, but performance penalty may be induced.

Experiments were performed using the testbench based simulation environment on Xilinx ISE 12.2. SETs and SEUs at designer-visible logic were injected through non-synthesizable HDL code which can access all design logic without causing overhead to processor execution. The testbench injected faults and compared the inserted values with the outputs of respective voters to determine if a fault is mitigated. Faults were injected on all design logic, on each clock cycle: every element (combinational or sequential) in the pipeline had the same probability of being injected with a fault at each clock cycle, set at 0,1%, causing an average of 4320000000 faults per bit per day, when simulating at 50MHz; it should be noted that this is a very pessimistic approach, as estimates have shown that for a device at an altitude of 3000Km, the average occurrence of SEUs is of 100 per day [37]. Since partial reconfiguration itself cannot be simulated, the outputs of corruption detectors were monitored, and configuration faults removed whenever the outputs were active.

IV. CONCLUSIONS

This paper presented the RAPTOR-Design framework, targeting the design of application-specific SoCs. The framework’s core processor was described in terms of customization capabilities, and the impact on metrics of software-to-hardware migration was demonstrated. A case study, depicting the full integration of a PID controller module onto the datapath using the RAPTOR-Design framework, was presented and simulation results demonstrated. Additionally, the framework’s capability to implement Fault Tolerance on the designed system was presented and the results obtained from fault injection experiments were demonstrated.

The RAPTOR-Design allows rapid design and prototyping of systems targeting FPGAs, customized for particular applications. Future work will focus on developing a library of hardware modules and integrating automatic software-to-hardware migration tools.

188190

REFERENCES

[1] Hempel, G.; Hochberger, C.; "A resource optimized Processor Core for FPGA based SoCs", 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools, 2007.

[2] Danek, M.; Kadlec, J.; Bartosinski, R.; Kohout, L.; "Increasing the level of abstraction in FPGA-based designs", International Conference on Field Programmable Logic and Applications, 2008.

[3] Labrecque, M.; Yiannacouras, P.; Steffan, J.G.; "Scaling Soft Processor Systems", 16th International Symposium on Field-Programmable Custom Computing Machines, 2008.

[4] Yiannacouras, P.; Rose, J.; Steffan, J. G.; “The Microarchitecture of FPGA-Based Soft Processors”, in Proceedings of the 2005 international conference on Compilers, Architectures and Synthesis for Embedded Systems, 2005.

[5] Oliveira, A.S.R.; Almeida, L.; de Brito Ferrari, A.; "The ARPA-MT Embedded SMT Processor and Its RTOS Hardware Accelerator" IEEE Transactions on Industrial Electronics, vol.58, no.3, March 2011

[6] Muller, O.; Baghdadi, A.; Jezequel, M.; "From Application to ASIP-based FPGA Prototype: a Case Study on Turbo Decoding", The 19th IEEE/IFIP International Symposium on Rapid System Prototyping, June 2008.

[7] Li Zhang; Shuangfei Li; Zan Yin; Wenyuan Zhao; "A Research on an ASIP Processing Element Architecture Suitable for FPGA Implementation", International Conference on Computer Science and Software Engineering, Dec. 2008.

[8] Gorjiara, B.; Gajski, D.; "Automatic architecture refinement techniques for customizing processing elements", 45th ACM/IEEE Design Automation Conference, 2008.

[9] Siew Kei Lam; Srikanthan, T.; Clarke, C.T.; "Architecture-Aware Technique for Mapping Area-Time Efficient Custom Instructions onto FPGAs", IEEE Transactions on Computers, vol.60, no.5, May 2011.

[10] Danek, M.; Kafka, L.; Kohout, L.; Sykora, J.; "Instruction set extensions for multi-threading in LEON3", IEEE 13th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, April 2010.

[11] Vassiliadis, N.; Theodoridis, G.; Nikolaidis, S.; "The ARISE Approach for Extending Embedded Processors With Arbitrary Hardware Accelerators", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.17, no.2, Feb. 2009.

[12] Bordoloi, U.D.; Huynh Phung Huynh; Chakraborty, S.; Mitra, T.; "Evaluating design trade-offs in customizable processors", 46th ACM/IEEE Design Automation Conference, July 2009.

[13] Huynh Phung Huynh; Yun Liang; Mitra, T.; "Efficient custom instructions generation for system-level design", 2010 International Conference on Field-Programmable Technology (FPT), Dec. 2010.

[14] Ya-shuai Lu; Li Shen; Li-bo Huang; Zhi-ying Wang; Nong Xiao; "Customizing computation accelerators for extensible multi-issue processors with effective optimization techniques", 45th ACM/IEEE Design Automation Conference, June 2008.

[15] Pothineni, N.; Brisk, P.; Ienne, P.; Kumar, A.; Paul, K.; "A high-level synthesis flow for custom instruction set extensions for application-specific processors", 15th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2010.

[16] Atasu, K.; Luk, W.; Mencer, O.; Ozturan, C.; Dundar, G.; "FISH: Fast Instruction SyntHesis for Custom Processors", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. PP, no.99.

[17] Marr, D.; Binns,F.; Hill,D.; Hinton, G.; Koufaty, D.; Miller,J.;Upton,M.; “Hyper-Threading Technology Architecture and Microarchitecture”, Intel Corp. White Paper, 2008

[18] Ye Lu; Sezer, S.; McCanny, J.; "Advanced Multithreading Architecture with Hardware Based Scheduling", 2010 International Conference on Field Programmable Logic and Applications (FPL), Aug./Sept. 2010.

[19] Labrecque, M.; Steffan, J.G.; "Improving Pipelined Soft Processors with Multithreading", International Conference on Field Programmable Logic and Applications, Aug. 2007.

[20] Dimond, R.; Mencer, O.; Luk, W.; "CUSTARD - a customisable threaded FPGA soft processor and tools", International Conference on Field Programmable Logic and Applications, Aug. 2005.

[21] Moussali, R.; Ghanem, N.; Saghir, M.: “Supporting multithreading in configurable soft processor cores”, In Proceedings of the 2007 international conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2007.

[22] Labrecque, M.; Steffan, J.G.; "Fast critical sections via thread scheduling for FPGA-based multithreaded processors", International Conference on Field Programmable Logic and Applications, Aug. 2009

[23] F. Salgado; P. Garcia; T. Gomes; J. Cabral; J. Monteiro; A. Tavares; M. Ekpanyapong: “Exploring Metrics Tradeoffs in a Multithreading Extensible Processor”, 21st IEEE International Symposium on Industrial Electronics, Hangzhou, China, June 2012

[24] P. Garcia; T. Gomes; F. Salgado; J. Cabral; P. Cardoso; A. Tavares; M. Ekpanyapong: “A Fault Tolerant Design Methodology for a FPGA-Based Softcore Processor”, 1st Conference on Embedded Systems, Computational Intelligence and Telematics in Control, Wurzburg, Germany, April 2012

[25] Guthaus, M.R.; Ringenberg, J.S.; Ernst, D.; Austin, T.M.; Mudge, T.; Brown, R.B.; "MiBench: A free, commercially representative embedded benchmark suite", IEEE International Workshop on Workload Characterization, Dec. 2001.

[26] Xilinx, Inc. (2011). Triple Module Redundancy Design Techniques for Virtex Series FPGA. Xilinx Application Note 197, v1.0. Available: www.xilinx.com

[27] Argyrides, C., Vargas, F., Moraes, M., Pradhan, D.K., (2008). Embedding Current Monitoring in H-Tree RAM Architecture for Multiple SEU Tolerance and Reliability Improvement. 14th IEEE International On-Line Testing Symposium, pp.155-160.

[28] Frigerio, L., Salice, F.,. RAM-based fault tolerant state machines for FPGAs. 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems, pp.312-320.

[29] Bastos, R.P., Kastensmidt, F.L., Reis, R., (2006). Design of a robust 8-bit microprocessor to soft errors. 12th IEEE International On-Line Testing Symposium, pp.2.

[30] Weaver, C., Austin, T., (2001). A fault tolerant approach to microprocessor design. International Conference on Dependable Systems and Networks, pp.411-420.

[31] Ramazani, A., Amin, M., Monteiro, F., Diou, C., Dandache, A., (2009). A fault tolerant journalized stack processor architecture. 15th IEEE International On-Line Testing Symposium, pp.201-202

[32] Tanoue, S., Ishida, T., Ichinomiya, Y., Amagasaki, M., Kuga, M., Sueyoshi, T., (2009). A novel states recovery technique for the TMR softcore processor. International Conference on Field Programmable Logic and Applications, pp.543-546.

[33] Hubner, M., Gohringer, D., Noguera, J., Becker, J., (2010). Fast dynamic and partial reconfiguration data path with low hardware overhead on Xilinx FPGAs. 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp.1-8.

[34] Marconi, T., Jae Young Hur, Bertels, K., Gaydadjiev, G., (2010). A novel configuration circuit architecture to speedup reconfiguration and relocation for partially reconfigurable devices. IEEE 8th Symposium on Application Specific Processors, pp.87-92.

[35] Straka, M., Kastil, J., Novotny, J., Kotasek, Z., (2011). Advanced fault tolerant bus for multicore system implemented in FPGA. IEEE 14th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp.397-398.

[36] Garvie, M., Thompson, A., (2004). Scrubbing away transients and jiggling around the permanent: long survival of FPGA systems through evolutionary self-repair. 10th IEEE International On-Line Testing Symposium, pp. 155- 160

[37] Carmichael, C., Fuller, E., Fabula, J., Lima, F., (2001). Proton Testing of SEU Mitigation Methods for the Virtex FPGA. Proc. of Military and Aerospace Applications of Programmable Logic Devices MAPLD.

189191

[IEEE 2012 Brazilian Symposium on Computing System Engineering (SBESC) - Natal, Brazil...

Documents

Transcript of [IEEE 2012 Brazilian Symposium on Computing System Engineering (SBESC) - Natal, Brazil...