vliw ieee paper jun2012

of 12 /12
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012 257 BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing Applications David Stevens, Vassilios Chouliaras, Vicente Azorin-Peris, Jia Zheng, Angelos Echiadis, and Sijung Hu, Senior Member, IEEE Abstract—We discuss BioThreads, a novel, configurable, exten- sible system-on-chip multiprocessor and its use in accelerating biomedical signal processing applications such as imaging pho- toplethysmography (IPPG). BioThreads is derived from the LE1 open-source VLIW chip multiprocessor and efficiently handles instruction, data and thread-level parallelism. In addition, it sup- ports a novel mechanism for the dynamic creation, and allocation of software threads to uncommitted processor cores by imple- menting key POSIX Threads primitives directly in hardware, as custom instructions. In this study, the BioThreads core is used to accelerate the calculation of the oxygen saturation map of living tissue in an experimental setup consisting of a high speed image acquisition system, connected to an FPGA board and to a host system. Results demonstrate near-linear acceleration of the core kernels of the target blood perfusion assessment with increasing number of hardware threads. The BioThreads processor was implemented on both standard-cell and FPGA technologies; in the first case and for an issue width of two, full real-time performance is achieved with 4 cores whereas on a mid-range Xilinx Virtex6 device this is achieved with 10 dual-issue cores. An 8-core LE1 VLIW FPGA prototype of the system achieved 240 times faster ex- ecution time than the scalar Microblaze processor demonstrating the scalability of the proposed solution to a state-of-the-art FPGA vendor provided soft CPU core. Index Terms—Biomedical image processing, field program- mable gate arrays (FPGAs), imaging photoplethysmography (IPPG), microprocessors, multicore processing. I. INTRODUCTION AND MOTIVATION B IOMEDICAL in-vitro and in-vivo assessment relies on the real-time execution of signal processing codes as a key to enabling safe, accurate and timely decision-making, allowing clinicians to make important decisions and perform Manuscript received December 20, 2010; revised May 03, 2011; accepted August 15, 2011. This work was supported by Loughborough University, U.K. Date of publication November 04, 2011; date of current version May 22, 2012. This paper was recommended by Associate Editor Patrick Chiang. D. Stevens, V. Chouliaras, and V. Azorin-Peris are with the Department of Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K. J. Zheng is with the National Institute for the Control of Pharmaceutical and Biological Products (NICPBP), China, No.2, Tiantan Xili, Chongwen District, Beijing 100050, China. A. Echiadis is with Dialog Devices Ltd., Loughborough LE11 3EH, U.K. S. Hu is with the Department of Electronic and Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBCAS.2011.2166962 medical interventions as these are based on hard facts, derived in real-time from physiological data [1], [2]. In the area of biomedical image processing, a number of imaging methods have been proposed over the past few years including laser Doppler [3], optical coherence tomography [4] and more recently, imaging photoplethysmography (IPPG) [5], [6]; How- ever, none of these techniques can attain their true potential without a real-time biomedical image processing system based on very large scale integration (VLSI) systems technology. For instance, the quality and availability of physiological information from an IPPG system is directly related to the frame size and frame rate used by the system. From a user perspective, the extent to which such a system can run in real-time is a key factor in its usability, and practical imple- mentations of the system ultimately aim to be standalone and portable to achieve its full applicability. This is an area where advanced computer architecture concepts, routinely utilized in high-performance consumer and telecoms systems-on-chip (SoC) [7], can potentially provide the required data streaming and execution bandwidth to allow for the real-time execution of algorithms that would otherwise be executed offline (in batch mode) using more established techniques and platforms (e.g., sequential execution on a PC host). A quantitative comparison in this study (results and discussion) illustrates the foreseen performance gains, showing that a scalar embedded processor is six times slower than the single-core configuration of our research platform. Such SoC-based architectures typically include scalar em- bedded processor cores with a fixed instruction-set-architecture (ISA) which are widely used in standard-cell (ASIC) [8] and re- configurable (FPGA)-based embedded systems [9]. These pro- cessors present a good compromise for the execution of general- purpose codes such as the user interface, low-level/bandwidth protocol processing, the embedded operating system (eOS) and occasionally, low-complexity signal processing tasks. However, they lack considerably in the area of high-throughput execu- tion and high-bandwidth data movement as is often required by the core algorithms in most signal processing application do- mains. An interesting comparison of the capabilities of three such scalar engines targeting field-programmable technologies (FPGAs) is given in [10]. To relieve this constraint, scalar embedded processors have been augmented with DSP coprocessors in both tightly-coupled [11] and loosely-coupled configurations [12] to target perfor- mance-critical inner loops of DSP algorithms. A side-effect of this approach is the lack of homogeneity in the SoC platform 1932-4545/$26.00 © 2011 IEEE

Embed Size (px)

Transcript of vliw ieee paper jun2012

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

257

BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing ApplicationsDavid Stevens, Vassilios Chouliaras, Vicente Azorin-Peris, Jia Zheng, Angelos Echiadis, and Sijung Hu, Senior Member, IEEE

AbstractWe discuss BioThreads, a novel, congurable, extensible system-on-chip multiprocessor and its use in accelerating biomedical signal processing applications such as imaging photoplethysmography (IPPG). BioThreads is derived from the LE1 open-source VLIW chip multiprocessor and efciently handles instruction, data and thread-level parallelism. In addition, it supports a novel mechanism for the dynamic creation, and allocation of software threads to uncommitted processor cores by implementing key POSIX Threads primitives directly in hardware, as custom instructions. In this study, the BioThreads core is used to accelerate the calculation of the oxygen saturation map of living tissue in an experimental setup consisting of a high speed image acquisition system, connected to an FPGA board and to a host system. Results demonstrate near-linear acceleration of the core kernels of the target blood perfusion assessment with increasing number of hardware threads. The BioThreads processor was implemented on both standard-cell and FPGA technologies; in the rst case and for an issue width of two, full real-time performance is achieved with 4 cores whereas on a mid-range Xilinx Virtex6 device this is achieved with 10 dual-issue cores. An 8-core LE1 VLIW FPGA prototype of the system achieved 240 times faster execution time than the scalar Microblaze processor demonstrating the scalability of the proposed solution to a state-of-the-art FPGA vendor provided soft CPU core. Index TermsBiomedical image processing, eld programmable gate arrays (FPGAs), imaging photoplethysmography (IPPG), microprocessors, multicore processing.

I. INTRODUCTION AND MOTIVATION IOMEDICAL in-vitro and in-vivo assessment relies on the real-time execution of signal processing codes as a key to enabling safe, accurate and timely decision-making, allowing clinicians to make important decisions and perform

B

Manuscript received December 20, 2010; revised May 03, 2011; accepted August 15, 2011. This work was supported by Loughborough University, U.K. Date of publication November 04, 2011; date of current version May 22, 2012. This paper was recommended by Associate Editor Patrick Chiang. D. Stevens, V. Chouliaras, and V. Azorin-Peris are with the Department of Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K. J. Zheng is with the National Institute for the Control of Pharmaceutical and Biological Products (NICPBP), China, No.2, Tiantan Xili, Chongwen District, Beijing 100050, China. A. Echiadis is with Dialog Devices Ltd., Loughborough LE11 3EH, U.K. S. Hu is with the Department of Electronic and Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K. (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TBCAS.2011.2166962

medical interventions as these are based on hard facts, derived in real-time from physiological data [1], [2]. In the area of biomedical image processing, a number of imaging methods have been proposed over the past few years including laser Doppler [3], optical coherence tomography [4] and more recently, imaging photoplethysmography (IPPG) [5], [6]; However, none of these techniques can attain their true potential without a real-time biomedical image processing system based on very large scale integration (VLSI) systems technology. For instance, the quality and availability of physiological information from an IPPG system is directly related to the frame size and frame rate used by the system. From a user perspective, the extent to which such a system can run in real-time is a key factor in its usability, and practical implementations of the system ultimately aim to be standalone and portable to achieve its full applicability. This is an area where advanced computer architecture concepts, routinely utilized in high-performance consumer and telecoms systems-on-chip (SoC) [7], can potentially provide the required data streaming and execution bandwidth to allow for the real-time execution of algorithms that would otherwise be executed ofine (in batch mode) using more established techniques and platforms (e.g., sequential execution on a PC host). A quantitative comparison in this study (results and discussion) illustrates the foreseen performance gains, showing that a scalar embedded processor is six times slower than the single-core conguration of our research platform. Such SoC-based architectures typically include scalar embedded processor cores with a xed instruction-set-architecture (ISA) which are widely used in standard-cell (ASIC) [8] and recongurable (FPGA)-based embedded systems [9]. These processors present a good compromise for the execution of generalpurpose codes such as the user interface, low-level/bandwidth protocol processing, the embedded operating system (eOS) and occasionally, low-complexity signal processing tasks. However, they lack considerably in the area of high-throughput execution and high-bandwidth data movement as is often required by the core algorithms in most signal processing application domains. An interesting comparison of the capabilities of three such scalar engines targeting eld-programmable technologies (FPGAs) is given in [10]. To relieve this constraint, scalar embedded processors have been augmented with DSP coprocessors in both tightly-coupled [11] and loosely-coupled congurations [12] to target performance-critical inner loops of DSP algorithms. A side-effect of this approach is the lack of homogeneity in the SoC platform

1932-4545/$26.00 2011 IEEE

258

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

programmers model which itself necessitates the use of complex mailbox-type [13] communications and the programmermanaged use of multiple address spaces, coherency issues and DMA-driven data ows, typically under the control of the scalar CPU. Another architectural alternative is the implementation of the core DSP functionality using custom (hardwired) logic. Using established methodologies (register-transfer-level design, RTL) this task involves long development and verication times and results in systems that are of high performance yet, they are only tuned to the task at hand. Also, these solutions tend to offer little or no programmability, making difcult their modication to reect changes in the input algorithm. In the same architectural domain, the synthesis of such hardwired engines from high level languages (ESL synthesis) is an area of active research in academia [14], [15] (academic efforts targeting ESL synthesis of Ada and C-descriptions); Industrial tools in this area have matured [16][18] (commercial offerings targeting C++, C and UML&C++) to the point of competing favorably with hand-coded RTL implementations, at least for certain types of designs [19]. A potent solution to high performance VLSI systems design is provided by congurable, extensible processors [20]. These CPUs allow the extension of their architecture (programmer model and ISA), and microarchitecture (execution units, streaming engines, coprocessors, local memories) by the system architect. They typically offer high performance, full programmability and good post-fabrication adaptability to evolving algorithms through the careful choice of the custom ISA and execution/storage resources prior to committing to silicon. High performance is achieved through the use of custom instructions which collapse data ow graph (DFG) sub-graphs (especially those repeated many times [21]) into one or more multi-input, multi-output (MIMO) instruction nodes. At the same time, these processors deliver better power efciency compared to non-extensible processors, via the reduction in the dynamic instruction count of the target application and the use of streaming local memories instead of data caches. All the solutions to develop high-performance digital engines for consumer, and in this case, biomedical image processing mentioned so far suffer from the need to explicitly specify the software/hardware interface and schedule communications across that boundary. This research proposes an alternative, all-software solution, based on a novel, congurable, extensible VLIW chip-multiprocessor (CMP) based on an open-source VLIW core [22][24] and targeting both FPGA and standard-cell (ASIC) silicon. The VLIW architectural paradigm was chosen as such architectures efciently handle parallelism at the instruction (ILP) and data (DLP) levels. ILP is exploited via the static (compile-time) specication of independent RISC-ops (referred to as syllables or RISCops) per VLIW instruction whereas DLP is exploited via the compiler-directed unrolling and pipelining of inner loops (kernels). Key to this is the use of advanced compilation technology such as Trimaran [25] for fully-predicated EPIC architectures or VEX [26] for the partially-predicated LE1 CPU [27], the core element of the BioThreads CMP used in this work. A third form of parallelism, thread level parallelism (TLP) can be explored

via the instantiation of multiple such VLIW cores, operating in a shared-memory ecosystem. The BioThreads processor addresses all three forms of parallelism and provides a unique hardware mechanism with which software threads are created and allocated to uncommitted LE1 VLIW cores via the use of custom instructions implementing key POSIX Threads (PThreads) primitives directly in hardware. A. Multithreaded Processors Multithreaded programming allows for the better utilization of the underlying multiprocessor system by splitting up sequential tasks such that they can be performed concurrently on separate CPUs (processor contexts) resulting in a reduction of the total task execution time and/or the better utilization of the underlying silicon. Such threads are disjoint sections of the control ow graph that can potentially execute concurrently subject to the lack of data dependencies. Multithreaded programming relies on the availability of multiple CPUs capable of running concurrently in a shared-memory ecosystem (multiprocessor) or as a distributed memory platform (multicomputer). Both multiprocessor and multicomputers fall in two major categories depending on how threads are created and managed: a) programmer-driven multithreading (potentially with OS software support), known as explicit multithreading and b) hardware generated threads (implicit multithreading). A very good overview of explicit multithreaded processors is given in [28]. 1) Explicit Multithreading: Explicit multithreaded processors are categorized in: a) Interleaved multithreading (IMT) in which the CPU switches to another hardware thread at instruction boundaries thus, effectively hiding long-latency operations (memory accesses); b) Blocked multithreading (BMT) in which a thread is active until a long-latency operation is encountered; c) Simultaneous multithreading (SMT) which relies on a wide (ILP) pipeline to dynamically schedule operations across multiple hardware threads. Explicit multithreaded architectures have the form of either chip multiprocessors (shared-memory ecosystem) or multicomputers (distributed memory ecosystem). In both cases, special OS thread libraries (APIs) control the creation of threads and make use of the underlying multicore architecture (if one is provided) or time-multiplex the single CPU core. Examples of such APIs for shared-memory multithreading are the POSIX Threads (PThreads) and MPI for distributed memory multicomputers. PThreads in particular allows for explicit creation, termination, joining and detaching of multiple threads and provides further support services in the form of mutex and conditional variables. Notable machines supporting IMT include the HEP and the Cray MTA; More recent explicit-multithreading VLIW CMPs include amongst others the SiliconHive HIVEFLEX CSL2500 Communications processor (multicomputer architecture) [29] and the Fujitsu FR1000 VLIW media multicore (multiprocessor architecture) [30]. In the academic world, the most notable offerings in the re-congurable/extensible VLIW domain include the tightly-coupled VLIW/datapath architecture [31] and the ADRES architecture [32]. In the biomedical signal processing domain very few references can be found; A CMP architecture

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR

259

based on a commercial VLIW core was used for the real-time processing of 12-lead ECG signals in [33]. 2) Implicit Multithreading: Prior research in hardware managed threads (implicit multithreading) includes the SPSM and WELD architectures [34][36]. The single-program speculative multithreading (SPSM) method uses fork and merge operations to reduce execution time. Extra work by the compiler is required to nd code blocks which are data independent; when such blocks are found, the compiler inserts extra instructions to inform the hardware to run the data independent code concurrently. When the executing thread (master) reaches a fork instruction, a second thread is started at another location in the program. Both threads then execute and when the master thread reaches the location in the program from which the second thread started, the two threads are merged together. The WELD architecture uses branch prediction as a method of reducing the impact of pipeline restarts due to control ow changes. Due to the organization of modern processors if a branch is taken this requires the pipeline to be restarted and the instructions in the branch shadow be squashed, resulting in wasted issue slots. A way around this inefciency is to run two or more threads concurrently and each thread to run the code if a branch is taken or not (thus following both control ow paths). Later on, when it is discovered whether a branch is denitely taken or not taken, the correct speculative thread is chosen (and becomes denite) whereas the incorrect thread is squashed. This removes the need to re-ll the pipeline with the correct instructions as both branch paths are concurrently executed. This method requires extra work by the compiler which introduces extra instructions (fork/bork) to inform the processor that it needs to run both branch paths as separate threads. B. The BioThreads CMP The BioThreads VLIW CMP is termed a hardware-assisted, explicit multithreaded architecture (software threads are user specied, thread management is hardware based) and is differentiated to offerings in that area via a) Its hardware PThreads primitives and b) its massive scalability which can range from a single-thread, dual-issue core to a theoretical maximum of 4 K (256 contexts 16 hypercontexts) shared memory hardware threads in each of the maximum 256 distributed memory multicomputers, for a theoretical total of 1 M threads, on up to 256-wide (VLIW issue slots) cores. Clearly these are theoretical maxima as in such massively-parallel congurations, the latency of the memory system (within the same shared memory multiprocessor) is substantially increased, potentially resulting in sub-optimal single-thread performance unless aggressive compiler-directed loop unrolling and pipelining is performed. C. Research Contributions The major contributions of this research are summarized as follows: a) A congurable, extensible, chip-multiprocessor has been developed based on the open-source LE1 VLIW CPU, capable of performing key PThreads primitives directly in hardware. This is a unique feature of the LE1 (and BioThreads) engine and uniquely differentiates it from other key research such as hardware primitives for remote memory access [37]. In that respect, the BioThreads core can be thought of as a hybrid between an OS and a collection of processors, delivering services

(execution bandwidth and thread handling) to a higher order system and moving towards the real-time execution of compute-bound biomedical signal processing codes. b) The use of such a complex processing engine is advocated in the biomedical signal processing domain such as the real-time blood perfusion calculation. Its inherent, multiparallel scalability allows for the real-time calculation of key computational kernels in this domain. c) A unied, software-hardware ow has been developed so that all algorithm development takes place in the MATLAB environment, followed by automatic C-code generation and its introduction to the LE1 tool chain. This is a well encapsulated process which ensures that the biomedical engineer is not exposed to the intricacies of real-time software development for a complex, multicore, SoC platform; at the same time this methodology results in a working embedded system directly implementing the algorithmic functionality specied in the MATLAB input description with minimum user guidance. II. THE BIOTHREADS ENGINE The BioThreads CMP is based on the LE1 open-source processor which it extends with execution primitives to support high speed image processing and dynamic thread allocation and mapping to uncommitted CPU cores. The BioThreads architecture species a hybrid, shared-memory multiprocessor/distributed memory multicomputer. The multiprocessor aspect of the BioThreads architecture falls in between the two categories (explicit and implicit) as it requires the user to explicitly identify the software threads in the code but at the same time, implements hardware support for the creation/management/synchronization/termination of such threads. The thread management in the LE1 provides full hardware support for key PThread primitives such as pthread_create/join/exit and pthread_mutex_init/lock/trylock/unlock/destroy. This is achieved with a hardware block, the thread control unit (TCU), whose purpose is the service of these custom hardware calls and the start and stop execution of multiple LE1 cores. The TCU is an explicit serialization point which multiple contexts (cores) compete for access; PThreads command requests are internally serialized and the requesting contexts served in turn. The use of the TCU removes the overhead of an operating system for the LE1 as low-level PThread services are provided in hardware; a typical pthread_create instruction completes in less than 20 clocks. This is a unique feature of the LE1 VLIW CMP and the primary differentiator with other VLIW multicore engines. Fig. 1 depicts a high level overview of the BioThreads engine. The main components are the scalar platform, consisting of a service processor (the Xilinx Microblaze, 5-stage pipeline 32-bit CPU), its subsystem based on the CoreConnect [38] bus architecture, and nally, the LE1 chip multiprocessor (CMP) which executes the signal processing kernels. Fig. 2 depicts the internal organization of a single LE1 context. The CPU consists of the instruction fetch engine (IFE), the execution core (LE1_CORE), the pipeline controller (PIPE_CTRL) and the load/store unit (LSU). The IFE can be congured with an instruction cache or alternatively, a closely-coupled instruction RAM (IRAM). These are accessed every cycle and return a long instruction word (LIW) consisting of multiple RISCops for decode and

260

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

Fig. 1. BioThreads Engine showing LE1 cores, memory subsystem and overall architecture.

depth; however, they maintain a common exception resolution point to support a precise exception programmers model. PIPE_CTRL is the primary control logic. It is a collection of interlocked, pipelined state machines, which schedule the execution datapaths and monitor the overall instruction ow down the processing and memory pipelines. PIPE_CTRL maintains the decoding logic and control registers of the CPU and handshakes the host during debug operations. The LSU is the primary path of the LE1_CORE to the system memory. It allows for up to ISSUE_WIDTH (VLIW architectural width) memory operations per cycle and directly communicates with the shared data memory (STRMEM). The latter is a multibank, 2 or 3-stage pipelined cross-bar architecture which scales reasonably well (in terms of speed and area) for up to 8 clients, 8 banks (8 8), as shown in Table I and Table II. Note that the number of such banks and number of LSU clients (LSU_CHANNELS) are not necessarily equal, allowing for further microarchitecture optimizations. This STRMEM block organization is depicted in Fig. 3. Finally, to allow for the exploitation of shared-memory TLP, multiple processing cores can be instantiated in a CMP conguration as shown in Fig. 4. The gure depicts a dual-LE1, single-cluster BioThreads system interfacing to the common streaming data RAM. III. THE THREAD CONTROL UNIT Thread management and dynamic allocation to hardware contexts takes place in the TCU. This is a set of hierarchical state machines, responsible for the management of software threads and their allocation to execution resources (HC, hypercontexts). It accepts PThreads requests from either the host or any of the executing hypercontexts. It maintains a series of hardware (state) tables, and is a point of synchronization amongst all executing hypercontexts. Due to the need to directly control the operating mode of every hypercontext (HC) while having direct access to the system memory, the TCU resides in the DEBUG_IF (Fig. 1) where it makes use of the existing hardware infrastructure to stop, start, R/M/W memory and communicate with the host. A critical block in thread management is the Context TCU which manages locally (per context, in the PIPE_CTRL block) the distribution of PThreads instructions to the centralized TCU. Each clock, one of the active HCs in a context arbitrates for the use of the context TCU; When granted access, the command requested is passed on to the TCU residing in the DBG_IF for centralized processing. Upon completion of the PThreads command, the Context TCU returns (to the requesting HC) the return values, as specied by that command. Fig. 5 depicts the thread control organization in the context of a single shared-memory system. The gure depicts a system containing contexts (0 through ); For simplicity, each context contains two hyperto contexts (HC0, HC1) and has direct access to the system-wide STRMEM for host-initiated DMA transfers and/or recovering the argument for void pthread_exit(void *value_ptr). The supported commands are listed in Table III.

Fig. 2. Open-source LE1 Core pipeline organization.

dispatch. The IFE controller handles interfacing to the external memory for ICache rells and provides debug capability into the ICache/IRAM. The IFE can also be congured with a branch predictor unit, currently based on the 2-bit saturating counter scheme (Smith predictor) in both set-associative and fully-associative (CAM-based) organization. The LE1_CORE block includes the main execution datapaths of the CPU. There are a congurable number of clusters, each with its own register set. Each cluster includes an integer core (SCORE), a custom instruction core (CCORE) and optionally, a oating point core (FPCORE). The integer and oating-point datapaths are of unequal pipeline

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR

261

TABLE I BIOTHREADS REAL-TIME PERFORMANCE (DUAL-ISSUE LE1 CORES, FPGA AND ASIC)

TABLE II BIOTHREADS REAL-TIME PERFORMANCE (QUAD-ISSUE LE1 CORES, FPGA AND ASIC)

Fig. 3. Multibank streaming memory subsystem of the BioThreads CMP.

IV. BIOMEDICAL SIGNAL PROCESSING APPLICATION: IPPG The application area selected to deploy the BioThreads VLIW-CMP engine for real-time biomedical signal processing was photoplethysmography (PPG), which is the measurement of blood volume changes in living tissue using optical means. PPG is primarily used in Pulse Oximetry for the point-measurement of oxygen saturation. In this application, PPG is implemented from an area measurement. The basic concept of this implementation, known as imaging PPG (IPPG), is to illuminate the tissue with a homogeneous, nonionizing light source and to detect the reected light with a 2D sensor array. This yields a sequence of images (frames) from which a map (over the illuminated area) of the blood volume changes can

Fig. 4. BioThreads CMP, two LE1 cores connected via the streaming memory system.

be generated, for subsequent extraction of physiological parameters. The use of multiple wavelengths in the light source enables the reconstruction of blood volume changes at different depths of the tissue (due to the different penetration depth of

262

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

Fig. 5. Multibank streaming memory subsystem of the BioThreads CMP.

Fig. 6. Imaging PPG System Architecture.

TABLE III LE1 PTHREADS HARDWARE SUPPORT

each wavelength), yielding a 3D map of the tissue function. This is the principle of operation of real-time IPPG [39]. Such functional maps have numerous applications in clinical diagnostics, including the assessment of the severity of skin burns or wounds, of cardiovascular surgical interventions and of overall cardiovascular function. The overall IPPG system architecture is depicted in Fig. 6. A reectionmode IPPG setup was deployed for the validation experiment in this investigation, the basic elements of which are a ring-light illuminator with arrays of red (660 nm) and infrared (880 nm) LEDS, a lens and a high sensitivity camera [40] as the detecting element, and the target skin tissue as dened by its optical coefcients and geometry. The use of the fast digital camera enables the noncontact measurement at a sufciently high sampling rate to allow PPG signal reconstruction from a large and homogeneously illuminated eld of view at more than one wavelength, as shown in Fig. 7. The acquisition hardware synchronizes the illumination unit with the camera in order to perform multiplexed acquisition of a sequence of images of the area of interest. For optimum signal quality during acquisition, the subject is seated comfortably and asked to extend their hand evenly onto a padded surface; and ambient light is kept to a minimum. The IPPG system typically consists of three processing stages (pre, main, post) as shown in Fig. 8. Preprocessing comprises an optional image stabilization stage, where its use is largely dependent on the quality of the acquisition setting. It is typically performed on each whole raw frame prior to the storage of raw images in memory and it can be implemented using a region of

Fig. 7. Schematic diagram of IPPG setup including the dual wavelength LED ring light, lens, CMOS camera and subject hand.

interest of a xed size, meaning that its processing time is a function of frame rate but is independent of raw image size. The main processing stage comprises the conversion of raw data into the frequency domain, requiring a minimum number of frames, i.e., samples, to be performed. Conversion to the frequency domain is performed once for every pixel position in the raw image, and the number of data points per second of time-domain data to convert is determined by the frame rate, meaning that the processing time for this stage is a function of both frame rate and size. The postprocessing stage comprises the extraction of application-specic physiological parameters from time or frequency domain data and consists of operations such as statistical calculations and unit conversions (scaling and offsetting) of the processed data, which require relatively low processing power. The ultimate scope of this experiment was to evaluate the performance of the BioThreads engine as a signal processing platform for IPPG, which was achieved by simplifying the processing workow of the optophysiological assessment system [39]. Having established that image stabilization is easily scalable as it is performed frame-by-frame on a xed-size segment of the data, the preprocessing stage was disregarded for this

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR

263

Fig. 8. Complete signal processing workow in IPPG setup.

study. The FFT cannot be performed point-by-point, and thus poses the most signicant constraint to the scalability of the system. The main processing stage was thus targeted in this study as the representative process of the system, and the resultant workow consisted of the transformation of detected blood volume changes in living tissue to the frequency domain via FFT followed by extraction of physiological parameters for blood perfusion mapping. By employing the principles relating to photoplethysmography (PPG), blood perfusion maps were generated from the power of the PPG signals in the frequency domain. The acquired image frames were processed as follows: a) Two seconds worth of image data (60 frames of size 64 64 pixels at 8-bit resolution) were recorded with the acquisition system. b) The average fundamental frequency of the PPG signal was manually extracted (1.4 Hz). c) Data were streamed to the BioThreads platform and the 64-point fast Fourier transform (FFT) of each pixel was calculated. This was done by taking the pixel values of all image frames for a particular pixel position to form a pixel value vector in the time domain. d) The Power of the FFT tap corresponding to the PPG fundamental frequency was copied into a new matrix at the same coordinates of the pixel (or pixel cluster) under processing. In the presence of blood volume variation at that pixel (or pixel cluster) the power would be larger than if there was no blood volume variation. Repeating (d) for all the remaining pixels (clusters) provides a new matrix (image) whose elements (pixels) depend on the detected blood volume variation power. This technique allows the generation of a blood perfusion map, as a high PPG power can be attributed to high blood volume variation and ultimately to blood perfusion. Fig. 9 illustrates in a simplied diagrammatic representation the algorithm discussed above, and Fig. 10 illustrates the output frame after the full image processing stage. The algorithm was prototyped in the MATLAB environment and subsequently translated to C using the embedded MATLAB compiler (emlc) for compilation on the VLIW CMP. The emlc enables the output of C from MATLAB functions. Using emlc, the function required is run with the example data and C code is generated by MATLAB. In this example the data was a full dataset of two seconds worth of images in a one dimensional array. This generated C code is a function which can be compiled and run in a handheld (FPGA-based) system. Alongside this function a driver function was written to setup the input data and call the function. The function computes the whole frame. To be able to split this over multiple LE1 cores the code was modied to include a start and end value. This

Fig. 9. High-level view of the signal processing algorithm.

Fig. 10. (A) Original image. (B) Corresponding ac map (Mean AC). : : ( ). (C) Corresponding ac power map at heart rate (HR)

= 1 4 Hz F = 1 4 Hz

was a simple change which included altering the loop variables within the C code. In MATLAB there was an issue exporting a function which used loop variables that were passed to the function. These values are computed by the driver function and passed to the generated function. Example of code (pseudocode): Generated by MATLAB: autogen_func(inputArray, outputArray); Altered to: autogen_func(inputArray, outputArray, start, end); Driver code: main() { for( ; ; loop++) { ; ; & } } This way the number_of_threads constant can be easily altered and the code does not need to be rewritten/reexported. Both the MATLAB generated function and the driver function are then compiled using the LE1 tool-chain to create the machine code to run in simulation as well as on silicon. The system front-end of Fig. 11 is implemented in LabVIEW and executes on the host system. The acquisition panel is used &

264

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

Fig. 11. LabVIEW-based host application for hardware control (left) and analysis (right).

Fig. 12. Speedup of BioThreads performance for 2-wide LE1 subsystem (FPGA and ASIC).

to control the acquisition of frames and to send the raw imaging data to the FPGA-resident BioThreads engine for processing. Upon completion, the processed frame is returned for display in the analysis panel, where the raw data is also accessible for further analysis in the time-domain. V. RESULTS AND DISCUSSION This section presents the results of a number of experiments when using the BioThreads platform for the real-time blood-volume change calculation. These results are split into two major sections: A) Performance (real-time) results, pertaining to the real-time calculation of the blood perfusion map, and B) SoC platform results. The latter include data such as area, maximum frequency, when targeting a Xilinx Virtex6 , 1-poly, 8-metal LX240T FG1156 [41] FPGA and a 0.13 (1P8M) standard-cell process. It should be noted that the FPGA device is on a near-state-of-the-art silicon node (40 nm, TSMC) whereas the available standard-cell library in our research lab is rather old. As such, the performance differential (300 MHz for the standard-cell compared to the 100 MHz for the FPGA target in Tables I and II) is certainly not representative of that expected when targeting a standard cell process at an advanced silicon node (40 nm and below). A. Performance Results The 60 frames were streamed onto the target platform and the processors started executing the IPPG algorithms. Upon completion, the computed frame was returned to the host system for display. Table I shows the real execution time for the 2-wide LE1 system (VLIW CMP consisting of dual-static-issue cores) of the FPGA platform; ASIC results were obtained from simulation. The results shown in the tables are arranged in columns under the following headings: Cong: The macroarchitecture of the BioThreads engine in LE1_CORES MEMORY_BANKS LE1 Cores: Identies the number of execution cores on the LE1 subsystem. Data Memory Banks: The number memory banks as depicted in Fig. 3 (and thus, the maximum number of concurrent load/store operations supported by the streaming memory system). This plays a major role in the overall system performance as will be shown below.

The remaining three parameters have been measured on an FPGA platform (100 MHz LE1 subsystem and service processor, as shown on Fig. 1) or derived from RTL simulations (ASIC implementation). Both sets of results were obtained without and with custom instructions for accelerating the FFT calculation. Cycles: The number of clock cycles taken when executing the core signal processing algorithm. Real time (sec): The real time taken to execute the algorithm (measured by the service processor for FPGA targets, calculated by RTL simulation for the ASIC target). Speedup: The relative speedup of BioThreads conguration compared to the degenerate case of a single-context, single-bank, FPGA solution without the FFT custom instructions. From a previous study on signal processing kernel acceleration (FFT) on the LE1 processor [42], it was concluded that the best performance is achieved with user-directed function in-lining, compiler-driven loop unrolling and custom instructions. Using the above methods an 87% cycle reduction was achieved thus making possible the execution of the IPPG algorithm steps in real-time. A subset of the speedup results of Table I (dual-issue performance) are plotted in Fig. 12. Speedup was calculated in reference to the 1 1 reference conguration with no optimizations. As shown, there is a maximum speed up of 125 on the standard-cell target and 41 on the FPGA target (ASIC implementation is 3 faster compared to the FPGA), with full optimizations and custom instructions. With respect to the number of memory channels, best performance is achieved when the number of cores in the BioThreads engine equals the number of memory banks. This is expected as memory bank conicts . increase for Fig. 13 shows the speedup values for the quad-issue congurations (subset of results from Table II). There is a similar trend in the shape of the graph as was seen in the dual-issue results and the same dependency on the number of memory banks is clearly seen. The Real Time (sec) values in Tables I and II highlighted in grey identify the BioThreads congurations whose processing time is less than the acquisition time. These 14 congurations are instances of the BioThreads processor that achieve real-time performance. On an FPGA target, real-time is achieved with 8 quad-core LE1s and a 4/8-bank memory

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR

265

TABLE IV LE1 COMPARISON WITH THE VEX VLIW/DSP

Fig. 13. Speedup BioThreads performance for 4-wide LE1 subsystem (FPGA and ASIC).

system. The FPGA device chosen (Virtex6 LX240T) can accommodate only 5 such LE1 cores (along with the service processor system) and thus, cant achieve the real-time constraint. For the dual-issue conguration however, near-real-time performance is achieved with 8 cores and a 4 or 8-bank streaming memory (acquisition time of 2.00 s, processing of 2.85 s and 2.16 s respectively). Finally, full real-time is achieved cores memory banks. These congurations can be accommodated on the target FPGA and thus are the preferred congurations of the BioThreads engine for this study. On a rather old standard-cell technology, both dual and quadissue congurations achieve the required post-layout speed of 300 MHz. In the rst case, 4 cores and a 2-bank memory system are sufcient; for quad-issue cores a 4 core by 1-bank system is sufcient. Congurations around the acquisition time of 2 sec could be considered as viably real-time as the subsecond time difference may not be noticeable in the clinical environment or be important in the nal use of the processed data. An interesting comparison for the FPGA targets is with the Microblaze processor provided by the FPGA vendor. The latter is a 5-stage, scalar processor (similar to the ARM9 in terms of microarchitecture) with 32 K Instruction and Data caches. Write-through conguration was selected for the data cache to reduce the PLB interconnect trafc. The code was recompiled for the Microblaze using the gcc toolchain with O3 optimizations and run in single-thread mode as that processor is not easily scalable; Running it in PThreads mode involves substantial OS intervention thus, the total runtime is longer than that for a single thread. The application took 864 sec (0.07 fps) to execute on a 62.5 MHz FPGA board, which is 9.6 times slower than the dual-issue reference 100 MHz single-core LE1 (Table I, 1 1 conguration, 90.17 s). Extrapolating this gure to the LE1 clock of 100 MHz and assuming no memory subsystem degradation (an optimistic assumption), shows that the Microblaze processor is 6 times slower than the reference LE1 conguration (0.11 fps); Finally, compared to the maximal dual-issue (conguration 8 8) of Table I with full optimizations, the scalar processor proved to be more than 240 times slower (525.00 s versus 2.16 s). A nal evaluation of the performance of the BioThreads platform was carried out against the (simulated) VEX VLIW/DSP. The VEX ISA is closely related to the Multiow TRACE architecture and a commercial implementation of the ISA is the

ST Microelectronics ST210 series of VLIW cores. The VLIW ISA has good DSP support since it includes 16/32-bit multiplications, local data memory, instruction and data prefetching and is supported by a direct descendant of the Multiow trace scheduling compiler. The simulated VEX processor conguration included a 32 K, 2-way ICache and a 64 K, 4-way, write-through data cache whereas the BioThreads conguration used included a direct instruction and data memory system. Both processors executed the single-threaded version of the workload as there was no library/OS support in VEX for the PThreads primitives included in the BioThreads core. Table IV depicts the comparison in which it is clear that the local instruction/data memory system of the Biothreads core play an important role in the latter achieving 62.00% (2-wide) and 54.21% (4-wide) better clocks compared to the VEX CPU. As with any engineering study, the chosen system very much depends on tradeoffs between speed and area costs. These are detailed in the next section. 1) VLSI Platform Results: This section details the VLSI implementation of a number of congurations of interest of the BioThreads engine implemented on both standard-cell (TSMC 0.13 1 Poly, 8 metal, high-speed process) and a 40 nm state-of-the-art FPGA device (Xilinx Virtex6 LX240T-FF1156-1). 2) Xilinx Virtex6 LX240T-FF1156-1: To provide further insight on the use of advanced system FPGAs when implementing a CMP platform, the BioThreads processor was targeted to a midrange device from the latest Xilinx Virtex6 family. The following congurations were implemented: a) Dual-issue BioThreads congurations (2 IALUs, 2 IMULTs, 1 LSU_CHANNELS per LE1 core, 4-banks STRMEM): 1, 2, 4 and 8 LE1 cores, each with a private 128 KB IRAM and a shared 256 KB DRAM, for a total memory of 384 KB, 512 KB, 1024 KB, and 1.5 MB respectively. b) Quad-issue BioThreads congurations (4 IALUs, 4 IMULTs, 1 LSU_CHANNEL per LE1 core, 8-banks STRMEM): 1, 2 and 4 LE1 cores. In this case, the same amount of IRAM and DRAM was chosen, for a total memory of 384 KB, 512 KB, and 1024 KB respectively. Both dual and quad-issue congurations included a Microblaze 32-bit scalar processor system (5-stage pipeline, no FPU, no MMU, 8 K ICache and DCache/Write-through, acting as the service processor) with a 32-bit single PLB backbone, to interface to the camera, stream-in frames, initiate execution on the BioThreads processor and extract the processed results (Oxygen map) upon completion. Note that the service processor in these implementations are lower-spec to the standalone Microblaze processor used in the performance comparison at the end of the

266

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

TABLE V 2-ISSUE BIOTHREADS CONFIGURATIONS ON VIRTEX6 LX240T

TABLE VI 4-ISSUE BIOTHREADS CONFIGURATIONS ON VIRTEX6 LX240T

previous section as the Microblaze does not participate in the computations in this case. The Xilinx Kernel RTOS (Xilkernel) was used to provide basic software services (stdio) to the service processor and thus, to the rest of the BioThreads engine. Finally, the calculated Oxygen map was returned to the host system for display purposes in the LABVIEW environment. The dual-issue congurations achieved the requested operating frequency of 100 MHz (this is the limit at which both the Microblaze platform and the BioThreads engine can operate with the external DDR3; however, the 4-wide congurations exat that target freposed a critical path for quency resulting in a lower overall system speed of 83.3 MHz. For the case where only the on-board block RAM is used, this is signicantly increased). Table V depicts the post-synthesis results for the dual-issue platforms whereas Table VI depicts the synthesis results for the quad-issue platforms. The above tables include the number of LE1 cores in the BioThreads engine (Contexts), the absolute and relative (for the given device) number of LUTs, ops and block RAMs used, and the number of LUTs and ops per the total number of issue slots for the VLIW CMP. The latter two metrics are used to compare the hardware efciency of the 2-wide and 4-wide congurations. A number of interesting conclusions are drawn from the data: The 2-wide LE1 core is 27.12% to 29.60% more efcient (LUTs per issue slot) compared to the 4-wide core. This is expected and it is attributed to the overhead of the pipeline bypass network. The internal bypassing for one operand is shown in Fig. 14 however, the LSU bypass paths are not depicted in the gure for clarity. The quad-issue congurations exhibit substantial multiplexing overhead as the number of source operands is doubled 32-bit operands per VLIW core) with each, tapping ( to ISSUE_WIDTH * 2 (IALU) + ISSUE_WIDTH * 2 (IMULT) + ISSUE_WIDTH *2 (LSU) result buses. In terms of absolute silicon area, the Virtex6 device can acissue commodate up to 11 dual-issue LE1 cores ( issue slots per cycle) slots per cycle) and up to 5 ( quad-issue cores (with the service processor subsystem present in both cases); the quad-issue system however achieves 83 MHz

Fig. 14. Pipeline bypass taps as a function of IALUs and IMULTs.

instead of the required (by Table I) 100 MHz for real-time operation. This suggests that the lower issue-width congurations are more benecial for applications exhibiting good TLP (many independent software threads) while exploiting a lower amount of ILP. As shown in the performance (real-time) section, the IPPG workload exhibits substantial TLP thus making the 2-wide congurations more attractive. 3) Standard-Cell (TSMC0.13LV Process): For the standard-cell (ASIC) target, dual-issue (2 IALUs, 2 IMULTs, LSU_CHANNELS, 4 memory banks) and quad-issue 2 (4 IALUs, 4 IMULTs, 4 LSU_CHANNELS, 8 memory banks) congurations were used in 1, 2, 3 and 4 cores organizations. To facilitate collection of results after long RTL-simulation runtimes, a scripting mechanism was used to automatically modify the RTL conguration les, execute RTL (presynthesis) regression tests, perform front-end synthesis (Synopsys dc_shell, nontopographical mode) and execute postsynthesis simulations. Fig. 15 shows the postroute (real silicon) area of the congurations studied. In this case, the 4-wide is only 15.59% larger than the 2-wide processor with that number increasing to 18.96%. The postroute area of a 4-core, dual-issue and a 3-core quad-issue system are nearly identical and these are the congurations depicted in Fig. 16.

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR

267

larger frame sizes and/or faster frame rates. Further work will investigate the introduction of single-instruction, multiple-data (SIMD) processor state and custom image processing vector instructions and their effect on accelerating algorithmic kernels from the same domain. REFERENCES[1] K. Rajan and L. M. Patnaik, CBP and ART image reconstruction algorithms on media and DSP processors, Microprocess. Microsyst., vol. 25, pp. 233238, 2001. [2] O. Dandekar and R. Shekhar, FPGA-Accelerated deformable image registration for improved target-delineation during CT-guided interventions, IEEE Trans. Biomed. Circuits Syst., vol. 1, no. 2, pp. 116127, 2007. [3] K. Wardell and G. E. Nilsson, Duplex laser Doppler perfusion imaging, Microvasc. Res., vol. 52, pp. 171182, 1996. [4] S. Srinivasan, B. W. Pogue, S. D. Jiang, H. Dehghani, C. Kogel, S. Soho, J. J. Gibson, T. D. Tosteson, S. P. Poplack, and K. D. Paulsen, Interpreting haemoglobin and water concentration, oxygen saturation, and scattering measured in vivo by near infrared breast tomography, Proc. Natl. Academy Sciences USA, vol. 100, no. 21, pp. 1234912354, 2003. [5] S. Hu, J. Zheng, V. A. Chouliaras, and R. Summers, Feasibility of imaging photoplethysmography, in Proc. Conf. BioMedical Engineering and Informatics, Sanya,, China, 2008, pp. 7275. [6] P. Shi, V. Azorin Peris, A. Echiadis, J. Zheng, Y. Zhu, P. Y. S. Cheang, and S. Hu, Non-contact reection photoplethysmography towards effective human physiological monitoring, J Med. Biol. Eng., vol. 30, no. 30, pp. 161167, 2010. [7] V. A. Chouliaras, J. L. Nunez, D. J. Mulvaney, F. Rovati, and D. Alfonso, A multi-standard video accelerator based on a vector architecture, IEEE Trans. Consum. Electron., vol. 51, no. 1, pp. 160167, 2005. [8] ARM Cortex M3 Processor Specication, Sep. 2010 [Online]. Available: http://www.arm.com/products/processors/cortex-m/cortexm3.php [9] Microblaze Processor Reference Guide, Doc. UG081 (v10.3), Oct. 2010 [Online]. Available: http://www.xilinx.com [10] D. Mattson and M. Christensson, Evaluation of Synthesizable CPU Cores, Masters thesis, Dept. Computer Engineering, Chalmers Univ. Technology, Goteborg, Sweden, 2004. [11] V. A. Chouliaras and J. L. Nunez, Scalar coprocessors for accelerating the G723.1 and G729A speech coders, IEEE Trans. Consum. Electron., vol. 49, no. 3, pp. 703710, 2003. [12] N. Vassiliadis, G. Theodoridis, and S. Nikolaidis, The ARISE recongurable instruction set extensions framework, in Proc. Intl Conf. Emb. Computer Systems: Architectures, Modelling and Simulation, Jul. 1619, 2007, pp. 153160. [13] Xilinx XPS Mailbox V1.0a Data Sheet, Oct. 2010 [Online]. Available: http://www.xilinx.com [14] M. F. Dossis, T. Themelis, and L. Markopoulos, A web service to generate program coprocessors, in Proc. 4th IEEE Int. Workshop Semantic Media Adaptation and Personalization, Dec. 2009, pp. 121128. [15] B. Gorjiara, M. Reshadi, and D. Gajski, Designing a custom architecture for DCT using NISC technology, in Proc. Design Automation, Asia and South Pacic Conf., 2006, pp. 2427. [16] V. Kathail, S. Aditya, R. Schreiber, B. Ramakrishna Rau, D. Cronquist, and M. Sivaraman, PICO: Automatically designing custom computers, IEEE Comput., vol. 35, pp. 3947, 2002. [17] R. Thomson, S. Moyers, D. Mulvaney, and V. A. Chouliaras, The UML-based design of a hardware H.264/MPEG 4 AVC video decompression core, in Proc. 5th Int. UML-SoC Workshop (in Conjunction with 45th DAC), Anaheim, CA, Jun. 2008, pp. 16. [18] The AutoESL AutoPilot High-Level Synthesis Tool, May 2010 [Online]. Available: http://www.autoesl.com/docs/bdti_autopilot_nal.pdf [19] Y. Guo and J. R. Cavallaro, A low complexity and low power SoC design architecture for adaptive MAI suppression in CDMA systems, J. VLSI Signal Process. Syst. Signal Image Video Technol., vol. 44, pp. 195217, 2006. [20] S. Leibson and J. Kim, Congurable processors: A new era in chip design, IEEE Comput., vol. 38, no. 7, pp. 5159, 2005. [21] N. T. Clark, H. Zhong, and S. A. Mahlke, Automated custom instruction generation for domain-specic processor acceleration, IEEE Trans. Comput., vol. 54, no. 10, pp. 12581270, 2005. [22] D. Stevens and V. A. Chouliaras, LE1: A parameterizable VLIW chipmultiprocessor with hardware PThreads support, in Proc. IEEE Symp. VLSI, Jul. 2010, pp. 122126.

Fig. 15. VLSI campaign area (

m sq.).

Fig. 16. TSMC 0.13 VLSI results. (A) 4 instances of a 2-wide LE1 core. (B) 3 instances of a 4-wide LE1 core.

m

VI. CONCLUSIONS This work discussed the methodology and evaluated the feasibility of using a novel, congurable VLIW CMP for accelerating biomedical signal processing codes. Experts in the biomedical signal processing domain require advanced processing capabilities without having to resort to the expertise routinely utilized in the consumer electronics and telecommunications domains were the silicon platform is designed by one team, benchmarked and optimized by a second team and programmed by a third. A deliberate choice was made to stay within the reach of tools routinely utilized by biomedical signal processing practitioners, such as MATLAB and LABVIEW; The BioThreads tools infrastructure was designed to facilitate C-level programmability in the particular application domain. Use of the congurable, extensible BioThreads engine was demonstrated for computing, in real-time/near-real-time blood perfusion of living tissue, using algorithms developed in the Embedded MATLAB subset; The autogenerated C code was passed on to the toolchain which compiled it into an application binary and performed coarse architecture space evaluation to identify the best BioThreads congurations that achieve the required level of performance. Following that, the code was loaded onto the CMP, residing on the FPGA and datasets were streamed from the host system (via the LABVIEW front-end) for accelerated calculation. This work presented a custom signal processing engine capable of a high data-processing throughput at a signicantly lower operating frequency than general purpose processors. In the context of imaging PPG, real-time capability was demonstrated for a frequency-domain processing on 64 64 pixel images at 30 frames per second. The system is easily scalable via its parameterization and allows for real-time execution on

268

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

[23] LE1 VLIW Core Source Code, Oct. 2010 [Online]. Available: http:// www.vassilios-chouliaras.com/le1 [24] V. A. Chouliaras, G. Lentaris, D. Reisis, and D. Stevens, Customizing a VLIW chip multiprocessor for motion estimation algorithms, in Proc. 2nd Workshop Parallel Programming and Run-Time Management Techniques for Many-Core Architectures, Lake Como, Italy, Feb. 23, 2011, pp. 178184. [25] Trimaran 4.0 Manual, Oct. 2010 [Online]. Available: http://www. trimaran.org [26] J. A. Fischer and P. Faraboschi, Embedded Computing. A VLIW Approach to Architectures, Compilers and Tools. San Mateo, CA: Morgan Kaufmann, 2005. [27] R. P. Colwell, O. P. Nix, J. J. ODonell, D. B. Papworth, and P. K. Rodman, A VLIW architecture for a trace scheduling compiler, IEEE Trans. Comput., vol. 37, no. 8, pp. 967979, 1988. [28] T. Ungerer, B. Robic, and J. Silc, A survey of processors with explicit multithreading, ACM Comput. Surv., vol. 35, no. 1, pp. 2963, 2003. [29] HiveFlex CSP2500 Series Digital RF Processor Databrief, Apr. 2011 [Online]. Available: http://www.siliconhive.com/Flex/Site/Page. aspx?PageID=13326 [30] A. Suga and S. Imai, FR-V Single chip multicore processor: FR1000, Fujitsu Sci. Tech. J., vol. 42, no. 2, pp. 190199, Apr. 2006. [31] A. K. Jones, R. Hoare, D. Kusic, J. Stander, G. Mehta, and J. Fazekas, A VLIW processor with hardware functions: Increasing performance while reducing power, IEEE Trans. Circuits Syst., vol. 53, no. 11, pp. 12501254, 2006. [32] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins, ADRES: An architecture with tightly coupled VLIW processor and coarse-grained recongurable matrix, in Proc. Field Programmable Logic, 2003, pp. 6170. [33] I. Al Khatib, F. Poletti, D. Bertozzi, L. Benini, M. Bechara, H. Khalifeh, A. Jantsch, and R. Nabiev, A multiprocessor system-on-chip for realtime biomedical monitoring and analysis: Architectural design space exploration, in Proc. IEEE/ACM Design Automation Conf., Sep. 2006, pp. 125130. [34] P. K. Dubey, K. OBrien, K. M. OBrien, and C. Barton, Single-program speculative multithreading (SPSM) architecture: Compiler- assisted ne-grained multithreading, in Proc. Int. Conf. Parallel Architecture and Compilation Techniques, Jun. 1995, pp. 109121. [35] E. Ozer, T. M. Conte, and S. Sharma, Weld: A multithreading technique towards latency tolerant VLIW processors, in Processors. Lecture Notes in Computer Science. Berlin, Germany: Springer, 2001. [36] E. Ozer and T. M. Conte, High-performance and low-cost dual-thread VLIW processor using Weld architecture paradigm, IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 12, pp. 11321142, 2005. [37] S. G. Ziavras, A. V. Gerbessiotis, and R. Bafnaa, Coprocessor design to support MPI primitives in congurable multiprocessors, Integration, VLSI J., vol. 40, pp. 235252, 2007. [38] CoreConnect Architecture, Oct. 2010 [Online]. Available: http://www. xilinx.com [39] J. Zheng, S. Hu, V. Azorin-Peris, A. Echiadis, V. A. Chouliaras, and R. Summers, Remote simultaneous dual wavelength imaging photoplethysmography: A further step towards 3-D mapping of skin blood microcirculation, in Proc. BiOS SPIE, 2008, vol. 6850, pp. 68500S1. [40] MC1311 High-Speed 1,3 MPixel Colour CMOS Imaging Farbkamera, Oct. 2010 [Online]. Available: http://www.mikrotron.de [41] FG1156 FPGA Datasheet, Oct. 2010 [Online]. Available: http://www. xilinx.com [42] D. Stevens, N. Glynn, P. Galiatsatos, V. A. Chouliaras, and D. Reisis, Evaluating the performance of a congurable, extensible VLIW processor in FFT execution, in Proc. Int. Conf. Environmental and Computer Science, 2009, pp. 771774.

Vassilios Chouliaras was born in Athens, Greece, in 1969. He received the B.Sc. degree in physics and laser science from Heriot-Watt University, Edinburgh, Scotland, the M.Sc. degree in VLSI systems engineering from the University of Manchester Institute of Science and Technology, Manchester, U.K., and the Ph.D. degree from Loughborough University, Leicestershire, U.K., in 1993, 1995, and 2006, respectively. He has worked as an ASIC Design Engineer for INTRACOMSA and as a Senior R&D Engineer/Processor Architect for ARC International. Currently, he is a Senior Lecturer in the Department of Electronic and Electrical Engineering, Loughborough University, where he is leading the research in CPU architecture and microarchitecture, SoC modeling, and ESL methodologies. He is the architect of the BioThreads platform and a Founder of Axilica Ltd. Vicente Azorin Peris was born in Valencia, Spain in 1981. He received the B.Eng. degree in electronic and electrical engineering and the Ph.D. degree in biomedical photonics engineering from Loughborough University, Leicestershire, U.K., in 2004 and 2009, respectively. For the past ve years, he has worked on functional biomedical sensing and imaging techniques, including nite-element simulation of the interaction between light and human tissue as a Ph.D. student and as a Research Associate in the Photonics Engineering and Health Technology Research Group at Loughborough University. He is currently employed by Ansivia Solutions Ltd., Loughborough, U.K., where he is involved in research and development of in-vitro diagnostic point-of-care testing instrumentation. Jia Zheng was born in Heilongjiang, China, in 1980. She received the B.Sc. degree in electrical engineering from the Beijing Institute of Technology, Beijing, China, in 2003, and the M.Sc. degree in digital communication systems and the Ph.D. degree in biomedical photonics engineering from Loughborough University, Leicestershire, U.K., in 2005 and 2010 respectively. During her Ph.D. studies, she specialized in imaging photoplethysmography and opto-physiological monitoring. Recently, she joined the National Institute for the Control of Pharmaceutical and Biological Products (NICPBP), Beijing, China, as Researcher for the regulation of in-vivo physiological assessment device and instrumentation. Angelos Echiadis was born in Orestiada, Greece, in 1977. He received the 1st Class B.Eng. degree in electrical and electronic engineering with distinction from Heriot-Watt University, Edinburgh, Scotland, and the Ph.D. degree in photonics and health technology from Loughborough University, Leicestershire, U.K., in 2003 and 2007, respectively. Currently, he works as a biomedical systems engineering professional at Dialog Devices Ltd., Loughborough, U.K. His research interests include multidimensional signal acquisition and characterization of live tissues, as well as photoplethysmography in pulse oximetry and in other medical applications. Sijung Hu (SM10) was born in Shanghai, China, in 1960. He received the B.Sc. degree in chemical engineering from Donghua University, Shanghai, China, the M.Sc. degree in environmental monitoring from the University of Surrey, Guildford, U.K., in 1995, and the Ph.D. degree in uorescence spectrophotometry from Loughborough University, Leicestershire, U.K., in 2000. He was appointed as a Senior Scientist at Kalibrant Ltd., Loughborough, U.K., for R&D of medical instruments from 1999 to 2002. Currently, he is a Senior Research Fellow and Leader of Photonics Engineering Research Group in the Department of Electronic and Electrical Engineering, Loughborough University, and Visiting Professor of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China, with numerous contributions in opto-physiological assessment, biomedical instrumentation and signal and image processing.

David Stevens was born in Melton Mowbray, U.K., in 1985. He received the B.Eng. degree in computer systems engineering from Loughborough University, Leicestershire, U.K., in 2007. He began working toward the Ph.D. degree at Loughborough University in 2008, investigating methods of acceleration for a VLIW processor. His research focuses on both the implementation and verication of a tool-chain for an open source VLIW processor incorporating various methods of acceleration, including hardware threading. Currently, he is a Research Associate at Loughborough University working on a project to develop an integrated framework for embedded system tools.