vliw ieee paper jun2012

12
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012 257 BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing Applications David Stevens, Vassilios Chouliaras, Vicente Azorin-Peris, Jia Zheng, Angelos Echiadis, and Sijung Hu, Senior Member, IEEE Abstract—We discuss BioThreads, a novel, configurable, exten- sible system-on-chip multiprocessor and its use in accelerating biomedical signal processing applications such as imaging pho- toplethysmography (IPPG). BioThreads is derived from the LE1 open-source VLIW chip multiprocessor and efficiently handles instruction, data and thread-level parallelism. In addition, it sup- ports a novel mechanism for the dynamic creation, and allocation of software threads to uncommitted processor cores by imple- menting key POSIX Threads primitives directly in hardware, as custom instructions. In this study, the BioThreads core is used to accelerate the calculation of the oxygen saturation map of living tissue in an experimental setup consisting of a high speed image acquisition system, connected to an FPGA board and to a host system. Results demonstrate near-linear acceleration of the core kernels of the target blood perfusion assessment with increasing number of hardware threads. The BioThreads processor was implemented on both standard-cell and FPGA technologies; in the first case and for an issue width of two, full real-time performance is achieved with 4 cores whereas on a mid-range Xilinx Virtex6 device this is achieved with 10 dual-issue cores. An 8-core LE1 VLIW FPGA prototype of the system achieved 240 times faster ex- ecution time than the scalar Microblaze processor demonstrating the scalability of the proposed solution to a state-of-the-art FPGA vendor provided soft CPU core. Index Terms—Biomedical image processing, field program- mable gate arrays (FPGAs), imaging photoplethysmography (IPPG), microprocessors, multicore processing. I. INTRODUCTION AND MOTIVATION B IOMEDICAL in-vitro and in-vivo assessment relies on the real-time execution of signal processing codes as a key to enabling safe, accurate and timely decision-making, allowing clinicians to make important decisions and perform Manuscript received December 20, 2010; revised May 03, 2011; accepted August 15, 2011. This work was supported by Loughborough University, U.K. Date of publication November 04, 2011; date of current version May 22, 2012. This paper was recommended by Associate Editor Patrick Chiang. D. Stevens, V. Chouliaras, and V. Azorin-Peris are with the Department of Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K. J. Zheng is with the National Institute for the Control of Pharmaceutical and Biological Products (NICPBP), China, No.2, Tiantan Xili, Chongwen District, Beijing 100050, China. A. Echiadis is with Dialog Devices Ltd., Loughborough LE11 3EH, U.K. S. Hu is with the Department of Electronic and Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBCAS.2011.2166962 medical interventions as these are based on hard facts, derived in real-time from physiological data [1], [2]. In the area of biomedical image processing, a number of imaging methods have been proposed over the past few years including laser Doppler [3], optical coherence tomography [4] and more recently, imaging photoplethysmography (IPPG) [5], [6]; How- ever, none of these techniques can attain their true potential without a real-time biomedical image processing system based on very large scale integration (VLSI) systems technology. For instance, the quality and availability of physiological information from an IPPG system is directly related to the frame size and frame rate used by the system. From a user perspective, the extent to which such a system can run in real-time is a key factor in its usability, and practical imple- mentations of the system ultimately aim to be standalone and portable to achieve its full applicability. This is an area where advanced computer architecture concepts, routinely utilized in high-performance consumer and telecoms systems-on-chip (SoC) [7], can potentially provide the required data streaming and execution bandwidth to allow for the real-time execution of algorithms that would otherwise be executed offline (in batch mode) using more established techniques and platforms (e.g., sequential execution on a PC host). A quantitative comparison in this study (results and discussion) illustrates the foreseen performance gains, showing that a scalar embedded processor is six times slower than the single-core configuration of our research platform. Such SoC-based architectures typically include scalar em- bedded processor cores with a fixed instruction-set-architecture (ISA) which are widely used in standard-cell (ASIC) [8] and re- configurable (FPGA)-based embedded systems [9]. These pro- cessors present a good compromise for the execution of general- purpose codes such as the user interface, low-level/bandwidth protocol processing, the embedded operating system (eOS) and occasionally, low-complexity signal processing tasks. However, they lack considerably in the area of high-throughput execu- tion and high-bandwidth data movement as is often required by the core algorithms in most signal processing application do- mains. An interesting comparison of the capabilities of three such scalar engines targeting field-programmable technologies (FPGAs) is given in [10]. To relieve this constraint, scalar embedded processors have been augmented with DSP coprocessors in both tightly-coupled [11] and loosely-coupled configurations [12] to target perfor- mance-critical inner loops of DSP algorithms. A side-effect of this approach is the lack of homogeneity in the SoC platform 1932-4545/$26.00 © 2011 IEEE

Transcript of vliw ieee paper jun2012

Page 1: vliw ieee paper  jun2012

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012 257

BioThreads: A Novel VLIW-Based ChipMultiprocessor for Accelerating Biomedical

Image Processing ApplicationsDavid Stevens, Vassilios Chouliaras, Vicente Azorin-Peris, Jia Zheng, Angelos Echiadis, and

Sijung Hu, Senior Member, IEEE

Abstract—We discuss BioThreads, a novel, configurable, exten-sible system-on-chip multiprocessor and its use in acceleratingbiomedical signal processing applications such as imaging pho-toplethysmography (IPPG). BioThreads is derived from the LE1open-source VLIW chip multiprocessor and efficiently handlesinstruction, data and thread-level parallelism. In addition, it sup-ports a novel mechanism for the dynamic creation, and allocationof software threads to uncommitted processor cores by imple-menting key POSIX Threads primitives directly in hardware, ascustom instructions. In this study, the BioThreads core is used toaccelerate the calculation of the oxygen saturation map of livingtissue in an experimental setup consisting of a high speed imageacquisition system, connected to an FPGA board and to a hostsystem. Results demonstrate near-linear acceleration of the corekernels of the target blood perfusion assessment with increasingnumber of hardware threads. The BioThreads processor wasimplemented on both standard-cell and FPGA technologies; in thefirst case and for an issue width of two, full real-time performanceis achieved with 4 cores whereas on a mid-range Xilinx Virtex6device this is achieved with 10 dual-issue cores. An 8-core LE1VLIW FPGA prototype of the system achieved 240 times faster ex-ecution time than the scalar Microblaze processor demonstratingthe scalability of the proposed solution to a state-of-the-art FPGAvendor provided soft CPU core.

Index Terms—Biomedical image processing, field program-mable gate arrays (FPGAs), imaging photoplethysmography(IPPG), microprocessors, multicore processing.

I. INTRODUCTION AND MOTIVATION

B IOMEDICAL in-vitro and in-vivo assessment relies onthe real-time execution of signal processing codes as a

key to enabling safe, accurate and timely decision-making,allowing clinicians to make important decisions and perform

Manuscript received December 20, 2010; revised May 03, 2011; acceptedAugust 15, 2011. This work was supported by Loughborough University, U.K.Date of publication November 04, 2011; date of current version May 22, 2012.This paper was recommended by Associate Editor Patrick Chiang.

D. Stevens, V. Chouliaras, and V. Azorin-Peris are with the Department ofElectrical Engineering, Loughborough University, Leicestershire LE11 3TU,U.K.

J. Zheng is with the National Institute for the Control of Pharmaceutical andBiological Products (NICPBP), China, No.2, Tiantan Xili, Chongwen District,Beijing 100050, China.

A. Echiadis is with Dialog Devices Ltd., Loughborough LE11 3EH, U.K.S. Hu is with the Department of Electronic and Electrical Engineering,

Loughborough University, Leicestershire LE11 3TU, U.K. (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TBCAS.2011.2166962

medical interventions as these are based on hard facts, derivedin real-time from physiological data [1], [2]. In the area ofbiomedical image processing, a number of imaging methodshave been proposed over the past few years including laserDoppler [3], optical coherence tomography [4] and morerecently, imaging photoplethysmography (IPPG) [5], [6]; How-ever, none of these techniques can attain their true potentialwithout a real-time biomedical image processing system basedon very large scale integration (VLSI) systems technology.For instance, the quality and availability of physiologicalinformation from an IPPG system is directly related to theframe size and frame rate used by the system. From a userperspective, the extent to which such a system can run inreal-time is a key factor in its usability, and practical imple-mentations of the system ultimately aim to be standalone andportable to achieve its full applicability. This is an area whereadvanced computer architecture concepts, routinely utilizedin high-performance consumer and telecoms systems-on-chip(SoC) [7], can potentially provide the required data streamingand execution bandwidth to allow for the real-time execution ofalgorithms that would otherwise be executed offline (in batchmode) using more established techniques and platforms (e.g.,sequential execution on a PC host). A quantitative comparisonin this study (results and discussion) illustrates the foreseenperformance gains, showing that a scalar embedded processoris six times slower than the single-core configuration of ourresearch platform.

Such SoC-based architectures typically include scalar em-bedded processor cores with a fixed instruction-set-architecture(ISA) which are widely used in standard-cell (ASIC) [8] and re-configurable (FPGA)-based embedded systems [9]. These pro-cessors present a good compromise for the execution of general-purpose codes such as the user interface, low-level/bandwidthprotocol processing, the embedded operating system (eOS) andoccasionally, low-complexity signal processing tasks. However,they lack considerably in the area of high-throughput execu-tion and high-bandwidth data movement as is often required bythe core algorithms in most signal processing application do-mains. An interesting comparison of the capabilities of threesuch scalar engines targeting field-programmable technologies(FPGAs) is given in [10].

To relieve this constraint, scalar embedded processors havebeen augmented with DSP coprocessors in both tightly-coupled[11] and loosely-coupled configurations [12] to target perfor-mance-critical inner loops of DSP algorithms. A side-effect ofthis approach is the lack of homogeneity in the SoC platform

1932-4545/$26.00 © 2011 IEEE

Page 2: vliw ieee paper  jun2012

258 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

programmer’s model which itself necessitates the use of com-plex ‘mailbox-type’ [13] communications and the programmer-managed use of multiple address spaces, coherency issues andDMA-driven data flows, typically under the control of the scalarCPU.

Another architectural alternative is the implementation of thecore DSP functionality using custom (hardwired) logic. Usingestablished methodologies (register-transfer-level design, RTL)this task involves long development and verification times andresults in systems that are of high performance yet, they areonly tuned to the task at hand. Also, these solutions tend to offerlittle or no programmability, making difficult their modificationto reflect changes in the input algorithm. In the same architec-tural domain, the synthesis of such hardwired engines from highlevel languages (ESL synthesis) is an area of active researchin academia [14], [15] (academic efforts targeting ESL syn-thesis of Ada and C-descriptions); Industrial tools in this areahave matured [16]–[18] (commercial offerings targeting C++,C and UML&C++) to the point of competing favorably withhand-coded RTL implementations, at least for certain types ofdesigns [19].

A potent solution to high performance VLSI systems de-sign is provided by configurable, extensible processors [20].These CPUs allow the extension of their architecture (pro-grammer model and ISA), and microarchitecture (executionunits, streaming engines, coprocessors, local memories) bythe system architect. They typically offer high performance,full programmability and good post-fabrication adaptability toevolving algorithms through the careful choice of the customISA and execution/storage resources prior to committing to sil-icon. High performance is achieved through the use of custominstructions which collapse data flow graph (DFG) sub-graphs(especially those repeated many times [21]) into one or moremulti-input, multi-output (MIMO) instruction nodes. At thesame time, these processors deliver better power efficiencycompared to non-extensible processors, via the reduction in thedynamic instruction count of the target application and the useof streaming local memories instead of data caches.

All the solutions to develop high-performance digital enginesfor consumer, and in this case, biomedical image processingmentioned so far suffer from the need to explicitly specifythe software/hardware interface and schedule communicationsacross that boundary. This research proposes an alternative,all-software solution, based on a novel, configurable, extensibleVLIW chip-multiprocessor (CMP) based on an open-sourceVLIW core [22]–[24] and targeting both FPGA and stan-dard-cell (ASIC) silicon. The VLIW architectural paradigmwas chosen as such architectures efficiently handle parallelismat the instruction (ILP) and data (DLP) levels. ILP is exploitedvia the static (compile-time) specification of independentRISC-ops (referred to as “syllables” or RISCops) per VLIWinstruction whereas DLP is exploited via the compiler-directedunrolling and pipelining of inner loops (kernels). Key to this isthe use of advanced compilation technology such as Trimaran[25] for fully-predicated EPIC architectures or VEX [26]for the partially-predicated LE1 CPU [27], the core elementof the BioThreads CMP used in this work. A third form ofparallelism, thread level parallelism (TLP) can be explored

via the instantiation of multiple such VLIW cores, operatingin a shared-memory ecosystem. The BioThreads processoraddresses all three forms of parallelism and provides a uniquehardware mechanism with which software threads are createdand allocated to uncommitted LE1 VLIW cores via the useof custom instructions implementing key POSIX Threads(PThreads) primitives directly in hardware.

A. Multithreaded Processors

Multithreaded programming allows for the better utiliza-tion of the underlying multiprocessor system by splitting upsequential tasks such that they can be performed concurrentlyon separate CPUs (processor contexts) resulting in a reductionof the total task execution time and/or the better utilizationof the underlying silicon. Such threads are disjoint sectionsof the control flow graph that can potentially execute concur-rently subject to the lack of data dependencies. Multithreadedprogramming relies on the availability of multiple CPUs ca-pable of running concurrently in a shared-memory ecosystem(multiprocessor) or as a distributed memory platform (multi-computer). Both multiprocessor and multicomputers fall in twomajor categories depending on how threads are created andmanaged: a) programmer-driven multithreading (potentiallywith OS software support), known as explicit multithreadingand b) hardware generated threads (implicit multithreading).A very good overview of explicit multithreaded processors isgiven in [28].

1) Explicit Multithreading: Explicit multithreaded proces-sors are categorized in: a) Interleaved multithreading (IMT)in which the CPU switches to another hardware thread atinstruction boundaries thus, effectively hiding long-latencyoperations (memory accesses); b) Blocked multithreading(BMT) in which a thread is active until a long-latency operationis encountered; c) Simultaneous multithreading (SMT) whichrelies on a wide (ILP) pipeline to dynamically schedule opera-tions across multiple hardware threads. Explicit multithreadedarchitectures have the form of either chip multiprocessors(shared-memory ecosystem) or multicomputers (distributedmemory ecosystem). In both cases, special OS thread libraries(APIs) control the creation of threads and make use of theunderlying multicore architecture (if one is provided) ortime-multiplex the single CPU core. Examples of such APIsfor shared-memory multithreading are the POSIX Threads(PThreads) and MPI for distributed memory multicomputers.PThreads in particular allows for explicit creation, termination,joining and detaching of multiple threads and provides furthersupport services in the form of mutex and conditional variables.Notable machines supporting IMT include the HEP and theCray MTA; More recent explicit-multithreading VLIW CMPsinclude amongst others the SiliconHive HIVEFLEX CSL2500Communications processor (multicomputer architecture) [29]and the Fujitsu FR1000 VLIW media multicore (multiprocessorarchitecture) [30]. In the academic world, the most notable of-ferings in the re-configurable/extensible VLIW domain includethe tightly-coupled VLIW/datapath architecture [31] and theADRES architecture [32]. In the biomedical signal processingdomain very few references can be found; A CMP architecture

Page 3: vliw ieee paper  jun2012

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 259

based on a commercial VLIW core was used for the real-timeprocessing of 12-lead ECG signals in [33].

2) Implicit Multithreading: Prior research in hardware man-aged threads (implicit multithreading) includes the SPSM andWELD architectures [34]–[36]. The single-program speculativemultithreading (SPSM) method uses fork and merge opera-tions to reduce execution time. Extra work by the compiler isrequired to find code blocks which are data independent; whensuch blocks are found, the compiler inserts extra instructionsto inform the hardware to run the data independent code con-currently. When the executing thread (master) reaches a forkinstruction, a second thread is started at another location in theprogram. Both threads then execute and when the master threadreaches the location in the program from which the secondthread started, the two threads are merged together. The WELDarchitecture uses branch prediction as a method of reducing theimpact of pipeline restarts due to control flow changes. Dueto the organization of modern processors if a branch is takenthis requires the pipeline to be restarted and the instructions inthe branch shadow be squashed, resulting in wasted issue slots.A way around this inefficiency is to run two or more threadsconcurrently and each thread to run the code if a branch is takenor not (thus following both control flow paths). Later on, whenit is discovered whether a branch is definitely taken or not taken,the correct speculative thread is chosen (and becomes definite)whereas the incorrect thread is squashed. This removes theneed to re-fill the pipeline with the correct instructions as bothbranch paths are concurrently executed. This method requiresextra work by the compiler which introduces extra instructions(fork/bork) to inform the processor that it needs to run bothbranch paths as separate threads.

B. The BioThreads CMP

The BioThreads VLIW CMP is termed a hardware-assisted,explicit multithreaded architecture (software threads are userspecified, thread management is hardware based) and is differ-entiated to offerings in that area via a) Its hardware PThreadsprimitives and b) its massive scalability which can range froma single-thread, dual-issue core to a theoretical maximum of4 K (256 contexts 16 hypercontexts) shared memory hard-ware threads in each of the maximum 256 distributed memorymulticomputers, for a theoretical total of 1 M threads, on up to256-wide (VLIW issue slots) cores. Clearly these are theoreticalmaxima as in such massively-parallel configurations, the latencyof the memory system (within the same shared memory mul-tiprocessor) is substantially increased, potentially resulting insub-optimal single-thread performance unless aggressive com-piler-directed loop unrolling and pipelining is performed.

C. Research Contributions

The major contributions of this research are summarized asfollows: a) A configurable, extensible, chip-multiprocessor hasbeen developed based on the open-source LE1 VLIW CPU, ca-pable of performing key PThreads primitives directly in hard-ware. This is a unique feature of the LE1 (and BioThreads) en-gine and uniquely differentiates it from other key research suchas hardware primitives for remote memory access [37]. In thatrespect, the BioThreads core can be thought of as a hybrid be-tween an OS and a collection of processors, delivering services

(execution bandwidth and thread handling) to a higher ordersystem and moving towards the real-time execution of com-pute-bound biomedical signal processing codes. b) The use ofsuch a complex processing engine is advocated in the biomed-ical signal processing domain such as the real-time blood per-fusion calculation. Its inherent, multiparallel scalability allowsfor the real-time calculation of key computational kernels inthis domain. c) A unified, software-hardware flow has been de-veloped so that all algorithm development takes place in theMATLAB environment, followed by automatic C-code gener-ation and its introduction to the LE1 tool chain. This is a wellencapsulated process which ensures that the biomedical engi-neer is not exposed to the intricacies of real-time software de-velopment for a complex, multicore, SoC platform; at the sametime this methodology results in a working embedded systemdirectly implementing the algorithmic functionality specified inthe MATLAB input description with minimum user guidance.

II. THE BIOTHREADS ENGINE

The BioThreads CMP is based on the LE1 open-sourceprocessor which it extends with execution primitives to supporthigh speed image processing and dynamic thread allocationand mapping to uncommitted CPU cores. The BioThreadsarchitecture specifies a hybrid, shared-memory multipro-cessor/distributed memory multicomputer. The multiprocessoraspect of the BioThreads architecture falls in between thetwo categories (explicit and implicit) as it requires the user toexplicitly identify the software threads in the code but at thesame time, implements hardware support for the creation/man-agement/synchronization/termination of such threads. Thethread management in the LE1 provides full hardware supportfor key PThread primitives such as pthread_create/join/exitand pthread_mutex_init/lock/trylock/unlock/destroy. This isachieved with a hardware block, the thread control unit (TCU),whose purpose is the service of these custom hardware calls andthe start and stop execution of multiple LE1 cores. The TCU isan explicit serialization point which multiple contexts (cores)compete for access; PThreads command requests are internallyserialized and the requesting contexts served in turn. The use ofthe TCU removes the overhead of an operating system for theLE1 as low-level PThread services are provided in hardware;a typical pthread_create instruction completes in less than 20clocks. This is a unique feature of the LE1 VLIW CMP and theprimary differentiator with other VLIW multicore engines.

Fig. 1 depicts a high level overview of the BioThreads en-gine. The main components are the scalar platform, consistingof a service processor (the Xilinx Microblaze, 5-stage pipeline32-bit CPU), its subsystem based on the CoreConnect [38] busarchitecture, and finally, the LE1 chip multiprocessor (CMP)which executes the signal processing kernels. Fig. 2 depicts theinternal organization of a single LE1 context.

• The CPU consists of the instruction fetch engine (IFE),the execution core (LE1_CORE), the pipeline controller(PIPE_CTRL) and the load/store unit (LSU). The IFE canbe configured with an instruction cache or alternatively,a closely-coupled instruction RAM (IRAM). These areaccessed every cycle and return a long instruction word(LIW) consisting of multiple RISCops for decode and

Page 4: vliw ieee paper  jun2012

260 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

Fig. 1. BioThreads Engine showing LE1 cores, memory subsystem and overallarchitecture.

Fig. 2. Open-source LE1 Core pipeline organization.

dispatch. The IFE controller handles interfacing to theexternal memory for ICache refills and provides debugcapability into the ICache/IRAM. The IFE can also beconfigured with a branch predictor unit, currently basedon the 2-bit saturating counter scheme (Smith predictor)in both set-associative and fully-associative (CAM-based)organization.

• The LE1_CORE block includes the main execution data-paths of the CPU. There are a configurable number of clus-ters, each with its own register set. Each cluster includes aninteger core (SCORE), a custom instruction core (CCORE)and optionally, a floating point core (FPCORE). The in-teger and floating-point datapaths are of unequal pipeline

depth; however, they maintain a common exception reso-lution point to support a precise exception programmer’smodel.

• PIPE_CTRL is the primary control logic. It is a collectionof interlocked, pipelined state machines, which schedulethe execution datapaths and monitor the overall instruc-tion flow down the processing and memory pipelines.PIPE_CTRL maintains the decoding logic and controlregisters of the CPU and handshakes the host during debugoperations.

• The LSU is the primary path of the LE1_CORE to thesystem memory. It allows for up to ISSUE_WIDTH(VLIW architectural width) memory operations per cycleand directly communicates with the shared data memory(STRMEM). The latter is a multibank, 2 or 3-stagepipelined cross-bar architecture which scales reasonablywell (in terms of speed and area) for up to 8 clients, 8banks (8 8), as shown in Table I and Table II. Notethat the number of such banks and number of LSUclients (LSU_CHANNELS) are not necessarily equal,allowing for further microarchitecture optimizations. ThisSTRMEM block organization is depicted in Fig. 3.

• Finally, to allow for the exploitation of shared-memoryTLP, multiple processing cores can be instantiated in aCMP configuration as shown in Fig. 4. The figure depictsa dual-LE1, single-cluster BioThreads system interfacingto the common streaming data RAM.

III. THE THREAD CONTROL UNIT

Thread management and dynamic allocation to hardwarecontexts takes place in the TCU. This is a set of hierarchicalstate machines, responsible for the management of softwarethreads and their allocation to execution resources (HC, hy-percontexts). It accepts PThreads requests from either the hostor any of the executing hypercontexts. It maintains a seriesof hardware (state) tables, and is a point of synchronizationamongst all executing hypercontexts. Due to the need to di-rectly control the operating mode of every hypercontext (HC)while having direct access to the system memory, the TCUresides in the DEBUG_IF (Fig. 1) where it makes use of theexisting hardware infrastructure to stop, start, R/M/W memoryand communicate with the host. A critical block in threadmanagement is the Context TCU which manages locally (percontext, in the PIPE_CTRL block) the distribution of PThreadsinstructions to the centralized TCU. Each clock, one of the ac-tive HCs in a context arbitrates for the use of the context TCU;When granted access, the command requested is passed on tothe TCU residing in the DBG_IF for centralized processing.Upon completion of the PThreads command, the Context TCUreturns (to the requesting HC) the return values, as specified bythat command. Fig. 5 depicts the thread control organization inthe context of a single shared-memory system.

The figure depicts a system containing contexts (0 throughto ); For simplicity, each context contains two hyper-contexts (HC0, HC1) and has direct access to the system-wideSTRMEM for host-initiated DMA transfers and/or recoveringthe argument for void pthread_exit(void *value_ptr). The sup-ported commands are listed in Table III.

Page 5: vliw ieee paper  jun2012

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 261

TABLE IBIOTHREADS REAL-TIME PERFORMANCE (DUAL-ISSUE LE1 CORES, FPGA AND ASIC)

TABLE IIBIOTHREADS REAL-TIME PERFORMANCE (QUAD-ISSUE LE1 CORES, FPGA AND ASIC)

Fig. 3. Multibank streaming memory subsystem of the BioThreads CMP.

IV. BIOMEDICAL SIGNAL PROCESSING APPLICATION: IPPG

The application area selected to deploy the BioThreadsVLIW-CMP engine for real-time biomedical signal processingwas photoplethysmography (PPG), which is the measurementof blood volume changes in living tissue using optical means.PPG is primarily used in Pulse Oximetry for the point-mea-surement of oxygen saturation. In this application, PPG isimplemented from an area measurement. The basic conceptof this implementation, known as imaging PPG (IPPG), is toilluminate the tissue with a homogeneous, nonionizing lightsource and to detect the reflected light with a 2D sensor array.This yields a sequence of images (frames) from which a map(over the illuminated area) of the blood volume changes can

Fig. 4. BioThreads CMP, two LE1 cores connected via the streaming memorysystem.

be generated, for subsequent extraction of physiological pa-rameters. The use of multiple wavelengths in the light sourceenables the reconstruction of blood volume changes at differentdepths of the tissue (due to the different penetration depth of

Page 6: vliw ieee paper  jun2012

262 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

Fig. 5. Multibank streaming memory subsystem of the BioThreads CMP.

TABLE IIILE1 PTHREADS HARDWARE SUPPORT

each wavelength), yielding a 3D map of the tissue function.This is the principle of operation of real-time IPPG [39].Such functional maps have numerous applications in clinicaldiagnostics, including the assessment of the severity of skinburns or wounds, of cardiovascular surgical interventions andof overall cardiovascular function. The overall IPPG systemarchitecture is depicted in Fig. 6.

A reflection–mode IPPG setup was deployed for the vali-dation experiment in this investigation, the basic elements ofwhich are a ring-light illuminator with arrays of red (660 nm)and infrared (880 nm) LEDS, a lens and a high sensitivitycamera [40] as the detecting element, and the target skin tissueas defined by its optical coefficients and geometry. The use ofthe fast digital camera enables the noncontact measurement at asufficiently high sampling rate to allow PPG signal reconstruc-tion from a large and homogeneously illuminated field of viewat more than one wavelength, as shown in Fig. 7.

The acquisition hardware synchronizes the illumination unitwith the camera in order to perform multiplexed acquisition ofa sequence of images of the area of interest. For optimum signalquality during acquisition, the subject is seated comfortably andasked to extend their hand evenly onto a padded surface; andambient light is kept to a minimum.

The IPPG system typically consists of three processing stages(pre, main, post) as shown in Fig. 8. Preprocessing comprises anoptional image stabilization stage, where its use is largely de-pendent on the quality of the acquisition setting. It is typicallyperformed on each whole raw frame prior to the storage of rawimages in memory and it can be implemented using a region of

Fig. 6. Imaging PPG System Architecture.

Fig. 7. Schematic diagram of IPPG setup including the dual wavelength LEDring light, lens, CMOS camera and subject hand.

interest of a fixed size, meaning that its processing time is a func-tion of frame rate but is independent of raw image size. The mainprocessing stage comprises the conversion of raw data into thefrequency domain, requiring a minimum number of frames, i.e.,samples, to be performed. Conversion to the frequency domainis performed once for every pixel position in the raw image, andthe number of data points per second of time-domain data toconvert is determined by the frame rate, meaning that the pro-cessing time for this stage is a function of both frame rate andsize. The postprocessing stage comprises the extraction of appli-cation-specific physiological parameters from time or frequencydomain data and consists of operations such as statistical calcu-lations and unit conversions (scaling and offsetting) of the pro-cessed data, which require relatively low processing power.

The ultimate scope of this experiment was to evaluate theperformance of the BioThreads engine as a signal processingplatform for IPPG, which was achieved by simplifying the pro-cessing workflow of the optophysiological assessment system[39]. Having established that image stabilization is easily scal-able as it is performed frame-by-frame on a fixed-size segmentof the data, the preprocessing stage was disregarded for this

Page 7: vliw ieee paper  jun2012

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 263

Fig. 8. Complete signal processing workflow in IPPG setup.

study. The FFT cannot be performed point-by-point, and thusposes the most significant constraint to the scalability of thesystem. The main processing stage was thus targeted in thisstudy as the representative process of the system, and the resul-tant workflow consisted of the transformation of detected bloodvolume changes in living tissue to the frequency domain via FFTfollowed by extraction of physiological parameters for bloodperfusion mapping. By employing the principles relating to pho-toplethysmography (PPG), blood perfusion maps were gener-ated from the power of the PPG signals in the frequency domain.The acquired image frames were processed as follows:

a) Two seconds worth of image data (60 frames of size64 64 pixels at 8-bit resolution) were recorded with theacquisition system.

b) The average fundamental frequency of the PPG signal wasmanually extracted (1.4 Hz).

c) Data were streamed to the BioThreads platform and the64-point fast Fourier transform (FFT) of each pixel wascalculated. This was done by taking the pixel values ofall image frames for a particular pixel position to form apixel value vector in the time domain.

d) The Power of the FFT tap corresponding to the PPG fun-damental frequency was copied into a new matrix at thesame coordinates of the pixel (or pixel cluster) under pro-cessing. In the presence of blood volume variation at thatpixel (or pixel cluster) the power would be larger than ifthere was no blood volume variation.

Repeating (d) for all the remaining pixels (clusters) providesa new matrix (image) whose elements (pixels) depend on the de-tected blood volume variation power. This technique allows thegeneration of a blood perfusion map, as a high PPG power canbe attributed to high blood volume variation and ultimately toblood perfusion. Fig. 9 illustrates in a simplified diagrammaticrepresentation the algorithm discussed above, and Fig. 10 illus-trates the output frame after the full image processing stage.

The algorithm was prototyped in the MATLAB environ-ment and subsequently translated to C using the embeddedMATLAB compiler (emlc) for compilation on the VLIW CMP.The emlc enables the output of C from MATLAB functions.Using emlc, the function required is run with the example dataand C code is generated by MATLAB. In this example the datawas a full dataset of two seconds worth of images in a onedimensional array. This generated C code is a function whichcan be compiled and run in a handheld (FPGA-based) system.Alongside this function a driver function was written to setupthe input data and call the function. The function computes thewhole frame. To be able to split this over multiple LE1 coresthe code was modified to include a start and end value. This

Fig. 9. High-level view of the signal processing algorithm.

Fig. 10. (A) Original image. (B) Corresponding ac map (Mean AC).(C) Corresponding ac power map at heart rate (HR) � ��� �� (� � ��� ��).

was a simple change which included altering the loop variableswithin the C code. In MATLAB there was an issue exportinga function which used loop variables that were passed to thefunction. These values are computed by the driver function andpassed to the generated function.

Example of code (pseudocode):Generated by MATLAB:

autogen_func(inputArray, outputArray);

Altered to:autogen_func(inputArray, outputArray, start, end);

Driver code:

main() {for( ; ; loop++) {

;;

& &}

}

This way the number_of_threads constant can be easily al-tered and the code does not need to be rewritten/reexported.Both the MATLAB generated function and the driver functionare then compiled using the LE1 tool-chain to create the ma-chine code to run in simulation as well as on silicon.

The system front-end of Fig. 11 is implemented in LabVIEWand executes on the host system. The acquisition panel is used

Page 8: vliw ieee paper  jun2012

264 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

Fig. 11. LabVIEW-based host application for hardware control (left) and anal-ysis (right).

to control the acquisition of frames and to send the raw imagingdata to the FPGA-resident BioThreads engine for processing.Upon completion, the processed frame is returned for displayin the analysis panel, where the raw data is also accessible forfurther analysis in the time-domain.

V. RESULTS AND DISCUSSION

This section presents the results of a number of experi-ments when using the BioThreads platform for the real-timeblood-volume change calculation. These results are split intotwo major sections: A) Performance (real-time) results, per-taining to the real-time calculation of the blood perfusion map,and B) SoC platform results. The latter include data such asarea, maximum frequency, when targeting a Xilinx Virtex6LX240T FG1156 [41] FPGA and a 0.13 , 1-poly, 8-metal(1P8M) standard-cell process. It should be noted that the FPGAdevice is on a near-state-of-the-art silicon node (40 nm, TSMC)whereas the available standard-cell library in our research labis rather old. As such, the performance differential (300 MHzfor the standard-cell compared to the 100 MHz for the FPGAtarget in Tables I and II) is certainly not representative of thatexpected when targeting a standard cell process at an advancedsilicon node (40 nm and below).

A. Performance Results

The 60 frames were streamed onto the target platform andthe processors started executing the IPPG algorithms. Uponcompletion, the computed frame was returned to the hostsystem for display. Table I shows the real execution time for the2-wide LE1 system (VLIW CMP consisting of dual-static-issuecores) of the FPGA platform; ASIC results were obtained fromsimulation.

The results shown in the tables are arranged in columns underthe following headings:

• Config: The macroarchitecture of the BioThreads enginein LE1_CORES MEMORY_BANKS

• LE1 Cores: Identifies the number of execution cores onthe LE1 subsystem.

• Data Memory Banks: The number memory banks as de-picted in Fig. 3 (and thus, the maximum number of con-current load/store operations supported by the streamingmemory system). This plays a major role in the overallsystem performance as will be shown below.

Fig. 12. Speedup of BioThreads performance for 2-wide LE1 subsystem(FPGA and ASIC).

The remaining three parameters have been measured onan FPGA platform (100 MHz LE1 subsystem and serviceprocessor, as shown on Fig. 1) or derived from RTL simulations(ASIC implementation). Both sets of results were obtainedwithout and with custom instructions for accelerating the FFTcalculation.

• Cycles: The number of clock cycles taken when executingthe core signal processing algorithm.

• Real time (sec): The real time taken to execute the algo-rithm (measured by the service processor for FPGA targets,calculated by RTL simulation for the ASIC target).

• Speedup: The relative speedup of BioThreads configu-ration compared to the degenerate case of a single-con-text, single-bank, FPGA solution without the FFT custominstructions.

From a previous study on signal processing kernel acceler-ation (FFT) on the LE1 processor [42], it was concluded thatthe best performance is achieved with user-directed functionin-lining, compiler-driven loop unrolling and custom instruc-tions. Using the above methods an 87% cycle reduction wasachieved thus making possible the execution of the IPPG algo-rithm steps in real-time.

A subset of the speedup results of Table I (dual-issue perfor-mance) are plotted in Fig. 12. Speedup was calculated in ref-erence to the 1 1 reference configuration with no optimiza-tions. As shown, there is a maximum speed up of 125 on thestandard-cell target and 41 on the FPGA target (ASIC imple-mentation is 3 faster compared to the FPGA), with full opti-mizations and custom instructions. With respect to the numberof memory channels, best performance is achieved when thenumber of cores in the BioThreads engine equals the numberof memory banks. This is expected as memory bank conflictsincrease for .

Fig. 13 shows the speedup values for the quad-issue con-figurations (subset of results from Table II). There is a similartrend in the shape of the graph as was seen in the dual-issue re-sults and the same dependency on the number of memory banksis clearly seen. The “Real Time (sec)” values in Tables I andII highlighted in grey identify the BioThreads configurationswhose processing time is less than the acquisition time. These14 configurations are instances of the BioThreads processor thatachieve real-time performance. On an FPGA target, real-timeis achieved with 8 quad-core LE1’s and a 4/8-bank memory

Page 9: vliw ieee paper  jun2012

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 265

Fig. 13. Speedup BioThreads performance for 4-wide LE1 subsystem (FPGAand ASIC).

system. The FPGA device chosen (Virtex6 LX240T) can ac-commodate only 5 such LE1 cores (along with the service pro-cessor system) and thus, can’t achieve the real-time constraint.For the dual-issue configuration however, near-real-time perfor-mance is achieved with 8 cores and a 4 or 8-bank streamingmemory (acquisition time of 2.00 s, processing of 2.85 s and2.16 s respectively). Finally, full real-time is achieved

cores memory banks. These configurations canbe accommodated on the target FPGA and thus are the preferredconfigurations of the BioThreads engine for this study.

On a rather old standard-cell technology, both dual and quad-issue configurations achieve the required post-layout speed of300 MHz. In the first case, 4 cores and a 2-bank memorysystem are sufficient; for quad-issue cores a 4 core by 1-banksystem is sufficient. Configurations around the acquisition timeof 2 sec could be considered as viably ‘real-time’ as the sub-second time difference may not be noticeable in the clinical en-vironment or be important in the final use of the processed data.

An interesting comparison for the FPGA targets is with theMicroblaze processor provided by the FPGA vendor. The latteris a 5-stage, scalar processor (similar to the ARM9 in termsof microarchitecture) with 32 K Instruction and Data caches.Write-through configuration was selected for the data cache toreduce the PLB interconnect traffic.

The code was recompiled for the Microblaze using the gcctoolchain with –O3 optimizations and run in single-thread modeas that processor is not easily scalable; Running it in PThreadsmode involves substantial OS intervention thus, the total run-time is longer than that for a single thread. The applicationtook 864 sec (0.07 fps) to execute on a 62.5 MHz FPGA board,which is 9.6 times slower than the dual-issue reference 100 MHzsingle-core LE1 (Table I, 1 1 configuration, 90.17 s). Extrap-olating this figure to the LE1 clock of 100 MHz and assumingno memory subsystem degradation (an optimistic assumption),shows that the Microblaze processor is 6 times slower than thereference LE1 configuration (0.11 fps); Finally, compared to themaximal dual-issue (configuration 8 8) of Table I with fulloptimizations, the scalar processor proved to be more than 240times slower (525.00 s versus 2.16 s).

A final evaluation of the performance of the BioThreads plat-form was carried out against the (simulated) VEX VLIW/DSP.The VEX ISA is closely related to the Multiflow TRACE ar-chitecture and a commercial implementation of the ISA is the

TABLE IVLE1 COMPARISON WITH THE VEX VLIW/DSP

ST Microelectronics ST210 series of VLIW cores. The VLIWISA has good DSP support since it includes 16/32-bit multipli-cations, local data memory, instruction and data prefetching andis supported by a direct descendant of the Multiflow trace sched-uling compiler. The simulated VEX processor configuration in-cluded a 32 K, 2-way ICache and a 64 K, 4-way, write-throughdata cache whereas the BioThreads configuration used includeda direct instruction and data memory system. Both processorsexecuted the single-threaded version of the workload as therewas no library/OS support in VEX for the PThreads primitivesincluded in the BioThreads core.

Table IV depicts the comparison in which it is clear that thelocal instruction/data memory system of the Biothreads coreplay an important role in the latter achieving 62.00% (2-wide)and 54.21% (4-wide) better clocks compared to the VEX CPU.

As with any engineering study, the chosen system very muchdepends on tradeoffs between speed and area costs. These aredetailed in the next section.

1) VLSI Platform Results: This section details the VLSIimplementation of a number of configurations of interest ofthe BioThreads engine implemented on both standard-cell(TSMC 0.13 1 Poly, 8 metal, high-speed process)and a 40 nm state-of-the-art FPGA device (Xilinx Virtex6LX240T-FF1156-1).

2) Xilinx Virtex6 LX240T-FF1156-1: To provide further in-sight on the use of advanced system FPGAs when implementinga CMP platform, the BioThreads processor was targeted to amidrange device from the latest Xilinx Virtex6 family. The fol-lowing configurations were implemented:

a) Dual-issue BioThreads configurations (2 IALUs, 2IMULTs, 1 LSU_CHANNELS per LE1 core, 4-banksSTRMEM): 1, 2, 4 and 8 LE1 cores, each with aprivate 128 KB IRAM and a shared 256 KB DRAM, for atotal memory of 384 KB, 512 KB, 1024 KB, and 1.5 MBrespectively.

b) Quad-issue BioThreads configurations (4 IALUs, 4IMULTs, 1 LSU_CHANNEL per LE1 core, 8-banksSTRMEM): 1, 2 and 4 LE1 cores. In this case, thesame amount of IRAM and DRAM was chosen, for a totalmemory of 384 KB, 512 KB, and 1024 KB respectively.

Both dual and quad-issue configurations included a Microb-laze 32-bit scalar processor system (5-stage pipeline, no FPU,no MMU, 8 K ICache and DCache/Write-through, acting as theservice processor) with a 32-bit single PLB backbone, to inter-face to the camera, stream-in frames, initiate execution on theBioThreads processor and extract the processed results (Oxygenmap) upon completion. Note that the service processor in theseimplementations are lower-spec to the standalone Microblazeprocessor used in the performance comparison at the end of the

Page 10: vliw ieee paper  jun2012

266 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

TABLE V2-ISSUE BIOTHREADS CONFIGURATIONS ON VIRTEX6 LX240T

TABLE VI4-ISSUE BIOTHREADS CONFIGURATIONS ON VIRTEX6 LX240T

previous section as the Microblaze does not participate in thecomputations in this case. The Xilinx Kernel RTOS (Xilkernel)was used to provide basic software services (stdio) to the serviceprocessor and thus, to the rest of the BioThreads engine. Finally,the calculated Oxygen map was returned to the host system fordisplay purposes in the LABVIEW environment.

The dual-issue configurations achieved the requested oper-ating frequency of 100 MHz (this is the limit at which boththe Microblaze platform and the BioThreads engine can operatewith the external DDR3; however, the 4-wide configurations ex-posed a critical path for at that target fre-quency resulting in a lower overall system speed of 83.3 MHz.– For the case where only the on-board block RAM is used, thisis significantly increased). Table V depicts the post-synthesis re-sults for the dual-issue platforms whereas Table VI depicts thesynthesis results for the quad-issue platforms.

The above tables include the number of LE1 cores in the Bio-Threads engine (Contexts), the absolute and relative (for thegiven device) number of LUTs, flops and block RAMs used, andthe number of LUTs and flops per the total number of issue slotsfor the VLIW CMP. The latter two metrics are used to comparethe hardware efficiency of the 2-wide and 4-wide configurations.

A number of interesting conclusions are drawn from the data:The 2-wide LE1 core is 27.12% to 29.60% more efficient (LUTsper issue slot) compared to the 4-wide core. This is expected andit is attributed to the overhead of the pipeline bypass network.The internal bypassing for one operand is shown in Fig. 14 how-ever, the LSU bypass paths are not depicted in the figure forclarity.

The quad-issue configurations exhibit substantial multi-plexing overhead as the number of source operands is doubled( 32-bit operands per VLIW core) with each, tappingto ISSUE_WIDTH * 2 (IALU) + ISSUE_WIDTH * 2 (IMULT)+ ISSUE_WIDTH *2 (LSU) result buses.

In terms of absolute silicon area, the Virtex6 device can ac-commodate up to 11 dual-issue LE1 cores ( issueslots per cycle) and up to 5 ( issue slots per cycle)quad-issue cores (with the service processor subsystem presentin both cases); the quad-issue system however achieves 83 MHz

Fig. 14. Pipeline bypass taps as a function of IALUs and IMULTs.

instead of the required (by Table I) 100 MHz for real-time opera-tion. This suggests that the lower issue-width configurations aremore beneficial for applications exhibiting good TLP (many in-dependent software threads) while exploiting a lower amount ofILP. As shown in the performance (real-time) section, the IPPGworkload exhibits substantial TLP thus making the 2-wide con-figurations more attractive.

3) Standard-Cell (TSMC0.13LV Process): For the stan-dard-cell (ASIC) target, dual-issue (2 IALUs, 2 IMULTs,2 LSU_CHANNELS, 4 memory banks) and quad-issue(4 IALUs, 4 IMULTs, 4 LSU_CHANNELS, 8 memorybanks) configurations were used in 1, 2, 3 and 4cores organizations. To facilitate collection of results after longRTL-simulation runtimes, a scripting mechanism was used toautomatically modify the RTL configuration files, execute RTL(presynthesis) regression tests, perform front-end synthesis(Synopsys dc_shell, nontopographical mode) and executepostsynthesis simulations.

Fig. 15 shows the postroute (real silicon) area of the con-figurations studied. In this case, the 4-wide is only 15.59%larger than the 2-wide processor with that number increasingto 18.96%. The postroute area of a 4-core, dual-issue and a3-core quad-issue system are nearly identical and these are theconfigurations depicted in Fig. 16.

Page 11: vliw ieee paper  jun2012

STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 267

Fig. 15. VLSI campaign area (�� sq.).

Fig. 16. TSMC 0.13 �� VLSI results. (A) 4 instances of a 2-wide LE1 core.(B) 3 instances of a 4-wide LE1 core.

VI. CONCLUSIONS

This work discussed the methodology and evaluated thefeasibility of using a novel, configurable VLIW CMP foraccelerating biomedical signal processing codes. Experts inthe biomedical signal processing domain require advancedprocessing capabilities without having to resort to the expertiseroutinely utilized in the consumer electronics and telecommu-nications domains were the silicon platform is designed byone team, benchmarked and optimized by a second team andprogrammed by a third. A deliberate choice was made to staywithin the reach of tools routinely utilized by biomedical signalprocessing practitioners, such as MATLAB and LABVIEW;The BioThreads tools infrastructure was designed to facilitateC-level programmability in the particular application domain.

Use of the configurable, extensible BioThreads engine wasdemonstrated for computing, in real-time/near-real-time bloodperfusion of living tissue, using algorithms developed in the Em-bedded MATLAB subset; The autogenerated C code was passedon to the toolchain which compiled it into an application bi-nary and performed coarse architecture space evaluation to iden-tify the best BioThreads configurations that achieve the requiredlevel of performance. Following that, the code was loaded ontothe CMP, residing on the FPGA and datasets were streamedfrom the host system (via the LABVIEW front-end) for accel-erated calculation.

This work presented a custom signal processing engine ca-pable of a high data-processing throughput at a significantlylower operating frequency than general purpose processors. Inthe context of imaging PPG, real-time capability was demon-strated for a frequency-domain processing on 64 64 pixel im-ages at 30 frames per second. The system is easily scalablevia its parameterization and allows for real-time execution on

larger frame sizes and/or faster frame rates. Further work willinvestigate the introduction of single-instruction, multiple-data(SIMD) processor state and custom image processing vector in-structions and their effect on accelerating algorithmic kernelsfrom the same domain.

REFERENCES

[1] K. Rajan and L. M. Patnaik, “CBP and ART image reconstruction algo-rithms on media and DSP processors,” Microprocess. Microsyst., vol.25, pp. 233–238, 2001.

[2] O. Dandekar and R. Shekhar, “FPGA-Accelerated deformable imageregistration for improved target-delineation during CT-guided in-terventions,” IEEE Trans. Biomed. Circuits Syst., vol. 1, no. 2, pp.116–127, 2007.

[3] K. Wardell and G. E. Nilsson, “Duplex laser Doppler perfusionimaging,” Microvasc. Res., vol. 52, pp. 171–182, 1996.

[4] S. Srinivasan, B. W. Pogue, S. D. Jiang, H. Dehghani, C. Kogel, S.Soho, J. J. Gibson, T. D. Tosteson, S. P. Poplack, and K. D. Paulsen,“Interpreting haemoglobin and water concentration, oxygen saturation,and scattering measured in vivo by near infrared breast tomography,”Proc. Natl. Academy Sciences USA, vol. 100, no. 21, pp. 12349–12354,2003.

[5] S. Hu, J. Zheng, V. A. Chouliaras, and R. Summers, “Feasibility ofimaging photoplethysmography,” in Proc. Conf. BioMedical Engi-neering and Informatics, Sanya,, China, 2008, pp. 72–75.

[6] P. Shi, V. Azorin Peris, A. Echiadis, J. Zheng, Y. Zhu, P. Y. S. Cheang,and S. Hu, “Non-contact reflection photoplethysmography towards ef-fective human physiological monitoring,” J Med. Biol. Eng., vol. 30,no. 30, pp. 161–167, 2010.

[7] V. A. Chouliaras, J. L. Nunez, D. J. Mulvaney, F. Rovati, and D. Al-fonso, “A multi-standard video accelerator based on a vector archi-tecture,” IEEE Trans. Consum. Electron., vol. 51, no. 1, pp. 160–167,2005.

[8] ARM Cortex M3 Processor Specification, Sep. 2010 [Online].Available: http://www.arm.com/products/processors/cortex-m/cortex-m3.php

[9] Microblaze Processor Reference Guide, Doc. UG081 (v10.3), Oct.2010 [Online]. Available: http://www.xilinx.com

[10] D. Mattson and M. Christensson, “Evaluation of Synthesizable CPUCores,” Master’s thesis, Dept. Computer Engineering, Chalmers Univ.Technology, Goteborg, Sweden, 2004.

[11] V. A. Chouliaras and J. L. Nunez, “Scalar coprocessors for acceler-ating the G723.1 and G729A speech coders,” IEEE Trans. Consum.Electron., vol. 49, no. 3, pp. 703–710, 2003.

[12] N. Vassiliadis, G. Theodoridis, and S. Nikolaidis, “The ARISE recon-figurable instruction set extensions framework,” in Proc. Intl Conf.Emb. Computer Systems: Architectures, Modelling and Simulation,Jul. 16–19, 2007, pp. 153–160.

[13] Xilinx XPS Mailbox” V1.0a Data Sheet, Oct. 2010 [Online]. Available:http://www.xilinx.com

[14] M. F. Dossis, T. Themelis, and L. Markopoulos, “A web service to gen-erate program coprocessors,” in Proc. 4th IEEE Int. Workshop SemanticMedia Adaptation and Personalization, Dec. 2009, pp. 121–128.

[15] B. Gorjiara, M. Reshadi, and D. Gajski, “Designing a custom archi-tecture for DCT using NISC technology,” in Proc. Design Automation,Asia and South Pacific Conf., 2006, pp. 24–27.

[16] V. Kathail, S. Aditya, R. Schreiber, B. Ramakrishna Rau, D. Cron-quist, and M. Sivaraman, “PICO: Automatically designing customcomputers,” IEEE Comput., vol. 35, pp. 39–47, 2002.

[17] R. Thomson, S. Moyers, D. Mulvaney, and V. A. Chouliaras, “TheUML-based design of a hardware H.264/MPEG 4 AVC video decom-pression core,” in Proc. 5th Int. UML-SoC Workshop (in Conjunctionwith 45th DAC), Anaheim, CA, Jun. 2008, pp. 1–6.

[18] The AutoESL AutoPilot High-Level Synthesis Tool, May 2010 [On-line]. Available: http://www.autoesl.com/docs/bdti_autopilot_final.pdf

[19] Y. Guo and J. R. Cavallaro, “A low complexity and low power SoCdesign architecture for adaptive MAI suppression in CDMA systems,”J. VLSI Signal Process. Syst. Signal Image Video Technol., vol. 44, pp.195–217, 2006.

[20] S. Leibson and J. Kim, “Configurable processors: A new era in chipdesign,” IEEE Comput., vol. 38, no. 7, pp. 51–59, 2005.

[21] N. T. Clark, H. Zhong, and S. A. Mahlke, “Automated custom in-struction generation for domain-specific processor acceleration,” IEEETrans. Comput., vol. 54, no. 10, pp. 1258–1270, 2005.

[22] D. Stevens and V. A. Chouliaras, “LE1: A parameterizable VLIW chip-multiprocessor with hardware PThreads support,” in Proc. IEEE Symp.VLSI, Jul. 2010, pp. 122–126.

Page 12: vliw ieee paper  jun2012

268 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 6, NO. 3, JUNE 2012

[23] LE1 VLIW Core Source Code, Oct. 2010 [Online]. Available: http://www.vassilios-chouliaras.com/le1

[24] V. A. Chouliaras, G. Lentaris, D. Reisis, and D. Stevens, “Customizinga VLIW chip multiprocessor for motion estimation algorithms,” inProc. 2nd Workshop Parallel Programming and Run-Time Manage-ment Techniques for Many-Core Architectures, Lake Como, Italy, Feb.23, 2011, pp. 178–184.

[25] “Trimaran 4.0 Manual,” Oct. 2010 [Online]. Available: http://www.trimaran.org

[26] J. A. Fischer and P. Faraboschi, Embedded Computing. A VLIWApproach to Architectures, Compilers and Tools. San Mateo, CA:Morgan Kaufmann, 2005.

[27] R. P. Colwell, O. P. Nix, J. J. O’Donell, D. B. Papworth, and P. K.Rodman, “A VLIW architecture for a trace scheduling compiler,” IEEETrans. Comput., vol. 37, no. 8, pp. 967–979, 1988.

[28] T. Ungerer, B. Robic, and J. Silc, “A survey of processors with explicitmultithreading,” ACM Comput. Surv., vol. 35, no. 1, pp. 29–63, 2003.

[29] HiveFlex CSP2500 Series Digital RF Processor Databrief, Apr.2011 [Online]. Available: http://www.siliconhive.com/Flex/Site/Page.aspx?PageID=13326

[30] A. Suga and S. Imai, “FR-V Single chip multicore processor: FR1000,”Fujitsu Sci. Tech. J., vol. 42, no. 2, pp. 190–199, Apr. 2006.

[31] A. K. Jones, R. Hoare, D. Kusic, J. Stander, G. Mehta, and J. Fazekas,“A VLIW processor with hardware functions: Increasing performancewhile reducing power,” IEEE Trans. Circuits Syst., vol. 53, no. 11, pp.1250–1254, 2006.

[32] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins,“ADRES: An architecture with tightly coupled VLIW processor andcoarse-grained reconfigurable matrix,” in Proc. Field ProgrammableLogic, 2003, pp. 61–70.

[33] I. Al Khatib, F. Poletti, D. Bertozzi, L. Benini, M. Bechara, H. Khalifeh,A. Jantsch, and R. Nabiev, “A multiprocessor system-on-chip for real-time biomedical monitoring and analysis: Architectural design spaceexploration,” in Proc. IEEE/ACM Design Automation Conf., Sep. 2006,pp. 125–130.

[34] P. K. Dubey, K. O’Brien, K. M. O’Brien, and C. Barton, “Single-pro-gram speculative multithreading (SPSM) architecture: Compiler- as-sisted fine-grained multithreading,” in Proc. Int. Conf. Parallel Archi-tecture and Compilation Techniques, Jun. 1995, pp. 109–121.

[35] E. Ozer, T. M. Conte, and S. Sharma, “Weld: A multithreading tech-nique towards latency tolerant VLIW processors,” in Processors. Lec-ture Notes in Computer Science. Berlin, Germany: Springer, 2001.

[36] E. Ozer and T. M. Conte, “High-performance and low-cost dual-threadVLIW processor using Weld architecture paradigm,” IEEE Trans. Par-allel Distrib. Syst., vol. 16, no. 12, pp. 1132–1142, 2005.

[37] S. G. Ziavras, A. V. Gerbessiotis, and R. Bafnaa, “Coprocessor designto support MPI primitives in configurable multiprocessors,” Integra-tion, VLSI J., vol. 40, pp. 235–252, 2007.

[38] CoreConnect Architecture, Oct. 2010 [Online]. Available: http://www.xilinx.com

[39] J. Zheng, S. Hu, V. Azorin-Peris, A. Echiadis, V. A. Chouliaras, andR. Summers, “Remote simultaneous dual wavelength imaging photo-plethysmography: A further step towards 3-D mapping of skin bloodmicrocirculation,” in Proc. BiOS SPIE, 2008, vol. 6850, pp. 68500S–1.

[40] MC1311 – High-Speed 1,3 MPixel Colour CMOS Imaging Far-bkamera, Oct. 2010 [Online]. Available: http://www.mikrotron.de

[41] FG1156 FPGA Datasheet, Oct. 2010 [Online]. Available: http://www.xilinx.com

[42] D. Stevens, N. Glynn, P. Galiatsatos, V. A. Chouliaras, and D. Reisis,“Evaluating the performance of a configurable, extensible VLIW pro-cessor in FFT execution,” in Proc. Int. Conf. Environmental and Com-puter Science, 2009, pp. 771–774.

David Stevens was born in Melton Mowbray, U.K.,in 1985. He received the B.Eng. degree in computersystems engineering from Loughborough University,Leicestershire, U.K., in 2007.

He began working toward the Ph.D. degree atLoughborough University in 2008, investigatingmethods of acceleration for a VLIW processor. Hisresearch focuses on both the implementation andverification of a tool-chain for an open source VLIWprocessor incorporating various methods of accel-eration, including hardware threading. Currently, he

is a Research Associate at Loughborough University working on a project todevelop an integrated framework for embedded system tools.

Vassilios Chouliaras was born in Athens, Greece,in 1969. He received the B.Sc. degree in physicsand laser science from Heriot-Watt University, Edin-burgh, Scotland, the M.Sc. degree in VLSI systemsengineering from the University of ManchesterInstitute of Science and Technology, Manchester,U.K., and the Ph.D. degree from LoughboroughUniversity, Leicestershire, U.K., in 1993, 1995, and2006, respectively.

He has worked as an ASIC Design Engineer forINTRACOMSA and as a Senior R&D Engineer/Pro-

cessor Architect for ARC International. Currently, he is a Senior Lecturer in theDepartment of Electronic and Electrical Engineering, Loughborough Univer-sity, where he is leading the research in CPU architecture and microarchitecture,SoC modeling, and ESL methodologies. He is the architect of the BioThreadsplatform and a Founder of Axilica Ltd.

Vicente Azorin Peris was born in Valencia, Spainin 1981. He received the B.Eng. degree in electronicand electrical engineering and the Ph.D. degree inbiomedical photonics engineering from Loughbor-ough University, Leicestershire, U.K., in 2004 and2009, respectively.

For the past five years, he has worked on func-tional biomedical sensing and imaging techniques,including finite-element simulation of the interactionbetween light and human tissue as a Ph.D. student andas a Research Associate in the Photonics Engineering

and Health Technology Research Group at Loughborough University. He is cur-rently employed by Ansivia Solutions Ltd., Loughborough, U.K., where he isinvolved in research and development of in-vitro diagnostic point-of-care testinginstrumentation.

Jia Zheng was born in Heilongjiang, China, in1980. She received the B.Sc. degree in electrical en-gineering from the Beijing Institute of Technology,Beijing, China, in 2003, and the M.Sc. degreein digital communication systems and the Ph.D.degree in biomedical photonics engineering fromLoughborough University, Leicestershire, U.K., in2005 and 2010 respectively.

During her Ph.D. studies, she specialized inimaging photoplethysmography and opto-physiolog-ical monitoring. Recently, she joined the National

Institute for the Control of Pharmaceutical and Biological Products (NICPBP),Beijing, China, as Researcher for the regulation of in-vivo physiologicalassessment device and instrumentation.

Angelos Echiadis was born in Orestiada, Greece,in 1977. He received the 1st Class B.Eng. degreein electrical and electronic engineering with dis-tinction from Heriot-Watt University, Edinburgh,Scotland, and the Ph.D. degree in photonics andhealth technology from Loughborough University,Leicestershire, U.K., in 2003 and 2007, respectively.

Currently, he works as a biomedical systems engi-neering professional at Dialog Devices Ltd., Lough-borough, U.K. His research interests include multidi-mensional signal acquisition and characterization of

live tissues, as well as photoplethysmography in pulse oximetry and in othermedical applications.

Sijung Hu (SM’10) was born in Shanghai, China, in1960. He received the B.Sc. degree in chemical engi-neering from Donghua University, Shanghai, China,the M.Sc. degree in environmental monitoring fromthe University of Surrey, Guildford, U.K., in 1995,and the Ph.D. degree in fluorescence spectrophotom-etry from Loughborough University, Leicestershire,U.K., in 2000.

He was appointed as a Senior Scientist at KalibrantLtd., Loughborough, U.K., for R&D of medical in-struments from 1999 to 2002. Currently, he is a Se-

nior Research Fellow and Leader of Photonics Engineering Research Group inthe Department of Electronic and Electrical Engineering, Loughborough Uni-versity, and Visiting Professor of Biomedical Engineering, Shanghai Jiao TongUniversity, Shanghai, China, with numerous contributions in opto-physiologicalassessment, biomedical instrumentation and signal and image processing.