VLSI IEEE Projects

NXFEE INNOVATION IP CORE PRODUCT DEVELOPMENT & PCB DESIGNING

Nxfee Innovation (IP Core Product Development & PCB Designing)

#45, Vivekananda street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry-4 Web: www.nxfee.com Email: [email protected] Ph: +91 9789443203

VLSI IEEE TRANSACTION 2016 PAPERS

PROJECT TITLE TITLE FOR VLSI STUDENT PRICE

LOW POWER

VLSI01_LP1 Title: A Fully Digital Front-End Architecture for ECG Acquisition System With 0.5 V Supply Abstract: This paper presents a new power-efficient electrocardiogram acquisition system that uses a fully digital architecture to reduce the power consumption and chip area. The proposed architecture is compatible with digital CMOS technology and is capable of operating with a low supply voltage of 0.5 V. In this architecture, no analog block, e.g., low-noise amplifier (LNA), and filters, and no passive elements, such as ac coupling capacitors, are used. A moving average voltage-to time converter is used, which behaves instead of the LNA and Anti-aliasing filter. A digital feedback loop is employed to cancel the impact of the dc offset on the circuit, which eliminates the need for coupling capacitors. The circuit is implemented in 0.18-um CMOS process. The simulation results show that the front-end circuit consumes 274 nW of power.

Simulation Rs. 10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI07_LP2 Title: Low-Cost High-Performance VLSI Architecture for Montgomery Modular Multiplication Abstract: This paper proposes a simple and efficient Montgomery multiplication algorithm such that the low-cost and high-performance Montgomery modular multiplier can be implemented accordingly. The proposed multiplier receives and outputs the data with binary representation and uses only one-level carry-save adder (CSA) to avoid the carry propagation at each addition operation. This CSA is also used to perform operand precomputation and format conversion from the carry save format to the binary representation, leading to a low hardware cost and short critical path delay at the expense of extra clock cycles for completing one modular multiplication. To overcome the weakness, a configurable CSA (CCSA), which could be one full-adder or two serial half-adders, is proposed to reduce the extra clock cycles for operand precomputation and format conversion by half. In addition, a mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed. As a result, the extra clock cycles for operand precomputation and format conversion can be hidden and high throughput can be obtained. Experimental results show that the proposed Montgomery

Simulation

Rs.8000 /

Hardware Rs.18000

+ S6 BOARD




modular multiplier can achieve higher performance and significant area–time product improvement when compared with previous designs.

VLSI09_LP3 Title: RF Power Gating: A Low-Power Technique for Adaptive Radios Abstract: In this paper, we propose a low-power technique, called RF power gating, which consists in varying the active time ratio (ATR) of the RF front end at a symbol time scale. This technique is especially well suited to adapt the power consumption of the receiver to the performance needs without changing its architecture. The effect of this technique on the bit error rate (BER) performances is studied for a basic estimator in the specific case of minimum-shift keying signaling. A system-level energy model is also derived and discussed to estimate precisely the power reduction based on the characteristics and the power consumption of each block. This model allows highlighting the different contributors of the power reduction. The BER results and the energy model are finally merged to determine the best ATR meeting the design constraints. Applying this technique to the IEEE 802.15.4 standard, this paper shows that an ATR of 20% is a good tradeoff to meet the packet error rate constraint while maximizing the energy reduction ratio. Using typical block power consumptions, an energy reduction ratio around 20% can be reached. Even better energy reduction ratios (∼60%) are also achievable when most of the blocks are power-gated.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI18_LP4 Title: Low-Power ECG-Based Processor for Predicting Ventricular Arrhythmia Abstract: This paper presents the design of a fully integrated electrocardiogram (ECG) signal processor (ESP) for the prediction of ventricular arrhythmia using a unique set of ECG features and a naive Bayes classifier. Real-time and adaptive techniques for the detection and the delineation of the P-QRS-T waves were investigated to extract the fiducial points. Those techniques are robust to any variations in the ECG signal with high sensitivity and precision. Two databases of the heart signal recordings from the MIT PhysioNet and the American Heart Association were used as a validation set to evaluate the performance of the processor. Based on application-specified integrated circuit (ASIC) simulation results, the overall classification accuracy was found to be 86% on the out-of-sample validation data with 3-s window size. The architecture of the proposed ESP was implemented using 65-nm CMOS process. It occupied 0.112-mm2 area and consumed 2.78-µW power at an operating frequency of 10 kHz and from an operating voltage of 1 V. It is worth mentioning that the proposed ESP is the first ASIC implementation of an ECG-based processor that is used for the prediction of ventricular arrhythmia up to 3 h before the onset.

Simulation Rs.20000

/ Hardware (Encoder

+ Decoder) Rs.35000

+ S6 BOARD




VLSI30_LP5 Title: A New Parallel VLSI Architecture for Real-Time Electrical Capacitance Tomography Abstract : This paper presents a fixed-point reconfigurable parallel VLSI hardware architecture for real-time Electrical Capacitance Tomography (ECT). It is modular and consists of a front-end module which performs precise capacitance measurements in a time multiplexed manner using Capacitance to Digital Converter (CDC) technique. Another FPGA module performs the inverse steps of the tomography algorithm. A dual port built-in memory banks store the sensitivity matrix, the actual value of the capacitances, and the actual image. A two dimensional (2D) core multi-processing elements (PE) engine intercommunicates with these memory banks via parallel buses. A Hardware-software co-design methodology was conducted using commercially available tools in order to concurrently tune the algorithms and hardware parameters. Hence, the hardware was designed down to the bit-level in order to reduce both the hardware cost and power consumption, while satisfying real-time constraint. Quantization errors were assessed against the image quality and bit-level simulations demonstrate the correctness of the design. Further simulations indicate that the proposed architecture achieves a speed-up of up to three orders of magnitude over the software version when the reconstruction algorithm runs on 2.53 GHz-based Pentium processor or DSP Ti’s Delphino TMS320F32837 processor. More specifically, a throughput of 17.241 Kframes/sec for both the Linear-Back Projection (LBP) and modified Landweber algorithms and 8.475 Kframes/sec for the Land weber algorithm with 200 iterations could be achieved. This performance was achieved using an array of [22][22] processing units. This satisfies the real-time constraint of many industrial applications. To the best of the authors’ knowledge, this is the first embedded system which explores the intrinsic parallelism which is available in modern FPGA for ECT tomography.

Simulation Rs.8000

/ Hardware Rs.25000

+ S6 BOARD

VLSI33_LP6 Title: Low-Power FPGA Design Using Memoization-Based Approximate Computing Abstract: Field-programmable gate arrays (FPGAs) are increasingly used as the computing platform for fast and energy efficient execution of recognition, mining, and search applications. Approximate computing is one promising method for achieving energy efficiency. Compared with most prior works on approximate computing, which target approximate processors and arithmetic blocks, this paper presents an approximate computing methodology for FPGA-based design. It studies memoization as a method for approximation on FPGA and analyzes different architectural and design parameters that should be considered. The proposed design flow leverages on high-level synthesis to enable memoization-based micro

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




architecture generation, thus also facilitating a C-to-register-transfer-level synthesis. When compared with the previous approaches of bit-width truncation and approximate multipliers, memoization-based approximate computation on FPGA achieves a significant dynamic power saving (around 20%) with very small area overhead (<5%) and better power-to-signal noise ratio values for the studied image processing benchmarks.

VLSI35_LP7 Title: Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a low-power FFT processor, because it has the lowest number of arithmetic operations among all the FFT algorithms. In the design of such processors, an efficient addressing scheme for FFT data as well as twiddle factors is required. The signal flow graph of SRFFT is the same as radix-2 FFT, and therefore, the conventional address generation schemes of FFT data could also be applied to SRFFT. However, SRFFT has irregular locations of twiddle factors and forbids the application of radix-2 address generation methods. This brief presents a shared-memory low-power SRFFT processor architecture. We show that SRFFT can be computed by using a modified radix-2 butterfly unit. The butterfly unit exploits the multiplier-gating technique to save dynamic power at the expense of using more hardware resources. In addition, two novel address generation algorithms for both the trivial and nontrivial twiddle factors are developed. Simulation results show that compared with the conventional radix-2 shared-memory implementations, the proposed design achieves over 20% lower power consumption when computing a 1024-point complex-valued transform.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI63_LP8 Title: A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem, known as the utilization wall or dark silicon, is becoming increasingly serious. With the introduction of 3-D integrated circuits (ICs), it is likely to become more severe. Thus, how to take advantage of the extra transistors, made available by Moore’s law and the onset of 3-D ICs, within the power budget poses a significant challenge to system designers. To address this challenge, we propose a 3-D hybrid architecture consisting of a CPU layer with multiple cores, a field programmable gate array (FPGA) layer, and a DRAM layer. The architecture is designed for low power without sacrificing performance. The FPGA layer is capable of supporting a large number of accelerators. It is placed adjacent to the CPU layer, with a communication mechanism that allows it to access CPU data caches directly. This enables fast switches between these two layers. This architecture reduces the power and energy significantly, at better or similar performance. This then alleviates the dark silicon problem by letting us power ON more components to achieve higher

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




performance. We evaluate the proposed architecture through a new framework we have developed. Relative to the out-of-order CPU, the accelerators on the FPGA layer can reduce function-level power by 6.9×and energy-delay product (EDP) by 7.2×, and application-level power by 1.9×and EDP by 2.2×, while delivering similar performance. For the entire system, this translates to a 47.5% power reduction relative to a baseline system that consists of a CPU layer and a DRAM layer. This also translates to a 72.9% power reduction relative to an alternative system that consists of a CPU layer, an L3 cache layer, and a DRAM layer.

VLSI65_LP9 Title: Design of a Network of Digital Sensor Macros for Extracting Power Supply Noise Profile in SoCs Abstract: Increased functional density with shrinking technology could result in escalating power supply noise (PSN)-induced failures in the field. Furthermore, the low correlation between system-level functional test and production test is making it difficult to better screen parts that would fail in the field due to PSN. To address these issues, in this paper, we present a fully digital on-chip distributed sensor network to continuously monitor the PSN profile across the chip and generate a trace for diagnosis of any noise-induced failure at silicon validation, structural test, system test, and functional operation phases of system on chips (SoCs). The sensors capture PSN at a fine granularity and store the SoC’s critical status bits. The sensor offers easy access and control with the aid of scan chains. The sensor network has been designed in the 28-nm standard cell library, and its performance is demonstrated in the physical design of OpenSPARCT1 multi core processor SoC.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI70_LP10 Title: Flexible ECC Management for Low-Cost Transient Error Protection of Last-Level Caches Abstract: The conventional error correcting code (ECC) schemes for caches are based on a fixed mapping between cache data words and ECC check bits, and fixed ECC word granularity. This leads to inefficient usage of the ECC check bits. We propose to manage the check bits flexibly for low-cost error protection of last-level caches. The proposed ECC schemes work at the word level, whereas the conventional ECC schemes work at the cache line or set level. The proposed schemes protect only dirty words with ECC check bits using a flexible mapping. Moreover, the proposed schemes utilize variable ECC word granularities. Dirty (modified) words that are unlikely to be modified further before being evicted are collectively protected with a larger ECC word granularity. The proposed schemes reduce DRAM and data bus energy overheads by 28% and 45%, respectively, with the same area overhead as previously proposed competitive schemes. Our schemes show more energy reduction results for multi core systems without noticeable performance degradation.

Simulation

Rs.9000 /

Hardware Rs.20000

+ S6 BOARD




HIGH SPEED DATA TRANSMISSION

VLSI02_HS1 Title: A High-Speed FPGA Implementation of an RSD-Based ECC Processor Abstract: In this paper, an exportable application-specific instruction-set elliptic curve cryptography processor based on redundant signed digit representation is proposed. The processor employs extensive pipelining techniques for Karatsuba–Ofman method to achieve high throughput multiplication. The proposed processor performs singlepoint multiplication employing points in affine coordinates in 2.26 ms and runs at a maximum frequency of 160 MHz in Xilinx Virtex 5 (XC5VLX110T) field-programmable gate array.

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD

VLSI06_HS2 Title: High-Speed and Energy-Efficient Carry Skip Adder Operating Under a Wide Range of Supply Voltage Levels Abstract: In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy consumption compared with the conventional one. The speed enhancement is achieved by applying concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA) structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The proposed structures are assessed by comparing their speed, power, and energy parameters with those of other adders using a 45-nm static CMOS technology for a wide range of supply voltages. In addition, the power–delay product was the lowest among the structures considered in this paper, while its energy–delay product was almost the same as that of the Kogge–Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the latest works in this field while having a reasonably high speed.

Simulation

Rs.9000 /

Hardware Rs.18000

+ S6 BOARD

VLSI23_HS3 Title: A 0.52/1 V Fast Lock-in ADPLL for Supporting Dynamic Voltage and Frequency Scaling Abstract: In energy-efficient processing platforms, such as wearable sensors and implantable medical devices, dynamic voltage and frequency scaling allows optimizing the energy efficiency under various modes of operation. The clock generator used in these platforms should be capable of achieving a faster settling time and has a wider operating voltage range. In this brief, a fast lock-in all-digital phase-locked loop (ADPLL) with two operation modes (0.52/1 V) is presented. The proposed ADPLL can quickly compute the desired digitally controlled oscillator control code with high accuracy. Therefore, the proposed ADPLL can achieve a fast setting time with frequency errors <5% within four clock cycles. The proposed ADPLL is implemented using a standard performance 90-nm CMOS process. The output frequency of the ADPLL ranges from 60 to 600 MHz at 1 V, and from

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD




30 to 120 MHz at 0.52 V.

VLSI26_HS4 Title: Code Compression for Embedded Systems Using Separated Dictionaries Abstract: Engineers must consider performance, power consumption, and cost when designing embedded digital systems; furthermore, memory is a key factor in such systems. Code compression is a technique used in embedded systems to reduce the memory usage. Bit Mask-based code compression is a modified version of dictionary-based code compression. In this paper, we applied a small separated dictionary, and variable mask numbers were used with the Bit Mask algorithm to reduce the codeword length of high frequency instructions. In addition, a novel dictionary selection algorithm was proposed to increase the instruction match rates. The fully separated dictionary method was used to improve the performance of the decompression engine without affecting the compression ratio (CR) (the compressed code size divided by original code size). Based on the experimental results, the proposed method can achieve a 7.5% improvement in the CR with nearly no hardware overhead.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI27_HS5 Title: A Dynamically Reconfigurable Multi-ASIP Architecture for Multi-standard and Multimode Turbo Decoding Abstract: The multiplication of wireless communication standards is introducing the need of flexible and reconfigurable multi-standard baseband receivers. In this context, multiprocessor turbo decoders have been recently developed in order to support the increasing flexibility and throughput requirements of emerging applications. However, these solutions do not sufficiently address reconfiguration performance issues, which can be a limiting factor in the future. This brief presents the design of a reconfigurable multiprocessor architecture for turbo decoding achieving very fast reconfiguration without compromising the decoding performances.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI36_HS6 Title: Design and Implementation of High-Speed All-Pass Transformation-Based Variable Digital Filters by Breaking the Dependence of Operating Frequency on Filter Order Abstract: All-pass transformation (APT)-based variable digital filters (VDFs), also known as frequency warped VDFs, are typically used in various audio signal-processing applications. In an APT-based VDF, all-pass filter structures of appropriate order are used to replace the delay elements in a prototype filter structure. The resultant filter can provide variable frequency responses with unabridged control over cutoff frequencies on the fly, without updating the filter coefficients. In this brief, we briefly review the first- and second-order APT-based VDFs along with their hardware implementation architectures, and provide generalized design procedures to realize them as per required specifications. We also propose the modified pipelined

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




hardware implementation architectures for both the first- and second-order APT-based VDFs. Field-programmable gate array implementation results of different first- and second-order APT-based VDF designs for both non-pipelined and pipelined implementation architectures are presented. An analysis of the results shows that the proposed pipelined implementation architectures result in high-speed VDFs, achieving high operating frequencies that are independent of the prototype filter order, for both the first- and second-order.

VLSI61_HS7 Title: Statistical Framework and Built-In Self Speed-Binning System for Speed Binning Using On-Chip Ring Oscillators Abstract: This paper presents a model-fitting framework to correlate the on-chip measured ring-oscillator counts to the chip’s maximum operating speed. This learned model can be included in an auto test equipment (ATE) software to predict the chip speed for speed binning. Such a speed-binning method can avoid the use of applying any functional test and, hence, result in a third-order test time reduction with a limited portion of chips placed into a slower bin compared with the conventional functional-test binning. This paper further presents a novel built in self-speed-binning system, which embeds the learned chip speed model with a built-in circuit such that the chip speed can be directly calculated on-chip without going through any offline ATE software, achieving a fourth-order test-time reduction compared with the conventional speed binning. The experiments were conducted based on 360 test chips of a 28-nm, 0.9 V, 1.6-GHz mobile-application system-on-chip.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD

VLSI62_HS8 Title: A Low-Power Broad-Bandwidth Noise Cancellation VLSI Circuit Design for In-Ear Headphones Abstract: Conventional active noise cancelling (ANC) headphones often perform well in reducing the low-frequency noise and isolating the high-frequency noise by earmuffs passively. The existing ANC systems often use high-speed digital signal processors to cancel out disturbing noise, which results in high power consumption for a commercial ANC headphone. The contribution of this paper can be classified into: 1) proper filter length selection; 2) low-power storage mechanism for convolution operation; and 3) high-throughput pipelining architecture. With these novel techniques, we develop an area-/power-efficient ANC circuit by using the TSMC 90-nm CMOS technology for in-ear headphone applications. The proposed feed forward filtered-x least mean square ANC circuit design provides the features of using lower operating frequency and consuming much less power that facilitate better performance than the conventional ANC headphones. To verify the effectiveness of the proposed design, a series of physical measurements is executed in an anechoic chamber. Measurement results show that the proposed high-performance/low-power circuit design can reduce disturbing noise of various frequency bands very well, and

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




outperforms the existing works. The proposed design can attenuate 15 dB for broadband pink noise between 50 and 1500 Hz when operated at 20-MHz clock frequency at the costs of 84.2 k gates and power consumption of 6.59 mW only. Compared with the existing designs, the proposed work achieves higher noise cancellation performance in terms of 3 dB further and saves 97% power consumption.

VLSI71_HS9 Title: Source Coding and Preemphasis for Double-Edged Pulse width Modulation Serial Communication Abstract: Double-edged pulse width modulation (DPWM) is less sensitive to frequency-dependent losses in electrical chip-to-chip interconnects. However, the DPWM scheme instantaneously transmits information at a different rate than a synchronous source. This paper presents an 8-/9-bit line-coding scheme to compensate for thetiming skew between the DPWM and synchronous clock domains while limiting the size of buffering required in the transmitter and receiver. Furthermore, preemphasis is introduced and analyzed as a means to improve the signal integrity of a DPWM signal. A multiphase-based, time interleaving receiver architecture using a sense amplifier is presented for high-speed data recovery. The DPWM transceiver is implemented in a 45-nm CMOS Silicon on insulator and operates at 10 Gbit/s with 10−12 bit error rate and consumes 96 mW. The power consumption of the 8-/9-bit coding hardware is 1.5 mW at 10 Gbit/s demonstrating low-power overhead.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI96_HS10 Title: A Fast-Acquisition All-Digital Delay-Locked Loop Using a Starting-Bit Prediction Algorithm for the Successive-Approximation Register Abstract: This brief presents a fast-acquisition 11-bit all-digital delay-locked loop (ADDLL) using a novel starting-bit prediction algorithm for the successive-approximation register (SBP-SAR). It can effectively eliminate the harmonic lock and the false lock. The achievable acquisition time is within 17.5–23.5 or 17.5–32.5 clock cycles when the ADDLL works at the low or high clock rate, respectively. The digital-controlled delay line and the SBP-SAR of the ADDLL chip are synthesized using Taiwan Semiconductor Manufacturing Company’s (TSMC’s) 0.18-µm CMOS cell library. The proposed ADDLL can operate at a clock frequency from 60 MHz to 1.1 GHz.

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD




VLSI87_HS11 Title: GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation problems. However, parallelizing LU factorization on the graphic processing units (GPUs) turns out to be a difficult problem due to intrinsic data dependence and irregular memory access, which diminish GPU computing power. In this paper, we propose a new sparse LU solver on GPUs for circuit simulation and more general scientific computing. The new method, which is called GPU accelerated LU factorization (GLU) solver (for GPU LU), is based on a hybrid right-looking LU factorization algorithm for sparse matrices. Experimental results show that the proposed GLU solver can deliver 5.71×and 1.46× speedup over the single-threaded and the 16-threaded PARDISO solvers, respectively, 19.56×speedup over the KLU solver,47.13×over the UMFPACK solver, and 1.47×speedup over a recently proposed GPU-based left-looking LU solver on the set of typical circuit matrices from the University of Florida (UFL) sparse matrix collection. Furthermore, we also compare the proposed GLU solver on a set of general matrices from the UFL, GLU achieves 6.38×and 1.12×speedup over the single threaded and the 16-threaded PARDISO solvers, respectively, 39.39×speedup over the KLU solver, 24.04×over the UMFPACK solver, and 2.35×speedup over the same GPU-based left-looking LU solver. In addition, comparison on self-generated RLC mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing sparse LU solvers.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD

VLSI89_HS12 Title: An All-Digital Approach to Supply Noise Cancellation in Digital Phase-Locked Loop Abstract: With increased levels of integration in modern system-on-chips, the coupling of supply noise in a phase locked loop (PLL) has become the dominant source of performance degradation in many systems. In this paper, an all-digital approach to canceling the effects of supply noise is presented. By sensing the supply noise using an analog-to-digital converter (ADC), an observer– controller loop filter jointly processes the ADC and phase detector outputs to determine the oscillator control signals that minimize the output jitter. The proposed digital PLL is shown to be significantly more robust to supply noise compared with a conventional PLL.

Simulation

Rs.8000 /

Hardware Rs.25000

+ S6 BOARD




VLSI97_HS13 Title: Design of Modified Second-Order Frequency Transformations Based Variable Digital Filters With Large Cutoff Frequency Range and Improved Transition Band Characteristics Abstract: The frequency transformation based filters (FT filters) provide an absolute control over the cutoff frequency. However, the cutoff frequency range (c_range) of the FT filters is limited. The second-order frequency transformations combined with coefficient decimation technique based filter (FTCDM filter) has wider c_range compared with the FT filter; however, the ratio of transition bandwidth of the transformed filter to that of the prototype filter, tbwFT/tbwmod, is large over a significant portion of c_range.In this paper, we propose a novel idea of relaxing the one-to-one mapping condition between the frequency variables, to overcome the issue of limited c_range for tbw FT≤tbw mod. In the proposed modified second-order frequency transformation based filter (MSFT filter), we relax the one-to-one mapping condition between the frequency variables and use low-pass to high-pass transformation on the prototype filter to achieve wider c_range withtbwFT ≤tbwmod. Design example shows that the MSFT filter provides 3 and 1.22 times widerc_range compared to FT and FTCDM filters, respectively.

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD

AREA EFFICIENT/ TIMING & DELAY REDUCTION

VLSI03_AE1 Title: A Mixed-Decimation MDF Architecture for Radix-2K Parallel FFT Abstract: This paper presents a mixed-decimation multipath delay feedback (M 2 DF) approach for the radix-2k fast Fourier transform. We employ the principle of folding transformation to derive the proposed architecture, which activates the idle period of arithmetic modules in multipath delay feedback (MDF) architectures by integrating the decimation-in-time operations into the decimation-in-frequency-operated computing units. Furthermore, we compare the proposed design with other efficient schemes, namely, the MDF and the multipath delay com-mutator (MDC) scheme theoretically and experimentally. Relying on the obtained expressions and statistics, it can be concluded that the M2DF design serves as an efficient alternative to the MDF scheme, since it achieves improved efficiency in the utilization of arithmetic resources without deteriorating the superiorities of feedback structures. In addition, the recommended design performs better in memory requirement and computing delay compared with the MDC approach.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD




VLSI04_AE2 Title: Algorithm and Architecture of Configurable Joint Detection and Decoding for MIMO Wireless Communications With Convolution Codes Abstract: This paper presents an algorithm and a VLSI architecture of a configurable joint detection and decoding (CJDD) scheme for multi-input multi-output (MIMO) wireless communication systems with convolutional codes. A novel tree-enumeration strategy is proposed such that the MIMO detection and decoding of convolutional codes can be conducted in single stage using a tree-searching engine. Moreover, this design can be configured to support different combinations of quadrature amplitude modulation (QAM) schemes as well as encoder code rates, and thus can be more practically deployed to real-world MIMO wireless systems. A formal outline of the proposed algorithm will be given and simulation results for 16-QAM and 64-QAM with rate-1/2 and rate-1/3 codes will be presented showing that, compared with the conventional separate scheme, the CJDD algorithm can greatly improve bit error rate (BER) performance with different system settings. In addition, the VLSI architecture and implementation of the CJDD approach will be illustrated. The architectures and circuits are designed to support configurability and flexibility while maintaining high efficiency and low complexity. The post layout experimental results for 16-QAM and 64-QAM with rate-1/2 and rate-1/3 codes show that, compared with the previous configurable design, this architecture can achieve reduced or comparable complexity with improved BER performance.

Simulation Rs.14000

/ Hardware Rs.25000

+ S6 BOARD

VLSI05_AE3 Title: One-Cycle Correction of Timing Errors in Pipelines With Standard Clocked Elements Abstract: One of the most aggressive uses of dynamic voltage scaling is timing speculation, which in turn requires fast correction of timing errors. The fastest existing error correction technique imposes a one-cycle time penalty only, but it is restricted to two-phase transparent latch-based pipelines. We perform one-cycle error correction by gating only the main latch in each stage of the pipeline that precedes a failed stage. This new method is applicable to widely used clocking elements, such as flip-flops and pulsed latches. Because it prevents inputs arriving at a stage, which is stalled, it can also be used in pipelines with multiple fan-in, fan-out, and looping. Simulations show an energy saving of 8%–12% with a target throughput of 0.9 instructions per cycle, and 15%–18% when the target is 0.8.

Simulation

Rs.8000 /

Hardware Rs.18000

+ S6 BOARD




VLSI10_AE4 Title: Hardware and Energy-Efficient Stochastic LU Decomposition Scheme for MIMO Receivers Abstract: In this paper, we design a hardware and energy-efficient stochastic lower–upper decomposition (LUD) scheme for multiple-input multiple-output receivers. By employing stochastic computation, the complex arithmetic operations in LUD can be performed with simple logic gates. With proposed dual partition computation method, the stochastic multiplier and divider exhibit high computation accuracy with relative short length stochastic stream. We have designed and synthesized the stochastic LUD with CMOS 130-nm technology. According to the post layout report, the hardware efficiency of the stochastic LUDisashighas1.5×compared with the exiting LUD methods, and the energy efficiency is also higher than the state-of-the-art LUD when the matrix dimension is 8×8andlarger.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI12_AE5 Title: Hybrid LUT/Multiplexer FPGA Logic Architectures Abstract: Hybrid configurable logic block architectures for field-programmable gate arrays that contain a mixture of lookup tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectures, both nonfracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites (VTR and CHStone) using a custom tool flow consisting of LegUp-HLS, Odin-II front-end synthesis, ABC logic synthesis and technology mapping, and VPR for packing, placement, routing, and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for nonfracturable architectures, without any mapper optimizations, we naturally save up to∼8% area postplace and route; both accounting for complex logic block and routing area while maintaining mapping depth. For fracturable architectures, experiments show that only marginal gains are seen after place-and-route up to∼2%. For both nonfracturable and fracturable architectures, we see minimal impact on timing performance for the architectures with best area-efficiency.

Simulation Rs.8000

/ Hardware Rs.18000

+ S6 BOARD

VLSI13_AE6 Title: A 520k (18 900, 17 010) Array Dispersion LDPC Decoder Architectures for NAND-Flash Memory Abstract: Although Latin square is a well-known algorithm to construct low-density parity-check (LDPC) codes for satisfying long code length, high code-rate, good correcting capability, and low error floor, it has a drawback of large sub matrix that the hardware implementation will be suffered from large barrel shifter and worse routing congestion in fitting NAND flash applications. In this paper, a top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented. A two-

Simulation Rs.12000

/ Hardware (Encoder

+ Decoder) Rs.20000

+




step array dispersion algorithm is proposed to construct long LDPC codes with a small sub matrix size. Then, the constructed LDPC code is optimized by masking matrix to obtain better bit-error rate (BER) performance and lower error floor. In addition, our LDPC codes have a diagonal-like structure in the parity-check matrix leading to a proposed hybrid storage architecture, which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput. To be adopted for NAND flash applications, an (18 900, 17 010) LDPC code with a code-rate of 0.9 and sub matrix size of 63 is constructed and the field-programmable gate array simulations show that the error floor is successfully suppressed down to BER of 10 −12. An LDPC decoder using normalized min-sum variable-node-centric sequential scheduling decoding algorithm is implemented in UMC 90-nm CMOS process. The post layout result shows that the proposed LDPC decoder can achieve a throughput of 1.58 Gb/s at six iterations with a gate count of520k under a clock frequency of166.6 MHz. It meets the throughput requirement of both NAND flash memories with Toggle double data rate 1.0 and open NAND flash interface 2.3NANDinterfaces.

S6 BOARD

VLSI14_AE7 Title: Implementing Minimum-Energy-Point Systems With Adaptive Logic Abstract: Timing-error-detection (TED)-based systems have been shown to reduce power consumption or increase yield due to reduced margins. This paper shows that the increased adaptability can be a great advantage in the system design in addition to the well-known mitigated susceptibility to ambient and internal variations. Specifically, the design tolerances of the power management are relaxed to enable even greater system-level energy savings than what can be achieved in the logic alone. In addition, the system is simultaneously able to operate near the minimum error point. Here, the power management is a simplified dc–dc converter and the TED is based on time borrowing. The target application is a single-chip system on chip without external discrete components; thus, switched capacitors are used for the dc–dc. The system achieves 7.9% energy reduction at the minimum energy point simultaneously with a 36.4% energy–delay product decrease and a 15% increase in dc–dc efficiency. In addition, the effect of local variations on average system performance is reduced by 12%.

Simulation

Rs.8000 /

Hardware Rs.15000

+ S6 BOARD

VLSI15_AE8 Title: High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2m) Abstract: This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM) over GF(2m). The architecture uses a bit-parallel finite field (FF) multiplier accumulator (MAC) based on the Karatsuba–Ofman algorithm. The Montgomery ladder algorithm is modified for better sharing of execution paths. The data path in the architecture is well designed, so that the critical path contains few extra logic primitives apart from the FF MAC. In order to find the optimal number of pipeline stages, scheduling schemes with different pipeline stages are proposed and

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




the ideal placement of pipeline registers is thoroughly analyzed. We implement ECSM over the five binary fields recommended by the National Institute of Standard and Technology on Xilinx Virtex-4 and Virtex-5 field-programmable gate arrays. The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF(2163) in 6.1µs using 7354 Slices on Virtex-4. Using Virtex-5, the scalar multiplication form=163, 233, 283, 409, and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 µs, respectively, which are faster than previous results.

VLSI17_AE9 Title: High-Performance NB-LDPC Decoder With Reduction of Message Exchange Abstract: This paper presents a novel algorithm based on trellis min–max for decoding non-binary low-density parity check (NB-LDPC) codes. This decoder reduces the number of messages exchanged between check node and variable node processors, which decreases the storage resources and the wiring congestion and, thus, increases the throughput of the decoder. Our frame error rate performance simulations show that the proposed algorithm has a negligible performance loss for high rate codes with GF(16) and GF(32), and a performance loss smaller than 0.07 dB for high-rate codes over GF(64). In addition, a layered decoder architecture is presented and implemented on a 90-nm CMOS process for the following high-rate NB-LDPC codes: (2304, 2048) over GF(16), (837, 726) over GF(32), and (1536, 1344) over GF(64). In all cases, the achieved throughput is higher than 1 Gb/s.

Simulation Rs.12000

/ Hardware (Encoder

+ Decoder) Rs.20000

+ S6 BOARD

VLSI19_AE10 Title: LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter Abstract: In this paper, we analyze the contents of lookup tables (LUTs) of distributed arithmetic (DA)-based block least mean square (BLMS) adaptive filter (ADF) and based on that we propose intra-iteration LUT sharing to reduce its hardware resources, energy consumption, and iteration period. The proposed LUT optimization scheme offers a saving of 60% LUT content for block size 8 and still higher saving for larger block sizes over the conventional design approach. It offers a saving of 60% LUT-update per output and 59% LUT access per output over the recently proposed DA-based BLMS ADF structure for block size 8 and filter length 64. Besides, the proposed structure involves nearly 30% saving in the iteration period over the other for 16-bit coefficient word length. Application specific integrated circuit (ASIC) synthesis result shows that the proposed structure for block size 8 offers a saving of 48% area-delay product (ADP) and 53% energy per sample (EPS) over the existing DA-based BLMS ADF structure on average for different filter lengths, and offers 30% higher sampling rate due to its shorter iteration period. Compared with the existing DA-based LMS ADF structure, the proposed structure involves 68% less ADP and 1.6×less EPS.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




VLSI20_AE11 Title: Graph-Based Transistor Network Generation Method for Supergate Design Abstract: Transistor network optimization represents an effective way of improving VLSI circuits. This paper proposes a novel method to automatically generate networks with minimal transistor count, starting from an irredundant sum-of-products expression as the input. The method is able to deliver both series–parallel (SP) and non-SP switch arrangements, improving speed, power dissipation, and area of CMOS gates. Experimental results demonstrate expected gains in comparison with related approaches.

Simulation

Rs.8000

VLSI21_AE12 Title: Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic Abstract: Hardware acceleration has been proved an extremely promising implementation strategy for the digital signal processing (DSP) domain. Rather than adopting a monolithic application-specific integrated circuit design approach, in this brief, we present a novel accelerator architecture comprising flexible computational units that support the execution of a large set of operation templates found in DSP kernels. We differentiate from previous works on flexible accelerators by enabling computations to be aggressively performed with carry-save (CS) formatted data. Advanced arithmetic design concepts, i.e., recoding techniques, are utilized enabling CS optimizations to be performed in a larger scope than in previous approaches. Extensive experimental evaluations show that the proposed accelerator architecture delivers average gains of up to 61.91% in area-delay product and 54.43% in energy consumption compared with the state-of-art flexible data paths.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD

VLSI25_AE13 Title: A Cellular Network Architecture With Polynomial Weight Functions Abstract: Emulations of cellular nonlinear networks on digital reconfigurable hardware are renowned for an efficient computation of massive data, exceeding the accuracy and flexibility of full-custom designs. In this contribution, a digital implementation with polynomial coupling weight functions is proposed for the first time, establishing novel fields of application, e.g., in the medical signal processing and in the solution of partial differential equations. We present an architecture that is capable of processing large-scale networks with a high degree of parallelism, implemented on state-of-the-art field-programmable gate arrays.

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD




VLSI28_AE14 Title: A High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications Abstract: Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant multiplications (MCM) technique that results in significant saving of computation. However, transpose form configuration does not directly support the block processing unlike direct form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for transpose form block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves significantly less area delay product (ADP) and less energy per sample (EPS) than the existing block implementation of direct-form structure for medium or large filter lengths, while for the short-length filters, the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure. Application specific integrated circuit synthesis result shows that the proposed structure for block size 4 and filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.

Simulation Rs.8000

/ Hardware Rs.25000

+ S6 BOARD

VLSI29_AE15 Title: Fault Tolerant Parallel FFTs Using Error Correction Codes and Parseval Checks Abstract: Soft errors pose a reliability threat to modern electronic circuits. This makes protection against soft errors a requirement for many applications. Communications and signal processing systems are no exceptions to this trend. For some applications, an interesting option is to use algorithmic-based fault tolerance (ABFT) techniques that try to exploit the algorithmic properties to detect and correct errors. Signal processing and communication applications are well suited for ABFT. One example is fast Fourier transforms (FFTs) that are a key building block in many systems. Several protection schemes have been proposed to detect and correct errors in FFTs. Among those, probably the use of the Parseval or sum of squares check is the most widely known. In modern communication systems, it is increasingly common to find several blocks operating in parallel. Recently, a technique that exploits this fact to implement fault tolerance on parallel filters has been proposed. In this brief, this technique

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD




is first applied to protect FFTs. Then, two improved protection schemes that combine the use of error correction codes and Parseval checks are proposed and evaluated. The results show that the proposed schemes can further reduce the implementation cost of protection.

VLSI34_AE16 Title: Exploiting Intracell Bit-Error Characteristics to Improve Min-Sum LDPC Decoding for MLC NAND Flash-Based Storage in Mobile Device Abstract: A multilevel per cell (MLC) technique significantly improves the storage density, but also poses serious data integrity challenge for NAND flash memory. This consequently makes the low-density parity-check (LDPC) code and the soft-decision memory sensing become indispensable in the next-generation flash-based solid-state storage devices. However, the use of LDPC codes inevitably increases memory read latency and, hence, degrades speed performance. Motivated by the observation of intracell unbalanced bit error probability and data dependence in the MLC NAND flash memory, this paper proposes two techniques, i.e., intracell data placement interleaving and intracell data dependence aware LDPC decoding, to efficiently improve the LDPC decoding throughput and energy efficiency for the MLC NAND flash-based storage in a mobile device. Experimental results show that, by exploiting the intracell bit-error characteristics, the proposed techniques together can improve the LDPC decoding throughput by up to 84.6% and reduce the energy consumption by up to 33.2% while only incurring less than 0.2% silicon area overhead.

Simulation Rs.12000

/ Hardware Rs.25000

+ S6 BOARD

VLSI37_AE17 Title: Unequal-Error-Protection Error Correction Codes for the Embedded Memories in Digital Signal Processors Abstract: In many digital signal processing applications, some parts of a word stored in the embedded static random access memories (SRAMs) are more important than other parts of the word. Due to the differences in importance, memory failures that occur in more important bit locations generally give rise to relatively larger system performance degradation than those in less important locations. This brief presents a low-complexity unequal-error-protection error correcting code (UEEP-ECC) approach for the embedded memories in digital signal processor. In the proposed UEEP-ECC, repetition code is combined with the Bose–Chaudhuri–Hocquenghem code to selectively provide stronger error correction capabilities on more important data portions without a large hardware overhead. An efficient UEEP-ECC generation algorithm that can find the UEEP-ECC code with a minimum power of memory core and ECC logics is also presented. The experimental results show that the UEEP-ECC scheme achieves considerable power savings and data quality improvements in both of the H.264 and fast Fourier transform applications.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




VLSI38_AE18 Title: A High Throughput List Decoder Architecture for Polar Codes Abstract: While long polar codes can achieve the capacity of arbitrary binary-input discrete memoryless channels when decoded by a low complexity successive-cancellation (SC) algorithm, the error performance of the SC algorithm is inferior for polar codes with finite block lengths. The cyclic redundancy check (CRC)-aided SC list (SCL) decoding algorithm has better error performance than the SC algorithm. However, current CRC-aided SCL decoders still suffer from long decoding latency and limited throughput. In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. Our RLLD algorithm performs the list decoding on a binary tree, whose leaves correspond to the bits of a polar code. In existing SCL decoding algorithms, all the nodes in the tree are traversed, and all possibilities of the information bits are considered. Instead, our RLLD algorithm visits much fewer nodes in the tree and considers fewer possibilities of the information bits. When configured properly, our RLLD algorithm significantly reduces the decoding latency and, hence, improves throughput, while introducing little performance degradation. Based on our RLLD algorithm, we also propose a high throughput list decoder architecture, which is suitable for larger block lengths due to its scalable partial sum computation unit. Our decoder architecture has been implemented for different block lengths and list sizes using the TSMC 90-nm CMOS technology. The implementation results demonstrate that our decoders achieve significant latency reduction and area efficiency improvement compared with the other list polar decoders in the literature.

Simulation Rs.12000

/ Hardware Rs.20000

+ S6 BOARD

VLSI39_AE19 Title: A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Abstract: Nowadays, many applications require simultaneous computation of multiple independent fast Fourier transform (FFT) operations with their outputs in natural order. Therefore, this brief presents a novel pipelined FFT processor for the FFT computation of two independent data streams. The proposed architecture is based on the multipath delay commutator FFT architecture. It has an N/2-point decimation in time FFT and an N/2-point decimation in frequency FFT to process the odd and even samples of two data streams separately. The main feature of the architecture is that the bit reversal operation is performed by the architecture itself, so the outputs are generated in normal order without any dedicated bit reversal circuit. The bit reversal operation is performed by the shift registers in the FFT architecture by interleaving the data. Therefore, the proposed architecture requires a lower number of registers and has high throughput.

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD




VLSI40_AE20 Title: Design and FPGA Implementation of a Reconfigurable 1024-Channel Channelization Architecture for SDR Application Abstract: In this paper, we present a novel channelization architecture, which can simultaneously process two channels of complex input data and provide up to 1024 independent channels of complex output data. The proposed architecture is highly modular and generic, so that parameters of each output channel can be dynamically changed even at runtime in terms of the bandwidth, center frequency, output sampling rate, and so on. It consists of one tunable pipelined frequency transform (TPFT)-based coarse channelization block, one tuning unit, and one resampling filter. Based on the analysis of the data dependence between the subbands, a novel channel splitting scheme is proposed to enable multiple subbands to share the proposed TPFT block. The proposed Farrow-based resampling filter does not require division operation and dual-port RAMs resulting in significant area saving. Finally, we implement the proposed channelization architecture in a single field-programmable gate array. The experiment results indicate that our design provides the flexibility associated with the existing works, but with greater resource efficiency.

Simulation Rs.15000

/ Hardware Rs.25000

+ S6 BOARD

VLSI64_AE21 Title: Low-Power/Cost RNS Comparison via Partitioning the Dynamic Range Abstract: Residue number systems (RNSs) are the main choice in many comparison- and division-free applications (e.g., digital signal processing). However, the development of efficient RNS comparators can widen the spectrum of RNS applications. Such comparators can replace the straightforward, but slow and costly, practice of converting the comparison operands to binary, as inputs to a wide word binary comparator. This has motivated some researchers to design shortcut RNS comparison methods that obviate the need for full reverse conversions. However, the few actual realizations that we have encountered are based on moduli set τ ={2n−1,2n,2n+1}. In this paper, a�er brief review and performance evaluation of the previous methods, we present a new τ-comparator with considerably reduced cost and power dissipation, with no delay penalty. The underlying comparison algorithm is based on ordering the dynamic range into consecutive partitions, and locating the partitions that own the corresponding comparison operands. The required circuitry includes two n-bit adders, which are replaced by one compound parallel prefix architecture, in order to save area and power. Post layout performance evaluations, of the proposed work and the best previous one, show small latency improvement, 17%(46%) reduction in area consumption, 30%(41%)in power dissipation, and 31%(47%) in power-delay product, for n=8(22).

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




VLSI66_AE22 Title: Understanding the Relation Between the Performance and Reliability of NAND Flash/SCM Hybrid Solid-State Drive Abstract: A NAND flash memory/storage-class memory (SCM) hybrid solid-state drive (SSD) can achieve higher performance than the conventional NAND flash-only SSD. Error-correcting codes (ECCs) are applied to the SSD to correct bit errors occurring inside the NAND flash and SCM. To correct more bit errors, the stronger ECC is required and the ECC latency increases. This paper evaluates the relation between the performance and the reliability of the NAND flash/SCM hybrid SSD. First, how the ECC latency impacts the SSD performance is analyzed. Then, the SSD performances are evaluated with various data-access patterns. The ECC effect is significantly different among the data access patterns. Moreover, four scenarios of the SCM reliability are established and the performances are evaluated with the four data-access patterns. When the SCM reliability becomes high, the decrease in the throughput due to the ECC for SCM becomes significantly small. Finally, by setting the acceptable SSD performance, the acceptable bit-error rate (BER) of the SCM is evaluated. The SCM BER can be as high as around 0.9%.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD

VLSI68_AE23 Title: Optimized Built-In Self-Repair for Multiple Memories Abstract: A new built-in self-repair (BISR) scheme is proposed for multiple embedded memories to find optimum point of the performance of BISR for multiple embedded memories. All memories are concurrently tested by the small dedicated built-in self-test to figure out the faulty memories, the number of faults, and irreparability. After all memories are tested, only faulty memories are serially tested and repaired by the shared built-in redundancy analysis according to the sizes of memories in descending order. Thus, the fast test and repair are performed with low area overhead. To accomplish an optimal repair rate and a fast analysis speed, an exhaustive search for all combinations of spare rows and columns is proposed based on the optimized fault collection. Experimental results show that the proposed BISR has the optimal repair rate because of the exhaustive search. The performance of the proposed BISR is located in the optimum point between the test and repair time, and the area overhead. For example, the proposed BISR requires 49.6% of the area and 1.3 times of the test and repair time in comparison with parallel BISR scheme for four memories (one 128 K, two 256 K, and one 512 K memories). Furthermore, the more there are memories, the more superior performance in terms of the test and repair time, and the area overhead is shown.

Simulation Rs.8000

/ Hardware Rs.16000

+ S6 BOARD




VLSI69_AE24 Title: Measuring Improvement When Using HUB Formats to Implement Floating-Point Systems Under Round-to-Nearest Abstract: This paper analyzes the benefits of using half-unitbiased (HUB) formats to implement floating-point (FP) arithmetic under a round-to-nearest mode from a quantitative point of view. Using the HUB formats to represent numbers allows the removal of the rounding logic of arithmetic units, including sticky-bit computation. This is shown for FP adders, multipliers, and converters. Experimental analysis demonstrates that the HUB formats and the corresponding arithmetic units maintain the same accuracy as the conventional ones. On the other hand, the implementation of these units, based on basic architectures, shows that the HUB formats simultaneously improve area, speed, and power consumption. In addition, based on the data obtained from the synthesis, an HUB single-precision adder is ∼14% faster but consumes 38% less area and 26% less power than the conventional adder. Similarly, an HUB single-precision multiplier is 17% faster, uses 22% less area, and consumes slightly less power than the conventional multiplier. At the same speed, the adder and the multiplier achieve area and power reductions of up to 50% and 40%, respectively.

Simulation Rs.8000

/ Hardware Rs.16000

+ S6 BOARD

VLSI72_AE25 Title: A High-Throughput Hardware Design of a One-Dimensional SPIHT Algorithm Abstract: Video display systems include frame memory, which stores video data for display. To reduce system cost, video data are often compressed for storage in frame memory. A desirable characteristic for display memory compression is support for the raster-scan processing order and the fixed target compression ratio. Set partitioning in hierarchical trees (SPIHT) is an efficient two-dimensional compression algorithm that guarantees a fixed target compression ratio, but its one-dimensional (1D) variation has received little attention, even though its 1D nature supports the raster-scan processing order. This paper proposes a novel hardware design for 1D SPIHT. The algorithm is modified to exploit parallelism for effective hardware implementation. For the encoder, dependences that prohibit parallel execution are resolved and a pipelined schedule is proposed. For the parallel execution of the decoder, the algorithm is modified to enable estimation of the bit stream length of each pass prior to decoding. This modification allows parallel and pipelined decoding operations, leading to a high-throughput design for both encoder and decoder. Although the modifications slightly decrease compression efficiency, additional optimizations are proposed to improve such efficiency. As a result, the peak signal-to-noise ratio drop is reduced from 1.40 dB to 0.44 dB. The throughputs of the proposed encoder and decoder are 7.04 Gbps and 7.63 Gbps, respectively, and their respective gate counts are 37.2K and 54.1K.

Simulation Rs.8000

/ Hardware Rs.16000

+ S6 BOARD




VLSI73_AE26 Title: Network-on-Chip for Turbo Decoders Abstract: The multi-application specific instruction processor (ASIP) architecture is a promising candidate for flexible high-throughput turbo decoders. This brief proposes a network-on-chip (NoC) structure for multi-ASIP turbo decoders. The process of turbo decoding is studied, and the addressing patterns for turbo codes in long term evolution (LTE) and High Speed Downlink Packet Access (HSDPA) are analyzed. Based on this analysis, two techniques, sub net working and calculation sequence, are proposed for reducing the complexity of the NoC. The implementation results show that the proposed structure gives an improvement of 53% for HSDPA and 133% for LTE in throughput/area efficiency compared with state-of-the-art NoC solutions.

Simulation Rs.12000

/ Hardware Rs.24000

+ S6 BOARD

VLSI74_AE27 Title: Enhanced Wear-Rate Leveling for PRAM Lifetime Improvement Considering Process Variation Abstract: The limited write endurance is one of the major obstacles for phase-change random access memory (PRAM)-based main memory. Traditionally, wear leveling (WL) techniques were proposed to enhance its lifetime by balancing write traffic. However, these techniques do not concern the endurance variation in PRAM chips. When different PRAM cells have distinct endurance, balanced writes results in lifetime degradation due to the weakest cells. In this paper, we first define a new metric-wear rate (i.e., writes/endurance) considering both the write traffic and endurance distribution from application and hardware, respectively. After investigating the writing behavior of applications and endurance variation, we propose an architecture-level leveling mechanism to balance wear rate of cells across the PRAM chip. Hardware and algorithm to support the proposed leveling mechanism are presented. Moreover, there is an important tradeoff between endurance improvement and swapping data volume. To co-optimize endurance and swapping, this situation is formulated as a maximum weight perfect matching problem in bipartite graph. Thereafter, a novel algorithm that minimizes wear-rate and swapping by employing Kuhn–Munkras algorithm is proposed to maximize PRAM lifetime and minimize performance degradation. The experimental results show ∼17×lifetime improvement over prior WL.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




VLSI75_AE28 Title: Speculative Look ahead for Energy-Efficient Microprocessors Abstract: In addition to being the in situ performance monitor for adaptive voltage scaling (AVS), timing speculation mechanisms (e.g., razor) featuring dynamic timing fault detection and correction help to relax timing constraints for simple logic structures and low-power cells. Conventional timing fault detection mechanisms require substantial buffers to prevent race conditions on short paths for double sampling, which can overwhelm energy savings from timing relaxation and voltage scaling. This paper proposes a novel timing speculation scheme, speculative look ahead (SL), comprising duplicate timing-relaxed data paths, the short paths of which do not introduce race conditions and thus require no additional buffer insertion. In experiments using a 40-nm CMOS technology, SL consumed a 54.89% area of a razor-based 32-bit multiplier, and conserved 59.77% energy per operation at nominal 1.1 V and 53.49% when AVS was applied. An ARM Cortex M0-like microprocessor unit (MPU) was designed using an SL-based data path, the timing fault detection and correction mechanism of which can be dynamically deactivated for latency-tolerant instructions [i.e., on-demand timing speculation (ODTS)] to further conserve up to 31.08% energy in the execution unit. In addition, an field-programmable gate array prototype of the SL/ODTS MPU was constructed to demonstrate the effectiveness of delay variation tolerance and implementation flexibility.

Simulation

Rs.8000 /

Hardware Rs.25000

+ S6 BOARD

VLSI78_AE29 Title: Efficient Synchronization for Distributed Embedded Multiprocessors Abstract: In multiprocessor systems, low-latency synchronization is extremely important to effectively exploit fine-grain data parallelism and improve overall performance. This brief presents an efficient synchronization for embedded distributed multiprocessors. The proposed solution works in a completely decentralized request–response manner via explicit message exchange among the processing elements. Scalable lock and barrier synchronization algorithms, which are derived from the inherent distributed characteristics of the underlying architecture, are proposed to enable fair, orderly, and contention-free synchronization. We implement the proposed synchronization model in a distributed 32-core architecture with a commercial cycle-accurate System-C simulation platform. Experimental results that show our proposed approach achieves ultralow synchronization latency and almost ideal scalability when the core count scales.

Simulation

Rs.8000 /

Hardware Rs.22000

+ S6 BOARD




VLSI79_AE30 Title: NAND Flash Memory With Multiple Page Sizes for High-Performance Storage Devices Abstract: In recent years, the demand for NAND flash-based storage devices has rapidly increased because of the popularization of various portable devices. NAND flash memory (NFM) offers many advantages, such as non-volatility, high performance, the small form factor, and low-power consumption, while achieving high chip integration with a specialized architecture for bulk data access. A unit of NFM’s read and program operations, the page, has continuously grown. Although increasing page size reduces costs, it adversely affects performance because of the resultant side effects, such as fragmentation and wasted space, caused by the incongruity of data and page sizes. To address this issue, we propose a multiple-page-size NFM architecture and its management. Our method dramatically improves write performance through adopting multiple page sizes without requiring additional area overhead or manufacturing processes. Based on the experimental results, the proposed NFM improves write latency and NFM lifetime by up to 65% and 62%, respectively, compared with the single-page-size NFM.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD

VLSI80_AE31 Title: A Performance Degradation Tolerable Cache Design by Exploiting Memory Hierarchies Abstract: Performance degradation tolerance (PDT) has been shown to be able to effectively improve the yield, reliability, and lifetime of an electronic product. The focus of PDT is on the particular performance degrading faults (pdef)that only incur some performance degradation of a system without inducing any computation errors. The basic idea is that as long as the defective chips containing only the pdef can provide acceptable performance for some applications, they may still be marketable. Critical issues of PDT to be addressed include the portion of the pdef in a faulty chip and their induced performance degradation. For a typical cache design, most of the possible faults are not pdef. In this brief, we propose a cache redesign method, called PDT cache, where all functional faults in the data-storage cells of a cache (major part of the cache) can be transformed into pdef. By transforming this large number of faults into pdef, a faulty cache becomes much more likely to be still marketable. The proposed design exploits the existing hardware resources and the inherent error resilience scheme to reduce the incurred hardware overhead. The logic synthesis results show that the incurred hardware overhead is only 6.29% for a 32-kB cache. We also evaluate the induced performance degradation under various fault densities using the CPU2000 and CPU2006 benchmark programs. The results show that for a 32-kB cache design, when the fault density is <1%, only 0.31% performance degradation is incurred. In addition, the scalability of the PDT cache is also evaluated. The results show that a smaller hardware overhead is required for a larger cache, and the performance degradation is independent of the cache associativity and can

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




even be smaller for a larger cache under a given fault density.

VLSI81_AE32 Title: Knowledge-Based Neural Network Model for FPGA Logical Architecture Development Abstract: This paper proposes a knowledge-based neural network (KBNN) modeling approach for field-programmable gate array (FPGA) logical architecture design. The KBNN embeds the existing FPGA analytical models (AMs) into an NN. The NN can complement the AMs according to their needs to provide further increased model accuracy, while maintaining the meaningful trends successfully captured in the AMs. The obtained KBNN predicts the routing channel width required by circuit implementations on various FPGA architectures, which can be used by architects to quickly and accurately evaluate various FPGA architectures in early development stages. Experimental results show that the KBNN-based approach achieves an average error of 2%, which shows 75% accuracy enhancement over the existing AMs for routing channel width estimation of a set of benchmark circuits and FPGA architectures. The KBNN model has been applied to three FPGA architecture development scenarios to demonstrate its practical application and effectiveness.

Simulation Rs.15000

/ Hardware Rs.30000

+ S6 BOARD

VLSI83_AE33 Title: A New Optimal Algorithm for Energy Saving in Embedded System With Multiple Sleep Modes Abstract: For embedded systems with multiple sleep modes, it is interesting to understand how to maximize the energy saving potential by choosing the suitable sleep mode(s) during the idle period. In this paper, we establish a sufficient condition to narrow down the search space of sleep policy and propose a new algorithm: optimal-idle-threshold-policy-algorithm under more realistic setting than the existing works. Theoretical proofs and experimental results justify the benefits of our approach.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD

VLSI84_AE34 Title: A Fast Fault-Tolerant Architecture for Sauvola Local Image Thresholding Algorithm Using Stochastic Computing Abstract: Binarization plays an important role in document image processing, particularly in degraded document images. Among all local image thresholding algorithms, Sauvola has excellent binarization performance for degraded document images. However, this algorithm is computationally intensive and sensitive to the noises from the internal computational circuits. In this paper, we present a stochastic implementation of Sauvola algorithm. Our experimental results show that the stochastic implementation of Sauvola needs much less time and area and can tolerate more faults, while consuming less power in comparison with its conventional implementation.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




VLSI85_AE35 Title: Efficiency Enablers of Lightweight SDR for MIMO Baseband Processing Abstract: The flexibility and programmability of an application-specific instruction-set processor (ASIP) come at the expense of reduced area and energy efficiency compared to application-specific integrated circuit (ASIC) solutions. Nevertheless, ASIPs are desirable for versatile application domains like wireless communications and software defined radio (SDR). Typically, ASIP designers reduce the ASIC-ASIP efficiency gap by increasingly complex architectures with decreasing flexibility and usability. This paper takes the opposite approach and presents concepts for a highly efficient, lightweight SDR ASIP. Efficiency enablers include simple but effective measures like a carefully chosen instruction set, optimized data access techniques for efficient utilization of functional units, and the use of flexible floating-point arithmetic with runtime-adaptive numerical precision. We present a conceptual processor core to show the impact of these measures and discuss its potential as well as limitations compared to tailored ASIC solutions. For demonstration, we choose the field of linear multiple-input multiple-output (MIMO) detection. We present synthesis results for several design versions in 90 nm CMOS technology and the corresponding energy benchmarks. Also, we show post-layout results for a selected design to demonstrate the feasibility of our concept.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI86_AE36 Title: A Novel Quantum-Dot Cellular Automata X-bit ×32-bit SRAM Abstract: Application of quantum-dot cellular automata (QCA) technology as an alternative to CMOS technology on the nanoscale has a promising future; QCA is an interesting technology for building memory. The proposed design and simulation of a new memory cell structure based on QCA with a minimum delay, area, and complexity is presented to implement a static random access memory (SRAM). This paper presents the design and simulation of a 16-bit×32-bit SRAM with a new structure in QCA. Since QCA is a pipeline, this SRAM has a high operating speed. The 16-bit ×32-bit SRAM has a new structure with a 32-bit width designed and implemented in QCA. It has the ability of a conventional logic SRAM that can provide read/write operations frequently with minimum delay. The 16-bit×32-bit SRAM is generalized and an n×16-bit ×32-bit SRAM is implemented in QCA. Novel 16-bit decoders and multiplexers (MUXs) in QCA are presented that have been designed with a minimum number of majority gates and cells. The new SRAM, decoders, and MUXs are designed, implemented, and simulated in QCA using a signal distribution network to avoid the coplanar problem of crossing wires. The QCA-based SRAM cell was compared with the SRAM cell based on CMOS. Results show that the proposed SRAM is more efficient in terms of area, complexity, clock frequency, latency, throughput, and power consumption.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




VLSI88_AE37 Title: Ultralow-Energy Variation-Aware Design: Adder Architecture Study Abstract: Power consumption of digital systems is an important issue in nanoscale technologies and growth of process variation makes the problem more challenging. In this brief, we have analyzed the latency, energy consumption, and effects of process variation on different structures with respect to the design structure and logic depth to propose architectures with higher throughput, lower energy consumption, and smaller performance loss caused by process variation in application specific integrated circuit design. We have exploited adders as different implementations of a processing unit, and propose architectural guidelines for finer technologies in subthreshold which are applicable to any other architecture. The results show that smaller computing building blocks have better energy efficiency and less performance degradation because of variation effects. In contrast, their computation throughput will be mid or less unless proper solutions, such as pipelined or parallel structures, are used. Therefore, our proposed solution to improve the throughput loss while reducing sensitivity to process variations is using simpler elements in deep pipelined designs or massively parallel structures.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD

VLSI90_AE38 Title: Write Buffer-Oriented Energy Reduction in the L1 Data Cache for Embedded Systems Abstract: In resource-constrained embedded systems, on-chip cache memories play an important role in both performance and energy consumption. In contrast to read operations, scant regard has been paid to optimizing write operations even though the energy consumed by write operations in the data cache constitutes a large portion of the total energy consumption. Consequently, this paper proposes a write buffer-oriented (WO) cache architecture that reduces energy consumption in the L1 data cache. Observing that write operations are very likely to be merged in the write buffer because of their high localities, we construct the proposed WO cache architecture to utilize two schemes. First, the write operations update the write buffer but not the L1 data cache, which is updated later by the write buffer after the write operations are merged. Write merging significantly reduces write accesses to the data cache and, consequently, energy consumption. Second, we further reduce energy consumption in the write buffer by filtering out unnecessary read accesses to the write buffer using a read hit predictor. In this paper, we show that the proposed WO cache architecture is applicable to the conventional embedded processors that support both write-through and write-back policies. Further, the experimental results verify that the proposed cache architecture reduces energy consumption in data caches up to 14%.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




VLSI92_AE39 Title: Toward Solving Multichannel RF-SoC Integration Issues Through Digital Fractional Division Abstract: In modern RF system on chips (SoCs), the digital content consumes up to 85% of the IC chip area. The recent push to integrate multiple RF-SoC cores is met with heavy resistance by the remaining RF/analog circuitry, which creates numerous strong aggressors and weak victims leading to RF performance degradation. A key such mechanism is injection pulling through parasitic coupling between variousLC-tank oscillators as well as between them and strong transmitter (TX) outputs. Any static or dynamic frequency proximity between aggressors (i.e., oscillators and TX outputs) and victims (i.e., oscillators) that share the same die causes injection pulling, which produces unwanted spurs and/or modulation distortion. In this paper, we propose and demonstrate a new frequency planning technique of a multicore TX where each LC-tank oscillator is separated from other aggressors beyond its pulling range. This is done by breaking the integer harmonic frequency relationship of victims/aggressors within and between the RF transmission channels using digital fractional divider based on a phase rotation. Each oscillator’s center frequency can be fractionally separated by ∼28% but, at the same time, both producing closely spaced frequencies at the phase rotator outputs. The injection-pulling spurs are so far away that they are insignificantly small (−80 dBc) and coincide with the second harmonic of the carrier. This method is experimentally verified in a two-channel system in 65-nm digital CMOS, each channel comprising a high-swing class-C oscillator, frequency divider, and phase rotator.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD

VLSI93_AE40 Title: Error Resilient and Energy Efficient MRF Message-Passing-Based Stereo Matching Abstract: Message-passing-based inference algorithms have immense importance in real-world applications. In this paper, error resiliency of a message passing based Markov random field (MRF) stereo matching hardware is explored and enhanced through the application of statistical error compensation. Error resiliency is of particular interest for sub-nanometer and Post silicon devices. The inherent robustness of iteration-based MRF inference algorithms is explored and shows that small errors are tolerable, while large errors degrade the performance significantly. Based on these error characteristics, algorithmic noise tolerance (ANT) has been applied at the arithmetic, iteration, and system levels. Introducing timing errors via voltage Over scaling, at the arithmetic level, results show that the ANT-based hardware can tolerate an error rate of 21.3%, with performance degradation of only 3.5% at an overhead of 97.4%, compared with an error-free hardware with an energy savings of 39.7%. To reduce compensation complexity, iteration and system-level compensation was explored. Results

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




show that, compared with arithmetic level, system-level compensation reduces overhead to 59%, while maintaining stereo matching performance with only 2.5% degradation with 16% additional power savings. These results are verified via FPGA emulation with timing errors induced within the message passing unit via relaxed synthesis.

VLSI94_AE41 Title: Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation Abstract: Fast Fourier transform (FFT) coprocessor, having a significant impact on the performance of communication systems, has been a hot topic of research for many years. The FFT function consists of consecutive multiply add operations over complex numbers, dubbed as butterfly units. Applying floating-point (FP) arithmetic to FFT architectures, specifically butterfly units, has become more popular recently. It offloads compute-intensive tasks from general-purpose processors by dismissing FP concerns (e.g., scaling and overflow/underflow). However, the major downside of FP butterfly is its slowness in comparison with its fixed-point counterpart. This reveals the incentive to develop a high-speed FP butterfly architecture to mitigate FP slowness. This brief proposes a fast FP butterfly unit using a devised FP fused-dot product-add (FDPA) unit, to compute AB±CD±E, based on binary signed-digit BSD) representation. The FP three-operand BSD adder and the FP BSD constant multiplier are the constituents of the proposed FDPA unit. A carry-limited BSD adder is proposed and used in the three-operand adder and the parallel BSD multiplier so as to improve the speed of the FDPA unit. Moreover, modified Booth encoding is used to accelerate the BSD multiplier. The synthesis results show that the proposed FP butterfly architecture is much faster than previous counterparts but at the cost of more area.

Simulation Rs.8000

/ Hardware Rs.20000

+ S6 BOARD

VLSI95_AE42 Title: On Efficient Retiming of Fixed-Point Circuits Abstract: Retiming of digital circuits is conventionally based on the estimates of propagation delays across different paths in the data-flow graphs (DFGs) obtained by discrete component timing model, which implicitly assumes that operation of a node can begin only after the completion of the operation(s) of its preceding node(s) to obey the data dependence requirement. Such a discrete component timing model very often gives much higher estimates of the propagation delays than the actuals particularly when the computations in the DFG nodes correspond to fixed point arithmetic operations like additions and multiplications. On the other hand, very often it is imperative to deal with the DFGs of such higher granularity at the architecture-level abstraction of digital system design for mapping an algorithm to the desired architecture, where the overestimation of propagation delay leads to unwanted pipelining and undesirable increase in pipeline overheads. In this paper, we propose the connected component timing model to obtain adequately precise estimates of propagation delays

Simulation Rs.8000

/ Hardware Rs.16000

+ S6 BOARD




across different combinational paths in a DFG easily, for efficient cut set-retiming in order to reduce the critical path substantially without significant increase in register-complexity and latency. Apart from that, we propose novel node-splitting and node-merging techniques that can be used in combination with the existing retiming methods to achieve reduction of critical path to a fraction that of the original DFG with a small increase in overall register complexity.

VLSI99_AE43 Title: Trigger-Centric Loop Mapping on CGRAs Abstract: A coarse-grained reconfigurable architecture (CGRA) is a promising platform based on considerations for both performance and power efficiency. One of the primary obstacles that CGRAs might face is how to accelerate loops with if–then–else (ITE) structures. A recent control paradigm for CGRAs named triggered instruction architecture (TIA) can provide an efficient scheme to accelerate loops with ITE structures. Yet common loop mapping frameworks cannot leverage this scheme autonomously. To this end, this brief makes two contributions: 1) identify and remove redundancy nodes from a data flow graph and 2) propose an integrated approach—TRMap, which consists of operations merging, Boolean operations offloading, and transformation of triggers. Our experimental results from some vital kernels extracted from SPEC2006 benchmarks and digital signal processing applications show that by using TIA scheme, TRMap is able to accelerate loops with ITE structures to an execution that is 1.38×and 1.64×faster than that achieved by a full predication scheme (FP-Choi) and a state-of-the-art method (BRMap).

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI100_AE44 Title: Area-Aware Cache Update Trackers for Post silicon Validation Abstract: The internal state of the complex modern processors often needs to be dumped out frequently during post silicon validation. Since the caches hold most of the state, the volume of data dumped and the transfer time are dominated by the large caches present in the architecture. The limited bandwidth to transfer data present in these large caches off-chip results in stalling the processor for long durations when dumping the cache contents off-chip. To alleviate this, we propose to transfer only those cache lines that were updated since the previous dump. Since maintaining a bit-vector with a separate bit to track the status of individual cache lines is expensive, we propose two methods: 1) where a bit tracks multiple cache lines and 2) an Interval Table which stores only the starting and ending addresses of continuous runs of updated cache lines. Both methods require significantly lesser space compared with a bit-vector, and allow the designer to choose the amount of space to allocate for this design-for-debug feature. The impact of reducing storage space is that some non updated cache lines are dumped too. We attempt to minimize such overheads. We propose a scheme to share such cache update tracking hardware (or Update Trackers) across multiple caches in case of physically distributed caches so that they

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




are replicated fewer times, thereby limiting the area overhead. We show that the proposed Update Trackers occupy less than 1% of cache area for both the shared and distributed caches.

VLSI101_AE45 Title: PEVA: A Page Endurance Variance Aware Strategy for the Lifetime Extension of NAND Flash Abstract: With aggressive scaling and multilevel cell technology, the reliability of NAND flash continuously degrades. The lifetime of NAND flash is highly restricted by the bit error rate (BER), and error-correcting codes (ECCs) can provide only limited error correction capability to tolerate increasing bit errors. To cope with this issue, a novel page endurance variance aware (PEVA) strategy is proposed to extend the lifetime of NAND flash based on the experimental observations from our hardware–software co-designed experimental platform. The experimental observations indicate that the BER distribution of retention error shows distinct variances in different pages. The key purpose of PEVA is to exploit the lifetime potency of every page in a block by introducing fine-grained bad page management instead of coarse-grained bad block management (BBM). The experimental results show that the PEVA can extend the lifetime of 2×-nm NAND flash by 9.8×compared with the conventional BBM and that there is at most an 8.7% degradation in writing speed compared with the traditional sector mapping technology. In addition, the maximum writing response time increased by at most 5.9% during the operation of the PEVA strategy.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD

VLSI102_AE46 Title: Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures Abstract: The coarse-grained reconfigurable architectures (CGRAs) are a promising class of architectures with the advantages of high performance and high power efficiency. The compute-intensive parts of an application (e.g., loops) are often mapped onto the CGRA for acceleration. Due to the extra overhead of memory access and the limited communication bandwidth between the processing element (PE) array and local memory, previous works trying to solve the routing problem are mainly confined in the internal resources of PE arrays (e.g., PEs and registers). Inevitably, routing with PEs or registers will consume a lot of computational resources and cause the increase of the initiation interval. To solve this problem, this paper makes two contributions: 1) establishing a precise formulation for the CGRA mapping problem while using shared local data memory as a routing resource and 2) extracting an effective approach for mapping loops to CGRAs. The experimental results on loops of the SPEC2006, Livermore, and MiBench show that our approach (called MEMMap) can improve the performance of the kernels on CGRA up to 1.62×,1.58×,1.28×, and 1.23×compared with the edge-centric modulo scheduling, EPIMap, REGIMap, and force-directed map, respectively, with an acceptable increase in compilation time.

Simulation

Rs.8000 /

Hardware Rs.16000

+ S6 BOARD




VLSI104_AE47 Title: An Efficient Single and Double-Adjacent Error Correcting Parallel Decoder for the (24,12) Extended Golay Code Abstract: Memories that operate in harsh environments, like for example space, suffer a significant number of errors. The error correction codes (ECCs) are routinely used to ensure that those errors do not cause data corruption. However, ECCs introduce overheads both in terms of memory bits and decoding time that limit speed. In particular, this is an issue for applications that require strong error correction capabilities. A number of recent works have proposed advanced ECCs, such as orthogonal Latin squares or difference set codes that can be decoded with relatively low delay. The price paid for the low decoding time is that in most cases, the codes are not optimal in terms of memory overhead and require more parity check bits. On the other hand, codes like the (24,12) Golay code that minimize the number of parity check bits have a more complex decoding. A compromise solution has been recently explored for Bose–Chaudhuri–Hocquenghem codes. The idea is to implement a fast parallel decoder to correct the most common error patterns (single and double adjacent) and use a slower serial decoder for the rest of the patterns. In this brief, it is shown that the same scheme can be efficiently implemented for the (24,12) Golay code. In this case, the properties of the Golay code can be exploited to implement a parallel decoder that corrects single- and double-adjacent errors that is faster and simpler than a single-error correction decoder. The evaluation results using a 65-nm library show significant reductions in area, power, and delay compared with the traditional decoder that can correct single and double-adjacent errors. In addition, the proposed decoder is also able to correct some triple-adjacent errors, thus covering the most common error patterns.

Simulation Rs.10000

/ Hardware Rs.18000

+ S6 BOARD

VLSI105_AE48 Title: Concept, Design, and Implementation of Reconfigurable CORDIC Abstract: This brief presents the key concept, design strategy, and implementation of reconfigurable coordinate rotation digital computer (CORDIC) architectures that can be configured to operate either for circular or for hyperbolic trajectories in rotation as well as vectoring-modes. It can, therefore, be used to perform all the functions of both circular and hyperbolic CORDIC. We propose three reconfigurable CORDIC designs: 1) a reconfigurable rotation-mode CORDIC that operates either for circular or for hyperbolic trajectory; 2) a reconfigurable vectoring-modeCORDIC for circular and hyperbolic trajectories; and 3) a generalized reconfigurable CORDIC that can operate in any of the modes for both circular and hyperbolic trajectories. The reconfigurable CORDIC can perform the computation of various trigonometric and exponential functions, logarithms, square-root, and so on of circular and hyperbolic CORDIC using either rotation-mode or vectoring-mode CORDIC in one single circuit. It can be used in digital synchronizers, graphics processors, scientific calculators,

Simulation Rs.10000

/ Hardware Rs.18000

+ S6 BOARD




and so on. It offers substantial saving of area complexity over the conventional design for reconfigurable applications.

Audio, Image and Video Processing

VLSI08_IM1 Title: Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video Encoding Abstract: The field of approximate computing has received significant attention from the research community in the past few years, especially in the context of various signal processing applications. Image and video compression algorithms, such as JPEG, MPEG, and so on, are particularly attractive candidates for approximate computing, since they are tolerant of computing imprecision due to human imperceptibility, which can be exploited to realize highly power-efficient implementations of these algorithms. However, existing approximate architectures typically fix the level of hardware approximation statically and are not adaptive to input data. For example, if a fixed approximate hardware configuration is used for an MPEG encoder (i.e., a fixed level of approximation), the output quality varies greatly for different input videos. This paper addresses this issue by proposing a reconfigurable approximate architecture for MPEG encoders that optimizes power consumption with the goal of maintaining a particular Peak Signal-to-Noise Ratio (PSNR) threshold for any video. We propose two heuristics for automatically tuning the approximation degree of the RABs in these two modules during runtime based on the characteristics of each individual video. Experimental results show that our approach of dynamically adjusting the degree of hardware approximation based on the input video respects the given quality bound (PSNR degradation of 1%–10%) across different videos while achieving a power saving up to 38% over a conventional nonapproximated MPEG encoder architecture. Note that although the proposed reconfigurable approximate architecture is presented for the specific case of an MPEG encoder, it can be easily extended to other DSP applications.

(Image) Simulation Rs. 10000

/ (Video)

Hardware Rs.30000

+ S6 BOARD

VLSI11_IM2 Title: A Configurable Parallel Hardware Architecture for Efficient Integral Histogram Image Computing Abstract: Integral histogram image can accelerate the computing process of feature algorithm in computer vision, but exhibits high computation complexity and inefficient memory access. In this paper, we propose a configurable parallel architecture to improve the computing efficiency of integral histogram. Based on the configurable design in the architecture, multiple integral objects for integral histogram image, such as image intensity, image gradient, and local binary pattern, are well supported. Meanwhile, by means of the proposed strip-based memory partitioning mechanism, this architecture processes the integral histogram quickly with maximal parallelism in a pipeline manner. Besides, in this architecture, the proposed data correlation memory compression mechanism effectively

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




solves the expansion problem of integral histogram memory caused by storing the histogram data. It fully reduces the data redundancy in the integral histograms, and saves a lot of memory resources. Experiments using Cyclone IV-based field-programmable gate array platform and 65-nm technology-based post synthesis show that our architecture improves the average computing speed by 8.6 times with high power efficiency compared with the state-of-the-art works.

VLSI16_IM3 Title: A New Binary-Halved Clustering Method and ERT Processor for ASSR System Abstract: This paper presents an automatic speech–speaker recognition (ASSR) system implemented in a chip which includes a built-in extraction, recognition, and training (ERT) core. For VLSI design (here, ASSR system), the hardware cost and time complexity are always the important issues which are improved in this proposed design in two levels: 1) algorithmic and 2) architecture. At the algorithm level, a newly binary-halved clustering (BHC) is proposed to achieve low time complexity and low memory requirement. In addition, at the architecture level, a new ERT core is proposed and implemented based on data dependence and reuse mechanism to reduce the time and hardware cost as well. Finally, the chip implementation is synthesized, placed, and routed using TSMC 90-nm technology library. To verify the performance of the proposed BHC method, a case study is performed based on nine speakers. Moreover, the validation of the ASSR system is examined in two parts: 1) speech recognition and 2) speaker recognition. The results show that the proposed system can achieve 93.38% and 87.56% of recognition rates during speech and speaker recognition, respectively. Furthermore, the proposed ASSR chip includes 396k gate counts, and consumes power in 8.74 mW. Such results demonstrate that the performance of the proposed ASSR system is superior to the conventional systems.

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD

VLSI31_IM4 Title: The VLSI Architecture of a Highly Efficient De-blocking Filter for HEVC Systems Abstract: This paper presents the VLSI architecture and hardware implementation of a highly efficient De-blocking Filter for High Efficiency Video Coding (HEVC) systems. In order to reduce the number of data accesses and thus to enhance the timing efficiency, novel data structures and memory access schemes for image pixels are proposed. Furthermore, a novel edge-fetching order is presented to strike a balance between the processing throughput and complexity. Based on the proposed structure and access pattern, a six-stage pipelined, two-line De-blocking Filter engine with low-latency data access sequence is designed, aiming to achieve high processing throughput while at the same time maintaining low complexity. The detailed storage structure and data access scheme are illustrated and VLSI architecture for the De-blocking Filter engine is depicted in this paper.

(Image)

Simulation Rs.10000

/ Hardware Rs.25000

+ S6 BOARD




In addition, the proposed De-blocking Filter is implemented using TSMC 90nm standard cell library. Experimental results based on post-layout estimations show that the proposed design can achieve 60 frames per second for frame resolution of 4096×2048 pixels (Ultra HD resolution) assuming an operating frequency of 100MHz. Moreover, this design occupies area complexity of 466.5 kGE with power consumption of 26.26 mW. In comparison with prior arts targeting on similar system specification and throughput, the proposed design results in a significantly reduced area complexity.

VLSI32_IM5 Title: Low-Power System for Detection of Symptomatic Patterns in Audio Biological Signals Abstract: In this paper, we present a low-power, efficacious, and scalable system for the detection of symptomatic patterns in biological audio signals. The digital audio recordings of various symptoms, such as cough, sneeze, and so on, are spectrally analyzed using a discrete wavelet transform. Subsequently, we use simple mathematical metrics, such as energy, quasi average, and coastline parameter for various wavelet coefficients of interest depending on the type of pattern to be detected. Furthermore, a mel frequency cepstrum-based analysis is applied to distinguish between signals, such as cough and sneeze, which have a similar frequency response and, hence, occur in common wavelet coefficients. Algorithm-circuit codesign methodology is utilized in order to optimize the system at algorithm and circuit levels of design abstraction. This helps in implementing a low-power system as well as maintaining the efficacy of detection. The system is scalable in terms of user specificity as well as the type of signal to be analyzed for an audio symptomatic pattern. We utilize multiplierless implementation circuit strategies and the algorithmic modification of mel cepstrum computation to implement low power system in the 65-nm bulk Si technology. It is observed that the pattern detection system achieves about 90% correct classification of five types of audio health symptoms. We also scale the supply voltage due to lower frequency of operation and report a total power consumption of ∼184 µW at 700 mV supply.

Simulation Rs.12000

/ Hardware Rs.30000

+ S6 BOARD

VLSI82_IM6 Title: Energy-Efficient Floating-Point MFCC Extraction Architecture for Speech Recognition Systems Abstract: This brief presents an energy-efficient architecture to extract mel-frequency cepstrum coefficients (MFCCs) for real-time speech recognition systems. Based on the algorithmic property of MFCC feature extraction, the architecture is designed with floating-point arithmetic units to cover a wide dynamic range with a small bit-width. Moreover, various operations required in the MFCC extraction are examined to optimize operational bit-width and lookup tables needed to compute nonlinear functions, such as trigonometric and logarithmic functions. In addition, the dataflow of MFCC extraction is tailored to minimize the computation time. As a result, the

Simulation Rs.12000

/ Hardware Rs.30000

+ S6 BOARD




energy consumption is considerably reduced compared with previous MFCC extraction systems.

VLSI98_IM7 Title: Fixed-Point Computing Element Design for Transcendental Functions and Primary Operations in Speech Processing Abstract: This brief presents a fixed-point architecture based on a reconfigurable scheme for integrating several commonly used mathematical operations of speech signal processing. The proposed design can perform two transcendental mathematical operations called logarithm and powering, and three commonly used computations with similar operations named polynomial calculation, filtering, and windowing. By analyzing the adopted algorithms of the above five operations, a simplified computing unit is designed. This unit can combine six types of operations by reconfiguring the data paths, and the same multiply– add architecture can be reused for reducing the redundant usage of logic gates. The experimental results reveal that the proposed design can work at a 200-MHz clock rate, and its gate count only has 11.9k. Compared with the results of the floating-point function, the median errors of the proposed design for computing the powering and logarithmic functions are 0.57% and 0.11%, respectively. Such results indicate that this simple architecture can be effectively used in most speech processing applications.

Simulation Rs.12000

/ Hardware Rs.30000

+ S6 BOARD

NETWORKING

VLSI22_NW1 Title: In-Field Test for Permanent Faults in FIFO Buffers of NoC Routers Abstract: This brief proposes an on-line transparent test technique for detection of latent hard faults which develop in first-input first-output buffers of routers during field operation of NoC. The technique involves repeating tests periodically to prevent accumulation of faults. A prototype implementation of the proposed test algorithm has been integrated into the router-channel interface and on-line test has been performed with synthetic self-similar data traffic. The performance of the NoC after addition of the test circuit has been investigated in terms of throughput while the area overhead has been studied by synthesizing the test hardware. In addition, an on-line test technique for the routing logic has been proposed which considers utilizing the header flits of the data traffic movement in transporting the test patterns.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD

VLSI67_NW2 Title: FCUDA-NoC : A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in

Simulation

Rs.8000 /

Hardware Rs.20000

+




the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63× and reduce external memory reads by up to 81% compared with a single hardware core implementation.

S6 BOARD

VLSI91_NW3 Title: Process Variation Delay and Congestion Aware Routing Algorithm for Asynchronous NoC Design Abstract: The effect of process variation (PV) on delay is a major reason to deteriorate the performance in advanced technologies. The performance of different routing algorithms is determined with/without PV for various traffic patterns. The saturation throughput and average message delay are used as performance metrics to evaluate the throughput. PV decreases the saturation throughput and increases the average message delay. PV increases the average message delay by up to 90% and decreases the saturation throughput by up to 29% compared with nominal characteristics of different routing algorithms. Adaptive routing algorithm should be manipulated with the PV. A novel PV delay and congestion aware routing (PDCR) algorithm is proposed for asynchronous network-on-chip design. PDCR is adaptive, low cost, and scalable. The novel routing algorithm outperforms different adaptive routing algorithms in the average delay and saturation throughput for various traffic patterns. PDCR can achieve up to 12%–32% average message delay lower than that of other routing algorithms. Moreover, the proposed scheme yields improvements in saturation throughput by up to 11%–82% compared with other adaptive routing algorithms.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD




VLSI76_NW4 Title: Argo: A Real-Time Network-on-Chip Architecture With an Efficient GALS Implementation Abstract: In this paper, we present an area-efficient, globally asynchronous, locally synchronous network-on-chip (NoC) architecture for a hard real-time multiprocessor platform. The NoC implements message-passing communication between processor cores. It uses statically scheduled time-division multiplexing (TDM) to control the communication over a structure of routers, links, and network interfaces (NIs) to offer real-time guarantees. The area-efficient design is a result of two contributions: 1) asynchronous routers combined with TDM scheduling and 2) a novel NI micro architecture. Together they result in a design in which data are transferred in a pipelined fashion, from the local memory of the sending core to the local memory of the receiving core, without any dynamic arbitration, buffering, and clock synchronization. The routers use two-phase bundled-data handshake latches based on the Mousetrap latch controller and are extended with a clock gating mechanism to reduce the energy consumption. The NIs integrate the direct memory access functionality and the TDM schedule, and use dual-ported local memories to avoid buffering, flow-control, and synchronization. To verify the design, we have implemented a 4×4 bitorus NoC in 65-nm CMOS technology and we present results on area, speed, and energy consumption for the router, NI, NoC, and post layout.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD

VLSI77_NW5

Title: Efficient Dynamic Virtual Channel Organization and Architecture for NoC Systems Abstract: A growing number of processing cores on a chip require an efficient and scalable communication structure such as network on chip (NoC). The channel buffer organization of NoC uses virtual channels (VCs) to improve data flow and performance of the NoC system. Dynamically allocated multi queues (DAMQs) are an effective mechanism to achieve VC flow control with maximum buffer utilization. In this model, VCs employ variable number of buffer slots depending on the traffic. Despite the performance merits of DAMQs, it has some limitations. We propose a new input-port micro architecture to support our efficient dynamic VC (EDVC) approach that is built on DAMQ buffers. To demonstrate the advantages of EDVC, we compare its micro architecture with that of the conventional dynamic VC (CDVC), which also employs link-list tables for buffer organization. In terms of hardware, DVC input-port organization consumes on average 61% less power for application-specific integrated circuit design when compared with the CDVC input port. The saving is even better when compared with VC regulator methodology. An EDVC approach can improve NoC latency by 48%–50% and throughput by100% on average as compared with the CDVC mechanism.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD




VLSI103_NW6 Title: A New CDMA Encoding/Decoding Method for on-Chip Communication Network Abstract: As a high performance on-chip communication method, the code division multiple access (CDMA) technique has recently been applied to networks on chip (NoCs). We propose a new standard-basis based encoding/decoding method to leverage the performance and cost of CDMA NoCs in area, power assumption, and network throughput. In the transmitter module, source data from different senders are separately encoded with an orthogonal code of a standard basis and these coded data are mixed together by an XOR operation. Then, the sums of data can be transmitted to their destinations through the on chip communication infrastructure. In the receiver module, a sequence of chips is retrieved by taking an AND operation between the sums of data and the corresponding orthogonal code. After a simple accumulation of these chips, original data can be reconstructed. We implement our encoding/decoding method and apply it to a CDMA NoC with a star topology. Compared with the state-of-the-art Walsh-code-based (WB) encoding/decoding technique, our method achieves up to 67.46% power saving and 81.24% area saving together with decrease of 30%–50% encoding/decoding latency. Moreover, the CDMA NoC with different sizes applying our encoding/decoding method gains power saving, area saving, and maximal throughput improvement up to 20.25%, 22.91%, and 103.26%, respectively, than the WB CDMA NoC.

Simulation

Rs.8000 /

Hardware Rs.20000

+ S6 BOARD

VERIFICATION

VLSI24_VR1 Title: Source Code Error Detection in High-Level Synthesis Functional Verification Abstract: A dynamic functional verification method that compares untimed simulations versus timed simulations for synthesizable [high-level synthesis (HLS)] behavioral descriptions (ANSI-C) is presented in this paper. This paper proposes a method that automatically inserts a set of probes into the untimed behavioral description. These probes record the status of internal signals of the behavioral description during an initial untimed simulation. These simulation results are subsequently used as golden outputs for the verification of the internal signals during a timed simulation once the behavioral description has been synthesized using HLS. Our proposed method reports any simulation mismatches and accurately pinpoints any discrepancies between the functional Software (SW) simulation and the timed simulation at the original behavioral description (source code). Our method does not only determine where to place the probes, but is also able to insert different type of probes based on the specified HLS synthesis options in order not to interfere with the HLS process, minimizing the total number of probes and the size of the data to be stored in the trace file in order to minimize the running time. Results show that our proposed method is very effective and extremely simple to use as it is fully

(Design

+ Verification)

Simulation Rs.10000

/ Hardware Rs.20000

+ S6 BOARD




automated.

TANNER /MICROWIND – (AREA EFFICEINT)

VLSI41_TAE1 Title: A Single-Ended With Dynamic Feedback Control 8T Subthreshold SRAM Cell Abstract: A novel 8-transistor (8T) static random access memory cell with improved data stability in subthreshold operation is designed. The proposed single-ended with dynamic feedback control 8T static RAM (SRAM) cell enhances the static noise margin (SNM) for ultralow power supply. It achieves write SNM of 1.4× and 1.28× as that of isoarea 6T and read-decoupled 8T (RD-8T), respectively, at 300 mV. The standard deviation of write SNM for 8T cell is reduced to 0.4× and 0.56× as that for 6T and RD-8T, respectively. It also possesses another striking feature of high read SNM ∼2.33×, 1.23×, and 0.89× as that of 5T, 6T, and RD-8T, respectively. The cell has hold SNM of 1.43×, 1.23×, and 1.05× as that of 5T, 6T, and RD-8T, respectively. The write time is 71% lesser than that of single-ended asymmetrical 8T cell. The proposed 8T consumes less write power 0.72×, 0.6×, and 0.85× as that of 5T, 6T, and isoarea RD-8T, respectively. The read power is 0.49× of 5T, 0.48× of 6T, and 0.64× of RD-8T. The power/energy consumption of 1-kb 8T SRAM array during read and write operations is 0.43× and 0.34×, respectively, of 1-kb 6T array. These features enable ultralow power applications of 8T.

TANNER

Simulation Rs.5000

VLSI43_TAE2 Title: OTA-Based Logarithmic Circuit for Arbitrary Input Signal and Its Application Abstract: In this paper, a new design procedure has been proposed for realization of logarithmic function via three phases: 1) differentiation; 2) division; and 3) integration for any arbitrary analog signal. All the basic building blocks, i.e., differentiator, divider, and integrator, are realized by operational transconductance amplifier, a current mode device. Realization of exponential, power law and hyperbolic function as the design examples claims that the proposed synthesis procedure has the potential to design a log-based nonlinear system in a systematic and hierarchical manner. The performance of all the proposed circuits has been verified with SPICE simulation.

TANNER

Simulation Rs.10000

VLSI44_TAE3 Title: A Robust Energy/Area-Efficient Forwarded-Clock Receiver With All-Digital Clock and Data Recovery in 28-nm CMOS for High-Density Interconnects Abstract: This paper presents a robust energy/area-efficient receiver fabricated in a 28-nm CMOS process. The receiver consists of eight data lanes plus one forwarded-clock lane supporting the hypertransport standard for high-density chip-to-chip links. The proposed all-digital clock and data recovery (ADCDR) circuit, which is well suited for today’s CMOS process

TANNER

Simulation Rs.8000




scaling, enables the receiver to achieve low power and area consumption. The ADCDR can enter into open loop after lock-in to save power and avoid clock dithering phenomenon. Moreover, to compensate the open loop, a phase tracking procedure is proposed to enable the ADCDR to track the phase drift due to the voltage and temperature variations. Furthermore, the all-digital delay-locked loop circuit integrated in the ADCDR can generate accurate multiphase clocks with the proposed calibrated locking algorithm in the presence of process variations. The precise multiphase clocks are essential for the half-rate sampling and Alexander-type phase detecting. Measurement results show that the receiver can operate at a data rate of 6.4 Gbits/s with a bit error rate.

VLSI46_TAE4 Title: Full-Swing Local Bitline SRAM Architecture Based on the 22-nm FinFET Technology for Low-Voltage Operation Abstract: The previously proposed average-8T static random access memory (SRAM) has a competitive area and does not require a write-back scheme. In the case of average-8T SRAM architecture, a full-swing local bit line (BL) that is connected to the gate of the read buffer can be achieved with a boosted word line (WL) voltage. However, in the case of an average-8T SRAM based on an advanced technology, such as a 22-nm FinFET technology, where the variation in threshold voltage is large, the boosted WL voltage cannot be used, because it degrades the read stability of the SRAM. Thus, a full-swing local BL cannot be achieved, and the gate of the read buffer cannot be driven by the full supply voltage (VDD), resulting in a considerably large read delay. To overcome the above disadvantage, in this paper, a differential SRAM architecture with a full-swing local BL is proposed. In the proposed SRAM architecture, full swing of the local BL is ensured by the use of cross-coupled pMOSs, and the gate of the read buffer is driven by a full VDD, without the need for the boosted WL voltage. Various configurations of the proposed SRAM architecture, which stores multiple bits, are analyzed in terms of the minimum operating voltage and area per bit. The proposed SRAM that stores four bits in one block can achieve a minimum voltage of 0.42 V and a read delay that is 62.6 times lesser than that of the average-8T SRAM based on the 22-nm FinFET technology.

TANNER

Simulation Rs.8000

VLSI48_TAE5 Title: A 0.1–3.5-GHz Duty-Cycle Measurement and Correction Technique in 130-nm CMOS Abstract: A duty-cycle correction technique using a novel pulse width modification cell is demonstrated across a frequency range of 100 MHz–3.5 GHz. The technique works at frequencies where most digital techniques implemented in the same technology node fail. An alternative method of making time domain measurements such as duty cycle and rise/fall times from the frequency domain data is introduced. The data are obtained from the equipment that has significantly lower bandwidth than required for measurements in the time domain. An algorithm for the same has been developed and experimentally verified. The correction circuit is

TANNER

Simulation Rs.7000




implemented in a 0.13-µm CMOS technology and occupies an area of 0.011 mm2. It corrects to a residual error of less than 1%. The extent of correction is limited by the technology at higher frequencies.

VLSI51_TAE6 Title: A Systematic Design Methodology of Asynchronous SAR ADCs Abstract: Successive approximation register (SAR) analog-to digital converters (ADCs) are widely used in biomedical and portable/wearable electronic systems due to their excellent power efficiency. However, both the design and the optimization of high-performance SAR ADCs are time consuming, even for well experienced circuit designers. For system designers, it is also hard to quickly evaluate the feasibility of a given specification in a process node. This paper presents a systematic sizing procedure for asynchronous SAR ADCs based on design considerations. A sizing tool based on the proposed design procedure is also implemented, the sizing results of which are highly competitive in comparison with other state-of-the-art manual works. Moreover, the sizing time is relatively short due to the efficient and effective search algorithms employed. In addition to the simulation results, two silicon proofs with different specifications and process nodes are provided to demonstrate the feasibility of this design methodology.

TANNER

Simulation Rs.10000

VLSI52_TAE7 Title: Read Bit line Sensing and Fast Local Write-Back Techniques in Hierarchical Bit line Architecture for Ultralow-Voltage SRAMs Abstract: Voltage scalable decoupled SRAMs operating at a sub threshold region have various challenges, such as deteriorated read bit line (RBL) swing resulting in read sensing failure and degraded cell stability due to the half-select write. This paper proposes an equalized bit line scheme to eliminate the leakage dependence on data pattern and thus improves RBL sensing and its resilience against process, voltage, and temperature variations. In addition, we propose a fast local write-back (WB) technique to implement a half-select-free write operation. With hierarchical bit line architecture, it facilitates a local read and a subsequent fast WB action to secure the original data without performance degradation. A 16-kb SRAM test chip has been fabricated in a 65-nm CMOS technology and achieved the minimum operating voltage of 0.24 V with a read access time of 4.88µs.

TANNER

Simulation Rs.10000

VLSI53_TAE8 Title: Online Measurement of Degradation Due to Bias Temperature Instability in SRAMs Abstract: A method is proposed to detect failing cells due to bias temperature instability (BTI) in Static Random Access Memories (SRAMs) en route to failure. If potentially failing cells are detected prior to failure, SRAMs can be operated without failures, since the detection of potentially failing cells can trigger re-configuration, given available memory redundancy. Using an experimentally verified BTI model, we study the

TANNER

Simulation Rs.8000




performances of conventional 6T SRAM cells as a function of BTI degradation, in the presence of process variations. An on-chip monitoring scheme is presented that can be embedded within conventional SRAM designs without affecting normal device operation and with minimal overhead.

VLSI54_TAE9 Title: Incorporating Process Variations Into SRAM Electromigration Reliability Assessment Using Atomic Flux Divergence Abstract: Electromigration (EM) greatly affects the long-term reliability of VLSI chips. Not only power/ground lines but also bitlines of SRAM arrays may be damaged by EM. In this paper, we analyze current flow on SRAM bitline, demonstrate that it may suffer EM due to the pulsed dc pattern, and conclude that bitline’s EM reliability can dramatically be worsened by process variation due to a significant increase of subthreshold leakage current. We statistically model the effects of process variation that includes both transistor parameter fluctuation and interconnect line roughness, propose an atomic flux divergence-based current conversion scheme for applying Blech criterion, and develop a procedure for preventing EM failure by modifying the width of bitlines. Considering the effect of bitline width modification on cell stability and performance, we propose a tradeoff between functional and EM failures and indicate an optimal bitline width that maximizes the yield of SRAM arrays.

TANNER

Simulation Rs.8000

VLSI58_TAE10 Title: Integrated Floating-Gate Programming Environment for System-Level ICs Abstract: We present the first integrated system to handle heterogeneously used and programmed floating-gate (FG) elements in a single modular approach. We focus on IC design, integration, characterization, and algorithmic development of an integrated FG programming system for a large scale field-programmable analog array. We work through tunneling approaches to initialize the FG devices for precision programming, as well as hot-electron injection approaches for precise device programming.

TANNER

Simulation Rs.15000

VLSI107_TAE11 Title: PROCEED: A Pareto Optimization-Based Circuit-Level Evaluator for Emerging Devices Abstract: Evaluation of novel devices in the context of circuits is crucial to identifying and maximizing their value. We propose a new framework, Pareto optimization-based circuit-level evaluator for emerging device (PROCEED), that uses comprehensive performance, power, and area metrics for accurate device-circuit co-evaluation through optimization of digital circuit benchmarks. The PROCEED assesses technology suitability over a wide operating region (megahertz to gigahertz) by leveraging available circuit knobs (threshold voltage assignment, power management, sizing, and so on). It improves the benchmark accuracy by 3×to 115×compared with the existing methods while offering orders of magnitude improvements in

TANNER

Simulation Rs.8000




runtime over full physical design implementation flows. To illustrate the PROCEED’s capabilities, we deploy it to assess emerging technologies, including novel tunneling field-effect transistors, compared with conventional silicon CMOS. As a further illustration, we extend PROCEED to evaluate future heterogeneous integration of varied devices onto the same silicon substrate.

VLSI108_TAE12 Title: Design of a CMOS System-on-Chip for Passive, Near-Field Ultrasonic Energy Harvesting and Back-Telemetry Abstract: Many packaging and structural materials are made of conductive materials such as metal or carbon-fiber composites, which limits the use of embedded radio frequency-based telemetry systems for sensing. In this paper, we present the design of a complete passive ultrasonic energy harvesting and back-telemetry system that exploits near-field acoustic coupling to wirelessly transfer energy and data across conductive barriers. The use of near-field operation makes the telemetry robust to multipath reflections that occur at barrier discontinuities and robust to crosstalk when multiple sensors are simultaneously interrogated. Underlying the proposed architecture is a systemon-chip (SoC) that integrates different ultrasonic energy harvesting and telemetry modules. The operation of the system has been verified using SoC prototypes fabricated in a 0.5-µm CMOS process which have been integrated with a piezoelectric transducer attached to an aerospace-grade aluminum substrate. Measured results show that the proposed near-field ultrasonic telemetry system can effectively operate across a 2-mm-thick metallic barrier at a frequency of 13.56 MHz with the SoC consuming 22.3µWofpower.

TANNER

Simulation Rs.12000

VLSI109_TAE13 Title: A Fast-Transient Wide-Voltage-Range Digital-Controlled Buck Converter With Cycle-Controlled DPWM Abstract: This paper presents a wide-voltage-range, fast-transient all-digital buck converter using a high-resolution digital pulse width modulator (DPWM). The converter employs the multi threshold-voltage band-control technique to shorten its transient response. The DPWM uses an all-digital delay-locked loop (ADDLL) to control its cycle. The usage of ADDLL leads to the DPWM possessing a small area while maintaining high cycle resolution. Moreover, the proposed ADDLL-based cycle controlled DPWM can achieve synchronization between its input and output. This decreases the loop delay of the proposed converter so that the system is easy to be stabilized. The prototype chips of both the ADDLL-based cycle-controlled DPWM and the all-digital buck converter are fabricated in 0.35-µm CMOS process. Measurement results of the cycle controlled DPWM show that the duty cycle of its output is adjustable from 1% to 99% in a 0.78% increment per step when operating at 1 MHz. The measured transition time of the all-digital buck converter is <3.5µs when the load current changes from 50 to

TANNER

Simulation Rs.8000




500 mA, and vice versa.

VLSI110_TAE14 Title: Designing Tunable Subthreshold Logic Circuits Using Adaptive Feedback Equalization Abstract: Ultralow-power sub threshold logic circuits are becoming prominent in embedded applications with limited energy budgets. Minimum energy consumption of digital logic circuits can be obtained by operating in the sub threshold regime. However, in this regime process variations can result in up to an order of magnitude variations in ION/IOFF ratios leading to timing errors, which can have a destructive effect on the functionality of the sub threshold circuits. These timing errors become more frequent in scaled technology nodes where process variations are highly prevalent. Therefore, mechanisms to mitigate these timing errors while minimizing the energy consumption are required. In this paper, we propose a tunable adaptive feedback equalizer circuit that can be used with a sequential digital logic to mitigate the process variation effects and reduce the dominant leakage energy component in the subthreshold digital logic circuits. We also present detailed energy-performance models of the adaptive feedback equalizer circuit. As part of the modeling approach, we also develop an analytical methodology to estimate the equivalent resistance of MOSFET devices in subthreshold regime. For a 64-bit adder designed in 130 nm, our proposed approach can reduce the normalized variation of the critical path delay from 16.1% to 11.4% while reducing the energy-delay product by 25.83% at minimum energy supply voltage.

TANNER

Simulation Rs.8000

VLSI111_TAE15 Title: Dual-Calibration Technique for Improving Static Linearity of Thermometer DACs for I/O Abstract: In this paper, we propose a dual-calibration technique to improve the matching accuracy of digital-to-analog converter (DAC) elements and improve nonlinearity induced static errors in a current-steering thermometer DAC. The novelty of the proposed dual-calibration scheme lies in obtaining best samples from the error distribution using redundancy for improved matching followed by adaptively reordering these samples to reduce error accumulation. This technique exploits the 2-D nature of the DAC to achieve lower calibration time. We consider the statistical basis for each of these methods and demonstrate statistical modeling of the proposed technique. We demonstrate a 38% reduction in differential nonlinearity (DNL) and 55% reduction in integral nonlinearity (INL) through simulations. We fabricated an 8-bit current steering thermometer DAC in Taiwan Semiconductor Manufacturing Company 65-nm CMOS process. With only 2 redundant cells per row, we show an improvement of 36% in DNL and 50% in INL from the measurement of 16 chips over the baseline DAC.

TANNER

Simulation Rs.8000




VLSI112_TAE16 Title: An Add-On Type Real-Time Jitter Tolerance Enhancer for Digital Communication Receivers Abstract: An add-on type real-time jitter tolerance enhancer (JTE) is presented in this paper. The proposed JTE can improve high-frequency jitter tolerance (JTOL) by using a real-time phase alignment scheme. A mathematical analysis for an advanced bit error rate (BER) prediction method is also introduced. The proposed circuit is applicable to various types of receivers, such as reference less receivers, receivers with a reference clock source, and source-synchronous receivers. The reference less receiver with the proposed JTE achieved an out-of-band JTOL of 0.71 UIppat 100 MHz with<10−12 BER. This is 196% higher than a conventional receiver without the JTE. The source-synchronous receiver with the proposed JTE achieved 0.92 UIpp at 300 MHz with<10 −12 BER. Total core areas of the receiver and JTE are 0.19 and 0.07 mm2 in a 0.13-µm CMOS process, respectively. The power consumption of the receiver is 38 mW at 5.4 Gbit/s, and the JTE dissipates 22 mW.

TANNER

Simulation Rs.10000

VLSI113_TAE17 Title: SRAM-Based Unique Chip Identifier Techniques Abstract: Integrated circuit (IC) identification using unclon-able digital fingerprints facilitates the authentication of ICs, device tracking, and cryptographic functions. In this paper, we present two hardware methods exploiting the inherent process induced mismatch of SRAM cells. The proposed circuits improve upon those previously published by reducing the number of bits that vary from trial to trial, and can be used at times other than just IC power-up. The proposed circuits and methods are compared with the previous power-up approach using the experimental results from a 90-nm test chip. The required SRAM array periphery circuit changes allow the use of standard foundry SRAM cells and do not impact the memory access time.

TANNER

Simulation Rs.8000

TANNER /MICROWIND – (LOW POWER)

VLSI42_TLP1 Title: A Low-Power Robust Easily Cascaded PentaMTJ-Based Combinational and Sequential Circuits Abstract: Advanced computing systems embed spintronic devices to improve the leakage performance of conventional CMOS systems. High speed, low power, and infinite endurance are important properties of magnetic tunnel junction (MTJ), a spintronic device, which assures its use in memories and logic circuits. This paper presents a PentaMTJ-based logic gate, which provides easy cascading, self-referencing, less voltage headroom problem in precharge sense amplifier and low area overhead contrary to existing MTJ-based gates. PentaMTJ is used here because it provides guaranteed disturbance free reading and increased tolerance to process variations along with compatibility with CMOS process. The logic gate is validated by simulation at the 45-nm technology node using a

TANNER

Simulation Rs.8000




VerilogA model of the PentaMTJ.

VLSI45_TLP2 Title: Low-Power Variation-Tolerant Nonvolatile Lookup Table Design Abstract: Emerging nonvolatile memories (NVMs), such as MRAM, PRAM, and RRAM, have been widely investigated to replace SRAM as the configuration bits in field-programmable gate arrays (FPGAs) for high security and instant power ON. However, the variations inherent in NVMs and advanced logic process bring reliability issue to FPGAs. This brief introduces a low-power variation-tolerant nonvolatile lookup table (nvLUT) circuit to overcome the reliability issue. Because of large ROFF/RON , 1T1R RRAM cell provides sufficient sense margin as a configuration bit and a reference resistor. A single-stage sense amplifier with voltage clamp is employed to reduce the power and area without impairing the reliability. Matched reference path is proposed to reduce the parasitic RC mismatch for reliable sensing. Evaluation shows that 22% reduction in delay, 38% reduction in power, and the tolerance of variations of 2.5× typical RON or ROFF in reliability are achieved for proposed nvLUT with six inputs.

TANNER

Simulation Rs.8000

VLSI49_TLP3 Title: Low-Energy Power-ON-Reset Circuit for Dual Supply SRAM Abstract: Design of a low-energy power-ON reset (POR) circuit is proposed to reduce the energy consumed by the stable supply of the dual supply static random access memory (SRAM), as the other supply is ramping up. The proposed POR circuit, when embedded inside dual supply SRAM, removes its ramp-up constraints related to voltage sequencing and pin states. The circuit consumes negligible energy during ramp-up, does not consume dynamic power during operations, and includes hysteresis to improve noise immunity against voltage fluctuations on the power supply. The POR circuit, designed in the 40-nm CMOS technology within 10.6-µm2 area, enabled 27× reduction in the energy consumed by the SRAM array supply during periphery power-up in typical conditions.

TANNER

Simulation Rs.8000

VLSI50_TLP4 Title: Frequency-Boost Jitter Reduction for Voltage-Controlled Ring Oscillators Abstract: Ring oscillators (ROs) are popular due to their small area, modest power, wide tuning range, and ease of scaling with process technology. However, their use in many applications is limited due to poor phase noise and jitter performance. Thermal noise and flicker noise contribute jitter that decreases inversely with oscillation frequency. This paper describes a frequency boost technique to reduce jitter in ROs. We boost the internal oscillation frequency and introduce a frequency divider following the oscillator to maintain the desired output frequency. This approach offers reduced jitter as well as the opportunity to trade off output jitter with power for dynamic performance management. The oscillator has 32 operating modes, corresponding to different values for the ring size and frequency division. In a 0.5-µm CMOS process, the highest oscillation

TANNER

Simulation Rs.8000




frequency achieved is 25 MHz with a root-mean-square period jitter of 54 ps and a power consumption of 817 µW at 5 V supply. A jitter model for current-starved oscillators was derived and verified by measurement; a direct relationship between oscillation frequency and jitter was derived and measured. Compared with other oscillators, this design achieves the highest performance in terms of jitter per unit interval and figure-of-merit. The performance is expected to improve in more advanced technologies. The results are summarized to offer design guidance based on the frequency boost technique.

VLSI47_TLP5 Title: High-Speed, Low-Power, and Highly Reliable Frequency Multiplier for DLL-Based Clock Generator Abstract: A high-speed, low-power, and highly reliable frequency multiplier is proposed for a delay-locked loop-based clock generator to generate a multiplied clock with a high frequency and wide frequency range. The proposed edge combiner achieves a high-speed and highly reliable operation using a hierarchical structure and an overlap canceller. In addition, by applying the logical effort to the pulse generator and multiplication-ratio control logic design, the proposed frequency multiplier minimizes the delay difference between positive- and negative-edge generation paths, which causes a deterministic jitter. Finally, a numerical analysis is performed to analyze and compare the performance of the proposed frequency multiplier with that of previous frequency multipliers. The proposed frequency multiplier is fabricated using a 0.13-µm CMOS process technology, and has the multiplication ratios of 1, 2, 4, 8, and 16, and an output range of 100 MHz–3.3 GHz. The frequency multiplier achieves a power consumption to a frequency ratio of 2.9 µW/MHz.

TANNER

Simulation Rs.10000

VLSI55_TLP6 Title: EMDBAM: A Low-Power Dual Bit Associative Memory With Match Error and Mask Control Abstract: A ternary content addressable memory (TCAM) speeds up the search process in the memory by searching through pre-stored contents rather than addresses. The additional don’t care (X) state makes the TCAM suitable for many network applications but the large amount of cell requirement for storage consumes high power and takes a large design area. This paper presents a novel architecture of TCAM, which pre-stores 2 bits of data in an up–down manner and provides multiple masking operations through a single control multi masking circuit. The proposed dual bit associative memory with match error and mask control (EMDBAM) consumes low power and selects the valid value on match line through match error controller. The proposed design has been implemented using a standard 45-nm CMOS technology, and the extracted layout has been simulated using SPECTRE with the supply voltage at 1 V. The proposed EMDBAM can reduce the cell area by 39% compared with a basic TCAM design with a reduction of 9.6% in the energy delay product.

TANNER

Simulation Rs.10000




VLSI56_TLP7 Title: A Single-Stage Low-Dropout Regulator With a Wide Dynamic Range for Generic Applications Abstract: Single-stage regulator topologies are often preferred in embedded applications due to their low power consumption with a single-pole behavior, resulting in easy frequency compensation. Since the achievable differential gain from a single stage is low, the dc load regulation is poor over a wide dynamic range. This paper presents a single-stage, adaptively biased, low-dropout regulator to achieve a comparable dc load regulation similar to multistage topologies. This is achieved mainly by modifying the adaptive bias loop which amplifies both the common-mode and differential-mode signals. In addition, as the proposed regulator is stable for a wide range of output capacitors, including the capacitor-less (on-chip) and with-capacitor (off-chip) conditions, it is suitable for more generic applications. The proposed regulator is implemented in a standard 0.18-µm CMOS technology. The experimental results show that the regulator is capable of delivering up to 100 mA with a dc load regulation of 0.140 mV/mA and is stable with Co ≤3.3nF (capacitor-less) and Co≥1µF (with-capacitor).

TANNER

Simulation Rs.10000

VLSI57_TLP8 Title: Glitch Energy Reduction and SFDR Enhancement Techniques for Low-Power Binary-Weighted Current-Steering DAC Abstract: This brief proposes a glitch reduction approach by dynamic capacitance compensation of binary-weighted current switches in a current-steering digital-to-analog converter (DAC). The method was proved successfully by a 10-bit 400-MHz pure binary-weighted current steering DAC with a minimum number of retiming latches. The experiment results yield very low-glitch energy during major carry transitions at output, which is <1 pVs. This brief utilizes a layout structure to improve the spurious-free dynamic range at high signal frequencies. This chip was implemented in a standard 0.18-µm CMOS technology and consumes 20.7 mW at 400 MS/s.

TANNER

Simulation Rs.12000

VLSI59_TLP9 Title: Design of Silicon Photonic Interconnect ICs in 65-nm CMOS Technology Abstract: This paper describes a design methodology for CMOS silicon photonic interconnect ICs according to CMOS technology scaling. As the CMOS process is scaled, the endurable voltage stress and the intrinsic gain of the CMOS devices are reduced; therefore, a design of the high swing transmitter and high-gain receiver required at the silicon photonic interface becomes much more challenging. In this paper, a triple-stacked Mach–Zehnder modulator driver and an inverter-based trans-impedance amplifier with inductive feedback are proposed, and the robustness of the proposed designs is verified through Monte Carlo analyses. The prototype ICs are fabricated using a 65-nm CMOS technology. The transmitter exhibits a 6 Vpp output swing, 98-mW power consumption, and 0.04-mm2 active area at 10 Gb/s. The receiver was verified with a commercial photo detector, and it

TANNER

Simulation Rs.12000




exhibits a 78-dBgain, 25.3-mW power consumption, and 0.18-mm2 active area at 20 Gb/s.

VLSI60_TLP10 Title: Test Escapes of Stuck-Open Faults Caused by Parasitic Capacitances and Leakage Currents Abstract: Intragate open defects are responsible for a significant percentage of defects in present technologies. A majority of these defects causes the logic gate to become stuck open, and this is why they are traditionally modeled as stuck-open faults (SOFs). The classical approach to detect the SOFs is based on a two-vector sequence, and has been proved effective for a wide range of technologies. However, factors typically neglected in past technologies have become a major concern in nanometer technologies, i.e., leakage currents and downstream parasitic capacitances. Some recent works have examined the influence of leakage currents. However, to the best of our knowledge, no one has considered the influence of downstream parasitic capacitances. In this paper, the influence of both factors is investigated and experimentally measured with a test chip built on a 65-nm technology. An analysis based on the electrical simulations is performed to quantify the number of test escapes in the presence of SOFs. Test recommendations are derived from the analysis results to maximize the detect ability of these faults in present and future technologies.

TANNER

Simulation Rs.12000

VLSI106_TLP11 Title: Power Efficient Level Shifter for 16 nm FinFET Near Threshold Circuits Abstract: Since the minimum feature size has shrunk beyond the sub-30-nm node, power density has become the major factor in modern microprocessors. Techniques such as dynamic voltage scaling operating down to near threshold voltage levels and supporting multiple voltage domains have become necessary to reduce dynamic as well as static power. A key component of these techniques is a level shifter that serves different voltage domains. This level shifter must be high speed and power efficient. The proposed level shifter translates voltages ranging from 250 to 790 mV, and exhibits 42% shorter delay, 45% lower energy consumption, and 48% lower static power dissipation. In addition, the proposed level shifter exhibits symmetric rise and fall transition times with up to 12% skew at the extreme conditions over the maximum range of voltages.

TANNER

Simulation Rs.10000

VLSI IEEE Projects

Engineering

Transcript of VLSI IEEE Projects