Comparing FPGA vs. custom cmos and the impact on processor ...

10
Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Department of Electrical and Computer Engineering University of Toronto [email protected] Vaughn Betz Altera Corp. [email protected] Jonathan Rose Department of Electrical and Computer Engineering University of Toronto [email protected] ABSTRACT As soft processors are increasingly used in diverse applica- tions, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates. We then use the re- sults of these comparisons to infer how the microarchitecture of soft processors on FPGAs should be different from hard processors on custom CMOS. We find that the ratios of the area required by an FPGA to that of custom CMOS for different building blocks varies significantly more than the speed ratios. As area is often a key design constraint in FPGA circuits, area ratios have the most impact on microarchitecture choices. Complete processor cores have area ratios of 17-27× and delay ratios of 18-26×. Building blocks that have dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7× area ratio), while multi- plexers and CAMs are particularly area-inefficient (> 100× area ratio), leading to cheaper ALUs, larger caches of low associativity, and more expensive bypass networks than on similar hard processors. We also find that a low delay ratio for pipeline latches (12-19×) suggests soft processors should have pipeline depths 20% greater than hard processors of similar complexity. Categories and Subject Descriptors B.7.1 [Integrated Circuits]: Types and Design Styles General Terms Design, Measurement Keywords FPGA, CMOS, Area, Delay, Soft Processor 1. INTRODUCTION Custom CMOS silicon logic processes, standard cell ASICs, and FPGAs are commonly-used substrates for implementing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FPGA’11, February 27–March 1, 2011, Monterey, California, USA. Copyright 2011 ACM 978-1-4503-0554-9/11/02 ...$10.00. digital circuits, with circuits implemented on each one hav- ing different characteristics such as delay, area, and power. In this paper, we compare custom CMOS and FPGA sub- strates for implementing processors and explore the microar- chitecture trade-off space of soft processors in light of the differences. FPGA soft processors have mostly employed single-issue in-order microarchitectures due to the limited area budget available on FPGAs [1–3]. We believe that future soft pro- cessor microarchitecture design decisions can benefit from a quantitative understanding of the differences between cus- tom CMOS and FPGA substrates. Another reason to revisit the soft processor microarchitecture trade-off space is that the increased capacity of modern FPGAs can accommodate much larger soft processors if it results in a performance benefit. Previous studies have measured the relative delay and area of FPGA, standard cell, and custom CMOS substrates as an average across a large set of benchmark circuits [4,5]. While these earlier results are useful in determining an estimate of the size and speed of circuit designs that can be implemented on FPGAs, we need to compare the relative performance of various types of “building block” circuits in order to have enough detail to guide microarchitecture design decisions. This paper makes two contributions: We compare delay and area between FPGA and cus- tom CMOS implementations of a set of building block circuits. While prior work gave results for complete systems, we show specific strengths and weaknesses of the FPGA substrate for the different building blocks. Based on the delay and area ratios of building block circuits on FPGAs vs. custom CMOS, we discuss how processor microarchitecture design trade-offs change on an FPGA substrate and the suitability of existing processor microarchitecture techniques when used on FPGAs. We begin with background in Section 2 and methodology in Section 3. We then present our comparison results and their impact on microarchitecture in Sections 4 and 5, and conclude in Section 6. 2. BACKGROUND 2.1 Technology Impact on Microarchitecture Studies on how process technology trends impact microar- chitecture are essential for designing effective microarchitec- tures that take advantage of ever-changing manufacturing 5

Transcript of Comparing FPGA vs. custom cmos and the impact on processor ...

Page 1: Comparing FPGA vs. custom cmos and the impact on processor ...

Comparing FPGA vs. Custom CMOS and the Impact onProcessor Microarchitecture

Henry WongDepartment of Electrical and

Computer EngineeringUniversity of Toronto

[email protected]

Vaughn BetzAltera Corp.

[email protected]

Jonathan RoseDepartment of Electrical and

Computer EngineeringUniversity of Toronto

[email protected]

ABSTRACTAs soft processors are increasingly used in diverse applica-tions, there is a need to evolve their microarchitectures ina way that suits the FPGA implementation substrate. Thispaper compares the delay and area of a comprehensive setof processor building block circuits when implemented oncustom CMOS and FPGA substrates. We then use the re-sults of these comparisons to infer how the microarchitectureof soft processors on FPGAs should be different from hardprocessors on custom CMOS.

We find that the ratios of the area required by an FPGAto that of custom CMOS for different building blocks variessignificantly more than the speed ratios. As area is oftena key design constraint in FPGA circuits, area ratios havethe most impact on microarchitecture choices. Completeprocessor cores have area ratios of 17-27× and delay ratiosof 18-26×. Building blocks that have dedicated hardwaresupport on FPGAs such as SRAMs, adders, and multipliersare particularly area-efficient (2-7× area ratio), while multi-plexers and CAMs are particularly area-inefficient (> 100×area ratio), leading to cheaper ALUs, larger caches of lowassociativity, and more expensive bypass networks than onsimilar hard processors. We also find that a low delay ratiofor pipeline latches (12-19×) suggests soft processors shouldhave pipeline depths 20% greater than hard processors ofsimilar complexity.

Categories and Subject DescriptorsB.7.1 [Integrated Circuits]: Types and Design Styles

General TermsDesign, Measurement

KeywordsFPGA, CMOS, Area, Delay, Soft Processor

1. INTRODUCTIONCustom CMOS silicon logic processes, standard cell ASICs,

and FPGAs are commonly-used substrates for implementing

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’11, February 27–March 1, 2011, Monterey, California, USA.Copyright 2011 ACM 978-1-4503-0554-9/11/02 ...$10.00.

digital circuits, with circuits implemented on each one hav-ing different characteristics such as delay, area, and power.In this paper, we compare custom CMOS and FPGA sub-strates for implementing processors and explore the microar-chitecture trade-off space of soft processors in light of thedifferences.

FPGA soft processors have mostly employed single-issuein-order microarchitectures due to the limited area budgetavailable on FPGAs [1–3]. We believe that future soft pro-cessor microarchitecture design decisions can benefit from aquantitative understanding of the differences between cus-tom CMOS and FPGA substrates. Another reason to revisitthe soft processor microarchitecture trade-off space is thatthe increased capacity of modern FPGAs can accommodatemuch larger soft processors if it results in a performancebenefit.

Previous studies have measured the relative delay and areaof FPGA, standard cell, and custom CMOS substrates as anaverage across a large set of benchmark circuits [4,5]. Whilethese earlier results are useful in determining an estimate ofthe size and speed of circuit designs that can be implementedon FPGAs, we need to compare the relative performance ofvarious types of “building block” circuits in order to haveenough detail to guide microarchitecture design decisions.

This paper makes two contributions:

• We compare delay and area between FPGA and cus-tom CMOS implementations of a set of building blockcircuits. While prior work gave results for completesystems, we show specific strengths and weaknesses ofthe FPGA substrate for the different building blocks.

• Based on the delay and area ratios of building blockcircuits on FPGAs vs. custom CMOS, we discuss howprocessor microarchitecture design trade-offs changeon an FPGA substrate and the suitability of existingprocessor microarchitecture techniques when used onFPGAs.

We begin with background in Section 2 and methodologyin Section 3. We then present our comparison results andtheir impact on microarchitecture in Sections 4 and 5, andconclude in Section 6.

2. BACKGROUND

2.1 Technology Impact on MicroarchitectureStudies on how process technology trends impact microar-

chitecture are essential for designing effective microarchitec-tures that take advantage of ever-changing manufacturing

5

Page 2: Comparing FPGA vs. custom cmos and the impact on processor ...

processes. Issues currently facing CMOS technology includepoor wire delay scaling, high power consumption, and morerecently, process variation. Microarchitectural techniquesthat respond to these challenges include clustered processormicroarchitectures and chip multiprocessors [6, 7].

FPGA-implemented circuits face a very different set ofconstraints. Power consumption is not currently the domi-nant design constraint due to lower clock speeds, while areais often the primary constraint due to high area overheadsof FPGA circuits. Characteristics such as inefficient multi-plexers and the need to map RAM structures into FPGAhard SRAM blocks are known and generally adjusted forby modifying circuit-level, but not microarchitecture-level,design [8–11].

2.2 Measurement of FPGAsKuon and Rose have measured the area, delay, and power

overheads of FPGAs compared to a standard cell ASIC flowon 90 nm processes [4] using a benchmark set of completecircuits to measure the overall impact of using FPGAs com-pared to ASICs and the effect of FPGA hard blocks. Theyfound that circuits implemented on FPGAs consumed 35×more area than on standard cell ASIC for circuits that didnot use hard memory or multiplier blocks, to a low of 18×for those that used both types. Minimum cycle time of theFPGA circuits were not significantly affected by hard blocks,ranging from 3.0 to 3.5× greater than ASIC. Chinnery andKeutzer [5] made similar comparisons between standard celland custom CMOS and reported a delay ratio of 3 to 8×.Combined, these reports suggest that the delay of circuitsimplemented on an FPGA would be 9 to 28× greater thanon custom CMOS. However, data for full circuits are insuf-ficiently detailed to guide microarchitecture-level decisions.

3. METHODOLOGYWe seek to measure the delay and area of FPGA build-

ing block circuits and compare them against their customCMOS counterparts, resulting in area ratios and delay ra-tios. We define these ratios to be the area or delay of anFPGA circuit divided by the area or delay of the customCMOS circuit. A higher ratio means the FPGA implemen-tation has more overhead. We compare several completeprocessor cores and a set of building block circuits againsttheir custom CMOS implementations, then observe whichtypes of building block circuits have particularly high or lowoverhead on FPGA.

As we do not have the expertise to implement highly-optimized custom CMOS circuits, our building block circuitcomparisons use data from custom CMOS implementationsfound in the literature. The set of building block circuits isgeneric enough so that there are sufficient published delayand area data. We focus mainly on custom designs built in65 nm processes, because it is the most recent process wheredesign examples are readily available in the literature, andcompare them to the Altera Stratix III 65 nm FPGA. Inmost cases, we implemented the equivalent FPGA circuitsfor comparison. Power consumption is not compared dueto the scarcity of data in the literature and the difficulty instandardizing testing conditions such as test vectors, volt-age, and temperature.

We normalize area measurements to a 65 nm process usingan ideal scale factor of 0.5× area between process nodes. Wenormalize delay using published ring oscillator data, with

90 nm 65 nm 45 nm

Area 0.5 1.0 2.0Delay 0.78 1.0 1.23

Table 1: Normalization Factors Between Processes

ResourceRelative Area Tile Area(Equiv. LABs) (mm2)

LAB 1 0.0221ALUT (half-ALM) 0.05 0.0011M9K memory 2.87 0.0635M144K memory 26.7 0.5897DSP block 11.9 0.2623

Total core area 18 621 412

Table 2: Estimated FPGA Resource Area Usage

the understanding that these reflect gate delay scaling morethan interconnect scaling. Intel reports 29% FO1 delay im-provement between 90 nm and 65 nm, and 23% FO2 delayimprovement between 65 nm and 45 nm [12, 13]. The areaand delay scaling factors used are summarized in Table 1.

Delay is measured as the longest register to register pathor input to output path in a circuit. In papers that describeCMOS circuits embedded in a larger unit (e.g., a shifter in-side an ALU), we conservatively assume that the subcircuithas the same cycle time as the larger unit. In FPGA circuits,delay is measured using register to register paths, with theregister delay subtracted out when comparing circuits thatdo not include a register (e.g., wire delay).

To measure FPGA resource usage, we implement circuitson the Stratix III FPGA and use the “logic utilization” met-ric as reported by Quartus II to account for lookup tables(LUT) and registers, and count partially-used memory andmultiplier blocks as entirely used since it is unlikely anotherpart of the design can use them. Table 2 shows the areasof the Stratix III FPGA resources. The FPGA tile areasinclude routing area so we do not track routing separately.The core of the the largest Stratix III (EP3LS340) FPGAcontains 13 500 LABs of 10 ALMs each, 1 040 M9K memo-ries, 48 M144K memories, and 72 DSP blocks, for a total of18 621 LAB equivalent areas and 412 mm2 core area.

We implemented FPGA circuits using Altera Quartus II10.0 SP1 using the fastest speed grade of the largest StratixIII. Circuits containing MLABs used Stratix IV due to hard-ware1 issues. Registers delimit the inputs and outputs of ourcircuits under test. We set timing constraints to maximizeclock speed, reflecting the use of these circuits as part of alarger circuit in the FPGA core.

3.1 Additional LimitationsThe trade-off space for a given circuit structure on custom

CMOS is huge and our comparison circuit examples may notall have the same optimization targets. Delay, area, power,and design effort can all be traded-off, resulting in vastlydifferent circuit performance. We assume that designs pub-lished in the literature are optimized primarily for delay withreasonable values for the other metrics and we implementour FPGA equivalents with the same approach.

Another source of imprecision results from the differencesin performance between chip manufacturers. For example,

1A Stratix III erratum halves the MLAB capacity to 320 bits andis fixed in Stratix IV. See Stratix III Device Family Errata Sheet

6

Page 3: Comparing FPGA vs. custom cmos and the impact on processor ...

transistor drive strength impacts logic gate delay and isvoltage- and process-dependent. At 100 nA/µm leakage cur-rent, Intel’s 65 nm process [14] achieves 1.46 and 0.88 mA/µmat 1.2 V for NMOS and PMOS, respectively, and 1.1 and0.66 mA/µm at 1.0 V, while TSMC’s 65 nm process [15]achieves 1.01 and 0.48 mA/µm at 1.0 V. The choice of sup-ply voltage results from both the power-performance trade-off and the transistor reliability the manufacturer can achieve.

Finally, we note that high-performance microprocessor de-signs tend to have generous power budgets that FPGAs donot enjoy, and this has an impact that is embedded in thearea and speed comparisons.

4. FPGA VS. CUSTOM CMOS

4.1 Complete Processor CoresWe begin by comparing complete processor cores imple-

mented on an FPGA vs. custom CMOS to provide contextfor the building block measurements. Table 3 shows a com-parison of the area and delay of four commercial processorsthat have both custom CMOS and FPGA implementationsand includes in-order, multithreaded, and complex out-of-order processors. The FPGA implementations are synthe-sized from RTL code for the custom CMOS processors, withsome FPGA-specific circuit-level optimizations. However,the FPGA-specific optimization effort is smaller than forcustom CMOS designs and could inflate the area and delayratios slightly. Overall, custom processors have delay ratiosof 18-26× and area ratios of 17-27×, with no obvious trendswith processor complexity.

The OpenSPARC T1 and T2 cores are derived from theSun UltraSPARC T1 and T2, respectively [21]. Both coresare in-order multithreaded, and use the 64-bit SPARC V9instruction set. The OpenSPARC T2 has the encryptionunit from the UltraSPARC T2 removed, although we believethe change is small enough to not significantly skew ourmeasurements.

The Intel Atom is a reduced-voltage dual-issue in-order64-bit x86 processor with two-way multithreading. The FPGAsynthesis by Wang et al. includes only the processor corewithout L2 cache, and occupies 85% of the largest 65 nmVirtex-5 FPGA (XC5VLX330) [10]. Our FPGA area metricassumes the core area of the largest Virtex 5 is the same asthe Stratix III, 412 mm2.

The Intel Nehalem is a high-performance multiple-issueout-of-order 64-bit x86 processor with two-way multithread-ing. The Nehalem’s core area was estimated from a diephoto. The FPGA synthesis by Schelle et al. does not in-clude the per-core L2 cache, and occupies roughly 300% ofthe largest Virtex-5 FPGA [19]. Because the FPGA synthe-sis is split over multiple chips and runs at 520 kHz, it is notuseful for estimating delay ratio.

4.2 SRAM BlocksSRAM blocks are commonly used in processors for build-

ing caches and register files. SRAM performance can becharacterized by latency and throughput. Custom SRAMdesigns can trade latency and throughput by pipelining,while FPGA designs are typically limited to the types ofhard SRAM blocks that exist on the FPGA. SRAMs onthe Stratix III FPGA can be implemented using two sizesof hard block SRAM, LUTRAM, or in registers and LUTs.Their throughput and density are compared in Table 4 tofive high-performance custom SRAMs in 65 nm processes.

Figure 1: SRAM Throughput

Figure 2: SRAM Density

The density and throughput of FPGA and custom SRAMsis plotted against memory size in Figures 1 and 2. Theplots include data from CACTI 5.3, a CMOS memory per-formance and area model [27]. The throughput ratio be-tween FPGA memories and custom is 7-10×, lower than theoverall delay ratio of 18-26×, showing that SRAMs are rel-atively fast on FPGAs. It is surprising that this ratio is notlower because FPGA SRAM blocks have little programma-bility. The FPGA data above use 32-bit wide data portsthat slightly underutilize the native 36-bit ports. The rawdensity of an FPGA SRAM block is listed in Table 4.

Below 9 kbit, the bit density of FPGA RAMs falls offnearly linearly with reducing RAM size because M9Ks areunderutilized, and the MLAB density is low. For largerarrays with good utilization, FPGA SRAM arrays have adensity ratio of only 2-5× vs. single read-write port (1rw)CMOS (and CACTI) SRAMs, far below the full processorarea ratio of 17-27×.

As FPGA SRAMs use dual-ported (2rw) arrays, we alsoplotted CACTI’s 2rw model for comparison. For arrays ofsimilar size, the bit density of CACTI’s 2rw models are1.9× and 1.5× the raw bit density of fully-utilized M9Kand M144K memory blocks, respectively. This suggests thathalf of the bit density gap between FPGA and custom inour single-ported test is due to FPGA memories paying theoverhead of dual ports.

For register file use where latency may be more importantthan memory density, custom processors have the optionof trading throughput for area and power by using fasterand larger storage cells. The 65 nm Pentium 4 register filetrades 5× bit density for 9 GHz single-cycle performance

7

Page 4: Comparing FPGA vs. custom cmos and the impact on processor ...

ProcessorCustom CMOS FPGA Logic Ratios

fmax Area fmax Area Utilization fmax Area(MHz) (mm2) (MHz) (mm2) (ALUT)

OpenSPARC T1 (90 nm) [16] 1800 6.0 79 100 86 597 23 17OpenSPARC T2 (65 nm) [17] 1600 11.7 88 294 250 235 18 25Atom (45 nm) [10,18] >1300 12.8 50 350 85% 26 27Nehalem (45 nm) [19,20] 3000 51 - 1240 300% - 24

Geometric Mean 22 23

Table 3: Complete Processor Cores. Area and delay normalized to 65 nm.

Design PortsSize fmax Area Bit Density Ratios

(kbit) (MHz) (mm2) (kbit/mm2) fmax Density

IBM 6T 65 nm [22] 2r or 1w 128 5600 1.2 V 0.276 464 9.5 2.1Intel 6T 65 nm [23] 1rw 256 4200 1.2 V 0.3 853 7.6 3.9Intel 6T 65 nm [14] 1rw 70 Mb 3430 1.2 V 110 [24] 820 – –IBM 8T 65 nm SOI [25] 1r1w 32 5300 1.2 V – – 9.0 –Intel 65 nm Regfile [26] 1r1w 1 8800 1.2 V 0.017 59 18 3.3

Registers 1r1w - - - 0.77MLAB 1r1w 640 b ∼600 0.033 23M9K 1r1w 9 590 0.064 142M144K 1r1w 144 590 0.59 244

Table 4: Custom and FPGA SRAM Blocks

CACTI 5.3 FPGARatios

Portsfmax Density fmax Density

(MHz) ( kbitmm2 ) (MHz) ( kbit

mm2 ) fmax Density

2r1w 4430 109 501 15.7 9 74r2w 4200 35 320 0.25 13 1436r3w 3800 17 286 0.20 13 898r4w 3970 9.8 269 0.15 15 6510r5w 3740 6.3 266 0.10 14 6112r6w 3520 4.5 249 0.090 14 5114r7w 3330 3.4 237 0.080 14 4316r8w 3160 2.7 224 0.068 14 39

Live Value Table (LVT)4r2w 420 1.50 10 238r4w 345 0.41 12 24

Table 5: Multiported 2 kbit SRAM. LVT data fromLaForest et al. [28].

[26]. FPGA RAMs lack this flexibility, and the delay ratiois even greater (18×) for this specific use.

4.3 Multiported SRAM BlocksFPGA hard SRAM blocks can typically implement up to

two read-write ports (2rw). Increasing the number of readports can be achieved reasonably efficiently by replicatingthe memory blocks but increasing the number of write portscannot. A multi-write port RAM can be implemented us-ing registers for storage, but it is inefficient. A more efficientmethod using hard RAM blocks for most of the storage repli-cates memory blocks for each write and read port and usesa live value table (LVT) to indicate for each word which ofthe replicated memories holds the most recent copy [28]. Wepresent data for multiported RAMs implemented using reg-isters, LVT-based multiported memories [28], and CACTI5.3 models of custom CMOS multiported RAMs. We focuson a 64×32-bit (2 kbit) memory array with twice as many

read ports as write ports (2N read, N write) because itis similar to register files in processors. Table 5 shows thethroughput and density comparisons.

The custom vs. FPGA bit density ratio is 7× for 2r1w,and increases to 23× and 143× for 4r2w LVT- and register-based memories, respectively. The delay ratio is 9× for 2r1w,and increases to 10× and 13× for 4r2w LVT- and register-based memories, respectively, a smaller impact than the arearatio increase.

4.4 Content-Addressable MemoriesA Content-Addressable Memory (CAM) is a logic circuit

that allows associative searches of its stored contents. Cus-tom CMOS CAMs are implemented as dense arrays using 9Tto 11T cells compared to 6T in SRAM, and are about 2-3×less dense than SRAMs. Ternary CAMs use two storage cellsper “bit” to store three states (0, 1, and don’t-care), a capa-bility often used for longest prefix searches in network packetrouters. In processors, CAMs are used in tag arrays for high-associativity caches and TLBs. CAM-like structures are alsoused in out-of-order instruction schedulers. CAMs in pro-cessors require both frequent read and write capability, butnot large capacities. Pagiamtzis and Sheikholeslami give agood overview of the CAM design space [29].

There are several methods of implementing CAM func-tionality on FPGAs [30]. CAMs implemented in soft logicuse registers for storage and LUTs to read, write, and searchthe stored bits. Another proposal, which we call BRAM-CAM, stores one-hot encoded match-line values in blockRAM to provide the functionality of a w × b-bit CAM us-ing a 2b×w-bit block RAM [31]. The soft logic CAM is theonly design that provides one-cycle writes, while the BRAM-CAM offers improved bit density with two-cycle writes. Wedo not consider CAM implementations with even longerwrite times.

Table 6 shows a variety of custom CMOS and FPGACAM designs. Figures 3 and 4 plot these and also 8-bit

8

Page 5: Comparing FPGA vs. custom cmos and the impact on processor ...

SizeSearch Bit

RatiosTime Density

(bits) (ns) ( kbit

mm2 ) DelayDensity

Ternary CAMsIBM 64×72 [32] 4 608 0.6 - 5.0 -IBM 64×240 [32] 15 360 2.2 167 1.7 205

Binary CAMsPOWER6 8×60 [33] 480 <0.2 - 14 -Godson-3 64×64 [34] 4 096 0.55 76 5 99Intel 64×128 [35] 8 192 0.25 167 14 209

FPGA Binary CAMsSoft logic 64×72 4 608 3.0 0.82Soft logic 64×240 15 360 3.8 0.81Soft logic 8×60 480 2.1 0.83Soft logic 64×64 4 096 2.9 0.77Soft logic 64×128 8 192 3.4 0.80MLAB-CAM 64×20 1 280 4.5 1.0M9K-CAM 64×16 1 024 2.0 2.0

Table 6: CAM Designs

Figure 3: CAM Search Speed

and 128-bit soft logic CAMs of varying depth. Because ofhigh power consumption, CAMs often have three competingdesign goals: delay, density, and energy per bit per search.CAMs can achieve delay comparable to SRAMs but at ahigh cost in power. Both the POWER6 TLB and Intel’s64×128 BCAM achieve at least 4 GHz, but the latter uses13 fJ/bit/search while IBM’s 450 MHz 64×240 TCAM uses1 fJ/bit/search. While Intel’s BCAM and the Godson-3TLB are both 64-entry full-custom designs, Intel achievestwice the bit density and half the cycle time at equal energyper bit per search, highlighting the impact of good circuitdesign, layout, and process technology.

As shown in Table 6, soft logic CAMs have poor 100-210× bit density ratios vs. custom CAMs. Despite thelow density, the delay ratio is only 14×. BRAM-CAMsbuilt from M9Ks can offer 2.4× better density than softlogic CAMs but halved write bandwidth. Despite the higherport-width/depth aspect ratio, MLAB BRAM-CAMs havebit density worse than M9K-CAMs because of control logicoverhead. The halved write bandwidth of BRAM-CAMsmake them unsuitable for performance-critical uses, such astag matching in instruction schedulers and L1 caches.

We observe that the bit density of soft logic CAMs isnearly the same as using registers to implement RAM (Ta-ble 4), suggesting that the area inefficiency comes from usingregisters for storage, not the associative search itself.

Figure 4: CAM Bit Density

Figure 5: Multiplier Latency

4.5 MultipliersMultiplication is an operation performed frequently in sig-

nal processing applications, but not used as often in pro-cessors. Multiplier blocks can also be used to inefficientlyimplement shifters and multiplexers [39].

Figure 5 shows the latency of multiplier circuits on cus-tom CMOS and on FPGA using hard DSP blocks. Latencyis the product of the cycle time and the number of pipelinestages, and does not adjust for unbalanced pipeline stagesor pipeline latch overheads. Table 7 shows details of the de-sign examples. The two IBM multipliers have latency ratioscomparable to full processor cores. Intel’s 16-bit multiplierdesign has much lower latency ratios as it appears to targetlow power instead of delay. Because custom multiplier de-

Figure 6: Multiplier Area

9

Page 6: Comparing FPGA vs. custom cmos and the impact on processor ...

Design Size StagesLatency Area Ratios

(ns) (mm2) LatencyArea

Intel 90 nm1.3 V [36] 16×16 1 0.81 0.014 3.4 4.7

IBM 90 nmSOI 1.4 V [37] 54×54 4 0.41 0.062 22 7.0

IBM 90 nmSOI 1.3 V [38] 53×53 3 0.51 0.095 17 4.5

Stratix III 16×16 1 2.8 0.066Stratix III 54×54 1 8.8 0.43

Table 7: Multiplier area and delay, normalized to65 nm process. Unpipelined latency is pipelined cy-cle time×stages.

Figure 7: Adder Delay

signs are often more deeply pipelined (3 and 4 stages) thanthe hard multipliers on FPGAs (2 stages), throughput ratiosare higher than latency ratios.

The area of the custom CMOS and FPGA multipliers areplotted in Figure 6. The area ratios of 4.5-7.0× are muchlower than for full processor cores.

4.6 AddersCustom CMOS adder circuit designs can span the area-

delay trade-off space from slow ripple-carry adders to fastparallel adders. On an FPGA, adders are usually imple-mented using hard carry chains that implement variationsof the ripple-carry adder, although carry-select adders havebeen also been used. Although parallel adders can be im-plemented with soft logic and routing, the lack of dedicatedcircuitry means parallel adders are bigger and slower thanthe ripple-carry adder with hard carry chains [44].

Figure 7 plots a comparison of adder delay, with details in

DesignSize fmax Area Delay Area(bit) (MHz) (mm2) Ratio Ratio

Agah et al. [40] 32 12 000 1.3 V - 20 -Kao et al.

64 7 100 1.3 V 0.016 19 4.590 nm [41]Pentium 4 [42] 32 9 000 1.3 V - 16 -IBM [43] 108 3 700 1.0 V 0.017 15 6.9

Stratix III32 593 0.03564 374 0.071108 242 0.119

Table 8: Adders. Area and delay normalized to65 nm process.

MuxInputs

FPGA CustomArea Delay Delay Delay

(mm2) (ps) (ps) Ratio

2 0.0011 210 2.8 744 0.0011 260 4.9 538 0.0022 500 9.1 54

16 0.0055 680 18 3732 0.0100 940 29 3264 0.0232 1200 54 21

Table 9: Analytical model of passive tree multiplex-ers [45] normalized to 65 nm process.

Table 8. The Pentium 4 delay is conservative as the delay isfor the full integer ALU. FPGA adders achieve delay ratiosof 15-20× and a low area ratio of around 4.5-7×.

4.7 MultiplexersMultiplexers are found in many circuits, yet we have found

little literature that provides their area and delay in customCMOS. Instead, we estimate delays of small multiplexers us-ing an RC analytical model and the delays of the Pentium 4shifter unit and the Stratix III ALM. Our area ratio estimatecomes from an indirect measurement using the ALM.

Table 9 shows a delay comparison between FPGA and ananalytical model of switch-based tree multiplexers [45]. Thispassive switch model is pessimistic for larger multiplexers,as active buffer elements can reduce delay. On an FPGA,small multiplexers can often be combined with other logicwith minimal extra delay and area, so multiplexers measuredin isolation are likely pessimistic. For small multiplexers,the delay ratio is high, roughly 40-75×. Larger multiplexersappear to have decreasing delay ratios, but we believe thisis largely due to the unsuitability of passive designs.

The 65 nm Pentium 4 integer shifter datapath [42] is dom-inated by small multiplexers (sizes 3, 4, and 8), but notin isolation. We implemented the datapath (Figure 8) ex-cluding control logic on the Stratix III, with results shownin Table 10. The delay ratio of 20× is smaller than sug-gested by the isolated multiplexer comparison, but may beoptimistic if Intel omitted details from their shifter circuitdiagram. Another delay ratio estimate can be made by ex-amining the Stratix III Adaptive Logic Module itself, as itsdelay consists mainly of multiplexers. Configuration RAMhold static values and do not affect delay. We implementeda circuit equivalent to an ALM as described in the StratixIII Handbook [46], comparing delays of the FPGA imple-mentation to custom CMOS delays of the ALM given bythe Quartus timing models. Internal LUTs are modeled asmultiplexers. Each ALM input pin is modeled as a 21-to-1multiplexer, as 21 to 30 are reasonable sizes according toLewis et al. [47].

We examined one long path and one short path out ofmany possible timing paths through the ALM. The longand short timing paths begin after the input multiplexersfor pins datab0 and dataf0, respectively, both terminatingat the LUT register. We placed these two paths into sepa-rate clock domains to ensure they were independently opti-mized. Table 10 shows delay ratios of 7.1× and 11.7× forthe long and short paths, respectively. These delay ratios arelower compared to previous examples due to the lower powerand area budgets preventing custom FPGAs from being asaggressively delay-optimized as custom processors, and to

10

Page 7: Comparing FPGA vs. custom cmos and the impact on processor ...

CircuitFPGA

Delay (ps)

CustomDelay(ps)

Ratio

65 nm Pentium 4 Shifter 2260 111 20

Stratix III ALMLong path 2500 350 7.1Short path 800 68 11.7

Table 10: Delay of Multiplexer-Dominated Circuits

Byte rotate0‐7 bits

Word rotate0/8/16/24 bits

Adderresult

Fillvalue

Set/Zero/True/Negate

10

Adder

Figure 8: 65 nm Pentium 4 Integer ALU Shifter [42]

extra circuit complexity not shown in the Stratix III Hand-book.

We estimate the multiplexer area ratio indirectly by againimplementing the equivalent circuit of an ALM on a StratixIII. We find that our equivalent ALM consumes 104 ALUTs,or roughly 52 ALMs. An area ratio of 52× is the ratio of thearea of the FPGA equivalent circuit (52 ALMs) comparedto the silicon area of the custom circuit (1 ALM equiva-lent). The real area ratio is substantially greater, as we im-plemented only the ALM’s input and internal multiplexersand did not include global routing resources or configurationRAM. A rule of thumb is that half of an FPGA’s core areais spent in the programmable global routing network, foran area ratio estimate of 104×. We expect this to be evenhigher once configuration RAM and other ALM complexitiesare accounted for.

Groups of multiplexers have delay ratios below 20×, withsmall isolated multiplexers being worse (40-75×). However,multiplexers are particularly area-intensive with an area ra-tio greater than 100×. Thus we find that the intuition thatmultiplexers are expensive is justified, especially from anarea perspective.

4.8 Pipeline LatchesIn synchronous circuits, the maximum clock speed of a cir-

cuit is typically limited by a register-to-register delay pathfrom a pipeline latch2, through a pipeline stage’s combina-tional logic, to the next set of pipeline latches. The delay ofa pipeline latch (setup and clock-to-output times) impactsthe speed of a circuit and the clock speed improvement whenincreasing pipeline depth.

The “effective” cost of inserting an extra pipeline registerinto LUT-based pipeline logic is measured by observing theincrease in delay as the number of LUTs between registersincreases, then extrapolating the delay to zero LUTs. Ta-ble 11 shows estimates of the delay cost of a custom CMOSpipeline stage. The 180 nm Pentium 4 design assumed 90 psof pipeline latch delays including clock skew [48], which we

2Latch refers to pipeline storage elements. This can be alatch, flip-flop, or other implementation.

Design Register Delay (ps) Delay Ratio

Sprangle et al. [48] 35 (90 ps in 180 nm) 12Hartstein et al. [49] 32 (2.5 FO4) 14Hrishikesh et al. [50] 23 (1.8 FO4) 19

Geometric Mean 29.5 15

Stratix III 436 -

Table 11: Pipeline Latch Delay

scaled according to the FO1 ring oscillator delays for Intel’sprocesses (11 ps at 180 nm to 4.25 ps at 65 nm) [14]. Hart-stein et al. and Hrishikesh et al. present estimates expressedin FO4 delays, which were scaled to an estimated FO4 delayof 12.8 ps for Intel’s 65 nm process.

The delay ratio for a pipeline latch is 10-15×. Althoughwe do not have area comparisons, registers are consideredto occupy very little FPGA area because more LUTs areused than registers in most FPGA circuits, yet FPGA logicelements include at least one register for every LUT.

4.9 Point-to-Point RoutingInterconnect delay comprises a significant portion of the

total delay in both FPGAs and modern CMOS processes.Figure 9 plots the point-to-point wire delays for three classesof interconnect (local, intermediate, and global), and FPGArouting delay both in absolute distance and normalized fordecreased FPGA area efficiency.

We model each buffered segment as an RC delay as illus-trated in Figure 10 and Equation 1, where h is the buffertransistor size and Rtr is the linearized buffer resistance.Buffered wires are modeled as multiple identical segmentswith the number of segments chosen to minimize delay, whileminimum-delay buffer transistor sizing can be found by dif-ferentiating Equation 1 with respect to h. Equation 1 ismodified from Bakoglu and Meindl to include transistor gate(Cg) and junction (Cd) lumped capacitances and to model50%-to-50% delay instead of 0-to-90% [51].

T = RtrCd + (1

hRtr +

1

2Rint)Cint + (

1

hRtr +Rint)hCg (1)

We use interconnect and transistor parameters from theInternational Technology Roadmap for Semiconductors (ITRS)2007 report [52]. We model the minimum pitch for each classof wiring (local, intermediate, global). Buffers are assumedto be inverters with 2:1 PMOS to NMOS transistor size ra-tio, and junction capacitance is approximated as half of gatecapacitance. Gate and junction capacitances are modeledas lump capacitances and wires are modeled as distributedRC. Interconnect delays using ITRS 2007 data may be pes-simistic, as Intel’s 65 nm process reports lower delays usinglarger pitch and wire thicknesses [14].

The point-to-point delay on FPGA is measured using thedelay between two manually-placed registers, with the delayof the register itself subtracted out, and includes the over-head of routing programmability. Assuming LABs have anaspect ratio (the vertical/horizontal ratio of delay for eachLAB) of 1.6 gives a good delay vs. Manhattan distance fit.The physical distance is calculated from LAB coordinatesusing the aspect ratio and area of a LAB (Table 2).

In Figure 9, the delay of the CMOS load and drivingbuffers (one FO1) dominates the delay for CMOS short wirelengths under 20 µm. FPGA short local wires (100 µm) havea delay ratio around 9×. Long global wire delay is quite close

11

Page 8: Comparing FPGA vs. custom cmos and the impact on processor ...

Figure 9: Point-to-Point Routing Delay

VDDR tr

Cd Cg

intR

intC

Driver Interconnect Load

Figure 10: RC Model of Wire Delay

(2×) to CMOS for the same length of wire.Routing delays are more meaningful when “distance” is

normalized to the amount of “logic” that can be reached. Toapproximate this metric, we adjust the FPGA routing dis-tance by the square-root of the FPGA’s overall area over-head vs. custom CMOS (

√23× = 4.8×). Short local FPGA

wires (100 µm) have a logic density-normalized delay ratio20×, while long global wires (7 500 µm) have a delay ratioof only 9×. The short wire delay ratio is comparable tothe overall delay overhead for full processors, but the longwire delay ratio is half that, suggesting that FPGAs are lessaffected by long wire delays than custom CMOS.

4.10 Off-Chip MemoryTable 12 gives a brief overview of off-chip DRAM latency

and bandwidth as commonly used in processor systems. Ran-dom read latency is measured on Intel DDR2 and DDR3systems with off-chip (65 ns) and on-die (55 ns) memorycontrollers. FPGA memory latency is the sum of the mem-ory controller latency and closed-page DRAM access time.While these estimates do not account for real access pat-terns, they are enough to show that off-chip latency andthroughput ratios between custom CMOS and FPGA arefar lower than for in-core circuits like complete processors.

4.11 Summary of Building Block CircuitsA summary of our estimates for the FPGA vs. custom

CMOS delay and area ratios can be found in Table 13. Therange of delay ratios (8-75×) is smaller than the range of arearatios (2-210×). Multiplexers have the highest delay ratiosand hard blocks only have a small impact on delay ratios.Hard blocks are particularly area-efficient (SRAM, adders,multipliers), while multiplexers and CAMs are particularlyarea-inefficient.

In previous work, Kuon and Rose reported an average of3.0-3.5× delay ratio and 18-35× area ratio for FPGA vs.standard cell ASIC for a set of complete circuits [4]. Al-

CustomFPGA [53] Ratio

CMOS

DDR2 Frequency (MHz) 533 400 1.3DDR3 Frequency (MHz) 800 533 1.5Read Latency (ns) 55-65 100-133 2

Table 12: Off-Chip DRAM Latency and Through-put. Latency assumes closed-page random accesses.

Design Delay Ratio Area Ratio

Processor Cores 18 - 26 17 - 27

SRAM 1rw 7 - 10 2 - 5SRAM 4r2w LUTs / LVT 13 / 10 143 / 23CAM 14 100 - 210Multiplier 17 - 22 4.5 - 7.0Adder 15 - 20 4.5 - 7.0Multiplexer 20 - 75 > 100Pipeline latch 12 - 19 -Routing 9 - 20 -Off-Chip Memory 1.3 - 2 -

Table 13: Delay and Area Ratio Summary

though we expect both ratios to be higher when comparingFPGA against custom CMOS, our processor core delay ra-tios are higher but area ratios are slightly lower, likely dueto custom processors being optimized more for delay at theexpense of area compared to typical standard cell circuits.

5. IMPACT ON MICROARCHITECTUREThe area and delay differences between circuit types and

substrates measured in Section 4 affects the microarchitec-tural design of circuits targeting custom CMOS vs. target-ing FPGAs. As FPGA designs often have logic utilization(area) as a primary design constraint, we expect that arearatios will have more impact on microarchitecture than de-lay ratios.

5.1 Pipeline DepthPipeline depth is one of the fundamental choices in the

design of a processor microarchitecture. Increasing pipelinedepth results in higher clock speeds, but with diminishingreturns due to pipeline latch delays. The analysis by Hart-stein et al. shows that the optimal processor pipeline depth

for performance is proportional to√

tpto

, where tp is the to-

tal logic delay of the processor pipeline, and to is the delayoverhead of a pipeline latch [49].

Section 4.8 showed that the delay ratio of registers (i.e., toFPGA vs. to custom CMOS,∼15×) on FPGAs is lower thanthat of a complete processor (approximately tp FPGA vs.tp custom CMOS, ∼22×), increasing tp/to on FPGA. Thechange in tp/to is roughly (22/15), suggesting soft processorsshould have pipeline depths roughly 20% longer comparedto an equivalent microarchitecture implemented in customCMOS. Today’s soft processors prefer short pipelines [54]because soft processors are simple and have low tp, and notdue to a property of the FPGA substrate.

5.2 Partitioning of StructuresThe portion of a chip that can be reached in a single clock

cycle is decreasing with each newer process, while transistor

12

Page 9: Comparing FPGA vs. custom cmos and the impact on processor ...

switching speeds continue to improve. This leads to microar-chitectures that partition large structures into smaller onesand partition the design into clusters or multiple cores toavoid global communication [7].

In Section 4.9, we observed that after adjustment for thereduced logic density of FPGAs, long wires have a delayratio roughly half that of a processor core. The relativelyfaster long wires lessen the impact of global communication,reducing the need for aggressive partitioning of designs forFPGAs. In practice, FPGA processors have less logic com-plexity than high-performance custom processors, furtherreducing the need to partition.

5.3 ALUs and BypassingMultiplexers consume much more area on FPGAs than

custom CMOS (Section 4.7), making bypass networks thatshuffle operands between functional units more expensiveon FPGAs. The functional units themselves are often com-posed of adders and multipliers and have a lower 4.5-7× arearatio. The high cost of multiplexers reduces the area benefitof using multiplexers to share these functional units.

There are processor microarchitecture techniques that re-duce the size of operand-shuffling networks relative to thenumber of ALUs. “Fused” ALUs that perform two or moredependent operations at a time increase the amount of com-putation relative to operand shuffling, such as the com-mon fused multiply-accumulate unit and interlock collapsingALUs [55,56]. Other proposals cluster instructions togetherto reduce the communication of operand values to instruc-tions outside the group [57,58]. These techniques may ben-efit soft processors even more than hard processors.

5.4 Cache OrganizationSet-associative caches have two common implementation

styles. Low associativity caches replicate the cache tag RAMand access them in parallel, while high associativity cachesstore tags in CAMs. High associativity caches are more ex-pensive on FPGAs because of the high area cost of CAMs(100-210× bit density ratio). In addition, custom CMOScaches built from tag CAM and data RAM blocks can havethe CAM’s decoded match lines directly drive the RAM’sword lines, while an FPGA CAM must produce encodedoutputs that are then decoded by the SRAM, adding a re-dundant encode-decode operation that was not included inthe circuits in Section 4.4. In comparison, custom CMOSCAMs have minimal delay and 2-3× area overhead com-pared to RAM allowing for high-associativity caches to havean amortized area overhead of around 10%, with minimalchange in delay compared to set-associative caches [59].

CAM-based high-associativity caches are not area efficientin FPGA soft processors and soft processor caches shouldhave lower associativity than similar hard processors. Softprocessor caches should also be higher capacity than simi-lar hard processors because of the area efficiency of FPGASRAMs (2-5× density ratio).

5.5 Memory System DesignThe lower area cost of block RAM encourages the use of

larger caches, reducing cache miss rates and lowering thedemand for off-chip bandwidth to DRAM main memory.The lower clock speeds of FPGA circuits further reduce off-chip bandwidth demand. The latency and bandwidth ofoff-chip memory is only slightly worse on FPGAs than oncustom CMOS processors.

Low off-chip memory system demands suggest that moreresources should be dedicated to improving the performanceof the processor core than improving memory bandwidth ortolerating latency. Techniques used to improve the memorysystem in hard processors include DRAM access schedul-ing, non-blocking caches, prefetching, memory dependencespeculation, and out of order memory accesses.

5.6 Out-of-Order MicroarchitectureSection 4.1 suggests that processor complexity does not

have a strong correlation with area ratio, so out-of-ordermicroarchitectures seem to be a reasonable method for im-proving soft processor performance. There are several stylesof microarchitectures commonly used to implement preciseinterrupt support in pipelined or out-of-order processors andmany variations are used in modern processors [60,61]. Themain variations between the microarchitecture styles con-cern the organization of the reorder buffer, register renam-ing logic, register file, and instruction scheduler and whethereach component uses RAM- or CAM-based implementations.

FPGA RAMs have particularly low area cost (Section 4.2),but CAMs are area expensive (Section 4.4). These charac-teristics favour microarchitecture styles that minimize theuse of CAM-like structures. Reorder buffers, register re-naming logic, and register files are commonly implementedwithout CAMs. There are CAM-free instruction schedulertechniques that are not widely implemented [6,62], but maybecome more favourable in soft processors.

If a traditional CAM-based scheduler is used in a softprocessor, its capacity would tend to be smaller than on hardprocessors due to area, but the delay ratio of CAMs (15×) isnot particularly poor. Reducing scheduler area can be doneby reducing the number of entries or the amount of storagerequired per entry. Schedulers can be data-capturing whereoperand values are captured and stored in the scheduler,or non-data-capturing where the scheduler tracks only theavailability of operands, with values fetched from the registerfile or bypass networks when an instruction is finally issued.Non-data-capturing schedulers reduce the amount of datathat must be stored in each entry of a scheduler.

On FPGAs, block RAMs come in a limited selection ofsizes, with the smallest commonly being 4.5 kbit to 18 kbit.Reorder buffers and register files are usually even smallerbut are limited by port width or port count so processorson FPGAs can have larger capacity ROBs, register files,and other port-limited RAM structures at little extra cost.In contrast, expensive CAMs limit soft processors to smallscheduling windows (instruction scheduler size). Microar-chitectures that address this particular problem of large in-struction windows with small scheduling windows may beuseful in soft processors [63].

6. CONCLUSIONSWe have presented area and delay comparisons of pro-

cessors and their building block circuits implemented oncustom CMOS and FPGA substrates. In 65 nm processes,we find FPGA implementations of processor cores have 18-26× greater delay and 17-27× greater area usage than thesame processors implemented using custom CMOS. We findthat the FPGA vs. custom delay ratios of most processorbuilding block circuits fall within the delay ratio range forcomplete processor cores, but that area ratios vary more.Building block circuits such as adders and SRAMs thathave dedicated hardware support on FPGAs are particularly

13

Page 10: Comparing FPGA vs. custom cmos and the impact on processor ...

area-efficient, while multiplexers and CAMs are particularlyarea-inefficient. The measurements’ precision is limited bythe availability of custom CMOS design examples in the lit-erature. The measurements can also change as both CMOStechnology and FPGA architectures evolve and may lead todifferent design choices.

We have discussed how our measurements impact microar-chitecture design choices: Differences in the FPGA substrateencourage soft processors to have larger, low-associativitycaches, deeper pipelines, and fewer bypass networks thansimilar hard processors. Also, out-of-order execution is avalid design option for soft processors, although schedulingwindows should be kept small.

7. REFERENCES[1] Altera. Nios II processor.

http://www.altera.com/products/ip/processors/nios2/.

[2] Xilinx. MicroBlaze soft processor.http://www.xilinx.com/tools/microblaze.htm.

[3] ARM. Cortex-M1 processor.http://www.arm.com/products/processors/cortex-m/cortex-m1.php.

[4] I. Kuon and J. Rose. Measuring the Gap Between FPGAs andASICs. IEEE Trans. Computer-Aided Design of IntegratedCircuits and Systems, 26(2), Feb. 2007.

[5] D. Chinnery and K. Keutzer. Closing the Gap Between ASIC& Custom, Tools and Techniques for High-Performance ASICDesign. Kluwer Academic Publishers, 2002.

[6] S. Palacharla et al. Complexity-Effective SuperscalarProcessors. SIGARCH Comp. Arch. News, 25(2), 1997.

[7] V. Agarwal et al. Clock Rate versus IPC: The End of the Roadfor Conventional Microarchitectures. SIGARCH Comp. Arch.News, 28(2), 2000.

[8] P. Metzgen and D. Nancekievill. Multiplexer Restructuring forFPGA Implementation Cost Reduction. In Proc. DAC, 2005.

[9] P. Metzgen. A High Performance 32-bit ALU for ProgrammableLogic. In Proc. FPGA, 2004.

[10] P. H. Wang et al. Intel Atom Processor Core MadeFPGA-Synthesizable. In Proc. FPGA, 2009.

[11] S.-L. Lu et al. An FPGA-based Pentium in a CompleteDesktop System. In Proc. FPGA, 2007.

[12] S. Tyagi et al. An Advanced Low Power, High Performance,Strained Channel 65nm Technology. In Proc. IEDM, 2005.

[13] K. Mistry et al. A 45nm logic technology with high-k+metalgate transistors, strained silicon, 9 cu interconnect layers,193nm dry patterning, and 100% pb-free packaging. In Proc.IEDM, 2007.

[14] P. Bai et al. A 65nm Logic Technology Featuring 35nm GateLengths, Enhanced Channel Strain, 8 Cu Interconnect Layers,Low-k ILD and 0.57 µm2 SRAM Cell. In Proc. IEDM, 2004.

[15] S.K.H. Fung et al. 65nm CMOS High Speed, General Purposeand Low Power Transistor Technology for High VolumeFoundry Application. In Proc. VLSI, 2004.

[16] A. S. Leon et al. A Power-Efficient High-Throughput 32-ThreadSPARC Processor. IEEE Journal of Solid-State Circuits,42(1), 2007.

[17] U.G. Nawathe et al. Implementation of an 8-Core, 64-Thread,Power-Efficient SPARC Server on a Chip. IEEE Journal ofSolid-State Circuits, 43(1), 2008.

[18] G. Gerosa et al. A Sub-2 W Low Power IA Processor for MobileInternet Devices in 45 nm High-k Metal Gate CMOS. IEEEJournal of Solid-State Circuits, 44(1), 2009.

[19] G. Schelle et al. Intel Nehalem Processor Core Made FPGASynthesizable. In Proc. FPGA, 2010.

[20] R. Kumar and G. Hinton. A Family of 45nm IA Processors. InProc. ISSCC, 2009.

[21] Sun Microsystems. OpenSPARC. http://www.opensparc.net/.

[22] J. Davis et al. A 5.6GHz 64kB Dual-Read Data Cache for thePOWER6 Processor. In Proc. ISSCC, 2006.

[23] M. Khellah et al. A 4.2GHz 0.3mm2 256kb Dual-Vcc SRAMBuilding Block in 65nm CMOS. In Proc. ISSCC, 2006.

[24] P. Bai. Foils from “A 65nm Logic Technology Featuring 35nmGate Lengths, Enhanced Channel Strain, 8 Cu InterconnectLayers, Low-k ILD and 0.57 0.57 µm2 SRAM Cell”. IEDM,2004.

[25] L. Chang et al. A 5.3GHz 8T-SRAM with Operation Down to0.41V in 65nm CMOS. In Proc. VLSI, 2007.

[26] S. Hsu et al. An 8.8GHz 198mW 16x64b 1R/1WVariation-Tolerant Register File in 65nm CMOS. In Proc.ISSCC, 2006.

[27] S. Thoziyoor et al. CACTI 5.1. Technical report, HPLaboratories, Palo Alto, 2008.

[28] C. E. LaForest and J. G. Steffan. Efficient Multi-PortedMemories for FPGAs. In Proc. FPGA, 2010.

[29] K. Pagiamtzis and A. Sheikholeslami. Content-AddressableMemory (CAM) Circuits and Architectures: A Tutorial andSurvey. IEEE Journal of Solid-State Circuits, 2006.

[30] K. McLaughlin et al. Exploring CAM Design For NetworkProcessing Using FPGA Technology. In Proc. AICT-ICIW,2006.

[31] J.-L. Brelet and L. Gopalakrishnan. Using Virtex-II BlockRAM for High Performance Read/Write CAMs.http://www.xilinx.com/support/documentation/application notes/xapp260.pdf, 2002.

[32] I. Arsovski and R. Wistort. Self-Referenced Sense Amplifier forAcross-Chip-Variation Immune Sensing in High-PerformanceContent-Addressable Memories. In Proc. CICC, 2006.

[33] D. W. Plass and Y. H. Chan. IBM POWER6 SRAM Arrays.IBM Journal of Research and Development, 51(6), 2007.

[34] W. Hu et al. Godson-3: A Scalable Multicore RISC Processorwith x86 Emulation. IEEE Micro, 29(2), 2009.

[35] A. Agarwal et al. A Dual-Supply 4GHz 13fJ/bit/search64×128b CAM in 65nm CMOS. In Proc. ESSCIRC 32, 2006.

[36] S.K. Hsu et al. A 110 GOPS/W 16-bit Multiplier andReconfigurable PLA Loop in 90-nm CMOS. IEEE Journal ofSolid-State Circuits, 41(1), 2006.

[37] W. Belluomini et al. An 8GHz Floating-Point Multiply. InProc. ISSCC, 2005.

[38] J.B. Kuang et al. The Design and Implementation ofDouble-Precision Multiplier in a First-Generation CELLProcessor. In Proc. ICIDT, 2005.

[39] P. Jamieson and J. Rose. Mapping Multiplexers onto HardMultipliers in FPGAs. In IEEE-NEWCAS, 2005.

[40] A. Agah et al. Tertiary-Tree 12-GHz 32-bit Adder in 65nmTechnology. In Proc. ISCAS, 2007.

[41] S. Kao et al. A 240ps 64b Carry-Lookahead Adder in 90nmCMOS. In Proc. ISSCC, 2006.

[42] S. B. Wijeratne et al. A 9-GHz 65-nm Intel Pentium 4Processor Integer Execution Unit. IEEE Journal of Solid-StateCircuits, 42(1), 2007.

[43] X. Y. Zhang et al. A 270ps 20mW 108-bit End-around CarryAdder for Multiply-Add Fused Floating Point Unit. SignalProcessing Systems, 58(2), 2010.

[44] K. Vitoroulis and A.J. Al-Khalili. Performance of ParallelPrefix Adders Implemented with FPGA Technology. In Proc.NEWCAS Workshop, 2007.

[45] M. Alioto and G. Palumbo. Interconnect-Aware Design of FastLarge Fan-In CMOS Multiplexers. IEEE Trans. Circuits andSystems II, 2007.

[46] Altera. Stratix III Device Handbook Volume 1. 2009.

[47] D. Lewis et al. The Stratix II Logic and Routing Architecture.In Proc. FPGA, 2005.

[48] E. Sprangle and D. Carmean. Increasing Processor Performanceby Implementing Deeper Pipelines. SIGARCH Comp. Arch.News, 30(2), 2002.

[49] A. Hartstein and T. R. Puzak. The Optimum Pipeline Depthfor a Microprocessor. SIGARCH Comp. Arch. News, 30(2),2002.

[50] M. S. Hrishikesh et al. The Optimal Logic Depth per PipelineStage is 6 to 8 FO4 Inverter Delays. In Proc. ISCA 29, 2002.

[51] H.B. Bakoglu and J.D. Meindl. Optimal InterconnectionCircuits for VLSI. IEEE Trans. Electron Devices, 32(5), 1985.

[52] International Technology Roadmap for Semiconductors.http://www.itrs.net/Links/2007ITRS/Home2007.htm, 2007.

[53] Altera. External Memory Interface Handbook, Volume 3. 2010.

[54] Peter Yiannacouras et al. The Microarchitecture ofFPGA-based Soft Processors. In Proc. CASES, 2005.

[55] N. Malik et al. Interlock Collapsing ALU for IncreasedInstruction-Level Parallelism. SIGMICRO Newsl., 23(1-2),1992.

[56] J. Phillips and S. Vassiliadis. High-Performance 3-1 InterlockCollapsing ALU’s. IEEE Trans. Computers, 43(3), 1994.

[57] P. G. Sassone and D. S. Wills. Dynamic Strands: CollapsingSpeculative Dependence Chains for Reducing PipelineCommunication. In Proc. MICRO 37, 2004.

[58] A. W. Bracy. Mini-Graph Processing. PhD thesis, University ofPennsylvania, 2008.

[59] M. Zhang and K. Asanovic. Highly-Associative Caches forLow-Power Processors. In Kool Chips Workshop, Micro-33,2000.

[60] J.E. Smith and A.R. Pleszkun. Implementing Precise Interruptsin Pipelined Processors. IEEE Trans. Computers, 37(5), 1988.

[61] G.S. Sohi. Instruction Issue Logic for High-Performance,Interruptible, Multiple Functional Unit, Pipelined Computers.IEEE Trans. Computers, 39:349–359, 1990.

[62] F. J. Mesa-Mart́ınez et al. SEED: Scalable, EfficientEnforcement of Dependences. In Proc. PACT, 2006.

[63] M. P. et al. A Decoupled KILO-Instruction Processor. Proc.HPCA, 2006.

14