© 2013 Jikai Chen - University of...
-
Upload
nguyenduong -
Category
Documents
-
view
216 -
download
2
Transcript of © 2013 Jikai Chen - University of...
1
LOW-POWER HIGH-SPEED SERIAL LINK DESIGN
By
JIKAI CHEN
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2013
4
ACKNOWLEDGEMENTS
During the seven years as a PhD student at the University of Florida, I received
much help from many people. Although there is only person listed as the author, the
work presented in this Dissertation would not have been possible without them. To each
one of them I owe many thanks.
I want to thank my advisor, Dr. Rizwan Bashirullah, for his encouragement when
things might go wrong, his tolerance and patience when things did go wrong, and his
high standard which I will carry though the rest of my life.
I want to thank Dr. Jenshan Lin, Dr. Robert Fox, and Dr. Sanjay Ranka for being
in my committee and spending their precious time on this Dissertation.
My special thanks go to my friends at ICR. Walker Turner, Qiuzhong Wu, Hang
Yu, Chris Dougherty, Chun-ming Tang, Lin Xue, Zhiming Xiao, Chun-chin Peng, Yan
Hu, Pawan Sabharwal, Deepak Bhatia, Lawrence Fomundam, and Felipe Garay offered
me help when I needed it the most, and brought fun to my supposedly dull PhD life. I
will miss the basketball games that we played in those hot summer days.
I want to thank Professor Paul Kohl and his group at Georgia Institute of
Technology for their wonderful cooperation, especially Brad Chen and Todd Spencer.
I feel blessed to have such wonderful friends outside ICR, including Shuo Cheng,
Mingqi Chen, Changzhi Li, Xiaogang Yu and Yan Yan. There is no doubt I enjoyed and
will always cherish our friendship.
I am grateful to my manager, Yanli Fan, and my colleagues, Karl Muth, Archie Hu
and Huawen Jin, at Texas Instruments. Yanli has been very supportive when I needed
to take time off for my defense. I learnt a lot from each one of them, and look forward to
making my own contribution to the team.
5
I want to thank my parents, my parents in law, and my sister. Throughout the ups
and downs in the past years, they supported me with their love without condition. If
there is only one thing that I want to achieve in my life, I want to make them proud.
Finally I want to thank my dear wife, Yuan Rao, the most caring and lovely
woman in my life. I cannot thank her enough for her love, encouragement, patience, and
everything she has done for me. Marrying her is by far the best thing that ever
happened to me. I won’t hesitate a moment to give everything in the world for my wife,
and dedicating this Dissertation to her is the least I can do.
6
TABLE OF CONTENTS
page
ACKNOWLEDGEMENTS ............................................................................................... 4
TABLE OF CONTENTS .................................................................................................. 6
LIST OF TABLES ............................................................................................................ 9
LIST OF FIGURES ........................................................................................................ 10
LIST OF ABBREVIATIONS ........................................................................................... 16
ABSTRACT ................................................................................................................... 18
CHAPTER
1 INTRODUCTION .................................................................................................... 21
1.1 Research Motivation ......................................................................................... 21
1.2 Dissertation Organization .................................................................................. 24
2 HIGH-SPEED SERIAL LINK OVERVIEW ............................................................... 27
2.1 Chapter Overview ............................................................................................. 27
2.2 The Channel ..................................................................................................... 28 2.3 Equalization ...................................................................................................... 32
2.3.1 FFE ........................................................................................................ 33 2.3.2 CTLE ..................................................................................................... 34
2.3.3 DFE ....................................................................................................... 35 2.4 Clocking ............................................................................................................ 36
2.4.1 Clock Generation ................................................................................... 36
2.4.2 Clock Recovery...................................................................................... 39 2.5 Signaling ........................................................................................................... 41
2.5.1 Signaling Efficiency ................................................................................ 42 2.5.2 Effects of Channel Loss ......................................................................... 43 2.5.3 Effects of FFE and DFE ......................................................................... 45
2.5.4 Effects of Back Termination ................................................................... 46
2.5.5 Effects of Signaling and Termination Modes ......................................... 49 2.6 Summary .......................................................................................................... 52
3 AN ACTIVE LINK WITH AIR-CAVITY TRANSMISSION LINES ............................. 54
3.1 Chapter Overview ............................................................................................. 54 3.2 Transmission Line Design ................................................................................. 57 3.3 Fabrication ........................................................................................................ 60 3.4 Link Implementation .......................................................................................... 62
7
3.4.1 Link Architecture .................................................................................... 62
3.4.2 TX Design .............................................................................................. 63 3.4.3 RX Design ............................................................................................. 64
3.4.3.1 Preamp design .......................................................................... 64 3.4.3.2 DFE design ............................................................................... 68
3.5 Experimental Results ........................................................................................ 69 3.5.1 Air-Cavity Transmission Line Measurement .......................................... 70 3.5.2 Link Measurement ................................................................................. 71
3.6 Summary .......................................................................................................... 74
4 A 4.5-Gb/s 12.4-mW RX WITH BAUD-RATE CDR ................................................. 76
4.1 Chapter Overview ............................................................................................. 76
4.2 Baud-Rate CDR ................................................................................................ 77 4.3 Majority-Voting DFE .......................................................................................... 81 4.4 Chip Implementation ......................................................................................... 86
4.4.1 Architecture ........................................................................................... 86 4.4.2 Slicer ...................................................................................................... 88
4.4.3 DMUX .................................................................................................... 89 4.4.4 Clocking ................................................................................................. 90
4.5 Experimental Results ........................................................................................ 92
4.6 Summary .......................................................................................................... 96
5 A 5-Gb/s 0.75-pJ/BIT VOLTAGE-MODE TRANSCEIVER ...................................... 98
5.1 Chapter Overview ............................................................................................. 98 5.2 TX Implementation ............................................................................................ 99
5.2.1 TX Architecture ...................................................................................... 99 5.2.2 PRBS Generator .................................................................................. 100 5.2.3 LDO ..................................................................................................... 102
5.2.4 TX Driver ............................................................................................ 103 5.3 RX Implementation ......................................................................................... 104
5.3.1 RX Architecture.................................................................................... 104 5.3.2 Slicer Design ....................................................................................... 105 5.3.3 Level Shifting and DFE Tap Generation .............................................. 106
5.3.4 DFE with Look-Ahead Selection Tree .................................................. 108 5.3.5 Decimated Baud-Rate CDR ................................................................. 109
5.4 Injection-Locking-Based Clock Generation ..................................................... 109 5.4.1 Clock Generation Overview ................................................................. 109
5.4.2 ILRO Core ........................................................................................... 110 5.4.3 Delay Line ............................................................................................ 111
5.5 Experimental Results ...................................................................................... 112 5.5.1 TX Measurement ................................................................................. 112 5.5.2 Clocking Measurement ........................................................................ 114
5.5.3 RX Measurement ................................................................................. 115 5.5.4 Transceiver Measurement ................................................................... 117
5.6 Summary ........................................................................................................ 120
8
6 A DIGITAL BACKGROUND ADC CALIBRATION TECHNIQUE .......................... 122
6.1 Chapter Overview ........................................................................................... 122 6.2 Background Calibration ................................................................................... 124
6.2.1 Review of Prior Art ............................................................................... 124 6.2.2 Proposed Background Calibration Scheme ......................................... 128
6.2.2.1 Calibration accuracy ............................................................... 130 6.2.2.2 Convergence speed ................................................................ 131 6.2.2.3 Calibration overhead and performance considerations ........... 133
6.3 Chip Implementation ....................................................................................... 134 6.3.1 ADC Architecture ................................................................................. 134 6.3.2 Resistor Ladder ................................................................................... 136 6.3.3 T/H ....................................................................................................... 136
6.3.4 Comparator .......................................................................................... 138 6.3.5 Digital Backend .................................................................................... 144
6.3.6 Reference ADC.................................................................................... 144 6.3.7 Calibration Engine and Supporting Circuitry ........................................ 145
6.3.8 Clock and Power Distribution ............................................................... 146 6.4 Experimental Results ...................................................................................... 146 6.5 Summary ........................................................................................................ 151
7 CONCLUSIONS ................................................................................................... 153
LIST OF REFERENCES ............................................................................................. 155
BIOGRAPHICAL SKETCH .......................................................................................... 165
9
LIST OF TABLES
Table Page 2-1 Summary of signaling and termination modes ........................................................ 52
3-1 Final air-cavity microstrip dimensions ..................................................................... 58
3-2 Performance summary ............................................................................................ 74
4-1 CDR truth table ....................................................................................................... 79
4-2 update ...................................................................................................... 81
4-3 Clock phase update ................................................................................................ 81
4-4 Selector truth table .................................................................................................. 83
4-5 Majority-voter truth table ......................................................................................... 84
4-6 Performance summary ............................................................................................ 96
5-1 Performance summary of the receiver .................................................................. 117
5-2 Performance summary of the transceiver ............................................................. 120
6-1 Comparison of proposed and existing background calibration schemes ............... 134
6-2 Comparison with recently published work ............................................................. 150
10
LIST OF FIGURES
Figure Page 1-1 Evolution of Intel Microprocessors. ....................................................................... 22
1-2 ITRS predictions for transistor count and on-chip clock frequency for the next decade. ............................................................................................................... 22
1-3 ITRS predictions of I/O and power for the next decade .......................................... 23
1-4 Power efficiency of high-speed links vs. year ......................................................... 23
2-1 A typical high-speed serial link ............................................................................... 27
2-2 Conductor loss. ..................................................................................................... 29
2-3 Physical mechanism of dielectric loss ..................................................................... 30
2-4 Channel loss .......................................................................................................... 31
2-5 A sample SBR ........................................................................................................ 32
2-6 Main cursor vs. Nyquist loss .................................................................................. 32
2-7 Eye degradation due to channel loss ..................................................................... 32
2-8 FFE. ....................................................................................................................... 33
2-9 CTLE. ..................................................................................................................... 34
2-10 DFE block diagrams. ............................................................................................ 36
2-11 Block diagrams of a PLL and a DLL. .................................................................... 37
2-12 Block diagrams of an injection-locked 5-stage ring oscillator ............................... 38
2-13 Simulated phase noise suppression with injection-locking ................................... 39
2-14 CDR block diagram .............................................................................................. 39
2-15 Block diagram and principle of Alexander PD ...................................................... 40
2-16 Simulated performances of an inverter in a 0.13-μm CMOS technology.............. 41
2-17 A typical link frontend ........................................................................................... 42
2-18 Main cursor amplitude and signaling power penalty vs. channel loss .................. 43
11
2-19 Post-cursor amplitudes vs. channel loss .............................................................. 44
2-20 The effects of channel loss and equalization on .......................................... 45
2-21 Effects of FFE and DFE in frequency domain ...................................................... 46
2-22 Lattice diagram for reflection calculation .............................................................. 48
2-23 Eye opening vs. RX mismatch ............................................................................. 48
2-24 CM signaling. ....................................................................................................... 50
2-25 VM signaling......................................................................................................... 51
3-1 Cross-sections of microstrips. ................................................................................ 55
3-2 Simulated of conventional and air-cavity microstrip ..................................... 56
3-3 Simulated of conventional and air-cavity microstrip .......................................... 56
3-4 Simulated dielectric loss of conventional and air-cavity microstrip .......................... 56
3-5 Picture of the 3D model and simulated loss at various line widths ......................... 58
3-6 Simulated dielectric loss of air-cavity and conventional transmission lines ............ 58
3-7 Improvement with air-cavity transmission line ........................................................ 59
3-8 Signaling power reduction with air-cavity. .............................................................. 59
3-9 Fabrication process for the air-cavity structure ....................................................... 61
3-10 Picture and cross-section of the fabricated air-cavity structure ............................. 61
3-11 Link block diagram ................................................................................................ 62
3-12 Schematics of the latch and multiplexer. ............................................................... 63
3-13 Schematic of the 5-b DAC ..................................................................................... 63
3-14 Preamp model for gain optimization ...................................................................... 64
3-15 Preamp design. ..................................................................................................... 65
3-16 Input impedance tuning. ........................................................................................ 67
3-17 Simulated RX eye diagrams. ................................................................................. 67
3-19 Layout of the test board with the air-cavity active link ........................................... 69
12
3-20 Measured performances of a 5-cm air-cavity microstrip. ...................................... 70
3-21 Loss of the air-cavity line ....................................................................................... 71
3-22 Chip micrographs of the TX and the RX ................................................................ 71
3-23 Picture of the populated test board ....................................................................... 72
3-24 Test setup ............................................................................................................. 72
3-25 Measured waveforms ............................................................................................ 73
3-26 Measured link performances. ................................................................................ 74
4-1 Different ISI seen by the edge and data samples ................................................... 76
4-2 CDR block diagrams. .............................................................................................. 78
4-3 Operation principle of the proposed baud-rate CDR ............................................... 80
4-4 Block diagram of a 1-tap speculative DFE .............................................................. 82
4-6 Proposed majority voter schematic ......................................................................... 83
4-7 Simulated delay....................................................................................................... 85
4-8 Simulated selector and majority-voter performances. ............................................. 86
4-9 Block diagram of the RX ......................................................................................... 87
4-10 Schematic of the slicer with threshold control ....................................................... 88
4-11 Simulated slicer performances. ............................................................................. 89
4-12 Schematics of the CML and CMOS DMUX cells ................................................... 90
4-13 Schematic of the divider for I/Q generation ........................................................... 90
4-14 Principle of PI ........................................................................................................ 91
4-15 Schematic of the phase interpolator ...................................................................... 91
4-16 Level-converter schematic. ................................................................................... 92
4-17 Die micrograph and board picture ......................................................................... 92
4-18 Test setup ............................................................................................................. 93
4-19 Measured 20” channel performances. ................................................................... 94
13
4-20 Measured DFE performances. .............................................................................. 95
4-21 CDR measurement results. ................................................................................... 95
4-22 Measured CDR jitter tolerance .............................................................................. 96
5-1 TX block diagram .................................................................................................. 100
5-2 PRBS block diagram ............................................................................................. 100
5-3 All-zero detector .................................................................................................... 102
5-4 Schematic of the self-biased comparator with offset ............................................. 102
5-5 Simulated waveforms confirming the function of the all-zero detector .................. 102
5-6 Stability of the LDO ............................................................................................... 103
5-7 RX block diagram .................................................................................................. 104
5-8 Schematic of the slicer .......................................................................................... 105
5-9 Level shifters. ........................................................................................................ 106
5-10 Detailed schematic of the level shifter ................................................................. 107
5-11 Simulated frequency response of the level shifter at different gain settings ........ 107
5-12 Simulated pre-layout selector delay vs. power supply ......................................... 108
5-13 DFE selection tree. .............................................................................................. 109
5-14 Block diagram of the injection-locking-based clock generation ........................... 110
5-15 Schematic of the ILRO core ................................................................................ 111
5-16 Start-up issue of the pseudo-differential oscillator .............................................. 111
5-17 Schematic of the current-starved delay line ........................................................ 112
5-18 Simulated delay line tuning curve........................................................................ 112
5-19 Chip micrograph and transceiver layout .............................................................. 113
5-20 TX measurement results at 6.25 Gb/s. ................................................................ 113
5-21 ILRO measurement results. ................................................................................ 114
5-22 Measured phase noise with and without injection locking ................................... 115
14
5-23 Measured CDR delay line tuning curve showing >2-UI tuning range .................. 115
5-24 Measured loss characteristics of the 20” channel ............................................... 116
5-25 Measured 4-Gb/s eye diagrams before and after the 20” channel ...................... 116
5-26 RX bathtubs with and without DFE ...................................................................... 116
5-27 Jitter histogram of the recovered clock ................................................................ 117
5-28 Measured 5-Gb/s TX eye diagrams. ................................................................... 118
5-29 Measured CDR waveforms. ................................................................................ 119
5-30. RX bathtubs with and withou DFE ...................................................................... 119
6-1 An ADC-based serial link ...................................................................................... 122
6-2 Schematic of a preamp ......................................................................................... 123
6-3 Correlation-based calibration ................................................................................ 125
6-4 Redundancy-based calibration .............................................................................. 126
6-5 Reference-ADC-based calibration......................................................................... 127
6-6 Principle of reference-ADC-based calibration. ...................................................... 127
6-7 Proposed reconfigurable-comparator-based calibration ........................................ 129
6-9 Mechanism of noise-induced calibration error ....................................................... 131
6-10 Required conversions for convergence with different resolutions ....................... 133
6-11 Block diagram of the ADC ................................................................................... 135
6-12 T/H Design. ......................................................................................................... 137
6-13 T/H Bandwidth vs. switch width .......................................................................... 137
6-14 Comparator block diagram. ................................................................................. 138
6-15 Schematics of the first two stages of the preamplifier ......................................... 139
6-16 Effects of M3. ....................................................................................................... 140
6-19 Current-steering DAC and the DAC bias generator. The bias generator is shared by all the comparators. ......................................................................... 142
6-20 Simulated comparator performances. ................................................................. 143
15
6-21 Block diagram of the digital backend ................................................................... 144
6-22 FSM flow chart. N is the calibration index, which is also the SRAM address. ..... 145
6-23 Chip micrograph. ................................................................................................. 147
6-24 Measured ADC linearity. ..................................................................................... 148
6-25 Test setup for dynamic performance evaluation.................................................. 149
6-26 Output spectrums. ............................................................................................... 149
6-27 ENOB w/ and w/o calibration .............................................................................. 149
16
LIST OF ABBREVIATIONS
Term: Definition ADC Analog-to-digital converter
CDR Clock and data recovery
CG Common-gate
CM Current mode
CML Current-mode logic
CTLE Continuous-time linear equalization
DFE Decision-feedback equalization
DLL Delay-locked loop
DMUX De-multiplexer
DNL Differential non-linearity
DSP Digital signal processor
ENOB Effective number of bits
FFE Feedforward equalization
FSM Finite-state machine
ILRO Injection-locked ring oscillator
INL Integral non-linearity
ISI Inter-symbol-interference
ITRS International technology roadmap of semiconductors
I/O Input/output
LFSR Linear-feedback shift register
LPF Low-pass filter
LSB Least significant bit
MUX Multiplexer
NRZ Non-return-to-zero
17
PD Phase detector
PFD Phase-and-frequency detector
PI Phase interpolator
PLL Phase-locked loop
PM Phase modulation
PRBS Pseudo-random bit sequence
RX Receiver
SAFF Sense-amplifier flip-flop
SBR Single-bit response
SNR Signal-to-noise ratio
TX Transmitter
UI Unit interval
VCDL Voltage-controlled delay line
VCO Voltage-controlled oscillator
VM Voltage mode
18
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
LOW-POWER HIGH-SPEED SERIAL LINK DESIGN
By
Jikai Chen
May 2013
Chair: Rizwan Bashirullah Major: Electrical and Computer Engineering
With ever increasing integrated functionalities and on-chip clock frequency on a
processor, the off-chip bandwidth is increasing at even higher rates. The ITRS predicts
that the aggregate off-chip bandwidth of future processors will reach 100 Tb/s in the
next ten years, delivered by multiple high-speed serial links in parallel, each running at
multi-Gb/s. At the same time, the total power budget of a processor is practically flat
due to package and cooling technology limitations. To accommodate the increase of off-
chip bandwidth, the power efficiency of high-speed interconnects must be dramatically
improved over the next decade.
Various factors come into play when improving the power efficiency of high-
speed serial links. For multi-Gb/s off-chip signaling, the electrical channel presents the
most difficult challenge with its latency and frequency-dependent attenuation. As a
result, clock and data recovery (CDR) and channel equalization have become essential
functions in all high-speed off-chip serial links. To truly optimize the link power
efficiency, the impact of channel condition, CDR and equalization on the link power
19
must be well understood, in addition to that of such design choices as signaling mode
and termination topology. This Dissertation is the result of such an effort.
The Dissertation starts with an overview of the high-speed serial link. The
channel loss mechanisms are first reviewed and dielectric loss is shown to be the
dominant factor in future high-speed channels. The dependence of the signaling power
on signaling modes, termination topologies and equalization techniques is analyzed to
identify power-efficient solutions. CDR is also briefly reviewed, revealing the need for a
better baud-rate scheme than existing ones.
To reduce the dielectric loss, a low-power active link is presented in Chapter 3
with an air-cavity transmission line which reduces the channel latency and the dielectric
loss by replacing the dielectric material between the signal lines and the ground plane
with air. Other techniques include the use of DFE, a current-sharing frontend, and the
removal of back termination for better power efficiency. The link works up to 6.25 Gb/s
with a power efficiency of 0.6 pJ/bit.
Clock recovery is addressed in Chapter 4. A novel digital baud-rate CDR scheme
is proposed which automatically tracks the maximum eye-opening. Chapter 4 also
proposes replacing the selectors in a traditional speculative DFE with majority-voters
which is faster and more power-efficient. A receiver that incorporates the proposed
baud-rate CDR and majority-voting DFE works at 4.5 Gb/s while consuming 12.4 mW,
yielding a power efficiency of 2.8 pJ/bit.
Building upon the results of Chapters 3 and 4, Chapter 5 presents a complete 5-
Gb/s transceiver which dissipates only 3.7 mW. To improve the power efficiency, the
transceiver uses exclusively static CMOS logic gates instead of the CML gates in
20
Chapters 3 and 4, and employs injection-locking based clock generation. Heavy
parallelism and speculation in the DFE selection tree further reduces the power
consumption. The measured 0.75-pJ/it power efficiency is among the best reported to
date.
While currently most serial links still rely on some analog signal processing, the
continuous scaling of CMOS technology has recently made an ADC-based serial link
attractive in which equalization and timing recovery are all carried out in the digital
domain. One of the key challenges in this ADC-based architecture is the power
consumption of the high-speed ADC. Chapter 6 presents a novel digital background
calibration scheme suitable for high-speed ADCs which features negligible hardware
and power overhead. The efficacy of the proposed calibration scheme is experimentally
confirmed with a 50-mW 2.5-GS/s 5-bit full-flash ADC.
All the test chips in this Dissertation are in a 0.13-µm bulk CMOS technology.
However, they are readily applicable to more advanced technologies. It is therefore
expected that techniques proposed in this Dissertation should help enable future off-
chip serial links with high aggregate bandwidth and low power consumption.
21
CHAPTER 1 INTRODUCTION
1.1 Research Motivation
The past few decades have witnessed the tremendous advancement of the
semiconductor technology. Governed by Moore’s Law [1] [2], the functionality
(represented by the number of transistors) integrated on a single chip and the on-chip
clock frequency both grew exponentially, as can be observed in Figure 1-1, which
shows the transistor number and on-chip clock frequency of Intel’s microprocessors
over the past 40 years. Consequently, higher and higher I/O bandwidth is needed for
the communication between microprocessors, accelerators, and memories [3].
Recently, the aggregate off-chip bandwidth has entered the Tb/s range, necessitating
the integration of multiple (tens or even hundreds of) high-speed serial-link transceivers
on the same chip, each operating at multi-Gb/s. For example, in [4], a 16-core SPARC
processor has 1.1 Tb/s aggregate I/O bandwidth provided by 112 transmitters and 176
receivers with peak signaling rate of 4.08 Gb/s each.
Such exponential growth of functionality and clock frequency is expected to
continue in the coming decade, as predicted by ITRS [5] and shown in Figure 1-2(A)
and (B), giving rise to even faster increase of the I/O bandwidth over the same period.
Figure 1-3(A) and Figure 1-3(B) show the predicted off-chip clock frequency and the
total number of pads, while the resulting aggregate off-chip bandwidth is plotted in
Figure 1-3(C), assuming that differential NRZ signaling is used and that 50% of the
pads are dedicated to off-chip signaling. It can be seen that within 10 years, the total
bandwidth will extend to the hundred Tb/s range.
22
(A) (B)
Figure 1-1. Evolution of Intel Microprocessors. A) Transistor count. B) on-chip clock frequency.
(A) (B)
Figure 1-2. ITRS predictions for transistor count and on-chip clock frequency for the next decade. A) Transistor count. B) on-chip clock frequency.
However, due to packaging and cooling limitations, it is also predicted that the
total power consumption of a processor will be kept practically flat about 140 W over the
same period, as shown in Figure 1-3(D) [5]. State-of-the-art power efficiency of high-
speed serial-link transceivers is around 1 pJ/bit (1 mW/Gb/s), which means 100 W I/O
power consumption if 100 Tb/s aggregate bandwidth is desired. Apparently, the power
efficiency of high-speed transceivers must be greatly improved in order to maintain such
a growth of I/O bandwidth. For example, if the I/O power is to be kept around 20% of the
whole chip, the power efficiency should improve to approximately 0.2 pJ/bit in 2022.
1E+03
1E+04
1E+05
1E+06
1E+07
1E+08
1E+09
1E+10
1970 1980 1990 2000 2010 2020
Tra
ns
isto
r #
Year
Pentium
Pentium III8086
10-Core Xeon
40040.1
1
10
100
1000
10000
1970 1980 1990 2000 2010 2020
Clo
ck f
req
ue
nc
y (
MH
z)
Year
Pentium
Pentium III
808610-Core Xeon
4004
1E+09
1E+10
1E+11
1E+12
2012 2014 2016 2018 2020 2022
Tra
nsis
tors
#
Year
1
10
100
2012 2014 2016 2018 2020 2022
Clo
ck
fre
qu
en
cy (
GH
z)
Year
23
(A) (B)
(C) (D)
Figure 1-3. ITRS predictions of I/O and power for the next decade
Figure 1-4. Power efficiency of high-speed links vs. year
In response, the power efficiency of high-speed serial links has been steadily
improving at about 20% each year [6] [7] in the past driven by the joint effort of
technology scaling and design innovations. Figure 1-4 shows the power efficiency of the
high-speed serial links published in ISSCC and the VLSI Symposium since 2000.
1
10
100
2012 2014 2016 2018 2020 2022
Off
-ch
ip C
loc
k (
GH
z)
Year
4.6X
0
500
1000
1500
2000
2500
3000
3500
4000
2012 2014 2016 2018 2020 2022
IO p
ad
s
Year
1.4X
1
10
100
1000
2012 2014 2016 2018 2020 2022
Ag
gre
gate
IO
BW
(T
b/s
)
Year
6.4X
0
20
40
60
80
100
120
140
160
180
2012 2014 2016 2018 2020 2022
Po
we
r (W
)
Year
1E-1
1E+0
1E+1
1E+2
1E+3
1E+4
2000 2004 2008 2012
Po
we
r E
ffic
ien
cy (
pJ
/bit
)
Year
10-1
100
101
102
103
104
~-20%/year
24
Extrapolating this trend to 2022 gives about 0.7 pJ/bit, which is 3× the 0.2 pJ/bit goal.
This clearly indicates that more drastic improvement is needed in the future and is the
motivation behind the research work presented in this Dissertation.
1.2 Dissertation Organization
A high-speed serial link involves functions such as equalization, clocking, and
signaling. To improve the power efficiency of the whole link, it is vital to understand
each of these components and their inter-dependencies, which is the topic of Chapter 2.
Chapter 2 starts with the channel, with special emphasis on the intrinsic loss of
transmission lines. It then introduces a few popular equalization techniques to
compensate channel loss. The important topic of clock generation and recovery follows,
revealing the attractiveness of injection-locking-based clock generation and baud-rate
CDR. After that, the signaling power is related to channel loss, equalization, termination,
and signaling modes. The advantages of DFE and voltage mode signaling with
differential termination are demonstrated.
Chapter 3 focuses on reducing the signaling power by joint channel and circuit
optimization. An air-cavity transmission line structure is proposed to reduce the
dielectric loss which dominates at high frequencies. To further reduce the power
dissipation, the link also features speculative DFE and a current-sharing frontend
without back termination. The active link dissipates 3.7 mW at 6.25 Gb/s, which
translates to a power efficiency of 0.6 pJ/bit.
A digital eye-tracking baud-rate CDR scheme is proposed in Chapter 4. The
baud-rate CDR automatically tracks the maximum eye-opening while reducing the
clocking power by more than 50% compared to a conventional oversampling-based
CDR. A majority-voting 1-tap speculative DFE is also proposed which is more amenable
25
to low-power and high-speed designs than the selectors in conventional speculative
DFE’s. Implemented with CML gates, a receiver with the proposed baud-rate CDR and
majority-voting DFE consumes 12.4-mW at 4.5-Gb/s including the clocking circuitry.
To further improve the power efficiency, Chapter 5 presents a complete
transceiver in exclusive static CMOS gates. The RX employs heavy parallelism to
reduce the power supply from the nominal 1.2 V to 1.0 V. Other design features include
a speculative DFE with a look-ahead selection tree, a decimated baud-rate eye-tracking
CDR, and an injection-locked ring oscillator for multi-phase clock generation. The TX
uses a voltage-mode driver with differential termination to reduce the signaling power.
The transceiver consumes 3.7 mW at 5 Gb/s. At 0.75 pJ/bit, the power efficiency is
among the best to date.
With advanced CMOS technologies offering transistors with cut-off frequencies
above 100 GHz and gate delays of around 10 ps, it is now possible for the RX to directly
digitize incoming signal and perform equalization and timing recovery in the digital
domain [8]. One of the key challenges, however, is the ADC’s power consumption. With
a given architecture, an ADC’s power consumption is limited by mismatch which
prevents the use of small transistors. In response, Chapter 6 describes a novel
background ADC calibration scheme that is suitable for high-speed ADCs and incurs
negligible hardware and power overhead. The proposed calibration scheme is
implemented in a 50-mW 2.5-GS/s 5-bit flash ADC and its effectiveness is
demonstrated with experimental results.
All the reported results are in 0.13-μm bulk CMOS technology. It is expected that
the migration to more advanced technologies will lead to even better performances. The
26
proposed techniques should therefore help pave the way toward low-power high-speed
serial links to meet the requirements of future high-performance electronic systems.
27
CHAPTER 2 HIGH-SPEED SERIAL LINK OVERVIEW
2.1 Chapter Overview
Figure 2-1 shows a typical high-speed serial link, which consists of a TX, a
channel, and a RX. The TX multiplexes a low-speed parallel bus into a high-speed
serial stream and drives it toward the channel. The RX resolves the stream into digital
bits with a slicer and de-multiplexes them back to a parallel format. The equalizer (EQ)
compensates the frequency-dependent loss of the channel, and the clock and data
recovery (CDR) unit adaptively adjusts the RX clock phase so that the slicer digitizes
the incoming stream with enough timing margin.
Figure 2-1. A typical high-speed serial link
To improve the power efficiency of a serial link, the various parts of the link must
be well understood. We first examine the channel, with emphasis on transmission line
loss because it plays a vital role in determining the link performance. We then introduce
some popular equalization techniques to compensate the channel loss, including FFE,
CTLE, and DFE. Clocking, including clock generation and clock recovery, is presented
next. We show in this part that injection-locking is an attractive clock-generation
technique, and that baud-rate CDR schemes are generally preferred over their over-
MU
X DRV
CDR
DM
UX
EQ
RXChannelTX
28
sampling counterparts. In the end, we relate the signaling power to channel loss,
equalization, impedance mismatch, signaling modes, and termination schemes. We
demonstrate that DFE usually gives better signaling efficiency than FFE, and that
voltage-mode signaling with differential termination reduces the signaling power
significantly.
2.2 The Channel
At multi-Gb/s, the channel delay is comparable or even larger than the bit time,
rendering the signaling sensitive to reflections due to impedance mismatch. For this
reason, the channel is usually a transmission line with controlled 50-Ω impedance to
accommodate measurement equipment and properly terminated at both the TX and RX.
Discontinuities along the channel such as vias, packages, and connectors should all be
carefully evaluated and controlled.
However, even a perfectly uniform transmission with proper termination presents
challenges to high-speed signaling. At multi-Gb/s, the channel suffers from two
frequency-dependent loss mechanisms, and it’s the channel rather than the transistors
that limit the total signaling bandwidth. For example, it is shown in [9] that, in theory, an
NMOS in 0.8um technology is able to resolve a 48-Gb/s binary bit stream. However, the
experimental results fall way short of the theoretical prediction due to the channel
bottleneck (including the pads and packages).
The first loss mechanism is the conductor resistance. At low frequencies, the
current flows evenly through the conductor cross-sectional area. At high frequencies,
however, the current tends to follow the path with least inductance, flowing only in a
shallow band underneath the conductor surface, a phenomenon known as skin effect,
29
as shown in Figure 2-2(A). The skin depth, the depth at which the current density
decays to e-1 of that at the surface, is given by [10]
√
where δ is the skin depth, is the frequency, μ is the permeability , and σ is the
conductivity. Figure 2-2(B) plots the skin depth in copper as a function of frequency. In
GHz range, the skin depth is only on the order of μm.
(A)
(B)
Figure 2-2. Conductor loss. A) Skin effect. B) Skin depth vs. frequency in copper
The crowding of current to the conductor surface increases the effective
resistance at high frequencies. Since the skin depth is inversely proportional to √ , the
conductor loss (in dB) increases proportionally to √ .
0
5
10
15
20
25
0 2 4 6 8 10
δ(μ
m)
Frequency (GHz)
30
The second loss mechanism is the dielectric dissipation, which originates from
the polarization of the molecules in the dielectric material. As illustrated in Figure 2-3,
when an alternating electric field is applied to a dielectric material, the molecules rotate
to align with the external field and in doing so rub against each other and convert some
of the electric energy into heat [11]. Because the molecules rotate every time the field
polarity changes, the dielectric loss (in dB) is proportional to frequency, and is given
by [12]
√
where is the loss tangent of the dielectric material.
Figure 2-3. Physical mechanism of dielectric loss
The total loss is the combined effects of and , and can be expressed as
√
where and
are constants determined by the transmission line construction. Since
both and increase with frequency, the channel displays a low-pass profile. Figure
2-4 shows an example channel loss, where is the data rate. The loss at half data
rate, , is also known as the Nyquist loss. denotes the frequency at which
the two loss mechanisms contribute the same and is given by
(
)
E
31
For a differential 100-Ω 8-mil 0.5-OZ microstrip line on FR4, is around 2 GHz. For
high-quality cables, may be much higher. For example, a 50-Ω RG-58 cable with
PolyEthylene dielectric material may have an around 100 GHz.
Figure 2-4. Channel loss
In the time domain, this low-pass characteristic can be captured by the channel’s
single-bit response (SBR) . Figure 2-5 shows a sample SBR, where is the
main cursor, those with negative index are pre-cursors, and those with positive index
are post-cursors. It can be seen that due to the limited channel bandwidth, a single bit
spans more than one UI and interferes with neighboring bits, a phenomenon known as
inter-symbol-interference (ISI).
To evaluate the impact of channel loss on the link performances, it is desirable to
establish a relationship between the Nyquist loss and the SBR. However, since the
Nyquist loss does not completely characterize the channel, an exact mapping between
the Nyquist loss and the SBR is not possible. Figure 2-6 shows the main cursor
amplitude at different Nyquist losses. Depending on the relationship between and
, and may have varying significances, and channels with the same Nyquist loss
may have different SBRs. Without loss of generality, the discussion in this chapter
considers the case .
0.00 1.00
α(f)
Frequency
0 0.5fDR fDR
Nyquist loss
fC
32
Figure 2-5. A sample SBR
Figure 2-6. Main cursor vs. Nyquist loss
2.3 Equalization
Figure 2-7 shows the simulated eye diagrams for channels with 6-, 12-, and 18-
dB Nyquist losses. The channel loss degrades both the voltage and timing margins
seen by the RX. When the Nyquist loss is about 12 dB, the eye completely closes. To
extend the bandwidth of the channel, equalization is often employed in high-speed
serial links. This section reviews some of the most popular techniques.
Figure 2-7. Eye degradation due to channel loss
0.0
0.2
0.4
0.6
0.8
1.0
-2 -1 0 1 2 3 4 5 6
SB
R
Time (UI)
hch(0)
hch(1)
hch(2)hch(-1)hch(3)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 3 6 9 12 15
hch(0
)
Nyquist loss (dB)
only
=
only
6 dB
1.0
0
-1.0
Vo
ltag
e (
V)
+1-1Time (UI)
0
12 dB
1.0
0
-1.0
Vo
ltag
e (
V)
+1-1Time (UI)
0
18 dB
1.0
0
-1.0
Vo
lta
ge
(V
)
+1-1Time (UI)
0
33
2.3.1 FFE
Since the ISI originates from the channel’s low-pass characteristic, it is possible
to reverse it with a linear high-pass filter. One way of doing this is through a discrete-
time FIR filter [13] [14] at the TX or RX, of which TX feedforward equalization (FFE) is
the most popular, as Figure 2-8(A) shows. By adjusting the tap weights, a relatively flat
composite frequency response can be obtained, as shown in Figure 2-8(B).
(A)
(B)
Figure 2-8. FFE. A) Block diagram. B) Working principle
Although more drivers are used for FFE, their total size is the same as the driver
without FFE if the same peak gain is maintained. The electronic power overhead of FFE
stems mainly from the additional flip-flops and the associated wiring.
D D D DD
CK
-5
-4
-3
-2
-1
0
0 0.5 1
Ga
in (
dB
)
Frquency
Channel FFE
Composite
0 0.25 fDR 0.5 fDR
34
2.3.2 CTLE
(A)
(B)
Figure 2-9. CTLE. A) Circuit detail. B) Frequency response
Another linear equalization technique is the continuous-time linear equalizer
(CTLE) [6] [7]. Figure 2-9(A) and (B) show the schematic and transfer function of such a
CTLE. The transfer functions has two poles and one zero, which are given by
The product of the gain, the peaking factor, and the bandwidth satisfies [15]
0.01 0.10 1.00 10.00 100.00
Ga
in
Frequency
35
which means the performance of the CTLE is limited by the cut-off frequency of the
technology. Due to the high bandwidth and linearity requirements, a CTLE tends to be
power hungry. For example, implemented in 90-nm CMOS, the CTLE in [6] provides
8.7-dB peaking and accounts for 27% of the total RX power at 6.25 Gb/s. For a 12.5-
Gb/s link implemented in 65-nm CMOS, the CTLE provides 7.5-dB peaking and
represents 38% of the RX power [7].
2.3.3 DFE
Besides the linear equalizers discussed above, a non-linear equalization
technique, known as decision-feedback equalization (DFE), has found interest in recent
high-speed serial links [16] [17] [18]. A 1-tap DFE is depicted in Figure 2-10(A). It works
by directly removing the ISI of the previous bit from the current analog sample. Another
way of viewing it is that the DFE adjust the slicer threshold depending on the previous
bit. The power overhead of the DFE shown in Figure 2-10(A) consists mainly of the
summer.
The feedback path in Figure 2-10(A) must settle within one UI, a difficult design
challenge at high data rates. To relax this stringent timing requirement, speculative DFE
can be used, where possible results are pre-computed and then selected by the
previous bits [19], as shown in Figure 2-10(B). The power overhead of speculative DFE
is comprised of the additional slicers.
36
(A)
(B)
Figure 2-10. DFE block diagrams. A) Conventional DFE. B) Speculative DFE.
2.4 Clocking
At multi-Gb/s, both the timing offset and uncertainty must be well controlled, and
clocking, including clock generation and clock recovery, may constitute a significant or
even dominant portion of the total link power [6] [20].This section looks at both clock
generation and clock recovery, and identifies ways to reduce the clocking power.
2.4.1 Clock Generation
Clock generation in high-speed serial links is usually done with a PLL or a DLL.
Figure 2-11(A) depicts a PLL block diagram, which consists of a phase detector (PD), a
low-pass loop filter (LPF), a voltage-controlled oscillator (VCO), and an optional divider.
At steady state, the negative feedback loop ensures that the VCO output phase is
aligned with that of the reference clock.
A DLL block diagram is shown in Figure 2-11(B), where the VCO in a PLL is
replaced with a voltage-controlled delay line (VCDL). Under locked condition, the delay
DFF
Slicer
DFFSlicers
Selector
37
of the VCDL is equal to one reference clock cycle. Compared to a PLL, a DLL is usually
easier to design because the loop is of first order.
While the cores of a PLL and a DLL are the VCO and VCDL, the other loop
components may consume significant power. For example, in [6], the VCO consumes
only 12% of the total PLL power. Besides, the PD and loop filter also occupy
considerable area.
(A)
(B)
Figure 2-11. Block diagrams of a PLL and a DLL. A) PLL. B) DLL
Another clock generation technique that is found in some recent serial links is the
injection-locked oscillator [21] [22]. Figure 2-12 depicts the block diagram of an
injection-locked 5-stage ring oscillator. In the absence of injection signal, each stage of
the oscillator contributes a delay of , resulting in a free-running frequency of
When a clock with frequency is injected to one of the nodes, the delay of the
injected stage changes by and at rising and falling edges respectively.
PD VCOLPFCKREF
LPF
VCDL
PD
CKREF
38
Designating , under locked condition, the oscillation is sustained at ,
and the following equation holds:
Injection-locking a ring oscillator to a clean reference clock can dramatically
improve its noise performance because periodical correction by the injected clock
prevents jitter from accumulating indefinitely [23]. This can be observed in the frequency
domain as a reduction in the phase with injection-locking, as illustrated in Figure 2-13.
Compared to a PLL or a DLL, an injection-locked oscillator avoids the power and
area overhead of the PD, the LPF and the dividers, while still offering good jitter
performance [24] [23] [25]. Besides, since no feedback loop is involved, an injection-
locking-based clock generation does not have the stability issue of a PLL or DLL.
Figure 2-12. Block diagrams of an injection-locked 5-stage ring oscillator
CK0
CK1
CK2 CK3
CK4
TD
39
Figure 2-13. Simulated phase noise suppression with injection-locking
2.4.2 Clock Recovery
A clock recovery unit is essentially a feedback system consisting of three basic
blocks, namely a phase detector (PD), a phase shifter or rotator, and a loop filter, as
shown in Figure 2-14. The PD determines whether the sampling clock is too early or too
late. The early/late information, after being processed by a loop filter, is used to control
the phase shifter or rotator toward the desired position.
Figure 2-14. CDR block diagram
Various architectures exist for clock recovery [26]. The PD can be either linear
[27] or non-linear [28], with the former giving both the direction and magnitude of the
phase deviation, while the latter only the direction. In high-speed serial links, non-linear
PD is more popular because it does not require processing of narrow pulses [29]. The
loop filter can be analog [30], digital [31], or hybrid [32]. The phase shifter or rotator can
be implemented with an oscillator, a delay line, or a phase interpolator (PI) etc.
-150
-140
-130
-120
-110
-100
-90
-80
1E+05 1E+06 1E+07 1E+08 1E+09
Ph
ase
no
ise
(d
Bc)
Offset frequency (Hz)
w/o injection
w/ injection
Phase
RotatorLPF
PD
40
Non-linear phase detection is usually achieved via oversampling. Figure 2-15(A)
shows the block diagram of an Alexander PD [28]. The input signal is sliced twice for
each UI, one for eye center (data) and one for eye boundary (edge). Whenever a data
transition is detected, the edge sample in between is compared with the two data
samples to determine whether the sampling clock is too early or too late, as illustrated in
Figure 2-15(B). Assuming the clock phases are evenly spaced, at locked condition, the
data-sampling phase is automatically placed at the center.
(A)
(B)
Figure 2-15. Block diagram and principle of Alexander PD
The power overhead of oversampling CDR consists of the additional slicers and
clocking circuitry. While the additional slicers may be disabled to reduce their power
consumption if a low CDR bandwidth is acceptable [6], it is still necessary to generate
the extra clock phases. Moreover, since oversampling requires timing resolution better
DIN CK
D
LO
GIC Early/
Late
E
D0 E0 D1
(D0=E0 &&E0!=D1) CK too early
D0 E0 D1
(D0!=E0 &&E0=D1) CK too late
D0 E0 D1
(D0=E0 &&E0=D1) No transition
41
than the bit time, the clocking power overhead is more than it appears because doubling
the timing resolution requires more than doubling the clocking power. This can be
observed in Figure 2-16, which shows the delay and energy of an inverter in a 0.13μm
CMOS technology. For this reason, baud-rate CDR is preferred to reduce clocking
power.
(A)
(B)
Figure 2-16. Simulated performances of an inverter in a 0.13-μm CMOS technology. A) Delay. B) Energy.
2.5 Signaling
In a high-speed serial link, the TX driver needs to produce a large enough
voltage swing over the low channel impedance. The power consumed by the TX driver,
also known as the signaling power, may constitute a significant portion of the total link
power. For instance, in [7], nearly 40% of the link power is consumed by the TX driver.
0
50
100
150
200
250
0.4 0.6 0.8 1 1.2
t pd
(ps
)
VDD (V)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
0.4 0.6 0.8 1 1.2
En
erg
y/c
yc
le (
fJ)
VDD (V)
42
To improve the power efficiency of the whole link, it is imperative to gain an insight to
the various factors that affect the signaling power.
2.5.1 Signaling Efficiency
Figure 2-17 shows a typical frontend found in high-speed links [17]. The analysis
in this section assumes that the DC loss of the channel is negligible. Without DC loss,
the signal swing at the TX and RX are the same, as shown in Figure 2-17. For the ideal
case with lossless channel and perfect termination, the eye opening is the same as
the signal swing and the signaling power is
Figure 2-17. A typical link frontend
Factors such as channel loss, equalization, termination, and signaling modes
cause to deviate from . If we define the signaling efficiency as
the signaling power now becomes
Z0
ZTX = Z0 ZRX = Z0
43
By studying the relationship between and the various factors such as channel loss,
equalization, termination, and signaling mode, their impacts on the signaling power can
be understood.
2.5.2 Effects of Channel Loss
With the SBR given, the worst-case eye opening can be found using the peak-
distortion technique [33], and is calculated to be
∑| |
For a uniform channel with perfect matching, all the cursors are positive. Since
the DC loss is negligible, i.e.
∑ )
Equation 2-9 can be simplified to
, )
Figure 2-18. Main cursor amplitude and signaling power penalty vs. channel loss
Figure 2-18 shows the simulated amplitudes of the main cursor as a function of
the channel Nyquist loss. Assuming the post-cursors are completely removed by DFE,
the main cursor amplitude equals the RX eye opening. The signaling power penalty of
the channel loss is therefore calculated accordingly and is plotted also in Figure 2-18. It
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 5 10 15
PS
IGp
en
alt
y
Main
cu
rso
r
Nyquist loss (dB)
44
can be seen that when the Nyquist loss exceeds about 9dB, 50% more signaling power
is needed to restore the eye opening seen by the RX slicers.
Besides mandating more signaling power, higher channel loss also necessitates
more equalization and induces power penalty for signal processing thereof. This is
explained with the help of Figure 2-19, which shows the amplitudes of the first three
post-cursors normalized to the main cursor. Generally speaking, with increasing
channel loss, the post-cursors become more and more significant compared to the main
cursor. Specifically, when the Nyquist loss is 9 dB, the second post-cursor is around
10% of the main cursor. While 1-tap DFE may be enough when the Nyquist loss is less
than about 6~9 dB, extra DFE taps are desired beyond that, incurring power penalty for
the extra latches etc.
Figure 2-19. Post-cursor amplitudes vs. channel loss
Figure 2-20 plots for different channel losses. When the Nyquist loss
goes beyond 9 dB, the eye opening quickly degrades and error-free signaling without
equalization becomes impractical or even impossible near 12 dB.
0.0
0.1
0.2
0.3
0.4
0.5
0 5 10 15
Po
st
cu
rso
r a
mp
litu
de
Nyquist loss (dB)
h(1)
h(2)
h(3)
All normalized to h(0).
45
Figure 2-20. The effects of channel loss and equalization on
2.5.3 Effects of FFE and DFE
To facilitate signaling over lossy channels, equalization is often employed in high-
speed serial links. The impacts on the signaling power depend on the specific
equalization scheme.
The FFE operates with an FIR filter in cascade with the channel. With proper tap
weights, the FIR filter inverts the channel response so that the composite frequency
response is flat up to the Nyquist frequency, i.e.
| | | |. )
The peak gain of the FIR filter occurs at the Nyquist frequency, and is kept at
unity for fair comparison, i.e.
| | . )
Equation 2-12 can then be simplified to
| | | | | |. )
The signaling efficiency with FFE is then given by [34]
| |. )
The DFE, on the other hand, directly removes the ISI of the previous bits and is
better understood in the time domain. In the absence of detection errors (no error
propagation), the DFE can be analyzed in a linear fashion and the composite SBR is
0
1
2
3
4
5
6
7
8
0 3 6 9 12 15
PS
IG/P
SIG
0
Nyquist loss (dB)
W/O EQ W/ FFE
W/ DFE
46
)
The signaling efficiency with DFE is then given by
∑| |
. )
The normalized signaling power with FFE and DFE is also plotted in Figure 2-20.
While both FFE and DFE extend the achievable data rate, DFE always yields the lowest
signaling power. For example, when the Nyquist loss is 9 dB, the signaling power with
DFE is 40% lower than that with FFE.
Figure 2-21. Effects of FFE and DFE in frequency domain
Intuitively, this benefit of DFE stems from the fact that DFE boosts the high-
frequency component [16]. This is in contrast to FFE, which merely attenuates the low-
frequency component of the signal so that the high- and low-frequency components
have the same amplitude when arriving at the RX. This is shown in Figure 2-21, which
compares the composite frequency responses with FFE and DFE of a hypothetical
channel which has an SBR of [0.8, 0.2].
2.5.4 Effects of Back Termination
As shown in Figure 2-17, a typical link has termination at both the TX and RX.
Although the TX back termination helps mitigate reflections, it reduces the signal swing
by 50%, which must be compensated for by doubling the signaling power. Note,
-5
-4
-3
-2
-1
0
0 0.5 1
Att
en
ua
tio
n (
dB
)
Frquency
0 0.25 fDR 0.5 fDR
Boosting
W/O EQ.
W/ DFE
W/ FFE
47
however, that this back termination is not necessary if the channel is relatively uniform
and a good impedance matching is ensured at the RX. With the back termination
removed and assuming perfect RX matching, the signaling power now becomes
. )
Comparing Equation 2-17 to Equation 2-7, removing the back termination reduces the
signaling power by half because it doubles the impedance seen by the TX driver [35].
However, without the damping of the back termination, reflections due to RX
impedance mismatch may make multiple trips along the channel before dying out. The
resulting degradation of the eye opening must be evaluated.
The effect of RX impedance mismatch can be studied with the help of the lattice
diagram [36], as shown in Figure 2-22, where and are the reflection coefficients at
the TX and RX respectively. When a pulse first arrives at the RX, the transmitted pulse
is given by
. )
The reflected pulse travels back and gets fully reflected at the TX. When it arrives again
at the RX, the transmitted pulse is
, )
where denotes convolution. Since the channel DC loss is negligible, the worst case
eye opening degradation due to the first reflected pulse is
∑| |
| |. )
Similarly, the degradation due to the nth reflection is
∑| |
| | . )
48
The total effect is obtained by taking the sum and is
∑∑| |
∑ | |
| |. )
The signaling efficiency without back termination is therefore
2( ∑| |
| |) , )
where the factor 2 accounts for the amplitude doubling due to the removal of the back
termination.
Figure 2-22. Lattice diagram for reflection calculation
Figure 2-23 depicts the eye opening improvement with the back termination
removed as a function of RX impedance mismatch. With 9-dB Nyquist loss and 10%
impedance mismatch, the signaling power is reduced by nearly 40%.
Figure 2-23. Eye opening vs. RX mismatch
Also plotted in Figure 2-23 is the effect of RX mismatch when the Nyquist loss is
12 dB. The eye degradation becomes more sensitive to RX mismatch without back
0%
50%
100%
150%
200%
250%
-40% -20% 0% 20% 40%
No
rma
led
eye
op
en
ing
RX mismatch
Nyquist loss =9 dB
12 dB
49
termination when the channel loss increases. Intuitively, this is because the main cursor
decreases with increasing channel loss, while the reflection remains the same as long
as the DC loss is negligible.
Note the above discussion assumes negligible DC loss of the channel. If the
channel has substantial DC loss, the reflections may be heavily attenuated and good
termination may not be required at either TX or RX [37].
2.5.5 Effects of Signaling and Termination Modes
The above discussion considers exclusively current-mode signaling. However,
both current-mode (CM) [16] [20] and voltage-mode (VM) [6] [38] [39] signaling have
been used for high-speed serial links. Besides, the termination may be single-ended or
differential. Their signaling powers are analyzed below.
Figure 2-24(A) shows the schematic of a current-mode frontend with single-
ended termination. The differential pair works in saturation region and steers the tail
current to either branch according to the bit being transmitted. The voltage levels at the
TX outputs are
The voltage swing and the signaling power are therefore
When the termination is differential, as shown in Figure 2-24(A), the voltage
levels become
50
while the single-ended voltage swing and the signaling power are the same as single-
ended termination.
(A)
(B) Figure 2-24. CM signaling. A) Single-ended termination. B) Differential termination.
Figure 2-25 shows the schematic for VM signaling. The transistors work in linear
region and connect the outputs to either voltage rails according to the bit being
transmitted. Termination is provided by series resistors, either by the on-resistance of
the transistors or by explicit resistors in series with the transistors. With single-ended
termination, the voltage levels at the TX outputs are
The single-ended voltage swing and the signaling power are
R
VDD
Z0
R R
Z0
VDD
R
R
VDD
Z0
R
RZ0
51
For the case of differential termination, the voltage levels become
The single-ended voltage swing and the signaling power now are
It can be seen that using differential termination reduces the signaling power by 50% for
VM signaling.
(A)
(B)
Figure 2-25. VM signaling. A) Single-ended termination. B) Differential termination.
Z0
R=Z0
Z0
R R
VDRV
Z0
Z02R
VDRV
R =Z0
52
Table 2-1 summarizes the performance of current-mode and voltage-mode
drivers with single-ended and differential terminations. It can be seen that even with a
linear regulator to generate VDRV, a VM signaling with differential termination consumes
only 25% of CM signaling power.
Table 2-1 Summary of signaling and termination modes Mode CM CM VM VM
Term. SE Diff. SE Diff
2.6 Summary
Various factors come into play when one tries to improve the power efficiency of
a high-speed serial link, with the channel posing the most difficult challenge. At multi-
Gb/s, conductor loss and dielectric loss limit the channel loss and causes temporal
spreading of the transmitted pulses. To compensate for the resulting ISI, high-speed
serial links usually employ equalization such as FFE, CTLE and DFE, with each
involving a different level of complexity.
Clocking, including clock generation and clock recovery, is challenging at high
data rates and sometimes may dominate the total link power budget. Conventional
53
solutions such as PLL and DLL entail considerable area and power overhead due to the
PD and LPF. Injection-locking based clock generation, on the other hand, is a promising
technique because it avoids such overhead while still features low jitter. To reduce the
clocking power, baud-rate CDR is preferred over its oversampling counterpart, such as
the Alexander type CDR, which has found popular use in recent high-speed serial links.
Due to the low channel impedance, the signaling power, the power dissipated by
the TX driver, consumes considerable percentage of the link power. Using the peak
distortion technique and the concept of signaling efficiency, this chapter shows the
attractiveness of DFE and VM signaling with differential termination. It is also shown
that with moderate channel loss and reasonable termination tolerance, back termination
can be removed to further reduce the signaling power.
The rest part of this Dissertation will report a few TX and RX implementations
that embed the analysis results presented in this chapter. Their usefulness is
demonstrated with experimental results.
54
CHAPTER 3 AN ACTIVE LINK WITH AIR-CAVITY TRANSMISSION LINES
3.1 Chapter Overview
As discussed in chapter 2, the bandwidth of transmission lines is limited primarily
by conductor loss and dielectric loss . Because is proportional to √ while is
proportional to [36], the latter mechanism dominates at high frequencies. For
conventional dielectric materials such as FR4, the dielectric loss significantly degrades
the channel bandwidth for multi-Gb/s signaling. While resorting to materials with low
loss tangents or even optics is possible, such solutions incur significant cost overhead.
Figure 3-1(A) shows the cross-sections of a conventional microstrip on FR4
( . Since the field of a microstrip transmission line resides in
both the air and FR4, the effective dielectric constant lies somewhere between
and . The extent to which dominates is characterized by a so-
called filling factor [40], which satisfies
The effective loss tangent can also be related to the filling factor by [40]
Because the dielectric loss is determined by both and through [12]
√
reduction of the filling factor will reduce the dielectric loss.
Intuitively, since most of the field energy is confined between the signal lines and
the ground plane, if we can somehow fill the space between them with air, the filling
55
factor will be reduced. This can be done by employing the air-cavity microstrip structure
(also known as inverted microstrip [41]), as shown in Figure 3-1(B) Air-cavity microstrips
can be formed by selectively post-processing the FR4 boards for high-speed
interconnects. This avoids the cost overhead associated with expensive substrate
materials for non-critical signals.
(A)
(B)
Figure 3-1. Cross-sections of microstrips. A) Conventional. B) Air-cavity.
Figure 3-3 shows the simulated of conventional and air-cavity differential
microstrips, with the conductor thickness kept at 5 µm. The calculated filling factor is
shown in Figure 3-3. It can be seen that air-cavity microstrip has lower and ,
and that when
, is reduced by 30% by employing the air-cavity. According to
Equation 3-7, such reductions translate to an improvement of 36% of , as shown in
Figure 3-4.
It should be noted that not only is the air-cavity structure attractive for low loss, it
also features lower latency for the same channel length because is reduced.
Encouraged by these results, this Chapter presents the design and fabrication of
air-cavity transmission lines, and their use in an active link. The active link features a
FR4
FR4
56
current-sharing frontend and speculative DFE to reduce the signaling power. Back
termination at the TX is also removed for further power saving. Experimental results
confirm the dielectric loss is reduced by 26% by the air-cavity structure. Operating at
6.25 Gb/s, the link consumes 3.7 mW, yielding a 0.6 pJ/bit power efficiency.
Figure 3-2. Simulated of conventional and air-cavity microstrip
Figure 3-3. Simulated of conventional and air-cavity microstrip
Figure 3-4. Simulated dielectric loss of conventional and air-cavity microstrip
0.0
1.0
2.0
3.0
4.0
0 1 2 3 4 5 6
r,
W/H
Conventional
Air-cavity
0.5
0.6
0.7
0.8
0.9
1.0
0 2 4 6
αf
W/H
conventional
Air-cavity
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 2 4 6
αd(dB/cm
)
W/H
ConventionalAir-cavity
57
3.2 Transmission Line Design
The main design parameters of the proposed air-cavity structure include the
signal line width W and spacing S, the conductor thickness t, and the height of the air-
cavity H. The design goals include 100 Ω differential impedance, low loss and high
density. Considering the process capability, the conductor thickness is chosen to be 5
μm. For simplicity, the signal line width W and spacing S are assumed to be the same.
A meandered transmission line length of 20 cm is used as a representative channel
length for chip-to-chip interconnects [42]. The channel loss is evaluated at 5GHz with a
target of 10 dB or less, or equivalently an attenuation constant of 0.5 dB/cm at this
frequency.
The transmission line is simulated in a 3D electromagnetic simulator. Figure
3-5(A) shows the picture of the 3D model. To reduce the requirement on computation
resources, a short line of 1 cm is simulated. The obtained S-parameters are then
cascaded to get the characteristics of longer lines.
Figure 3-5(B) shows the simulated air-cavity loss performance at 5 GHz at
various signal line widths. While the conductor loss decreases with increasing conductor
sizes due to larger effective conducting surface area, the dielectric loss stays relatively
constant since it is primarily determined by the material properties. From a loss
reduction perspective, it is desirable to use as big a W as possible. However, to achieve
the desired impedance, a proper
must be maintained. The fabrication process limits
the air-cavity height to about 20 μm. Accordingly, the final W is chosen to be 40 μm,
which gives an 8 dB total loss for a 20 cm channel at 5 GHz. The transmission line
dimensions are listed in Table 3-1.
58
Figure 3-5. Picture of the 3D model and simulated loss at various line widths
Table 3-1. Final air-cavity microstrip dimensions W S t H
40 µm 40 µm 5 µm 19 µm
Figure 3-6 compares the dielectric loss in the proposed air-cavity transmission
line and the conventional FR4-based microstrip transmission line (in dB/cm) with the
same conductor width and spacing. The air-cavity structure reduces the dielectric loss
by around 26%.
Figure 3-6. Simulated dielectric loss of air-cavity and conventional transmission lines
The effective dielectric constants are calculated from the simulated phase
characteristic. The air-cavity structure reduces the effective dielectric constant by 25%
from 2.75 to 2.07.
Signal P
Signal N
Ground
FR4
0 10 20 30 400
0.2
Width (μm)
S2
1 (d
B/c
m)
50 60
0.4
0.6
0.8
1.0
1.2
Total Loss
Dielectric Loss
Conductor Loss
Frequency (GHz)
-0.8
-0.6
-0.4
-0.2
0
0 5 10 15 20
Airgap tanD
FR4 tanDConventional
Air-cavity
(dB
/cm
)
59
Figure 3-7 compares the simulated losses of conventional and air-cavity
transmission lines. The loss of the air-cavity transmission line is 0.25 dB/cm at 3.125
GHz and is 8% less than the conventional structure. Figure 3-8 shows the signaling
power reduction with the air-cavity structure assuming FFE and DFE respectively. The
improvement of air-cavity topology becomes more pronounced at higher frequencies as
the dielectric loss becomes more significant. For example, at 10 GHz, the loss
improvement is nearly 15%, and for a 20-cm channel the signaling power is reduced by
more than 10% with DFE and 16% with FFE. It is therefore expected that the air-cavity
structure is especially attractive for future high-speed interconnects.
Figure 3-7. Improvement with air-cavity transmission line
(A) (B)
Figure 3-8. Signaling power reduction with air-cavity. A) With FFE. B) With DFE.
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0 5 10 15 20
Lo
ss
(d
B/c
m)
Frequency (GHz)
Conventional
Air-cavity
5
10
15
20
25
30
35
40
2 6 10 14 18 22 26 30 34 38
Data rate (Gb/s)
50%-55%45%-50%40%-45%35%-40%30%-35%25%-30%20%-25%15%-20%10%-15% C
han
ne
l le
ng
th (
cm
)
5
10
15
20
25
30
35
40
2 6 10 14 18 22 26 30 34 38
Data rate (Gb/s)
25%-30%
20%-25%
15%-20%
10%-15%
5%-10%
0%-5% Ch
an
ne
l le
ng
th (
cm
)
60
3.3 Fabrication
Figure 3-9 illustrates the process flow for fabricating the proposed air-cavity
interconnects. The process begins with electroplating the first copper pattern on an FR4
substrate representing the differential signal lines (Figure 3-9(A)). Following this step, a
sacrificial polymer layer is spin-coated with desired thickness and patterned to act as a
temporary placeholder in the formation of the air-cavity (Figure 3-9(B)). The sacrificial
polymer contains poly-propylene carbonate (PPC) (Novomer Inc., Ithica, NY). A
photoacid generator is added in order to obtain a photo sensitive polymer mixture, and
γ-butyrolactone (GBL) serves as the solvent. A similar formulation is available as Unity
2203P from Promerus LLC, Brecksville, OH. Two different approaches for patterning
are studied for the PPC layer, photo-patterning and self-patterning [43]. When photo-
patterning, a photo mask is used. When employing PPC self-patterning process, no
photo mask is needed, and the slightly sloped sidewalls of the PPC patterns makes it
ideal for the sequential layers to have a better step coverage. The copper ground layer
is then patterned on top of the PPC patterns. The entire surface is then overcoated with
Avatrel 8000P (functionalized polynorbornene) for hermetic seal of transmission line
and providing mechanical support for the top ground copper layer (Figure 3-9(C)). PPC
polymer backbone unzipping occurs upon heating up to 220°C during Avatrel overcoat
curing, during which period of time the solid PPC is converted to gaseous products. The
gaseous products gradually permeates through the overcoat sidewalls and opening in
the ground layer patterns, leaving an air-cavity region of the same physical shape as the
patterned PPC with little residue (Figure 3-9(D)), thus air-cavity transmission line
structure is formed. The overcoat also serves as solder mask for later die and cable
attachments.
61
(A)
(B)
(C)
(D)
Figure 3-9. Fabrication process for the air-cavity structure
Figure 3-10. Picture and cross-section of the fabricated air-cavity structure
Figure 3-10(A) shows the picture of the finished air-cavity differential
transmission lines. The ground plane is patterned in a grid style, with holes for gas
release during PPC evaporation. Figure 3-10(B) shows a cross-section of the finished
air-cavity structure.
FR4
FR4
PPC
FR4
PPC
Avatrel
FR4
Air
Avatrel
FR4
SignalGround
Air-cavity
62
3.4 Link Implementation
3.4.1 Link Architecture
Figure 3-11 shows the block diagram of the link. The RX has a common-gate
(CG) preamp and a half-rate 1-tap speculative DFE. The TX consists of a half-rate 27-1
PRBS core, a MUX, and an open-drain driver. To reduce signaling power, the back
termination at the TX output usually found in high-speed serial links is removed in this
design. For the same voltage swing seen by the RX, removing the back termination
reduces the required signaling power by 50% because it doubles the impedance seen
by the TX.
Figure 3-11. Link block diagram
Channel equalization is primarily done by the DFE for better power efficiency as
discussed above. However, because DFE only cancels post-cursors, a 2-tap FFE is still
TX MUX
CG amp
Driver
DFE
L L L
L L
27-1
PR
BS
D0
D1
VB
CKTX
Offset
Control
+h
-h
Q0
+h
-h
Q1
CKRX
L
L
LL
L
L
LL
Current-sharing frontend
Impedance control
20-cm air-cavity
channel
32 µm 32 µm
24 µm 12 µm
63
built in the TX driver for pre-cursor cancellation. Note that this TX FFE can also be
configured for post-cursor cancellation, and facilitates the comparison between FFE and
DFE in terms of power efficiency.
3.4.2 TX Design
The latches, multiplexers and drivers in the TX are all implemented in current-
mode logic (CML) for fast operation and good power noise immunity, as shown in
Figure 3-12. Considering the fact that the pre-cursor is usually only a fraction of the
main cursor, the pre-cursor driver is sized half of the main cursor driver. The
multiplexers are sized in such a manner that the signal path comprised of the latch, the
multiplexer and the driver has a uniform fan-out.
(A) (B)
Figure 3-12. Schematics of the latch and multiplexer. A) Latch. B) Multiplexer
Figure 3-13. Schematic of the 5-b DAC
To facilitate debugging and testing, a serial interface is integrated on-chip. The
bias currents of all the gates are controlled with 5-b DACs, the schematic of which is
shown in Figure 3-13.
OUTN
OUTP
INP INN
CKP CKN
OUTN
OUTP
AP AN
CKP CKN
BP BN
X2
W1VDD
VREF
VBIAS
X1
W0
X4
W2
X8
W3
X16
W4
64
3.4.3 RX Design
3.4.3.1 Preamp design
The RX consists of the CG preamp and the DFE. The CG frontend at the RX side
serves multiple purposes. First, it provides low-to-high impedance transformation and
increases the voltage swing seen by the following DFE stage. This accommodates a
smaller input voltage swing, which is important for high power efficiency as discussed
before. Second, it accomplishes level-shifting of the input signal so that NMOS input
stages can be used in the DFE. Third, the input impedance looking into the source of
the CG amplifier provides partial impedance matching for the channel.
The most important design metrics of the CG preamp are bandwidth and gain,
which are both closely related to power. With the bandwidth design target set to 67% of
the data rate, or 4.2 GHz for 6.25 Gb/s NRZ signaling, the gain of the CG preamp is
optimized for minimum link power. A higher preamp gain yields better RX sensitivity and
lower signaling power, but requires more power for the preamp. For a given channel
condition and technology, an optimum gain therefore exists that minimizes the total
frontend power PFE.
Figure 3-14. Preamp model for gain optimization
Figure 3-14 shows a preamp model for gain optimization. For a given load
capacitance , gain A and 3-dB bandwidth , the following equations hold:
IN
65
where W is the transistor width, is the transistor transconductance per unit width, R
is the load resistance, and is the transistor drain capacitance per unit width. For each
transistor current density , W and R can be solved and the amplifier current is found to
be
(A)
(B)
Figure 3-15. Preamp design. A) Amplifier current vs. current density. B) Frontend power vs. preamp gain
Figure 3-15(A) plots the amplifier current as a function of at different gain in the
target 0.13-um CMOS technology when driving the four slicers of the DFE. For each
0.0
0.2
0.4
0.6
0.8
1.0
1 10 100 1000
I AM
P(m
A)
Current density (uA/um)
5
43
2A=1
6
7
8
0
1
2
3
4
0 2 4 6 8 10
Po
we
r (m
W)
Preamp gain
66
gain, there exists an optimum current density, and the optimum current density
increases with increasing gain.
Figure 3-15(B) shows the signaling power, the preamp power and the frontend
power at different gain with optimum current density over a channel with 9-dB Nyquist
loss. The slicer sensitivity is 100 mV, and it is assumed that DFE is used and that back
termination is removed. The minimum frontend power is attained when the preamp gain
is around 4, and is about 50% lower than the case without the preamp.
The frontend power is further reduced with a current-sharing frontend, as shown
in Figure 3-11. By stacking the CG preamp and the open-drain TX driver, the tail current
of the TX driver is reused by the RX amplifier. According to Figure 3-15(B), this reduces
the frontend power by nearly 50%. The fact that the TX driver is powered from the RX
supply also helps to suppress the noise coupling from the TX supply.
Back termination is removed in this work to reduce signaling power. The
downside of this practice is the risk of potential reflections due to TX impedance
mismatch. To mitigate the effect of reflections, a good impedance matching at the RX
side must be maintained. Since the input impedance of the CG frontend is bias
dependent and non-linear, a programmable resistor is connected across the RX inputs
to provide a better matching, as shown in Figure 3-16(A). The programmable range of
the resistor is chosen so that a differential input impedance of 100 Ω is maintained over
a wide bias range between 0.5 mA and 5 mA, as shown in Figure 3-16(B). Figure 3-17
compares the RX eye diagrams with and without back termination. It can be seen that,
as expected, removing the back termination nearly doubles the RX eye opening without
67
any noticeable degradation of the eye quality. Given the same RX sensitivity, this
means the signaling power is reduced by nearly 50%.
(A)
(B)
Figure 3-16. Input impedance tuning. A) Schematic. B) Simulated result.
(A) (B)
Figure 3-17. Simulated RX eye diagrams. A) With back termination. B) Without back termination.
VB32 µm 32 µm
58
0
58
0
1.2
µm
X 8
50
60
70
80
90
100
110
120
130
140
150
0 1 2 3 4 5 6
ZD
M(Ω
)
Tail current (mA)
150
100
50
0
-50
-100
-1500 0.2 0.4 0.6 0.8 1.0
64 mV
Time (UI)
Vo
lta
ge
(m
V)
150
100
50
0
-50
-100
-1500 0.2 0.4 0.6 0.8 1.0
126 mV
Time (UI)
Vo
lta
ge
(m
V)
68
To prevent the RX sensitivity degradation due to small transistor sizes, offset
cancellation is also built into the CG amplifier, as shown in Figure 3-11. The polarity and
magnitude of the offset cancellation are all adjustable via digital control.
3.4.3.2 DFE design
The DFE employs a speculative architecture and half-rate clocking to ease timing
requirement. The slicers are implemented as CML latches with adjustable built-in offset,
as shown in Figure 3-18. When the latch is in its amplification phase (CKP is HIGH), an
auxiliary differential amplifier injects static current into the output nodes to introduce a
desired offset. This is in contrast to [44], where the offset is introduced during the
regeneration phase. This leads to more robust latch operation since the regenerative
gain is not affected by the offset injecting differential pair. Another highlight of the DFE
design is that a single latch stage is employed before the selector, unlike [44] where a
complete flip-flop is used. To account for different channel profiles, both the polarity and
the magnitude of the offset injecting current are programmable via an on-chip serial
interface. The programmable range of the slicer threshold is simulated to be ±140 mV,
which is large enough to account for different DFE tap weights required by different
channel profiles.
Figure 3-18. Slicer schematic
OUTN
OUTP
INP INN
CKP CKN
SP SN
CKP
Tap
Control
69
The designs of the CML latches and multiplexers in the DFE are the same as the
TX except sizing. Unlike the multiplexers in the TX which see the large input
capacitances of the pre-cursor and main cursor drivers, the multiplexers in the RX only
see the CML latch inputs. Accordingly, they are sized the same as the latches to save
power.
3.5 Experimental Results
To evaluate the performance of the proposed air-cavity structure, a test board is
designed. The layout of the test board is shown in Figure 3-19. The center area is
occupied by the active link, which include footprints for a TX chip and a RX, and the air-
cavity transmission lines. The rectangular board for the active link is cut using a dicing
saw and interfaced with test equipment to evaluate overall link performance. CPW lines
are used to connect the SMA connectors to the chip footprint.
Figure 3-19. Layout of the test board with the air-cavity active link
The top and bottom areas of the test board are used to implement air-cavity test
structures of various lengths. To improve measurement accuracy, open-short-thru de-
embedding structures are also implemented. To facilitate processing, custom alignment
marks are placed at multiple locations. The entire board footprint is designed to fit into a
circular area with a diameter of 4” to accommodate the in-house fabrication capabilities.
Active
Link
Test
Structures
Test
Structures
TX RX
4”
70
3.5.1 Air-Cavity Transmission Line Measurement
The performance of the air-cavity transmission line was obtained by measuring a
5-cm test structure using a vector network analyzer with high-frequency probes. Figure
3-20 shows the measured loss and phase responses. The effective dielectric constant is
calculated to be 1.7 from the measured phase, which is lower than predicted before.
This is probably because the dielectric constant of the base material is lower (~3.9) than
the used 4.4 in previous simulations. The lower dielectric constant also leads to higher
line impedance, which causes ripples in the measured loss due to impedance mismatch
[12].
(A)
(B)
Figure 3-20. Measured performances of a 5-cm air-cavity microstrip. A) Loss. B) Phase.
-8
-7
-6
-5
-4
-3
-2
-1
0
0 5 10 15 20
Lo
ss
(d
B)
Frequency (GHz)
-240
-180
-120
-60
0
60
120
180
240
0 5 10 15 20
Ph
as
e (
de
gre
e)
Frequency (GHz)
71
The true loss of the line (excluding the effects of impedance mismatch) is
calculated from extracted propagation constant using the technique in [45], and the
result is shown in Figure 3-21. The loss is 0.28 dB/cm at 3.125 GHz, which readily
meets our design goal. Simulation result (with ) is also overlaid for comparison,
demonstrating good agreement between measurement and simulation.
Figure 3-21. Loss of the air-cavity line
3.5.2 Link Measurement
The TX and RX test chips are fabricated in 0.13-μm 1.2-V CMOS process. Figure
3-22 shows the chip micrographs. The TX and RX cores occupy 0.03 mm2 and 0.02
mm2, respectively. The test chips are wire-bonded to QFN packages and mounted on
the test board with a 20-cm 8”) air-cavity interconnect. Figure 3-23 shows the picture of
the populated test board with air-cavity lines in the center of the board.
Figure 3-22. Chip micrographs of the TX and the RX
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0 5 10 15 20
Lo
ss
(d
B/c
m)
Frequency (GHz)
Measurement
Simulation
1.5
mm
1.3 mm
Transmitter
200μm
15
0μ
m
1.3 mm
1.5
mm
Receiver
200μm
10
0μ
m
72
Figure 3-23. Picture of the populated test board
Figure 3-24. Test setup
The test setup is depicted in Figure 3-24. The TX and RX work mesochronously,
deriving their clocks from the same signal generator, with their phase relationship
adjusted by a mechanically-tunable delay-line.
The full link operates successfully at 6.25 Gb/s with a half-rate input clock of
3.125 GHz. Figure 3-25(A) and Figure 3-25(B) show the measured single-ended eye-
diagrams at the outputs of the RX CG amplifier (driven off-chip for testing purpose)
before and after enabling the TX FFE respectively. The closed eye-diagram is
TX RX20 cm
air-cavity
TX
Balun
RX
Balun
Delay
Scope
or
BERT
Tri
gg
erAir-Cavity
3.125 GHzSplitter
CK RXCK TX
TX OUT RX IN RX OUT
73
successfully opened by enabling TX FFE. Figure 3-25(C) shows the eye-diagram at the
output of the DFE for a 27-1 PRBS pattern, with the corresponding transient waveform
shown in Figure 3-25(D). Correct 27-1 PRBS sequence is verified with both visual
inspection and BER measurements.
(A) (B)
(C) (D)
Figure 3-25. Measured waveforms
Figure 3-26 shows the measured RX bathtub curves and energy-per-bit
performance with different equalization settings. At 6.25 Gb/s and a BER of 10-12 with
only the TX FFE enabled, the eye opening is 30% UI. Enabling the RX DFE and
disabling TX FFE improves the eye opening to 37%, while the overall power efficiency
improves from 0.9 to 0.6 mw/(Gb/s), respectively. Enabling both FFE and DFE further
improves the horizontal eye opening to 56% UI but decreases the power efficiency.
When the link is operated at 6.25 Gb/s with only the DFE enabled, the TX core, the
current-sharing front-end, and the DFE dissipate 1.44 mW, 1.2 mW and 1.06 mW,
respectively.
74
(A) (B)
Figure 3-26. Measured link performances. A) RX bathtub curves. B) Power efficiency.
Table 3-2 summarizes the link performance in relation to a recently published
paper. Compared to previously published results, a large portion of the TX and RX
power is decreased using the current-sharing frontend.
Table 3-2. Performance summary This work [7]
Technology 0.13 μm 65 nm
Supply voltage 1.2 V 1.0 V
Data rate 6.25 Gb/s 12.5 Gb/s
Front-end swing 125 mV 100 mV
BER 1e-12 1e-12
Horizontal eye 56% UI @ 6.25Gb/s -
Power 3.7 mW 12 mW
Energy-per-bit 0.6 pJ/bit 0.98 pJ/bit
TX/RX core area 0.03mm2/ 0.02mm
2 0.24mm
2/0.24mm
2
3.6 Summary
The bandwidth of the channel poses difficult challenges for high-speed serial
links. At high frequencies, dielectric loss dominates over conductor loss. The design and
10-12
10-10
10-8
10-6
10-4
10-2
100B
ER
-0.5 -0.3 -0.1
Time (UI)
0.1 0.3 0.5
6.25 Gb/s
FFE
DFE
FFE+DFE Po
we
r E
ffic
ien
cy
(p
J/b
it)
Data Rate (Gb/s)
FFE Only
DFE Only
FFE+DFE
0 1 2 3 4 5 6 7 80
0.5
1
1.5
75
fabrication of the air-cavity transmission line structure is presented in this Chapter to
reduce the dielectric loss. The measured effective dielectric constant is 1.73 and the
loss is about 0.4 dB/cm.
The air-cavity transmission lines are used in an active link. The active link
features a low-power current-sharing frontend with a 1-tap speculative DFE. To further
reduce power consumption, the back termination is also removed. The active link
achieves successful 6.25 Gb/s operation and consumes 3.7 mW off a 1.2 V power
supply, demonstrating the potential of the techniques for future low-power high-speed
interconnects.
76
CHAPTER 4 A 4.5-Gb/s 12.4-mW RX WITH BAUD-RATE CDR
4.1 Chapter Overview
The receiver presented in Chapter 3 does not include CDR, an essential function
in high-speed receivers as discussed in Chapter 2. CDR in high-speed serial links is
usually achieved with oversampling. However, oversampling CDRs have a few issues.
One of the issues is explained in Chapter 2, which is the requirement for power-hungry
clock generation and distribution with sub-bit-time resolution. The second issue with
oversampling CDR lies in its assumption that the maximum voltage margin occurs at the
eye center [31]. When the input eye is horizontally asymmetric, locking to the eye center
may lead to sub-optimal voltage margin. The third issue with oversampling CDR is that
it reduces the already challenging settling time requirement for DFE [17] [46]. Because
the input signal is oversampled, the time allowed for the DFE to settle is now less than
one UI. For low-power high-speed serial link design, a baud-rate CDR that circumvents
these issues is therefore of interest. Sampling at the eye edges may also require
dedicated edge equalization, since the edge samples experience different ISI than the
data samples, as shown in .
Figure 4-1. Different ISI seen by the edge and data samples
1 2 3 4 5 6 7 8
Time (UI)
Data sample ISI
Edge sample ISI
77
In this Chapter we present a RX with a novel digital baud-rate eye-tracking CDR
which employs an auxiliary slicer (CDR slicer) with adjustable threshold voltage. By
jointly updating the sampling phase and the threshold voltage of the CDR slicer, the
CDR loop drives the decision point of the CDR slicer to the peak of the eye opening,
and thus automatically locks to the maximum voltage-margin point. Because the CDR
slicer samples at exactly the same instant as the main data slicers, it does not interfere
with DFE operation.
We also present a majority-voting DFE architecture that replaces the selectors in
a traditional speculative DFE with majority-voters. Compared to a selector, a majority
voter is more amenable to low-power and high-speed designs because it reduces the
transistor stacking levels and features equal delay to all data inputs. A majority-voter
also eliminates the need for a level shifter in bipolar designs.
A receiver was implemented with the proposed CDR scheme and the majority-
voting DFE. Details of the RX implementation will be given in this Chapter, together with
measurement results, which confirmed correct functions of both techniques.
4.2 Baud-Rate CDR
A few baud-rate CDR schemes have been proposed in the past. The Mueller-
Muller CDR [47], used in several recently published serial link receivers [8] [46],
operates by adjusting the clock phase so that the sampled pulse response satisfies a
predefined timing criterion. However, this type of CDR does not necessarily ensure
maximum voltage margin of the sampled eye at lock. The CDR in [48] improves the
voltage sampling margin but is only suitable for integrating-type RX frontends. The
baud-rate CDR in [7] relies on auxiliary slicers that have a larger sampling window than
the main data slicers to keep the sampling phase away from the eye edges, but it does
78
not take into account the voltage margin. Another baud-rate CDR reported in [49] locks
to the maximum voltage-margin point, but requires analog slope detection circuitry and
is therefore not as amenable to technology scaling and migration as digital solutions.
(A)
(B)
Figure 4-2. CDR block diagrams. A) Alexander CDR. B) Proposed baud-rate CDR.
Figure 4-2 shows the block diagram of the Alexander CDR and the proposed
baud-rate CDR. The Alexander CDR employs two slicers, sampling half-UI away from
each other, hence 2× oversampling. The PD in an Alexander CDR only produces
information for updating the clock phase. The proposed CDR also employs two slicers
(main and CDR slicers). However, unlike the Alexander CDR, these two slicers sample
the input signal at the same time, therefore no oversampling is involved. The PD in the
proposed CDR not only controls the clock phase, but also the offset of the CDR slicer.
Phase
update
DIN
D
E
LOOP
FILTER
PD
LOGIC
2× sampling
Phase
control
Edge slicer
Data slicer
DIN
D
DCDR
LOOP
FILTER
PD
LOGIC
Phase & offset
update
1× sampling
Phase & offset
control
CDR slicer
Main slicer
79
The algorithm of the proposed CDR is such that it drives the sampling point of the CDR
slicer to the position with maximum vertical eye opening. Since the CDR slicer and the
main slicer are triggered by the same clock phase, this automatically lock the clock
phase to the point with maximum voltage margin.
The operation principle of the proposed CDR is explained with the help of the
CDR truth table shown in Table 4-1, where an “×” denotes “don’t care”. , and
are three consecutive outputs of the data slicer, is the output of the CDR
slicer sampled at the same time as , and is the threshold voltage of the CDR
slicer. The CDR takes action whenever is 1, tracking only the upper part of the eye.
The discussion below therefore considers the case when exclusively. If higher
CDR bandwidth is desired, the lower portion of the eye can also be utilized using an
additional CDR slicer.
Table 4-1. CDR truth table
0 1 0 0 ↓ -- 0 1 0 1 ↑ --
0 1 1 0 ↓ →
0 1 1 1 ↑ ←
1 1 0 0 ↓ ←
1 1 0 1 ↑ →
1 1 1 0 -- --
1 1 1 1 -- --
× 0 × × -- --
Figure 4-3(A) illustrates an example eye diagram. The upper portion of the eye is
divided into five numbered regions by the different waveform trajectories corresponding
to input patterns (010), (011), (110) and (111). According to Table 4-1, the CDR updates
only when data pattern equals (010), (011) or (110) since pattern (111)
80
does not contain any timing information. Assuming equal probability for pattern
occurrences, the CDR behavior is summarized in Table 4-2 and Table 4-3 and is
graphically depicted in Figure 4-3(B), where the circles indicate possible decision points
of the CDR slicer, the vertical arrows indicate the updating direction, and the
horizontal arrows indicate the clock phase updating direction. By inspecting Figure
4-3(B), it can be seen that the CDR drives the CDR slicer’s decision point until it dithers
around the maximum eye-opening position (denoted by a star). Since the CDR slicer
and the DFE are clocked at the same phase, this automatically locks the DFE to the
maximum voltage-margin point.
The proposed CDR has a few noteworthy advantages. First, baud-rate operation
saves clocking power by eliminating the need to generate extra clock phases for
oversampling. Second, the CDR automatically locks to the point with maximum voltage
margin without using any eye-opening monitor circuits. Third, the proposed CDR does
not constrain the frontend interface to any particular architecture. Moreover, decimation
of the CDR slicer output is easily accommodated in this CDR, whereas in some other
schemes this may be constrained because they require consecutive CDR slicer results
[46]. It should also be noted that the CDR slicer can be reused for equalization
adaptation to reduce hardware and power overhead.
Figure 4-3. Operation principle of the proposed baud-rate CDR
2
1
3 45
2
1
3 4
5
81
Table 4-2. update
Region (010) (011) (110) (111) Total
1 ↑ ↑ ↑ -- ↑ 2 ↓ ↑ ↑ -- ↑
3 ↓ ↓ ↑ -- ↓
4 ↓ ↑ ↓ -- ↓
5 ↓ ↓ ↓ -- ↓
Table 4-3. Clock phase update
Region (010) (011) (110) (111) Total
1 -- ← → -- -- 2 -- ← → -- --
3 -- → → -- →
4 -- ← ← -- ←
5 -- → ← -- --
4.3 Majority-Voting DFE
DFE has been used extensively in high-speed links to compensate for inter-
symbol-interference (ISI) in band-limited electrical channels [12] [17] [16] due to its
noise immunity, high signaling power efficiency as explained in Chapter 2. To relax the
stringent timing requirement, speculative DFE architecture [19] [50] is often used. As
shown in Figure 4-4, a 1-tap speculative DFE makes two tentative decisions and
assuming the previous bit is and respectively, and then the correct decision
is selected by . The timing requirement for the DFE loop can be written as
)
where is the selector delay, and and are the delay and setup time of
the CML DFF.
82
Figure 4-4. Block diagram of a 1-tap speculative DFE
From Equation 4-2 the selector and flip-flop delays in the critical timing path
determine the maximum operating speed of the 1-tap speculative DFE. While significant
work has been published on CML latches/FFs [51] [52], the following observations can
be made regarding the operation of a CML selector, which is shown in Figure 4-5. First,
because the selection of the current bit decision is made by series connecting the
previous bit , the CML selector employs three transistors in the stack (including the
tail current), and is therefore not optimal for low-voltage/low-power designs. Second, to
maximize the timing margin of the critical DFE feedback loop, it is desirable to minimize
the delay from to , yet in Figure 4-5, experiences the largest delay among
the three inputs. The third issue concerns the common-mode level of : since is
supplied from a CML latch, its common-mode level is close to VDD and this may
necessitate an explicit level shifting stage which incurs power and speed overhead
(especially in bipolar implementations [53]).
Figure 4-5. Schematic of a CML selector
DFFSlicers
Selector
83
Table 4-4. Selector truth table
-1 -1 -1 +1 -1
-1 -1 +1 -1 -1
-1 +1 -1 +1 +1
-1 +1 +1 -1 -1
+1 -1 -1 +1 -1
+1 -1 +1 -1 +1
+1 +1 -1 +1 +1
+1 +1 +1 -1 +1
Table 4-3 shows the truth table of the CML selector in a speculative DFE. Note,
however, that in a low-pass electrical channels with a pulse response of [ , ], both
coefficients and are positive, and thus the feedback tap weight in the DFE
always tends negative. This implies that the combination and
in the
truth table in Figure 4-5 does not occur (indicated in gray), and inverting the
corresponding row outputs therefore does not affect the DFE function. Thus, the truth
table can be rewritten as shown inTable 4-5, and can be expressed as
. )
where is the sign of the operand.
Figure 4-6. Proposed majority voter schematic
Equation 4-2 can be readily implemented with a majority-voter, as shown in
Figure 4-6. Compared to the selector in Figure 4-5, the majority-voter obviates the few
disadvantages mentioned previously. The number of transistors in stack is reduced from
84
three to two, making the majority-voter more amenable for low voltage designs. The
majority-voter is fully symmetric with respect to the three inputs, and as a result, the
critical delay from to is identical for all inputs. Moreover, no level-shifting is
required for .
Table 4-5. Majority-voter truth table
-1 -1 -1 +1 -1
-1 -1 +1 -1 -1
-1 +1 -1 +1 +1
-1 +1 +1 -1 -1
+1 -1 -1 +1 +1
+1 -1 +1 -1 -1
+1 +1 -1 +1 +1
+1 +1 +1 -1 +1
Figure 4-7(A) compares the simulated to delay for a selector and
majority-voter as a function of the input transistors’ current density. For comparison, the
input transistors are of the same size, the single-ended input swing is 300mV, the fan-
out is assumed to be two, and the supply is set to 1.2V. The load resistors are adjusted
so that both the selector and the majority-voter have a small-signal gain of one. The
delay of both selector and majority voter decreases with larger current densities and
higher transistor , and saturates as reaches its maximum. For equal current
densities, the majority-voter exhibits ~50% less delay.
Figure 4-7(B) shows the overall DFE loop delay using the proposed majority
voter and the traditional selector. In this comparison, the latches in the DFF are biased
with equal current density in both cases. The majority voter based DFE shows >10%
improvement in delay over a wide range of current densities. Further improvement can
be achieved by increasing the current-density bias point and speed of the CML DFFs.
85
(A)
(B)
Figure 4-7. Simulated delay. A) Selector and majority-voter. B) Overall DFE loop.
Figure 4-8(A) shows the selector and the majority-voter delay as a function of
bias current. Although the majority-voter has three static tail current paths compared to
the single current bias leg of the selector, the overall current consumption to achieve the
same delay is comparable. This is due to the fact that the majority-voter requires a
lower current density than the selector to achieve the same speed. That is, the majority-
voter has a lower effort delay [54], and thus it exhibits higher power efficiency. This can
be related to the majority-voter having one transistor less in the stack, which also
enables operation at lower supply voltages as shown in Figure 4-8(B). A comparison of
the selector and majority-voter delay normalized to their respective delays at the
0
10
20
30
40
50
60
0 20 40 60 80 100
De
lay (
ps
)
Current density (µA/µm)
Selector
Majority-voter
0%
5%
10%
15%
20%
25%
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100
Imp
rove
me
nt
DF
E l
oo
p d
ela
y (
ps
)
Current density (µA/µm)
W/ selector
W/ majority-voter
86
nominal supply voltage of 1.2V shows that 1) the majority voter is significantly less
sensitive to supply voltage variation and 2) it can operate at a lower supply voltage. For
instance, the selector delay quickly degrades below 0.8V while the majority-voter
exhibits a more gradual degradation below 0.6 V.
(A)
(B)
Figure 4-8. Simulated selector and majority-voter performances. A) Delay vs. total bias current. B) Normalized delay variation with supply voltage (VDD) for current-
density of 100 A/m.
4.4 Chip Implementation
4.4.1 Architecture
Figure 4-9 shows the block diagram of the RX core. The input data is sampled by
a half-rate 1-tap speculative DFE and a CDR slicer. The DFE output is then de-
multiplexed by 8, whereas the CDR slicer output is decimated by 8. A CDR logic block
0
10
20
30
40
50
60
0 100 200 300 400 500 600 700
Dela
y (
ps
)
Current (µA)
Selector
Majority-voter
0
1
2
3
4
5
6
7
8
9
10
0.4 0.6 0.8 1.0 1.2
No
rma
lize
d d
ela
y
VDD (V)
Selector
Majority-voter
87
processes the output of the DFE and the CDR slicer according to the CDR algorithm
described above, and updates both the threshold of the CDR slicer with a 6-b DAC and
the clock phase with a phase interpolater (PI). The I/Q inputs to the PI are generated by
dividing down a full-rate external clock.
To minimize power, the RX employs high-speed CML circuits only in the first two
stages and static CMOS logic for the later stages, as shown in Figure 4-9. In addition,
the data output of the CDR slicer is decimated by 8 instead of being fully de-
multiplexed. Although this decimation reduces the CDR bandwidth, experimental results
reported in following sections confirm that the CDR bandwidth is sufficiently large for
plesiochronous chip-to-chip interconnects. All blocks are built with custom layout except
the CDR logic block which is synthesized with standard cells.
Figure 4-9. Block diagram of the RX
+
+
+
+
L
L
L
L
L
L+
+
IN
/2 /2PI /2CK
CMOS
CMLI
Q
/2
Clocking
DFE
Slicer
Maj. voter
Level converter
Latch
L L D D D
DMUX
Q[0:15]
QCDR
CDR
LOGIC
5
6
CDR slicer
6-b DAC
SAFF SAFF
SAFF
L L
L
L
L
88
4.4.2 Slicer
The slicer is implemented as a CML latch with digital offset control, as shown in
Figure 4-10, where all transistors without length annotation are of minimum channel
length. During pre-amplification mode, a current is injected to the output nodes to
introduce a desired offset. To reduce power supply noise, the offset-injection current is
kept active even when the slicer is in regeneration mode. Both the polarity and
magnitude of the injected current are controlled through the serial interface. An
important design parameter of the slicer is the offset tuning range, which must be large
enough to override the intrinsic slicer offset while generating the desired DFE tap
weight. Figure 4-11(A) shows the simulated offset of the slicer, while the simulated
offset tuning characteristic of the slicer is shown in Figure 4-11(B) when the sign of
offset is set to 1. The slicer offset is 34 mV, and the offset tuning range is ±220 mV.
With 6-b digital control, this gives a maximum DFE tap weight of nearly 200 mV with a
nominal step of 3 mV.
Figure 4-10. Schematic of the slicer with threshold control
IN
CK
S
CK
S
CK CK
IN
89
(A)
(B)
Figure 4-11. Simulated slicer performances. A) Slicer offset. B) Offset tuning.
4.4.3 DMUX
The DMUX is constructed from cascading 1:2 DMUX cells. Figure 4-12 illustrates
the schematics of the latch-based CML and CMOS 1:2 DMUX cells, together with their
transistor-level details. The CML latch has the same topology as the slicer, except that it
does not have the offset adjustment. Also note that the bias current and the transistor
sizes are reduced by 50% since offset is not critical. The CMOS latches are
implemented as sense-amplifier flip-flops (SAFFs).
0
10
20
30
40
50
60
-36 -27 -18 -9 0 9 18 27 36
Offset voltage (mV)
= .
0
50
100
150
200
250
0 16 32 48 64
Slic
er
off
set
(mV
)
Offset control code
90
Figure 4-12. Schematics of the CML and CMOS DMUX cells
4.4.4 Clocking
The clocking circuitry generates clocks for the DFE and the DMUX. A full-rate
external clock is first divided down by a CML divider to obtain I/Q clocks, as shown in
Figure 4-13. Since phase inversion is simply swapping the differential signal polarity, I
and IB are obtained simultaneously. The same is true for Q and QB.
Figure 4-13. Schematic of the divider for I/Q generation
A phase interpolator (PI) combines the I/Q clocks with digitally-controlled weights
to adjust the receiver sampling phase. The principle of PI is depicted in Figure 4-14.
Phase interpolation is achieved by combining the I/Q clock phases with different
SR latchSense-amplifier
CK
SAFF SAFF
SAFF
L L
L
L
L
IN
CK CK
IN1 0.7
2
5
6.7
0.28
0.28
0.6
0.2
= =
L
L
IP / IN
QP/QNCK
91
weightings. Figure 4-15 shows the schematic of the PI, which consists of four differential
pairs. Phase tuning is achieved by adjusting the tail currents of the four differential pairs.
To guarantee monotonicity, the tail current in each differential pair is split into eight
identical current sources, and the binary phase control word PI[5:0] is converted to
thermometer code W[0:31] to control the 32 current sources. With this half-rate
architecture, the phase resolution of the PI is
UI.
(A) (B)
Figure 4-14. Principle of PI
Figure 4-15. Schematic of the phase interpolator
I (0o)
Q(90o)
QB(270o)
IB(180o)
0o
270o
90o
180o
IP IN QP QN IN IP QN QP
VBN
[0] [1] [2] [3] [4] [5] [6] [7]
1.7
2 2 2 2
Decoder
W[0:7] W[8:15] W[16:23] W[24:31]
PI[5:0]
92
The output of the PI is further divided down to clock the DMUX. Figure 4-16
shows the level-converter schematic used to convert CML logic levels to full-swing
CMOS for clocking the SAFF’s in the last two DMUX stages. The CML clock is AC-
coupled to inverters with resistive feedback. The feedback resistor and coupling
capacitor values are chosen so that the lower cut-off frequency is well below the target
clock frequency.
Figure 4-16. Level-converter schematic.
4.5 Experimental Results
The receiver chip was implemented in 0.13-μm bulk CMOS technology, mounted
on a QFN package and assembled on an FR4 test board. Figure 4-17 shows the die
micrograph along with test board picture. The receiver core occupies an area of
0.14mm2.
(A) (B)
Figure 4-17. Die micrograph and board picture
CML CMOS
RX
360 μm
40
0 μ
m
93
Figure 4-18 depicts the measurement setup. A PRBS generator and a 20-inch
differential microstrip FR4 channel were used to validate the receiver. The PRBS
generator and the RX were clocked by two different RF sources. When evaluating the
DFE, the two RF sources are synchronized with the RX CDR disabled. Otherwise they
ran independently when CDR loop was enabled. The phase modulation (PM) was
added for jitter tolerance measurement. The recovered data was monitored using a
BERT and a high-speed sampling oscilloscope. Measurements were performed up to
4.5 Gb/s with a 27-1 PRBS pattern, limited at higher data rates by equipment capability.
Figure 4-18. Test setup
Figure 4-19 shows the measured channel insertion loss and the resulting eye
diagram at 4.5 Gb/s, showing complete eye closure due to severe ISI. The loss at
Nyquist frequency is 22 dB. The measured bathtubs at different DFE settings are shown
in Figure 4-20. which were obtained by sweeping the PI control code while monitoring
the receiver BER. Without DFE, error-free operation was not possible. The eye opening
enlarges with increasing DFE settings, and decreases due to over-equalization after
reaching the maximum eye-opening. The peak eye-opening is 0.5 UI.
Figure 4-21(A) shows the measured PI linearity. The minimum DNL of -0.64 LSB
indicates monotonic operation, as guaranteed by the thermometer coding. The
maximum DNL is 1.5 LSB, giving a maximum phase step of 0.09 (=1.5/16) UI. The
RX
Balun
Scope
or
BERT
RF SRC 2
CKIN
DIN
PRBS
RF SRC 1
DOUT
CK
SYNC
20” FR4 ustrip
PM
94
repetitive DNL and INL patterns are due to the use of simple I/Q interpolation scheme
[55].
(A)
(B)
Figure 4-19. Measured 20” channel performances. A) Loss. B) Eye diagram.
The CDR function was evaluated by setting the frequency of the PRBS generator
slightly different from the RX clock source. The CDR lock range was measured to be
±100 ppm, confirming plesiochronous operation even though the CDR bandwidth is low
due to decimation. The histogram of the recovered clock at the limit of the lock range is
the shown in Figure 4-21(B). The RMS jitter is 13 ps. The jitter is relatively high because
the clock output buffer chain shares the same power domain with the noisy digital
circuitry.
-50
-40
-30
-20
-10
0
0 1 2 3 4 5
S2
1 (
dB
)
Frequency (GHz)
-22 dB @ 2.25 GHz
95
(A)
(B) Figure 4-20. Measured DFE performances. A) Bathtub curves. B) Eye openings.
(A) (B)
Figure 4-21. CDR measurement results. A) PI linearity. B) Recovered clock.
Jitter tolerance of the CDR was measured by phase modulating the clock of the
PRBS generator and recording the modulation depth when bit error occurred. The
measured jitter tolerance is shown in Figure 4-22. Below 30 KHz jitter frequency, the
jitter tolerance is larger than 1 UI.
1.0E-12
1.0E-10
1.0E-08
1.0E-06
1.0E-04
1.0E-02
1.0E+00
0 0.2 0.4 0.6 0.8 1
BE
R
Phase (UI)
100
10-2
10-4
10-6
10-8
10-10
DFE setting =5
10-12
10
15
20
0%
10%
20%
30%
40%
50%
60%
0 10 20 30 40
Eye
op
en
ing
(U
I)
DFE setting
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 8 16 24 32
INL
/DN
L (
LS
B)
PI control word
INL
DNL
96
Figure 4-22. Measured CDR jitter tolerance
The RX core consumes 12.4 mW from a 1.2V supply, which translates to an
FOM of 2.75 pJ/bit. Table 4-6 shows the performance summary.
Table 4-6. Performance summary Input Data Rate 4.5 Gb/s
De-multiplexing 1:16
Equalization 1-tap speculative DFE
Clock Recovery Baud-rate eye-tracking
Power Supply 1.2 V
Power 12.4 mW
Process 0.13 μm CMOS
Area 360μm × 400μm
FoM 2.8 pJ/bit
4.6 Summary
Traditional oversampling CDR involves a few design issues, including the
requirement of power-hungry generation and distribution of clocks with sub-bit-time
resolution, the stringent constraint on the settling time of DFE, the possibility of sub-
optimal equalization of edge samples. It also locks to the center of the eye regardless of
the specific eye shape, potentially leading to degraded voltage margins. Various baud-
rate CDRs have been proposed over the years. However, they either do not take into
account the voltage margin, still require sampling at instants other than the data
0.0
0.5
1.0
1.5
2.0
2.5
3.0
10 100
Jitt
er t
ole
ran
ce (
UI)
Jitter frequency (KHz)
97
sampling instants, entails analog circuitry for slope detection, or is only suitable for
integrating-type frontends.
In this Chapter, we propose a novel digital baud-rate eye-tracking CDR scheme
that obviates the above disadvantages. It employs a CDR slicer in parallel with the main
slicers, and the CDR algorithm controls both the clock phase and the threshold voltage
of the CDR slicer to drive the decision point of the CDR slicer to the peak of the eye
opening. Since the CDR slicer shares the same clock phase as the main slicer, this
automatically locks the RX to the point with the maximum eye-opening.
A majority-voting DFE architecture is also presented in this Chapter wherein the
selectors in a speculative DFE are replaced with majority-voters. The majority-voter has
one less level of transistors in the stack, and is therefore more amenable to low-power
and high-speed designs compared to a selector. It also reduces the DFE loop delay due
to its structural simplicity. Furthermore, the majority-voting DFE obviates the need for a
level shifter in bipolar designs.
Experimental results confirm the effectiveness of the proposed CDR scheme and
the majority-voting DFE. Implemented in 0.13-μm CMOS, the RX works reliably at 4.5
Gb/s while consuming 12.4 mW. Higher data rate is limited by the measurement
equipment. The CDR displays a lock range of ±100 ppm, and the DFE is able to
equalize a channel with 22 dB Nyquist loss while producing a 50% UI equalized eye-
opening.
98
CHAPTER 5 A 5-Gb/s 0.75-pJ/BIT VOLTAGE-MODE TRANSCEIVER
5.1 Chapter Overview
Chapter 3 and Chapter 4 apply some of the results from Chapter 2 to improve the
link power efficiency on the architecture level, namely the removal of back termination,
the channel loss reduction with air-cavity transmission lines, and the use of DFE and
baud-rate CDR. A few circuit techniques are also resorted to in Chapters 3 and 4, such
as the current-sharing frontend and the majority-voting speculative DFE. The 6.25-Gb/s
transceiver in Chapter 3 achieves 0.6-pJ/bit power efficiency without CDR, whereas the
4.5-Gb/s RX in Chapter 4 achieves 2.8-pJ/bit including CDR and clocking circuitry.
Based upon these results, this Chapter attempts to build a complete transceiver with
better power efficiency in the same technology. To attain this goal, the transceiver
employs a combination of architectural improvements and circuit techniques.
One major improvement is the signaling mode. The transceiver uses voltage-
mode signaling with differential termination in place of the current-mode signaling used
in the air-cavity active link in Chapter 3. According to chapter 2, this reduces the
signaling power by 75%.
The other major improvement is the exclusive use of static CMOS logic gates
instead of the CML logic gates in Chapters 3 and 4. This avoids the static current
consumption of the CML gates since the CMOS gates only consume power during state
transitions. To further improve the power efficiency, the RX operates from a 1-V power
supply, instead of the nominal 1.2-V power supply. To cope with the resulting speed
degradation of the gates, the slicers heavily parallelized and a look-ahead selection tree
99
is used in the DFE. Heavy parallelism in the frontend also saves power by eliminating
the need for an explicit DMUX.
The RX in this Chapter uses the same baud-rate CDR algorithm as presented in
Chapter 4. However, further decimation is applied to reduce the power consumption. An
injection-locked ring oscillator is used for clock generation to avoid the power overhead
of a PLL or DLL. In place of the PI for phase rotation in Chapter 4, a delay line is used
to adjust the injection clock phase so that the RX clock phases can be moved
simultaneously.
The result is a complete 5-Gb/s transceiver in 0.13-µm bulk CMOS process with
3.7-mW power consumption. This translates to a power efficiency of 0.75-pJ/bit, which
is among the best reported to date.
5.2 TX Implementation
5.2.1 TX Architecture
Figure 5-1 shows the TX block diagram. A full-swing restorer (FSR) converts the
output from a CML PRBS generator (reused from a previous design) to full swing
CMOS logic levels. A tapered inverter chain acts as a pre-driver between the FSR and
the VM driver. To preserve high speed, the fan-out of the predriver is designed to be
two. An on-chip LDO generates the supply VDRV for the VM driver from the un-regulated
chip supply.
100
Figure 5-1. TX block diagram
5.2.2 PRBS Generator
Figure 5-2 shows the block diagram of the PRBS generator. It consists of a clock
buffer, a PRBS core, a buffer, and an all-zero detector. This PRBS generator is reused
from a previous design, and all the buffers and gates are implemented in fully-
differential CML although the drawing is single-ended for simplicity.
The PRBS core is a linear feedback shift register (LFSR) comprised of 14 D
latches clocked at 2.5 GHz. The linear feedback through the XOR gates implements the
polynomial X7+X6+1 to generate a 27-1 maximum-length sequence. A half-rate
architecture is chosen for easier clock distribution [56]. The two 2.5-Gb/s PRBS streams
with proper phase shift are multiplexed to obtain the 5-Gb/s PRBS.
Figure 5-2. PRBS block diagram
DriverPre-driver
LDO
VREF
VDRV
VDD
PRBS
FSR
FSR
33
0.6
+
-
PRBS Core
D D D D D DD
D D D D D DDAll Zero
Detector
CK
101
One well-known design issue in PRBS generator is the all-zero state of the LFSR
which will circulate indefinitely once the LFSR falls into this state. To prevent this from
happening, [57] [58] uses a reset signal to manually insert a one into the LFSR. This
solution will not work if the LFSR accidentally falls into the all-zero state during normal
operation (for instance due to power supply disturbance). A better solution is to monitor
the LFSR and automatically reset it if such an all-zero state is detected. [59] uses logic
gates to detect the all-zero state, which is complex and timing-critical. [60] instead
detect the average DC level of the LFSR outputs. Although this solution is not timing-
critical, it still needs additional routing for all the LFSR outputs and thereby incurs extra
loading and complicates the layout.
Note, however, that it’s not necessary to monitor all the LFSR outputs to detect
the all-zero state. Instead, monitoring the final generator output would suffice. This
avoids the loading and layout complication. Figure 5-3 shows the all-zero detection
used in this work. The RC filter has a cut-off frequency of 2 MHz and filters out the high-
frequency component. Since a PRBS is nearly DC balanced, P and N should have
nearly the same DC voltages. When the LFSR falls into the all-zero state, however, P
will have a lower DC voltage than N, and the comparator senses such a condition and
resets the LFSR. Figure 5-4 shows the schematic of the self-biased comparator. For
robust operation, the comparator has a built-in offset of roughly 60 mV so that it will not
activate reset during normal operation. Figure 5-5 shows the simulated waveforms of
the all-zero detector. At start-up, the PRBS is stuck at the all-zero state. The detector
senses this state and inserts one’s into the LFSR so that proper PRBS pattern can be
initiated.
102
Figure 5-3. All-zero detector
Figure 5-4. Schematic of the self-biased comparator with offset
Figure 5-5. Simulated waveforms confirming the function of the all-zero detector
5.2.3 LDO
The LDO powers the TX driver for better supply noise rejection and also provides
a convenient means for adjusting the TX output swing. For a single-ended output swing
of 100 mV, the driver current consumption is 1 mA with differential RX termination. With
PRBS Core
Reset
42 KΩ
1.7 pF
P
N
0.2µm/5µm
12µm 3µm
10µm/0.5µm
10µm/0.5µm
All-zero Reset Normal operation
103
a width of , the pass element is large enough to source 10 mA to support larger
swings in measurement. The error amplifier is a simple two-stage opamp. The dominant
pole is located at the VDRV node due to the large decoupling capacitor. Figure 5-6 shows
the stability simulation results. The phase margin is 72 degrees.
(A)
(B)
Figure 5-6. Stability of the LDO
5.2.4 TX Driver
Since the targeted TX swing is less than 100 mV, the TX employs an N-over-N
VM driver [6] [39], as shown in Figure 5-1. Exclusive use of NMOS in the driver reduces
the input capacitance and therefore the predriver power consumption compared to an
inverter driver [61]. The transistors are sized for 50-Ω Ron for proper channel back
-50
-25
0
25
50
75
1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 1E+8
Ga
in (
dB
)
Frequency (Hz)
0
30
60
90
120
150
180
1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 1E+8
Ph
as
e (
de
gre
e)
Frequency (Hz)
104
termination. Note the top NMOS is sized slightly larger than the bottom one since it sees
less overdrive voltage.
5.3 RX Implementation
5.3.1 RX Architecture
Figure 5-7 depicts the receiver block diagram with differential termination.
Because the TX output has low common-mode voltage, the input signals VP and VN are
first shifted up to enable NMOS transistors at the input of the slicers. Equalization is
done with 1-tap speculative DFE for its high signaling efficiency compared to TX FFE
[16]. A bank of 32 slicers performs digitization and direct 1:16 de-multiplexing. Two
additional CDR slicers facilitate timing recovery. The slicer bank’s 34 output bits are
synchronized, and 17 of them are selected to accomplish DFE. The ILRO, locked to a
312.5-MHz external source, generates 16 clocks phases CK[0:15] for the slicer bank.
The CDR logic extracts timing information from the 17 bits and adjusts the phase of the
injection clock to track the maximum eye opening.
Figure 5-7. RX block diagram
DFE
Level
Shifter
SY
NC
DF
E S
ele
cti
on
Tre
e
ILRO
CD
R L
og
ic
Q[0]
Q[7]
Q[8]
Q[15]
Q*[8]
Delay 312.5 MHz
CK[0:15]
CDR slicers
5 Gb/s
VLS -VDFE
VP
VN
VP-VN+2VDFE
VP-VN-2VDFE
VLS +VDFE
VCM
105
5.3.2 Slicer Design
The most important design goals of the slicer include power, speed and
sensitivity. The slicers are implemented as SAFFs to avoid static power consumption,
as shown in Figure 5-8. With 16-way interleaving, the speed requirement on the slicer is
much relaxed, leaving its sensitivity the focus of design optimization.
One factor that impacts the slicer sensitivity is transistor mismatch. To reduce the
input capacitance and power consumption, the slicers are sized to near minimum. As a
result, the simulated 1-σ slicer offset is 38 mV. To improve RX sensitivity, all the slicers
have 8-b offset trimming. The trimming range is designed to be ±160 mV, yielding a
trimming resolution of 1.25 mV.
Figure 5-8. Schematic of the slicer
Another factor that impacts the slicer sensitivity is hysteresis, including the
hysteresis due to incomplete resetting of the SA core, and the hysteresis due to the
imbalanced input capacitances of the RS latch that follows the SA core [62]. With heavy
front-end parallelism, the SA core has enough time to completely reset and no
hysteresis is observed due to the SA core. To remove the hysteresis due to the
imbalanced RS latch input capacitance, a buffer stage is inserted between the SA core
and the RS latch, as shown in Figure 5-8. Simulation indicates that without this buffer
CKB
106
stage, the slicer has a hysteresis of 30 mV, whereas inserting the buffer stage makes
the hysteresis negligible.
5.3.3 Level Shifting and DFE Tap Generation
The slicers use NMOS input transistors for faster operation. However, the RX
input has a common-mode level close to ground due to the use of the VM signaling. A
level shifter is therefore required before the slicers to shift up the input signals by VLS.
Level-shifting can be accomplished with an AC-coupling capacitor [63] or a
common-gate (CG) amplifier [64], as shown in Figure 5-9(A) and (B). AC-coupling does
not consume power but cuts off the low frequency component of the input signal. On the
other hand, a CG amplifier provides DC coverage but dissipates excessive power due
to the stringent bandwidth requirement. This is especially true when driving the large
input capacitance of the heavily-parallelized slicer bank. Figure 5-9(C) shows the basic
idea of the proposed level-shifter, which combines the advantages of both - a capacitor
provides a high-frequency signal path while a source-follower enables DC coverage.
(A) (B) (C)
Figure 5-9. Level shifters. A) Capacitor-based. B) CG-amp-based. C) Proposed.
Figure 5-10 shows the detailed schematic of the level shifter. The AC-coupling
capacitor is implemented as a NMOS transistor with source and drain shorted to the
input. The shifting voltage is adjusted by tuning VB. To control the low frequency gain,
the source follower is broken into 4 identical segments, with the input of each segment
Capacitor-based CG-amp-based
AC path
DC path
Proposed
107
switchable between the input and the common mode voltage by GAIN[3:0]. When all the
four inputs are switched to the common mode voltage (GAIN=0), the DC path of the
level shifter is disabled. Figure 5-11 shows the simulated frequency response of the
level shifter at different gain settings. When the DC path is disabled, the level shifter has
a low cut off frequency of 3M Hz. Because of its much relaxed bandwidth requirement,
the source follower consumes negligible <10 μW) power.
Figure 5-10. Detailed schematic of the level shifter
Figure 5-11. Simulated frequency response of the level shifter at different gain settings
The level shifters also provide a convenient means of generating the DFE tap.
This is achieved by introducing an offset in the shifting voltages of VP and VN, as
shown in Figure 5-7. Although it’s possible to embed the DFE tap into the slicer offset,
doing so would have required too large a slicer trimming range when the input swing is
high.
VB
VIN
VCM
GAIN[3:0]
To slicers
-60
-50
-40
-30
-20
-10
0
1.0E+04 1.0E+06 1.0E+08 1.0E+10
Ga
in (
dB
)
Frequency (Hz)
Gain=4
Gain=3
Gain=2
Gain=1
Gain=0
108
5.3.4 DFE with Look-Ahead Selection Tree
The slicer bank is implemented using a 16-way parallel architecture to relax
speed requirement and avoid the added power consumption by an explicit de-
multiplexer. A critical issue in the speculative DFE is the stringent timing constraint,
which occurs when decisions are selected based on previously received bits. For a
straightforward implementation of the DFE selection tree shown in Figure 5-13(A), the
previous bits must ripple through all 16 selectors under worst-case conditions, and the
resulting timing constraint is
where and are the delay and set-up times of the D flip-flop, is the
selector delay, and is the bit time. Figure 5-12 shows the simulated as a
function of VDD before layout extraction. At 1.0 V, the delay is about 120 ps. Considering
the parasitics due to wiring, such a delay is marginal for 5 Gb/s operation ( ).
Figure 5-12. Simulated pre-layout selector delay vs. power supply
This work uses a look-ahead selection tree to expedite the selection process.
Two possible sets of decisions for Q[8:15] are pre-computed and then selected, as
shown in Figure 5-13(B). The timing constraint now becomes
0
50
100
150
200
250
300
350
0.6 0.7 0.8 0.9 1.0 1.1 1.2
Dela
y (
ps
)
VDD (V)
109
which is relaxed by nearly 50% compared to the straightforward implementation.
(A) (B)
Figure 5-13. DFE selection tree. A) Conventional. B) Look-ahead.
5.3.5 Decimated Baud-Rate CDR
The RX employs the same baud-rate CDR scheme as that in chapter 4 to reduce
the clocking power compared to Alexander-type CDRs [64]. If we want to monitor all
CK[0:15], 32 more slicers will be required, leading to considerable power and area
overhead. To further reduce power consumption, only CK[8] is monitored in this work.
This greatly reduces the number of CDR slicers by more than 90%, from 32 to 2.
Although this decimation reduces CDR bandwidth, it is generally acceptable for
mesochronous chip-to-chip links [64]. Note that because of heavy parallelism, the
reduction in input capacitance and area is more pronounced compared to the
decimation in [64].
5.4 Injection-Locking-Based Clock Generation
5.4.1 Clock Generation Overview
Despite a 50% reduction in the number of clock phases by the baud-rate CDR,
generating the required 16 phases for the slicer bank is still non-trivial. Injection-locking
based clock generation is chosen in place of PLL- or DLL-based schemes for its low
power and superior jitter performance. Figure 5-14 shows the block diagram of the clock
DFF Q[7]
DFF Q[8]
DFF Q[15]
DFF Q[0]
DFF Q[7]
DFF Q[8]
DFF Q[15]
DFF Q[0]
0 1
Selector
Precomputation
110
generation circuitry. At the core lie two cascaded (master and slave) low-power
injection-locked ring oscillators (ILROs). Both ILROs are digitally trimmed to ensure
reliable locking. The slave ILRO helps correct the master ILRO’s phase mismatch and
duty-cycle distortion due to injection locking [65]. A bank of current-starved delay lines
facilitates further phase calibration.
Phase tuning of ILRO is usually done by adjusting the free-run frequency of the
ILRO [66] [22] [67]. However, tuning the free-run frequency of the ILRO may change the
phase relationship between its outputs and degrade the RX timing margin. In this work,
the phases of the ILRO outputs are tuned by adjusting the injection clock phase with an
additional delay line controlled by the CDR logic, as shown in Figure 5-14.
Figure 5-14. Block diagram of the injection-locking-based clock generation
5.4.2 ILRO Core
The master and slave ILROs are of the same design. Figure 5-15 shows the
ILRO core schematic. Eight pseudo-differential delay cells constructed from inverters
are used instead of CML delay cells to avoid static current consumption. The input clock
phases are injected through NMOS transistors. To ensure locking, the free-run
frequency of the oscillator is digitally trimmed.
Delay line
X16
Master ILRO
X16
X16
Slave ILRO
Delay lines
Freq.
trimming
From
CDR logic
Phase
trimming
Ext. ref.
111
Figure 5-15. Schematic of the ILRO core
One design issue of the pseudo-differential oscillator is its start-up. Because
there are even stages of delay cells, a stable DC solution exists where the whole ring
behaves like a latch, as shown in Figure 5-16. To prevent that from happening, the
cross-coupled inverters must be sized large enough compared to the main inverters. In
this design, the cross-coupled inverters are sized of the main inverters for reliable
start-up, as annotated in Figure 5-15.
Figure 5-16. Start-up issue of the pseudo-differential oscillator
5.4.3 Delay Line
The delay lines are constructed from cascading current-starved delay cells, the
schematic of which is shown in Figure 5-17, where a 4-b digitally controlled current sets
the bias current of the inverters. Figure 5-18 shows the simulated tuning curve of one
delay cell. The tuning range is 30 ps. The CDR delay line consists of 8 delay cells. The
P[0]
P[8]
P[9]
P[1]
P[6]
P[14]
P[15]
P[7]
[1] [0]X128
CTRL[7]
INJ[0]
P[0]
INJ[1]
P[1]
INJ[8]
P[8]
INJ[14]
P[14]
INJ[15]
P[15]
PMOS:
NMOS:
PMOS:
NMOS:
X2 X1:
0
0
1
1
0
0
1
1
112
total tuning range of 240 ps is larger than 1 UI for reliable CDR operation especially
when the extra delay caused by parasitics is considered.
Figure 5-17. Schematic of the current-starved delay line
Figure 5-18. Simulated delay line tuning curve
5.5 Experimental Results
The transceiver was fabricated in a 0.13-μm bulk CMOS process using only
nominal-VT devices. The test chip was assembled in a 32-pin QFN package and
mounted on an FR4 board. Figure 5-19 shows the chip micrograph. The RX measures
, while the TX occupies .
5.5.1 TX Measurement
The TX is measured at different supply voltages. With a 1.5-V supply the TX is
able to work up to 6.25 Gb/s, whereas at 1.2 V the TX is able to work at 5 Gb/s. Below
1.2 V the TX does not work properly, probably limited by the CML PRBS core. Figure
5-20(A) shows the measured TX eye diagrams at 6.25 Gb/s. The RMS jitter is 11 ps.
1
IN OUT
[3:0]
220
230
240
250
260
270
0 5 10 15
De
lay (
ps
)
Control code
113
Figure 5-20(B) shows the captured transient of the TX output, which confirms correct 27-
1 pattern generation.
Figure 5-19. Chip micrograph and transceiver layout
(A)
(B)
Figure 5-20. TX measurement results at 6.25 Gb/s. A) Output eye diagram. B) TX transient showing correct 27-1 PRBS patter.
RX ILROs
DFECDR logic
Level
shifters
Buffers
Delay
lines
TX PRBSDecoupling cap
LDO
FS
R
DR
V
500 μm
30
0 μ
m
500 μm
23
0 μ
m
20 mV 50 ps
20 mV 2.5 ns
114
5.5.2 Clocking Measurement
Figure 5-21 shows the measured tuning curve and locking range of the ILRO.
The ILRO has a tuning range of more than 500 MHz, and the locking range is larger
than 10% when the free-run frequency is 312.5 MHz.
(A)
(B)
Figure 5-21. ILRO measurement results. A) Frequency tuning. B) Locking range.
Figure 5-22 shows measured phase noises with and without injection. At 100
KHz offset, injection-locking suppresses the phase noise by more than 70 dB.
0
100
200
300
400
500
600
700
800
0 64 128 192 256
Fre
qu
en
cy (
MH
z)
Frequency control word
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
0 64 128 192 256
Lo
ck
ra
ng
e
Frequency control word
115
Figure 5-22. Measured phase noise with and without injection locking
The measured CDR delay line tuning curve is shown in Figure 5-23. The tuning
range is 400 ps, which covers 2 UI when the data rate is 5 Gb/s. The measured tuning
range is more than 60% larger than simulation results, indicating heavy parasitics due to
routing.
Figure 5-23. Measured CDR delay line tuning curve showing >2-UI tuning range
5.5.3 RX Measurement
Standalone RX measurement is done up to 4 Gb/s due to equipment limit. Figure
5-24 shows the measured loss profile of the 20” channel. The loss is 19.2 dB at 2 GHz.
Figure 5-25 shows the 4 Gb/s eye diagrams before and after the channel. Due to severe
channel loss, the eye is completely closed after the channel.
-130-120-110-100-90-80-70-60-50-40-30-20-10
1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
Ph
as
e n
ois
e (
dB
c)
Frequency Offset
103 106 107105104 108
W/O injection
W/ injection
0
50
100
150
200
250
300
350
400
450
0 32 64 96 128
De
lay i
nc
rea
se
(p
s)
Control code
116
Figure 5-24. Measured loss characteristics of the 20” channel
Figure 5-25. Measured 4-Gb/s eye diagrams before and after the 20” channel
Figure 5-26 shows the measured bathtubs with and without DFE. Error-free
operation cannot be attained without DFE, while the eye opening is 30% when DFE is
enabled. Figure 5-26 shows the recovered clock. The RMS jitter is 4.85 ps, while the p-
p jitter is 42 ps.
Figure 5-26. RX bathtubs with and without DFE
-50
-40
-30
-20
-10
0
0 1 2 3 4 5
S2
1 (
dB
)
Frequency (GHz)
-19.2 dB @2 GHz
30 mV100 ps
30 mV100 ps
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0.0 0.2 0.4 0.6 0.8 1.0
BE
R
Delay (UI)
100
10-3
10-6
10-9
10-12
30%
W/O DFE
W/ DFE
117
Figure 5-27. Jitter histogram of the recovered clock
The receiver core is powered from a 1V supply, and dissipates 1.1 mW, which
translates to a power efficiency of 0.28 pJ/bit. Table 5-1 compares the performance to
some recently published work. The power efficiency is nearly a 2× improvement over
the best result of previously published complete receivers.
Table 5-1. Performance summary of the receiver
[6] [7] [22] This work
Data rate (Gb/s) 6.25 12.5 8 4
Equalization CTLE CTLE CTLE DFE
Nyquist loss (dB) 15 12 9.7 19
Sub-rate 1/2 1/2 1/10 1/16
Clock generation PLL PLL ILRO ILRO
CDR Alexander Buad-rate NA Buad-rate eye-tracking
Jrms
(ps) NA 2.2 4 4.85
Technology 90-nm 65-nm 65-nm 0.13-μm
VDD
(V) 1.0 1.0 0.6/1.0 1.0
Power (mW) 8.22 6.6 1.3-1.98 1.1
Area (mm2
) 0.15 0.24 0.014-0.018 0.15
FoM (pJ/bit) 1.31 0.53 0.16-0.25 0.28
5.5.4 Transceiver Measurement
The whole link is then tested with a 10” channel on FR4 at 5 Gb/s, although the
TX is capable of operating at 6.25 Gb/s.
JRMS=4.85 ps
JP-P = 42 ps
118
Figure 5-28 shows the TX eye diagrams before and after passing the 10”
channel. Although the Nyquist channel loss is less than the standalone RX
measurement, the eye is still completely closed due to the bandwidth and jitter of the
TX. The near-end TX RMS jitter is 13 ps.
(A)
(B)
Figure 5-28. Measured 5-Gb/s TX eye diagrams. A) Before the channel. B) After the 10” channel
Figure 5-29 show the recovered data and clock of the RX. The recovered clock
has an RMS jitter of 6.9 ps. Figure 5-30 shows the RX bathtubs before and after
enabling the DFE. The eye opening with DFE enabled is 18%.
20 mV50 ps
20 mV50 ps
119
(A)
(B)
Figure 5-29. Measured CDR waveforms. A) Recovered 312.5-Mb/s data. B) Recovered 312.5-M clock.
Figure 5-30. RX bathtubs with and withou DFE
The TX works from a 1.2-V supply and consumes 2.1 mW, while the RX
consumes 1.6 mW from a 1-V supply. The total power consumption of the transceiver is
3.7 mW, and the power efficiency is 0.75 pJ/bit. Table 5-2 compares the transceiver
20 mV 1 ns
JRMS= 6.9 ps
JP-P = 57.8 ps
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0.2 0.3 0.4 0.5 0.6 0.7 0.8
BE
R
Delay (UI)
100
10-3
10-6
10-9
10-12
18%
W/O DFE
W/ DFE
120
performance with some recent publications. Even though we use a lelatively less
advanced technology, the power efficiency is among the best.
Table 5-2. Performance summary of the transceiver
[42] [6] [7] [68] This work
Technology 65 nm 90 nm 65 nm 45 nm 0.13 μm
TX VDD (V) 0.68 1.0 1.0 V 0.8 1.2 V
RX VDD (V) 0.68 1.0 1.0 0.8 1.0
Data rate (Gb/s) 5 6.25 12.5 10 5
Nyquist loss (dB) 4 15 12 8 12
TX swing (mVpp) 100 200 150 150 160
BER 1e-12 1e-15 1e-12 1e-14 1e-12
Eye opening (UI) - 30% 43% - 18%
Power (mW) 13.5 14 12 14 3.7
Energy efficiency (pJ/bit) 2.7 2.24 0.98 1.4 0.75
TX/RX area (mm2) 0.03/0.06 0.31/0.31 0.24/0.24 0.07/0.07 0.15/0.12
5.6 Summary
Building on the results in Chapter 3 and Chapter 4, this Chapter presents a 5-
Gb/s 0.75-pJ/bit transceiver in 0.13-um bulk CMOS technology. Various design
techniques are combined to attain this high power efficiency, including the VM signaling
with differential termination to reduce the signaling power by 75% compared to CM
signaling, the exclusive use of static CMOS gates to avoid the static power consumption
of CML gates, the injection-locking-based clock generation, decimation in the CDR
circuitry, and low-voltage RX operation enabled by the heavy frontend parallelism and
the look-ahead DFE selection tree. The heavy parallelism also eliminates the need for
an explicit DMUX, leading to further power reduction.
121
Even though the transceiver is implemented in a less advanced 0.13-um CMOS
technology, the achieved power efficiency of 0.75 pJ/bit is among the best reported to
date at comparable data rates. It’s therefore believed that the techniques presented in
this Chapter will help enable the Tb/s aggregate off-chip signaling of future electronic
systems.
122
CHAPTER 6 A DIGITAL BACKGROUND ADC CALIBRATION TECHNIQUE
6.1 Chapter Overview
The continuous scaling of CMOS technology has made digital signal processing
more powerful and affordable. Compared to analog signal processing, digital solutions
have the advantages of greater flexibility and better scalability. As a result, there is a
trend of moving more and more signal processing into the digital domain. This trend is
also reflected in high-speed serial links [8] [69] [70], where an ADC digitizes the
distorted incoming bit stream and a DSP carries out the signal processing such as
equalization and timing recovery in the digital domain, as shown in Figure 6-1.
Figure 6-1. An ADC-based serial link
One of the key challenges in such ADC-based serial links is the design of a high-
speed low-power ADC. Due to its high speed, a flash ADC is often the architecture of
choice. For low power consumption, it is desirable to use small transistors in the flash
ADC. However, the mismatch between transistors becomes worse with small transistor
sizes, which will degrade the linearity of the ADC if left unaddressed.
For example, consider the preamp in Figure 6-2 often found in flash ADCs.
Around balanced condition, the input and output are related by
where is the preamp gain, , , and
are the differential output, input and reference voltages respectively. The last term,
is the offset voltage of the preamp due to device mismatches. With proper design and
ADCTX DSP
123
layout, has a zero mean (no systematic offset) and a certain spread determined by
circuit details and the fabrication technology. For typical bias conditions, is
dominated by transistor threshold voltage mismatch [71] and can be expressed as
where is a parameter determined by the technology, and is the gate area of
the transistors. To satisfy linearity requirement, the transistors must be sized large
enough so that is kept within a fraction of the ADC step size. With the transistor
length and current density
largely determined by speed requirement, W is the only
design variable that can be exploited to reduce . According to Equation 6-2, to
decrease by half, the transistor width and therefore the current consumption
must be increased by , a very unfavorable tradeoff for low power designs. As
technology scales down, this tradeoff is expected to become more and more
challenging due to effects such as random dopant fluctuation (RDF) and line-edge
roughness (LER) [72].
Figure 6-2. Schematic of a preamp
Since offset changes slowly over time with environmental (supply voltage and
temperature) variations and device aging, it can be cancelled with some form of
calibration effectively. Various calibration schemes have been proposed in the past for
2ID 2ID
VINP VINNVRP VRN
RD
W/L W/L
VON
VOP
124
flash ADCs, which all fall into either the foreground [73] [74] [75] or the background
categories [76] [77]. A foreground calibration scheme mandates temporarily interrupting
the normal ADC operation and is therefore usually done at power-up or during certain
idle times when allowed by the system. However, as the supply voltage and
temperature change over time, the calibration results may no longer be optimum,
leading to degraded performance [78]. In contrast, a background calibration scheme
does not require interrupting the ADC operation and can run continuously to track
environmental variations and device aging. Thus, background calibration schemes are
generally preferred.
Some of the critical challenges in background calibration for high-speed ADCs
are accuracy, convergence speed, area/power overhead, and performance penalty.
Despite the many background calibration techniques proposed in the past, a quick
literature review demonstrates the need for an improved background calibration scheme
that is suitable for high-speed ADCs. In response, this Chapter describes a novel
background calibration scheme for ADCs which features negligible hardware and power
overhead. The proposed calibration scheme is implemented in a 50-mW 2.5-GS/s 5-bit
flash ADC and its effectiveness is verified with experimental results.
6.2 Background Calibration
6.2.1 Review of Prior Art
Several background calibration schemes for flash ADCs have been reported in
literature, and are briefly reviewed here. Correlation-based calibration operates by
modulating the analog input signal with pseudo-random sequences to extract offset
information from the resulting statistics of the digital output, and has been proposed for
both pipeline and flash ADC’s [79] [80] [81] [82]. In [79] and [80], the analog input is
125
converted to a white signal with little energy at DC by chopping it with a pseudo-random
binary sequence. The DC component in the resulted signal stems mainly from the ADC
offset. By forcing this DC component to zero, the comparator offset can be effectively
removed. A more general approach is proposed in [81], where the offset of a
comparator is detected by chopping the analog input with a sequence from an on-chip
random-number-generator (RGN) and observing the code distribution of the digital
outputs, as illustrated in Figure 6-3 (drawn single-ended for simplicity). The chopping
operation degrades the ADC sample rate because it needs finite time to settle. Due to
this approach’s statistical nature, the analog input must be uncorrelated with the on-chip
generated random sequence and the calibration results are prone to fluctuation which
can only be minimized at the cost of the convergence speed [81]. Furthermore,
Correlation-based calibration invariably introduces performance penalty because they
interfere with the analog signal path with chopping or noise injection. For fast and robust
calibration, deterministic schemes are generally preferred.
Figure 6-3. Correlation-based calibration
+-
RNG
SH
SL
SL
SH
VIN
VR
Q
Vos
VIN-VR
P1
PD
F
VIN-VR
P1
PD
F
∆P1
0 0+Vos-Vos
When RNG=1:
Q=sgn(VIN-VR-Vos)
When RNG=0:
Q=sgn(VIN-VR+Vos)
+Vos
126
Redundancy-based calibration [83] [77] [84] achieves deterministic operation by
employing redundant elements to enable un-interrupted ADC operation when some of
the elements undergo calibration. Figure 6-4 shows the 6b ADC block diagram with
background calibration as reported in [76] , where 64 instead of 63 comparators (C1-
C64) are employed in parallel. When C1 is being calibrated, the other 63 comparators
(C2-C64) work together as a normal ADC. After C1’s calibration is done, the comparator
array is reconfigured so that C1 and C3-C64 work together as a normal ADC and C2
undergoes calibration, with the ADC operation un-interrupted. This process repeats
continuously and in the end all the comparators are calibrated. The advantage of this
technique is its low hardware overhead. However, this technique still incurs speed
penalty because it needs to reconfigure the ADC during its normal operation.
Figure 6-4. Redundancy-based calibration
Reference-ADC based calibration schemes proposed in [85] [86] [87] employ a
slow but accurate reference ADC to improve the linearity of the fast but inaccurate main
ADC. Figure 6-5 shows a simplified block diagram of the reference-ADC based
calibration scheme, while Figure 6-6 shows its working principle. For simplicity, we
assume that the main ADC has 3-b resolution. In Figure 6-6, the transfer curves of the
main ADC and the ideal reference ADC are overlaid. Denoting the transition levels of
C64 C63 C2 C1
Control Logic
VIN
VRP VRN
Encoder
127
the main and reference ADCs as and respectively, any offset will
cause to differ from . These differences are marked by gray bars in Figure 6-6
and are referred to as calibration windows hereafter. Whenever falls within the
calibration windows, a discrepancy occurs between the reference and main ADC
outputs. The calibration engine then examines such discrepancies and drives
toward the ideal .
Figure 6-5. Reference-ADC-based calibration
Figure 6-6. Principle of reference-ADC-based calibration.
Although reference-ADC-based calibration is deterministic and incurs negligible
performance penalty, there is considerable design overhead when the reference and
main ADCs are entirely different – for example, a Σ-Δ ADC is used to calibrate a
pipeline ADC in [87]. Furthermore, because the main and reference ADCs operate from
different sampling clocks, mismatch in their track-and-hold (T/H) circuits can degrade
the calibration accuracy. To alleviate this problem, one has to resort to either power-
Ref.
ADC
Cal.
Engine
Main
ADC
M
VIN
Decimation
Ou
tpu
t c
od
e
VIN
Ref. ADC
Main ADC
128
hungry T/H circuits to drive both ADCs [86] or dedicated timing calibration for the two
sampling clocks [88], both of which are very challenging at high speeds. These
disadvantages can be avoided with the so-called “split-ADC” architecture, where the
reference ADC is simply a replica of the main ADC and operates at the same speed [78]
[89]. The replica ADC, however, incurs significant area, input capacitance and power
overhead.
6.2.2 Proposed Background Calibration Scheme
In the reference-ADC based calibration scheme, all the transition levels are
calibrated simultaneously. This necessitates a reference ADC with at least the same
resolution as the main ADC, and thus high overhead seems inevitable. However,
because offset varies slowly over time, the transition levels can be calibrated
sequentially instead of simultaneously. The benefit of this sequential calibration is the
greatly reduced complexity of the reference ADC. In the extreme case, as in our
proposed calibration scheme, 1-b resolution is sufficient, and the reference ADC
degenerates to a single comparator.
Figure 6-7 shows a block diagram of the proposed calibration scheme. The
reference ADC is now replaced with a single comparator, whose threshold voltage is
reconfigurable through a digital-to-analog converter (DAC). At the beginning, the
calibration engine sets the comparator’s threshold voltage to , as shown in Figure
6-8(A). By monitoring the outputs of the ADC and the comparator, the calibration engine
adjusts until
. After calibrating , the comparator’s threshold
voltage is set to and calibration of begins, as shown in Figure 6-8(B). By
iterating the same process, all the transition levels of the main ADC can be calibrated.
129
The resulting fully-calibrated transfer curve of the ADC is shown in Figure 6-8(H). The
performance metrics of the proposed calibration scheme are discussed below.
Figure 6-7. Proposed reconfigurable-comparator-based calibration
(A) (B) (C) (D)
(E) (F) (G) (H)
Figure 6-8. Principle of the proposed calibration scheme. The transition levels are
calibrated sequentially in A)-G), and the resulting transfer curve is shown in H).
+
-
VIN
Reconfigurable
comparator
DA
C
Main
ADC
Cal.
Engine
Ou
tpu
t c
od
e
VIN
VTH[1] cal.
Ou
tpu
t c
od
e
VIN
VTH[2] cal.
Ou
tpu
t c
od
e
VIN
VTH[3] cal.
Ou
tpu
t c
od
eVIN
VTH[4] cal.
Ou
tpu
t c
od
e
VIN
VTH[5] cal.
Ou
tpu
t c
od
e
VIN
VTH[6] cal.
Ou
tpu
t c
od
e
VIN
VTH[7] cal.
Ou
tpu
t c
od
e
VIN
Finished
130
6.2.2.1 Calibration accuracy
The calibration accuracy is determined by a few factors, including the reference
ADC accuracy, the calibration step size, and noise. The discussion above assumes an
ideal reference ADC. In reality, however, both the DAC and the comparator in the
reference ADC introduce errors and ultimately limit the calibration accuracy. Moreover,
due to the digital nature of the calibration scheme, the main ADC can only be adjusted
in discrete steps. The reference ADC accuracy, together with the finite calibration step
size, limits the overall calibration accuracy. Once the ADC is calibrated, the residual
error in the transition level is bounded by
- )
where is the DAC error, is the offset of the comparator in the reference
ADC, and is the calibration step size. The calibrated INL and DNL are bounded by
| | )
and
| | )
respectively. Notice that does not impact the calibrated DNL. This is because
appears in all the calibrated transition levels and merely causes a DC shift in the
calibrated transfer curve.
The effect of noise on calibration accuracy is shown in Figure 6-9 for the case
, where denotes the mean of a random variable. For convenience,
the noise is lumped to in Figure 6-9. Ideally, whenever a discrepancy occurs, it
should indicate and correct calibration can be made. However, due to noise,
may be temporarily higher than , as indicated by the dashed line in Figure 6-9,
131
and this may cause incorrect calibration to occur. To improve immunity to noise, the
calibration engine can average multiple discrepancies before making a decision.
Figure 6-9. Mechanism of noise-induced calibration error
Because the reference ADC shares the same T/H and sampling clock as the
main ADC, the calibration accuracy of the proposed scheme does not suffer from the
T/H mismatch issue as the conventional reference-ADC based approach does. Nor is it
sensitive to the statistics of the input signal since it does not rely on the correlation
between the input signal and an on-chip pseudo-random sequence.
6.2.2.2 Convergence speed
To calculate the convergence speed, we assume distributes uniformly within
the full-scale input range VFS. Similar calculations can be carried out for other input
distributions, such as those of sine waves. Suppose the initial offset of a certain
transition level is . The probability that the input produces a discrepancy is
, and
on average ⌈
⌉ conversions are needed to reduce the offset by one step, where ⌈
⌉
is the smallest integer that is larger than |
|. Therefore, the number of conversions to
calibrate the offset is
-2 0 2 4
IncorrectCorrect
132
∑
⌈
⌉
)
If we assume the offset is a normal distribution with a mean of zero and a
standard deviation of σ, then the average number of conversions required to calibrate a
particular transition level is
∫
(
√
∑
⌈
⌉
)
)
Exploiting the symmetry of the integrand and assuming the offset is within [-3σ,
3σ], we can approximate the above integral as
∫
(
√
∑
⌈
⌉
)
)
For an N-bit ADC, there are 2N-1 transition levels. The total number of
conversions for the calibration to converge is
)
Since , Equations 6-8 and 6-9 are combined to yield
∫
(
√
∑
⌈
⌉
)
)
Figure 6-10 plots as a function of the ADC resolution with different σ when
. For a 5-bit ADC, when , the calibration takes about
conversions to converge. Note that while grows at a rate of 22N, it is a relatively
133
weak function of σ. For example, tripling σ from to increases the required
number of conversions by only 37%. This is because calibrating small offsets takes
more conversions as the input has a lower chance of producing a discrepancy when the
offset is small.
Figure 6-10. Required conversions for convergence with different resolutions
6.2.2.3 Calibration overhead and performance considerations
The calibration overhead consists mainly of the reference ADC, the calibration
engine, the memory to store the offset control words, and the circuitry to adjust the main
ADC offset. With the calibration engine, the memory and the adjustment circuitry being
common to all digital calibration schemes, the major overhead advantage of the
proposed scheme lies in the simplicity of the reference ADC. The comparator in the
reference ADC can reuse the design available in the main ADC and entails no extra
design effort. The DAC in the reference ADC is only used to set the threshold voltage
and its speed requirement is much relaxed compared to the main ADC’s sample rate.
The power, area, and design overhead of the reference ADC is therefore trivial.
The proposed calibration scheme does not require noise injection or chopping as
seen in correlation-based calibrations. While redundancy-based calibration reconfigures
the main ADC during normal operation, the calibration scheme herein does not.
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
4 5 6 7 8 9 10
# o
f c
on
ve
rsio
ns
Resolution (bit)
103
104
105
106
107
108
σ=1VLSB
σ=3VLSB
134
Moreover, it does not insert extra conversion cycles thereby avoiding any speed
penalty. Although the reference ADC does increase the input capacitance, this penalty
is minimal because only a single comparator is used. For example, calibrating a 5-b
flash ADC with the proposed scheme increases the input capacitance by less than 4%.
This is in stark contrast to the split-ADC architecture, which increases the input
capacitance by .
Table 6-1 shows a comparison of various background calibration schemes. The
proposed calibration engine achieves deterministic operation, introduces little
performance penalty, and incurs low hardware and design overhead. Because the
calibration is sequential, its convergence is slower than the split-ADC architecture. This
usually is not detrimental since environmental variations are slow. When fast
convergence is desired (for example, to reduce the test time during mass production),
foreground calibration can be performed at power up before the background calibration
is enabled.
Table 6-1. Comparison of proposed and existing background calibration schemes
Deterministic Performance
Penalty
Hardware
Overhead
Design
Effort
Converg.
Speed
Correlation-based No Yes Medium Medium Low
Redundancy-based Yes Yes Low Low High
Ref.-ADC-based Yes No High High Medium
Split-ADC Yes Yes High Low High
This work Yes No Low Low Medium
6.3 Chip Implementation
6.3.1 ADC Architecture
Figure 6-11 depicts a block diagram of the implemented 5-bit flash ADC with the
calibration circuitry (drawn single-ended for simplicity, though the real implementation is
135
differential). The main ADC consists of a track-and-hold (T/H), a resistor ladder, a
comparator array, and a digital backend. The comparator array is comprised of
comparators C[1:31], which digitize the sampled analog input against 31 evenly-spaced
reference voltages VR[1:31] from the resistor ladder. The resulting thermometer codes
are then converted to binary format by the digital backend which also corrects first-order
bubble errors.
Figure 6-11. Block diagram of the ADC
The calibration circuitry consists of the resistor ladder and the shaded blocks in
Figure 6-11. The switch bank SR, the resistor ladder and the comparator C[0] make up
the reference ADC. The SRAM stores the offset control words W[1:31] for
C1~C31. The finite-state machine (FSM) communicates with the SRAM through the
address decoder and serves as the calibration engine.
The chip also houses a serial interface. This facilitates digital control of the bias
generator and allows clearing the SRAM content to disable calibration.
SR
SQ
Addr. Decoder
C[31] C[30] C[2] C[1] C[0]
Digital Backend
FSM
S[31]
SRAM (31X5b)
S[30] S[2] S[1]
C[0]~C[31]: Comparators
W[1]~W[31]: Offset control words
S[31]
W[31]
S[30]
W[30]
S[2]
W[2]
S[1]
W[1]
VIN
VRP VRN
DATA
ADDR
T/H
VR[31] VR[30] VR[2] VR[1]
S[31] S[30] S[2] S[1]
Serial
Interface
Bias
Gen.
Q[31] Q[30] Q[2] Q[1] Q[0]
136
6.3.2 Resistor Ladder
Since the resistor ladder generates the reference voltages for the reference ADC,
its linearity ultimately determines the achievable calibration accuracy. For an N-bit ADC,
the requirement on the resistors used in the ladder is [90]
√
where R is the nominal resistance and is the variance. The resistor ladder consists of
identical poly resistor units with W/L of 8μm/4μm with estimated mismatch <0.35%,
which is better than 8-bit accuracy [91]. To stabilize the reference voltages and
suppress input feedthrough, decoupling PMOS capacitors are connected to all resistor
ladder output taps [92]. The resistor ladder consumes 0.21 mW.
6.3.3 T/H
A passive T/H precedes the comparator array, the schematic of which is shown
in Figure 6-12(A). By presenting a static signal to the comparator array during
quantization, the T/H helps minimize linearity degradation due to signal dependent
comparator delays and the clock and signal skew between comparators. Since the input
voltage swing is from VDD-0.4V to VDD, PMOS transistors are used. This also eliminates
the need for a buffer to shift the input common mode level [93] [92].
The bandwidth of the T/H is determined by the on-resistance of the switch and
the sampling capacitor. Figure 6-12(B) shows the small signal model of the T/H, where
CPAD is the pad parasitic capacitance, Csample is the sampling capacitance, and the 25Ω
resistor is the parallel combination of the channel impedance and the on-chip
termination resistor. A simple π model is used in the transistors’ places, with R’ and C’
being the channel resistance and the gate capacitance of a unit width transistor
137
respectively. A larger transistor has a lower on-resistance and thus tends to give a
higher bandwidth. However, when the on-resistance is comparable to 25Ω, the
bandwidth will drop with increasing transistor width because the parasitic capacitance
begins to dominate. An optimum transistor size therefore exists which maximizes the
total T/H bandwidth. Figure 6-13 plots the T/H bandwidth as a function of the transistor
width. It can be seen that a width of 28um gives the highest bandwidth. However, the
optimum is not a very sharp one. A transistor width of 14um is chosen instead, with only
a 10% drop in bandwidth, while saving about 0.2mW on clocking.
(A)
(B)
Figure 6-12. T/H Design. A) Schematic. B) Its small-signal model.
Figure 6-13. T/H Bandwidth vs. switch width
7µm14µm7µm
CKBCKCKB
VD
D
W
R'
WC'WC'PADCsampleC25
2.0
2.5
3.0
3.5
4.0
4.5
0 10 20 30 40 50
Ba
nd
wid
th (
GH
z)
Width (µm)
138
A few mechanisms limit the T/H linearity, including signal-dependent charge
injection, clock feedthrough, and nonlinear channel resistance during track-mode [94].
Dummy switches driven by a delayed complementary clock are used at both sides of
the sampling switch to cancel the charge injection [92]. With second order distortion
largely removed by differential signaling, the third order term dominates the distortion
performance. Simulation shows that, when sampling a 1.4GHz full scale sine wave at
2.5GS/s, the T/H achieves -45dBc third order harmonic distortion, with 1.5dB
improvement by the dummy switches.
6.3.4 Comparator
Figure 6-14 shows the block diagram of the comparator. A three-stage
preamplifier followed by a regenerative latch digitizes the difference between input and
reference voltages. Another two latch stages reduce metastability and convert current-
mode-logic (CML) levels to full-swing CMOS logic levels. A current steering DAC
accepts the control word from the SRAM and injects static current into the output of the
first preamplifier stage to cancel the offset of the whole comparator.
Figure 6-14. Comparator block diagram.
Compared to a dynamic comparator [74], the preamplifier expedites the
regeneration in the latch [95], suppresses charge kickback, and provides better power
supply and common-mode rejections. The preamplifier consists of three stages (P1~P3)
for fast overdrive recovery [90] [93]. Figure 6-15 shows the schematics of P1, P2, and
VR
VINSR
P1 P2 P3 L1 P4 L2 L3
DAC
CML Latch CML Latch SAFFSRAM
139
the DAC. Resistor loads are used instead of diode connected transistors to avoid the
voltage headroom due to the transistor VT [93].
Figure 6-15. Schematics of the first two stages of the preamplifier
For high speed operation, the bandwidth of the preamplifiers must be maximized.
For that reason, it’s desirable to bias the transistors at high current densities. However,
this practice is limited by two factors. First, the transit frequency of a transistor
increases slowly at high current densities, as shown in Figure 6-16(A), which means the
current efficiency drops at high current densities, even without considering the Tf drop
caused by velocity saturation. Second, the highest current density is limited by the
supply voltage due to voltage headroom issues. For P1, ignoring the currents through
M3, the gain is given by
where mg is the transconductance of M1 and M2, 1I is the current through M1 and M2,
and RV is the voltage drop on R1 when the differential pair is balanced. The term
is
due to the fact that half of the bias current flows through M1B and M2B and does not
produce any gain. Since VINP, VINN, VRP and VRN all vary between VDD-0.4V to VDD, to
prevent M1 and M2 from entering linear region, must be kept below
or about 0.25
DAC P2P1
VINP VRP VRNVINN
VB
M1A M1B M2A M2B
M3A M3B
M4A M4B M5A M5B
R1BR1A R2BR2A
IT1 IT2 IT3IDAC
M1A, M1B 1µ/0.12µ
M2A, M2B 1µ/0.12µ
M3A, M3B 1µ/0.12µ
M4A, M4B 0.4µ/0.12µ
M5A, M5B 1µ/0.12µ
R1A, R1B 6 KΩ
R2A, R2B 6 KΩ
IT1, IT2, IT3 100 µA
IDAC 0~40 µA
140
V considering the body effect. Figure 6-16(B) plots as a function of the current
density, assuming a moderate gain of 2. It can be seen that the speed and gain
requirements can’t be met without violating the limit. To solve this problem, two
transistors biased in the saturation region (M3A and M3B) are used to bypass half of the
current to reduce the voltage headroom on R1A and R1B by half [96], as also shown in
Figure 6-16(B). The chosen current density is 50μA/μm.
Since P2 has less self-loading, it can achieve a larger GBW than P1 given the
same bias condition and fanout. The gain of P2 is therefore designed 70% higher than
P1, while the bandwidths of P1 and P2 are kept the same. No inductive peaking is used
to save area.
(A)
(B)
Figure 6-16. Effects of M3. A) Transit frequency vs. current density. B) Required voltage drop on the load resistor vs. current density.
0
20
40
60
80
100
120
0E+00 1E-04 2E-04 3E-04
f T(G
Hz)
Current Density (μA/μm)
0 100 200 300
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
0 100 200 300
VR
(V)
Current density (μA/μm)
w/o M3
w/ M3
141
Figure 6-17. Schematic of the CML latches
A CML flip-flop and a sense-amplifier flip-flop (SAFF) complete the comparator.
Figure 6-17 shows the CML flip-flop, which is constructed with the conventional master-
slave topology. Figure 6-18 shows the SAFF schematic. It consists of a sense-amplifier
(SA) and a set-reset (SR) latch. The SAFF provides additional gain to suppress
metastability errors and convert CML levels to full-swing CMOS levels. With the
additional gains of the latches, the ADC’s BER is estimated to be better than [97].
Figure 6-18. Schematic of the SAFF
Figure 6-19 shows the current-steering DAC. A bias generator shared by all the
comparators generates three bias voltages. The offset control word W[N] selects from
these three bias voltages and VSS to inject an appropriate current to comparator C[N]
and cancel its offset.
L2P4L1P3
CK CKB CKB CK
M1A M1B
R1BR1A
M2A M2B M3AM3B M4A M4B
M5A M5B M6A M6B
R2BR2A
IT1 IT2
M1A, M1B 1µ/0.12µ
M2A, M2B 0.8µ/0.12µ
M3A, M3B 1µ/0.12µ
M4A, M4B 0.8µ/0.12µ
M5A, M5B 2µ/0.12µ
M6A, M6B 2µ/0.12µ
R1A, R1B 8 KΩ
R2A, R2B 8 KΩ
IT1, IT2 60 µA
SRL3
CKB
142
Figure 6-19. Current-steering DAC and the DAC bias generator. The bias generator is shared by all the comparators.
One important design parameter of the current-steering DAC is its calibration
range . This range is selected based on the comparator offset and the yield target.
To reduce area and power consumption, the transistors in the comparators are sized
close to the minimum. Figure 6-20(A) shows the simulated comparator offset , which
is 22.5 mV (0.9 LSB) and is dominated by the preamplifier. For a certain calibration
range, the yield is the probability of all the 32 comparators’ offset falling within this
range, and, assuming a Gaussian distribution for the comparator offset, is given by
[ (
√
)]
Figure 6-20(B) shows the yield as a function of the normalized calibration range. To
achieve a yield higher than 90%, the normalized calibration range
should be higher
than 6. In this prototype, the maximum IDAC is programmable through the serial
interface, and the simulated can cover up to , as Figure 6-20(C) shows.
The other key parameter of the current-steering DAC is its resolution, which
determines the calibration step and the achievable calibration accuracy as discussed
previously. In this prototype, 5-b resolution is chosen. When the calibration range is
Current-steering DACShared bias generator
W[N][1:0]
VB[3]
VB[2]
VB[1]
VSS
W[N][3:2]
VB[3]
VB[2]
VB[1]
VSS
M3BM3A
W[N][4] W[N][4]
VB[1]
IB×1
VB[2]
IB×2
VB[3]
IB×3M2A M2B
M1BM1A M1C
M6A, M6B, M6C 2µ/0.16µ
M4A, M4B 0.4µ/0.12µ
M3A 1.6µ/0.16µ
M3B 0.4µ/0.16µ
IB 13 µA
143
programmed to 5.4 LSB ( ), the calibration step is 0.19 LSB. With the resistor
ladder providing higher than 8-b linearity, this guarantees a calibration accuracy of 0.5
LSB according to Equation 6-3.
(A)
(B)
(C)
Figure 6-20. Simulated comparator performances. A) Offset. B) Yield vs. normalized calibration range. C) Calibration range.
-60 -40 -20 0 20 40 600
2
4
6
8
10
12
14
16
18
20
0
4
8
12
16
20
-60 -40 -20 0 20 40 60
Offset (mV)
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12
Yie
ld
Normalized DAC Range
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40 50
Vcal(m
V)
Max. IDAC (µA)
144
6.3.5 Digital Backend
A digital backend converts the output thermometer codes of the comparator array
to binary format. It also provides the capability of correcting or minimizing errors due to
bubbles or metastabilities. Figure 6-21 shows the block diagram of the digital backend.
The three-input AND gate array converts the thermometer codes to one-hot codes and
provides 1st order bubble error correction. The one-hot codes are then used to address
a quasi-gray-code ROM encoder [98]. Simple XOR gates convert the quasi-gray code to
binary codes. The binary codes are then decimated by 64 to accommodate the limited
bandwidth of the test equipment.
Figure 6-21. Block diagram of the digital backend
6.3.6 Reference ADC
The reference ADC is comprised of the resistor ladder, the switch bank SR, and
the comparator C[0]. The resistor ladder is reused form the main ADC to reduce the
calibration overhead. The switch bank SR is built with CMOS transmission gates and is
controlled by the one-hot code S[1:31] to select the desired reference voltage for C[0]
from the resistor ladder. The switch bank SR is implemented with simple CMOS
Pipelined ROM Encoder
Decimator
0
SR SR SR SR
11
Quasi-gray
One-hot
Thermometer
Binary
/64
CK
145
transmission gates. C[0] shares the same design as C[1:31] and does not involve any
extra design effort.
6.3.7 Calibration Engine and Supporting Circuitry
The other calibration circuitry includes the FSM as the calibration engine, the
SRAM to store the offset control words, the address decoder to facilitate the
communication between the FSM and SRAM, and the switch bank SQ. The FSM, the
SRAM, and the address decoder are all built with standard cells, while the switch bank
SQ is implemented with CMOS transmission gates, same as SR.
Figure 6-22. FSM flow chart. N is the calibration index, which is also the SRAM address.
Figure 6-22 shows the flow chart of the FSM operation. At the beginning, the
FSM sets N to 1. This sets S[1] to HIGH so that both C[0] and C[1]’s reference voltages
are connected to VR[1]. Meanwhile, C[1]’s output is also selected. To improve noise
immunity, the FSM then accumulates the results of 128 comparisons between C[0] and
C[1]’s outputs before updating the control word W[1] in the SRAM. After that, the FSM
sets N to 2 and calibrates C[2]. This process repeats cyclically for C[1:31] so that the
comparators are all continuously calibrated in the background.
Clear error counter
Compare Q[N] and Q[0]
Update error counter
128 comparisons?
Update W[N]
N = 1
No
Yes
N = 31 ?
No
Yes
N = N+1
146
Note that, with the help of SQ, the FSM directly reads the ADC’s raw
thermometer output instead of its decoded binary output. This eliminates the need for a
5-b digital comparator and bypasses the possible complication introduced by bubble
error correction.
6.3.8 Clock and Power Distribution
Clock distribution is of crucial importance in high speed ADC design. The clock
buffers are sized for the same fan-out. Dummy loads are inserted in the clock tree to
compensate for unbalanced loads. To account for the finite delay through the
preamplifier, the clock of the T/H leads that of the comparators by one inverter delay.
Since the clock of the FSM and the decimator is divided down from the full-speed clock
and its phase relationship with the full-speed clock is unknown, multiple phases are
generated for selection through the on-chip serial interface.
The power is split to analog and digital domains. Decoupling capacitors are
inserted whenever there is spare area. To prevent noise coupling through the substrate,
guardring is inserted between the analog part and the digital part. The guardring is
connected to a dedicated ground pad, separate from analog and digital ground pads
[99].
6.4 Experimental Results
The prototype 5-bit flash ADC was fabricated in 0.13μm 1-poly 8-metal bulk
CMOS process and was measured in a QFN package. Figure 6-23 shows the chip
micrograph. The ADC core occupies an active area of 0.24 mm2. Even without any
layout optimization, the calibration circuitry takes less than 10% of the core area.
147
Figure 6-23. Chip micrograph.
The ADC was powered from a 1.2-V supply. The reference voltages VRP and VRN
were set to 1.2 V and 0.8 V respectively, giving a differential full-scale input range of 0.8
V. The ADC’s decimated digital output was captured by a mixed-signal oscilloscope and
post-processed in Matlab.
The ADC’s static performance was evaluated by stepping the DC input voltage to
the ADC and recording the levels at which the output toggles. The peak-to-peak noise
observed during DC measurement is 2.5 mV, or roughly 0.1 LSB. To remove the effect
of noise during the DC measurement, the output codes were averaged to find the
transition levels. Figure 6-24 shows the measured INL and DNL with and without
calibration. When calibration is disabled, i.e., when all the SRAM bits are cleared to 0
through the serial interface, the ADC has an INL of -1.85/1.48 LSB and a DNL of -
1.00/2.75 LSB. Enabling calibration improves the INL to -0.21/0.17 LSB and the DNL to
-0.07/0.04 LSB. The low calibrated DNL and INL clearly demonstrates the efficacy of
the proposed calibration scheme.
FSM
SR
AM
Co
mp
ara
tor
Dig
ita
l
Ba
ck
en
d
Bias
Clock Tree
R Ladder
100 μm
148
(A)
(B)
Figure 6-24. Measured ADC linearity. A) INL. B) DNL.
Figure 6-25 shows dynamic performance evaluation test setup. The single-ended
input signal from a signal generator is first converted to differential by a passive balun
before being fed to the ADC. Figure 6-26 shows the output spectrums before and after
enabling the calibration. The input signal is a full-scale 1.172-GHz sine wave, and the
sample rate is 2.5 GS/s. Note that due to the decimation, the fundamental tone is
aliased to 0.3 MHz and the frequency spans from DC to 19.53125 MHz. The SFDR
improves by nearly 12 dB from 27.3 dB to 39.2 dB with calibration.
-3
-2
-1
0
1
2
3
0 4 8 12 16 20 24 28 32
INL
(L
SB
)
Output code
w/ calibration
w/o calibration
-1.85/1.48 LSB -0.21/0.17 LSB
-3
-2
-1
0
1
2
3
0 4 8 12 16 20 24 28 32
DN
L (
LS
B)
Output code
w/o calibration
w/ calibration
-1.00/2.75 LSB -0.07/0.04 LSB
149
Figure 6-25. Test setup for dynamic performance evaluation
(A) (B)
Figure 6-26. Output spectrums. A) W/ calibration. B) W/o calibration
Figure 6-27. ENOB w/ and w/o calibration
Figure 6-27shows the measured ENOB at various sample rates with the input
frequency kept at around 1.2 GHz. Without calibration, the highest ENOB is below 3.5
Test Board
Balu
n
Mixed-signal
ScopeMatlab
Power
Supply
VDD VRP VRNCK
ADC LVDS Driver
0 2 4 6 8 10 12 14 16 18-70
-60
-50
-40
-30
-20
-10
0
25004p2dbm.csv ENOB=4.4035
dB
39.2dBw/ cal.
0 2 4 6 8
Frequency (MHz)
10 12 14 16 18
-60
-50
-40
-30
-20
-10
0
-700 2 4 6 8 10 12 14 16 18
-70
-60
-50
-40
-30
-20
-10
0
2500.csv ENOB=3.1763
27.3dB w/o cal.
dB
0 2 4 6 8
Frequency (MHz)
10 12 14 16 18
-60
-50
-40
-30
-20
-10
0
-70
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1.0 1.5 2.0 2.5 3.0
EN
OB
(b
it)
Sample rate (GS/s)
w/ calibration
w/o calibration
1.2b
150
bits. With calibration, the ENOB improves to 4.7 bits below 2 GS/s and remains above
4.4 bits until 2.5 GS/s. For all sample rates, the calibration improves the ENOB by more
than 1.2 bits.
The ADC core (excluding peripheral IO and termination) consumes 50mW, of
which about 34 mW is consumed by the digital backend and the clocking circuitry. Even
without resorting to power-saving architectures such as interpolation and folding, our
design achieves a competitive figure-of-merit (FoM) of 0.95 pJ/conversion. Table 6-2
shows our design’s performance summary alongside some recently published flash
ADCs. Note that designs with similar or better FoM all employ interpolating or folding
techniques except [74], which uses fully dynamic comparators and a more advanced
technology.
Table 6-2. Comparison with recently published work
Reference [77] [78] [100] [74] [92] [101
]
[102
]
[103
] This work
Interpolating Yes Yes No No Yes Yes No Yes No
Folding No Yes No No No No No No No
Resolution 6 6 4 5 6 6 6 6 5
Fs (GS/s) 3 2.7 4 1.75 3.5 1.6 5 1.2 2.5
INL (LSB) 0.2 0.73 0.24 0.39 1 0.42 0.7 0.6 -0.21/0.17
DNL (LSB) 0.2 0.53 0.15 0.38 0.5 0.49 0.6 0.4 -0.07/0.04
ENOB 5.81)
5.3 3.5 4.7 4.9 5.4 5.0 5.7 4.4
Process (nm) 90 90 180 90 90 130 65 130 130
VDD (V) 1.2 1 1.8/2.5 1 0.9 1.5 1.3 1.5 1.2
Power (mW) 90 50 608 7.6 98 180 320 90 50
Calibration BG2)
BG FG3)
FG No No No No BG
Area (mm2) 0.28 0.36 0.88 0.03 0.15 0.42 0.3 0.12 0.24
FoM (pJ/Conv.) 2.3 0.47 13.6 0.17 0.95 2.6 1.97 1.4 0.95 1)
With 10MHz input. 2)
Background. 3)
Foreground.
151
6.5 Summary
As technology scales, ADC-based serial links are becoming attractive for its
flexibility and scalability, where a flash ADC architecture is usually used for its high
speed capability. One of the key challenges in ADC-based serial links is the power
consumption of high-speed ADCs, reduction of which is limited by the mismatch
between components. By compensating for the offset due to mismatch, calibration
allows the use of small components in the ADC’s without performance degradation, thus
enables low-power designs. Running the calibration in the background provides the
additional benefit of tracking environmental changes and device aging.
Key metrics for background calibration techniques include accuracy,
convergence speed, area/power overhead, and performance penalty. A brief survey of
currently available background calibration techniques against these metrics suggests
the need for improvement, especially for high-speed ADCs. A novel digital background
ADC calibration scheme has been proposed in this Chapter. By employing a single
reference comparator and reconfiguring its threshold voltage, the proposed scheme
calibrates the transition levels of the main ADC sequentially. Compared to the
simultaneous calibration of existing solutions, this sequential operation leads to
extremely low hardware and design overhead. Its impact on the ADC performance is
also minimal.
The effectiveness of the proposed calibration scheme is experimentally
demonstrated by the significant improvements in the static and dynamic performance of
a 50-mW 2.5-GS/s 5-bit full-flash ADC in 0.13-μm CMOS technology. Although a flash
ADC is used as a prototype in this work, the concept can be readily extended to other
152
architectures. This technique should help pave the way for future low-power ADC-based
serial links.
153
CHAPTER 7 CONCLUSIONS
The exponential increase of functionality integrated on a single microprocessor
requires ever higher aggregate I/O bandwidth. Meanwhile, the whole chip power budget
has been kept practically flat at around 140 W due to packaging and thermal
management limitations. As a result, the power efficiency of off-chip signaling must be
greatly improved to maintain the scaling of microprocessors.
At multi-Gb/s, the channel imposes a challenging bandwidth bottleneck because
of its frequency-dependent loss induced by skin effect and dielectric dissipation. As a
result, high-speed signaling usually resorts to sophisticated equalization such as FFE
and DFE to compensate for the channel loss. Besides equalization, other essential
functions in a high-speed link include clocking and signaling. To improve the link power
efficiency, the implementation options for each function must be carefully evaluated in
terms of their impact on the total link power so that informed tradeoffs can be made.
This Dissertation represents such an effort from both the circuit and channel
perspectives. On the circuit side, different schemes for equalization, clock generation
and recovery, and signaling modes are compared. The advantages of DFE, injection-
locking-based clock generation, baud-rate CDR, and voltage-mode signaling with
differential termination are identified. On the channel side, air-cavity transmission-lines
are proposed to reduce the dielectric loss of electrical channels at high frequencies. The
results of this effort include a 6.25-Gb/s 0.6-pJ/bit active with a current-sharing frontend
and an air-cavity channel, a 4.5-Gb/s 3.2-pJ/bit receiver with baud-rate eye-tracking
154
CDR and majority-voting DFE, and a 5-Gb/s 0.75-pJ/bit transceiver in exclusive static
CMOS logic style, which is among the best reported to date.
As semiconductor technology scales, digital signaling processing has become
more and more power efficient compared to its analog counterpart. In the field of high-
speed off-chip signaling, this has recently led to the interest in ADC-based links. One
critical challenge in the ADC-based link architecture is to reduce the power consumption
of the high-speed ADC, which is limited by the component mismatches among other
factors. This Dissertation presents a digital background calibration technique that
features minimal overhead and performance penalty. The efficacy of the calibration
scheme is experimentally confirmed with a 50-mW 2.5-GS/s 5-b full-flash ADC.
All the silicon results in this Dissertation are based on a 0.13-µm bulk CMOS
technology. However, there are no fundamental reasons that prevent the presented
techniques from being extended to more advanced technologies. The work in this
Dissertation should therefore help pave the way toward more power-efficient off-chip
signaling in future electronic systems.
155
LIST OF REFERENCES
[1] G. E. Moore, "Cramming more components onto integrated circuits," Electronics,
vol. 38, no. 8, pp. 114-117, April 1965.
[2] G. Moore, "Progress in Digital Electronics," in IEEE Technical Digest of the Int’l Electron Devices Meeting, 1975.
[3] B. Casper, G. Balamurugan, J. Jaussi, J. Kennedy and M. Mansuri, "Future microprocessor interfaces: analysis, design and optimization," in IEEE Custom Integrated Circuit Conf., 2007.
[4] J. Nasrullah, A. Amin, W. Ahmad, Z. Qin, Z. Mushtaq, O. Javed, J. Yoon, L. Chua, D. Huang, B. Huang, M. Vichare, K. Ho and M. Rashid, "A terabit/s-throughput; SerDes-based interface for a third-generation 16 Core 32 thread chip-multithreading SPARC processor," in IEEE Symp. VLSI Circuits, 2008.
[5] "The International Technology Roadmap for Semiconductors (ITRS)," 2011. [Online]. Available: http://public.itrs.net/. [Accessed 2011].
[6] J. Poulton, R. Palmer, A. M. Fuller, T. Greer, J. Eyles, W. J. Dally and M. Horowitz, "A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS," IEEE J. Solid-State Circuits, vol. 42, no. 12, pp. 2745-2757, December 2007.
[7] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, T. Takemoto, F. Yuki and T. Saito, "A 12.3 mW 12.5 Gb/s complete transceiver in 65nm CMOS," in ISSCC Dig. Tech. Papers, San Francisco, 2010.
[8] M. Harwood, N. Warke, R. Simpson, T. Leslie, A. Amerasekera, S. Batty, D. Colman, E. Carr, V. Gopinathan, S. Hubbins, P. Hunt, A. Joy, P. Khandelwal, B. Killips, T. Krause, S. Lytollis, A. Pickering, M. Saxton, D. Sebastio and G. Swanson, "A 12.5Gb/s SerDes in 65nm CMOS Using a Baud-Rate ADC with Digital receiver Equalization and Clock Recovery," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2007.
[9] H. Johansson and C. Svensson, "Time resolution of NMOS sampling switches used on low-swing signals," IEEE J. Solid-State Circuits, vol. 33, no. 2, pp. 237-
245, February 1998.
[10] H. Johnson and M. Graham, High-speed digital design: a handbook of black magic, New Jersey: Prentice-Hall, 1993.
[11] E. Bogatin, "Essential principles of signal integrity," IEEE Microwave Magazine, vol. 12, no. 5, pp. 34-41, August 2011.
156
[12] E. Bogatin, Signal integrity: simplified, New Jersey: Prentice Hall, 2003.
[13] W. J. Dally and J. Poulton, "Transmitter equalization for 4-Gbps signaling," Micro, vol. 17, no. 1, pp. 48-56, 1997.
[14] J. Jaussi, G. Balamurugan, D. Johnson, B. Casper, A. Martin, J. Kennedy, N. Shanbhag and R. Mooney, "8-Gb/s source-synchronous I/O link with adaptive receiver equalization, offset cancellation, and clock de-skew," IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 80 - 88, January 2005.
[15] S. Gondi and B. Razavi, "Equalization and clock and data recovery techniques for 10-Gb/s CMOS serial-link receivers," IEEE J. Solid-State Circuits, vol. 42, no. 9, pp. 1999-2011, 2007.
[16] T. Beukema, M. Sorna, K. Selander, S. Zier, B. Ji, P. Murfet, J. Mason, W. Rhee, H. Ainspan, B. Parker and M. Beakes, "A 6.4Gb/s CMOS SerDes core with feed-forward and decision-feedback equalization," IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2633-2645, 2005.
[17] R. Payne, P. Landman, B. Bhakta, S. Ramaswamy, S. Wu, J. D. Powers, M. U. Erdogan, A. Yee, R. Gu, L. Wu, Y. Xie, B. Parthasarathy, K. Brouse, W. Mohammed, K. Heragu, V. Gupta, L. Dyson and W. Lee, "A 6.25-Gb/s binary transceiver in 0.13-um CMOS for serial data transmission across high los legacy backplane channels," IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2646-2657, December 2005.
[18] A. Emami-Neyestanak, A. Varzaghani, J. Bulzacchelli, A. Rylyakov, C.-K. Yang and D. Friedman, "A 6.0 mW 10.0Gb/s receiver with switched-capacitor summation DFE," IEEE J. Solid-State Circuits, vol. 42, no. 4, pp. 889-896, 2007.
[19] S. Kasturia and J. H. Winters, "Techniques for high-speed implementation of nonlinear cancellation," IEEE J. Sel. Areas Commun., vol. 9, no. 5, pp. 711-717, June 1991.
[20] G. Balamurugan, J. Kennedy, G. Banerjee, J. Jaussi, M. Mansuri, F. O'Mahony, B. Casper and R. Mooney, "A scalable 5-15Gbps, 14-75mW low power I/O transceiver in 65nm CMOS," in IEEE Symp. VLSI Circuits, 2007.
[21] F. O'Mahony, S. Shekhar, M. Mansuri, G. Balamurugan, J. E. Jaussi, J. Kennedy, B. Casper, D. J. Allstot and R. Mooney, "A 27Gb/s forwarded-clock I/O receiver using an injection-locked LC-DCO in 45nm CMOS," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2008.
[22] K. Hu, R. Bai, T. Jiang, C. Ma, A. Ragab, S. Palermo and P. Y. Chiang, "0.16-0.25 pJ/bit, 8 Gb/s near-threshold serial link receiver with super-harmonic injection-locking," IEEE J. Solid-State Circuits, vol. 47, no. 8, pp. 1842-1853, 2012.
157
[23] B. Razavi, "A study of injection locking and pulling in oscillators," IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1415-1424, 2004.
[24] J. Lee and H. Wang, "Study of subharmonically injetion-locked PLLs," IEEE J. Solid-State Circuits, vol. 44, no. 5, pp. 1539-1553, 2009.
[25] J. Chen, A. Hu, Y. Fan and R. Bashirullah, "Noise suppression in injection-locked ring oscillators," Electronics Letters, vol. 48, no. 6, pp. 323-324, 2012.
[26] M. Hsieh and G. Sobelman, "Architectures for multi-gigabit wire-linked clock and data recovery," IEEE Circuits and Systems Magazine, vol. 8, no. 4, pp. 45-57, 2008.
[27] C. R. Hogge, "A self-correcting clock recovery circuit," IEEE J. Lightwave Tech., vol. 3, no. 12, pp. 1312-1314, 1985.
[28] J. D. H. Alexander, "Clock recovery from binary signals," Electronics Letters, vol. 11, no. 22, pp. 541-542, 30 October 1975.
[29] Y. M. Greshishchev, P. Schvan, J. L. Showell, M. Xu, J. J. Ojha and J. E. Rogers, "A fully integrated SiGe receiver IC for 10-Gb/s data rate," IEEE J. Solid-State Circuits, vol. 35, no. 12, p. 1949–1957, 2000.
[30] J. Lee and B. Razavi, "A 40 Gb/s clock and data recovery circuit in 0.18um CMOS technology," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2003.
[31] T. Toifl, C. Menolfi, P. Buchmann, C. Hagleitner, M. Kossel, T. Morf, J. Weiss and M. Schmatz, "A 72mW 0.03mm2 inductorless 40Gb/s CDR in 65nm SOI CMOS," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2007.
[32] C. Kromer, G. Sialm, c. Menolfi, M. Schmatz, F. Ellinger and H. Jackel, "A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects," IEEE J. Solid-State Circuits,
vol. 41, no. 12, p. 2921–2929, December 2006.
[33] B. K. Casper, M. Haycock and R. Mooney, "An accurate and efficient analysis method for multi-Gb/s chip-to-chip signaling schemes," in IEEE Symp. VLSI Circuits, 2002.
[34] H. Hatamkhani and C.-K. K. Yang, "A study of the optimal data rate for minimum power of I/Os," IEEE Trans. Circuits and Syst. II, vol. 53, no. 11, pp. 1230-1234, 2006.
[35] M.-S. Chen, Y.-N. Shih, C.-L. Lin, H.-W. Hung and J. Lee, "A Fully-Integrated 40-Gb/s Transceiver in 65-nm," vol. 47, no. 3, pp. 627-640, March 2012.
158
[36] S. Hall and H. Heck, Advanced signal integrity for high-speed digital designs, New Jersey: John Wiley & Sons, 2009.
[37] B. Kim, Y. Liu, T. Dickson, J. Bulzacchelli and D. Friedman, "A 10-Gb/s Compact Low-Power Serial I/O With DFE-IIR Equalization in 65-nm CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3526-3538, 2009.
[38] T. Tanahashi, M. Kurisu, H. Yamaguchi, T. Nedachi, M. Arai, S. Tomari, T. Matsuzaki, K. Nakamura, M. Fukaishi, S. Naramoto and T. Sato, "A 2 Gb/s 21 CH low-latency transceiver circuit for inter-processor communication," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2001.
[39] K.-L. Wong, H. Hatamkhani, M. Mansuri and C.-K. Yang, "A 27-mW 3.6-Gb/s I/O transceiver," IEEE J. Solid-State Circuits, vol. 39, no. 4, p. 2004, April 2003.
[40] D. M. Pozar, Microwave engineering, New Jersey: John Wiley & Sons, 1998.
[41] M. V. Schneider, "Microstrip lines for microwave integrated circuits," Bell Syst. Tech. Journal, vol. 48, no. 5, p. 1421–1444, 1969.
[42] G. Balamurugan, J. Kennedy, G. Banerjee, J. Jaussi, M. Mansuri, F. O'Mahony, B. Casper and R. Mooney, "A scalable 5–15 Gbps, 14–75 mW low-power I/O transceiver in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 43, no. 4, pp. 1010-1019, 2008.
[43] T. Spencer, Y. Chen, R. Saha and P. Kohl, "Stablization of the thermal decomposition of poly(propylene carbonate) through Copper ion incorporation and use in self-patterning," Journal of Electronic Materials, pp. 1350-1363, 2011.
[44] D. Z. Turker, A. Rylyakov, D. Friedman, S. Gowda and E. Sanchez-Sinencio, "A 19Gb/s 38mW 1-tap speculative DFE receiver in 90nm CMOS," in IEEE Symp. VLSI Circuits, 2009.
[45] W. R. Eisenstadt and Y. Eo, "S-parameter-based IC interconnect transmission line characterization," IEEE Trans. Components, Hybrids, and Manufacturing Technology, vol. 15, no. 4, pp. 483-490, 1992.
[46] V. Balan, J. Caroselli, J.-G. Chern, C. Chow, R. Dadi, C. Desai, L. Fang, D. Hsu, P. Joshi, H. Kimura, C. Liu, T.-W. Pan, R. Park, C. You, Y. Zeng, E. Zhang and F. Zhong, "A 4.8-6.4-Gb/s serial link for backplane applications using decision feedback equalization," IEEE J. Solid-State Circuits, vol. 40, no. 9, pp. 1957-1967, 2005.
[47] K. H. Mueller and m. Muller, "Timing recovery in digital synchronous data receivers," IEEE Trans. on Communications, vol. 24, no. 5, pp. 516-531, May 1976.
159
[48] A. Emami-Neyestanak, S. Palermo, H.-C. Lee and M. Horowitz, "CMOS transceiver with baud rate clock recovery for optical interconnects," in IEEE Symp. VLSI Circuits, 2004.
[49] F. Musa and A. C. Carusone, "A baud-rate timing recovery scheme with a dual-function analog filter," IEEE Trans. Circuits Syst. II, vol. 53, no. 12, pp. 1393-1397, December 2006.
[50] R. S. Kajley, P. Hurst and J. E. C. Brown, "A mixed-signal decision-feedback equalizer that uses a look-ahead architecture," IEEE J. Solid-State Circuits, vol. 32, no. 3, pp. 450-459, 1997.
[51] W. Fang, "Accurate analytical delay expressions for ECL and CML circuits and their applications to optimizing high-speed bipolar circuits," IEEE J. Solid-State Circuits, vol. 25, no. 2, pp. 572-583, 1990.
[52] T. E. Collins, V. Manan and S. I. Long, "Design analysis and circuit enhancements for high-speed bipolar flip-flops," IEEE J. Solid-State Circuits, vol. 40, no. 5, pp. 1166-1174, 2005.
[53] A. Garg, A. C. Carusone and S. P. Voinigescu, "A 1-tap 40-Gb/s look-ahead decision feedback equalizer in 0.18-um SiGe BiCMOS technology," IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2224-2232, October 2006.
[54] A. Kapoor, Y. Hu and R. Bashirullah, "Design and optimization of high-speed CML gaters using a current-centric LE model," to appear in IEEE Trans. Circuits & Syst. I.
[55] C. Kromer, G. Sialm, C. Menolfi, M. Schmatz, F. Ellinger and H. Jackel, "A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects," IEEE J. Solid-State Circuits, vol. 41, no. 12, p. 2921–2929, December 2006.
[56] M. G. Chen and J. K. Notthoff, "A 3.3-V 21-Gb/s PRBS generator in AlGaAs/GaAs HBT technology," IEEE J. Solid-State Circuits, vol. 35, no. 9, pp. 1266-1270, 2000.
[57] E. Laskin and S. P. Voinigescu, "A 60 mW per lan, 4X23-Gb/s 27-1 PRBS generator," IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2198-2208, 2006.
[58] T. O. Dickson, E. Laskin, I. Khalid, R. Beerkens, J. Xie, B. Karajica and S. P. Voinigescu, "An 80-Gb/s 231-1 pseudorandom binary sequence generator in SiGe BiCMOS technology," IEEE J. Solid-State Circuits, vol. 41, no. 12, pp. 2735-2745, 2005.
[59] H. Knapp, M. Wurzer, T. F. Meister, J. Bock and K. Aufinger, "40Gbitps 27-1 PRBS generator IC in SiGe bipolar technology," in Proc. Bipolar/BiCMOS Circuits and Technology Meeting, Monterey, CA, 2002.
160
[60] H. Knapp, M. Wurzer, W. Perndl, K. Aufinger, J. Bock and T. F. Meister, "100-Gb/s 27-1 and 54-Gb/s 211-1 PRBS generators in SiGe bipolar technology," IEEE J. Solid-State Circuits, vol. 40, no. 10, pp. 2118-2125, 2005.
[61] K. Fukuda, H. Yamashita, F. Yuki, M. Yagyu, R. Nemoto, T. Takemoto, T. Saito, N. Chujo, K. Yamamoto, H. Yanai and A. Hayashi, "An 8Gb/s transceiver with 3X-oversampling 2-threshold eye-tracking CDR citcuit for -36.8dB-loss backplane," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2008.
[62] M.-J. E. Lee, W. J. Dally and P. Chiang, "Low-power area-efficient high-speed I/O circuit techniques," IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 1591-1599, 2000.
[63] S. Quan, F. Zhong and W. L. e. al, "A 1.0625-to-14.025Gb/s multimedia transceiver with full-rate source-series-terminated transmit driver and floating-tap decision-feedback equalizer in 40nm CMOS," in ISSCC Dig. Tech. Papers, San Francisco, 2011.
[64] R. Palmer, J. Poulton, W. J. Dally, J. Eyles, A. M. Fuller, T. Greer, M. Horowitz, M. Kellam, F. Quan and F. Zarkeshvari, "A 14mW 6.25Gb/s transceiver in 90nm CMOS for serial chip-to-chip communications," in ISSCC Dig. Tech. Papers, San Francisco, 2007.
[65] R. Farjad-Rad, A. Nguyen, J. M. Tran, T. Greer, J. Poulton, W. J. Dally, J. H. Edmondson, R. Senthinathan, R. Rathi, M.-J. E. Lee and H. Ng, "A 33-mW 8-Gb/s CMOS clock multiplier and CDR for highly integrated I/Os," IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1553-1561, 2004.
[66] K. Hu, T. Jiang, J. Wang, F. O'Mahony and P. Y. Chiang, "A 0.6 mV/Gb/s, 6.4-7.2 Gb/s serial link receiver using local injection-locked ring oscillators in 90 nm CMOS," IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 899-908, 2010.
[67] S. Shekhar, M. Mansuri, F. O'Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy, D. J. Allstot, R. Mooney and B. Casper, "Strong injection locking in low-Q LC oscillators: modeling and application in a forwarded-clock I/O receiver," IEEE Trans. Circuits and Syst. -I: Regular Papers, vol. 56, no. 8, pp. 1818-1829, 2009.
[68] F. O'Mahony, J. E. Jaussi, J. Kennedy, G. Balamurugan, M. Mansuri, C. Roberts, S. Shekhar, R. Mooney and B. Casper, "A 14X10 Gb/s 1.4mW/Gb/s parallel interface in 45 nm CMOS," IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2828-2837, 2010.
[69] J. Cao, B. Zhang, U. Singh, D. Cui, A. Vasani, A. Garg, W. Zhang, N. Kocaman, D. Pi, B. Raghavan, H. Pan, I. Fujimori and A. Momtaz, "A 500mW digitally-calibrated AFE in 65nm CMOS for 10Gb/s serial links over backplane and multimode fiber," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2009.
161
[70] H. Yamaguchi, H. Tamura, Y. Doi, Y. Tomita, T. Hamada, M. Kibune, S. Ohmoto, K. Tateishi, O. Tyshchenko, A. Sheikholeslami, T. Higuchi, J. Ogawa, T. Saito, H. Ishida and K. Gotoh, "A 5Gb/s transceiver with and ADC-based feedforward CDR and CMA adaptive equalizer in 65nm CMOS," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2010.
[71] P. Kinget, "Device mismatch and tradeoffs in the design of analog circuits," IEEE J. Solid-State Circuits, vol. 40, no. 6, pp. 1212 - 1224, June 2005.
[72] I. Young, "Analog mixed-signal circuits in advanced nano-scale CMOS technology for microprocessors and SoCs," in Proceedings of the ESSCIRC, 2010.
[73] C. Chen, M. Le and K. Kim, "A low power 6-bit flash ADC with reference voltage and common-mode calibration," IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1041-1046, 2009.
[74] B. Verbruggen, P. Wambacq, M. Kuijk and G. Van der Plas, "A 7.6 mW 1.75 GS/s 5 bit flash A/D converter in 90 nm digital CMOS," in IEEE Symp. VLSI Circuits, 2008.
[75] M. Flynn, C. Donovan and L. Sattler, "Digital calibration incorporating redundancy of flash ADCs," IEEE Trans. Circuits Syst. II, vol. 50, no. 5, pp. 205 - 213, May 2003.
[76] S. Tsukamoto, I. Dedic, T. Endo, K. Kikuta, K. Goto and O. Kobayashi, "A CMOS 6-b; 200 MSample/s; 3 V-supply A/D converter for a PRML read channel LSI," IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1831 - 1836, 1996.
[77] M. Kijima, K. Ito, K. Kamei and S. Tsukamoto, "A 6b 3GS/s Flash ADC with Background Calibration," in IEEE Custom Integrated Circuits Conf., 2009.
[78] Y. Nakajima, A. Sakaguchi, T. Ohkido, N. Kato, T. Matsumoto and M. Yotsuyanagi, "A background self-calibrated 6b 2.7 GS/s ADC with cascade-calibrated folding-interpolating architecture," IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 707-718, April 2010.
[79] H. Ploeg, G. Hoogzaad, H. Termeer, M. Vertregt and a. R. Roovers, "A 2.5-V 12-b 54-Msample/s 0.25-um CMOS ADC in 1-mm2 with mixed-signal chopping and calibration," IEEE J. Solid-State Circuits, vol. 36, no. 12, pp. 1859-1867, December 2001.
[80] S. Jamal, D. Fu, N. Chang, P. Hurst and S. Lewis, "A 10-b 120-Msample/s time-interleaved analog-to-digital converter with digital background calibration," IEEE J. Solid-State Circuits, vol. 37, no. 12, pp. 1618-1627, December 2002.
162
[81] C. Huang and J. Wu, "A background comparator calibration technique for flash analog-to-digital converters," IEEE Trans. Circuits Syst., vol. 52, no. 9, pp. 1732-1740, September 2005.
[82] D. Fu, K. C. Dyer, S. H. Lewis and P. J. Hurst, "A digital background calibration technique for time-interleaved analog-to-digital converters," IEEE J. Solid-State Circuits, vol. 33, no. 12, pp. 1904 - 1911, 1998.
[83] S. Tsukamoto, I. Dedic, T. Endo, K. Kikuta, K. Goto and O. Kobayashi, "A CMOS 6-b, 200 MSample/s, 3 V-supply A/D converter for a PRML read channel LSI," IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1831 - 1836, 1996.
[84] J. Ingino and B. Wooley, "A continuously calibrated 12-b, 10-MS/s, 3.3-V A/D converter," IEEE J. Solid-State Circuits, vol. 33, no. 12, pp. 1920 - 1931, 1998.
[85] Y. Chiu, C. Tsang, B. Nikolic and P. Gray, "Least-mean-square adaptive digital background calibration of pipelined analog-to-digital converters," IEEE Trans. Circuits Syst., vol. 51, no. 1, pp. 38-46, 2004.
[86] X. Wang, P. J. Hurst and S. H. Lewis, "A 12-bit 20-MSampls/s pipelined analog-to-digital converter with nested digital background calibration," IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 1799 - 1808, November 2004.
[87] C. Tsang, Y. Chiu, J. Vanderhaegen, S. Hoyos, C. Chen, R. Brodersen and B. Nikolic, "Background ADC calibration in digital domain," in IEEE Custom Integrated Circuits Conf., 2008.
[88] H. Wang, X. Wang, P. J. Hurst and S. H. Lewis, "Nested digital background calibration of a 12-bit pipelined ADC without an input SHA," IEEE J. Solid-State Circuits, vol. 44, no. 10, pp. 2780-2789, 2009.
[89] J. McNeill, M. C. W. Coln and B. J. Larivee, ""Split ADC" architecture for deterministic digital background calibration of a 16-bit 1-MS/s ADC," IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2437 - 2445, 2005.
[90] J. Doernberg, P. Gray and D. Hodges, "A 10-bit 5-Msample/s CMOS two-step flash ADC," IEEE J. Solid-State Circuits, vol. 24, no. 4, pp. 241-249, 1989.
[91] K. Uyttenhove and M. Steyaert, "A 1.8-V 6-bit 1.3-GHz flash ADC in 0.25-μm CMOS," IEEE J. Solid-State Circuits, vol. 38, no. 7, pp. 1115 - 1122, July 2003.
[92] K. Deguchi, N. Suwa, M. Ito, T. Kumamoto and T. Miki, "A 6b 3.5GS/s 0.9V 98mW flash ADC in 90nm CMOS," IEEE J. Solid-State Circuits, vol. 43, no. 10, pp. 2303-2310, 2008.
163
[93] M. Choi and A. Abidi, "A 6b 1.3GS/s A/D converter in 0.35um CMOS," IEEE J. Solid-State Circuits, vol. 36, no. 12, pp. 1847-1858, 2001.
[94] R. J. V. d. Plassche, Integrated analog-to-digital and digital-to-analog converters, Boston: Kluwer, 1994.
[95] P. Allen and D. Holberg, CMOS analog circuit design, New York: Oxford, 2002.
[96] B. Razavi, Design of analog CMOS integrated circuits, New York: McGraw-Hill, 2001.
[97] W. Evans, E. Naviasky, H. Tang and B. Allison, "Comparator metastability analysis," 1 January 2011. [Online]. Available: http://www.designers-guide.org/Analysis/metastability.pdf. [Accessed 1 July 2012].
[98] Y. Akazawa, A. Iwata, T. Wakimoto, T. Kamato, H. Nakamura and H. Ikawa, "A 400MSPS 8b flash AD conversion LSI," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 1987.
[99] M. Ingels and M. S. J. Steyaert, Integrated CMOS circuits for optical communications, New York: Springer-Verlag, 2004.
[100] S. Park, Y. Palaskas and M. Flynn, "A 4GS/s 4b flash ADC in 0.18μm CMOS," IEEE J. Solid-State Circuits, vol. 42, no. 9, pp. 1865-1872, September 2007.
[101] A. Ismail and M. Elmasry, "A 6bit 1.6GS/s low power wideband flash ADC converter in 0.13um CMOS," IEEE J. Solid-State Circuits, vol. 43, no. 9, pp. 1982-1990, September 2008.
[102] M. Choi, J. Lee, J. Lee and H. Son, "A 6-bit 5-GSample/s Nyquist A/D Converter in 65nm CMOS," in Symp. VLSI Circuits, 2008.
[103] C. Sandner, M. Clara, A. Santner, T. Hartig and F. Kuttner, "A 6bit 1.2GS/s low power flash ADC in 0.13um CMOS," IEEE J. Solid-State Circuits, vol. 40, no. 7, pp. 1499-1505, July 2005.
[104] H. Katamkhani and C.-K. K. Yang, "A study of the optimal data rate for minimum power of I/Os," IEEE Trans. Circuits and Systems II, vol. 53, no. 11, pp. 1230-1234, November 2006.
[105] A. Deutsch, C. Surovic, R. Krabbenhoft, G. Kopcsay and B. Chamberlin, "Prediction of losses caused by roughness of metallization in printed-circuit boards," IEEE Trans. Advanced Packaging, vol. 30, no. 2, pp. 279-287, 2007.
164
[106] P. M. Figueiredo, P. Cardoso, A. Lopes, C. Fachada, N. Hamanishi, K. Tanabe and J. Vital, "A 90 nm CMOS 1.2 V 6b 1 GS/s two-step subranging ADC," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2006.
[107] X. Wang, P. Hurst and S. Lewis, "A 12-bit 20-MSampls/s pipelined analog-to-digital converter with nested digital background calibration," IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 1799 - 1808, November 2004.
[108] W. Evans, E. Naviasky, H. Tang and B. Allison, "http://www.designers-guide.org/Analysis/metastability.pdf," 1 January 2011. [Online]. Available: http://www.designers-guide.org/Analysis/metastability.pdf. [Accessed 1 October 2011].
[109] H. Chen, I. Chen, H. Tseng and H. Chen, "1-GS/s 6-bit two-channel two-step ADC in 0.13-μm CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 11, pp. 3051-3059, 2009.
[110] G. Balamurugan, F. O'Mahnoy, M. Mansuri, J. E. Jaussi, J. T. Kennedy and B. Casper, "A 5-to-25Gb/s 1.6-to-3.8mW/(Gb/s) reconfigurable transceiver in 45nm CMOS," in ISSCC Dig. Tech. Papers, San Francisco, 2010.
165
BIOGRAPHICAL SKETCH
Jikai Chen received BSEE and MSEE from East China Normal University,
Shanghai, China and Zhejiang University, Hangzhou, China respectively. He received
his PhD from the University of Florida, Gainesville, FL in 2013. From 2003 to 2004, he
was an analog IC design engineer with Realsil Microelectronics, working on PLL-based
clock buffers. From 2004 to 2006, he was a senior analog IC design engineer with
Philips Semiconductors (now NXP), designing high-voltage LCD drivers. From 2006 to
2012 he was a research assistant with the Integrated Circuit Research lab of the
University of Florida, with his research focused on low-power circuit design for high-
speed serial links. Since 2012 he has been with Texas Instruments as an analog circuit
designer working on high-speed circuit design for optical communications.