A 4266 Mb/s/pin LPDDR4 Interface With An Asynchronous Feedback CTLE and An Adaptive...

1894 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 12, DECEMBER 2018

A 4266 Mb/s/pin LPDDR4 Interface WithAn Asynchronous Feedback CTLE and

An Adaptive 3-Step Eye DetectionAlgorithm for Memory Controller

Mino Kim, Joo-Hyung Chae, Student Member, IEEE, Sungphil Choi, Gi-Moon Hong, Hyeongjun Ko,

Deog-Kyoon Jeong , Fellow, IEEE, and Suhwan Kim , Senior Member, IEEE

Abstract—A 4266Mb/s/pin LPDDR4 interface with an asyn-chronous feedback continuous-time linear equalizer (AF-CTLE)and an adaptive 3-step eye detection algorithm for memory con-troller is presented. The AF-CTLE removes the glitch of DQSwithout training by applying an offset larger than the noise,and improves read margin by operating as a decision feedbackequalizer in DQ path. The adaptive 3-step eye detection algorithmreduces power consumption and black-out time in initializationsequence and retraining in comparison to the 2-D full scanning.A prototype chip was implemented in a 65-nm CMOS processwith fine-pitch ball grid array package and tested with commod-ity LPDDR4 memory. The write margin was 0.36 UI and 148 mV;and the read margin was enhanced from 0.30 UI and 76 mVwithout AF-CTLE to 0.47 UI and 80 mV to with AF-CTLE.The power efficiency during burst write and read were 5.68 pJ/bitand 1.83 pJ/bit, respectively.

Index Terms—Adaptive eye detection, LPDDR4, memory con-troller, pseudo-decision feedback equalizer, transceiver.

I. INTRODUCTION

H IGH-END mobile devices such as smartphones, tabletPCs, and wearable devices now have multi-core CPUs

and high-quality graphic engines, increasing the demand forlow power consumption and high bandwidth mobile memoryinterface. To meet this demand, LPDDR was introduced in2005 [1], LPDDR2 in 2009, LPDDR3 in 2012 [2], andLPDDR4, with a maximum bandwidth of 4266Mbps/pin, in2015 [3].

Achieving both low power consumption and high band-width requires not only a mobile DRAM with an advanced

Manuscript received March 11, 2018; revised March 21, 2018; acceptedMarch 22, 2018. Date of publication March 26, 2018; date of currentversion November 23, 2018. This work was supported in part by theFuture Semiconductor Device Technology Development Program through theMinistry of Trade, Industry and Energy under Grant 10080570, and in part bythe Korea Semiconductor Research Consortium. This brief was recommendedby Associate Editor F. C. M. Lau. (Corresponding author: Suhwan Kim.)

M. Kim, G.-M. Hong, and H. Ko are with the DRAM Product DevelopmentDivision, SK Hynix, Icheon 17336, South Korea.

J.-H. Chae, S. Choi, D.-K. Jeong, and S. Kim are with the Department ofElectrical and Computer Engineering, Seoul National University, Seoul 08826,South Korea (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSII.2018.2819430

architecture and an appropriate interface but also the assis-tance of memory controller in the form of calibration ortraining. The clocking scheme was changed from a source syn-chronous matched scheme to a source synchronous unmatchedscheme (SSUS), which makes it possible to use a simplereceiver and decision feedback equalizer by eliminating thematching delay line. But the removal of this delay line requirestraining to achieve optimum sampling point. Mobile DRAMalso lacks the delay-locked loop (DLL) which compensate forthe skew, tDQSCK , between the read DQS and CK in computingDRAM [4]. This reduces power but also requires training.

While controller training allows the mobile DRAM todraw less power and achieve high bandwidth, it increasespower consumption and the black-out time during which thecontroller is unable to access the DRAM. The use of eye moni-toring circuitry enables to reduce black-out time [5]. However,the additional hardware may not be recommended in terms ofpower and area. In addition, periodic retraining is requireddue to changes in voltage and temperature over time. It isobvious that the training time should be kept as short aspossible and the re-training interval should be maximized.A free-running oscillator, an on-chip temperature or voltagesensor were proposed to predict the need for retraining, ratherthan retraining the controller at regular interval [6]. Recently,it has been suggested that retraining for compensating largevariation of tDQSCK could be eliminated by generating a readdata gating signal using an additional receiver that comparesDQSN and VREFDQ [7]. This technique eliminates the black-out time from retraining for tDQSCK compensation but requiresadditional receiver and control logic.

We propose LPDDR4 memory controller which adoptsan asynchronous feedback continuous-time linear equalizer(AF-CTLE), without additional receivers, to skip tDQSCK train-ing. The AF-CTLE also improves read margin by operating asa decision feedback equalizer (DFE) in DQ path. In addition,an adaptive 3-step eye detection algorithm reduces the residualtraining time.

II. ARCHITECTURE

Fig. 1 shows the overall architecture of ourLPDDR4 memory controller which communicates witha single channel of the LPDDR4. It consists of anall-digital phase-locked loop (ADPLL), an all-digitaldelay-locked loop (ADDLL), a phase interpolator (PI),

1549-7747 c© 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0003-0436-703X

https://orcid.org/0000-0001-9107-2963

KIM et al.: 4266 Mb/s/pin LPDDR4 INTERFACE WITH AF-CTLE AND ADAPTIVE 3-STEP EYE DETECTION ALGORITHM 1895

Fig. 1. Architecture of our LPDDR4 memory controller.

a clock distribution circuit, a link training finite-statemachine (LTFSM), 7 transmitters for CK and CA,18 transceivers for DQ and DMI, and 2 transceiversfor DQS. The ADPLL generates the global clock sig-nal (PHYCLK) used by the transceivers, and the systemclock (SYSCLK) used by the LTFSM. The PHYCLK has anoperating frequency of 266 to 2133MHz, and the frequencyof SYSCLK is an eighth of that. The ADDLL consistsof a global ADDLL, which promotes fast locking of thelocal ADDLL which generates the multi-phase clock. The PIadjusts the clock phase in 1/64UI steps. The LTFSM generatesDQ, CA and control codes for the memory controller. Thetransmitter and receiver are configured to provide a highbandwidth by the training operation. The clocking architectureis also designed for low power consumption in idle periods,which are commonly lengthy in mobile applications. Thearchitecture of the memory controller maintains the trainingbehavior of the entire system and the source synchronousclocking scheme to increase the tolerance of high frequencytransmit source jitter [8]. At the end of the clock tree, themetal-wire shorts the input clock to each TXs and RXs toreduce the pin-to-pin skew and dummy load reduces loadmismatch.

III. CIRCUIT IMPLEMENTATION

A. Transmitter

Fig. 2 is a block diagram of the transmitter, which consistsof a PI, a digitally controlled delay line (DCDL), a 16:1 seri-alizer (SER), a pre-driver (pre-DRV), and low-voltage swingterminated logic (LVSTL) [9]. The PI adjusts the phase ofthe output signal in 1/64UI steps and has a range from0 to 1tCK. [10]. However, tDQS2DQ, the skew between DQand DQS due to the unmatched scheme, exceeds 1tCK at4266Mbps. In that reason, the PI is used to find eye sam-pling point, and the DCDL consisting of NAND gate logiccompensates both tDQS2DQ and per-pin skew. When write oper-ations are being not performed, the power consumption of theDCDL, SER, Pre-DRV, and LVSTL (i.e., the blocks shown in

Fig. 2. Block diagram of a transmitter.

Fig. 3. Block diagram of a receiver.

gray in Fig. 2) are reduced by blocking the PI output clock.The 16:1 SER is used to maintain the same command-to-data latency for various timing differences between the DQ TXand CA TX by the tDQS2DQ and the PI. The LVSTL sup-ports 1-tap de-emphasis for channel loss compensation andhas accurate pull-up and pull-down drive strength through ZQcalibration. The local ADDLL tracks the PVT variation evenin idle state and the PI tracks the PVT variation by periodicre-training.

B. Receiver

Fig. 3 is a block diagram of the receiver, which consistsof a reference voltage generator (VREF GEN), a CTLE, a PI,a DCDL, a 1:4 de-serializer (DES), and a 4:16 DES. TheCTLE is used to compensate for channel loss. The referencevoltage VREFDQ generated by VREF GEN is shared by allDQs. The DCDL compensates for per-pin skew between DQsand DQS. The 1: 4 DES is used to ensure a wide timingmargin at clock domain crossing point. The clock domaincrossing, read latency training, and byte alignment from DQSto PHYCLK between 1:4 DES and 4:16 DES are all performedby the LTFSM. Since DQS is irregularly received in the readpath, the power consumption of the CTLE, PI, DCDL, and1:4 DES (i.e., the blocks shown in gray in Fig. 3) are reduced.

C. Asynchronous Feedback CTLE

Since both DQSP and DQSN are ground-terminated beforeand after a read operation, noise can make glitches in theshaded region of YDQS which is the output of the receiver,as shown in Fig. 4. The controller can remove glitches by train-ing the gate with pulses, but periodic retraining is required tocompensate for change in tDQSCK caused by voltage and tem-perature variations. But this training uses power and increasesblack-out time. We therefore introduce an AF-CTLE that canremove glitches without the need for a gate signal that isvulnerable to timing variation, as shown in Fig. 5. The feed-back loop in this AF-CTLE consists of an SR latch whichfeeds back the value of YDQS to adjust VOS, which is theoffset of the CTLE. Fig. 6 shows a circuit diagram of the


Fig. 4. Timing diagram of YDQS showing glitches and gate training failurecaused by voltage and temperature variations.

Fig. 5. Block diagram of an AF-CTLE.

Fig. 6. Circuit diagram of an AF-CTLE.

AF-CTLE. Since the LPDDR4 interface uses LVSTL, it isadvantageous to receive signal by PMOS transistors. Thebandwidth loss caused by using PMOS transistors insteadof NMOS transistors is overcome by applying cherry-hoopertopology [11]. The sign of offset VOS between INP and INNis controlled by FBP and FBN. Fig. 7 shows the timingdiagram of this CTLE, which has a larger offset VOS thanthe noise causing the glitch, so that a clean YDQS can beobtained. YDQSP and YDQSN shows the YDQS waveformwith the offset of CTLE is biased to −VOS and +VOS. Theshaded region where glitches can occur is removed withoutgate signals, but biased offsets produce a duty-cycle error ofYDQS. Therefore we change the sign of VOS in response to thevalue of YDQS. The simulation results shown in Fig. 8 illus-trate the removal of a DQS glitch. The AF-CTLE successfullyprevents glitches without a gate signal, unlike a CTLE withoutan asynchronous feedback (AF) loop.

In addition, the AF-CTLE in DQ path is effectively operat-ing as an unclocked DFE by changing the decision threshold

Fig. 7. Timing diagram of a glitch-free AF-CTLE.

Fig. 8. Simulation result from our glitch-free AF-CTLE.

with the data without clock signal [12]. The loop delay ofunclocked DFE should be design to a half period for accurateerror cancellation. At operating frequency between 1866Hzand 2133MHz, our AF-CTLE supports equalization by con-trolling the loading capacitance of feedback loop to adjust thelatency from 234ps to 268ps. In our case, the latency is setaccording to the operating frequency in the initialization step.

D. LTFSM With Adaptive 3-Step Eye Detection

The link training finite-state machine uses an adaptive gaincontrol scheme for eye detection to reduce training time.During the training, the PRBS7 pattern is written to memoryand read back, and the error is used to adjust the voltageor time control code. When consecutive passes or fails isdetected, the LTFSM increases the gain of training step. Iftraining result differs from the previous control code, it reducesthe adaptive gain and detect the boundary point with a binarysearch algorithm to maintain accuracy. Fig. 9 shows an exam-ple of this adaptive 3-step eye detection algorithm. Thanks tothe LPDDR4 specification, the value of the reference voltageof 1st step can be effectively started by with half value ofthe VOH, VDDQ/5 or VDDQ/6 [13]. This method requiresmuch fewer search-points than a 2-dimensional full scanningscheme which examines all combinations of voltage and time.For example, the number of training points of full scanningscheme is

NTR = NTNV (1)

where NTR is the total number of training steps, NT is thenumber of time steps and NV is the number of voltage steps.In order to perform the training more precisely and in a widerange, it requires a lot of black-out time and power consump-tion for single training because NTR is proportional to the timestep and voltage steps. Assuming two boundary points, thenumber of training steps of our adaptive 3-step scheme is

NTR = 2NT/K + NV/K + 3Nmin (2)

KIM et al.: 4266 Mb/s/pin LPDDR4 INTERFACE WITH AF-CTLE AND ADAPTIVE 3-STEP EYE DETECTION ALGORITHM 1897

Fig. 9. Timing diagram of the adaptive 3-step eye detection algorithm.

where K is the maximum gain of the training step and Nmin isthe minimum number of consecutive passes or fails to reachthe maximum gain. Assuming that Nmin is sufficiently smallcompared to NT/K and NV/K, (2) can be simplified as

NTR = 2NT/K + NV/K. (3)

In our controller, NT = 256, NV = 72, and K = 2:Our algorithm examines 292 search-points, while conven-tional 2-dimensional full scanning method has to examine18,432 points. Using the adaptive 3-step eye detection algo-rithm, the simulated calibration time of read training and writetraining at 4266Mbps/pin were 11.0us and 15.3us, respectively.

IV. EXPERIMENTAL RESULTS

A prototype of our LPDDR4 memory controller was imple-mented in a 65nm CMOS process on a fine-pitch ball gridarray (FBGA) package. Fig. 10 shows the chip microphoto-graph with the measurement setup. All trainings, includingmargin tests, were performed by sending and receiving thePRBS7 patterns between chip-to-chip communication in themeasurement setup. The clock generator generates the 66MHzreference clock for the ADPLL. The measurement setup isconfigured with a chip-to-chip printed circuit board (PCB)platform of our LPDDR4 memory controller and commodityLPDDR4 memory. The active areas of the transmitter, receiver,ADPLL, global ADDLL and local ADDLL are 0.054mm2,0.045mm2, 0.39mm2, 0.047mm2 and 0.027mm2, respectively.The experimental results of training and margin tests are trans-mitted to PC through an I2C module. The ADPLL consumes17.47mW at 2133MHz. Fig. 11 shows output waveform ofthe ADDLL, and we verified that the waveforms of the CK0and CK180 signals were locked 180◦ out of phase with anoscilloscope. The PI uses a multi-phase clock including CK0and CK180 to generate a 1/64UI step clock phase and adjuststhe timing of the output signal. The global ADDLL powersoff after locking and then the local ADDLL only consumes3.71mW at 2133MHz. Fig. 12 shows the output eye diagram ofDQS and DQ, measured using a 50�-terminated oscilloscopewith no memory connected. It shows the DQ and DQS sig-nals which is generated by the clock generated by the ADPLL

Fig. 10. Measurement setup and enlarged microphotograph.

Fig. 11. Measured output waveform of the ADDLL.

Fig. 12. Measured output eye diagram of DQS and DQ at 4266Mbps.

through the on-chip clock tree and the transmitter path, and themeasurement result including the power noise generated whilepassing through the path. The eye width and height of DQ


TABLE IPERFORMANCE COMPARISON OF MOBILE MEMORY INTERFACE

Fig. 13. Measured write and read margin at 4266Mbps/pin.

are 168ps and 183mV respectively. Fig. 13 shows the writeand read margins at 4266Mbps/pin. Both margin tests wereperformed by our LPDDR4 memory controller after complet-ing LPDDR4 memory initialization sequence which includingpower up, MRS initialization, ZQ calibration, command bustraining, write leveling, read training and writing training. ThePI changes the timing of the DQ in 1/64 UI steps and MRScommand and internal reference voltage generator controls thereference voltage of the DQ with 4mV steps. The measuredwrite margin is 0.36UI and 148mV. The measured read marginwithout the feedback loop in AF-CTLE is 0.30UI and 76mV,and the AF-CTLE improves it to 0.47UI and 80mV. The powerefficiency of write and read operations at 4266Mbps/pin are5.68pJ/bit and 1.83pJ/bit respectively. Table I compares thisbrief with a mobile memory interface.

V. CONCLUSION

We have presented a low-power high-bandwidthLPDDR4 memory controller with an asynchronous feedbackcontinuous-time linear equalizer and an adaptive 3-step eyedetection algorithm. The AF-CTLE eliminates glitches in theidle state of the DQS path without the need for an additionalreceivers, and improves the read margin by operating asa decision-feedback equalizer in the DQ path. The adaptive3-step eye detection algorithm reduces the power and timerequired for training by adaptive gain control. The prototypechip was verified using a test board connected to a commodityLPDDR4 memory, and its performance was verified in writeand read margin tests.

REFERENCES

[1] S.-H. Kim et al., “A low power and highly reliable 400 Mbps mobileDDR SDRAM with on-chip distributed ECC,” in Proc. IEEE AsianSolid-State Circuits Conf., Jeju Province, South Korea, 2007, pp. 34–37.

[2] Y.-C. Bae et al., “A 1.2V 30nm 1.6Gb/s/pin 4Gb LPDDR3 SDRAM withinput skew calibration and enhanced control scheme,” in IEEE ISSCCDig. Tech. Papers, San Francisco, CA, USA, 2012, pp. 44–46.

[3] K. Song et al., “A 1.1V 2y-nm 4.35 Gb/s/pin 8 Gb LPDDR4 mobiledevice with bandwidth improvement techniques,” IEEE J. Solid-StateCircuits, vol. 50, no. 8, pp. 1945–1959, Aug. 2015.

[4] T.-Y. Oh et al., “25.1 A 3.2Gb/s/pin 8Gb 1.0V LPDDR4 SDRAM withintegrated ECC engine for sub-1V DRAM core operation,” in IEEE Int.Solid-State Circuits Conf. Dig. Tech. Papers, San Francisco, CA, USA,2014, pp. 430–431.

[5] B. Analui, A. Rylyakov, S. Rylov, M. Meghelli, and A. Hajimiri,“A 10-Gb/s two-dimensional eye-opening monitor in 0.13-μm standardCMOS,” IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2689–2699,Dec. 2005.

[6] H. Lee, D. Shim, C. Rhee, M. Kim, and S. Kim, “A sub-1.0V on-chipCMOS thermometer with a folded temperature sensor for low-powermobile DRAM,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 63,no. 6, pp. 553–557, Jun. 2016.

[7] S.-M. Lee et al., “23.6 A 0.6V 4.266Gb/s/pin LPDDR4X interface withauto-DQS cleaning and write-VWM training for memory controller,” inIEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, San Francisco,CA, USA, 2017, pp. 398–399.

[8] J. Zerbe et al., “A 5Gb/s link with matched source synchronousand common-mode clocking techniques,” IEEE J. Solid-State Circuits,vol. 46, no. 4, pp. 974–985, Apr. 2011.

[9] Y.-C. Cho et al., “A sub-1.0V 20 nm 5 Gb/s/pin post-LPDDR3 I/Ointerface with low voltage-swing terminated logic and adaptive cali-bration scheme for mobile application,” in Proc. Symp. VLSI Circuits,Kyoto, Japan, 2013, pp. C240–C241.

[10] J.-H. Chae et al., “266–2133 MHz phase shifter using all-digital delay-locked loop and triangular-modulated phase interpolator for LPDDR4Xinterface,” Electron. Lett., vol. 53, no. 12, pp. 766–768, Jun. 2017.

[11] R. Shabanpour et al., “Cherry-Hooper amplifiers with 33 dB gain at400 kHz BW and 10 dB gain at 3.5 MHz BW in flexible self-aligneda-IGZO TFT technology,” in Proc. Int. Symp. Intell. Signal Process.Commun. Syst., Kuching, Malaysia, 2014, pp. 271–274.

[12] S. Chandramouli et al., “10-Gb/s optical fiber transmission usinga fully analog electronic dispersion compensator (EDC) with unclockeddecision-feedback equalization,” IEEE Trans. Microw. Theory Techn.,vol. 55, no. 12, pp. 2740–2746, Dec. 2007.

[13] K. Song et al., “A 1.1V 2y-nm 4.35Gb/s/pin 8Gb LPDDR4 mobiledevice with bandwidth improvement techniques,” in Proc. IEEE CustomIntegr. Circuits Conf., San Jose, CA, USA, 2014, pp. 1–4.

[14] H.-K. Jung et al., “A 4.35Gb/s/pin LPDDR4 I/O interface with multi-VOH level, equalization scheme, and duty-training circuit for mobileapplications,” in Proc. Symp. VLSI Circuits, Kyoto, Japan, 2015,pp. C184–C185.

A 4266 Mb/s/pin LPDDR4 Interface With An Asynchronous Feedback CTLE and An Adaptive...

Documents

Transcript of A 4266 Mb/s/pin LPDDR4 Interface With An Asynchronous Feedback CTLE and An Adaptive...