An efficient Hardware implementation of the Peak ...1091121/...DEGREE PROJECT IN INFORMATION AND...

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

An efficient Hardware implementation of the Peak Cancellation Crest Factor Reduction Algorithm

MATTEO BERNINI

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

An efficient Hardware implementation of the PeakCancellation Crest Factor Reduction Algorithm

MATTEO BERNINI

Master’s Thesis at KTH Information and Communication TechnologySupervisor: Shafqat UllahExaminer: Johnny Öberg

TRITA-ICT-EX-2016:187

AbstractAn important component of the cost of a radio base station comes from to the Power Am-plifier driving the array of antennas. The cost can be split in Capital and Operationalexpenditure, due to the high design and realization costs and low energy efficiency of thePower Amplifier respectively. Both these cost components are related to the Crest Factorof the input signal. In order to reduce both costs, it would be possible to lower the averagepower level of the transmitting signal, whereas in order to obtain a more efficient transmis-sion, a more energized signal would allow the receiver to better distinguish the message fromthe noise and interferences. These opposed needs motivate the research and development ofsolutions aiming at reducing the excursion of the signal without the need of sacrificing itsaverage power level. One of the algorithms addressing this problem is the Peak CancellationCrest Factor Reduction. This work documents the design of a hardware implementationof such method, targeting a possible future ASIC for Ericsson AB. SystemVerilog is theHardware Description Language used for both the design and the verification of the project,together with a MATLAB model used for both exploring some design choices and to val-idate the design against the output of the simulation. The two main goals of the designhave been the efficient hardware exploitation, aiming to a smaller area footprint on the inte-grated circuit, and the adoption of some innovative design solutions in the controlling partof the design, for example the managing of the cancelling pulse coefficients and the use of atime-division multiplexing strategy to further save area on the chip. For the contexts whereboth the solutions could compete, the proposed one shows better results in terms of areaand delay compared to the current methods in use at Ericsson and also provides innovativesuggestions and ideas for further improvements.

Keywords: CFR, PC-CFR, PAPR Reduction, OFDM

SammanfattningEn effektiv hårdvaruimplementation av Peak Cancellation-algoritmen

för reduktion av toppfaktor

En komponent som det är viktigt att ta hänsyn till när det kommer till en radiobasstationskostnad är förstärkaren som används för att driva antennerna. Kostnaden för förstärkarenkan delas upp i en initial kostnad relaterad till utveckling och tillverkning av kretsen, samten löpande kostnad som är relaterad till kretsens energieffektivitet. Båda kostnaderna ärkopplade till en egenskap hos förstärkarens insignal, vilken är kvoten mellan signalens maxi-mala effekt och dess medeleffekt, såkallad toppfaktor. För att reducera dessa kostnader så ärdet möjligt att minska signalens medeleffekt, men en hög medeleffekt förbättrar radioöver-föringen eftersom det är lättare för mottagaren att skilja en signal med hög energi från brusoch interferens. Dessa två motsatta krav motiverar forskning och utveckling av lösningarför att minska signalens maximala värde utan att minska dess medeleffekt. En algoritm somkan användas för att minska signalens toppfaktor är Peak Cancellation. Den här rapportenpresenterar design och hårdvaruimplementering av Peak Cancellation med avsikt att kunnaanvändas av Ericsson AB i framtida integrerade kretsar. Det hårdvarubeskrivande språketSystemVerilog användes för både design och testning i projektet. MATLAB användes föratt utforska designalternativ samt för att modellera algoritmen och jämföra utdata medhårdvaruimplementationen i simuleringar. De två huvudmålen med designen var att utnytt-ja hårdvaran effektivt för att nå en så liten kretsyta som möjligt och att använda en radinnovativa lösningar för kontrolldelen av designen. Exempel på innovativa designlösningarsom användes är hur koefficienter för pulserna, som används för reducera toppar i signalen,hanteras och användning av tidsmultiplex för att ytterligare minska kretsytan. I använd-ningsscenarion där båda lösningarna kan konkurrera, visar den föreslagna lösningen bättreresultat när det kommer till kretsyta och latens än nuvarande lösningar som används avEricsson. Ges också förslag på ytterligare framtida förbättringar av implementationen.

Keywords: CFR, PC-CFR, PAPR Reduction, OFDM

List of Acronyms and AbbreviationsACLR Adjacent Channel Leakage Ratio

AM Amplitude Modulation

ASIC Application Specific Integrated Circuit

ASM Algorithmic State Machine

BPSK Binary Phase Shift Keying

CAF Clipping and Filtering Technique

CapEx Capital Expenditure

CCDF Complementary Cumulative Distribution Function

CF Crest Factor

CORDIC Coordinate Rotation Digital Computer

CS Clip Stage

EVM Error Vector Magnitude

FDM Frequency Division Multiplexing

FIR Finite Impulse Response

FM Frequency Modulation

FPGA Field Programmable Gate Array

GSM Global System for Mobile communication

(H)PA (High) Power Amplifier

(I)DCT (Inverse) Discrete Cosine Transform

IFFT Inverse Fast Fourier Transform

I/Q In-phase / Quadrature signal

LTE Long Term Evolution

MSR Multi Standard Radio

NS Noise Shaping

OFDM Orthogonal Frequency Division Multiplexing

OOB Out Of Band

OpEx Operating Expenditure

PA(P)R Peak to Average (Power) Ratio

PC, PC-CFR Peak Cancellation Crest Factor Reduction

PCU Peak Cancelling Unit

PDF Probability Density Function

PF Peak Filtering

PM Phase Modulation

PM Peak Manager

PTS Partial Transmit Sequence

PW Peak Windowing

QPSK Quadrature Phase-Shift Keying

RMS Root Mean Square

RTL Register Transfer Level

SLM SeLective Mapping

SV SystemVerilog

TC Turbo Clipping

TDM Time Division Multiplexing

TI Tone Injection

TR Tone Reservation

WCDMA Wideband Code Division Multiple Access

Contents

1 Introduction 11.1 Background and statement of the problem . . . . . . . . . . . . . . . 21.2 Purpose of the design project . . . . . . . . . . . . . . . . . . . . . . 4

2 Background and related work 72.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Orthogonal Frequency Division Multiplexing (OFDM) . . . . 72.1.2 Definitions: CF, PAPR, EVM and ACLR . . . . . . . . . . . 92.1.3 Overview of the main CFR methods . . . . . . . . . . . . . . 11

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 The proposed implementation of the PC-CFR 213.1 General description of the PC-CFR algorithm . . . . . . . . . . . . . 213.2 Structural description of the proposed implementation . . . . . . . . 26

3.2.1 The Clip Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 The Peak Manager . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Future work and suggested improvements 434.1 Programmable or dynamic CS–PCU mapping . . . . . . . . . . . . . 434.2 Bypassable PC-CFR module . . . . . . . . . . . . . . . . . . . . . . 434.3 Clip Stages with different delay memories and cancelling pulses length 454.4 Truncation of cancelling pulses . . . . . . . . . . . . . . . . . . . . . 454.5 Variable length Peak Search Window . . . . . . . . . . . . . . . . . . 464.6 Priority-based acceptance of peaks . . . . . . . . . . . . . . . . . . . 474.7 Generation of multiple cancelling pulses from the same time slot . . 49

5 Results and conclusions 535.1 Comparative synthesis results . . . . . . . . . . . . . . . . . . . . . . 535.2 Some input and model configuration exploration . . . . . . . . . . . 54

5.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 61

Appendices 62

A The MATLAB golden model 63

Chapter 1

Introduction

If the cost of a typical transmitting radio base station is analyzed, we discover thatthe Capital Expenditure (CapEx)1 and the Operating Expenditure (OpEx)2 relativeto the radio cards alone cover roughly 50% of the total cost[1]. The radio cards housethe Power Amplifier (PA) whose low efficiency is the main culprit for the OpEx partof the overall costs. In fact, only a small quota of the power consumed by the radiocards becomes transmitted power. Similar considerations are valid for the consumerelectronics market: every mobile device, relying on wireless communications, suffersfrom the non-optimal efficiency of the PA causing a substantial negative effect on thebattery lifetime. In many low-cost applications, this issue alone might prevent thewhole system to be considered convenient or even possible to design. The efficiencyof the PA is a function of the characteristics of the input signal, in particular ofits Peak to Average Power Ratio (PAPR, or PAR) or Crest Factor (CF), which arethe ratio between the powers or the magnitudes associated to the largest and theaverage values of the signal, respectively.

In Figure 1.1, we can see a small segment of data in a typical scenario. Themaximum values, that is the peaks (a more accurate definition of peaks will begiven in 3.1, for now the intuitive comprehension is sufficient), are responsible forthe high PAPR of a given signal. It is not surprising that the industry is strivingto reduce this phenomenon, and thus the costs and inefficiencies, by investigatingseveral alternatives. Basically the two most relevant ways to deal with the problemare: 1)introducing some changes in the signals to be transmitted (without of coursecompromising its informative content) in order to prevent the occurrence of highpeaks, at the cost of an increased complexity of the transmitter and/or sacrificingsome data rate for the transmission of side information needed on the receiver sidefor the reconstruction of the information, or 2)digitally processing the signal as itis (either in the time or frequency domain) in order to limit the occurrence andmagnitude of the unavoidable peaks, at the cost of some introduced distortion.

This thesis work focuses on the design, modeling and verification of an algo-

1Resources invested by a company to buy or upgrade fixed, physical, non-consumable assets.2Day-to-day costs of operation.

1

CHAPTER 1. INTRODUCTION

Figure 1.1: A segment of a typical signal amplitude showing high variability and,as a consequence, a high ratio between the maximum and average values.

rithm belonging to the digital processing category, namely the Peak CancellationCrest Factor Reduction (PC-CFR) and it is targeted to an Application Specific In-tegrated Circuit (ASIC). The thesis project was performed at Ericsson AB in Kista,Stockholm.

1.1 Background and statement of the problem

Very widely used multi-carrier signals such as Orthogonal Frequency Division Multi-plexing (OFDM) show higher PAPR than single carrier systems. Also, several radioaccess technologies such as Long Term Evolution (LTE), Wideband Code DivisionMultiple Access (WCDMA), etc. are used in Multi Standard Radio (MSR) trans-mitters situated in base stations. These signals exhibit a non-constant envelopebehaviour, but show instead a fluctuating envelope with a high CF (see Figure 1.2,[2]). The main reason is the fact that the sum of multiple sub-carriers create acompound signal whose real and imaginary parts approach a Gaussian ProbabilityDensity Function (PDF), due to the Central Limit Theorem, whereas the amplitudewill approach a Rayleigh PDF. On the other hand, the Global System for Mobilecommunication (GSM), uses constant envelope Gaussian modulation.

The input-output static characteristics of a PA show a linear region bounded bya non-linear part (see Figure 1.3). The part of the PA input signal characteristicsoutside the linear region entails significative Out Of Band (OOB) emissions, caused

2

1.1. BACKGROUND AND STATEMENT OF THE PROBLEM

Figure 1.2: Comparative view of PAPR for different transmission protocols (source:[2]).

Figure 1.3: Power Amplifier characteristics before PAR reduction (source: [3]).

by the inter-modulation products on the adjacent channels. Therefore the linearpart of the PA’s characteristics needs to be wide enough to contain the dynamicrange of the input signal that has to be amplified and fed to the antenna(s). In orderfor the PAs to accommodate signals with such a high voltage swing, either they haveto be dimensioned for the maximum peak value (thus increasing the CapEx), or they

3

CHAPTER 1. INTRODUCTION

Figure 1.4: Power Amplifier characteristics after PAR reduction. Note the increasedaverage output voltage (thus power) available thanks to the reduction of the PAR(source: [3]).

are made operating with more back-off3 from the most convenient operating point,which translates to a lesser efficient usage of energy (thus increasing the OpEx). Inother words, PAs with larger linear ranges are more expensive and make a worseuse of electric power than those with smaller linear input range.

What is desirable, instead, is to deal with signals with limited PAPR (or CF)because then it is possible to increase their average power level without the riskof falling into the saturation region of the PA. The increased transmitting powerguarantees a higher strength of the signal with respect to the unavoidable noise andthus an overall more efficient transmission of information. In Figure 1.4, the input-output characteristics of a PA after a 6 dB reduction of PAPR is shown. Noticethat now it is possible to accommodate the operating point of the signal at a higherpower level thanks to the reduction of the PAPR.

1.2 Purpose of the design project

The purpose of the project described in this report is the design, verification andperformance test of an innovative implementation of the Peak Cancellation (PC)algorithm, which will possibly be implemented in one of Ericsson’s ASICs in thefuture. The design is as generic and configurable as possible, in order for the userto be able to compare different parameter options against existing solutions alreadyimplemented in Ericsson. The programmability of the PC-CFR module is anotherdesirable characteristic of the project because, as a consequence of the changes in

3The back-off is the deliberate reduction of the average input power to the PA.

4

1.2. PURPOSE OF THE DESIGN PROJECT

the input signal properties, some actions might be taken accordingly, for example achange of the length of the search window (related to the granularity of the detectionof the peaks).

One of the most attractive aspects of the PC algorithm, as opposed to othersolutions, is the low complexity in terms of hardware, which translates to a smallerarea occupancy on the ASIC and to a lower power consumption of the module. Thedrawback of the PC is that each peak must be treated separately by dedicatinghardware resources to it for the entire duration of the corresponding cancellingpulse. When the detected peaks in the input signal exhibit a density such that theavailable hardware resources are insufficient in number to cancel them all, some ofthem pass untouched and eventually reach the PA.

The PC-CFR architecture proposed in this thesis report is new and possiblyinnovative in some aspects, compared to the documented already existing imple-mentations[1][3][4][5]. The aspect of the design that required most of the effort wasthe optimization of the hardware resources and, at the same time, the minimizationof the probability of a peak leak. In order to fulfill these requirements, most of thehardware resources were not used exclusively but more efficiently shared in a TimeDivision Multiplexing (TDM) configuration, thanks to the availability of a second,faster clock and several design expedients.

The Register Transfer Level (RTL) design and the testbench are written in theSystemVerilog (SV) language and simulated and synthesized via the software toolsmade available by Ericsson. A MATLAB golden model has been written in a wayto match both the expected behaviour of the PC-CFR algorithm and, as accuratelyas possible, all the elaborations of the data taking place in the target hardwareimplementation. This model was used to compare its output against the RTLversion when driven with the same input data: the target RTL implementation isconsidered compliant to the model when the two outputs match sample by sample.

5

Chapter 2

Background and related work

2.1 Background

2.1.1 Orthogonal Frequency Division Multiplexing (OFDM)

Communication systems use a physical channel to provide a reliable mean to trans-fer information by the use of a technique called modulation: by superimposing somecoded version of the information over one or more of the characteristics of a prop-erly chosen sinusoidal signal, called carrier, it is possible to overcome the physicallimits of the communication channel, in terms of available bandwidth and maxi-mum power. According to the fact that the carrier signal characteristic is frequency,phase or amplitude (or a combination of them), we have several types of modulation(such as Amplitude Modulation (AM), Phase Modulation (PM), Frequency Modu-lation (FM), Binary Phase Shift Keying (BPSK), Quadrature Phase-Shift Keying(QPSK), etc...) each with different advantages and drawbacks. If more than one lineof communication needs to be established over the same physical channel, then somemeans to share it must be employed, such as multiplexing (we might think of theseindependent paths of communication as logical channels, as well as pairs of users).In Time Division Multiplexing (TDM) each user occupies the entire bandwidth ofthe channel for a given time frame in a round-robin fashion, with some silence timebetween two successive frames, whereas in Frequency Division Multiplexing (FDM)the whole channel bandwidth is divided in segments separated by guard intervalsand each user has at its disposal a specific bandwidth arranged somewhere arounda carrier for the entire duration of the communication. The relation among carrierscan be any, the only constraint being the non-overlapping of the frequency bandsof each channel. In OFDM there is a specific relationship among the carrier fre-quencies i.e. they are all multiples of a single frequency. This simple expedientallows the relaxation of the requirement about the non-overlapping of the variousbands, thus actually compacting them together in order to make a better use of thechannel resource. The fact that all carriers are multiples of a common frequency

7

CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.1: Block diagram of the generation of a OFDM signal. The sinusoidalcarriers are orthogonal (source: [6]).

entails the orthogonality1 of them and this makes the recovery of the transmittedinformation on the receiver side much easier and, most of all, possible even if thesignals overlap in frequency. OFDM (which can be considered as a special case ofFDM), is a so called multi-carrier modulation technique because it makes use ofseveral carriers at the same time each capable of conveying information modulatedaccording to different mappings (BPSK, QPSK, etc...).

The communication quality through channels affected by frequency selectivefading2 benefits from OFDM, in the sense that the fading can be more easily com-pensated at the receiver side: with OFDM, instead of compensating for the fadingof the channel as a continuous function of frequency over a large range (which isa more involved operation), the receiver can divide the frequency range into smallsegments corresponding each to a sub-carrier, and approximate the fading as a con-stant in each one of such segments. The advantage is that constant fading can befought more easily by using error correction and other techniques. A block-leveldiagram is shown in Figure 2.1 (see also [6]).

1Two signals are said orthogonal if their scalar product is zero.2Frequency selective fading is a radio propagation anomaly due to the partial cancellation of a

signal by itself, because the signal arrives from at least two different directions and one or more ofsuch paths is lengthening or shortening.

8

2.1. BACKGROUND

2.1.2 Definitions: CF, PAPR, EVM and ACLRAs already stated, the problem with non-constant envelope signals is the presence ofa too large variability in the amplitude, and this is harmful for the design and powerefficiency of the PA. This phenomenon can be very closely related to the presence ofgroup of samples whose magnitude exceeds a certain desired value called threshold.

Some of the techniques proposed to mitigate this behavior are briefly listedin the following, but here more quantitative definitions of Crest Factor and Peakto Average (Power) Ratio are presented. We define the Crest Factor as the ratiobetween the maximum value of the magnitude and the average value of a signal,observed in a certain temporal window:

CF = ‖s(n)‖maxsrms

We also define the more commonly used Peak to Average (Power) Ratio, againfor a given interval of time or a certain number of samples, for discrete-time contexts:

PAPR = ‖s(n)‖2maxs2rms

, or PAPRdB = 10 log10‖s(n)‖2maxs2rms

Note that PAPR = CF 2. The desired effect of the various CF reduction tech-niques is to reduce the PAR of the signal without introducing too much distortion.Some of the techniques will not introduce any distortion at all, at the price of agreater complexity and/or reduction of data rate, whereas some other will injectsome unavoidable distortion both in-band (the bandwidth occupied by the signalbeing transmitted) and out of band (in the adjacent bands). Both of these side ef-fects are of course undesirable and in order to quantify them, two parameters exist:Error Vector Magnitude (EVM), and Adjacent Channel Leakage Ratio (ACLR).

EVM is a measurement that quantifies the global displacements of the received(output) signal compared to the expected ideal one, due to any disturbances (suchas noise) and, as in our case, to the CFR intervention too. We define it as (seeFigure 2.2):

EVM = 10 log10PerrorPref

or EVM(%) =√PerrorPref

· 100

Where Perror is the sum of all the error vector powers and Pref is the sum ofall the reference, expected, signal powers. The error vector is the vector in theI/Q plane that connects the received symbol with the ideal, expected position inthe plane (the position corresponding to the exact transmitted symbol). For eachreceived symbol, the corresponding power is computed and averaged, then dividedby a properly chosen value representative for the modulation scheme. The result is acumulative measure of how much the whole transmitter-receiver chain is close to theideal from the accuracy point of view. In an ideal transmission system, each receivedwaveform would fall exactly in one of the possible points in the plane correspondingto the coding of the sent symbol. The scattering of the received waveforms compared

9


Figure 2.2: I/Q plane with representations of the reference and the measured (orreceived, in a communication channel) vectors. The powers of the error and thereference vectors are used to compute the EVM (source: [7]).

Figure 2.3: The components at the base of the definition of the ACLR (source: [8]).

to the constellation of the expected symbols is as much pronounced as less ideal isthe communication system. In the present case, in-band distortion introduced bythe CFR algorithm has a direct effect on the EVM which, as a consequence, isconsidered as a measurement of the performance of the method.

Adjacent Channel Leakage Ratio (ACLR) is the measurement concerning theout of band distortion. It is defined as the ratio of the power leaked to the adjacentand the power in the carrier channels (see also Figure 2.3 and [8]):

ACLR = AdjacentChannelPower

MainChannelPower

10

2.1. BACKGROUND

The most important reason behind the desire to keep the ACLR to a low level isthat otherwise unexpected and unwanted power will pour outside of the frequencyband of interest. If the adjacent frequency intervals are used as the main channelsof other communication systems, it means that we are injecting interference intothem. The second reason driving the effort in keeping the ACLR as low as possibleis simply the fact that high ACLR translates to some energy (supposed to be inthe main channel) wasted over adjacent channels therefore reducing the efficiencyof transmission.

2.1.3 Overview of the main CFR methodsSeveral techniques have been proposed to mitigate the PAPR problem of the OFDMsignals. These techniques can be roughly and partially categorized in: coding tech-nique, probabilistic (scrambling) technique, adaptive pre-distortion technique andclipping technique. This last category will be further explored given its importanceto this thesis work.

Coding technique

The coding technique pursues PAPR reduction via an appropriate choice of thecodes of the modulation to be transmitted for each sub-carrier. This method causesno distortion both in-band and OOB, but it suffers from non optimal bandwidthusage because a smaller number of data words is mapped to a greater number ofcode words. The complexity of the algorithm is also non-negligible because boththe computational effort needed to choose the most appropriate symbol to sendand the area required to store the look-up tables grow rapidly with the number ofsub-carriers, up to the point of becoming computationally intractable for commonuseful signals.

Probabilistic (scrambling) technique

This technique entails the scrambling (meaning, in this context, the act of manip-ulating a signal with a well known sequence to alter its properties but in such away to not introduce distortion) of the OFDM input signal with several versionsof scrambling sequences, one block of samples at a time, and successively choosingamong the resulting sequences the one exhibiting the lowest PAPR. This approachcannot guarantee a desired PAPR level (it will provide the minimum among thesequences though), yields a reduction in bandwidth utilization because of the ad-ditional information to be sent to the receiver and the complexity rapidly increaseswith the number of sub-carriers. This solution includes the SLM (SeLective Map-ping), PTS (Partial Transmit Sequence), TI (Tone Injection) techniques, and TR(Tone Reservation) algorithms.

As an example we might very briefly consider the SeLective Mapping (see Figure2.4). This technique requires the OFDM signal to be independently multiplied byU phase sequences P uv = ejφ

uv , u = 1, 2, ..., U . The U resulting sequences are passed

11


Figure 2.4: Block diagram of the selective mapping technique for PAPR reduction(source: [9]).

through U IFFT (Inverse Fast Fourier Transform) blocks and the output sequencesxu are compared in order to determine the one yielding the lowest PAPR. Theside information about the selected sequence needs to be sent into the channel forthe receiver to be able to reconstruct the original OFDM message. Therefore, theSLM algorithm requires U IFFT blocks, the sending of the side information and theblock to properly choose the version of the OFDM signal with the smallest PAPR,through a proper measurement and comparison.

Adaptive pre-distortion

The idea behind the adaptive pre-distortion is to distort the signal according toa non-linear function in order to compensate for the successive, well known, non-linear characteristics of the PA. Some solutions are capable of dealing with time-variable characteristics of the PA by dynamically and efficiently changing the inputconstellation.

Clipping technique

This technique has the advantage of being the simplest to implement, but incurs inin-band distortion, out-of-band interferences, and the disruption of the orthogonal-ity of the sub-carriers. The method requires some sort of digital processing in thetime and/or frequency domain. Among others, this technique includes: Clippingand Filtering Technique (CAF), block-scaling technique, Peak Windowing technique(PW), Peak Cancellation technique (PC), and Fourier projection technique.

12

2.1. BACKGROUND

In order to introduce the scope of this work, a brief description of some of thealgorithms belonging to this category follows. The algorithms have been chosenbecause of their conceptual and practical affinities with the proposed approach inthis work. For all the following algorithms (Peak Filtering, Peak Cancellation andPeak Windowing), the concept of threshold is of utmost importance. The thresholdis the desired maximum value for the magnitude of the input signal. It can beeither hardwired inside the algorithm or programmed during its operating life. Inany case, by setting a certain value for the threshold, we also inherently program adesired PAPR, because the magnitude of the signal is monotonically related to thepower. The three described algorithms differ in the way they obtain the reductionof the maximum magnitude of the signal (and thus the PAPR) to the desired level,but they all will operate a digital processing on it thus introducing some distortion,whose size they try to minimize.

Peak Filtering (PF)

The Peak Filtering algorithm, sometimes referred to as Noise Shaping (NS) consistsof extracting the part of the input signal whose magnitude exceeds the threshold,called the clip error sequence, then filtering it and finally subtracting it from aproperly delayed version of the original signal itself. The purpose of the delay is tocompensate for all the latencies generated during the detection and extraction ofthe clip error and filtering. The clip error generation consists first of the generationof a clipped version of the signal, B(n), according to the formula (note that theclipped signal retains its complex nature, see also Figure 2.5):

B(n) =

x(n) if ‖x(n)‖ ≤ thresholdx(n)·threshold‖x(n)‖ otherwise

and second, of the successive subtraction of such a generated signal from the originalone:

e(n) = x(n)−B(n)

where x(n) is the original signal and e(n) is the clip error (see Figure 2.6). The cliperror signal e(n) is then filtered by a filter whose coefficients are computed off-lineand stored in a memory. The filter design is tailored for the specific type of signalthe algorithm will work with (i.e. number and bandwidth of the carriers). Aftereach iteration of the algorithm, it is possible that some peaks will be created bythe filtering operation itself (the so called peak regrowth phenomenon), so succes-sive applications of the algorithm might be necessary, and this is accomplished bycascading several stages of PF.

Another reason justifying the cascading of several PF stages is the fact that adiscrete-time signal does not necessarily exhibit the maximums of the true analogsignal of which the elements constitute the sampling and that will reach the PowerAmplifier[9]. It is indeed possible for two successive elements of the discrete-time

13


Re

Im

Figure 2.5: Reduction of a complex sample to a version with the same phase andmagnitude equal to a set threshold.

signal to have both a lower amplitude than the analog signal they are samplesof, because of the very nature of the discrete-time representation of a continuoustime signal. In order to expose these hidden peaks, fractional delay filters areoften interposed between successive stages of the PF. The effect of these filters isequivalent to a conversion from digital to analog followed by a slightly time-shiftedsampling process at the same sample rate as the original.

samples

treshold

Input

treshold

Output

ampl. ampl.

samples

Figure 2.6: The generation of the clip error from the original signal.

Peak Cancellation (PC)

Contrarily to the PF, the Peak Cancellation algorithm (see Figure 2.7) does notfilter the clip error sequence, but explicitly isolates a single input element sampleamong those identified within a certain Peak Search Window interval (a more formal

14

2.1. BACKGROUND

Figure 2.7: A very simplified top-level architecture of the Peak Cancellation algo-rithm.

definition will be given when the algorithm will be described more in depth). Eachtime the algorithm detects these elements, called peaks, it cancels them individuallyby subtracting a properly shaped cancelling pulse from the signal, one for eachpeak. The major advantage of the PC is the reduced complexity of the algorithmcompared to the PF because of the lack of actual filtering over a clip error. In Figure2.7, the Peak Extractor is the block that detects the samples whose magnitude isgreater than the threshold, and it is basically the same in PF and in PW, whereasthe Peak Detector, present only in the PC algorithm, isolates the maximum of thesamples, which as said is defined as the peak. The reduction of the PAPR via thePC algorithm is made by the cancellation of these detected peaks via cancellingpulses that are generated only when the peaks are detected. The stored pulse is,similar to the PF filter, a combined impulse response of all the input carrier filtersmodulated to the correct frequency within the multi carrier frequency band. Suchcancelling pulse can be generated in advance (off-line) and is only dependent oncarrier configuration of the input signal. For each peak, an impulse with the correctamplitude and phase is generated and subtracted. Some peak regrowth can occuras consequence of the subtraction of the cancelling pulses from the input signal,therefore the algorithm has to be run several times. For example, in Figure 2.8 itcan be seen that, because of the application of the cancelling pulse (in red), the twominimums surrounding the peak add in phase with the pulse itself thus generatingtwo more peaks.

Peak Windowing (PW)

The peak windowing algorithm (see Figure 2.9) is based on multiplying the signalwith an attenuating window W (k) rather than adding a correction to the signal.When a peak is detected in the input signal, a set of coefficients (a window, seeFigure 2.10) is either generated at run-time or read from a memory where it isstored, pre-computed off-line. Before the application of the window to the signal,

15


Figure 2.8: The effect of the cancelling pulse on the adjacent samples of the targetedpeak. Note the regrowth of the peaks as a consequence.

Peak Extractor Window generator

Delay

1 +

-

Figure 2.9: Top-level architecture of a Peak Window algorithm

the coefficients are scaled by a real number C, chosen in such a way that the peakswill be attenuated to the desired level (threshold). The signal around the maximumpeak sample np is multiplied by the attenuating window according to:

y(n) = x(n) · (1− C ·W (n− np +K/2))

Where K is the number of the window’s coefficients. The input signal is delayed tocompensate for the delay of the peak search part of the algorithm and to make thepeak sample correspond to the maximum of the window. The windowing operationcorresponds to a subtraction, from the original signal, of a windowed part of itself,whereas in the frequency domain it corresponds to the convolution of the signal with

16

2.1. BACKGROUND

Figure 2.10: Window to be multiplied with the signal in order to reduce the mag-nitude of the peaks.

the Fourier transform of the window. Among the advantages of the algorithm, thereis the fact that if the window amplitude changes smoothly, then not much OOBemission is expected to appear, but on the other hand the lack of knowledge aboutthe exact frequency characteristics of the attenuating window (because it is tailoredon the particular input segment around the peak) makes it harder to guarantee arequired or specified OOB performance. It would be desirable to minimize both theEVM and the OOB but a trade-off must be chosen for the length of the windowbecause, as it will be better clarified further, the longer the window is, the worsethe impact on the in-band distortion (thus the EVM) is but the better the effecton the adjacent channel (thus the OOB emission) and vice-versa is at the sametime. Furthermore, if closely spaced peaks are detected, the algorithm tends toovercompensate and this again has a negative effect on the EVM.

Figure 2.11 shows the effect of the windowing on a segment of the input signal.The successive processing of the signal in this way, when there are overlappingsamong successive windows has the unfortunate effect of reducing the overall averagepower instead of the PAPR. This can be partially mitigated by introducing somemore complexity in the algorithm, such as coefficients that take into account thepresence of earlier windows, the searching and detection of closely spaced peaksand the subsequent generation of the window only once etc... The best way toreduce the risk of an excessive attenuation is the cascading of several PW stageseach attenuating the peaks in a lower measure. This of course will introduce longerdelay as well. The PW is the least complex of the presented algorithms but alsothe one having the worst (and least predictable) performance in terms of in-bandand out of band emissions.

17


Figure 2.11: Effect of the application of the window on a segment of the signalcontaining peaks.

2.2 Related Work

In Xilinx application note 1033 (XAPP1033[1]), the company proposes a PC-CFRalgorithm, together with an implementation for their Virtex-4 and Virtex-5 fami-lies of FPGAs, based on a simple architecture featuring a peak detector and fourcancelling pulse generators. The coefficients of the unscaled cancelling pulse aregenerated off-line by superposing as many prototype filter masks, properly shiftedin frequency, as the number of carriers the input signal is made of. The algorithmis compared against a Peak Windowing CFR (PW-CFR) and a Noise Shaping CFR(NS-CFR). With the frequency and number of coefficients chosen by the authorsof the application note for the comparison, the PC-CFR outperforms both the NS-CFR and the PW-CFR solutions in terms of ACLR and EVM.

In [5], Song and Ochiai propose a Field Programmable Gate Array (FPGA)implementation of the PC-CFR. The added value of their solution is a workaroundover the problem of the overlapping of cancelling pulses due to too closely spaceddetected peaks being cancelled. When the detected peaks are too closely spaced (interms of number of samples), the relative generated cancelling pulses might overlapand add in-phase thus both reducing the effect of peak reduction and generatingpeak regrowth. The authors propose, when the measured distance between succes-sive peaks falls under a certain value, the generation of a truncated version of thecancelling pulses in order to avoid the overlap. Such a truncation introduces discon-tinuities in the signal and as a consequence, OOB emission. The authors state thatthe use of a simple moving-average filter is good enough to take care of these emis-sions and satisfy the ACLR requirements. Results show that the proposed solutionis satisfactory in terms of both EVM and ACLR although the hardware complexity

18

2.2. RELATED WORK

is higher than the plain PC-CFR solution, because of the added circuitry to takecare of the detection and truncation of the pulses.

In [10], Schmidt and Schlee propose a PC method that generates a cancellingpulse shaped only on the carrier that, at the moment of the peak detection, givesthe most contribution to the aggregated signal. By doing so, the algorithm shouldminimize both the in-band and the OOB emissions. The knowledge about whichsub-carrier is responsible for the largest part of the peak should be available fromthe measurement of the time the peak is detected. The cancelling pulses are alsodynamically conditioned by a set of weights that may change according to severalscenarios that might occur (e.g. if a carrier is idle for a certain amount of time, thecorresponding spectral range could be "occupied" without any risk of introducingdistortion).

In [11], Bauml et al. use the term selected mapping for the first time. The se-lected mapping algorithm can be used to mitigate the PAPR of signals consisting ofan arbitrary number of carriers and any signal constellation. This method providessignificant advantages at the cost of a moderate additional complexity.

In [12], Wang et al. described the first nonlinear companding3 transform (NCT)for PAPR reduction, applied to a speech processing algorithm µ − law. It showedbetter performance than the clipping algorithm.

In [13] Jean Armstrong transforms the OFDM signal into time-domain via anover-sized IDFT giving origin to trigonometric interpolation. Then the signal isclipped and filtered via a forward and inverse DFT in order to remove OOB emis-sions. These results are further improved by the same author (see [14]) by repeatedlyclipping and filtering. In particular the author claims that this method causes noincrease in OOB emissions.

In [15], unlike the µ − law companding scheme which reduces the PAPR byenlarging the small portions of the signal only, Jiang et al. propose a solutionbased on the exponential companding technique, that adjusts small and large signalssamples altogether, keeping the average power unchanged but transforming thepower density distribution to uniform instead of Rayleigh and generating fewerspectrum side-lobes too. Similar approach is pursued in [16] by Al-Azzo et al.,where this time the distribution density is transformed from Rayleigh to Gaussianand as a consequence of that, peak and average values are changed so that theoverall PAPR reduces. Improvements are shown in the in-band distortion too.

In 2008, Carole et al. [17], present a method that exploits the unused carriersin OFDM systems in order to decrease the PAPR of the signal without introducingsignificative OOB and in-band distortions (compared to clipping and windowingtechniques), because no interference with the proper data channels exists.

In 2013 Sroy et al. [18] propose a version of the Iterative Clipping and Filtering(ICF) algorithm for the PAPR reduction of OFDM type of signals using (Inverse)Discrete Cosine Transform (IDCT/DCT), showing better results than the the reg-ular DFT/IDFT based approach in [14]

3From the combination of the words compressing and expanding.

19

Chapter 3

The proposed implementation of thePC-CFR

3.1 General description of the PC-CFR algorithm

A detailed description of the implementation of the PC-CFR algorithm is given inthe following section of this Chapter, but first a more in-depth discussion about itfrom the general point of view is necessary in order to better understand the designchoices that have been made.

The PC-CFR module is usually placed after the aggregator (combining all thesignals coming from different channels) and before the Digital Pre-Distorter (DPD),

h1

Σh2

hK

CFR DPD DAC HPA

𝑒𝑖𝜔1𝑇𝑠

𝑒𝑖𝜔2𝑇𝑠

𝑒𝑖𝜔𝐾𝑇𝑠

x1

x2

xK

𝑒𝑖ω𝑡

Antenna

Figure 3.1: Typical positioning of the CFR inside the communication chain

21

CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

when present (see Figure 3.1). The input of the system is a fixed-point signal madeof two parts (in-phase and quadrature1). It is the result of the sum of all the variouscomponents relative to the various carriers. The result is a high PAPR discrete-timesignal. The output of the PC-CFR is a lower PAPR and delayed signal of the sameformat. The purpose of the algorithm is to reduce the PAPR of the input signal toa desired value, and this is achieved by properly monitoring and, when necessary,reducing the values of the samples exceeding a certain threshold. The value of suchthreshold is directly related to the final desired PAPR.

The PC-CFR performs a time-domain signal processing on limited, selected por-tions of the input signal. Such parts are selected according to the presence of peakswhich can be defined as follows: given the interval of samples of the input signalstarting from the first one having magnitude greater than the threshold and finish-ing after a fixed number of samples, the peak is the element having the maximummagnitude inside this interval. Because the detection of the peaks is made on thebasis of the magnitude of the input samples, a conversion from rectangular to polarform or some other means to expose the magnitude of the input samples is neededas one of the first steps of the algorithm. For each detected peak, a cancelling pulseis generated and subtracted from the input signal in order to reduce the value ofthe peak to the value of the threshold. The complex coefficients of the cancellingpulse are stored in a memory; these coefficients are the same for each peak beingcancelled, but in order for the cancelling pulse to be shaped accurately after thepeak it is expected to cancel, they are multiplied by the peak characteristics prior ofbeing subtracted from the input signal. To be more clear, the cancelling pulse, usedto cancel the peak from the input complex signal by subtraction, is generated by asimple complex multiplication between each of the coefficients of the stored unscaledcancelling pulse, and a single complex number coming from the peak detection partof the algorithm, this operation being performed for each peak independently.

The characteristics of the peak p that are needed for the generation of thecancelling pulse are: the difference between the magnitude of the sample selectedas peak (sk, for some k) and the threshold, and the phase of such element:

p = ρP · eiθP , where ρP = ‖sk‖ − threshold

The cancelling pulse elements c[n], are generated according to this formula:

c[n] = ρP · ρ[n] · ei(θP +θ[n])

Where ρ[n] · eiθ[n] are the coefficients of the unscaled cancelling pulse, for allthe values n. It should be noted that this operation is much less computationallyintensive (i.e. it requires a much lower amount of hardware resources) than otherfiltering-based CFR signal processing algorithms. At the output of the multiplier,

1The in-phase/quadrature components format can be formally considered as a complex sig-nal, with the real and imaginary parts corresponding to the in-phase and quadrature componentsrespectively. In the rest of the text the two formalisms (complex and I/Q) will be used interchange-ably.

22

3.1. GENERAL DESCRIPTION OF THE PC-CFR ALGORITHM

the complex data is converted back to rectangular form2, ready to be subtractedfrom the input signal thus finally cancelling the peaks. Of course, it may happenthat more than one cancelling pulse needs to be generated at the same time sothat portions of their intervals overlap. In order to provide the cumulative effectof all the cancelling pulses, all the coefficients of the active pulses must be addedtogether and then subtracted from the signal at each sample of interest. Anotherobservation is that the cancelling pulse effectively cancels the peak element andthat element only: the central element of the unscaled cancelling pulse is the actualelement that, when multiplied by the peak characteristics and subtracted from thesignal, will yield as a result an element having magnitude matching exactly thethreshold value. It follows that the value of such element must be real and equal toone. In Figure 2.5 the effect of the subtraction and the consequent reduction of thepeak to a magnitude matching the threshold is shown on the complex plane. Allthe neighbor input samples will be modified, as already explained, in such a waythat their magnitude will be generally reduced too, but it should be noted that thealgorithm has not accurate control over these elements, therefore some undesirablephenomenons are unavoidable, as it will be illustrated shortly.

The algorithm is usually applied more than once to the signal, and this is per-formed by letting the output of the algorithm, elaborated by a module or stage,become the input of the next one, in a cascade-like structure (see Figure 3.2). Thereasons for which this is usually done are the following:

• Peak Leak. If an implementation of the PC-CFR algorithm poses an upperlimit on the number of simultaneous cancelling pulses that can be generatedby a single stage, then it happens that, when such limit is reached and a newpeak is detected, the peak will simply pass uncancelled through the stage and,in case of the last one, it will reach the Power Amplifier, which is the event westrive to avoid in the first place. By cascading several stages, the probabilityof such an event obviously decreases. The scenario depicted should not beconsidered unlikely because peaks may come in bursts separated by relativelylong periods of inactivity, so the utilization of the resources of the moduleis not uniform during the time, passing from high intensity to long idlingperiods. It is crucial to understand that what are to be considered as peaks,and so their presence, density and magnitude, are relative to the parametervalues we decide to configure the Clip Stage with. So, for example, if for acertain value of the threshold no peaks are detected, it may be possible thatfor a lower threshold the same set of input values exhibit one or more peaks.

The number of closely spaced detected peaks also depends on the Peak SearchWindow length (i.e. how many samples are observed in search for the peak):the same set of input elements could give rise to a larger or smaller amountof detected peaks according to the length of such interval (the longer the

2The rectangular form of the complex numbers is much more suitable than the polar form toperform additions and subtractions.

23


interval, the fewer the detected peaks, because larger amounts of samples willbe associated with single peaks).

• Peak Regrowth. It can be observed that (Figure 2.8), because the subtrac-tion of the cancelling pulses from the signal interests a much larger number ofsamples than the peak alone, some of the samples that were smaller than thethreshold before the cancellation of a peak may raise over it because of theconstructive summation of the cancelling pulses, thus becoming peaks them-selves although they were not in the beginning, and creating the so-called peakregrowth phenomenon. It can be observed that the magnitude of the regrownpeaks is correlated with the height of the original cancelled peak, in the sensethat the greater a peak is, the more likely and higher the regrown peaks ap-pear after its cancellation. By cascading several stages, the regrown peaks canbe taken care of as well.

• Gradual peak reduction. It may happen that in an interval with more can-celling pulses operating simultaneously, one or more peaks are not cancelledefficiently (not completely or too much) because of the reciprocal interactionsamong cancelling pulses. This is an unavoidable phenomenon which is morelikely to happen and whose effects are more severe the greater the peak tocancel is. In order to mitigate this and the effect of the peak regrowth phe-nomenon, a smart strategy consists of gradually reducing the magnitude ofthe peaks by applying progressively decreasing thresholds to successive stagesof the PC-CFR, instead of trying to completely cancel them in one pass. Thiscan be easily achieved by a cascading architecture because each iteration of thePC-CFR may be independently configured with a different set of parameters,such as the threshold.

The implementation of numerous clip stages not only requires a larger area (andthus higher power consumption) on the chip, but also introduces a higher delay onthe signal, which in general is an undesirable effect especially for the more recentcommunication protocols. The delay in the signal data path is purposely introducedin order for all the computations constituting the algorithm to have the needed timeto execute. The largest portion of the delay is by far the group delay of the cancellingpulse itself, which obviously cannot start before the actual detection of the peak.As previously stated, this algorithm involves some signal processing which in turnwill modify the characteristics of the input signal thus introducing both in-bandand out of band distortion. In order to reduce this undesirable consequence, theunscaled cancelling pulse is chosen so that its frequency spectrum will match asmuch as possible that of the input signal. The spectrum of the input signal dependson the number, bandwidth and relative positions of the carriers and is either knownor estimable. Hence, a trade-off must be chosen because the longer is the cancellingpulse (which translates to: the more coefficients it is made of), the more severe isthe effect on the input signal when the cancelling pulse is subtracted from it becausethe operation will affect a larger number of elements, impacting negatively on the

24

3.1. GENERAL DESCRIPTION OF THE PC-CFR ALGORITHM

Peak Manager(management of HW resources)

Clip Stage 2

Peak detection notification and peak

characteristics

Cancelling pulses

Higher PAPR I/Q signal

Lower PAPR I/Q signal

Clip Stage 1

Figure 3.2: Simplified block-level view of the architecture of the PC-CFR module,with two cascaded Clip Stages as an example.

EVM; also, longer cancelling pulses require larger memories for their storing andimpose longer delays. On the other hand, the steeper the frequency response of thecancelling pulse3 is, the more accurately we can intervene on the signal spectrumwhile at the same time reducing the consequences over the frequency intervals thatdo not belong to the input signal, yielding lower OOB emissions. This is a desirablebehaviour because the total frequency bandwidth is a resource that is shared amongseveral users, thus its integrity must be preserved.

One notable limitation of the PC-CFR algorithm over other types of signalprocessing algorithms for CFR is the fact that, every time a cancelling pulse isbeing generated, it requires the exclusive use of some hardware resources, whichof course amount to a finite quantity. Other algorithms, based essentially on thefiltering of the signal or portions of it, do not suffer this limitation but, on the otherhand, the complexity of the filters (that can be translated to higher area occupancyand in general more power consumed by the ASIC) limits their attractiveness.

On the other hand, a notable advantage of the PC-CFR over the Turbo Clip-ping (TC) and other filter-based algorithms, is its inherent flexibility in terms ofchanges of the input signal characteristics. For the PC-CFR algorithm, in fact, inorder to adapt to a completely different configuration of the input signal carriers, itis just a matter of changing the coefficients of the unscaled cancelling pulse, via are-configuration of the pulse memory, thus enhancing the usefulness of the modulefor several contexts. The TC algorithm, instead, operates on each carrier indepen-

3According to the theory of digital signal processing, longer sequences in the discrete-timedomain correspond to steeper profiles in the frequency domain.

25


dently via a properly designed branch consisting of one or more decimators, FiniteImpulse Response (FIR) filters and interpolators. It follows that the entire hard-ware architecture of the TC is shaped upon a particular configuration for the inputsignal carriers, and it cannot be reconfigured as easily. On the other hand, the per-carrier filtering of the TC allows a more accurate and so, effective intervention onthe input signal, whereas the cancelling pulse in the PC-CFR is generally obtainedby the cumulative characteristics of the entire input signal carrier configuration andtherefore is sub-optimal with respect to each carrier.

3.2 Structural description of the proposed implementation

The proposed architecture is made by a parameterizable number of cascaded ClipStages (CSs), each of them communicating with a centralized controlling modulecalled Peak Manager (PM) (see Figure 3.2 for an example scenario with two ClipStages). The cascading set of CSs constitutes the data path of the signal, andallows the iteration of the algorithm the desired number of times, but not necessarilywith the same set of configuration values (every CS can be configured with a localthreshold and Peak Search Window length, for example). In each CS, the followingoperations are performed: the conversion from rectangular to polar form of the inputsignal, the peak detection, the delaying of the input signal and the subtraction ofthe cancelling pulse from it.

The PM is responsible for dispatching the detected peaks coming from the vari-ous Clip Stages to the available Peak Cancelling Units (PCUs)4 by implementing adispatching policy. The generation of a cancelling pulse requires the availability ofa PCU for the entire duration of the pulse itself. Such PCU will appear busy andtherefore unavailable for the generation of cancelling pulses for the entire period.There is a finite number of PCUs in the Peak Manager. The PM receives the notifi-cations about (and the characteristics of) the detected peaks from all the connectedCSs, and then generates and dispatches the cancelling pulses to them (again, if atleast a PCU is available). The PM is made of several components: one memory tostore the coefficients of the unscaled cancelling pulse, a complex multiplier, a Co-ordinate Rotation Digital Computer (CORDIC) unit dedicated to the conversionof the data from the polar back to the rectangular form and an adder to combinetogether all the cancelling pulses before sending them to the various CSs for thefinal cancellations. A controlling unit and a pulse generator are responsible for theoverall management of the whole subsystem.

In Figure 3.3, the top-level diagram of the entire PC-CFR is presented with thename of the input/output ports and the principal configurable parameters with thenames as they appear in the SV code. The following is a list describing each ofthese signals.

4What is referred here as PCU is the set of hardware and physical resources (a time-slot in theTime-Division Multiplexing rotation is a physical resource) needed for the generation of a cancellingpulse.

26

3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

PC-CFR

i_data_real

i_data_imag

o_data_real

o_data_imag

o_data_statsclk

clk_1G

rst_n

i_dtg

i_thr_lvs

i_cmd_strst

cr_thr_c cr_psw_length_c

Figure 3.3: Top-level view of the input/output signals and parameters of the PC-CFR module

• i_data_real. Data, input. In-phase component of the input data.• i_data_imag. Data, input. Quadrature component of the input data.• i_dtg. Configuration register, input. Input data toggle. At each toggle of

this signal the module processes one data.• i_thr_lvs. Configuration register, input. This input provides the mapping

between the peak scale values and the length of the cancelling pulses, asexplained in the report.

• i_cmd_strst. Configuration register, input. Synchronous reset of the peakstatistics.

• clk. Clock. Main clock of the module. Its value is 250 MHz.• clk_1G. Clock. Secondary, faster clock of the module used for time-division

multiplexing. The value is clk*4 = 1 GHz.• rst_n. Reset. Active low, asynchronous reset.• cr_thr_c. Configuration register, input. Values of the thresholds for the Clip

Stages.• cr_psw_length_c. Configuration register, input. Values of the PSW length

for the Clip Stages.• o_data_real. Data, output. In-phase component of the output data.• o_data_imag. Data, output. Quadrature component of the output data.• o_data_stats. Status register, output. Statistics about the peak height

distribution.

27


Delay (CORDIC + PSW + group delay + various)

CORDIC Peak Detector

Threshold Peak Search Window (PSW) length

Peak statistics

To the Peak Manager From the Peak Manager

+-

Can

celli

ng

pu

lse

s

Mag/phase

Higher PAPR I/Q data

Lower PAPR I/Q data

peak scale,peak phase,displacement

Figure 3.4: Block level diagram of the Clip Stage. The Peak Detector isolates thepeaks in the input signal and collects statistics on them.

3.2.1 The Clip Stage

Each Clip Stage (Figure 3.4) receives the data to be processed from the previous CS(or from the previous module in the processing chain, in case of the first clip stage),in the form of an I/Q fixed-point signal. The inputs of the CS are the clock, theactive-low reset, the input signal (real and imaginary parts), the input data togglecommand, the synchronous reset of the peak statistics and the cancelling pulse(s)coming from the PM. The output is the registered difference between the (delayed)input signal and the cancelling pulse(s).

CORDIC

The first module encountered by the signal inside the CS is the CORDIC. TheCORDIC is a flexible iterative algorithm capable of computing several approximatedtranscendental functions without the need of multipliers, so it is conveniently usedin hardware design in order to minimize the area. Inside the CS, it is used toconvert the input complex signal from the rectangular to the polar form, so thatthe magnitude of the input signal samples is exposed and the peaks can be detected.The implemented CORDIC can be configured to be synthesized in pipelined or non-pipelined version. The latter operates all the iterations combinatorially in a singleclock cycle thus offering a significative lower delay but might not be synthesizableat the higher frequencies. In the perspective of designing the PC-CFR in a as muchconfigurable way as possible, both choices are available.

28


Peak Detector

In the following, it is referred as Peak Detector what, with reference to the Figure2.7, corresponds to the cumulative functions made by the Peak Detector and thePeak Extractor. Therefore, the goals of the Peak Detector are:

• To identify the peaks in the input signal. Every time a new peak is detected,the module sends the peak characteristics and a notification pulse to the PeakManager (PM).

• To collect information on the height of the detected peaks. This data iscollected either for statistical purposes or in the perspective of using it toapply adjustments on the threshold and the Peak Search Window (PSW)length (not yet possible in the present implementation).

The Peak Detector can be configured with two values: the threshold and thePSW length (cr_thr_p and cr_psw_length_c in the SV code, respectively) thatcan be set independently for each CS. The module is implemented as a two-statesFinite State Machine (FSM) (see Figure 3.5 for the Algorithmic State Machine(ASM) of the Peak Detector, with pseudo-code or plain English in place of theactual SV statements or variable identifiers, in order to favor clarity over formality):in the IDLE state, the input samples pass unaffected and no action is taken untila sample exceeds the programmed threshold. Then the state machine evolves tothe PEAK_SEARCH state during which, for the fixed amount of samples dictatedby the PSW length register, successive input samples are compared to the lastdetected maximum in order to find the maximum sample within the entire interval(the definition of peak). This is performed simply by comparing the magnitudeof each new input sample with the actual maximum which is stored in a registertogether with the corresponding phase. At the end of such interval, the value ofthe threshold parameter is subtracted from the found maximum input sample thusdefining what will be referred to as peak scale in the rest of the report. The peakscale, the relative phase and a trigger signal are sent to the PM, and the statisticsof the peaks are updated with the new arrival.

A fundamental aspect in the process of peak detection has been neglected sofar: within the PSW interval, the sample that will be elected as the peak can befound at any position (i.e. it could be the first or the second of the last sample inthe interval), and this positional information is necessary for the proper alignmentbetween the cancelling pulse that will be generated by the PM and the input signal.The Peak Detector keeps track of this displacement of the peak inside the PSWinterval via a counter (reported as displacement in Figure 3.5), and this is the lastinformation sent by the Peak Detector to the PM when a new peak is detected.

Delay Memory

The input signal to the CS is sent to both the CORDIC and a delay memory whosepurpose is to compensate for the delays due to the various aforementioned steps of

29


i_data_mag>

thr.?

Save i_data_mag,i_data_pha.

Reset PSW counter

i_data_mag>

current max?

End of PSW reached?

Notify the Peak Manager,create and send pk_scale = max mag – thr.,

pk_pha, displacement, update peak statistics

IDLE

PEAK_SEARCH

F

T

Update current max mag, phase,

displacement

PSW = PSW + 1(increases samples count)

F

T

T F

Figure 3.5: ASM of the Peak Detector. Please note that "thr." and "End of PSW"correspond to the cr_thr_c and cr_psw_length_c parameters respectively.

30


the processing on the signal in the CS and in the PM. The delay can be split in twocomponents, giving a clearer understanding of their origins and relative measures(expressed in terms of data rate periods): the smaller component is to compensatefor the CORDIC (one if it has been configured as non-pipelined, eleven otherwise5),the Peak Detector (for a number of units equivalent to the PSW length), and all thechain of elaboration provided by the PM. The most consistent component by far isthe group delay relative to the cancelling pulse generation, the amount of which isapproximately half of the number of coefficients of the pulse.

Final registered subtractor

The output of the CS is generated as the difference between the delayed signal andthe sum of the cancelling pulses coming from the PM, registered. This costs anotherdata period of delay per Clip Stage but is not compensated by the delay memorybecause the final registered subtraction is the very last operation applied to thesignal.

3.2.2 The Peak Manager

The Peak Manager (see Figure 3.6) is the centralized unit that receives the no-tifications and the characteristics of the detected peaks from all the Clip Stages,generates the cancelling pulses accordingly and sends them back to the appropriateClip Stage, where they will finally cancel the peaks. One of the most crucial tasksof the PM is the management of the Peak Cancelling Units (PCUs), whose optimalutilization has been the main effort in this project design. In the most naive way oftackling the problem, the availability of N PCUs would require the presence of Nreplicas of all the resources needed for the generation of a single cancelling pulse; thisin turn would mean: N memories for the storing of the cancelling pulse coefficients,N complex multipliers, N CORDICs for the conversion from polar to rectangularform and N accesses to an adder to combine together all the cancelling pulses. Inorder to minimise the area occupancy of the PC-CFR module, as anticipated in theintroduction, the present implementation makes use of a time-division multiplexingapproach for a more efficient exploitation of the described hardware resources. Tomake this possible, a second, faster clock is used as well, and the ratio between thefaster and the slower clock frequencies is set as the parameter num_ts_c (numberof time slots, see Figure 3.7). As in every time-division multiplexing scenarios, asingle resource is shared among several users in different intervals or slots of timeforming a partition (that is, without any overlapping) of a longer interval of time,which repeats periodically. In the present implementation the shared resource ismade by the mentioned set of hardware resources (coefficients memory, multiplieretc...), the slot of time is the period of the faster clock and the longer interval is the

5The number eleven comes from the precision of the data that is elaborated by the CORDIC.The number of iterations of this algorithm is roughly the same as the number of bits that is usedto represent the data.

31


Clip Stage 0 Clip Stage 1 Clip Stage 2 Clip Stage 3I/Q I/Q

Search for and allocate HW resources to peaks

Peak Cancelling Units (PCUs)

1 1 1 13 3 3 3

2

Peak Manager

Figure 3.6: Basic conceptual scheme of the relation between Clip Stages and PeakManager. 1. detected peaks are notified and sent to the PM; 2. the PM searches foravailable HW resources in order to generate the cancelling pulses and, if available,3. sends the cancelling pulses to the Clip Stages for the cancellation of the peaks.

period of the slower clock. Ultimately, the PM shares the hardware among as manycancelling pulses as the ratio of the two clock periods (the reciprocal of the ratiobetween the clock frequencies), so it should be clear now that the higher is this ratio,the more efficient usage can be done with the available hardware resources becausesuch hardware can now be seen as exclusively available for the generation of multiplecancelling pulses, with the only constraint that they alternate in time for the accessto it, in a non overlapping way. The present PC-CFR module has been simulatedand synthesised with the constraint of the slower clock frequency set to the valueof 250 MHz (corresponding to the input data rate) and with the availability of afaster clock, generated internally to the ASIC, four times faster (for a value of 1GHz) thus giving the ratio of four, but in the effort of not limiting the reusabilityand flexibility of the project, the parameter num_ts_c can be modified to any otherinteger positive value whenever a different ratio would be convenient and/or phys-ically available. As a conclusion, the PM has the availability of num_ts_c PCUsinstead of only one, which translates to the possibility to cancel up to num_ts_cpeaks simultaneously.

The PM maintains a table holding the information about the peaks being dealtwith, and in doing so it also keeps track of the availability of the resources for thegeneration of the cancelling pulses when a new peak is detected; in particular eachrow of the table corresponds to a PCU.6 Each PCU is also associated with a specific,

6The terms rows and PCUs will be used interchangeably in the following.

32


TS0 TS1 TS2 TS3 TS0 TS1

fCLK = 250 MHz

fCLK_1G = 1 GHz

Time Slots

Figure 3.7: The relation between the two clocks determines the definition of thetime slots. It can be seen that in this time-division multiplexing exactly four timeslots are available.

and only one, CS, this information being stored in a separate field, CS, see Table3.17. So for example if the PC-CFR has four Clip Stages and the PM is capable ofgenerating up to eight cancelling pulses simultaneously, the first four PCUs may bededicated to the first CS, the next two to the second one and each one of the lasttwo to the third and fourth. The mapping among the PCUs and the Clip Stagesis at the moment non configurable but it can anyway be changed by modifying theSV code. The way the Clip Stages are mapped to the rows and therefore to thePCUs, however, has been designed is such a way that it will be easily configurablein a future possible improvement (see Section 4).

When some CS notifies the PM of a detected peak, the PM checks if for thatparticular CS an available PCU exists by scanning the busy bit field of the rows(so the complete condition to satisfy for a peak in order to be accepted is that,among the rows associated to the CS the peak notification is coming from, at leastone row has the busy bit at zero); if it does, the peak scale and the peak phaseinformation arriving from the CS are stored in the table at the corresponding row,the displacement is sent to another part of the PM that will be discussed shortlyand the busy bit is asserted, otherwise the peak will be ignored and it will leak fromthe present Clip Stage. The busy bit is deasserted and the pk_scale and pk_phafields are reset at the end of the generation of the corresponding cancelling pulse,thus making the PCU available for the generation of another cancelling pulse.

Whenever a detected peak is accepted and inserted into the table, the generationof the corresponding cancelling pulse starts. As we have seen in the general descrip-

7Although the technology constraint for this design project imposes a number of time slots ofonly four, Table 3.1 shows eight PCUs available. This is not for explanatory reasons only, but,as it will be shown further, the limit of num_ts_c time slot can be bypassed with some hardwareresources redundancy.

33


PCU n. busy CS pk_scale pk_pha0 0 0 0 01 1 0 2345 2212 1 0 1632 -1103 1 0 995 12984 0 1 0 05 1 1 4632 -11216 0 2 0 07 1 3 832 -450

Table 3.1: An example of a possible PCU table at a certain instant. In this examplewe have 8 PCUs mapped on 4 CSs in this order: 4, 2, 1, 1. The choice of mappingmore resources to earlier CSs is generally considered a convenient rule of thumb.

Magnitude component Phase component

V D CS

V: Valid bitD: Direction bitCS: Destination Clip Stage

Figure 3.8: For each unit of data some meta-data are associated. These meta-dataare consumed at various points during the elaboration chain.

tion of the algorithm, a set of operations has to be performed in order to do it, thefirst of all being the complex multiplication; the two operands are the (peak_scale,peak_phase) pair (characterizing the specific detected peak and coming from thePCU table) and each of the coefficients of the unscaled cancelling pulse (the samefor all the peaks, coming from the data bus of the pulse memory). Some meta-dataare generated, attached to the data and consumed during the chain of elaborationof the cancelling pulses. See Figure 3.8 for a possible representation of the data andmeta-data at the output of the complex multiplier.

The valid bit is a flag used to notify the following stages that the correspondingdata should be actually processed. The direction bit identifies the data as beingpart of the first or the second half of the cancelling pulse, in order to take advantageof the symmetry of the pulse itself (see Section 3.2.2). The information about theClip Stage is sent together with the data because when it will reach the adder anddispatcher, this information will be used to send the data to the correct destinationCS. In Figure 3.9, the detailed steps and some design choices that have been madeare described, together with the rationale behind them, for each of the two datapaths constituting the operands of the first step in the chain, the complex multiplier.

The PCU table datapath

At the faster clock frequency rate, each row in the PCU table is read continuouslystarting from the beginnning, reaching the bottom and starting from the first againand in doing so, the values of the pk_scale and pk_pha are sent to the complex

34


canc. pulse mem.

MUX

add

ress

ge

ner

ato

r0

add

ress

ge

ner

ato

r1

add

ress

ge

ner

ato

r2

add

ress

ge

ner

ato

r3

CORDIC Σ

MU

X

Busy bit

Clip Stage

Peakscale

Peakphase

1 0 241 -25

1 1 125 119

0 2 0 0

1 3 78 -66

to CS0

to CS1

to CS2

to CS3

mag/phase

Peak detection notification and characteristics coming from the Clip Stages

Time slot 0

Time slot 1

Time slot 2

Time slot 3

ρC ·ejϑC

ρP·ejϑP

ρC ·ρP ·ej(ϑC + ϑP)

a + jb CMUL

Figure 3.9: The schematics shows the most important parts constituting the PeakManager. On the left, the unscaled cancelling pulse datapath is composed of theaddress generators and the pulse memory, and in the bottom part the PCU tableis filled with some example values. The rows and address generators that have thesame colors are matched.

multiplier regardless of the fact that at a specific row of the table a peak is actuallystored and being cancelled or not (as can be seen from Table 3.1, the rows associatedto an idle PCU yield 0 for both the peak scale and phase anyway so even if theywere computed, they would provide a zero result). Each row reading corresponds toa time slot, and the entire reading of the table corresponds to the periodic intervalof the slower clock. At this point, together with peak scale and phase, also the CSinformation is sent, to be used later by the adder.

The unscaled cancelling pulse datapath

The second operand of the complex multiplier comes from the pulse memory, whichis shared among all the PCUs. As it has been discussed, for each time slot, adifferent row of the table is being read and the peak scale data are sent to themultiplier, so analogously the correct coefficient must be fetched from the memoryand sent to it as well.

The purpose of the address generators is firstly to generate the addresses to thecancelling pulse memory, and secondly to manage the valid and direction bits.The address generators are implemented as up-down programmable counters withenable input. The cancelling pulse coefficients are stored sequentially into the pulsememory, so the counting outputs of such address generators are connected directly

35


to the address bus of the memory via a multiplexer that will alternate the addressescoming from the various counters in a time-shared fashion. The PM contains asmany address generators as the number of PCUs which in turn, is also the samenumber as the available time slots inside a sample rate period. The PCU table rowsand the address generators are matched, in the sense that they will always workin pairs (i.e. the first row with the first address generator and so on). Becauseall the various simultaneous cancelling pulses must be generated independently, theaddress generators must work independently as well, in the sense that each one mustkeep track of the address of the specific coefficient inside the shared pulse memoryindependently from the other pulses, and resuming the counting at the point itstopped the last time it had its time slot ready. Therefore there is not a chanceto exploit the sharing of a single address generator among the various time shares;this explains the reason why a number of address generators equal to the numberof PCUs (stored in the parameter num_PCU_c) has been necessarily instantiated inthe code.

The event of inserting a new peak into the PCU table if an available CS/busybit combination is found is immediately followed by the programming of the corre-sponding address generator, consisting of the following information: the displace-ment information coming from the Clip Stage (see 3.2.1), a start and a finishingaddress. The starting address can be 0 in case of complete cancelling pulse gen-eration but, and it will be discussed further, could also be a larger number (i.e.the cancelling pulse is not generated completely but only a portion of it, but stillsymmetrically around the central element). Whichever is the case, the address willfollow the same progress: it will increase up to the value of 511 (the central elementof the cancelling pulse responsible of the peak cancellation), then it will decreaseup to the starting value8. As soon as this information is made available, an enablesignal will trigger the counting, at the end of which the address generator will asserta pulse that will notify the PM of the end of the cancelling pulse. This in turn willclear the corresponding busy bit in the appropriate PCU table row and make thePCU available to process another peak.

At every rising edge of the faster clock, the system explores a different time slot.The effect on the PCU table is the sending of the information of one of the rowsto the multiplier and, on the matched address generator side only, the evolution(increase or decrease) of the address driving the address bus of the pulse memory.All the other address generators not interested by the present time slot hold thesame state waiting for their turn to change the count. The displacement informationabout the detected peak (see 3.2.1), is used by the address generators to delay thestart of the counting, thus the generation of the addresses towards the pulse memory,by this amount; by doing this, the displacement of the position of the peak sampleinside the PSW is taken care of and the central element of the cancelling pulseis aligned with the proper position of the peak. More precisely, during the initial

8This addressing scheme is justified by the symmetry properties of the cancelling pulse thatare better illustrated further in this report.

36


350 400 450 500 550 600 650

Coefficient number

0

0.2

0.4

0.6

0.8

1

Mag

nitu

de

350 400 450 500 550 600 650

Coefficient number

-4000

-2000

0

2000

4000

Pha

se

Central elementMagnitude = 1

Central elementPhase = 0

Figure 3.10: Magnitude and phase of the unscaled cancelling pulse (that is before themultiplication with the peak characteristics). For clarity of representation, only thecentral part are represented, but the symmetry of the first and the anti-symmetryof the second can be seen.

delay phase, the address generator is loaded with the value of the displacement andcommanded to perform a down-counting; when zero is reached, the proper addressgeneration starts. It should be clearer now the reason behind the need of the validbit: during the down-counting required for the compensation of the displacement,the addresses to the pulse memory are generated anyway but no actual coefficientsthat are appearing on the data bus of the memory as a consequence should be takeninto consideration for the elaboration of any cancelling pulse. By associating a zerovalid bit to these data, such data are not used for the computation of the cancellingpulse.

A notable saving of memory area has been made possible because of the sym-metry properties of the unscaled cancelling pulse (see Figure 3.10): the pulse hasa symmetrical magnitude with respect to the central element, whereas the phase isanti-symmetrical, again against the central element. This simple property makes itpossible to store only half of the coefficients, thus implementing a smaller memory.9

9The memory area is an important component of the overall area footprint of an ASIC.

37


50 100 150 200 250 300 350 400 450 500

Coefficient number

0

0.2

0.4

0.6

0.8

1

Mag

nitu

de

50 100 150 200 250 300 350 400 450 500

Coefficient number

-4000

-2000

0

2000

4000

Pha

se

direction bit = 1

direction bit = 0

address generatorsincrease counting

address generatorsdecrease counting

Figure 3.11: The actual portion of coefficients that are stored in the pulse memory.During the increasing and the decreasing part of the address generation, the direc-tion bit is 0 and 1 respectively in order to "fix" the anti-symmetry of the phase forthe complex multiplication.

This is the reason for which the address generators are up-down counters: whenthe central value 511 has been reached (over a total of 1023 elements), the countingdirection is inverted. From the point of view of the memory data bus, a completesymmetric cancelling pulse appears, coefficient after coefficient with the only differ-ence that the phase shows values with inverted sign during the descending part. Tobetter understand the rationale behind the direction bit, remember that in orderto exploit the anti-symmetry of the phase shown by the cancelling pulses, duringthe descending part of the generation of the addresses, the phases must be invertedin sign. The direction bit is sent with the values of 0 or 1 together with the datato mark whether the phases have to be added or subtracted respectively by themultiplier (see Figure 3.11).

Another chance of simplification emerges from the study of the various ways acomplex multiplication can be performed. In rectangular form it consists of fourreal multiplications and two additions, whereas in polar form only requires onereal multiplication and one addition, with an evident saving of area (see 3.1 and3.2 for a comparison of the complex multiplication in rectangular and polar forms,

38


respectively). As it has been hinted repeatedly, this second approach has beenchosen by storing the coefficients of the cancelling pulse inside the memory in themagnitude/phase form, whereas the peak scale and the phase are already in theconvenient form.

(a+ ib) · (c+ id) = (ac− bd) + i(ad+ bc) (3.1)

ρAeiθA · ρBeiθB = ρAρBe

i(θA+θB) (3.2)The multiplier will basically perform the Equation 3.3 using the valid and di-

rection bits as follows: if valid bit is 0, no multiplication if performed (the clockwill be gated in order to save power) otherwise the multiplication will be executedwith the phases added together if the direction bit is 0, subtracted otherwise (thephase coming from the cancelling pulse branch sign inverted). The CS informationfor each data passes untouched.

c[n] ={ρPρ[n] · ei(θP +θ[n]) if direction bit = 0ρPρ[n] · ei(θP−θ[n]) otherwise

(3.3)

The next step in the creation of the cancelling pulse is the conversion of the datafrom polar to rectangular form, and in order to do this, the CORDIC algorithm hasbeen used again, this time in 11-stages pipelined form, given the fact that all thechain of computation of the cancelling pulses work at the higher frequency rate.

The final stage of the chain of elaboration for the generation of the cancellingpulses, and the last part working in the time-division multiplexing mode, is theadder/dispatcher. The complex data, now in rectangular form is now easily addedtogether on a Clip Stage basis. The meta-data that play a role here are the valid bitand the CS information: the adder adds together all the data that have the validbit set and belong to the same CS. At the end of each clock period of the slowerclock, all the results are dispatched to the respective Clip Stages.

Implemented improvements

The duration of the cancelling pulses is the principal obstacle, for the PC-CFRalgorithm, for not being able to process a higher number of peaks because the longerthe pulse, the longer the corresponding PCU stays busy. On the other hand we havediscussed in 3.1, that shortening the cancelling pulses might have harmful effectson the OOB distortion, but this observation is not taking into consideration themagnitude (basically the energy associated) of the cancelling pulses. It is reasonableto think that for the cancellation of a very small peak, the effect of the cancellingpulse on the energy of the signal is smaller compared to the effects of the cancellingpulses of bigger peaks. That is because the cancelling pulse is the product of theunscaled cancelling pulse and the detected peak scale, therefore cancelling pulsesassociated to smaller peaks are smaller in magnitude and in energy as well. Itis therefore conceivable to use shorter cancelling pulses to cancel smaller peaks,

39


Height max 90+% of max 50+% of max 20+% of max elseStart addr. 0 75 182 293 402

Table 3.2: The table maps the height of detected peaks with the starting addressof the pulse memory.

because the side lobes of the cancelling pulse, which are smaller anyway, will be evenmore negligible after the multiplication with a small peak scale. The advantage ofadopting this strategy is that it is possible to keep the PCUs associated with thesmaller peaks busy for less time thus increasing the overall availability of the PCUsto the new peaks.

With this in mind from the earlier design phases of the project, the addressgenerators have been designed so that this strategy could be easily implemented.The generation of shorter pulses, indeed, requires nothing more than to programthe start address of the address generator with a value which is not 0 (the very firstcoefficient of the complete cancelling pulse), but with a greater value, correspondingto a successive element, which will also be the final address when the up/downcounting will have completed the descending part. In this way the central elementof the pulse, corresponding to the actual peak to cancel, will still be present andstill the central element of the shorter pulse, and the address generator will sendthe end pulse to the PM earlier thus freeing the corresponding PCU. In orderto introduce an even lower possibility of OOB distortion, several peak scale-pulselengths pairs are provided to the PC-CFR module, via the i_thr_lvs input port(see Table 3.2 for a possible mapping, corresponding to that of Figure 3.12). Withthe present implementation, the choice of the mapping between peak magnitude andpulse length is left to the software, in the sense that it is a configuration register. InFigure 3.12 there are some possible examples of reduced length cancelling pulses.

Another chance of improvement comes from the observation that with the con-straint imposed on the ratio between the faster and the slower clocks, a total ofonly num_ts_c = 4 time slots are available (which as it is known by now, providesas many PCUs, therefore simultaneous cancelling pulses). In order to improve thisaspect, some hardware redundancy had to be used, thus using some more area onthe ASIC (see Figure 3.13). The number of rows in the PCU table has been in-creased to 8, and so the number of address generators. At every time slots, now twopaths are generated in parallel: two rows are read from the PCU table (the firstand the fifth, and so on) and the two matched address generators are enabled inparallel being connected to two independent pulse memories storing the same coef-ficients. Basically the hardware resources in the Peak Manager have been doubled.This approach is parameterizable by changing some constants in the pc_cfr_pk.svpackage file in order to increase even more the number of PCUs at the expenseof more hardware, so it it possible to have any multiple of num_ts_c PCUs in thePC-CFR module up to the point in which the hardware complexity of the moduledoes not justify the need of the cancellation of the expected density of peaks.

40


200 400 600 800

A

0

0.2

0.4

0.6

0.8

1

Mag

nitu

de

200 400 600 800

B

0

0.2

0.4

0.6

0.8

1

Mag

nitu

de

300 400 500 600 700

C

0

0.2

0.4

0.6

0.8

1

Mag

nitu

de

450 500 550 600

D

0

0.2

0.4

0.6

0.8

1

Mag

nitu

de

Figure 3.12: Several possible choices for incomplete cancelling pulses, showing mag-nitudes only. Note that they are progressively shorter cycling from A to D corre-sponding to peaks from the highest to the smallest respectively. The points of startand finish for the limited pulses are chosen at the passing through zero or at verylittle values of the magnitude in order to minimize the discontinuities and the OOBemissions.

0

0

0

0

1

0

1

1

1

0

1

1

1

1

2

3

CORDIC0

CORDIC1

Σ0

Σ1

to CS0

to CS1

to CS2

to CS3

canc. pulse

mem. 0

canc. pulse

mem. 1

CMUL0

CMUL1

addr

ess

gene

rato

r0

addr

ess

gene

rato

r7ad

dres

s ge

nera

tor3

addr

ess

gene

rato

r4

Time slot 0

Time slot 0

Time slot 1

Time slot 1

Time slot 2

Time slot 2

Time slot 3

Time slot 3

Figure 3.13: Still four time slots available, but this time up to two cancelling pulsescan be generated at each time slot by these two branches working in parallel, for atotal of eight maximum cancelling pulses simultaneously (the idea can be extendedto any number of branches N, yielding N*4 cancelling pulses).

41

Chapter 4

Future work and suggestedimprovements

4.1 Programmable or dynamic CS–PCU mappingThe mapping between the Clip Stages and the PCUs in the PCU table (stored inthe CS field) of the Peak Manager is, in the presented implementation, static butchangeable by modifying the SV code. This lack of flexibility may be easily fixed byhaving the mapping be read from a configuration register. So, for example, an initialconfiguration such as 4, 2, 1, 1 (4 PCUs for the first CS, 2 for the second and soon) might change at some point at run time to, as an example, 2, 2, 2, 2 (see Figure4.1). The reason behind the usefulness of such change is related to: the densityof the detected peaks at the various Clip Stages, the probability of peak regrowth,and to the need or decision of progressively reducing the peaks instead of trying tocancel them in a single pass (see 3.1). The control over the reconfiguration couldbe software programmable or automatically implemented based on the evaluationof the statistics collected by the various Clip Stages.

4.2 Bypassable PC-CFR moduleThe delay introduced by every module in a communication chain is of course anundesirable but unavoidable characteristic. As it has been shown, the PC-CFR isnot an exception to this rule: every Clip Stage the PC-CFR is made of, introducesa significant latency whose main component is the group delay of the cancellingpulse. The delay is present even when no peaks are detected, which are (hopefully)significantly longer periods of time compared to those with some peak activity. Apossible improvement to the basic PC-CFR module that could partially mitigatethis problem could be the insertion of an observer stage before the PC-CFR havingthe task of inserting or bypassing the module totally or partially according to thepresence of peak activity. The detection of such peak activity is nothing more thanthe task performed by the already discussed Peak Detector present inside every Clip

43

CHAPTER 4. FUTURE WORK AND SUGGESTED IMPROVEMENTS

PCU0 PCU1 PCU2 PCU3 PCU4 PCU5 PCU6 PCU7

CS0CS1

CS2CS3

PCU0 PCU1 PCU2 PCU3 PCU4 PCU5 PCU6 PCU7

CS0 CS1CS2 CS3

Initial arrangement

After automatic reconfiguration or

reprogramming

Figure 4.1: A mapping between PCUs and Clip Stages at a certain instant is re-configured in order to differently redistribute the PCUs among the Clip Stages,according for example to some changes in the input signal.

PD0 PC-CFR

Turn on-off module

Insert/remove module from signal path

selectI/Q

I/Q

Figure 4.2: A possible insertion of a mechanism to bypass the PC-CFR module ifa period of absence of peaks is detected, in order to reduce the delay on the signalpath and save power. PD0 is the Peak Detector of the first Clip Stage (CS0).

Stage (preceded by a CORDIC, of course, and not mentioned anymore), thereforethe Peak Detector in the first CS (PD0 in Figure 4.2) could be used as the observerof the input signal. During the periods in which the PC-CFR is bypassed, themodule (with the exclusion of the PD0, to avoid turning it off and being incapableof resuming) could also be gate-clocked in order to save power. On the other hand,it should be noted that the switching between configurations with different delaysmight not be very conveniently tolerated by some of the more recent communicationprotocols, therefore some more research should be definitely done in this aspect inorder to evaluate the feasibility or convenience of this feature.

44

4.3. CLIP STAGES WITH DIFFERENT DELAY MEMORIES AND CANCELLINGPULSES LENGTH

4.3 Clip Stages with different delay memories andcancelling pulses length

Another chance of reducing the delay of the PC-CFR comes from the analysis ofthe components of such delay. As it has been discussed, the greatest part of it isdue to the generation of the cancelling pulse itself, and this group delay is directlyproportional to the number of coefficients constituting the pulse. In general, laterClip Stages of a well balanced CFR are expected to deal with smaller peaks, be-cause of the already applied effect of the previous Clip Stages. As a consequencesmaller cancelling pulses will be generated and they can also be made shorter (seethe observations at the base of the choice of implementing variable length cancellingpulses in 3.2.2) in order to reduce the overall delay of the PC-CFR module. In 3.2.2,the motivation is not the reduction of the delay, but the increased availability of thePCUs due to the shorter occupancy of the HW resources: in the implemented solu-tion the delay of the Clip Stages is always the maximum delay needed by the longest(complete) cancelling pulse corresponding to the highest peak that the module isexpected to receive, even when shorter pulses are generated, because the delaymemory is statically configured and cannot change its delay as a function of thepulses (consider that the same Clip Stage can generate several cancelling pulses ofdifferent length at the same time so it could not provide different delays at the sametime in any case). In this case, instead, successive CSs might feature shorter delaymemories because they implement shorter maximum cancelling pulses, thereforethey introduce less net delay, but also greater OOB emissions because, as statedrepeatedly, the shorter the cancelling pulse is, the broader the flanks of the relativeFourier spectrum that, as a consequence, will match more poorly the spectrum ofthe signal are. Eventually, a final FIR filter will take care of these emissions (seeFigure 4.3 for two comparative scenarios). Of course several combinations of shorterand normal length pulses may be explored in order to find an optimal configuration.

4.4 Truncation of cancelling pulses

When two or more closely spaced peaks are being cancelled by the respective pulses,there might be a considerable overlapping among them, because the length of thecancelling pulses is usually much greater than the distance that might be observedbetween successive peaks, especially when short PSW are used. The consequenceis that the peaks and the neighbor input samples are lowered more than required(it may be useful to remember here that ideally the peaks should settle exactlyat the threshold value, after the cancellation) thus introducing considerable in-band distortion. In Figure 4.4, two closely spaced peaks are detected and cancelled(PSW1 and PSW2 are the two correspondent Peak Search Windows, which arecontiguous). A possible expedient to mitigate this problem could be the truncationof the already in-progress cancelling pulse somewhere close to the middle pointbetween two successive peaks and the immediate start of the generation of the

45


CS0

CS0

CS1 CS2 CS3

CS1 CS2 CS3 FIR

N/2 N/2 N/2 N/2

Each stage uses full pulse => total delay ≈ 2N

N/2 N/4 N/4 N/4 N/2

One full pulse, others use half width + FIR => total delay ≈ 7/4 N

Base configuration

Alternative configuration

Figure 4.3: In the base configuration, four Clip Stages use a full length cancellingpulse, of N elements, yielding a total delay of approximately 2N time units. Thealternative configuration uses the full cancelling pulse only for the first stage, andfor the successive stages it uses half length pulses, so that the overall delay (delayof the CSs plus the group delay of the FIR filter) is only 7/4N. A final FIR filtercan be inserted to reduce the emissions introduced by the shorter cancelling pulses;its delay has been taken into consideration.

partial pulse for the next one (not starting from the beginning, but from a properpoint). Particular care should be taken in choosing the "breaking" point betweenthe cancelling pulses, because this discontinuity will translate into OOB emissions.This procedure also provides the benefit of using a single PCU for two cancellingpulses, whereas the careful management of the PCU table and especially the addressgenerators is the more involved part.

4.5 Variable length Peak Search WindowSometimes the peaks arrive in burst. If these peaks are tackled individually as inthe implemented design, the pool of available PCUs will soon be depleted, with theconsequence of a possible peak leak. If this case is detected, it might be convenientto let the search window length increase (up to a certain maximum) so that itcan embrace a larger number of input samples and detect and cancel a single peak(or fewer, anyway) among them instead of isolating more peaks and cancel themindividually (see Figure 4.5). It is useful to remember that the Peak Cancellationalgorithm only exactly cancel the peak sample, not the neighbor elements of it,therefore by enlarging too much the PSW and thus cancelling fewer peaks (given

46

4.6. PRIORITY-BASED ACCEPTANCE OF PEAKS

450 500 550 600 650

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

#104 Original signal (red) vs peak-cancelled signal(black)

Peak 1 Peak 2

ThresholdPSW1 PSW2

Figure 4.4: This interval shows the input(red) and the output(black) signals ofthe peak cancellation algorithm. PSW1 and PSW2 are the two (contiguous) PeakSearch Windows associated to the two closely spaced detected peaks, so the respec-tive cancelling pulses are overlapped for a considerable amount of samples. Thenotable effect is the excessive reduction of the power of the signal in the interestedinterval.

the same amount of input elements for comparison), the resulting cancellation willbe less accurate and eventually require more passages through the algorithm. Asfor the CS/PCU mapping already discussed, the configuration of the PSW of thevarious stages may be software programmable or it can be given some degree ofautomatism according to the observation of the peak occurrence at the first ClipStage.

4.6 Priority-based acceptance of peaksIn a further effort to optimize the utilization of the PCUs, it could be consideredconvenient to interrupt the generation of a cancelling pulse if all the PCUs are busyand a new peak is detected, according to some criteria. It is reasonable to accept bynow that higher peaks are more "dangerous" than smaller ones, in the sense that ifthey would leak and reach the Power Amplifier, they would cause a larger non-linear

47


480 500 520 540 560 580 6002

2.1

2.2

2.3

2.4

2.5

2.6

#104PSW lenght = 16 => 4 peaks detected

480 500 520 540 560 580 6002

2.1

2.2

2.3

2.4

2.5

2.6

#104PSW lenght = 32 => 2 peaks detected

Peak 1

Peak 2

Peak 3

Peak 4

Peak 1 Peak 2

Figure 4.5: On the left, a Peak Search Window of 16 elements is configured. As aconsequence, four peaks are detected and as many PCUs used. On the right partof the Figure, the same input interval is passed through a PSW of 32 elements, andonly two peaks are detected. In this case, 50% of PCUs are enough, compared tothe first scenario.

distortion than the smaller ones. But when peaks are detected by the Peak Detectorof the Clip Stages, they are given the same importance: if PCUs are available, theyare all accepted and cancelled completely (with longer or shorter cancelling pulsesaccording to their entities, as it has been discussed in 3.2.2). It is reasonable, instead,to give priority to higher peaks compared to the smaller ones. The difficulty hereis that the presence and height of a peak is unknown until it is detected, and atthis point it is too late to start cancelling it if all the PCUs are already busy. So, ifa very high peak is detected just after the last PCU is assigned to a much smallerone, it will leak and eventually reach the PA. The proposed improvement is insteadbased on some sort of priority to be attributed to the newly detected peaks inrelation to the "importance" of the cancelling pulses that are being generated, allof this as a function of time. If the detected peak is somehow "more important"than at least one of the cancelling pulses that are keeping some PCUs busy, theleast "important" one among those PCUs could be interrupted to accommodate thenewly arrived peak. The condition could be the comparison between two numbers:

48

4.7. GENERATION OF MULTIPLE CANCELLING PULSES FROM THE SAMETIME SLOT

on the new peak side, the height of the peak itself is the metric that, as it hasbeen highlighted, determines the severity of the consequences on the PA. On thecancelling pulses side, the priority could be defined as the product of the height ofthe peaks being cancelled and a number function of the portion of the cancellingpulses that, at the moment of the detection of the new peak, is being generated.This can be justified with the observation that interrupting a cancelling pulse at thecentral point entails consequences in terms of OOB distortions more severe than theinterruption over one of the tails but, if the newly arrived peak is high enough, mightstill be worth truncating a cancelling pulse even in the middle; so a product shouldgive a measurement that takes into consideration both these aspects. In Figure 4.6,some "weights" for the various parts of the cancelling pulses are proposed, the finetuning of such weights could be an object for further study. The so defined priority,product of the height of the cancelled peak and these weight, is therefore a functionof time and it could be stored as an additional field of the PCU table, and updatedevery time the generation of the correspondent cancelling pulse passes from one ofthe regions of Figure 4.6 to another. Note that the weights have been chosen aspower of two in order to minimize the computation effort for the computation ofthe priorities (only the shift of the respective peak scale is needed).

4.7 Generation of multiple cancelling pulses from thesame time slot

The input signal passes through all the Clip Stages of the PC-CFR module. It isuseful to think that each CS monitors the signal at different instants of time, andwhen two or more of them are in the PEAK_SEARCH state of the respective PeakDetector, they will eventually notify the PM about the detected peaks. Now, if thedisplacements inside the correspondent search windows are the same, the generationof the cancelling pulses will use exactly the same addresses for all the peaks. Thevarious peak scales and phases of course will still be different for each detected peak,but the unscaled cancelling pulse path will output exactly the same coefficients forall of them at the same instants. Therefore, as depicted in Figure 4.7, for these casesa single time slot for the generation of multiple pulses could be used. Note that,in order to accommodate this solution, the PCU table must hold the informationabout as many peaks as the number of Clip Stages present in the system, for eachrow, because in the best case all the Clip Stages will detect a peak at the sametime. When the time slot relative to the multiple peaks arrives, the entire relativerow is read so that all the peak information is sent to as many complex multipliers(so there is a need for some hardware redundancy, namely the complex multipliersand the CORDICs), and only one address generator will be used. So basically, upto num_CS_c cancelling pulses can be generated for every time slot.

The event this solution is based upon is the contemporary detection of a peak bymore than one Clip Stage in the same position of their respective search windows.This event might not happen with a high enough probability to justify the increase in

49


100 200 300 400 500 600 700 800 900 1000-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

4 481 1

Figure 4.6: A possible association between portions of the cancelling pulse beinggenerated and weight/importance. Interrupting an in-progress cancelling pulse hasmore severe consequences than stopping it at a very early stage or when it is aboutto end anyway.

hardware complexity and design effort, so the condition needed for the mechanism totrigger may be relaxed: even when the peaks are not detected exactly with the samedisplacement (in other words, not at the same relative distance from the beginningof the search window), but they are still closely spaced by some programmableinterval ("tolerance"), the system will still produce multiple cancelling pulses from asingle PCU. By doing this it is clear that not all the peaks will be cancelled exactlybecause the central element of the cancelling pulse will be synchronized with thepeak only, and the other detected peaks will receive a cancelling pulse displaced bya number of samples equal (in the worst case) to the length of the small toleranceinterval, which could be of some elements. This approximation can be tolerated ifwe observe that the characteristics of the input signal samples are similar for closelyspaced elements, therefore the cancellation of a peak with a not perfectly alignedcancelling pulse may still provide a substantial benefit.

50

4.7. GENERATION OF MULTIPLE CANCELLING PULSES FROM THE SAMETIME SLOT

busy sca0 pha0 sca1 pha1 sca2 pha2 sca3 pha3

1 123 -45 229 78 93 12 312 -321

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

1 43 55 148 -654 345 -221 432 331

Cancelling pulse

Address generators

To t

he

CO

RD

ICs

and

Ad

der

sTime slot 0

Time slot 1

Time slot 2

Time slot 3

peak

peak

peak

peak

CS0

CS3

CS2

CS1

Figure 4.7: On the left of the Figure, the input signal as it appears to the ClipStages at the moment of the peak detection. Note that the peaks are not perfectlyaligned to the same relative position. All the information about the four detectedpeaks are sent to the generation of the cancelling pulses during a single time slot.

51

Chapter 5

Results and conclusions

5.1 Comparative synthesis resultsSeveral configurations (i.e. different numbers of Clip Stages with different numbersof PCUs) of the designed PC-CFR have been successfully synthesized in order toexplore how the logic and memory area and the gate counts components scaled.The results are compared with a particular configuration of the Ericsson TurboClipping. In Figure 5.1, the much lower area occupancy of the PC-CFR for boththe memory and the logic gate parts make the former an interesting alternative forthe CFR module in the future Ericsson ASICs.

Figure 5.1: Comparative results between one Turbo Clipping and three PC-CFRconfigurations.

The data shown in the table should be considered, though, only after some ob-servations: it should be noted that the most complete configuration of the PC-CFRthat has been taken into consideration (and synthesized) provides the capability ofonly 8 PCUs, which again translates to only 8 simultaneous cancelling pulses. Onthe other hand, it should be noted that the SV code is parameterized so that itis possible to include as many PCUs as needed, in groups of num_ts_c elementsat a time, of course at the price of increased gate count and memory area. Thefact that the design has been successfully synthesized with 8 PCUs (which meanswith 2 time-division shared PCUs) gives a reasonable confidence that it could besynthesizable with the same frequency constraint even with more PCUs, because

53

CHAPTER 5. RESULTS AND CONCLUSIONS

such modules operate in parallel, so they do not contribute to the cumulative delayof the critical path of the design.1 A reasonable amount of total PCUs to be in-cluded inside the PC-CFR in order to be comparable with the TC for some realisticand useful carrier configurations of the input signal would be 32, which means 8time-division shared PCUs, or 8 HW structures like those in Figure 3.9.

Another important aspect that has been neglected so far is the fact that theinput and output data passing through the various Clip Stages are discrete-timeand quantized signals originated from the sampling of analog signals, and the finaloutput of the elaboration chain of which the PC-CFR is part will also be convertedback to the analog form before being sent to the Power Amplifier and transmittedthrough the antenna array. Although the sampling frequency of the input signalshas been chosen so that the Nyquist-Shannon theorem is satisfied, the PC-CFRhas no the certainty that the samples it sees and uses to detect peaks correspondto the actual maximums and minimums (therefore the maximum magnitudes) ofthe underlying analog signal. In other words the algorithm bases the detection ofpeaks and the computation of the peak scales on the values of the samples of thetrue analog signal, then it generates the cancelling pulses and cancels the detectedpeaks according to these values, but the true signal reaching the Power Amplifier isthe analog signal, whose true peaks might pass undetected between two successivesamples. The result is that the true PAPR of the analog signal is always greaterthan or equal to the PAPR of the discrete-time signal the PC-CFR has worked on.In order to address this issue, two solutions are usually taken into consideration:some fractional delay filters can be interposed between every pair of successive ClipStages, or the Peak Detection module can be made working not on the data signal asit enters the Clip Stage, but on an interpolated (and therefore at a higher frequency)version of it. Without entering into details, both the solutions entail an incrementof complexity of the CFR, translating as usual into higher gate count and largerarea. The Ericsson Turbo Clipping, being a FIR-based CFR, suffers not from theshortage of PCUs as the PC-CFR does and it fully takes into account the problem ofthe "hidden" peaks between samples, therefore the higher values in terms of area ofthe TC also are due to this added complexity. Nevertheless, the numeric differencesbetween TC and PC-CFR in Table 5.1 are so prominent that even taking intoaccount all the discussed observations, the PC-CFR is still an attractive solution.

5.2 Some input and model configuration exploration

This section explores the behaviour of the PC-CFR algorithm under several config-uration scenarios (see Table 5.2). A MATLAB model (see Appendix A) has beenwritten in such a way that it is sample and bit-accurate with the SV design project.

1The critical path is the slowest point-to-point path present in a given circuit. If the cumulativedelay of such a path can fit into the fastest clock (that is the smallest clock period), then thesynthesizer is capable of synthesizing all the parts of the digital subsystem with this given clockfrequency.

54

5.2. SOME INPUT AND MODEL CONFIGURATION EXPLORATION

Figure 5.2: The table summarizing the configuration values and parameters usedto test the PC-CFR. The last two columns are the two output values taken intoconsideration.

This means that the output of the MATLAB model are exactly the same of theSV simulation, the only difference being the timing aspects (data rate of the inputand output signal, delays etc...) which are not modeled. This model has been usedin this section for the mentioned purposes. For all the configurations the followingcharacteristics are fixed:

• All the Clip Stages have the same threshold

• All the Clip Stages have the same Peak Search Window length

• The PCUs are evenly distributed among the various Clip Stages

One of the objectives of the present section is to expose the limitations of thealgorithm especially as a consequence of the lack of PCUs, the other is to help gettinga deeper insight on the relationships among the various aspects of the algorithm.In Table 5.2, 8 configurations are listed. The parameters are set into the MATLABmodel and then the script is run. The outputs of the simulations that have beenreported are the EVM, the "target" PAPR, the effective PAPR of the output signaland the CCDF diagram.

The EVM is computed inside the script as:

evm = sqrt(var(canc_pulse_tot)/var(sig_inout_fix(:,1))) * 100;

where canc_pulse_tot is the cumulative cancelling pulse obtained as the sum ofall the cancelling pulses for each Clip Stages. It has no physical correspondenceinside the algorithm, since each CS applies its own cancelling sequence individually,but it is needed to compute the EVM because canc_pulse_tot acts as, and itis the only "perturbation" to the input signal, therefore it carries the information

55


about how much the algorithm has changed the original signal in order to performthe peak cancellation. All the digital signal processing algorithms operate somechanges on the signal, and the EVM is the quantity that measures such change,which is desirably small. It should be noted, though, that this measurement carriesvery little (if not misleading) information when the algorithm does not fulfill therequired PAPR reduction. As it can be easily noted by looking at the Figures 5.3and 5.4, only the configurations 2, 4, and 6 fulfill the requirements in a satisfactoryway. For the cases in which the PC-CFR cannot cancel all the peaks, the EVMmeasurement makes little sense and should be neglected. It is, on the other hand,helpful as a comparative gauge among configurations that fulfill the desired PAPRreduction in order to better guide a choice towards a more convenient configuration,or among several signal processing CFR methods to estimate the in-band distortionof each.

The target PAPR parameter is computed as follows, in the script:

PAPR_tar = 10*log10(((threshold(1)/1.6474) .^ 2) /...(mean(abs(sig_inout_fix(:,num_CS_c+1)).^ 2)));

It means that it is defined as the ratio between the power corresponding to thethreshold value (which is a constant for all the CSs for this set-up), and the averagepower of the output signal of the PC-CFR. This parameter is just a reference value,because a true target value for the output PAPR is hard to define. Perhaps the ratiobetween the power relative to the threshold value and the average of the input signalclipped to such threshold value could be considered as theoretical limit, although itwould correspond to an unrealistic and undesired scenario (the clipping of the inputsignal as a means of managing the PAPR has been discarded immediately becauseof the OOB emissions).

The definition of the output PAPR is straightforward:

PAPR_out = 10*log10((max(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)) /...(mean(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)));

The CCDF (Complementary Cumulative Distribution Function) is a represen-tation of the distribution of the instantaneous to average power ratio in a givensignal interval. It can be interpreted as follows: for each value of the x-axis (repre-senting values of the ratios between instantaneous and average power of the signal,expressed in dB), the corresponding value on the y-axis is the relative frequency2 ofthe input signal elements having such a ratio equal to or greater than the value onthe x-axis (in other words, how many input samples have instantaneous to averagepower greater than or equal to the value on the x-axis). Each graph of Figures5.3 and 5.4 shows in red and in black the CCDFs for the input and the outputsignals respectively. The blue vertical line represents the target PAPR as discussed

2Intended as ratio between a number of occurrences of an event at every trial of an outcomeand the total number of outcomes.

56


earlier. It should be kept in mind that the target PAPR is not only a function ofthe threshold but also of the average of the output signal after the application ofthe algorithm, therefore we cannot expect it to be the same among configurationshaving the same threshold.

A well working PC-CFR should yield an output CCDF definitely on the leftof the input CCDF and approaching the blue line. When this does not happen,it is because one or more peak has leaked and contributes to the raising of thevalues of the corresponding relative frequency function (so it "pushes" the blackgraph to the right). The peaks leak because, in most of the configurations, thechosen values of thresholds/PSW lengths yield a density of peaks that cannot betackled by the limited number of PCUs. In Figure 5.3, 2 Clip Stages are used. Thetop-left most image is relative to the configuration with the least amount of PCUsand as a consequence it completely fails in cancelling all the peaks. In addition,the overlapping of the cancelling pulses actually makes things worse so that theoutput signal shows an even higher PAPR of the input signal. The top-right mostimage refers to the case with 8 PCUs and indeed successfully reaches the desiredPAPR. But by lowering the threshold (that is, imposing a stricter requirement onthe desired PAPR reduction), the previous configuration also fails (it is the bottomleft image of Figure 5.3). The simple enlargement of the Peak Search Window from32 to 64 samples reduces the density of the detected peaks to a point so that theavailable PCUs are again sufficient to deal with the peaks and so the configurationperforms decently (bottom right image). It should be noted, though, that theenlargement of the PSW comes with the price of a less accurate cancellation of theinput elements that are close to the peak, so their magnitude does not always fallunder the threshold and the desired PAPR requirement is not exactly met.

5.2.1 ObservationsFrom the Figure 5.4 some observations arise. Each image in the Figure has the samemodel parameters and configuration of the images in Figure 5.3 in the same positionwith the only difference being in the number of Clip Stages (this time there are 4CSs): the combination of number of PCUs per CSs (two) and length of the PSWyields the unexpected result of the bottom right image in the Figure. The largePSW cancels the peak element but as already observed, leaves some elements overthe threshold which are detected as peaks by the following stages. These very smallpeaks occupy all the PCUs and prevent the algorithm to take care of the biggerone that basically is never cancelled and leaks to the end of the algorithm. Thisdid not happen with only 2 Clip Stages with 4 PCUs each because 4 PCUs wereenough to deal with both the smallest peaks and the big one. The input-outputsignals of the configuration 8 is represented in Figure 5.5. The experiment makes itclear that the configuration of the PC-CFR is a delicate operation in which somenon-independent parameters interact in ways that are not obvious.

57


(a) threshold = 22500, PSW length = 32, 4 PCUs for the left image, 8 PCUs for the rightimage

(b) 8 PCUs, threshold = 22000, PSW length = 32 for the left image, 64 for the right image

Figure 5.3: Configurations from 1 to 4 according to Table 5.2. All the configurationsuse 2 CSs, but the topmost use a threshold of 22500, the two bottom images use22000.

58


(a) threshold = 22500, PSW length = 32, 4 PCUs for the left image, 8 PCUs for the rightimage

(b) 8 PCUs, threshold = 22000, PSW length = 32 for the left image, 64 for the right image

Figure 5.4: Configurations from 5 to 8 according to Table 5.2. All the configurationsuse 4 CSs, but the topmost use a threshold of 22500, the two bottom images use22000.

59


Figure 5.5: Input and output signals for configuration 8, with an uncancelled peakclearly visible (peak leak).

60

Bibliography

[1] Ed Hemphill et al. Peak Cancellation Crest Factor Reduction Reference Design(XAPP1033). Xilinx Inc.

[2] Adrio Communications Ltd Ian Poole. Radio-Electronics.com, Resources andanalysis for electronics engineers. 2016. url: http://www.radio-electronics.com/ (visited on 08/29/2016).

[3] Altera Corporation. Crest Factor Reduction (Application Note 396-1.0). Al-tera Corporation.

[4] Lattice Semiconductor. Peak Cancellation Crest Factor Reduction IP CoreUser’s Guide. Lattice Semiconductor.

[5] Jiajia Song and Hideki Ochiai. “A low-complexity peak cancellation schemeand its FPGA implementation for peak-to-average power ratio reduction”. In:EURASIP Journal onWireless Communications and Networking (2015). doi:10.1186/s13638-015-0319-0.

[6] Mathuranathan Viswanathan. Introduction to OFDM - orthogonal Frequencydivision multiplexing. 2011. url: http://www.gaussianwaves.com/2011/05/introduction-to-ofdm-orthogonal-frequency-division-multiplexing-2/ (visited on 04/09/2016).

[7] Electronicdesign. url: http://electronicdesign.com/engineering-essentials/understanding-error-vector-magnitude (visited on 03/19/2017).

[8] ShareTechnote. url: http://www.sharetechnote.com/html/RF_Handbook_ACLR_ACPR.html (visited on 04/09/2016).

[9] Y. S. Cho et al. MIMO OFDM wireless communications with MATLAB. JohnWiley & Sons, 2010, pp. 218–221.

[10] G. Schmidt and J. Schlee. “Crest factor reduction for a multicarrier-signalwith spectrally shaped single-carrier cancelation pulses”. Patent US 8,619,903(US). Dec. 2013. url: https://www.google.se/patents/US8619903.

[11] Bauml Robert, F Robert, and BH Johannes. “Reducing the peak-to-averagepower ratio of multicarrier modulation by selected mapping”. In: Electron.lett 32 (1996), pp. 2056–2057.

[12] Hemdutta Joshi. “Performance augmentation of OFDM system”. In: (2013).

61

http://www.radio-electronics.com/

http://www.radio-electronics.com/

http://dx.doi.org/10.1186/s13638-015-0319-0

http://www.gaussianwaves.com/2011/05/introduction-to-ofdm-orthogonal-frequency-division-multiplexing-2/



http://electronicdesign.com/engineering-essentials/understanding-error-vector-magnitude

http://electronicdesign.com/engineering-essentials/understanding-error-vector-magnitude

http://www.sharetechnote.com/html/RF_Handbook_ACLR_ACPR.html

http://www.sharetechnote.com/html/RF_Handbook_ACLR_ACPR.html

https://www.google.se/patents/US8619903

BIBLIOGRAPHY

[13] Jean Armstrong. “New OFDM peak-to-average power reduction scheme”. In:Vehicular Technology Conference, 2001. VTC 2001 Spring. IEEE VTS 53rd.Vol. 1. IEEE. 2001, pp. 756–760.

[14] Jean Armstrong. “Peak-to-average power reduction for OFDM by repeatedclipping and frequency domain filtering”. In: Electronics letters 38.5 (2002),p. 1.

[15] Tao Jiang, Yang Yang, and Yong-Hua Song. “Companding technique forPAPR reduction in OFDM systems based on an exponential function”. In:Global Telecommunications Conference, 2005. GLOBECOM’05. IEEE. Vol. 5.IEEE. 2005, 4–pp.

[16] Wisam F Al-Azzo et al. “Time domain statistical control for PAPR reduc-tion in OFDM system”. In: Communications, 2007. APCC 2007. Asia-PacificConference on. IEEE. 2007, pp. 141–144.

[17] Carole A Devlin, Anding Zhu, and Thomas J Brazil. “Peak to average powerratio reduction technique for OFDM using pilot tones and unused carriers”.In: Radio and Wireless Symposium, 2008 IEEE. IEEE. 2008, pp. 33–36.

[18] Sroy Abouty et al. “A novel iterative clipping and filtering technique for PAPRreduction of OFDM signals: system using DCT/IDCT transform”. In: Inter-national Journal of Future Generation Communication and Networking 6.1(2013), pp. 1–8.

62

Appendix A

The MATLAB golden model

A model of the described RTL implementation of the PC-CFR has been writtenin MATLAB. The purpose of the model is to replicate the behavior of the designas faithfully as possible, so no priority has been given to performance or memoryefficiency in its developing, although pre-allocation and vectorization of the twoCORDIC functions are used.

A properly chosen interval of the data file 2carriers.dat has been used asinput for both the model and in SV testbench and the outputs of the model andthe simulation have been compared in order to check for full 100% matches. Thiscondition has been considered as proof of the consistency between the model andthe RTL implementation. The code of the script as well as the functions it dependsupon are listed in this Appendix. The only missing files are the input data andthe cancelling pulse coefficient ones. The segment of the input file that has beenisolated from the much larger 2carriers.dat, has been isolated because it exhibitsa variety of peak densities and envelope shapes useful to show several characteristicsand limitations of the PC-CFR (see Figure A.2 for the data segments before andafter the PC-CFR).

The script starts in Listing A.1, where some parameters for the PC-CFR con-figuration (thresholds, PSW lengths, number of CSs and PCUs) are set. Experi-menting by modifying these and running the script, makes it possible to explore theperformance of the PC-CFR for both different PAPR reduction requirements (bychanging the threshold), and for several HW configurations (by selecting number ofCSs, distribution of PCUs among them etc.).

Input data and coefficients are loaded from external files and are closely relatedto each other (in case of a change of the input data stream carriers configuration, anew, correspondent set of coefficients must be used for the cancelling pulse). Finally,the signals sig_inout_fix and canc_pulse_tot are defined. The first holds all thesuccessive signals of all the data path chain from the input up to the output of thelast stage (i.e. the output of the PC-CFR).

63

APPENDIX A. THE MATLAB GOLDEN MODEL

Listing A.1: Part 1 of the MATLAB script. Configuration of the model and loadingof input data segment and pulse coefficients.

%% This is the golden model for the PC-CFR%*********% Part 1 *%*********% Parameters for the model% num_CS_c is number of Clip Stages. It translates into number of% iterations of the PC-CFR algorithm threshold is an array of thresholds% for each Clip Stage. The desired values are to be multiplied by the% value 1.6474 to take into account the scaling effect introduced by the% CORDIC during the conversion from the rectangular to the polar forms.% psw is an array of Peak Search Window lengths for each Clip Stage% num_PCU_c represents how the PCUs are distributed among the Clip% Stages (it is actually not parameterized, since it describes a 4 Clip% Stages case). num_cdc_iter is the number of iterations the CORDIC will% do. The value must match the number of elements of the array% lut_table, which contains the amount of rotations for each iteration% (in radians * 2^10). These two last parameters have been inserted% directly into the code of the two CORDIC function in order to better% vectorize them. pk_table maps the peak scales with the starting% elements of the cancelling pulse. This models the variable length% cancelling pulsesclear;

num_CS_c = 4;threshold = zeros(1, num_CS_c);psw = zeros(1, num_CS_c);num_PCU_c = [4, 2, 1, 1];for i = 1:num_CS_c

threshold(i) = 37066; % = 22500*1.6474;psw(i) = 32;

endpk_table = [1000, 2000, 4000, 5000, 131071; 403, 294, 183, 76, 1];

%% Read the input data file% a subset of 5600 elements is isolated and used as input of the modelfid = fopen('input_data/2carriers.dat', 'r');data = textscan(fid, '%f %f');fclose(fid);i_data = data{1};i_data = i_data(278751:284350);q_data = data{2};q_data = q_data(278751:284350);iq_data = complex(i_data, q_data);data_length = length(iq_data);clear data i_data q_data fid;

%% Load the cancelling pulse coefficients, in mag/phase formload('pulse_coeffs');clear coeffs;

64

%% The matrix sig_inout_fix stores all the signals of the chain from the% input to the output, included the outputs of the intermediate CSssig_inout_fix = zeros(data_length, num_CS_c+1);sig_inout_fix(:, 1) = iq_data;

%% We iterate the algorithm over num_CS_c times. Each iteration uses the% output of the previous iteration as its input, in order to model a% cascade-like structure. canc_pulse_tot stores the cumulative% cancelling pulse signals, composition of all the cancelling pulses of% all the stages. It is needed for the computation of EVM at the end% of the algorithmcanc_pulse_tot = zeros(data_length, 1);

In Listing A.2, the actual iterations start, in order to model the cascade-likestructure of the Clip Stages. First, the CORDIC exposes the magnitude of thesamples for the successive Peak Detector to be able to find the Peaks (this snippetof code populates a table with all the peak characteristics). The following snippetdefines, according to the detected peaks, the starting and ending indexes of thecorresponding cancelling pulses’ intervals (including the special cases in which oneof the interval limit is outside of the range of the input signal). Then a veryimportant snippet of code follows: here the algorithm checks whether a peak canbe cancelled according to the availability of PCUs. If it cannot, it will simply leak,which translates into not generating a cancelling pulse in the following part of thescript. But if it can, it will increase a counter, for all the duration of the interval ofthe corresponding cancelling pulse, in such a way that the PCU availability for thefollowing peaks will keep this into account (one less PCU available).

Listing A.2: Part 2 of the MATLAB script. In order: conversion of input samplesto polar form, peak detection, definition of cancelling pulses on the input signal andflagging of the peaks that will leak and thus will not be cancelled. The very lastpart generates a graphical representation of the PCU occupancy divided per CS, ascan be seen in Figure A.1%*********% Part 2 *%*********for m = 1:num_CS_c

% CORDIC[data_mag_array, data_pha_array] = arrayfun(@cordic_c2p,...

real(sig_inout_fix(:, m)), imag(sig_inout_fix(:, m)));

%% Peak detection% in the first column we put the absolute index of the detected% peak, in the second the value of the scale, in the third the% phase, in the fourth the starting address of the relative% cancelling pulse according to the scale (check pk_table array).% In the fifth, later in the algorithm, will be set a 1 or a 0% according to the fact that the relative peak leaks or not% (if all the resources of the present CS are already busy).% In the sixth we put the number of elements from the peak position

65


% to the end of the PSWpeak_info = zeros(250, 6);state = 'IDLE';j = 1; % counter inside the PSWk = 1; % peak table indexfor i = 1:data_length

switch statecase 'IDLE'

if (data_mag_array(i) > threshold(m))state = 'PEAK_SEARCH';index_temp = i;mag_temp = data_mag_array(i);pha_temp = data_pha_array(i);index_in_psw = 1;

endcase 'PEAK_SEARCH'

j = j + 1;if (data_mag_array(i) > mag_temp)

index_temp = i;index_in_psw = j;mag_temp = data_mag_array(i);pha_temp = data_pha_array(i);

endif j == psw(m)

peak_info(k,1) = index_temp;peak_info(k,2) = mag_temp - threshold(m);peak_info(k,3) = pha_temp;peak_info(k,6) = psw(m) - index_in_psw;index_in_psw = psw(m);j = 1;% Choice of the pulse length according to the% magnitude of the peak scalefor q = 1:5

if peak_info(k,2) < pk_table(1,q)peak_info(k,4) = pk_table(2,q);break;

endend%k = k + 1;state = 'IDLE';mag_temp = 0;pha_temp = 0;index_temp = 1;

endend

endpeak_info = peak_info(1:k-1,:);

%% Study the density of cancelling pulses needed to deal with peaks% for each detected peak, start_end_idx stores the initial and the% final indexes of the relative cancelling pulses, if they will be% generated (the peak leak condition is checked further in the code)start_end_idx = zeros(size(peak_info, 1), 2);

66

for n = 1:size(peak_info, 1)delay = 512 - peak_info(n,4);if ((peak_info(n,1) - delay) <= 0)

start_end_idx(n,1) = 1;start_end_idx(n,2) = peak_info(n,1) + delay;

elseif ((peak_info(n,1) + delay) >= data_length)start_end_idx(n,1) = peak_info(n,1) - delay;start_end_idx(n,2) = data_length;

elsestart_end_idx(n,1) = peak_info(n,1) - delay;start_end_idx(n,2) = peak_info(n,1) + delay;

endend

% This part of code detects whether a peak can be cancelled% according to the availability of PCUs for the actual CS (i.e.% iteration) or it will leak. In the latter case, no corresponding% cancelling pulse will be generatedpeak_leak = zeros(size(peak_info, 1), 1);pulse_intervals = zeros(length(iq_data) + 2*psw(m) + 1023, 1);for n = 1:(size(peak_info, 1))

% we manage a counter in a proper interval of the input% signal. The counting represents how many PCUs are already in% use at that point. The interval starts just after the end of% the PSW and lasts for: the displacement between the actual% peak index position and the start of the PSW plus 512% elements needed to reach the central element of the% cancelling pulse plus the remaining elements of the% cancelling pulse before the PCU will be free againif pulse_intervals(peak_info(n,1) + peak_info(n,6)) + 1 >...

num_PCU_c(m)peak_leak(n) = 1;

elsefor t = peak_info(n,1)+peak_info(n,6):...

peak_info(n,1)+peak_info(n,6) +...(psw(m) - peak_info(n,6)) + 512 + (512-peak_info(n,4))pulse_intervals(t) = pulse_intervals(t) + 1;

endend

endpeak_info(:,5) = peak_leak;

figure(1);subplot(num_CS_c, 1, m);plot(pulse_intervals);xlabel('Input signal element index');ylabel('PCUs');

In Listing A.3, finally the cancelling pulses are generated by modeling the sameoperation sequences implemented into the RTL design: complex multiplication inpolar form followed by the conversion to rectangular form and subtraction from theinput signal. The results are computed and displayed in the final Part 4.

67


Listing A.3: Part 3 and 4 of the MATLAB script. Configuration of the model andloading of input data segment and pulse coefficients.%% Creation and application of the cancelling pulses to the signal

%*********% Part 3 *%*********canc_pulse_i = zeros(data_length,1);canc_pulse_q = zeros(data_length,1);for j = 1:size(peak_info, 1) % for every detected peak

if peak_info(j, 5) == 0 % if the peak did not leakrange = start_end_idx(j,1):start_end_idx(j,2);pulseRange = range - peak_info(j,1) + 1 + 511;canc_pulse_mag = zeros(data_length, 1);canc_pulse_pha = zeros(data_length, 1);canc_i = zeros(data_length, 1);canc_q = zeros(data_length, 1);% 48334 is to compensate for the gain of the two codecscanc_pulse_mag = 48334 * peak_info(j,2) *...

coeffs_mag(pulseRange);canc_pulse_pha = peak_info(j,3) + coeffs_pha(pulseRange);[canc_i(range), canc_q(range)] =...

arrayfun(@cordic_p2c, canc_pulse_mag, canc_pulse_pha);canc_pulse_i(range) = canc_pulse_i(range) + canc_i(range);canc_pulse_q(range) = canc_pulse_q(range) + canc_q(range);

endend% The right shift by 34 position is to compensate% for the cancelling pulse coefficients(which were multiplied% by 2^17-1) and for the constant 48334 = 0.3687562... * 2^17)temp_i = floor(bitshift(canc_pulse_i, -34, 'int64'));temp_q = floor(bitshift(canc_pulse_q, -34, 'int64'));canc_pulse = complex(temp_i, temp_q);canc_pulse_tot = canc_pulse_tot + canc_pulse;% actual cancellation of the peakssig_inout_fix(:, m+1) = sig_inout_fix(:, m) - canc_pulse;

end % end of algorithm iterations loop

%*********% Part 4 *%*********figure;plot(abs(iq_data), 'r');hold on;plot(abs(sig_inout_fix(:,num_CS_c+1)), 'k');plot(ones(1, data_length)*threshold(m)/1.6474, 'k');title('Input signal (red) vs peak-cancelled signal(black)');

%% Calculate the FFT of the input signal and the output of the PC-CFR% X = (fftshift(fft(iq_data, data_length)));% XX = X .* conj(X)/(data_length^2);% Y = (fftshift(fft(sig_inout_fix(:,num_CS_c+1), data_length)));% YY = Y .* conj(Y)/(data_length^2);

68

% figure;% % plot input power spectrum% subplot(2,1,1);% plot(10*log10(XX));% title('Power Spectrum Using Log Scale, input sequence');% ylabel('Power in DB');% % plot output power spectrum% subplot(2,1,2);% plot(10*log10(YY));% title('Power Spectrum Using Log Scale, output sequence');% ylabel('Power in DB');

%% computation of PAPRPAPR_in = 10*log10(max(abs(iq_data) .^ 2) / mean(abs(iq_data) .^ 2));PAPR_out = 10*log10((max(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)) /...

(mean(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)));% target PAPR, computed on threshold of CS0PAPR_tar = 10*log10(((threshold(1)/1.6474) .^ 2) /...

(mean(abs(sig_inout_fix(:,num_CS_c+1)).^ 2)));str_in = sprintf('PAPR of input signal = %f dB', PAPR_in);str_out = sprintf('PAPR of output signal = %f dB', PAPR_out);str_tar = sprintf('Target PAPR = %f dB', PAPR_tar);disp(str_in);disp(str_out);disp(str_tar);

%% computation of EVMevm = sqrt(var(canc_pulse_tot)/var(sig_inout_fix(:,1))) * 100;fprintf('Percent EVM = %f %%\n', evm);

%% computation of CCDFfigure;CDF_plot(iq_data, 'r');hold on;CDF_plot(sig_inout_fix(:,num_CS_c+1), 'k');hold online([PAPR_tar PAPR_tar], ylim);legend('PC-CFR input', 'PC-CFR output', 'Target PAPR');title('CCDF of input and output signals');xlabel('PAR');ylabel('Relative frequency');

% end of PC-CFR model

In Listings A.4, A.5 and A.6 the MATLAB code for the two CORDICs andthe Complementary Cumulative Distribution Function (CCDF) computation anddisplaying function is provided (see Appendix 5.2 for a brief explanation about theCCDF representation). Note that in each part of the MATLAB code, fixed-pointarithmetic and representation has been "implemented" by the careful use of integersand scaling. In Figure A.3, the comparative CCDF of input and output signal isshown.

69


Listing A.4: The CORDIC function used to convert from rectangular to polarfunction [mag, pha] = cordic_c2p(x, y)

pi_div_2 = 1571;n = 11;inpLUT = [804, 475, 251, 127, 64, 32, 16, 8, 4, 2, 1];

if x < 0if y < 0

tmp = x;x = -y;y = tmp;z = -pi_div_2;

elsetmp = x;x = y;y = -tmp;z = pi_div_2;

endelse

z = 0;end

for idx = 1:nxtmp = floor(bitshift(x, -(idx-1), 'int32'));ytmp = floor(bitshift(y, -(idx-1), 'int32'));%xtmp = floor(bitshift(x, -(idx-1))); % octave version%ytmp = floor(bitshift(y, -(idx-1))); % octave versionif y < 0

z = z - inpLUT(idx);x = x - ytmp;y = y + xtmp;

elsez = z + inpLUT(idx);x = x + ytmp;y = y - xtmp;

endendmag = x;pha = z;

end

Listing A.5: The CORDIC function used to convert from polar to rectangularfunction [x0, y0] = cordic_p2c(mag, pha)

pi_div_2 = 1571;pi = 3142;two_pi = 6284;n = 11;inpLUT = [804, 475, 251, 127, 64, 32, 16, 8, 4, 2, 1];

if pha < -pipha = pha + two_pi;

70

elseif pha > pipha = pha - two_pi;

end

if pha < 0pha = pha + pi_div_2;pos_rot = 0;

elsepha = pha - pi_div_2;pos_rot = 1;

end

x = mag;y = 0;z = pha;

for idx = 1:nxtmp = floor(bitshift(x, -(idx-1), 'int64'));ytmp = floor(bitshift(y, -(idx-1), 'int64'));%xtmp = floor(bitshift(x, -(idx-1))); % octave version%ytmp = floor(bitshift(y, -(idx-1))); % octave versionif z < 0

z = z + inpLUT(idx);x = x + ytmp;y = y - xtmp;

elsez = z - inpLUT(idx);x = x - ytmp;y = y + xtmp;

endend

if pos_rot == 1x0 = -y;y0 = x;

elsex0 = y;y0 = -x;

endend

Listing A.6: Function used to plot the CCDF of the input and output signalsfunction [tmp] = CDF_plot(Y, color)

P_average = mean(abs(Y) .^ 2);Instantaneous_power = abs(Y) .^ 2;[n, X] = hist(Instantaneous_power, length(Y));m = cumsum(n);semilogy(10*log10(X/P_average), 1 - m/max(m), color);grid on;axis([-2 12 2e-4 1])

end

71


Figure A.1: PCU usage for each Clip Stage. Note that the maximum number ofPCU per stage (in this example 4, 2, 1, 1) cannot be exceeded.

Figure A.2: The input data segment is compared with the output of the PC-CFR.Note that for this configuration of CSs, threshold, PSW length and distribution andnumber of PCUs, the algorithm is perfectly capable to satisfy the PAPR require-ment.

72

Figure A.3: The CCDF of input and output signal. The output signal fully satisfiesthe target PAPR requirement.

73

TRITA ICT-EX-2016:187

www.kth.se

An efficient Hardware implementation of the Peak ...1091121/...DEGREE PROJECT IN INFORMATION AND...

Documents

Transcript of An efficient Hardware implementation of the Peak ...1091121/...DEGREE PROJECT IN INFORMATION AND...