Application of machine learning in 5G to extract …1320414/...girlfriend for always supporting me...

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Application of machine learning in 5G to extract prior knowledge of the underlying structure in the interference channel matrices

DANILO PENG

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Application of machine learning in 5G to extract prior knowledge of the underlying structure in the interference channel matrices DANILO PENG KTH ROYAL

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits)

KTH Royal Institute of Technology year 2019

Supervisor at Huawei: Jinliang Huang

Supervisor at KTH: Henrik Hult

Examiner at KTH: Henrik Hult

TRITA-SCI-GRU 2019:071

MAT-E 2019:28

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

The data traffic has been growing drastic over the past few years due to digitization and newtechnologies that are introduced to the market, such as autonomous cars. In order to meetthis demand, the MIMO-OFDM system is used in the fifth generation wireless network, 5G.Designing the optimal wireless network is currently the main research within the area oftelecommunication. In order to achieve such a system, multiple factors has to be taken intoaccount, such as the suppression of interference from other users. A traditional method calledlinear minimum mean square error filter is currently used to suppress the interferences. Toderive such a filter, a selection of parameters has to be estimated. One of these parametersis the ideal interference plus noise covariance matrix. By gathering prior knowledge of theunderlying structure of the interference channel matrices in terms of the number of interferersand their corresponding bandwidths, the estimation of the ideal covariance matrix couldbe facilitated. As for this thesis, machine learning algorithms were used to extract theseprior knowledge. More specifically, a two or three hidden layer feedforward neural networkand a support vector machine with a linear kernel was used. The empirical findings impliespromising results with accuracies above 95% for each model.

Keywords: Machine learning, 5G, interference channel, blind source estimation, MIMO,OFDM, bandwidth prediction, support vector machines, artificial neural network.

Sammanfattning

Under de senaste åren har dataanvändningen ökat drastiskt på grund av digitaliseringen ochallteftersom nya teknologier introduceras på marknaden, exempelvis självkörande bilar. Föratt bemöta denna efterfrågan används ett s.k. MIMO-OFDM system i den femte genera-tionens trådlösa nätverk, 5G. Att designa det optimala trådlösa nätverket är för närvarandehuvudforskningen inom telekommunikation och för att uppnå ett sådant system måste flerafaktorer beaktas, bland annat störningar från andra användare. En traditionell metod somanvänds för att dämpa störningarna kallas för linjära minsta medelkvadratfelsfilter. Föratt hitta ett sådant filter måste flera olika parametrar estimeras, en av dessa är den idealastörning samt bruskovariansmatrisen. Genom att ta reda på den underliggande struktureni störningsmatriserna i termer av antal störningar samt deras motsvarande bandbredd, ärnågot som underlättar uppskattningen av den ideala kovariansmatrisen. I följande avhandlinghar olika maskininlärningsalgoritmer applicerats för att extrahera dessa informationer. Merspecifikt, ett neuralt nätverk med två eller tre gömda lager samt stödvektormaskin med enlinjär kärna har använts. De slutliga resultaten är lovande med en noggrannhet på minst 95%för respektive modell.

Swedish Title: Applikation av maskininlärning inom 5G för att extrahera informationav den underliggande strukturen i interferenskanalmatriserna

Acknowledgements

Firstly, I want to start by giving my deepest gratitude to my supervisor Dr. Jinliang Huangfor offering me the opportunity to carry out my master thesis project at Huawei. Moreimportantly, I would like to thank him for all of the support and guidance that he was givingme, not just as a supervisor but also as a true friend. It has been a privilege to work withyou and this project. Furthermore, I would like to thank Karl Gäfvert for all the helpfuldiscussions we have had throughout this project. I also would like to express my gratitude tomy supervisor Prof. Henrik Hult at KTH Royal Institute of Technology for his availabilityand guidance. Last but not least, the following thesis is dedicated to my family and lovelygirlfriend for always supporting me and pushing me towards the best version of myself,without you guys this would have never been possible.

Table of Contents

List of Figures x

List of Tables xii

List of Abbreviations xiii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Telecommunication Concepts 32.1 Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Multiple Input Multiple Output . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Orthogonal Frequency Division Multiplexing . . . . . . . . . . . . . . . . 42.4 Physical Resource Block . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5.1 Linear Minimum Mean Square Error . . . . . . . . . . . . . . . . 62.5.2 Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Background on Machine Learning 113.1 Supervised vs Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 113.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Maximum Margin Classifier . . . . . . . . . . . . . . . . . . . . . 123.2.2 Support Vector Classifier . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 163.2.4 SVM for Multi-classification . . . . . . . . . . . . . . . . . . . . . 18

3.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Table of Contents ix

3.3.1 Feedforward Neural Network . . . . . . . . . . . . . . . . . . . . 183.3.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.3 Training Neural Network . . . . . . . . . . . . . . . . . . . . . . . 213.3.4 Additional Concepts of Neural Network . . . . . . . . . . . . . . . 24

3.4 Performance Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Experimental Setup 284.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Bandwidth Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.2 Feedforward Neural Network . . . . . . . . . . . . . . . . . . . . 31

4.3 Source Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 33

5 Results 345.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Multi-classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Discussion 466.1 Critical Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion 497.1 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

References 50

List of Figures

2.1 A wireless channel model with a signal x propagating through channel Hwith an output signal y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Demonstration of the idea behind MIMO, where Tx stands for the trans-mitting antennas, Rx for the receiver antennas and hi, j corresponds to thechannel state information for the UE. Reproduced from Baumgärtner [2]. . 4

2.3 Visualization of OFDM given in the frequency domain, a signal with band-width B divided into smaller bandwidths b. Redrawn from Lin [18]. . . . . 5

2.4 Illustration of interference in MIMO. . . . . . . . . . . . . . . . . . . . . . 8

3.1 Illustration of the concept behind MMC. Reproduced from Rajput [23]. . . 133.2 Illustration of the concept behind SVC. Reproduced and adapted from Dey [7]. 143.3 Illustrates the concept of enlarging the feature space. Reproduced from Kim

[14]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 A feedforward neural network with one single hidden layer. . . . . . . . . . 193.5 A computational graph of a FNN with one hidden layer. . . . . . . . . . . . 223.6 Geometrical interpretation of each component in the singular value decom-

position. Reproduced and adapted from Johann [13]. . . . . . . . . . . . . 27

4.1 Illustration of the simulated datasets Sd for d = 1,2,3,4,5. Each one of themcontaining uniquely angular information. . . . . . . . . . . . . . . . . . . . 29

4.2 Plot over the power of 24 subcarriers or two RBs in the observed matrix Vl

illustrating the edge and normal cases. . . . . . . . . . . . . . . . . . . . . 304.3 Eigenvalue distribution containing different number of sources for zero SNR. 32

5.1 Summary of the FNN model with one hidden layer. . . . . . . . . . . . . . 345.2 Summary of the FNN model with two hidden layers. . . . . . . . . . . . . 355.3 Summary of the FNN model with three hidden layers. . . . . . . . . . . . . 355.4 Training history of the different FNN models with one, two and three hidden

layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

List of Figures xi

5.5 Comparison between the FNN models with different number of hidden layersand a simple threshold method. Evaluation based on dataset S5 with mixedtest samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.6 Comparison between the FNN models with different number of hidden layersand a simple threshold method. Evaluation based on dataset S5 with edgecase test samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.7 Confusion matrices for FNN models, L2 and L3 based on mixed test samplesfor the ideal as well as zero SNR. . . . . . . . . . . . . . . . . . . . . . . . 39

5.8 Confusion matrices for FNN models, L2 and L3 based on edge case testsamples for the ideal as well as zero SNR. . . . . . . . . . . . . . . . . . . 40

5.9 Comparison between the different kernels used in SVM. A baseline methodbased on logistic regression is also included. Evaluation based on dataset S5. 43

5.10 Confusion matrices for ideal SNR with different kernel functions in SVM. . 445.11 Confusion matrices for zero SNR with different kernel functions in SVM. . 45

List of Tables

3.1 Examples of commonly used kernel functions in SVM. . . . . . . . . . . . 173.2 Examples of activation functions that can be used when building a feedfor-

ward neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Accuracy of each model for five different SNRs evaluated on dataset S5 withmixed test samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Accuracy of each model for five different SNRs evaluated on dataset S5 withedge case test samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Grid search history of SVM by applying the linear kernel function. Meanaccuracy based on the training data samples, S1-S4 with k-folded crossvalidation, k = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Grid search history of SVM by applying the RBF kernel function. Meanaccuracy based on the training data samples, S1-S4 with k-folded crossvalidation, k = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Grid search history of SVM by applying the sigmoid kernel function. Meanaccuracy based on the training data samples, S1-S4 with k-folded crossvalidation, k = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.6 Finalized models based on the grid search of the linear, RBF and sigmoidkernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.7 Accuracy for different kernels used in SVM as well as a logistic regressionclassifier. Evaluated on dataset S5 for five different SNRs. . . . . . . . . . 43

List of Abbreviations

4G Fourth generation wireless system5G Fifth generation wireless systemADAM Adaptive moment estimate optimizerANN Artificial neural networkAoD Angular of departureCCI Co-channel interferenceFNN Feedforward neural networkLMMSE Linear minimum mean square errorMIMO Multiple input multiple outputMMC Maximum margin classifierOFDM Orthogonal frequency division multiplexingRB Resource blockRBF Radial basis functionReLu Rectified linear unitSNR Signal to noise ratioSVC Support vector classifierSV D Singular value decompositionSV M Support vector machineT T I Transmission time intervalUE UserZoD Zenith angular of departure

Chapter 1

Introduction

1.1 Motivation

During the past few years, fourth generation wireless system (4G) has been dominatingthe industry within wireless technology. This includes higher-quality video streaming tosimple phone call through WhatsApp. However, as new technologies are introduced into themarket, such as autonomous cars, the demand for improved wireless communications hasincreased dramatically. Not to mention, the data traffic is expected to grow by a factor of 9between 2014 and 2020 [21]. In order to meet this demand, a fifth generation wireless system(5G) is introduced. This provides a more efficient solution to the generation of wirelesstechnologies by offering higher data rates and speed, lower latency, wider bandwidths andgreater coverage [16]. Simply put, 5G will bring and allow for new opportunities in alldifferent kinds of industries to occur. As of now, one of the commonly used techniquesin wireless communication is called multiple input multiple output (MIMO). This type oftechnique takes the advantage of using increased number of antennas at the transmitter as wellas the receiver side. A major challenge that still remains is the detection of the signal at thereceiver end, i.e. how the information from spot A can be transferred and interpreted correctlyat spot B. This is due to the fact that there are multiple users (UEs) that are transmittingsignals at the same time which interferes with each other. Therefore, in order to designan optimal 5G network, key factors such as the interference of other UEs has to be takeninto account. Unfortunately, accurate estimation of interference channel remains to be abottleneck that affects the system performance. A conventional algorithm used to suppressthis kind of interferences is the linear minimum mean square error (LMMSE) filter appliedat the receiver side, the advantage of the algorithm is its low complexity. To derive such afilter, multiple parameters has to be estimated, one of them being the ideal covariance matrixof the interference plus noise. By obtaining information about the underlying structure of

1.2 Research Questions 2

the interference channel matrices, the estimation of the covariance matrix can be improved.This thesis will showcase how machine learning can be applied to extract this type of priorknowledge.

1.2 Research Questions

As mentioned in the previous section, the aim of the thesis is to retrieve information of theunderlying structure of the interference channel matrices. In other words, this problem willbe modified into a machine learning setting and various algorithms will be applied in orderto answer the questions below:

1. How can the bandwidth of the interference channel matrices be detected?

2. How many interferences are there within each resource block (RB)?

1.3 Limitations

In order to adapt the complexity of the thesis to an appropriate level, the following limitationshas been made:

• The maximum number of interferers is four UEs.

• All the data is simulated in MATLAB and based on one specific scenario with fivedifferent signal to noise ratio (SNR).

• The reconstruction of the interference plus noise covariance matrix is not considered,rather the underlying structure of the components that forms it.

• Estimating the channel state information of the main UE, not the interferers, is assumedto be calculated by some arbitrary method and thus known.

1.4 Outline

The thesis is presented accordingly, Chapter 2 gives the reader a short introduction ofbasic concepts within telecommunication. While Chapter 3 is focused on explaining themathematical theory of the machine learning methods used in the thesis. Furthermore, theexperimental setups are introduced in Chapter 4. The results is then presented in Chapter5 and further discussed in Chapter 6. Finally, in Chapter 7, a conclusion is drawn andsuggestions for further work is considered.

Chapter 2

Telecommunication Concepts

In order to understand and grasp the main problem, some basic and fundamental conceptswithin telecommunication are presented.

2.1 Channel

A channel in the context of telecommunication usually refers to the medium used to transmitthe data to the receiver, for example a wire or air, as seen in Figure 2.1.

Figure 2.1 A wireless channel model with a signal x propagating through channel H with anoutput signal y.

Fading is a well known phenomena in telecommunication which describes the variationof the attenuation of the transmitted signal. The concept multipath fading channel refers toa channel that experiences fading. This occurs when the transmitted signal takes differentpaths resulting in changes of their strength and phases. Therefore, a signal x that propagatesthrough a multipath fading channel H often results in a different output signal y. The cause ofsuch a phenomena can depend on multiple variables, such as the environment and temperature[1].

2.2 Multiple Input Multiple Output 4

2.2 Multiple Input Multiple Output

MIMO has been dominating the past few years within different kind of wireless applicationssuch as 4G and 5G. The main idea is to place multiple antennas on both transmitter and re-ceiver end. By doing so, the antennas will behave differently, for instance by having differentfading amplitudes. Thus, it reduces the probability of a signal not reaching the UE due tochannel fading and other factors that can deteriorate or weaken the signal. MIMO providesgreater spectral efficiency, reliability and is able to handle larger data traffic compared to thetraditional setup single input single output (SISO), where only one antenna applies at eachend [16]. Figure 2.2 demonstrates the concept behind MIMO.

Figure 2.2 Demonstration of the idea behind MIMO, where Tx stands for the transmittingantennas, Rx for the receiver antennas and hi, j corresponds to the channel state informationfor the UE. Reproduced from Baumgärtner [2].

2.3 Orthogonal Frequency Division Multiplexing

Orthogonal frequency division multiplexing (OFDM) is one of the most commonly usedmodulation techniques. In this concept, rather than transmitting a signal with one carrier ofbandwidth B, the signal is divided and transmitted through a set of orthogonal subcarrierswith a smaller bandwidth b [27, 3], shown in Figure 2.3.

2.4 Physical Resource Block 5

Figure 2.3 Visualization of OFDM given in the frequency domain, a signal with bandwidthB divided into smaller bandwidths b. Redrawn from Lin [18].

Advantages of OFDM includes, elimination of inter-symbol interference, computationallyefficient by applying fast Fourier transform which is also more robust against fading [27].

2.4 Physical Resource Block

One of the main challenges in telecommunication is the allocation of resources. Simplyput, it can be described as where to allocate data in the frequency band for each UE. Toquantify the amount of resource allocated for each UE, a RB is defined. Each RB contains 12subcarriers, where each subcarrier carries a data symbol. Moreover, one RB is the smallestunit of resource that can be allocated to a UE [17]. However, this is viewed in the downlinkperspective, where the UE can be seen as the receiver while the base station corresponds tothe transmitting side. Similarly, the UE can transmit data stored in one or multiple RBs tothe base station in the uplink case.

2.5 System Model

Another challenge in the area of telecommunication is the reconstruction of a signal at thereceiver end that has been sent from the transmitter. Usually, if the channel state informationfor each UE is known, traditional methods can be applied. Unfortunately, this does notoccur in reality since there are multiple UEs transmitting data signals at the same time andfrequency; thus, it creates new challenges in reconstructing the original data signal.

Assume a MIMO-OFDM system with Nk subcarrier , Nr antennas at the base stationand Nu interferers. The received signal at the base station for the k-th subcarrier of UE l is

2.5 System Model 6

modelled as

yk,l = hk,lsk,l +Nu

∑i=1i =l

hk,isk,i +nk, k = 1,2, ...,Nk, (2.1)

where yk,l ∈ CNr , sk,l ∈ C and sk,i ∈ C denotes the received signal, normalized signal fromUE l and i-th interferer, respectively; hk,l ∈ CNr represents the k-th column of the channelmatrix Hl ∈ CNr×Nk , between the base station and UE l while hk,i ∈ CNr is the k-th columnof channel matrix Hi ∈ CNr×Nk of the base station and the i-th interferer and nk ∈ CNr is aadditive white Gaussian noise [26]. Note that the system model shown in (2.1) is expressedper subcarrier but is actually observed in terms of RBs, i.e. multiples of 12 subcarriers,resulting in the matrices

Yl =

...

......

......

......

...y1,l y2,l . . . y12,l y13,l . . . . . . y12·m,l

......

......

......

......

,

Hl =

...

......

......

......

...h1,l h2,l . . . h12,l h13,l . . . . . . h12·m,l

......

......

......

......

,

Hi =

...

......

......

......

...h1,i h2,i . . . h12,i h13,i . . . . . . h12·m,i

......

......

......

......

,

N =

...

......

......

......

...n1 n2 . . . n12 n13 . . . . . . n12·m...

......

......

......

...

,

where m is the number of RBs. The matrices Hi will be referred to as the interferencechannel matrices. Furthermore, the known variables consists of Yl and sk,l , while Hl can beestimated.

The optimal goal is to reconstruct the signals sk,l and can be done by applying a filter atthe receiver side that suppresses the interferences and noise. Such an algorithm is presentedin the next section.

2.5.1 Linear Minimum Mean Square Error

The LMMSE algorithm aims to minimize the square error given by

2.5 System Model 7

E[(x− x)2],

where x is the desired signal and x is the estimated one. For a system on the form

y = hx+n,

it is well known that the LMMSE estimator is

x = (R−1yy Ryx)

T y, (2.2)

where Ryy is the covariance matrix of y, and Ryx is the cross covariance matrix between yand x [24].

The system model shown in (2.1) can be redefined by using the notation according to

yk,l = hk,lsk,l +vk,l,

where vk,l = ∑Nui=1i=l

hk,isk,i+nk is the contribution of all the interferences and noise on the k-th

subcarrier, and let Rnknk be the covariance matrix of the white Gaussian noise. Moreover,applying the LMMSE estimator results in the following covariance matrices

Ryk,l ,yk,l = Rvk,l ,ideal +hklhHkl, (2.3)

Ryk,l ,sk,l = hkl, (2.4)

where (·)H is the hermitian operator and Rvk,l ,ideal is the ideal covariance matrix of theinterference plus noise term defined as

Rvk,l ,ideal = Rnknk +Nu

∑i=1i =l

hk,ihHk,i,

under the assumption that all the UEs are independent of each other as well as the whitenoise. Hence, the LMMSE filter at the k-th subcarrier is calculated by combining (2.2), (2.3)and (2.4) to obtain

FLMMSE = (Rvk,l ,ideal +hklhHkl)

−1hkl, (2.5)

which suppresses the noise plus interferences at the receiver end [26, 20].

2.5 System Model 8

2.5.2 Interference

Interference in telecommunication is defined as contributions of unwanted signals, includingnoise. There are different type of interferences, one of which being the co-channel interfer-ence (CCI). CCI is caused by frequency reuse as a consequence of several UEs using thesame set of frequencies [22]. Besides the desired signal, other unwanted signals are arrivingat the receiving end as well. Sources leading to this phenomena could be another mobileor call nearby and/or base stations that operates at the same frequency. This occurs morefrequently in crowded areas, especially when the UEs are close to each other. Figure 2.4 isan example where UE 2, 3 and 4 interfer with UE 1.

Figure 2.4 Illustration of interference in MIMO.

Once again, consider the system model shown in (2.1). It was previously mentionedthat the only known variables are the received signals Yl , the normalized signals sk,l and thechannel matrix Hl . Therefore, (2.1) can be rewritten as

vk,l = yk,l − hk,lsk,l =Nu

∑i=1i=l

hk,isk,i +nk,

where hk,l is an estimation of hk,l . The vk,l terms are expressed with the following matrixnotation

Vl =

...

......

......

......

...v1,l v2,l . . . v12,l v13,l . . . . . . v12·m,l

......

......

......

......

, (2.6)

2.5 System Model 9

where m is the number of RBs and each column is the total interference plus noise contributionin each subcarrier from interferer i = 1,2, ...,Nu. Matrix (2.6) becomes the actual valuesobserved in reality.

Previously, it was shown that the LMMSE filter is dependent on the ideal covariancematrix of vk,l . However, by calculating it explicitly, the following is obtained

Rvk,l = Rnknk +Nu

∑i=1i =l

hk,ihHk,i +2

Nu

∑j=1j =i,l

Nu

∑i=1

i = j,l

hk,isk,isHk, jh

Hk, j = Rvk,l ,ideal + c, (2.7)

where c contains crossterms of the k-th column between the interference channel matricesand is undesired. The idea is to reconstruct Rvk,l ,ideal only by observing (2.6); therefore, onewould like to use the number of interferer, Nu in each RB and the underlying structure of Hi

as prior information for the reconstruction. Hence, it becomes a blind estimation problem.To clarify this, consider a case with 2 interferer, where interferer 1 has a bandwidth of 2 RBs(subcarrier 1-24) while interferer 2 has a bandwidth of 1 RB (subcarrier 13-24). This wouldyield the following interference channel matrices

H1 =

...

......

......

...h1,1 . . . h12,1 h13,1 . . . h24,1

......

......

......

,

H2 =

0 0 0

......

......

...... h13,2 . . . h24,2

0 0 0...

......

.

Then, the observed matrix would be

Vl =

...

......

......

v1,l . . . v13,l . . . v24,l...

......

......

, (2.8)

where the k-th column corresponds to the summation

Vl(k) = hk,1sk,1 +hk,2sk,2 +nk, k = 1,2, ...,24, (2.9)

where nk is white Gaussian Noise with zero mean and variance σ2s . From (2.9), it is clear that

there are different number of sources in each RB in Matrix (2.8), Nu = 1 in the first RB and

2.5 System Model 10

Nu = 2 in the second one. Furthermore, the contribution of the interferences depends on twodifferent sources which often corresponds to its own parameters. Example of such parametersare the angular information, angel of departure (AoD) and zenith angel of departure (ZoD).If H1 instead had a bandwidth of 1 RB (subcarrier 1-12), Vl would have the followingdecomposition

......

... 0 0 0

h1,1 · s1,1 . . . h12,1 · s12,1...

......

......

... 0 0 0

+

0 0 0

......

......

...... h13,2 · s13,2 . . . h24,2 · s24,2

0 0 0...

......

+N.

Thus, the two RBs have the same number of interference (Nu = 1) while the sourcesare coming from different interferers. It would be desired to capture such source switchesas well since the RBs will contain different statistical information, such as the power andangular information. These prior knowledge can help estimate the ideal interference plusnoise covariance matrix in the LMMSE filter.

Chapter 3

Background on Machine Learning

The mathematics and background behind the machine learning algorithms used in the thesisis explained in the following chapter.

3.1 Supervised vs Unsupervised Learning

There are mainly two different types of problems in machine learning, supervised andunsupervised. In supervised learning, the labels corresponding to a specific observationis known. The idea is to use these observations and labels to train a model in order topredict the outcome of future observations. In unsupervised learning on the other hand, onlya set of features, i.e. observations, are available without labeling or measurement of thecorresponding outcome. Thus, the aim within unsupervised learning is to find the underlyingstructure of the features, by applying different clustering methods for instance [12]. Forthis thesis, the problems are modelled as supervised and the different algorithms used in thesupervised setting are described.

3.2 Support Vector Machines

A class of well known algorithms for classification problems is named support vectormachines, which is based on constructing a decision boundary that separates the classes aswell as possible [12].

Consider a binary classification problem where xi ∈ Rp are the feature vectors with acorresponding class label yi ∈ {−1,1} for i = 1,2, ....,N. Moreover, a hyperplane is definedas

f (xi) = β0 +βββT xi.

3.2 Support Vector Machines 12

Assuming that β0 = β0 and βββ = βββ has been calculated, the label y j ∈ {−1,1} of a newobserved data point x j can be predicted. A decision rule, δ (x j) is derived by looking at the

sign of f (x j) = β0 + βββT

xxx j, as shown below

δ (xxx j) =

1 f (x j)> 0

−1 f (x j)< 0.

A model that bases its decision on a sign of a specific function is usually referred to asperceptions [9]. The challenge is to find an estimation of the parameters, β0 and βββ , to obtainthe optimal hyperplane that forms the basic idea of the support vector machine algorithms.

3.2.1 Maximum Margin Classifier

Maximum margin classifier (MMC), also referred to as hard margin SVM, can be seen asthe simplest form amongst the SVM classifiers. It is formulated as the following convexoptimization problem

minβββ ,β0

12||β ||2

s.t. yi(β0 +βββT xi)≥ 1 for i = 1,2, ...N.

(3.1)

Translating the optimization problem (3.1), it can be interpreted as finding a hyperplanethat correctly classifies all of the observations xi, i.e. that the classes are on the correct sideof the hyperplane. Therefore, an assumption in the observations that has to be made is linearseparability since a hyperplane creates a linear boundary. In addition, each of the points xi

are at least a distance M from the hyperplane. The variable M is often referred to as themargin [12]. Figure 3.1 is a graphical demonstration of the binary classification problem,described using the MMC where xi ∈ R2.


Figure 3.1 Illustration of the concept behind MMC. Reproduced from Rajput [23].

3.2.2 Support Vector Classifier

As mentioned in the previous section, a necessary assumption to perform MMC is that theclasses are linearly separable. However, this does not occur often in reality. A generalizationof MMC is the support vector classifier (SVC), where the main difference is that it allowsobservations to violate the margin and even be missclassified [12]. Mathematically it issimilar to the optimization problem (3.1), aside from the fact that two other variables areintroduced, i.e. εεε ∈ RN which contains slack variables εi and C ∈ R which is a tuningparameter. These variables fulfills the constrains below

εi ≥ 0,N

∑i=1

εi ≤C for i = 1,2, ...N.

Hence, the optimization problem (3.1) can be modified into

minβββ ,β0

12||β ||2 +C

N

∑i=1

εi

s.t. yi(β0 +βββT xi)≥ 1− εi for i = 1,2, ...N,

εi ≥ 0,N

∑i=1

εi ≤C for i = 1,2, ...N,

(3.2)

resulting in the SVC. Note that by setting C = 0, MMC is yielded. The newly introducedvariables, εi and C, is interpreted as the following:

• εi: Shows how the i-th observation is located relative to the hyperplane and margin.


1. εi = 0 ⇐⇒ the i-th observation does not violate the margin and is correctlyclassified.

2. 0 < εi < 1 ⇐⇒ observation is on the wrong side of the margin but correctlyclassified.

3. εi > 1, ⇐⇒ observation is missclassified.

• C: A budget for N observations to violate the margin.

The concept behind SVC, as well as the variables εεε and C for the binary classificationproblem, is illustrated in Figure 3.2.

Figure 3.2 Illustration of the concept behind SVC. Reproduced and adapted from Dey [7].

Instead of solving (3.2), it can be viewed as the dual formulation that benefits from thefact that only the inner products of the observations has to be considered. Additionally, this isalso the key component of creating a non-linear decision boundaries. The lagrangian primalfunction that has to minimized with respect to the parameters β0 and βββ is

LP =12∥β∥2 +C

N

∑i=1

εi −N

∑i=1

λi

(yi

(β0 +βββ

T xi

)− (1− εi)

)−

N

∑i=1

µiεi, (3.3)


where λi ≥ 0, µi ≥ 0. This can be done by setting the derivatives equal to zero to obtain theequations

βββ =N

∑i=1

λiyixi, (3.4)

0 =N

∑i=1

λiyi, (3.5)

λi =C−µi, i = 1,2, . . . ,N. (3.6)

By inserting (3.4) - (3.6) into (3.3), the consecutive dual form is obtained

maxλ

N

∑i=1

λi −12

N

∑i=1

N

∑k=1

λiλkyiykxTi xk

s.t. 0 ≤ αi ≤C, for i = 1,2, ...,N,

N

∑i=1

λiyi = 0.

(3.7)

Furthermore, the Karush-Kuhn-Tucker conditions are included

λi

(yi

(β0 +βββ

T xi

)− (1− εi)

)= 0, (3.8)

µiεi = 0, (3.9)

(yi

(β0 +βββ

T xi

)− (1− εi)

)≥ 0 for i = 1,2, ...,N.

The dual problem (3.7) is solved using traditional methods such as quadratic programming.After solving the dual problem, the parameter βββ is obtained according to (3.4). Clearly, theoptimal estimation of βββ includes the lagrange multipliers λi, which has a correspondingobservation xi. These observations are often called the support vectors [12]. The optimalvalue of parameter β0 = β0 is calculated by (3.8), using that the observations on the marginhave εi = 0 and 0 < λi <C due to (3.6) and (3.9).

Classification of a new observation x j, where j = i is based on the sign of the function

f(x j)=

N

∑i=1

λiyixTj xi + β0. (3.10)


However, (3.10) still results in a linear decision boundary. In reality, problems are oftennon-linear and even though SVC allows for some missclassification, it would be more feasibleto introduce an alternative method that can eliminate such errors. Hence, an approach forcreating non-linear decision boundaries is introduced in the next section.

3.2.3 Support Vector Machine

Support vector machine (SVM) is an extension of SVC based on the concept of enlargingthe feature space with the help of kernels. The basic idea is to map the input vectors intosome high dimensional space by a function g : Rp → RL, where L > p. The observations inthe enlarged feature space is used to compute a linear decision boundary that corresponds toa non-linear one in the original feature space [12]. Figure 3.3 showcases the procedure inSVM.

Figure 3.3 Illustrates the concept of enlarging the feature space. Reproduced from Kim [14].

Assuming that the SVC is performed in the enlarged feature space given function g,results in the optimization problem

maxλ

N

∑i=1

λi −12

N

∑i=1

N

∑k=1

λiλkyiyk⟨g(xTi ),g(xk)⟩

s.t. 0 ≤ αi ≤C, for i = 1,2, ...N,

N

∑i=1

λiyi = 0,

where ⟨·, ·⟩ is the inner product operation.


As shown in (3.10), the optimal decision boundary is given by

N

∑i=1

λiyixTj ⟨g(xT

i ),g(x j)⟩+ β0, (3.11)

which implies that it is dependent on the inner product of the enlarged feature vectors.However, the transformation function g does not have to be explicitly defined to calculate theinner product; instead, different kernel functions in terms of xi and x j is used. The notationfor a kernel K for xi and x j ∈ Rp is defined as

K(xi,x j),

where K is a function K : Rp×p → R for all xi and x j. Table 3.1 shows popular choices ofvarious kernels K.

Table 3.1 Examples of commonly used kernel functions in SVM.

Kernel K(xi,x j)

Polynomial of degree d (1+⟨xi,x j⟩)d

Radial Basis function (RBF) exp{−γ∥xi −x j∥2}

Hyperbolic Tangent/Sigmoid tanh(γ⟨xi,x j⟩+κ)

Equation (3.11) can therefore be written as

f(x j)=

N

∑i=1

λiyiK(x j,xi)+ β0.

In order to construct a non-linear decision boundary, a non-linear kernel such as RBFhas to be applied. Also, selecting a linear kernel results in a linear decision boundary in theoriginal feature space. However, using a non-linear kernel is not always the best approachsince it can be very computationally expensive. The running time complexity for the RBFkernel is O(nSV × p), where nSV is the number of support vectors and p is the dimension ofthe feature vectors xi. On the other hand, the linear kernel has a time complexity of onlyO(p) [4]. If the aim is to get fast results, the linear kernel should be considered, often at acost of reduced accuracy.

3.3 Artificial Neural Network 18

3.2.4 SVM for Multi-classification

Compared to the binary class setting in multi-classification, the label class K is greaterthan two, meaning that yi ∈ {1,2, ...,K}. The two most commonly used approaches areone-versus-one and one-versus-all [12].

One-versus-One

In this approach, the idea is to construct(K

2

)SVMs where each one corresponds to a distinct

pair of classes, i.e. 1-vs-2,...., 1-vs-K, ...., K-1 vs K. The(K

2

)SVMs are applied on the

observed data when predicting a new observation. The data is then assigned to the class thathas been most frequently classified amongst all the SVMs.

One-versus-All

Compared to the previous approach, the principle behind one vs all is that the K classes arecompared against the K −1 remaining ones, instead of creating SVMs for each distinct pairof classes. Hence, K SVMs are computed, where the k-th (k ∈ {1,2, ...,K}) class is labeledas 1 and the rest as -1. For each class, a corresponding decision boundary is found by thefunction fk. Prediction of a new observation x j is then based on

y j,predict = argmaxk∈{1,2,...,K}

fk(x j).

3.3 Artificial Neural Network

Artificial neural network (ANN) is a mathematical attend to mimic the behavior of the humanbrain. It can be seen as an abstract network of nonlinear transforming elements. ANN issuitable for problems containing high-dimensional data and large sample sizes. Thus, it isoften used in areas such as image analysis and applied to regression as well as classificationproblems. During 1943, the neurophysiologist Warren McCulloch and mathematician WalterPitts were the first ones to model a neural network using electrical circuits [11]. In orderto fully understand the reasoning behind ANNs, key components such as hidden layers,activation functions and loss functions has to be understood.

3.3.1 Feedforward Neural Network

Feedforward neural network (FNN) is one type of ANN and contains neurons that areconnected to each other and forms different types of layers. A FNN typically consists of


an input layer, N hidden layers, and an output layer, where the hidden layers are defined asall the layers in between the input and output layers. The simplest structure of a FNN iswhen N = 1, given by Figure 3.4. When the number of hidden layers are N > 1, it is usuallyreferred to as a deep FNN [11].

Figure 3.4 A feedforward neural network with one single hidden layer.

The basic idea of a FNN is to apply nonlinear transformations between the hidden layersto the input data which is expressed as a linear combination of themselves to generate adesired output at the end. To demonstrate this, consider the case with one single hiddenlayer, i.e. N = 1, containing S hidden nodes. The transformation applied to the input vectorx = (x1,x2, .....xp)

T between the input and the hidden layer is

Zs = σs(β0,s +wT

s x), s = 1,2, ...,S,

where σs(·) is an activation function between the layers, ws is the weight connection vectorbetween x and node Zs while b0,s are bias terms. Furthermore, Z = (Z1,Z2, ...,ZS) will act asthe input between the hidden and output layer of the network accordingly

fk(x) = gk(α0,k +αααTk Z), k = 1,2, . . . ,K, (3.12)

where gk(·) is an activation function, αααk is a weight vector, α0,k are bias terms and fk(x)corresponds to the output of the k-th node in the network. Equation (3.12) is expressed usingthe following matrix notation

f(x) = g(ααα0 +Aσ (βββ 0 +Wx)) , (3.13)


where f = ( f1, . . . fK)T is the output vector, W is a (S× p) weight matrix, A is a (K × S)

weight matrix, βββ 0 =(β01, . . .β0,S

)T and ααα0 = (α01, . . .α0,K)T are the bias vectors while

σσσ = (σ1, . . .σS)T and g = (g1, . . .gK)

T are vectors with activation functions.This can easily be extended to the case where N > 1, by using (3.12) as the input for the

next hidden layer in the network and thereafter apply the same reasoning. The objective isto derive the weight matrices W and A as well as the bias vectors βββ 0 and ααα0 such that theoptimal network is obtained [11].

3.3.2 Activation Function

The purpose of activation functions σ(·) and g(·) is to introduce non-linearity into thenetwork to manage more complex problems [11]. Linear activation functions can also beapplied, which is equivalent as outputting the linear combinations of the inputs and thereforelimiting the complexity of the network. Usually, the same activation function is applied ateach node in the same layer, however different ones can also be chosen. Table 3.2 showsvarious activation functions that can be used when building a FNN.

Table 3.2 Examples of activation functions that can be used when building a feedforwardneural network.

Activation function σ(t)

Gaussian radial basis 1√2π

e−t2/2

Rectified Linear Unit (ReLu) max(0, t)

Leaky Rectified Linear Unit (Leaky ReLu) max(0, t)+ I{t ≤ 0} ·at

Sigmoid 1/(1+ e−t)

In classification problems, the desired output is usually a vector containing probabilitiesfor each one of the classes, i.e a value between zero and one. Thus, the activation functionbetween the last hidden layer and output layer should be able to achieve such a mapping, forinstance the softmax function

so f tmax(ti) =eti

∑Kj=1 et j

.


3.3.3 Training Neural Network

When creating a neural network, the objective is to find a set of parameters, connectionweights and biases such that an optimal network is obtained. For simplicity, consider the caseillustrated before with N = 1 in the classification setting and denote the set of parameters as

Θ = {ααα0,A,βββ 0,W}.

Furthermore, consider J number of observed vectors x j ∈ Rp, j = 1,2, ...,J with acorresponding class label y j expressed as a one-hot vector with K classes. The aim is to findthe parameters in Θ such that the empirical risk defined as

R(Θ) =−1J

J

∑j=1

K

∑k=1

ykj log

(f(x j))

, (3.14)

where ykj = 1 for the correct class and f

(x j)

is shown in (3.13) with g(·) = so f tmax(·), isminimized. The general approach used to find Θ that minimizes the empirical risk is knownas the gradient decent and is an iterative process. In order to perform the algorithm, thegradient of the empirical risk with respect to the different parameters has to be calculated. Anefficient way of doing that is to use a method called backpropagation which takes advantageof the chain rule. Firstly, (3.14) is rewritten as

R(Θ) =−1J

J

∑j=1

K

∑k=1

ykj log

(f(x j))

=1J

J

∑j=1

L(f(x j),y j;Θ),

where L(f(x j),y j;Θ) is a loss function, being cross-entropy in this case. Consequently, thegradient of R(Θ) is

∇ΘR(Θ) =1J

J

∑j=1

∇ΘL(f(x j),y j;Θ). (3.15)

Therefore, minimizing the empirical risk is equivalent to as minimizing the loss functionover all observed x j. Looking at (3.15), it can be seen that the number of gradients increasesas the number of observations J does. Hence, this can be very computationally expensive.However, there are two other options that can be used, these include mini batch and stochasticgradient descent. The idea behind these alternative approaches is to select a subset in the Jobservations in the mini batch gradient descent or to choose one random observation in thestochastic gradient descent. Then, the corresponding gradient acts as an approximation of(3.15).


Assume that W(q) = (w(q)p,s) is the weight matrix between the input and hidden layer and

A(q) = (a(q)s,k ) is the weight matrix between the hidden and output layer at q-th iteration. Thefollowing

u(q)1 = W(q)x j

u(q)2 = u(q)

1 +βββ(q)0 ,

H(q)1 = σ(u(q)

2 )

u(q)3 = A(q)H(q)

1

u(q)4 = u(q)

3 +ααα(q)0 ,

f(x j)(q) = g(u(q)

4 ),

is used as notations for the values at each node in the different layers with the correspondingcomputational graph shown in Figure 3.5.

Figure 3.5 A computational graph of a FNN with one hidden layer.

The parameters in the gradient descent is updated accordingly

w(q+1)p,s = w(q)

p,s −λq

J

∑j=1

∂L

∂w(q)p,s

,


a(q+1)s,k = a(q)s,k −λq

J

∑j=1

∂L

∂a(q)s,k

,

β(q+1)0,s = β

(q)0,s −λq

J

∑j=1

∂L

∂β(q)0,s

,

α(q+1)0,k = α

(q)0,k −λq

J

∑j=1

∂L

∂α(q)0,k

,

where λq is a parameter known as the learning rate and specifies the step size within each iter-ation, and the derivatives ∂L

∂w(q)p,s

, ∂L∂a(q)s,k

, ∂L∂β

(q)0,s

and ∂L∂α

(q)0,k

are calculated using the backpropagation

method

∂L

∂w(q)p,s

=∂L

∂ f (q)(x j) ∂ f (q)

(x j)

∂u(q)4

∂u(q)4

∂u(q)3

∂u(q)3

∂H(q)1

∂H(q)1

∂u(q)2

∂u(q)2

∂u(q)1

∂u(q)1

∂w(q)p,s

,

∂L

∂a(q)s,k

=∂L

∂ f (q)(x j) ∂ f (q)

(x j)

∂u(q)4

∂u(q)4

∂u(q)3

∂u(q)3

∂a(q)s,k

,

∂L

∂β(q)0,s

=∂L

∂ f (q)(x j) ∂ f (q)

(x j)

∂u(q)4

∂u(q)4

∂u(q)3

∂u(q)3

∂H(q)1

∂H(q)1

∂u(q)2

∂u(q)2

∂β(q)0,s

,

∂L

∂α(q)0,k

=∂L

∂ f (q)(x j) ∂ f (q)

(x j)

∂u(q)4

∂u(q)4

∂α(q)0,k

.

For q = 0 the parameters in Θ are usually randomized or assigned with a specificprobability distribution such as the Gaussian distribution. Summarizing the training of aFNN, it can be divide it into two parts:

1. Forward pass: The initial weights, biases and input data are passed through thenetwork to calculate the values at each node of the layers.

2. Backward pass: Calculating the derivatives with respect to the parameters by applyingbackpropagation, starting from the last layer and then updating the parameters usinggradient descent.


3.3.4 Additional Concepts of Neural Network

Besides obtaining the optimal weights and biases, another crucial factor to achieve an optimalFNN is the hyperparameters of the network. These include the design of the hidden layers,regularization, starting values, choice of learning rate and preprocessing of the input data.

Design of Hidden Layer

In general, it does not exist any common rule on the number of hidden layers or nodes that anetwork should have since it is problem dependent. However, there are various guidelinesthat can be found based on others experiences. One guideline is that the number of hiddenneurons should be around 2/3 or 50% of the input size [10]. Furthermore, the universalapproximation theorem states the following:

“A feed-forward network with a single hidden layer containing a finite number of neuronscan approximate any continuous functions on compact subsets of Rp under mild

assumptions on the activation function.” [6]

Therefore, it is beneficial to start with one single hidden layer and gradually increasing itif the problem is too complex.

Regularization

Overfitting is usually one of the main problems when designing a classifier, that is to perfectlyfit the training data while failing to predict untrained or future observations. In order to avoidsuch phenomena, various regularization techniques are used, e.g. dropout rates. Dropout is acommonly used and easy approach to avoid overfitting, where the basic idea is to specify aprobability p of each node in a hidden layer to deactivate by setting to zero. Adding dropoutis similar to creating multiple neural networks with different architecture called ensemblingand is known for combating overfitting [5]. The idea is that for every iteration during thetraining, it corresponds to a new architecture of the network since the neurons are droppingrandomly. Therefore, the finalized model can be seen as an averaging of a large number ofnetworks which is the concept behind ensembling. In addition, the neurons are forced tolearn features independently of other neurons presence, thus creating a more robust network.

Starting Value

When the empirical risk is nonconvex it often appears to have multiple local minimum.Hence, the solution is probably dependent on the initial weights. In order to avoid getting

3.4 Performance Measurements 25

stuck at a local minimum, it is favorable to try different initial weights and investigate howthe model perform.

Learning Rate

Choosing the correct learning rate is a challenge in itself. Learning rate is the amount thatthe weights in the network are adjusting with respect to the gradient of the loss function, i.e.the step sizes that are taken towards the minimum of the empirical risk in each iteration. Anincreased learning rate might cause the training to diverge, while a low learning rate resultsin a very long training time. Typically, setting the learning rate is iterative and based onprevious experiences. However, a well known approach is to adjust the learning rates foreach parameter during the training, named adaptive moment estimate optimizer (ADAM)[15].

Preprocessing of Input Data

Preprocessing of the input data is a strategy that can improve the performance and runningtime of a network [19]. One of the techniques used is dimension reduction, e.g. principalcomponent analysis, resulting in fewer parameters for the network to estimate. Commonly,the input data is often scaled since some data samples might contain different units or scales.Standardizing the input to have zero mean and unit variance is frequently used when trainingneural networks.

3.4 Performance Measurements

One way to evaluate the results in the classification setting is to make use of the confusionmatrix. A confusion matrix is a matrix that allows one to visualize how well a classifier isperforming. In the matrix, the rows represents the actual class and the columns correspondsto the predicted class. An example of a confusion matrix in the multi-classification settingwith K classes is in the following form

c1,1 c1,2 . . . . . . c1,K−1 c1,K

c2,1 c2,2 . . . . . . c2,K−1 c2,K...

... . . . . . . ......

cK−1,1 cK−1,2 . . . . . . cK−1,K−1 cK−1,K

cK,1 cK,2 . . . . . . cK,K−1 cK,K

, (3.16)

3.5 Singular Value Decomposition 26

where elements ci, j corresponds to a prediction on class j for j = 1,2, ...,K and i indicatesthe true class label for i = 1,2, ...,K. Whenever i = j, a correct classification is made and theelements are represented in the main diagonal of the confusion matrix. In contrast, if i = j, amissclassification has been made. From the confusion matrix, multiple performance metricscan be obtained, one of them being the accuracy defined as

∑Ki=1 ci,i

∑Ki=1 ∑

Kj=1 ci, j

,

and is a commonly used evaluation metric under the assumption that the class labels arebalanced by having an equal number of classes in the data set. A special case is the binaryclassification setting with one class labeled as 1 (negative) and the other as 2 (positive).Matrix (3.16) then becomes [

c1,1 c1,2

c2,1 c2,2

],

where c1,1 is the true negative rate, c2,2 is the true positive rate, c1,2 is the false positive rateand c2,1 is the false negative rate [25].

3.5 Singular Value Decomposition

Singular value decomposition (SVD) is a mathematical linear technique which states that forany matrix M, it can be decomposed into simpler pieces according to

M = UΣΣΣVT , (3.17)

where U and V are orthogonal matrices, while ΣΣΣ is a diagonal matrix with its singular valuessi in the diagonal and is the square root of the eigenvalues λi in M [8]. The term "simpler" isused in the sense that the matrices obtained after the decomposition has fewer parametersto infer. Geometrically, the three components of (3.17) corresponds to the linear operationsbelow and is also illustrated in Figure 3.6.

• Matrices U and VT are rotation operations.

• ΣΣΣ corresponds to a "stretch" along the axes ΣΣΣ.

3.5 Singular Value Decomposition 27

Figure 3.6 Geometrical interpretation of each component in the singular value decomposition.Reproduced and adapted from Johann [13].

Chapter 4

Experimental Setup

In this Chapter, the methods used in order to answer the research questions are presented.These include the generation of the data, formulation of the problem into a machine learningsetting, implementation and evaluation of the algorithms.

4.1 Data Generation

The data was simulated using MATLAB with a corresponding scenario named Urban-Macrothat models all the necessary factors, such as the number of paths that can be taken and itsenvironment. The outputs were the interference channel matrices Hi and their correspondingnormalized signals sk,i for four different interferers. Each interference channel matrix had adimension of 64×2040, where 64 is the number of receiver antennas and 2040 is the numberof subcarriers or 170 RBs, and generated through multiple transmission time intervals (TTI).In addition, these matrices could be divided into smaller interference channel matrices withdimension 64× t ∗RB, where t is an integer. Five datasets were created

Sd = {Ad,i,Zd,i}4i=1, d = 1,2, ...,5,

where Ad,i is the AoD of the d-th dataset and i-th interferer, respectively; Zd,i is the ZoD ofthe d-th dataset and i-th interferer. Furthermore, the parameters AoD and ZoD were chosenuniquely for each interferer and dataset

Ad,i = Al,i, Zd,i = Zl,i, d = l,

Ad,i = Ad, j, Zd,i = Zl, j, i = j,

which is illustrated in Figure 4.1.

4.2 Bandwidth Detection 29

Figure 4.1 Illustration of the simulated datasets Sd for d = 1,2,3,4,5. Each one of themcontaining uniquely angular information.

The reason for creating multiple sets is to evaluate the generalization of the final machinelearning models. Instead of evaluating the models on datasets with similar parameters, i.e.angular information, the confidence of the model can be increased by testing on anotherset with completely different underlying parameters. Thus, set S1-S4 were used during thetraining while S5 acted as the test set for when the final model was obtained.

4.2 Bandwidth Detection

The aim of the first topic was to detect when a switch of source occurs in Vl , that is whenevera new interferer appears or drops out between the RBs. Illustrated in Section 2.5.2, theobserved matrix Vl given by 2 interferers was expressed as

Vl =

...

...... 0 0 0

h1,1 · s1,1 . . . h12,1 · s12,1...

......

......

... 0 0 0

+

0 0 0...

......

......

... h13,2 · s13,2 . . . h24,2 · s24,2

0 0 0...

......

+N.


Clearly, it occurs a switch of source between the first and second RB. On the other hand,looking at the opposite case

Vl =

...

......

......

h1,1 ·s1,1 . . . h12,1 ·s12,1 . . . h24,1 ·s24,1...

......

......

+

......

......

...h1,2 ·s1,2 . . . h12,2 ·s12,2 . . . h24,2 ·s24,2

......

......

...

+N,

the number of sources does not change and comes from the same interferers. Hence, it wasmodeled as a binary classification problem with the following class labels:

• Class 0 : No source switch.

• Class 1: A source switch.

A naive approach to classify is to calculate the power of each subcarrier defined as theEuclidean norm ∥vk,l∥, and calculate the mean power difference of the RBs ∆P. If ∆P isabove a pre-defined threshold, ∆Pthreshold , it is classified as class 1, otherwise it is classifiedas class 0. Generally, ∆P should be above ∆Pthreshold for class 1. However, there are caseswhere this does not occur and these data samples will be referred to as edge cases. This isalso shown in Figure 4.2.

(a) Normal cases. (b) Edge cases.

Figure 4.2 Plot over the power of 24 subcarriers or two RBs in the observed matrix Vlillustrating the edge and normal cases.


The downside of using such an approach is that other important statistical information,besides the power, might be excluded and lead to undesirable results. The threshold methodwas used as a baseline method for the purpose of comparison.

4.2.1 Data Preparation

Since the objective was to find if a source switch occurs or not between two RBs, it wasnatural to construct data samples of said RBs. The simulated interference channel matricesintroduced in Section 4.1 was divided into 85 Hi ∈ C64×24 samples, where i = 1,2,3,4 forall the different data sets. Furthermore, the k-th column of Vl was created accordingly

Vl(k) = hk,1sk,1 +hk,2sk,2 +hk,3sk,3 +hk,4sk,4 +nk, k = 1,2, ...,24, (4.1)

where hk,i is the k-th column of Hi with a randomly assigned bandwidth of zero, one or twoRBs, sk,i is the normalized signal for the k-th subcarrier from i-th interferer and nk ∈ C64

is the white Gaussian noise with zero mean and variance σ2s . The power or variance of

the noise was set such that the following SNRs were obtained: ideal (no noise), ten, five,one and zero expressed in dB. Thereof, samples Vl was labelled as either class 1 or class 0.In addition, some of the Vl belonging to class 1 were constructed in a way that the meanpower difference between the RBs would be low on purpose, i.e. additional edge cases weregenerated. Class imbalance was taken into account; hence, the training and test data wascomprised of approximately 50% of each class. Currently, there exists no library in Pythonthat is able to train models based on complex numbers in the field of machine learning. Inorder to solve this, the input data Vl was reshaped into a real and imaginary part and wastreated as the final input for the model. Also, it was standardized to have zero mean and unitvariance.

4.2.2 Feedforward Neural Network

The machine learning algorithm applied to solve the first topic was a FNN. This due to thefact that one wants to capture not only the information with respect to the power domain, butother hidden statistical information such as the angular information. As a result, the modelwould be able to handle the edge cases better compared to the naive threshold method. TheFNN was implemented in Python using the library Tensorflow and Keras.

Designing an architecture of a robust FNN requires a lot of testing since it involves tuningof multiple hyperparameters, such as the depth of the network, number of nodes, activationfunction and learning rate. Unfortunately, it does not exist a common way of choosing

4.3 Source Estimation 32

such hyperparameters. As for this thesis, multiple configurations were conducted, trainedand evaluated based on the accuracy of the test sample. The initial weights were randomlyinitialized with a predetermined seed in order to achieve a repeatable experiment. Variousnumber of hidden layers, one to three, were used and combined with different learning ratesranging from 0.1−0.0001 with the ADAM optimizer. The batch size used was 128 as thenumber of neurons was set arbitrary. To avoid overfitting, dropout layers with a probabilityof 50% were included between the layers. Since it was a binary classification problem, thenatural choice of activation function between the last hidden and output layer was the sigmoidfunction, a special case of the softmax function. As for the other layers, activation functionsReLu, LeakyReLu and sigmoid were tested.

4.3 Source Estimation

As for the second topic, the objective was to estimate the number of sources in each RB. Themethod used was inspired by Yamamoto et al. [28] where the authors wanted to estimate thenumber of sources in the reverberant sound field. The main idea is to explicitly calculate thecovariance matrix of the observed data, as seen in (2.7). Then, SVD is applied to obtain theeigenvalues, i.e. the power of two for the diagonal values in ΣΣΣ (Section 3.5). The number ofdominant eigenvalues is then equivalent to the number of sources. However, the challenge isto find an algorithm that can define what dominant is. Figure 4.3 is an instance taken fromthe training data to address this issue. Also, logistic regression was used as a baseline methodto compare with the finalized models.

Figure 4.3 Eigenvalue distribution containing different number of sources for zero SNR.

4.3 Source Estimation 33

4.3.1 Data Preparation

Compared to the data preparation in the bandwidth detection, the input here was the eigenval-ues corresponding to the averaged covariance matrix of 12 subcarriers or one RB, since thenumber of sources in one RB is the same. Thus, 170 Hi ∈ C64×12 samples with i = 1,2,3,4from all five data sets were created. Similarly, each Hi was assigned a bandwidth of zero orone RB. Equation (4.1) was then used to generate Vl with the same SNR as before, howeverfor k = 1,2, ...,12. The covariance matrix for each column in Vl , meaning vk,l , was calculatedand then averaged. SVD was applied to the averaged covariance matrix and the eigenvalueswas identified. The eigenvalues were then standardized and became the input data with acorresponding label c ∈ {0,1,2,3,4}.

4.3.2 Support Vector Machine

According to a previously reported protocol [28], the chosen algorithm used to identifythe dominant eigenvalues was the SVM. Once again, the method was implemented usingPython with the library sklearn. An advantage of using this library is that it has the built infunction GridSearchCV that creates SVMs for different combinations of hyperparemeters,such as the tuning parameter C and kernels as well as returning the best model with itscorresponding parameters. As for the best model, it can be interpreted differently dependingon which performance metric one aims to optimize. In this case, the mean accuracy wasused based on a 3-folded cross validation of the training sets, meaning S1 − S4. Mainly,three different kernels were tested, linear (polynomial of degree one), RBF and sigmoidshown in Table 3.1. These were combined with the parameters C = {0.01,0.1,1,10} andγ = {0.001,0.01,0.1,1} which is a parameter in the kernel’s RBF and sigmoid. Since it wasa multi-classification task, the one vs one approach was applied and is also set as default inPython. The finalized model was evaluated on the untrained dataset S5 which was based onthe accuracy.

Chapter 5

Results

In the following chapter, the results are presented. These include the bandwidth detectionmodelled as a binary classification problem and the source estimation as a multi-classificationone.

5.1 Binary Classification

To begin with, three configurations of the FNN models are presented. In the models, all thehyperparemeters are fixed, aside from the number of hidden layers. The activation functionbetween the hidden layers is ReLu while for the output layer, sigmoid is used. Moreover, thelearning rate is set to 0.0001 with ADAM as the optimizer and accuracy as the validationmetric. The models is trained on data set S1−S4 which is mentioned in Section 4.1 with inputvectors of size 3072. Summary of each configuration, including the number of parametersestimated in the FNNs, number and type of layers is shown i Figure 5.1−5.3.

Figure 5.1 Summary of the FNN model with one hidden layer.

5.1 Binary Classification 35

Figure 5.2 Summary of the FNN model with two hidden layers.

Figure 5.3 Summary of the FNN model with three hidden layers.

Furthermore, the training history of each FNN is shown in Figure 5.4. Note that thevalidation set in the figures are based on data set S1 −S4, more precise 20% of it.


(a) Accuracy for one hidden layer. (b) Loss for one hidden layer.

(c) Accuracy for two hidden layers. (d) Loss for two hidden layers.

(e) Accuracy for three hidden layers. (f) Loss for three hidden layers

Figure 5.4 Training history of the different FNN models with one, two and three hiddenlayers.


The test accuracy is shown in Figure 5.5 with the corresponding numerical values inTable 5.1 for the different configurations and the threshold method as the baseline. Evaluationis based on data set S5 and contains mixed data samples of edge and normal cases for differentSNRs.

Figure 5.5 Comparison between the FNN models with different number of hidden layersand a simple threshold method. Evaluation based on dataset S5 with mixed test samples.

Table 5.1 Accuracy of each model for five different SNRs evaluated on dataset S5 with mixedtest samples.

ModelSNR

0 1 5 10 Ideal

Threshold 78.19% 77.93% 75.72% 73.54% 71.78%L1 83.62% 83.16% 79.33% 75.44% 74.24%L2 95.40% 95.86% 96.77% 97.14% 97.25%L3 95.71% 96.16% 97.03% 97.37% 97.49%


Similarly, the same results are shown for a data sample with only edge cases evaluated atdifferent SNRs, as seen in Figure 5.6 and Table 5.2.

Figure 5.6 Comparison between the FNN models with different number of hidden layersand a simple threshold method. Evaluation based on dataset S5 with edge case test samples.

Table 5.2 Accuracy of each model for five different SNRs evaluated on dataset S5 with edgecase test samples.

ModelSNR

0 1 5 10 Ideal

Threshold 63.20% 63.20% 61.50% 59.40% 57.40%L1 90.37% 89.39% 84.57% 80.62% 79.59%L2 96.83% 97.28% 98.13% 98.48% 98.60%L3 96.98% 97.44% 98.26% 98.56% 98.69%

Also, four different confusion matrices, two for the edge as well as the mixed test samplesfor L2 and L3 configuration are shown in Figure 5.7−5.8. However, only the ideal and zeroSNRs are presented as the other SNRs shows similar behavior.


(a) L3 with ideal SNR. (b) L3 with zero SNR.

(c) L2 with ideal SNR. (d) L2 with zero SNR.

Figure 5.7 Confusion matrices for FNN models, L2 and L3 based on mixed test samples forthe ideal as well as zero SNR.


(a) L3 with ideal SNR. (b) L3 with zero SNR.

(c) L2 with ideal SNR. (d) L2 with zero SNR.

Figure 5.8 Confusion matrices for FNN models, L2 and L3 based on edge case test samplesfor the ideal as well as zero SNR.

5.2 Multi-classification 41

5.2 Multi-classification

Results from the grid search of SVM for the linear, RBF and sigmoid kernel functions areshown in Table 5.3−5.5.

Table 5.3 Grid search history of SVM by applying the linear kernel function. Mean accuracybased on the training data samples, S1-S4 with k-folded cross validation, k = 3.

Configuration Mean accuracy St. Deviation C γ

0 96.89% 0.20% 0.01 -1 98.55% 0.12% 0.1 -2 98.88% 0.14% 1 -3 98.98% 0.16% 10 -

Table 5.4 Grid search history of SVM by applying the RBF kernel function. Mean accuracybased on the training data samples, S1-S4 with k-folded cross validation, k = 3.


0 46.20% 0.64% 0.01 0.0011 63.54% 0.21% 0.01 0.012 74.07% 0.16% 0.01 0.13 48.78% 0.46% 0.01 14 70.63% 0.20% 0.1 0.0015 89.25% 0.11% 0.1 0.016 92.65% 0.14% 0.1 0.17 86.87% 0.20% 0.1 18 92.49% 0.16% 1 0.0019 97.03% 0.16% 1 0.01

10 97.60% 0.19% 1 0.111 95.40% 0.22% 1 112 97.70% 0.20% 10 0.00113 98.58% 0.18% 10 0.0114 98.47% 0.07% 10 0.115 95.94% 0.21% 10 1


Table 5.5 Grid search history of SVM by applying the sigmoid kernel function. Meanaccuracy based on the training data samples, S1-S4 with k-folded cross validation, k = 3.


0 39.21% 0.38% 0.01 0.0011 55.12% 0.11% 0.01 0.012 34.43% 0.12% 0.01 0.13 33.06% 1.08% 0.01 14 62.96% 0.48% 0.1 0.0015 56.67% 3.82% 0.1 0.016 25.51% 4.03% 0.1 0.17 27.21% 0.51% 0.1 18 87.71% 0.28% 1 0.0019 56.92% 4.72% 1 0.01

10 24.98% 3.60% 1 0.111 27.17% 0.65% 1 112 90.37% 0.41% 10 0.00113 56.33% 3.99% 10 0.0114 23.78% 4.52% 10 0.115 26.98% 0.64% 10 1

In addition, the best configuration for each kernel with its corresponding hyperparametersis shown in Table 5.6 and used as the finalized models.

Table 5.6 Finalized models based on the grid search of the linear, RBF and sigmoid kernels.

Kernel Configuration Mean accuracy St. Deviation C γ

Linear 3 98.98% 0.16% 10 -RBF 13 98.58% 0.18% 10 0.01

Sigmoid 12 90.37% 0.41% 10 0.001

Based on the final models from the grid search, the accuracy for each one of them,including the logistic regression classifier can be seen in Figure 5.9 and Table 5.7.


Figure 5.9 Comparison between the different kernels used in SVM. A baseline method basedon logistic regression is also included. Evaluation based on dataset S5.

Table 5.7 Accuracy for different kernels used in SVM as well as a logistic regression classifier.Evaluated on dataset S5 for five different SNRs.

ModelSNR

0 1 5 10 Ideal

Logistic Regression 97.77% 98.05% 98.21% 96.67% 94.98%Linear 97.94% 98.74% 99.48% 99.68% 99.66%RBF 97.82% 98.13% 99.41% 99.65% 99.72%Sigmoid 87.45% 88.80% 91.57% 92.29% 92.48%

As in the previous section, the confusion matrices are shown in Figure 5.10−5.11 forthe ideal and zero SNR.


(a) Linear kernel function (b) RBF kernel function

(c) Sigmoid Kernel function

Figure 5.10 Confusion matrices for ideal SNR with different kernel functions in SVM.


(a) Linear kernel function (b) RBF kernel function

(c) Sigmoid Kernel function

Figure 5.11 Confusion matrices for zero SNR with different kernel functions in SVM.

Chapter 6

Discussion

The results obtained in each topic seem very promising. To begin with, regardless of theconfiguration used in the bandwidth prediction, each FNN model exceeds the thresholdapproach. In addition, there is a huge gap difference in accuracy for L1 compared to L2 andL3 with at least 6%. This could be explained by the fact that by applying only one hiddenlayer, the FNN is not able to capture the important features and interpret the underlyingstructure correctly, leading to a lower accuracy. Furthermore, another interesting aspect isthat L1 is better at classifying data with lower SNRs compared to L2 and L3, as seen in Table5.1− 5.2. Intuitively, the FNN models should be better at classifying data with higher SNRs,such as L2 and L3, since the data are less noisy, thus allowing easier identification of theunderlying structure and important features. However, an explanation for such a phenomenacould be that the pretrained model L1 is subtracting or removing too much noise when aprediction of a new data point is introduced. As a result, the model interprets the low SNRdata samples as high, and the higher SNRs as low. Also, note that this type of behavior isseen in the threshold method as well. Consequently, only L2 and L3 are discussed further.

The difference in accuracy between L2 and L3 is minimal, around 0.5% for the mixedsamples and approximately 0.1% for the edge samples, as shown in Table 5.1−5.2. Thenumber of parameters to be estimated differs between L2 and L3, with the former being 8.4million and the latter being 8.9 million, corresponding to an increase of approximately 6%.Thus, there is a trade-off between the number of parameters to be estimated and the accuracy.More precisely, an improvement of approximately 0.5% and 0.1% in accuracy corresponds toa 6% increase in number of parameters to be estimated. Moreover, the finalized models, L2and L3, are vastly better at classifying class 0 compared to class 1, and can be seen in Figure5.7−5.8. This could depend on the fact that it is much harder for the models to distinguishbetween the data points in class 1 due to their complex structure while it is more straightforward for class 0, surprisingly even for the cases when there is pure noise.

47

The models are actually better at predicting the edge cases than the mixed ones. However,these edge cases were defined by observing the power. On the other, there might be othercrucial factors in the mixed samples that is hard to capture by the models.

As for the second topic, the results obtained via grid search implied that using the linearkernel yields the highest accuracies, whilst the RBF kernel had similar accuracies. Both ofthese kernels surpass the baseline method, i.e. logistic regression. However, the sigmoidkernel is not able to exceed the baseline and thus performs the worst among the differentkernel functions. Two arguments for why the linear kernel outperforms the sigmoid and RBFare shown below:

1. The data sample is almost linearly separable, and by applying a nonlinear kernel itwould result in overfitting.

2. Grid search of the RBF and sigmoid did not include the optimal combination of thehyperparemeters γ and C; thus, the models global optimal is not found.

Building on the second argument, "only" four different values of the hyperparemeters γ

and C was tested, and therefore there might exist one combination (γ′, C

′) that could result in

a better accuracy for the RBF and sigmoid kernel. However, it is unreasonable and impossibleto try all the existing combinations.

By observing the history of the grid search it is clear that regardless of the C value giventhe linear kernel, similar results are yielded, compared to the other kernels which deviates alot. In addition, applying a nonlinear kernel on data points that are almost linearly separabledoes not only introduce the risk of overfitting but also increases the computational time.Comparing the linear kernel with the RBF kernel it is a factor of nSV which is the number ofsupport vectors.

Another interesting aspect is that the models are much better at predicting the classeswith lower number of sources shown in the confusion matrices in Figure 5.10−5.11. As thenumber of interferences increases, the underlying parameters of each interference matricesadded together results in a mixture of them. For instance, some interference power might bemuch weaker than others. As a result, the dominant sources (eigenvalues) might not be asclear compared to only one interferer, which makes it harder for the model to classify. Theresults with respect to the SNRs seems reasonable and implies that the classifiers are moreaccurate for higher SNRs. The explanation is that for lower SNRs, the dominant eigenvaluescorresponding to the sources is much harder to identify since it would be masked by thenoise.

Based on the the experimental and the empirical findings we can answer the researchquestions accordingly:

6.1 Critical Reflection 48

1. A feedforward neural network with either two or three hidden layers with the architec-ture shown i Figure 5.2−5.3 can be used to identify the bandwidth of the interferencechannel matrices.

2. Estimating the number of sources can be done by applying the support vector machinewith a linear kernel.

6.1 Critical Reflection

Firstly, all of the data samples used in this thesis was simulated. Even though the results seempromising, it can be argued that it has to be tested on real data to fully evaluate the models.In Section 4.1, it is mentioned that the aim is to generalize the final models. One data set (S5)with uniquely angular information was created and tested on. However, this only containedfive different SNRs, while in reality there might be an infinite number of SNRs. In addition,the simulator used to generate the data was based on the specific scenario Urban-Macroand has its own parameters that are fixed, which is not the case in reality. In order to trulyevaluate the generalization of the models, these factors should also have been included.Moreover, the inputs are divided from complex form into a real and a imaginary part (firsttopic) and by doing this, there could be loss of important information of the underlyingstructure. Finally, the results were only based on the accuracy of the obtained models, otherperformance metrics such as f1-score, precision and recall could have been added to get abetter understanding of how the models perform for specific classes.

Chapter 7

Conclusion

In conclusion, even though machine learning is a newly introduced concept within telecom-munications, it is shown in the following thesis that it can be applied in this field to get moreknowledge about the underlying structure of the interference channel matrices. As a result,bandwidth detection and number of source estimation was investigated, for the purpose ofusing it as prior knowledge for reconstructing the ideal interference plus noise covariancematrix. Bandwidth detection was done by applying a feedforward neural network of eithertwo or three hidden layers with accuracies of at least 95%. As for the number of sourceestimation, support vector machine was used with a linear kernel and showed even betteraccuracies ≥ 97%.

7.1 Further Research

As for further research, the topics could be extended for Nu > 4 interferers. Also, othertypes of generalization tests should be performed, including different SNRs and scenarios.If possible, real based data should be gathered, trained and evaluated on. As for the finalgoal, how this type of prior information of the channel interference matrices can be appliedto reconstruct the ideal interference plus noise covariance matrix should be investigated.This could be achieved by combining the prior knowledge with a denoising autoencoder orindependent component analysis.

References

[1] Alfred, H. O. (2014). Telecommunications media. Encyclopedia Britannica.

[2] Baumgärtner, B. (2006). Multiple input multiple output, webpage viewed 2019-05-03 athttps://sv.wikipedia.org/wiki/Multiple_Input_Multiple_Output. Wikipedia.

[3] Bhardwaj, M., Gangwar, A., and Soni, D. (2012). A review on ofdm: Concept, scope /its applications. IOSR Journal of Mechanical and Civil Engineering, 1(1).

[4] Claesen, M., Smet, f. D., Suykens, A. K. J., and Moor, B. D. (2014). Fast prediction withsvm models containing rbf kernels. arXiv, 3(1).

[5] Clarke, J. and Seo, D. (2008). An ensemble approach to improved prediction frommultitype data. arXiv, 3(1).

[6] Csáji, B. C. (2001). Approximation with artificial neural networks. Eötvös LorándUniversity, Hungary, (1).

[7] Dey, S. (2018). Implementing a soft-margin kernelized support vector machine bi-nary classifier with quadratic programming in r and python, webpage viewed 2019-05-03 at https://www.datasciencecentral.com/profiles/blogs/implementing-a-soft-margin-kernelized-support-vector-machine.

[8] Guruswami, V. and Kannan, R. (2012). Singular value decomposition. Lecutre notes,Carnegie Mellon School of Computer Science.

[9] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning:data mining, inference and prediction. Springer, New York.

[10] Heaton, J. (2008). Introduction to Neural Networks for Java. Heaton Research, Inc.,Washington.

[11] Izenman, A. J. (2008). Modern Multivariate Statistical Techniques: Regression, Classi-fication, and Manifold Learning. Springer, Pennsylvania.

[12] James, G., Witten, D., Hasti, T., and R., T. (2015). An Introduction to StatisticalLearning with Applications in R. Springer, New York.

[13] Johann, G. (2010). Singular value decomposition, webpage viewed 2019-05-03 athttps://en.wikipedia.org/wiki/Singular_value_decomposition. Wikipedia.

[14] Kim, E. (2017). Everything you wanted to know about the kernel trick, webpage viewed2019-05-03 at http://www.eric-kim.net/eric-kim-net/posts/1/kerneltrick.html.

References 51

[15] Kingma, P. D. and Ba, J. L. (2015). Adam: A method for stochastic optimization. ICLRConference Publication, San Diego, USA.

[16] Krumbein, A. (2016). Understanding the basics of mimo communication technology.Southwest Antennas.

[17] Ku, G. (2011). Resource allocation in lte. Adaptive Signal Processing and InformationTheory Research Group.

[18] Lin, X. Y. (2017). A machine learning based approach for the link-to-system mappingproblem. KTH, Royal Institute of Technology, Stockholm.

[19] Nawi, M. N., Atomi, H. W., and Rehman, M. Z. (2013). The effect of data pre-processing on optimized training of artificial neural networks. ICEEI Conference Publica-tion, 11(1):32–39.

[20] Negro, F., Shenoy, P. S., Ghauri, I., and Slock, T. D. (2010). On the mimo interferencechannel. IEEE Conference Publication, San Diego, USA.

[21] Odlyzko, A. (2015). The growth rate and the nature of internet traffic. University ofMinnesota, School of Mathematics.

[22] Rahman, I. M., Carvalho, E., and Prasad, R. (2007). Impact of mimo co-channelinterference. IEEE Conference Publication, Athens, Greece.

[23] Rajput, M., Chakravarti, M., and Kothari, T. (2015). A comprehensive study on theapplications of machine learning for diagnosis of cancer. arXiv, 4(1).

[24] Saleem, S. and Islam, U. L. (2011). Optimization of lse and lmmse channel estima-tion algorithms based on cir samples and channel taps. IJCSI International Journal ofComputer Science, 8(1).

[25] Sammut, C. (2010). Encyclopedia of Machine Learning and Data Mining. Springer,Boston.

[26] Thobaben, R. (2016). Mimo architectures theoretical foundations of wireless communi-cations. Lecture notes, KTH Royal Institute of Technology, Stockholm.

[27] Tse, D. and Viswanath, P. (2005). Fundamentals of Wireless Communication, pages148–150. Cambrigde university press, Cambrigde.

[28] Yamamoto, K., Asano, F., Rooijen, W., Ling, E., Yamada, T., and Kitawaki, N. (2003).Estimation of the number of sound sources using support vector machines and its applica-tion to sound source separation. IEEE Conference Publication, Hong Kong, China.

TRITA -SCI-GRU 2019:071

www.kth.se

Application of machine learning in 5G to extract …1320414/...girlfriend for always supporting me...

Documents

Transcript of Application of machine learning in 5G to extract …1320414/...girlfriend for always supporting me...