Exploration of 5G Tra˚c Models using Machine...

75
Linköpings universitet SE– Linköping + , www.liu.se Linköping University | Department of Computer and Information Science Master’s thesis, 30 ECTS | Datateknik 2020 | LIU-IDA/LITH-EX-A--20/52--SE Exploration of G Trac Models using Machine Learning Analys av trakmodeller i G-nätverk med maskininlärning Aron Gosch Supervisor : Patrick Lambrix Examiner : Niklas Carlsson

Transcript of Exploration of 5G Tra˚c Models using Machine...

  • Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se

    Linköping University | Department of Computer and Information ScienceMaster’s thesis, 30 ECTS | Datateknik2020 | LIU-IDA/LITH-EX-A--20/52--SE

    Exploration of 5G Traffic Modelsusing Machine LearningAnalys av trafikmodeller i 5G-nätverk med maskininlärning

    Aron Gosch

    Supervisor : Patrick LambrixExaminer : Niklas Carlsson

    http://www.liu.se

  • Upphovsrätt

    Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annananvändning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning somgod sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentetändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsidahttp://www.ep.liu.se/.

    Copyright

    The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to bementionedwhen his/her workis accessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page:http://www.ep.liu.se/.

    © Aron Gosch

    http://www.ep.liu.se/http://www.ep.liu.se/

  • Abstract

    The Internet is a major communication tool that handles massive information ex-changes, sees a rapidly increasing usage, and offers an increasingly wide variety of ser-vices. In addition to these trends, the services themselves have highly varying quality ofservice (QoS), requirements and the network providers must take into account the frequentreleases of new network standards like 5G. This has resulted in a significant need for newtheoretical models that can capture different network traffic characteristics. Such modelsare important both for understanding the existing traffic in networks, and to generate bet-ter synthetic traffic workloads that can be used to evaluate future generations of networksolutions using realistic workload patterns under a broad range of assumptions and basedon how the popularity of existing and future application classes may change over time. Tobetter meet these changes, new flexible methods are required.

    In this thesis, a new framework aimed towards analyzing large quantities of trafficdata is developed and used to discover key characteristics of application behavior for IPnetwork traffic. Traffic models are created by breaking down IP log traffic data into dif-ferent abstraction layers with descriptive values. The aggregated statistics are then clus-tered using the K-means algorithm, which results in groups with closely related behaviors.Lastly, the model is evaluated with cluster analysis and three different machine learningalgorithms to classify the network behavior of traffic flows. From the analysis frameworka set of observed traffic models with distinct behaviors are derived that may be used asbuilding blocks for traffic simulations in the future. Based on the framework we have seenthat machine learning achieve high performance on the classification of network traffic,with a Multilayer Perceptron getting the best results. Furthermore, the study has produceda set of ten traffic models that have been demonstrated to be able to reconstruct traffic forvarious network entities.

  • Acknowledgments

    Thanks to my supervisor Georgios Almyras, Vengatanathan Krishnamoorthi, Vivek Dasariand the traffic modeling team at Ericsson for technical support during this thesis. I am verygrateful to Niklas Carlsson and Pontus Sandberg for providing me with this thesis opportu-nity. Also, a many thanks to my friends and family.

    iv

  • Contents

    Abstract iii

    Acknowledgments iv

    Contents v

    List of Figures vii

    List of Tables viii

    1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background 32.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Radio Access Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Network protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Internet communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Traffic packet flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3 Method 163.1 Investigation of application behavior . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Network traffic definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 PySpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 System setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4 Results 264.1 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Parameter study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Traffic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6 Network content analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    5 Discussion 415.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    v

  • 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    6 Conclusion 476.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    Bibliography 49

    A Appendix A 52

    B Appendix B 54B.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55B.2 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58B.3 IAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61B.4 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63B.5 Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64B.6 Packet bursts for UL/DL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66B.7 Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    vi

  • List of Figures

    2.1 Traffic Forecast Update from the 2019 Cisco VNI Forecast. . . . . . . . . . . . . . . 42.2 RAN-architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Interconnected mobile traffic network. . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 The OSI reference model and protocol stack for TCP/IP. . . . . . . . . . . . . . . . . 62.5 Data transmission with TCP and UDP. . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Datagram format for IPv4 protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.7 Packet flow for network traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.1 Proposed stack layers in network model. . . . . . . . . . . . . . . . . . . . . . . . . 173.2 UE and application traffic for network levels. . . . . . . . . . . . . . . . . . . . . . . 183.3 Visualization of how thresholds are used to divide packets into network-layer seg-

    ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 CRISP-DM process [CRISP_DM_WIKI]. . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Analysis framework for traffic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6 Spark execution process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Spark and HDFS configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.1 Fraction of service providers and application-clients from input data. . . . . . . . . 274.2 CDF of packet characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Elbow and Silhouette score for K-means model. . . . . . . . . . . . . . . . . . . . . 304.4 Clustering with K-means model using t-SNE with ten clusters labeled with num-

    bers zero to nine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Heat-maps for t-SNE representation of clustering model. The scatter-plots depict

    parameter intensity from low to high. . . . . . . . . . . . . . . . . . . . . . . . . . . 314.6 Heat-map of average parameter values per cluster. Black boxes indicate missing

    values for a parameter. Distinct parameters value are separated with a horizontalline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    4.7 CDF for cluster volume in Bytes. Input data is signified by thicker line. . . . . . . . 344.8 CDF of duration for bursts and connections in seconds. . . . . . . . . . . . . . . . . 354.9 CDF of IAT in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.10 CDF for number of packets and volume for up/downlink. . . . . . . . . . . . . . . 374.11 Cluster fraction of server IP addresses. . . . . . . . . . . . . . . . . . . . . . . . . . . 384.12 Fraction of total percentage from clusters for application-clients and service

    providers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    A.1 Correlation-matrix for all features in the input data. . . . . . . . . . . . . . . . . . . 53

    vii

  • List of Tables

    3.1 Format of statistics aggregated from IP-log data. . . . . . . . . . . . . . . . . . . . . 213.2 Hardware specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Software specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.1 Highly correlated feature groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Conclusion from network layer study. . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Results for network service classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Traffic model characteristic behaviors. Abbreviations in the table: Packet (P),

    Packet Bursts (PB), Connection Sessions (CS), Flow Sessions (FS), Uplink (UL) andDownlink (DL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.5 Server IP connection to application-client and service provider. . . . . . . . . . . . 39

    viii

  • 1 Introduction

    Understanding the characteristics of network traffic and their patterns in cellular radio accesssystems is vital to aspects ranging from design, implementation, testing and development ofnew network features and concepts. In contrast to the preceding network architectures, thenew 5G-networks are predicted to be highly modular and are designed to meet the futuredemands [1]. Furthermore, the introduction of 5G is foreseen to result in a massive shift intraffic behavior. Adapting viable strategies to meet these changes in network behavior willconsequently be of critical importance in order to enable further technological advancementfor the mobile network operators [2].

    Live networks are inherently dynamic environments where several traffic-generating ap-plications evolve and run side-by-side with multiple users. The practical aspects of the char-acterization of an observed traffic behavior manifest itself in the development of a mathemat-ical representation that is known as a traffic model. Advanced traffic models that build upona relevant set of traffic behaviors are thus necessary to provide a better understanding of anetwork and a significant basis from which to evaluate and improve the system in terms ofdesign, troubleshooting, performance and dimensioning [3].

    Understanding the behavior of network applications is especially important since differ-ent applications have different demands on the quality of service. The interaction with anapplication in this context could, for instance, refer to Voice over IP (VoIP) calls with Skype,watching a video on YouTube or a user accessing emails with Outlook. Today, there ex-ist many possible network analysis methods, such as Deep Packet Inspection (DPI) and IPport-lookup, but this master thesis instead adopts another popular approach which is to usestatistical classification based on network packet traces.

    The goal of the thesis is to determine if this approach is viable to unearth application-levelnetwork traffic behavior in a 5G context. Data from the network is stored in IP packet headerslogs, where a line in the log contains all recorded information for a packet. Based on the datain the IP packet header logs, traffic models will be identified by investigating characteristicstatistical distributions of the network data. This framework will provide a basis for usingmachine learning models to categorize application and user equipment (UE) behavior, withfeatures such as burstiness, call duration or activity-inactivity intervals, and find significantcorrelations in the traffic of the network.

    1

  • 1.1. Background

    1.1 Background

    Ericsson is a major telecommunications company with its headquarters based in Stockholm,Sweden. The introduction of the new 5G-network is predicted be more service-centric andthus requires a deeper understanding of the underlying network traffic behavior. For thisreason, telecom providers such as Ericsson desire solutions that can be used to analyze mas-sive data-flows and create reliable traffic models for the relatively small user base before thelarge-scale launch of the 5G-network.

    1.2 Motivation

    Generating application-level traffic models is a non-trivial task as there is an unknown mixof device types and configurations as well as different variations on user and applicationbehavior. In accordance with traditional traffic modeling studies, real-life network data fromcustomer measurements can be used in advanced data analytics methodologies to utilize thefull potential of the data. This is beneficial both from the perspectives of subscribers that mayget better service and the service provider that can optimize the network based on indicationsfrom the traffic models.

    1.3 Aim

    The thesis aims to examine how complex behaviors work in a communications network forIP packet header log data and explore the patterns that may emerge from the analysis ofuser data. This kind of exploration and model construction is essential to construct reliablenetwork traffic models that may be used by mobile network operators in the future.

    1.4 Research questions

    The following research questions are answered in the thesis:

    1. How can traffic characteristics be used for the design of traffic models that mimic char-acteristic behaviors of applications?

    2. Which machine learning clustering model works best to define groups of traffic charac-teristics that are most representative of the traffic behavior variety in a network?

    3. What traffic models can be defined by the identified groups of behaviors and how arethese correlated with real-life applications?

    1.5 Delimitations

    The results in this thesis are based on user data provided by Ericsson and the traffic modelsgenerated may not be representative of other network types. Restrictions that can arise frombig data processing is hardware limitations and data complexity, which may limit the amountof data that can be processed. Additionally, we are limited in terms of knowledge in termsof packet-content and how network conditions may affect the results. For example, the samevideo watched by two UEs in the network can be played back at various degrees of qualityeven though both devices utilize the same application, e.g. due to limitations in bandwidthand access to the network. Lastly, the theoretical network model, might be a simplification ofreality and may not be precise enough to yield satisfactory results.

    2

  • 2 Background

    This section introduces the relevant background necessary to understand the most criticalconcepts of the thesis.

    2.1 Preliminaries

    Today, mobile connectivity has become an essential part of our everyday life and most peo-ple already consider mobile services a necessity. Global mobile data traffic grew 71 percent,compared to the previous year, and reached 11.5 exabytes per month in 2017, which meansthat mobile data traffic had grown 17-fold over 5 years [4]. Thus, the task of creating reliablemodels that can be used in network analysis often involves processing of massive amounts ofdata, which is a considerable undertaking and of critical importance for any modern mobilenetwork service provider. Additionally, the introduction of 5G is predicted to induce mas-sive changes in data generated from the mobile networks. By 2022, 5G-traffic is predicted tomake up more than ten percent of the total generated mobile traffic. A visualization of thepredicted traffic growth can be found in Figure 2.1.

    In order to meet the growing demands, new network technology standards must evolvein order to meet challenges in areas such as energy consumption, speed and connectivity [2].The new development also involves changes in terms of data volume generated from thenetworks, the variety of the data from a more significant number of different kinds of devicesthat are connected and velocity, meaning the frequency of incoming data that needs to beprocessed. A key component to finding solutions to these problems is to explore and analyzethe data generated from the network to be able to create reliable models. Understanding thebehavior of traffic is essential to develop models that can be used to support the evolution ofthe network utilization.

    Due to the increasing complexity of networks and Internet traffic, big data-driven devel-opment and large-scale simulations are becoming increasingly relevant in the network field[5]. A popular method used to analyze network behavior is to perform simulations based ontraffic generators. The generators are using traffic models to generate series of packet flowspopulating the network with traffic. Input to the models can be based on parameters such asdata volume, packet burst patterns and inter-arrival time (IAT), which are aggregated fromIP packet header logs that store massive amounts of information about network traffic. Es-

    3

  • 2.2. Radio Access Network

    Figure 2.1: Traffic Forecast Update from the 2019 Cisco VNI Forecast.

    tablishing reliable models for traffic is essential in dimensioning, design and test phases todiscover new trends, use-cases and optimizing the performance of an existing system [3].

    In a global scheme, as well as within companies such as Ericsson, traffic models are tradi-tionally developed based on the UE behavior, without taking into account the app perspec-tive, where many devices can be connected to the network and run different kinds of applica-tions in parallel, which is predicted to play a much larger role with the evolution of networktechnology [3]. It is thus necessary to investigate which parameters are crucial to includewhen designing new traffic models and selecting input for traffic generators. In this thesis, afour-layered network model is proposed with the goal of increasing granularity and capturecomplex network behaviors. To solve this problem and identify the most critical parametersfor a traffic model we develop a framework for processing large quantities of network trafficdata with PySpark and Hadoop.

    2.2 Radio Access Network

    Network traffic can be defined as communication between end-hosts that lead to an informa-tion exchange, which is packetized to carry over digital networks and routed by equipment inthe network. The devices in a mobile network are connected through a Radio Access Network(RAN), which consists of several base stations that provide provide over-the-air connectivityfor the nearby area. Several UEs of different types (phones, computers, tablets, etc.) may beconnected to a base station, which goes through to the core network. At the core level, criticaldecisions regarding UE access rights and connectivity to the Internet or other networks arehandled [3]. There exist many different kinds of RAN-architectures based on various tech-nology and application areas. Today, cellular network are in the process of transitioning fromLong Term Evolution (LTE) with 4G towards New Radio (NR) 5G. A high-level illustrationof most commonly used RAN-architectures can be found in Figure 2.2.

    The architectures for 4G LTE, 4G-5G dual connection and standalone 5G connects a userdevice to the core-network through a series of intermediary steps. LTE in 4G communicatesdirectly over an evolved Node B (eNB). Dual connectivity is a solution that enables interac-tions between 4G and 5G by establishing a link between a Master evolved Node B (MeNB)and a Secondary geran Node B (SgNB). This makes it possible to boost the performance of astandard 4G network and is a quick route for an operator to get some 5G connectivity benefitsbefore full stand-alone deployment [6]. The final RAN deployment mode is standalone 5Gwhich connects a device to the 5G core network directly through the gNB. The traffic in the

    4

  • 2.2. Radio Access Network

    Figure 2.2: RAN-architectures.

    network can be divided into two different different categories; a user plane for data traffic anda control plane that contains administrative traffic from routing protocols. Together the basestations make up a large-scale mobile network. Figure 2.3 contains a small representation ofthe mobile network structure.

    Figure 2.3: Interconnected mobile traffic network.

    5

  • 2.3. Network protocols

    2.3 Network protocols

    Network communication is built on the utilization of several protocols, which are modelsconnected to different system structures within the network. A protocol defines the formatand the order of messages exchanged between two or more communicating entities, as wellas the actions taken on the transmission or receipt of a message or other event [7]. Differentprotocols are used to accomplish various communication tasks and all activity over the In-ternet that involves two or more communicating remote entities is governed by a protocol.A communication network can be organized as a stack of layers, that can be implemented ineither software or hardware. The layered architecture has several benefits and can providean abstraction of a complex system that is relatively modular and easy to understand. Thecombination all of the protocols used by the network layers is called a protocol stack.

    Protocol layering has several conceptual and structural advantages and increases themodularity of network functionality at the cost of some overhead for the lower layers. Eachlayer in the protocol stack implements a service via its own internal actions which are relyingon services provided by the layers below in the protocol stack. Two of the most commonlystudied protocol stacks are the Open Systems Interconnection (OSI) protocols and the TCP/IPfamily, which are used exclusively for Internet communication. Each protocol stack modelconsists of a number of layers that are connected to a certain network-level functionality [8].The structure of the OSI and TCP/IP network protocol models can be found in Figure 2.4. Asseen in Figure 2.4 the traditional OSI uses seven abstraction layers while the TCP/IP familyonly uses five layers. The main difference between the two models is that TCP/IP reducesthe number of layers which leads to a less complexity. More specifically it does not deal withthe presentation and session layers. Below a brief explanation of each of the layers present inthe TCP/IP-model is provided.

    Application

    Transport

    Network

    Data link

    Application

    Transport

    Network

    Physical

    Presentation

    Session

    Data link

    OSI Model TCP/IP Model

    Physical

    Figure 2.4: The OSI reference model and protocol stack for TCP/IP.

    Application layer

    The top-layer in the TCP/IP protocol stack is called the application layer, which defines theformat in which the data should be received from or handed over to the applications [8].There exists a vast variety of network applications with different purposes. Many of thepopular Internet applications like World Wide Web (WWW), P2P-file sharing, e-mail, VoIPand social networking applications like Facebook and twitter have become an integral partof our society. The architecture of these applications can be divided into two groups; theclient-server architecture or the peer-to-peer (P2P) architecture [7]. For the client-server ap-plications, such as web-based applications, there always exists an active host, which is a de-

    6

  • 2.3. Network protocols

    vice with an unique IP-address, called a server which provides services for many other hostscalled clients. With P2P applications, like BitTorrent and Skype, on the other hand there isno reliance on servers and instead they use direct communication between hosts, which arecalled peers.

    Network applications communicate with processes that sends or receives messages to asocket, which works as a sliding-door that must be passed before a message can be delivered.This message exchange procedure is controlled by the application-layer protocols. For exam-ple, HyperText Transfer Protocol (HTTP), File Transfer Protocol (FTP), Simple Mail TransferProtocol (SMTP) and Domain Name System (DNS). Here each protocol is connected to a spe-cific application type; HTTP is connected to web document request and response, FTP tofile-transfer between two end-systems, SMTP to email-transfer and DNS to domain namelookup [7]. Based on specific application quality requirements on data integrity, throughput,timing and security a transport protocol is also needed, which leads us to the next layer inthe TCP/IP protocol stack.

    Transport layer

    The transport layer handles communication services for message exchanges between appli-cation processes that are running on different hosts [7]. This is done by applying protocols toestablish a logical communication where the sending host breaks down the application mes-sage into segments that are passed to the network layer. On the receiver-side the segments arethen reconstructed into a message that can be passed to the transport and application-layerand be delivered to the intended application. For the Internet, there are two main protocolsTransmission Control Protocol (TCP) and User Datagram Protocol (UDP) that correspond tothe transport layer. TCP transports data using TCP-segments that are addressed to individ-ual applications, while UDP transports data using UDP-packets. One distinguishing factorbetween TCP and UDP is that TCP is a connection-oriented service, where the destinationconfirms the data received. If some data gets lost, the destination requests a re-transmissionof the lost data. In contrast, UDP is a connectionless-oriented service that transports datausing packets where the delivery is not guaranteed [8]. Figure 2.5 depicts the data exchangeprocess between a sender and receiver for the TCP and UDP protocols.

    Figure 2.5: Data transmission with TCP and UDP.

    At the beginning of each connection between a sender and receiver, TCP performs a hand-shake where three segments are exchanged. First, the connection is established by a devicesending a Synchronize Sequence Number (SYN), which informs the receiver that a senderis ready to initiate communication. In response, the receiver sends back an SYN-ACK sig-nal, which acknowledges that it has received the message. Lastly, the sender sends out an

    7

  • 2.3. Network protocols

    acknowledgment that it has detected the response from the server. Once both devices aresynced, the actual data transfer can begin. This contrasts with the exchange process of theUDP-protocol, where transmission is initiated by a request and data exchanges can start im-mediately. Generally, this structure makes UDP more common in smaller applications withlow energy cost, where speed is essential, while TCP is more common for larger applicationswhere high transmission quality is necessary. For example, this could mean that that TCP isused for websites and UDP for streaming a Skype conversation [7].

    Additionally, the TCP-protocol is further differentiated from UDP with how the protocolhandles concepts such as congestion and flow control. These are two fundamental conceptsthat has a large impact and needs to be taken into consideration for network problems withdata transactions between multiple devices such as this thesis. Congestion instead occurs ina network when too many sources are sending out a too much data at too high speed forthe network to be able to handle. It can result in lost packets, due to buffer overflow forrouters and long delays for queuing in router buffers [7]. In order to reduce data-loss we canspecify how much unconfirmed data the source can send before the network gets congestedin a Window (WIN). The source-side window is called the congestion window (CWND) andthe source must always send amounts of unconfirmed data that do not exceed the windowthat has been defined by the receiver window (RWND) or the CWND [8].

    The congestion control mechanism for TCP is often referred to as Additive-Increase, Mul-tiplicative Decrease (AIMD), which produces a sharp wave-like pattern. This congestion con-trol algorithm means that the CWND is decreased by a factor of two when loss occurs. Thisprocedure means that the algorithm works to achieve connections that have an equal shareof bandwidth within the network and limit potential congestion from developing. If a TCPsender detects that very little congestion is occurring between itself and the receiver it canincrease its transmission rate, and conversely if there is much congestion it can reduce itstransmission. The control algorithms follows a fairness goal: that if K TCP sessions sharesame bottleneck link of bandwidth R, each should have average data rate of R/K. Flow con-trol is a process where the receiver controls the sender, so that the sender will not overflowthe buffer on the receiver-side by transmitting too much data at too high speed.

    Network layer

    The network layer, for the Internet, is responsible for transferring packets known as IP data-grams from one host to another [7]. The network layer then provides the service of deliveringthe segments to the transport layer in the destination host. A difference between transportand network layer is that that the transport layer focuses on logical communication betweenprocesses while the network layer has logical communication between hosts. On the sendingside transport segments are encapsulates into datagrams that are reassembled on the receiv-ing side, that delivers the segments to transport layer. In order to accomplish this task everyhost has network layer IP and routing protocols. A router examines header fields in all IPdatagrams passing through it. It is possible to compare this procedure as putting a letter inan envelope with a specific address dropping it into a mailbox [7]. The structure and contentof the IPv4 protocol datagram format can be found in Figure 2.6.

    The functionality of the network layer can be divided into two main categories; forward-ing which moves packets from router input to router output and routing which determinesthe route that packets will take from source to destination. Routing protocols are often basedon algorithms that provides solutions to the shortest-path problem. Some common routingprotocols are Routing Information Protocol (RIP), Open Shortest Path First (OSPF) and BorderGateway Protocol (BGP). Included in the network layer is also the essential Internet Protocol(IP), which determines the structure of the datagram that is sent. There is only one IP proto-col, and all Internet components that have a network layer must run the IP protocol. Thereexists different versions of the IP like the IPv4, as seen in Figure 2.6, and the more recent IPv6protocol that have a different format for datagrams.

    8

  • 2.4. Internet communication

    ver header len type of service total length

    32-bit source IP address

    32 bit destination IP address

    options (if any)

    data (typically a TCP or UDP segment)

    0 31

    16-bit identifier flags fragment offset

    time to live upper layer header checksum

    Figure 2.6: Datagram format for IPv4 protocol.

    Data link layer

    In the network hosts and routers can be considered as nodes that are connected for every dataexchange. The data-link layer has the responsibility of transferring datagrams handed-downfrom the network layer from one node to physically adjacent node over a link [7]. Datagramsoften needs to traverse several nodes on its way from source to destination, which meansthat they may pass several links along the route that utilize different link-layer protocols.The protocols on the link layer may include wired links like Ethernet or wireless links suchas WiFi. The complexity of a link can be as simple as a single sender and reciever or it canhave multiple senders and recievers in which case it utilizes a multiple access protocol.

    The link layer provides a number of services that facilitate communication between nodes.Similar to the upper layers in the protocol stack the link layer can encapsulate the data thatwill be transmitted, called framing, between nodes into frames. A frame consists of a datafield, in which the network-layer datagram is inserted, and a number of header fields. Linkaccess is controlled by a Medium Access Control (MAC) protocol specifies the rules by whicha frame is transmitted to the link. The link can provide reliable delivery that ensures that thedatagram will be delivered without error. The link can also be used to discover and correcterrors with methods such as parity-bit checks or checksums.

    Physical layer

    The physical layer describes the low-level electric or optical signals used for communicatingbetween two computers [8]. It consists of protocols that operate only on a physical link andis dependent on an actual transmission medium like twisted-pair copper wire, coaxial cablesor fiber. The goal of the services provided by the layer is to enable bits to travel over the wireto their intended destination.

    2.4 Internet communication

    Information exchange over the Internet is based on a technique called packet switching,which refers to the hand-over of data packets. To send a message from a source end sys-tem to a destination end system, the source breaks long messages into smaller chunks of dataknown as packets [7]. Between source and destination, each packet travels through commu-nication links and packet switches. Each network interface on the Internet network has oneor more IP address that are unique worldwide. One network interface can have several IP

    9

  • 2.5. Traffic packet flow

    addresses, but one IP address cannot be used by many network interfaces [8]. The Internet iscomposed of individual networks that are interconnected via routers. Each IP-packet headercontains the destination address, which is the complete routing information used for deliv-ering the packet to its destination. Different links can transmit data at different rates, withthe transmission rate of a link measured in bits per second. Messages can perform a controlfunction or contain data like an email, a PNG image, or an MP4 video file [7].

    2.5 Traffic packet flow

    Devices can generate network traffic by sending packets at defined time intervals. Packets inthe network can have varying sizes and distribution over a time-based transfer-interval fora particular device configuration. The burst traffic patterns are characterized by a sequenceof packages sent in rapid succession, with a Inter-Arrival Time (IAT) between them lowerthan 1msec, which is a common time-interval used in network communication to reduceimpact of network conditions. The time between bursts are called Idle Time (IT) and a devicewill be disconnected if it exceeds 10 seconds. It is also possible to measure time in terms ofConnection Time (CT), which is the total amount of time that a device has been connected tothe network. Additionally, each packet has a direction, either Uplink (Ul) or Downlink (Dl),that indicate if the packet is sent or received by a device [9]. Figure 2.7 contains a graphicalrepresentation of the traffic packet flow definition.

    IAT

    Burst1

    IT

    Burst2 Uplink 

    Downlink

    Size[B]

    Time[s]

    Figure 2.7: Packet flow for network traffic.’

    2.6 Machine learning algorithms

    Machine learning (ML) refers to a field within computer science closely related to mathemat-ics and statistics. The purpose of a model in machine learning is to simulate processes andsolve complex problems. A model is based on data that is used to train the model and dataused to test and validate the model. In machine learning there exists two different methodssupervised, where input is mapped against an expected output, and unsupervised learningthat can be used to discover patterns from input data. The algorithms described in this sec-tion are all based on the implementation available in the ML-framework for PySpark, whichis a development platform that can be used to process a large amount of data in parallel withML algorithms.

    Unsupervised learning

    Unsupervised learning is most commonly used in clustering problems. Clustering refers tocreating groups of data points based on a specific criterion or methodology. The groups canthen be studied in more detail with cluster analysis.

    10

  • 2.6. Machine learning algorithms

    K-means

    K-means is part of a family of unsupervised learning techniques where n data points are as-signed to k different clusters. A cluster is a group created by choosing an arbitrary centerpoint and evaluating the Sum of Square Error (SSE) from the Euclidean distances to all otherpoints within the cluster. The algorithm stops when the best center, which minimizes cost interms of SSE within the cluster, has been found [10]. The algorithm has a number of draw-backs; the number of clusters k has to be specifie, by a user, beforehand and the initial choiceof centroid will influence the outcome of the clustering. Another limitation is that it is depen-dent on the geometric distance between the points and is hence sensitive to problems withlarge dimensional vector space called the curse of dimensionality [11]. The algorithm has thefollowing steps:

    1. Specify number of clusters k.

    2. Initialize centroids by randomly selecting k data points for the centroid placement.

    3. Identify the subset of data point that is closer to it than any other center.

    4. Calculate the means of each feature for the data points in each cluster and this meanvector becomes the new center for that cluster.

    The algorithm continues to iterate between step 3 and 4 until convergence is reached. Apossible improvement on the algorithm is the K-means++ method which is an initializationtechnique with the goal of avoiding selection of sub-optimal cluster centers. It addresses thisissue by choosing the first centroid at random and subsequently selecting centroids based onthe highest probability, which is proportional to the squared distance from the cluster cen-ter. In PySpark the ML-framework utilizes a scalable version of K-means, called K-means||,introduced by Bahmani et. al [12]. Two popular methods to find an appropriate value for kon a data-set is the Silhouette score and Elbow method. The Silhouette score is a measure ofsimilarity within clusters (cohesion) and distance between clusters (separation), where highcohesion and separation is sought after. The metric has a range between [-1, 1], where a highvalue indicates good clustering. Conversely, the Elbow method works by finding a good cut-off point using the ”elbow” of a curve, e.g. WSSE, after which the returns for adding moreclusters diminishes. In this sense using the number of clusters at the elbow reduces the riskof over- or under-fitting the model to a particular data-set.

    t-SNE

    t-distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction tech-nique primarily used in data exploration and provides support for interpreting and under-standing distributions [13]. It works by creating a probability distribution, which is thentranslated into coordinates in a vector-space with a selected number of dimensions. Thismeans that high-dimensional problems can be visualized in two- or three-dimensional plots.A drawback of traditional SNE is that it is complicated and generally only is applied to asmaller sample of a dataset. The method addresses this issue by using a student t-distributionin the low dimensional space beside Gaussian distribution employed in the data representa-tion [7]. The student t-distribution has a longer tail compared with the Gaussian distributionthat helps the model to map faraway points of high-dimensional space as faraway points inthe embedded space as well. Another significant difference between the t-SNE and the SNEis that the t-SNE uses a symmetric version of the SNE [14].

    PCA

    Principal Component Analysis (PCA) is a common method to process a large amount of dataand get the essential information necessary to work with algorithms and analysis. It works

    11

  • 2.6. Machine learning algorithms

    by trying to find the new axis, the lines, along which the data variance is the highest. Lateron the data is projected to those axis, which then are the PC themselves. The best line is thefirst principal component (PC) and explains the most variance of the total data. Each PC isa linear combination of variables and weights that determine the impact of a parameter onthe component [11]. The components either increase together (correlated) or one increaseswhile the other decreases (inversely correlated). These relations between variables can beshown by visualizing the co-variance between each variable with a covariance-matrix, wherea positive value indicates correlation between two variables, while negative values indicateinverse correlation. The algorithm has the following steps:

    1. Normalization of variables to have the same amount of impact.

    2. Calculate covariance-matrix to explain correlation between variables.

    3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the im-pact from parameters on principal components.

    Supervised learning

    Supervised learning is most commonly used in classification and prediction problems. Insupervised learning has two sub-categories: regression and classification. Regression en-capsulates several different methods where the output is continuous, while the output ofclassification models is discrete.

    Random forest

    Random Forest is a relatively modern tree-based classification technique, which was intro-duced by Breiman in 2001 [15]. Random forests is an ensemble learning method that consistsof multiple decision-trees, where every tree depends on values random vector sampled inde-pendently and with the same distribution for all trees in the forest [11]. The random injectionsinto the training process means that each resulting tree will be somewhat different from oneanother. After training a certain number of trees the output averages are combined to createto a robust model with low chance of over-fitting to the data. For the classification problem,Random Forest receives a class vote from each tree and then makes a final decision based onthe majority vote. The algorithm has the following steps:

    1. Create sets of random samples from the input data.

    2. Construct a decision-tree for every sample.

    3. Make predictions and gather votes from each tree.

    4. Select the majority vote as the final model prediction.

    The model is initiated by selecting a desired number of trees t and maximum tree depthd. In general, increasing these parameters will lead to a more complex model with higheraccuracy at the cost of longer training time. It has been shown that Random Forests stabilizesat approximately 200 trees, after which the accuracy gains become negligible [11]. Anothercommon application of Random Forest is in feature selection; where the goal is to find themost important features in a data-set by evaluating the average impact of the features onevery tree in the model. The structure of the Random Forest lends itself well to parallelexecution and can also be modified to increase scalability [16].

    12

  • 2.6. Machine learning algorithms

    Multinomial logistic regression

    Multinominal logistic regression is a classification method that expands upon standard logis-tic regression for instances with more than two variables. The model is based on the posteriorprobabilities of K number of classes with separate linear functions for a vector of items x, mul-tiplied with a corresponding regression coefficient β, that together is sum up to one [11]. Forthe multi-classification problem the model utilizes the softmax function, which is a gener-alization of a sigmoid function, and is used to compute the probability of the item y beingpresent in a class c. One of the properties of the softmax function is that values close to themaximum value get pushed towards 1 while values far from max get pushed towards 0. It ispossible to formulate multinominal logistic regression with the following log-linear model:

    P(yi = c) = so f tmax(c, β1 ¨ x1, ..., βK ¨ xi), (2.1)

    where the softmax function is defined as

    so f tmax(c, x1, ..., xn) = exc /n

    ÿ

    i=1

    exn . (2.2)

    For classification, the model chooses a class based on maximum likelihood, thus the valueclosest to one is the best possible choice. The model is based on the assumption that eachvariable in the model can be considered independent, meaning that the variables are notlinear combinations of each other.

    Multilayer perceptron

    A multilayer perceptron is a type of feed-forward neural network, which is a non-linear clas-sification method [11]. These kind of models are inspired by biological phenomena, whereartificial neurons conforms to the way that biological neurons are connected to process in-formation in a brain. The multilayer perceptron is organized into distinct nodes and layers,which are initiated by an activation function f , e.g. a heavy-side or sigmoid function. Eachnode in the network is connected with a specific weight w and depending on the activationfunction the multilayer perceptron can perform different tasks such as classification or regres-sion. The network is structured in three main layers; input layer, hidden layers and outputlayer. Nodes in the input layer corresponds to the provided input data x, while all other lay-ers consists of linear combinations of the input with the individual weights w and bias b forthe nodes. This can be described as

    y(x̂) = fK(... f2(ŵT2 ¨ f1(ŵT1 x + b1) + b2)... + bK). (2.3)

    In classification the multilayer perceptron uses back-propagation to train a model, whichis a generalization of the least square minimization. A general drawback of neural networksare that they are often complex and difficult to train due to having many parameters that maylead to unstable optimization if initialized incorrectly [11]. An overly complex model withtoo many weights might also lead to over-fitting the model to the input data [11]. Thus, it isdesirable strike a balance for a model with enough hidden layers to make it flexible enoughto fit a specific problem and input. The network architecture used in this thesis is inspired bythe U-net architecture, where, the number of parameters is doubled for each level [17]. Thisprocess results in a network with four layers and the following structure:

    1. First layer: Same amount of parameters as input variables n.

    2. Second layer: 2*n.

    3. Third layer: n.

    4. Fourth layer: Same amount as number of expected output classes k.

    13

  • 2.7. Related work

    Performance metrics

    In classification the performance of the machine learning algorithms are evaluated by calcu-lating the confusion matrix, which is a table containing cells for True Positives (TP), FalseNegatives (FN), False Postives (FP) and True Negatives (TN). Here TP and TN correspondsto the number of true positive and negative instances that are correctly classified, while FPand FN denote the number of misclassified negative and positive instances [18]. These valuescan be used in mathematical formulas to calculate precision and recall:

    Precision =TP

    TP + FN, Recall =

    TPTP + TN

    . (2.4)

    Lastly, these two measures can be combined to calculate the harmonic mean between preci-sion and recall called the F1-score measurement:

    F1score =2 ¨ Precision ¨ RecallPrecision + Recall

    . (2.5)

    Data normalization

    Normalization is a common technique in data preprocessing for machine learning algorithms.This step is necessary for many problems where the features have different contributions tothe ML model due to high variety in range and distributions.

    Percentile scaling

    In this thesis we have both binomial and heavy-tailed distribution for some features. In orderto get a fair contribution to our ML models we employ percentile scaling, which scales allfeatures to the range between zero and one. The new value is based on the ranking andtotal number of elements of a in a feature vector. The mathematical formula used for featurenormalization can be found in equation 2.6 if there are exactly r elements which are less thane and s elements which are equal to e, N is the size of the total number of values for a specificfeature and x is the normalized output from the scaling:

    x̂k =r + 0.5 ˚ (s´ 1)

    N, i = 0...N. (2.6)

    2.7 Related work

    This section covers the related work that has been done in the relevant fields for this thesis.This paper expands on the findings from the previous research in the area and proposes amore complex network model. The motivation behind this strategy is to be able to adaptto changes in network communication towards a more complex service-oriented structureand to be able to identify a set of characteristic behavior that may be used to simulate dif-ferent traffic types. In this procedure we take advantage of the IP packet header log analysisframework suggested by Lin, et al, we also adopt a combination of the traffic classificationprocedures used by Rojas, et al and Erman, et. al where a four-layered model is created fromtime-based features. From the analysis framework a set of statistics are aggregated which areprocessed with a feature selection technique and unsupervised learning to create groups thatcan be used to analyze and discover trends and characteristic behaviors for traffic flows inthe network.

    Big data analysis

    Big data analysis is becoming an increasingly important way to discover network patternsand gather the knowledge needed to construct and verify the quality of traffic models. To

    14

  • 2.7. Related work

    meet the growing demands of 5G technology Kan Zhen et al., propose a framework for bigdata-driven (BDD) optimization [19]. The framework consists of four parts; data collection,storage management, data analytics and network optimization. The authors present threeexamples with case studies where BDD schemes can be utilized to improve the performanceof mobile networks toward 5G. The paper also discusses the potential challenges of adoptingBDD like data collection, including the challenges associated with processing large amountsof data due to limitations of available techniques.

    IP packet header log analysis

    Traffic logs contain much interesting information that is relevant in many contexts like secu-rity, user and application behavior in a system [20, 21, 22]. However, as the complexity andamount of data increase, an efficient solution to process the logs generated by the system isnecessary. A log analysis system not only needs to be able to handle massive and stable dataprocessing but also be flexible to account for different scenarios. Lin et al. propose a frame-work for log analysis that consists of a combination of Apache Spark and Hadoop [23]. Beingable to process the data with a distributed file-system and in smaller partitions in parallel iscrucial to obtaining an acceptable execution-time and not run out of memory when applyingmore complex algorithms. This is corroborated by a study by Ilias Mavridis, et al. where theyshow how execution-time decreases with the number of nodes in the cluster [24].

    Machine learning in network analysis

    There have been many prior studies made with the aim of analyzing the behavior of networktraffic. One example of this is the approach described by Rojas, et al. where the authorspropose a method to identify over the top (OTT) applications with statistical classificationbased on IP-log data with the goal to be able to establish a personalized service degradationpolicy that limits the amount of data that can be transferred over a certain period [25]. Theauthors employ K-means to data and are then clustered into three groups based on the levelof energy consumption: low, medium and high consumption, where each cluster is assigned aspecific policy. The results are validated using eight different machine learning classificationalgorithms. From the analysis, the authors present a table with policy recommendations foreach application.

    Another approach for traffic classification described by Li Wei et. al utilizes a semi-automated machine learning approach. The proposed framework consists of series of mech-anisms that are applied to find the most relevant parameters from network data to use inclassification [26]. The data-set used contains 248 different features, which are reduced to asubset by applying a correlation-filtering method to the data. The resulting subset has tenbehavior features and two features for specific port-numbers. These parameters are usedas input for classification models on a hand-labeled data-set with ten different applicationclasses. All of the models tested results in high accuracy over 90 percent on the classificationproblem for all except one of the application classes.

    Others have used multi-fractal analysis to extract timing based features to better classifyindividual encrypted traffic flows [27, 28]. The evolution in network technology means thatmany previous methods such as Deep Packet Inspection (DPI) and Port-based classificationare no longer reliable options, due to applications using dynamic port-numbers and encryp-tion to avoid detection. The authors of this paper show that it is possible to apply a featureselection technique like PCA and man-in-the-middle (MITM) based flow labeling to achievehigh accuracy for traffic type classification with time-based information for traffic flows. Fur-thermore, Erman et. al suggest that an alternative approach in traffic classification is to exploitthe distinctive characteristics of applications [29]. Fur this purpose the authors show that theclustering algorithms K-means and DBSCAN can be used to identify these behaviors in anetwork environment.

    15

  • 3 Method

    This chapter describes the method that was used during the thesis work to answer the re-search questions.

    3.1 Investigation of application behavior

    A starting point for this thesis is the basic assumption that mobile applications generate trafficpatterns that may be utilized as a basis for similarity measures. To confirm the validity of thisapproach to network modeling, it is thus necessary to find traffic models that can be used todescribe the behavior of an application in the network over time. This is closely connected tothe first research question from section 1.4 in the introduction and it is a non-trivial task thatrequires several steps to investigate:

    1. Data-processing from IP packet header logs.

    2. Feature selection to determine the importance of each characteristic on the networkbehavior definition.

    3. Clustering of traffic data based on characteristic behaviors and analysis of the resultingclusters.

    4. Definition of traffic models based on the clusters characteristic behaviors and their cor-relation with application types.

    Traffic characteristics

    A proposed holistic model of traffic characteristics is to investigate behavior on packet, packetbursts, connection and flow that occur in the lifetime of an application. Each of these levelscan be considered as a building block, i.e. a burst contains a number of packets and the sizeof the burst is the sum of the packet sizes in that burst. Additionally, we consider the possibleconnections for an application, which is defined as a link between an IP-address and port.An application may have several connections and traffic flows active during its lifespan andeach one can be further divided into sections based on specified threshold values.

    16

  • 3.2. Network traffic definition

    3.2 Network traffic definition

    In communication networks traffic consists of a mixture of both devices and services. A UEcan be any kind of device that connects to the network (phone, Ipad, computer, etc.) whilea service is something that can run on the device like; Chrome, YouTube, or Skype. Goingeven further, a service may have multiple applications running in the background; for exam-ple, Skype can be used for both VoIP and chat at the same time. It is possible that each ofthese underlying functions affects the traffic pattern of the service as a whole. Finding sim-ilar patterns in traffic flow between applications, therefore, first requires the establishing ofcharacteristics that are sufficient for describing the behavior of an application in the network.Since data from IP packet header logs only contain a single line of information per packet,these layers have to be created by a traffic model designer. For this reason the model thatwe propose to include application behavior utilizes several different abstraction-layers. Themodel used in this thesis has a four-layered structure which can be seen in Figure 3.1.

    Figure 3.1: Proposed stack layers in network model.

    The connection between UE, application and each network layer is depicted in Figure 3.2.Here we can see the process of running multiple applications in parallel from the UE perspec-tive. In order to be able to accurately model the traffic flow for these separate entities, it isthus necessary to examine network behavior with increased granularity to separate the ongo-ing traffic generating processes from one another. Traffic for the network layers is separatedby applying grouping based on the unique combinations of IP-addresses and ports. The in-teraction between a UE and server consists of many flows and connections. Applications inthe network are signified by utilizing multiple flows, for underlying processes, where a UEIP-address can be matched with various server-IPs that carry out specific tasks.

    A flow combination is created from an IP-tuple, (IPn, IPm), where the UE can be connectedto several different servers for the same application, which results in 1, ..., n possible flowsper UE. Our second group of network traffic in our model is IP-port connections, whichare referred to as connections in this thesis, which in addition to IP-addresses also includeports for both UE and server. This results in a unique four-tuple, ((IPi, Pj), (IPm, IPn)), whichallows for more variation. A connection has an exact origin and destination and it is possibleto have many connections for a UE that only differ in one of the elements of the four-tuple.

    Proceeding to an even deeper network level, we can zoom into each flow and connection.Here we see how information exchange is made up by packet deliveries that between a UEand server. Each set of packet transmission exists within a packet burst and it is possibleto have an arbitrary number of bursts and packets for a flow or connection, this is denotedas 1, ..., n in the figure. The full perspective of the network model is gained by combiningthe information from each level. This results in a four-layered model consisting of packets,packet bursts, connections and flows which can be used to investigate traffic characteristicsfor different abstraction levels of the network. Information for the packet-level is directlyavailable in traffic logs and can be used as a starting-point to aggregate information for eachof the higher levels in the hierarchy of the model.

    The structure of the proposed network model is highly flexible and layers, may be addedor removed based on preference. The goal of creating this kind of network model is that Traf-

    17

  • 3.2. Network traffic definition

    UE Application A1

    Application An

    .

    .

    .

    Flow F1

    Flow Fn

    .

    .

    .

    Connection C1

    Connection Cn

    .

    .

    .

    Packet P1

    Packet Pn

    Packet Burst PB1

    .

    .

    .

    Packet P1

    Packet Pn

    Packet Burst PBn

    .

    .

    .

    .

    .

    .

    (IPn, IPm)

    ((IPi,Pj) (IPm,Pn))

    Figure 3.2: UE and application traffic for network levels.

    fic Models may be created based on different priorities of network properties. An example ofa network property, that is specific to this thesis, is the connection-layer which is based ex-clusively based on the TCP-protocol connection for device communications over the Internetand does not include UDP-logic.

    Time perspective

    Each packet has a certain size in bytes and follows the structure described in Section 2.5 Apacket burst consists of sequence of consecutive packet exchanges for a UE and a server com-bination and is concluded when a threshold on the idle time between packets is met. Morespecifically, we compared the inter arrival time (IAT) between two packets with packet arrivaltimes ti and ti+1 with a threshold δ. Whenever ti+1 ´ ti ď δ, we say that both packets belongto the same packet burst. With this definition, packet bursts (e.g., numbered with index j) areseparated by an IAT greater than δ. Such arrival instance therefore increase the burst counterj which results in a time section that contains a number of packets. The definition used for athreshold can be found in 3.1, where t is time and PB is short for packet burst.

    IAT = ti+1 ´ titi = time o f reception o f packetij = 0

    i f IAT ą δ :@packet Ñ PBjj += 1

    (3.1)

    In a similar fashion, thresholds can be used to define more complex structures and capturenetwork behavior over a longer interval. By extending the the above burst logic to higher lev-els in our network model we get a definition for ”sessions” which are time-slices connected toa unique identifier and threshold. The idea is that these time intervals can be used to dividethe packet exchange into sections that should preserve some specific network characteristics.For the connection and flow level the thresholds are approximated from a designers per-spective with background from network theory. Figure 3.3 illustrates how we define packetbursts, connection sessions, and flows based on inactive-time on packet level. We next de-scribe each of these key concepts one-by-one:

    18

  • 3.2. Network traffic definition

    1. Flow: As we only collect packet level data it is difficult to determine the exact life timeof an end-to-end flow. Technically, an application for a UE may remain inactive in thebackground for long periods of time. The flow lifetime is therefore unknown.

    2. Flow session: We define a flow session using a single flow session threshold t f s. Inparticular, a new flow session is deemed to terminate as soon as there are no packetsobserved for the flow over the past t f s = 80s. The decision to use 80 sec as a thresholdfor flow-sessions is based on observations from the Cumulative Distribution Function(CDF) of IAT for packets, which can be found in Figure 4.2b. From this study we aim toselect the threshold that captures 98 percent of the total packet IAT values.

    3. Connection session: The connection threshold is defined by applying a connectionthreshold tcs. For this threshold we consider tcs = 0.5s which is based on the assump-tion that whenever a TCP connection is idle for a time larger than the Roundtrip Time-Out (RTO), it enters slow-start, which is also done at the beginning of every connection.With this threshold assumption, we make the distinction that whenever a connection isinactive for more than 0.5 sec, it has experienced a RTO and can be considered as a newnetwork connection.

    4. Packet burst: We define a packet burst to last for as long as there are packet exchangesof a certain rate. A packet bursts occurs when a threshold of tb = 0.1s is exceeded. Themotivation of using a threshold of 0.1 sec for packet bursts is that this is a very com-monly utilized interval in network analysis, mainly due to buffering and radio condi-tion issues that makes it necessary to bundle the packets into bursts.

    fs1

    cs1

    b1

    Packets

    PacketBursts b2

     ConnectionSessions

    Flow

    cs2

    Time[s]

    Time[s]

    Time[s]

    Time[s]

    b3

    fs2

    b4

    cs3

    Figure 3.3: Visualization of how thresholds are used to divide packets into network-layersegments.

    Data pre-processing

    Each event in the communication network is documented in IP packet header logs, whichcontain all of the data necessary to monitor network traffic. These measurements are made atthe packet-level and include recordings of timestamps, IP addresses, size, transport direction,protocol and more. Below we provide a small sample with the first five lines from an IP packetheader log test-file, where all IP-addresses have been replaced to preserve user anonymity.

    1. 1509373884.998817,u,”xx.xx.xxx.xxx”,1024,”yy.yyy.yy.yyy”,1133,61,u,454063850073515,1234567,geran„454:6:40:11492

    19

  • 3.3. Data analysis

    2. 1509373886.128775,u,”xx.xxx.xx.xxx”,1133,”yy.yyy.yy.yyy”,1024,47,d,454063850073515,1234567,geran„454:6:40:11492

    3. 1509373887.022907,u,"xx.xx.xxx.xxx",1024,”yy.yyy.yy.yyy”,1133,61,u,454063850073515,1234567,geran„454:6:40:11492

    4. 1509373887.319837,u,”xx.xxx.xx.xxx”,1133, ,"yy.yyy.yy.yyy",1024,47,d,454063850073515,1234567,geran„454:6:40:11492

    5. 1509373888.199975,u,”xx.xx.xxx.xxx”,1024,”yy.yyy.yy.yyy”,1133,42,u,454063850073515,1234567,geran„454:6:40:11492

    The server-client mapping that we are using is dependent on the RAN direction, whichmeans that uplink is up to the RAN and downlink is down to the UE. The logs contain muchinformation, but not everything is important in terms of traffic modeling. To get data entriesmatching the established traffic characteristics that are interesting to investigate, we removethe columns that are not relevant to the theoretical model established in Section 3.1. Eachlayer has a number of statistics that are used to model traffic behavior; a brief explanation ofthe parameter categories is listed below:

    1. Flow session level info: Identified by UE and Server IP tuple. Statistics are aggregatedafter after a flow burst, i.e. when the inter-arrival time exceeds flow session thresholdt f s.

    2. Connection session level info: Identified by UE and Server IP and port four-tuple.Statistics are aggregated after a connection burst, i.e. when the inter-arrival time ex-ceeds connection session threshold tcs.

    3. Dl Packet burst level info: Statistics aggregated from packet bursts level in downlinkdirection when time exceeds packet-burst threshold tb.

    4. Ul Packet burst level info: Statistics aggregated from packet bursts level in uplinkdirection when time exceeds packet-burst threshold tb.

    5. Packet burst level info: Statistics aggregated after a packet burst, i.e. when the inter-arrival time exceeds packet-burst threshold tb.

    6. Packet level info: Parameters gathered directly from packets in IP packet header logs;timestamp, packet size and IAT.

    After choosing threshold values to split the flows into flow sessions, the data is loaded intoPySpark and statistics are aggregated for each parameter. To ensure that the entire spectra ofparameter characteristics are preserved, we store values in a four-item array containing the20, 50, 80 and 100 percentile for a specific feature. This approach is necessary since most pa-rameters from the network have either distributions that are binomial with only a few peaksfor distinct values or heavy-tailed with a couple of values that are significantly larger than therest. Using the average or standard deviation for a feature would, in these instances, result ina misrepresentation of the parameter behavior. Furthermore, the proposed method allows formore complex behavior combinations for different parameter percentiles. The statistics arecollected as separate parquet-files that are combined at the end of the pre-processing script,which is used as input for the data analysis. This procedure results in the following set ofstatistics seen in Table 3.1.

    3.3 Data analysis

    To ensure that the network traffic definition from Section 3.1 approximates network trafficbehavior with sufficient accuracy thorough analysis of a large amount of the IP packet headerlog data is needed. To do this we employ CRISP-DM, which is a well-established data-miningframework, that have been used to solve similar problems in the past [30].

    20

  • 3.3. Data analysis

    Table 3.1: Format of statistics aggregated from IP-log data.

    Flow identifiers Description TypeUE_IP UE IP-address Stringsrv_IP Server IP-address Stringfs_ID Unique identifier for a UE and server combination StringFlow parametersprov List of service providers Array[String]app List of client-applications Array[String]pr Transport protocol StringO List of overlapping flow sessions Array[String]V Flow session total Volume IntDv List of Duration for volume ramp-up of the flow session Array[Float]cs_N Number of connection sessions in a flow session Integercs_V List of total volumes of the underlaying connection sessions Array[integer]cs_D List of session durations of the underlaying connection sessions Array[Float]cs_P List of number of packets in the underlaying connection sessions Array[Int]cs_O List of total Volumes for underlaying connection sessions Array[Int]cs_IAT List of inter-arrival times for underlaying connection sessions Array[Float]pb_N Number of packet bursts in the particular flow session Intpb_V List of total Volumes of the underlaying packet bursts Array[Int]pb_D List of session Durations of the underlaying packet bursts Array[Float]pb_P List of number of packets in the underlaying packet bursts Array[Int]pb_Dl List of number of dl bursts in the underlaying packet bursts Array[Int]pb_Ul List of number of ul bursts in the underlaying packet bursts Array[Int]pb_IAT List of inter-arrival times for underlaying packet bursts Array[Float]dl_N Number of dl bursts in the particular flow session Intdl_V List of total Volumes of the underlaying dl bursts Array[Int]dl_D List of session Durations of the underlaying dl bursts Array[Float]dl_P List of number of packets in the underlaying dl bursts Array[Int]dl_O List of dl overlapping bursts Array[Int]ul_N Number of ul bursts in the particular flow session Intul_V List of total Volumes of the underlaying ul bursts Array[Int]ul_D List of session Durations of the underlaying ul bursts Array[Float]ul_P List of number of packets in the underlaying ul bursts Array[Int]ul_O List of ul overlapping bursts Array[Int]p_N Number of packets in the particular flow session Intp_S List of packet sizes Array[Int]p_IAT List of inter-arrival times for packets Array[Float]O_N Number of overlapping flow sessions Short

    CRISP-DM

    Cross-industry standard process for data mining, known as CRISP-DM, is an open standardprocess model that describes common approaches used by data mining experts. It is the mostwidely-used models in data analytics. CRISP-DM [31] breaks down the data-mining processinto six different phases which can be seen in Figure 3.4.

    21

  • 3.3. Data analysis

    Figure 3.4: CRISP-DM process [31].

    Below a description of how each step relates to the analysis process in this thesis is listed:

    1. Business Understanding: Introduction to company, colleagues and problem.

    2. Data Understanding: Explore content of data, create plots, interpret results.

    3. Data Preparation: Preprocessing in PySpark to aggregate statistics for network data.

    4. Modeling: Sample data, apply machine learning models, plot and interpret results.

    5. Evaluation: Create groupings based on observed behavior, evaluate how well machinelearning classifiers perform with this gold standard.

    6. Deployment: Deliver results from analysis, motivate choice of parameters for creatinga traffic model.

    Analysis method

    After aggregating values for characteristics at the different levels of the network model, thedata is normalized with percentile scaling, which adjusts the data to the range of zero to one.The mathematical formula used for normalization can be found in equation 2.6. This is doneto ensure that different metrics and scaling of individual features do not affect the results.Exceptions to this method are protocols, which are defined as zero for UDP and one for TCP,and null-values that are represented by minus one. Using extreme values enables us to moreeasily identify these tendencies with machine learning algorithms, an example of this is theheat-map in Figure 4.6 where null-values result in black boxes. Packet volume is especiallyheavy-tailed with large values and need to be normalized for a fair comparison with the otherfeatures. Following the feature selection process we split the data in two halves; where one isused in clustering analysis and the other half in traffic classification.

    Analysis of traffic behavior is done by applying a K-means clustering model to the data-structure. If the modeling approach works as expected, groupings with similar characteris-tics will be formed. To get the optimal number of clusters to fit the data we utilize the Elbowmethod which aims to minimize the Sum of Square Error of the within the clusters and max-imize Silhouette Score which ranges from [-1, 1]. From the clustering we receive groups ofpackets from the network data with similar characteristic behavior. Validation of the process

    22

  • 3.4. PySpark

    is done by applying three different machine learning models on the other half of the data;Random Forest, Multinominal Logistic Regression and a Multilayer Perceptron. Here 70 per-cent of the validation data is used for training and 30 percent for testing. The models areevaluated using the standard performance metrics; precision, recall and F1-score.

    The next step of the analysis is to determine if these groups are coherent by investigat-ing cluster content and comparing the CDF graphs for the parameters against the input data.This is the parameter study of the thesis goal where we take an in-depth look at each charac-teristic behavior of the clusters. The conclusion of this study results in a set of traffic modelswith distinct behaviors. Lastly, if the clustering yields satisfactory results, we conduct aninvestigation of the connection between the traffic models and real-life network content. Inorder to be able to confirm that the proposed traffic analysis procedure is accurate enough weexpect to see that network entities, like applications or services, can be built from the definedmodel, otherwise we go back, adjust our input and iterate until we reach a satisfactory out-come with sufficiently high accuracy. Figure 3.5 high-lights all the steps used for processingand analyzing the aggregated statistics from the network.

    Standardize data

    Data

    Feature Selection Clustering -Select k

    Classification

    Cluster Analysis

    Good enough? Done

    Standardizeddata

    Features Plots

    Classes

    Output

    Yes

    No

    Figure 3.5: Analysis framework for traffic data.

    3.4 PySpark

    PySpark is a development tool for processing large amounts of data in python and is anextension of the popular open-source Apache Spark [32]. Spark provides libraries like MLand MLib that contain most of the common machine learning and statistical algorithms usedin data analysis. Spark can be deployed in the cloud, but can also be used on local clusterson computers as well. The jobs often employ a resource-manager, like Yet Another ResourceNegotiator (YARN), to ensure that worker nodes and memory are efficiently utilized [33].The execution in Spark distributes the data into several smaller partitions that are processedin parallel by multiple nodes. Every Spark application has a master node that directs andkeeps track of each job and a number of worker nodes that executes sub-tasks on a partitionof the data [34]. A visualization of this process can be found in Figure 3.6.

    Spark introduces a data abstraction called Resilient Distributed Datasets (RDD). An RDDis a read-only collection of objects partitioned across a set of machines that can be rebuiltif a partition is lost. It is immutable in nature and has two types of operations availableTransformations, that apply changes the RDD and Actions that apply computations and passthe result back to the master node. The operations in Spark follows lazy evaluation, meaningthat no execution will start until an action is triggered [35]. The abbreviation RDD describesthe main three features of the data-structure:

    1. Resilient: It is fault tolerant and is capable of rebuilding data on failure.

    2. Distributed: Data is distributed among multiple nodes in a cluster.

    23

  • 3.4. PySpark

    Job

    Master node

    Job

    Task

     Worker nodes

    Task

    TaskTask

    TaskTask

    Figure 3.6: Spark execution process.

    3. Dataset: Collection of partitioned data with values.

    Another common Spark data-structure is data-frames, which are tables that consist ofrows and columns. It can be initialized by loading data from many different file-types suchas parquet, JSON, CSV, TXT, or it can be created by explicitly specifying the schema, which de-scribes the internal structure of the data frame. Similarly to an RDD, data-frames are also im-mutable and distributed. The Spark SQL module enables a programmer to use SQL-querieson data-frames and perform operations as one would in an SQL-database. It is also possibleto convert an RDD to the data-frame format and vice versa [3].

    Spark can be integrated with a wide variety of distributed storage systems like HadoopDistributed File System (HDFS), Amazon S3 and OpenStack Swift. HDFS utilizes the map-reduce pattern to enable scalable processing of Big data [36]. This provides users with the op-portunity to expand their processing capabilities by combining Spark with Hadoop MapRe-duce, HBase and other big data frameworks. There are three main ways to deploy Spark in aHadoop cluster environment, which can be seen in Figure 3.7.

    Spark

    HDFS

    (a) Standalone

    Spark

    HDFS

    YARN

    (b) Over YARN

    HDFS

    (c) Spark in MR

    Spark

    Hadoop MR

    Figure 3.7: Spark and HDFS configurations.

    1. Standalone deployment: Resources can be statically allocated and allows Spark andHDFS to run side by side.

    24

  • 3.5. System setup

    2. Hadoop Yarn deployment: Spark runs on HDFS with the help of YARN, which is aresource-manager. This allows multiple users to access the cluster and configure theset-up used in data processing.

    3. Spark In MapReduce (SIMR): Spark runs directly on Hadoop Mapreduce, whichmeans that there is no need for users to install Spark locally.

    3.5 System setup

    The setup used for the cluster that is used for analysis consists of seven servers with a to-tal of 184 CPUs with the Intel Xeon E5-2658 architecture and 1120GB RAM available. Thehard-drive has ten TB available for storage. For Spark configuration, we use Hadoop Yarndeployment, which provides us with flexibility regarding how much resources that we wantthe cluster to utilize. The software that is used is Jupyter notebook 6.0.3, Apache spark ver-sion 3.0.0 and Python version 3.0. Additionally, the servers have the Red hat Enterprise LinuxServer 7.7 operating system installed. A full description of the system specifications can befound in Tables 3.2 and 3.3.

    Table 3.2: Hardware specifications

    Name CPU RAM (GB)Master 8 64Worker 1 8 64Worker 2 8 64Worker 3 8 64Worker 4 8 64Worker 5 72 400Worker 6 72 400

    Table 3.3: Software specifications

    Name VersionPython 3.0Apache Spark 3.0Jupyter notebook 6.0.3

    25

  • 4 Results

    This chapter presents the results generated during the thesis and provides the foundationneeded to answer the research questions. The chapter includes a pre-study, called data explo-ration, where the data used in the study is investigates and an analysis portion which followsthe structure described in Section 3.3. The analysis framework is used to derive traffic mod-els contains the following subsections; data exploration, feature analysis, cluster analysis,classification, parameter study. Lastly, the traffic models are validated in a network contentanalysis that investigate how the traffic models can be used to build traffic for entities in thenetwork.

    4.1 Data exploration

    As a first step in our analysis we start by looking closer at the data, which is important inorder to gain the understanding necessary to use the data with the analysis framework ef-fectively. We divide this subsection into exploration of the raw IP packet header logs andmultivariate analysis of correlation between the aggregated statistics from the network data.

    IP packet header log

    The data in this study is exclusively based on recordings of network traffic from east Asiaduring 2016. The data was accumulated over a time-span of approximately one hour. Thedata is characterized by applying the Deep Packet Inspection (DPI) probe engine, which usu-ally manages to classify a large share of network traffic, but there will still be occasions whereno classification can be made, e.g. with encrypted traffic and unsupported applications andservices. The tags that are provided by the engine are: protocol, encryption, encapsulation,client-application and service-provider. This technique has proven to yield good results forEurope and America where most of the identified servers are known, but is less accurate forAsia where less is known about the services that are used. The distributions for the variouscategories of the IP packet header log data can be found in Figure 4.1. The figure showsthat two labeling categories are used; service providers and application-clients, where Figure4.1a shows the distribution for service providers and Figure 4.1b shows the distribution ofapplication-clients for all packets in the data.

    26

  • 4.1. Data exploration

    0 10 20 30 40 50 60 70 80 90 100Percent

    Google PlayiTunes

    AmazonApple

    WhatsAppInstagram

    GoogleYouTube

    FacebookUnknown

    (a) Top ten service providers.

    0 10 20 30 40 50 60 70 80 90 100Percent

    SkypeMMessenger

    LINEViber

    onLiveAMPlayer

    iTunesiOSMPlayer

    ChromeUnknown

    (b) Top ten application-clients.

    Figure 4.1: Fraction of service providers and application-clients from input data.

    The dataset has a total of 99 different kinds of labeled services and of the total labeled dataFacebook, YouTube, Google, Instagram and WhatsApp constitutes more than 50 percent ofthe total services, while the most common application-clients are; Chrome, IOSMediaPlayerand Itunes. In Figure 4.1b we observe that a majority of the data is unlabeled or has a presencelower than one percentage of the total dataset. This is also true in a lesser extent for serviceproviders where only 40 percent are unknown. These results are to be expected since thelabeling process for network data from Asia is often poor in quality. In addition to providingsome insight into user service preference and habits, this visualization shows that most of thedata gathered directly from the IP IP packet header logs is ambiguous and that we need todig deeper to actually find interesting relationships. In Figure 4.2 we also plot the CDF of theIAT and size for packet transmissions.

    0 500 1000 1500 2000Size [B]

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    CDF

    (a) Packet size CDF in Bytes.

    10 5 10 3 10 1 101 103Time [s]

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    CDF

    (b) Packet IAT in Seconds.

    Figure 4.2: CDF of packet characteristics.

    In Figure 4.2a we can see that the CDF for packets size has two distinct peaks for eithervery small values of less than ten bytes and large values around a couple of kiloBytes. This isa typical example of the binomial distribution used as motivation for the percentile approachdescribed in Section 3.2. Figure 4.2b shows that the the total values of the packet IAT rangesfrom very small time intervals of less than ten microseconds to a thousand seconds. Here wenote that 80 sec capture more than 98 percent of the total IAT values, which is the motivation

    27

  • 4.1. Data exploration

    why it is used as a threshold definition for flow sessions in our network model. Additionally,we also investigate the direction of the information exchange and note that the observed traf-fic consists of an almost even split with 40 percent uplink and 60 percent downlink packets. Itis worth to note that the volume is significantly higher for downlink that makes up 91 percentof the total packet volume in comparison to 9 percent for uplink.

    Feature analysis

    From the IP packet header log data a set of different statis