Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network...

50
Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren Examiner: Anders Lansner TRITA xxx yyyy-nn

Transcript of Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network...

Page 1: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Multi-Source Learning in a 3G Network

YLVA ERSVIK

Master’s Thesis at CSC, KTHSupervisor: Jens LagergrenExaminer: Anders Lansner

TRITA xxx yyyy-nn

Page 2: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular
Page 3: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Abstract

By 2020 the world is expected to generate 50 times the amount of data itdid in 2011, and much of this increased information will be carried overa mobile network. Understanding the data in the network can assistin mitigating threats to network performance such as congestion andhelp in network management and the allocation of resources. This mas-ter’s thesis aims to investigate to what extent the data carried throughthe mobile network can be understood in its real-world context, andwhether anomalous patterns in the network data profile data can beexplained using external data sources. We constructed topic models us-ing LDA for a Twitter stream in London and modeled how the topics’relative importance changed over time. We examined three anomalouspoints in the network data profile and studied their correlation with thetopic proportions and current weather information. The topic model forTwitter performed poorly due to the difficulty in processing the multi-faceted Twitter corpus. We acknowledge the need to refine the LDAmodel, to include additional textual data sources, and to understandthe different types of anomalous present in the network together withtheir causes. Such an understanding would allow for a more targetedanalysis of anomalous patterns in the network and their relation to thereal world.

Page 4: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Referat

Maskininlärning med flera källor i ett 3G-nätverk

År 2020 kommer världen att generera 50 gånger den mängd data dengjorde år 2011, och till stor del kommer denna ökning av data att fär-das i ett mobilt nätverkt. Förståelse för data i nätverket kan hjälpa ossatt säkerställa nätverkets prestanda samt ge förståelse för hantering ochresursfördelning. Detta examensarbete syftar till att undersöka i vilkengrad vi kan förstå hur data i ett 3G-nätverk relaterar till sitt verkligasammanhang, och huruvida ett avvikande beteende i nätverket kan för-klaras med hjälp av externa datakällor. Vi skapade en LDA-modell förTwitter-data för London och modellerade hur innehållets teman föränd-rades med tid. Vi undersökte sambandet mellan tre avvikande punkter inätverket, innehållets teman samt väderleksinformation. LDA-modellenvisade sig fungera dåligt på grund av svårigheten att hantera det mång-facetterade Twitter-innehållet. Vi ser ett behov av att förfina dennaLDA-modell, inkludera ytterligare textkällor, samt av att förstå de oli-ka avvikande beteenden som förekommer i ett nätverk och deras orsa-ker. En sådan förståelse skulle tillåta en noggrannare analys av dessabeteenden och deras förhållande med sin verkliga kontext.

Page 5: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Preface

This master’s thesis is written as a part of a degree project at the School of Com-puter Science and Communication at the Royal Institute of Technology (KTH) incollaboration with Ericsson.

Many thanks to my supervisor Jens Lagergren at KTH who guided me through theprocess, kept me focused on the task at hand and challenged me to achieve results.

Many thanks to Martin Svensson and Richard Cöster at Ericsson for their support,never-ending enthusiasm and belief in the project idea.

Ylva

Page 6: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Contents

1 Introduction 1

1.1 What is Big Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Mobile Network and Society . . . . . . . . . . . . . . . . . . . . 2

1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Anomaly Detection 5

2.1 Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Anomaly Detection in Network Traffic . . . . . . . . . . . . . . . . . 7

3 Learning from Text 9

3.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Latent Dirichlet Allocation (LDA) . . . . . . . . . . . . . . . 9

3.1.2 Dynamic Topic Models . . . . . . . . . . . . . . . . . . . . . . 10

3.1.3 Topic Models for Microblogs . . . . . . . . . . . . . . . . . . 11

3.2 Text for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Page 7: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

3.3 Beyond the Text of Social Media . . . . . . . . . . . . . . . . . . . . 12

4 Learning in the Temporal Domain 13

4.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.1 Clustering of Time Series . . . . . . . . . . . . . . . . . . . . 16

4.2.2 Agglomerate Hierarchical Clustering . . . . . . . . . . . . . . 16

4.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 Spatiotemporal Learning . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Data Acquisition 19

5.1 Risks in the Data Collection Process . . . . . . . . . . . . . . . . . . 20

6 Method and Implementation 21

7 Data Overview 23

8 Results 27

9 Discussion 35

9.1 Significant Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 35

9.2 Topic Modeling for Twitter . . . . . . . . . . . . . . . . . . . . . . . 36

9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

9.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

9.3.2 Anomalous Events . . . . . . . . . . . . . . . . . . . . . . . . 37

9.3.3 Choice of Resolution . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography 39

Page 8: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular
Page 9: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 1

Introduction

"If we have data, let’s look at data. If all we have are opinions, let’s gowith mine."

– Jim Barksdale, former Netscape CEO

1.1 What is Big Data?

The amount and availability of data provided openly, not least on internet, is everincreasing; we are generating massive amounts of data through an uncountable num-ber of devices that make measurements or that allow the user to share information.By 2020 the world is expected to generate 50 times the amount of data it did in2011 [Matti and Kvernvik, 2012].

Data is often made publicly available in real-time, and opportunities for data min-ers seem endless. Ongoing research is massive and we can expect knowledge inthe field to increase as research continues, not least with technology advances asmachine learning methods to explore the data pose extreme capacity and memoryrequirements on our systems.

With increasing popularity, data is being given a geo-spatial component with usersvia their mobile devices sending information not only about themselves, their sen-timents, thoughts, and opinions, but also about their exact geographical location.This not only permits us to perform socio-geographical analytics using thoughts,experiences and sentiments expressed by people, but we can also exploit the masscrowd behaviors in trying to understand what is going on in our societies.

1

Page 10: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

What is less frequently mentioned in context of big data are questions on informationsecurity, ethics around collection and storage of data, as well as integrity and theright for an individual to her own data.

1.2 The Mobile Network and Society

Much of the increased information that we and our devices generate will be carriedover a mobile network. When data volumes increase in the network, it can causereduced performance, congestion in the network, and have consequences on userexperience [Matti and Kvernvik, 2012].

Understanding the data in the network assist in mitigating these threats throughbetter network management, allocation of resources, and insight generation throughglobal comparisons. It may also serve as a generator of ideas for the utilization ofunused resources such as recognizing additional revenue potential, ideas for newsubscription plans, and optimization of roaming opportunities. The real-time infor-mation we can extract from data in the network can also be used for urban planningsuch as efficient transportation, and smart distribution of electricity and supply ofwater, by forecasting of demand and meeting demand with minimal waste [Mattiand Kvernvik, 2012].

Finally, understanding data is crucial for business innovation and the developmentof new business models, with historical examples from India and Ghana includingleasing contracts for capacity in the network during peak hours [Moritz, 2012]. Inshort, it allows us to be proactive rather than reactive to change in the socialpatterns of society – something that must come hand in hand with innovation inmobile communications.

1.3 Purpose

The aim of this thesis project is to investigate how the abundance of data generatedby us, and our objects and devices, and carried through the mobile network, relateto the physical, real world where it is created. Working under the assumptionthat the network and other data are different descriptions of the same reality, canwe correlate the network profile with other variables that provide a descriptionof our everyday behavior, social patterns, needs, events and circumstances? Towhat extent can network data patterns be explained by other data sources thatare descriptive of the physical world? Subsequently, can other sources of data canhelp us understand behavior in the network that would otherwise be consideredanomalous or unexpected?

2

Page 11: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

1.4 Goal

The goal is to develop a model that provides us with an understanding of if andhow aggregate traffic statistics1 can be understood by looking at data generatedoutside of it. The model will function based on historical traffic statistics, Twitterstreams and weather reports for the same time period. More specifically, the goal isto investigate whether we can find any correlation or dependency between variablesin time points that seem abnormal or particularly interesting to us.

1.5 Definitions

A cellular network or mobile network is a wireless network distributed overland areas called cells.

A cell is a land area served by one or more radio base stations. Joined together thecells provide radio coverage over a wide geographic area.

3G is the third generation’s mobile telecommunications technology. Via 3G usersget access to telephony, mobile internet, mobile TV, and more.

1.6 Delimitations

We look at 3G data from a single operator per network cell.

We do not aim to be comprehensive in the choice of data sources added to the model,but rather to get an idea of the value potential from adding additional sources, intheir explanation of abnormal events in the network.

Pre-processed data from an additional source, should be addable to the model af-terwards. Hence the model should be generic.

1Traffic volume and the number of active users withing the network cell.

3

Page 12: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular
Page 13: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 2

Anomaly Detection

"To study the abnormal is the best way of understanding the normal."

– William James

Anomaly or outlier detection refers to the problem of finding patterns in data thatdo not conform to expected normal behavior [Chandola et al., 2009]. Such taskshave become an important problem in many industrial and financial applications[Pokrajac et al., 2007]. Anomaly detection has been researched in various researchareas for diverse application domains. Many techniques for anomaly detection aregeneric while others have been developed for a specific application domains.

The exact notion of an anomaly differs with application domains and hence a tech-nique developed in one domain cannot without modification be applied in another.Anomalies may be present in data for a variety of reasons, including media events,malicious activity, intrusion or terrorist activity, break-down of a system, naturaldisaster or crisis event. In this chapter, we will provide a survey of anomaly detec-tion techniques as well as an overview of recent research on anomaly detection innetwork traffic.

2.1 Detection Techniques

Chandola et al. [2009] distinguish between classification-based, nearest-neighbor-based, clustering-based, statistical, information theoretic, and spectral anomaly de-tection techniques.

5

Page 14: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Classification-based anomaly detection techniques operate with machine learningtechniques for classification, such as neural networks, bayesian networks, supportvector machines, and rule-based techniques. They operate under the assumptionthat training data consists of one or multiple normal classes, and any test instancethat does not fall into one of those is considered anomalous.

Nearest neighbor-based anomaly detection techniques tend to assign to each datainstance an anomaly score computed as the distance to the kth nearest neighbor oras the relative density of the instance. Here, anomaly scores computed as distancesusually apply a threshold value on the anomaly score to determine whether aninstance is anomalous or not. Density-based techniques instead uses the densityof the neighborhood of a data instance to determine whether it is to be declaredas anomalous or normal. The key advantage of nearest-neighbor techniques is thatthey are unsupervised. A key disadvantage is that density-based techniques performpoorly if data has regions or varying densities; also, unsupervised techniques missanomalies that have close neighbors and misclassify normal instances that do nothave enough close neighbors.

Clustering-based techniques apply clustering-based algorithms to find clusters indata, and operate under the assumption that normal data instances lie close totheir closest cluster centroid, while anomalies are far away. The key advantageof clustering-based techniques is that they are unsupervised. The disadvantage ofsuch techniques is that they are designed to find clusters, not anomalies, and anyanomalies that form clusters will be missed.

Statistical anomaly detection techniques assume a generative stochastic model forthe data. Such techniques fit a statistical model to the data, and statistical infer-ence determines whether a particular data instance belongs to the model or not.Statistical anomaly detection techniques can be parametric or non-parametric. Thekey advantage of statistical techniques is that they are associated with a confidenceinterval; the key disadvantage is that we need to make an assumption that the datais generated from a particular distribution.

Information theoretic techniques make use of the information content of data suchas entropy and Kolomogorov complexity. The advantage is that we do not need tomake any assumption of the generative distribution; the disadvantage is that we arehighly dependent upon our choice of information theoretic measure.

Spectral techniques try to identify subspaces where anomalous instances are moreeasily identified. Such techniques are suitable for high dimensional data set; thedisadvantage is that they are useful only if there exist subspaces where normal andanomalous instance are separable.

6

Page 15: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

2.2 Anomaly Detection in Network Traffic

Real network traffic has intrinsic characteristics that must be taken into account inany attempt to model normal traffic. Given these properties, an anomaly detectionsystem must learn and consider normal behavior and hence take into account theperiodicity, nonstationarity, and seasonality observed in aggregate traffic variables[Coluccia et al., 2013].

Coluccia et al. [2013] illustrate how a reference set S for normal traffic can be con-structed, taking into account the nonstationarity and daily and weekly seasonalityof real traffic behavior. They further detect traffic anomalies using two differentdistribution-based detection approaches on an operational 3G network, where adistribution is the network-wide distribution of a network variable across individ-ual users. To identify the reference distribution, the authors use a sliding windowapproach. Letting k be the current time bin, they consider the Nw pas time binsk−1, k−2, ..., k−Nw and assume that the most correlated information is containedin the most recent observations. This simple method does not take into accountthat certain behavior may be repeated systematically over time, or behavior thatadheres to certain times of the day. We can extend the simple sliding window ap-proach to a dual-window approach, where we also take into account the time binscorresponding to the same times in the past days. Further, they consider two waysof developing a detector from the reference set: a heuristic approach that involvescomputation of internal and external divergence metrics, and one based on a gen-eralized likelihood ratio test (GLRT). The detector based on divergence metrics isa comparison of the internal dispersion – the set of divergences between all pairsof distributions in the reference set – and the external dispersion – the divergencesbetween the current distribution and those in the reference set.

Similarly, D’Alconzo et al. [2010] present a statistical-based method for anomalydetection in 3G networks, designed to take into account the nonstationary natureof network traffic. They conclude by experiment that the assumption that the mostrecent samples have maximum correlation with the current sample does not hold.Instead they identify a reference set where they exclude the most recent samplesfrom the observation window. Since traffic distributions at the same hour of differentdays tend to be similar, they include n previous days in the observation window,letting the reference set identification algorithm search for same-hour samples. Thereference set I0(t) consists of selected past distributions observed in the currentobservation window W (t). The observation windows is a set of time bins W (t) ={tj : a(t) ≤ tj ≤ b(t)}, where a(t) and b(t) are the lower and upper defining boundsof the distribution X(t) at time t. The distributions in I0(t) are selected based ontheir similarity with the current distribution. The detector is again developed as acomparison of the internal and external dispersion metrics. D’Alconzo et al. [2010]further find that detected anomalies tend to be present across variables and time

7

Page 16: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

scales in the network traffic, and that human supervision seems unavoidable andthe effectiveness of fully automated anomaly detection system questions.

Contrary to Coluccia et al. [2013] and D’Alconzo et al. [2010], Siris and Papagalou[2006] use a scalar measure of traffic flow being the number of TCP SYN packetsfor detecting SYN flooding1 anomalies. Siris and Papagalou [2006] evaluate twoadaptive algorithms; the adaptive threshold algorithm and the cumulative sum al-gorithm. The adaptive threshold algorithm is a simple algorithm that signals analarm when k consecutive measurements exceed a threshold (α+1)µ, where µ is themeasured average and α determines the sensitivity of the detector. The cumulativesum algorithm signals an alarm when the accumulated volume of measurementsthat are above a threshold (α+ 1)µ exceed some threshold h.

Other recent articles include Mackrell et al. [2013] who look at the frequency spectraof internet traffic for a selection of crisis events. They use discrete Fourier transformsto transform the time series of traffic variables to the frequency domain. They use anevent and control model and compare their respective correlation coefficients witha test set. A binary indicator that identifies cases where the correlation coefficientof the event model exceeds that of the control model, serves as an indicator for ananomaly.

1A SYN flood is a series of connection requests sent in an attempt to consume enough resourcesto make it unavailable for its intended users.

8

Page 17: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 3

Learning from Text

Computational linguistics, or natural language processing, is a broad scientific re-search field that aims to find sophisticated methods to analyze the large and evergrowing corpora of natural language – such as words and text in English, Norwegianand Portuguese, or any other natural language – that is stored digitally, sometimeson the web. Though natural language processing covers many more aspects than fitinto this section, we will here give a brief introduction to natural language processingfor topic modeling and prediction, with focus on microblogs and news.

3.1 Topic Models

Topic models are a suite of algorithms targeted to uncover the hidden thematicstructure in a collection of documents. The models can help us understand not onlywhat the themes are but also how they are connected and how they evolve overtime [Blei, 2012].

3.1.1 Latent Dirichlet Allocation (LDA)

Latent Dirichlet allocation (LDA) is a vastly successful topic modeling method firstintroduced by Blei et al. [2003]. Blei [2012] presents the intuitive idea behind LDAbeing that documents are a collection of words ~x1:D that have arisen from one ormore topics. Blei gives as an example a scientific article that blends evolutionarybiology, genetics and data analysis. Latent dirichlet allocation turns this intuitioninto a generative probabilistic process for a collection of D documents, a corpus.We assume that each of the D documents is a mixture of K topics, each topic being

9

Page 18: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

a distribution of the fixed vocabulary of W words. Each document shares the sameK topics, though exhibiting these in different proportions.

LDA treats the observed words and documents as being generated by a hiddentopic structure [Blei et al., 2003]. The hidden variables are the mixing proportions~θ1:K of topics per document d, the topics ~β1:K , and the per-word topic assignmentsz1:D,1:W . The topic proportions ~θ are distributions over topic indices 1, ...,K andthe topics ~β are distributions over word indices 1, ...,W . Both of these are Dirichletrandom variables. Now let ~α be a positive K-vector and η a scalar and the LDAgenerative process can be written

topic proportions ~θd ∼ Dir(~α) (3.1)

topics ~βk ∼ Dir(η) (3.2)topic assignments zd,w ∼Mult(~θd) (3.3)

words xd,w ∼Mult(~βzd,w) (3.4)

Each αk is the prior weights of topic k in a document and η is the prior weightof each word in a topic. The problem of inferring the hidden topic structure viacomputations corresponds to computing the posterior distribution of the hiddenvariables ~θ1:K , ~β1:K and z1:D,1:W , given the observed documents. Computation-ally, the posterior distribution is intractable, and therefore a variety of techniqueshave been developed for approximate inference, including variational inference [Bleiet al., 2003] and Gibbs sampling [Steyvers and Griffiths, 2006], each with advan-tages and disadvantages subject to trade-offs between speed, complexity, accuracyand simplicity [Blei and Lafferty, 2009].

LDA assumes that words are exchangeable within documents, meaning that theorder of words does not affect the probability of its topic assignment; for eachdocument the content is treated as a bag-of-words, treating each document as avector of word counts. LDA further assumes that documents are exchangeablewithin the corpus, meaning that their order do not affect the probability of thembeing generated by the model. This assumption is a simplification; it is obvious thatmany sets of documents such as scientific journals, newspapers and emails depict acontent that evolves over time [Blei and Lafferty, 2009, Blei, 2007].

3.1.2 Dynamic Topic Models

The dynamic topic model [Blei and Lafferty, 2009] can be used to model how topicsor trends within a topic evolve over time. In the dynamic topic model, docu-ments in the corpus are sequentially organized. Here we assume that documentsare exchangeable within each sequential slice. In this way, the model allows topicdistributions to evolve from slice to slice, where each time slice is a separate LDAmodel [Blei, 2007].

10

Page 19: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

3.1.3 Topic Models for Microblogs

Hong and Davison [2010] try to address the problem of topic modeling in the shorttext context, by aggregating tweets by the same author into documents, as anextension to the standard LDA model. They find that the effectiveness of topicmodels can increase by aggregating short messages, and that different aggregationmethods of the short tweets yield different topics.

Zhao et al. [2011] compare the content of a Twitter corpus between that of NewYork Times as a source of news articles to understand topical differences and thedifferent nature of the two sources as information providers. Zhao et al. [2011] usea standard LDA for the New York Times data set. For Twitter, where tweets areso short in comparison to a news article, they propose a Twitter-LDA model wherethey recognize the fact that each tweet usually is about a single topic, each userchoosing a topic for her tweet based on her topic distribution. Their approach differsfrom [Hong and Davison, 2010] where tweets from a single user are treated as onedocument; something which comes with the assumption that each document musthave a single author. They find that the Twitter-LDA model outperforms standardLDA for discovering Topics from Twitter. They also find that Twitter and newscover similar topics but that the distributions differ; family, life and arts is givenmore attention than world events in the Twitter feed, while news tend to have moreof a balance between world, arts and business events.

3.2 Text for Prediction

Twitter as a source of data has been extensively explored to predict of a number ofreal-world phenomena such as flu trends [Achrekar et al., 2011], criminal incidents[Wang et al., 2012], stock market trends [Bollen et al., 2011] and state-level polling inthe US [Beauchamp, 2013]. Related studies have been performed with other socialmedia content, for example stock market prediction using financial news articles[Schumaker and Chen, 2009].

Achrekar et al. [2011] present a framework for predicting the emergence and spreadof influenza based on messages posted on Twitter. They use a data set consistingof a real-time stream of tweets containing textual indicators of flu, and an auto-regressive model that takes the number of unique Twitter users with flu as input.

Wang et al. [2012] study prediction of criminal incidents using historical patternsof criminal incidents in combination with data from Twitter. They show that amodel that incorporates information from Twitter increases prediction accuracy offuture incidents, compared to a model that relies solely on historical information on

11

Page 20: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

criminal incident patterns. They first extract current events from the Twitter feedof day d that they hypothesize correlate with future criminal incidents. Semanticrole labeling (SLR) is used for event extraction, after which they extract topics fromthe events using LDA. The distribution of topics from the LDA is then used in alinear regression model that determines the probability of an incident happening onday d + 1. The parameters of the regression model are estimated using historicalincident data.

Beauchamp [2013] uses a linear regression model, refitted each day, to study the re-lationship between polls and text in Twitter feeds, more specifically how frequenciesof words are associated with changes in pro-Obama and pro-Romney vote intentions.

3.3 Beyond the Text of Social Media

Although Twitter has been used extensively for content analysis, it has also beenexplored for more generic attempts to use the geo-spatial and textual informationprovided.

Twitter has been used to build a real-time earthquake reporting system in Japan[Sakaki et al., 2010]. They detect target events, which they define as large scaleevents that affect people’s daily life, with a spatial and temporal location. Theyfirst searching among tweets that mention the target event, and classify them intoa positive or negative class. In this way, Sakaki et al. [2010] consider the tweetsas sensors with a time and location. They develop a temporal model where theycalculate the probability of an event occurring, given the sensor signals, and aspatial model where they estimate the earthquake center and trajectory estimationof a typhoon using Kalman and particle filters.

In an attempt to determine the occurrence of local events, Lee and Sumiya [2010]study crowd behavior patterns using geo-tagged microblog data. They attemptto model the regular behavior of local crowds as a geographical regularity whichdictates usual crowd patterns. This geographical regularity will define the normalstatus for the defined region and on the basis of these local characteristics it will beutilized to detect unusual crowd activities. If a region is unusually crowded, specificmessages can be looked at within the region, however, the event itself is defined byits location and geographical crowdedness rather than keywords.

12

Page 21: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 4

Learning in the Temporal Domain

"The only reason for time is so that everything doesn’t happen at once."

– Albert Einstein

In the previous chapter we have explored how Twitter, microblogs and other datasources with texts have been exploited for learning. Could we generalize this knowl-edge to data sources in other domains? Acknowledging that our textual data sourcesas well as data in the mobile network are indexed by time, this chapter aims to pro-vide an overview of methodologies for knowledge discovery from time series or othertemporal data.

Temporal data mining involves slightly different objectives and constraints and isconcerned with the mining of large sets of data ordered with respect to some index.Such data sets could be text, time series, moves in a game, or any other sequence ofdata where the ordering of records is crucial. Temporal data mining tasks includeprediction, classification, clustering, search, and pattern discovery [Laxman andSastry, 2006], as well as anomaly detection [Esling and Agon, 2012]. Clustering andprediction will be briefly introduced here.

4.1 Time Series

Time series analysis has a long history where weather forecasting and stock mar-ket prediction are among the oldest and most studied applications. When speechrecognition research emerged matching and classification of time series started toreceive attention. With that came also a raised interest for machine learning tech-

13

Page 22: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

niques such as Hidden Markov Models and Neural Networks for time series analysis[Laxman and Sastry, 2006, Esling and Agon, 2012].

4.1.1 Similarity Measure

Comparing time series and their movement patterns is central in many fields rang-ing from tasks in data mining such as clustering, classification and rule discovery[Kvernvik, 2013] to applications in pattern and speech recognition, surveillance,and computer animation [Soatto, 2007]. Clustering of time series or other typeof analysis that require a comparison between two or more series, need to involvea consideration of how to measure similarity between the two series. This choiceof similarity measure is crucial since it affects the final classification or clusteringoutcome [Soatto, 2007, Lhermitte et al., 2011].

Data of the same event, object or feature can show large degree of variability, anddepending on the nature of the time series, discrete-valued or real-valued, equal orunequal length, univariate or multivariate [Liao, 2005], non-stationary or stationary[Soatto, 2007], one measure may be more applicable than another. The type ofapplication may also pose additional requirements on the type of distance desired[Liao, 2005, Soatto, 2007, Esling and Agon, 2012].

Let Q = q1, q2, ..., qi, ..., qn and R = r1, r2, ..., rj , ..., rm be two time series. TheEuclidean distance between Q and R is defined as

dE(Q,R) =

√√√√ n∑k=1

(qk − rk)2 (4.1)

where Q and R in this case are time series of equal length n = m. Computing theEuclidean distance involves aligning sequences one-to-one; qi is necessarily alignedwith ri [Kvernvik, 2013]. Similarly, we can compute the Minkowski distance dM ,which generalizes from the Euclidean to

dM = q

√√√√ n∑k=1

(qk − rk)q (4.2)

for q > 0, the main advantage of which is the ease in its calculation and inter-pretability, while limitations include a stationarity requirement for the time series,and zero cross-correlation between the data sets. A solution to this is the use of theMahalanobis distance [Lhermitte et al., 2011]. These distance measures, however,can be distinguished from correlation-based measures, of which the best known isPearson’s correlation coefficient

dCC =∑n

k=1(qk − q̄)(rk−s − r̄)√∑nk=1(qk − q̄)2

√∑nk=1(rk−s − r̄)2 (4.3)

14

Page 23: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

that describes the linear relationship between Q and R, s being the lag between thetwo series [Lhermitte et al., 2011].

Lhermitte et al. [2011] evaluate the performance of similarity measures between anoriginal time series and a simulated series which has been introduced to effects onamplitudes, scaling, noise and time translations. Esling and Agon [2012] contributeswith a study of similarity measure’s robustness to scale, time warps, noise andoutliers. The Euclidean distance and other distance-based measures are sensitive todistortions and unable to achieve a level of abstraction that can account for noise,outliers, amplitude and scaling effects [Esling and Agon, 2012].

Correlation-based measures evaluate the relationship between the series and do notaccount for distance and thus provide a better pictures of how two time series movetogether without being affected by amplitude shiftings. Correlation-based measuresare nevertheless sensitive to noise and to time scaling and translation effects thatcreate a lag between the original and simulated time series. In particular, Lhermitteet al. [2011] show that an increase in the lag between series results in a decrease inthe correlation. Similarly, they show that noise results in decreased correlation.

The sensitivity to time lags can be solved by Dynamic Time Warping (DTW)[Berndt and Clifford, 1994], which allows for time shifting and for comparison oftime series of unequal lengths [Liao, 2005]. The DTW algorithm finds the optimalalignment by computing a warping path which minimizes the distance of the twotime series. The first step involves computing a distance matrix with n times melements, each representing the Euclidean distance between the points qi and rj .Each possible warping between the time series is a path through the matrix; thepath W = w1, w2, ..., wk, ..., wK that minimized the distance between the two seriesis called the warping path and can be defined as

dDT W = min

∑Kk=1wk

K(4.4)

for max(m,n) ≥ K ≤ m+n− 1 [Liao, 2005, Al-Naymat et al., 2009]. The problemwith DTW is that it still suffers from amplitude effect sensitivity due to its depen-dency on the Euclidean distance, as well as lack of robustness to noise and outliers[Esling and Agon, 2012].

4.2 Clustering

The purpose of clustering is to find a structure in data by organizing it into nat-ural groups. In clustering of temporal data, this involves grouping time series orsequences based on their similarity. Formally, the grouping should maximize theintercluster variance while minimizing the intracluster variance; the objective is to

15

Page 24: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

find homogeneous clusters that are as distinct as possible from each other [Eslingand Agon, 2012]. In short, the time series clustering task usually involve three com-ponents: an algorithm to perform the clustering, a measure of similarity or distancebetween the time series, and a criterion for evaluation [Liao, 2005].

4.2.1 Clustering of Time Series

Esling and Agon [2012] divide the time series clustering task into whole series clus-tering and subsequence clustering, where whole-series clustering refers to groupingthe entire time series into clusters, and subsequence clustering refers to groupingsubsequences of single or multiple longer time series. Subsequence clustering there-fore involves slicing time series into non-overlapping windows where the width ischosen by investigating the periodicity of the time series, e.g. by an autocorrelationstudy. The limit of this approach is that in the absence of a strong periodicity, theslicing may miss structures in the series. Several methods have been proposed toovercome this problem, including clustering algorithms that are not forced to useall available slices and algorithms that let subsequences overlap [Esling and Agon,2012].

After choosing a suitable distance measure, almost any generic clustering algorithmcan be adapted to fit the task at hand. Esling and Agon [2012] mention wholeseries clustering methods using Self-Organizing Maps, Hidden Markov Models andSupport Vector Machines. Liao [2005] provides a survey of clustering algorithms forstatic data and distinguishes between partitioning methods, hierarchical methods,density-based methods, grid-based methods, and model-based methods.

Widiputra et al. [2011] propose a clustering method for time series that modelwhole series and subsequence relationships and predict their values of the timeseries simultaneously. To do so, they perform whole series clustering using Pearson’scorrelation coefficient as similarity measure, and clustering of recurring trends of atime series by using kernel regression.

4.2.2 Agglomerate Hierarchical Clustering

Hierarchical clustering is a method to group time series or other data objects to atree of clusters. The agglomerate clustering algorithm works by placing each objectin its own cluster at the bottom of the tree and in a single cluster at the top. Itis therefore possible to follow the merging process. The criterion for fusion of twoclusters is a so called linkage function that usually belong to any of the following:single, complete, average, or Ward’s linkage. The single and complete linkage algo-rithm measures the similarity between the closest and farthest respectively of pairs

16

Page 25: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

of data objects. The average uses the average of them and Ward’s linkage algorithmmerges clusters based on the increase of the sum-of-squares variance [Liao, 2005].

Agglomerate hierarchical clustering can be used with any choice of similarity mea-sure and for time series of unequal lengths.

4.3 Prediction

We have already seen in Section 3.2 how prediction can be achieved by building apredictive model of textual data. Prediction of time series is a major research area inseveral fields, that aims to model dependencies between subsequent values. Predic-tion of time series has gotten so much attention in research that there are numeroussurveys focused on only specific applications or a specific family of methods [Eslingand Agon, 2012]. In classical time series analysis, the family of autoregressive (AR)models use a linear combination of earlier values to predict a future value. Here,the ARMA model assumes linear stationarity of the series and the ARIMA modelis targeted processes where the difference between successive terms can be assumedto be stationary [Laxman and Sastry, 2006].

Also popular for the prediction of time series are machine learning methods, in-cluding neural networks, support vector machines, self-organizing maps and clusterfunction approximation [Esling and Agon, 2012].

The model proposed by Widiputra et al. [2011] for simultaneous prediction of mul-tiple time series recognizes that different clusters of whole or sequences of timeseries depict different phenomena, and show that local regressions developed foreach cluster provide better accuracy to the prediction than global models such aslinear regressions for multiple time series and the multi-layer perceptron (a neuralnetwork).

4.4 Spatiotemporal Learning

We have now seen how we can explore various types of data in the temporal domain.In the mobile network, we are also concerned about the geo-spatial dimension ofour problem. How can we relate to this? There exists little work on analysis of geo-spatial time series; the methods that exist tend to be heavily adapted to a specificproblem, application, or relationships between only a subset of available locations.

Chandra and Al-Deek [2008] use cross-correlation analysis to study the dependenciesof the traffic speed at a location with the traffic speeds of upstream and downstream

17

Page 26: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

locations. They find that past values of upstream and downstream locations influ-ence the value at a location, and that a vector autoregressive model works betterthan a traditional ARIMA model for prediction of traffic speeds at this location.

This is an example of what Rinzivillo et al. [2008] describe as an approach wherespatial relationships are made explicit before modeling; the authors have a clear ideathat they want to study the effect of data at two locations on a third. This approachto the deal with the geo-spatial domain is advantageous in that we after that canapply any standard technique for data mining, in this case autoregressive models.Rinzivillo et al. [2008] describe a second approach to geo-spatial data mining wherethe spatial domain can be explored during the data mining process, but where thedata mining techniques have to be reinvented to suit a specific purpose.

We saw in Section 3.3 how Sakaki et al. [2010] propose temporal and spatial modelsfor detecting earthquake events. These authors’ temporal and spatial models arelargely separate. The temporal model detects tweets written about the the targetevent (earthquake) and classifies such tweets as positive or negative; the spatialmodel uses Kalman and particle filtering to locate the centers and trajectories ofthe targeted events.

18

Page 27: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 5

Data Acquisition

"With data collection, ’The sooner the better’ is always the best answer."

– Marissa Mayer, President and CEO of Yahoo!

Since provision of historical data is limited, we collected it real-time for our regionsof interest at suitable aggregation levels. This proved to be quite a complex process.

The process consisted of writing scripts in Python that could collect Twitter postsin real-time as they were written, and weather reports every half an hour as weatherupdates were made at this frequency. We collected data for a period between 1 amon 21 November 2013 and 12 am on 6 December 2013. Aggregate network statisticswas withdrawn for this period for a resolution level of 15 minutes. This choice offrequency was made with the desire to have high frequency data to better captureevents in the network, and since 15 minutes was the highest frequency available forthe aggregate network statistics this was a natural choice.

All geo-tagged Twitter posts sent in Greater London were acquired. In this regionthe availability of English tweets were assumed to be high which should aid us innatural language processing and the interpretation of the results at a later stage.This region of interest can be specified by the bounding box1 shown in Table 5.1.

Aggregate network statistics collected included total traffic volume and the numberof active users per cell and for the network as a whole, of one telecommunicationsoperator in the region.

The weather conditions collected included temperature, perceived temperature, air1A bounding box describes a land area on Earth by a bounding rectangle, defined by its latitude

and longitude coordinates.

19

Page 28: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Table 5.1. Bounding box for London.

Location Top Left Bottom RightLondon 51.671 -0.423 51.334 0.159

pressure, precipitation, and wind speed were collected real time every 30 minutes2

from Weather Underground’s API [Underground, 2013]. The API requests weresent over HTTP using Python and responses returned in JSON format. Data werethen saved into files on the local drive.

Twitter posts (tweets) were streamed from Twitter’s Streaming API [Twitter, 2013b]as they were written. Twitter data were collected using Twitter’s Streaming API[Twitter, 2013b] using a Python wrapper extending from code written by Geduldig[2013]. Connecting to the streaming API requires keeping a HTTP connectionopen that returns data from the Twitter API incrementally in the form of JSON-encoded Tweets objects. Twitter users can provide information about their locationon their profile page but can also make the exact location of their mobile devicespublic. Tweets from users who have enabled geo-location will have the geo encodingattached to the message, with geographic coordinates specified according to preciselongitude and latitude coordinates. Data were then stored in JSON files on thelocal drive.

5.1 Risks in the Data Collection Process

The data collection process involved considerations that required particular atten-tion. The primary risks were that the scripts collecting the data crash due to bugs,power outages, computer failures or that the Twitter or Wunderground servers denyaccess or interrupt an existing connection; these errors would damage the data andcreate missing values. The Python scripts used for data acquisition had to be de-veloped accordingly.

2Sending requests more often would violate the API restrictions of the number of allowedrequests per day.

20

Page 29: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 6

Method and Implementation

As we know from Section 3.2 numerous attempts have been made to use Twitter andother textual data sources for prediction of seemingly unrelated variables or events.Often the prediction is targeted at a well defined variable or event. In some casesthe prediction is binary [Wang et al., 2012], targeted at previously defined keywordsin the Twitter stream [Achrekar et al., 2011, Sakaki et al., 2010]. Other attempts,described in Section 3.3, try to use the geographical information provided in tweets.More rarely have these attempts been combined; the probabilistic spatiotemporalmodel by Sakaki et al. [2010] is a unique attempt to make use of the whole spectrumof information in the Twitter stream for knowledge discovery. Even more rarely haveattempts been made to make predictions using more than one additional source ofinformation.

While most research in the field has focused on combining two data sets as an inter-section of social data or media and the physical world, this thesis aims to explore thepossibility of using multiple sets of data to generate knowledge. We work under theassumption that data sources are diverse and that we know little about the natureof the data. Therefore we want to refrain from making assumptions of stationarity,of autocorrelation, of the generative distribution, and other assumptions underlyingmany of the similarity measures for comparing time series and for modeling of theprocess. As also mentioned in Section 1, we seek a model that can work for a varietyof data sets.

In order to make a comparison of our data sets, we need to process them and repre-sent them with objects that are comparable. In this case, we represent each variablewith a time series, enabling us to extend the comparison to an unlimited number ofvariables. To achieve this, we first need to process the Twitter corpus to represent itas a time series. Following Blei [2012] we therefore implement a probabilistic topicmodel with LDA. Since one time point in our case corresponds to one document,

21

Page 30: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

we do not need to implement the dynamic topic model [Blei and Lafferty] but canimplement standard LDA using the lda package for R [Chang, 2012]. Recognizingwhat we mentioned in Section 3.1.3 that the Twitter posts (tweets) are normallyconsidered too short to form documents on their own. Therefore, and in order toreduce the dimensionality of the data, we aggregate all tweets written in a 15 minuteinterval into a document. Following [Blei and Lafferty] we preprocess the corpusby tokenization, stemming, and the removal of stop words and words that appearless than five times. In additional, we only consider tweets written in English. Wefurther remove all hashtags and words starting with @ (references to other Twitterusers). A number of tokenization implementations and stemmers were tried; themain challenge consisted in lemmatization of words, since most implementationsare made for pure text and misspellings, abbreviations, and website addresses areplentiful in the Twitter corpus. We ended up with a corpus of 5,697 unique tokens.After preprocessing, we create a corpus of document using the gensim package[Řehuřek, 2013] in Python that serve as input to the LDA model in R.

Setting the parameters for LDA involves a few considerations. First, the LDA modelrequires as input the number of topics desired. Given that the we know from [Zhaoet al., 2011] that topics in microblogs are quite homogenous, covering mainly topicssuch as arts, business, family and life, we assign a low number of topics to themodel, and settle for K = 10. Second, we need to provide values for α and η, theprior weights of the per-document topics and of the per-topic words respectively.Usually α is set to a number less than 1 to achieve sparse distributions of topics perdocument, and η is set to a number much less than 1 to prefer sparse distributionswhere there are only a few words per topic. We note that Griffiths and Steyvers[2004] suggest a parameter of α = 50/K and η = 0.1. We settled for α = 1, η = 0.1and ran LDA with 1,000 iterations.

Aggregate network statistics are already naturally in the form of time series of 15minute intervals. The weather information collected each half an hour was dupli-cated to create a time series with 15 minute intervals between data points.

In order to study dependencies between variables, we study the linear correlationsbetween traffic volume and the external variables in three anomalous and threenormal time points, for one network cell in a localized region of Greater London.Due to the limited amount of data, we choose to pick the anomalies manually. Thesize of the complete data set of Greater London proved to put extreme requirementson our computer system, and the lack of anomalies presence in the aggregated dataset also reduces relevance of such computations. Furthermore, studying all networkcells at the same time and their interdependencies, would result in an enormousmodel that would not allow for necessary interpretation. We therefore choose tostudy one central cell location; we assume that it provides more variety in the datathan the aggregate network does as a whole, and a centrally located cell should stillbe able to provide us with a reasonable though limited amount of data.

22

Page 31: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 7

Data Overview

Let us look at the time series of the variables for traffic volume and the number ofactive users, collected from the 3G network, in Figure 7.1 and Figure 7.2 respec-tively. The data concerns one cell of the network for Greater London located inthe proximity of Piccadilly Circus and Green Park. As expected, there is a notableperiodicity seen in the seasonal component but also temporal variations depicted inthe random component. No trend can be distinguished for any of the variables.

The weather data for London turned out to be of poor quality due to infrequentvalue updates from the weather station, and further extreme and missing values.Precipitation seems to have been missed completely in the weather updates and wetherefore remove this variable from the data set. Extreme values were removed,missing values were taken as the average of the neighboring values. Temperature indegrees Celsius temp_c, the perceived temperature in degrees Celsius feelslike_c,the wind speed in miles per hour wind_mph and air pressure in inches of mercurypressure_in are shown in Figure 7.3.

Only tweets that are geo-encoded with coordinates within the bounding boxes inTable 5 were extracted from the API. Nevertheless, filtering on this location, thestreaming API also returned tweets with no exact geo-location coordinates but witha place [Twitter, 2013a] attached to it and that lies within the filtered region. Tweetswith places are not necessarily written at that location but could also be writtenabout the location. We therefore remove such tweets manually from the data set.In order to obtain an understanding of the Twitter stream, let us study the numberof geo-tagged tweets in the target network cell. The total number of tweets, as wellas the seasonal, trend and random component of the time series are shown in Figure7.4. Note that there is a clear seasonal component, no distinguishable trend for thetime period and random noise that shows no evidence of incidents or interestingevents.

23

Page 32: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

traffic_volume

Seasonal

Trend

Random

0e+00

1e+06

2e+06

3e+06

0e+00

1e+06

2e+06

3e+06

0e+00

1e+06

2e+06

3e+06

0e+00

1e+06

2e+06

3e+06

2013

−11

−21

2013

−11

−22

2013

−11

−23

2013

−11

−24

2013

−11

−25

2013

−11

−26

2013

−11

−27

2013

−11

−28

2013

−11

−29

2013

−11

−30

2013

−12

−01

2013

−12

−02

2013

−12

−03

2013

−12

−04

2013

−12

−05

2013

−12

−06

2013

−12

−07

Figure 7.1. Traffic volume at the target location, with the seasonal, trend andrandom components.

active_users

Seasonal

Trend

Random

−5000

0

5000

10000

−5000

0

5000

10000

−5000

0

5000

10000

−5000

0

5000

10000

2013

−11

−21

2013

−11

−22

2013

−11

−23

2013

−11

−24

2013

−11

−25

2013

−11

−26

2013

−11

−27

2013

−11

−28

2013

−11

−29

2013

−11

−30

2013

−12

−01

2013

−12

−02

2013

−12

−03

2013

−12

−04

2013

−12

−05

2013

−12

−06

2013

−12

−07

Figure 7.2. Active users at the target location, with the seasonal, trend and randomcomponents.

24

Page 33: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

temp_c feelslike_c

wind_mph pressure_in

0

3

6

9

−4

0

4

8

0

10

20

30

29.6

30.0

30.4

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

Figure 7.3. The weather in London depicted by the temperature in degrees Celsiustemp_c, the perceived temperature in degrees Celcius feelslike_c, the wind speedin miles per hour wind_mph and air pressure in inches of mercury pressure_in.

Total

Seasonal

Trend

Random

−30

0

30

60

90

−30

0

30

60

90

−30

0

30

60

90

−30

0

30

60

90

2013

−11

−21

2013

−11

−22

2013

−11

−23

2013

−11

−24

2013

−11

−25

2013

−11

−26

2013

−11

−27

2013

−11

−28

2013

−11

−29

2013

−11

−30

2013

−12

−01

2013

−12

−02

2013

−12

−03

2013

−12

−04

2013

−12

−05

2013

−12

−06

2013

−12

−07

Figure 7.4. The total number of tweets and the seasonal, trend and random com-ponents of the number of geo-tagged tweets sent near the target location.

25

Page 34: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular
Page 35: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 8

Results

The results consist of the inferred topics and the linear regression results. Westudy the regression results of running the traffic volume variable on the number ofactive users, the number of tweets, the topic proportions and the weather variables.We study first the results of the regression run for the whole time series of 21November through 6 December, and then the results of running the regression forthree days where we have noted an anomalous or different time series profile, being22 November, 25 November and 3 December. These results are compared with thosefor the three normal days being 26 November, 28 November and 2 December.

Table 8.1 presents the resulting topics of the LDA iterations for the Twitter cor-pus. Figure 8.1 visualizes how the topic proportions for each topic have changedover time. These results prove the sensitivity of the LDA process. Although thetop words of the topics present certain distinguishable characteristics, the resultingdynamics of topics provide neither evidence of changing dynamics nor a differencebetween topics.

Table 8.2 presents the regression results of running the traffic volume variable on thenumber of active users, the number of tweets, the topic proportions and the weathervariables for the whole time series profile. Table 8.3 presents the regression resultsof running the random component of traffic volume on the random component of thenumber of active users, and the rest of the variables unchanged, for the same timeperiod. Table 8.4 presents the result of the regression ran on 22 November wherewe have noted an anomalous profile. Likewise, Table 8.5 and Table 8.6 presentthe regression results for 25 November and 3 December respectively. These can becompared with the results of the same regression ran for the three seemingly normaldays of the network, presented in Table 8.7–8.9.

27

Page 36: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Table 8.1. Top words for each topic X1-X10.

X1 X2 X3 X4 X5lol today world london tonightdon good time morning nightim happy god train amazing

fucking birthday mandela hate cominggonna morning christmas station coysmake back man work goodgirls work nelson end dinner

watching give great bus tomorrowlool day night stop eveningah hope true ve weekendfuck enjoy day early hourssleep st people victoria watchshit working rip talk gladfeel home things means bedloool coffee found people followstop album np trains win

X6 X7 X8 X9 X10amp christmas london big ampgreat don britain ff lolyear time house ll lifenews love greater ve greatnice show party thing goodyoung nice uk back veevent boy art half time

england sounds ll woman tweettoday won pic hear peoplehealth bit didn cold loveday live chelsea girl rt

people yeah city music daychange song westminster fun misschannel hour mi eating lunchwomen made photo fuck xxxsupport thought museum isn mum

28

Page 37: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

X1 X2 X3 X4 X5

X6 X7 X8 X9 X10

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

2013

−11

−22

2013

−11

−24

2013

−11

−26

2013

−11

−28

2013

−11

−30

2013

−12

−02

2013

−12

−04

2013

−12

−06

time

prop

ortio

n

Figure 8.1. The proportions of each topic X1-X10 visualized over time.

29

Page 38: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Table 8.2. Linear regression results for traffic volume: November 21 – December 6.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users 6.404e+01 2.396e+00 26.725 ł2e-16 ***no_tweets 3.266e+03 4.061e+02 8.042 1.88e-15 ***x1_proportion 3.699e+04 6.127e+04 0.604 0.5462x2_proportion 4.976e+04 6.140e+04 0.810 0.4179x3_proportion 5.266e+04 6.101e+04 0.863 0.3882x4_proportion 1.064e+05 6.328e+04 1.681 0.0930 .x5_proportion 1.023e+05 6.345e+04 1.613 0.1071x6_proportion 7.858e+04 6.138e+04 1.280 0.2007x7_proportion 3.648e+04 6.286e+04 0.580 0.5618x8_proportion 5.011e+04 6.149e+04 0.815 0.4153x9_proportion 6.873e+04 6.436e+04 1.068 0.2857x10_proportion NA NA NA NAtemp_c 2.134e+04 1.054e+04 2.024 0.0432 *feelslike_c -1.720e+04 7.495e+03 -2.294 0.0219 *wind_mph 1.714e+03 2.101e+03 0.816 0.4147pressure_in 6.491e+04 2.807e+04 2.312 0.0209 *(Intercept) -2.081e+06 8.613e+05 -2.416 0.0158 *

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.3. Linear regression results for the random component of traffic volume:November 21 – December 6.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random 3.291e+01 4.899e+00 6.717 2.72e-11 ***no_tweets 4.030e+02 3.623e+02 1.112 0.266x1_proportion 3.318e+04 5.790e+04 0.573 0.567x2_proportion 6.326e+04 5.800e+04 1.091 0.2769x3_proportion 6.326e+04 5.800e+04 1.091 0.276x4_proportion 9.362e+04 5.973e+04 1.567 0.117x5_proportion 9.006e+04 5.997e+04 1.502 0.133x6_proportion 8.220e+04 5.803e+04 1.417 0.157x7_proportion 6.693e+04 5.972e+04 1.121 0.263x8_proportion 3.007e+04 5.903e+04 0.509 0.611x9_proportion 6.843e+04 6.062e+04 1.129 0.259x10_proportion NA NA NA NAtemp_c -2.827e+03 1.019e+04 -0.278 0.781feelslike_c -3.768e+02 7.253e+03 -0.052 0.959wind_mph 2.627e+03 1.921e+03 1.367 0.172pressure_in 4.545e+04 3.375e+04 1.347 0.178(Intercept) -1.448e+06 1.037e+06 -1.396 0.163

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

30

Page 39: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Table 8.4. Linear regression results for the random component of traffic volume onNovember 22: anomalous day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random -2.631e+02 9.358e+01 -2.812 0.00623 **no_tweets 3.629e+02 4.354e+03 0.083 0.93379x1_proportion 3.314e+05 3.731e+05 0.888 0.37710x2_proportion 1.256e+05 4.657e+05 0.270 0.78814x3_proportion -3.760e+03 4.316e+05 -0.009 0.99307x4_proportion 5.627e+05 3.999e+05 1.407 0.16340x5_proportion 1.801e+05 4.105e+05 0.439 0.66208x6_proportion 1.818e+05 3.672e+05 0.495 0.62197x7_proportion 2.884e+05 4.138e+05 0.697 0.48785x8_proportion -1.225e+05 3.477e+05 -0.352 0.72556x9_proportion 6.904e+05 4.331e+05 1.594 0.11497x10_proportion NA NA NA NAtemp_c 6.178e+03 1.477e+05 0.042 0.96674feelslike_c -8.406e+04 1.203e+05 -0.699 0.48693wind_mph 3.912e+04 3.942e+04 0.992 0.32416pressure_in 2.015e+06 7.343e+05 2.744 0.00753 **(Intercept) -6.086e+07 2.203e+07 -2.763 0.00715 **

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.5. Linear regression results for the random component of traffic volume onNovember 25: anomalous day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random -1.104e+02 7.392e+01 -1.494 0.1393no_tweets 7.256e+03 4.012e+03 1.809 0.0743 .x1_proportion -2.833e+05 3.935e+05 -0.720 0.4737x2_proportion -6.566e+05 3.151e+05 -2.084 0.0405 *x3_proportion -1.486e+05 3.628e+05 -0.410 0.6832x4_proportion -4.555e+05 2.548e+05 -1.788 0.0777 .x5_proportion 3.578e+05 3.559e+05 1.005 0.3178x6_proportion -1.655e+05 3.432e+05 -0.482 0.6309x7_proportion 2.519e+05 4.378e+05 0.575 0.5667x8_proportion -1.812e+05 3.801e+05 -0.477 0.6349x9_proportion -2.683e+05 4.654e+05 -0.576 0.5660x10_proportion -4.808e+05 3.788e+05 -1.269 0.2082temp_c 1.285e+05 1.731e+05 0.742 0.4601feelslike_c -1.834e+05 1.424e+05 -1.288 0.2017wind_mph -8.261e+04 6.808e+04 -1.214 0.2286pressure_in -3.724e+06 1.688e+06 -2.207 0.0303 *(Intercept) 1.147e+08 5.156e+07 2.224 0.0290 *

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

31

Page 40: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Table 8.6. Linear regression results for the random component of traffic volume onDecember 3: anomalous day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random -1.020e+02 4.423e+01 -2.305 0.0238 *no_tweets 3.866e+03 2.552e+03 1.515 0.1337x1_proportion -1.548e+05 3.032e+05 -0.511 0.6110x2_proportion 3.444e+05 2.882e+05 1.195 0.2356x3_proportion -9.362e+04 2.521e+05 -0.371 0.7113x4_proportion -3.351e+04 2.535e+05 -0.132 0.8952x5_proportion 1.676e+05 3.068e+05 0.546 0.5865x6_proportion 1.813e+05 2.672e+05 0.678 0.4995x7_proportion 1.354e+05 2.528e+05 0.535 0.5939x8_proportion 1.476e+06 2.817e+05 5.241 1.3e-06 ***x9_proportion 2.397e+05 2.313e+05 1.036 0.3033x10_proportion NA NA NA NAtemp_c -1.430e+04 1.415e+05 -0.101 0.9198feelslike_c 9.461e+04 1.012e+05 0.935 0.3525wind_mph -1.776e+04 4.589e+04 -0.387 0.6997pressure_in -3.620e+06 1.432e+06 -2.528 0.0135 *(Intercept) 1.095e+08 4.356e+07 2.514 0.0140 *

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.7. Linear regression results for the random component of traffic volume onNovember 26: normal day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random -5.345e+01 4.902e+01 -1.090 0.2807no_tweets 5.021e+03 2.084e+03 2.409 0.0197 *x1_proportion -7.692e+04 1.554e+05 -0.495 0.6227x2_proportion 1.029e+05 2.076e+05 0.496 0.6223x3_proportion -4.262e+05 2.029e+05 -2.100 0.0408 *x4_proportion 2.126e+05 1.527e+05 1.392 0.1700x5_proportion -7.169e+04 1.607e+05 -0.446 0.6575x6_proportion 8.472e+03 1.593e+05 0.053 0.9578x7_proportion -9.491e+04 1.617e+05 -0.587 0.5599x8_proportion 3.840e+04 1.705e+05 0.225 0.8227x9_proportion 2.493e+04 2.076e+05 0.120 0.9049x10_proportion NA NA NA NAtemp_c 9.198e+04 8.965e+04 1.026 0.3098feelslike_c -8.512e+04 7.895e+04 -1.078 0.2861wind_mph -7.228e+04 4.647e+04 -1.555 0.1262pressure_in -4.711e+06 2.048e+06 -2.300 0.0257 *(Intercept) 1.447e+08 6.287e+07 2.302 0.0255 *

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

32

Page 41: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Table 8.8. Linear regression results for the random component of traffic volume onNovember 28: normal day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random 2.554e+01 2.286e+01 1.117 0.267no_tweets -2.267e+02 8.682e+02 -0.261 0.795x1_proportion 3.838e+04 1.076e+05 0.357 0.722x2_proportion 5.458e+00 1.374e+05 0.000 1.000x3_proportion 6.666e+04 1.156e+05 0.577 0.566x4_proportion -1.805e+05 1.138e+05 -1.586 0.117x5_proportion 1.087e+05 1.566e+05 0.694 0.490x6_proportion -2.953e+04 1.101e+05 -0.268 0.789x7_proportion 6.829e+03 1.189e+05 0.057 0.954x8_proportion -9.717e+04 1.578e+05 -0.616 0.540x9_proportion 1.605e+05 1.249e+05 1.285 0.202x10_proportion NA NA NA NAtemp_c -6.647e+04 7.534e+04 -0.882 0.380feelslike_c 4.543e+03 4.107e+04 0.111 0.912wind_mph -1.486e+04 1.351e+04 -1.100 0.274pressure_in 5.810e+05 1.014e+06 0.573 0.568(Intercept) -1.715e+07 3.113e+07 -0.551 0.583

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.9. Linear regression results for the random component of traffic volume onDecember 2: normal day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|)active_users$random -4.733e+01 4.474e+01 -1.058 0.2933no_tweets 2.209e+03 2.369e+03 0.932 0.3539x1_proportion -2.113e+05 1.861e+05 -1.136 0.2595x2_proportion -3.508e+05 2.163e+05 -1.622 0.1088x3_proportion -3.068e+04 2.206e+05 -0.139 0.8897x4_proportion -2.787e+05 2.183e+05 -1.277 0.2055x5_proportion -3.451e+05 2.128e+05 -1.621 0.1089x6_proportion -1.655e+05 2.080e+05 -0.796 0.4285x7_proportion -1.852e+05 2.105e+05 -0.880 0.3816x8_proportion -1.795e+05 2.366e+05 -0.759 0.4503x9_proportion -1.587e+05 1.693e+05 -0.938 0.3513x10_proportion NA NA NA NAtemp_c 3.277e+04 5.991e+04 0.547 0.5859feelslike_c 3.543e+03 4.461e+04 0.079 0.9369wind_mph -1.022e+04 1.303e+04 -0.784 0.4353pressure_in 2.421e+06 1.120e+06 2.161 0.0337 *(Intercept) -7.400e+07 3.426e+07 -2.160 0.0338 *

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

33

Page 42: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular
Page 43: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Chapter 9

Discussion

"Don’t spend time beating on a wall, hoping to transform it into a door."

– Coco Chanel

While the original ambition consisted in creating a model that could depict thenetwork with its geo-spatial dependencies, we have studied how the network behav-ior at a local geographic point co-relates to variables from external data sources.The level of resolution targeted in this project had to be reduced after the systemrequirements proved to be extraordinary and unfeasibly high; the dimensionality ofthe problem had to be refined in order to answer the research questions. The lackof system capacity needed for the original idea gives an idea of the challenge thatthe big data community still battles with, and that machine learning research canlook forward to much further advancement as technology makes progress.

9.1 Significant Correlations

In Tables 8.4–8.6 we note how traffic volume for a so called anomalous day variessignificantly with the of active users in the network cell in two of the three cases,and the number of tweets in one of the cases. Traffic volume varies significantlywith air pressure in all of the cases.

From the results in Tables 8.9–8.8 is apparent that traffic volume for a normal dayvaries significantly with the air pressure and the intercept in two of the cases andwith nothing at all in the third case.

35

Page 44: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

The presence of significance suggests that modeling dependencies between the dif-ferent data sets and variables may be possible, in particular for days where we havenoted an anomalous time series profile. An important problem that remains is thatwe do not understand the anomalous events in terms of their real-world causes, andhence it is difficult to understand whether the correlations found are plausible. Wealso cannot evaluate the appropriateness of the chosen model.

9.2 Topic Modeling for Twitter

Capturing the topic dynamics of a Twitter stream using LDA proved particularlychallenging. This challenge consisted in finding a suitable lemmatizer for the tweets,which do not adhere to characteristics and semantic structure that normally canbe assumed for literature or text. Indeed, tweets have numerous features thatdo not exist in normal text, such as hashtags, website addresses, slang and socialabbreviations. The failure of lemmatization and stemming has resulted in poorpre-processing results of the Twitter data, and affected the topic modeling processnegatively. This can be noted from the topics and word contexts in Table 8 andfrom the topic proportions as a function of time, visualized in Table 8.1, both ofin which noise seems to be present. The lack of relative change in topic proportioncould also be due to the fact that the Twitter stream may not change much overtime. A better indicator of changing topics may therefore be in news articles orother textual data sources, the disadvantage of these lying in that they can nevercapture localized events within regions of a city.

The poor results of the topic modeling procedure could also partly be explained bywhat we know from Zhao et al. [2011], that tweets capture much less of the spectrumof society than do news articles. Further research would benefit from includingtextual data sources with richer variety of content than the Twitter stream.

An additional challenge in modeling local Twitter trends is the choice of resolutionor target geographical area. A small region may capture the local trends in thenetwork cell better than a large area, but nevertheless a small area suffers frominclusiveness with great amounts of missing values, very short documents whichaffects significance negatively. However, with increasing size of target area, thecorrelation between the local Twitter stream and the behavior of the local celldecreases, as then network behavior is shared between cells. This trade-off must beconsidered in future work. This relates to the problem that the Twitter corpus usedin this study was found to be sparse; aggregation of tweets over 15 minute windowsyielded many missing values. For a sparse data set, a better technique may be toaggregate over a long time period. Such an approach could potentially provide amore stable outcome where topic proportions do not fluctuate wildly between near-zero and near-one values. In addition, it would be intelligible to follow Sakaki et al.

36

Page 45: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

[2010] and chose a target area with a track-record of numerous and geographicallydispersed Twitter users.

It remains a question whether topic modeling for Twitter effectively can capturechanges in the 3G network. A better method could be to follow for example Sakakiet al. [2010] and Achrekar et al. [2011] and look for target words in the Twitterstream, rather than topics. This leads us to a discussion of the model in the nextsection.

9.3 Future Work

9.3.1 Model

Despite the lack of a convincing result, the regression analysis indicate that fusingseveral sources of data together could help us understand the time series profilesfrom the network, in particular in situations where we observe events that cannot beconsidered normal in the network. This conclusion requires verification but is goodnews for future research which could merit from extending the regressing scheme intoa predictive model with a weighting scheme based on the knowledge repository ofclustered recurring trends and profiles of time series. In this way, the model couldcapture non-linear and also lagged dependencies between variables, that a linearregression model fails to consider. Given the possibility of dynamic time warpingto capture interdependencies of time series irrespective of time lags, clustering withdynamic time warping could help us an understand nonlinear dependencies and usethose for prediction. It will be interesting to follow whether such a predictive modelcould compete with linear regressions for these types of problems where the data setsare numerous and of great variety. At this stage, it remains a question whether sucha model can be generic for the different data sets, or whether targeted models needto be developed for each data set and thereafter combined with ensemble methods.

9.3.2 Anomalous Events

Future work would further require a rigorous study of anomalous events in a net-work; which type of anomalies exist in the network and why? A finer idea of thetargeted events in the network would provide a better outlook for research in thefield. Our understanding of the different anomalies and their relations with thereal world could at a later stage be aggregated in order to achieve a more completeunderstanding of the network.

37

Page 46: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Understanding anomalies is crucial in that it help us define the target event, which inturn can help us break down the problem into parts. It is reasonable to believe thatcertain anomalies in the network are more interesting than others, from the networkprovider’s point of view, and that not all anomalies are equally harmful or undesired.An understanding of the anomalies can help the researcher focus and should provideresults that allow interpretation and evaluation. Further, with a target event, themodel becomes less generic and reduces our wild search for answers to a problemthat can be solved with a systematic problem solving process. For example, whereSakaki et al. [2010] studies the geo-spatial dependencies of a target event. It shouldprovide much more difficult to study geo-spatial dependencies of all types of eventsout there. The problem becomes too complex and interpretations difficult.

9.3.3 Choice of Resolution

Regardless of the definition, it proved difficult to find anomalous or abnormal eventsin the network data with the data set available. A low level of resolution, lookingat the aggregate traffic in Greater London for example, would immediately createthe need for anomalies of larger scale that in turn are less frequent, for us to beable to perform our study. Such anomalies cannot be observed in our available dataset (i.e. for Greater London). With a higher level of resolution, anomalies can beobserved but their significance becomes increasingly questionable. Also, the sizeof the external (Twitter) data set has to be reduced, which poses a threat to thescientific process. In this case we found that the Twitter stream filtered on thesingle network cell near Piccadilly proved undesirably limited.

A greater amount of data at an increased level of resolution would increase theamount of anomalies but with that the need for an automatic anomaly detectionsystem which extends beyond the scope of this thesis, and improved data capacitywhich was a problem already in this case. This relates to the fact that we still lack astructured and systematic method to model dependencies between spatial locations;a greater understanding of network anomalies would permit us to develop such.

Furthermore, the study would have benefited from a more rigorous formulation ofthe research questions. Especially for machine learning, which increasingly is lookedupon as the solution to any business intelligence, research questions need to be posedwith care to achieve tangible results.

Lastly, the credibility of our results would benefit from longer time series that enablemore anomalous as well as normal points to be present in the data.

38

Page 47: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

Bibliography

H. Achrekar, A. Gandhe, R. Lazarus, S. Yu, and B. Liu. Predicting flu trends usingtwitter data. In Computer Communications Workshops (INFOCOM WKSHPS),2011 IEEE Conference on, pages 702–707. IEEE, 2011.

G. Al-Naymat, S. Chawla, and J. Taheri. Sparsedtw: A novel approach to speedup dynamic time warping. 8th Australian Data Mining Conference, 101:117–127,2009.

N. Beauchamp. Predicting and interpolating state-level polling using twitter textualdata. In 2013 Australian Political Studies Association Annual Conference (APSA2013). 2013.

D. Berndt and J. Clifford. Using dynamic time warping to find patterns in timeseries. In AAAI-94 workshop on Knowledge Discovery in Databases, pages 229–248. 1994.

D. Blei. Modeling science: Dynamic topic models of scholarly research. GoogleTech Talks, May 2007. http://www.youtube.com/watch?v=8nBE5Qm8y6I Blei.

D. Blei. Probabilistic topic models. In Communications of the ACM, volume 55,pages 77–84. 2012.

D. Blei and J. Lafferty. Dynamic topic models. In ICML 2006.

D. Blei and J. Lafferty. Topic models. In Text Mining: Theory and Applications.Taylor and Francis, 2009.

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of MachineLearning Research, 3:993–1022, 2003.

J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journalof Computational Science, 2(1):1–8, 2011.

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. In ACMComputing Surveys (CSUR), volume 41. 2009.

39

Page 48: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

S. Chandra and H. Al-Deek. Cross-correlation analysis and multivariate predictionof spatial time series of freeway traffic speeds. Transportation Research Record:Journal of the Transportation Research Board, (2061):64–76, 2008.

J. Chang. lda: Collapsed gibbs sampling methods for topic models, 2012.http://cran.r-project.org/web/packages/lda/.

A. Coluccia, A. D’Alconzo, and F. Ricciato. Distribution-based anomaly detectionin network traffic. In Data Traffic Monitoring and Analysis: From Measurement,Classification, and Anomaly Detection to Quality of Experience, volume 7754 ofLecture Notes in Computer Science, pages 202–216. Springer, 2013.

A. D’Alconzo, A. Coluccia, and P. Romirer-Maierhofer. Distribution-based anomalydetection in 3g mobile networks: from theory to practice. International Journalof Network Management, 20(5):245–269, 2010.

P. Esling and C. Agon. Time-series data mining. ACM Computing Surveys (CSUR),45(1):12:1–12:34, 2012.

J. Geduldig. Python wrapper for twitter’s rest and streaming apis. Website, October2013. https://github.com/geduldig/TwitterAPI.

T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the NationalAcademy of Sciences of the USA, volume 101. 2004.

L. Hong and B.D. Davison. Empirical study of topic modeling in twitter. InSOMA’10 Proceedings of the First Workshop on Social Media Analytics, pages80–88. ACM, 2010.

T. Kvernvik. Dynamic time warping. Unpublished, 2013.

S. Laxman and P. Sastry. A survey of temporal data mining. Sadhana, 31(2):173–198, 2006.

R. Lee and K. Sumiya. Measuring geographical regularities of crowd behaviors fortwitter-based geo-social event detection. In Proceedings of ACM LBSN 2010,pages 1–10. 2010.

S. Lhermitte, J. Verbesselt, W. W. Verstraeten, and P. Coppin. A comparison oftime series similarity measures for classification and change detection of ecosystemdynamics. Remote Sensing of Environment, 115(12):3129–3152, 2011.

T. Liao. Clustering of time series data – a survey. Pattern Recognition, 38(11):1857–1874, 2005.

M. Mackrell, K. Twilley, W. Kirk, L. Lu, J. Underhill, and L. Barnes. Discoveringanomalous patterns in network traffic data during crisis events. In 2013 IEEESystems and Information Engineering Design Symposium (SIEDS), pages 52–57.2013.

40

Page 49: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

M. Matti and T. Kvernvik. Applying big-data technologies to network architecture.Ericsson Review, (284 23-3181), 2012.

S. Moritz. Exploring data. Unpublished, 2012.

D. Pokrajac, A. Lazarevic, and L. Latecki. Incremental local outlier detection fordata streams. In 2007 IEEE Symposium on Computational Intelligence and DataMining (CIDM 2007), pages 504–515. 2007.

S. Rinzivillo, F. Turini, V. Bogorny, C. Körner, B. Kuijpers, and M. May. Knowledgediscovery from geographical data. In F. Giannotti and D. Predreschi, editors,Mobility, Data Mining and Privacy, pages 243–265. Springer, 2008.

T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th InternationalConference on World Wide Web (WWW 2010), WWW 2010, pages 851–860.ACM, 2010.

R. Schumaker and H. Chen. Textual analysis of stock market prediction using break-ing financial news: The azfintext system. In ACM Transactions on InformationSystems (TOIS), volume 27, pages 12:1–12:19. ACM, 2009.

V. Siris and F. Papagalou. Application of anomaly detection algorithms for de-tecting syn flooding attacks. In 12th IEEE International Conference on Network(ICON 2004), volume 29, pages 1433–1442. 2006.

S. Soatto. On the distance between non-stationary time series. In Modeling, Es-timation and Control: Festschrift in Honor of Giorgio Picci on the Occasion ofhis Sixty-Fifth Birthday, volume 364 of Lecture Notes in Control and InformationSciences, pages 285–299. Springer, 2007.

M. Steyvers and T. Griffiths. Probabilistic topic models. In Latent Semantic Anal-ysis: A Road to Meaning. Laurence Erlbaum, 2006.

Twitter. Places. Website, November 2013a. https://dev.twitter.com/docs/platform-objects/places.

Twitter. The streaming apis. Website, October 2013b.https://dev.twitter.com/docs/streaming-apis/.

Weather Underground. A weather api designed for developers. Website, October2013. http://www.wunderground.com/weather/api/.

R. Řehuřek. gensim: Topic modelling for humans. Website, December 2013.http://radimrehurek.com/gensim/.

X. Wang, M. Gerber, and D. Brown. Automatic crime prediction using eventsextracted from twitter posts. In S.J. Yang, A.M. Greenberg, and M. Endsley,

41

Page 50: Multi-Source Learning in a 3G Network753180/FULLTEXT01.pdf · Multi-Source Learning in a 3G Network YLVA ERSVIK Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren ... A cellular

editors, 2012 International Conference on Social Computing, Behavioral-CulturalModeling, & Prediction (SBP12), volume 7227, pages 231–238. Springer, 2012.

H. Widiputra, R. Pears, and N. Kasabov. Multiple time-series prediction throughmultiple time-series relationships profiling and clustered recurring trends. InJ. Huang, L. Cao, and J. Srivastava, editors, Advances in Knowledge Discoveryand Data Mining: 15th Pacific-Asia Conference (PAKDD 2011), volume 6635 ofLecture Notes in Computer Science, pages 303–347. Springer, 2011.

W. Zhao, J. Jiang, J. Weng, J. He, E. Lim, H. Yan, and X. Li. Comparing twitterand traditional media using topic models. In Advances in Information Retrieval,Proceedings of ECIR 2011, volume 6611 of Lecture Notes in Computer Science,pages 338–349. Springer, 2011.

42