Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and...

26
Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word-Spaces - a First Survey DAVID NILSSON ARIEL EKGREN Bachelor’s Thesis at CSC Supervisors: Hedvig Kjellström Jussi Karlgren Mikael Vejdemo-Johansson Examiner: Mårten Olsson

Transcript of Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and...

Page 1: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

Topology and Word Spaces

Mapper and Betti 0 Barcodes Applied to Random Indexing Word-Spaces - a FirstSurvey

DAVID NILSSONARIEL EKGREN

Bachelor’s Thesis at CSCSupervisors:

Hedvig KjellströmJussi Karlgren

Mikael Vejdemo-JohanssonExaminer: Mårten Olsson

Page 2: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological
Page 3: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

Abstract

This paper will introduce analytic methods for linguisticdata that is represented in forms of word-spaces constructedfrom the random indexing model. The paper will presenttwo different methods; a visualisation method derived froman algorithm called Mapper, and a word-space propertymeasure derived from Betti numbers. The methods will beexplained and thereafter implemented in order to demon-strate their behaviour with a smaller set of linguistic data.The implementations will constitute as a foundation for fu-ture research.

Page 4: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

Contents

1 Introduction 11.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4.1 Computational Linguistics . . . . . . . . . . . . . . . . . . . . 31.4.2 Topological data analysis . . . . . . . . . . . . . . . . . . . . 41.4.3 Topology and Algebraic Topology . . . . . . . . . . . . . . . 41.4.4 Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 The Topology of Text 92.1 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Barcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Experimental Results 133.1 Barcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Conclusions 174.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Betti 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1.2 Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.1 Barcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3.2 Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Bibliography 21

Page 5: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

1. Introduction

Analysis of vast amounts of high dimensional data is a growing field in both scienceand industry and are often referred to asBig Data analytics. Information generatedby enterprises, social media, the Internet of Things and many other entities isincreasing in volume and detail and will fuel exponential growth in data for theforeseeable future [14]. The data analytic tools in use today is not always suitedfor vast amounts of high dimensional data. Thus there exists a demand for novelanalytical methods.

Computational Linguistics is one of the areas of research that are faced with thetask of handling the ever growing streams of data. Automated and scalable methodsfor analysing dynamic data is of high interest in a world were newspapers, blogs,Facebook, Twitter and the rest of the internet are generating increasingly largeamounts of text every second. In collaboration with the swedish company GavagaiAB, that works with high performance dynamic text analytics, a set of interestinglinguistic data analysis problems were defined.

This paper will focus on the analysis of a certain type of linguistic data. The datawill be in the form of mathematical representations of words from a large sourceof text and their surrounding contexts. The tools already present for analysis andvisualisation of this type of data are limited. Mainly due to the amount and highdimensionality of the data. Thus we turn to novel approaches of analysis and thefield of topology. Our work will hopefully contribute to further understanding ofthe intersection between topology and linguistics and inspire future research on thesubject.

1.1 Problem

One way that Gavagai process their linguistic data is through special high dimen-sional vector spaces. Distinguishing changes in the vector spaces as new data isprocessed and determining classes of subsets of the data in the vector spaces aretwo interesting problems from a data analytic, linguistic and semantic point of view.

1

Page 6: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 1. INTRODUCTION

Questions of interest such as -“Is this data set generated from this or that source?”or -“What type of distinguishing features does this data set contain?” are hard toanswer because of the size and high dimensionality of the data. The tools todayincludes manually picking out subsets of interest and keeping track of them as thedata changes over time and more or less manually examining relations between datapoints in the vector spaces. Identifying subtle changes in the signals and hiddenfeatures of this type of data is an open problem.

1.2 ThesisWe believe that it is possible to classify linguistic vector space data by applying ideasfrom topological data analysis (TDA). We think that by using TDA algorithmsand concepts it will be possible to develop diagnostic tools and to visualise highdimensional data in ways that show relevant features and properties as well asrevealing interesting subsets of linguistic vector space data. We think that two keyconcepts will be applicable, Betti 0 Barcodes and the Mapper algorithm [4] [21].

1.3 GoalThe aim of this paper is to introduce two data analytic methodologies and applythem to computational linguistic data. The scope of this thesis is not to evolve acomplete and rigorous method, but rather to bridge work done in topological dataanalysis [4] to computational linguistics [11]. The two methodologies we aim tointroduce and implement are: a global measure of connectedness and a visualisationattached to point cloud data which allows for a qualitative understanding throughdirect visualisation. They are Betti 0 Barcodes and the Mapper algorithm. Wewish to apply these to a specific type of high dimensional vector space, a RandomIndexing word-space (RI).

1.4 BackgroundOur work consists of combining ideas from two different academic disciplines; com-putational linguistics and topological data analysis. Combining these two to performtopological data analysis on computational linguistic data sets. We will start bycovering some basic and some not so basic concepts in a lighter manner. For amore thorough explanation of the concepts and subjects we direct the reader to ourreferences.

The idea of applying topological data analysis to computational linguistics arosefrom the need to find coordinate invariant properties in semantic word-spaces. Anovel approach to the problem was needed in order to identify and compare bothglobal and small scale features in a high dimensional vector space with a lot of datapoints.

2

Page 7: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

1.4. BACKGROUND

1.4.1 Computational Linguistics

Computational linguistics is the academic discipline where language is examinedwith the aid of computers, statistics and math [8]. One way to study text is touse a word-space model, which is a linguistic model used to make mathematicalrepresentations of written language. Word-spaces are often high-dimensional vectorspaces where words are represented as points. One important property of thesevector spaces is that words with similar meaning are located closer to each otherthan words with no similarity at all.

Word-spaces and Random Indexing

There are different ways to construct a word-space, but the one referred to through-out this paper is the Random Indexing approach. The following explanation of RIword-spaces will be prerequisite knowledge to understand later sections of the paper.For a more detailed theory of the word-space model and RI we refer to [19].

A corpus T is a set of sequences of words. Given a corpus T we can define

W = {w : w ∈ T, ∀T ∈ T}

Cn = {(w1, . . . , wn) subsequence of T, ∀T ∈ T}

We then define a function f with domain W and co-domain the N -dimensionalvector space KN . f maps a unique index vector to each word, when the vector iscreated the entries in it are randomly generated as described in [19]

f : W→ KN

w 7→ vindex

We then define a function g(w)((w1, w2, . . . , wm)) that returns the number of timesthe context (w1, w2, . . . , wm) is a context centered on w

g : W→ Hom(Cn,N)

Lastly, we define a function that maps each word w to a context vector c that canbe described as the sum of all the word’s different contexts

h : W→ KN

w 7→∑c∈Cn

g(w)(c) ∗( ∑w′∈c

f(w′))

Finally we have obtained the set of context vectors {c} derived from T, our RandomIndexing word-space.

3

Page 8: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 1. INTRODUCTION

1.4.2 Topological data analysis

In the past ten years the field of topology has been developed to fit the needs ofdata analysis and has been applied in widely differing fields such as identification ofsubgroups of breast cancer patients [16] to computer vision recognition [5]. Topologyis well suited for data analysis in the sense that it allows us to look past disturbancein the form of coordinate dependence and instead distinguish qualitative featuresof the data.

1.4.3 Topology and Algebraic Topology

Topology is the branch of mathematics concerned with the general study of conti-nuity and closeness of topological spaces. A topological space is the most generalnotion of a mathematical “space” and it consists of a set of points, along witha set of neighbourhoods for each point, that satisfy a set of axioms that relatepoints and neighbourhoods. A key concept in topology is the similarity propertycalled homeomorphism. A homeomorphism is a one-to-one continuous mapping ofone topological space onto another. If there exist a homeomorphism between twospaces, A and B, they are said to be homeomorphic (equivalent in a topologicalsense) [20].

In order for two topological spaces to be homeomorphic they have to fulfil constraintsthat can be complicated to validate. An easier approach is via algebraic topology.In algebraic topology we have defined homotopy and homology, which are similar tohomeomorphism but less strict and rigorous. To compute homology, linear algebrais used and it is therefore preferential to homotopy when working with large amountsof data [10].

To begin the topological examination of point cloud data a conversion of the datato a topological space has to be done. This is achieved through creating simplicialcomplexes from the data.

Simplicial Complexes

One can think of simplicial complexes as triangular structures connected undercertain constraints. In the word-space case, these triangular shapes are built whenletting edges connect points in the point cloud under the constraints defined in [15].A simplex is the generalization of a tetrahedral region of space to n dimensions.For example a 0-simplex consists of a point, a 1-simplex consists of a line segment,a 2-simplex a triangle, a 3-simplex a tetrahedron and so on. Alot more can besaid about the simplex but for our general and brief explanation of the simplicialcomplex this will suffice. The following definition of a simplicial complex is citedfrom [15].

Definition. A simplicial complex is a finite collection of simplices K such that

4

Page 9: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

1.4. BACKGROUND

σ ∈ K and τ ≤ σ implies τ ∈ K, and, σi,σj ∈ K implies σi ∩ σj is either empty ora face of both.

A practical method of dividing the point cloud data into simplicial complexes is touse the Vietoris-Rips complex. The following definition of the Rips complex is citedfrom [7].

Definition. Given a collection of points {xα} in Euclidean space En, the Ripscomplex, Rc, is the abstract simplicial complex whose k-simplices are determined byunordered (k + 1)-tuples of points{xα}k0 which are pairwise within distance ε.

Betti Numbers

Betti numbers is a way to describe connectivity in a topological space and is oftendenoted as βk. Informally speaking the k − th Betti number describes how manyk-dimensional surfaces that are independent, and if two Betti numbers are the samefor two different spaces then the spaces are homotopy equivalent [4]. This thesisfocuses only on the β0 number which describes the connections between points.

Barcodes

Barcodes is a parametrised version of betti numbers. And as Robert Ghrist pointsout in his 2008 survey of research on barcodes we can motivate the use of barcodesand partially explain them in the following way [7]:

1. It is beneficial to replace a set of data points with a family of simplicialcomplexes, indexed by a proximity parameter. This converts the data set intoglobal topological objects.

2. It is beneficial to view these topological complexes through the lens of algebraictopology, specifically, via a novel theory of persistent homology adapted toparameterized families.

3. It is beneficial to encode the persistent homology of a data set in the form ofa parameterized version of a betti number, a barcode.

Example. A Barcode example from artificially generated point cloud data. Thedata consists of 30 points in three clusters randomly generated with centers at [x, y]coordinates: [−5,−5], [−5, 10], [5,−5] and a standard deviation of 0.5. You can seethe points in a scatter plot and a Betti 0 Barcode showing how they connect.

5

Page 10: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 1. INTRODUCTION

Figure 1.1: Point cloud data Figure 1.2: Barcode data

Figure 1.3: Barcode example with artificially generated 2d point cloud data

1.4.4 Mapper

Mapper, first described in [21], is a method for constructing combinatorial repre-sentations of geometric information about high dimensional point cloud data. Byfiltering and clustering the data set you get a different representation of the dataset than if you would have acted upon it directly. The method bear similaritiesto density clustering trees, disconnectivity graphs and reeb graphs but is a moregeneralised approach.

The method consists of a number of steps, given a point cloud with Npoints x ∈ X:

1. We start with a function f : X → R whose value is known for the N datapoints. We call this function a filter. The function should convey some in-teresting geometric or other, for the task at hand relevant, properties of thedata.

2. Citing from [21]: “Finding the range (I) of the filter f restricted to the setX and creating a cover of X by dividing I into a set of smaller intervals (S)which overlap. This gives us two parameters which can be used to controlresolution namely the length of the smaller intervals (l) and the percentageoverlap between successive intervals (p).”

3. Citing from [21]: “Now, for each interval Ij ∈ S, we find the set Xj ={x|f(x) ∈ Ij} of points which form its domain. The set {Xj} forms a cover ofX, and X ⊆ ⋃j Xj .”

4. Choosing a metric d(·, ·) to get the set of all interpoint distances Dj ={d(xa, xb)|xa, xb ∈ Xj}

6

Page 11: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

1.4. BACKGROUND

5. For each Xj together with the set of distances Dj we find clusters {Xjk}

6. Each cluster then becomes a vertex in our complex and an edge is createdbetween vertices if Xjk ∩Xlm 6= ∅ meaning that two clusters share a commonpoint.

Example. A Mapper example from artificially generated point cloud data. Thedata consists of 5000 points randomly generated from a gaussian distribution sur-rounding three centroids at [x, y] coordinates: [10, 20], [−10,−10], [17,−10]; with astandard deviation of 9. The filter function f was chosen to be Gaussian kernel den-sity estimation. And the mapper parameters were set to 7 intervals and an overlapof 10 percent. See Figure 1.4

Figure 1.4: Mapper example with artificially generated 2D point cloud data with corre-sponding mapper output

7

Page 12: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological
Page 13: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

2. The Topology of Text

The data extraction section below will describe the linguistic data followed by anexplanation and description of the implementation of our Betti 0 barcode algorithmand the implementation of the Mapper algorithm.

2.1 Data Extraction

The data in the study were RI word-spaces constructed from three different corpuses:

• British National Corpus, short BNC. 82 070 645 words [6].

• Touchstone Applied Science Associates, Inc., short TASA. 10 861 774words [22].

• Reuters Corpus, Volume 1, short Reuters. 200 144 390 words [18].

Subsets of words were then extracted from these word-spaces, using word lists, inorder to get a more manageable set of words that would work as a representation ofeach word-space. Here follows a short explanation of these word lists but for moredetailed information see references.

• Swadesh is a classic compilation of basic concepts for the purposes of historical-comparative linguistics.

• Abstract terms. Many academic writing-guides claim that these termsshould not be used in academic reports for the reason that they are consideredto be abstract terms. The list is complied by Karlgren [12].

9

Page 14: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 2. THE TOPOLOGY OF TEXT

2.2 BarcodesDespite existing libraries for calculation of simplicial complexes and barcodes, thefollowing barcode algorithm was developed from scratch and implemented in Python.The Dionysus C++ library for computing persistent homology [1] was consideredbut omitted, due to its installation dependencies and the project’s need for crossplatform compatibility (Windows 7 and Mac OSX 10.7).

Since the algorithm only derives Betti 0 barcodes some underlying mathematicalconcepts could be disregarded, giving rise to simplifications that would not be pos-sible for calculations of Betti 1 barcodes or higher.

The algorithm will first be described in plain text followed by a pseudocode descrip-tion. The barcode plots can be viewed in the Results section.

All the barcodes contain one bar that goes to infinity, from when all points havecoalesced. All the barcode plots presented in this paper are missing theinfinite bar.

2.2.1 The AlgorithmAfter a set of word vectors was loaded into a Numpy array, the pairwise distanceswere calculated and stored in a pairwise distance matrix M , where Mi,j is thedistance between vector i and j.

Each row in the distance matrix was then examined in order to determine closestneighbours. A bar bi in the barcode was then generated letting the barcode value bethe distance between i and j and, by convention, letting the higher indexed vector“die”. Each distance in the smallest indexed vector was then compared to eachdistance in the other vector, substituting larger values with smaller values. In thisway the distance information in the higher indexed vector was transferred to thelower indexed vector. This operation was looped until every index, except one, was“dead”. The only surviving index got the value infinity.

After the iteration was done every bar bi ∈ β0 had values 0 < bi ≤ ∞. In orderto visualise the bars they were plotted in a bar plot Figure 3.1 - 3.6. Algorithm 1shows a pseudo code how the algorithm was implemented.

2.3 MapperThe implementation of mapper was done in Python, utilising the already availablepackages: Numpy, Scipy, Modular toolkit for Data Processing (MDP) [23], scikit-learn [17], NetworkX [9] and matplotlib. The final visualisations were done inGephi [3], a graph visualisation software.

10

Page 15: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

2.3. MAPPER

Algorithm 1 Calculate Barcodesdistance_matrix = pairwise_distances(vectors)for i in distance_matrix doj = i.mindex()if j.mindex() == i.mindex() thenbl = i.min()if j < i and value in jk < value in ik thenβ0j = blreplace values and kill j

else if j > i and value in jk > value in ik thenβ0i = blreplace values and kill i

end ifend if

end forreturn β0

Each of the nine sets of text, described in this chapters section Data Handling, wereexamined with the authors implementation of Mapper.

The filter function f : X → R was chosen to be the projection of the data pointson to the first eigenvector of a singular value decomposition (SVD) based principalcomponent analysis (PCA), we call this filter PCA1. Two reasons for choosing PCA1will be presented; when examining new data looking at the components with highvariance might reveal interesting properties of the data, the high dimensionality ofthe data does not allow for a more basic approach such as filtering by density. Thereare many interesting filtering functions that can be used but due to time constraintsthe examination of the data was restricted to PCA1 only.

The clustering algorithm was chosen to be DBSCAN. DBSCAN seemed like a goodcandidate for the Mapper implementation because it is a dynamic clustering al-gorithm in the sense that the number of clusters do not have to be pre-specified,the metric for similarity may be defined by the user and it is readily available inthe python package scikit-learn [17]. For the choice of metric cosine similarity wasused d(u, v) = 1 − u·v

‖u‖2‖v‖2 . The cosine similarity metric was chosen because itprojects the word-space points to a unit-sphere. This property is particularly wellsuited for our high dimensional word-spaces due to the elimination of unwantedword-frequency effects.

The adjustable parameters in the filter and clustering function were adjusted forTASA with the abstract terms filtering, Figure 3.13. These parameter settings werethen used for all word-spaces, with the expectation that they would give rise to

11

Page 16: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 2. THE TOPOLOGY OF TEXT

meaningful visualisations for all six plots. The mapper plots, Figure 3.7 - 3.13,convey the opposite.

Adjustable mapper parameters:

• The number of segments Ij to divide the range I into, was set to 7.

• The percentage of overlap p between the segments was set to p = 0.4.

• The ε parameter of the DBSCAN clustering algorithm was set to ε = 0.51

• minimum points required to form a cluster by DBSCAN was just set to 1.

After the mapper algorithm was implemented, a comparison between the PCA1and the word frequency was done and plotted for all three Swadesh subsets. Thiswas done because the PCA1 seemed to give a similar representation as if a word-frequency mapping had been used instead. The correlation in Figure 3.10 confirmsthis assumption for low frequency words. Outliers in this correlation plot is onlyhigh-frequency words and is not presented, as there are just a very small fractionof them.

12

Page 17: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

3. Experimental Results

3.1 Barcodes

The following plots show the normed Betti 0 barcode for texts from Reuters, BNCand TASA. The words are filtered with the Swadesh and abstract terms lists.

Swadesh Words

Figure 3.1: Swadesh in Reuters texts. Figure 3.2: Swadesh in BNC texts.

Figure 3.3: Swadesh in Tasa texts.

13

Page 18: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 3. EXPERIMENTAL RESULTS

Abstract terms

Figure 3.4: Abstract terms in Reuterstexts.

Figure 3.5: Abstract terms in BNCtexts.

Figure 3.6: Abstract terms in Tasatexts.

3.2 Mapper

The following plots are the result from applying the Mapper algorithm to the withthe Swadesh and abstract terms filters. The PCA1 word-frequency plots is alsopresented in the following section.

14

Page 19: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

3.2. MAPPER

Swadesh Words

Figure 3.7: Swadesh in Reuters texts. Figure 3.8: Swadesh in BNC texts.

Figure 3.9: Swadesh in Tasa texts.

Figure 3.10: Freq vs PCA; tasa, reuters and bnc all swadesh

15

Page 20: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 3. EXPERIMENTAL RESULTS

Abstract Terms

Figure 3.11: Abstract terms in Reuterstexts.

Figure 3.12: Abstract terms in BNCtexts.

Figure 3.13: Abstract terms in TASAtexts.

16

Page 21: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

4. Conclusions

4.1 Discussion

4.1.1 Betti 0

The barcode plots show the connectedness of each word-space subset. A curvaturecan be seen by looking at the ends of the bars, and each barcode plot has a curvaturethat reflects the corresponding point’s relative positioning in its word-space. Apossible conclusion is that the more curvature the more separation between thepoints.

4.1.2 Mapper

We believe that the Mapper algorithm might be useful to represent topological struc-ture of word-spaces, to find interesting subspaces and to display relations amongtopics. Since the combination of Mapper and RI word-spaces is previously unchartedterritory it is, without further examination, hard to draw any general conclusionsabout the feasibility of the approach. The questions are many and the answers arefew.

The filter function was chosen to be the projection of the data on to the firsteigenvector of a principal component analysis, as shown in Figure 3.10 you cansee an almost linear correlation between the PCA-projection and word frequencyfor “low” frequencies, which raises doubts about the semantic relevance of filteringby PCA. Further investigation regarding properties of relevant filter functions forsemantic word-spaces is thus needed.

The clustering algorithm DBSCAN is most likely not well suited for the Mapperalgorithm. The positive features of DBSCAN is outweighed by the lack of controlover the process and poor performance for high dimensional clustering. Since theinitial implementation was done with DBSCAN the research had to be carried outwith it, but for future research single linkage clustering is proposed as a moresuitable alternative.

17

Page 22: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

CHAPTER 4. CONCLUSIONS

The cosine distance d(u, v) = 1− u·v‖u‖2‖v‖2 was used for the distance matrix fed to the

clustering algorithm. The cosine distance seem to be a good choice when workingwith RI word-spaces. With the filter, distance metric and clustering algorithmchosen it was hard to fine tune the parameters of the Mapper algorithm to conveythe underlying structures of the data in a relevant way. The results are either largeclusters connected one to one or many small disconnected clusters. It may be thata different choice of metric and clustering algorithm makes it easier to hone in onfavourable parameter choices. But if one looks at the Barcode plots provided inthe results section it is also evident that there is no single parameter choice thatwill catch all of the underlying structure, since the curves are all quite smooth fromsmallest to largest Barcode bar.

4.2 Summary

Even if no clear approval of Mapper and Betti 0 Barcodes applied on linguistic datacould be made, the foundation for future work are laid. The result of this thesisprovides insight into the nature of topological data analysis and computationallinguistics and more specific into Betti 0 Barcodes, Mapper and Random IndexingWord-Spaces. This paper can hopefully contribute to enough prerequisites in thesefields, for a continuation of this work.

4.3 Future Work

4.3.1 Barcodes

Extracting features, in the machine learning sense, from the barcodes in line withthe work described in [2]. Examining if barcodes for text can act as features forrecognising writing styles or authors etc.

In order to confirm our assertion about Betti 0 barcodes, further tests are required.Many different types of word-spaces should be considered but with known linguis-tic structure in order to classify the barcode-shape with corresponding word-space.We then believe that a proper machine learning algorithm can be applied to thesebarcodes in order to find how different barcodes “should” look for texts of a cer-tain type. One important tool that has to be considered is the filtering method.This work was restricted to Swadesh, abstract terms and food filtrations, but otherfiltrations might reveal new aspects that were not mentioned in this thesis.

As for the Betti 0 algorithm described in chapter 2, it possess one weakness that in-hibits time-consuming calculations for larger data sets, namely the pairwise distancecalculation. For each point, the distance to all other points are calculated. It wassufficient for this particular thesis to use this method, because of the size-limitationof the test sets, but a better alternative is needed for larger datasets. The other

18

Page 23: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

4.3. FUTURE WORK

aspect of this algorithm is the fact that many simplifications have been made dueto the limited interest of Betti numbers higher Betti 0 barcodes. This implies thatfor further Betti numbers higher than the Betti 0, one might need to extend thecode with a much wider foundation with the algebraic topology theory.

4.3.2 MapperThe possible future works for applying the Mapper algorithm to Random Indexingword-spaces or word-spaces in general are many and we will list a few crucial onesto determine if Mapper and word-spaces are a fruitful combination.

• Finding novel filter functions: finding semantically relevant filter func-tions might prove both challenging and extremely useful for displaying thetrue topology of word-spaces and by extension, the topology of language. Aninteresting approach, that was not tested, is to let the word-space points beprojected to the first two or three PCA vectors, and thereafter taking a pointcloud density measure. By doing this, we believe that the the mapping outputwill be more related to the intrinsic dimension of the word-space.

• Breaking up clustering: investigating the possibility to use a recursivestrategy of breaking up large clusters and applying Mapper to them again toallow for a more complete and adaptive strategy of visualising the word-spaces.

• Quantifying results: whether examining subspaces alone, connections be-tween subspaces, topological changes over time, comparing topologies betweendifferent spaces or other interesting investigations the need for a quantifiableresult is of uttermost importance. Yet, when it comes to these early investiga-tions it have been hard to define what properties to examine. It is easier donewith data of a raw statistical nature. But how do we measure the relevanceof a set of words compared to another set of words. How to interpret thegraph distance between two sets of words. The mapping of the results, fromthe intersection between Mapper and word-spaces, to the real world containsa lot of future work.

• Showing RI topology: examining if Mapper can be configured to show thefilament structure of the Random Indexing word-spaces described in [13].

19

Page 24: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological
Page 25: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

Bibliography

[1] Dionysus. http://www.mrzv.org/software/dionysus/#, 06 2012.

[2] Aaron Adcock, Erik Carlsson, and Gunnar Carlsson. The ring of algebraicfunctions on persistence bar codes. 2012.

[3] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An opensource software for exploring and manipulating networks, 2009.

[4] Gunnar Carlsson. Topology and data. Bulletin of The American MathematicalSociety, 46(2):255–308, January 29 2009.

[5] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. Onthe local behavior of spaces of natural images. International journal of com-puter vision, 76(1):1–12, 2008.

[6] The British National Corpus. The british national corpus.

[7] Robert Ghrist. Barcodes: the persistent topology of data. Bulletin of theAmerican Mathematical Society, 45(1):61–75, 2008.

[8] Ralph Grishman. Computational linguistics: an introduction. Cambridge Uni-versity Press, 1986.

[9] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring networkstructure, dynamics, and function using NetworkX. In Proceedings of the 7thPython in Science Conference (SciPy2008), pages 11–15, Pasadena, CA USA,August 2008.

[10] Allen Hatcher. Algebraic topology. Cambridge UP, Cambridge, 2002.

[11] Pentti Kanerva, Jan Kristofersson, and Anders Holst. Random indexing oftext samples for latent semantic analysis. In Proceedings of the 22nd annualconference of the cognitive science society, volume 1036. Citeseer, 2000.

[12] Jussi Karlgren. New measures to investigate term typology by distributionaldata. 19th Nordic Conference on Computational Linguistics. Oslo, 2013.

21

Page 26: Topology and Word Spaces - DiVA portal700054/FULLTEXT01.pdf · Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word ... The idea of applying topological

BIBLIOGRAPHY

[13] Jussi Karlgren, Anders Holst, and Magnus Sahlgren. Filaments of meaning inword space. In Advances in Information Retrieval, pages 531–538. Springer,2008.

[14] J Manyika, M Chui, B Brown, J Bughin, R Dobbs, C Roxburgh, and AH By-ers. Big data: the next frontier for innovation, competition, and productivity.mckinsey global institute, 2011.

[15] James R Munkres. Elements of algebraic topology, volume 2. Addison-WesleyReading, 1984.

[16] Monica Nicolau, Arnold J Levine, and Gunnar Carlsson. Topology based dataanalysis identifies a subgroup of breast cancers with a unique mutational pro-file and excellent survival. Proceedings of the National Academy of Sciences,108(17):7265–7270, 2011.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[18] Reuters. Reuters corpus, volume 1, english language, 1996-08-20 to 1997-08-19.

[19] Magnus Sahlgren. The Word-Space Model: Using distributional analysisto represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm, 2006.

[20] G.F. Simmons. Introduction to topology and modern analysis. Internationalseries in pure and applied mathematics. R.E. Krieger Pub. Co., 1983.

[21] Gurjeet Singh, Facundo Mémoli, and Gunnar Carlsson. Topological methodsfor the analysis of high dimensional data sets and 3d object recognition. In Eu-rographics Symposium on Point-Based Graphics, volume 22. The EurographicsAssociation, 2007.

[22] Inc. Touchstone Applied Science Associates. Touchstone applied science asso-ciates, inc., 05 2013.

[23] Tiziano Zito, Niko Wilbert, Laurenz Wiskott, and Pietro Berkes. Modulartoolkit for data processing (mdp): a python data processing framework. Fron-tiers in neuroinformatics, 2, 2008.

22