EXPLOITING STRUCTURED DATA FOR MACHINE LEARNING ... · 3.1 Machine Learning for Textual Entailment...
Transcript of EXPLOITING STRUCTURED DATA FOR MACHINE LEARNING ... · 3.1 Machine Learning for Textual Entailment...
UNIVERSITA DEGLI STUDI DI ROMA TOR VERGATADIPARTIMENTO DI INFORMATICA SISTEMI E PRODUZIONE
Dottorato di RicercaInformatica e Ingegneria dell’Automazione
Ciclo XXV
EXPLOITING STRUCTURED DATA FOR
MACHINE LEARNING: ENHANCEMENTS IN
EXPRESSIVE POWER AND COMPUTATIONAL
COMPLEXITY
Lorenzo Dell’Arciprete
Supervisor:Prof. Fabio Massimo Zanzotto
Rome, September 2013
Contents
1 Introduction 1
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Data Representation and Kernel Functions . . . . . . . . . . . . . . . 3
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Machine Learning and Structured Data 9
2.1 Classification in Machine Learning . . . . . . . . . . . . . . . . . . . 9
2.2 Kernel Machines and Kernel Functions . . . . . . . . . . . . . . . . . 10
2.3 Kernel Functions on Structured Data . . . . . . . . . . . . . . . . . . 14
2.3.1 Model-Driven Kernels . . . . . . . . . . . . . . . . . . . . . 15
2.3.1.1 Spectral Kernels . . . . . . . . . . . . . . . . . . . 15
2.3.1.2 Diffusion Kernels . . . . . . . . . . . . . . . . . . 16
2.3.2 Syntax-Driven Kernels . . . . . . . . . . . . . . . . . . . . . 17
2.3.2.1 Convolution Kernels . . . . . . . . . . . . . . . . . 17
2.3.2.2 String Kernels . . . . . . . . . . . . . . . . . . . . 18
2.3.2.3 Tree Kernels . . . . . . . . . . . . . . . . . . . . . 20
2.3.2.4 Graph Kernels . . . . . . . . . . . . . . . . . . . . 20
2.4 Tree Kernels: Potential and Limitations . . . . . . . . . . . . . . . . 22
2.4.1 Expressive Power . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1.1 Extensions of the Subtree Feature Space . . . . . . 24
2.4.1.2 Other Feature Spaces . . . . . . . . . . . . . . . . 28
2.4.2 Computational Complexity . . . . . . . . . . . . . . . . . . . 32
iii
CONTENTS
3 Improving Expressive Power: Kernels on tDAGs 35
3.1 Machine Learning for Textual Entailment Recognition . . . . . . . . 36
3.2 Representing First-order Rules and Sentence Pairs as Tripartite Di-
rected Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 An Efficient Algorithm for Computing the First-order Rule Space Kernel 44
3.3.1 Kernel Functions over First-order Rule Feature Spaces . . . . 45
3.3.2 Isomorphism between tDAGs . . . . . . . . . . . . . . . . . 47
3.3.3 General Idea for an Efficient Kernel Function . . . . . . . . . 50
3.3.3.1 Intuitive Explanation . . . . . . . . . . . . . . . . 51
3.3.3.2 Formalization . . . . . . . . . . . . . . . . . . . . 54
3.3.4 Enabling the Efficient Kernel Function . . . . . . . . . . . . 57
3.3.4.1 Unification of Constraints . . . . . . . . . . . . . . 58
3.3.4.2 Determining the Set of Alternative Constraints . . . 58
3.3.4.3 Determining the Set C∗ . . . . . . . . . . . . . . . 60
3.3.4.4 Determining Coefficients N(c) . . . . . . . . . . . 61
3.4 Worst-case Complexity and Average Computation Time Analysis . . . 63
3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Improving Computational Complexity: Distributed Tree Kernels 69
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.2 Description of the Challenges . . . . . . . . . . . . . . . . . 73
4.2 Theoretical Limits for Distributed Representations . . . . . . . . . . 77
4.2.1 Existence and Properties of Function f . . . . . . . . . . . . 77
4.2.2 Properties of the Vector Space . . . . . . . . . . . . . . . . . 80
iv
CONTENTS
4.3 Compositionally Representing Structures as Vectors . . . . . . . . . . 83
4.3.1 Structures as Distributed Vectors . . . . . . . . . . . . . . . . 83
4.3.2 An Ideal Vector Composition Function . . . . . . . . . . . . 85
4.3.3 Proving the Basic Properties for Compositionally-obtained Vec-
tors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.4 Approximating the Ideal Vector Composition Function . . . . 88
4.3.4.1 Transformation Functions . . . . . . . . . . . . . . 88
4.3.4.2 Composition Functions . . . . . . . . . . . . . . . 90
4.3.4.3 Empirical Analysis of the Approximation Properties 91
4.4 Approximating Traditional Tree Kernels with Distributed Trees . . . . 98
4.4.1 Distributed Collins and Duffy’s Tree Kernels . . . . . . . . . 99
4.4.1.1 Distributed Tree Fragments . . . . . . . . . . . . . 99
4.4.1.2 Recursively Computing Distributed Trees . . . . . 100
4.4.2 Distributed Subpath Tree Kernel . . . . . . . . . . . . . . . . 102
4.4.2.1 Distributed Tree Fragments for the Subpath Tree Ker-
nel . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.2.2 Recursively Computing Distributed Trees for the Sub-
path Tree Kernel . . . . . . . . . . . . . . . . . . . 103
4.4.3 Distributed Route Tree Kernel . . . . . . . . . . . . . . . . . 106
4.4.3.1 Distributed Tree Fragments for the Route Tree Kernel 106
4.4.3.2 Recursively Computing Distributed Trees for the Route
Tree Kernel . . . . . . . . . . . . . . . . . . . . . 107
4.5 Evaluation and Experiments . . . . . . . . . . . . . . . . . . . . . . 109
4.5.1 Trees for the Experiments . . . . . . . . . . . . . . . . . . . 109
v
CONTENTS
4.5.1.1 Linguistic Parse Trees and Linguistic Tasks . . . . . 109
4.5.1.2 Artificial Trees . . . . . . . . . . . . . . . . . . . . 110
4.5.2 Complexity Comparison . . . . . . . . . . . . . . . . . . . . 111
4.5.2.1 Analysis of the Worst-case Complexity . . . . . . . 111
4.5.2.2 Average Computation Time . . . . . . . . . . . . . 112
4.5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 112
4.5.3.1 Direct Comparison . . . . . . . . . . . . . . . . . . 113
4.5.3.2 Task-based Experiments . . . . . . . . . . . . . . . 115
5 A Distributed Approach to a Symbolic Task: Distributed Representation
Parsing 123
5.1 Distributed Representation Parsers . . . . . . . . . . . . . . . . . . . 124
5.1.1 The Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.1.2 Building the Final Function . . . . . . . . . . . . . . . . . . 126
5.1.2.1 Sentence Encoders . . . . . . . . . . . . . . . . . . 127
5.1.2.2 Learning Transformers with Linear Regression . . . 129
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . 131
5.2.2 Parsing Performance . . . . . . . . . . . . . . . . . . . . . . 133
5.2.3 Kernel-based Performance . . . . . . . . . . . . . . . . . . . 136
5.2.4 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . 138
6 Conclusions and Future Work 141
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
vi
List of Tables
3.1 Comparative performances of Kmax and K . . . . . . . . . . . . . . 67
4.1 Relation between d, m and ε . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Dot product between two sums of k random vectors, with h vectors in
common . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Computational time and space complexities for several tree kernel tech-
niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.4 Spearman’s correlation of DTK values with respect to TK values . . . 114
4.5 Spearman’s correlation of SDTK values with respect to STK values . 115
4.6 Spearman’s correlation of RDTK values with respect to RTK values . 116
5.1 Pseudo f-measure of the DRP s and the DSP on the non-lexicalized
data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Pseudo f-measure of theDRP3 and theDSPlex on the lexicalized data
sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3 Spearman’s Correlation between the oracle’s vector space and the sys-
tems’ vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
vii
List of Figures
2.1 Routes in trees: an example . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 A simple rule and a simple pair as a graph . . . . . . . . . . . . . . . 43
3.2 Two tripartite DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Simple non-linguistic tDAGs . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Intuitive idea for the kernel computation . . . . . . . . . . . . . . . . 52
3.5 Algorithm for computing LC for a pair of nodes . . . . . . . . . . . . 59
3.6 Algorithm for computing C∗ . . . . . . . . . . . . . . . . . . . . . . 61
3.7 Comparison of the execution times . . . . . . . . . . . . . . . . . . . 64
4.1 Map of the used spaces and functions . . . . . . . . . . . . . . . . . 72
4.2 A sample tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Norm of the vector obtained as combination of different numbers of
basic random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Dot product between two combinations of basic random vectors, iden-
tical apart from one vector . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Variance for the values of Fig. 4.4 . . . . . . . . . . . . . . . . . . . 96
4.6 Tree Fragments for Collins and Duffy (2002)’s tree kernel . . . . . . 99
4.7 Tree Fragments for the subpath tree kernel . . . . . . . . . . . . . . . 103
4.8 Tree Fragments for the Route Tree Kernel . . . . . . . . . . . . . . . 106
4.9 Computation time of FTK and DTK . . . . . . . . . . . . . . . . . . 113
4.10 Performance on Question Classification task of TK, DTK� andDTK�117
4.11 Performance on Question Classification task of STK, SDTK� and
SDTK� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
ix
LIST OF FIGURES
4.12 Performance on Question Classification task of RTK, RDTK� and
RDTK� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.13 Performance on Recognizing Textual Entailment task of TK, DTK�
and DTK� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.14 Performance on Recognizing Textual Entailment task of STK, SDTK�
and SDTK� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.15 Performance on Recognizing Textual Entailment task ofRTK,RDTK�
and RDTK� . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.1 “Parsing” with distributed structures in perspective . . . . . . . . . . 125
5.2 Subtrees of the tree t in Fig. 5.1 . . . . . . . . . . . . . . . . . . . . 127
5.3 Processing chains for the production of the distributed trees . . . . . . 132
5.4 Topology of the resulting spaces derived with the three different methods137
5.5 Performances with respect to the sentence length . . . . . . . . . . . 138
x
1Introduction
Learning, like intelligence, covers such a broad range of processes that it is difficult
to define it precisely. Zoologists and psychologists study learning in animals and hu-
mans, while computer scientists’ concern is about learning in machines. There are
several parallels between human and machine learning. Certainly, many techniques in
machine learning derive from the efforts of psychologists to make more precise theo-
ries of animal and human learning through computational models. It seems likely also
that the concepts and techniques being explored by researchers in machine learning
may illuminate certain aspects of biological learning.
With regard to machines, we might say that a machine learns whenever it changes
its structure, program, or data, based on its inputs or in response to external informa-
tion, in such a manner that its expected future performance improves. To put it in more
formal terms, “a computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E”(Mitchell, 1997). Some of these changes,
such as the addition of a record to a database, fall comfortably within the field of other
disciplines and may not necessarily be defined as learning. But, for example, when the
performance of a speech-recognition machine improves after hearing several samples
of a person’s speech, we feel quite justified to say that the machine has learned.
1
Chapter 1. Introduction
1.1 Machine Learning
There are several reasons why machine learning is important. Some of these are the
following.
• Some tasks cannot be defined well except by example, meaning that we might be
able to specify input-output pairs but not a concise relationship between inputs
and desired outputs. We would like machines to be able to adjust their internal
structure to produce correct outputs for a large number of sample inputs and thus
suitably constrain their input-output function to approximate the relationship im-
plicit in the examples, so that it could be applied to new cases as well.
• It is possible that hidden among large piles of data are important relationships
and correlations. Machine learning methods can often be used to extract these
relationships (Data Mining).
• Human designers often produce machines that do not work as well as desired in
the environments in which they are used. In fact, certain characteristics of the
working environment might not be completely known at design time. Machine
learning methods can be used for on-the-job improvement of existing machine
designs.
• The amount of knowledge available about certain tasks might be too large for ex-
plicit encoding by humans. Machines that learn this knowledge gradually might
be able to capture more of it than humans would want to write down.
• Environments change over time. Machines that can adapt to a changing environ-
ment would reduce the need for constant redesign.
2
1.2. Data Representation and Kernel Functions
• New knowledge about tasks is constantly being discovered by humans. Vocabu-
laries change. There is a constant stream of new events in the world. Continuing
redesign of AI systems to conform to new knowledge is impractical, but machine
learning methods might be able to track much of it.
To better explain what machine learning is about, let us consider a simple but well-
known example. In mathematics and statistics, we encounter techniques that, given a
set of points, e.g.→xi, and the values associated with them, e.g. yi, attempt to derive
the function that best interpolates the relation φ(~x, y), for example by means of linear
or polynomial regression. These are the first examples of machine learning algorithms.
The case we want to focus on is when the output values of the target function are finite
and discrete; then the regression problem can be regarded as a classification problem.
1.2 Data Representation and Kernel Functions
In the interpolation example, the data is represented by points in a vector space. This is
the common setting for most machine learning algorithms, such as decision tree learn-
ers (Quinlan, 1993), Bayesian networks (John and Langley, 1995), support vector ma-
chines (Cristianini and Shawe-Taylor, 2000) or artificial neural networks (Aleksander
and Morton, 1995). In general, the data points represent some real world entities by
means of their peculiar features; as such, the vector spaces used to represent data are
called feature spaces. The issue of determining an adequate feature space in order to
apply some learning algorithm is central to the machine learning problem.
Establishing a feature vector representation for a data object is a concern that has
been largely studied, independently of its backlashes on the machine learning con-
text. A whole literature exists on the topic of distributed representations (Hinton et al.,
3
Chapter 1. Introduction
1986; Rumelhart and McClelland, 1986). Inspired by the inherently distributed mech-
anisms taking place in a human brain, these studies aim at analyzing the possibility of
representing symbolic information in a distributed form. This objective is particularly
interesting, and challenging, when considering structured data, such as strings, trees or
graphs. A distributed representation is then expected to preserve information about the
structure of the object, i.e. about how its components are composed to form the whole
object (Plate, 1994).
Kernel functions have emerged as an alternative to the explicit distributed repre-
sentation of symbolic data. An interesting class of learning algorithms, the kernel
machines (Muller et al., 2001), deal with data only in terms of pairwise similarities. A
kernel function k(oi, oj) is then a function performing an implicit mapping φ of ob-
jects oi, oj into feature vectors ~xi, ~xj , so that k(oi, oj) = φ(oi) · φ(oj) = ~xi · ~xj . By
keeping the target feature space implicit, kernel functions allow for the use of huge,
possibly infinite feature spaces, overcoming the troubles of producing and dealing with
the corresponding feature vectors. As such, kernel functions over structured data have
gained large popularity. Many of these functions have been proposed to model a wide
range of feature spaces, capturing structure information at different levels of detail (see
Chapter 2).
1.3 Thesis Contributions
The aim of this thesis is to analyze the limits and possible enhancements of machine
learning techniques used to exploit structured data. We will focus on the framework
of kernel functions, and in particular on its application to tree structures. Trees are
a fundamental type of structure, widely used to represent objects in a broad range of
4
1.3. Thesis Contributions
research fields, such as proteins in biology, HTML documents in computer security
and syntactic interpretations in natural language processing. The perspective of the
present work is mainly oriented towards natural language processing tasks, though the
techniques introduced are relevant and useful for the other research areas involving tree
structures as well.
The analysis of the state of the art highlights two major lines of evolution for tree
kernel techniques. The first one is aimed at exploring feature spaces different from
the original one by Collins and Duffy (2002). This is necessary in order to define
more expressive tree kernels, often tailoring new feature spaces to the specific needs
of particular tasks. The second line of research tries to tackle the limitations deriving
from the tree kernels computational complexity. Having a complexity quadratic in the
size of the involved trees, tree kernels can hardly be applied to very large data sets or
data instances. In this regard, optimizations are needed, possibly allowing for some
approximation of the kernel results.
Regarding tree kernel expressiveness, we propose a kernel able to deal with struc-
tures more complex than trees (Zanzotto and Dell’Arciprete, 2009; Zanzotto et al.,
2011). These structures, called tDAGs, are composed of two trees linked by a set of
intermediate nodes, acting as variable names. The proposed kernel implements the fea-
ture space of first order rules between trees. This kind of space is inspired by the com-
putational linguistics task of textual entailment recognition, where the data instances
are pairs of sentences, and the task is to determine if the first one entails the second at a
linguistic level. The sentence pairs are represented as pairs of syntactic trees, possibly
sharing a common set of terms or phrases, thus constituting a tripartite directed acyclic
graph (tDAG). Though it has been shown that a complete kernel on graphs is NP-hard
5
Chapter 1. Introduction
to compute, we present an efficient computation for the kernel on tDAGs.
We then introduce a framework for the efficient computation of tree kernels, allow-
ing for some degree of approximation. The distributed tree kernels framework (Zan-
zotto and Dell’Arciprete, 2011a, 2012; Dell’Arciprete and Zanzotto, 2013) is based on
the explicit representation of trees in a distributed form, i.e. as low-dimensional vec-
tors. As long as these distributed representations for trees are built according to certain
criteria, the kernel computation can be approximated by a simple dot product in the
final vector space. This drastically reduces the computation time for tree kernels, since
a linear time algorithm is proposed for the construction of distributed trees. It is shown
how such a framework can be applied to different instances of tree kernels, leaving
open the possibility of applying it to other kinds of kernels and structures as well.
Finally, an application of the distributed tree kernels is proposed, in the task of syn-
tactic parsing of sentences (Zanzotto and Dell’Arciprete, 2013). The distributed repre-
sentation parser is a way to short-circuit the expensive and error-prone parsing phase,
in processes that apply kernel learning methods to natural language sentences. Such a
parser can be trained to produce the final distributed representation for a syntactic tree,
without explicitly producing the symbolic one.
1.4 Thesis Outline
The thesis outline is as follows.
In Chapter 2 we introduce the kernel machines approach to machine learning and
explain the use of kernel functions. We then provide a survey of kernels over several
kinds of structured data. In particular, we analyze the importance and limitations of
tree kernel functions. We give a more detailed survey of tree kernels, focusing on the
6
1.4. Thesis Outline
two aspects of expressive power and computational complexity.
In Chapter 3 we present our kernel on tDAGs for tree pairs classification. We ex-
plain the significance of the introduced feature space, and show the efficient algorithm
used to compute the kernel. We report experimental results on the task of textual en-
tailment recognition.
In Chapter 4 we present the distributed tree kernels framework. We show its the-
oretical foundations and our proposed implementation. We perform a wide empirical
analysis of the degree of approximation introduced by the distributed tree kernel. Then,
we report experimental comparisons on the tasks of question classification and textual
entailment recognition.
In Chapter 5 we present the distributed representation parser. We explain the learn-
ing process for the parser and we report several experimental results measuring its
correlation with respect to traditional symbolic parsers.
In Chapter 6, finally, we draw some conclusions and we outline future research
directions.
7
2Machine Learning and Structured Data
As one of the peculiar activities of the human mind, the ability to learn is a fundamental
part of what can be defined as an artificial intelligence. The field of machine learning
includes many different approaches, whose common aim is to produce systems able to
accurately perform a task on new, unseen examples, after having trained on a learning
data set. In other words, the objective of a machine learning algorithm is to generalize
from experience. Several kinds of algorithms have been developed, usually divided into
categories depending on the degree of human or external support given to the machine
learning system, in the learning process or as a feedback to the system behavior.
2.1 Classification in Machine Learning
The task of classification is one of the most important among machine learning activi-
ties. At its broadest, the term could cover any context in which some decision or fore-
cast is made on the basis of currently available information. A classification procedure
is then some formal method for repeatedly making such judgments in new situations.
Considering a more restricted interpretation, the problem concerns the construction of
a procedure that will be applied to a sequence of cases, in which each new case must
be assigned to one of a set of pre-defined classes on the basis of observed attributes or
features. The construction of a classification procedure from a set of data for which the
9
Chapter 2. Machine Learning and Structured Data
true classes are known has also been called supervised learning (in order to distinguish
it from unsupervised learning, in which the classes are inferred from the data).
The approach of machine learning to classification problems is thus to determine
algorithms that take as input a set of conveniently annotated examples, and return as
output a program, written according to some specific format. The output program
should be generated in such a way that it performs as accurately as possible on the
training examples. The effectiveness of a machine learning technique could be assessed
according to the two following properties:
• generalization: the degree to which the generated program can be successfully
applied to new examples. It obviously depends on both the complexity of the
program and the number of training examples used to generate it;
• computational tractability: the ability to find a good program in a short time.
When this is not the case, it should be possible to determine a useful approxima-
tion, requiring a smaller computational effort.
Clearly it is difficult to satisfy both properties at the same time. In fact, a more complex
program, built on the basis of a large set of training examples, will guarantee a better
generalization, but may take a large amount of time to be written. On the other hand, a
simpler program, generated according to a small set of examples, can be delineated in
a short time, but will probably perform badly on a generalized case.
2.2 Kernel Machines and Kernel Functions
One of the most useful learning methods used for classification are the Support Vector
Machines (SVMs) (Cortes and Vapnik, 1995; Scholkopf, 1997). Suppose we are given
10
2.2. Kernel Machines and Kernel Functions
a set of data points, each belonging to one of two classes, and the goal is to decide
which class a new data point will be in. The approach of support vector machines is
to view a data point as a n-dimensional vector, and to look for a (n − 1)-dimensional
hyperplane such that it can separate the points belonging to the two classes. Such a
classification method is called a linear classifier.
It is then necessary to define a multidimensional space able to represent the relevant
characteristics of the data objects taken into consideration. Such a space is called a
feature space, and its modeling is fundamental in the construction of a good learning
mechanism. In fact, the ability to split the data points in classes depends on the features
selected to represent the objects as vectors. A too small number of features (i.e. of
dimensions in the feature space) could lead to inseparability of the data; but this could
happen even by considering a very large set of features, if the characteristics chosen are
irrelevant to the problem in question. At the opposite, a feature space with an extremely
high number of dimensions could pose tractability issues, and for some problems no
feature space at all can be found such that it allows linear separability of the data.
Assuming we can design an adequate feature space, there might be many hyper-
planes able to classify the data. However, we are additionally interested in finding out
if we can achieve maximum separation (margin) between the two classes. By this we
mean that the hyperplane should be picked so that the distance from the nearest data
points to the hyperplane itself is maximized. Now, if such a hyperplane exists, it is
clearly of interest and is known as the maximum-margin hyperplane, and such a linear
classifier is known as a maximum-margin classifier.
The simplest form of SVM is the algorithm known as Perceptron (Rosenblatt,
1958), that can be seen as an artificial version of the human brain neurons. The Per-
11
Chapter 2. Machine Learning and Structured Data
ceptron classification function is of the form:
f(~x) = sgn(~w · ~x+ b)
where ~w · ~x + b represents a simple hyperplane and the signum function divides the
data points in two sets: those that are above and those that are below the hyperplane.
The major advantage of making use of linear functions only is that, given a set of
training points, S = { ~x1, ..., ~xm}, each one associated with a classification label yi ∈
{+1,−1}, we can apply a learning algorithm that derives the vector ~w and the scalar b
of a separating hyperplane, provided that at least one exists.
Since we are interested in finding the maximum-margin hyperplane, it is possible
to demonstrate that the objective of learning is reduced to an optimization problem of
the form: min ||~w||
yi(~w · ~xi + b) ≥ 1 ∀~xi ∈ S
In real scenario applications, training data is often affected by noise due to several
reasons, e.g. classification mistakes of the annotators. These may cause the data not
to be separable by any linear function. Additionally, as we already pointed out, the
target problem itself may not be separable in the designed feature space. As a result,
the simplest version of SVM (called Hard Margin SVM), as described above, will fail
to converge. In order to solve such a critical aspect, a more flexible design is proposed
with Soft Margin Support Vector Machines. The main idea is that the optimization
problem is allowed to provide solutions that can violate a certain number of constraints.
Obviously, to be as much as possible consistent with the training data, the number of
such errors should be the lowest possible.
One of the most interesting properties we can observe about SVMs is that the gra-
12
2.2. Kernel Machines and Kernel Functions
dient ~w is obtained by a summation of vectors proportional to the examples ~xi. This
means that ~w can be written as a linear combination of training points, i.e.:
~w =
m∑i=1
αiyi ~xi
where the coefficients αi can be seen as the alternative coordinates for representing
the vector ~w in a dual space, whose dimensions are the training data vectors. This
also means that every scalar product between vector ~w and a data vector ~x can be
decomposed in a summation of scalar products between data vectors.
One of the most difficult tasks for applying machine learning is the features design.
Features should represent data in a way that allows learning algorithms to separate
positive from negative examples. In SVMs, features are used to build the vector rep-
resentation of data examples, and the scalar product between example pairs quantifies
how much they are similar (sometimes simply counting the number of common fea-
tures). Instead of encoding data in feature vectors, we may design kernel functions
(Vapnik, 1995) that provide such similarity between example pairs without using an
explicit feature representation.
In this way, a linear classifier algorithm can solve also a non-linear problem by
mapping the original non-linear observations into a higher-dimensional space, where
the linear classifier is subsequently used. This process, also known as the kernel trick,
makes a linear classification in the new space equivalent to non-linear classification in
the original space.
In the optimization problem used to learn SVMs, the feature vectors always appear
in a scalar product; consequently, the feature vectors ~xi can be replaced with the data
objects oi, by substituting the scalar product ~xi · ~xj with a kernel function k(oi, oj). The
initial objects oi can be mapped into the vectors ~xi by using a feature representation,
13
Chapter 2. Machine Learning and Structured Data
φ(.), so that ~xi · ~xj = φ(oi) · φ(oj) = k(oi, oj).
The idea of a feature extraction procedure φ : o → (x1, ..., xn) = ~x allows us to
define a kernel as a function k such that ∀~x, ~z ∈ X
k(~x, ~z) = φ(~x) · φ(~z)
where φ is a mapping from X to an (inner product) feature space.
Notice that, once we have defined a kernel function that is effective for a given
learning problem, we do not need to find explicitly which mapping φ it corresponds
to. It is enough to know that such a mapping exists. This is guaranteed by Mercer’s
theorem (Mercer, 1909), stating that any continuous, symmetric, positive semi-definite
kernel function k(x, y) can be expressed as a scalar product in a high-dimensional
space.
The use of kernel functions allows SVMs to solve non-linear classification prob-
lems. Learning algorithms, such as SVMs, building on kernel functions are called
kernel machines.
2.3 Kernel Functions on Structured Data
Real world tasks often deal with data that is not represented as mere attribute-value
tuples. Strings, trees and graphs are extensively used to represent different kinds of
objects, in several areas such as natural language processing, biology and computer
security. The application of machine learning methods to classification tasks in these
fields has led to a wide development of kernel functions able to deal with such kinds
of structured data. It should be noted that, by talking about kernels for structured data,
one could refer to two different families of kernel functions: model-driven kernels and
syntax-driven kernels (Gartner, 2003).
14
2.3. Kernel Functions on Structured Data
2.3.1 Model-Driven Kernels
Model-driven kernels are kernels defined on the structure of the instance space, such as
the spectral kernels and the diffusion kernels.
2.3.1.1 Spectral Kernels
Spectral kernels (Li et al., 2005) are a form of support to automatic learning method-
ologies based on the use of kernel functions. Having a set of n samples, a kernel matrix,
or Gram matrix, K can be defined as a matrix with dimensions n× n, whose element
Ki,j contains the value of the kernel function for samples i and j. Spectral kernels stem
from spectral graph theory, since their functioning is based on the analysis of kernel
matrices in terms of their characteristic properties, like their eigenvalues and eigenvec-
tors. These properties can be used, for example, for determining a clustering of the
samples, by finding some optimum cut in the graph whose adjacency matrix is given
by the kernel matrix.
Spectral kernels work as follows. Firstly, they may apply a transformation to the
n × n kernel matrix (i.e. consider the Laplacian matrix of the corresponding graph).
Then, they perform an eigen-decomposition of the transformed matrix, and use it to
extract feature vectors of length k for the n objects. Finally, the kernel is computed by
classic similarity measures over Rk.
New input, considered as a vector of original kernel values with respect to the
training examples, is firstly transformed in a manner dependent on the transformation
previously used, and then is projected onto the spectral embedding space given by the
training examples.
Following a similar principle, the Latent Semantic Kernel (Cristianini et al., 2002)
15
Chapter 2. Machine Learning and Structured Data
can be viewed as a specific instance of the spectral kernel framework. In this case,
starting from a generic kernel matrix, the LSK works by manipulating the kernel ma-
trices through Latent Semantic Indexing techniques (Deerwester et al., 1990), which
are successfully used in the context of Information Retrieval to capture semantic rela-
tions between terms and documents.
2.3.1.2 Diffusion Kernels
Diffusion kernels (Kondor and Lafferty, 2002) can be applied to data sets that can be
regarded as vertices of a graph (e.g. documents linked in the Web). The idea comes
from the equations used to describe the diffusion of heat through a medium. Diffusion
kernels are related to the Gaussian kernel over Rn, which gives a measure of similarity
according to the Gaussian function with parameter σ, as k(~x, ~z) = e−‖~x−~y‖2
2σ2 .
As a more generic approach, exponential kernels are defined by means of a Gram
matrix K = eβH = limt→∞
(I βHt
)t. β is a “bandwidth” parameter, of meaning
similar to parameter σ in Gaussian kernels, and H , the “generator”, is a symmetric
square matrix.
Diffusion kernels on graphs are obtained by choosing matrix H to represent the
structure of the considered graph. In particular, H is taken to be the negative of the
Laplacian matrix, i.e. its elements Hi,j are defined as −degree(vi) if i = j, 1 if
(vi, vj) ∈ E and 0 otherwise.
Intuitively, the diffusion kernel K(x, x′) represents the heat found at point x at
time tβ if all the heat of the system was concentrated in x′ at time 0. This is also
related to random walks on the graph, defining the probability distribution of finding the
walker in vertex x at some step, if starting at vertex x′. While random walks consider a
discrete series of steps, diffusion kernels can be seen as considering an infinite number
16
2.3. Kernel Functions on Structured Data
of infinitesimal steps. At each step the walker in vertex vi will take each of the edges
emanating from vi with fixed probability β and will remain in place with probability
1− degree(vi)β.
2.3.2 Syntax-Driven Kernels
Syntax-driven kernels are kernels defined on the structure of the instances. They deal
with instances belonging to families of structured data such as strings, trees and graphs.
Being the main focus of the present work, in the following sections and chapters, by
kernels on structured data we will always refer to syntax-driven kernels.
2.3.2.1 Convolution Kernels
The vast majority of kernels on structured data stem from the convolution kernel (Haus-
sler, 1999), whose key idea is to define a kernel on a composite object by means of
kernels on the parts of the objects. This originates from the assumption that often the
semantics of structured objects can be captured by a relation R between the object and
its parts.
Let x, x′ ∈ X be the composite objects and ~x, ~x′ ∈ X1×· · ·×XD be tuples of parts
of these objects. Given the relation R : (X1× · · · ×XD)×X , the decomposition R−1
can be defined as R−1(x) = {~x : R(~x, x)}. Then the convolution kernel is defined as:
kconv(x, x′) =
∑~x∈R−1(x), ~x′∈R−1(x′)
D∏d=1
kd(xd, x′d)
Convolution kernels are then a class of kernels that can be formulated in the above
way. Their advantage is that they are very general and can be applied to many different
problems. The work required to adapt the general formulation to a specific problem
17
Chapter 2. Machine Learning and Structured Data
consists in choosing an adequate relation R. Simpler and more complex kinds of de-
composition relations have been studied for structures such as strings, trees and graphs,
to define several kernels based on the general framework of the convolution kernel.
2.3.2.2 String Kernels
The traditional model for text classification is based on the bag-of-words representa-
tion, which associates a text with a vector indicating the number of occurrences of terms
in the text. Text similarity is then computed as a simple scalar product between these
vectors. Kernels on strings try to define a more sophisticated approach to the problem
of text classification, though they can be applied also to other sequences of symbols,
e.g. the amino acids describing a protein or the phonemes constituting spoken text.
The first kernel function defined on strings can be found in Lodhi et al. (2002), and it
is based on a notion of string similarity given by the number of common subsequences.
These subsequences need not be contiguous, but their relevance is weighted according
to the number of gaps occurring in the subsequence, so that the more gaps it contains,
the less weight it is given in the kernel function.
Consider a string to be a finite sequence of characters from a finite alphabet Σ. Then
Σn is the set of strings of length n and Σ∗ is the set of all strings, including the empty
string. Let |s| denote the length of string s = s1, ..., s|s|, and s[i] the subsequence of s
induced by the set of indices i. The total length l(i) of subsequence s[i] in s is defined
as i|i| − i1 + 1, where the indices in i are ordered so that 1 ≤ i1 < ... < i|i| ≤ |s|.
Then, the mapping φ underlying the string kernel can be defined for each element of
the feature space, i.e. the space of all possible substrings Σ∗. For any substring u, the
18
2.3. Kernel Functions on Structured Data
value of feature φu(s) is:
φu(s) =∑
i:u=s[i]
λl(i)
where λ ≤ 1 is a decay factor that penalizes long and gap-filled subsequences. Then,
the kernel between strings s and t is the inner product of the feature vectors for the two
strings, computing a weighted sum over all common subsequences:
k(s, t) =∑u∈Σ∗
φu(s)φu(t) =∑u∈Σ∗
∑i:u=s[i]
∑j:u=t[j]
λl(i)+l(j)
In Lodhi et al. (2002), a restricted formulation is given, considering as the feature space
only the subsequences of length n, i.e. Σn:
kn(s, t) =∑u∈Σn
φu(s)φu(t) =∑u∈Σn
∑i:u=s[i]
∑j:u=t[j]
λl(i)+l(j)
and an efficient recursive algorithm is given to reduce the computation complexity to
O(n|s||t|). Rousu and Shawe-Taylor (2005) introduce a further optimization, reducing
the complexity to O(n|M | log min(|s|, |t|)), where M = {(i, j)|si = tj} is the set of
characters matches in the two sequences.
Leslie et al. (2002) and Paass et al. (2002) use an alternative kernel in the context
of protein and spoken text classification, considering only contiguous substrings. A
string is then represented by the number of times each unique substring of length n
occurs in the sequence. This way of representing a string as its n-grams is also known
as the spectrum of a string. The kernel function is then simply the scalar product of
these representations, and can be computed in time linear in n and in the length of the
strings.
String kernels can also be seen as a specific instance of more generic sequence
kernels, where the symbols of the string are not characters but more complex objects,
19
Chapter 2. Machine Learning and Structured Data
even strings themselves. As an example, Bunescu and Mooney (2006) proposed a
subsequence kernel for the task of extracting relations among entities from texts. Their
kernel applies to sequences of objects taken from a set Σ× = Σ1 × Σ2 × ... × Σk,
where each object includes several features from feature sets Σ1,Σ2, ...,Σk, e.g. a
word, a POS tag, etc. Then, if we consider the set of all possible features Σ∪ =
Σ1 ∪ Σ2 ∪ ... ∪ Σk, a sequence u ∈ Σ∗∪ is a subsequence of sequence s ∈ Σ∗× if there
is a sequence of |u| indices i such that uk ∈ sik for all k = 1, ..., |u|.
2.3.2.3 Tree Kernels
The study of kernel functions for trees has been very popular and led to several different
tree kernel formulations. The differences among the various tree kernels are related
to both the feature spaces covered and the kind of trees considered (e.g. ordered or
unordered, labeled or unlabeled edges). Since they constitute the focus of this work,
an extensive overview of tree kernel functions can be found in Section 2.4.
2.3.2.4 Graph Kernels
Graphs are the most complex of the presented structures. In fact, both string and tree
kernels can be seen as kernels on some restricted set of graphs. A theoretical limit
arises when trying to define a complete graph kernel, i.e. a kernel capable of counting
common isomorphic subgraphs. It has been shown, in fact, that such a kernel would
be NP-hard to compute (Gartner et al., 2003). To see this, consider a feature space that
has one feature ΦH for each possible graph H , and a graph kernel where each feature
ΦH(G) measures how many subgraphs of G are isomorphic to graph H . Graphs satis-
fying certain properties could be identified using the inner product in this feature space.
In particular, one could decide whether a graph has a Hamiltonian path, i.e. a sequence
20
2.3. Kernel Functions on Structured Data
of adjacent vertices containing every vertex exactly once. Since this problem is known
to be NP-hard to compute, the same can be concluded for the computation of such a
graph kernel.
Some work has been devoted to develop alternative approaches to the definition of a
graph kernel. With respect to a complete graph kernel, these alternative kernels are less
expressive and therefore less expensive to compute. The common idea behind these
works is that features are not subgraphs but walks in the graphs, having some or all
labels in common. In Gartner (2002), a walk is characterized by the labels of the initial
and terminal vertices. The kernel proposed by Kashima et al. (2003) computes the
probability of random walks with equal sequences of vertex and edge labels. In Gartner
et al. (2003), equal label sequences are counted, allowing the presence of some gaps.
Since these features may belong to an infinite space, in the case of cyclic graphs, non-
trivial computation algorithms are needed. The strategy for efficiently computing all
of these kernels is based on exploiting structural information of the considered graph,
such as the adjacency matrix, the transition probability matrix or the topological order
of nodes for acyclic graphs. The actual computation consists then in solving a linear
equation system or computing the limit of a matrix power series.
A different approach to the development of graph kernels is the one that limits the
kernel to a particular subset of graphs. For example, Suzuki et al. (2003) proposed a
kernel that can only be applied to a class of graphs used to represent syntactic informa-
tion of natural language sentences, i.e. the hierarchical directed acyclic graphs.
21
Chapter 2. Machine Learning and Structured Data
2.4 Tree Kernels: Potential and Limitations
Trees are fundamental data structures used to represent very different objects such as
proteins, HTML documents, or interpretations of natural language utterances (e.g. syn-
tactic analysis). Thus, many research areas – for example, biology, computer security
and natural language processing – fostered extensive studies on methods for learning
classifiers that leverage on these data structures.
Tree kernels were firstly introduced in Collins and Duffy (2001) as specific con-
volution kernels (see Sec. 2.3.2.1), and are widely used to fully exploit tree structured
data when learning classifiers. The kernel by Collins and Duffy (2001) considers the
feature space of subtrees, intended as any subgraph which includes more than one node,
with the restriction that entire (not partial) rule productions must be included. In other
words, when a node is included in a subtree, either it is included as a leaf node, or all of
its children in the original tree are also included in the subtree. The kernel computation
is performed by means of a recursive function, according to the convolution kernels
framework, so that the tree kernel is defined as follows:
K(T1, T2) =∑
n1∈N(T1)n2∈N(T2)
∆(n1, n2)
where N(T ) is the set of nodes of the tree T . The recursive function ∆(n1, n2) is the
core of the kernel function and of the computation algorithm. Denoting by ch(n, j) the
j-th son of node n, the definition of function ∆ is as follows:
• ∆(n1, n2) = 1 if n1 and n2 are two terminal nodes and their labels are the same;
• ∆(n1, n2) =∏j(1 + λ∆(ch(n1, j), ch(n2, j)) if the productions rooted in n1
22
2.4. Tree Kernels: Potential and Limitations
and n2 are the same;
• ∆(n1, n2) = 0 otherwise.
Parameter λ is a decay factor, introduced to reduce the contribution of larger trees. By
setting 0 < λ < 1, the larger a tree is, the lower its weight will be in the final kernel
measure.
Following the work of Collins and Duffy (2001), tree kernels have been applied to
use tree structured data in many areas, such as biology (Vert, 2002; Hashimoto et al.,
2008), computer security (Dussel et al., 2008), and natural language processing (Gildea
and Jurafsky, 2002; Pradhan et al., 2005; MacCartney et al., 2006; Zhang and Lee,
2003; Moschitti et al., 2008; Zanzotto et al., 2009). Different tree kernels modeling
different tree fragment feature spaces have been proposed, in order to enhance the tree
kernels expressive power and to exploit different features of the data. At the same time,
another primary research focus has been the reduction of the tree kernel execution time,
in order to allow for the application on wider data sets, and larger trees.
2.4.1 Expressive Power
The automatic design of classifiers using machine learning and linguistically anno-
tated data is a widespread trend in Natural Language Processing (NLP) community.
Part-of-speech tagging, named entity recognition, information extraction, and syntactic
parsing are NLP tasks that can be modeled as classification problems, where manually
tagged sets of examples are used to train the corresponding classifiers. The training
algorithms have their foundation in machine learning research but, to induce better
classifiers for complex NLP problems, like for example, question-answering, textual
entailment recognition (Dagan and Glickman, 2004; Dagan et al., 2006), and semantic
23
Chapter 2. Machine Learning and Structured Data
role labeling (Gildea and Jurafsky, 2002), syntactic and/or semantic representations of
text fragments have to be modeled as well. Kernel-based machines can be used for this
purpose as kernel functions allow to directly describe the similarity between two text
fragments (or their representations) instead of explicitly describing them in terms of
feature vectors.
Many linguistic theories (Chomsky, 1957; Marcus et al., 1993; Charniak, 2000;
Collins, 2003) express syntactic and semantic information with trees. This kind of
information can also be encoded in projective and non-projective graphs (Tesniere,
1959; Grinberg et al., 1996; Nivre et al., 2007a), directed-acyclic graphs (Pollard and
Sag, 1994), or generic graphs for which the available tree kernels are inapplicable. In
fact, algorithms for computing the similarity between two general graphs in term of
common subgraphs are exponential (see Sec. 2.3.2.4). Then, a great amount of work
has been devoted to kernels for trees (Collins and Duffy, 2002; Moschitti, 2004), to
extend the basic model that measures the similarity between two trees by counting the
common subtrees. Different and more expressive feature spaces were defined in order
to capture deeper layers of syntactic or semantic information, and to highlight aspects
more relevant for the specific tasks faced.
2.4.1.1 Extensions of the Subtree Feature Space
Many of the tree kernels proposed following the work of Collins and Duffy (2001) tried
to leverage on its principles to define more complex feature spaces. These kernels often
originated as variants of the tree kernel by Collins and Duffy (2001). This section will
briefly present some of these works.
24
2.4. Tree Kernels: Potential and Limitations
Tree Sequence Kernel The tree sequence kernel (Sun et al., 2011) adopts the struc-
ture of a sequence of subtrees instead of the single subtree structure. This kernel lever-
ages on the subsequence kernel (Sec. 2.3.2.2) and the tree kernel, enriching the former
with syntactic structure information and the latter with disconnected subtree sequence
structures. Clearly, the tree kernel by Collins and Duffy (2001) is a special case of the
tree sequence kernel, where the number of subtrees in the tree sequence is restricted to
1.
To define the tree sequence kernel, Sun et al. (2011) previously define a set se-
quence kernel, which allows multiple choices of symbols in any position of a se-
quence. This kernel is defined on set sequences S, whose items Si are ordered symbol
sets, belonging to an alphabet Σ. Then, S[(~i, ~i′)] ∈ Σm denotes the subsequence
S(i1,i′1)S(i2,i′2)...S(im,i′m), where S(i,i′) represents the i′-th symbol of the i-th symbol
set in S. The set sequence kernel is defined, for subsequences of length m, as:
Km(S, S′) =∑u∈Σm
∑(~i,~i′):u=S[(~i,~i′)]
p(u,~i) ·∑
(~j,~j′):u=S[(~j,~j′)]
p(u,~j)
where p(u,~i) is a penalization function that may be based on the count of matching
symbols or on the count of gaps.
The tree sequence kernel is then defined by integrating the algorithms of the set se-
quence kernel and of the tree kernel. This is achieved by transforming the tree structure
into a set sequence structure, and then matching the subtrees in a subtree sequence from
left to right and from top to bottom. An efficient approach to computing the kernel is
provided by Sun et al. (2011), in a similar manner to the approach used to compute the
string kernel.
25
Chapter 2. Machine Learning and Structured Data
Partial Tree Kernel The work of Moschitti (2006a) proposed a variant of the orig-
inal tree kernel by Collins and Duffy (2001). In this variant, the notion of subtree is
extended to include a larger feature space. This is done by relaxing the constraint on
the integrity of the production rules appearing in a subtree. Thus, partial production
rules may be included in a subtree, i.e. a subtree may contain any subset of the original
children for each one of its nodes. This feature space is clearly much larger than the
original subtree feature space. The definition of the partial tree kernel is the same as
the one in Collins and Duffy (2001), but recursive function ∆ is modified as follows:
• ∆(n1, n2) = 0 if n1 and n2 have different labels;
• ∆(n1, n2) = 1 +∑
~J1, ~J2,| ~J1|=| ~J2|
| ~J1|∏i=1
∆(ch(n1, ~J1i), ch(n2, ~J2i)) otherwise,
where ~J1 and ~J2 are index sequences associated with the ordered child sequences of n1
and n2 respectively, so that ~J1i and ~J2i point to the i-th children in the two sequences.
Moreover, two decay factors are introduced: λ, having the same function of the
parameter by Collins and Duffy (2001); and µ, that is used to keep into account the
presence of gaps in the productions of the subtrees. The latter parameter highlights the
fact that the partial tree kernel, as well as the tree sequence kernel , is inspired by the
use of both tree and string kernels at the same time. In fact, Moschitti (2006a) proposes
an efficient way of computing the partial tree kernel that defines a recursive formulation
for function ∆, analogous to the one used by the string kernel (Sec. 2.3.2.2).
Elastic Subtree Kernel In Kashima and Koyanagi (2002) a tree kernel for labeled
ordered trees is proposed. This variant on the tree kernel is very similar in principle
to the one of the partial tree kernel , in that the feature space includes subtrees with
26
2.4. Tree Kernels: Potential and Limitations
partial production rules. The kernel is defined as the one by Collins and Duffy (2001),
but function ∆ is defined by means of another recursive function, so that ∆(n1, n2) =
Sn1,n2(nc(n1), nc(n2)), where nc(n) is the number of children of node n. Function S
is then defined as follows:
Sn1,n2(i, j) = Sn1,n2(i− 1, j) + Sn1,n2(i, j − 1)− Sn1,n2(i− 1, j − 1)
+Sn1,n2(i− 1, j − 1) ·∆(ch(n1, i), ch(n2, j))
An interesting point in the work of Kashima and Koyanagi (2002) is the introduc-
tion of two extensions for their kernel. In the first one, label mutations are allowed.
This means that, given a mutation score function f : Σ×Σ→ [0, 1], subtrees differing
for some labels are also included in the kernel computation, with a weight depending
on the score of the occurring mutations. The second extension of the kernel allows for
the matching of elastic tree structures. In other words, a subtree is considered to appear
in a tree as long as the relative positions of its nodes are preserved in the tree. This
allows for the inclusion of non-contiguous subtrees along with the contiguous ones.
This is an idea further explored in the framework of the mapping kernels .
Mapping Kernels The mapping kernels framework (Shin and Kuboyama, 2010; Shin
et al., 2011) has been proposed as a generalization of Haussler’s convolution kernel
(Sec. 2.3.2.1). In particular, it has been extensively applied to the study of existing tree
kernels and the engineering of new ones. The convolution kernel assumes that each
data point x in a space χ is associated with a finite subset χ′x of a common space χ′,
and that a kernel k : χ′ × χ′ → R is given. Then, the convolution kernel is defined by:
K(x, y) =∑
(x′,y′)∈χ′x×χ′y
k(x′, y′)
27
Chapter 2. Machine Learning and Structured Data
The mapping kernel differs from the convolution kernel in two aspects. Firstly,
instead of evaluating every pair (x′, y′) ∈ χ′x × χ′y , it evaluates only the pairs in a
predetermined subsetMx,y of χ′x×χ′y . Then, the mapping kernel relaxes the constraint
that χ′x must be a subset of χ′, by introducing a mapping γx : χ′x → χ′. So, the
mapping kernel is defined as:
K(x, y) =∑
(x′,y′)∈Mx,y
k(γx(x′), γy(y′))
Shin and Kuboyama (2010) show that this is a positive semidefinite kernel as long
as a necessary and sufficient condition is satisfied: that the mapping system Mx,y is
transitive. Moreover, they show how most of the existing tree kernels can be reduced
to the framework of the mapping kernels, by appropriately defining the spaces χ′X and
MX,Y , the mapping γX and the kernel k.
2.4.1.2 Other Feature Spaces
Together with the development of tree kernels based on the work of Collins and Duffy
(2001), other kinds of feature spaces have been explored. These kinds of tree kernels
are not strictly related to the subtree framework, and propose simpler features such
as paths or different ones such as logic descriptions. This section will present a brief
summary of some of these works.
Subpath Tree Kernel The subpath tree kernel (Kimura et al., 2011) uses very simple
tree fragments: chains of nodes. Given a context-free grammar G = (N,Σ, P, S), any
sequence of non-terminal symbols N , possibly closed by one terminal symbol in Σ, is
a valid tree fragment.
28
2.4. Tree Kernels: Potential and Limitations
The kernel function between two trees T1 and T2 is then defined as:
K(T1, T2) =∑p∈P
λ|p|num(T1p)num(T2p) (2.1)
where P is the set of all subpaths in T1 and T2 and num(Tp) is the number of times
a subpath p appears in tree T . λ is a parameter, similar to the one of the classic tree
kernel, assigning an exponentially decaying weight to a subpath p according to its
length |p|.
A simple algorithm for the computation of the subpath tree kernel is the recursive
formulation that follows:
K(T1, T2) =∑
n1∈N(T1),n2∈N(T2)
∆(n1, n2) (2.2)
Function ∆(n1, n2) is defined as:
• ∆(n1, n2) = λ if n1 or n2 is a terminal node and n1 = n2
• ∆(n1, n2) = λ(1 +∑i,j ∆(ch(n1, i), ch(n2, j)) if n1 and n2 are two non-
terminal nodes
• ∆(n1, n2) = 0 otherwise
where, as usual, ch(n, i) is the i-th son of node n in tree T . More efficient algorithms
are provided in Kimura and Kashima (2012); Kimura et al. (2011).
Route Kernel Large tree structures with many symbols may produce feature spaces
of tree fragments that are very sparse. This may affect the final performance of the
classification function, as discussed in Suzuki and Isozaki (2006). Route kernels for
trees (Aiolli et al., 2009) are introduced to address this issue. Instead of encoding a
29
Chapter 2. Machine Learning and Structured Data
Figure 2.1: Routes in trees: an example.
path between two nodes in the tree using the node labels, route kernels use the relative
position of the edges in the production originated in a node. As shown in Aiolli et al.
(2009), this reduces the sparsity and has a positive effect on the final performance of
the classifiers.
Route kernels for trees deal with positional ρ-ary trees, i.e. trees where a unique
positional index Pn[e] ∈ {1, · · · , ρ} is assigned to each edge e leaving from node n.
Figure 2.1 reports an example tree with positional indexes as edge labels. Route kernels
introduce the notion of route π(ni, nj) between nodes ni and nj as the sequence of
indexes of the edges that constitute the shortest path between the two nodes. The
definition follows:
π(n1, nk) = Pn1 [(n1, n2)]Pn2 [(n2, n3)] . . . Pnk−1[(nk−1, nk)]
In the general case, a route may contain both positive and negative indexes, for edges
that are traversed away from or towards the root, respectively. For example, the route
from node B to node D is π(B,D) = [−1, 2, 1], as the edge (A,B) is traversed
towards the root of the tree.
30
2.4. Tree Kernels: Potential and Limitations
In this setting, a generalized route kernel takes the form of:
K(T1, T2) =∑
ni,nj∈T1
∑nl,nm∈T2
kπ((ni, nj)(nl, nm))kξ((ni, nj)(nl, nm)) (2.3)
where kπ is a local kernel defined on the routes and kξ is some other local kernel used
to add expressiveness to the kernel.
Aiolli et al. (2009) define an instantiation of the generalized route kernel, for which
an efficient implementation is proposed. This kernel restricts the set of feasible routes
to those between a node and any of its descendants. The empty route π(n, n) is in-
cluded, with |π(n, n)| = 0. A decay factor λ is introduced to reduce the influence of
larger routes, leading to the following formulation for kπ:
kπ((ni, nj)(nl, nm)) = δ(π(ni, nj), π(nl, nm))λ|π(ni,nj)| (2.4)
where δ is the usual Kronecker comparison function. Finally, kξ is defined as δ(l(nj), l(nm)),
i.e. 1 if nj and nm have the same label, 0 otherwise. A variant is also proposed for kξ,
where the whole productions at nj and nm are compared instead.
Relational Kernel In Cumby and Roth (2003) a family of kernel functions is pro-
posed, built up from a description language of limited expressivity, tailored for rela-
tional domains. Relational learning problems include learning to identify functional
phrases and named entities from linguistic parse trees, learning to classify molecules
for mutagenicity from atom-bond data, or learning a policy to map goals to actions in
planning domains.
The proposed relational kernel is specified through the use of a previously intro-
duced feature description language (Cumby and Roth, 2002). An interesting aspect of
this language is that it provides a framework for representing the properties of nodes
31
Chapter 2. Machine Learning and Structured Data
in a concept graph. Thus, the relational kernel may be applied to more generic struc-
tures than trees. Features for this kernel are described by propositions like “(AND
phrase(NP) (contains word(boy)))”, essentially meaning that in the given data instance
∃x, y such that phrase(x,NP ) ∧ contains(x, y) ∧ word(y, boy).
Then, for any two graphs G1, G2 and feature description D, the kernel function is
defined as:
KD(G1, G2) =∑n1∈N1
∑n2∈N2
kD(n1, n2)
where N1, N2 are the node sets of G1, G2 respectively, and function kD is defined
inductively on the structure of the feature description D. More complex kernels can be
defined by considering a set of feature descriptions and combining the corresponding
kernels.
2.4.2 Computational Complexity
Since kernel machines perform many tree kernel computations during learning and
classification, the research in efficient tree kernel algorithms has always been a key
issue. The original tree kernel algorithm by Collins and Duffy (2001), that relies on
dynamic programming techniques, has a quadratic time and space complexity with re-
spect to the size of input trees. Execution time and space occupation are still affordable
for parse trees of natural language sentences, that hardly go beyond the hundreds of
nodes. But these tree kernels hardly scale to large training and application sets, and
moreover have several limitations when dealing with large trees, such as HTML doc-
uments or other structured network data. Then, several attempts at reducing the tree
kernels computational complexity have been pursued. Since worst-case complexity of
tree kernels is hard to improve, the biggest effort has been devoted in controlling the
32
2.4. Tree Kernels: Potential and Limitations
average execution time of tree kernel algorithms. Three directions have been mainly
explored.
The first direction is the exploitation of some specific characteristics of trees, as in
the fast tree kernel by Moschitti (2006b). Prior to the actual kernel computation, this
algorithm efficiently builds a node pair set Np = {〈n1, n2〉 ∈ NT1× NT2
: p(n1) =
p(n2)}, where NT is the set of nodes of tree T and p(n) returns the production rule
associated with node n. Then, the kernel is computed as:
K(T1, T2) =∑
〈n1,n2〉∈Np
∆(n1, n2)
where function ∆ is the same as in Collins and Duffy (2001). The result is preserved,
though, since only pairs of nodes 〈n1, n2〉 such that ∆(n1, n2) = 0 are omitted. Mos-
chitti (2006b) demonstrated that, by using the fast tree kernel, the execution time of
the original algorithm becomes linear in average for parse trees of natural language
sentences. Yet, the tree kernel has still to be computed over the full underlying feature
space and the space occupation is still quadratic.
The second explored direction is the reduction of the underlying feature space of
tree fragments, in order to control the execution time by introducing an approximation
of the kernel function. The approximate tree kernel (Rieck et al., 2010) is based on the
introduction of a feature selection function ω : Σ → 0, 1, where Σ is the set of node
labels. The approximate tree kernel is then defined as:
Kω(T1, T2) =∑s∈Σ
ω(s)∑
n1∈NT1l(n1)=s
∑n2∈NT2l(n2)=s
∆(n1, n2)
where, function ∆(n1, n2) is the same as ∆(n1, n2), but returns 0 if either n1 or n2
have not been selected, i.e. ω(l(n1)) = 0 or ω(l(n2)) = 0. The feature selection
33
Chapter 2. Machine Learning and Structured Data
is done in the learning phase by solving an optimization problem, so as to maximize
the preservation of the discriminative power of the kernel. Then, for the classification
phase, the selection is directly encoded in the kernel computation by selecting only the
subtrees headed by the selected node labels. A similar approach is used by Pighin and
Moschitti (2010), where a smaller feature space is explicitly selected, by discarding
features whose weight contributes less to the kernel machine gradient w. In both these
cases, the beneficial effect is only obtained during the classification phase, while the
learning phase is overloaded with feature selection algorithms.
A third approach is the one of Shin et al. (2011). In the framework of the mapping
kernels (Sec. 2.4.1.1), they exploit dynamic programming on the whole training and
application sets of instances. Kernel functions are then reformulated to be computed
exploiting partial kernel computations, previously performed on other pairs of trees.
As any dynamic programming technique, this approach results in transferring time
complexity in space complexity.
34
3Improving Expressive Power: Kernels on
tDAGs
One of the most important research areas in Natural Language Processing concerns the
modeling of semantics expressed in text. Since foundational work in natural language
understanding has shown that a deep semantic approach is still not feasible, current
research is focused on shallow methods, combining linguistic models and machine
learning techniques. They aim at learning semantic models, like those that can detect
the entailment between the meaning of two text fragments, by means of training exam-
ples described by specific features. These are rather difficult to design since there is no
linguistic model that can effectively encode the lexico-syntactic level of a sentence and
its corresponding semantic models. Thus, the adopted solution consists in exhaustively
describing training examples by means of all possible combinations of sentence words
and syntactic information. The latter, typically expressed as parse trees of text frag-
ments, is often encoded in the learning process using graph algorithms. As the general
problem of common subgraph counting is NP-hard to solve (see Sec. 2.3.2.4), a good
strategy is to find relevant classes of graphs that are more general than trees, for which
it is possible to find efficient algorithms.
In this chapter, a specific class of graphs, the tripartite directed acyclic graphs
(tDAGs), is defined. We show that the similarity between tDAGs in terms of sub-
35
Chapter 3. Improving Expressive Power: Kernels on tDAGs
graphs can be used as a kernel function in Support Vector Machines (see Sec. 2.2) to
derive semantic implications between pairs of sentences. We show that such model can
capture first-order rules (FOR), i.e. rules that can be expressed by first-order logic, for
textual entailment recognition (at least at the syntactic level). Most importantly, we
provide an algorithm for efficiently computing the kernel on tDAGs.
The chapter is organized as follows. In Section 3.1, we introduce some background
on the task of Textual Entailment Recognition. In Section 3.2, we describe tDAGs and
their use for modeling FOR. In Section 3.3, we introduce the similarity function for
FOR spaces. We then introduce our efficient algorithm for computing the similarity
among tDAGs. In Section 3.5, we empirically analyze the computational efficiency of
our algorithm and we compare it against the analogous approach proposed by Moschitti
and Zanzotto (2007).
3.1 Machine Learning for Textual Entailment Recogni-tion
In Natural Language Processing, the kernel trick is widely used to represent structures
in the huge space of substructures, e.g. to represent the syntactic structure of sen-
tences. The first and most popular example is the tree kernel defined by Collins and
Duffy (2002) (see Section 2.4). In this case a feature j is a syntactic tree fragment,
e.g. (S (NP) (VP)) 1. Thus in the feature vector of an instance (a tree) t, the feature j
assumes a value different from 0 if the subtree (S (NP) (VP)) belongs to t. The subtree
space is very large but the scalar product just counts the common subtrees between the
two syntactic trees, i.e.:
1A sentence S composed by a noun phrase NP and a verbal phrase VP.
36
3.1. Machine Learning for Textual Entailment Recognition
K(t1, t2) = F (t1)F (t2) = |S(t1) ∩ S(t2)| (3.1)
where S(·) is the set of subtrees of tree t1 or t2. Yet, some important NLP tasks
such as Recognition of Textual Entailment (Dagan and Glickman, 2004; Dagan et al.,
2006) and some linguistic theories such as HPSG (Pollard and Sag, 1994) require more
general graphs and, then, more general algorithms for computing similarity among
graphs.
Recognition of Textual Entailment (RTE) is an important basic task in natural lan-
guage processing and understanding. The task is defined as follows: given a text T and
a hypothesis H , we need to determine whether sentence T implies sentence H . For
example, we need to determine whether or not “Farmers feed cows animal extracts”
entails “Cows eat animal extracts” (T1, H1). It should be noted that a model suitable to
approach the complex natural language understanding task must also be capable of rec-
ognizing textual entailment (Chierchia and McConnell-Ginet, 2001). Overall, in more
specific NLP challenges, where we want to build models for specific tasks, systems
and models solving RTE can play a very important role.
RTE has been proposed as a generic task tackled by systems for open domain
question-answering (Voorhees, 2001), multi-document summarization (Dang, 2005),
information extraction (MUC-7, 1997), and machine translation. In question-answering,
a subtask of the problem of finding answers to questions can be rephrased as an RTE
task. A system could answer the question “Who played in the 2006 Soccer World
Cup?” using a retrieved text snippet “The Italian Soccer team won the World Champi-
onship in 2006”. Yet, knowing that “The Italian soccer team” is a candidate answer,
the system has to solve the problem of deciding whether or not the sentence “The Ital-
37
Chapter 3. Improving Expressive Power: Kernels on tDAGs
ian football team won the World Championship in 2006” entails the sentence “The
Italian football team played in the 2006 Soccer World Cup”. The system proposed in
Harabagiu and Hickl (2006), the answer validation exercise (Peas et al., 2007), and
the correlated systems (e.g. Zanzotto and Moschitti (2007)) use this reformulation of
the question-answering problem. In multi-document summarization (extremely useful
for intelligence activities), again, part of the problem, i.e. the detection of redundant
sentences, can be framed as a RTE task (Harabagiu et al., 2007). The detection of
redundant or implied sentences is a very important task, as it is the way of correctly
reducing the size of the documents.
RTE models are then extremely important as they enable the possibility of building
final NLP applications. Yet, as any NLP model, textual entailment recognizers need a
big amount of knowledge. This knowledge ranges from simple equivalence, similar-
ity, or relatedness between words to more complex relations between generalized text
fragments. For example, to deal with the above example, an RTE system should have:
• a similarity relationship between the words soccer and football, even if this sim-
ilarity is valid only under specific conditions;
• the entailment relation between the words win and play
• the entailment rule XwonY inZ → X playedY inZ
This knowledge is generally extracted in a supervised setting using annotated training
examples (e.g. Zanzotto et al. (2009)) or in unsupervised setting using large corpora
(e.g. Lin and Pantel (2001); Pantel and Pennacchiotti (2006); Zanzotto et al. (2006)).
The kind of knowledge that can be extracted from the two methods is extremely differ-
ent, as unsupervised methods can induce positive entailment rules, whereas supervised
38
3.1. Machine Learning for Textual Entailment Recognition
learning methods can learn both positive and negative entailment rules. A rule such as
“tall does not entail short”, even if the two words are related, can be learned only using
supervised machine learning approaches.
To use supervised machine learning approaches, we have to frame the RTE task
as a classification problem (Zanzotto et al., 2009). This is in fact possible, as an RTE
system can be seen as a classifier that, given a (T,H) pair, outputs one of these two
classes: entails if T entails H or not-entails if T does not entail H . Yet, this classifier,
as well as its learning algorithm, has to deal with an extremely complex feature space
in order to be effective. If we represent T and H as graphs, the classifier and the learn-
ing algorithm has to deal with two interconnected graphs since, to model the relation
between T and H , we need to connect words in T and words in H .
In Raina et al. (2005); Haghighi et al. (2005); Hickl et al. (2006), the problem of
dealing with interconnected graphs is solved outside the learning algorithm and the
classifier. The two connected graphs, representing the two texts T and H , are used
to compute similarity features, i.e. features representing the similarity between T and
H . The underlying idea is that lexical, syntactic, and semantic similarities between
sentences in a pair are relevant features to classify sentence pairs in classes such as
entail and not-entail. In this case, features are not subgraphs. Yet, these models can
easily fail as two similar sentences may in one case be an entailment pair and in the
other not. For example, the sentence “All companies pay dividends” (A) entails that
“All insurance companies pay dividends” (B) but does not entail “All companies pay
cash dividends” (C). In terms of number of different words, the difference between (A)
and (B) is the same existing between (A) and (C).
If we want to better exploit training examples to learn textual entailment classifiers,
39
Chapter 3. Improving Expressive Power: Kernels on tDAGs
we need to use first-order rules (FOR) that describe entailment in the training instances.
Suppose that the instance “Pediatricians suggest women to feed newborns breast milk”
entails “Pediatricians suggest that newborns eat breast milk” (T2, H2), is contained in
the training data. For classifying (T1, H1), the first-order rule ρ = feedY Z →
Y eatZ must be learned from (T2, H2). The feature space describing first-order
rules, that was introduced in Zanzotto and Moschitti (2006), allows for highly accurate
textual entailment recognition with respect to traditional feature spaces. Unfortunately,
this model, as well as the one proposed in Moschitti and Zanzotto (2007), shows two
major limitations: it can represent rules with less than seven variables and the similarity
function is not a valid kernel.
In de Marneffe et al. (2006), first-order rules have been explored. Yet, the associ-
ated spaces are extremely small. Only some features representing first-order rules were
explored. Pairs of graphs are used here to determine if a feature is active or not, i.e.
if the rule fires or not. A larger feature space of rewrite rules was implicitly explored
in Wang and Neumann (2007a) but they considered only ground rewrite rules. Also in
machine translation, some methods, such as Eisner (2003), learn graph based rewrite
rules for generative purposes. Yet, the method presented in Eisner (2003) can model
first-order rewrite rules only with a very small amount of variables, i.e. two or three.
3.2 Representing First-order Rules and Sentence Pairsas Tripartite Directed Acyclic Graphs
To define and build feature spaces for first-order rules we cannot rely on existing kernel
functions over tree fragment feature spaces (Collins and Duffy, 2002; Moschitti, 2004).
These feature spaces are not sufficiently expressive for describing rules with variables.
40
3.2. Representing First-order Rules and Sentence Pairs as Tripartite DirectedAcyclic Graphs
In this section, we explain through an example why we cannot use tree fragments and
we will then introduce the tripartite directed acyclic graphs (tDAGs) as a subclass of
graphs useful to model first-order rules. We intuitively show that, if sentence pairs are
described by tDAGs, determining whether or not a pair triggers a first-order rewrite
rule is a graph matching problem.
To explore the problem of defining first-order rules feature spaces, we can consider
the rule ρ= feedY Z → Y eatZ and sentence pair (T1, H1). The rule ρ encodes
the entailment relation between the verb to feed and the verb to eat. If represented over
a syntactic interpretation, the rule has the following aspect:
ρ =
VPPPPPP\\
�����VB
feed
NP Y NP Z →
S``
NP Y VPPP��
VB
eat
NP Z
A similar tree-based representation can be derived for the pair (T1, H1), where the syn-
tactic interpretations of both sentences in the pair are represented and the connections
between the text T and the hypothesis H are somehow explicit in the structure. This
representation of the pair (T1, H1) has the following aspect:
P1 = 〈
Shhhh((((NP
NNS
Farmers
VPhhhhh""(((((
VB
feed
NP 1
NNS 1
cows
NP 3XX��
NN 2
animal
NNS 3
extracts
,
Shhh(((NP 1
NNS 1
Cows
VPhhh(((VB
eat
NP 3XX��
NN 2
animal
NNS 3
extracts
〉
Augmenting node labels with numbers is one of the way of co-indexing parts of the
trees. In this case, co-indexes indicate that a part of the tree is significantly related
41
Chapter 3. Improving Expressive Power: Kernels on tDAGs
with another part of the other tree, e.g. the co-index 1 on the NNS nodes describes the
relation between the two nodes describing the plural common noun (NNS) cows in
the two trees, and the same co-index on the NP nodes indicates the relation between
the noun phrases (NP ) having cows as semantic head (Pollard and Sag, 1994). These
co-indexes are frequently used as additional parts of node labels in computational lin-
guistics, to indicate relations among different parts in a syntactic tree (e.g. Marcus
et al. (1993)). Yet, the names used for the co-indexes have a precise meaning only
within the trees where they are used. Then, having a similar representation for the rule
ρ and the pair P1, we need to determine whether or not the pair P1 triggers the rule ρ.
Considering both variables in the rule ρ and co-indexes in the pair P1 as extensions of
the node tags, we would like to see this as a tree matching problem. In this case, we
could easily apply existing kernels for tree fragment feature spaces (Collins and Duffy,
2002; Moschitti, 2004). However, this simple example shows that this is not the case,
as the two trees representing the rule ρ cannot be matched with the two subtrees:
VPPPPP\\����
VB
feed
NP 1 NP 3
S`` NP 1 VP
PP��VB
eat
NP 3
as the node label NP 1 is not equal to the node label NP X .
To solve the above problem, similarly to the case of feature structures (Carpenter,
1992), we can represent rule ρ and pair P1 as graphs. We start the discussion describing
the graph for rule ρ. Since we are interested in the relation between the right hand side
and the left hand side of the rule, we can substitute each variable with an unlabeled
42
3.2. Representing First-order Rules and Sentence Pairs as Tripartite DirectedAcyclic Graphs
(a) (b)
Figure 3.1: A simple rule and a simple pair as a graph.
node. We then connect tree nodes having variables with the corresponding unlabeled
node. The result is the graph in Figure 3.1(a). Variables Y and Z are represented by
the unlabeled nodes between the trees.
In the same way we can represent the sentence pair (T1, H1) using a graph with
explicit links between related words and nodes (see Figure 3.1(b)). We can link words
using anchoring methods as in Raina et al. (2005). These links can then be propagated
in the syntactic tree using semantic heads of the constituents (Pollard and Sag, 1994).
Rule ρ1 matches over pair (T1, H1) if the graph for ρ1 (Figure 3.1(a)) is among the
subgraphs of the graph in Figure 3.1(b).
Both rules and sentence pairs are graphs of the same type. These graphs are basi-
cally two trees connected through an intermediate set of nodes, representing variables
in the rules and relations between nodes in the sentence pairs. We will hereafter call
these graphs tripartite directed acyclic graphs (tDAGs). The formal definition follows.
Definition 3.2.1. tDAG: A tripartite directed acyclic graph is a graph G = (N,E)
where
• the set of nodes N is partitioned in three sets Nt, Ng , and A
43
Chapter 3. Improving Expressive Power: Kernels on tDAGs
• the set of edges is partitioned in four sets Et, Eg , EAt , and EAg
such that t = (Nt, Et) and g = (Ng, Eg) are two trees and EAt = {(x, y)|x ∈
Nt and y ∈ A} and EAg = {(x, y)|x ∈ Ng and y ∈ A} are the edges connecting the
two trees.
A tDAG is a partially labeled graph. The labeling function L only applies to the
subsets of nodes related to the two trees, i.e. L : Nt ∪Ng → L. Nodes in set A are not
labeled.
The explicit representation of the tDAG in Figure 3.1(b) shows that determining the
rules fired by a sentence pair is a graph matching problem. To simplify our explanation
we will then describe a tDAG with an alternative and more convenient representation:
a tDAG G = (N,E) can be seen as a pair G = (τ, γ) of extended trees τ and γ where
τ = (Nt ∪ A,Et ∪ EAt) and γ = (Ng ∪ A,Eg ∪ EAg ). These are extended trees in
the sense that each tree contains the relations with the other tree.
As in the case of feature structures, we will graphically represent (x, y) ∈ EAt
and (z, y) ∈ EAg as boxes y respectively on nodes x and z. These nodes will then
appear as L(x) y and L(z) y , e.g. NP 1 . The name y is not a label but a placeholder,
or anchor, representing an unlabeled node. This representation is used for rules and
for sentence pairs. The sentence pair in Figure 3.1(b) is then represented as reported in
pair P1 of Figure 3.2.
3.3 An Efficient Algorithm for Computing the First-order Rule Space Kernel
In this section, we present an efficient algorithm implementing feature spaces for de-
riving first-order rules (FOR). In Section 3.3.1, we firstly define the similarity function,
44
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
P1 = 〈
S``
NP
NNS
Farmers
VP
HHH�����
VB
feed
NP 1
NNS 1
cows
NP 3
PP��NN 2
animal
NNS 3
extracts
,
SXX��
NP 1
NNS 1
Cows
VPPP��
VB
eat
NP 3
PP��NN 2
animal
NNS 3
extracts
〉
P2 = 〈
S 2``
NP 1
NNS 1
Pediatricians
VP 2XX��
VB 2
suggest
SXX��
NP
NNS
women
VPXX��
TO
to
VP
HHH�����
VB
feed
NP 3
NNS 3
newborns
NP 4
PP��NNS 5
breast
NN 4
milk
,
S 2``
NP 1
NNS 1
Pediatricians
VP 2XX��
VB 2
suggest
SBARXX��
IN
that
SXX��
NP 3
NNS 3
newborns
VPPP��
VB
eat
NP 4
aa!!NN 5
breast
NN 4
milk
〉
Figure 3.2: Two tripartite DAGs.
i.e. the kernel K(G1, G2), that implements the feature spaces for learning first-order
rules. This kernel is based on the definition of isomorphism between graphs and our ef-
ficient approach for detecting the isomorphism between tDAGs (Section 3.3.2). Then,
we present the basic idea and the formalization of our efficient algorithm for comput-
ing K(G1, G2) based on the properties of tDAGs isomorphism (Section 3.3.3). We
demonstrate that our algorithm, and so our kernel function, computes the FOR fea-
ture space. We finally describe the ancillary algorithms and properties for making the
computation possible (Section 3.3.4).
3.3.1 Kernel Functions over First-order Rule Feature Spaces
In this section we introduce the FOR space and we then define the prototypical kernel
function that implicitly defines it. The FOR space is in general the space of all possible
first-order rules defined as tDAGs. Within this space, it is possible to define function
45
Chapter 3. Improving Expressive Power: Kernels on tDAGs
S(G) that computes all the subgraphs (features) of a tDAG G. Therefore, we need to
take into account the subgraphs of G that represent first-order rules.
Definition 3.3.1. S(G): Given a tDAG G = (τ, γ), S(G) is the set of subgraphs of G
of the form (t, g), where t and g are extended subtrees of τ and γ, respectively.
For example, the subgraphs of P1 and P2 in Figure 3.2 are hereafter partially rep-
resented:
S(P1) = {〈SQQ��
NP VP,
S
cc##NP 1 VP
〉, 〈NP 1
NNS 1,
NP 1
NNS 1〉, 〈
SPPP
���NP VP
HHHCC
���
VB
feed
NP 1 NP 3,
SHH��
NP 1 VP
cc##VB
eat
NP 3〉,
〈
VPHHHCC
���
VB
feed
NP 1 NP 3 ,
SHH��
NP 1 VP
cc##VB
eat
NP 3〉, ...}
and
S(P2) = {〈S 2
QQ��NP 1 VB 2
,
S 2
QQ��NP 1 VB 2
〉, 〈NP 1
NNS 1,
NP 1
NNS 1〉, 〈
VPHHHCC
���
VB
feed
NP 1 NP 3 ,
SHH��
NP 3 VP
cc##VB
eat
NP 4〉, ...}
In the FOR space, the kernel function K should then compute the number of sub-
graphs in common between two tDAGs G1 and G2. The trivial way to describe K is
using the intersection operator, i.e. the kernel K(G1, G2) is the following:
K(G1, G2) = |S(G1) ∩ S(G2)|, (3.2)
where a graph g is in the intersection S(G1) ∩ S(G2) if it belongs to both S(G1) and
S(G2).
46
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
We point out that determining whether two graphs, g1 and g2, are the same graph
g1 = g2 is not trivial. For example, it is not sufficient to naively compare graphs
to determine that ρ1 belongs both to S(G1) and S(G2). If we compare the string
representation of the fourth tDAG in S(P1) and the third in S(P2), we cannot derive
that the two graphs are the same graph.
We need to use a correct comparison for g1 = g2, i.e. the isomorphism between two
graphs. Let us define Iso(g1, g2) as the predicate indicating the isomorphism between
the two graphs. When Iso(g1, g2) is true, both g1 and g2 can represent the graph.
Unfortunately, computing Iso(g1, g2) has an exponential complexity (Kobler et al.,
1993).
To solve the complexity problem, we need to differently define the intersection
operator between sets of graphs. We will use the same symbol but we will use the
prefix notation.
Definition 3.3.2. Given two tDAGs G1 and G2, we define the intersection between the
two sets of subgraphs S(G1) and S(G2) as:
∩(S(G1),S(G2)) = {g1|g1 ∈ S(G1),∃g2 ∈ S(G2), Iso(g1, g2)}
3.3.2 Isomorphism between tDAGs
Isomorphism between graphs is the critical point for defining an effective graph kernel,
so we here review its definition and we adapt it to tDAGs. An isomorphism between
two tDAGs can be divided into two sub-problems:
• finding a partial isomorphism between two pairs of extended trees;
47
Chapter 3. Improving Expressive Power: Kernels on tDAGs
• checking whether the partial isomorphism found between the two pairs of ex-
tended trees is compatible with the set of anchor nodes.
Consider the general definition for graph isomorphism.
Definition 3.3.3. Two graphs, G1 = (N1, E1) and G2 = (N2, E2) are isomorphic (or
match) if |N1| = |N2|, |E1| = |E2|, and a bijective function f : N1 → N2 exists such
that, given the node labeling function L, these properties hold:
• for each node n ∈ N1, L(f(n)) = L(n)
• for each edge (n1, n2) ∈ E1 an edge (f(n1), f(n2)) is in E2
The bijective function f is a member of the combinatorial set F of all possible
bijective functions between the two sets N1 and N2.
The trivial algorithm for detecting if two graphs are isomorphic, by exploring the
whole set F , is exponential (Kobler et al., 1993). It is still undetermined if the general
graph isomorphism problem is NP-complete. Yet, we can use the fact that tDAGs
are two extended trees for building an efficient algorithm, since an efficient algorithm
exists for trees (as the one used in Collins and Duffy (2002)).
Given two tDAGs G1 = (τ1, γ1) and G2 = (τ2, γ2) the isomorphism can be re-
duced to the problem of detecting two properties:
1. Partial isomorphism. Two tDAGs G1 and G2 are partially isomorphic, if τ1 and
τ2 are isomorphic and if γ1 and γ2 are isomorphic. The partial isomorphism
produces two bijective functions fτ and fγ .
2. Constraint compatibility. Two bijective functions fτ and fγ are compatible on
the sets of nodes A1 and A2, if for each n ∈ A1, it happens that fτ (n) = fγ(n).
48
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
We can rephrase the second property, i.e. the constraint compatibility, as follows. We
define two constraints c(τ1, τ2) and c(γ1, γ2) representing the functions fτ and fγ re-
stricted to the sets A1 and A2. The two constraints are defined as follows: c(τ1, τ2) =
{(n, fτ (n))|n ∈ A1} and c(γ1, γ2) = {(n, fγ(n))|n ∈ A1}. Then two partially iso-
morphic tDAGs are isomorphic if the constraints match, i.e. c(τ1, τ2) = c(γ1, γ2).
For example, the fourth pair of S(P1) and the third pair of S(P2) are isomorphic
as: (1) they are partially isomorphic, i.e. the right hand sides τ and the left hand
sides γ are isomorphic; (2) both pairs of extended trees generate the constraint c1 =
{( 1 , 3 ), ( 3 , 4 )}. In the same way, the second pair of S(P1) and the second pair of
S(P2) generate c2 = {( 1 , 1 )}.
Given the above considerations, we need to define what a constraint is and we need
to demonstrate that two tDAGs satisfying the two properties are isomorphic.
Definition 3.3.4. Given two tDAGs, G1 = (Nt1 ∪ Ng1 ∪ A1, E1) and G2 = (Nt2 ∪
Ng2 ∪A2, E2), a constraint c is a bijective function between the sets A1 and A2.
We can then enunciate the theorem.
Theorem 3.3.1. Two tDAGs G1 = (N1, E1) = (τ1, γ1) and G2 = (N2, E2) =
(τ2, γ2) are isomorphic if they are partially isomorphic and constraint compatibility
holds for the two functions fτ and fγ induced by the partial isomorphism.
Proof. First we show that |N1| = |N2|. Since partial isomorphism holds, we have
that ∀n ∈ τ1.L(n) = L(fτ (n)). However, since nodes in Nt1 and Nt2 are labeled
whereas nodes in A1 and A2 are unlabeled, it follows that ∀n ∈ Nt1 .fτ (n) ∈ Nt2 and
∀n ∈ A1.fτ (n) ∈ A2. Thus we have that |Nt1 | = |Nt2 | and |A1| = |A2|. Similarly,
we can show that |Ng1 | = |Ng2 |, and since Nt, Ng and A are disjoint sets, we can
49
Chapter 3. Improving Expressive Power: Kernels on tDAGs
conclude that |Nt1 ∪Ng1 ∪A1| = |Nt2 ∪Ng2 ∪A2|, i.e. |N1| = |N2|.
Now we show that |E1| = |E2|. By partial isomorphism we know that |Et1 ∪
EAt1 | = |Et2 ∪ EAt2 | and |Eg1 ∪ EAg1 | = |Eg2 ∪ EAg2 |, so |Et1 ∪ EAt1 | + |Eg1 ∪
EAg1 | = |Et2 ∪ EAt2 | + |Eg2 ∪ EAg2 |. Since these are all disjoint sets, it trivially
follows that |Et1 ∪EAt1 ∪Eg1 ∪EAg1 | = |Et2 ∪EAt2 ∪Eg2 ∪EAg2 |, i.e. |E1| = |E2|.
Finally, we have to show the existence of a bijective function f : N1 → N2 such
as the one described in the definition of graph isomorphism. Consider the following
restricted functions for fτ and fγ : fτ |Nt1 : Nt1 → Nt2 , fγ |Ng1 : Ng1 → Ng2 ,
fτ |A1: A1 → A2, fγ |A1
: A1 → A2. By constraint compatibility, we have that
fτ |A1 = fγ |A1 . Now we can define function f as follows:
f(n) =
fτ (n) if n ∈ Nt1fγ(n) if n ∈ Ng1fτ (n) = fγ(n) if n ∈ A1
Since the properties described in the definition of graph isomorphism hold for both fτ
and fγ , they hold for f as well.
3.3.3 General Idea for an Efficient Kernel Function
As discussed above, two tDAGs are isomorphic if two properties, the partial iso-
morphism and the constraint compatibility, hold. To compute the kernel function
K(G1, G2) defined in Section 3.3.1, we can exploit these properties in the reverse
order. Given a constraint c, we can select all the graphs that meet the constraint c
(constraint compatibility). Having determined the set of all the tDAGs meeting the
constraint, we can detect the partial isomorphism. We split each pair of tDAGs into the
four extended trees and we determine if these extended trees are compatible.
We introduce this method to compute the kernel K(G1, G2) in the FOR space in
50
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
Pa = 〈
A 1
QQ��B 1
SS��B 1 B 2
C 1
SS��C 1 C 2
,
L 1
��
M 1
\\��M 2 M 1
N 1
SS��N 2 N 1
〉
Pb = 〈
A 1
QQ��B 1
SS��B 1 B 2
C 1
SS��C 1 C 3
,
L 1
��
M 1
\\��M 3 M 1
N 1
SS��N 2 N 1
〉
Figure 3.3: Simple non-linguistic tDAGs.
two steps. Firstly, we give an intuitive explanation and then we formally define the
kernel.
3.3.3.1 Intuitive Explanation
To give an intuition of the kernel computation, without loss of generality and for the
sake of simplicity, we use two non-linguistic tDAGs, Pa and Pb (see Figure 3.3), and
the subgraph function S(θ) where θ is one of the extended trees of the pairs, i.e. γ or
τ . This latter is an approximate version of S(θ), that only selects tDAGs with subtrees
rooted in the root of θ.
To exploit the constraint compatibility property, we define C as the set of all the
relevant alternative constraints, i.e. the constraints c that could be generated when
detecting the partial isomorphism. For Pa and Pb, this set is C = {c1, c2} ={{( 1 , 1 ), ( 2 , 2 )}, {( 1 , 1 ), ( 2 , 3 )}
}.
We can informally define ∩(S(Pa), S(Pb))|c as the common subgraphs that meet
constraint c. For example, in Figure 3.4, the first tDAG of the set ∩(S(Pa), S(Pb))|c1belongs to the set as its constraint c′ = {( 1 , 1 )} is a subset of c1. Then, we can obtain
51
Chapter 3. Improving Expressive Power: Kernels on tDAGs
∩(S(Pa), S(Pb))|c1 = {〈A 1
SS��B 1 C 1
,L 1
\\��M 1 N 1
〉, 〈
A 1
ll,,B 1
SS��B 1 B 2
C 1 ,L 1
\\��M 1 N 1
〉,
〈
A 1
ll,,B 1
SS��B 1 B 2
C 1 ,
L 1
ll,,M 1 N 1
SS��N 2 N 1
〉, 〈A 1
SS��B 1 C 1
,
L 1
ll,,M 1 N 1
SS��N 2 N 1
〉 } =
= {A 1
SS��B 1 C 1
,
A 1
ll,,B 1
SS��B 1 B 2
C 1 } × {L 1
\\��M 1 N 1
,
L 1
ll,,M 1 N 1
SS��N 2 N 1
=
= ∩(S(τa), S(τb))|c1 × ∩(S(γa), S(γb))|c1
∩(S(Pa), S(Pb))|c2 = {〈A 1
SS��B 1 C 1
,
L 1
\\��M 1 N 1
〉, 〈
A 1
ll,,B 1 C 1
SS��C 1 C 2
,
L 1
\\��M 1 N 1
〉,
〈
A 1
ll,,B 1 C 1
SS��C 1 C 2
,
L 1
ll,,M 1
\\��M 2 M 1
N 1 〉, 〈A 1
SS��B 1 C 1
,
L 1
ll,,M 1
\\��M 2 M 1
N 1 〉} =
= {A 1
SS��B 1 C 1
,
A 1
ll,,B 1 C 1
SS��C 1 C 2
} × {L 1
\\��M 1 N 1
,
L 1
ll,,M 1
\\��M 2 M 1
N 1 } =
= ∩(S(τa), S(τb))|c2 × ∩(S(γa), S(γb))|c2
Figure 3.4: Intuitive idea for the kernel computation.
52
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
the kernel K(Pa, Pb) as:
K(Pa, Pb) = | ∩ (S(Pa), S(Pb))| ==
∣∣∣∩(S(Pa), S(Pb))|c1⋃∩(S(Pa), S(Pb))|c2
∣∣∣ (3.3)
Looking at Figure 3.4, we compute the value of the kernel for the two pairs asK(Pa, Pb) =
7. For better computing the cardinality of the union of the sets, it is possible to use the
inclusion-exclusion principle. The value of the kernel for the example can be derived
as:
K(Pa, Pb) =∣∣∣∩(S(Pa), S(Pb))|c1
⋃∩(S(Pa), S(Pb))|c2
∣∣∣ =
=∣∣∣∩(S(Pa), S(Pb))|c1
∣∣∣+∣∣∣∩(S(Pa), S(Pb))|c2
∣∣∣+−∣∣∣∩(S(Pa), S(Pb))|c1
⋂∩(S(Pa), S(Pb))|c2
∣∣∣ (3.4)
A nice property that can be easily demonstrated is that:
∩ (S(Pa), S(Pb))|c1⋂∩(S(Pa), S(Pb))|c2 = ∩(S(Pa), S(Pb))|c1∩c2 (3.5)
Expressing the kernel computation in this way is important since elements in∩(S(Pa), S(Pb))|c
already satisfy the property of constraint compatibility. We can exploit now the partial
isomorphism property to find the elements in ∩(S(Pa), S(Pb))|c. Then, we can write
the following equivalence:
∩ (S(Pa), S(Pb))|c = ∩(S(τa), S(τb))|c × ∩(S(γa), S(γb))|c (3.6)
Figure 3.4 reports this equivalence for the two sets derived using constraints c1 and c2.
Note that this equivalence is not valid if a constraint is not applied, i.e. ∩(S(Pa), S(Pb))
6=∩(S(τa), S(τb))×∩(S(γa), S(γb)). The pairPa itself does not belong to∩(S(Pa), S(Pb))
but it does belong to ∩(S(τa), S(τb))× ∩(S(γa), S(γb)).
53
Chapter 3. Improving Expressive Power: Kernels on tDAGs
Equivalence 3.6 allows to compute the cardinality of ∩(S(Pa), S(Pb))|c using the
cardinalities of ∩(S(τa), S(τb))|c and ∩(S(γa), S(γb))|c. The latter sets contain only
extended trees where the equivalences between unlabeled nodes are given by c. We
can then compute the cardinalities of these two sets using methods developed for trees
(e.g. the kernel function KS(θ1, θ2) proposed in Collins and Duffy (2002) and refined
in KS(θ1, θ2, c) for extended trees in Moschitti and Zanzotto (2007); Zanzotto et al.
(2009)). The cardinality of ∩(S(Pa), S(Pb))|c is then computed as:∣∣∣∩(S(Pa), S(Pb))|c∣∣∣ =
=∣∣∣∩(S(τa), S(τb))|c
∣∣∣ ∣∣∣∩(S(γa), S(γb))|c∣∣∣ = KS(τa, τb, c)KS(γa, γb, c)
(3.7)
3.3.3.2 Formalization
The intuitive explanation, along with the associated examples, suggests the following
steps for computing the desired kernel function:
• Given a set of alternative constraints C, we can divide the original intersection
into a union of intersections over the projection of the original set on the con-
straints (Eq. 3.3). This is the application of the constraint compatibility.
• The cardinality of the union of intersections can be computed using the inclusion-
exclusion principle (Eq. 3.4). Given the property in Eq. 3.5, we can transfer the
intersections from the sets to the constraints.
• Applying the partial isomorphism detection, we can transfer the computation
of the intersection from tDAGs to the extended trees (Eq. 3.6) and, then, apply
efficient algorithms for computing the cardinality of these intersections between
extended trees (Collins and Duffy, 2002; Moschitti and Zanzotto, 2007; Zanzotto
et al., 2009)
54
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
In the rest of the paper, we will use again the general formulation of function S(G),
instead of the simpler S version.
To provide the theorem proving the validity of the algorithm, we need to introduce
some definitions. Firstly, we define the projection operator of an intersection of tDAGs
or extended trees given a constraint c.
Definition 3.3.5. Given two tDAGs G1 and G2, the set ∩(S(G1),S(G2))|c is the in-
tersection of the related sets S(G1) and S(G2) projected on constraint c. A tDAG
g′ = (τ ′, γ′) ∈ S(G1) is in ∩(S(G1),S(G2))|c if ∃g′′ = (τ ′′, γ′′) ∈ S(G2) such
that g′ is partially isomorphic to g′′, and c′ = c(τ ′, τ ′′) = c(γ′, γ′′) is covered by and
compatible with constraint c, i.e. c′ ⊆ c.
We can then generalize Property 3.5 as follows.
Lemma 3.3.1. Given two tDAGs G1 and G2, the following property holds:
⋂c∈C∩(S(G1),S(G2))|c = ∩(S(G1),S(G2))|⋂
c∈C c
We omit this proof that can be easily demonstrated.
Secondly, we can generalize Equivalence 3.6 in the following form.
Lemma 3.3.2. Let G1 = (τ1, γ1) and G2 = (τ2, γ2) be two tDAGs. Then:
∩(S(G1),S(G2))|c = ∩(S(τ1),S(τ2))|c × ∩(S(γ1),S(γ2))|c
Proof. First we show that if g = (τ, γ) ∈ ∩(S(G1),S(G2))|c then τ ∈ ∩(S(τ1),S(τ2))|c
and γ ∈ ∩(S(γ1),S(γ2))|c. To show that a tree τ belongs to ∩(S(τ1),S(τ2))|c, we
have to show that ∃τ ′ ∈ S(τ1), τ ′′ ∈ S(τ2) such that τ, τ ′ and τ ′′ are isomorphic and
fτ |A1⊆ c, i.e. c(τ ′, τ ′′) ⊆ c. Since g = (τ, γ) ∈ ∩(S(G1),S(G2))|c, we have that
55
Chapter 3. Improving Expressive Power: Kernels on tDAGs
∃g′ = (τ ′, γ′) ∈ S(G1), g′′ = (τ ′′, γ′′) ∈ S(G2) such that τ, τ ′ and τ ′′ are isomor-
phic, γ, γ′ and γ′′ are isomorphic, c(τ ′, τ ′′) ⊆ c and c(γ′, γ′′) ⊆ c. It follows by
definition that τ ∈ ∩(S(τ1),S(τ2))|c and γ ∈ ∩(S(γ1),S(γ2))|c.
It is then trivial to show that if τ ∈ ∩(S(τ1),S(τ2))|c and γ ∈ ∩(S(γ1),S(γ2))|c
then g = (τ, γ) ∈ ∩(S(G1),S(G2))|c.
Given the nature of the constraint set C, we can efficiently compute the previous
equation as two different J1 and J2 in 2{1,...,|C|} often generate the same c, i.e.:
c =⋂i∈J1
ci =⋂i∈J2
ci (3.8)
Then, we can define the set C∗ of all intersections of constraints in C.
Definition 3.3.6. Given the set of alternative constraints C = {c1, ..., cn}, set C∗ is
the set of all the possible intersections of elements of set C:
C∗ = {c(J)|J ∈ 2{1,...,|C|}} (3.9)
where c(J) =⋂i∈J ci.
The previous lemmas and definitions are used to formulate the main theorem that
can be used to build the algorithm for counting the subgraphs in common between two
tDAGs and, then, computing the related kernel function.
Theorem 3.3.2. Given two tDAGs G1 and G2, the kernel K(G1, G2) that counts the
common subgraphs of the set S(G1) ∩ S(G2) follows this equation:
K(G1, G2) =∑c∈C∗
KS(τ1, τ2, c)KS(γ1, γ2, c)N(c) (3.10)
56
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
where
N(c) =∑
J∈2{1,...,|C|}
c=c(J)
(−1)|J|−1 (3.11)
and
KS(θ1, θ2, c) = |∩(S(θ1),S(θ2))|c| (3.12)
Proof. Given Lemma 3.3.2, K(G1, G2) can be written as:
K(G1, G2) =
∣∣∣∣∣⋃c∈C∩(S(τ1),S(τ2))|c × ∩(S(γ1),S(γ2))|c
∣∣∣∣∣ (3.13)
The cardinality of the set can be computed using the inclusion-exclusion property, i.e.:
|A1 ∪ · · · ∪An| =∑
J∈2{1,...,n}
(−1)|J|−1|AJ | (3.14)
where 2{1,...,n} is the set of all the subsets of {1, . . . , n} and AJ =⋂i∈J Ai. Given
Eq. 3.13, 3.14, and 3.12, we can rewrite K(G1, G2) as:
K(G1, G2) =∑
J∈2{1,...,|C|}
(−1)|J|−1KS(τ1, τ2, c(J))KS(γ1, γ2, c(J)) (3.15)
Finally, defining N(c) as in Eq. 3.11, Eq. 3.10 can be derived from Eq. 3.15.
3.3.4 Enabling the Efficient Kernel Function
The above idea for computing the kernel function is promising but we need to make
it viable by describing the way we can determine efficiently the three main parts of
Eq. 3.10: 1) the set of alternative constraints C (Sec. 3.3.4.2); 2) the set C∗ of all the
possible intersections of constraints in C (Sec. 3.3.4.3); and, finally, 3) the coefficients
N(c) (Sec. 3.3.4.4). Before describing the above steps, we need to point out some
properties of constraints and introduce a new operator.
57
Chapter 3. Improving Expressive Power: Kernels on tDAGs
3.3.4.1 Unification of Constraints
In the previous sections we manipulated constraints as sets, but, since they represent
restrictions on bijective functions, they must be treated carefully. In particular, the
union of two constraints may generate a semantically meaningless result. For example,
the union of c1 = {( 1 , 1 ), ( 2 , 2 )} and c2 = {( 1 , 2 ), ( 2 , 1 )} would produce the
set c = c1 ∪ c2 = {( 1 , 1 ), ( 2 , 2 ), ( 1 , 2 ), ( 2 , 1 )} but c is clearly a contradictory
and not valid constraint. Thus we introduce a more useful partial operator.
Definition 3.3.7. Unification (t): Given two constraints c1 = (p′1, p′′1), . . . , (p′n, p
′′n)
and c2 = (q′1, q′′1 ), . . . , (q′m, q
′′m), their unification is c1 t c2 = c1 ∪ c2 if @(p′, p′′) ∈
c1, (q′, q′′) ∈ c2|p′ = q′ and p′′ 6= q′′ or vice versa; otherwise it is undefined and we
write c1 t c2 = ⊥.
3.3.4.2 Determining the Set of Alternative Constraints
The first step of Eq. 3.10 is to determine the set of alternative constraintsC. We can use
the possibility of dividing tDAGs into two trees. We build C starting from sets Cτ and
Cγ , that are respectively the constraints obtained from pairs of isomorphic extended
trees t1 ∈ S(τ1) and t2 ∈ S(τ2), and the constraints obtained from pairs of isomorphic
extended trees t1 ∈ S(γ1) and t2 ∈ S(γ2). The idea for an efficient algorithm is that we
can compute set C without explicitly looking at all the involved subgraphs. We instead
use and combine the constraints derived from the comparison between the production
rules of the extended trees. We can compute then Cτ with the productions of τ1 and τ2
and Cγ with the productions of γ1 and γ2. For example (see Fig. 3.2), focusing on τ ,
the rules NP 3 → NN 2NNS 3 of τ1 and NP 4 → NN 5NNS 4 of
τ2 generate the constraint c = {( 3 , 4 ), ( 2 , 5 )}.
58
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
Algorithm Procedure getLC(n′, n′′)
LC ← ∅c← constraint according to which the productions in n′ and n′′ are equivalentIF no such constraint exists RETURN ∅ELSE
add c to LC
FORALL pairs of children ch′i, ch′′i of n′, n′′
LCi ← getLC(ch′i, ch′′i )
FORALL c′ ∈ LCiIF c t c′ 6= ⊥ add c t c′ to LC
FORALL ci, cj ∈ AC such that i 6= j
IF ci t cj 6= ⊥ add ci t cj to LC
RETURN LC
Figure 3.5: Algorithm for computing LC for a pair of nodes.
To express the above idea in a formal way, for each pair of nodes n1 ∈ τ1, n2 ∈ τ2
(the same holds when considering γ1 and γ2), we need to determine a set of constraints
LC = {ci|∃t1, t2 subtrees rooted in n1 and n2 respectively such that t1 and t2 are
isomorphic according to ci}. This can be done by applying the procedure described in
Figure 3.5 to all pairs of nodes.
Although the procedure shows a recursive structure, adopting a dynamic program-
ming technique, i.e. storing the results of the procedure in a persistent table, allows the
number of executions to be limited to the number of node pairs, |Nτ1 | × |Nτ2 |.
Once we have obtained the sets of local alternative constraints LCij for each node
pair, we can simply merge the sets to produce the final set:
Cτ =⋃
1≤i≤|Nτ1 |1≤j≤|Nτ2 |
LCij
59
Chapter 3. Improving Expressive Power: Kernels on tDAGs
The same procedure is applied to produce Cγ .
The alternative constraint set C is then obtained as c′ t c′′|c′ ∈ Cτ , c′′ ∈ Cγ , so
that each constraint in C contains at least one of the constraints in Cτ and one of the
constraints in Cγ . In the last step, we reduce the size of the final set. For this purpose,
we remove from C all constraints c such that ∃c′ ⊇ c ∈ C, since their presence is made
redundant by the use of inclusion-exclusion property.
Lemma 3.3.3. The alternative constraint set C obtained by the above procedures sat-
isfies the following two properties:
1. for each isomorphic sub-tDAG according to a constraint c, ∃c′ ∈ C such that
c ⊆ c′;
2. @c′, c′′ ∈ C such that c′ ⊂ c′′ and c′ 6= ∅.
Proof. Property 2 is trivially assured by the last described step. As for property 1,
let G(t, g) be the isomorphic tDAG according to constraint c; then ∃t1 ∈ S(τ1), t2 ∈
S(τ2), g1 ∈ S(γ1), g2 ∈ S(γ2) such that t, t1, t2 are isomorphic and g, g1, g2 are
isomorphic and ct = fτ |A1 ⊆ c and ct = gγ |A1 ⊆ c and ct t cg = c. By definition
of LC, we have that ct ∈ LCij for some ni ∈ τ1, nj ∈ τ2 and cg ∈ LCkl for some
nk ∈ γ1, nl ∈ γ2. Thus ct ∈ Cτ and cg ∈ Cγ , and then ∃c′ ∈ C|c′ ⊇ ct t cg = c.
3.3.4.3 Determining the Set C∗
The set C∗ is defined as the set of all possible intersections of alternative constraints
in C. Figure 3.6 presents the algorithm determining C∗. Due to the Property 3.5
discussed in Section 3.3.3, we can empirically demonstrate that, although the worst
60
3.3. An Efficient Algorithm for Computing the First-order Rule Space Kernel
Algorithm Build the set C∗ from the set C
C+ ← C ; C1 ← C ; C2 ← ∅WHILE |C1| > 1
FORALL c′ ∈ C1
FORALL c′′ ∈ C1 such that c′ 6= c′′
c← c′ ∩ c′′
IF c /∈ C+ add c to C2
C+ ← C+ ∪ C2 ; C1 ← C2; C2 ← ∅C∗ ← C ∪ C+ ∪ {∅}
Figure 3.6: Algorithm for computing C∗.
case complexity of the algorithm is exponential, the average complexity is not higher
than O(|C|2).
3.3.4.4 Determining Coefficients N(c)
Coefficient N(c) (Eq. 3.11) represents the number of times constraint c is considered
in the sum of Eq. 3.10, keeping into account the sign of the corresponding addend. To
determine its value, we exploit the following property.
Lemma 3.3.4. For coefficient N(c), the following recursive equation holds:
N(c) = 1−∑c′∈C∗c′⊃c
N(c′) (3.16)
Proof. Let us callNn(c) the cardinality of the set {J ∈ 2{1,...,|C|}.c(J) = c, |J | = n}.
We can rewrite Eq. 3.11 as:
N(c) =
|C|∑n=1
(−1)n−1Nn(c) (3.17)
61
Chapter 3. Improving Expressive Power: Kernels on tDAGs
We note that the following properties hold:
Nn(c) = |{J ∈ 2{1,...,|C|}.c(J) = c, |J | = n}| =
= |{J ∈ 2{1,...,|C|}.c(J) ⊇ c, |J | = n}|−
− |{J ∈ 2{1,...,|C|}.c(J) ⊃ c, |J | = n}|
Now let xc be the number of alternative constraints which include the constraint c, i.e.
xc = |{c′ ∈ C.c′ ⊇ c}|. Then, by combinatorial properties and by the definition of
Nn(c), the previous equation becomes:
Nn(c) =
(xcn
)−∑c′∈C∗c′⊃c
Nn(c′) (3.18)
From Eq. 3.17 and 3.18, it follows that N(c) can be written as:
N(c) =
|C|∑n=1
(−1)n−1(
(xcn
)−∑c′∈C∗c′⊃c
Nn(c′)) =
=
|C|∑n=1
(−1)n−1
(xcn
)−|C|∑n=1
(−1)n−1∑c′∈C∗c′⊃c
Nn(c′) =
=
xc∑n=0
(−1)n−1
(xcn
)+
(xc0
)−∑c′∈C∗c′⊃c
|C|∑n=1
(−1)n−1Nn(c′)
We now observe that, exploiting the binomial theorem, we can write:
N∑K=0
(−1)K(N
K
)=
N∑K=0
(1)N−K(−1)K(N
K
)= (1 − 1)N = 0
thusxc∑n=0
(−1)n−1
(xcn
)= −
xc∑n=0
(−1)n(xcn
)= 0
62
3.4. Worst-case Complexity and Average Computation Time Analysis
Finally, since(xc0
)= 1, and according to the definition of N(c) in Eq. 3.17, we can
derive the property in Eq. 3.16, i.e.:
N(c) = 1−∑c′∈C∗c′⊃c
N(c′)
This recursive formulation of the equation allows us to easily determine the value
of N(c) for every c belonging to C∗.
3.4 Worst-case Complexity and Average ComputationTime Analysis
We can now analyze both the worst-case complexity and the average computation
time of the algorithm we proposed with Theorem 3.3.2. The computation of Eq. 3.10
strongly depends on the cardinality of C and the related cardinality of C∗. The worst-
case complexity is O(|C∗|n2|C|) where n is the cardinality of the node sets of the ex-
tended trees. Then, the worst-case computational complexity is still exponential with
respect to the size of the set of anchors of the two tDAGs, A1 and A2. In the worst-
case, C is equal to F(A1,A2), i.e. the set of the possible correspondences between the
nodes in A1 and A2. This is a combinatorial set. Then, the worst-case complexity is
O(2|A|n2).
Yet, there are some hints that suggest that the average case complexity (Wang,
1997) and the average computation time can be promising. Set C is generally very
small with respect to the worst case. It happens that |C| << |F(A1,A2)|, where
|F(A1,A2)| is the worst case. For example, in the case of P1 and P2, the cardinality
63
Chapter 3. Improving Expressive Power: Kernels on tDAGs
0
10
20
30
40
50
0 10 20 30 40 50
ms
n×m placeholders
K(G1, G2)Kworst(G1, G2)
0
200
400
600
800
1000
1200
1400
1600
0 2 4 6 8 10 12 14
s
#ofplaceholders
K(G1, G2)Kworst(G1, G2)
(a) Mean execution time in millisec-onds (ms) of the two algorithms wrt.n×m where n and m are the numbersof placeholders of the two tDAGs
(b) Total execution time in seconds (s)of the training phase on RTE2 wrt. dif-ferent numbers of allowed placeholders
Figure 3.7: Comparison of the execution times.
of C ={{( 1 , 1 )}, {( 1 , 3 ), ( 3 , 4 ), ( 2 , 5 )}
}is extremely smaller than the one of
F(A1,A2) = {{( 1 , 1 ), ( 2 , 2 ), ( 3 , 3 )}, {( 1 , 2 ), ( 2 , 1 ), ( 3 , 3 )}, {( 1 , 2 ), ( 2 , 3 ),( 3 , 1 )},
..., {( 1 , 3 ),( 2 , 4 ),( 3 , 5 )}}. Moreover, setC∗ is extremely smaller than 2{1,...,|C|} due
to Property 3.5.
We estimated the behavior of the algorithms on a large distribution of cases. We
compared the computational times of our algorithm with the worst-case, i.e. C =
F(A1,A2). We refer to our algorithm as K and to the worst case as Kworst. We im-
plemented both algorithms K(G1, G2) and Kworst(G1, G2) in SVMs and we experi-
mented with both implementations on the same machine.
For the first set of experiments, the source of examples is the one of the recognizing
textual entailment challenge, i.e. RTE2 (Bar-Haim et al., 2006). The dataset of the
challenge has 1,600 sentence pairs. To derive tDAGs for sentence pairs, we used the
64
3.4. Worst-case Complexity and Average Computation Time Analysis
following resources:
• The Charniak parser (Charniak, 2000) and the morpha lemmatiser (Minnen
et al., 2001) to carry out the syntactic and morphological analysis. These have
been used to build the initial syntactic trees.
• The wn::similarity package (Pedersen et al., 2004) to compute the Jiang&Conrath
(J&C) distance (Jiang and Conrath, 1997), as in Corley and Mihalcea (2005), for
finding relations between similar words, in order to find co-indexes between trees
H and T .
The computational cost of both K(G1, G2) and Kworst(G1, G2) depends on the
number of placeholders n = |A1| of G1 and m = |A2| of G2. Then, in the first
experiment we focused on determining the relation between the computational time
and factor n × m. The results are reported in Figure 3.7(a), where the computation
times are plotted with respect to n×m. Each point in the curve represents the average
execution time for the pairs of instances having n × m placeholders. As expected,
the computation of function K is more efficient than the computation Kworst. The
difference between the two execution times increases with n×m.
We then performed a second experiment that determines the relation of the total
execution time with the maximum number of placeholders in the examples. This is
useful to estimate the behavior of the algorithm with respect to its application in learn-
ing models. Using the RTE2 data, we artificially built different versions with increasing
number of placeholders, i.e. with one placeholder, two placeholders, three placeholders
and so on at most in each pair. In other words, the number of pairs is the same whereas
the maximal number of placeholders changes. The results are reported in Figure 3.7(b),
65
Chapter 3. Improving Expressive Power: Kernels on tDAGs
where the execution time of the training phase (in seconds) is plotted for each different
set. We see that the computation of Kworst looks exponential with respect to the num-
ber of placeholders and it becomes intractable after 7 placeholders. The plot associated
with the computation of K is instead flatter. This can be explained as the computation
of K is related to the real alternative constraints that appear in the dataset. Therefore,
the computation time of K is extremely shorter than the one of Kworst.
3.5 Performance Evaluation
To better show the benefit of our approach in terms of efficiency and effectiveness,
we compared it to the algorithm presented in Moschitti and Zanzotto (2007). We will
hereafter call that algorithm Kmax; it induces an approximation of FOR and it is not
difficult to demonstrate that Kmax(G1, G2) ≤ K(G1, G2). The Kmax approximation
is based on maximization over the set of possible correspondences of the placeholders,
i.e.:
Kmax(G1, G2) = maxc∈F(A1,A2)
KS(τ1, τ2, c)KS(γ1, γ2, c) (3.19)
where F(A1,A2) are all the possible correspondences between the nodes A1 and A2
of the two tDAGs as the one presented in Section 3.3.3. This formulation has the
same worst-case computational complexity of our method (Kmax behaves exactly as
Kworst).
Moschitti and Zanzotto (2007) showed that Kmax is very accurate for RTE (Bar-
Haim et al., 2006) but, since K computes a slightly different similarity function, we
need to show that its accuracy is comparable with Kmax. Thus, we performed an
experiment by using all the data derived from RTE1, RTE2, and RTE3 for training (i.e.
4567 training examples) and the RTE-4 data for testing (i.e. 1000 testing examples).
66
3.5. Performance Evaluation
Kernel Accuracy Used training examples Support VectorsKmax 59.32 4223 4206K 60.04 4567 4544
Table 3.1: Comparative performances of Kmax and K.
The results are reported in Table 3.1. The table shows that the accuracy of K is higher
than the accuracy of Kmax. Our explanation for this result is that (a) Kmax is an
approximation of K and (b) K can use sentence pairs with more than 7 placeholders,
i.e. the complete training set, as showed in the third column of the table.
67
4Improving Computational Complexity:
Distributed Tree Kernels
Reducing the computational complexity of tree kernels has been a long standing re-
search interest. Most of the tree kernels proposed in the literature have a worst-case
computation time quadratic with the size of the involved trees. This has hindered the
application of tree kernel methods to corpora and trees of large size.
In this chapter, we propose the distributed tree kernels framework as a novel op-
portunity to use tree structured data in learning classification functions, in learning
regressors, or in clustering algorithms. The key idea is to transform trees into explicit
low dimensional vectors by embedding the huge feature spaces of tree fragments. Vec-
tors in these low dimensional spaces can then be directly used in learning algorithms.
This linearization dramatically reduces the complexity of tree kernels computation. In
the initial formulation (Collins and Duffy, 2002) and in more recent studies (Moschitti,
2006b; Rieck et al., 2010; Pighin and Moschitti, 2010; Shin et al., 2011), the complex-
ity depends on the size of the trees involved in the computation. With the distributed
tree kernels, the computation is reduced to a constant complexity, depending on the
size chosen for the low dimensional space.
Linearized trees have also other advantages. The approach gives the possibility to
use linear support vector machines for tree structured input data. This possibility is not
69
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
given with the traditional tree kernels (e.g. Collins and Duffy, 2002), even if classes are
linearly separable in the explicit feature spaces of the tree fragments. Linearized trees
allow both kernel-based and non-kernel-based machine learning algorithms to exploit
tree structured data. For example, probabilistic classifiers such as naive Bayes clas-
sifiers or the maximum entropy classifiers, as well as decision tree learners (Quinlan,
1993), can use tree structured data.
The distributed tree kernel framework has the potentiality to be applied to many
kernels defined over trees. In this work, we show how the idea can be applied to the
tree kernel by Collins and Duffy (2002), to subpath tree kernels (Kimura et al., 2011)
and to route tree kernels (Aiolli et al., 2009). Existing tree kernels are transformed
in distributed counterparts: distributed tree kernels (DTK), distributed subpath tree
kernels (DSTK), and distributed route tree kernels (DRTK). Moreover, the proposed
approach suggests that the framework could be extended to more complex structures,
such as graphs, or at least some specific families of graphs.
The rest of the chapter is organized as follows. Section 4.1 introduces the idea
and the related challenges. Section 4.2 analyzes the theoretical limits of embedding
a large vector space into a smaller space. Section 4.3 describes the compositional
approach to derive vectors for trees by combining vectors for nodes, and the ideal vector
composition function, with its expected properties and some approximate realizations.
Section 4.4 introduces the novel class of tree kernels by analyzing three instances: the
distributed tree kernel, the distributed subpath tree kernel and the distributed route tree
kernel. Section 4.5 formally and empirically investigates the complexity of the derived
distributed tree kernels and how well they approximate the corresponding original tree
kernels.
70
4.1. Preliminaries
4.1 Preliminaries
To explain the results of this chapter, this section introduces the idea of linearizing tree
structures, clarifies the used notation, and, finally, poses the challenges that need to
be solved to demonstrate the theoretical soundness and the feasibility of the proposed
approach.
4.1.1 Idea
Tree kernels (for example, Collins and Duffy, 2002; Aiolli et al., 2009; Kimura et al.,
2011) are defined to use tree structured data in kernel machines. They directly com-
pute the similarity between trees T ∈ T by counting the common tree fragments τ 1.
These kernels are valid, as underlying feature spaces are clearly defined. Different tree
kernels rely on different feature spaces, i.e., different classes of tree fragments, and
these kernels can be ultimately seen as dot products over the underlying feature spaces.
Figure 4.1.(a) shows an example to establish the notation, with tree T on the left and
two of its possible tree fragments τi and τj on the right. The two tree fragments are two
dimensions of the underlying vector space. In this space, a tree T is then represented
as a vector ~T = I(T ) ∈ Rm, where each dimension ~τi corresponds to a tree fragment
τi (see Figure 4.1.(b)). Function I(·) is the mapping function between spaces T and
Rm. The trivial weighing scheme assigns ωi = 1 to dimension ~τi if tree fragment τi is
present in the original tree T and ωi = 0 otherwise. Different weighting schemes are
possible and are used. The count of common tree fragments performed by a tree kernel
TK(T1, T2) is, by construction, the dot product of the two vectors ~T1 · ~T2 representing
1For the sake of simplicity, in this section we will consider the subtrees of a tree to be its tree fragments,unless otherwise specified. This will allow the space of tree fragments to coincide with the space of trees T.
71
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
(a)
T . . . τi . . . τj . . .S``
NP
we
VPXX��
V
looked
NP
them
. . . SPP��
NP VP
. . . SXX��
NP VPPP��
V NP
. . .}T
(b)
~T = I(T ) . . . ~τi = I(τi) . . . ~τj = I(τj) . . .
}Rm
0
.
.
.010
.
.
.010
.
.
.0
. . .
0
.
.
.010
.
.
.000
.
.
.0
. . .
0
.
.
.000
.
.
.010
.
.
.0
. . .
(c)
;
T = f(~T ) . . . f(~τi) =;τ i . . . f(~τj) =
;τ j . . .
0.00024350.00232
.
.
.−0.007325
. . .
−0.00172450.0743869
.
.
.0.0538474
. . .
0.0032531−0.0034478
.
.
.−0.0345224
. . .}Rd
Figure 4.1: Map of the used spaces and functions: the tree fragments T, the full treefragment feature space Rm and the reduced space Rd; function I(·) that maps trees Tinto vectors of Rm, function I : T → E where E is the standard orthonormal basis ofRm, the space reduction function f : Rm → Rd, and the direct function f : T → Rd.Examples are given for trees, tree fragments, and vectors in the two different spaces.
the trees in the feature space Rm of tree fragments, i.e.:
TK(T1, T2) = ~T1 · ~T2
As these tree fragment feature spaces Rm are huge, kernel functions K(T1, T2) are
used to implicitly compute the similarity ~T1 · ~T2 without explicitly representing vectors
~T1 and ~T2. But these kernel functions are generally computationally expensive.
Our aim is to map vectors ~T in the explicit space Rm into a lower dimensional
space Rd, with d � m (see Fig. 4.1.(c)), to allow for an approximate but faster and
72
4.1. Preliminaries
explicit computation of the kernel functions. The idea is that, in lower dimensional
spaces, the kernel computation, being a simple dot product, is extremely efficient. The
direct mapping f : Rm → Rd is, in principle, possible with techniques like singular
value decomposition or random indexing (see Sahlgren, 2005), but it is impractical due
to the huge dimension of Rm.
To map vectors ~T in the explicit space Rm into a lower dimensional space Rd,
we then need a function f that directly maps trees T into vectors;
T , much smaller
than the implicit vector ~T used by classic tree kernels. This function acts from the
set of trees to the space Rd, i.e. f : T → Rd (see Fig. 4.1). For an assonance with
Distributed Representations (Plate (1994)), we call;τ i a Distributed Tree Fragment
(DTF), whereas;
T is a Distributed Tree (DT). We then define the Distributed Tree
Kernel (DTK) between two trees as the dot product between the two Distributed Trees,
i.e.:
DTK(T1, T2) ,;
T1 ·;
T2 = f(T1) · f(T2)
4.1.2 Description of the Challenges
Function f : T → Rd linearizes trees into low dimensional vectors and, then, has
a crucial role in the overall picture. We need then to clearly define which properties
should be satisfied by this function.
To derive the properties of function f , we need to examine the relation between the
traditional tree kernel mapping I : T→ Rm, that maps trees into tree fragment feature
spaces, mapping I : T → Rm, that maps tree fragments into the standard orthogonal
basis of Rm, the additional function f : Rm → Rd, that maps ~T into a smaller vector;
T = f(~T ), and our newly defined function f .
73
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
The first function to be examined is function f . It works as a sort of approximate
transformation of the basis of Rm into Rd, by embedding the first space into the sec-
ond. This is the function that could be obtained using techniques like singular value
decomposition or random indexing on the original feature space of tree fragments. But,
as already observed, this is impractical due to the huge dimension of the tree fragment
feature space Rm. Yet, introducing this function is useful in order to formally justify
the objective of building a function f (see again Figure 4.1 as a reference).
We can start by observing that each vector ~T ∈ Rm can be trivially represented as:
~T =∑i
ωi~τi
where each ~τi represents the unitary vector corresponding to tree fragment τi, i.e.
~τi = I(τi) where I(·) is the mapping function from tree fragments to vectors of the
standard basis. In other words, the set {~τ1 . . . ~τm} corresponds to the standard basis
E = {~e1 . . . ~em} of Rm, whose vectors ei have elements eii = 1 and eij = 0 for i 6= j.
Then, the dot product between two vectors ~T1 and ~T2 is:
~T1 · ~T2 =∑i,j
ω(1)i ω
(2)j ~τi~τj =
∑i
ω(1)i ω
(2)i (4.1)
where ω(k)i is the weight of the i-th dimension of vector ~Tk. This interpretation is
trivial, but it is useful for better explaining the other functions.
The approximate;
T ∈ Rd can be rewritten as:
;
T = f(~T ) = f(∑i
ωi~τi) =∑i
ωif(~τi) =∑i
ωi;τ i
where each;τ i represents the tree fragment τi in the new space. Function f then maps
vectors ~τ of the standard basis E into corresponding vectors;τ ∈ Rd. To preserve, to
74
4.1. Preliminaries
some extent, the properties of vectors ~τ , the set of vectors E = {;τ 1 . . .;τ m} should
be a sort of approximate basis for Rd. The final aim is to approximate the dot product
of two vectors ~T1 and ~T2 with the dot product of the two approximate vectors;
T1 and;
T2:;
T1 ·;
T2 =∑i,j
ω(1)i ω
(2)j
;τ i
;τ j ≈
∑i
ω(1)i ω
(2)i = ~T1 · ~T2 (4.2)
Since E should be an approximate basis and the space transformation should pre-
serve the dot product, these two properties are required to hold for the vectors in E:
Property 1. (Nearly Unit Vectors) A distributed tree fragment;τ representing a tree
fragment τ is a nearly unit vector: 1− ε < ||;τ || < 1 + ε
Property 2. (Nearly Orthogonal Vectors) Given two different tree fragments τ1 and τ2,
their distributed vectors are nearly-orthogonal: if τ1 6= τ2, then |;τ1 ·;τ2| < ε
We discussed function f and its expected properties, but a direct realization of this
function is impractical, though possible. We can now introduce the role of function
f . As vectors;τ ∈ E represent tree fragments τ , the idea is that
;τ can be obtained
directly from tree fragments τ by means of a function f(τ) = f(I(τ)) that composes
f and I . Using this function to produce distributed tree fragments;τ , distributed trees
;
T can be obtained as follows:
;
T = f(T ) =∑i
ωif(τi) =∑i
ωi;τ i (4.3)
To provide a concrete framework for the implementation of the proposed approach,
the following issues must be tackled:
• First, we need to show that a function f : Rm → Rd exists. This function should
have the property of keeping the vectors;τ i nearly-orthogonal: for each pair
75
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
~τi, ~τj ∈ E, with i 6= j, |f(~τi) · f(~τj)| < ε with a high probability, where 0 <
ε < 1. Using the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss
(1984)), it is possible to show that such a function exists. The relation between
ε, d, and m can also be found. This is useful to estimate how many nearly-
orthogonal distributed tree fragments τ can be encoded in space Rd, given the
expected error ε.
• Second, we want to define a function f , that directly computes;τ i using the
structure of tree fragment τi. Function f ideally merges the two mapping func-
tions f and I , as;τ i = f(~τi) = f(I(τi)) = f(τi). Function f(τi) is based on
the use of a set of nearly-orthogonal vectors for the nodes of the tree fragments
and an ideal vector composition function �. We need to show that, given spe-
cific properties for function � and given two trees τi and τj , f(τi) and f(τj) are
statistically nearly-orthogonal, i.e. |f(τi) · f(τj)| < ε with a high probability.
• Finally, we need to show that vectors;
T can be efficiently computed, using dy-
namic programming, for the spaces of tree fragments underlying the selected
tree kernels: Collins and Duffy (2002)’s tree kernel, subpath tree kernel (Kimura
et al., 2011), and route tree kernel (Aiolli et al., 2009).
Once these three issues are solved, we need to demonstrate that the related dis-
tributed versions of the kernels approximate the kernels in the original space. Eq. 4.2
should hold, with a high probability, for the distributed Collins and Duffy (2002)’s tree
kernel (DTK), the distributed subpath tree kernel (DSTK), and the distributed route tree
kernel (DRTK). We also need to demonstrate that their computation is more efficient.
This last issue will be discussed in the experimental section, whereas the above points
76
4.2. Theoretical Limits for Distributed Representations
are addressed in the following sections.
4.2 Theoretical Limits for Distributed Representations
Understanding the theoretical limits of the proposed approach is extremely relevant,
as it is useful to know to which extent low dimensional spaces can embed high di-
mensional ones. Hecht-Nielsen (1994) introduced the conjecture that there is a strict
relation between the dimensions of the space and the number of nearly-orthonormal
vectors that it can host. This is exactly what is needed to demonstrate the existence of
function f and to describe the limits of the approach. But the proof of the conjecture
should have appeared in a later paper that, to the best of our knowledge, has never been
published. A previous result (Kainen and Kurkova, 1993) provides some theoretical
lower-bounds on the number of nearly-orthonormal vectors, but these lower-bounds
are not satisfactory, as large vector spaces are still needed to ensure that the set of
nearly-orthonormal vectors required by the distributed tree kernels is covered.
Section 4.2.1 reports a corollary based on the Johnson and Lindenstrauss (1984)
lemma, that is useful to determine the theoretical limits of the approach based on dis-
tributed representations. Then, Section 4.2.2 empirically investigates the properties of
the possible nearly-orthonormal basis E.
4.2.1 Existence and Properties of Function f
In this section, we want to show that a vector transformation function f : Rm → Rd
exists, satisfying Property 1 and Property 2. This function guarantees that vectors;τi in
the transformed basis are still nearly-orthogonal. It is also useful to know the relation
between the approximation ε, the dimension d of the low dimensional target space, and
77
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
the dimension m of the original high dimensional space of tree fragments.
The starting point to show these results, that will be formalized in Lemma 4.2.1, is
the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984). The theorem is
reported here as described in Dasgupta and Gupta (1999):
Theorem 4.2.1. (Johnson-Lindenstrauss Lemma) For any 0 < ε < 1 and any integer
m. Let d be a positive integer such that
d ≥ 4(ε2/2− ε3/3)−1 lnm.
Then for any set V of m points in Rk, there is a map f : Rk → Rd such that for all
~u,~v ∈ V ,
(1− ε)||~u− ~v||2 ≤ ||f(~u)− f(~v)||2 ≤ (1 + ε)||~u− ~v||2.
This theorem can be used to show that it is possible to project the orthonormal
basis E of space Rm into the reduced space Rd, satisfying Properties 1 and 2 with
high probability. To do this, we start by observing that mapped vectors f(~τ) can be
assumed to be approximately unitary, i.e. ||f(~τ)||2 ≈ 1, with high probability. This
is because Dasgupta and Gupta (1999) showed that the mapping can take the form of
f(~τ) =√
md~τ ′, where ~τ ′ is the projection of ~τ on a random d-dimensional subspace of
Rm. Moreover, they showed that the expected norm of ~τ ′, other than having expected
length µ = d/m, is also fairly tightly concentrated around µ. Specifically:
Pr
[d
m−∆− ≤ ||~τ ′||2 ≤
d
m+ ∆+
]≥ 1− e d2 (ln(1−∆−)+∆−) − e d2 (ln(1−∆+)−∆+)
thus:
Pr[1− δ− ≤ ||f(~τ)||2 ≤ 1 + δ+] ≥ 1− e d2 (ln(1− dm δ−)+ d
m δ−) − e d2 (ln(1− dm δ+)− d
m δ+)
78
4.2. Theoretical Limits for Distributed Representations
where δ− = md ∆− and δ+ = m
d ∆+. So, if we express ||f(~τ)||2 as 1 + δ, we obtain:
Pr[−δ− ≤ δ ≤ δ+] ≥ 1− e d2 (ln(1− dm δ−)+ d
m δ−) − e d2 (ln(1− dm δ+)− d
m δ+) (4.4)
i.e. the difference between the length of a mapped vector f(~τ) and 1 is statistically
very small. Now we can demonstrate that the following lemma holds:
Lemma 4.2.1. For any 0 < ε < 1 and any integer m. Let d be a positive integer such
that
d ≥ 4(ε2/2− ε3/3)−1 lnm.
Then given the standard basis E of Rm, there is a map f : Rm → Rd such that for all
~τi, ~τj ∈ E,
|f(~τi)f(~τj)− δij | < ε
where δij ≈ 0 with high probability.
Proof. First, we can observe that ||~τi − ~τj ||2 = ||~τi||2 + ||~τj ||2 − 2~τi~τj = 2 as ~τi and
~τj are unitary and orthogonal. Then, we can see that ||f(~τi)− f(~τj)||2 = ||f(~τi)||2 +
||f(~τj)||2 − 2f(~τi)f(~τj) = 2(1 + δij − f(~τi)f(~τj)), assuming ||f(~τi)||2 = 1 + δi,
||f(~τj)||2 = 1+δj and δij =δi+δj
2 . Then, the disequality of the Johnson-Lindenstrauss
Lemma becomes: (1− ε) ≤ (1 + δij − f(~τi)f(~τj)) ≤ (1 + ε) This can be reduced to:
−ε < f(~τi)f(~τj)− δij < ε. By Eq. 4.4 and the definition of δij , we can conclude that
δij ≈ 0 with a high probability.
This result is also important as it gives a relation, though approximate, between ε,
d, and m. Table 4.1.(a) shows the relation between d and m, for a fixed ε = 0.05.
Table 4.1.(b) shows the relation between ε and m, for a fixed d = 2500. The tables
79
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
ε d m0.05 1500 1408
2000 157822500 1768983000 19827593500 222236294000 2490921164500 27919329325000 312932002565500 3.50748 · 1011
6000 3.93133 · 1012
d ε m2500 0.01 1
0.02 70.03 820.04 24080.05 1768980.06 319601380.07 139210325520.08 1.43294 · 1013
0.09 3.41657 · 1016
0.1 1.84959 · 1020
(a) (b)
Table 4.1: Relation between d, m and ε with respect to the result of the lemma. Table(a) has a fixed ε = 0.05. Table (b) has a fixed d = 2500.
show that the space Rm that can be encoded in a smaller space Rd rapidly grows with
d (for a fixed ε). Whereas, for a fixed d, the dimension of Rm can rapidly grow while
reducing the expectations on ε. This result shows that the strategy of encoding large
tree fragment feature spaces in much smaller spaces is possible.
4.2.2 Properties of the Vector Space
In this section, we empirically investigate the theoretical result obtained in the previous
section, as we need to determine how the nearly-orthonormal vectors work when the
dot product is applied. To better explain the objectives of this analysis, we assume
the trivial weighting scheme that assigns ωi = 1 if τi is a subtree of T and ωi = 0
otherwise. Distributed trees can be rewritten as:
;
T =∑
τ∈S(T )
;τ (4.5)
80
4.2. Theoretical Limits for Distributed Representations
h=0 h=1 h=2 h=5 h=10Dim. 512 Avg Var Avg Var Avg Var Avg Var Avg Var
k = 20 -0.0446 0.7183 0.9739 1.0211 2.0138 0.5777 4.9507 0.6272 9.8645 1.0127k = 50 -0.0545 5.4629 1.1306 3.4711 2.2941 4.4938 5.0171 4.824 9.6618 4.2196
k = 100 -0.6819 21.09 0.9822 16.6965 1.8942 20.1004 5.2458 16.3342 10.2801 20.3961k = 200 -0.4052 77.8553 1.9928 53.8576 2.1421 76.4839 5.0398 78.0001 9.0137 78.9879k = 500 -1.5108 417.1121 2.6724 491.0844 5.0566 489.0258 7.3068 481.6177 10.6926 360.5955
Dim. 1024k = 20 0.0002 0.49 0.9851 0.4446 1.9378 0.3432 4.8732 0.3592 9.8996 0.4976k = 50 -0.2511 2.3229 0.9788 2.4572 1.9985 1.8425 5.1292 2.2691 10.0537 2.1102
k = 100 -0.0327 10.0111 1.37 8.2847 2.5441 8.9007 5.5761 7.9167 9.8143 8.529k = 200 0.4292 40.0213 1.6357 36.9718 2.6673 38.6285 4.8888 36.5791 10.5477 32.6902k = 500 0.961 240.6677 -0.2899 197.0764 2.1809 238.7662 6.1283 247.0507 10.5271 243.4888
Dim. 2048k = 20 -0.0528 0.2225 0.9848 0.1895 1.943 0.1712 4.995 0.2328 9.9534 0.2106k = 50 -0.0289 1.0154 0.9423 1.2872 2.0609 1.1389 5.2488 1.5417 10.0115 1.3548
k = 100 0.145 4.5321 1.1818 4.25 2.2062 5.3105 5.074 4.1688 10.1047 4.7287k = 200 0.3227 20.7004 1.7306 21.3282 1.0251 20.681 4.4617 21.263 10.3172 25.5613k = 500 0.8879 107.0174 2.5353 111.5448 0.3597 136.8164 5.7405 145.3144 9.8646 133.716
Dim. 4096k = 20 -0.0079 0.0959 1.0025 0.0916 2.0257 0.068 4.9981 0.0884 9.9922 0.1027k = 50 0.0441 0.5899 1.0628 0.5569 2.0566 0.6462 4.9818 0.6404 9.9896 0.6424
k = 100 0.172 2.2318 1.0608 1.6847 2.0725 2.5823 5.0671 2.506 9.9597 2.244k = 200 -0.2724 10.2017 1.1402 8.715 2.2495 10.1631 5.3923 7.8346 10.4583 9.428k = 500 -0.4444 66.6689 1.3839 83.4934 1.8321 73.682 5.3868 46.6318 10.2459 71.5771
Dim. 8192k = 20 0.0181 0.0559 1.0237 0.0411 1.9857 0.0441 5.0078 0.0524 9.9892 0.0529k = 50 0.0165 0.3067 0.969 0.304 1.9878 0.3611 5.0606 0.3586 10.0249 0.3336
k = 100 -0.0316 1.2879 0.8976 1.074 1.9485 1.0715 4.9804 1.3136 9.899 1.0581k = 200 0.0654 3.4304 1.1079 4.4238 1.6926 4.9392 4.7375 4.1595 10.299 4.7273k = 500 0.6881 31.5398 0.7341 29.1581 2.467 23.417 5.4105 34.0299 9.6346 31.7414
Table 4.2: Average value and variance over 100 samples of the dot product betweentwo sums of k random vectors, with h vectors in common, for several vector sizes.
81
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
where S(T ) is the set of the subtrees of T , τ is a subtree, and;τ is its DTF vector.
Given Eq. 4.2 and Properties 1 and 2, expected for the vectors;τ , we can derive the
following:
~T1 · ~T2 − |S(T1)||S(T2)|ε <;
T1 ·;
T2 < ~T1 · ~T2 + |S(T1)||S(T2)|ε
where |S(T )| is the cardinality of set S(T ). We want to empirically determine whether
this variability can be a real problem when applying this approach to real cases.
To experiment with the above issue, we produced two sums of k basic random
vectors;v , with h < k vectors in common. To ensure that these basic vectors are
nearly orthonormal, their elements;v i are randomly drawn from a normal distribution
N (0, 1) and they are normalized so that ||;v || = 1 (as done in the demonstration of
the Johnson-Lindenstrauss Lemma by Dasgupta and Gupta (1999)). The sums were
then compared by means of their dot product. Ideally, the result is expected to be as
close as possible to h. We repeated the experiment for several values of k and h, and
for different vector space sizes. The results in terms of average value and variance are
shown in Table 4.2.
Some observations follow:
• for small values of k, the actual results are close, on average, to the expected
ones, independently of the value of h considered;
• for larger values of k, the noise grows and we register a sensible degradation of
the results, especially with respect to their variance;
• use of larger vector spaces scarcely affects the average values (that are already
close to the expected ones), but it has a large positive impact on the variances.
82
4.3. Compositionally Representing Structures as Vectors
These results suggest that the adopted approach has some structural limits in relation
to its ability to scale up to highly complex structures, i.e. structures that can be de-
composed into many substructures. To overcome the limits, we need to establish the
correct dimension of the target vector space in relation with the expected number of
active substructures and not only with respect to the size of the initial space of tree
fragments.
4.3 Compositionally Representing Structures as Vectors
In this section, we will show how to compute nearly-orthonormal vectors;τ i in the re-
duced space Rd, starting from the original trees τi. As already described, these vectors
represent dimensions fi of the original space Rm and, thus, tree fragments τi. We want
then to define the mapping function f :
f(τi) = f(I(τi)) = f(~τi) =;τ i
The mapping function f is built on top of a set of nearly-orthonormal vectors N for
tree node labels and a vector composition function �.
We firstly introduce the mapping function f(τi). Then we introduce the properties
of the ideal vector composition function �. We then show that, given the properties of
function �, the proposed function f(τi) satisfies Properties 1 and 2. Finally, we analyze
whether it is possible to define a concrete function satisfying the properties of the ideal
function �.
4.3.1 Structures as Distributed Vectors
The first issue to address is how to directly obtain a vector from a tree. Assigning a
random vector to each tree as a random indexing (Sahlgren, 2005) of the space of the
83
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
AXX��
B
W1
CPP��
D
W2
E
W3
Figure 4.2: A sample tree.
tree fragments is infeasible. This space is combinatorial with respect to the node labels.
The proposed method follows a different approach. In line with the compositional
approach in the distributed representations (Plate, 1994), we define function f in a
compositional way: the vector of the tree is obtained by composing some basic vectors
for the nodes.
The basic blocks needed to represent a tree are its nodes. Each node class (the
possible labels) l can be mapped to a random vector in Rd. The set of the vectors for
node labels is N . These basic vectors should be nearly orthonormal. Thus, we can
obtain them using the same method adopted in the above experiments. The elements;
l i are randomly drawn from a normal distributionN (0, 1) and they are normalized so
that ||;
l || = 1. These conditions are sufficient to guarantee that each node vector is
statistically nearly-orthogonal with respect to the others.
The vector representing a node n will be simply denoted by;n ∈ N , keeping in
mind that the actual vector depends on the node label, so that;n1 =
;n2 if the two
nodes share the same label l = l(n1) = l(n2).
Tree structure can be univocally represented in a ‘flat’ format using a parenthetical
notation. For example, the tree in Figure 4.2 could be represented by the sequence:
(A (B W1)(C (D W2)(E W3)))
84
4.3. Compositionally Representing Structures as Vectors
This notation corresponds to a depth-first visit of the tree, augmented with parentheses
so that the tree structure is delined as well.
If we replace the nodes with their corresponding vectors and introduce the compo-
sition function � : Rd × Rd → Rd, we can regard the above formulation as a math-
ematical expression that defines a representative vector for a whole tree, keeping into
account its nodes and structure. The example tree (Fig. 4.2) would then be represented
by the vector obtained as:
;τ = (
;
A � (;
B �;
W1) � (;
C � (;
D �;
W2) � (;
E �;
W3)))
We formally define function f(τ), as follows:
Definition 4.3.1. Let τ be a tree and N the set of nearly-orthogonal vectors for node
labels. We recursively define f(τ) as:
• f(n) =;n if n is a terminal node, where
;n ∈ N
• f(τ) = (;n � f(τc1 . . . τck)) if n is the root of τ and τci are its children subtrees
• f(τ1 . . . τk) = (f(τ1) � f(τ2 . . . τk)) if τ1 . . . τk is a sequence of trees
4.3.2 An Ideal Vector Composition Function
We here introduce the ideal properties of the vector composition function �, such that
function f(τi) has the two desired properties. We firstly introduce the properties of the
composition function and then we prove the properties of f(τi).
The definition of the ideal composition function follows:
Definition 4.3.2. The ideal composition function is � : Rd×Rd → Rd such that, given;a ,
;
b ,;c ,
;
d ∈ N , a scalar s and a vector;
t obtained composing an arbitrary number
of vectors in N by applying �, the following properties hold:
85
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
4.3.2.1 Non-commutativity with a very high degree k2
4.3.2.2 Non-associativity:;a � (
;
b �;c ) 6= (
;a �
;
b ) �;c
4.3.2.3 Bilinearity:
I) (;a +
;
b ) �;c =
;a �;
c +;
b �;c
II);c � (
;a +
;
b ) =;c �;
a +;c �
;
b
III) (s;a ) �
;
b =;a � (s
;
b ) = s(;a �
;
b )
Approximation Properties
4.3.2.4 ||;a �;
b || = ||;a || · ||;
b ||
4.3.2.5 |;a ·;
t | < ε if;
t 6= ;a
4.3.2.6 |;a �;
b ·;c �;
d | < ε if |;a ·;c | < ε or |;
b ·;
d | < ε
4.3.3 Proving the Basic Properties for Compositionally-obtainedVectors
Having defined the ideal vector composition function �, we can now focus on the two
properties needed to have DTFs as a nearly-orthonormal basis of Rm embedded in Rd,
i.e. Property 1 and Property 2.
Property 1 (Nearly Unit Vectors) is realized by the following lemma:
Lemma 4.3.1. Given a tree τ , the norm of the vector f(τ) is unitary.
2By degree of commutativity we refer to the lowest number k such that � is non-commutative, i.e.;a �
;
b 6=;
b �;a , and for any j < k,
;a � c1 � . . . � cj �
;
b 6=;
b � c1 � . . . � cj �;a
86
4.3. Compositionally Representing Structures as Vectors
This lemma can be easily proven using Property 4.3.2.4 and knowing that vectors
in N are versors.
For Property 2 (Nearly Orthogonal Vectors), we first need to observe that, due to
Properties 4.3.2.1 and 4.3.2.2, a tree τ generates a unique sequence of applications
of function � in f(τ), representing its structure. We can now address the following
lemma:
Lemma 4.3.2. Given two different trees τa and τb, the corresponding DTFs are nearly-
orthogonal: |f(τa) · f(τb)| < ε.
Proof. The proof is done by induction on the structure of τa and τb.
Basic step
Let τa be the single node a. Two cases are possible: τb is the single node b 6= a.
Then, by the properties of vectors in N , |f(τa) · f(τb)| = |;a ·
;
b | < ε; Otherwise, by
Property 4.3.2.5, |f(τa) · f(τb)| = |;a · f(τb)| < ε.
Induction step
Case 1 Let τa be a tree with root production a → a1 . . . ak and τb be a tree with
root production b → b1 . . . bh. The expected property becomes |f(τa) · f(τb)| =
|(;a �f(τa1 . . . τak))·(;
b �f(τb1 . . . τbh))| < ε. We have two cases: If a 6= b, |;a ·;
b | < ε.
Then, |f(τa) · f(τb)| < ε by Property 4.3.2.6. Else if a = b, then τa1 . . . τak 6=
τb1 . . . τbh as τa 6= τb. Then, as |f(τa1 . . . τak) · f(τb1 . . . τbh)| < ε is true by inductive
hypothesis, |f(τa) · f(τb)| < ε by Property 4.3.2.6.
Case 2 Let τa be a tree with root production a → a1 . . . ak and τb = τb1 . . . τbh
be a sequence of trees. The expected property becomes |f(τa) · f(τb)| = |(;a �
f(τa1 . . . τak)) · (f(τb1) � f(τb2 . . . τbh))| < ε. Since |;a · f(τb1)| < ε is true by
inductive hypothesis, |f(τa) · f(τb)| < ε by Property 4.3.2.6.
87
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Case 3 Let τa = τa1 . . . τak and τb = τb1 . . . τbh be two sequences of trees. The
expected property becomes |f(τa) · f(τb)| = |(f(τa1) � f(τa2 . . . τak)) · (f(τb1) �
f(τb2 . . . τbh))| < ε. We have two cases: If τa1 6= τb1 , |f(τa) · f(τb)| < ε by inductive
hypothesis. Then, |f(τa) · f(τb)| < ε by Property 4.3.2.6. Else, if τa1 = τb1 , then
τa2 . . . τak 6= τb2 . . . τbh as τa 6= τb. Then, as |f(τa2 . . . τak) · f(τb2 . . . τbh)| < ε is true
by inductive hypothesis, |f(τa) · f(τb)| < ε by Property 4.3.2.6.
4.3.4 Approximating the Ideal Vector Composition Function
The computation of DTs and, consequently, the kernel function DTK strongly depend
on the availability of a vector composition function �with the required ideal properties.
Such a function is hard to define, and can only be approximated.
The proposed general approach is to consider a function of the form:
~x � ~y ≈ f(ta(~x), tb(~y))
where functions ta, tb : Rn → Rn are two fixed, analogous but different vector trans-
formations, used to ensure that the final operation is not commutative. The transformed
vectors are then processed by the actual composition function f : Rn × Rn → Rn.
We considered several possible combinations of vector transformation and composition
functions.
4.3.4.1 Transformation Functions
We will present here the different transformation functions that were considered. The
transformed vectors should preserve the properties of the original vectors (norm, statis-
tical distribution of the elements value), so that the composition function can be defined
88
4.3. Compositionally Representing Structures as Vectors
independently of the actual transformation applied to the vectors. In the following, we
will consider n to be the size of the vectors, whose indices are in the range 0, . . . , n− 1.
The transformation functions are used to introduce Property 4.3.2.2 and Prop-
erty 4.3.2.1 with different values of k in the final function.
Reversing The simplest transformation function is the one that reverses the order of
the vector elements, i.e. rev(~x)i = xn−1−i. In this case, we define tb = rev and
ta = I (the identity function). For reversing, Property 4.3.2.1 holds for k = 1.
Shifting Another simple transformation function is the shifting of the vector elements
by h positions, i.e. shifth(~x)i = xi−h. We can use two different values ha and hb
for the two transformations ta and tb. Notice that for h = 0 we have shift0 = I . For
shifting, Property 4.3.2.1 holds for k = n where n is the size of the vectors.
Shuffling A slightly more complex transformation is the shuffling of the order of the
elements in the vector, i.e. shuf ~K(~x)i = xKi , where ~K = (K0, . . . ,Kn−1), with
Ki ∈ (0, . . . , n − 1), is a random permutation of the n elements of the vector. We
can define two different permutations ~Ka and ~Kb for the two transformations ta and
tb. Notice that for ~K0 = (0, . . . , n − 1) we have shuf ~K0= I . This transformation
function has also been used to encode word order in random indexing models (Sahlgren
et al., 2008). For shuffling, Property 4.3.2.1 holds for k that is proportional to n!, where
n is the size of the vectors.
89
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
4.3.4.2 Composition Functions
We considered three possible composition functions that are bilinear (Property 4.3.2.3).
These functions are also commutative, and thus need to be used in conjunction with one
of the above vector transformation methods.
The functions are used to progressively introduce the Approximation Properties of
the ideal function �. Firstly we examine how the proposed functions can behave with
respect to Property 4.3.2.4, then Section 4.3.4.3 reports an empirical investigation on
whether the other Approximation Properties statistically hold.
Element-wise product The element-wise product ~v = ~x× ~y is a well-known opera-
tion on vectors, where the elements of the resulting vector are obtained as the product
of the corresponding elements of the original vectors, i.e. vi = xiyi. This operation
does not guarantee any relation between the norm of the original vectors and the norm
of the resulting vector. Property 4.3.2.4 cannot hold, thus this function is not adequate
for our model.
γ-product In order to overcome the issue with element-wise product, we can intro-
duce a normalization parameter γ, depending on the size of the feature space, that can
be estimated as the reciprocal of the average norm of the element-wise product between
two random versors. Thus, we define the γ-product as ~v = ~x ⊗γ ~y with vi = γxiyi.
This operation approximates Property 4.3.2.4, but applying it in long chains of vectors
introduces a degradation, whose magnitude has to be estimated.
Circular convolution Circular convolution has been used for purposes similar to
ours by Plate (1994), in the context of the distributed representations. It is defined as
90
4.3. Compositionally Representing Structures as Vectors
~v = ~a ∗~b with:
vi =
n−1∑k=0
akb(i−k mod n)
Notice that, with respect to the element-wise product, circular convolution does not
require the introduction of a normalization parameter to approximate Property 4.3.2.4.
Notice also that circular convolution has a higher computational complexity, in the
order of O(n2). This complexity can be reduced to O(n log n) by using a Fast Fourier
Transform algorithm.
4.3.4.3 Empirical Analysis of the Approximation Properties
Several tests were performed on the proposed composition functions, to determine if
and to what degree they can satisfy the Approximation Properties of the ideal compo-
sition function �. We repeated the tests using vectors of different sizes, to verify the
impact of the vector space dimension on the effectiveness of the different functions.
We considered all the possible combinations of transformation functions and com-
position functions, except for those including the non-normalized element-wise prod-
uct as composition function. This latter does not even approximate the required prop-
erty of norm preservation, as previously pointed out.
Two categories of experiments were performed, to test for the following aspects:
• norm preservation, as in Property 4.3.2.4;
• orthogonality of compositions, as in Properties 4.3.2.5 and 4.3.2.6.
Norm preservation We tried composing an increasing number of basic vectors (with
unit norm) and measuring the norm of the resulting vector. Each plot in Figure 4.3
91
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
0.6
0.8
1
0 5 10 15 20
Shuffled
Shifted
Reverse
(a) Dimension 1024, circular convolution
0.4
0.6
0.8
1
0 5 10 15 20
(b) Dimension 1024, γ-product
0.6
0.8
1
0 5 10 15 20
(c) Dimension 2048, circular convolution
0.4
0.6
0.8
1
0 5 10 15 20
(d) Dimension 2048, γ-product
Figure 4.3: Norm of the vector obtained as combination of different numbers of basicrandom vectors, for various combination functions. The values are averaged on 100samples.
shows the norm of the combination of n vectors with the composition function (γ-
product or circular convolution) for the three transformation functions. The norm is
the y-axis and n is the x-axis.
These results show that, using γ-product for the composition function, the norm
is mostly preserved up to a certain number of composed vectors, and then decreases
rapidly. Increasing the vectors dimension allows for a larger number of compositions
before the norm starts degrading. Using different transformation functions seems not
92
4.3. Compositionally Representing Structures as Vectors
0.6
0.8
1
0 5 10 15 20
(e) Dimension 4096, circular convolution
0.4
0.6
0.8
1
0 5 10 15 20
(f) Dimension 4096, γ-product
0.6
0.8
1
0 5 10 15 20
(g) Dimension 8192, circular convolution
0.4
0.6
0.8
1
0 5 10 15 20
(h) Dimension 8192, γ-product
Figure 4.3: Norm of the vector obtained as combination of different numbers of basicrandom vectors, for various combination functions. The values are averaged on 100samples.
to have a relevant impact.
Circular convolution guarantees a very good norm preservation, instead, as long as
shuffling is used. Shifting and reversing yield worse results, but still behave better than
product-based composition functions. It should be noted that the variance measured
(not reported) increases with the number of vectors in the composition, both in the case
of γ-product and circular convolution.
We also repeated the tests composing sums of two and three vectors, instead of
93
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
unitary vectors. The results are very similar, allowing us to postulate that the previ-
ous observations can be generalized to the composition of vectors of any norm, as in
Property 4.3.2.4.
0
0.05
0.1
0 5 10 15 20
Shuffled
Shifted
Reverse
(a) Dimension 1024, circular convolution
0
0.05
0.1
0 5 10 15 20
(b) Dimension 1024, γ-product
0
0.05
0.1
0 5 10 15 20
(c) Dimension 2048, circular convolution
0
0.05
0.1
0 5 10 15 20
(d) Dimension 2048, γ-product
Figure 4.4: Dot product between two combinations of basic random vectors, identicalapart from one vector, for various combination functions. The values are averaged on100 samples, and the absolute value is taken.
Orthogonality of compositions Properties 4.3.2.5 and 4.3.2.6 can be easily shown
for a single application. But the degradation introduced can become huge when the
concrete vector composition function 2 is recursively applied. Then, we want here to
94
4.3. Compositionally Representing Structures as Vectors
0
0.05
0.1
0 5 10 15 20
(e) Dimension 4096, circular convolution
0
0.05
0.1
0 5 10 15 20
(f) Dimension 4096, γ-product
0
0.05
0.1
0 5 10 15 20
(g) Dimension 8192, circular convolution
0
0.05
0.1
0 5 10 15 20
(h) Dimension 8192, γ-product
Figure 4.4: Dot product between two combinations of basic random vectors, identicalapart from one vector, for various combination functions. The values are averaged on100 samples, and the absolute value is taken.
investigate how functions behave in their repeated application. We measured, by dot
product, the similarity of two compositions of up to 20 basic random vectors, where all
but one of the vectors in the compositions were the same, i.e.:
(;x2
;a12 . . .2
;an) · (;y2;
a12 . . .2;an)
This is strictly related to the repeated application of Properties 4.3.2.5 and 4.3.2.6.
Similarly to what done before, plots in Figure 4.4 represent the absolute value of
95
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
0
0.05
0.1
0.15
0 5 10 15 20
Shuffled
Shifted
Reverse
(a) Dimension 1024, circular convolution
0
0.2
0.4
0 5 10 15 20
(b) Dimension 1024, γ-product
0
0.05
0.1
0.15
0 5 10 15 20
(c) Dimension 2048, circular convolution
0
0.2
0.4
0 5 10 15 20
(d) Dimension 2048, γ-product
Figure 4.5: Variance for the values of Fig. 4.4.
the average on 100 samples of these dot products, for a given composition function
combined with the three transformation functions. Results are encouraging, as the
similarity absolute value is never over 1%. Using a larger vector size results in even
lower similarities, mostly below 0.5%. Circular convolution still seems to yield slightly
better results than γ-product.
In Figure 4.5 we also reported the variance measured for this experiment. These
values highlight more clearly the better behavior of circular convolution with respect to
γ-product, and of shuffled circular convolution with respect to the shifted and reverse
96
4.3. Compositionally Representing Structures as Vectors
0
0.05
0.1
0.15
0 5 10 15 20
(e) Dimension 4096, circular convolution
0
0.2
0.4
0 5 10 15 20
(f) Dimension 4096, γ-product
0
0.05
0.1
0.15
0 5 10 15 20
(g) Dimension 8192, circular convolution
0
0.2
0.4
0 5 10 15 20
(h) Dimension 8192, γ-product
Figure 4.5: Variance for the values of Fig. 4.4.
variants. We should also point out that shifted and reverse circular convolution yield
exactly the same results, both in this experiment and in the one of Figure 4.3. This is
due to the nature of circular convolution. Notice that, though sharing these properties,
the vectors obtained by the two composition operations are not the same.
We also tried composing vectors in different fashions, i.e. comparing just one vec-
tor against the composition of several vectors. The results measured are substantially
analogous and thus are not reported.
97
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Conclusions The analysis of the vector composition function properties leads to
drawing some conclusions. First, circular convolution seems to guarantee a better
approximation of Properties 4.3.2.4, 4.3.2.5 and 4.3.2.6, with respect to γ-product.
Second, when using circular convolution as the proper composition function, shuffling
is the transformation function that yields better approximation of the ideal properties.
Thus we performed the following experiments using shuffled circular convolution� as
vector composition function, unless differently specified.
As a final note, we point out that, as expected, increasing the vector size always
results in a better approximation of the desired properties.
4.4 Approximating Traditional Tree Kernels with Dis-tributed Trees
Starting with the definitions introduced in the previous sections, it is now possible to
define the class of the distributed tree kernels, that behave approximately as the original
tree kernels. This work is focused on three sample tree kernels: the original tree kernel
by Collins and Duffy (2002), the subpath tree kernel (Kimura et al., 2011), and the
route tree kernel (Aiolli et al., 2009). The distributed version of each tree kernel is
presented in one separate section. Each section contains the analysis of the feature
space of the tree fragments, along with the definition of the corresponding distributed
tree fragments, and the definition of a structurally recursive algorithm for efficiently
computing the distributed trees (cf. Eq. 4.5), without having to enumerate the whole
set of tree fragments.
98
4.4. Approximating Traditional Tree Kernels with Distributed Trees
4.4.1 Distributed Collins and Duffy’s Tree Kernels
Collins and Duffy (2002) introduced the first definition of tree kernel applied to syntac-
tic trees, stemming from the notion of convolution kernels. This section describes the
distributed tree kernel that emulates their Tree Kernel (TK). For a review of this kernel
function see Section 2.4.
4.4.1.1 Distributed Tree Fragments
The feature space of the tree fragments used by Collins and Duffy (2002) can be easily
described in this way. Given a context-free grammar G = (N,Σ, P, S) where N is the
set of non-terminal symbols, Σ is the set of terminal symbols, P is the set of production
rules, and S is the start symbol, the valid tree fragments for the feature space are any
tree obtained by any derivation starting from any non-terminal symbol in N .
S(
AXX��
B
W1
CPP��
D
W2
E
W3
) =
{Aaa!!
B C,
B
W1,
Aaa!!
B
W1
C ,
AXX��
B Caa!!
D E
,
AXX��
B Caa!!
D
W2
E,
AXX��
B Caa!!
D E
W3
,
AXX��
B CPP��
D
W2
E
W3
,
AXX��
B
W1
CPP��
D
W2
E
W3
,
Caa!!
D
W2
E ,
Caa!!
D E
W3
,
CPP��
D
W2
E
W3
,E
W3,
D
W2}
Figure 4.6: Tree Fragments for Collins and Duffy (2002)’s tree kernel.
The above definition is impractical, as it describes a possibly infinite set of features.
99
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Thus, a description of a function S(T ) that extracts the active subtrees from a given tree
T is generally preferred. Given a tree T , S(T ) contains any subtree of T which includes
more than one node, with the restriction that entire (not partial) rule productions must
be included. In other words, if node n in the original tree has children c1, . . . , cm, every
subtree containing n must include either the whole set of children c1, . . . , cm, or none
of them (i.e. having n as terminal node). Figure 4.6 gives an example.
For the formulation described in Section 2.4, the feature representing tree fragment
τ for a given tree T has weight:
ω =√λ|τ |−1
where |τ | is the number of non-terminal nodes of τ . In the original formulation (Collins
and Duffy, 2001), the contribution of tree fragment τ to the TK is λ|τ | giving a ω =√λ|τ | (see also Pighin and Moschitti (2010)). This difference is not relevant in the
overall theory of the distributed tree kernels.
The distributed tree fragments for the reduced version of this space are the depth-
first visits of the above subtrees. For example, given the tree fragments in Figure 4.6,
the corresponding distributed tree fragments for the first, the second, and the third tree
are: (;
A � (;
B �;
C)), (;
B �;
W1) and (;
A � ((;
B �;
W1) �;
C)).
4.4.1.2 Recursively Computing Distributed Trees
The structural recursive formulation for the computation of distributed trees;
T is the
following:;
T =∑
n∈N(T )
s(n) (4.6)
whereN(T ) is the node set of tree T and s(n) represents the sum of distributed vectors
for the subtrees of T rooted in node n. Function s(n) is recursively defined as follows:
100
4.4. Approximating Traditional Tree Kernels with Distributed Trees
• s(n) = ~0 if n is a terminal node.
• s(n) =;n � (
;c1 +
√λs(c1)) � . . . � (
;cm +
√λs(cm)) if n is a node with children
c1 . . . cm.
As in the TK, the decay factor λ decreases the weight of large tree fragments in the
final kernel value. With dynamic programming, the time complexity of this function is
linear O(|N(T )|) and the space complexity is d (where d is the size of the vectors in
Rd).
The overall theorem we need is the following.
Theorem 4.4.1. Given the ideal vector composition function �, the equivalence be-
tween Eq. 4.5 and Eq. 4.6 holds, i.e.:
;
T =∑
n∈N(T )
s(n) =∑
τi∈S(T )
ωif(τi)
We demonstrate Theorem 4.4.1 by showing that s(n) computes the weighted sum
of vectors for the subtrees rooted in n (see Theorem 4.4.2).
Definition 4.4.1. Let n be a node of tree T . We defineR(n) = {τ |τ is a subtree of T rooted in n}
We need to introduce a simple lemma, whose proof is trivial.
Lemma 4.4.1. Let τ be a tree with root node n. Let c1, . . . , cm be the children of n.
Then R(n) is the set of all trees τ ′ = (n, τ1, ..., τm) such that τi ∈ R(ci) ∪ {ci}.
Now we can show that function s(n) computes exactly the weighted sum of the
distributed tree fragments for all the subtrees rooted in n.
Theorem 4.4.2. Let n be a node of tree T . Then s(n) =∑
τ∈R(n)
√λ|τ |−1f(τ).
101
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Proof. The theorem is proved by structural induction.
Basis. Let n be a terminal node. Then we have R (n) = ∅. Thus, by its definition,
s(n) = ~0 =∑
τ∈R(n)
√λ|τ |−1f(τ).
Step. Let n be a node with children c1, . . . , cm. The inductive hypothesis is then
s(ci) =∑
τ∈R(ci)
√λ|τ |−1f(τ). Applying the inductive hypothesis, the definition of
s(n) and the Property 4.3.2.3, we have
s(n) =;n �
(;c1 +
√λs(c1)
)� . . . �
(;cm +
√λs(cm)
)=
;n �
;c1 +
∑τ1∈R(c1)
√λ|τ1|f(τ1)
� . . . �;cm +
∑τm∈R(cm)
√λ|τm|f(τm)
=
;n �
∑τ1∈T1
√λ|τ1|f(τ1) � . . . �
∑τm∈Tm
√λ|τm|f(τm)
=∑
(n,τ1,...,τm)∈{n}×T1×...×Tm
√λ|τ1|+...+|τm|
;n � f(τ1) � . . . � f(τm)
where Ti is the setR(ci)∪{ci}. Thus, by means of Lemma 4.4.1 and the definition
of f , we can conclude that s(n) =∑
τ∈R(n)
√λ|τ |−1f(τ).
4.4.2 Distributed Subpath Tree Kernel
In this section, the Distributed Tree framework is applied to the Subpath Tree Kernel
(STK) for unordered trees. For a review of this kernel function see Section 2.4.1.2.
4.4.2.1 Distributed Tree Fragments for the Subpath Tree Kernel
A subpath of a tree is formally defined as a substring of a path from the root to one of
the leaves of the tree.
102
4.4. Approximating Traditional Tree Kernels with Distributed Trees
S(
AXX��
B
W1
CPP��
D
W2
E
W3
) = {A
B,
A
B
W1
,B
W1,
A
C
D
W2
,
C
D
W2
,
A
C
D
,A
C,
C
D,
D
W2,
A
C
E
W3
,
C
E
W3
,
A
C
E
,C
E,
E
W3, A,B,W1, C,D,E,W2,W3}
Figure 4.7: Tree Fragments for the subpath tree kernel.
Function S(T ) can be defined accordingly. Given a tree T , S(T ) contains any
sequence of symbols a0...an where ai is a direct descendant of ai−1 for any i > 0.
The distributed tree fragments are then:;a0 � (
;a1 � (...
;an)...)).
Figure 4.7 proposes an example of the above function. In Kimura et al. (2011), the
weight w for the subpath feature p for a given tree T is:
w = num(Tp)√λ|p|
where |p| is the length of subpath pi and num(Tp) is the number of times a subpath p
appears in tree T . The related distributed tree fragments can be easily derived.
4.4.2.2 Recursively Computing Distributed Trees for the Subpath Tree Kernel
As in the case of the TK, we can define a Distributed Tree representation;
T for tree
T such that the kernel function can be approximated by the explicit dot product, i.e.
STK(T1, T2) ≈;
T1 ·;
T2. In this case, each standard versor ~pi of the implicit feature
103
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
space Rm corresponds to a possible subpath pi. Thus, the Distributed Tree is:
;
T =∑
p∈P (T )
λ|p|;p (4.7)
where P (T ) is the set of subpaths of T and;p is the Distributed Tree Fragment vector
for subpath p. Subpaths can be seen as trees where each node has at most one child,
thus their DTF representation is the same.
Again, an explicit enumeration of the subpaths of a tree T is impractical, so an
efficient way to compute;
T is needed. The formulation in Eq. 4.6 is still valid, as long
as we define recursive function s(n) as follows:
s(n) =√λ
;n +
;n �
∑c∈C(n)
s(c)
(4.8)
where C(n) is the set of children of node n.
A theorem like Theorem 4.4.1 must be proved:
Theorem 4.4.3. Given the ideal vector composition function �, the following equiva-
lence holds:;
T =∑
n∈N(T )
s(n) =∑
p∈P (T )
√λ|p|
;p (4.9)
To prove Theorem 4.4.3, we introduce a definition and two simple lemmas, whose
proof is trivial. In the following, we will denote by (n|p) the concatenation of a node
n with a path p.
Definition 4.4.2. Let n be a node of a tree T . We define P (n) = {p|p is a subpath of
T starting with n}
Lemma 4.4.2. Let n be a tree node and C(n) the (possibly empty) set of its children.
Then P (n) = n ∪⋃
c∈C(n)
⋃p′∈P (c)
(n|p′).
104
4.4. Approximating Traditional Tree Kernels with Distributed Trees
Lemma 4.4.3. Let p = (n|p′) be the path given by the concatenation of node n and
path p′. Then;p =
;n �
;
p′.
Now we can show that function s(n) computes exactly the sum of the DTFs for all
the possible subpaths starting with n.
Theorem 4.4.4. Let n be a node of tree T . Then s(n) =∑
p∈P (n)
√λ|p|
;p .
Proof. The theorem is proved by structural induction.
Basis. Let n be a terminal node. Then we have P (n) = n. Thus, by its definition,
s(n) =√λ;n =
∑p∈P (n)
√λ|p|
;p .
Step. Let n be a node with children set C(n). The inductive hypothesis is then
∀c ∈ C(n).s(c) =∑
p∈P (c)
√λ|p|
;p . Applying the inductive hypothesis, the definition
of s(n) and Property 4.3.2.3, we have
s(n) =√λ
;n +
;n �
∑c∈C(n)
s(c)
=√λ
;n +
;n �
∑c∈C(n)
∑p∈P (c)
√λ|p|
;p
=√λ;n +
∑c∈C(n)
∑p∈P (c)
√λ|p|+1;n �;
p
Thus, by means of Lemmas 4.4.2 and 4.4.3, we can conclude that s(n) =∑
p∈P (n)
√λ|p|
;p .
105
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Figure 4.8: Tree Fragments for the Route Tree Kernel.
4.4.3 Distributed Route Tree Kernel
The aim of this section is to introduce the distributed version of the Route Tree Kernel
(RTK) for positional trees. For a review of this kernel function see Section 2.4.1.2.
4.4.3.1 Distributed Tree Fragments for the Route Tree Kernel
The features considered by the RTK are the routes between a node ni and any of its
descendants nj , together with the label of node nj . An example is given in Figure 4.8,
that reports the tree we are using in this new form. The last node of the route has the
106
4.4. Approximating Traditional Tree Kernels with Distributed Trees
label whereas all the other nodes do not. The weight ω of route π(ni, nj) is:
ω =√λ|π(ni,nj)|
The transposition of these routes to distributed routes is straightforward. Indexes
Pn[e] are treated as new node labels. Thus, the distributed tree fragment associated
with the route π(n1, nk) ended by node nk is:
;πn1,nk = (
;
Pn1[(n1, n2)] � (
;
Pn2[(n2, n3)] � . . . �
;
Pnk−1[(nk−1, nk)])...) � ;
nk)
The Distributed Tree is then:
;
T =∑
π∈Π(T )
√λ|π|
;π (4.10)
where Π(T ) is the set of valid routes of T .
4.4.3.2 Recursively Computing Distributed Trees for the Route Tree Kernel
An efficient way to compute;
T is needed also in the case of the route features for a tree
T . The formulation in Eq. 4.6 is still valid, as long as we define recursive function s(n)
as follows:
s(n) =;n +√λ∑
c∈C(n)
;
Pn[(n, c)] � s(c) (4.11)
A theorem like Theorem 4.4.1 can be demonstrated to show that Eq. 4.6 with the
above definition of s(n) computes exactly the same as Eq. 4.5, where tree fragments
are the routes of the tree.
Theorem 4.4.5. Given the ideal vector composition function �, the following equiva-
lence holds:;
T =∑
n∈N(T )
s(n) =∑
π∈Π(T )
√λ|π|
;π (4.12)
107
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
To prove Theorem 4.4.5, we introduce a definition and a simple lemma, whose
proof is trivial.
Definition 4.4.3. Let n be a node of a tree T . We define Π(n) = {π(n,m)|m is a
descendant of n}.
Lemma 4.4.4. Let n be a tree node, m a child of n, and l a descendant of m. Then;πn,l =
;
Pn[(n,m)] � ;πm,l.
Now we can show that function s(n) computes exactly the sum of the DTFs for all
the possible routes starting in n.
Theorem 4.4.6. Let n be a node of tree T . Then s(n) =∑
π∈Π(n)
√λ|π|
;π .
Proof. The theorem is proved by structural induction.
Basis. Let n be a terminal node. Then we have Π(n) = πn,n. Thus, by its
definition, s(n) =;n =
;πn,n =
∑π∈Π(n)
√λ|π|
;π .
Step. Let n be a node with children set C(n). The inductive hypothesis is then
∀c ∈ C(n).s(c) =∑
π∈Π(c)
√λ|π|
;π . Applying the inductive hypothesis, the definition
of s(n) and Property 4.3.2.3, we have
s(n) =;n +√λ∑
c∈C(n)
;
Pn[(n, c)] � s(c)
=;πn,n +
√λ∑
c∈C(n)
;
Pn[(n, c)] �∑
π∈Π(c)
√λ|π|
;π
=;πn,n +
∑c∈C(n)
∑π∈Π(c)
√λ|π|+1
;
Pn[(n, c)] �;π
108
4.5. Evaluation and Experiments
Thus, by means of Lemma 4.4.4 and the definition of tree node descendants, we can
conclude that s(n) =∑
π∈Π(n)
√λ|π|
;π .
4.5 Evaluation and Experiments
Distributed tree kernels are an attractive counterpart to the traditional tree kernels, be-
ing linear and thus much faster to compute. But these kernels are an approximation
of the original ones. In the previous sections, we demonstrated that, given an ideal
vector composition function, the approximation is accurate. We have also shown that
approximations of the ideal vector composition functions exist. In this section, we
will investigate first how fast the distributed tree kernels are, and then how well they
approximate the original kernels.
The rest of the section is organized as follows. Section 4.5.2 analyzes the theoretical
and practical complexity of the distributed tree kernel, compared with the original tree
kernel (Collins and Duffy, 2002) and some of its improvements (Moschitti, 2006b;
Rieck et al., 2010; Pighin and Moschitti, 2010). Section 4.5.3 then proposes a direct
and task-based evaluation of how well DTKs approximate TKs.
4.5.1 Trees for the Experiments
The following experiments were performed using trees taken from actual linguistic
corpora and artificially generated. This section describes these two data classes.
4.5.1.1 Linguistic Parse Trees and Linguistic Tasks
For the task-based experiments, we used standard datasets for the two NLP tasks of
Question Classification (QC) and Recognizing Textual Entailment (RTE).
109
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
For QC, we used a standard training and test set3, where the test set are the 500
TREC 2001 test questions. To measure the task performance, we used a question
multi-classifier by combining n binary SVMs according to the ONE-vs-ALL scheme,
where the final output class is the one associated with the most probable prediction.
For RTE, we considered the corpora ranging from the first challenge to the fifth
(Dagan et al., 2006), except for the fourth, which has no training set. These sets are
referred to as RTE1-5. The dev/test distribution for RTE1-3, and RTE5 is respec-
tively 567/800, 800/800, 800/800, and 600/600 T-H pairs. We used these sets for the
traditional task of pair-based entailment recognition, where a pair of text-hypothesis
p = (t, h) is assigned a positive or negative entailment class.
As a final specification of the experimental setting, the Charniak’s parser (Charniak,
2000) was used to produce syntactic interpretations of sentences.
4.5.1.2 Artificial Trees
As in Rieck et al. (2010), many of the following experiments considered artificial trees
along with linguistic parse trees. The artificial trees are generated from a set of n node
labels, divided into terminal and non-terminal labels. A maximum out-degree d for
the tree nodes is chosen. The trees are generated recursively by building tree nodes
whose labels and numbers of children are picked at random, according to a uniform
distribution, until all tree branches end in a terminal node.
The Artificial Corpus of trees used in the following experiments is a set of 1000
trees generated randomly according to the described procedure. The label set contains
6 terminal and 6 non-terminal labels. The maximum out-degree is 3, and the trees are
composed of 30 nodes in average.
3The QC set is available at http://l2r.cs.uiuc.edu/˜cogcomp/Data/QA/QC/
110
4.5. Evaluation and Experiments
Algorithm Tree preparation Learning ClassificationTime Space Time Space Time Space
TK - - O(n2) O(n2) O(n2) O(n2)FTK - - A(n) O(n2) A(n) O(n2)FTK-with-FS - - A(n) O(n2) k kATK - - O(n
2
qω) O(n2) O(n
2
qω) O(n2)
DTK O(n) d d d d d
Table 4.3: Computational time and space complexities for several tree kernel tech-niques: n is the tree dimension, qω is a speed-up factor, k is the size of the selectedfeature set, d is the dimension of space Rd, O(·) is the worst-case complexity, andA(·)is the average case complexity.
4.5.2 Complexity Comparison
In this section, we focus on how the distributed tree kernel affects the worst-case com-
plexity and the practical average computation time, when applied to the original tree
kernel (Collins and Duffy, 2002). For the other two kernels, we can draw similar con-
clusions as the ones we discuss here.
4.5.2.1 Analysis of the Worst-case Complexity
The initial formulation of the tree kernel by Collins and Duffy (2002) has a worst-
case complexity quadratic with respect to the number of nodes, both with respect to
time and space. Several studies have tried to propose methods for controlling the aver-
age execution time. Among them are the Fast Tree Kernels (FTK) by Moschitti (2006b)
and the Approximate Tree Kernels (ATK) by Rieck et al. (2010) (see Sec. 2.4.2).
With respect to these methods, Distributed Tree Kernels change the perspective,
since each kernel computation only consists in a vector dot product of constant com-
plexity, proportional to the dimension of the reduced space. The vector preparation still
111
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
has linear complexity with respect to the number of nodes, but this computation is only
performed once for each tree in the corpus. The differences of these approaches are
summarized in Table 4.3.
4.5.2.2 Average Computation Time
Since it is a relevant matter also for the Subpath Tree Kernel and the Route Tree Kernel,
and in order to have an idea of the practical application of the distributed tree kernels,
we measured the average computation time of FTK (Moschitti, 2006b) and DTK (with
vector size 8192) on a set of trees derived from the Question Classification corpus. As
these trees are parse trees, the FTK has an average linear complexity. The reported
results are then useful also to understand the behavior of the other two kernels. In fact,
the best implementations of the STK behave linearly with respect to the number of
nodes of the trees (Kimura and Kashima, 2012).
Figure 4.9 shows the relation between the computation time and the size of the
trees, computed as the total number of nodes in the two trees. As expected, DTK has
constant computation time, since it is independent of the size of the trees. On the
other hand, the computation time for FTK, while being lower for smaller trees, grows
very quickly with the tree size. The larger are the trees considered, the higher is the
computational advantage offered by using DTK instead of FTK.
4.5.3 Experimental Evaluation
In this section, we report on two experimental evaluations performed to estimate how
well the distributed tree kernels approximate the original tree kernels. The first set
of experiments is a direct evaluation: we compared the similarities computed by the
DTKs to those computed by the TKs. The second set of experiments is a task-based
112
4.5. Evaluation and Experiments
0.001
0.01
0.1
1
10
0 40 80 120 160 200 240 280 320
Com
puta
tion
time
(ms)
Sum of nodes in the trees
FTKDTK
Figure 4.9: Computation time of FTK and DTK (with d = 8192) for tree pairs with anincreasing total number of nodes, on a 1.6 GHz CPU.
evaluation, where we investigated the behavior of the distributed tree kernels in two
NLP tasks.
4.5.3.1 Direct Comparison
These experiments test the ability of DTKs to emulate the corresponding TKs. We
compared the similarities derived by the traditional tree kernels with those derived
by the distributed versions of the kernels. For each set of trees (see Sec. 4.5.1), we
considered the Spearman’s correlation of DTK values with respect to TK values. Each
table reports the correlations for the three sets with different values of λ and with
113
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Dim. 512 Dim. 1024 Dim. 2048 Dim. 4096 Dim. 8192Artificial λ=0.20 0.792 0.869 0.931 0.96 0.978Corpus λ=0.40 0.669 0.782 0.867 0.925 0.956
λ=0.60 0.34 0.454 0.59 0.701 0.812λ=0.80 0.06 0.058 0.075 0.141 0.306λ=1.00 0.018 0.017 -0.043 -0.019 0.112
QC λ=0.20 0.943 0.961 0.981 0.99 0.994Corpus λ=0.40 0.894 0.925 0.961 0.978 0.989
λ=0.60 0.571 0.621 0.73 0.804 0.88λ=0.80 0.165 0.148 0.246 0.299 0.377λ=1.00 0.037 0.014 0.06 0.108 0.107
RTE λ=0.20 0.969 0.983 0.991 0.996 0.998Corpus λ=0.40 0.849 0.888 0.919 0.943 0.961
λ=0.60 0.152 0.207 0.245 0.299 0.343λ=0.80 0.002 0.027 0.021 0.041 0.026λ=1.00 0.018 0.023 0.018 0.003 0
Table 4.4: Spearman’s correlation of DTK values with respect to TK values, on treestaken from the three data sets.
different vector space dimensions (d =1024, 2048, 4096, and 8192).
Distributed Tree Kernel The Spearman’s correlation results in Table 4.4 show that
DTK does not approximate adequately TK for λ = 1. Nonetheless, DTK’s perfor-
mances improve dramatically when parameter λ is reduced. The difference is mostly
notable on the linguistic corpora, thus highlighting once more DTK’s difficulty to cor-
rectly handle large trees. This is probably due to the influence of noise in the Dis-
tributed Tree representations.
Distributed Subpath Tree Kernel The Spearman’s correlation results for the Sub-
path Tree Kernel are reported in Table 4.5. In this case, the correlation is extremely
promising even for high values of λ. This is most likely because the feature space
114
4.5. Evaluation and Experiments
Dim. 1024 Dim. 2048 Dim. 4096 Dim. 8192Artificial λ=0.20 0.958 0.979 0.992 0.995Corpus λ=0.40 0.945 0.968 0.988 0.993
λ=0.60 0.919 0.952 0.981 0.989λ=0.80 0.876 0.926 0.968 0.983λ=1.00 0.804 0.878 0.944 0.971
QC λ=0.20 0.994 0.997 0.999 0.999Corpus λ=0.40 0.991 0.997 0.998 0.999
λ=0.60 0.986 0.994 0.997 0.998λ=0.80 0.975 0.99 0.994 0.996λ=1.00 0.953 0.98 0.989 0.993
RTE λ=0.20 0.995 0.998 0.999 0.999Corpus λ=0.40 0.995 0.998 0.999 0.999
λ=0.60 0.994 0.998 0.999 0.999λ=0.80 0.992 0.996 0.998 0.998λ=1.00 0.986 0.993 0.996 0.997
Table 4.5: Spearman’s correlation of SDTK values with respect to STK values, takenon trees taken from the three data sets.
of STK is smaller than the one of TK, for trees of the same size. It should be noted
however that STK is commonly used over trees larger than the ones considered for this
experiment. Interestingly, though, the best performances are achieved on the linguistic
corpora, where DTK showed most limitations.
Distributed Route Tree Kernel The Spearman’s correlation results for the Route
Tree Kernel are reported in Table 4.6. The results are analogous to those obtained for
STK. This is not surprising, since their feature spaces are quite similar.
4.5.3.2 Task-based Experiments
In this section, we report on the experiments aimed at comparing the performance of
the distributed tree kernels with respect to the corresponding tree kernels on actual
115
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
Dim. 1024 Dim. 2048 Dim. 4096 Dim. 8192Artificial λ=0.20 0.962 0.982 0.994 0.995Corpus λ=0.40 0.953 0.974 0.991 0.992
λ=0.60 0.932 0.962 0.985 0.988λ=0.80 0.897 0.945 0.976 0.983λ=1.00 0.846 0.919 0.964 0.975
QC λ=0.20 0.992 0.998 0.999 1Corpus λ=0.40 0.99 0.997 0.999 0.999
λ=0.60 0.988 0.996 0.999 0.999λ=0.80 0.986 0.994 0.998 0.999λ=1.00 0.98 0.99 0.996 0.998
RTE λ=0.20 0.991 0.997 0.998 0.999Corpus λ=0.40 0.987 0.995 0.998 0.999
λ=0.60 0.981 0.991 0.996 0.998λ=0.80 0.97 0.984 0.992 0.996λ=1.00 0.948 0.968 0.985 0.992
Table 4.6: Spearman’s correlation of RDTK values with respect to RTK values, takenon trees taken from the three data sets.
linguistic tasks. The tasks considered are Question Classification (QC) and Recogniz-
ing Textual Entailment (RTE). For these experiments, we considered two versions of
the Distributed Tree Kernels, using two composition functions. We denote by DTK�
and DTK� the DTKs that use shuffled γ-product and shuffled circular convolution
respectively. An analogous notation is used for SDTK and RDTK. All Distributed Tree
Kernels rely on vectors of dimension 8192.
Performance on Question Classification task This experiment compared the per-
formances of DTKs with respect to TKs on the actual task of Question Classification.
The experimental setting is the one described in Section 4.5.1.1.
The results for the DTKs are shown in Figure 4.10. DTKs lead to worse perfor-
mances with respect to TK, but the gap is narrower for small values of λ ≤ 0.4.
116
4.5. Evaluation and Experiments
60
65
70
75
80
85
90
0.2 0.4 0.6 0.8 1
Acc
urac
y
λ
TKDTK�DTK�
Figure 4.10: Performance on Question Classification task of TK, DTK� and DTK�
(d = 8192) for several values of λ.
These are the values usually adopted, since they produce better performances for the
task. Moreover, it can be noted that DTK� behaves better than DTK� for smaller
values of λ, while the opposite is true for larger values of λ. An explanation of this
phenomenon may be given in light of the results of the experiments in Section 4.3.4.3.
Since the norm of large vector compositions, using �, tends to drop greatly, their final
weight is smaller than expected. In other words, DTK� adds an implicit decay factor
to the explicit one. Thus, adopting larger values of λ affects DTK� less heavily than
it does for DTK� and TK itself.
117
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
83
83.5
84
84.5
85
85.5
86
86.5
87
0.2 0.4 0.6 0.8 1
Acc
urac
y
λ
STKSDTK�SDTK�
Figure 4.11: Performance on Question Classification task of STK, SDTK� andSDTK� (d = 8192) for several values of λ.
The results for the SDTKs and the RDTKs are shown in Figure 4.11 and Fig-
ure 4.12 respectively. SDTKs and RDTKs behave similarly to DTKs. Their perfor-
mances are very similar for small values of λ, while the gap increases for higher values
of λ. In these cases, STK and RTK seem to constantly gain accuracy when λ in-
creases. SDTK� and RDTK� achieve mostly stable performances, while SDTK�
and RDTK� show a performance decay for higher vales of λ. It should be noted,
though, that the range of accuracy obtained by Subpath and Route Tree Kernels is
much narrower than the one obtained by the Tree Kernel.
118
4.5. Evaluation and Experiments
81
82
83
84
85
86
87
88
0.2 0.4 0.6 0.8 1
Acc
urac
y
λ
RTKRDTK�RDTK�
Figure 4.12: Performance on Question Classification task of RTK, RDTK� andRDTK� (d = 8192) for several values of λ.
Performance on Textual Entailment Recognition task For this experiment, the
setting is again the one described in Section 4.5.1.1. For our comparative analysis,
we used the syntax-based approach described in Moschitti and Zanzotto (2007) with
two kernel function schemes: (1) PKS(p1, p2) = KS(t1, t2) + KS(h1, h2); and, (2)
PKS+Lex(p1, p2) = Lex(t1, h1)Lex(t2, h2) + KS(t1, t2) + KS(h1, h2). Lex is the
lexical similarity between T andH , computed using WordNet-based metrics as in Cor-
ley and Mihalcea (2005). This feature is used in combination with the basic kernels and
it gives an important boost to their performances. KS is realized with TK, DTK�,
119
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
and DTK�. In the plots, the different PKS kernels are referred to as TK, DTK�,
and DTK� whereas the different PKS+Lex kernels are referred to as TK + Lex,
DTK� + Lex, and DTK� + Lex. Analogous notations are used for Subpath and
Route Tree Kernels. For the computation of the similarity feature Lex (Corley and
Mihalcea, 2005), we exploited the Jiang&Conrath distance (Jiang and Conrath, 1997)
computed using the wn::similarity package (Pedersen et al., 2004). As for the
case of the QC task, we considered several values of λ.
52
54
56
58
60
62
64
0.2 0.4 0.6 0.8 1
Acc
urac
y
λ
TK + Lex
DTK� + Lex
DTK� + Lex
TK
DTK�
DTK�
Figure 4.13: Performance on Recognizing Textual Entailment task of TK,DTK� andDTK� (d = 8192) for several values of λ. Each point is the average accuracy on the4 data sets.
Accuracy results for DTKs are reported in Figure 4.13. The results lead to con-
120
4.5. Evaluation and Experiments
clusions similar to the ones drawn from the QC experiments. For λ ≤ 0.4, DTK�
and DTK� is similar to TK. Differences are not statistically significant except for
λ = 0.4, whereDTK� behaves better than TK (with p < 0.1). Statistical significance
is computed using the two-sample Student t-test. DTK� + Lex and DTK� + Lex
are statistically similar to TK + Lex for any value of λ.
55
56
57
58
59
60
61
62
63
0.2 0.4 0.6 0.8 1
Acc
urac
y
λ
STK + Lex
SDTK� + Lex
SDTK� + Lex
STK
SDTK�
SDTK�
Figure 4.14: Performance on Recognizing Textual Entailment task of STK, SDTK�
and SDTK� (d = 8192) for several values of λ. Each point is the average accuracyon the 4 data sets.
Accuracy results for SDTKs and RDTKs are reported in Figure 4.14 and Fig-
ure 4.15 respectively. The behavior is very similar to that of DTKs. In this case, the
performances are even slightly higher for the Distributed Kernels than for the original
121
Chapter 4. Improving Computational Complexity: Distributed Tree Kernels
55
56
57
58
59
60
61
62
63
0.2 0.4 0.6 0.8 1
Acc
urac
y
λ
RTK + Lex
RDTK� + Lex
RDTK� + Lex
RTK
RDTK�
RDTK�
Figure 4.15: Performance on Recognizing Textual Entailment task of RTK, RDTK�
and RDTK� (d = 8192) for several values of λ. Each point is the average accuracyon the 4 data sets.
Subpath and Route Kernels, and they seem to be less heavily affected by the values of
parameter λ.
122
5A Distributed Approach to a Symbolic
Task: Distributed Representation Parsing
Syntactic processing is widely considered an important activity in natural language
understanding (Chomsky, 1957). Research in natural language processing (NLP) pos-
itively exploits this hypothesis in models and systems. Syntactic features improve per-
formances in high level tasks such as question answering (Zhang and Lee, 2003), se-
mantic role labeling (Gildea and Jurafsky, 2002; Pradhan et al., 2005; Moschitti et al.,
2008; Collobert et al., 2011), paraphrase detection (Socher et al., 2011), and textual en-
tailment recognition (MacCartney et al., 2006; Wang and Neumann, 2007b; Zanzotto
et al., 2009).
Classification and learning algorithms are key components in the above models
and in current NLP systems. But these algorithms cannot directly use syntactic struc-
tures. The relevant parts of phrase structure trees or dependency graphs are explicitly
or implicitly stored in feature vectors, as explained in Chapter 4. Structural kernels al-
low to exploit high-dimensional spaces of syntactic tree fragments by concealing their
complexity. Then, even in kernel machines, symbolic syntactic structures act only as
proxies between the source sentences and the syntactic feature vectors. Syntactic struc-
tures are exploited to build or stand for these vectors, used by final algorithms when
learning and applying classifiers for high level tasks.
123
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
In this chapter, we explore an alternative way to use syntax in feature spaces: the
Distributed Representation Parsers (DRP). The core of the idea is straightforward:
DRPs directly bridge the gap between sentences and syntactic feature spaces. DRPs
act as syntactic parsers and feature extractors at the same time. We leverage on the
distributed trees framework, introduced in Chapter 4, and on multiple linear regression
models, to learn linear DRPs from training data. DRPs are compared to the traditional
processing chain, i.e. a symbolic parser followed by the construction of distributed
trees, by performing experiments on the Penn Treebank data set (Marcus et al., 1993).
Results show that DRPs produce distributed trees significantly better than those ob-
tained by traditional methods, in the same non-lexicalized conditions, and competitive
with those obtained by traditional lexicalized methods. Moreover, it is shown that
DRPs are extremely faster than traditional methods.
The rest of the chapter is organized as follows. First, we introduce and describe the
DRP model, as a follow-up to the distributed trees framework (Section 5.1). Then, we
report on the experiments (Section 5.2).
5.1 Distributed Representation Parsers
In Chapter 4 we showed how the widespread use of tree kernels obfuscated the
fact that syntactic trees are ultimately used as vectors in learning algorithms. Stem-
ming from the research in Distributed Representations (Hinton et al., 1986; Bengio,
2009; Collobert et al., 2011; Socher et al., 2011), we proposed the distributed trees
(DT) framework as a solution to the problem of representing high-dimensional implicit
feature vectors through smaller but explicit vectors.
We showed through experimental results that distributed trees are good representa-
124
5.1. Distributed Representation Parsers
Figure 5.1: “Parsing” with distributed structures in perspective.
tions of syntactic trees, that we can use in the definition of distributed representation
parsers.
In the following, we sketch the idea of Distributed Representation “Parsers”. Then,
we describe how to build DRPs by combining a function that encodes sentences into
vectors and a linear regressor that can be induced from training data.
5.1.1 The Idea
The approach to using syntax in learning algorithms generally follows two steps: first,
parse sentences s with a symbolic parser (e.g. Collins (2003); Charniak (2000); Nivre
et al. (2007b)) and produce symbolic trees t; second, use tree kernels to exploit implicit
syntactic feature vectors, or use an encoder to build explicit ones. Figure 5.1 sketches
this idea, when the final vectors are the distributed trees;
t ∈ Rd introduced in Chap-
ter 4. The function used to build a distributed tree;
t from a tree t (see f(T ) in Eq. 4.3)
will be referred to as the Distributed Tree Encoder (DT).
125
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
Our proposal is to build a Distributed Representation “Parser” that directly maps
sentences s into the final vectors, i.e. the distributed trees. A DRP acts as follows (see
Fig. 5.1): first, a function D encodes sentence s into a distributed representation vector;s ∈ Rd; second, a function P transforms the input vector
;s into a distributed tree
;
t .
This second step is a vector to vector transformation and, in a wide sense, “parses” the
input sentence.
Given an input sentence s, a DRP is then a function defined as follows:
;
t = DRP (s) = P (D(s)) (5.1)
In this study, some functions D are designed and a linear function P is proposed,
designed to be a regressor that can be induced from training data. The used vector
space has d dimensions for both sentences;s and distributed trees
;
t ; but, in general,
the two spaces can be of different size.
The chosen distributed trees implementation is the one referring to the feature space
induced by the Collins and Duffy (2002) tree kernel, as described in Section 4.4.1.
The shuffled circular convolution is selected as the vector composition function �. We
experiment with two tree fragment sets: the non-lexicalized set Sno lex(t), where tree
fragments do not contain words, and the lexicalized set Slex(t), including all the tree
fragments. An example is given in Figure 5.2.
5.1.2 Building the Final Function
To build a DRP, the encoderD and the transformer P must be defined. In the following,
we present a non-lexicalized and a lexicalized model for the encoderD and we describe
how we can learn the transformer P by means of a linear regression model.
126
5.1. Distributed Representation Parsers
Sno lex(t) = {SPP��
NP VP,
VPPP��
V NP,
NP
PRP,
SPP��
NP
PRP
VP ,
SXX��
NP VPPP��
V NP
,
VPXX��
V NPaa!!
DT NN
, . . . }
Slex(t) = Sno lex(t) ∪ {
SPP��
NP
PRP
We
VP,
VP`` V
booked
NPaa!!
DT NN
,
VP`` V
booked
NPaa!!
DT
the
NN,
VP`` V
booked
NPPP��
DT NN
flight
, . . . }
Figure 5.2: Subtrees of the tree t in Fig. 5.1.
5.1.2.1 Sentence Encoders
Establishing good models to encode input sentences into vectors is the most difficult
challenge. These models should consider the kind of information that can lead to a
correct syntactic interpretation. Only in this way, the distributed representation parser
can act as a vector transformation module. Unlike in models such as Socher et al.
(2011), our encoder is required to represent the whole sentence as a fixed size vector.
In the following, a non-lexicalized model and a lexicalized model are proposed.
Non-lexicalized model The non-lexicalized model relies only on the pos-tags of the
sentences s: s = p1 . . . pn, where pi is the pos-tag associated with the i-th token of the
sentence. In the following we discuss how to encode this information in a Rd space.
The basic model D1(s) is the one that considers the bag-of-postags, that is:
D1(s) =∑i
;p i (5.2)
where;p i ∈ N is the vector for label pi, taken from the set of nearly orthonormal ran-
dom vectors N , as defined in Section 4.3.1. This is basically in line with the bag-of-
127
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
words model used in random indexing (Sahlgren, 2005). Due to the commutative prop-
erty of the sum and since vectors in N are nearly orthonormal: (1) two sentences with
the same set of pos-tags have the same vector; and, (2) the dot product between two
vectors, D1(s1) and D1(s2), representing sentences s1 and s2, approximately counts
how many pos-tags the two sentences have in common. The vector for the sentence in
Figure 5.1 is then:
D1(s) =;
PRP +;
V +;
DT +;
NN
The general non-lexicalized model that takes into account all n-grams of pos-tags,
up to length j, is then the following:
Dj(s) = Dj−1(s) +∑i
;p i � . . . �
;p i+j−1
where � is again the shuffled circular convolution. An n-gram pi . . . pi+j−1 of pos-tags
is represented as;p i � . . . �
;p i+j−1. Given the properties of the shuffled circular con-
volution, an n-gram of pos-tags is associated to a versor, as it composes j versors, and
two different n-grams have nearly orthogonal vectors. For example, vector D3(s) for
the sentence in Figure 5.1 is:
D3(s) =;
PRP +;
V +;
DT +;
NN +
;
PRP �;
V +;
V �;
DT +;
DT �;
NN +
;
PRP �;
V �;
DT +;
V �;
DT �;
NN
Lexicalized model Including lexical information is the hardest part of the overall
model, as it makes vectors denser in information. Here we propose an initial model
128
5.1. Distributed Representation Parsers
that is basically the same as the non-lexicalized model, but includes a vector represent-
ing the words in the unigrams. The equation representing sentences as unigrams is:
Dlex1 (s) =
∑i
;p i �
;wi
Vector;wi represents word wi and is taken from the set N of nearly orthonormal ran-
dom vectors. This guarantees that Dlex1 (s) is not lossy. Given a pair word-postag
(w, p), it is possible to know if the sentence contains this pair, as Dlex1 (s)×;
p �;w ≈ 1
if (w, p) is in sentence s and Dlex1 (s) × ;
p � ;w ≈ 0 otherwise. Other vectors for rep-
resenting words, e.g. distributional vectors or those obtained as look-up tables in deep
learning architectures (Collobert and Weston, 2008), do not guarantee this possibility.
The general equation for the lexicalized version of the sentence encoder follows:
Dlexj (s) = Dlex
j−1(s) +∑i
;p i � . . . �
;p i+j−1
This model is only an initial proposal in order to take into account lexical informa-
tion.
5.1.2.2 Learning Transformers with Linear Regression
The transformer P of the DRP (see Eq. 5.1) can be seen as a linear regressor:
;
t = D;s (5.3)
where D is a square matrix. This latter can be estimated having training sets (T,S) of
oracle vectors and sentence input vectors (;
t i,;s i) for sentences si. Oracle vectors are
the ones obtained by applying the Distributed Tree Encoder to the correct syntactic tree
129
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
for the sentence, provided by an oracle. Interpreting these sets as matrices, we need to
solve a linear system of equations, i.e.: T = DS.
An approximate solution can be computed using Principal Component Analysis
and Partial Least Square Regression1. This method relies on Moore-Penrose pseudo-
inversion (Penrose, 1955). Pseudo-inverse matrices S+ are obtained using singular
value decomposition (SVD). These matrices have the property SS+ = I. Using the
iterative method for computing SVD (Golub and Kahan, 1965), we can obtain different
approximations S+(k) of S+ considering k singular values. Final approximations of
DRP s are then: D(k) = TS+(k).
Matrices D are estimated by pseudo-inverting matrices representing input vectors
for sentences S. Given the different input representations for sentences, we can then
estimate different DRPs: DRP1 = TS+1 , DRP2 = TS+
2 , and so on. We need to
estimate the best value for k in a separate parameter estimation set.
5.2 Experiments
We evaluated three issues for assessing DRP models: first, the performance of DRPs
in reproducing oracle distributed trees (Section 5.2.2); second, the quality of the topol-
ogy of the vector spaces of distributed trees produced by DRPs (Section 5.2.3); and,
finally, the computation run time of DRPs (Section 5.2.4). Section 5.2.1 describes the
experimental set-up.
1An implementation of this method is available within the R statistical package (Mevik and Wehrens,2007).
130
5.2. Experiments
5.2.1 Experimental Set-up
Data The data sets were derived from the Wall Street Journal (WSJ) portion of the
English Penn Treebank data set (Marcus et al., 1993), using a standard data split for
training (sections 2-21PTtrain with 39,832 trees) and for testing (section 23PT23 with
2,416 trees). Section 24 PT24, with 1,346 trees, was used for parameter estimation.
We produced the final data sets of distributed trees with three different λ values:
λ=0, λ=0.2, and λ=0.4. For each λ, we have two versions of the data sets: a non-
lexicalized version (no lex), where syntactic trees are considered without words, and
a lexicalized version (lex), where words are considered. Oracle trees t are transformed
into oracle distributed trees;o using the Distributed Tree Encoder DT (see Fig. 5.1).
We experimented with two sizes of the distributed trees space Rd: 4096 and 8192.
We have designed the data sets to determine how DRPs behave with λ values rele-
vant for syntax-sensitive NLP tasks. Both tree kernels and distributed tree kernels have
the best performances in tasks such as question classification, semantic role labeling,
or textual entailment recognition with λ values in the range 0–0.4.
System Comparison We compared the DRPs against the original way of producing
distributed trees: distributed trees are obtained using the output of a symbolic parser
(SP) that is then transformed into a distributed tree using the DT with the appropriate
λ. We refer to this chain as the Distributed Symbolic Parser (DSP ). The DSP is then
the chain DSP (s) = Dtree(SP (s)). Figure 5.3 reports the definitions of the original
and the DRPs processing chain, along with the procedure leading to oracle trees.
As for the symbolic parser, we used Bikel’s version (Bikel, 2004) of Collins’ head-
driven statistical parser (Collins, 2003). For a correct comparison, we used the Bikel’s
131
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
Figure 5.3: Processing chains for the production of the distributed trees: translationof oracle trees, distributed trees with symbolic parsers, and distributed representationparsing.
parser with oracle part-of-speech tags. We experimented with two versions: (1) the
original lexicalized method DSPlex, i.e. the natural setting of the Collins/Bikel parser,
and (2) a fully non-lexicalized version DSPno lex that exploits only part-of-speech
tags. This latter version is obtained by removing words in input sentences and leaving
only part-of-speech tags. We trained these DSP s on PTtrain.
Parameter estimation DRPs have two basic parameters: (1) parameter k of the
pseudo-inverse, that is, the number of considered eigenvectors (see Sec. 5.1.2.2) and (2)
the maximum length j of the n-grams considered by the encoder Dj (see Sec. 5.1.2.1).
Parameter estimation was performed on the datasets derived from section PT24 by
maximizing a pseudo f-measure. Section 5.2.2 reports both the definition of the mea-
sure and the results of the parameter estimation.
132
5.2. Experiments
5.2.2 Parsing Performance
The first issue to explore is whether DRP s are actually good “distributed syntactic
parsers”. We compare DRP s against the distributed symbolic parsers by evaluating
how well these “distributed syntactic parsers” reproduce oracle distributed trees.
Method We define the pseudo f-measure, that is, a parsing performance measure
that aims to reproduce the traditional f-measure on the distributed trees. The pseudo
f-measure is defined as follows:
f(;
t ,;o ) =
2;
t ·;o
||;
t ||+ ||;o ||
where;
t is the system’s distributed tree and;o is the oracle distributed tree. This mea-
sure computes a score that is similar to traditional f-measure:;
t ·;o approximates true
positives, ||;
t || approximates the number of observations, and, finally, ||;o || approxi-
mates the number of expectations. We compute these two measures in a sentence-based
(i.e. vector-based) granularity. Results report average values.
Estimated parameters We estimated parameters k and j by training the different
DRP s on the PTtrain set and by maximizing the pseudo f-measure of the DRP s on
PT24. The best pair of parameters is j=3 and k=3000. For completeness, we report
also the best k values for the five different j we experimented with: k = 47 for j=1
(the number of linearly independent vectors representing pos-tags), k = 1300 for j=2,
k = 3000 for j=3, k = 4000 for j=4, and k = 4000 for j=5. For comparison, some
resulting tables report results for the different values of j.
133
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
dim Model λ = 0 λ = 0.2 λ = 0.4
4096
DRP1 0.6449 0.5697 0.4596DRP2 0.7843 0.7014 0.5766DRP3 0.8167 0.7335 0.6084DRP4 0.8039 0.7217 0.5966DRP5 0.7892 0.7069 0.5831DSPno lex 0.6504 0.5850 0.4806DSPlex 0.8129 0.7793 0.7099
8192DRP3 0.8228 0.7392 0.6139DSPno lex 0.6547 0.5891 0.4843DSPlex 0.8136 0.7795 0.7102
Table 5.1: Pseudo f-measure on PT23 of the DRP s (with different j) and the DSPon the non-lexicalized data sets with different λ values and with the two dimensions ofthe distributed tree space (4096 and 8192).
Model λ = 0 λ = 0.2 λ = 0.4
DRP3 0.6957 0.5997 0.0411DSPlex 0.9068 0.8558 0.6438
Table 5.2: Pseudo f-measure on PT23 of the DRP3 and the DSPlex on the lexicalizeddata sets with different λ values on the distributed tree space with 4096 dimensions.
134
5.2. Experiments
Results Table 5.1 reports the results of the first set of experiments on the non-lexicalized
data sets. The first block of rows (seven rows) reports the pseudo f-measure of the dif-
ferent methods on the distributed tree spaces with 4096 dimensions. The second block
(the last three rows) reports the performance on the space with 8192 dimensions. The
pseudo f-measure is computed on the PT23 set. Although we already selected j=3 as
the best parametrization (i.e. DRP3), the first five rows of the first block report the
results of the DRPs for five values of j. This gives an idea of how the different DRPs
behave. The last two rows of this block report the results of the two DSPs.
We can observe some important facts. First, DRP s exploiting 2-grams, 3-grams,
4-grams, and 5-grams of part-of-speech tags behave significantly better than the 1-
grams for all the values of λ. Distributed representation parsers need inputs that keep
trace of sequences of pos-tags of sentences. But these sequences tend to confuse the
model when too long. As expected, DRP3 behaves better than all the other DRPs.
Second, DRP3 behaves significantly better than the comparable original parsing chain
DSPno lex that uses only part-of-speech tags and no lexical information. This hap-
pens for all the values of λ. Third, DRP3 behaves similarly to DSPlex for λ=0. Both
parsers use oracle pos-tags to emit sentence interpretations but DSPlex also exploits
lexical information that DRP3 does not access. For λ=0.2 and λ=0.4, the more in-
formed DSPlex behaves significantly better than DRP3. But DRP3 still behaves sig-
nificantly better than the comparable DSPno lex. All these observations are valid also
for the results obtained for 8192 dimensions.
Table 5.2 reports the results of the second set of experiments on the lexicalized
data sets performed on a 4192-dimension space. The first row reports the pseudo f-
measure of DRP3 trained on the lexicalized model and the second row reports the
135
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
Output Model λ = 0 λ = 0.2 λ = 0.4
No lexDRP3 0.9490 0.9465 0.9408DSPno lex 0.9033 0.9001 0.8932DSPlex 0.9627 0.9610 0.9566
LexDRP3 0.9642 0.9599 0.0025DSPlex 0.9845 0.9817 0.9451
Table 5.3: Spearman’s Correlation between the oracle’s vector space and the systems’vector spaces, with dimension 4096. Average and standard deviation are on 100 trialson lists of 1000 sentence pairs.
results of DSPlex. In this case, DRP3 is not behaving well with respect to DSPlex.
The additional problem DRP3 has is that it has to reproduce input words in the output.
This greatly complicates the work of the distributed representation parser. But, as we
report in the next section, this preliminary result may be still satisfactory for λ=0 and
λ=0.2.
5.2.3 Kernel-based Performance
This experiment investigates howDRP s preserve the topology of the oracle vector
space. This correlation is an important quality factor of a distributed tree space. When
using distributed tree vectors in learning classifiers, whether;oi ·
;oj in the oracle’s vector
space is similar to;
ti ·;
tj in the DRP’s vector space is more important than whether;oi is similar to
;
ti (see Fig. 5.4). Sentences that are close using the oracle syntactic
interpretations should also be close using DRP vectors. The topology of the vector
space is more relevant than the actual quality of the vectors. The experiment on the
parsing quality in the previous section does not properly investigate this property, as
the performance of DRPs could be not sufficient to preserve distances among sentences.
136
5.2. Experiments
Figure 5.4: Topology of the resulting spaces derived with the three different methods:similarities between sentences.
Method We evaluate the coherence of the topology of two distributed tree spaces by
measuring the Spearman’s correlation between two lists of pairs of sentences (si, sj),
ranked according to the similarity between the two sentences. If the two lists of pairs
are highly correlated, the topology of the two spaces is similar. The different methods
and, thus, the different distributed tree spaces are compared against the oracle vector
space (see Fig. 5.4). Then, the first list always represents the oracle vector space and
ranks pairs (si, sj) according to;o i ·
;o j . The second list instead represents the space
obtained with a DSP or a DRP. Thus, it is respectively ranked with;
ti ·;
tj or;
ti ·;
tj . In
this way, we can comparatively evaluate the quality of the distributed tree vectors of
ourDRP s with respect to the other methods. We report average and standard deviation
of the Spearman’s correlation on 100 runs over lists of 1000 pairs. We used the testing
set PT23 for extracting vectors.
137
Chapter 5. A Distributed Approach to a Symbolic Task: DistributedRepresentation Parsing
(a) Running time (b) Pseudo f-measure with λ=0.4
Figure 5.5: Performances with respect to the sentence length, with space dimension4092.
Results Table 5.3 reports results both on the non-lexicalized and on the lexicalized
data set. For the non-lexicalized data set we report three methods (DRP3, DSPno lex,
andDSPlex) and for the lexicalized dataset we report two methods (DRP3 andDSPlex).
Columns represent different values of λ. Experiments are carried out on the 4096-
dimension space. For the non-lexicalized data set, distributed representation parsers
behave significantly better than DSPno lex for all the values of λ. The upper-bound of
DSPlex is not so far. For the harder lexicalized data set, the difference between DRP3
and DSPlex is smaller than the one based on the parsing performance. Thus, we have
more evidence of the fact that we are in a good track. DRP s can substitute the DSP
in generating vector spaces of distributed trees that adequately approximate the space
defined by an oracle.
5.2.4 Running Time
138
5.2. Experiments
In this last experiment, we compared the running time of the DRP with respect to
the DSP . The analysis has been done on a dual-core processor and both systems are
implemented in the same programming language, i.e. Java. Figure 5.5a plots the run-
ning time of the DRP , the SP , and the full DSP = DT ◦ SP . The x-axis represents
the sentence length in words and the y-axis represents the running time in milliseconds.
The distance between SP and DSP shrinks as the plot is in a logarithmic scale. Fig-
ure 5.5b reports the pseudo f-measure of DRP , DSPlex, and DSPno lex, with respect
to the sentence length, on the non-lexicalized data set with λ=0.4.
We observe that DRP becomes extremely convenient for sentences larger than 10
words (see Fig. 5.5a) and the pseudo f-measure difference between the different meth-
ods is nearly constant for the different sentence lengths (see Fig. 5.5b). This test already
makes DRPs very appealing methods for real time applications. But, if we consider
that DRPs can run completely on Graphical Processing Units (GPUs), as dealing only
with matrix products, fast-Fourier transforms, and random generators, we can better
appreciate the potentials of the proposed methods.
139
6Conclusions and Future Work
Many fields of research require machine learning algorithms to work on structured
data. Thus, kernel methods such as the kernel machines are very useful, since they
allow for the definition of kernel functions. Kernel functions define an implicit feature
space, usually of very high dimensionality, and measure similarity as the dot product
of data instances mapped into the implicit feature space. One of the most frequently
recurring family of structure is the one of trees. The main interest of this thesis is
the use of syntactic trees in natural language processing tasks, but trees are used to
represent a wide variety of entities in several different fields, such as HTML documents
in computer security and proteins in biology.
Since their introduction, tree kernels have been very popular in the machine learn-
ing research community. Many different tree kernels have been proposed, defining new
feature spaces tailored to solve specific tasks and theoretic issues. At the same time,
new tree kernel algorithms have been proposed to tackle the high computation com-
plexity, which is quadratic in the size of the involved trees. After providing a brief
survey of kernels for structured data and, more specifically, of tree kernels, this thesis
proposed improvements on both of the mentioned research lines.
Firstly, we introduced a new kind of kernel, tailored to be applied in tasks where tree
pairs instead of single trees are considered, such as the textual entailment recognition
141
Chapter 6. Conclusions and Future Work
task of natural language processing. This kernel is built on top of the concept of tDAGs
(tripartite directed acyclic graphs), which are data structures used to represent a pair of
trees as a single graph, preserving information about correlated nodes. The modeled
feature space is the one of first order rules among trees. The proposed kernel on tDAGs
is, technically, a kernel on graphs. Complete kernels on generic graphs have been
proven to be NP-hard to compute. Nonetheless, we proposed an efficient algorithm
to compute the kernel, by exploiting the peculiar characteristics of tDAG structures.
We showed how the kernel on tDAGs is much more efficient than previously proposed
kernels for tree pairs. At the same time, this allows for a better use of the available
information, leading to higher performances on the textual entailment recognition task.
Second, we proposed the distributed tree kernels framework, as a means to strongly
reduce computational complexity of tree kernel functions. The idea for this framework
stems from the research field of distributed representations, concerning the representa-
tion of symbolic structures as distributed entities, i.e. points in vector spaces. Thus, the
main issue of the distributed tree kernels is the process of transforming trees into dis-
tributed trees, whereas the actual kernel computation consists in a simple dot product
between vectors of limited dimensionality. By introducing an ideal vector composition
function, satisfying some specific properties, we showed that the dot product between
two distributed trees computes an approximation of the corresponding tree kernel, as
long as the distributed trees are built in an appropriate manner. We showed that this ap-
proach can be applied to different kinds of tree kernel, defining different feature spaces,
and we introduced efficient algorithms to build distributed trees for the different ker-
nels. Then, we presented a broad empirical analysis concerning the ability of concrete
functions to approximate the ideal function properties and the degree of approxima-
142
tion of the resulting distributed tree kernels, with regard to the original ones. We can
observe that distributed tree kernels could allow on-line learning algorithms such as
those based on the perceptron (Rosenblatt, 1958) (e.g. the shifting perceptron model
(Cavallanti et al., 2007)) to use tree structures without any need to go for bounded
on-line learning models that select and discard vectors for memory constraint or time
complexity (Cavallanti et al., 2007; Dekel et al., 2005; Orabona et al., 2008).
Finally, we introduced a possible application of the distributed tree kernels frame-
work to the field of syntactic parsing algorithms. Syntactic parsing is a preliminary
step needed by every machine learning algorithm which makes use of syntactic trees.
Nonetheless, parsing is an expensive and not error free process. Since kernel methods
ultimately work on the implicit feature space representation of trees, we proposed a
way of directly obtaining the distributed representation of a syntactic tree, without go-
ing through the intermediate symbolic form. This process, named distributed represen-
tation parsing, requires sentences to be represented by a first, easy to obtain, distributed
representation. Then, an appropriately learned linear regressor is applied to produce the
final distributed tree representation. We presented an extensive analysis of the degree
of correlation of the syntactic tree space produced by the distributed representation
parser, with regard to the space produced by applying a traditional symbolic parser
and the original algorithm to build distributed trees. This novel path to use syntactic
structures in feature spaces opens interesting and unexplored possibilities. The tight
integration of parsing and feature vector generation lowers the computational cost of
producing distributed representations from trees, as circular convolution is not applied
on-line. Moreover, distributed representation parsers can contribute to treat syntax
in deep learning models in a uniform way. Deep learning models (Bengio, 2009) are
143
Chapter 6. Conclusions and Future Work
completely based on distributed representations. But, when applied to natural language
processing tasks (e.g. Collobert et al. (2011); Socher et al. (2011)), syntactic structures
are not represented in the neural networks in a distributed way. Syntactic information
is generally used by exploiting symbolic parse trees, and this information positively
impacts performances on final applications, e.g. in paraphrase detection (Socher et al.,
2011) and in semantic role labeling (Collobert et al., 2011).
6.1 Future Work
The themes introduced by this thesis lead to several open lines of research.
The algorithm introduced for the efficient computation of the kernel on tDAGs may
be adapted to work on more complex kinds of structures. Although the computation
of complete kernels on generic graphs is NP-hard, efficient algorithms might be found
allowing for the introduction of a kernel on generic directed acyclic graphs.
We have shown how the distributed tree kernels framework can be applied to re-
produce different tree kernels. Other than applying the framework to reproduce more
tree kernels, a further generalization can be explored, producing a distributed kernels
framework that can be applied to a wider range of structures, from strings to graphs.
Moreover, the representation of syntactic trees as distributed structures resembles
the distributional representation of the semantics of words. An integration of the two
approaches might lead to interesting advances in the popular research field of the com-
positional distributional semantics.
Distributed representation parsing could also benefit from the use of distributional
semantics information. In fact, our first approach does not adequately consider lexical
information, when building the initial distributed representation for a sentence. Using a
144
6.1. Future Work
more complex representation model, possibly including semantic information, can lead
to higher performances and may be better compared to traditional parsers. Finally the
integration of distributed representation parsers and deep learning models could result
in an interesting line of research.
145
Publications
Zanzotto, F. M. and Dell’Arciprete, L. (2009). Efficient kernels for sentence pair classi-
fication. In Conference on Empirical Methods on Natural Language Processing, pages
91–100
Zanzotto, F. M., Dell’arciprete, L., and Korkontzelos, Y. (2010). Rappresentazione dis-
tribuita e semantica distribuzionale dalla prospettiva dell’intelligenza artificiale. TEORIE
& MODELLI, XV II-III, 107–122
Zanzotto, F. M., Dell’Arciprete, L., and Moschitti, A. (2011). Efficient graph kernels
for textual entailment recognition. Fundamenta Informaticae, 107(2-3), 199 – 222
Zanzotto, F. M. and Dell’Arciprete, L. (2011a). Distributed structures and distributional
meaning. In Proceedings of the Workshop on Distributional Semantics and Composi-
tionality, pages 10–15, Portland, Oregon, USA. Association for Computational Lin-
guistics
Zanzotto, F. M. and Dell’Arciprete, L. (2011b). Distributed tree kernels rivaling tree
kernels in entailment recognition. In AI*IA Workshop on ”Learning by Reading in the
Real World”
DellArciprete, L., Murphy, B., and Zanzotto, F. (2012). Parallels between machine
and brain decoding. In F. Zanzotto, S. Tsumoto, N. Taatgen, and Y. Yao, editors,
Brain Informatics, volume 7670 of Lecture Notes in Computer Science, pages 162–
174. Springer Berlin Heidelberg
147
Chapter 6. Conclusions and Future Work
Zanzotto, F. M. and Dell’Arciprete, L. (2012). Distributed tree kernels. In Proceedings
of the 29th International Conference on Machine Learning (ICML-12), pages 193–200.
Omnipress
Zanzotto, F. M. and Dell’Arciprete, L. (2013). Transducing sentences to syntactic
feature vectors: an alternative way to ”parse”? In Proceedings of the Workshop on
Continuous Vector Space Models and their Compositionality, pages 40–49, Sofia, Bul-
garia. Association for Computational Linguistics
Dell’Arciprete, L. and Zanzotto, F. M. (2013). Distributed convolution kernels on
countable sets. Submitted to a journal
148
Bibliography
Aiolli, F., Da San Martino, G., and Sperduti, A. (2009). Route kernels for trees. In Pro-
ceedings of the 26th Annual International Conference on Machine Learning, ICML
’09, pages 17–24, New York, NY, USA. ACM.
Aleksander, I. and Morton, H. (1995). An introduction to neural computing. Interna-
tional Thomson Computer Press.
Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., and Magnini,
Bernardo Szpektor, I. (2006). The second PASCAL recognising textual entailment
challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recog-
nising Textual Entailment, Venice, Italy.
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in
Machine Learning, 2(1), 1–127.
Bikel, D. M. (2004). Intricacies of Collins’ parsing model. Comput. Linguist., 30,
479–511.
Bunescu, R. and Mooney, R. J. (2006). Subsequence kernels for re-
lation extraction. In Submitted to the Ninth Conference on Natural
Language Learning (CoNLL-2005), Ann Arbor, MI. Available at url-
http://www.cs.utexas.edu/users/ml/publication/ie.html.
Carpenter, B. (1992). The Logic of Typed Feature Structures. Cambridge University
Press, Cambridge, England.
Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2007). Tracking the best hyperplane
with a simple budget perceptron. Machine Learning, 69(2-3), 143–167.
149
BIBLIOGRAPHY
Charniak, E. (2000). A maximum-entropy-inspired parser. In Proc. of the 1st NAACL,
pages 132–139, Seattle, Washington.
Chierchia, G. and McConnell-Ginet, S. (2001). Meaning and Grammar: An introduc-
tion to Semantics. MIT press, Cambridge, MA.
Chomsky, N. (1957). Aspect of Syntax Theory. MIT Press, Cambridge, Massachussetts.
Collins, M. (2003). Head-driven statistical models for natural language parsing. Com-
put. Linguist., 29(4), 589–637.
Collins, M. and Duffy, N. (2001). Convolution kernels for natural language. In NIPS,
pages 625–632.
Collins, M. and Duffy, N. (2002). New ranking algorithms for parsing and tagging:
Kernels over discrete structures, and the voted perceptron. In Proceedings of ACL02.
Collobert, R. and Weston, J. (2008). A unified architecture for natural language pro-
cessing: Deep neural networks with multitask learning. In International Conference
on Machine Learning, ICML.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.
(2011). Natural language processing (almost) from scratch. J. Mach. Learn. Res.,
12, 2493–2537.
Corley, C. and Mihalcea, R. (2005). Measuring the semantic similarity of texts. In
Proc. of the ACL Workshop on Empirical Modeling of Semantic Equivalence and
Entailment, pages 13–18. Association for Computational Linguistics, Ann Arbor,
Michigan.
150
BIBLIOGRAPHY
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,
1–25.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Ma-
chines and Other Kernel-based Learning Methods. Cambridge University Press.
Cristianini, N., Shawe-Taylor, J., and Lodhi, H. (2002). Latent semantic kernels. J.
Intell. Inf. Syst., 18(2-3), 127–152.
Cumby, C. and Roth, D. (2002). Learning with feature description logics. In ILP, pages
32–47.
Cumby, C. and Roth, D. (2003). On kernel methods for relational learning. In ICML,
pages 107–114.
Dagan, I. and Glickman, O. (2004). Probabilistic textual entailment: Generic applied
modeling of language variability. In Proceedings of the Workshop on Learning Meth-
ods for Text Understanding and Mining, Grenoble, France.
Dagan, I., Glickman, O., and Magnini, B. (2006). The PASCAL recognising textual
entailment challenge. In Q.-C. et al., editor, LNAI 3944: MLCW 2005, pages 177–
190, Milan, Italy. Springer-Verlag.
Dang, H. T. (2005). Overview of DUC 2005. In Proceedings of the 2005 Document
Understanding Workshop.
Dasgupta, S. and Gupta, A. (1999). An elementary proof of the Johnson-Lindenstrauss
lemma. Technical Report TR-99-006, ICSI, Berkeley, California.
151
BIBLIOGRAPHY
de Marneffe, M.-C., MacCartney, B., Grenager, T., Cer, D., Rafferty, A., and D. Man-
ning, C. (2006). Learning to distinguish valid textual entailments. In Proceedings
of the Second PASCAL Challenges Workshop on Recognising Textual Entailment,
Venice, Italy.
Deerwester, S., Dumais, S. T., Furnas, G. W., L, T. K., and Harshman, R. (1990). In-
dexing by latent semantic analysis. Journal of the American Society for Information
Science, 41, 391–407.
Dekel, O., Shalev-Shwartz, S., and Singer, Y. (2005). The forgetron: A kernel-based
perceptron on a fixed budget. In In Advances in Neural Information Processing
Systems 18, pages 259–266. MIT Press.
Dell’Arciprete, L. and Zanzotto, F. M. (2013). Distributed convolution kernels on
countable sets. Submitted to a journal.
DellArciprete, L., Murphy, B., and Zanzotto, F. (2012). Parallels between machine and
brain decoding. In F. Zanzotto, S. Tsumoto, N. Taatgen, and Y. Yao, editors, Brain
Informatics, volume 7670 of Lecture Notes in Computer Science, pages 162–174.
Springer Berlin Heidelberg.
Dussel, P., Gehl, C., Laskov, P., and Rieck, K. (2008). Incorporation of application
layer protocol syntax into anomaly detection. In Proceedings of the 4th International
Conference on Information Systems Security, ICISS ’08, pages 188–202, Berlin,
Heidelberg. Springer-Verlag.
152
BIBLIOGRAPHY
Eisner, J. (2003). Learning non-isomorphic tree mappings for machine translation.
In Proceedings of the 41st Annual Meeting of the Association for Computational
Linguistics (ACL), Companion Volume, pages 205–208, Sapporo.
Gartner, T. (2002). Exponential and geometric kernels for graphs. In NIPS Workshop
on Unreal Data: Principles of Modeling Nonvectorial Data.
Gartner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations.
Gartner, T., Flach, P., and Wrobel, S. (2003). On graph kernels: Hardness results and
efficient alternatives. Lecture notes in computer science, pages 129–143.
Gildea, D. and Jurafsky, D. (2002). Automatic Labeling of Semantic Roles. Computa-
tional Linguistics, 28(3), 245–288.
Golub, G. and Kahan, W. (1965). Calculating the singular values and pseudo-inverse
of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series
B: Numerical Analysis, 2(2), 205–224.
Grinberg, D., Lafferty, J., and Sleator, D. (1996). A robust parsing algorithm for link
grammar. In 4th International workshop on parsing tecnologies, Prague.
Haghighi, A. D., Ng, A. Y., and Manning, C. D. (2005). Robust textual inference via
graph matching. In Proceedings of the conference on Human Language Technology
and Empirical Methods in Natural Language Processing, HLT ’05, pages 387–394,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Harabagiu, S. and Hickl, A. (2006). Methods for using textual entailment in open-
domain question answering. In Proceedings of the 21st International Conference on
153
BIBLIOGRAPHY
Computational Linguistics and 44th Annual Meeting of the Association for Compu-
tational Linguistics, pages 905–912, Sydney, Australia. Association for Computa-
tional Linguistics.
Harabagiu, S., Hickl, A., and Lacatusu, F. (2007). Satisfying information needs with
multi-document summaries. Information Processing & Management, 43(6), 1619 –
1642. Text Summarization.
Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M., and Mamitsuka, H. (2008).
Mining significant tree patterns in carbohydrate sugar chains. Bioinformatics, 24,
i167–i173.
Haussler, D. (1999). Convolution kernels on discrete structures. Technical report,
University of California at Santa Cruz.
Hecht-Nielsen, R. (1994). Context vectors: general purpose approximate meaning
representations self-organized from raw data. Computational Intelligence: Imitating
Life, IEEE Press, pages 43–56.
Hickl, A., Williams, J., Bensley, J., Roberts, K., Rink, B., and Shi, Y. (2006). Rec-
ognizing textual entailment with LCCs GROUNDHOG system. In B. Magnini and
I. Dagan, editors, Proceedings of the Second PASCAL Recognizing Textual Entail-
ment Challenge, Venice, Italy. Springer-Verlag.
Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. (1986). Distributed represen-
tations. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Pro-
cessing: Explorations in the Microstructure of Cognition. Volume 1: Foundations.
MIT Press, Cambridge, MA.
154
BIBLIOGRAPHY
Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity based on corpus statistics
and lexical taxonomy. In Proc. of the 10th ROCLING, pages 132–139. Tapei, Taiwan.
John, G. H. and Langley, P. (1995). Estimating continuous distributions in bayesian
classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial
intelligence, UAI’95, pages 338–345, San Francisco, CA, USA. Morgan Kaufmann
Publishers Inc.
Johnson, W. and Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a
Hilbert space. Contemp. Math., 26, 189–206.
Kainen, P. C. and Kurkova, V. (1993). Quasiorthogonal dimension of euclidean spaces.
Applied Mathematics Letters, 6(3), 7 – 10.
Kashima, H. and Koyanagi, T. (2002). Kernels for semi-structured data. In Proceedings
of the Nineteenth International Conference on Machine Learning, ICML ’02, pages
291–298, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Kashima, H., Tsuda, K., and Inokuchi, A. (2003). Marginalized kernels between la-
beled graphs. In Proceedings of the Twentieth International Conference on Machine
Learning, pages 321–328. AAAI Press.
Kimura, D. and Kashima, H. (2012). Fast computation of subpath kernel for trees.
CoRR, abs/1206.4642.
Kimura, D., Kuboyama, T., Shibuya, T., and Kashima, H. (2011). A subpath kernel for
rooted unordered trees. In J. Huang, L. Cao, and J. Srivastava, editors, Advances in
Knowledge Discovery and Data Mining, volume 6634 of Lecture Notes in Computer
Science, pages 62–74. Springer Berlin / Heidelberg.
155
BIBLIOGRAPHY
Kobler, J., Schoning, U., and Toran, J. (1993). The graph isomorphism problem: its
structural complexity. Birkhauser Verlag, Basel, Switzerland, Switzerland.
Kondor, R. I. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete
structures. In In Proceedings of the ICML, pages 315–322.
Leslie, C., Eskin, E., and Noble, W. S. (2002). The spectrum kernel: a string kernel
for SVM protein classification. Pacific Symposium On Biocomputing, 575(50), 564–
575.
Li, W., Ong, K.-L., Ng, W.-K., and Sun, A. (2005). Spectral kernels for classification.
In Proceedings of the 7th international conference on Data Warehousing and Knowl-
edge Discovery, DaWaK’05, pages 520–529, Berlin, Heidelberg. Springer-Verlag.
Lin, D. and Pantel, P. (2001). DIRT-discovery of inference rules from text. In Proceed-
ings of the ACM Conference on Knowledge Discovery and Data Mining (KDD-01),
San Francisco, CA.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). Text
classification using string kernels. J. Mach. Learn. Res., 2, 419–444.
MacCartney, B., Grenager, T., de Marneffe, M.-C., Cer, D., and Manning, C. D. (2006).
Learning to recognize features of valid textual entailments. In Proceedings of the
Human Language Technology Conference of the NAACL, Main Conference, pages
41–48, New York City, USA. Association for Computational Linguistics.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large an-
notated corpus of English: The Penn Treebank. Computational Linguistics, 19,
313–330.
156
BIBLIOGRAPHY
Mercer, J. (1909). Functions of positive and negative type, and their connection with
the theory of integral equations. Philosophical Transactions of the Royal Society
of London. Series A, Containing Papers of a Mathematical or Physical Character,
209(441-458), 415–446.
Mevik, B.-H. and Wehrens, R. (2007). The pls package: Principal component and
partial least squares regression in R. Journal of Statistical Software, 18(2), 1–24.
Minnen, G., Carroll, J., and Pearce, D. (2001). Applied morphological processing of
English. Natural Language Engineering, 7(3), 207–223.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, Inc., New York, NY, USA,
1 edition.
Moschitti, A. (2004). A study on convolution kernels for shallow semantic parsing. In
proceedings of the ACL, Barcelona, Spain.
Moschitti, A. (2006a). Efficient Convolution Kernels for Dependency and Constituent
Syntactic Trees. In Proceedings of The 17th European Conference on Machine
Learning, Berlin, Germany.
Moschitti, A. (2006b). Making tree kernels practical for natural language learning. In
Proceedings of EACL’06, Trento, Italy.
Moschitti, A. and Zanzotto, F. M. (2007). Fast and effective kernels for relational
learning from texts. In Proceedings of the International Conference of Machine
Learning (ICML), Corvallis, Oregon.
Moschitti, A., Pighin, D., and Basili, R. (2008). Tree kernels for semantic role labeling.
Computational Linguistics, 34(2), 193–224.
157
BIBLIOGRAPHY
MUC-7 (1997). Proceedings of the seventh message understanding conference (MUC-
7). In Columbia, MD. Morgan Kaufmann.
Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. (2001). An intro-
duction to kernel-based learning algorithms. IEEE TRANSACTIONS ON NEURAL
NETWORKS, 12(2), 181–201.
Nivre, J., Hall, J., Kubler, S., McDonald, R., Nilsson, J., Riedel, S., and Yuret, D.
(2007a). The CoNLL 2007 shared task on dependency parsing. In Proceedings of the
CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 915–932. Association
for Computational Linguistics, Prague, Czech Republic.
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., and
Marsi, E. (2007b). Maltparser: A language-independent system for data-driven de-
pendency parsing. Natural Language Engineering, 13(2), 95–135.
Orabona, F., Keshet, J., and Caputo, B. (2008). The projectron: a bounded kernel-based
perceptron. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Machine
Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008),
Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference
Proceeding Series, pages 720–727. ACM.
Paass, G., Leopold, E., Larson, M., Kindermann, J., and Eickeler, S. (2002). SVM
classification using sequences of phonemes and syllables. In Proceedings of the
6th European Conference on Principles of Data Mining and Knowledge Discovery,
PKDD ’02, pages 373–384, London, UK, UK. Springer-Verlag.
158
BIBLIOGRAPHY
Pantel, p. and Pennacchiotti, M. (2006). Espresso: A bootstrapping algorithm for
automatically harvesting semantic relations. In Proceedings of the 21st Coling and
44th ACL, Sydney, Australia.
Pedersen, T., Patwardhan, S., and Michelizzi, J. (2004). WordNet::Similarity - measur-
ing the relatedness of concepts. In Proc. of 5th NAACL. Boston, MA.
Penrose, R. (1955). A generalized inverse for matrices. In Proc. Cambridge Philo-
sophical Society.
Peas, A., lvaro Rodrigo, and Verdejo, F. (2007). Overview of the answer validation
exercise 2007. In C. Peters, V. Jijkoun, T. Mandl, H. Mller, D. W. Oard, A. Peas,
V. Petras, and D. Santos, editors, CLEF, volume 5152 of Lecture Notes in Computer
Science, pages 237–248. Springer.
Pighin, D. and Moschitti, A. (2010). On reverse feature engineering of syntactic tree
kernels. In Conference on Natural Language Learning (CoNLL-2010), Uppsala,
Sweden.
Plate, T. A. (1994). Distributed Representations and Nested Compositional Structure.
Ph.D. thesis.
Pollard, C. and Sag, I. (1994). Head-driven Phrase Structured Grammar. Chicago
CSLI, Stanford.
Pradhan, S., Ward, W., Hacioglu, K., Martin, J. H., and Jurafsky, D. (2005). Semantic
role labeling using different syntactic views. In ACL ’05: Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics, pages 581–588. As-
sociation for Computational Linguistics, Morristown, NJ, USA.
159
BIBLIOGRAPHY
Quinlan, R. J. (1993). C4.5: Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning). Morgan Kaufmann.
Raina, R., Haghighi, A., Cox, C., Finkel, J., Michels, J., Toutanova, K., MacCartney,
B., de Marneffe, M.-C., Christopher, M., and Ng, A. Y. (2005). Robust textual infer-
ence using diverse knowledge sources. In Proceedings of the 1st Pascal Challenge
Workshop, Southampton, UK.
Rieck, K., Krueger, T., Brefeld, U., and Muller, K.-R. (2010). Approximate tree ker-
nels. J. Mach. Learn. Res., 11, 555–580.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage
and organization in the brain. Psychological Reviews, 65(6), 386–408.
Rousu, J. and Shawe-Taylor, J. (2005). Efficient computation of gapped substring
kernels on large alphabets. J. Mach. Learn. Res., 6, 1323–1344.
Rumelhart, D. E. and McClelland, J. L. (1986). Parallel Distributed Processing: Ex-
plorations in the Microstructure of Cognition : Foundations (Parallel Distributed
Processing). MIT Press.
Sahlgren, M. (2005). An introduction to random indexing. In Proceedings of the
Methods and Applications of Semantic Indexing Workshop at the 7th International
Conference on Terminology and Knowledge Engineering (TKE), Copenhagen, Den-
mark.
Sahlgren, M., Holst, A., and Kanerva, P. (2008). Permutations as a means to encode
order in word space. In V. Sloutsky, B. Love, and K. Mcrae, editors, Proceedings
160
BIBLIOGRAPHY
of the 30th Annual Conference of the Cognitive Science Society, pages 1300–1305.
Cognitive Science Society, Austin, TX.
Scholkopf, B. (1997). Support Vector Learning.
Shin, K. and Kuboyama, T. (2010). A generalization of Haussler’s convolution kernel:
mapping kernel and its application to tree kernels. J. Comput. Sci. Technol., 25(5),
1040–1054.
Shin, K., Cuturi, M., and Kuboyama, T. (2011). Mapping kernels for trees. In L. Getoor
and T. Scheffer, editors, Proceedings of the 28th International Conference on Ma-
chine Learning (ICML-11), ICML ’11, pages 961–968, New York, NY, USA. ACM.
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011). Dy-
namic pooling and unfolding recursive autoencoders for paraphrase detection. In
Advances in Neural Information Processing Systems 24.
Sun, J., Zhang, M., and Tan, C. L. (2011). Tree sequence kernel for natural language.
Suzuki, J. and Isozaki, H. (2006). Sequence and tree kernels with statistical feature
mining. In Advances in Neural Information Processing Systems 18, pages 1321–
1328. MIT Press.
Suzuki, J., Hirao, T., Sasaki, Y., and Maeda, E. (2003). Hierarchical directed acyclic
graph kernel: Methods for structured natural language data. In In Proceedings of the
41st Annual Meeting of the Association for Computational Linguistics, pages 32–39.
Tesniere, L. (1959). Elements de syntaxe structurale. Klincksiek, Paris, France.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer.
161
BIBLIOGRAPHY
Vert, J.-P. (2002). A tree kernel to analyse phylogenetic profiles. Bioinformatics,
18(suppl 1), S276–S284.
Voorhees, E. M. (2001). The TREC question answering track. Nat. Lang. Eng., 7(4),
361–378.
Wang, J. (1997). Average-case computational complexity theory. pages 295–328.
Wang, R. and Neumann, G. (2007a). Recognizing textual entailment using a subse-
quence kernel method. In Proceedings of the Twenty-Second AAAI Conference on
Artificial Intelligence (AAAI-07), July 22-26, Vancouver, Canada.
Wang, R. and Neumann, G. (2007b). Recognizing textual entailment using sentence
similarity based on dependency tree skeletons. In Proceedings of the ACL-PASCAL
Workshop on Textual Entailment and Paraphrasing, pages 36–41, Prague. Associa-
tion for Computational Linguistics.
Zanzotto, F. M. and Dell’Arciprete, L. (2009). Efficient kernels for sentence pair clas-
sification. In Conference on Empirical Methods on Natural Language Processing,
pages 91–100.
Zanzotto, F. M. and Dell’Arciprete, L. (2011a). Distributed structures and distribu-
tional meaning. In Proceedings of the Workshop on Distributional Semantics and
Compositionality, pages 10–15, Portland, Oregon, USA. Association for Computa-
tional Linguistics.
Zanzotto, F. M. and Dell’Arciprete, L. (2011b). Distributed tree kernels rivaling tree
kernels in entailment recognition. In AI*IA Workshop on ”Learning by Reading in
the Real World”.
162
BIBLIOGRAPHY
Zanzotto, F. M. and Dell’Arciprete, L. (2012). Distributed tree kernels. In Proceedings
of the 29th International Conference on Machine Learning (ICML-12), pages 193–
200. Omnipress.
Zanzotto, F. M. and Dell’Arciprete, L. (2013). Transducing sentences to syntactic
feature vectors: an alternative way to ”parse”? In Proceedings of the Workshop on
Continuous Vector Space Models and their Compositionality, pages 40–49, Sofia,
Bulgaria. Association for Computational Linguistics.
Zanzotto, F. M. and Moschitti, A. (2006). Automatic learning of textual entailments
with cross-pair similarities. In Proceedings of the 21st Coling and 44th ACL, pages
401–408. Sydney, Australia.
Zanzotto, F. M. and Moschitti, A. (2007). Experimenting a ”General Purpose” Textual
Entailment Learner in AVE, volume 4730, pages 510–517. Springer, DEU.
Zanzotto, F. M., Pennacchiotti, M., and Pazienza, M. T. (2006). Discovering asymmet-
ric entailment relations between verbs using selectional preferences. In Proceedings
of the 21st Coling and 44th ACL, Sydney, Australia.
Zanzotto, F. M., Pennacchiotti, M., and Moschitti, A. (2009). A machine learning ap-
proach to textual entailment recognition. NATURAL LANGUAGE ENGINEERING,
15-04, 551–582.
Zanzotto, F. M., Dell’arciprete, L., and Korkontzelos, Y. (2010). Rappresentazione
distribuita e semantica distribuzionale dalla prospettiva dell’intelligenza artificiale.
TEORIE & MODELLI, XV II-III, 107–122.
163
BIBLIOGRAPHY
Zanzotto, F. M., Dell’Arciprete, L., and Moschitti, A. (2011). Efficient graph kernels
for textual entailment recognition. Fundamenta Informaticae, 107(2-3), 199 – 222.
Zhang, D. and Lee, W. S. (2003). Question classification using support vector ma-
chines. In Proceedings of the 26th annual international ACM SIGIR conference on
Research and development in information retrieval, SIGIR ’03, pages 26–32, New
York, NY, USA. ACM.
164