Semi Supervised Learning
-
Upload
yasin-khan -
Category
Documents
-
view
66 -
download
0
Transcript of Semi Supervised Learning
ABSTRACT
As a supervised learning algorithm, the standard Gaussian Processes has the
excellent performance of classification. In this report, we present a semi-
supervised algorithm to learning a Gaussian Process classifier, which
incorporating a graph-based construction of semi-supervised kernels in the
presence of labelled and unlabeled data, and expanding the standard Gaussian
Processes algorithm into the semi-supervised learning framework. Our
algorithm adopts the spectral decomposition to obtain the kernel matrices, and
employs a convex optimization method to learn an optimal semi-supervised
kernel, which is incorporated into the Gaussian Process model. In the Gaussian
Processes classification, the expectation propagation algorithm is applied to
approximate the Gaussian posterior distribution. The main characteristic of the
proposed algorithm is that we incorporate the geometric properties of unlabeled
data by globally defined kernel functions. The semi-supervised Gaussian
Processes model has an explicitly probabilistic interpretation, and can model the
uncertainty among the data and solve the complex non-linear inference
problems. In the presence of few labelled examples, the proposed algorithm
outperforms cross-validation methods, and we present the experimental results
demonstrating the effectiveness of this algorithm in comparison with other
related works in the literature.
1
CHAPTER 1
INTRODUCTION
Semi-supervised learning [1] has attracted an increasing amount of attention in the recent
years, which includes many research areas, such as semi-supervised classification, semi-
supervised regression, semi-supervised clustering, Co-training, etc. In this report, we
primarily consider the semi-supervised classification. The standard supervised learning
methods use only labelled data (or features) to train and learn classifiers. Due to the diversity
of data, labelled instances are often difficult, expensive and time consuming to obtain.
Meanwhile unlabeled data may be relatively easy to collect in practice. Comparing with
supervised learning, semi-supervised learning can build better classifiers by using large
amount of unlabeled data together with few labelled data.
In the statistics and machine learning fields, much of the basic theory and many algorithms
are shared to use. The primary differences between two fields are the goal of learning and the
type of problem solved. The statistics mainly considers how to understand the relationships
between data and models, such as linearity or independence. In contrast, the machine learning
primarily focuses on how to give the accurate prediction and understand the behaviour of
algorithms. Due to the different objectives, the two fields have the different development
trends, for example, the learning algorithms in the machine learning have widely used as
black-box and we only worry about the input and output. But it is usually difficult in the
statistics to describe these models and obtain the satisfied results. To some extent, Gaussian
process model [2-9] effectively bridges the two fields, and has an explicit probabilistic
interpretation which can facilitate modelling the uncertainty of complex data sets, and
provides a completely theoretical framework for model selection and probability prediction
simultaneously.
As a supervised learning algorithm, the posterior distribution of standard Gaussian processes
can be affected by unlabeled data, which makes the location of decision boundary not be
influenced. In this report, we present how to effectively expand Gaussian processes model
into the semi- supervised framework through incorporating unlabeled data, and improve the
performance of Gaussian process classifiers. Due to Gaussian processes based on Bayesian
framework, we can address this problem from two areas of likelihood function and prior
2
distribution: (1) to combine a Gaussian process prior with a given likelihood function, which
make the posterior distribution incorporate the cluster assumption and influence the location
of decision boundary. Lawrence [4] provided the Null-Category Noise Model (NCMM),
which is equal to be a probabilistic margin. Rogers [5] replaced the NCMM by a Multinomial
Probit likelihood function, which generalized the binary setting to the multi-class setting; (2)
To directly modify the kernel function of Gaussian process prior, which has the properties of
semi-supervised kernels and incorporate the information of labelled data and unlabeled data.
Spectral clustering [11], diffusion kernels [12] and Gaussian random field [13] are semi-
supervised kernel methods. These methods belong to the parametric approaches, which are
difficult to choose an appropriate function family and accurately model the data without the
enough degrees of freedom. In this report, we address the semi-supervised learning algorithm
from the area of Gaussian process prior distribution described above. The proposed algorithm
incorporates the geometric properties of unlabeled data through the graph-based spectral
decomposition, and obtains the optimally non-parametric semi-supervised kernel which is
combined with the Gaussian processes model.
1.1 SUPERVISED, UNSUPERVISED, AND SEMI-SUPERVISED
LEARNING
In order to understand the nature of semi-supervised learning, it will be useful first to take a
look at supervised and unsupervised learning.
1.1.1 SUPERVISED AND UNSUPERVISED LEARNING
Traditionally, there have been two fundamentally different types of tasks in machine learning.
The first one is unsupervised learning. Let X = (x1, ….., xn) be a set of n examples learning (or
points), where xi ∈ X for all i ∈ [n] := {1, . . . , n}. Typically it is assumed that the points are
drawn i.i.d. (independently and identically distributed) from a common distribution on X. It is
often convenient to define the (n × d)-matrix X = (xi)i∈[n] that contains the data points as its
rows. The goal of unsupervised learning is to find interesting structure in the data X. It has
been argued that the problem of unsupervised learning is fundamentally that of estimating a
3
density which is likely to have generated X. However, there are also weaker forms of
unsupervised learning, such as quantile estimation, clustering, outlier detection, and
dimensionality reduction.
The second task is supervised learning. The goal is to learn a mapping from x to y, given a
training set made of pairs (xi, yi). Here, the yi ∈ Y are called the labels or targets of the
examples xi. If the labels are numbers, y = (yi)i∈[n] denotes the column vector of labels. Again,
a standard requirement is that the pairs (xi, yi) are sampled i.i.d. from some distribution which
here ranges over X × Y. The task is well defined, since a mapping can be evaluated through
its predictive performance on test examples. When Y = R or Y = Rd (or more generally, when
the labels are continuous), the task is called regression. Most of this book will focus on
classification (there is some work on regression in chapter 23), i.e., the case where y takes
values in a finite set (discrete labels). There are two families of algorithms for supervised
learning. Generative algorithms try to model the class-conditional density p(x|y) by some
unsupervised learning procedure. A predictive density can then be inferred by applying Bayes
theorem:
(1.1)
In fact, p(x|y)p(y) = p(x, y) is the joint density of the data, from which pairs (xi, yi) could be
generated. Discriminative algorithms do not try to estimate how the xi have been generated,
but instead concentrate on estimating p(y|x). Some discriminative methods even limit
themselves to modelling whether p(y|x) is greater than or less than 0.5; an example of this is
the support vector machine (SVM). It has been argued that discriminative models are more
directly aligned with the goal of supervised learning and therefore tend to be more efficient in
practice.
1.1.2 SEMI-SUPERVISED LEARNING
Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In
addition to unlabeled data, the algorithm is provided with some supervision information – but
4
not necessarily for all examples. Often, this information will be the targets associated with
some of the examples. In this case, the data set X = (xi)i∈[n] can be divided into two parts: the
points Xl := (x1, . . . , xl), for which labels Yl := (y1, . . . , yl) are provided, and the points Xu :=
(xl+1, . . . , xl+u), the labels of which are not known. This is “standard” semi-supervised
learning as investigated in this book; most chapters will refer to this setting. Other forms of
partial supervision are possible. For example, there may be constraints such as “these points
have (or do not have) the same target” (cf. Abu-Mostafa, 1995). The different setting
corresponds to a different view of semi-supervised learning: In chapter 5, SSL is seen as
unsupervised learning guided by constraints. In contrast, most other approaches see SSL as
supervised learning with additional information on the distribution of the examples x. The
latter interpretation seems to be more in line with most applications, where the goal is the
same as in supervised learning: to predict a target value for a given xi. However, this view
does not readily apply if the number and nature of the classes are not known in advance but
have to be inferred from the data. In contrast, SSL as unsupervised learning with constraints
may still remain applicable in such situations. A problem related to SSL was introduced by
Vapnik already several decades ago: so-called transductive learning. In this setting, one is
given a (labelled) training set and an (unlabeled) test set. The idea of transduction is to
perform predictions only for the test points. This is in contrast to inductive learning, where
the goal is to output a prediction function which is defined on the entire space X. Many
methods described in this book will be transductive; in particular, this is rather natural for
inference based on graph representations of the data.
1.1.3 BRIEF HISTORY OF SEMI-SUPERVISED LEARNING
Probably the earliest idea about using unlabeled data in classification is self learning, which
is also known as self-training, self-labeling, or decision-directed learning. This is a wrapper-
algorithm that repeatedly uses a supervised learning method. It starts by training on the
labeled data only. In each step a part of the unlabeled points is labeled according to the
current decision function; then the supervised method is retrained using its own predictions as
additional labelled points. This idea has appeared in the literature already for some time (e.g.,
Scudder (1965); Fralick (1967); Agrawala (1970)).
5
An unsatisfactory aspect of self-learning is that the effect of the wrapper depends on the
supervised method used inside it. If self-learning is used with empirical risk minimization and
1-0-loss, the unlabeled data will have no effect on the solution at all. If instead a margin
maximizing method is used, as a result the decision boundary is pushed away from the
unlabeled points. In other cases it seems to be unclear what the self-learning is really doing,
and which assumption it corresponds to.
Closely related to semi-supervised learning is the concept of transductive inference, or
transduction, pioneered by Vapnik (Vapnik and Chervonenkis, 1974; Vapnik and Sterin,
1977). In contrast to inductive inference, no general decision rule is inferred, but only the
labels of the unlabeled (or test) points are predicted. An early instance of transduction (albeit
without explicitly considering it as a concept) was already proposed by Hartley and Rao
(1968). They suggested a combinatorial optimization on the labels of the test points in order
to maximize the likelihood of their model.
It seems that semi-supervised learning really took off in the 1970s when the problem of
estimating the Fisher linear discriminant rule with unlabeled data was considered (Hosmer,
1973; McLachlan, 1977; O’Neill, 1978; McLachlan and Ganesalingam, 1982). More
precisely, the setting was in the case where each classconditional density is Gaussian with
equal covariance matrix. The likelihood of the model is then maximized using the labeled and
unlabeled data with the help of an iterative algorithm such as the expectation-maximization
(EM) algorithm (Dempster et al., 1977). Instead of a mixture of Gaussians, the use of a
mixture of multinomial distributions estimated with labeled and unlabeled data has been
investigated in (Cooper and Freeman, 1970).
Later, this one component per class setting has been extended to several components per class
(Shahshahani and Landgrebe, 1994) and further generalized by Miller and Uyar (1997).
Learning rates in a probably approximately correct (PAC) framework (Valiant, 1984) have
been derived for the semi-supervised learning of a mixture of two Gaussians by Ratsaby and
Venkatesh (1995). In the case of an identifiable mixture, Castelli and Cover (1995) showed
that with an infinite number of unlabeled points, the probability of error has an exponential
convergence (w.r.t. the number of labelled examples) to the Bayes risk. Identifiable means
that given P(x), the decomposition in ∑y P(y) P(x|y) is unique. This seems a relatively strong
assumption, but it is satisfied, for instance, by mixtures of Gaussians. Related is the analysis
in (Castelli and Cover, 1996) in which the class-conditional densities are known but the class
6
priors are not. Finally, the interest in semi-supervised learning increased in the 1990s, mostly
due to applications in natural language problems and text classification (Yarowsky, 1995;
Nigam et al., 1998; Blum and Mitchell, 1998; Collins and Singer, 1999; Joachims, 1999).
Note that, to our knowledge, Merz et al. (1992) were the first to use the term “semi-
supervised” for classification with both labelled and unlabeled data. It has in fact been used
before, but in a different context than what is developed in this book; see, for instance,
(Board and Pitt, 1989).
1.1.4 SEMI-SUPERVISED LEARNING IN PRACTICE
Semi-supervised learning will be most useful whenever there are far more unlabeled data
than labelled. This is likely to occur if obtaining data points is cheap, but obtaining the labels
costs a lot of time, effort, or money. This is the case in many application areas of machine
learning, for example:
In speech recognition, it costs almost nothing to record huge amounts of speech, but
labelling it requires some human to listen to it and type a transcript.
Billions of WebPages are directly available for automated processing, but to classify
them reliably, humans have to read them.
Protein sequences are nowadays acquired at industrial speed (by genome sequencing,
computational gene finding, and automatic translation), but to resolve a three
dimensional (3D) structure or to determine the functions of a single protein may
require years of scientific work.
Since unlabeled data carry less information than labelled data, they are required in large
amounts in order to increase prediction accuracy significantly. This implies the need for fast
and efficient SSL algorithms.
7
CHAPTER 2
GAUSSIAN PROCESSES
The Gaussian process (GP) [2] is a generalization of a multivariate Gaussian distribution and
has the marginalization property. GP controls the properties of random data x by a random
process f (x) and synchronously describes this random process by a probability distribution.
GP describes a distribution over function and is fully specified by the mean function m(x) and
the covariance function (kernel function) K(x, x′) of this random process f:
f (x) ~ GP(m(x), K(x, x ')) (2.1)
Where the kernel K(x, x′) is usually chose as the form of Mercer kernel. For example, the
RBF kernel function has the following form:
(2.2)
Where θ1 and θ2 are the hyper parameters of the RBF kernel, which are generally selected by
maximizing the marginal likelihood (evidence).
2.1 GAUSSIAN PROCESSES CLASSIFICATION
In this report, we only consider the binary classification. We assume we are given a dataset
X= {Xl, Xu}, where Xl={x1, ..., xm} are the labelled dataset and associated label dataset is the
unlabeled dataset. The main idea is to assume
that there is an unobservable latent function f (x) which is imposed on a Gaussian process
prior p( f ) ~ GP(0,K), and the latent function f preserves the mapping relationships between
dataset X and label set Y .
The likelihood function (class probability) over the latent function is described as the
following:
(2.3)
Where Θ is a sigmoid function, such as logistic function or cumulative Gaussian function.
Based on the Bayesian theorem, the posterior probability can be written as:
8
(2.4)
Where θ is the hyper parameters of kernel function K and P(X |θ) is the normalization factor
known as the evidence for the hyper parameters. As a discriminative model, the graphical
representation of Gaussian processes is shown in Figure 1: the nodes are shaded to represent
different treatments. White shaded nodes are unobserved variables, grey shaded nodes are
observed variables and black shaded nodes are optimized.
Fig. 2.1: The graphical representation of Gaussian Processes in the discriminative framework
For the labelled dataset xi ∈ Xl, xi is not d-separated [14] (conditional independent) from K
because they have a common descendant yi which is observed. For the unlabeled dataset xu ∈ Xu, xj is d-separated from K because yj is unobserved. Namely, the unlabeled dataset xj will
not have an effect on the posterior distribution of latent function f, which makes the location
of decision boundary not be influenced. In the following section, we will present how to learn
a semi-supervised kernel capturing the information from unlabeled data which will influence
the location of decision boundary.
9
CHAPTER 3
SEMI-SUPERVISED KERNEL
10
The prior, p (f | X), plays a significant role in semi-supervised learning, especially in the
presence of a small amount of labelled data. The prior can be constructed by forming an
undirected graph G= {V, E} over the data points. The data points are the nodes V of the
graph, and the weights W = {wij} of edge E between the nodes are based on the similarity.
The prior imposes a smoothness constraint over the data points, which gives the higher
probability to the labels respecting the similarity of the graph. The similarity is usually
captured by the kernel matrix K. Given a diagonal matrix Dii = Σj Wij, we can construct the
normalized Laplacian Δ = I −D−1/2WD−1/2 of the graph, which is a symmetric and positive
semi-definite matrix. Consider the spectral decomposition of normalized Laplacian:
(3.1)
Where {vi} denote the eigenvectors of the normalized Laplacian and corresponding
eigenvalues {λi}. Through applying a transformation r (λ) to the eigenvalues {λi}, we can
obtain the semi-supervised kernel:
(3.2)
Where λ1 ≤⋅⋅⋅≤ λn, by reason of the eigenvector vi with large λi corresponding to rather uneven
functions on the graph and considered as noise, in the semi-supervised learning framework,
we should penalize them more strongly than vi with small λi representing large cluster
structures within the data. Due to the reason discussed above, we will choose the
transformation r (λ) as a decreasing function, r (λi) ≥ (λi+1), reversing the order of the
eigenvalues. In the following subsection, we apply the Kernel Alignment algorithm to learn
the transformation function r (λ).
3.1 KERNEL ALIGNMENT:
Kernel Alignment [15, 16] is used to evaluate the similarity between the kernel matrix
induced by the labelled dataset and the target matrix induced by the labels. To obtain the
11
optimal semi-supervised kernel, we should maximize the kernel alignment score described as
the following:
(3.3)
Where K~ is the sub-matrix of the semi-supervised kernel K based on the whole dataset, {Tij
= yi yj} is the target matrix preserving the labels of the training data. Kernel Alignment is also
a measurement of clustering:
( 3.4)
Where the first term in the right hand side of Equation (3.4) denotes the distances within class
and the second term denotes the distances between classes. Maximizing Kernel Alignment
score means maximizing the distances within class and minimizing the distances between
classes.
3.2 CONVEX OPTIMIZATION ALGORITHM:
Different choices of transformation functions r (λ ) lead to different semi-supervised learning
algorithms and the function are often chosen from a parametric family [11, 12, 13]. But in
practice, it is difficult to choose an appropriate function family and accurately model the data
without the enough degrees of freedom. In this report, we apply a convex optimization
algorithm to learn the transform vector (weights) {ri} of the semi-supervised kernel based on
the whole data space instead of the parametric methods. Thus the convex optimization
algorithm [10] is described as the following:
Maximize the objective A (K~, T) (3.5)
Subject to (3.6)
Trace (K*) = 1 (3.7)
12
ri ≥ 0 (3.8)
ri ≥ ri+1, i=1,2,…….n (3.9)
Where K* is the optimal semi-supervised kernel, Equation (3.7) ensures the scale invariance
of Kernel Alignment, Equation (3.8) ensures K* is a positive semi-definite matrix and
Equation (3.9) is the order constraints which provide a valid penalty.
CHAPTER 4
SEMI-SUPERVISED LEARNING WITH GP
13
Based on the ideas of the previous sections, we propose the following algorithm:
1. Extract the features from the data space and form the graph Laplacian Δ;
2. Computer Δ = I − D−1/2WD−1/ 2 and its spectral decomposition obtaining the eigenvectors
{vi};
3. Learn the optimal transform vector (weights) {ri} of the semi-supervised kernel K* by
maximizing the Kernel Alignment score;
4. To incorporate the semi-supervised kernels K* into the GP classification framework.
In this report, we use the cumulative Gaussian likelihood function:
(4.1)
Given the GP prior p( f ) and the likelihood function p( y | f ) , the posterior distribution over
the latent function is obtained:
(4.2
)
Given a test point xt, we obtain the predictive class probability:
(4.3)
Due to choosing the sigmoid functions as the likelihood functions, the non-Gaussian
likelihood in Equation (4.1) makes the posterior distribution p( f | X ,K* ) and the prediction
distribution ( yt | X, K*, xt ) analytically intractable, so the analytic approximations of integrals
are needed.
14
In this report, we apply the Expectation Propagation (EP) [17] algorithm to find the Gaussian
approximation q( f | X, K*) = Ν( f |m, Σ) of the non-Gaussian posterior p( f | X, K*) by the
moment matching of approximate marginal distribution. Given a test point xt, we obtain the
approximate posterior over the latent function f:
(4.4)
(4.5)
Where kt is the prior covariance function between the test data xt and the training data X. The
approximate posterior produced by EP algorithm is global because the latent function is
coupled through the GP prior.
CHAPTER 5
EXPERIMENTS
15
5.1 EXPERIMENTAL DATA
To evaluate the semi-supervised leaning algorithm with Gaussian processes, we apply four
datasets [10] shown in Table 1. The ‘One vs. Two’ dataset and the ‘Odd vs. Even’ dataset are
applied to the handwritten digits recognition tasks. ‘One vs. Two’ is to classify the digit “1”
vs. “2”. ‘Odd vs. Even’ is to classify odd “1, 3, 5, 7, 9” vs. even “0, 2, 4, 6, 8” in the artificial
task. The ‘Pc vs. Mac’ dataset and the ‘Baseball vs. Hockey’ dataset are taken from 10-
newsgroups dataset for the binary document categorization. In the area of feature extraction
on data, we use Euclidean 10-nearest-neighbor (10NN) unweighted graph on ‘One vs. Two’
and ‘Odd vs. Even’, and measure the Cosine similarity of TFIDF vector on ‘Pc vs. Mac’ and
‘Baseball vs. Hockey’.
Table 5.1: The Information of Database
5.2 EXPERIMENTAL RESULTS AND ANALYSIS
In the experiments, we choose five different sizes as the training (labelled) data for each
dataset, the rest as the test (unlabelled) data. For the fixed size of labelled datasets, we
perform 20 random trials, and respectively optimize the Kernel Alignment score to learn the
optimal semi-supervised kernel incorporated into Gaussian process model. In order to verify
the effectiveness of the proposed algorithm, we compare it with two algorithms: (1) one
algorithm is proposed by Zhu [10] who use SVM as the classifier choosing the bound on
SVM slack variables with 5-fold cross validation; (2) the other algorithm is GP model
combined with the RBF kernel shown in Equation (2.2), which is the standard supervised
learning method. The hyper parameters θ of RBF kernel are obtained by maximizing the
marginal likelihood function q(y | X, θ):
16
(5.1)
The experimental results are show from Table 2 to Table 5. The results of the proposed
algorithm in this paper are shown in the second and fourth column of each table, and the
results of Zhu are shown in the third and fifth column of each table. From the comparison of
the second and third column in each table, our proposed algorithm outperforms SVM with
cross validation techniques in the presence of few labelled data, which demonstrates the
reliability of GP model effectively describing and modelling the data space in the Bayesian
framework. From the comparison of the fourth and fifth column in each table, the results
demonstrate the classifiers improving the performance through incorporating the information
of unlabeled data together with labelled data, and significantly confirm the effectiveness of
the proposed semi-supervised leaning algorithm.
Table 5.2: The Average Accuracy on the One vs. Two Datasets
Table 5.3: The Average Accuracy on the Odd vs. Even Datasets
17
Table 5.4: The Average Accuracy on the Pc vs. Mac Datasets
Table 5.5: The Average Accuracy on the Baseball vs. Hockey Datasets
CONCLUSION
In this report we have presented a semi-supervised learning algorithm with Gaussian process
model, which combines a graph-based construction of semi-supervised kernel with GP
model. We have empirically demonstrated the reliability of the proposed algorithm which
builds better classifiers in virtue of the information of unlabeled data.
18
REFERENCES
[1] Alexander Zien, Bernhard Schölkopf, Olivier Chapelle. Semi-Supervised Learning.
Cambridge: MIT Press, 2006
[2] Carl Edward Rasmussen, Christopher K. I. Williams. Gaussian Processes for Machine
Learning. Cambridge: MIT Press, 2006.
[3] Carl Edward Rasmussen. Advances in Gaussian Processes. Advances in Neural
Information Processing Systems, 2006.
19
[4] N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian
Process latent variable models. Journal of Machine Learning Research, 2005, 6: 1783-
1816.
[5] Simon Rogers, Mark Girolami. Multi-class Semi-supervised Learning with the
ε-truncated Multinomial Probit Gaussian Process. Journal of Machine Learning Research,
2007,1: 17-32.
[6] N. D. Lawrence. Learning for larger datasets with the Gaussian process latent variable
model. Proceedings of the Eleventh International Workshop on Artificial Intelligence and
Statistics, 2007.
[7] N. D. Lawrence, A. J. Moore. Hierarchical Gaussian process latent variable models.
Proceedings of the International Conference in Machine Learning, 2007. 481-488.
[8] Raquel Urtasun, Trevor Darrell. Discriminative Gaussian Process Latent Variable Model
for Classification. Proceedings of the International Conference in Machine Learning,
2007. 927-934
[9] T. Joachims. Transductive Inference for Text Classification using Support Vector
Machines. Proceedings of the International Conference on Machine Learning, 1999
[10] Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John Lafferty. Nonparametric
transforms of graph kernels for semi-supervised learning. Advances in Neural
Information Processing Systems 17. MIT Press, Cambridge, MA, 2005.
[11] O. Chapelle, J. Weston, B. Sch¨olkopf. Cluster kernels for semi-supervised learning.
Advances in Neural Information Processing Systems, 2002, 15(15).
[12] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces.
Proc. 19th International Conf. on Machine Learning, 2002.
[13] X. Zhu, Z. Ghahramani, J. Lafferty. Semi-supervised learning using Gaussian fields and
harmonic functions. 20th International Conference on Machine Learning, 2003.
[14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann, San Mateo, CA, 1988.
[15] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, J. Kandola. On kernel-target alignment. In
Advances in Neural Information Processing Systems, 2002a
[16] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan. Learning the
Kernel matrix with semi definite programming. Journal of Machine Learning Research,
2004, 5:27–72.
20
[17] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference.
[Ph.D.Thesis]. Department of Electrical Engineering and Computer Science, MIT, 2001.
[18] Hongwei Li, Yakui Li, Hanqing Lu. Semi-supervised Learning with Gaussian Processes.
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
21