Semi-supervised classification with privileged...

ORIGINAL ARTICLE

Semi-supervised classification with privileged information

Zhiquan Qi1 • Yingjie Tian1 • Lingfeng Niu1 • Bo Wang1

Received: 25 December 2014 / Accepted: 12 June 2015 / Published online: 30 June 2015

� Springer-Verlag Berlin Heidelberg 2015

Abstract The privileged information that is available

only for the training examples and not available for test

examples, is a new concept proposed by Vapnik and

Vashist (Neural Netw 22(5–6):544–557, 2009). With the

help of the privileged information, learning using privi-

leged information (LUPI) (Neural Netw 22(5–6):544–557,

2009) can significantly accelerate the speed of learning.

However, LUPI is a standard supervised learning method.

In fact, in many real-world problems, there are also a lot of

unlabeled data. This drives us to solve problems under a

semi-supervised learning framework. In this paper, we

propose a semi-supervised learning using privileged

information (called Semi-LUPI), which can exploit both

the distribution information in unlabeled data and privi-

leged information to improve the efficiency of the learning.

Furthermore, we also compare the relative importance of

both types of information for the learning model. All

experiments verify the effectiveness of the proposed

method, and simultaneously show that Semi-LUPI can

obtain superior performances over traditional supervised

and semi-supervised methods.

Keywords Classification � Support vector machine �Privileged information

1 Introduction

For the classical learning model on the training data [2]

fðx1;y1Þ;...;ðxl;;ylÞg; xi2X�Rn; yi2Y¼f�1;1g; ð1Þ

where xi denotes the ith training data, and yi is the class label

of the ith training data. The leaner’s aim is to select a suit-

able classifier from a given collections f ðx; aÞ; a 2 K, which

can minimize the number of misclassified points [3].

However, in humans learning process, teachers play an

important role. They teach students knowledge through all

kinds of information such as include comments, compar-

ison, explanation, logic, emotional or metaphorical reason-

ing, and so on. Also, during the machine learning process, a

teacher may describe training examples with this additional

information. Vapnik et al. [1, 4–6] called this kind of

additional information as the privileged information, which

is only available at the training stage but is never available

for test samples, and then gave a new learning model:

learning using privileged information (called LUPI), which

has been proven to be able to significantly increase the speed

of learning through the statistical learning theory [1, 4, 5].

Recently, semi-supervised learning has attracted an

increasing amounts of interests [7–11]. One important rea-

son is that the labeled examples are always rare but there are

large amounts of unlabeled examples available in many

practical problems. Graph based methods are a very

important branch in this field, where nodes in the graph are

the labeled and unlabeled points, and weighted edges reflect

the similarities of nodes. The initial assumption of these

methods is that all points are located in a low dimensional

& Yingjie Tian

[email protected]

Zhiquan Qi

[email protected]

Lingfeng Niu

[email protected]

Bo Wang

[email protected]

1 Key Laboratory of Big Data Mining and Knowledge

Management, Chinese Academy of Sciences,

Beijing 100190, China

123

Int. J. Mach. Learn. & Cyber. (2015) 6:667–676

DOI 10.1007/s13042-015-0390-1

http://crossmark.crossref.org/dialog/?doi=10.1007/s13042-015-0390-1&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1007/s13042-015-0390-1&domain=pdf

manifold, and the graph is used for an approximation of the

underlying manifold. Neighboring point pairs connected by

large weight edges tend to have the same labels and vice

versa. By the means, the labels associated with data can be

propagated throughout the graph. By using the graph

Laplacian, [12] proposed a novel Laplacian support vector

machine (Lap-SVM). Unlike other methods based on graph

[13–15], Lap-SVM is a natural out-of-sample extension,

which can classify data that become available after the

training process, without having to retrain the classifier or

resort to various heuristics [12].

In this paper, we propose a novel semi-supervised

learning using general privileged information (called Semi-

LUPI), which can effectively exploit label data, unlabeled

data and privileged information to improve the perfor-

mance of the classifier and is a useful extension of LUPI.

Moreover, Semi-LUPI can be effectively solved by a

standard quadratic programming problem.

The remaining parts of the paper are organized as fol-

lows. Section 2 briefly introduces the background of LUPI;

Section 3 describes the new method proposed: Semi-LUPI;

Section 4 gives various extensions of Semi-LUPI; All

experimental results are shown in Sect. 5; Last section

gives the conclusions.

2 Background

Firstly,we give the mathematical formulation of the privi-

leged classification problem [1, 16] as follows.

Privileged classification problem [1]: Given a training

set

T ¼ ðx1; x�1; y1Þ; . . .; ðxl; x�l ; ylÞ; xi 2 Rn; x�i 2 Rm;

yi 2 f�1; 1g; i ¼ 1; . . .; l; ð2Þ

where xi denotes the ith training data, x�i denotes the

additional information about the ith training data, and yi is

the class label of the ith training data. The goal is to find a

real valued function g(x) in Rn, such that the value of y for

any x can be predicted by the decision function

f ðxÞ ¼ sgnðgðxÞÞ: ð3Þ

Since the additional information x�i 2 X� is included in the

training input ðxi; x�i Þ, but not in any testing input x, Vapnik

et al. [1] call it the privileged information.

In order to explain the basic idea of LUPI, we first

introduce the definition of oracle function.

Definition 1 (Oracle function) [1] Given a traditional

classification problem with the training set

T ¼ fðx1; y1Þ; . . .; ðxl; ylÞg: ð4Þ

Suppose there exists the best but unknown linear

hyperplane:

ðw0 � xÞ þ b0 ¼ 0: ð5Þ

The oracle function nðxÞ of the input x is defined as

follows:

n0 ¼ nðxÞ ¼ ½1 � yððw0 � xÞ þ b0Þ�þ; ð6Þ

where

½g�þ ¼g; if g > 0;

0; otherwise:

�ð7Þ

If we could know the value of the oracle function on

each training input xi such that we know the triplets

ðxi; n0i ; yiÞ with n0

i ¼ nðxiÞ; i ¼ 1; . . .; l, we can accelerate

its learning rate. However, in fact, a teacher does not know

the values of slacks. Instead, Vapnik et al. [1] use a so-

called correcting function to approximate an oracle func-

tion. In the linear case,

/ðx�Þ ¼ ðw� � x�Þ þ b�: ð8Þ

Replacing niði ¼ 1; . . .; lÞ by /ðx�i Þ in the primal problem

of SVM, we get the following primal problem:

minw;w�;b;b�

1

2ðkwk2 þ ckw�k2Þ þ C

Xli¼1

½ðw� � x�i Þ þ b��;

s.t. yi½ðw � xiÞ þ b� � 1 � ½ðw� � x�i Þ þ b��;ðw� � x�i Þ þ b� � 0; i ¼ 1; . . .; l:

ð9Þ

The corresponding dual problem is as follows

maxa;b

Xlj¼1

aj �1

2

Xli¼1

Xlj¼1

yiyjaiajðxi � xjÞ

� 1

2c

Xli¼1

Xlj¼1

ðai þ bi � CÞðaj þ bj � CÞðx�i ; x�j Þ;

s.t.Xli¼1

aiyi ¼ 0;Xli¼1

ðai þ bi � CÞ ¼ 0;

ai � 0; bi � 0; i ¼ 1; . . .; l:

ð10Þ

For the nonlinear case, introducing two transformations:

x ¼ UðxÞ : Rn ! H and x� ¼ U�ðx�Þ : Rm ! H�, the pri-

mal problem are constructed as follows:

668 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676

123

minw;w�;b;b�

1

2ðkwk2 þ ckw�k2Þ þ C

Xli¼1

½ðw� � Uðx�i ÞÞ þ b��;

s.t. yi½ðw � UðxiÞÞ þ b� � 1 � ½ðw� � U�ðx�i ÞÞ þ b��;ðw� � U�ðx�i ÞÞ þ b� � 0; i ¼ 1; . . .; l:

ð11Þ

Similarly, we can give its dual programming:

maxa;b

Xlj¼1

aj �1

2

Xli¼1

Xlj¼1

yiyjaiajKðxi; xjÞ

� 1

2c

Xli¼1

Xlj¼1

ðai þ bi � CÞðaj þ bj � CÞK�ðx�i ; x�j Þ;

s.t.Xli¼1

aiyi ¼ 0;Xli¼1

ðai þ bi � CÞ ¼ 0;

ai � 0; bi � 0; i ¼ 1; . . .; l: ð12Þ

3 Semi-LUPI

In this section, we will elaborate our proposed method:

Semi-LUPI.

First, given a set of labeled data (1) and a set of unla-

beled data

ðxlþ1; . . .; xlþuÞ; ð13Þ

where xlþi 2 Rn; i ¼ ; . . .; u. Suppose the labeled data are

generated according to the distribution P on X �R,

whereas unlabeled examples are drawn according to the

marginal distribution PX of P. Labels of samples can be

obtained from the conditional probability distribution

P(y|x). According to [12], the semi-supervised learning

framework can be expressed as

minf2Hk

Xli¼1

Vðxi; yi; f Þ þ cHkfk2H þ cMkfk2

M; ð14Þ

where Hk is a reproducing Kernel Hilbert space, f is a

classifier defined on a manifold M, V represents some loss

function on the labeled data, cH is the weight of kfk2H and

controls the complexity of f in the reproducing Kernel

Hilbert space. cM is the weight of kfk2M and controls the

complexity of the function in the intrinsic geometry of

marginal distribution, kfk2M is able to penalize f along the

Riemann manifold M.

Now, our goal is to make use of labeled data with

privileged information and unlabeled data together to infer

the labels. By means of the Representer Theorem, weights

w can be expressed as w ¼Plþu

i¼1 aiUðxiÞ, and K can be

denoted as the kernel matrix which is formed by the kernel

functions Kðxi; xjÞ ¼ ðUðxiÞ � UðxjÞÞ. So the regularization

term kfk2H can be rewritten as

kfk2H ¼ a>Ka: ð15Þ

Similarly, the oracle function can be rewritten as

/ðx�Þ ¼Xlj¼1

a�j K�ðx�j ; x�Þ þ b�; ð16Þ

where a� ¼ fa�1; . . .; a�l g, K� ¼ ðUðx�i Þ � Uðx�j ÞÞl�l. Replac-

ing kfk2H by (15), introducing 0-1 loss function and the the

corresponding oracle function, the formulation of Semi-

LUPI can be expressed as

mina;a�;b;b�

c1a>Kaþ c2a

�>K�a� þ1

le>K�a� þb� þ c3kfk2

M;

s.t. yiXlþu

j¼1

ajKðxi;xjÞþb

" #�1�

Xlj¼1

a�j K�ðx�i ;x�j Þþb�

" #;

Xlj¼1

a�j K�ðx�i ;x�j Þþb��0; i¼ 1; . . .; l:

ð17Þ

An important premise of this kind of approach is to assume

that the probability distribution of data has the geometric

structure of a Riemannian manifold M. The labels of two

points that are close in the intrinsic geometry of PX should

be the same or similar. Literature [12] applied the intrinsic

regularizer kfk2M to describe the constraint above,

kfk2M ¼ 1

ðlþ uÞ2

Xlþu

i;j¼1

Wi;jðf ðxiÞ � f ðxjÞÞ2 ¼ f>Lf ; ð18Þ

where L is the graph Laplacian. In practise, a data adja-

cency graph WðlþuÞ�ðlþuÞ is defined by nodes Wi;j, which

represents the similarity of every pair of input samples. The

weight matrix W may be defined by k nearest neighbor or

graph kernels as follows [12]:

Wij ¼expð�kxi � xjk2

2=2r2Þ; if xi; xj are neighbor ;

0; Otherwise ;

(

ð19Þ

where kxi � xjk22 denotes the Euclidean norm in Rn. L ¼

D�W is the graph Laplacian, D is a diagonal matrix with its ith

diagonalDii ¼Plþu

j¼1 Wij, and f ¼ ½f ðx1Þ; . . .; f ðxlþuÞ�> ¼Ka.

When (18) is used as a penalty item of the Eq. (17), we can

understand them by these means: if the neighbor of xi; xj has the

higher similarity(Wij is larger), the difference of fðxiÞ; fðxjÞwill obtain a big punishment. More intuitively, smaller jf ðxiÞ �f ðxjÞj is, more smooth f(x) in the data adjacency graph is. So

(17) can be translated to the following optimization

Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 669

123

mina;a�;b;b�

c1a>Kaþ c2a

�>K�a� þ 1

le>K�a�

þ b� þ c3

ðlþ uÞ2a>KLKa;

s.t. yiXlþu

j¼1

ajKðxi;xjÞþ b

" #�1�

Xlj¼1

a�j K�ðx�i ;x�j Þþ b�

" #;

Xlj¼1

a�j K�ðx�i ;x�j Þþ b��0; i¼ 1; . . .; l: ð20Þ

The Lagrangian corresponding to the problem (20) is given

by

LðHÞ ¼c1a>Kaþ c2a

�>K�a� þ 1

le>K�a� þ b� þ c3

ðlþ uÞ2a>

KLKa�Xli¼1

bi yiXlþu

j¼1

ajKðxi;xjÞþ b

" #� 1

!

þXlj¼1


" # !

�Xli¼1

giXlj¼1


!; ð21Þ

where H¼ fa;a�;b;b�;b;gg, b¼ ðb1; . . .;blÞ>, g¼ðg1; . . .;glÞ> are the Lagrange multipliers. So the dual

problem can be formulated as

maxH

LðHÞ

s.t. ra;a�;b;b�LðHÞ ¼ 0;

b; g� 0:

ð22Þ

From Eq. (22), we get

raL ¼ 2c1K þ 2c3

ðlþ uÞ2KLK

!a� KJ>Yb ¼ 0; ð23Þ

ra�L ¼ 2c2K�a� þ 1

lK�e� K�ðbþ gÞ ¼ 0; ð24Þ

rbL ¼Xli¼1

yibi ¼ 0; ð25Þ

rb�L ¼ 1 �Xli¼1

bi �Xli¼1

gi ¼ 0; ð26Þ

where J ¼ ½I0� is an l� ðlþ uÞ matrix with I as the l� l

identity matrix and Y is a diagonal matrix composed by the

labels as Y ¼ diagðy1; . . .; ylÞ. Now, substituting (23)–(26)

into the dual (22) , we can obtain the Wolfe dual of the

problem (20) as follows

maxb;g

Xli¼1

bi�1

2b>Qb� 1

4c2

b�g�1

le

� �>K� b�g�1

le

� �

s.t.Xli¼1

yibi ¼ 0;

1�Xli¼1

bi�Xli¼1

gi ¼ 0;

b�0; g�0;

ð27Þ

where

Q ¼ YJK 2c1I þ 2c3

ðlþ uÞ2LK

!�1

J>Y : ð28Þ

From (27), it is easy to find that this is a standard convex

quadratic programming problem and we do not need to

solve the additional variable a� and b�.Finally, Semi-LUPI can be summarized as the following

Algorithm 1:

4 Other extensions of the semi-LUPI

In this section, we give some extensions of Semi-LUPI.

4.1 Mixture model of slacks

Modeling slacks by values of some smooth function is

not always the best choice [1]. Let us model slacks by

a mixture of values of some smooth function. Consider

a model by a mixture of values of some smooth func-

tion /ðx�i Þ ¼Pl

j¼1 a�j K

�ðx�j ; x�i Þ þ b� and n�i ; i ¼ 1; . . .; l.

The primal optimization problem (20) can be changed

as

Algorithm 1 Semi-LUPI

• Input the training set T given by (1) and (13);• Choose two appropriate kernels K(, ) and K∗(, ), and parameters γ1, γ2, γ3 > 0;• Construct and solve the convex quadratic programming problem (28),

obtaining the solution β∗, η∗;• Select a component index j such that β∗

j > 0, η∗j > 0, and compute

b = yj − ∑l+ui=1 α∗

i yiK(xi, xj),

where α∗ = (2γ1I + 2γ3

(l + u)2LK)−1J�Y β∗.

• Construct the decision function:

f(x) = sgn(g(x)),

where g(x) =∑l+u

i=1 yiα∗i K(xi, x) + b.


123

mina;a�;b;b�

c1a>Kaþ c2a

�>K�a� þ 1

le>K�a�

þ b� þ 1

lhXli¼1

n� þ c3

ðlþ uÞ2a>KLKa;

s.t. yiXlþu

j¼1

ajKðxi;xjÞ þ b

" #�1�

Xlj¼1

a�j K�ðx�i ;x�j Þ þ b�

" #

� n�i ; i¼ 1; . . .; l;

Xlj¼1

a�j K�ðx�i ;x�j Þ þ b� �0;n�i �0; i¼ 1; . . .; l: ð29Þ

where h[ 0 is the pre-specified penalty factor. The dual

problem of the algorithm is almost the same as the (27).

The only difference is that the dual problem has one extra

constraint b 1lhe.

4.2 The case of only partly samples possessing

the privileged information

When the only partly samples are provided the corre-

sponding privileged information, the given training data

can be expressed as

ðx1; y1Þ; . . .; ðxn; ynÞ; ðxnþ1; x�nþ1; ynþ1Þ; . . .; ðxl; x�l ; ylÞ;

ðxlþ1; ylþ1Þ; . . .; ðxlþu; ylþuÞ: ð30Þ

In this situation, (20) can be rewritten as

mina;a�;b;b�

c1a>Kaþ c2

Xli¼nþ1

Xlj¼nþ1

a�i�na�j�nK

�ðx�i ;x�j Þ

þ 1

l� n

Xli¼nþ1

Xlj¼nþ1

a�j�nK�ðx�i ;x�j Þ þ b�

þ 1

nhXni¼1

n� þ c3

ðlþ uÞ2a>KLKa;

s.t. yiXlþu

j¼1

ajKðxi;xjÞ þ b

" #�1� ni; i¼ 1; . . .;n;

ni�0; i¼ 1; . . .;n;

yiXlþu

j¼1

ajKðxi; xjÞ þ b

" #�1�

Xlj¼nþ1

a�j�nK�ðx�i ;x�j Þ þ b�

" #;

i¼ nþ 1; . . .; l;

Xlj¼nþ1

a�j�nK�ðx�i ;x�j Þ þ b� �0; i¼ nþ 1; . . .; l: ð31Þ

The corresponding correcting function becomes

/ðx�Þ ¼Xlj¼nþ1

a�j�nK�ðx�j ; x�Þ þ b�; ð32Þ

and the dual problem for this case can be written as

maxb;g

Xli¼1

bi �1

2b>Qb� 1

4c2

Xli¼nþ1

Xlj¼nþ1

bi � gi �1

l� n

� �bj � gj �

1

l� n

� �K�ðx�i ; x�j Þ

s.t.Xli¼1

yibi ¼ 0;

1 �Xli¼nþ1

bi �Xli¼nþ1

gi ¼ 0;

0 bi hn; i ¼ 1; . . .; n;

bi � 0; gi � 0; i ¼ nþ 1; . . .; l: ð33Þ

4.3 The privileged information with different

dimensions

Suppose the privileged information are described in dif-

ferent spaces. In order to simplify, we only consider two

spaces: space X� and space X��. The given training data is

defined by

ðx1;x�1;y1Þ; . . .; ðxn;x�n; ynÞ; ðxnþ1;x

��nþ1;ynþ1Þ; . . .; ðxl;x��l ;ylÞ;

ðxlþ1;ylþ1Þ; . . .; ðxlþu;ylþuÞ: ð34Þ

In this case, the primal problem may be expressed as

mina;a�;b;b�

c1a>Kaþc2

Xni¼1

Xnj¼1

a�i a�j K

�ðx�i ;x�j Þ

þXli¼nþ1

Xlj¼nþ1

a��i�na��j�nK

��ðx��i ;x��j Þ!

þ1

n

Xni¼1

Xnj¼1

a��j K��ðx��i ;x��j Þþb�

þ 1

l�n

Xli¼nþ1

Xlj¼nþ1

a��j�nK��ðx��i ;x��j Þþb��þ c3

ðlþuÞ2a>KLKa;

s.t. yiXlþu

j¼1

ajKðxi;xjÞþb

" #�1�

Xnj¼1

a�j K�ðx�i ;x�j Þþb�

" #; i¼1;...;n;

Xnj¼1

a�j K�ðx�i ;x�j Þþb��0; i¼1;...;n;yi

Xlþu

j¼1

ajKðxi;xjÞþb

" #

�1�Xlj¼nþ1

a��j�nK��ðx��i ;x��j Þþb��

" #; i¼nþ1;...;l;

Xlj¼nþ1

a��j�nK��ðx��i ;x��j Þþb��0; i¼nþ1;...;l:

ð35Þ

The corresponding correcting function becomes

/ðx�Þ ¼Xnj¼1

a�j K�ðx�i ; x�j Þ þ b�; ð36Þ


123

/ðx��Þ ¼Xlj¼nþ1

a��j�nK��ðx��i ; x��j Þ þ b��; ð37Þ

and the dual problem is

maxb;g

Xli¼1

bi �1

2b>Qb

� 1

4c2

Xni¼1

Xnj¼1

bi � gi �1

l� n

� �

bj � gj �1

l� n

� �K�ðx�i ; x�j Þ

� 1

4c2

Xli¼nþ1

Xlj¼nþ1

bi � gi �1

l� n

� �

bj � gj �1

l� n

� �K��ðx��i ; x��j Þ

s.t.Xli¼1

yibi ¼ 0;

1 �Xni¼1

bi �Xni¼1

gi ¼ 0;

1 �Xli¼nþ1

bi �Xli¼nþ1

gi ¼ 0;

bi � 0; gi � 0; i ¼ 1; . . .; l: ð38Þ

5 Experiments

In this section, we compare the Semi-LUPI against LUPI

[1] and Lap-SVM [12] on time series prediction datasets

and MNIST datasets. For simplicity, we set c2 ¼ 1. c1; c3

and RBF kernel parameter r are all selected from the set

f2iji ¼ �7; . . .; 7g [16, 17].

All algorithms are implemented by using MATLAB

2010. The experimental environment: Intel Core I7-2600

CPU, 4 GB memory. For comparison purposes, the

‘‘quadprog’’ function with MATLAB is employed to solve

quadratic programming problem related to this paper.

5.1 Time series prediction

The time series prediction datasets are obtained from

Mackey–Glass time series [18], which can be described by

an equation

dxðtÞdt

¼ �axðtÞ þ bxðt � sÞ1 þ x10ðt � sÞ ; ð39Þ

where t[ 0 and a; b; s are parameters of the equation.

Basically, the experiment’s goal is to predict the value if the

time series at the moment t þ D will be larger or smaller

than the value at t for a given historical information about

the values of time series up to moment t. In the finance

market, there are many similar prediction problems.

Specifically, examples of time series before t� are taken

as the standard input data, and t0

between t� and t can be

taken as the privileged information (the future in the past),

which information is not available for testing (but obtain-

able for training) (see Fig. 1).

Similar to [1], we use the Mackey–Glass series with

parameters a ¼ 0:1; b ¼ 0:2; s ¼ 17 and xðsÞ ¼ 1:1, and

then to construct the training data as follows:

xt ¼ ðxðt � 3Þ; xðt � 2Þ; xðt � 1Þ; xðtÞÞ: ð40Þ

The corresponding privileged information can be expressed

as

x�t ¼ ðxðtþD� 2Þ;xðtþD� 1Þ;xðtþDþ 1Þ;xðtþDþ 2ÞÞ:ð41Þ

As in [1], we set D ¼ 1; 5; 8 to generate three classification

problems. The sizes of training sets are 100, 200, 400, 500,

respectively. The validation set of 500 is used to select the

model parameters; 500 examples are treated as unlabeled

Fig. 1 The interpretation about the privileged information in the time series prediction problem

Table 1 Error rates of Semi-LUPI, LUGPI and Lap-SVM on the

Mackey–Glass series

Model The sizes of training sets Interval

D = 1 D = 5 D = 8

Lap-SVM 100 3.82 6.17 8.11

LUPI 100 3.12 5.43 7.73

Semi-LUPI 100 2.64 4.78 6.43

Lap-SVM 200 3.62 6.24 7.64

LUPI 200 2.43 4.68 7.21

Semi-LUPI 200 2.12 3.97 5.98

Lap-SVM 400 3.34 4.65 6.33

LUPI 400 1.91 3.64 5.56

Semi-LUPI 400 1.68 3.31 4.17

Lap-SVM 500 2.24 4.55 5.12

LUPI 500 1.81 2.92 4.43

Semi-LUPI 500 1.42 2.18 3.44


123

data and 500 are for testing. Table 1 and Fig. 2 give the

final results of three methods.

From the results, we can obtain the following conclu-

sions: 1) with the change of D, the privileged information

has different impacts on the final error ratios of the

Mackey–Glass series’ classification. The case of D ¼ 1 is

superior to that of other cases of D ¼ 5 and D ¼ 8. This

shows the closer data points as the privileged information

to its training data points, the more obvious the effect of the

privileged information is. 2) Semi-LUPI has a better per-

formance than that of LUPI and Lap-SVM. The result isn’t

surprising, because Semi-LUPI uses more prior informa-

tion to improve the quality. 3) LUPI outperforms Lap-SVM

in all cases. This shows the prior information obtained by

teacher is far more than the distribution information by

unlabeled samples.

The first of Fig. 5 shows the error ratio changes with

different number of unlabeled data in the case of D ¼ 1 and

500 labeled samples as training data. With the increase of

unlabeled data, our algorithm’s the performance can be

improved gradually.

5.2 Digits recognition

In the second experiment, we use the MNIST dataset [1].

Similar with [1], we only consider the binary classification

0 500 1000 15000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

0 500 1000 15000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

0 500 1000 1500 20000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

(a) Lap-SVM

0 500 1000 15000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

0 500 1000 15000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

0 500 1000 1500 20000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

(b) LUPI

0 500 1000 15000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

0 500 1000 15000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

0 500 1000 1500 20000.2

0.4

0.6

0.8

1

1.2

1.4

t

x(t)

(c) Semi-LUPI

Fig. 2 The results of Lap-SVM, LUPI and Semi-LUPI in D ¼ 5. All

data are split into four parts (left! right): the first part (yellow ‘‘�’’) is

for training; the second part (purple ‘‘�’’) is for verification; the third

part (gray ‘‘�’’) is unlabeled data; the last part is for testing (the

correctly predicted points are shown with cyan ‘‘[ ’’, and erroneously

predicted points are shown with red ‘‘\’’) (color figure online)


123

problem of ‘‘5’’ vs. ‘‘8’’ in 28 � 28 pixels (This database

contains 5522 and 5652 images of 5 and 8, respectively.).

In order to make the problem more challenging, these

digits are further resized to 10 � 10 pixel images(see

Fig. 3).

Each training data with the privileged information was

supplied with a holistic description of the corresponding

image [1]. These holistic descriptions are translated into

21-dimensional features such as two-part-ness (0–5); tilting

to the right (0–3); aggressiveness (0–2); stability (0–3);

uniformity (0–3), and so on. These privileged information

is created prior to the learning process by independent

expert (more details can be found from NEC lab1).

Figure 4 illustrates the results by varying the number of

training data in a wider range. In all samples of ‘‘5’’ and

‘‘8’’, we pick up the first 50 images of ‘‘5’’ and 50 images

of ‘‘8’’ as the training set with the privileged information.

Training sets of the smaller size are randomly extracted

from the 100 selected images. 2000 of digits are used as a

fixed validation set, 2000 of digits are used as unlabeled set

and 1766 digits are used as a fixed test set. Lap-SVM1 of

Fig. 4 use samples with the resolution of 10 � 10, and Lap-

SVM2 uses ones with 28 � 28 (Note: the privileged

information was not used as part of the unlabeled dataset in

this experiment and the next experiment.).

From the results, we find that when the number of these

samples of ‘‘5’’ and ‘‘8’’ is small, the error ratios of LUPI,

Semi-LUPI and Lap-SVM is very close, but when the

number of samples is larger than 35, Semi-LUPI and

LUGPI have a better performance than Lap-SVM. This

shows that the privileged information can significantly

increase the speed of learning. In addition, with the help of

unlabeled data, the average error ratio of Semi-LUPI is

1:459% lower than that of LUPI. Semi-LUPI outperforms

LUPI in all cases. Note that, although Lap-SVM2 uses

digits with high resolution, its the accuracy is still lower

than LUPI and Semi-LUPI. This shows the privileged

information contained in poetic descriptions is even larger

than the total obtained from both unlabeled data and high

resolution image.

The second figure of Fig. 5 shows the error ratio changes

with different number of unlabeled data. For each class, 50

samples are randomly selected as labeled ones. The number

of unlabeled samples ranges from 100 to 2000. As can be

seen, the performance can be improved with more unla-

beled data.

5.3 Image classification

In this subsection, we will apply our proposed method to

image classification on PASCAL 2006 dataset (see Fig. 6)

[19]. The dataset contains 10 object categories (cats,

bicycles, cows, motorbikes, cars, dogs, buses, sheep, peo-

ple, horses) and 5304 images. In order to simplify, we only

pick up the 50 images of ‘‘cat’’ and 50 images of ‘‘dog’’ as

the training set. 50 samples are used as a fixed validation

set, 50 of images are used as unlabeled set and 100 images

are used as a fixed test set. The color representation method

[20] is used to extract the feature of images. All images are

resized to be gray images of 80 � 100.

In order to obtain the corresponding privileged infor-

mation, we created its holistic description for each training

sample. A holistic description for some ‘‘cat’’ is as follows

Fig. 3 Samples of ‘‘5’’ and ‘‘8’’ in different image resolutions.

Images in the first and the third row are 28 � 28 ; Images in the

second and four row are 10 � 10. When the resolution of those

images is reduced, some of them begin to become vague and

incomplete and it is even hard to recognize them by human eyes

20 40 60 80 1005

10

15

20

25

Sizes of training data

Err

or r

ates

(%

)

Lap−SVM1Lap−SVM2LUPISemi−LUPI

Fig. 4 The results of the comparison between LapSVM1 (using

digits ¼ 10 � 10), LapSVM2 (using digits ¼ 28 � 28), LUPI and

Semi-LUPI

1 http://www.nec-labs.com/research/machine/ml_website/depart

ment/software/learning-with-teacher.


123

http://www.nec-labs.com/research/machine/ml_website/department/software/learning-with-teacher

http://www.nec-labs.com/research/machine/ml_website/department/software/learning-with-teacher

(see Fig. 6a): The ear is small in proportion to its face; the

mouth is narrow and non-prominent; the nose is small and

its color is light; short and rounded head; hardly see its lip

on the face; the color of the whole body is very bright and

rich; see the whole body; several cats in the picture; the

image is clear. A holistic description for some ‘‘dog’’ is as

follows (see Fig. 6b): The ear is large in proportion to its

face; the mouth is wide and prominent; the nose is large

and black; the face is very long; the lip is also long and just

like a zipper on the face; the color of the whole body is very

dark and lacks diversity; only see the part of the body; only

a dog in the picture; the image is clear.

We translate these holistic descriptions into 11-dimen-

sional feature vectors: the length of the ear in proportion to

its face (0–5)2; the width of the mouth (0–5); the prominent

extent of the mouth (0–6); the size of the nose (0–6); the

color of the nose (0–4); the length of the head (0–5); the

100 200 300 400 5001

1.5

2

2.5

3

3.5

4

Sizes of unlabeled data

Err

or r

ates

(%

)Lap−SVMSemi−LUPI

(a) Time Series Prediction

100 200 300 400 500 600 700 800 20006

8

10

12

14

16

18


Err

or r

ates

(%

)

Lap−SVMSemi−LUPI

(b) Digits Recognition

10 20 30 40 5026

28

30

32

34

36

38


Err

or r

ates

(%

)

Lap−SVMSemi−LUPI

(c) Image Classification

Fig. 5 The results of the comparison between Lap-SVM and Semi-LUPI on various datasets

Fig. 6 Image classification on PASCAL 2006 dataset

1 3 5 7 110.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

Dimensions of privileged information

Acc

urac

y

Lap−SVMLUPISemi−LUPI

Fig. 7 The final results of ‘‘cat’’ and ‘‘dog’’ on PASCAL 2006 dataset2 0-5 is the range of possible values.


123

appearance of the head (0–4); the length of the lip (0–6); if

see the whole body (1–2); the number of the animal (0–6);

the clearness of the image (0–5).

Figure 7 gives the results by varying the vector dimen-

sions of the privileged information. Since Lap-SVM cannot

use the privileged information, its accuracy dose not

change. Performances of LUPI and Semi-LUPI are stable

improved with the increase of the privileged information.

Due to Semi-LUPI using more additional information, the

accuracy of Semi-LUPI is 0:76% higher than that of LUPI

and 2:18% higher than that of Lap-SVM. The third figure

of Fig. 5 shows the error ratio changes with different

number of unlabeled data. The result is similar with the

above two experiments, with the increase of unlabeled

data, Semi-LUPI have a better performance.

6 Conclusion

In human’s behavior and cognition, teacher always plays an

important role. However, in the field of machine learning,

the information offered by teacher is seldom applied.

Recently, Vapnik et al. introduce a new learning paradigm

called Learning Using Privileged Information (LUPI),

mainly consider how to include a ‘‘teacher’’ in the learning

process. The theory and experiments show that LUPI can

accelerate the convergence rate of learning especially when

the learning problem itself is hard. In this paper, we propose

a novel semi-supervised classification problem (called

Semi-LUPI) and its various extensions, that can simulta-

neously utilize the geometry information of the marginal

distribution embedded in unlabeled data and the additional

(privileged) information offered by teacher to improve the

classification performance. At the same time, in order to

deal with different forms of the privileged information, we

also give some extensions of Semi-LUPI. All experiments

confirm the effectiveness of our method. In the future work,

how to further accelerate the algorithm is under our con-

sideration. In addition, the extension of online learning and

multi-class classification are also interesting.

Acknowledgments This work has been partially supported by

grants from National Natural Science Foundation of China (NO.

61472390, NO. 61402429, NO. 11271361, NO. 11201472, NO.

11331012), key project of National Natural Science Foundation of

China (NO. 71331005), Major International (Regional) Joint

Research Project (NO. 71110107026).

References

1. Vapnik V, Vashist A (2009) A new learning paradigm: learning

using privileged information. Neural Netw 22(5–6):544–557

2. Vapnik V (1995) The nature of statistical learning theory.

Springer, New York

3. Vapnik V (1996) The nature of statistical learning theory.

Springer, New York

4. Vapnik V (2006) Estimation of dependences based on empirical

data (information science and statistics). Springer, Berlin

5. Pechyony D, Vapnik V (2010) On the theory of learning with

privileged information. In: Advances in neural information pro-

cessing systems, vol 23

6. Pechyony D, Izmailov R, Vashist A, Vapnik V (2010) Smo-style

algorithms for learning using privileged information. In: DMIN.

CSREA Press, Providence, pp 235–241

7. Seeger M (2001) Learning with labeled and unlabeled data.

Technical report

8. Chapelle O, Scholkopf B, Zien A (eds) (2006) Semi-supervised

learning (adaptive computation and machine learning). The MIT

Press, Cambridge

9. Zhu X (2006) Semi-supervised learning literature survey. Tech-

nical Report 15304, University of Wisconsin, Madison

10. Belkin M, Matveeva I, Niyogi P (2004) Regularization and semi-

supervised learning on large graphs. In: COLT. Springer, Berlin,

pp 624–638

11. Grandvalet Y, Bengio Y (2005) Semi-supervised learning by

entropy minimization. In: CAP, PUG, pp 281–296

12. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regulariza-

tion: a geometric framework for learning from labeled and

unlabeled examples. J Mach Learn Res 7:2399–2434

13. Joachims T (2003) Transductive learning via spectral graph

partitioning. In: ICML, pp 290–297

14. Belkin M, Niyogi P (2002) Using manifold structure for partially

labelled classification. In: NIPS, pp 953–960

15. Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised

learning using gaussian fields and harmonic functions. In: ICML,

pp 912–919

16. Deng N, Tian Y, Zhang C (2011) Optimization based data min-

ing: theory and applications. Springer Press, Berlin

17. Tian Y, Yong S, Xiaohui L (2012) Recent advances on support

vector machines research. Technol Econ Dev Econ 18(1): 5–33

18. Mackey MC, Glass L (1977) Oscillation and chaos in physio-

logical control systems. Science 197(4300):287–289

19. Everingham M, Zisserman A, Williams CKI, Van Gool L (2006) The

PASCAL visual object classes challenge 2006 (VOC 2006) results.

http://www.pascal-network.org/challenges/VOC/voc2006/results.

pdf

20. Deng Y, Manjunath BS, Kenney C, Moore MS, Member S, Shin

H (2001) An efficient color representation for image retrieval.

IEEE Trans Image Process 10:140–147


123

http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf

http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf

Semi-supervised classification with privileged...

Documents

Transcript of Semi-supervised classification with privileged...