Towards Bridging the Gap between Theory and …users.cecs.anu.edu.au/~psunehag/zhihua.pdfIntel Xeon...

http://lamda.nju.edu.cn

Towards Bridging the Gap between Theory and Practice in Semi-supervised Learning

Zhi-Hua Zhou

http://cs.nju.edu.cn/zhouzh/

Email: [email protected]

LAMDA Group

National Key Laboratory for Novel Software Technology, Nanjing University, China


http://lamda.nju.edu.cn Labeled vs. Unlabeled

In many practical applications, unlabeled training data are readily available but labeled ones are fairly expensive to obtain because labeling the unlabeled examples requires human effort

class = “war”

(almost) infinite number of web pages on the Internet

?


http://lamda.nju.edu.cn SSL: Why unlabeled data can be helpful?

blue or red?

Blue !

Intuitively,


http://lamda.nju.edu.cn Multiple views

Each view is a feature set

Many real tasks involve more than one feature sets, e.g.,

Video features

Audio features

Image features

Text features

……

Ideally, the views should be

Sufficient: Each view contains sufficient information for training a good learner

Redundant: The views are conditionally independent

http://v.youku.com/v_show/id_XMjUyMTE3OTIw.html

http://v.youku.com/v_show/id_XMjUyMTE3OTIw.html



learner1 learner2 X1 view X2 view

labeled training data

unlabeled training data

labeled unlabeled instances

labeled unlabeled instances

A representative of semi-supervised learning approaches; motivated by the use of multi-views; simple but effective

Co-training

[Blum & Mitchell, COLT’98]



• Statistical parsing [Sarkar, NAACL01; Steedman et al., EACL03;

Hwa et al., ICML03w]

• Noun phrase identification [Pierce & Cardie, EMNLP01]

• Image retrieval [Zhou et al., ECML’04, TOIS06]

• … …

Widely applied to many domains, e.g.,

Wide applications

See more in:

Zhou & Li, Semi-supervised learning by disagreement, Knowledge and Information Systems, 2010, vol.24, no.3, pp.415-439



[Blum & Mitchell, COLT’98] - Given a conditional independence assumption on the distribution D, if the target class is learnable from random classification noise in the standard PAC model, then any initial weak predictor can be boosted to arbitrarily high accuracy by co-training

[Dasgupta, Littleman & McAllester, NIPS’01] – When the requirement of sufficient and redundant views is met, the co-trained classifiers could make few generalization errors by maximizing their agreement over the unlabeled data

[Balcan, Blum & Yang, NIPS’04] - Given appropriately strong PAC-learners on each view, a weaker “expansion” assumption on the underlying data distribution is sufficient for iterative co-training to succeed

… …

Theoretical studies

All assumed “multi-views”



• Using two different types of decision trees [Goldman &

Zhou, ICML’00]

• Using classifiers constructed from different data samples [Zhou & Li, TKDE04]

• Using regression learners constructed with different distance metrics and/or k values [Zhou & Li, IJCAI’05]

• … …

Effective single-view algorithms

See more in:

Zhou & Li, Semi-supervised learning by disagreement, Knowledge and Information Systems, 2010, vol.24, no.3, pp.415-439



[Wang & Zhou, ECML’07]

A theoretical result:

“Large diversity” is sufficient !

Roughly speaking, the key requirement of co-training is that the two learners are with large diversity and each learner is better than a weak learner



Performances of the learners observed in experiments : the performances could not be improved further after a number of rounds

Previous theoretical studies indicated that the performances could always be improved

Performance of Co-training

The empirical/theoretical gap



[Wang & Zhou, ECML’07]

A theoretical result:

The gap will definitely occur

Roughly speaking, as the co-training process continues, the learners will become more and more similar, and therefore it is a “must”-phenomenon that co-training could not improve the performance further after a number of iterations

Based on this result, we get a method for roughly estimating the adequate round to terminate the co-training process


http://lamda.nju.edu.cn Estimating the round for termination

Data with two views

Data with single view

In either case, the performances at the estimated round is

close to that at the last learning round



Several sufficient conditions for co-training:

Conditional independence [Blum & Mitchell, COLT’98]

Weak dependence [Abney, ACL’02]

Expansion [Balcan, Blum & Yang, NIPS’04]

Large diversity [Wang & Zhou, ECML’07]

Theoretical status

Necessary condition

Sufficient and necessary condition

-- Proved by [Wang & Zhou, ICML’10]



Our theoretical results disclose that

“multi-views” are not really needed !

Can we really do … without “multi-views”?

… an application example -->

If we really have two sufficient and redundant views, we can even do SSL with a single labeled example. See: Zhou, Zhan & Yang. Semi-supervised learning with very few labeled training examples. AAAI'07.


http://lamda.nju.edu.cn Microprocessor design space exploration

Design Space Exploration (DSE) is to determine the promising processor architecture to meet the design specification

Many design parameters

• Core number, Frequency, Cache Size, Buffer Size

• Cache Replacement Policy

Design Space consists of billions of design configurations (i.e., combinations of design parameters)

Design Space


http://lamda.nju.edu.cn To use unlabeled data for DSE

There are not “two-views”

Comprehensibility is important

COMT: CO-training Model Trees

which uses two model trees, without two views

[Qi, Chen, Chen, Zhou, Hu & Xu, IJCAI’11]



0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

applu ar

t

bzip2

craf

ty

equa

ke

galgel

gcc

luca

sm

cf

swim

twol

fvp

r

MS

Es

ANNs

M5P

COMT

SPEC CPU2000 Benchmark

84%

30%

Comparing to DSE state-of-the-art, 30% to 84% improvement

[Qi, Chen, Chen, Zhou, Hu & Xu, IJCAI’11]


http://lamda.nju.edu.cn Industry application: Godson-3B

Vendor Chip No. of Cores

Frequency (GHZ)

Peak Performance

(GFlops)

Power (W)

Performance/Power

(GFlops/W)

ICT Godson-3B 8 1.0 128 40 3.2

Intel Xeon X5600 6 3.0 72 95 0.76

Xeon X5570 4 2.9 46.4 95 0.5

AMD Opteron 8384 4 2.7 43.2 75 0.6

IBM Power6+ 2 4.7 37.6 120 0.3

Power7 8 3.3 211.2 140 1.5

IEEE Spectrum (May 2011) reports

“Chinese Chip Wins Energy-Efficiency Crown”



What is crucial ?

Diversity !

Diversity is well known to be crucial for ensemble methods

Chap.5 Diversity Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms, Boca Raton, FL: Chapman & Hall/CRC, Jun. 2012.

http://www.amazon.com/Ensemble-Methods-Foundations-Algorithms-Recognition/dp/1439830037/ref=sr_1_sc_1?ie=UTF8&qid=1351688158&sr=8-1-spell&keywords=Ensemble+methods:+Founadtions+and+Algorithms

http://www.amazon.com/Ensemble-Methods-Foundations-Algorithms-Recognition/dp/1439830037/ref=sr_1_sc_1?ie=UTF8&qid=1351688158&sr=8-1-spell&keywords=Ensemble+methods:+Founadtions+and+Algorithms



So nice to have semi-supervised learning!

However, nothing can be perfect


http://lamda.nju.edu.cn An observation

The use of unlabeled data may degenerate performance

[Cozman, Cohen & Cirelo, ICML’03]

Similar observations have been reported in many literatures [Nigam et al., MLJ00; Zhang & Oles, ICML’00; Blum & Chawla, ICML’01; Zhou & Li, TKDE05; Balcan et al., ICMLw’05; Chapelle et al., JMLR08; Jebara et al. ICML’ 09]


http://lamda.nju.edu.cn Why degeneration ?

trained from data generated by Naïve Bayes model

Performance of Naïve Bayes classifiers [Cozman, Cohen & Cirelo, ICML’03]

trained from data generated by TAN model

Generative methods: mismatch of model assumption is one of the apparent reasons



What about general SSL approaches ?

Safe SSL – the holy grail of current SSL research

by using unlabeled data, the performance won’t be statistically significantly worse than purely using the original (limited) labeled data


http://lamda.nju.edu.cn For disagreement-based SSL

During disagreement-based SSL process, the learners will provide pseudo-labels to some unlabeled instances

Performance degeneration with incorrect pseudo-labels

How about editing the pseudo-labeled data to avoid unreliable instances?

Data editing: A technique which attempts to improve the quality of the training set through identifying and eliminating the training examples incorrectly generated in the human labeling process

Some effective methods: [Wilson, TSMC72; Koplowitz & Brown, PR81;

Sánchez et al., PRL03; Jiang & Zhou, ISNN’05]


http://lamda.nju.edu.cn Editing pseudo-labeled data

The self-training approach [Nigam & Ghani, CIKM’00] suffers seriously from incorrect pseudo-labels

Learner

Unlabeled Data

Labeled Data

Self-labeled Data


http://lamda.nju.edu.cn Editing pseudo-labeled data

The self-training approach [Nigam & Ghani, CIKM’00] suffers seriously from incorrect pseudo-labels

Learner

Unlabeled Data

Data Editing Labeled

Data

Self-labeled Data

Cleaned self-labeled

Data

SETRED (SElf-TRaining with Editing)

[Li & Zhou, PAKDD’05]



[Li & Zhou, PAKDD’05]

UCI datasets, 25% test, 75% training (unlabel rate 90%)

Nearest Neighbor (NN) Classifiers are used

Baselines: NN-L: NN trained from L only

NN-A: NN trained from L+U with all labels given

SETRED performance


http://lamda.nju.edu.cn For S3VMs

Large margin separator

Low-density separator [Chapelle & Zien, ICML’05]

Labeled examples：

Unlabeled examples：

f

Loss on labeled Loss on unlabeled


http://lamda.nju.edu.cn Low density separation

Low-density separation is the fundamental assumption of S3VMs

There are often more than one low-density separators

Performance degeneration with incorrect selection of low-density separators

[Li & Zhou, ICML’11]


http://lamda.nju.edu.cn S4VM: Safe S3VM

In contrast of generating a single low-density separator, S4VM: Generating multiple diverse low-density separators

Making the optimal prediction



http://lamda.nju.edu.cn Make the optimal prediction

Given a set of candidate low-density separators

, inductive SVM trained from labeled data only

The S4VM prediction must not be worse than inductive SVM


http://lamda.nju.edu.cn Make the optimal prediction (con’t)

Gains against inductive SVM

Losses against inductive SVM

However, the ground-truth is unknown!

A simple idea: To be more closer to the ground-truth

How about considering the worst case in ?


http://lamda.nju.edu.cn S4VM objective

Linear functions of

Note: S4VM does not rely on a single low-density separator


http://lamda.nju.edu.cn Theoretical property

S4VM is safe, under the same assumption of S3VM (that is, the ground-truth is a low-density separator)



http://lamda.nju.edu.cn Multiple diverse separators

Generating multiple diverse low-density separators:

objective functional of S3VM A quantity of penalty about the diversity of separators, e.g.,


http://lamda.nju.edu.cn Multiple diverse separators (con’t)

Non-convex task. We provide two implementations, resulting in two versions of S4VMs:

S4VMa: global simulated annealing search S4VMs: sampling

Thus, we get:



http://lamda.nju.edu.cn S4VMa

Simulated annealing (SA) is a popular probabilistic method for approaching global solution

Almost based on pure SA

Local search for speedup


http://lamda.nju.edu.cn S4VMs

Find N>T number of low-density separators

Select representative separators with large diversity


http://lamda.nju.edu.cn S4VM performance

TSVM often significantly degenerates S4VM never significantly degenerates

S4VM accuracy highly competitive with TSVM

S4VM wins TSVM on 28/46 cases for 10 labeled 28/46 cases for 100 labeled

12 splits for benchmark, 30 splits for UCI; 10 separators; S4VMa with similar observations (less efficient as S4VMs)



http://lamda.nju.edu.cn Some comparisons

• S3VMmin: selecting the separator with the smallest objective value

• S3VMcom: combining all seperators using uniform weights

S3VMmin and S3VMcom often degenerates performance


http://lamda.nju.edu.cn The “best single”

• S3VMbest: selecting the best separator using test set (by cheating)

The “best singles” are far from the ground-truth Selecting the “best single” is not safe

S4VM does not rely on a single separator

It is quite robust to the assumption “ground-truth is among the low-density separators”


http://lamda.nju.edu.cn The talk involves joint work with

My students:

My collaborators:

Tianshi Chen, Yunji Chen, Qi Guo, Weiwu Hu

Ling Li, Zhiwei Xu, … …

Ming Li

( )

Yu-Feng Li

( )

Wei Wang

( )


http://lamda.nju.edu.cn For details

Theoretical studies on disagreement-based SSL:

W. Wang and Z.-H. Zhou. Analyzing co-training style algorithms. In: Proceedings of the 18th European Conference on Machine Learning (ECML'07), Warsaw, Poland, 2007, pp.454-465.

W. Wang and Z.-H. Zhou. A new analysis of co-training. In: Proceedings of the 27th International Conference on Machine Learning (ICML'10), Haifa, Israel, 2010, pp.1135-1142.

SSL for CPU design:

Q. Guo, T. Chen, Y. Chen, Z.-H. Zhou, W. Hu, and Z. Xu. Effective and efficient microprocessor design space exploration using unlabeled design configurations. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11), Barcelona, Spain, 2011, pp.1671-1677.

T. Chen, Y. Chen, Q. Guo, Z.-H. Zhou, L. Li, and Z. Xu. Effective and efficient microprocessor design space exploration using unlabeled design configurations. ACM Transactions on Intelligent Systems and Technology, in press.


http://lamda.nju.edu.cn For details

Towards Safe SSL:

M. Li and Z.-H. Zhou. SETRED: Self-training with editing. In: Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), Hanoi, Vietnam, 2005, pp.611-621.

Y.-F. Li and Z.-H. Zhou. Improving semi-supervised support vector machines through unlabeled instances selection. In: Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI'11), San Francisco, CA, 2011, pp.386-391.

Y.-F. Li and Z.-H. Zhou. Towards making unlabeled data never hurt. In: Proceedings of the 28th International Conference on Machine Learning (ICML'11), Bellevue, WA, 2011, pp.1081-1088.

Code: http://lamda.nju.edu.cn/code_S4VM.ashx

Thanks!

http://lamda.nju.edu.cn/code_S4VM.ashx

http://lamda.nju.edu.cn/code_S4VM.ashx

Towards Bridging the Gap between Theory and …users.cecs.anu.edu.au/~psunehag/zhihua.pdfIntel Xeon...

Documents

Transcript of Towards Bridging the Gap between Theory and …users.cecs.anu.edu.au/~psunehag/zhihua.pdfIntel Xeon...