Towards Bridging the Gap between Theory and …users.cecs.anu.edu.au/~psunehag/zhihua.pdfIntel Xeon...
Transcript of Towards Bridging the Gap between Theory and …users.cecs.anu.edu.au/~psunehag/zhihua.pdfIntel Xeon...
http://lamda.nju.edu.cn
Towards Bridging the Gap between Theory and Practice in Semi-supervised Learning
Zhi-Hua Zhou
http://cs.nju.edu.cn/zhouzh/
Email: [email protected]
LAMDA Group
National Key Laboratory for Novel Software Technology, Nanjing University, China
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Labeled vs. Unlabeled
In many practical applications, unlabeled training data are readily available but labeled ones are fairly expensive to obtain because labeling the unlabeled examples requires human effort
class = “war”
(almost) infinite number of web pages on the Internet
?
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn SSL: Why unlabeled data can be helpful?
blue or red?
Blue !
Intuitively,
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Multiple views
Each view is a feature set
Many real tasks involve more than one feature sets, e.g.,
Video features
Audio features
Image features
Text features
……
Ideally, the views should be
Sufficient: Each view contains sufficient information for training a good learner
Redundant: The views are conditionally independent
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
learner1 learner2 X1 view X2 view
labeled training data
unlabeled training data
labeled unlabeled instances
labeled unlabeled instances
A representative of semi-supervised learning approaches; motivated by the use of multi-views; simple but effective
Co-training
[Blum & Mitchell, COLT’98]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
• Statistical parsing [Sarkar, NAACL01; Steedman et al., EACL03;
Hwa et al., ICML03w]
• Noun phrase identification [Pierce & Cardie, EMNLP01]
• Image retrieval [Zhou et al., ECML’04, TOIS06]
• … …
Widely applied to many domains, e.g.,
Wide applications
See more in:
Zhou & Li, Semi-supervised learning by disagreement, Knowledge and Information Systems, 2010, vol.24, no.3, pp.415-439
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
[Blum & Mitchell, COLT’98] - Given a conditional independence assumption on the distribution D, if the target class is learnable from random classification noise in the standard PAC model, then any initial weak predictor can be boosted to arbitrarily high accuracy by co-training
[Dasgupta, Littleman & McAllester, NIPS’01] – When the requirement of sufficient and redundant views is met, the co-trained classifiers could make few generalization errors by maximizing their agreement over the unlabeled data
[Balcan, Blum & Yang, NIPS’04] - Given appropriately strong PAC-learners on each view, a weaker “expansion” assumption on the underlying data distribution is sufficient for iterative co-training to succeed
… …
Theoretical studies
All assumed “multi-views”
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
• Using two different types of decision trees [Goldman &
Zhou, ICML’00]
• Using classifiers constructed from different data samples [Zhou & Li, TKDE04]
• Using regression learners constructed with different distance metrics and/or k values [Zhou & Li, IJCAI’05]
• … …
Effective single-view algorithms
See more in:
Zhou & Li, Semi-supervised learning by disagreement, Knowledge and Information Systems, 2010, vol.24, no.3, pp.415-439
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
[Wang & Zhou, ECML’07]
A theoretical result:
“Large diversity” is sufficient !
Roughly speaking, the key requirement of co-training is that the two learners are with large diversity and each learner is better than a weak learner
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
Performances of the learners observed in experiments : the performances could not be improved further after a number of rounds
Previous theoretical studies indicated that the performances could always be improved
Performance of Co-training
The empirical/theoretical gap
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
[Wang & Zhou, ECML’07]
A theoretical result:
The gap will definitely occur
Roughly speaking, as the co-training process continues, the learners will become more and more similar, and therefore it is a “must”-phenomenon that co-training could not improve the performance further after a number of iterations
Based on this result, we get a method for roughly estimating the adequate round to terminate the co-training process
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Estimating the round for termination
Data with two views
Data with single view
In either case, the performances at the estimated round is
close to that at the last learning round
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
Several sufficient conditions for co-training:
Conditional independence [Blum & Mitchell, COLT’98]
Weak dependence [Abney, ACL’02]
Expansion [Balcan, Blum & Yang, NIPS’04]
Large diversity [Wang & Zhou, ECML’07]
Theoretical status
Necessary condition
Sufficient and necessary condition
-- Proved by [Wang & Zhou, ICML’10]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
Our theoretical results disclose that
“multi-views” are not really needed !
Can we really do … without “multi-views”?
… an application example -->
If we really have two sufficient and redundant views, we can even do SSL with a single labeled example. See: Zhou, Zhan & Yang. Semi-supervised learning with very few labeled training examples. AAAI'07.
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Microprocessor design space exploration
Design Space Exploration (DSE) is to determine the promising processor architecture to meet the design specification
Many design parameters
• Core number, Frequency, Cache Size, Buffer Size
• Cache Replacement Policy
Design Space consists of billions of design configurations (i.e., combinations of design parameters)
Design Space
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn To use unlabeled data for DSE
There are not “two-views”
Comprehensibility is important
COMT: CO-training Model Trees
which uses two model trees, without two views
[Qi, Chen, Chen, Zhou, Hu & Xu, IJCAI’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
applu ar
t
bzip2
craf
ty
equa
ke
galgel
gcc
luca
sm
cf
swim
twol
fvp
r
MS
Es
ANNs
M5P
COMT
SPEC CPU2000 Benchmark
84%
30%
Comparing to DSE state-of-the-art, 30% to 84% improvement
[Qi, Chen, Chen, Zhou, Hu & Xu, IJCAI’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Industry application: Godson-3B
Vendor Chip No. of Cores
Frequency (GHZ)
Peak Performance
(GFlops)
Power (W)
Performance/Power
(GFlops/W)
ICT Godson-3B 8 1.0 128 40 3.2
Intel Xeon X5600 6 3.0 72 95 0.76
Xeon X5570 4 2.9 46.4 95 0.5
AMD Opteron 8384 4 2.7 43.2 75 0.6
IBM Power6+ 2 4.7 37.6 120 0.3
Power7 8 3.3 211.2 140 1.5
IEEE Spectrum (May 2011) reports
“Chinese Chip Wins Energy-Efficiency Crown”
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
What is crucial ?
Diversity !
Diversity is well known to be crucial for ensemble methods
Chap.5 Diversity Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms, Boca Raton, FL: Chapman & Hall/CRC, Jun. 2012.
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
So nice to have semi-supervised learning!
However, nothing can be perfect
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn An observation
The use of unlabeled data may degenerate performance
[Cozman, Cohen & Cirelo, ICML’03]
Similar observations have been reported in many literatures [Nigam et al., MLJ00; Zhang & Oles, ICML’00; Blum & Chawla, ICML’01; Zhou & Li, TKDE05; Balcan et al., ICMLw’05; Chapelle et al., JMLR08; Jebara et al. ICML’ 09]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Why degeneration ?
trained from data generated by Naïve Bayes model
Performance of Naïve Bayes classifiers [Cozman, Cohen & Cirelo, ICML’03]
trained from data generated by TAN model
Generative methods: mismatch of model assumption is one of the apparent reasons
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
What about general SSL approaches ?
Safe SSL – the holy grail of current SSL research
by using unlabeled data, the performance won’t be statistically significantly worse than purely using the original (limited) labeled data
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn For disagreement-based SSL
During disagreement-based SSL process, the learners will provide pseudo-labels to some unlabeled instances
Performance degeneration with incorrect pseudo-labels
How about editing the pseudo-labeled data to avoid unreliable instances?
Data editing: A technique which attempts to improve the quality of the training set through identifying and eliminating the training examples incorrectly generated in the human labeling process
Some effective methods: [Wilson, TSMC72; Koplowitz & Brown, PR81;
Sánchez et al., PRL03; Jiang & Zhou, ISNN’05]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Editing pseudo-labeled data
The self-training approach [Nigam & Ghani, CIKM’00] suffers seriously from incorrect pseudo-labels
Learner
Unlabeled Data
Labeled Data
Self-labeled Data
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Editing pseudo-labeled data
The self-training approach [Nigam & Ghani, CIKM’00] suffers seriously from incorrect pseudo-labels
Learner
Unlabeled Data
Data Editing Labeled
Data
Self-labeled Data
Cleaned self-labeled
Data
SETRED (SElf-TRaining with Editing)
[Li & Zhou, PAKDD’05]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn
[Li & Zhou, PAKDD’05]
UCI datasets, 25% test, 75% training (unlabel rate 90%)
Nearest Neighbor (NN) Classifiers are used
Baselines: NN-L: NN trained from L only
NN-A: NN trained from L+U with all labels given
SETRED performance
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn For S3VMs
Large margin separator
Low-density separator [Chapelle & Zien, ICML’05]
Labeled examples:
Unlabeled examples:
f
Loss on labeled Loss on unlabeled
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Low density separation
Low-density separation is the fundamental assumption of S3VMs
There are often more than one low-density separators
Performance degeneration with incorrect selection of low-density separators
[Li & Zhou, ICML’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn S4VM: Safe S3VM
In contrast of generating a single low-density separator, S4VM: Generating multiple diverse low-density separators
Making the optimal prediction
[Li & Zhou, ICML’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Make the optimal prediction
Given a set of candidate low-density separators
, inductive SVM trained from labeled data only
The S4VM prediction must not be worse than inductive SVM
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Make the optimal prediction (con’t)
Gains against inductive SVM
Losses against inductive SVM
However, the ground-truth is unknown!
A simple idea: To be more closer to the ground-truth
How about considering the worst case in ?
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn S4VM objective
Linear functions of
Note: S4VM does not rely on a single low-density separator
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Theoretical property
S4VM is safe, under the same assumption of S3VM (that is, the ground-truth is a low-density separator)
[Li & Zhou, ICML’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Multiple diverse separators
Generating multiple diverse low-density separators:
objective functional of S3VM A quantity of penalty about the diversity of separators, e.g.,
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Multiple diverse separators (con’t)
Non-convex task. We provide two implementations, resulting in two versions of S4VMs:
S4VMa: global simulated annealing search S4VMs: sampling
Thus, we get:
[Li & Zhou, ICML’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn S4VMa
Simulated annealing (SA) is a popular probabilistic method for approaching global solution
Almost based on pure SA
Local search for speedup
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn S4VMs
Find N>T number of low-density separators
Select representative separators with large diversity
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn S4VM performance
TSVM often significantly degenerates S4VM never significantly degenerates
S4VM accuracy highly competitive with TSVM
S4VM wins TSVM on 28/46 cases for 10 labeled 28/46 cases for 100 labeled
12 splits for benchmark, 30 splits for UCI; 10 separators; S4VMa with similar observations (less efficient as S4VMs)
[Li & Zhou, ICML’11]
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn Some comparisons
• S3VMmin: selecting the separator with the smallest objective value
• S3VMcom: combining all seperators using uniform weights
S3VMmin and S3VMcom often degenerates performance
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn The “best single”
• S3VMbest: selecting the best separator using test set (by cheating)
The “best singles” are far from the ground-truth Selecting the “best single” is not safe
S4VM does not rely on a single separator
It is quite robust to the assumption “ground-truth is among the low-density separators”
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn The talk involves joint work with
My students:
My collaborators:
Tianshi Chen, Yunji Chen, Qi Guo, Weiwu Hu
Ling Li, Zhiwei Xu, … …
Ming Li
( )
Yu-Feng Li
( )
Wei Wang
( )
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn For details
Theoretical studies on disagreement-based SSL:
W. Wang and Z.-H. Zhou. Analyzing co-training style algorithms. In: Proceedings of the 18th European Conference on Machine Learning (ECML'07), Warsaw, Poland, 2007, pp.454-465.
W. Wang and Z.-H. Zhou. A new analysis of co-training. In: Proceedings of the 27th International Conference on Machine Learning (ICML'10), Haifa, Israel, 2010, pp.1135-1142.
SSL for CPU design:
Q. Guo, T. Chen, Y. Chen, Z.-H. Zhou, W. Hu, and Z. Xu. Effective and efficient microprocessor design space exploration using unlabeled design configurations. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11), Barcelona, Spain, 2011, pp.1671-1677.
T. Chen, Y. Chen, Q. Guo, Z.-H. Zhou, L. Li, and Z. Xu. Effective and efficient microprocessor design space exploration using unlabeled design configurations. ACM Transactions on Intelligent Systems and Technology, in press.
http://cs.nju.edu.cn/zhouzh/
http://lamda.nju.edu.cn For details
Towards Safe SSL:
M. Li and Z.-H. Zhou. SETRED: Self-training with editing. In: Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), Hanoi, Vietnam, 2005, pp.611-621.
Y.-F. Li and Z.-H. Zhou. Improving semi-supervised support vector machines through unlabeled instances selection. In: Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI'11), San Francisco, CA, 2011, pp.386-391.
Y.-F. Li and Z.-H. Zhou. Towards making unlabeled data never hurt. In: Proceedings of the 28th International Conference on Machine Learning (ICML'11), Bellevue, WA, 2011, pp.1081-1088.
Code: http://lamda.nju.edu.cn/code_S4VM.ashx
Thanks!