Asymmetric Tri-training for Unsupervised Domain Adaptation
-
Upload
- -
Category
Technology
-
view
1.589 -
download
0
Transcript of Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training
for Unsupervised Domain Adaptation
Kuniaki Saito1, Yoshitaka Ushiku1 and Tatsuya Harada1,2
1: The University of Tokyo, 2:RIKEN
ICML 2017 (8/6~8/11), Sydney
Background: Domain Adaptation (DA)
rucksack
keyboard
bicycle
Source TargetSource Target
• Supervised learning with a lot of samples
– Cost to collect samples in various domain
– Classifiers suffer from the change of domain
• The purpose of DA– Training a classifier using source domain that works well on target domain
• Unsupervised Domain Adaptation– Labeled source samples and unlabeled target samples
Related Work
• Applications on computer vision
– Domain transfer + Generative Adversarial Networks
– This paper: a novel approach w/o generative models
• Training CNN for domain adaptation
– Matching hidden features
of different domains[Long+, ICML 2015][Ganin+, ICML 2014]
Real faces to illusts [Taigman+, ICLR 2017] Artificial images to real images [Bousmalis+, CVPR 2017]
No Adapt AdaptedSource Target
Class A
Class B
Theorem [Ben David+, Machine Learning 2010]
•– Related work: regard as being sufficiently small
• Distribution matching approaches aim to minimize
• There is no guarantee that is small enough
– Proposed method: minimizes by reducing error on target samples
• absence of labeled samples
→ We propose to give pseudo-labels to target samples
Theoretical Insight
How much features are discriminative
: Divergence between domains
Error on source domainError on target domain
?
p1
p2
pt
S+Tl
Tl
S : source samplesTl : pseudo-labeled target samples
InputX
F1
F2
Ft
ŷ : Pseudo-label for target sample
y : Label for source sample
F
S+Tl
F1 ,F2 : Labeling networks
Ft : Target specific network
F : Shared network
Proposed Architecture
p1
p2
pt
S+Tl
Tl
S : source samplesTl : pseudo-labeled target samples
InputX
F1
F2
Ft
ŷ : Pseudo-label for target sample
y : Label for source sample
F
S+Tl
F is updated using
gradients from F1,F2,Ft
Proposed Architecture
p1
p2
pt
S
S
S : source samplesTl : pseudo-labeled target samples
Input
X
F1
F2
Ft
ŷ : Pseudo-label for target sample
y : Label for source sample
F
S
All networks are trained
using only source samples.
1. Initial training
p1
p2
TInput
X
F1
F2
F
T
If F1 and F2 agree on their predictions, and either of their
probability is larger than threshold value, corresponding
labels are given to the target sample.
T : Target samples
2. Labeling target samples
F1, F2 : source and pseudo-labeled
samples
Ft: pseudo-labeled ones
F : learn from all gradients
p1
p2
pt
S+Tl
Tl
S : source samplesTl : pseudo-labeled target samples
Input
X
F1
F2
Ft
ŷ : Pseudo-label for target sample
y : Label for source sample
F
S+Tl
3. Retraining network using pseudo-labeled target samples
p1
p2
pt
S+Tl
Tl
S : source samplesTl : pseudo-labeled target samples
Input
X
F1
F2
Ft
ŷ : Pseudo-label for target sample
y : Label for source sample
F
S+Tl
Repeat the 2nd step and 3rd step
until convergence!
3. Retraining network using pseudo-labeled target samples
Overall objective
Overall Objective l1 |W T
1W 2 |+L1 +L2 +L3
W1
W2
p1
p2
pt
S+Tl
F1
F2
Ft
F
S+Tl
Tl
L1
L2
L3
CrossEntropy
To force F1 and F2 to learn from different features.
Experiments
• Four adaptation scenarios between digits datasets
– MNIST, SVHN, SYN DIGIT (synthesized digits)
• One adaptation scenario between traffic signs datasets
– GTSRB (real traffic signs), SYN SIGN (synthesized signs)
• Other experiments are omitted due to the time limit…
– Adaptation on Amazon Reviews
GTSRB SYN SIGNS
SYN DIGITSSVHN
MNISTMNIST-M
Accuracy on Target Domain
• Our method outperformed other methods.
– The effect of BN is obvious in some settings.
– The effect of weight constraint is not obvious.Source MNIST MNIST SVHN SYNDIG SYN NUM
Method Target MN-M SVHN MNIST SVHN GTSRB
Source Only (w/o BN) 59.1 37.2 68.1 84.1 79.2
Source Only (with BN) 57.1 34.9 70.1 85.5 75.7
DANN [Ganin et al., 2014] 81.5 35.7 71.1 90.3 88.7
MMD [Long et al., 2015 ICML] 76.9 - 71.1 88.0 91.1
DSN [Bousmalis et al, 2016 NIPS] 83.2 - 82.7 91.2 93.1
K-NN Labeling [Sener et al., 2016 NIPS] 86.7 40.3 78.8 - -
Ours (w/o BN) 85.3 39.8 79.8 93.1 96.2
Ours (w/o Weight constraint) 94.2 49.7 86.0 92.4 94.0
Ours 94.0 52.8 86.8 92.9 96.2
Summary and Future Work
• Summary
– Problem presentation for domain adaptation
– Proposal of Asymmetric tri-training
– Effectiveness is shown in experiments
• Future work
– Evaluate our method on fine-tuning of pre-trained model
For more details, please refer to…
Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric Tri-training for Unsupervised Domain Adaptation. International Conference on Machine Learning (ICML), 2017.ICML
Supplemental materials
Relationship with Tri-training
• Tri-training [Zhou et al., 2005]
– Use three classifiers equally
• Use two classifiers to give labels to unlabeled samples
• Train one classifiers by the labeled samples
• Repeat in all combination of classifiers
• Our proposed method
– Use three classifiers asymmetrically
• Use fixed two classifiers to give labels
• Train a fixed one classifier using the pseudo-labeled samples
Accuracy during training
Blue: (correctly labeled samples)/(labeled samples))
Initially, the accuracy is high and gradually decreases.
Red: Accuracy of learned network. It gradually increases.
Green: The number of labeled samples.
A-distance between domains
• A-distance
– Calculated by domain classifier’s error
• Proposed method does not make the divergence small.
– Minimizing the divergence is not a only way to achieve a good
adaptation !!
Analysis by gradient stopping
p1
p2
pt
S+Tl
Tl
F2
Ft
F
S+Tl
F1
Analysis by gradient stopping
p1
p2
pt
S+Tl
Tl
F2
Ft
F
S+Tl
F1
Analysis by gradient stopping
p1
p2
pt
S+Tl
Tl
F2
Ft
F
S+Tl
F1
Analysis by gradient stopping
p1
p2
pt
S+Tl
Tl
F2
Ft
F
S+Tl
F1