Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised
Unsupervised Domain Adaptation With Hierarchical Gradient...
Transcript of Unsupervised Domain Adaptation With Hierarchical Gradient...
Unsupervised Domain Adaptation with Hierarchical Gradient Synchronization
Lanqing Hu1,2 Meina Kan1,2 Shiguang Shan1,2,3 Xilin Chen1,2
1 Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing
Technology, CAS, Beijing 100190, China2 University of Chinese Academy of Sciences, Beijing 100049, China
3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, 200031, China
[email protected] {kanmeina,sgshan,xlchen}@ict.ac.cn
Abstract
Domain adaptation attempts to boost the performance
on a target domain by borrowing knowledge from a well
established source domain. To handle the distribution gap
between two domains, the prominent approaches endeav-
or to extract domain-invariant features. It is known that
after a perfect domain alignment the domain-invariant rep-
resentations of two domains should share the same char-
acteristics from perspective of the overview and also any
local piece. Inspired by this, we propose a novel method
called Hierarchical Gradient Synchronization to model the
synchronization relationship among the local distribution
pieces and global distribution, aiming for more precise
domain-invariant features. Specifically, the hierarchical
domain alignments including class-wise alignment, group-
wise alignment and global alignment are first constructed.
Then, these three types of alignment are constrained to be
consistent to ensure better structure preservation. As a re-
sult, the obtained features are domain invariant and intrin-
sically structure preserved. As evaluated on extensive do-
main adaptation tasks, our proposed method achieves state-
of-the-art classification performance on both vanilla unsu-
pervised domain adaptation and partial domain adaptation.
1. Introduction
The general hypothesis of machine learning is that the
training and testing data share similar distribution, which
makes the model trained on a large scale labeled data per-
form well on the test data. However, in many real world
applications, we usually only have access to limited amount
of labeled training data sharing similar distribution with the
testing data, which is insufficient for training a good enough
model. Domain adaptation has shown promising effect on
such a challenge by borrowing knowledge from a sophisti-
cated set (i.e., source domain) which has a large number of
labeled data but lies in a different distribution with the test
data (i.e., target domain).
According to the scale of labeled data in target do-
main, domain adaptation can be categorized into super-
vised, semi-supervised and unsupervised domain adapta-
tion. This paper mainly concentrates on the unsupervised
domain adaptation where there is only unlabeled data in
target domain. Most existing works deal with the domain
adaptation problem by alleviating marginal distribution dis-
crepancy (i.e., the distribution of data X) or conditional
distribution discrepancy (i.e., distribution of data X given
classes labeled with Y ). Besides, there are also some works
attempting to tackle both the marginal and conditional dis-
tribution simultaneously.
In the early days, most methods endeavor to align the
marginal distribution of source and target domains by us-
ing instance re-weighting, such as sample selection bias
[45, 7, 19] and co-variate shift [39, 1]. These approaches
are suitable for those scenarios where the source and target
domains share the same support, thus they cannot achieve
satisfactory performance in the wild scenarios.
For better handling the complicated scenarios, the com-
mon subspace methods focusing on extracting domain in-
variant representation came up [14, 13, 37, 12, 38]. These
methods mainly attempt to minimize the gap between
marginal distributions of two domains. In the approach of
Geodesic Flow Kernel (GFK) [13], an infinite number of
the subspaces is integrated to model domain shift between
source and target domain. In [12], a set of landmarks, i.e.,
a subset of labeled data from the source domain that have
the most similar distribution as the target domain, are un-
covered to bridge the source and target domain. The meth-
ods proposed in [24] and [28] embed deep features into Re-
producing Kernel Hilbert spaces (RKHS) and minimize the
maximum mean discrepancy (MMD) of the features for dis-
tribution adaptation. JGSA [46] and PUnDA [11] mitigate
the geometrical structure gap and distribution shift jointly.
The method in [48] handles the domain shift by aligning the
14043
(a) Local misalignment (b) Expected consist alignment
Figure 1. Illustration of (a) local misalignment in methods only with global distribution alignment, and (b) expected alignment on both
global domain and local classes. Best viewed in color.
RKHS covariance matrix across domains.
In these conventional approaches, the distribution dis-
crepancy is usually measured by the metrics like MMD, K-
L divergence and Bregman divergence. Recently, the adver-
sarial loss as a more powerful metric has caught a lot of at-
tentions. The works in [9, 10, 41] handle the domain shift by
augmenting a gradient reversal layer or employing adversar-
ial objective on target domain features. As a result, the fea-
tures confusing the domain classifier are generally domain
invariant. Afterwards, many methods based on domain
transformation via adversarial learning [40, 23, 22, 49, 35]
attain quite promising performance on distribution align-
ment and domain invariant feature extraction.
The above methods only consider the gap between
marginal distributions of two domains. In other words,
these methods only align the two domains globally, but
without considering whether the alignment of local piece
is correct or not. As a result, there may happen that two do-
mains are well aligned, but the local pieces (e.g., categories)
of two domains are mismatched as shown in Figure 1(a).
In recent years, a few methods attempt to minimize the
gap between conditional distributions (i.e., class-wise dis-
tribution) of two domains, for better alignment of the cate-
gories between two domains. Specifically, in WDAN [44],
class-specific auxiliary weight for each class is introduced
into the original MMD metric for utilizing the class prior
on source and target domains. MADA [31] exploits multi-
ple adversarial learning, one for each class, gaining much
more performance improvement on target domain. Further,
based on this multiple adversarial framework, CDAN [25]
novelly designs multi-linear conditioning, i.e., conducting
adversarial learning on the covariance between feature rep-
resentations and classifier predictions, to implicitly align the
conditional distribution of source and target domains, which
handles the domain distribution alignment more elaborately.
Similarly, the methods specialized for partial domain adap-
tation including SAN [2] and PADA [3] also show the ad-
vantages of considering class-wise distribution alignment.
There are some other methods [26, 17, 36, 33, 5, 18, 30,
47] directly predict the category labels of unlabeled samples
in target domain as pseudo labels during training process as
pseudo-labels. With the pseudo category labels of target
domain samples and those known true labels of source do-
main samples, the samples from distinct domains but the
same category are implicitly pulled close to share the same
distribution. In the proposed SymNets in [47], the domain
discrimination and confusion are stacked upon the concate-
nated classifiers of source and target domains, thus facilitat-
ing the domain-level and category-level feature distribution
confusion. MCDDA [34] and CAN [21] are both approach-
es concentrating on explicitly calibrating the category-level
distribution of both domains. MCDDA [34] plays the min-
max game between feature encoder and two different clas-
sifiers to optimize the decision boundary and then allevi-
ate the intra-class domain discrepancy. CAN [21] explicitly
minimizes the intra-class discrepancy and simultaneously
maximizes the inter-class discrepancy between domains ac-
cording to the labels of source and predicted labels of target
domain.
Generally, these recently proposed methods consider the
alignment of both global distribution (domain-level) dis-
crepancy and local distribution (category-level) discrepan-
cy, thus achieving promising performance. However, in
these methods the global alignment and local alignmen-
t are implemented in a separate manner, e.g., minimizing
weighted sum of domain-level and category-level discrep-
ancy [6, 31, 25]. As a result, the obtained results are only a
trade-off of the global and local distribution alignment, and
24044
inconsistent distribution alignment still exists.
As observed from Figure 1(b), in a perfect domain align-
ment, the calibration of local category and the global do-
main distribution are consistent, i.e. the calibration direc-
tion are roughly the same. To elaborately consider the in-
trinsic relation between local and global distribution align-
ment, in this work we propose a new method that can con-
sistently align the local and global distribution by constrain-
ing the gradient of local and global alignment to be syn-
chronous, referred to as Domain Adaptation with Hierar-
chical Gradient Synchronization (GSDA).
Briefly, the contributions of this work are in two folds:
(1) we propose a novel method that considers consistency of
the global and local distribution alignment, to preserve the
intrinsic structures of both domain distributions for better
domain adaptation. To the best of our knowledge, it is the
first work to explicitly model the intrinsic relation between
global and local distribution alignment. (2) The consisten-
cy of the global and local distribution alignment is achieved
by a newly designed a hierarchical gradient synchroniza-
tion module. (3) This method achieves state-of-the-art clas-
sification accuracy in unsupervised domain adaptation and
partial domain adaptation scenarios experimentally.
2. Method
For clear description, we first give some definitions.
The labeled source domain images and the unlabeled tar-
get domain images are denoted as Xs = {(xsi , y
si )}
ni=1
and
Xt = {xtj}
mj=1
, respectively. In unsupervised domain adap-
tation, the source and target domains, i.e., Xs and Xt, gen-
erally follow different distributions but share the same cat-
egories. The samples in source domain are labeled, with
category label denoted as ysi ∈ Cs = {1, 2, · · · , r} , while
the samples in the target domain are unlabeled. In the unsu-
pervised domain adaptation the source and target domains
share exactly the same categories, i.e., Ct = Cs, where Ct
and Cs are r classes in target and source domains. There
is also a special scenario where the Ct is a subset of Cs,
i.e., Ct ⊂ Cs, called as partial unsupervised domain adap-
tation. Our method is applicable for both unsupervised do-
main adaptation and partial unsupervised domain adapta-
tion. For easier understanding we introduce the formulation
in the scenario of unsupervised domain adaptation, while
evaluate both tasks in the experiments section. Unless oth-
erwise specified, the symbols s and t used in the superscript
or subscript denote the source domain and target domain,
respectively.
The whole framework of our method is shown in Fig-
ure 2, which is equipped with a feature extractor E , an ob-
ject classifier C and three types of adversarial discriminators
D = {Ddom,Dgrp,Dcls}. Here, Ddom denotes the adver-
sarial discriminator for globally domain distribution align-
ment, namely, the domain-level alignment. Dcls denotes
adversarial discriminators for locally class-wise distribution
alignment. And Dgrp represents adversarial discriminators
for group-wise alignment where each group is composed of
several classes. The feature extractor E is fed with both the
source and target domain data and outputs the features f
which are expected to be domain invariant. Afterwards, the
features are fed into the classifier C for classification and
also into the adversarial discriminators D for domain shift
reduction. The feature extractor E and the discriminators
D play a two-player min-max game to make the features
from E domain invariant. In other words, the features from
E should be domain invariant if they successfully fool the
domain discriminators D.
2.1. Feature Extraction and Classification
The feature extractor E encodes the input source or target
samples xs and xt into a common feature space as follows:
fs = E(xs), f t = E(xt), (1)
where E can be any kind of network architecture such as
several successive convolutional layers. Then f ∈ {fs, f t}is fed into the classifier C to ensure feature f to be discrimi-
native. The parameter of feature extractor E and classifier Care denoted as θE and θC , respectively. The output of object
classifier C is denoted as below:
psi = C(fsi ), p
tj = C(f
tj ), (2)
where psi is the softmax output of C with xsi as input, and
ptj is the softmax output of C with xtj as input. Considering
that true category labels are available for source domain,
the cross entropy loss of classification is directly applied
and formulated as below:
Lsc =
∑
xsi∈Xs
H(
C(
E(xsi ))
, ysi
)
, (3)
where H(·, ·) represents the cross entropy loss.
For target domain samples, the category labels are un-
available, and thus conventional cross entropy loss is inap-
plicable. Therefore, following [15], the conditional entropy
loss is exploited to enhance the certainty of prediction, i.e.,
force only one element in ptj to be dominant while the rest
suppressed. Formally, the conditional entropy loss Ltc for
unlabeled target domain samples is as below:
Ltc =
∑
xtj∈Xt
H(
C(
E(xtj))
)
,(4)
where H(·) is the conditional entropy loss with H(ptj) =
−∑r
k=1ptj(k) log p
tj(k). The kth element ptj(k) in ptj indi-
cates the probability of xtj being assigned to the kth class.
34045
Figure 2. Illustration of the overall framework of our GSDA method. An input sample xi from source or target domain is firstly encoded
by the common feature extractor E . Based on the extracted feature, the classifier C is designed for object classification, and the adversarial
discriminators including Ddom, Dgrp and Dcls are designed for distribution alignment from perspective of domain-level, group-level and
category-level respectively. Furthermore, a hierarchical gradient synchronization between the three types of adversarial discriminators is
constructed to constrain the consistency between global and local alignment for better structure preservation. Best viewed in color.
Overall, the object classification loss of both domains is
obtained as below:
Lc = Lsc + αLt
c, (5)
constraining the common feature f to be discriminative,
benefitting the classification task.
2.2. Domain Distribution Alignment
Besides the categorial discriminability, the feature f
from E should be also domain invariant to potentiate knowl-
edge transfer from source domain to target domain. In a
perfect domain-invariant feature space, not only the glob-
al structure of both domains but also any local piece such
as every group or even every class should be well aligned.
Aiming for this goal, three types of adversarial discrimina-
tors are introduced for domain-level, group-level, and class-
level distribution alignment respectively. Furthermore, the
consistency of the three types of alignment are constrained
by a novel hierarchical gradient synchronization module.
This synchronization module ensures the alignment of any
local piece is consistent with the global alignment structure,
leading to a more informative domain alignment.
Global Adversarial Discriminator The global adver-
sarial discriminator, i.e., domain-level adversarial discrim-
inator Ddom is designed to distinguish the source domain
from target domain with cross entropy loss as follows:
Lg =∑
xi∈Xs∪Xt
H(Ddom(E(xi)), di), with
di =
{
1, if xi ∈ Xs,
0, if xi ∈ Xt,
(6)
where di represents the domain label of each sample xi.
By playing min-max adversarial optimization between Eand this discriminator Ddom whose parameter is denoted as
θDdom , the whole distributions of two domains from E will
become nonseparable globally.
Local Adversarial Discriminators Even if the global
distribution is well aligned, the distribution of each class in
two domains may be misaligned as shown in Figure 1(a),
e.g., the ith category of source domain may be aligned to
kth(i 6= k) category of target domain although the two do-
mains are globally well aligned. This is because that the
global domain migration constraint merely considers the w-
hole domain discrepancy but not the discrepancy in any lo-
cal piece. Therefore, the local adversarial discriminators are
established to deal with the distribution discrepancy in local
regions of source and target domains, which consist of two
kinds of local adversarial discriminators: class-wise ones
and group-wise ones.
Firstly and straightforwardly, class-wise adversarial dis-
criminators are constructed to tackle the discrepancy within
each category between the source and target domain, i.e.,
the ith category of source domain should be aligned to the
ith category of target domain rather than other categories
in target domain. Formally, the class-wise adversarial dis-
criminator for the kth category is denoted as Dclsk and its
domain discrimination loss is formulated as follows:
Lclsk =
∑
xi∈Xs∪Xt
pkiH(
Dclsk
(
E(xi))
, di
)
, with
di =
{
1, if xi ∈ Xs
0, if xi ∈ Xt,
(7)
where di is the domain label, similar with that in global ad-
versarial discriminator, k ∈ {1, 2, · · · , r} denotes the index
of kth class-wise adversarial discriminator and pki is the loss
weight of sample xi representing its probability of belong-
ing to kth class, i.e., the kth dimension output of psi and
pti in Equation (2). Note that if xi ∈ Xs and it belongs to
the kth class, pki = 1 and pji |j 6=k = 0 because the label of
xi ∈ Xs is definite. While for xi ∈ Xt, as its label is un-
available, the corresponding pki is the predicted probability
of xi ∈ Xt to be classified into the kth class by classifier C
in Equation (2).
Likewise, by playing min-max adversarial optimization
44046
with the objective above, the distribution of two domains
is well aligned for each category. The parameter of each
class-wise local discriminator Dclsk is denoted as θDcls
k.
Besides each class, any local group consisting of sever-
al classes should be also well aligned in a perfect domain
alignment. Thus, the local alignment can be reinforced by
establishing group-level adversarial discriminators. Similar
as the class-wise adversarial discriminators, the group-wise
adversarial discriminators Dgrpq for the qth group with do-
main discrimination loss is formulated as follows:
Lgrpq =
∑
xi∈Xs∪Xt
pqiH
(
Dgrpq
(
E(xi))
, di
)
, with
di =
{
1, if xi ∈ Xs
0, if xi ∈ Xt,
(8)
where q ∈ {1, 2, · · · , b} denotes the index of qth group-
wise adversarial discriminator, the pqi denotes the proba-
bility of xi belonging to the qth group. The groups here
are simply achieved as random divisions of all classes that
are defined in Equation (7). Correspondingly, the catego-
ry grouping probability of the qth group pqi can be easily
obtained as pqi =
∑
k∈q pki . Generally, the classes in differ-
ent groups are allowed to overlap with each other, while in
this work all groups are simply randomly divided without
overlap. What is worth mentioning is that, when the num-
ber of classes is large, these groups could be hierarchically
structured groups rather than flat structured ones.
Similarly, by playing min-max adversarial optimization
with the objective above, the distribution of two domains is
well aligned locally in each group. The parameter of each
group-wise local discriminator Dgrpq is denoted as θDcls
k.
The parameter of each group-wise local discriminator Dgrpq
is denoted as θDgrpq
.
Then the overall parameters of all discriminators are de-
noted as θD = {θDdom , θDcls , θDgrpq}. By summing up all
the local adversarial discriminators, the objective for local
distribution alignment is obtained as:
Ll =
b∑
q=1
Lgrpq +
r∑
k=1
Lclsk , (9)
where b stands for the number of groups and r represents
the number of classes.
Overall, the three types of distribution alignment includ-
ing domain-level, group-wise, and class-wise domain dis-
tribution alignment form a hierarchical aligning structure,
aiming for better alignment between source and target do-
mains globally as well as locally.
2.3. Hierarchical Gradient Synchronization
The preceding global and local adversarial discrimina-
tors deal with the distribution alignment between domains
from global and local perspective, but in an independen-
t manner. This may cause inconsistency among the global
Figure 3. Illustration of the hierarchical distribution alignments
and hierarchical gradient synchronization among them.
and local alignments, which would compromise the align-
ing direction of global and local alignment leading to inac-
curate distribution alignment.
Actually, in a perfect global alignment, any local piece
should be also well aligned, or vice versa: a perfect align-
ment of each local piece also forms an optimal global align-
ment. Specifically, the aligning direction and magnitude of
each local piece should be consistent with that of the whole
domain. So intuitively the consistency between the global
and local domain alignment could be used to verify if two
domains are well aligned or not. In return, it would benefit
the domain alignment if this consistency is formulated into
the process of distribution alignment. With this in mind, a
novel constraint on the gradient is designed as the Hierar-
chical Gradient Synchronization term, which is presented in
Figure 3 and specifically formulated in Equations (10) and
(11) below.
Specifically, Hierarchical Gradient Synchronization con-
sists of gradient synchronization among the three levels of
adversarial discriminators, i.e., domain-level, group-level,
and class-level discriminators, forming a hierarchical man-
ner. The gradient synchronization between class-wise align-
ment and group-wise alignment is designed as below:
Lsyngrp∼cls=
∣
∣
∣
∣
∣
∑
xi∈
Xs∪Xt
∣
∣
∣
∣
∣
∣
∂Lgrpq
∂E(xi)
∣
∣
∣
∣
∣
∣
2
−∑
k∈grpq
∑
xi∈
Xs∪Xt
∣
∣
∣
∣
∣
∣
∂Lclsk
∂E(xi)
∣
∣
∣
∣
∣
∣
2
∣
∣
∣
∣
∣
.
(10)
The gradient synchronization objective in the above E-
quation (10) attempts to make the magnitude of the align-
ing direction of each group to be consistent with the sum
of that of each class within this group. Here, the first term
denotes the gradient magnitude of discriminator for the qth
group, and the second term denotes the gradients magnitude
of discriminators for each class in qth group in the domain.
Note that here we only use constraint on magnitude as it
can affect both the direction and magnitude, while the sum
of gradients direction will neutralize the difference. Note
that in the second term, xi ∈ Xs ∪Xt still means the sam-
ples from the kth class because the sampling probability is
54047
included in Lclsk .
Similarly, the gradient synchronization between group-
wise and the whole domain alignment is formulated as fol-
lows:
Lsyndom∼grp=
∣
∣
∣
∣
∣
∑
xi∈
Xs∪Xt
∣
∣
∣
∣
∣
∣
∂Lg
∂E(xi)
∣
∣
∣
∣
∣
∣
2
−
b∑
q=1
∑
xi∈
Xs∪Xt
∣
∣
∣
∣
∣
∣
∂Lgrpq
∂E(xi)
∣
∣
∣
∣
∣
∣
2
∣
∣
∣
∣
∣
,
(11)
where the first term denotes the gradients magnitude of dis-
criminator for the whole domain, and the second term de-
notes the gradients magnitude of discriminators for each
group. Note that in the second term, xi ∈ Xs ∪ Xt still
means the samples from the qth group because the sampling
probability is included in Lgrpq .
Note that although Equations (10) and (11) are the losses
with regard to the gradients, they are first-order derivatives
optimization rather than second-order ones which is effi-
cient. This is because that the gradients in Equations (10)
and (11) are with regard to the input features, but not with
regard to the network parameters.
Afterwards, piling all the layers together, the overall 3-
layer hierarchical gradient synchronization constraint is nat-
urally obtained as below:
Lsyn =1
b
b∑
q=1
Lsyngrp∼cls + L
syndom∼grp. (12)
With this constraint, the directions and magnitude of gra-
dient descent for both global and local alignment are ex-
pected to be kept in synchronization with each other. As a
result, the distributions of two domains can be aligned more
accurately.
With the global alignment, local alignment, and gradient
synchronization defined in Equations (6), (9) and (12), the
overall objective function of the discriminators D is finally
formulated as follows:
Ld = Lg + Ll + βLsyn. (13)
With the objective in above Equation (13), the source and
target domain are aligned globally and locally, with consis-
tency between the global and local distribution alignment.
As as result, the two domains are well aligned and also the
discriminative structure are well preserved.
2.4. Overall Objective and Optimization
The overall objective function is optimized by alterna-
tively optimizing {E , C} and D following the adversarial
learning mechanism, which are detailed in the following.
Given {E , C}, the adversarial discriminators D are opti-
mized to distinguish the source domain from target domain
by minimizing the domain discrimination loss:
minθDg ,θ
Dl
Ld = Lg + Ll + βLsyn, (14)
with the parameters updated as below:
θDg ← θDg−η∂(Lg + βLsyn)
∂θDg
,
θDlk← θDl
k−η
∂(Ll + βLsyn)
∂θDlk
,
(15)
where η is the learning rate.
Given D, the feature extractor E and classifier C are op-
timized to make the features from E are discriminative and
domain invariant. This is achieved by minimizing the object
classification loss and confusing the adversarial discrimina-
tors by the min-max game as follows:
minθC,θE
(
Lc + βLsyn − (Lg + Ll))
, (16)
with the parameters updated as:
θC←θC− η∂Lc
∂θC,
θE ← θE− η
(
∂Lc
∂θC×∂θC
∂θE+β
∂Lsyn
∂θE−∂(Lg+Ll)
∂θD×∂θD
∂θE
)
.
(17)
3. Experiments
We evaluate the proposed method and other related
works on both unsupervised domain adaptation (source and
target domains share the same categories) and partial do-
main adaptation (the categories of target domain is a subset
of that of source domain) benchmarks of object classifica-
tion, of which the partial domain adaptation results will be
given in supplementary materials. Besides, ablation study
is carefully done for analysing the contributions of each part
of the proposed method.
3.1. Datasets and Experimental Setting
Three standard benchmarks for unsupervised domain
adaptation and one for partial domain adaptation, respec-
tively, are employed for the evaluation.
Office-31-DA Office-31 [20] is a classical and widely
used benchmark for domain adaptation with 31 categories,
consisting of 3 different domains including Amazon (A)
with 2817 images, Webcam (W) with 795 images, and D-
SLR (D) with 498 images. Following the commonly used
protocol defined in [13, 26, 28, 31], all 31 categories from
the three domains are used for evaluation of unsupervised
domain adaptation, forming 6 transfer tasks.
Office-Home [42] Office-Home is another classical
dataset with 65 categories, consisting of 4 different domains
including Artistic images (Ar), Clip Art images (Cl), Prod-
uct images (Pr) and Real-World images (Rw). Following
the commonly utilized protocol defined in [13, 26, 28, 31],
all 65 categories from the four domains are used for evalua-
tion of unsupervised domain adaptation, forming 12 transfer
tasks.
64048
Table 1. Ablation study of our GSDA for domain adaptation on Office-31 (ResNet50).
Glb Cls Grp Grad Sync A→W D→W W→D A→D D→A W→A Avg
X 87.9 98.2 100.0 85.5 66.4 64.1 83.7
X X 91.7 98.4 100.0 87.1 68.9 67.2 85.6
X X X 93.1 99.0 100.0 91.4 71.5 67.0 87.0
X X X X 95.7 99.1 100.0 94.8 73.5 74.9 89.7
VisDA-2017 VisDA-2017 [32] is a more challenging
simulation-to-real task, with two distinct domains: synthet-
ic object images rendered from 3D models and real object
images. It contains 152397 training images and 55388 vali-
dation images across 12 classes. Following the training and
testing protocol in [34, 25], the model is trained on labeled
training and unlabeled validation set and tested on the vali-
dation set in unsupervised domain adaptation.
Office-31-PDA Recently, a new protocol for partial do-
main adaptation is built on Office-31 [20]. As defined in
[2, 3], the same three domains as that for the standard un-
supervised domain adaptation are used but with different
categories for source and target domains: all 31 categories
from the three domains are used as source domains, denoted
as A31, D31, and W31, respectively, while the 10 common
categories shared between Office-31 and Caltech-256 are
used as target domains, denoted as A10 (958 images), W10(295 images) and D10 (157 images), respectively.
Implementation Details For fair comparison, on each
setting we use the same network architecture as the com-
pared methods. Specifically, we use ResNet50 as the back-
bone in all the experiments. In Office-31-DA and Office-
Home, the hyper-parameter α in Equation (5) and β in E-
quation (13) is set as 0.02, and 1.0. In VisDA-2017 and
Office-31-PDA, α and β are set as 0.2 and 10.0, respec-
tively. In Office-31-DA and Office-31-PDA, classes are di-
vided into 6 groups. In Office-Home, they are divided in-
to 13 groups. And in VisDA-2017, they are divided into
4 groups. For clearer explanation of hyper-parameter se-
lection, the sensitivity analysis about hyper-parameters is
presented in supplementary materials. For stable training of
GSDA, those categories with fewer samples are augment-
ed by randomly re-sampling images to make all categories
have roughly the same number of images to avoid the da-
ta imbalance problem stated in [50]. For target domain,
the class labels are unavailable, so only those samples with
highly confident pseudo labels are used as training samples.
3.2. Ablation Study
The ablation study is conducted on unsupervised domain
adaptation setting (Office-31-DA) to investigate the neces-
sity of each component in GSDA. Briefly, our GSDA con-
sist of three parts, global alignment, local alignment, and
the hierarchical gradient synchronization between them. As
shown in Table 1, the method with only global domain
alignment (Glb) performs worse than that added with class-
wise alignment (Cls), showing that the local alignment is
Table 2. Object classification accuracy on Office-31-DA
(ResNet50). All methods follow the same settings, so most results
are directly from the original works except MCDDA which is
tuned using the released codes.
Method
A D W A D W
Avg↓ ↓ ↓ ↓ ↓ ↓W W D D A A
ResNet50 [16] 68.4 96.7 99.3 68.9 62.5 60.7 76.1
TCA [29] 72.7 96.7 99.6 74.1 61.7 60.9 77.6
GFK [13] 72.8 95.0 98.2 74.5 63.4 61.0 77.5
DAN [24] 80.5 97.1 99.6 78.6 63.6 62.8 80.4
RTN [28] 84.5 96.8 99.4 77.5 66.2 64.8 81.6
JAN [27] 85.4 97.4 99.8 84.7 68.6 70.0 84.3
DANN [10] 82.0 96.9 99.1 79.7 68.2 67.4 82.2
ADDA [41] 86.2 96.2 98.4 77.8 69.5 68.9 82.9
MCDDA [34] 82.6 98.9 99.8 84.3 66.2 66.3 83.0
MADA [31] 90.0 97.4 99.6 87.8 70.3 66.4 85.2
CDAN [25] 94.1 98.6 100.0 92.9 71.0 69.3 87.7
SymNets [47] 90.8 98.8 100.0 93.9 74.6 72.5 88.4
SAFN [43] 90.3 98.7 100.0 92.1 73.4 71.2 87.6
BSP [4] 93.3 98.2 100.0 93.0 73.6 72.6 88.5
GSDA (Ours) 95.7 99.1 100.0 94.8 73.5 74.9 89.7
important for keeping discriminative structure during adap-
tation. Then constructed with group-wise alignment (Grp),
the model has further improvement because the discrimi-
native structure is captured more elaborately by random-
ly combining several classes as a group. Furthermore, by
considering hierarchical gradient synchronization between
global alignment and local alignment, our GSDA (with gra-
dient synchronization denoted as Grad Sync) achieves sig-
nificant improvement indicating its effectivenesswhich al-
so illustrates the necessity of consistency between global
and local distribution alignment. Clearly, our main contri-
butions, i.e., group-wise alignment and hierarchical align-
ment synchronization, shows promising benefit for domain
adaptation.
3.3. Unsupervised Domain Adaptation
Unsupervised domain adaptation is the most typical set-
ting for domain adaptation, and there are many related
works such as the conventional methods TCA [29] and GFK
[13], the deep adaptation works based on MMD criterion
like DAN [24], RTN [28] and JAN [27], and the adversari-
al learning based approaches including DANN [10], ADDA
[41], MADA [31], CDAN [25] and SymNets [47]. All these
methods are compared with our method on Office-31-DA,
Office-Home and VisDA-2017 introduced in Section 3.1.
The experiment results are shown in Tables 2, 3 and 4.
As can be seen, the baseline without adaptation and the
conventional non-deep methods perform the worst, while
74049
Table 3. Object classification accuracy on Office-Home dataset (ResNet50). All methods follow the same settings, so all the results are
directly copied from the original works.
Method
Ar Ar Ar Cl Cl Cl Pr Pr Pr Rw Rw Rw
Avg↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓Cl Pr Rw Ar Pr Rw Ar Cl Rw Ar Cl Pr
ResNet50 [16] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DAN [24] 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN [10] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
JAN [27] 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3
CDAN [25] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
SymNets[47] 47.7 72.9 78.5 64.2 71.3 74.2 64.2 48.8 79.5 74.5 52.6 82.7 67.6
SAFN[43] 54.4 73.3 77.9 65.2 71.5 73.2 63.6 52.6 78.2 72.3 58.0 82.1 68.5
GSDA (Ours) 61.3 76.1 79.4 65.4 73.3 74.3 65.0 53.2 80.0 72.2 60.6 83.1 70.3
Table 4. Object classification accuracy on VisDA-2017 task (ResNet50). All methods follow the same settings and encoder architecture
except the methods marked with † using ResNet101. The results are directly copied from the original works. The underlined results mean
the highest accuracies of the four marked methods with deeper networks or multiple data augmentations (S-En).
Method plane bcycl bus car horse knife mcycl person plant sktbrd train truck Avg
ResNet50 [16] 70.6 51.8 55.8 68.9 77.9 7.6 93.3 34.5 81.1 27.9 88.6 5.6 55.3
DAN[24] 61.7 54.8 77.7 32.2 75.0 80.8 78.3 46.9 66.9 34.5 79.6 29.1 59.8
DANN[10] 75.9 70.5 65.3 17.3 72.8 38.6 58.0 77.2 72.5 40.4 70.4 44.7 58.6
MCDDA † [34] 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9
TPN [30] 93.7 85.1 69.2 81.6 93.5 61.9 89.3 81.4 93.5 81.6 84.5 49.9 80.4
S-En* [8] 96.3 87.9 84.7 55.7 95.9 95.2 88.6 77.4 93.3 92.8 87.5 38.2 82.8
BSP †[4] 92.4 61.0 81.0 57.5 89.0 80.6 90.1 77.0 84.2 77.9 82.1 38.4 75.9
SAFN† [43] 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1
GSDA (Ours) 93.1 67.8 83.1 83.4 94.7 93.4 93.4 79.5 93.0 88.8 83.4 36.7 81.5
the deep methods with MMD criterion such as DAN [24]
and JAN [27] perform much better benefited from the fa-
vorable non-linearity of the deep networks. Furthermore,
the adversarial learning based methods including DANN
[10], ADDA [41], MADA [31] and ours perform even bet-
ter than those MMD based deep methods attributing to the
more powerful capability of adversarial learning for reduc-
ing distribution discrepancy.
Among the adversarial-based methods, DANN [10] and
ADDA [41] are early ones only concentrating on global
distribution alignment which outperform the MMD-based
methods but with limited improvement. MADA [31], C-
DAN [25] and SymNets [47] further consider class-level
alignment achieving more promising adaptation. However,
they do not consider the intrinsic relation between local and
global alignment, so some misalignment may still appear.
Go a further step, our proposed method GSDA considers
not only global and local (i.e., class-wise and group-wise)
alignment, but also the hierarchical gradient synchroniza-
tion relation between them, leading to better adaptation.
Moreover, BSP [4] and SAFN [43] are recently proposed
methods with different perspectives from feature distribu-
tion alignment. BSP penalizes the largest singular values
of feature eigenvectors to enhance the discriminality and
SAFN improves the transferablity by magnifying norm of
features. Compared with these two novel methods, our
method still achieves the best performance, demonstrating
the advantage and necessity of considering the relation be-
tween the global and local distribution alignment.
4. Conclusion and Future Work
Aiming for better unsupervised domain adaption, we
propose a novel method named GSDA aligning the distribu-
tion of two different domains globally and locally as well,
with gradient synchronization between them. The hierarchi-
cal gradient synchronization module is established to ensure
the consistency between global and local distribution align-
ment for better structure preservation. The extensive exper-
iments verify the superiority of our method.
The gradient synchronization between global and local
domain alignment has achieved promising improvement in
this work, and this also implies that the relation between
global and local distribution alignment deserves deeper
analysis and exploration in future.
ACKNOWLEDGEMENT
This work is partially supported by National Key R&D
Program of China (No. 2017YFA0700800), Natural Sci-
ence Foundation of China (No. 61772496) and UCAS Joint
PhD Training Program.
84050
References
[1] S. Bickel, M. Bruckner, and T. Scheffer. Discriminative
learning under covariate shift. Journal of Machine Learn-
ing Research (JMLR), 10(9):2137–2155, 2009.
[2] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and
Michael I. Jordan. Partial transfer learning with selective ad-
versarial networks. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2018.
[3] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin
Wang. Partial adversarial domain adaptation. In European
Conference on Computer Vision (ECCV), 2018.
[4] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin
Wang. Transferability vs. discriminability: Batch spectral
penalization for adversarial domain adaptation. In Interna-
tional Conference on Machine Learning (ICML), 2019.
[5] Z. Ding and Y. Fu. Robust transfer metric learning for im-
age classification. IEEE Transactions on Image Processing
(TIP), 26(2):660–670, 2016.
[6] Zhengming Ding, Sheng Li, Ming Shao, and Yun Fu. Graph
adaptive knowledge transfer for unsupervised domain adap-
tation. In European Conference on Computer Vision (EC-
CV), 2018.
[7] M. Dudık, R. E. Schapire, and S. J. Phillips. Correcting sam-
ple selection bias in maximum entropy density estimation. In
Neural Information Processing Systems (NeurIPS), 2005.
[8] Geoffrey French, Michal Mackiewicz, and Mark Fisher.
Self-ensembling for visual domain adaptation. In Interna-
tional Conference on Representation Learning (ICLR), 2018.
[9] Y. Ganin and V. Lempitsky. Unsupervised domain adap-
tation by backpropagation. In International Conference on
Machine learning (ICML), 2015.
[10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, and et al.
Domain-adversarial training of neural networks. Journal
of Machine Learning Research (JMLR), 17(1):2096–2030,
2016.
[11] Behnam Gholami, Ognjen Rudovic, and Vladimir Pavlovic.
Punda: Probabilistic unsupervised domain adaptation for
knowledge transfer across visual categories. In IEEE Inter-
national Conference on Computer Vision (ICCV), 2017.
[12] B. Gong, K. Grauman, and F. Sha. Connecting the dots with
landmarks: Discriminatively learning domain-invariant fea-
tures for unsupervised domain adaptation. In International
Conference on Machine learning (ICML), 2013.
[13] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow
kernel for unsupervised domain adaptation. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
2012.
[14] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation
for object recognition: An unsupervised approach. In IEEE
International Conference on Computer Vision (ICCV), 2011.
[15] Yves Grandvalet and Yoshua Bengio. Semi-supervised
learning by entropy minimization. In Neural Information
Processing Systems (NeurIPS), 2005.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[17] Cheng An Hou, Yao Hung Hubert Tsai, Yi Ren Yeh, and
Yu Chiang Frank Wang. Unsupervised domain adaptation
with label and structural consistency. IEEE Transactions on
Image Processing (TIP), 25(12):5552–5562, 2016.
[18] Lanqing Hu, Meina Kan, Shiguang Shan, and Xilin Chen.
Duplex generative adversarial network for unsupervised do-
main adaptation. In IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2018.
[19] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, B.
Scholkopf, and et al. Correcting sample selection bias by
unlabeled data. In Neural Information Processing Systems
(NeurIPS), 2007.
[20] M. Fritz K. Saenko, B. Kulis and T. Darrell. Adapting visual
category models to new domains. In European Conference
on Computer Vision (ECCV), 2010.
[21] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Haupt-
mann. Contrastive adaptation network for unsupervised do-
main adaptation. In IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2019.
[22] M. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-
image translation networks. In Neural Information Process-
ing Systems (NeurIPS), 2017.
[23] M. Liu and O. Tuzel. Coupled generative adversarial net-
works. In Neural Information Processing Systems (NeurIP-
S), 2016.
[24] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning trans-
ferable features with deep adaptation networks. In Interna-
tional Conference on Machine learning (ICML), 2015.
[25] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and
Michael I Jordan. Conditional adversarial domain adapta-
tion. In Neural Information Processing Systems (NeurIPS),
2018.
[26] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang
Sun, and Philip S. Yu. Transfer feature learning with joint
distribution adaptation. In IEEE International Conference
on Computer Vision (ICCV), 2014.
[27] Mingsheng Long, Jianmin Wang, and Michael I. Jordan.
Deep transfer learning with joint adaptation networks. In In-
ternational Conference on Machine learning (ICML), 2017.
[28] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised
domain adaptation with residual transfer networks. In Neural
Information Processing Systems (NeurIPS), 2016.
[29] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain
adaptation via transfer component analysis. IEEE Transac-
tions on Neural Networks (TNN), 22(2):199–210, 2010.
[30] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah
Ngo, and Tao Mei. Transferrable prototypical networks for
unsupervised domain adaptation. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2019.
[31] Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial
domain adaptation. In AAAI Conference on Artificial Intelli-
gence (AAAI), 2018.
[32] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman,
Dequan Wang, and Kate Saenko. Visda: The visual domain
adaptation challenge. CoRR, abs/1710.06924, 2017.
[33] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training
for unsupervised domain adaptation. In International Con-
ference on Machine learning (ICML), 2017.
94051
[34] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-
suya Harada. Maximum classifier discrepancy for unsuper-
vised domain adaptation. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2018.
[35] Swami Sankaranarayanan, Yogesh Balaji, Carlos D. Castillo,
and Rama Chellappa. Generate to adapt: Aligning domains
using generative adversarial networks. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2017.
[36] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning
transferrable representations for unsupervised domain adap-
tation. In Neural Information Processing Systems (NeurIPS),
2016.
[37] M. Shao, C. Castillo, Z. Gu, and Y. Fu. Low-rank trans-
fer subspace learning. In International Conference on Data
Mining (ICDM), 2012.
[38] M. Shao, D. Kit, and Y. Fu. Generalized transfer subspace
learning through low-rank constraint. International Journal
of Computer Vision (IJCV), 109(1-2):74–93, 2014.
[39] M. Sugiyama, M. Krauledat, and K-B. MAzller. Covari-
ate shift adaptation by importance weighted cross validation.
Journal of Machine Learning Research (JMLR), 8(5):985–
1005, 2007.
[40] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-
domain image generation. 2017.
[41] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial
discriminative domain adaptation. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017.
[42] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty,
and Sethuraman Panchanathan. Deep hashing network for
unsupervised domain adaptation. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017.
[43] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger
norm more transferable: An adaptive feature norm approach
for unsupervised domain adaptation. In IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), 2019.
[44] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Y-
ong Xu, and Wangmeng Zuo. Mind the class weight bias:
Weighted maximum mean discrepancy for unsupervised do-
main adaptation. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017.
[45] B. Zadrozny. Learning and evaluating classifiers under sam-
ple selection bias. In International Conference on Machine
learning (ICML), 2004.
[46] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geo-
metrical and statistical alignment for visual domain adapta-
tion. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017.
[47] Yabin Zhang, Hui Tang, Kui Jia, and Mingkui Tan. Domain-
symmetric networks for adversarial domain adaptation. In
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019.
[48] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Ne-
horai. Aligning infinite-dimensional covariance matrices
in reproducing kernel hilbert spaces for domain adaptation.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2018.
[49] Jun Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017.
[50] Yang Zou, Zhiding Yu, B.V.K. Vijaya Kumar, and Jinsong
Wang. Unsupervised domain adaptation for semantic seg-
mentation via class-balanced self-training. In European Con-
ference on Computer Vision (ECCV), 2018.
104052