Unsupervised Domain Adaptation With Hierarchical Gradient...

10
Unsupervised Domain Adaptation with Hierarchical Gradient Synchronization Lanqing Hu 1,2 Meina Kan 1,2 Shiguang Shan 1,2,3 Xilin Chen 1,2 1 Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, 200031, China [email protected] {kanmeina,sgshan,xlchen}@ict.ac.cn Abstract Domain adaptation attempts to boost the performance on a target domain by borrowing knowledge from a well established source domain. To handle the distribution gap between two domains, the prominent approaches endeav- or to extract domain-invariant features. It is known that after a perfect domain alignment the domain-invariant rep- resentations of two domains should share the same char- acteristics from perspective of the overview and also any local piece. Inspired by this, we propose a novel method called Hierarchical Gradient Synchronization to model the synchronization relationship among the local distribution pieces and global distribution, aiming for more precise domain-invariant features. Specifically, the hierarchical domain alignments including class-wise alignment, group- wise alignment and global alignment are first constructed. Then, these three types of alignment are constrained to be consistent to ensure better structure preservation. As a re- sult, the obtained features are domain invariant and intrin- sically structure preserved. As evaluated on extensive do- main adaptation tasks, our proposed method achieves state- of-the-art classification performance on both vanilla unsu- pervised domain adaptation and partial domain adaptation. 1. Introduction The general hypothesis of machine learning is that the training and testing data share similar distribution, which makes the model trained on a large scale labeled data per- form well on the test data. However, in many real world applications, we usually only have access to limited amount of labeled training data sharing similar distribution with the testing data, which is insufficient for training a good enough model. Domain adaptation has shown promising effect on such a challenge by borrowing knowledge from a sophisti- cated set (i.e., source domain) which has a large number of labeled data but lies in a different distribution with the test data (i.e., target domain). According to the scale of labeled data in target do- main, domain adaptation can be categorized into super- vised, semi-supervised and unsupervised domain adapta- tion. This paper mainly concentrates on the unsupervised domain adaptation where there is only unlabeled data in target domain. Most existing works deal with the domain adaptation problem by alleviating marginal distribution dis- crepancy (i.e., the distribution of data X) or conditional distribution discrepancy (i.e., distribution of data X given classes labeled with Y ). Besides, there are also some works attempting to tackle both the marginal and conditional dis- tribution simultaneously. In the early days, most methods endeavor to align the marginal distribution of source and target domains by us- ing instance re-weighting, such as sample selection bias [45, 7, 19] and co-variate shift [39, 1]. These approaches are suitable for those scenarios where the source and target domains share the same support, thus they cannot achieve satisfactory performance in the wild scenarios. For better handling the complicated scenarios, the com- mon subspace methods focusing on extracting domain in- variant representation came up [14, 13, 37, 12, 38]. These methods mainly attempt to minimize the gap between marginal distributions of two domains. In the approach of Geodesic Flow Kernel (GFK) [13], an infinite number of the subspaces is integrated to model domain shift between source and target domain. In [12], a set of landmarks, i.e., a subset of labeled data from the source domain that have the most similar distribution as the target domain, are un- covered to bridge the source and target domain. The meth- ods proposed in [24] and [28] embed deep features into Re- producing Kernel Hilbert spaces (RKHS) and minimize the maximum mean discrepancy (MMD) of the features for dis- tribution adaptation. JGSA [46] and PUnDA [11] mitigate the geometrical structure gap and distribution shift jointly. The method in [48] handles the domain shift by aligning the 4043

Transcript of Unsupervised Domain Adaptation With Hierarchical Gradient...

Page 1: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

Unsupervised Domain Adaptation with Hierarchical Gradient Synchronization

Lanqing Hu1,2 Meina Kan1,2 Shiguang Shan1,2,3 Xilin Chen1,2

1 Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing

Technology, CAS, Beijing 100190, China2 University of Chinese Academy of Sciences, Beijing 100049, China

3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, 200031, China

[email protected] {kanmeina,sgshan,xlchen}@ict.ac.cn

Abstract

Domain adaptation attempts to boost the performance

on a target domain by borrowing knowledge from a well

established source domain. To handle the distribution gap

between two domains, the prominent approaches endeav-

or to extract domain-invariant features. It is known that

after a perfect domain alignment the domain-invariant rep-

resentations of two domains should share the same char-

acteristics from perspective of the overview and also any

local piece. Inspired by this, we propose a novel method

called Hierarchical Gradient Synchronization to model the

synchronization relationship among the local distribution

pieces and global distribution, aiming for more precise

domain-invariant features. Specifically, the hierarchical

domain alignments including class-wise alignment, group-

wise alignment and global alignment are first constructed.

Then, these three types of alignment are constrained to be

consistent to ensure better structure preservation. As a re-

sult, the obtained features are domain invariant and intrin-

sically structure preserved. As evaluated on extensive do-

main adaptation tasks, our proposed method achieves state-

of-the-art classification performance on both vanilla unsu-

pervised domain adaptation and partial domain adaptation.

1. Introduction

The general hypothesis of machine learning is that the

training and testing data share similar distribution, which

makes the model trained on a large scale labeled data per-

form well on the test data. However, in many real world

applications, we usually only have access to limited amount

of labeled training data sharing similar distribution with the

testing data, which is insufficient for training a good enough

model. Domain adaptation has shown promising effect on

such a challenge by borrowing knowledge from a sophisti-

cated set (i.e., source domain) which has a large number of

labeled data but lies in a different distribution with the test

data (i.e., target domain).

According to the scale of labeled data in target do-

main, domain adaptation can be categorized into super-

vised, semi-supervised and unsupervised domain adapta-

tion. This paper mainly concentrates on the unsupervised

domain adaptation where there is only unlabeled data in

target domain. Most existing works deal with the domain

adaptation problem by alleviating marginal distribution dis-

crepancy (i.e., the distribution of data X) or conditional

distribution discrepancy (i.e., distribution of data X given

classes labeled with Y ). Besides, there are also some works

attempting to tackle both the marginal and conditional dis-

tribution simultaneously.

In the early days, most methods endeavor to align the

marginal distribution of source and target domains by us-

ing instance re-weighting, such as sample selection bias

[45, 7, 19] and co-variate shift [39, 1]. These approaches

are suitable for those scenarios where the source and target

domains share the same support, thus they cannot achieve

satisfactory performance in the wild scenarios.

For better handling the complicated scenarios, the com-

mon subspace methods focusing on extracting domain in-

variant representation came up [14, 13, 37, 12, 38]. These

methods mainly attempt to minimize the gap between

marginal distributions of two domains. In the approach of

Geodesic Flow Kernel (GFK) [13], an infinite number of

the subspaces is integrated to model domain shift between

source and target domain. In [12], a set of landmarks, i.e.,

a subset of labeled data from the source domain that have

the most similar distribution as the target domain, are un-

covered to bridge the source and target domain. The meth-

ods proposed in [24] and [28] embed deep features into Re-

producing Kernel Hilbert spaces (RKHS) and minimize the

maximum mean discrepancy (MMD) of the features for dis-

tribution adaptation. JGSA [46] and PUnDA [11] mitigate

the geometrical structure gap and distribution shift jointly.

The method in [48] handles the domain shift by aligning the

14043

Page 2: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

(a) Local misalignment (b) Expected consist alignment

Figure 1. Illustration of (a) local misalignment in methods only with global distribution alignment, and (b) expected alignment on both

global domain and local classes. Best viewed in color.

RKHS covariance matrix across domains.

In these conventional approaches, the distribution dis-

crepancy is usually measured by the metrics like MMD, K-

L divergence and Bregman divergence. Recently, the adver-

sarial loss as a more powerful metric has caught a lot of at-

tentions. The works in [9, 10, 41] handle the domain shift by

augmenting a gradient reversal layer or employing adversar-

ial objective on target domain features. As a result, the fea-

tures confusing the domain classifier are generally domain

invariant. Afterwards, many methods based on domain

transformation via adversarial learning [40, 23, 22, 49, 35]

attain quite promising performance on distribution align-

ment and domain invariant feature extraction.

The above methods only consider the gap between

marginal distributions of two domains. In other words,

these methods only align the two domains globally, but

without considering whether the alignment of local piece

is correct or not. As a result, there may happen that two do-

mains are well aligned, but the local pieces (e.g., categories)

of two domains are mismatched as shown in Figure 1(a).

In recent years, a few methods attempt to minimize the

gap between conditional distributions (i.e., class-wise dis-

tribution) of two domains, for better alignment of the cate-

gories between two domains. Specifically, in WDAN [44],

class-specific auxiliary weight for each class is introduced

into the original MMD metric for utilizing the class prior

on source and target domains. MADA [31] exploits multi-

ple adversarial learning, one for each class, gaining much

more performance improvement on target domain. Further,

based on this multiple adversarial framework, CDAN [25]

novelly designs multi-linear conditioning, i.e., conducting

adversarial learning on the covariance between feature rep-

resentations and classifier predictions, to implicitly align the

conditional distribution of source and target domains, which

handles the domain distribution alignment more elaborately.

Similarly, the methods specialized for partial domain adap-

tation including SAN [2] and PADA [3] also show the ad-

vantages of considering class-wise distribution alignment.

There are some other methods [26, 17, 36, 33, 5, 18, 30,

47] directly predict the category labels of unlabeled samples

in target domain as pseudo labels during training process as

pseudo-labels. With the pseudo category labels of target

domain samples and those known true labels of source do-

main samples, the samples from distinct domains but the

same category are implicitly pulled close to share the same

distribution. In the proposed SymNets in [47], the domain

discrimination and confusion are stacked upon the concate-

nated classifiers of source and target domains, thus facilitat-

ing the domain-level and category-level feature distribution

confusion. MCDDA [34] and CAN [21] are both approach-

es concentrating on explicitly calibrating the category-level

distribution of both domains. MCDDA [34] plays the min-

max game between feature encoder and two different clas-

sifiers to optimize the decision boundary and then allevi-

ate the intra-class domain discrepancy. CAN [21] explicitly

minimizes the intra-class discrepancy and simultaneously

maximizes the inter-class discrepancy between domains ac-

cording to the labels of source and predicted labels of target

domain.

Generally, these recently proposed methods consider the

alignment of both global distribution (domain-level) dis-

crepancy and local distribution (category-level) discrepan-

cy, thus achieving promising performance. However, in

these methods the global alignment and local alignmen-

t are implemented in a separate manner, e.g., minimizing

weighted sum of domain-level and category-level discrep-

ancy [6, 31, 25]. As a result, the obtained results are only a

trade-off of the global and local distribution alignment, and

24044

Page 3: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

inconsistent distribution alignment still exists.

As observed from Figure 1(b), in a perfect domain align-

ment, the calibration of local category and the global do-

main distribution are consistent, i.e. the calibration direc-

tion are roughly the same. To elaborately consider the in-

trinsic relation between local and global distribution align-

ment, in this work we propose a new method that can con-

sistently align the local and global distribution by constrain-

ing the gradient of local and global alignment to be syn-

chronous, referred to as Domain Adaptation with Hierar-

chical Gradient Synchronization (GSDA).

Briefly, the contributions of this work are in two folds:

(1) we propose a novel method that considers consistency of

the global and local distribution alignment, to preserve the

intrinsic structures of both domain distributions for better

domain adaptation. To the best of our knowledge, it is the

first work to explicitly model the intrinsic relation between

global and local distribution alignment. (2) The consisten-

cy of the global and local distribution alignment is achieved

by a newly designed a hierarchical gradient synchroniza-

tion module. (3) This method achieves state-of-the-art clas-

sification accuracy in unsupervised domain adaptation and

partial domain adaptation scenarios experimentally.

2. Method

For clear description, we first give some definitions.

The labeled source domain images and the unlabeled tar-

get domain images are denoted as Xs = {(xsi , y

si )}

ni=1

and

Xt = {xtj}

mj=1

, respectively. In unsupervised domain adap-

tation, the source and target domains, i.e., Xs and Xt, gen-

erally follow different distributions but share the same cat-

egories. The samples in source domain are labeled, with

category label denoted as ysi ∈ Cs = {1, 2, · · · , r} , while

the samples in the target domain are unlabeled. In the unsu-

pervised domain adaptation the source and target domains

share exactly the same categories, i.e., Ct = Cs, where Ct

and Cs are r classes in target and source domains. There

is also a special scenario where the Ct is a subset of Cs,

i.e., Ct ⊂ Cs, called as partial unsupervised domain adap-

tation. Our method is applicable for both unsupervised do-

main adaptation and partial unsupervised domain adapta-

tion. For easier understanding we introduce the formulation

in the scenario of unsupervised domain adaptation, while

evaluate both tasks in the experiments section. Unless oth-

erwise specified, the symbols s and t used in the superscript

or subscript denote the source domain and target domain,

respectively.

The whole framework of our method is shown in Fig-

ure 2, which is equipped with a feature extractor E , an ob-

ject classifier C and three types of adversarial discriminators

D = {Ddom,Dgrp,Dcls}. Here, Ddom denotes the adver-

sarial discriminator for globally domain distribution align-

ment, namely, the domain-level alignment. Dcls denotes

adversarial discriminators for locally class-wise distribution

alignment. And Dgrp represents adversarial discriminators

for group-wise alignment where each group is composed of

several classes. The feature extractor E is fed with both the

source and target domain data and outputs the features f

which are expected to be domain invariant. Afterwards, the

features are fed into the classifier C for classification and

also into the adversarial discriminators D for domain shift

reduction. The feature extractor E and the discriminators

D play a two-player min-max game to make the features

from E domain invariant. In other words, the features from

E should be domain invariant if they successfully fool the

domain discriminators D.

2.1. Feature Extraction and Classification

The feature extractor E encodes the input source or target

samples xs and xt into a common feature space as follows:

fs = E(xs), f t = E(xt), (1)

where E can be any kind of network architecture such as

several successive convolutional layers. Then f ∈ {fs, f t}is fed into the classifier C to ensure feature f to be discrimi-

native. The parameter of feature extractor E and classifier Care denoted as θE and θC , respectively. The output of object

classifier C is denoted as below:

psi = C(fsi ), p

tj = C(f

tj ), (2)

where psi is the softmax output of C with xsi as input, and

ptj is the softmax output of C with xtj as input. Considering

that true category labels are available for source domain,

the cross entropy loss of classification is directly applied

and formulated as below:

Lsc =

xsi∈Xs

H(

C(

E(xsi ))

, ysi

)

, (3)

where H(·, ·) represents the cross entropy loss.

For target domain samples, the category labels are un-

available, and thus conventional cross entropy loss is inap-

plicable. Therefore, following [15], the conditional entropy

loss is exploited to enhance the certainty of prediction, i.e.,

force only one element in ptj to be dominant while the rest

suppressed. Formally, the conditional entropy loss Ltc for

unlabeled target domain samples is as below:

Ltc =

xtj∈Xt

H(

C(

E(xtj))

)

,(4)

where H(·) is the conditional entropy loss with H(ptj) =

−∑r

k=1ptj(k) log p

tj(k). The kth element ptj(k) in ptj indi-

cates the probability of xtj being assigned to the kth class.

34045

Page 4: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

Figure 2. Illustration of the overall framework of our GSDA method. An input sample xi from source or target domain is firstly encoded

by the common feature extractor E . Based on the extracted feature, the classifier C is designed for object classification, and the adversarial

discriminators including Ddom, Dgrp and Dcls are designed for distribution alignment from perspective of domain-level, group-level and

category-level respectively. Furthermore, a hierarchical gradient synchronization between the three types of adversarial discriminators is

constructed to constrain the consistency between global and local alignment for better structure preservation. Best viewed in color.

Overall, the object classification loss of both domains is

obtained as below:

Lc = Lsc + αLt

c, (5)

constraining the common feature f to be discriminative,

benefitting the classification task.

2.2. Domain Distribution Alignment

Besides the categorial discriminability, the feature f

from E should be also domain invariant to potentiate knowl-

edge transfer from source domain to target domain. In a

perfect domain-invariant feature space, not only the glob-

al structure of both domains but also any local piece such

as every group or even every class should be well aligned.

Aiming for this goal, three types of adversarial discrimina-

tors are introduced for domain-level, group-level, and class-

level distribution alignment respectively. Furthermore, the

consistency of the three types of alignment are constrained

by a novel hierarchical gradient synchronization module.

This synchronization module ensures the alignment of any

local piece is consistent with the global alignment structure,

leading to a more informative domain alignment.

Global Adversarial Discriminator The global adver-

sarial discriminator, i.e., domain-level adversarial discrim-

inator Ddom is designed to distinguish the source domain

from target domain with cross entropy loss as follows:

Lg =∑

xi∈Xs∪Xt

H(Ddom(E(xi)), di), with

di =

{

1, if xi ∈ Xs,

0, if xi ∈ Xt,

(6)

where di represents the domain label of each sample xi.

By playing min-max adversarial optimization between Eand this discriminator Ddom whose parameter is denoted as

θDdom , the whole distributions of two domains from E will

become nonseparable globally.

Local Adversarial Discriminators Even if the global

distribution is well aligned, the distribution of each class in

two domains may be misaligned as shown in Figure 1(a),

e.g., the ith category of source domain may be aligned to

kth(i 6= k) category of target domain although the two do-

mains are globally well aligned. This is because that the

global domain migration constraint merely considers the w-

hole domain discrepancy but not the discrepancy in any lo-

cal piece. Therefore, the local adversarial discriminators are

established to deal with the distribution discrepancy in local

regions of source and target domains, which consist of two

kinds of local adversarial discriminators: class-wise ones

and group-wise ones.

Firstly and straightforwardly, class-wise adversarial dis-

criminators are constructed to tackle the discrepancy within

each category between the source and target domain, i.e.,

the ith category of source domain should be aligned to the

ith category of target domain rather than other categories

in target domain. Formally, the class-wise adversarial dis-

criminator for the kth category is denoted as Dclsk and its

domain discrimination loss is formulated as follows:

Lclsk =

xi∈Xs∪Xt

pkiH(

Dclsk

(

E(xi))

, di

)

, with

di =

{

1, if xi ∈ Xs

0, if xi ∈ Xt,

(7)

where di is the domain label, similar with that in global ad-

versarial discriminator, k ∈ {1, 2, · · · , r} denotes the index

of kth class-wise adversarial discriminator and pki is the loss

weight of sample xi representing its probability of belong-

ing to kth class, i.e., the kth dimension output of psi and

pti in Equation (2). Note that if xi ∈ Xs and it belongs to

the kth class, pki = 1 and pji |j 6=k = 0 because the label of

xi ∈ Xs is definite. While for xi ∈ Xt, as its label is un-

available, the corresponding pki is the predicted probability

of xi ∈ Xt to be classified into the kth class by classifier C

in Equation (2).

Likewise, by playing min-max adversarial optimization

44046

Page 5: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

with the objective above, the distribution of two domains

is well aligned for each category. The parameter of each

class-wise local discriminator Dclsk is denoted as θDcls

k.

Besides each class, any local group consisting of sever-

al classes should be also well aligned in a perfect domain

alignment. Thus, the local alignment can be reinforced by

establishing group-level adversarial discriminators. Similar

as the class-wise adversarial discriminators, the group-wise

adversarial discriminators Dgrpq for the qth group with do-

main discrimination loss is formulated as follows:

Lgrpq =

xi∈Xs∪Xt

pqiH

(

Dgrpq

(

E(xi))

, di

)

, with

di =

{

1, if xi ∈ Xs

0, if xi ∈ Xt,

(8)

where q ∈ {1, 2, · · · , b} denotes the index of qth group-

wise adversarial discriminator, the pqi denotes the proba-

bility of xi belonging to the qth group. The groups here

are simply achieved as random divisions of all classes that

are defined in Equation (7). Correspondingly, the catego-

ry grouping probability of the qth group pqi can be easily

obtained as pqi =

k∈q pki . Generally, the classes in differ-

ent groups are allowed to overlap with each other, while in

this work all groups are simply randomly divided without

overlap. What is worth mentioning is that, when the num-

ber of classes is large, these groups could be hierarchically

structured groups rather than flat structured ones.

Similarly, by playing min-max adversarial optimization

with the objective above, the distribution of two domains is

well aligned locally in each group. The parameter of each

group-wise local discriminator Dgrpq is denoted as θDcls

k.

The parameter of each group-wise local discriminator Dgrpq

is denoted as θDgrpq

.

Then the overall parameters of all discriminators are de-

noted as θD = {θDdom , θDcls , θDgrpq}. By summing up all

the local adversarial discriminators, the objective for local

distribution alignment is obtained as:

Ll =

b∑

q=1

Lgrpq +

r∑

k=1

Lclsk , (9)

where b stands for the number of groups and r represents

the number of classes.

Overall, the three types of distribution alignment includ-

ing domain-level, group-wise, and class-wise domain dis-

tribution alignment form a hierarchical aligning structure,

aiming for better alignment between source and target do-

mains globally as well as locally.

2.3. Hierarchical Gradient Synchronization

The preceding global and local adversarial discrimina-

tors deal with the distribution alignment between domains

from global and local perspective, but in an independen-

t manner. This may cause inconsistency among the global

Figure 3. Illustration of the hierarchical distribution alignments

and hierarchical gradient synchronization among them.

and local alignments, which would compromise the align-

ing direction of global and local alignment leading to inac-

curate distribution alignment.

Actually, in a perfect global alignment, any local piece

should be also well aligned, or vice versa: a perfect align-

ment of each local piece also forms an optimal global align-

ment. Specifically, the aligning direction and magnitude of

each local piece should be consistent with that of the whole

domain. So intuitively the consistency between the global

and local domain alignment could be used to verify if two

domains are well aligned or not. In return, it would benefit

the domain alignment if this consistency is formulated into

the process of distribution alignment. With this in mind, a

novel constraint on the gradient is designed as the Hierar-

chical Gradient Synchronization term, which is presented in

Figure 3 and specifically formulated in Equations (10) and

(11) below.

Specifically, Hierarchical Gradient Synchronization con-

sists of gradient synchronization among the three levels of

adversarial discriminators, i.e., domain-level, group-level,

and class-level discriminators, forming a hierarchical man-

ner. The gradient synchronization between class-wise align-

ment and group-wise alignment is designed as below:

Lsyngrp∼cls=

xi∈

Xs∪Xt

∂Lgrpq

∂E(xi)

2

−∑

k∈grpq

xi∈

Xs∪Xt

∂Lclsk

∂E(xi)

2

.

(10)

The gradient synchronization objective in the above E-

quation (10) attempts to make the magnitude of the align-

ing direction of each group to be consistent with the sum

of that of each class within this group. Here, the first term

denotes the gradient magnitude of discriminator for the qth

group, and the second term denotes the gradients magnitude

of discriminators for each class in qth group in the domain.

Note that here we only use constraint on magnitude as it

can affect both the direction and magnitude, while the sum

of gradients direction will neutralize the difference. Note

that in the second term, xi ∈ Xs ∪Xt still means the sam-

ples from the kth class because the sampling probability is

54047

Page 6: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

included in Lclsk .

Similarly, the gradient synchronization between group-

wise and the whole domain alignment is formulated as fol-

lows:

Lsyndom∼grp=

xi∈

Xs∪Xt

∂Lg

∂E(xi)

2

b∑

q=1

xi∈

Xs∪Xt

∂Lgrpq

∂E(xi)

2

,

(11)

where the first term denotes the gradients magnitude of dis-

criminator for the whole domain, and the second term de-

notes the gradients magnitude of discriminators for each

group. Note that in the second term, xi ∈ Xs ∪ Xt still

means the samples from the qth group because the sampling

probability is included in Lgrpq .

Note that although Equations (10) and (11) are the losses

with regard to the gradients, they are first-order derivatives

optimization rather than second-order ones which is effi-

cient. This is because that the gradients in Equations (10)

and (11) are with regard to the input features, but not with

regard to the network parameters.

Afterwards, piling all the layers together, the overall 3-

layer hierarchical gradient synchronization constraint is nat-

urally obtained as below:

Lsyn =1

b

b∑

q=1

Lsyngrp∼cls + L

syndom∼grp. (12)

With this constraint, the directions and magnitude of gra-

dient descent for both global and local alignment are ex-

pected to be kept in synchronization with each other. As a

result, the distributions of two domains can be aligned more

accurately.

With the global alignment, local alignment, and gradient

synchronization defined in Equations (6), (9) and (12), the

overall objective function of the discriminators D is finally

formulated as follows:

Ld = Lg + Ll + βLsyn. (13)

With the objective in above Equation (13), the source and

target domain are aligned globally and locally, with consis-

tency between the global and local distribution alignment.

As as result, the two domains are well aligned and also the

discriminative structure are well preserved.

2.4. Overall Objective and Optimization

The overall objective function is optimized by alterna-

tively optimizing {E , C} and D following the adversarial

learning mechanism, which are detailed in the following.

Given {E , C}, the adversarial discriminators D are opti-

mized to distinguish the source domain from target domain

by minimizing the domain discrimination loss:

minθDg ,θ

Dl

Ld = Lg + Ll + βLsyn, (14)

with the parameters updated as below:

θDg ← θDg−η∂(Lg + βLsyn)

∂θDg

,

θDlk← θDl

k−η

∂(Ll + βLsyn)

∂θDlk

,

(15)

where η is the learning rate.

Given D, the feature extractor E and classifier C are op-

timized to make the features from E are discriminative and

domain invariant. This is achieved by minimizing the object

classification loss and confusing the adversarial discrimina-

tors by the min-max game as follows:

minθC,θE

(

Lc + βLsyn − (Lg + Ll))

, (16)

with the parameters updated as:

θC←θC− η∂Lc

∂θC,

θE ← θE− η

(

∂Lc

∂θC×∂θC

∂θE+β

∂Lsyn

∂θE−∂(Lg+Ll)

∂θD×∂θD

∂θE

)

.

(17)

3. Experiments

We evaluate the proposed method and other related

works on both unsupervised domain adaptation (source and

target domains share the same categories) and partial do-

main adaptation (the categories of target domain is a subset

of that of source domain) benchmarks of object classifica-

tion, of which the partial domain adaptation results will be

given in supplementary materials. Besides, ablation study

is carefully done for analysing the contributions of each part

of the proposed method.

3.1. Datasets and Experimental Setting

Three standard benchmarks for unsupervised domain

adaptation and one for partial domain adaptation, respec-

tively, are employed for the evaluation.

Office-31-DA Office-31 [20] is a classical and widely

used benchmark for domain adaptation with 31 categories,

consisting of 3 different domains including Amazon (A)

with 2817 images, Webcam (W) with 795 images, and D-

SLR (D) with 498 images. Following the commonly used

protocol defined in [13, 26, 28, 31], all 31 categories from

the three domains are used for evaluation of unsupervised

domain adaptation, forming 6 transfer tasks.

Office-Home [42] Office-Home is another classical

dataset with 65 categories, consisting of 4 different domains

including Artistic images (Ar), Clip Art images (Cl), Prod-

uct images (Pr) and Real-World images (Rw). Following

the commonly utilized protocol defined in [13, 26, 28, 31],

all 65 categories from the four domains are used for evalua-

tion of unsupervised domain adaptation, forming 12 transfer

tasks.

64048

Page 7: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

Table 1. Ablation study of our GSDA for domain adaptation on Office-31 (ResNet50).

Glb Cls Grp Grad Sync A→W D→W W→D A→D D→A W→A Avg

X 87.9 98.2 100.0 85.5 66.4 64.1 83.7

X X 91.7 98.4 100.0 87.1 68.9 67.2 85.6

X X X 93.1 99.0 100.0 91.4 71.5 67.0 87.0

X X X X 95.7 99.1 100.0 94.8 73.5 74.9 89.7

VisDA-2017 VisDA-2017 [32] is a more challenging

simulation-to-real task, with two distinct domains: synthet-

ic object images rendered from 3D models and real object

images. It contains 152397 training images and 55388 vali-

dation images across 12 classes. Following the training and

testing protocol in [34, 25], the model is trained on labeled

training and unlabeled validation set and tested on the vali-

dation set in unsupervised domain adaptation.

Office-31-PDA Recently, a new protocol for partial do-

main adaptation is built on Office-31 [20]. As defined in

[2, 3], the same three domains as that for the standard un-

supervised domain adaptation are used but with different

categories for source and target domains: all 31 categories

from the three domains are used as source domains, denoted

as A31, D31, and W31, respectively, while the 10 common

categories shared between Office-31 and Caltech-256 are

used as target domains, denoted as A10 (958 images), W10(295 images) and D10 (157 images), respectively.

Implementation Details For fair comparison, on each

setting we use the same network architecture as the com-

pared methods. Specifically, we use ResNet50 as the back-

bone in all the experiments. In Office-31-DA and Office-

Home, the hyper-parameter α in Equation (5) and β in E-

quation (13) is set as 0.02, and 1.0. In VisDA-2017 and

Office-31-PDA, α and β are set as 0.2 and 10.0, respec-

tively. In Office-31-DA and Office-31-PDA, classes are di-

vided into 6 groups. In Office-Home, they are divided in-

to 13 groups. And in VisDA-2017, they are divided into

4 groups. For clearer explanation of hyper-parameter se-

lection, the sensitivity analysis about hyper-parameters is

presented in supplementary materials. For stable training of

GSDA, those categories with fewer samples are augment-

ed by randomly re-sampling images to make all categories

have roughly the same number of images to avoid the da-

ta imbalance problem stated in [50]. For target domain,

the class labels are unavailable, so only those samples with

highly confident pseudo labels are used as training samples.

3.2. Ablation Study

The ablation study is conducted on unsupervised domain

adaptation setting (Office-31-DA) to investigate the neces-

sity of each component in GSDA. Briefly, our GSDA con-

sist of three parts, global alignment, local alignment, and

the hierarchical gradient synchronization between them. As

shown in Table 1, the method with only global domain

alignment (Glb) performs worse than that added with class-

wise alignment (Cls), showing that the local alignment is

Table 2. Object classification accuracy on Office-31-DA

(ResNet50). All methods follow the same settings, so most results

are directly from the original works except MCDDA which is

tuned using the released codes.

Method

A D W A D W

Avg↓ ↓ ↓ ↓ ↓ ↓W W D D A A

ResNet50 [16] 68.4 96.7 99.3 68.9 62.5 60.7 76.1

TCA [29] 72.7 96.7 99.6 74.1 61.7 60.9 77.6

GFK [13] 72.8 95.0 98.2 74.5 63.4 61.0 77.5

DAN [24] 80.5 97.1 99.6 78.6 63.6 62.8 80.4

RTN [28] 84.5 96.8 99.4 77.5 66.2 64.8 81.6

JAN [27] 85.4 97.4 99.8 84.7 68.6 70.0 84.3

DANN [10] 82.0 96.9 99.1 79.7 68.2 67.4 82.2

ADDA [41] 86.2 96.2 98.4 77.8 69.5 68.9 82.9

MCDDA [34] 82.6 98.9 99.8 84.3 66.2 66.3 83.0

MADA [31] 90.0 97.4 99.6 87.8 70.3 66.4 85.2

CDAN [25] 94.1 98.6 100.0 92.9 71.0 69.3 87.7

SymNets [47] 90.8 98.8 100.0 93.9 74.6 72.5 88.4

SAFN [43] 90.3 98.7 100.0 92.1 73.4 71.2 87.6

BSP [4] 93.3 98.2 100.0 93.0 73.6 72.6 88.5

GSDA (Ours) 95.7 99.1 100.0 94.8 73.5 74.9 89.7

important for keeping discriminative structure during adap-

tation. Then constructed with group-wise alignment (Grp),

the model has further improvement because the discrimi-

native structure is captured more elaborately by random-

ly combining several classes as a group. Furthermore, by

considering hierarchical gradient synchronization between

global alignment and local alignment, our GSDA (with gra-

dient synchronization denoted as Grad Sync) achieves sig-

nificant improvement indicating its effectivenesswhich al-

so illustrates the necessity of consistency between global

and local distribution alignment. Clearly, our main contri-

butions, i.e., group-wise alignment and hierarchical align-

ment synchronization, shows promising benefit for domain

adaptation.

3.3. Unsupervised Domain Adaptation

Unsupervised domain adaptation is the most typical set-

ting for domain adaptation, and there are many related

works such as the conventional methods TCA [29] and GFK

[13], the deep adaptation works based on MMD criterion

like DAN [24], RTN [28] and JAN [27], and the adversari-

al learning based approaches including DANN [10], ADDA

[41], MADA [31], CDAN [25] and SymNets [47]. All these

methods are compared with our method on Office-31-DA,

Office-Home and VisDA-2017 introduced in Section 3.1.

The experiment results are shown in Tables 2, 3 and 4.

As can be seen, the baseline without adaptation and the

conventional non-deep methods perform the worst, while

74049

Page 8: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

Table 3. Object classification accuracy on Office-Home dataset (ResNet50). All methods follow the same settings, so all the results are

directly copied from the original works.

Method

Ar Ar Ar Cl Cl Cl Pr Pr Pr Rw Rw Rw

Avg↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓Cl Pr Rw Ar Pr Rw Ar Cl Rw Ar Cl Pr

ResNet50 [16] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1

DAN [24] 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3

DANN [10] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6

JAN [27] 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3

CDAN [25] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8

SymNets[47] 47.7 72.9 78.5 64.2 71.3 74.2 64.2 48.8 79.5 74.5 52.6 82.7 67.6

SAFN[43] 54.4 73.3 77.9 65.2 71.5 73.2 63.6 52.6 78.2 72.3 58.0 82.1 68.5

GSDA (Ours) 61.3 76.1 79.4 65.4 73.3 74.3 65.0 53.2 80.0 72.2 60.6 83.1 70.3

Table 4. Object classification accuracy on VisDA-2017 task (ResNet50). All methods follow the same settings and encoder architecture

except the methods marked with † using ResNet101. The results are directly copied from the original works. The underlined results mean

the highest accuracies of the four marked methods with deeper networks or multiple data augmentations (S-En).

Method plane bcycl bus car horse knife mcycl person plant sktbrd train truck Avg

ResNet50 [16] 70.6 51.8 55.8 68.9 77.9 7.6 93.3 34.5 81.1 27.9 88.6 5.6 55.3

DAN[24] 61.7 54.8 77.7 32.2 75.0 80.8 78.3 46.9 66.9 34.5 79.6 29.1 59.8

DANN[10] 75.9 70.5 65.3 17.3 72.8 38.6 58.0 77.2 72.5 40.4 70.4 44.7 58.6

MCDDA † [34] 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9

TPN [30] 93.7 85.1 69.2 81.6 93.5 61.9 89.3 81.4 93.5 81.6 84.5 49.9 80.4

S-En* [8] 96.3 87.9 84.7 55.7 95.9 95.2 88.6 77.4 93.3 92.8 87.5 38.2 82.8

BSP †[4] 92.4 61.0 81.0 57.5 89.0 80.6 90.1 77.0 84.2 77.9 82.1 38.4 75.9

SAFN† [43] 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1

GSDA (Ours) 93.1 67.8 83.1 83.4 94.7 93.4 93.4 79.5 93.0 88.8 83.4 36.7 81.5

the deep methods with MMD criterion such as DAN [24]

and JAN [27] perform much better benefited from the fa-

vorable non-linearity of the deep networks. Furthermore,

the adversarial learning based methods including DANN

[10], ADDA [41], MADA [31] and ours perform even bet-

ter than those MMD based deep methods attributing to the

more powerful capability of adversarial learning for reduc-

ing distribution discrepancy.

Among the adversarial-based methods, DANN [10] and

ADDA [41] are early ones only concentrating on global

distribution alignment which outperform the MMD-based

methods but with limited improvement. MADA [31], C-

DAN [25] and SymNets [47] further consider class-level

alignment achieving more promising adaptation. However,

they do not consider the intrinsic relation between local and

global alignment, so some misalignment may still appear.

Go a further step, our proposed method GSDA considers

not only global and local (i.e., class-wise and group-wise)

alignment, but also the hierarchical gradient synchroniza-

tion relation between them, leading to better adaptation.

Moreover, BSP [4] and SAFN [43] are recently proposed

methods with different perspectives from feature distribu-

tion alignment. BSP penalizes the largest singular values

of feature eigenvectors to enhance the discriminality and

SAFN improves the transferablity by magnifying norm of

features. Compared with these two novel methods, our

method still achieves the best performance, demonstrating

the advantage and necessity of considering the relation be-

tween the global and local distribution alignment.

4. Conclusion and Future Work

Aiming for better unsupervised domain adaption, we

propose a novel method named GSDA aligning the distribu-

tion of two different domains globally and locally as well,

with gradient synchronization between them. The hierarchi-

cal gradient synchronization module is established to ensure

the consistency between global and local distribution align-

ment for better structure preservation. The extensive exper-

iments verify the superiority of our method.

The gradient synchronization between global and local

domain alignment has achieved promising improvement in

this work, and this also implies that the relation between

global and local distribution alignment deserves deeper

analysis and exploration in future.

ACKNOWLEDGEMENT

This work is partially supported by National Key R&D

Program of China (No. 2017YFA0700800), Natural Sci-

ence Foundation of China (No. 61772496) and UCAS Joint

PhD Training Program.

84050

Page 9: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

References

[1] S. Bickel, M. Bruckner, and T. Scheffer. Discriminative

learning under covariate shift. Journal of Machine Learn-

ing Research (JMLR), 10(9):2137–2155, 2009.

[2] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and

Michael I. Jordan. Partial transfer learning with selective ad-

versarial networks. In IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), 2018.

[3] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin

Wang. Partial adversarial domain adaptation. In European

Conference on Computer Vision (ECCV), 2018.

[4] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin

Wang. Transferability vs. discriminability: Batch spectral

penalization for adversarial domain adaptation. In Interna-

tional Conference on Machine Learning (ICML), 2019.

[5] Z. Ding and Y. Fu. Robust transfer metric learning for im-

age classification. IEEE Transactions on Image Processing

(TIP), 26(2):660–670, 2016.

[6] Zhengming Ding, Sheng Li, Ming Shao, and Yun Fu. Graph

adaptive knowledge transfer for unsupervised domain adap-

tation. In European Conference on Computer Vision (EC-

CV), 2018.

[7] M. Dudık, R. E. Schapire, and S. J. Phillips. Correcting sam-

ple selection bias in maximum entropy density estimation. In

Neural Information Processing Systems (NeurIPS), 2005.

[8] Geoffrey French, Michal Mackiewicz, and Mark Fisher.

Self-ensembling for visual domain adaptation. In Interna-

tional Conference on Representation Learning (ICLR), 2018.

[9] Y. Ganin and V. Lempitsky. Unsupervised domain adap-

tation by backpropagation. In International Conference on

Machine learning (ICML), 2015.

[10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, and et al.

Domain-adversarial training of neural networks. Journal

of Machine Learning Research (JMLR), 17(1):2096–2030,

2016.

[11] Behnam Gholami, Ognjen Rudovic, and Vladimir Pavlovic.

Punda: Probabilistic unsupervised domain adaptation for

knowledge transfer across visual categories. In IEEE Inter-

national Conference on Computer Vision (ICCV), 2017.

[12] B. Gong, K. Grauman, and F. Sha. Connecting the dots with

landmarks: Discriminatively learning domain-invariant fea-

tures for unsupervised domain adaptation. In International

Conference on Machine learning (ICML), 2013.

[13] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow

kernel for unsupervised domain adaptation. In IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

2012.

[14] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation

for object recognition: An unsupervised approach. In IEEE

International Conference on Computer Vision (ICCV), 2011.

[15] Yves Grandvalet and Yoshua Bengio. Semi-supervised

learning by entropy minimization. In Neural Information

Processing Systems (NeurIPS), 2005.

[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2016.

[17] Cheng An Hou, Yao Hung Hubert Tsai, Yi Ren Yeh, and

Yu Chiang Frank Wang. Unsupervised domain adaptation

with label and structural consistency. IEEE Transactions on

Image Processing (TIP), 25(12):5552–5562, 2016.

[18] Lanqing Hu, Meina Kan, Shiguang Shan, and Xilin Chen.

Duplex generative adversarial network for unsupervised do-

main adaptation. In IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR), 2018.

[19] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, B.

Scholkopf, and et al. Correcting sample selection bias by

unlabeled data. In Neural Information Processing Systems

(NeurIPS), 2007.

[20] M. Fritz K. Saenko, B. Kulis and T. Darrell. Adapting visual

category models to new domains. In European Conference

on Computer Vision (ECCV), 2010.

[21] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Haupt-

mann. Contrastive adaptation network for unsupervised do-

main adaptation. In IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR), 2019.

[22] M. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-

image translation networks. In Neural Information Process-

ing Systems (NeurIPS), 2017.

[23] M. Liu and O. Tuzel. Coupled generative adversarial net-

works. In Neural Information Processing Systems (NeurIP-

S), 2016.

[24] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning trans-

ferable features with deep adaptation networks. In Interna-

tional Conference on Machine learning (ICML), 2015.

[25] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and

Michael I Jordan. Conditional adversarial domain adapta-

tion. In Neural Information Processing Systems (NeurIPS),

2018.

[26] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang

Sun, and Philip S. Yu. Transfer feature learning with joint

distribution adaptation. In IEEE International Conference

on Computer Vision (ICCV), 2014.

[27] Mingsheng Long, Jianmin Wang, and Michael I. Jordan.

Deep transfer learning with joint adaptation networks. In In-

ternational Conference on Machine learning (ICML), 2017.

[28] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised

domain adaptation with residual transfer networks. In Neural

Information Processing Systems (NeurIPS), 2016.

[29] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain

adaptation via transfer component analysis. IEEE Transac-

tions on Neural Networks (TNN), 22(2):199–210, 2010.

[30] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah

Ngo, and Tao Mei. Transferrable prototypical networks for

unsupervised domain adaptation. In IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR), 2019.

[31] Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial

domain adaptation. In AAAI Conference on Artificial Intelli-

gence (AAAI), 2018.

[32] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman,

Dequan Wang, and Kate Saenko. Visda: The visual domain

adaptation challenge. CoRR, abs/1710.06924, 2017.

[33] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training

for unsupervised domain adaptation. In International Con-

ference on Machine learning (ICML), 2017.

94051

Page 10: Unsupervised Domain Adaptation With Hierarchical Gradient …openaccess.thecvf.com/content_CVPR_2020/papers/Hu... · 2020-06-28 · Unsupervised Domain Adaptation with Hierarchical

[34] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-

suya Harada. Maximum classifier discrepancy for unsuper-

vised domain adaptation. In IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), 2018.

[35] Swami Sankaranarayanan, Yogesh Balaji, Carlos D. Castillo,

and Rama Chellappa. Generate to adapt: Aligning domains

using generative adversarial networks. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2017.

[36] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning

transferrable representations for unsupervised domain adap-

tation. In Neural Information Processing Systems (NeurIPS),

2016.

[37] M. Shao, C. Castillo, Z. Gu, and Y. Fu. Low-rank trans-

fer subspace learning. In International Conference on Data

Mining (ICDM), 2012.

[38] M. Shao, D. Kit, and Y. Fu. Generalized transfer subspace

learning through low-rank constraint. International Journal

of Computer Vision (IJCV), 109(1-2):74–93, 2014.

[39] M. Sugiyama, M. Krauledat, and K-B. MAzller. Covari-

ate shift adaptation by importance weighted cross validation.

Journal of Machine Learning Research (JMLR), 8(5):985–

1005, 2007.

[40] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-

domain image generation. 2017.

[41] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial

discriminative domain adaptation. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2017.

[42] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty,

and Sethuraman Panchanathan. Deep hashing network for

unsupervised domain adaptation. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2017.

[43] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger

norm more transferable: An adaptive feature norm approach

for unsupervised domain adaptation. In IEEE/CVF Interna-

tional Conference on Computer Vision (ICCV), 2019.

[44] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Y-

ong Xu, and Wangmeng Zuo. Mind the class weight bias:

Weighted maximum mean discrepancy for unsupervised do-

main adaptation. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 2017.

[45] B. Zadrozny. Learning and evaluating classifiers under sam-

ple selection bias. In International Conference on Machine

learning (ICML), 2004.

[46] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geo-

metrical and statistical alignment for visual domain adapta-

tion. In IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 2017.

[47] Yabin Zhang, Hui Tang, Kui Jia, and Mingkui Tan. Domain-

symmetric networks for adversarial domain adaptation. In

IEEE/CVF Conference on Computer Vision and Pattern

Recognition (CVPR), 2019.

[48] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Ne-

horai. Aligning infinite-dimensional covariance matrices

in reproducing kernel hilbert spaces for domain adaptation.

In IEEE/CVF Conference on Computer Vision and Pattern

Recognition (CVPR), 2018.

[49] Jun Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.

Efros. Unpaired image-to-image translation using cycle-

consistent adversarial networks. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2017.

[50] Yang Zou, Zhiding Yu, B.V.K. Vijaya Kumar, and Jinsong

Wang. Unsupervised domain adaptation for semantic seg-

mentation via class-balanced self-training. In European Con-

ference on Computer Vision (ECCV), 2018.

104052