Domain Agnostic Learning for Unbiased [email protected],[email protected],...

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Domain Agnostic Learning for UnbiasedAuthentication

Jian Liang, Yuren Cao, Shuang Li, Bing Bai, Hao Li, Fei Wang, Kun Bai

Abstract—Authentication is the task of confirming the matching relationship between a data instance and a given identity. Typicalexamples of authentication problems include face recognition and person re-identification. Data-driven authentication could be affectedby undesired biases, i.e., the models are often trained in one domain (e.g., for people wearing spring outfits) while applied in otherdomains (e.g., they change the clothes to summer outfits). Previous works have made efforts to eliminate domain-difference. Theytypically assume domain annotations are provided, and all the domains share classes. However, for authentication, there could be alarge number of domains shared by different identities/classes, and it is impossible to annotate these domains exhaustively. It couldmake domain-difference challenging to model and eliminate. In this paper, we propose a domain-agnostic method that eliminatesdomain-difference without domain labels. We alternately perform latent domain discovery and domain-difference elimination until ourmodel no longer detects domain-difference. In our approach, the latent domains are discovered by learning the heterogeneouspredictive relationships between inputs and outputs. Then domain-difference is eliminated in both class-dependent andclass-independent spaces to improve robustness of elimination. We further extend our method to a meta-learning framework to pursuemore thorough domain-difference elimination. Comprehensive empirical evaluation results are provided to demonstrate theeffectiveness and superiority of our proposed method.

Index Terms—Domain Agnostic Learning, Unbiased Authentication, Generalized Cross-latent-domain Recognition, PredictiveRelationships, Meta Learning.

F

1 INTRODUCTION

AUTHENTICATION is the problem of confirming whetherthe data instances match personal identities. There

is a variety of authentication applications including facerecognition [76], fingerprint verification [74] and person re-identification [4, 77]. However, the data-driven authentica-tion process often suffers from undesired biases. In particu-lar, the verification model is usually trained in one domainand tested and verified in other domains, which couldcause inconsistent prediction results due to domain dif-ference/shift. For example, for person re-identification [4],the prediction could be compromised due to the seasonaloutfits changing or the angle variation between a camera

• J. Liang is with AI for international Department, Alibaba Group, Beijing,100102, China.E-mail: [email protected].

• Y. Cao is with the Cloud and Smart Industries Group, Tencent, Beijing,100089, China.E-mail: [email protected].

• S. Li is with the school of Computer Science and Technology, BeijingInstitute of Technology, Beijing, 100081, China.E-mail: [email protected].

• B. Bai is with the Cloud and Smart Industries Group, Tencent, Beijing,100089, China.E-mail: [email protected].

• H. Li is with the Cloud and Smart Industries Group, Tencent, Beijing,100089, China.E-mail: [email protected].

• F. Wang is with the Department of Population Health Sciences, WeillCornell Medical College, New York, NY, 10065, USA.E-mail: [email protected].

• K. Bai is with the Cloud and Smart Industries Group, Tencent,Guangzhou, 510630, China.E-mail: [email protected].

Manuscript received April 19, 2005; revised August 26, 2015.

Table 1An example of the assumptions of our proposed GCLDR problem.

Each class set includes its unique classes. “Train”/“Test” denotes thedata for training/testing. “Latent” suggests domain labels are absent.

Class Set 1 Class Set 2 Class Set 3

Latent Domain 1 Train Test TestLatent Domain 2 Test Train TestLatent Domain 3 Test Test Train

and a pedestrian.1 Domain difference/shift can take manyforms, including covariate shift (distribution difference inp(x), where x denotes the feature) [61], target/prior prob-ability shift (difference in p(y), where y denotes the outputtarget) [75, 62], conditional shift (difference in p(y | x)) [75],and joint shift (difference in p(y,x)) [48]. Domain transfermethods can be categorized into two types [14, 19]: 1)symmetric methods that unify multiple domains into onecommon space; 2) asymmetric methods that map data fromone domain to another. To understand how we can alleviatethe aforementioned problem, we study the learning task forunbiased authentication. Specifically, we treat authentica-tion as a recognition problem so that each identity corre-sponds to a class. For model efficiency, this paper focuseson the symmetric methods, eliminating domain-differenceto unify domains.

The existing research on domain generalization [73, 26,31, 22, 49, 32, 35, 34] or multi-domain adaptation [63, 16, 47,16, 17] typically aims at learning domain-transfer from mul-

1. The seasonal outfits include four domains: spring, summer,autumn, and winter. The outfit and the shooting angle can be re-garded as two types of domain-difference.

arX

iv:2

010.

0525

0v2

[st

at.M

L]

23

Nov

202

0


(a) Training

(b) Testing

Figure 1. An experiment setting of C-MNIST with the background coloras the domain-difference. Best viewed in color.

tiple training domains. One limitation of these approaches isthat they assume domain labels are available. However, inreal applications, it is labor-intensive and time-consumingto provide annotations of all domains, especially whenthe number of domains is massive. Therefore, researchersrecently propose to detect the latent domains whose labelsare absent [25, 33, 24, 45, 71, 50, 46]. These methods aretypically purely based on features. However, the originalreason why domain generalization/adaptation is essentialis that the learned predictive relationship between featuresand targets, modeled by p(y | x), on training data mightchange on testing data. Therefore, the key to understanddomain difference is why the predictive relationship p(y | x)is different in different domains. Consequently, to preciselylearn and unify latent domains, we should exploit the het-erogeneity in p(y | x), in addition to the heterogeneity infeatures, as in most of the existing research.

Another limitation of existing approaches is that theyrequire the classes for recognition to be shared across all thedomains. However, since the classes are the identities, it isimpossible to collect all domain data for every individualin the authentication task. The real scenario is that onlythe data from one domain are collected for each subset ofindividuals, and this phenomenon is characterized in thegeneralized cross-domain recognition (GCDR) problem [37],although they assume that domain labels are provided. Inthis paper, we propose to study the problem where domainlabels are absent, which is referred to as a generalized cross-latent-domain recognition (GCLDR) problem. We present atoy example with only one type of latent domain differencein Table 1. Unlike the setups for standard domain gener-alization or adaptation, for training data in our example,different latent domains do not share any classes. For acolored-digit-recognition example shown in Fig. 1, in thetraining data set, one can only observe images of digits0 ∼ 4 with a green background, and digits 5 ∼ 9 with apink background, while the images of 5 ∼ 9 with the greenbackground are not observed. The background colors are thelatent domains and do not share any digit in the training set.Consequently, for any testing data sample, a trained modelcould generate wrong prediction due to the misleading of itsdomain information. Thus, we should transfer knowledgeacross domains, then the recognition model trained on otherlatent domains can be used.

To address the above issues, we propose a novel domain-agnostic method to learn and unify latent domains to tacklethe GCLDR problem. Specifically, we propose a latent-domain discovery (LDD) module to capture the heteroge-neous predictive relationships between features and targets

from different latent domains, where each predictive rela-tionship comes from one latent domain. The LDD modelalso includes a domain-discrimination component, whichdiscriminates latent domains based on features. The pos-terior distribution of latent domains given features andtargets naturally integrates the recognition and domain-discrimination components via the Bayes rule such that p(z |y,x) ∝ p(y | x)p(z | x), where z denotes the latent domain.Thus, leveraging the predictive relationships p(y | x)s, wediscover latent domains by the joint distribution p(y,x).By forcing the posterior probability of every latent domainto be equal, we can eliminate domain difference by thejoint distribution p(y,x). We alternately perform the latent-domain discovery and unification processes. Therefore, ev-ery possible type of domain difference (e.g., season, andshooting angle) can be learned and eliminated successively.As a consequence, the number of latent domains, which isa hyper-parameter, can be robustly fixed to be two: as longas separated latent domains exist, they can be organizedinto two different groups. On the other hand, inspired byLiang et al. [37], we propose to eliminate latent domain-difference in both class-dependent and class-independentspaces in two branches of our network, respectively. Thisarchitecture can be more robust for domain-difference elim-ination. Finally, inspired by Li et al. [32], we provide a meta-learning extension of our method to encourage commonalityamong latent domains by splitting latent domains into meta-train domains and meta-test domains and optimizing viaa learn-to-learn procedure: learning how to minimize thelosses on the meta-train domains to minimize the losseson the meta-test domains [32]. The experimental resultson benchmark and real-world data sets demonstrate theeffectiveness and superiority of our method. We also con-duct ablation experiments to show the contribution of eachcomponent of our proposed framework.

Our contributions are highlighted as follows.

• We propose to address a generalized cross-latent-domain recognition (GCLDR) problem, where do-main labels are absent, and domains do not shareclasses. This problem is challenging and common inreal applications of authentication, e.g., face recogni-tion authentication.

• We provide valuable insights that 1) alternatelyperforming latent-domain discovery and domain-difference elimination can boost generalization tounseen 〈class, domain〉 combinations; 2) both proce-dures should be based on predictive relationships be-tween features and targets via posterior probabilities,which is novel compared with existing domain adap-tation/generalization methods; and 3) eliminatingdomain-difference in both class-dependent and class-independent spaces can boost robustness of domain-difference elimination.

• Our method achieves significant improvements com-pared with baselines on one benchmark and twoauthentication datasets, and even rivals the state-of-the-art methods using extra domain labels on bothauthentication datasets.

The remainder of this paper is organized as follows.Section 2 discusses related works. The proposed methods


are presented in Section 3. Section 4 presents an empiricalevaluation of the proposed approaches. Section 5 discusseswhy related works failed in our setting, and Section 6provides conclusions.

2 RELATED WORKS

Domain Generalization/Adaption Domain generalizationapproaches [73, 26, 31, 22, 49, 32, 35, 34, 60] typically trainmodels on single/multi-domain data with shared classes forrecognition on an unseen domain, while domain adaptationapproaches [52, 21, 6, 42, 12, 64, 68, 43, 11, 69, 13, 9, 30]typically train models on source domains and recognize ontarget domains which share classes with source domainswithout class labels. Domain generalization and adaptationboth typically assume that on training data, classes areshared across domains and domain labels are provided,which do not hold in our GCLDR problem.Domain Agnostic Learning Recently, several domain-agnostic learning approaches [53, 10, 41, 56] emerge totypically handle the domain-adaptation problem where thetarget domain may contain several sub-domains withoutdomain labels [53]. DADA [53] and OCDA [41] proposenovel mechanisms and achieve effective domain-adaptationperformances, but do not discover latent domains in thetarget domain and exploit the information. By contrast,BTDA [10] clusters raw and deep features to discoverlatent domains. However, its latent-domain discovery isonly feature-based and does not exploit the heterogeneouspredictive relationships of p(y | x). DANL [56] learnsa normalization layer, but may be limited for more so-phisticated domain-difference [53]. Except for BTDA, thesemethods work on the extra knowledge of domain labels(source/target).Latent Domain Discovery Existing explicit latent-domaindiscovery approaches [25, 33, 24, 45, 71, 50, 46] typicallybuild special models to learn latent domains explicitlybased on features only. As an exception, Xiong et al. [71]propose to learn latent domains via a conditional distri-bution p(z | y,x), which is not based on the predictiverelationship p(y | x), but based on linear addition of deepfeatures of y and x. When latent domains are discovered,a main-stream of these approaches does not perform do-main transfer. However, as explained in introduction, itis not appropriate for our GCLDR problem. In contrast,mDA [45] (an improved version of DANL [56]) and its im-proved version CmDA [46] unify domains by normalizingthe hidden space for each domain to have zero mean andunit standard deviation, which may be ineffective for moresophisticated domain-difference. In addition to the aboveexplicit discovery methods, there are also several methodsthat learn latent domains implicitly, including ML-VAE [5],MCD [57] and MCD-SWD [28], which may encounter sub-optimal solutions due to not explicitly modeling multiplelatent domains thus may ignore fine-grained information.Meta Learning Meta-learning’s idea is learning tolearn [65, 59], which recently gains great popularity [18,58, 67, 31, 55, 2]. Recently, a few meta-learning domaingeneralization approaches [32, 3, 36] have been proposed,which learn a set of domains to learn another set of domains.However, these methods require domain labels for training.

Nonetheless, inspired by these methods, we propose a meta-learning framework when latent domains are discovered.Self-Supervised Learning Self-supervised learning typi-cally formulates a auxiliary learning task [27, 23, 51, 8, 29, 23]to improve supervised-learning without class labels, andrecently is found to be effective for generalization [38] ordomain generalization/adaptation [7, 72]. However, it maysuffer sub-optimal solutions in our GCLDR problem due tonot explicitly modeling or unifying latent domains.

3 METHODOLOGY

This section lays out the details of our proposed networkby first defining notations and problem settings. Considera data set D = {(xi, yi)}ni=1 consisting of n independentsamples. For the ith sample, xi ∈ Rd is a feature vector withd dimensions, and yi ∈ Z+ is a categorical class label of therecognition task. The data set contains no domain labels. Inother words, the setting of our proposed GCLDR problemextends the GCDR problem [37] such that no domain labelsare given. Throughout the paper, we denote [k] as the indexset {1, 2, . . . , k}.

3.1 Heterogeneous Predictive Relationships Discoveryand UnificationOur LDD module discovers multiple predictive relation-ships for p(y | x). The discovery process utilizes the hiddenfeatures of a deep neural network. Here, we assume thatwe determine to discover k ∈ Z+ latent domains. Note thatk = 2 is adequate, since we will discover and unify the restof the latent domains successively. Given a hidden featurevector f i for the ith data sample xi (i ∈ [n]), we build klocal recognition networks R1

l , . . . , Rkl to learn k conditional

distributions of p(yi | f i). Each conditional distributioncorresponds to a subset of samples, and is denoted byp(yi | f , Rrl ) which follows a categorical distribution:

p(yi | f i, Rrl ) =c∏j=1

p(yi = j | f i, Rrl )I(yi=j), r ∈ [k], (1)

where c denotes the number of classes. We further builda domain-discrimination network D to discriminate whichdomain does f i belong to. Then D aims to learn p(zi = r |f , D) for all r ∈ [k], where zi denotes the latent domainfor current (yi, f i). Then via the Bayes rule, the posteriorprobability of (yi, f i) belongs to the rth domain is

ρi,r = p(zi = r | yi, f i, {Rrl }kr=1, D)

=p(zi = r | f i, D)

∏cj=1 p(y

i = j | f i, Rrl )I(yi=j)∑k

r′=1 p(zi = r′ | f i, D)

∏cj=1 p(y

i = j | f i, Rr′l )I(yi=j).

(2)

The detailed derivation is in Appendix A. We can observe inEq. (2) that, since f is based on x, the posterior probabilitymodels p(z | y,x) which discover latent domains basedon the joint distribution p(y,x). Therefore, given class-labelinformation, the posterior probability can provide moreaccurate domain-discrimination than the feature-based dis-criminative probability p(zi = r | f i, D) which modelsp(z | x) using the information of p(x) only.

We aim to provide an end-to-end optimization schemeso that we discover latent domains on each mini-batch of


Figure 2. The architecture of our framework. P,Gcd, Gci are feature ex-tractors. Rg,cd learns class-dependent features fcds, while Rg,ci learnsclass-independent features fcis. The LDD modules independently learnlatent domains in each space. The posteriors of LDD are forced to beequal across domains to eliminate domain-difference.

data samples in a common mini-batch based optimizationprocedure. Given the posterior probabilities {ρi,r} — softselection of domains, we optimize:

`d =− 1

b

b∑i=1

k∑r=1

ρi,rc∑j=1

I(yi = j) log p(yi = j | f i, Rrl )

− 1

b

b∑i=1

k∑r=1

ρi,r log p(zi = r | f i, D),

(3)

where b denotes the batch size, which is resulted from theExpectation-Maximization derivation (see Appendix A fordetails).

To unify latent domains, we propose to force the pos-terior probabilities to be equal across domains to eliminatedomain-difference by the joint distribution p(y,x):

`e =1

b

b∑i=1

k∑r=1

(p(zi = r | yi, f i, {Rrl }kr=1, D)− 1/k)2. (4)

Alternately computing the posteriors by Eq. (2) andminimizing losses in Eq. (3) and (4), we can discover andthen unify all the latent domains successively, until ourmodel no longer detects domain-difference.

3.2 Double-Space Domain-Difference Elimination

Based on the latent-domain discovery and unification mod-ule introduced in Section 3.1, we propose to eliminatedomain-difference both in a class-dependent space (whereclasses can be recognized) and a class-independent space(where classes cannot be recognized).

We first introduce our model structure to learn hiddenfeatures in the above two spaces. As shown in Fig. 2, an

(a) Raw input images as xs

(b) Generated hidden feature maps as fcds

Figure 3. Examples of fcds on C-MNIST resulted from our method. Bestviewed in color.

input sample xi is transformed by a mapping network Pinto a hidden feature vector fc, which is further transformedby two feature-extraction networks Gcd, Gci to obtain aclass-dependent feature vector f icd and a class-independentfeature vector f ici, respectively. We let f icd be class-dependentby using a global recognition network Rg,cd to recognize theclass from f icd by minimizing:

Lcd = `c({f icd}bi=1, Rg,cd),

`c = −1

b

b∑i=1

c∑j=1

I(yi = j) log p(yi = j | f i· , Rg,·).(5)

For f ici to be class-independent, we learn by an adversariallearning process: we first learn a global recognition networkRg,ci to recognize the class from f ici by minimizing:

Lci = `c({f ici}bi=1, Rg,ci), (6)

then learn the feature-extraction network Gci to eliminateclass-dependence features by minimizing:

Lac =1

bc

b∑i=1

c∑j=1

(p(yi = j | f ici, Rrg,ci)− p(y = j))2, (7)

where p(y = j) denotes the frequency of class j on thetraining data.

The domain-difference elimination is also realized viaadversarial learning. According to Section 3.1, we discoverlatent domains in both class-specific and class-independencespaces by minimizing

Ld = `d({f icd}bi=1, {Rrl,cd}kr=1, Dcd)+`d({f ici}bi=1, {Rrl,ci}kr=1, Dci),(8)


Algorithm 1 Learning Algorithm for GCLDR

Input: Data set D = {(xi, yi)}ni=1, where ∀i ∈ [n], xi ∈ Rdand yi ∈ [c]. Number of latent domains k ∈ Z+. Batch sizeb ∈ Z+.

Output: Recognition model: Rg,cd(Gcd(P (·))).1: while not converge do2: Sample a mini-batch {xi, yi}bi=1.3: Forward the min-batch to obtain f icd = Gcd(P (xi)) and

f ici = Gci(P (xi)) for all i ∈ [b].4: Compute posteriors based on {f icd}bi=1 and {f ici}bi=1 by

Eq. (2), respectively.5: Optimize recognition and discrimination networks, i.e.,

Rg,cd, Rg,ci, {Rrl,cd}kr=1, {Rrl,ci}kr=1, Dcd, Dci:

min Lcd + Lci + Ld, (10)

where Lcd,Lci,Ld,ci are defined in Eq. (5), (6), and (8),respectively.

6: Optimize mapping and feature-extraction networks, i.e.,P,Gcd, Gci.

min Lcd + Lac + Lu, (11)

where Lac,Lu are defined in Eq. (7) and (9), respectively.7: end while

where `d is defined in Eq. (3), D·s are domain-discrimination networks, and {Rrl,·}kr=1s are groups of localrecognition networks. Then we eliminate domain-differencein both class-dependence and class-independence spaces.We fix the LDD modules and learn P,Gcd, Gci by minimiz-ing:

Lu = `e({f icd}bi=1, {Rrl,cd}kr=1, Dcd) + `e({f ici}bi=1, {Rrl,ci}kr=1, Dci),(9)

where Le is defined in Eq. (4).We summarize our latent-domain discovery and unifica-

tion in double spaces in Algorithm 1. For inference, we stackP,Gcd and Rg,cd to predict the class label yi for each samplexi.

3.3 Meta-Learning Paradigm Extension

Recall that in Eq. (4), when latent domains are discovered,class predictions between different domains are equalized toeliminate domain-differences on p(y | x). Inspired by Li etal. [32], we propose an alternative approach to encourageconsistency of p(y | x) across latent domains by a meta-learning mechanism: we split latent domains into meta-traindomains and meta-test domains, and learn how to minimizethe losses on the meta-train domains in order to minimizethe losses on the meta-test domains [32].

Specifically, for the loss of each domain, we use the localrecognition loss soft-selected by the posteriors:

`s(r) = −1

b

b∑i=1

ρi,rc∑j=1

I(yi = j) log p(yi = j | f i, Rrl ), (12)

where ρi,r is the posterior defined in Eq. (2). The loss inEq. (12) can be based on either LDD modules in class-dependent or class-independent spaces. So we merge thelosses for both spaces:

Ls(r) =`s(r; {f icd}bi=1, {Rrl,cd}kr=1, Dcd)

+ `s(r; {f ici}bi=1, {Rrl,ci}kr=1, Dci).(13)

Algorithm 2 Meta-Learning for Conditional UnificationInput: θ that collects the model-parameters for P,Gcd, Gci only.

Hyper-parameters γ, α ≥ 0.Output: Updated model parameters θ.

1: Randomly split latent domains to two sets: S1 ∪ S2 =[k],S1 ∩ S2 = ∅.

2: Meta-train 1: ∇1 = 1|S1|

∑r∈S1 ∇θLs(r), where Ls(r) is

defined in Eq. (13).3: Meta-train 2: ∇2 = 1

|S2|∑r∈S2 ∇θLs(r).

4: Learn class-dependent and class-independent feature-extraction, and eliminate domain-differences:

minθLcd + Lac + Lu + Lmeta,

Lmeta =γ

2

[1

|S1|∑r∈S1

Ls(r; θ − α∇2)

+1

|S2|∑r∈S2

Ls(r; θ − α∇1)

].

(14)

Then Step 6 in Algorithm 1 can be replaced by Algorithm 2.

We follow Li et al. [32] to analyze our Algorithm 2 viafirst order Taylor expansion, and obtain similar results:

Observation 1. The meta learning loss Lmeta in Eq. (14) ofAlgorithm 2 can be approximated as:

Lmeta ≈γ

2

[1

|S1|∑r∈S1

Ls(r)+1

|S2|∑r∈S2

Ls(r)]−γα∇T

1∇2. (15)

Proof. The proof is in Appendix B.

Observation 1 suggests that our meta-learning basedunification encourages the concordance of gradients be-tween different domains. Since the gradient of a deep neuralnetwork contains the information of all the related layers, itis possible to obtain superior transfer performance, compar-ing with Step 6 of Algorithm 1 which only encourages theclass prediction to be consistent across domains.

4 EXPERIMENTS

In this section, we evaluate our proposed framework. Bothsynthetic and real-world data sets are used for extensiveevaluations. Our implementation uses Keras with Tensor-flow [1] backends.

We evaluate in the setting proposed by Liang et al. [37]:training domains do not share classes, and testing combina-tions of 〈class, domain〉 are different from those of trainingdata. We conduct comprehensive evaluations on three datasets used by Liang et al. [37]: (1) the C-MNIST data set [44]with 10 classes and the background color as the domain-difference, (2) the re-organized CelebA data set [40] with211 classes and whether wearing eyeglasses as the domain-difference, and (3) the authentication data set based onmobile sensors developed by Liang et al. [37] with 29 classesand the OS types as the domain-difference. The detailed re-organization process for each data set has been describedby Liang et al. [37] and will be introduced in the followingsections. Domain labels are not used. 10% data of the testingset are randomly selected for validation. For each method oneach data set, we repeat 20 times and report the averagedresults.


Methods for Comparison We compare our method inAlgorithm 1 (Ours) and our method replacing the Step 6 ofAlgorithm 1 with Algorithm 2 (Ours-Meta) with the state-of-the-art baselines by firstly comparing the direct learningstrategy (Direct) that stacks P,Gcd, and Rg,cd only. Thenwe compare the methods that use extra domain labelsto evaluate to what extent the provided domain labelscan help to boost the transfer performances. These meth-ods include ABS-Net [44], ELEGANT [70], RevGrad [20],CDRD [39], SE-GZSL [66], and AAL-UA [37]. We presenttheir results reported by Liang et al. [37]. For the methodsthat do not use extra domain labels for training, we comparethree methods that implicitly discover and unify latentdomains: MCD [57], MCD-SWD [28], and ML-VAE [5];and two methods that explicitly perform latent-domaindiscovery-and-unification: mDA [45] and CmDA [46]. Forthe domain-agnostic domain-adaptation methods, we com-pare DADA [53] and BTDA [10] only, because that mDA canbe regarded as an improved version of DANL [56] for ourproblem, and that OCDA [41] is elaborate, whose code hasnot been released yet; for each data set, we treat the trainingdata as both the required source and target domains forthese methods to train. For image data sets, we involve theself-supervised methods JiGen [7], Rot [72] and MAXL [38].We build the base modules (e.g., feature extractors, andclassifiers) with the same structure as ours to conduct fairexperiments with the hyper-parameters optimized on thedata sets and settings in this paper.Evaluation Metrics We follow the settings of Liang etal. [37] to evaluate prediction performances for both multi-label and multi-class types of recognition. For the multi-label type, we use average AUC (aAUC) which is defined asthe average of the area under the ROC curve for every class,the average false acceptance rate (aFAR), and the averagefalse rejection rate (aFRR). We report aAUC and averagebalanced false rate (aBFR = (aFAR + aFRR)/2) as balancedscores since the negative samples dominate for each class.For the multi-class type, we report top-1 accuracy (ACC@1).Implementation Details We constrain our model-capacityto be the same as Liang et al. [37] to acquire fair comparisons.For all experiments,Gcd andGci are built by a single hiddenlayer with hyperbolic-tangent activation function, respec-tively. Rg,cd, Rg,ci, {Rrl,cd}kr=1, {Rrl,ci}kr=1, Dcd and Dci arebuilt by generalized linear layers, and using softmax as theactivation function. We build input mapping networks P swith the same structures as those designed by Liang etal. [37] to acquire fair comparisons. For image data sets,a simple Convolutional Neural Network (CNN) shown inTable 2 is built as the network P . A dropout layer with drop-rate of 0.25 is used after the max-pooling layer. The outputof P is flattened for subsequent full-connected layers. For

Table 2The CNN model used as P for image data sets.

Layer # Filters Kernel Size Stride # Padding Activation

Convolution 32 3 3 0 ReLuConvolution 32 3 3 0 ReLuMax-pooling - 2 2 0 -

vector-based data sets, P is built by a fully-connected neural

network as shown in Table 3. A dropout layer with drop-rateof 0.5 is used after the max-pooling layer. The output of P isflattened for subsequent full-connected layers. We set k = 2

Table 3The fully-connected model used as P for vector-based data sets.

Layer # Filters

Fully-Connected 512Batch-Normalization -Swish Activation [54] -

as discussed in the introduction. For the C-MNIST and theMobile data set, the batch size is set to b = 512, while for theCelebA data set, b = 128. We approximate the meta learningloss Lmeta by Eq. (15), according to Observation 1, and setγ = 0.01, α = 1.

4.1 Handwritten Digital ExperimentsThe C-MNIST data set is originally built by Lu et al. [44]. Itconsists of 70k colored RGB digital images with resolutionof 28 × 28 (60k for training and 10k for testing). It is builtfrom the original gray images of MNIST by adding 10background colors (b-colors) and other 10 foreground colors(f-colors), resulting in 1k possible combinations (10 digits× 10 b-colors × 10 f-colors). Examples from C-MNIST areshown in Fig. 6 of Lu et al. [44]. Liang et al. [37] re-constructthe C-MNIST data set for their GCDR problem. As shownin Fig. 1, training digits 0 ∼ 4 have a green b-color, while5 ∼ 9 have a pink b-color. On the contrary, testing digits0 ∼ 4 have a pink b-color, while 5 ∼ 9 have a green b-color.Other data are dropped. The re-construction results in 5970training instances and 1003 testing instances in total.

Table 4 summarizes the performance comparisons onC-MNIST. The results clearly show that our methods sig-nificantly outperform the direct learning method, whichproves the effectiveness of our methods. Furthermore, ourmethods significantly outperform the baseline methods thatdo not use domain labels for training, which shows thesuperiority of our proposed methods. Moreover, our meth-ods even significantly outperform the majority of the base-line methods that use domain labels for training, exceptfor the AAL-UA and the SE-GZSL methods. Consideringthese methods use additional domain labels, which rendersit much easier for cross-domain recognition, these resultsprovide sound evidence for the effectiveness and superiorityof our methods. Fig. 3 shows some examples of generatedhidden feature maps, in which the background differenceis eliminated. Our meta-learning extension obtains compa-rable performances, compared with our best results, whichdemonstrates that our meta-learning framework is promis-ing to handle our GCLDR problem. On the other hand,the MCD based methods show severe negative transfer. Weconjecture that the training objective of MCD is not suitablefor our GCLDR problem. Detailed discussion will be presentin Section 5.

4.2 Face RecognitionWe use the aligned, cropped and scaled version of theCelebA data set [40] with image-size of 64 × 64. Liang et


Table 4Performances (%) comparison on the C-MNIST data set. “∗” denotes

the methods that use extra domain labels for training.

Methods aAUC aBFR ACC@1

Direct 78.67 26.32 20.88∗ABS-Net [44] 77.69 27.41 15.92∗ELEGANT [70] 79.94 24.61 10.68∗RevGrad [20] 80.71 24.45 21.68∗CDRD [39] 84.83 35.79 33.49∗AAL-UA [37] 98.42 6.14 84.27∗SE-GZSL [66] 99.79 2.72 94.83MCD [57] 49.90 50.09 10.89MCD-SWD [28] 50.12 49.89 10.69ML-VAE [5] 77.26 28.06 18.73DADA [53] 83.90 22.29 15.83BTDA [10] 85.14 20.82 26.43JiGen [7] 82.44 24.44 33.33Rot [72] 75.52 36.41 18.42MAXL [38] 78.87 25.28 21.31mDA [45] 83.34 21.98 24.26CmDA [46] 86.52 21.00 43.21Ours 93.46 13.26 60.63Ours-Meta 91.06 16.13 53.16

Table 5Performances (%) comparison on the CelebA data set. “∗” denotes the

methods that use extra domain labels for training.


Direct 78.74 41.58 11.49∗ABS-Net [44] 75.80 34.90 8.09∗ELEGANT [70] 75.88 32.02 10.05∗RevGrad [20] 80.12 31.18 10.96∗CDRD [39] 80.20 39.90 16.47∗SE-GZSL [66] 84.96 26.62 12.76∗AAL-UA [37] 87.07 22.19 14.99MCD [57] 50.16 49.98 0.45MCD-SWD [28] 50.23 50.15 0.37ML-VAE [5] 75.29 36.07 7.97DADA [53] 83.37 28.06 11.36BTDA [10] 78.23 28.53 8.30JiGen [7] 81.94 31.75 10.57Rot [72] 76.02 34.37 7.55MAXL [38] 81.02 28.26 11.05mDA [45] 80.19 28.91 10.90CmDA [46] 83.57 27.98 11.79Ours 88.52 22.74 22.31Ours-Meta 85.41 25.03 15.61

al. [37] chose the Eyeglasses attribute as the domain-difference, selected individuals with at least 20 images,and balanced the data set such that #(Eyeglasses =0)/#(Eyeglasses = 1) ∈ [3/7, 7/3], resulting in 211 in-dividuals. Half of the individuals wear glasses only dur-ing training, while the other half wear glasses only dur-ing testing. Table 5 shows the comparisons conducted onCelebA. We achieve consistent results with those in Table 4.Our methods significantly outperform the baseline methodswithout domain-label supervision and most baseline meth-ods supervised by extra domain labels, which demonstratesthe effectiveness and superiority of our methods. Note thatour method even outperforms the best (AAL-UA) of themethods with domain-label supervision.

Table 6The authentication problem on mobile devices. The numbers in the first

row indicate groups of subjects. “×” means there are no data for thiscondition.

No. 1-6 No. 7-12 No. 13-15 No. 16-29

IOS Train Test × TrainAndroid Test Train Train ×

Table 7Performances (%) comparison on the Mobile data set. “∗” denotes the

methods that use extra domain labels for training.


Direct 76.53 28.64 3.79∗RevGrad [20] 75.88 32.38 0.38∗ABS-Net [44] 76.58 28.09 5.13∗SE-GZSL [66] 78.83 26.12 20.54∗CDRD [39] 89.17 20.26 46.05∗AAL-UA [37] 93.40 13.59 46.37MCD [57] 83.12 21.37 24.35MCD-SWD [28] 84.49 19.89 25.35ML-VAE [5] 77.16 27.18 4.68DADA [53] 77.77 27.87 22.40BTDA [10] 86.96 17.85 30.21MAXL [38] 77.02 27.26 5.05mDA [45] 81.43 26.38 18.80CmDA [46] 82.22 21.84 20.42Ours 90.72 16.04 35.49Ours-Meta 91.55 14.87 35.84

4.3 Authentication on Mobile DevicesWe use the mobile data set built by Liang et al. [37] who col-lect smart-phone sensor information from 29 subjects, whichrecords two-second time-series data from multiple sensors,such as accelerometer, gyroscope, gravimeter, etc. They ex-tracted statistical features from both time and spectrumdomains, resulting in 5144 data samples with the featuredimension of 191. They treated the OS types (IOS/Android)as the domain-difference and constructed a biased learningtask, as shown in Table 6. The results are reported in Table 7,in which our methods still achieve consistent results. Wecan see that our methods significantly outperform the base-line methods without domain-label supervision and mostbaseline methods supervised by domain labels. On thisdata set, our meta-learning method outperforms our base-line method, which demonstrates the effectiveness of ourextended framework. Note that the results of MCD basedmethods do not show significant negative transfer here,we conjecture that it is because over-fitting the domain-difference in this data set (OS type) is not very straight-forward, compared with learning to recognize individuals.We will also discuss this phenomenon in Section 5.

4.4 Ablative StudyWe conduct a series of ablation experiments on the threedata sets mentioned above to demonstrate how the hetero-geneous predictive relationship discovery and the double-space domain-difference elimination mechanisms contributeto the performance. Specifically, we compare the followingfour model variants of our method.

Single-Space. We learn and unify latent domains inthe class-dependent space only, i.e., the branches of net-


Table 8Performances (%) comparison on the C-MNIST data set for different

variants of our method.


Ours 93.46 13.26 60.63Single-Space 86.51 21.81 39.76Feature-Based 83.80 23.33 35.95Class-Confuse 77.48 28.54 24.22No-Unification 50.64 57.93 0.02Direct 78.67 26.32 20.88

Table 9Performances (%) comparison on the CelebA data set for different




Table 10Performances (%) comparison on the Mobile data set for different




works for the class-independent space are removed, i.e.,Gci, Rg,ci, {Rrl,ci}kr=1 and Dci.

Feature-Based. We learn and unify latent domains basedon features only, i.e., local recognition networks {Rrl,cd}kr=1

and {Rrl,ci}kr=1 are removed.Class-Confuse. No latent-domain discovery and unifica-

tion. But we still learn the class-independent space. Specifi-cally, {Rrl,ci}kr=1, Dcd, {Rrl,ci}kr=1 and Dci are removed.

No-Unification. We discover latent domains but do notunify domains. Instead, for a testing sample, we select therecognition model from the most relevant domain to makerecognition.

Specifically, Rg,cd, Gci, Rg,ci, {Rrl,ci}kr=1 and Dci are re-moved. We only use {Rrl,cd}kr=1 and Dcd to make recogni-tion for the ith sample:

j = argmaxj

k∑r=1

p(zi = r | f icd, Dcd)p(yi = j | f icd, Rrl,cd). (16)

The results are presented in Table 8∼10. It is notable thatFeature-Based’s performances drastically decrease compar-ing with our best scores. The results demonstrate that usingthe class label to model the predictive relationship p(y |x) can harness more information, make a more accuratelatent-domain discovery, and result in better class-alignmentacross latent domains. Besides, Our method outperforms

Single-Space significantly, which shows that it is morerobust to eliminate domain-difference both in the class-dependence and class-independence spaces. Liang et al. [37]also finds the effectiveness of such a multi-branch structure.Moreover, the results of Class-Confuse show that learn-ing features in the class-independence space itself cannotcontribute to cross-domain recognition, but rather slightlycompromises the recognition performances comparing withthe Direct method, because after all, its objective functionis contrary to that of recognition. Lastly, the results of No-Unification are significantly worse than the Direct method,especially for the C-MNIST data set. It demonstrates thatfor our GCLDR problem when we only discover latentdomains but do not unify them, a testing sample can onlyfind a poorly trained recognition model from its domain,especially when latent domains are easy to discover.

5 DISCUSSION

In Table 4 and 5, the results of MCD [57] and its updated ver-sion MCD-SWD [28] show severe negative transfer. We con-jecture that it is because MCD admits a “global” optimiza-tion objective: its two classifiers are required to classify allthe samples and classes in training data. MCD first (Step 1)trains the two classifiers such that they are different but stillrequired to correctly recognize all the samples and classes.Then (Step 2) MCD trains the feature-extractors to minimizethe output-distribution discrepancy of the two classifiers.In domain-adaptation problems, Step 1 and 2 perform onthe source and target domains, respectively, thus are notnecessarily contradictory. However, in our GCLDR problem,the two steps can only perform on the same training data.Therefore, it is possible that when the two classifiers aresignificantly different, the feature-extractors will achieve atrivial solution: the outputs (class prediction probabilities)of both classifiers are equal to be 1/c for each class and everysample, where c is the number of class. We checked theclassifier outputs of MCD and MCD-SWD for the C-MNISTand CelebA data sets (where negative transfer occurs), andtruly found that they achieved such a trivial solution. More-over, we conjecture that when domain-difference is easier tolearn than class-difference, since domain-difference enlargesclass-difference in our GCLDR problem, the two classifiersare easier to be significantly different, thus the negativetransfer will be more likely to happen. We think it is thereason why the negative transfer happened only on the C-MNIST and CelebA data sets, because color and eyeglassesare easier to discriminate than digits and individuals, re-spectively. By contrast, our methods learn k classifiers “lo-cally” to recognize the samples/classes in each domain only,thus will mainly concentrate on class-difference instead ofdomain-difference, which may help to avoid the objective-contradiction and the negative transfer.

6 CONCLUSION

In this paper, we investigate a generalized cross-latent-domain recognition problem in the field of authenticationwhere domain labels are absent, and domains do not shareclasses. We recognize the class for unseen 〈class, domain〉combinations of data. We propose an end-to-end domain

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE B.9

agnostic method to tackle the problem. We build a hetero-geneous predictive-relationship discovery and unificationmechanism to discover and unify latent domains succes-sively. Besides, we build a double-space domain-differenceelimination mechanism to eliminate domain-difference inboth class-dependent and class-independent spaces to im-prove robustness of elimination. We also extend our methodinto a meta-learning framework as an alternative elimi-nation approach. The experiments demonstrate that ourmethod significantly outperforms existing state-of-the-artmethods. We also conduct an ablation study to demonstratethe effectiveness of the critical components of our method.Some interesting future directions of research include de-veloping transfer learning algorithms flexible to emergingtypes of domain-difference.

APPENDIX ADERIVATION DETAILS OF POSTERIOR PROBABILI-TIES AND EM PROCEDURES

Recall that by Eq. (1), k local recognition networksR1l , . . . , R

kl aim to learn

p(yi | f i, Rrl ) =c∏j=1

p(yi = j | f i, Rrl )I(yi=j), r ∈ [k], (A.17)

and the domain-discrimination network D aims to learnp(zi = r | f , D). Denote by θ = {{Rrl }kr=1, D}. Then bythe Bayes’ rule, we have

p(zi = r | yi, f i, {Rrl }kr=1, D) = p(zi = r | yi, f i, θ)

=p(zi = r, yi | f i, θ)

p(yi | f i, θ)

=p(zi = r, yi | f i, θ)∑k

r′=1 p(zi = r′, yi | f i, θ)

=p(zi = r | f i, θ)p(yi | zi = r, f i, θ)∑k

r′=1 p(zi = r′ | f i, θ)p(yi | zi = r′, f i, θ)

=p(zi = r | f i, D)p(yi | zi = r, f i, {Rrl }kr=1)∑k

r′=1 p(zi = r′ | f i, D)p(yi | zi = r′, f i, {Rrl }kr=1)

=p(zi = r | f i, D)p(yi | f i, Rrl )∑k

r′=1 p(zi = r′ | f i, D)p(yi | f i, Rr′l )

=p(zi = r | f i, D)

∏cj=1 p(y


r′=1 p(zi = r′ | f i, D)

∏cj=1 p(y

i = j | f i, Rr′l )I(yi=j).

(A.18)

The third equation is due to the law of total probability.The fifth equation is because that given f i, the predictionfor zi = r is related to D only, and given f i and zi = r,the prediction for yi is related to {Rrl }kr=1 only. The sixthequation is because that given zi = r, the prediction foryi is related to Rrl only. The last equation is the result ofEq. (A.17).

We follow the Expectation-Maximization (EM) [15]scheme to solve the problem. For each i = [b], define(δi,1, . . . , δi,k) be a set of latent indicator variables, whereδi,r = 1 if the ith sample (yi, f i) belongs to the rth latentdomain and δi,r = 0 otherwise. So

∑kr=1 δ

i,r = 1, ∀i. Theseindicators are not observed since the domain labels of thesamples are unknown. Let δ denote the collection of allthe indicator variables. By treating δ as missing, the EM

algorithm proceeds by iteratively optimizing the conditionalexpectation of the complete log-likelihood criterion.

The complete likelihood is given byb∏i=1

k∏r=1

c∏j=1

[p(zi = r | f i, D)p(yi = j | f i, Rrl )]δi,rI(yi=j). (A.19)

Then the complete log-likelihood is given by

`c(θ | {(yi, f i)}bi=1, δ)

=

b∑i=1

k∑r=1

c∑j=1

δi,rI(yi = j) log p(zi = r | f i, D)

+

b∑i=1

k∑r=1

c∑j=1

δi,rI(yi = j) log p(yi = j | f i, Rrl ),

(A.20)

where θ = {{Rrl }kr=1, D}. The conditional expectation ofthe complete negative log-likelihood is then given by

Q(θ | θ′) = −E[`c(θ | {(yi, f i)}bi=1, δ) | {(yi, f i)}bi=1, θ′]/b,

(A.21)

It is easy to show that deriving Q(θ | θ′) boils down to thecomputation of E[δi,r | {(yi, f i)}bi=1, θ

′], which admits anexplicit form.

The EM algorithm proceeds as follows.E-Step: Given θ′ = {{Rrl }kr=1, D} computed by the last stepof optimization, compute

ρi,r = E[δi,r | {(yi, f i)}bi=1, θ′]

= p(zi = r | yi, f i, θ′)= p(zi = r | yi, f i, {Rrl }kr=1, D) =

p(zi = r | f i, D)∏cj=1 p(y


r′=1 p(zi = r′ | f i, D)

∏cj=1 p(y

i = j | f i, Rr′l )I(yi=j),

(A.22)

where the last equation is the result of Eq. (A.18).M-Step: Minimize Q(θ | θ′)

Q(θ | θ′) = −1

b

b∑i=1

k∑r=1

c∑j=1

ρi,rI(yi = j) log p(zi = r | f i, D)

− 1

b

b∑i=1

k∑r=1

c∑j=1

ρi,rI(yi = j) log p(yi = j | f i, Rrl ),

(A.23)

a) optimize the recognition model for each domain by:

− 1

b

b∑i=1

k∑r=1

c∑j=1

ρi,rI(yi = j) log p(yi = j | f i, Rrl )

= −1

b

b∑i=1

k∑r=1

ρi,rc∑j=1

I(yi = j) log p(yi = j | f i, Rrl ).

(A.24)

Eq. (A.24) corresponds to Eq. (3).b) optimize the concordance between the feature-based

discriminative probability p(zi = r | f i, D) and the poste-rior probability ρi,r by:

− 1

b

b∑i=1

k∑r=1

c∑j=1

ρi,rI(yi = j) log p(zi = r | f i, D)

= −1

b

b∑i=1

k∑r=1

ρi,r log p(zi = r | f i, D)

c∑j=1

I(yi = j)

= −1

b

b∑i=1

k∑r=1

ρi,r log p(zi = r | f i, D).

(A.25)


Eq. (A.25) corresponds to Eq. (4).

APPENDIX BPROOF OF OBSERVATION 1Proof. By the first order Tyler expansion, we have

Lmeta =γ

2

[1

|S1|∑r∈S1

Ls(r; θ − α∇2)

+1

|S2|∑r∈S2

Ls(r; θ − α∇1)

]≈γ2

[1

|S1|∑r∈S1

Ls(r; θ)− α(∇θLs(r; θ))T∇2

+1

|S2|∑r∈S2

Ls(r; θ)− α(∇θLs(r; θ))T∇1

]=γ

2

[1

|S1|∑r∈S1

Ls(r; θ)− α∇T1∇2

+1

|S2|∑r∈S2

Ls(r; θ)− α∇T2∇1

]=γ

2

[1

|S1|∑r∈S1

Ls(r) +1

|S2|∑r∈S2

Ls(r)]− γα∇T

1∇2.

(B.26)

REFERENCES[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,

M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-flow: a system for large-scale machine learning. In OSDI,volume 16, pages 265–283, 2016.

[2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman,D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas.Learning to learn by gradient descent by gradient descent.In Advances in neural information processing systems, pages3981–3989, 2016.

[3] Y. Balaji, S. Sankaranarayanan, and R. Chellappa.Metareg: Towards domain generalization using meta-regularization. In Advances in Neural Information ProcessingSystems, pages 998–1008, 2018.

[4] A. Bedagkar-Gala and S. K. Shah. A survey of approachesand trends in person re-identification. Image and VisionComputing, 32(4):270–286, 2014.

[5] D. Bouchacourt, R. Tomioka, and S. Nowozin. Multi-level variational autoencoder: Learning disentangled rep-resentations from grouped observations. arXiv preprintarXiv:1705.08841, 2017.

[6] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan,and D. Erhan. Domain separation networks. In Advances inneural information processing systems, pages 343–351, 2016.

[7] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, andT. Tommasi. Domain generalization by solving jigsawpuzzles. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2229–2238, 2019.

[8] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deepclustering for unsupervised learning of visual features. InProceedings of the European Conference on Computer Vision(ECCV), pages 132–149, 2018.

[9] D.-D. Chen, Y. Wang, J. Yi, Z. Chen, and Z.-H. Zhou. Jointsemantic domain alignment and target classifier learn-ing for unsupervised domain adaptation. arXiv preprintarXiv:1906.04053, 2019.

[10] Z. Chen, J. Zhuang, X. Liang, and L. Lin. Blending-target domain adaptation by adversarial meta-adaptationnetworks. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2248–2257, 2019.

[11] S. Cicek and S. Soatto. Unsupervised domain adapta-tion via regularized conditional alignment. arXiv preprintarXiv:1905.10885, 2019.

[12] G. Csurka. Domain adaptation for visual applications:A comprehensive survey. arXiv preprint arXiv:1702.05374,2017.

[13] S. Dai, Y. Cheng, Y. Zhang, Z. Gan, J. Liu, and L. Carin.Contrastively smoothed class alignment for unsuperviseddomain adaptation. arXiv preprint arXiv:1909.05288, 2019.

[14] O. Day and T. M. Khoshgoftaar. A survey on heteroge-neous transfer learning. Journal of Big Data, 4(1):29, 2017.

[15] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society. Series B (methodolog-ical), pages 1–38, 1977.

[16] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domainadaptation from multiple sources via auxiliary classifiers.In Proceedings of the 26th Annual International Conference onMachine Learning, pages 289–296. ACM, 2009.

[17] L. Duan, D. Xu, and I. W.-H. Tsang. Domain adaptationfrom multiple sources: A domain-dependent regulariza-tion approach. IEEE Transactions on Neural Networks andLearning Systems, 23(3):504–518, 2012.

[18] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceed-ings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.

[19] M. Friedjungova and M. Jirina. Asymmetric heteroge-neous transfer learning: A survey. 2017.

[20] Y. Ganin and V. Lempitsky. Unsupervised domain adap-tation by backpropagation. arXiv preprint arXiv:1409.7495,2014.

[21] Y. Ganin and V. Lempitsky. Unsupervised domain adap-tation by backpropagation. In International Conference onMachine Learning, pages 1180–1189, 2015.

[22] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi.Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE internationalconference on computer vision, pages 2551–2559, 2015.

[23] S. Gidaris, P. Singh, and N. Komodakis. Unsupervisedrepresentation learning by predicting image rotations. InInternational Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=S1v4N2l0-.

[24] B. Gong, K. Grauman, and F. Sha. Reshaping visualdatasets for domain adaptation. In Advances in NeuralInformation Processing Systems, pages 1286–1294, 2013.

[25] J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discoveringlatent domains for multisource domain adaptation. InEuropean Conference on Computer Vision, pages 702–715.Springer, 2012.

[26] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Tor-ralba. Undoing the damage of dataset bias. In EuropeanConference on Computer Vision, pages 158–171. Springer,2012.

[27] A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. arXiv preprintarXiv:1901.09005, 2019.

[28] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht. Slicedwasserstein discrepancy for unsupervised domain adap-tation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 10285–10295, 2019.

[29] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese,L. Fei-Fei, A. Garg, and J. Bohg. Making sense of vi-sion and touch: Self-supervised learning of multimodalrepresentations for contact-rich tasks. arXiv preprintarXiv:1810.10191, 2018.

https://openreview.net/forum?id=S1v4N2l0-


[30] S. Lee, D. Kim, N. Kim, and S.-G. Jeong. Drop to adapt:Learning discriminative features for unsupervised domainadaptation. In The IEEE International Conference on Com-puter Vision (ICCV), October 2019.

[31] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Deeper,broader and artier domain generalization. In Proceedings ofthe IEEE International Conference on Computer Vision, pages5542–5550, 2017.

[32] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Learningto generalize: Meta-learning for domain generalization.In Thirty-Second AAAI Conference on Artificial Intelligence,2018.

[33] W. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool. Domaingeneralization and adaptation using low rank exemplarsvms. IEEE transactions on pattern analysis and machineintelligence, 40(5):1114–1127, 2017.

[34] Y. Li, M. Gong, X. Tian, T. Liu, and D. Tao. Domaingeneralization via conditional invariant representations.In Thirty-Second AAAI Conference on Artificial Intelligence,2018.

[35] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao.Deep domain generalization via conditional invariant ad-versarial networks. In Proceedings of the European Conferenceon Computer Vision (ECCV), pages 624–639, 2018.

[36] Y. Li, Y. Yang, W. Zhou, and T. Hospedales. Feature-criticnetworks for heterogeneous domain generalization. InInternational Conference on Machine Learning, pages 3915–3924, 2019.

[37] J. Liang, Y. Cao, C. Zhang, S. Chang, K. Bai, and Z. Xu.Additive adversarial learning for unbiased authentication.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2019.

[38] S. Liu, A. J. Davison, and E. Johns. Self-supervised gen-eralisation with meta auxiliary learning. arXiv preprintarXiv:1901.08933, 2019.

[39] Y.-C. Liu, Y.-Y. Yeh, T.-C. Fu, S.-D. Wang, W.-C. Chiu,and Y.-C. F. Wang. Detach and adapt: Learning cross-domain disentangled deep representation. arXiv preprintarXiv:1705.01314, 2017.

[40] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learningface attributes in the wild. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 3730–3738, 2015.

[41] Z. Liu, Z. Miao, X. Pan, X. Zhan, S. X. Yu, D. Lin, andB. Gong. Compound domain adaptation in an open world.arXiv preprint arXiv:1909.03403, 2019.

[42] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transferlearning with joint adaptation networks. In Proceedings ofthe 34th International Conference on Machine Learning-Volume70, pages 2208–2217. JMLR. org, 2017.

[43] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditionaladversarial domain adaptation. In Advances in NeuralInformation Processing Systems, pages 1640–1650, 2018.

[44] J. Lu, J. Li, Z. Yan, F. Mei, and C. Zhang. Attribute-basedsynthetic network (abs-net): Learning more from pseudofeature representations. Pattern Recognition, 80:129–142,2018.

[45] M. Mancini, L. Porzi, S. Rota Bulo, B. Caputo, and E. Ricci.Boosting domain adaptation by discovering latent do-mains. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3771–3780, 2018.

[46] M. Mancini, L. Porzi, F. Cermelli, and B. Caputo. Discov-ering latent domains for unsupervised domain adaptationthrough consistency. In International Conference on ImageAnalysis and Processing, pages 390–401. Springer, 2019.

[47] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domainadaptation with multiple sources. In Advances in neuralinformation processing systems, pages 1041–1048, 2009.

[48] J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodrıGuez, N. V.Chawla, and F. Herrera. A unifying view on dataset shift

in classification. Pattern Recognition, 45(1):521–530, 2012.[49] K. Muandet, D. Balduzzi, and B. Scholkopf. Domain

generalization via invariant feature representation. InInternational Conference on Machine Learning, pages 10–18,2013.

[50] L. Niu, W. Li, and D. Xu. Multi-view domain general-ization for visual recognition. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 4193–4201, 2015.

[51] A. v. d. Oord, Y. Li, and O. Vinyals. Representationlearning with contrastive predictive coding. arXiv preprintarXiv:1807.03748, 2018.

[52] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visualdomain adaptation: A survey of recent advances. IEEEsignal processing magazine, 32(3):53–69, 2015.

[53] X. Peng, Z. Huang, X. Sun, and K. Saenko. Domainagnostic learning with disentangled representations. InInternational Conference on Machine Learning, pages 5102–5112, 2019.

[54] P. Ramachandran, B. Zoph, and Q. V. Le. Searching foractivation functions. 2018.

[55] S. Ravi and H. Larochelle. Optimization as a model forfew-shot learning. 2016.

[56] R. Romijnders, P. Meletis, and G. Dubbelman. A domainagnostic normalization layer for unsupervised adversarialdomain adaptation. In 2019 IEEE Winter Conference onApplications of Computer Vision (WACV), pages 1866–1875.IEEE, 2019.

[57] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Max-imum classifier discrepancy for unsupervised domainadaptation. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3723–3732, 2018.

[58] M. B. Sariyildiz and R. G. Cinbis. Gradient matchinggenerative networks for zero-shot learning. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recog-nition, pages 2168–2178, 2019.

[59] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting in-ductive bias with success-story algorithm, adaptive levinsearch, and incremental self-improvement. Machine Learn-ing, 28(1):105–130, 1997.

[60] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri,P. Jyothi, and S. Sarawagi. Generalizing across domainsvia cross-gradient training. In International Conference onLearning Representations, 2018. URL https://openreview.net/forum?id=r1Dx7fbCW.

[61] H. Shimodaira. Improving predictive inference undercovariate shift by weighting the log-likelihood function.Journal of statistical planning and inference, 90(2):227–244,2000.

[62] A. Storkey. When training and test sets are different:characterizing learning transfer. Dataset shift in machinelearning, pages 3–28, 2009.

[63] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye. Atwo-stage weighting framework for multi-source domainadaptation. In Advances in neural information processingsystems, pages 505–513, 2011.

[64] S. Sun, H. Shi, and Y. Wu. A survey of multi-sourcedomain adaptation. Information Fusion, 24:84–92, 2015.

[65] S. Thrun and L. Pratt. Learning to learn. Springer Science& Business Media, 2012.

[66] V. K. Verma, G. Arora, A. Mishra, and P. Rai. General-ized zero-shot learning via synthesized examples. In TheIEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2018.

[67] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.Matching networks for one shot learning. In Advancesin neural information processing systems, pages 3630–3638,2016.

[68] M. Wang and W. Deng. Deep visual domain adaptation: Asurvey. Neurocomputing, 312:135–153, 2018.

https://openreview.net/forum?id=r1Dx7fbCW

https://openreview.net/forum?id=r1Dx7fbCW


[69] J. Wen, N. Zheng, J. Yuan, Z. Gong, and C. Chen. Bayesianuncertainty matching for unsupervised domain adapta-tion. arXiv preprint arXiv:1906.09693, 2019.

[70] T. Xiao, J. Hong, and J. Ma. Elegant: Exchanging latent en-codings with gan for transferring multiple face attributes.In The European Conference on Computer Vision (ECCV),September 2018.

[71] C. Xiong, S. McCloskey, S.-H. Hsieh, and J. J. Corso.Latent domains modeling for visual domain adaptation.In Twenty-Eighth AAAI Conference on Artificial Intelligence,2014.

[72] J. Xu, L. Xiao, and A. M. Lopez. Self-supervised do-main adaptation for computer vision tasks. arXiv preprintarXiv:1907.10915, 2019.

[73] Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rankstructure from latent domains for domain generalization.In European Conference on Computer Vision, pages 628–643.Springer, 2014.

[74] N. Yager and A. Amin. Fingerprint verification based onminutiae features: a review. Pattern Analysis and Applica-tions, 7(1):94–113, 2004.

[75] K. Zhang, B. Scholkopf, K. Muandet, and Z. Wang. Do-main adaptation under target and conditional shift. InInternational Conference on Machine Learning, pages 819–827,2013.

[76] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld.Face recognition: A literature survey. ACM computingsurveys (CSUR), 35(4):399–458, 2003.

[77] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprintarXiv:1610.02984, 2016.

Jian Liang Jian Liang received his Ph.D. de-gree from Tsinghua University, Beijing, China, in2018. During 2018 and 2020 he was a seniorresearcher in the Wireless Security ProductsDepartment of the Cloud and Smart IndustriesGroup at Tencent, Beijing. In 2020 he joinedthe AI for international Department, New RetailIntelligence Engine, Alibaba Group as a senioralgorithm engineer. His paper received the BestShort Paper Award in 2016 IEEE InternationalConference on Healthcare Informatics (ICHI).

Yuren Cao Yuren Cao received his master de-gree from University of Electronic Science andTechnology of China, Chengdu, China, in 2019.He is currently a researcher in the Wireless Se-curity Products Department of the Cloud andSmart Industries at Tencent. His research in-terests include machine learning, deep learningand data mining.

Shuang Li Shuang Li received the Ph.D. de-gree in control science and engineering fromthe Department of Automation, Tsinghua Uni-versity, Beijing, China, in 2018. He was a Vis-iting Research Scholar with the Department ofComputer Science, Cornell University, Ithaca,NY, USA, from November 2015 to June 2016.He is currently an Assistant Professor with theschool of Computer Science and Technology,Beijing Institute of Technology, Beijing. His mainresearch interests include machine learning and

deep learning, especially in transfer learning and domain adaptation.

Bing Bai Bing Bai received his B.S. and Ph.D.degrees in control theory and application fromTsinghua University, China, in 2013 and 2018respectively, and he is currently a senior re-searcher with the Cloud and Smart IndustriesGroup, Tencent, Beijing, China. His research in-terests include natural language processing andrecommender systems.

Hao Li Hao Li is a principle researcher andengineer at Tencent. Hao’s research centers onmaking systems and data reliable, secure, andefficient. Recently he’s been focusing on datasecurity and privacy by distributed and decen-tralized cross-silo/cross-device federated learn-ing, and privacy-preserving machine learning byprogram analysis and secure multi-party compu-tation.

Fei Wang Fei Wang is an Associate Profes-sor in Division of Health Informatics, Depart-ment of Population Health Sciences, Weill Cor-nell Medicine, Cornell University. His major re-search interest is data mining, machine learningand their applications in health data science. Hehas published more than 200 papers on the topvenues of related areas such as ICML, KDD,NeurIPS, AAAI, JAMA Internal Medicine, Annalsof Internal Medicine, etc. His papers have re-ceived over 12,500 citations so far with an H-

index 55. His (or his students’) papers have won 7 best paper (ornomination) awards at international academic conferences. His teamwon the championship of the NIPS/Kaggle Challenge on Classificationof Clinically Actionable Genetic Mutations in 2017 and Parkinson’s Pro-gression Markers’ Initiative data challenge organized by Michael J. FoxFoundation in 2016. Dr. Wang is the recipient of the NSF CAREERAward in 2018, as well as the inaugural research leadership award inIEEE International Conference on Health Informatics (ICHI) 2019. Dr.Wang is the chair of the Knowledge Discovery and Data Mining workinggroup in American Medical Informatics Association (AMIA). Dr. Wangfrequently serves as the program committee chair, general chair andarea chair at international conferences on data mining and medicalinformatics. Dr. Wang is on the editorial board of several prestigiousacademic journals including Scientific Reports, IEEE Transactions onNeural Networks and Learning Systems, Data Mining and KnowledgeDiscovery, etc.

Kun Bai Kun Bai is the Director of Cloud &Smart Industries Group in Tencent. Before join-ing Tencent, he was Research Staff Memberand Manager in IBM TJ Watson Research andwas responsible for developing and leading ad-vanced research for IBM Watson Health and IBMWatson Cloud. He earned a Ph.D. in Informa-tion Science & Technologies from PennsylvaniaState University. He is a senior member of IEEE.

Domain Agnostic Learning for Unbiased [email protected],[email protected],...

Documents

Transcript of Domain Agnostic Learning for Unbiased [email protected],[email protected],...