Latent Space Domain Transfer between High Dimensional Overlapping Distributions
Sihong Xie† Wei Fan‡ Jing Peng* Olivier Verscheure‡ Jiangtao Ren††Sun Yat-Sen University‡IBM T. J. Watson Research Center*Montclair State University
Main Challenge:1. Transfer learning2. High Dimensional (4000 features)3. Overlapping (<80% features are the same)4. Solution with performance bounds
Standard Supervised Learning
New York Times
training (labeled)
test (unlabeled)
Classifier 85.5%
New York Times
In Reality……
New York Times
training (labeled)
test (unlabeled)
Classifier 64.1%
New York Times
Labeled data not available!Reuters
Domain Difference Performance Droptrain test
NYT NYT
New York Times New York Times
Classifier 85.5%
Reuters NYT
Reuters New York Times
Classifier 64.1%
ideal setting
realistic setting
High Dimensional Data Transfer High Dimensional Data:
Text Categorization Image Classification
The number of features in our experiment is more than 4000
Challenges: High dimensionality.
more than training examples Euclidean distance becomes similar
Feature sets completely overlapping?No. Some less than 80% features are the same.Marginally not so related?Harder to find transferable structuresProper similarity definition.
Transfer between high dimensional overlapping distributions
• Overlapping Distribution
A ? 1 0.2 +1
Data from two domains may not be lying on exactly the same space, but at most an overlapping one.
B 0.09 ? 0.1 +1
C 0.01 ? 0.3 -1
x y z label
Problems with overlapping distribution Using only the overlapping features may be lack of
predictive information
Transfer between high dimensional overlapping distributions
A ? 1 0.2 +1
B 0.09 ? 0.1 +1
C 0.01 ? 0.3 -1
f1 f2 f3 labelHard to predict correctly
Overlapping Distribution Use the union of all features and fill in the missing
value “zeros”?
Transfer between high dimensional overlapping distributions
A 0 1 0.2 +1
B 0.09 0 0.1 +1
C 0.01 0 0.3 -1
f1 f2 f3 label
Does it helps?
D2 { A, B} = 0.0181
>
D2 {A, C} = 0.0101
A is mis-classified as the same
class as C, insteadof B
Transfer between high dimensional overlapping distributions
When one uses the union of the overlapping and non-overlapping features and leave the missing values as “zero”, the distance of two marginal distributions p(x) can
become asymptotically very large as a function of non-overlapping features:
becomes a dominant factor in similarity measure.
High dimensionality can underpin important features
Transfer between high dimensional overlapping distributions
The “blues” are closer to the “green” than to
the “red”
LatentMap: two step correction
Missing value regression Brings marginal distribution closer
Latent space dimensionality reduction Further brings marginal distribution closer Ignores non-important noisy and “error imported
features” Identify transferable substructures across two
domains.
Filling up missing values (recall the previous example)
Missing Value Regression
1. Project to overlapped feature
2. Map from z to xRelationship
found byregression
model
D { img(A’), B} = 0.0109
<
D {img(A’), C} = 0.0125
A is correctlyclassified
as the same class as B
out-domainword vectors
in-domainword vectors
X
Dimensionality Reduction
Missing Values Filled
Overlapping Features
Missing Values
Word vector Matrix
Dimensionality Reduction
• Project the word vector matrix to the most important and inherent sub-space
=
d×t
XVk
UT
Σ-1
Low dimensional representatio
n
Solution (high dimensionality)
The blues are closer to the reds than to the greens
recall the previous example
The blues are closer to the greens than to the
reds
Properties It can bring the marginal distributions of two
domain close.- Marginal distributions are brought close in high-dimensional space (section 3.2)- Two marginal distributions are further minimized in low dimensional space. (theorem 3.2)
It bring two domains conditional distributions close.- Nearby instances from two domains have similar
conditional distribution (section 3.3)
It can reduce domain transfer risk- The risk of nearest neighbor classifier can be bounded in transfer learning settings. (theorem 3.3)
Experiment (I)
Data Sets 20 News Groups
20000 newsgroup articles SRAA (simulated real auto aviation)
73128 articles from 4 discussion groups Reuters
21758 Reuters news articles Baseline methods
naïve Bayes, logistic regression, SVM Knn-Reg: missing value filled without SVD pLatentMap: SVD but missing value as 0
Try to justify the two steps in our framework
First fill up the “GAP”, then useknn classifier to do classification
20 News groups
comp
comp.sys
comp.graphics
rec
rec.sport
rec.auto
Out-Domain
In-Domain
Learning Tasks
Experiment (II)10 win1 lossOverall performance
Experiment (III)
knnReg: Missing values filled but without SVD
Compared with knnReg8 win3 loss
pLatentMap: SVD but without filling missing values
Compared with pLatentMap8 win3 loss
Conclusion Problem: High dimensional overlapping domain
transfer -– text and image categorization
Step 1: Missing values filling up
--- Bring two domains’ marginal distributions closer
Step 2: SVD dimension reduction
--- Further bring two marginal distributions closer (Theorem 3.2)
--- Cluster points from two domains, making conditional distribution transferable. (Theorem 3.3)
Code and data available from the author’s webpage
=
d×t d×d d×t
t×t
X V Σ U
Solution (high dimensionality)
• Illustration of SVD
The most important and inherent information is in eigen-vectors corresponding to the top k eigen-values.
Top k singular-values
Top k singular vectors
So We can ….
Analysis (I)
SVR (support vector regression) minimizes the distance between two domains’ marginal distributions
Minimized by SVR
Brings the marginal distributions closeIn original space
Upper bound of distance between 2 domains’ points on overlapping features
Analysis (II)
SVD also clusters data such that nearby data have similar concept
Min
∝SVD achieve the
optimum solution
Objective function of k-
means
Analysis (III)
• SVD (singular value decomposition) bounds the distance of two marginal distributions (Theorem 3.2)
=
d×t
XVk
UT
Σ-1
Vk =XT
||T||2 =
Where >1
So the two marginal distributions are brought closer
Analysis (IV)
Bound the risk (R) of Nearest Neighbor classifier under transfer learning settings (Theorem 3.3)
Cluster data such that nearest neighbors have
similar conditional distribution
•The larger the distance between two conditional distributions, the higher the bound will be•Justify why we use SVD R -cov(r1, r2)∝
Where ri related with conditional distribution
↓↑
Experiment (IV)
Parameter sensitivity
Number of neighbors to retrieve
Number of the dimension of latent space
Thank Thank you!you!
Top Related