1 Transfer Learning by Mapping and Revising Relational Knowledge Raymond J. Mooney University of...
-
Upload
elwin-ralf-holmes -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Transfer Learning by Mapping and Revising Relational Knowledge Raymond J. Mooney University of...
1
Transfer Learning by Mapping and Revising Relational Knowledge
Raymond J. Mooney
University of Texas at Austinwith acknowledgements to
Lily Mihalkova, Tuyen Huynh
2
Transfer Learning
• Most machine learning methods learn each new task from scratch, failing to utilize previously learned knowledge.
• Transfer learning concerns using knowledge acquired in a previous source task to facilitate learning in a related target task.
• Usually assume significant training data was available in the source domain but limited training data is available in the target domain.
• By exploiting knowledge from the source, learning in the target can be:– More accurate: Learned knowledge makes better predictions.
– Faster: Training time is reduced.
3
Transfer Learning Curves
• Transfer learning increases accuracy in the target domain.
Amount of training data in target domain
Pre
dict
ive
Acc
urac
y
NO TRANSFER
TRANSFER FROM SOURCE
Jumpstart
Advantage from Limited target data
4
Recent Work on Transfer Learning
• Recent DARPA program on Transfer Learning has led to significant recent research in the area.
• Some work focuses on feature-vector classification.– Hierarchical Bayes (Yu et al., 2005; Lawrence & Platt, 2004)
– Informative Bayesian Priors (Raina et al., 2005)
– Boosting for transfer learning (Dai et al., 2007)
– Structural Correspondence Learning (Blitzer et al., 2007)
• Some work focuses on Reinforcement Learning– Value-function transfer (Taylor & Stone, 2005; 2007)
– Advice-based policy transfer (Torrey et al., 2005; 2007)
5
Similar Research Problems
• Multi-Task Learning (Caruana, 1997)
– Learn multiple tasks simultaneously; each one helped by the others.
• Life-Long Learning (Thrun, 1996)
– Transfer learning from a number of prior source problems, picking the correct source problems to use.
6
Logical Paradigm
• Represents knowledge and data in binary symbolic logic such as First Order Predicate Calculus.
+ Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.
Unable to handle uncertain knowledge and probabilistic reasoning.
7
Probabilistic Paradigm
• Represents knowledge and data as a fixed set of random variables with a joint probability distribution.
+ Handles uncertain knowledge and probabilistic reasoning.
Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.
8
Statistical Relational Learning (SRL)
• Most machine learning methods assume i.i.d. examples represented as fixed-length feature vectors.
• Many domains require learning and making inferences about unbounded sets of entities that are richly relationally connected.
• SRL methods attempt to integrate methods from predicate logic and probabilistic graphical models to handle such structured, multi-relational data.
9
Statistical Relational Learning
pacino
brando
…
coppola
…
godFather pacino
godFather brando
godFather coppola
streetCar brando
… …
pacino coppola
brando coppola
… …
Actor MovieDirector
WorkedFor
Multi-Relational Data Learning
Algorithm
Probabilistic Graphical Model
10
Multi-Relational Data Challenges
• Examples cannot be effectively represented as feature vectors.
• Predictions for connected facts are not independent. (e.g. WorkedFor(brando, coppolo), Movie(godFather, brando))– Data is not i.i.d.
– Requires collective inference (classification) (Taskar et al., 2001)
• A single independent example (mega-example) often contains information about a large number of interconnected entities and can vary in length.– Leave one university out testing (Craven et al., 1998)
TL and SRL and I.I.D.
Standard Machine Learning assumes examples are:
Independent and Identically Distributed
11
SRL breaks the assumptionthat examples are independent
TL breaks the assumptionthat test examples are drawn from the same distribution as
the training instances
12
Multi-Relational Domains
• Domains about people– Academic departments (UW-CSE)– Movies (IMDB)
• Biochemical domains– Mutagenesis – Alzheimer drug design
• Linked text domains– WebKB– Cora
13
Relational Learning Methods
• Inductive Logic Programming (ILP)– Produces sets of first-order rules– Not appropriate for probabilistic reasoning
• If a student wrote a paper with a professor, then the professor is the student’s advisor
• SRL models & learning algorithms– SLPs (Muggleton, 1996)– PRMs (Koller, 1999)– BLPs (Kersting & De Raedt, 2001)– RMNs (Taskar et al., 2002)– MLNs (Richardson & Domingos, 2006)
Markov logic networks
14
MLN Transfer(Mihalkova, Huynh, & Mooney, 2007)
• Given two multi-relational domains, such as:
• Transfer a Markov logic network learned in the Source to the Target by:– Mapping the Source predicates to the Target– Revising the mapped knowledge
IMDBIMDBDirector(A) Actor(A)
Movie(T, A) WorkedFor(A, B)
…
UW-CSEUW-CSEProfessor(A) Student(A)
Publication(T, A) AdvisedBy(A, B)
…
Source: Target:
15
First-Order Logic Basics
• Literal: A predicate (or its negation) applied to constants and/or variables.– Gliteral: Ground literal; WorkedFor(brando, coppola)– Vliteral: Variablized literal; WorkedFor(A, B)
• We assume predicates have typed arguments.– For example: Movie(godFather, coppola)
movieTitle person
17
Actor(pacino) WorkedFor(pacino, coppola) Movie(godFather, pacino)
Actor(brando) WorkedFor(brando, coppola) Movie(godFather, brando)
Director(coppola) Movie(godFather, coppola)
Representing the Data
• Makes a closed world assumption: – The gliterals listed are true; the rest are false
18
Markov Logic Networks(Richardson & Domingos, 2006)
• Set of first-order clauses, each assigned a weight.• Larger weight indicates stronger belief that the
clause should hold.• The clauses are called the structure of the MLN.
5.3
0.5
3.2
19
Markov Networks(Pearl, 1988)
• A concise representation of the joint probability distribution of a set of random variables using an undirected graph.
Quality of
Paper
Reputationof Author
Algorithm Experiments
R A E Q
e e e e 0.7
e e e p 0.1
e e p e 0.03
…
Same probability distribution can be represented as the
product of a set of functions defined over the cliques of
the graph
Joint distribution
21
Ground Markov Network for an MLN
• MLNs are templates for constructing Markov networks for a given set of constants:– Include a node for each type-consistent grounding
(a gliteral) of each predicate in the MLN.– Two nodes are connected by an edge if their
corresponding gliterals appear together in any grounding of any clause in the MLN.
– Include a feature for each grounding of each clause in the MLN with weight equal to the weight of the clause.
22
constants: coppola, brando,
godFather
1.31.20.5
Movie(godFather, brando) Movie(godFather,coppola)
WorkedFor(coppola, brando) WorkedFor(coppola, coppola)
WorkedFor(brando, brando) WorkedFor(brando, coppola)
Director(brando)
Actor(brando)
Director(coppola)
Actor(coppola)
23
MLN Equations
Compare to log-linear Markov networks:
all groundings of f i
all clauses in the model
24
MLN Equation Intuition
A possible world (a truth assignment to all gliterals) becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.
25
MLN Inference
• Given truth assignments for given set of evidence gliterals, infer the probability that each member of set of unknown query gliterals is true.
26
WorkedFor(coppola, brando) WorkedFor(coppola, coppola)
WorkedFor(brando, brando) WorkedFor(brando, coppola)
Director(brando)
Actor(brando)
Director(coppola)
Actor(coppola)
Movie(godFather, brando) Movie(godFather,coppola)
F
TT
T
0.9
0.70.1
0.2
0.20.3
27
MLN Inference Algorithms
• Gibbs Sampling (Richardson & Domingos, 2006)
• MC-SAT (Poon & Domingos, 2006)
28
MLN Learning
• Weight-learning (Richardson & Domingos, 2006; Lowd & Domingos, 2007)– Performed using optimization methods.
• Structure-learning (Kok & Domingos, 2005)
– Proceeds in iterations of beam search, adding the best-performing clause after each iteration to the MLN.
– Clauses are evaluated using WPLL score.
29
WPLL (Kok & Domingos, 2005)
• Weighted pseudo log-likelihood
Compute the likelihood of the data according to the model and take log
For each gliteral, condition on its Markov blanket for tractability
Weight it so that predicates with greater arity do not
dominate
Alchemy
• Open-source package of MLN software provided by UW that includes:– Inference algorithms– Weight learning algorithms– Structure learning algorithm– Sample data sets
• All our software uses and extends Alchemy.
30
31
TAMAR(Transfer via Automatic Mapping And Revision)
Target(IMDB)
Data
M-TAMAR
R-TAMAR
Movie(T,A) WorkedFor(A,B) → Movie(T,B)
Movie(T,A) WorkedFor(A,B) Relative(A,B) → Movie(T,B)
Publication(T,A) AdvisedBy(A,B) → Publication(T,B) (clause from UW-CSE)
Predicate Mapping:Publication MovieAdvisedBy WorkedFor
32
Predicate Mapping
• Each clause is mapped independently of the others.
• The algorithm considers all possible ways to map a clause such that:– Each predicate in the source clause is mapped to
some target predicate.– Each argument type in the source is mapped to
exactly one argument type in the target.
• Each mapped clause is evaluated by measuring its WPLL for the target data, and the most accurate mapping is kept.
33
Predicate Mapping Example
Publication(title, person) Movie(name, person)
AdvisedBy(person, person) WorkedFor(person, person)
Consistent Type Mapping: title → name person → person
34
Predicate Mapping Example 2
Publication(title, person) Gender(person, gend)
AdvisedBy(person, person) SameGender(gend, gend)
Consistent Type Mapping: title → person person → gend
35
TAMAR(Transfer via Automatic Mapping And Revision)
Target(IMDB)
Data
M-TAMAR
R-TAMAR
Movie(T,A) WorkedFor(A,B) → Movie(T,B)
Movie(T,A) WorkedFor(A,B) Relative(A,B) → Movie(T,B)
Publication(T,A) AdvisedBy(A,B) → Publication(T,B) (clause from UW-CSE)
36
Transfer Learning as Revision
• Regard mapped source MLN as an approximate model for the target task that needs to be accurately and efficiently revised.
• Thus our general approach is similar to that taken by theory revision systems (Richards & Mooney, 1995).
• Revisions are proposed in a bottom-up fashion.
Source MLN
Top-down Source MLN
Target TrainingData
Data-DrivenRevisionProposer
Bottom-up
37
R-TAMAR
Relational Data
Clause 1 wtClause 2 wtClause 3 wtClause 4 wtClause 5 wt
Self-diagnosis
0.10.51.3
-0.2Change in WPLL
1.7
Clause 1 wtClause 2 wtClause 3 wtClause 4 wtClause 5 wtToo Long
Too ShortGood
Too LongGood
Directed Beam Search
New Candidate ClausesNew Clause 1 wtNew Clause 3 wt
New Clause 5 wt
New clause
discovery
New Clause 6 wtNew Clause 7 wt
38
R-TAMAR: Self-Diagnosis
• Use mapped source MLN to make inferences in the target and observe the behavior of each clause:– Consider each predicate P in the domain in turn.– Use Gibbs sampling to infer truth values for the
gliterals of P, using the remaining gliterals as evidence.– Bin the clauses containing gliterals of P based on
whether they behave as desired.
• Revisions are focused only on clauses in the “Bad” bins.
39
Self-Diagnosis: Clause BinsActor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
40
Self-Diagnosis: Clause BinsActor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
– Relevant; Bad
41
Self-Diagnosis: Clause BinsActor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
– Relevant; Bad
– Irrelevant; Good
42
Self-Diagnosis: Clause BinsActor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
– Relevant; Bad
– Irrelevant; Good
– Irrelevant; Bad
Leng
then
Shor
ten
43
Structure Revisions
• Using directed beam search:– Literal deletions attempted only from clauses
marked for shortening.– Literal additions attempted only for clauses marked
for lengthening.
• Training is much faster since search space is constrained by:
1. Limiting the clauses considered for updates.2. Restricting the type of updates allowed.
44
New Clause Discovery
• Uses Relational Pathfinding (Richards & Mooney, 1992)
brando coppola
godFather rainMaker
WorkedFor
Movie M
ovie
Movie
Actor(brando) Director(coppola)Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
WorkedFor
45
Weight Revision
Target(IMDB)
Data
M-TAMAR
R-TAMAR
Movie(T,A) WorkedFor(A,B) → Movie(T,B)
Movie(T,A) WorkedFor(A,B) Relative(A,B) → Movie(T,B)
Publication(T,A) AdvisedBy(A,B) → Publication(T,B)
MLN Weight Training
0.8 Movie(T,A) WorkedFor(A,B) Relative(A,B) → Movie(T,B)
46
Experiments: Domains
• UW-CSE – Data about members of the UW CSE department– Predicates include Professor, Student, AdvisedBy, TaughtBy,
Publication, etc.
• IMDB– Data about 20 movies– Predicates include Actor, Director, Movie, WorkedFor, Genre,
etc.
• WebKB– Entity relations from the original WebKB domain
(Craven et al. 1998)– Predicates include Faculty, Student, Project, CourseTA, etc.
47
Dataset Statistics
Data is organized as mega-examples• Each mega-example contains information about a group of related entities.
• Mega-examples are independent and disconnected from each other.
Data Set # Mega-
Examples
#
Constants
#
Types
#
Predicates
# True Gliterals
Total # Gliterals
IMDB 5 316 4 10 1,540 32,615
UW-CSE 5 1,323 9 15 2,673 678,899
WebKB 4 1,700 3 6 2,065 688,193
Manually Developed Source KB
• UW-KB is a hand-built knowledge base (set of clauses) for the UW-CSE domain.
• When used as a source domain, transfer learning is a form of theory refinement that also includes mapping to a new domain with a different representation.
48
49
Systems Compared
• TAMAR: Complete transfer system.
• ScrKD: Algorithm of Kok & Domingos (2005) learning from scratch.
• TrKD: Algorithm of Kok & Domingos (2005) performing transfer, using M-TAMAR to produce a mapping.
50
Methodology: Training & Testing
• Generated learning curves using leave-one-out CV– Each run keeps one mega-example for testing and trains
on the remaining ones, provided one by one.
– Curves are averages over all runs.
• Evaluated learned MLN by performing inference for all gliterals of each predicate in turn, providing the rest as evidence, and averaging the results.
51
Methodology: Metrics Kok & Domingos (2005)
• CLL: Conditional Log Likelihood– The log of the probability predicted by the model that a
gliteral has the correct truth value given in the data.– Averaged over all test gliterals.
• AUC: Area under the precision recall (PR) curve– Produce a PR curve by varying the probability threshold.– Compute the area under this curve.
52
Metrics to Summarize Curves
• Transfer Ratio (Cohen et al. 2007)
– Gives overall idea of improvement achieved over learning from scratch
53
Transfer Scenarios
• Source/target pairs tested:– WebKB IMDB
– UW-CSE IMDB
– UW-KB IMDB
– WebKB UW-CSE
– IMDB UW-CSE
• WebKB not used as a target since one mega-example is sufficient to learn an accurate theory for its limited predicate set.
54
AUC Transfer Ratio
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
WebKB to IMDB UW-CSE to IMDB UW-KB to IMDB WebKB to UW-CSE IMDB to UW-CSE
Experiment
Tran
sfe
r R
ati
o
TrKD
TAMAR
Po
siti
ve
Tra
nsf
er
Ne
gat
ive
Tra
ns
fer
55
CLL Transfer Ratio
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
WebKB to IMDB UW-CSE to IMDB UW-KB to IMDB WebKB to UW-CSE IMDB to UW-CSE
Experiments
Tran
sfe
r R
ati
o
TrKD
TAMAR
Po
siti
ve
Tra
nsf
er
Ne
gat
ive
Tra
ns
fer
57
Total Training Time in Minutes
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
WebKB to IMDB UW-CSE to IMDB UW-KB to IMDB WebKB to UW-CSE IMDB to UW-CSE
Experiment
Min
ute
s
TrKD
TAMAR
Scr KD
0.89 9.10 9.99 6.75 42.3
Number of seconds to find best mapping:
Future Research Issues
• More realistic application domains.
• Application to other SRL models (e.g. SLPs, BLPs).
• More flexible predicate mapping– Allow argument ordering or arity to change.– Map 1 predicate to conjunction of > 1
predicates• AdvisedBy(X,Y) Movie(M,X) Director(M,Y)
58
Multiple Source Transfer
• Transfer from multiple source problems to a given target problem.
• Determine which clauses to map and revise from different source MLNs.
59
Source Selection
• Select useful source domains from a large number of previously learned tasks.
• Ideally, picking source domain(s) is sub-linear in the number of previously learned tasks.
60
61
Conclusions
• Presented TAMAR, a complete transfer system for SRL that:– Maps relational knowledge in the source to the
target domain.– Revises the mapped knowledge to further
improve accuracy.
• Showed experimentally that TAMAR improves speed and accuracy over existing methods.
63
Why MLNs?
• Inherit the expressivity of first-order logic– Can apply insights from ILP
• Inherit the flexibility of probabilistic graphical models– Can deal with noisy & uncertain environments
• Undirected models– Do not need to learn causal directions
• Subsume all other SRL models that are special cases of first-order logic or probabilistic graphical models [Richardson 04]
• Publicly available software package: Alchemy
64
Predicate Mapping Comments
• A particular source predicate can be mapped to different target predicates in different clauses.– This makes our approach context sensitive.
• More scalable.– In the worst-case, the number of mappings is
exponential in the number of predicates.– The number of predicates in a clause is generally
much smaller than the total number of predicates in a domain.
65
Relationship to Structure Mapping Engine (Falkenheiner et al., 1989)
• A system for mapping relations using analogy based on a psychological theory.
• Mappings are evaluated based only on the structural relational similarity between the two domains.
• Does not consider the accuracy of mapped knowledge in the target when determining the preferred mapping.
• Determines a single global mapping for a given source & target.
66
Summary of Methodology
1. Learn MLNs for each point on learning curve
2. Perform inference over learned models3. Summarize inference results using 2
metrics: CLL and AUC, thus producing two learning curves
4. Summarize each learning curve using transfer ratio and percentage improvement from one mega-example
67
CLL Percent Improvement
0
10
20
30
40
50
60
70
80
WebKB to IMDB UW-CSE to IMDB UW-KB to IMDB WebKB to UW-CSE IMDB to UW-CSE
Experiment
Perc
en
t Im
pro
vem
en
t fr
om
1
Exam
ple
TrKD
TAMAR