Inductive Transfer Retrospective & Review

75
Inductive Transfer Retrospective & Review Rich Caruana Rich Caruana Computer Science Department Computer Science Department Cornell University Cornell University

description

Inductive Transfer Retrospective & Review. Rich Caruana Computer Science Department Cornell University. Inductive Transfer: a.k.a. …. Bias Learning Multitask learning Learning (Internal) Representations Learning-to-learn Lifelong learning Continual learning Speedup learning Hints - PowerPoint PPT Presentation

Transcript of Inductive Transfer Retrospective & Review

Page 1: Inductive Transfer Retrospective & Review

Inductive Transfer Retrospective & Review

Rich CaruanaRich Caruana

Computer Science DepartmentComputer Science Department

Cornell UniversityCornell University

Page 2: Inductive Transfer Retrospective & Review

Inductive Transfer: a.k.a. …Inductive Transfer: a.k.a. …

Bias LearningBias Learning Multitask learningMultitask learning Learning (Internal) RepresentationsLearning (Internal) Representations Learning-to-learnLearning-to-learn Lifelong learningLifelong learning Continual learningContinual learning Speedup learningSpeedup learning HintsHints Hierarchical BayesHierarchical Bayes ……

Page 3: Inductive Transfer Retrospective & Review

Rich Sutton [1994] Constructive Induction Workshop:Rich Sutton [1994] Constructive Induction Workshop:

““Everyone knows that good representations are key to 99% of good learning Everyone knows that good representations are key to 99% of good learning performance. Why then has constructive induction, the science of finding good performance. Why then has constructive induction, the science of finding good representations, been able to make only incremental improvements in performance?representations, been able to make only incremental improvements in performance?

People can learn amazingly fast because they bring good representations to the People can learn amazingly fast because they bring good representations to the problem, representations they learned on previous problems. For people, then, problem, representations they learned on previous problems. For people, then, constructive induction does make a large difference in performance. …constructive induction does make a large difference in performance. …

The standard machine learning methodology is to consider a single concept to be The standard machine learning methodology is to consider a single concept to be learned. That itself is the crux of the problem…learned. That itself is the crux of the problem…

This is not the way to study constructive induction! … The standard one-concept This is not the way to study constructive induction! … The standard one-concept learning task will never do this for us and must be abandoned. Instead we should learning task will never do this for us and must be abandoned. Instead we should look to natural learning systems, such as people, to get a better sense of the real look to natural learning systems, such as people, to get a better sense of the real task facing them. When we do this, I think we find the key difference that, for all task facing them. When we do this, I think we find the key difference that, for all practical purposes, people face not one task, but a series of tasks. The different practical purposes, people face not one task, but a series of tasks. The different tasks have different solutions, but they often share the same useful representations.tasks have different solutions, but they often share the same useful representations.

… … If you can come to the nth task with an excellent representation learned from the If you can come to the nth task with an excellent representation learned from the preceding n-1 tasks, then you can learn dramatically faster than a system that does preceding n-1 tasks, then you can learn dramatically faster than a system that does not use constructive induction. A system without constructive induction will learn not use constructive induction. A system without constructive induction will learn no faster on the nth task than on the 1st. …”no faster on the nth task than on the 1st. …”

Page 4: Inductive Transfer Retrospective & Review

1986: Sejnowski & Rosenberg – NETtalk1986: Sejnowski & Rosenberg – NETtalk 1990: Dietterich, Hild, Bakiri – ID3 vs. NETtalk1990: Dietterich, Hild, Bakiri – ID3 vs. NETtalk 1990: Suddarth, Kergiosen, & Holden – rule injection (ANNs)1990: Suddarth, Kergiosen, & Holden – rule injection (ANNs) 1990: Abu-Mostafa – hints (ANNs)1990: Abu-Mostafa – hints (ANNs) 1991: Dean Pomerleau – ALVINN output representation (ANNs)1991: Dean Pomerleau – ALVINN output representation (ANNs) 1991: Lorien Pratt – speedup learning (ANNs)1991: Lorien Pratt – speedup learning (ANNs) 1992: Sharkey & Sharkey – speedup learning (ANNs) 1992: Sharkey & Sharkey – speedup learning (ANNs) 1992: Mark Ring – continual learning1992: Mark Ring – continual learning 1993: Rich Caruana – MTL (ANNs, 1993: Rich Caruana – MTL (ANNs, KNNKNN, , DTDT)) 1993: Thrun & Mitchell – EBNN1993: Thrun & Mitchell – EBNN 1994: Virginia de Sa – minimizing disagreement1994: Virginia de Sa – minimizing disagreement 1994: Jonathan Baxter – representation learning (and theory)1994: Jonathan Baxter – representation learning (and theory) 1994: Thrun & Mitchell – learning one more thing1994: Thrun & Mitchell – learning one more thing 1994: J. Schmidhuber – learning how to learn learning strategies1994: J. Schmidhuber – learning how to learn learning strategies

Transfer through the AgesTransfer through the Ages

Page 5: Inductive Transfer Retrospective & Review

1994: Dietterich & Bakiri: ECOC outputs1994: Dietterich & Bakiri: ECOC outputs 1995: Breiman & Friedman – Curds & Whey1995: Breiman & Friedman – Curds & Whey 1995: Sebastian Thrun – LLL (learning-to-learn, lifelong-learning)1995: Sebastian Thrun – LLL (learning-to-learn, lifelong-learning) 1996: Danny Silver – parallel transfer (ANNs)1996: Danny Silver – parallel transfer (ANNs) 1996: O’Sullivan & Thrun – task clustering (KNN)1996: O’Sullivan & Thrun – task clustering (KNN) 1996: Caruana & de Sa – inputs better as outputs (ANNs)1996: Caruana & de Sa – inputs better as outputs (ANNs) 1997: Munro & Parmanto – committee machines (ANNs)1997: Munro & Parmanto – committee machines (ANNs) 1998: Blum & Mitchell – co-training1998: Blum & Mitchell – co-training 2002: Ben-David, Gehrke, Schuller – theoretical framework2002: Ben-David, Gehrke, Schuller – theoretical framework 2003: Bakker & Heskes – Bayesian MTL (and task clustering)2003: Bakker & Heskes – Bayesian MTL (and task clustering) 2004: Tony Jebara – MTL in SVMs (feature and kernel selection)2004: Tony Jebara – MTL in SVMs (feature and kernel selection) 2004: Pontil & Micchelli – Kernels for MTL2004: Pontil & Micchelli – Kernels for MTL 2004: Lawrence & Platt – MTL in GP (info vector machine)2004: Lawrence & Platt – MTL in GP (info vector machine) 2005: Yu, Tresp, Schwaighofer – MTL in GP2005: Yu, Tresp, Schwaighofer – MTL in GP 2005: Lia & Carin – MTL for RBF Networks2005: Lia & Carin – MTL for RBF Networks

Page 6: Inductive Transfer Retrospective & Review

A Quick Romp Through Some Stuff

A Quick Romp Through Some Stuff

Page 7: Inductive Transfer Retrospective & Review

1 Task vs. 2 Tasks vs. 4 Tasks

Page 8: Inductive Transfer Retrospective & Review

STL vs. MTL Learning Curves

courtesy Joseph O’Sullivan

Page 9: Inductive Transfer Retrospective & Review

STL vs. MTL Learning Curves

Page 10: Inductive Transfer Retrospective & Review

A Different Kind of Learning Curve

Page 11: Inductive Transfer Retrospective & Review

MTL for Bayes Net Structure LearningMTL for Bayes Net Structure Learning

B

C

E

D

AB

C

E

D

AB

C

E

D

A

Yeast 1 Yeast 2 Yeast 3

Bayes Nets for these three species overlap significantlyBayes Nets for these three species overlap significantly Learn structures from data for each species separately? No.Learn structures from data for each species separately? No. Learn one structure for all three species? No.Learn one structure for all three species? No. Bias learning to favor shared structure while allowing some Bias learning to favor shared structure while allowing some

differences? Yes -- makes most of limited data.differences? Yes -- makes most of limited data.

Page 12: Inductive Transfer Retrospective & Review

When to Use Inductive Transfer? multiple tasks occur naturallymultiple tasks occur naturally using future to predict presentusing future to predict present time seriestime series decomposable tasksdecomposable tasks multiple error metricsmultiple error metrics focus of attentionfocus of attention different data distributions for same/similar problemsdifferent data distributions for same/similar problems hierarchical taskshierarchical tasks some input features work better as outputssome input features work better as outputs……

Page 13: Inductive Transfer Retrospective & Review

Multiple Tasks Occur Naturally

Mitchell’s Calendar Apprentice (CAP)Mitchell’s Calendar Apprentice (CAP)– time-of-day (9:00am, 9:30am, ...)time-of-day (9:00am, 9:30am, ...)– day-of-week (M, T, W, ...)day-of-week (M, T, W, ...)– duration (30min, 60min, ...)duration (30min, 60min, ...)– location (Tom’s office, Dean’s office, 5409, ...)location (Tom’s office, Dean’s office, 5409, ...)

Page 14: Inductive Transfer Retrospective & Review

Using Future to Predict Present

medical domainsmedical domains autonomous vehicles autonomous vehicles

and robotsand robots time seriestime series

– stock marketstock market

– economic forecastingeconomic forecasting

– weather predictionweather prediction

– spatial seriesspatial series

many moremany more

Page 15: Inductive Transfer Retrospective & Review

Decomposable Tasks

DireOutcomeDireOutcome = = ICU v Complication v DeathICU v Complication v Death

INPUTS

Page 16: Inductive Transfer Retrospective & Review

Focus of AttentionFocus of Attention

Single-Task ALVINN Multi-Task ALVINN

Page 17: Inductive Transfer Retrospective & Review

Different Data Distributions

Hospital 1: 50 cases, rural (Ithaca)Hospital 1: 50 cases, rural (Ithaca) Hospital 2: 500 cases, mature urban (Des Moines)Hospital 2: 500 cases, mature urban (Des Moines) Hospital 3: 1000 cases, elderly suburbs (Florida)Hospital 3: 1000 cases, elderly suburbs (Florida) Hospital 4: 5000 cases, young urban (LA,SF)Hospital 4: 5000 cases, young urban (LA,SF)

Page 18: Inductive Transfer Retrospective & Review

Some Inputs are Better as Outputs

Page 19: Inductive Transfer Retrospective & Review

And many more uses of Xfer…And many more uses of Xfer…

Page 20: Inductive Transfer Retrospective & Review

A Few Issues That Arise With Xfer

A Few Issues That Arise With Xfer

Page 21: Inductive Transfer Retrospective & Review

Issue #1: Interference

Page 22: Inductive Transfer Retrospective & Review

Issue #1: Interference

Page 23: Inductive Transfer Retrospective & Review

Issue #2: Task Selection/WeightingIssue #2: Task Selection/Weighting Analogous to feature selectionAnalogous to feature selection Correlation between tasksCorrelation between tasks

– heuristic works well in practiceheuristic works well in practice– very suboptimalvery suboptimal

Wrapper-based methodsWrapper-based methods– expensiveexpensive– benefit from single tasks can be too small to detect reliablybenefit from single tasks can be too small to detect reliably– does not examine tasks in setsdoes not examine tasks in sets

Task weighting: MTL ≠ one model for all tasksTask weighting: MTL ≠ one model for all tasks– main task vs. all tasksmain task vs. all tasks– even harder than task selectioneven harder than task selection– but yields best resultsbut yields best results

Page 24: Inductive Transfer Retrospective & Review

Issue #3: Parallel vs. Serial Transfer

Where possible, use parallel transferWhere possible, use parallel transfer– All info about a task is in the training set, not All info about a task is in the training set, not

necessarily a model trained on that train setnecessarily a model trained on that train set– Information useful to other tasks can be lost Information useful to other tasks can be lost

training one task at a timetraining one task at a time– Tasks often benefit each other mutuallyTasks often benefit each other mutually

When serial is necessary, implement via When serial is necessary, implement via parallel task rehearsalparallel task rehearsal

Storing all experience not always feasibleStoring all experience not always feasible

Page 25: Inductive Transfer Retrospective & Review

Issue #4: Psychological Plausibility

?

Page 26: Inductive Transfer Retrospective & Review

Issue #5: Xfer vs. Hierarchical BayesIssue #5: Xfer vs. Hierarchical Bayes

Is Xfer just regularization/smoothing?Is Xfer just regularization/smoothing?Yes and NoYes and NoYes:Yes:

– Similar models for different problem instancesSimilar models for different problem instancese.g. similar stocks, data distributions, …e.g. similar stocks, data distributions, …

No:No:– Focus of attentionFocus of attention– Task selection/clustering/rehearsalTask selection/clustering/rehearsal

Page 27: Inductive Transfer Retrospective & Review

related related helps learning (e.g., copy task) helps learning (e.g., copy task)

Issue #6: What does Related Mean?

Page 28: Inductive Transfer Retrospective & Review

related related helps learning (e.g., copy task) helps learning (e.g., copy task)helps learning helps learning related (e.g., noise task) related (e.g., noise task)

Issue #6: What does Related Mean?

Page 29: Inductive Transfer Retrospective & Review

related related helps learning (e.g., copy task) helps learning (e.g., copy task)helps learning helps learning related (e.g., noise task) related (e.g., noise task) related related correlated (e.g., A+B, A-B) correlated (e.g., A+B, A-B)

Issue #6: What does Related Mean?

Page 30: Inductive Transfer Retrospective & Review

Why Doesn’t Xfer Rule the Earth?Why Doesn’t Xfer Rule the Earth?

Tabula rasa learning surprisingly effectiveTabula rasa learning surprisingly effective the UCI problemthe UCI problem

Page 31: Inductive Transfer Retrospective & Review

Use Some Features as Outputs

Page 32: Inductive Transfer Retrospective & Review

Why Doesn’t Xfer Rule the Earth?Why Doesn’t Xfer Rule the Earth?

Xfer opportunities abound in real problemsXfer opportunities abound in real problemsSomewhat easier with ANNs (and Bayes nets)Somewhat easier with ANNs (and Bayes nets)Death is in the detailsDeath is in the details

– Xfer often hurts more than it helps if not carefulXfer often hurts more than it helps if not careful– Some important tricks counterintuitiveSome important tricks counterintuitive

don’t share too muchdon’t share too much give tasks breathing roomgive tasks breathing room focus on one task at a time focus on one task at a time

Tabula rasa learning surprisingly effectiveTabula rasa learning surprisingly effective the UCI problemthe UCI problem

Page 33: Inductive Transfer Retrospective & Review

What Needs to be Done?What Needs to be Done?

Have algs for ANN, KNN, DT, SVM, GP, BN, …Have algs for ANN, KNN, DT, SVM, GP, BN, …Better prescription of where to use XferBetter prescription of where to use XferPublic data setsPublic data setsComparison of MethodsComparison of Methods Inductive Transfer Competition?Inductive Transfer Competition?Task selection, task weighting, task clusteringTask selection, task weighting, task clusteringExplicit (TC) vs. Implicit (backprop) XferExplicit (TC) vs. Implicit (backprop) XferTheory/definition of task relatednessTheory/definition of task relatedness

Page 34: Inductive Transfer Retrospective & Review
Page 35: Inductive Transfer Retrospective & Review
Page 36: Inductive Transfer Retrospective & Review
Page 37: Inductive Transfer Retrospective & Review
Page 38: Inductive Transfer Retrospective & Review

Kinds of TransferKinds of Transfer

Human ExpertiseHuman Expertise– ConstraintsConstraints– Hints (monotonicity, smoothness, …)Hints (monotonicity, smoothness, …)

ParallelParallel– Multitask LearningMultitask Learning

SerialSerial– Learning-To-LearnLearning-To-Learn– Serial via parallel (rehearsal)Serial via parallel (rehearsal)

Page 39: Inductive Transfer Retrospective & Review

Motivating Example

4 tasks defined on eight bits B4 tasks defined on eight bits B11-B-B88::

all tasks ignore input bits Ball tasks ignore input bits B77-B-B88

Task 1= B1 ∨ (Parity B2 −B6 )

2Task =¬B1 ∨ (Parity B2 −B6 )

3Task = B1 ∧ (Parity B2 −B6 )

4Task =¬B1 ∧ (Parity B2 −B6 )

Page 40: Inductive Transfer Retrospective & Review

Goals of MTL

improve predictive accuracyimprove predictive accuracy– not intelligibilitynot intelligibility– not learning speednot learning speed

exploit “background” knowledgeexploit “background” knowledge applicable to many learning methodsapplicable to many learning methods exploit strength of current learning methods:exploit strength of current learning methods:

surprisingly good tabula rasa performancesurprisingly good tabula rasa performance

Page 41: Inductive Transfer Retrospective & Review

Problem 2: 1D-Doors

color camera on Xavier robotcolor camera on Xavier robot main tasks: main tasks: doorknob location and door typedoorknob location and door type 8 extra tasks (training signals collected by mouse):8 extra tasks (training signals collected by mouse):

– doorway widthdoorway width

– location of doorway centerlocation of doorway center

– location of left jamb, right jamblocation of left jamb, right jamb

– location of left and right edges of doorlocation of left and right edges of door

Page 42: Inductive Transfer Retrospective & Review

Predicting Pneumonia RiskPredicting Pneumonia Risk

PneumoniaRisk

Age

Gen

der

Blo

od P

ress

ure

Che

st X

-Ray

Pre-Hospital Attributes

Alb

umin

Blo

od p

O2

Whi

te C

ount

RB

C C

ount

In-Hospital Attributes

PneumoniaRisk

Age

Gen

der

Blo

od P

ress

ure

Che

st X

-Ray

Pre-Hospital Attributes

Page 43: Inductive Transfer Retrospective & Review

Predicting Pneumonia RiskPredicting Pneumonia Risk

PneumoniaRisk

Age

Gen

der

Blo

od P

ress

ure

Che

st X

-Ray

Pre-Hospital Attributes

Alb

umin

Blo

od p

O2

Whi

te C

ount

RB

C C

ount

In-Hospital Attributes

PneumoniaRisk

Age

Gen

der

Blo

od P

ress

ure

Che

st X

-Ray

Pre-Hospital Attributes

Page 44: Inductive Transfer Retrospective & Review

Pneumonia #1: Medis

Page 45: Inductive Transfer Retrospective & Review

Pneumonia #1: Results

-10.8% -11.8% -6.2% -6.9% -5.7%

Page 46: Inductive Transfer Retrospective & Review

Use imputed values for missing lab tests as extra inputs?

Use imputed values for missing lab tests as extra inputs?

Page 47: Inductive Transfer Retrospective & Review

Pneumonia #1: Feature Nets

Page 48: Inductive Transfer Retrospective & Review

Pneumonia #2: Results

MTL reduces error >10%

Page 49: Inductive Transfer Retrospective & Review

Related? Ideal:Ideal:

Func (MainTask, ExtraTask, Alg) = 1Func (MainTask, ExtraTask, Alg) = 1

iffiff

Alg (MainTask || ExtraTask) > Alg (MainTask)Alg (MainTask || ExtraTask) > Alg (MainTask)

unrealisticunrealistic try all extra tasks (or all combinations)?try all extra tasks (or all combinations)? need need heuristicsheuristics to help us find potentially useful extra to help us find potentially useful extra

tasks to use for MTL:tasks to use for MTL:

Related TasksRelated Tasks

Page 50: Inductive Transfer Retrospective & Review

related related helps learning (e.g., copy tasks) helps learning (e.g., copy tasks)

Related?

Page 51: Inductive Transfer Retrospective & Review

related related helps learning (e.g., copy task) helps learning (e.g., copy task)helps learning helps learning related (e.g., noise task) related (e.g., noise task)

Related?

Page 52: Inductive Transfer Retrospective & Review

related related helps learning (e.g., copy task) helps learning (e.g., copy task)helps learning helps learning related (e.g., noise task) related (e.g., noise task) related related correlated (e.g., A+B, A-B) correlated (e.g., A+B, A-B)

Related?

Page 53: Inductive Transfer Retrospective & Review

120 Synthetic Tasks backprop net not told how tasks are related, but ...backprop net not told how tasks are related, but ... 120 120 Peaks FunctionsPeaks Functions: A,B,C,D,E,F : A,B,C,D,E,F (0.0,1.0) (0.0,1.0)

– P 001 = If (A > 0.5) Then B, Else CP 001 = If (A > 0.5) Then B, Else C

– P 002 = If (A > 0.5) Then B, Else DP 002 = If (A > 0.5) Then B, Else D

– P 014 = If (A > 0.5) Then E, Else CP 014 = If (A > 0.5) Then E, Else C

– P 024 = If (B > 0.5) Then A, Else FP 024 = If (B > 0.5) Then A, Else F

– P 120 = If (F > 0.5) Then E, Else DP 120 = If (F > 0.5) Then E, Else D

Page 54: Inductive Transfer Retrospective & Review

MTL MTL netsnets clustercluster tasks tasks

by by functionfunction

Page 55: Inductive Transfer Retrospective & Review

Peaks Functions: Clustering

Page 56: Inductive Transfer Retrospective & Review

Focus of Attention

1D-ALVINN:1D-ALVINN:– centerlinecenterline– left and right edges of roadleft and right edges of road

removing centerlines from 1D-ALVINN images hurts removing centerlines from 1D-ALVINN images hurts MTL accuracy more than STL accuracyMTL accuracy more than STL accuracy

Page 57: Inductive Transfer Retrospective & Review

Some Inputs are Better as Outputs MainTask = Sigmoid(A)+Sigmoid(B)MainTask = Sigmoid(A)+Sigmoid(B) A, B A, B Inputs A and B coded via 10-bit binary codeInputs A and B coded via 10-bit binary code

Page 58: Inductive Transfer Retrospective & Review

Inputs Better as Outputs: Results

Page 59: Inductive Transfer Retrospective & Review

MTL in K-Nearest Neighbor

Most learning methods can MTL:Most learning methods can MTL:– shared representation shared representation – combine performance of extra taskscombine performance of extra tasks– control the effect of extra taskscontrol the effect of extra tasks

MTL in K-Nearest Neighbor:MTL in K-Nearest Neighbor:– shared representation: distance metricshared representation: distance metric– MTLPerf = (1-MTLPerf = (1-))MainPerf + MainPerf + ((ExtraPerf)ExtraPerf)

Page 60: Inductive Transfer Retrospective & Review

Summary

inductive transfer improves learninginductive transfer improves learning>15 problem types where MTL is applicable:>15 problem types where MTL is applicable:

– using the future to predict the presentusing the future to predict the present– multiple metricsmultiple metrics– focus of attentionfocus of attention– different data populationsdifferent data populations– using inputs as extra tasksusing inputs as extra tasks– . . . (at least 10 more). . . (at least 10 more)

most real-world problems fit one of thesemost real-world problems fit one of these

Page 61: Inductive Transfer Retrospective & Review

Summary/Contributions

applied MTL to a dozen problems, some not applied MTL to a dozen problems, some not created for MTLcreated for MTL– MTL helps most of the timeMTL helps most of the time

– benefits range from 5%-40%benefits range from 5%-40%

ways to improve MTL/Backpropways to improve MTL/Backprop– learning rate optimizationlearning rate optimization

– private hidden layers private hidden layers

– MTL Feature NetsMTL Feature Nets

MTL nets do unsupervised learning/clusteringMTL nets do unsupervised learning/clustering algorithms for MTL: ANN, KNN, SVMs, DTsalgorithms for MTL: ANN, KNN, SVMs, DTs

Page 62: Inductive Transfer Retrospective & Review

Open Problems

output selectionoutput selection scale to 1000’s of extra tasksscale to 1000’s of extra tasks compare to Bayes Netscompare to Bayes Nets theory of MTLtheory of MTL task weightingtask weighting features as both inputs and extra outputsfeatures as both inputs and extra outputs

Page 63: Inductive Transfer Retrospective & Review

Features as Both Inputs & Outputs some features help when used as inputssome features help when used as inputs some of those also help when used as outputssome of those also help when used as outputs get both benefits in one net?get both benefits in one net?

Page 64: Inductive Transfer Retrospective & Review

Summary/Contributions

focus on mainfocus on main task improves performancetask improves performance>15 problem types where MTL is applicable:>15 problem types where MTL is applicable:

– using the future to predict the presentusing the future to predict the present– multiple metricsmultiple metrics– focus of attentionfocus of attention– different data populationsdifferent data populations– using inputs as extra tasksusing inputs as extra tasks– . . . (at least 10 more). . . (at least 10 more)

most real-world problems fit one of thesemost real-world problems fit one of these

Page 65: Inductive Transfer Retrospective & Review

Summary/Contributions

applied MTL to a dozen problems, some not applied MTL to a dozen problems, some not created for MTLcreated for MTL– MTL helps most of the timeMTL helps most of the time

– benefits range from 5%-40%benefits range from 5%-40%

ways to improve MTL/Backpropways to improve MTL/Backprop– learning rate optimizationlearning rate optimization

– private hidden layers private hidden layers

– MTL Feature NetsMTL Feature Nets

MTL nets do unsupervised clusteringMTL nets do unsupervised clustering algs for MTL kNN and MTL Decision Treesalgs for MTL kNN and MTL Decision Trees

Page 66: Inductive Transfer Retrospective & Review

Future MTL Work

output selectionoutput selection scale to 1000’s of extra tasksscale to 1000’s of extra tasks theory of MTLtheory of MTL compare to Bayes Netscompare to Bayes Nets task weightingtask weighting ““features” as both inputs and extra outputsfeatures” as both inputs and extra outputs

Page 67: Inductive Transfer Retrospective & Review

Inputs as Outputs: DNA Domaingiven sequence of 60 DNA nucleotides, predict if given sequence of 60 DNA nucleotides, predict if

sequence is {Isequence is {IE, EE, EI, neither}I, neither}

... ACAGTACGTTGCATTACCCTCGTT...... ACAGTACGTTGCATTACCCTCGTT... {I{IE, EE, EI, neither}I, neither}

nucleotides {A,C,G,T} coded with 3 bitsnucleotides {A,C,G,T} coded with 3 bits3 * 60 = 180 inputs; 3 binary outputs3 * 60 = 180 inputs; 3 binary outputs

Page 68: Inductive Transfer Retrospective & Review

Making MTL/Backprop Better

Better training algorithm: Better training algorithm:

– learning rate optimizationlearning rate optimization

Better architectures:Better architectures:

– private hidden layers (overfitting in hidden unit space)private hidden layers (overfitting in hidden unit space)

– using features as both inputs and outputsusing features as both inputs and outputs

– combining MTL with Feature Netscombining MTL with Feature Nets

Page 69: Inductive Transfer Retrospective & Review

Private Hidden Layers many tasks: need many hidden unitsmany tasks: need many hidden units many hidden units: “hidden unit selection problem”many hidden units: “hidden unit selection problem” allow sharing, but without too many hidden units? allow sharing, but without too many hidden units?

Page 70: Inductive Transfer Retrospective & Review

Related Work– Sejnowski, Rosenberg [1986]: NETtalkSejnowski, Rosenberg [1986]: NETtalk– Pratt, Mostow [1991-94]: serial transfer in bp netsPratt, Mostow [1991-94]: serial transfer in bp nets– Suddarth, Kergiosen [1990]: 1st MTL in bp netsSuddarth, Kergiosen [1990]: 1st MTL in bp nets– Abu-Mostafa [1990-95]: catalytic hintsAbu-Mostafa [1990-95]: catalytic hints– Abu-Mostafa, Baxter [92,95]: transfer PAC modelsAbu-Mostafa, Baxter [92,95]: transfer PAC models– Dietterich, Hild, Bakiri [90,95]: bp vs. ID3Dietterich, Hild, Bakiri [90,95]: bp vs. ID3– Pomerleau, Baluja: other uses of hidden layersPomerleau, Baluja: other uses of hidden layers– Munro [1996]: extra tasks to decorrelate expertsMunro [1996]: extra tasks to decorrelate experts– Breiman [1995]: Curds & WheyBreiman [1995]: Curds & Whey– de Sa [1995]: minimizing disagreementde Sa [1995]: minimizing disagreement– Thrun, Mitchell [1994,96]: EBNNThrun, Mitchell [1994,96]: EBNN– O’Sullivan, Mitchell [now]: EBNN+MTL+RobotO’Sullivan, Mitchell [now]: EBNN+MTL+Robot

Page 71: Inductive Transfer Retrospective & Review

MTL vs. EBNN on Robot Problem

courtesy Joseph O’Sullivan

Page 72: Inductive Transfer Retrospective & Review

Theoretical Models of Parallel Xfer

PAC models based on VC-dim or MDLPAC models based on VC-dim or MDL– unreasonable assumptionsunreasonable assumptions

fixed size hidden layersfixed size hidden layers all tasks generated by one hidden layerall tasks generated by one hidden layer backprop is ideal search procedurebackprop is ideal search procedure

– predictions do not fit observationspredictions do not fit observations have to add hidden unitshave to add hidden units

– main problems: main problems: can't take behavior of backprop into accountcan't take behavior of backprop into account not enough is known about capacity of backprop netsnot enough is known about capacity of backprop nets

Page 73: Inductive Transfer Retrospective & Review

Learning Rate Optimization optimize learning rates of extra tasksoptimize learning rates of extra tasks goal is maximize generalization of main taskgoal is maximize generalization of main task ignore performance of extra tasksignore performance of extra tasks expensive!expensive!

performance on extra tasks improves 9%!performance on extra tasks improves 9%!

Page 74: Inductive Transfer Retrospective & Review

MTL Feature Nets

Page 75: Inductive Transfer Retrospective & Review

Acknowledgements

advisors: Mitchell & Simonadvisors: Mitchell & Simon committee: Pomerleau & Dietterichcommittee: Pomerleau & Dietterich CEHC: Cooper, Fine, Buchanan, et al.CEHC: Cooper, Fine, Buchanan, et al. co-authors: Baluja, de Sa, Freitagco-authors: Baluja, de Sa, Freitag robot Xavier: O’Sullivan, Simmonsrobot Xavier: O’Sullivan, Simmons discussion: Fahlman, Moore, Touretzkydiscussion: Fahlman, Moore, Touretzky funding: NSF, ARPA, DEC, CEHC, JPRCfunding: NSF, ARPA, DEC, CEHC, JPRC SCS/CMU: SCS/CMU: a great place to do researcha great place to do research spouse: Dianespouse: Diane