Learning the Structure of Related Tasks

Learning the Structure of Related Tasks

Presented by Lihan He

Machine Learning Reading Group

Duke University

02/03/2006

A. Niculescu-Mizil, R. Caruana

Outline

Introduction

Learning single Bayes networks from data

Learning from related tasks

Experimental results

Conclusions

Introduction

Graphical model:

Node represents random variables; edge represents dependency.

Undirected graphical model: Markov network

Directed graphical model: Bayesian network

x1

x2 x3

x4

Causal relationships between nodes;

Directed acyclic graph (DAG) : No directed cycles allowed;

B={G,θ}

),,,( 4321 XXXXp

)|()|()|()( 3,2413121 XXXpXXpXXpXp

Introduction

Goal: simultaneously learn Bayes Net structures for multiple tasks.

Different tasks are related;

Structures might be similar, but not identical.

Example: gene expression data.

1) Learning one single structure from data.

2) Generalizing to multiple task learning by setting joint prior of structures.

Single Bayesian network learning from data

Bayes Network B={G, θ}, including a set of n random variables X={X1, X2,…, Xn}

Joint probability P(X) can be factorized by

Given dataset D={x1, x2, …, xm}, where xi = (x1,x2,…,xn), we can learn structure G

and parameter θ from the dataset D.


Model selection: find the highest P(G|D) for all possible G

Searching for all possible G is impossible:

n=4, there are 543 possible DAGs

n=10, there are O(1018) possible DAGs

Question: How to search the best structure in the huge amount of possible DAGs?

Algorithm:

1) Randomly generate an initial DAG, evaluate its score;

2) Evaluate the scores of all the neighbors of current DAG;

3) while {some neighbors have higher scores than current DAG}

move to the neighbor that has the highest score

Evaluate the scores of all the neighbors of the new DAG;

end

4) Repeat (1) - (3) a number of times starting from different DAG every time.


Neighbors of a structure G: the set of all the DAGs that can be obtained by adding, removing or reversing an edge in G


Must satisfy acyclic constraint

x1

x2 x3

x4

x1

x2 x3

x4

x1

x2 x3

x4

x1

x2 x3

x4

x1

x2 x3

x4

Given iid dataset D1, D2, …, Dk,

Simultaneously learn the structure B1={G1, θ1} ,B2={G2, θ2},…,Bk={Gk, θk}

Structures (G1,G2,…,Gk) – similar, but not identical

Learning from related task


One more assumption: the parameters of different networks are

independent:

Not true, but make structure learning more efficient. Since we focus on structure

learning, not parameter learning, this is acceptable.


Prior:

If structures are not related: G1,…,Gk are independent a priori

Structures are learned independently for each task.

If structures are identical, )...(),...,( 11 kk GGcGGp

Learning the same structure:

},...2,1{ ),,,...,,(),...,,( 2121 kTSKTSKXXXXXX nn

Learning the single structure under the restriction that TSK is always the parent of all the other nodes.

Common structure: remove node TSK and all the edges connected to it.


Prior:

Between independent and identical:

Penalize each edge (Xi, Xj) that is different in two DAGs

δ=0: independent

δ=1: identical

0<δ<1

For the k task prior


Model selection: find the highest P(G1,…,Gk|D1,…Dk)

Same idea as single task structure learning.

Question: what is a neighbor of (G1,…,Gk) ?

Def 1: )()...()( 21 kGneighborGneighborGneighbor

Size of neighbors: O(n2k)

Def 2: Def1 + one more constraint:

All the changes of edges happen between the same two nodes for all DAGs in (G1,…,Gk)

Size of neighbors: O(n23k)


Acceleration:

At each iteration, algorithm must find best score from a set of neighbors

Not necessary search all the elements in

),...,,( 21 ii GGGC

The first i tasks are specified and the rest k-i tasks are not specified.

k

irrr

i

ppp

k

ksrisr

k

iqpqp GDPGDPGGPGGP

11

1

1

1

1

1

1)ˆ|()|()ˆ,ˆ(),(

where

is the upper bound of the neighbor subset )ˆ,...ˆ,,...,,( 121 kii GGGGG

Results

Original network, delete edges with probability Pdel, create 5 tasks.

1000 data points.

10 trials

Compute KL-divergence and editing distance between learned

structure and true structure.

KL-divergence Editing distance

Learning the Structure of Related Tasks

Documents

Transcript of Learning the Structure of Related Tasks