Stacked Graphical Models for Efficient Inference in Markov Random Fields

1
Stacked Graphical Models for Efficient Inference in Markov Random Fields Zhenzhen Kou, William W. Cohen Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA Abstract In collective classification , classes are predicted for a group of related instances simultaneously, rather than predicting a class for each instance separately. Collective classification has been widely used for classification on relational datasets. However, the inference procedure used in collective classification usually requires many iterations and thus is expensive. We propose stacked graphical learning , a meta-learning scheme in which a base learner is augmented by expanding one instance's features with predictions on other related instances. Stacked graphical learning is efficient, especially during inference, capable of capturing dependencies easily, and can be implemented with any kind of base learner. In experiments on eight datasets, stacked graphical learning is 40 to 80 times faster than Gibbs sampling during inference. Introduction Traditional machine learning algorithms assume independence among records •There are many relational datasets in reality, where the instances are not independent to each other •Web pages linked to each other; Data in a database; Papers with citations, co- authorships; … • Relational models assume dependence among instances •Relational Bayesian networks (RBNs) (Getoor et al. 2001) •Relational Markov networks (RMNs) (Taskar et al. 2002) •Relational dependency networks (RDNs) (Neville & Jensen 2003, 2004) •Markov logic networks (Richardson & Domingos 2004) Collective inference predicts the class labels for all instances in a data set simultaneously •Most existing models are expensive •Iterative inference in graphical models •An algorithm with efficient inference is important in applications Relational template for expanding features Relational template C finds all the instances relevant to x and returns their indices •Given predictions for a set of examples •Relational template allows aggregation •Aggregation is necessary because the number of neighbors may vary •Aggregators: COUNT, AVERAGE,MIN, MAX, EXISTS Evaluation •Eight real world datasets •Four relational datasets and four name extraction datasets •Relational templates •Collective classification of relational data: Count •Name extraction: Exists –Include the dependency among adjacent words and repeated words •Models to compare •Base learner •Stacked models •Competitive models : RDNs / Stacked sequential models •Statistical ceiling for stacked models Cross-validated Predictions during Training Convergence: inference for SGMs vs Gibbs sampling Efficiency: SGMs are 40 to 80 times faster than Gibbs sampling Summary •Stacked graphical learning substantially improves the performance compared to the base learner •Stacked graphical learning is competitive compared to other relational model •Stacked graphical learning is efficient during inference •very few iterations D D 2 D 3 Train f 1 using D 2 +D 3 D 1 D 3 Train f 2 using D 1 +D 3 D 1 D 2 Train f 3 using D 1 +D 2 Original dataset with local features D 2 Apply f 1 to D 1 D 1 D 3 Apply f 2 to D 2 Apply f 3 to D 3 D 1 D 2 D 3 Extended dataset 1 ,..., k j j z z 1 ,..., k j j ˆ (, ) C xY 1 ˆ ,..., j y ˆ j k y Stacked Graphical Models (SGMs) •Predict the class labels based on local features with a base learning method •Get an expanded feature vector and train a model with the expanded features Algorithm ( for k=1 ) Input: training data , a base learner A, a relational template C, and a cross-validation parameter J. Learning algorithm: 1.Split training set into J disjoint subsets, i.e., 2.Train J classifiers, i.e., for let 3.Get predicted label for , i.e., given 4.Constructed an extended dataset of instances 5.Return two functions: Inference algorithm: given an example • Let • Carry out Step 4 in the learning procedure to produce an extended instance {( )} t t D x ,y D 1 { ,..., } J D D D j=1,...,J, j ( D). j f AD ' ' () and (). f AS f AS ˆ , ( ). t j t j t D f x y x t x Local model Stacked model ' S ' . t t (x,y ) x ˆ () f y x ' x ' '( ') f y x Collective classification Name extraction SLIF WebKb Cora CiteSe er UT Yapex Genia CSpace Competitive model 86.7 74.2 72.9 58.7 76.8 66.8 77.1 81.2 Local model 77.2 58.3 63.9 55.3 73.1 65.7 72.0 80.3 Stacked model (k=1) Stacked model (k=2) 90.1 90.1 73.2 72.1 73.8 73.9 59.8 59.8 78.3 78.4 69.3 69.2 77.9 78.0 82.5 82.4 Ceiling for stacked model 96.3 73.6 76.9 62.3 80.5 70.5 80.3 84.6 Gibbs 50 Gibbs 100 SLIF WebKB Cora Citeseer 39.6 + 43.4 + 42.7 + 43.6 + 79.3 + 87.0 85.4 87.3 Average speed-up 42.3 84.8 ˆ Y X x 1 x 2 x 3 x 4 x 5 Graphical models Stacked models x 2, y’ 2 Models based on (x,y’) x 1, y’ 1 x 1, y’ 5 x 4, y’ 4 x 3, y’ 3 x 2, y’ 1 y’ 2 y’ 4 x 1, y’ 1 y’ 2 x 1, y’ 4 y’ 5 x 4, y’ 2 y’ 3 y’ 4 y’ 5 x 3, y’ 4 y’ 3 •Stacked models converge more quickly than Gibbs sampling •Even when starting with same initials •More iterations of stacking is not needed

description

x 2, y’ 1 y’ 2 y’ 4. x 3, y’ 4 y’ 3. x 4, y’ 2 y’ 3 y’ 4 y’ 5. x 1, y’ 1 y’ 2. x 1, y’ 4 y’ 5. Stacked Graphical Models for Efficient Inference in Markov Random Fields Zhenzhen Kou, William W. Cohen - PowerPoint PPT Presentation

Transcript of Stacked Graphical Models for Efficient Inference in Markov Random Fields

Page 1: Stacked Graphical Models for Efficient Inference in Markov Random Fields

Stacked Graphical Models for Efficient Inference in Markov Random Fields

Zhenzhen Kou, William W. CohenMachine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

AbstractIn collective classification, classes are predicted for a group of related instances simultaneously, rather than predicting a class for each instance separately. Collective classification has been widely used for classification on relational datasets. However, the inference procedure used in collective classification usually requires many iterations and thus is expensive. We propose stacked graphical learning, a meta-learning scheme in which a base learner is augmented by expanding one instance's features with predictions on otherrelated instances. Stacked graphical learning is efficient, especially during inference, capable of capturing dependencies easily, and can be implemented with any kind of base learner. In experiments on eight datasets, stacked graphical learning is 40 to 80 times faster than Gibbs sampling during inference.

Introduction• Traditional machine learning algorithms assume independence among records

•There are many relational datasets in reality, where the instances are not independent to each other

•Web pages linked to each other; Data in a database; Papers with citations, co-authorships; …

• Relational models assume dependence among instances•Relational Bayesian networks (RBNs) (Getoor et al. 2001)•Relational Markov networks (RMNs) (Taskar et al. 2002)•Relational dependency networks (RDNs)(Neville & Jensen 2003, 2004)•Markov logic networks (Richardson & Domingos 2004)

•Collective inference predicts the class labels for all instances in a data set simultaneously

•Most existing models are expensive•Iterative inference in graphical models•An algorithm with efficient inference is important in applications

Relational template for expanding features

•Relational template C finds all the instances relevant to x and returns their indices

•Given predictions for a set of examples

•Relational template allows aggregation

•Aggregation is necessary because the number of neighbors may vary

•Aggregators: COUNT, AVERAGE,MIN, MAX, EXISTS

Evaluation•Eight real world datasets

•Four relational datasets and four name extraction datasets •Relational templates

•Collective classification of relational data: Count•Name extraction: Exists

–Include the dependency among adjacent words and repeated words

•Models to compare•Base learner•Stacked models•Competitive models : RDNs / Stacked sequential models•Statistical ceiling for stacked models

Cross-validated Predictions during Training

Convergence: inference for SGMs vs Gibbs sampling

Efficiency: SGMs are 40 to 80 times faster than Gibbs sampling

Summary•Stacked graphical learning substantially improves the performance compared to the base learner

•Stacked graphical learning is competitive compared to other relational model

•Stacked graphical learning is efficient during inference•very few iterations

D

D2 D3 Train f1 using D2+D3

D1 D3 Train f2 using D1+D3

D1 D2 Train f3 using D1+D2

Original dataset with local features

D2’

Apply f1 to D1D1’

D3’

Apply f2 to D2

Apply f3 to D3

D’1 D’

2 D’3

Extended dataset

1,...,

kj jz z

1,..., kj j

ˆ( , )C x Y 1ˆ ,...,jy ˆ jky

Stacked Graphical Models (SGMs)•Predict the class labels based on local features with a base learning method•Get an expanded feature vector and train a model with the expanded features

Algorithm ( for k=1 )

Input: training data , a base learner A, a relational template C, and a cross-validation parameter J.

Learning algorithm:1.Split training set into J disjoint subsets, i.e., 2.Train J classifiers, i.e., for let 3.Get predicted label for , i.e., given 4.Constructed an extended dataset of instances5.Return two functions:

Inference algorithm: given an example• Let• Carry out Step 4 in the learning procedure to

produce an extended instance • Return

{( )}t tD x ,y

D 1{ ,..., }JD D D

j=1,...,J, j( D ).jf A D

' '( ) and ( ).f A S f A S

ˆ, ( ) .t j t j tD f x y xtx

Local model Stacked model

'S ' .t t(x , y )

xˆ ( )fy x

'x' '( ')fy x

Collective classification Name extraction

SLIF WebKb Cora CiteSeer UT Yapex Genia CSpace

Competitive model 86.7 74.2 72.9 58.7 76.8 66.8 77.1 81.2

Local model 77.2 58.3 63.9 55.3 73.1 65.7 72.0 80.3

Stacked model (k=1)

Stacked model (k=2)

90.1

90.1

73.2

72.1

73.8

73.9

59.8

59.8

78.3

78.4

69.3

69.2

77.9

78.0

82.5

82.4

Ceiling for stacked model

96.3 73.6 76.9 62.3 80.5 70.5 80.3 84.6

Gibbs 50 Gibbs 100

SLIF

WebKB

Cora

Citeseer

39.6+

43.4+

42.7+

43.6+

79.3+

87.0

85.4

87.3

Average speed-up 42.3 84.8

Y X

x1

x2 x3

x4

x5

Graphical models Stacked models

x2, y’2

Models based on (x,y’)

x1, y’1

x1, y’5

x4, y’4

x3, y’3x2,y’1 y’2 y’4

x1, y’1 y’2

x1, y’4 y’5

x4, y’2 y’3 y’4 y’5

x3, y’4 y’3

•Stacked models converge more quickly than Gibbs sampling

•Even when starting with same initials

•More iterations of stacking is not needed