Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

40
First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA 19 Sept 2004

description

Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction. Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA 19 Sept 2004. Talk Outline. - PowerPoint PPT Presentation

Transcript of Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Page 1: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Learning Ensembles ofFirst-Order Clauses for Recall-Precision Curves

A Case Study inBiomedical Information Extraction

Mark Goadrich, Louis Oliphant and Jude ShavlikDepartment of Computer Sciences

University of Wisconsin – Madison USA19 Sept 2004

Page 2: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Talk Outline Inductive Logic Programming Biomedical Information Extraction Our Gleaner Approach Aleph Ensembles Evaluation and Results Future Work

Page 3: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Inductive Logic Programming Machine Learning

Classify data into categories Divide data into train and test sets Generate hypotheses on train set and then

measure performance on test set

In ILP, data are Objects … person, block, molecule, word, phrase, …

and Relations between them grandfather, has_bond, is_member, …

Page 4: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Learning daughter(A,B) Positive

daughter(mary, ann) daughter(eve, tom)

Negative daughter(tom, ann) daughter(eve, ann) daughter(ian, tom) daughter(ian, ann) …

Background Knowledge mother(ann, mary) mother(ann, tom) father(tom, eve) father(tom, ian) female(ann) female(mary) female(eve) male(tom) male(ian)

Ann

IanEve

MaryTom

Possible Rules daughter(A,B) :- true. daughter(A,B) :- female(A). daughter(A,B) :- female(A), male(B). daughter(A,B) :- female(A), father(B,A). daughter(A,B) :- female(A), mother(B,A). …

Father Father

Mother Mother

Page 5: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

ILP Domains Object Learning

Trains, Carcinogenesis

Link Learning Binary predicates

Page 6: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Biomedical Information Extraction

*image courtesy of National Human Genome Research Institute

Page 7: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Yeast Protein Database

Page 8: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Biomedical Information Extraction Given: Medical Journal abstracts tagged

with protein localization relations Do: Construct system to extract protein

localization phrases from unseen text

NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism.

Page 9: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Biomedical Information Extraction

NPL3 encodes a nuclear protein with …

verbnoun article adj noun prep

sentence

prepphrase

…verb

phrasenoun

phrasenoun

phrase

alphanumeric marked location

nounphrase

nounphrase

Page 10: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Sample Extraction Structure

Find structures using ILP

S

C B

Aarticle

containsalphanumeric

containsalphanumeric

Pnoun

Lnoun

containsmarkedlocation

contains nobetween halfX verb

Page 11: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Protein Localization Extraction Hand-labeled dataset (Ray & Craven ’01)

7,245 sentences from 871 abstracts Examples are phrase-phrase combinations

1,810 positive & 279,154 negative

1.6 GB of background knowledge Structural, Statistical, Lexical and Ontological In total, 200+ distinct background predicates

Page 12: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Our Generate-and-Test Approach

rel(Prot, Loc) rel(Prot, Loc)

rel(Prot, Loc)

rel(Prot, Loc) rel(Prot, Loc) rel(Prot,

Loc)

rel(Prot, Loc) rel(Prot, Loc) rel(Prot,

Loc)

Parsed sentence (NP’s non-blue)

Candidates generated

NPL3 encodes a nuclear protein with …

Page 13: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Some Ranking Predicates High-scoring words in protein phrases

repressor, ypt1p, nucleoporin

High-scoring words in location phrases cytoskeleton, inner, predominately

High-scoring BETWEEN prot & loc cofraction, mainly, primarily, …, locate

Stemming seemed to hurt here … Warning: must do PER fold

Page 14: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Some Biomedical Predicates On-Line Medical Dictionary

natural source for semantic classes eg, word occurs in category ‘cell biology’

Medical Subject Headings (MeSH) canonized method for indexing biomedical articles ISA hierarchy of words and subcategories

Gene Ontology (GO) another ISA hierarchy of biological knowledge

Page 15: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Some More Predicates Look-ahead Phrase Predicates

few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold)

Relative Location of Phrases protein_before_location(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2)

Page 16: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Link Learning Large skew toward negatives

500 relational objects 5000 positive links means 245,000 negative links

Difficult to measure success Always negative classifier is 98% accurate ROC curves look overly optimistic

Enormous quantity of data 4,285,199,774 web pages indexed by Google PubMed includes over 15 million citations

Page 17: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Our Approach Develop fast ensemble algorithms focused

on recall and precision evaluation Key Ideas of Gleaner

Keep wide range of clauses Create separate theories for different recall ranges

Evaluation Area Under Recall Precision Curve (AURPC) Time = Number of clauses considered

Page 19: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Gleaner - Background Seed Example

A positive example that our clause must cover

Bottom Clause All predicates which are true about seed example

Rapid Random Restart (Zelezny et al ILP 2002) Stochastic selection of starting clause Time-limited local heuristic search We store variety of clauses (based on recall)

Page 21: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Gleaner - Combining Combine K clauses per bin

If at least L of K clauses match, call example positive

How to choose L ? L=1 then high recall, low  precision L=K then low  recall, high precision

Our method Choose L such that ensemble recall matches bin b Bin b’s precision should be higher than any clause in it

We should now have set of high precision rule sets spanning space of recall levels

Page 22: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

How to use GleanerP

reci

sion

Recall

Generate Curve User Selects Recall Bin Return Classifications

With Precision Confidence

Recall = 0.50Precision = 0.70

Page 23: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Aleph - Learning Aleph learns theories of clauses

(Srinivasan, v4, 2003) Pick positive seed example, find bottom clause Use heuristic search to find best clause Pick new seed from uncovered positives

and repeat until threshold of positives covered

Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with ensembles

Page 24: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Aleph Ensembles We compare to ensembles of theories Algorithm (Dutra et al ILP 2002)

Use K different initial seeds Learn K theories containing C clauses Rank examples by the number of theories

Need to balance C for high performance Small C leads to low recall Large C leads to converging theories

Page 25: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Aleph Ensembles (100 theories)

0 . 0 0

0 . 1 0

0 . 2 0

0 . 3 0

0 . 4 0

0 . 5 0

0 . 6 0

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0

N u m b e r o f C l a u s e s U s e d P e r T h e o r y

Te

sts

et

AU

RP

C

Page 26: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Evaluation Metrics Area Under Recall-

Precision Curve (AURPC) All curves standardized

to cover full recall range Averaged AURPC

over 5 folds Number of clauses

considered Rough estimate of time Both are “stop anytime”

parallel algorithms Recall

Pre

cisi

on

1.0

1.0

Page 27: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

AURPC Interpolation Convex interpolation in RP space?

Precision interpolation is counterintuitive Example: 1000 positive & 9000 negative

TP FP TP Rate FP Rate Recall Prec

500 500 0.50 0.06 0.50 0.50

1000 9000 1.00 1.00 1.00 0.10

Example Counts RP CurvesROC Curves

750 4750 0.75 0.53 0.75 0.14

Page 28: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

AURPC Interpolation

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cis

ion

Correct Interpolation Incorrect Interpolation

Page 29: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Experimental Methodology Performed five-fold cross-validation Variation of parameters

Gleaner (20 recall bins) # seeds = {25, 50, 75, 100} # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K}

Ensembles (0.75 minacc, 35,000 nodes) # theories = {10, 25, 50, 75, 100} # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

Page 30: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Results: Testfold 5 at 1,000,000 clauses

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

0 . 9

1 . 0

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0

R e c a l l

Pre

cis

ion

Ensembles

Gleaner

Page 31: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Results: Gleaner vs Aleph Ensembles

0 . 0 0

0 . 0 5

0 . 1 0

0 . 1 5

0 . 2 0

0 . 2 5

0 . 3 0

0 . 3 5

0 . 4 0

0 . 4 5

0 . 5 0

1 0 , 0 0 0 1 0 0 , 0 0 0 1 , 0 0 0 , 0 0 0 1 0 , 0 0 0 , 0 0 0 1 0 0 , 0 0 0 , 0 0 0

N u m b e r o f C la u s e s G e n e r a t e d ( L o g a r i t h m ic S c a le )

Te

sts

et

AU

RP

C

G le a n e r A le p h E n s e m b le s

Page 32: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Further Results

0 . 0 0

0 . 0 5

0 . 1 0

0 . 1 5

0 . 2 0

0 . 2 5

0 . 3 0

0 . 3 5

0 . 4 0

0 . 4 5

0 . 5 0

1 0 , 0 0 0 1 0 0 , 0 0 0 1 , 0 0 0 , 0 0 0 1 0 , 0 0 0 , 0 0 0 1 0 0 , 0 0 0 , 0 0 0

N u m b e r o f C la u s e s G e n e r a t e d ( L o g a r it h m ic S c a le )

Te

sts

et

AU

RP

C

G le a n e r A le p h E n s e m b le s E n s e m b le s 1 K

Page 33: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Conclusions Gleaner

Focuses on recall and precision Keeps wide spectrum of clauses Good results in few cpu cycles

Aleph ensembles ‘Early stopping’ helpful Require more cpu cycles

AURPC Useful metric for comparison Interpolation unintuitive

Page 34: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Future Work Improve Gleaner performance over time Explore alternate clause combinations Better understanding of AURPC Search for clauses that optimize AURPC Examine more ILP link-learning datasets Use Gleaner with other ML algorithms

Page 35: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Acknowledgements USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 Condor Group David Page Vitor Santos Costa, Ines Dutra Soumya Ray, Marios Skounakis, Mark Craven

Dataset available at (URL in proceedings)ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location

Page 36: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Deleted Scenes Clause Weighting Gleaner Algorithm

Director Commentaryon off

Page 37: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Take-Home Message Definition of Gleaner

One who gathers grain left behind by reapers

Gleaner and ILP Many clauses constructed and evaluated in ILP

hypothesis search We need to make better use of those that aren’t

the highest scoring ones

Thanks, Questions?

Page 38: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Clause Weighting Single Theory Ensemble

rank by how many clauses cover examples

Weight clauses using tuneset statistics CN2 (average precision of matching clauses) Lowest False Positive Rate Score Cumulative

F1 score Recall Precision Diversity

Page 39: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Clause Weighting

0 . 0 0

0 . 0 5

0 . 1 0

0 . 1 5

0 . 2 0

0 . 2 5

0 . 3 0

0 . 3 5

0 . 4 0

0 . 4 5

P r e c i s i o n E q u a l R a n k e d L i s t C N 2W e ig h t in g S c h e m e s

AU

RP

C

Page 40: Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences

Gleaner Algorithm Create B equal-sized recall bins For K different seeds

Generate rules using Rapid Random Restart Record best rule (precision x recall)

found for each bin For each recall bin B

Find threshold L of K clauses such thatrecall of “at least L of K clauses match examples”= recall for this bin

Find recall and precision on testset using each bin’s “at least L of K” decision process