Text Mining Applications for Literature Curation

28
Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium

description

Text Mining Applications for Literature Curation. Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium. WormBase: A Database for C. elegans and Other Nematodes. www.wormbase.org. Curating Diverse Data Types . Aggregation Behavior. Which worms aggregate - PowerPoint PPT Presentation

Transcript of Text Mining Applications for Literature Curation

Page 1: Text Mining Applications for  Literature Curation

Text Mining Applications for Literature Curation

Kimberly Van Auken

WormBase ConsortiumTextpresso

Gene Ontology Consortium

Page 2: Text Mining Applications for  Literature Curation

WormBase: A Database for C. elegans and Other Nematodes

www.wormbase.org

Page 3: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms aggregate with other worms

and what contributesto that behavior?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics

Page 4: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms (Strain)aggregate with

other worms and and what contributes to

that behavior?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics

Page 5: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms (Strain)aggregate with other worms

and what contributes to that behavior?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics

Strain information:August 1, 1972Pineapple field in Hawaii

Page 6: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms aggregate with

other worms (Phenotype) and what contributes

to the behavior?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics

Page 7: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms aggregate with

other worms (Phenotype) and what contributes to

that behavior?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics

Worm Phenotype Ontology (WPO): Bordering (WBPhenotype:0001820) Life stage ontology, e.g., L3 larval stage Assay, e.g., food source

Page 8: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms (Strain)aggregate with

other worms (Phenotype) and what contributes to

that behavior (Molecular Basis)?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics

Page 9: Text Mining Applications for  Literature Curation

Curating Diverse Data Types

Which worms (Strain)aggregate with

other worms (Phenotype) and what contributes

to that behavior (Molecular Basis)?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics Gene: npr-1 Variation: ad609 (T(83)->I and T(144)->A) Gene Ontology for npr-1:

Biological Process: feeding behaviorMolecular Function: neuropeptide receptor activityCellular Component: integral to plasma membrane

Page 10: Text Mining Applications for  Literature Curation

Literature Curation Workflow

PubMed keyword search – ‘elegans’

Full text paper acquisition

Data type flagging and entity recognition

Detailed curation/Fact extraction

Page 11: Text Mining Applications for  Literature Curation

Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’

PMID Title Authors AbstractArticle

type JournalCurator actions

Download citation XML

Page 13: Text Mining Applications for  Literature Curation

Data Type Flagging/Triage

Data Type Flagging/Triage:

General classification of papers

What types of experiments are in a paper? e.g. RNAi phenotypes, Variation phenotypes,

Expression patterns, Physical interactions

Page 14: Text Mining Applications for  Literature Curation

Main pipeline:

Support Vector Machines (SVMs)

Other methods:

Textpresso category searches hidden Markov models

Pattern matching scripts

Data Type Flagging Methods

Page 15: Text Mining Applications for  Literature Curation

Support Vector Machines: Document Classification

Machine learning models

Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) Positives: 100s, Negatives: 1000s

Resulting model classifies all new papers as negative or positive (high, medium, low confidence)

Page 16: Text Mining Applications for  Literature Curation

Data Type Flagging – Support Vector Machines

SVMs trained for ten different data types:

Antibody

Genetic Interactions

Physical Interactions

Gene Expression

Regulation of Gene Expression

Variation Phenotypes

Overexpression Phenotypes

RNAi Phenotypes

Variation Sequence Change

Gene Structure Correction

See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16

Page 17: Text Mining Applications for  Literature Curation

Curation from Support Vector Machine Results

SVM results lead directly to manual curation:

e.g. RNAi Phenotypes

Results from SVMs are processed further

e.g. Variation Sequence Change

Pattern Matching Script – regular expressions

New variations (entity recognition)

e.g. mg366, ju43, e1360

Page 18: Text Mining Applications for  Literature Curation

Data Type Flagging – Textpresso

www.textpresso.org

C. elegansMouseD. melanogasterNeuroscienceArabidopsisDicty

Wnt PathwayHIVNemtaodesS. cerevisiaeRegulonDB….many others

Full text of articlesTerms, phrases, entities – semantically taggedKeyword or category searchMatch within sentence or entire paper

Page 19: Text Mining Applications for  Literature Curation

Textpresso Categories

Pre-existing dictionaries, vocabularies:

Gene names ChEBI (Chemical Entities of Biological Interest)

PATO Sequence Ontology (SO)

Manually constructed by curators using language from published literature:

Sequence similarity – orthologous, conserved Localization assays – GFP, antibody, fluorescence Experimental verbs – required, regulates, exhibits

Page 20: Text Mining Applications for  Literature Curation

Data Type Flagging - Textpresso Category Searches

Data Type: C. elegans Human Disease Homologs

Three-category Textpresso search:

C. elegans gene

’Ortholog’, ’Homolog’, ’Similar’, ’Model’

Human disease

”We map this defect in dauer response to a mutation in the scd-2 gene, which, we show, encodes the nematode

anaplastic lymphoma kinse (ALK) homolog, a proto-oncogene receptor tyrosine kinase.”

Page 21: Text Mining Applications for  Literature Curation

Literature Curation Workflow

PubMed keyword search – ‘elegans’

Full text paper acquisition

Data type flagging and entity recognition

Detailed curation/Fact extraction

Page 22: Text Mining Applications for  Literature Curation

Textpresso: Semi-Automated Fact Extraction

Genetic Interactions Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its ability to dominantly (but weakly) suppress sep-1 (e2406ts), but recessively suppress sep-1(ax110) (supplementary material Table S1).

Physical Interactions – after SVM document classifier Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-

1 (Figure 3A and B, lane 7).

Gene Ontology – Cellular Component Curation During embryogenesis , PAN-1 protein is uniformly distributed throughout the cytoplasm of the germline and somatic blastomeres , as seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 in the P granules (Fig. 2K, N).

Page 23: Text Mining Applications for  Literature Curation

Textpresso: Semi-Automated GO CellularComponent Curation

Textpresso Search ResultsSuggested GO Annotations

Gene Products

Textpresso Component

See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

Page 24: Text Mining Applications for  Literature Curation

Future Directions

Textpresso, other methods (HMMs) applied to additional data types

e.g. GO Biological Process curation (Phenotypes)

Focusing triage and fact extraction on novel findings

How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results?

e.g. Commonly used molecular markers

Page 25: Text Mining Applications for  Literature Curation

Literature Annotation Tool – Tracking Evidence

WB, GO Common Annotation Framework, BioCreative

Page 26: Text Mining Applications for  Literature Curation

Summary

Text Mining Applications for Literature Curation:

Paper approval and full text acquisition Data type flagging and entity recognition Fact extraction – record evidence

All steps of our pipeline incorporate some form ofsemi- or fully automated approaches:

Scripts for downloads, pattern matching Support Vector Machines for document classification Textpresso for flagging and fact extraction (Hidden Markov Models for flagging, fact extraction)

Page 27: Text Mining Applications for  Literature Curation

The WormBase Consortium, TextpressoWormBase - CaltechPaul SternbergJuancarlos ChanWen ChenChris GroveRanjana KishoreRaymond LeeCecilia NakamuraDaniela RacitiGary SchindelmanKimberly Van AukenDaniel WangXiaodong WangKaren Yook Former member: Ruihua Fang

Textpresso - CaltechHans-Michael MullerYuling Li James DoneFormer member: Arun RangarajanWormBase – OICR, Toronto

Lincoln SteinAbigail CabunocTodd HarrisJD Wong

WormBase – Washington UniversityJohn SpiethTamberlyn BieriPhil Ozersky

WormBase – EBI, Sanger, Hinxton, UKRichard Durbin Paul Kersey Matt BerrimanPaul Davis Michael PauliniKevin Howe Mary Ann Tuli Gary Williams

CGC – Oxford University, Oxford, UKJonathan Hodgkin

Page 28: Text Mining Applications for  Literature Curation

Hidden Markov Models: Semi-Automated GO Molecular Function Curation

For each sentence, HMM yields: True positive score False positive score

For each sentence, curator assigns: Fully curatable (entity + indication of enzymatic activity) Positive (experiment was performed, result but no entity) False Positive (not about enzymatic activity at all)