Textpresso Application and Extensibility Eimear Kenny GMOD Meeting, April 2004.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation
description
Transcript of Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation
![Page 1: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/1.jpg)
Improving Curation Efficiency: User Contributions and Textpresso-Based
Semi-Automation
SAB 2008
WormBase Literature Curators Textpresso
![Page 2: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/2.jpg)
SAB 2008
User submission (email, web forms)
First-pass curation
Institution: Sanger InstituteSUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse/elegans/
COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7......
How does data get into WormBase?
![Page 3: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/3.jpg)
SAB 2008
Publication
Flagging/Triage
Curation
Current first-pass curation pipeline
![Page 4: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/4.jpg)
SAB 2008
Growing desire amongst biocurators for user submissions
First people to know what data is in a paper is the authors
TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter
Submitter
Paper identifier
Locus name
Term/descriptor,method
User submissions: first-pass flagging/triage
![Page 5: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/5.jpg)
SAB 2008
User-submitted first-pass flags - WormBase
![Page 6: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/6.jpg)
SAB 2008
User data-submission forms: Expression Pattern
![Page 7: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/7.jpg)
SAB 2008
Full-text searching
Keywords and/or categories
Data extraction: Textpresso
Müller, Kenny, and Sternberg. PLoS Biology, November, 2004.
![Page 8: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/8.jpg)
SAB 2008
Paper – entity association: pattern matching
Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12
Fact extraction: specialized categories
Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099) background, but not noticeably in the weaker tra-1(e1076) background.
GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows.
Textpresso: What data types?
![Page 9: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/9.jpg)
SAB 2008
Textpresso-mediated CC curation: from sentences to annotations
![Page 10: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/10.jpg)
SAB 2008
Transgenes: 1,100 new paper-transgene connections 250 new transgenes
checked manually – 95% accuracy ultimately, connections will go directly into database
Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers
GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week)
Textpresso: How much data?
![Page 11: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/11.jpg)
Textpresso: Other data types
How else can we use Textpresso?
Other data types: Molecular Function Assays, Gene Product Interactions
Pilot: GO molecular function annotations for protein kinase activitykeyword: phosphorylatecategory: C. elegans proteins
13 new GO annotations/hour
Extension of this: protein modifications – not yet captured in WB
Pilot: Gene product interactions for WB and BINDkeywords: physically interact
category: C. elegans proteins310 matches in 237 documents22 physical interactions – top 15 papers
![Page 12: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/12.jpg)
Textpresso for triage: Classifying text based on content
Multiple strategies (using existing first-pass papers as training set):
Organismal triage – C. elegans, Drosophila
Identify, prioritize information-rich papers
Flag for specific data types
Multiple levels:
Machine learning – SVM (Support Vector Machine)Word frequency analysis
Hand-crafted categories
Combine SVM and categories
Supplement with word weighting, contextual analyses
![Page 13: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/13.jpg)
SAB 2008
Keeping better track of curation statistics.....
![Page 14: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/14.jpg)
SAB 2008
.....and making curation statistics more transparent to users.
Users could search for curation status of any paper
Users could search for curation status of a given data type
Each database release would report newly curated papers
Each database release would document increases in data-type curation
![Page 15: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation](https://reader036.fdocuments.net/reader036/viewer/2022062800/5681404c550346895dabbff3/html5/thumbnails/15.jpg)
WormBase Literature Curation
Gene Symbols, Alleles,Sequence Features,
Mapping Data:Mary Ann Tuli, Sanger
Gene Function: Concise Descriptions,Gene Ontology:
Ranjana Kishore, CaltechErich Schwarz, Caltech
Kimberly Van Auken, Caltech
Mutant Phenotypes (RNAi and Alleles):Igor Antoshechkin, CaltechJolene Fernandez, Caltech
Raymond Lee, CaltechGary Shindelman, Caltech
Karen Yook, Caltech
First Pass, Genetic Interactions:
Andrei Petcherski, Caltech
Gene Regulation, PWMs:Xiaodong Wang, CaltechErich Schwarz, Caltech
Expression Patterns, Antibodies, Transgenes:
Wen Chen, Caltech
Anatomy Ontology, Cell Function:
Raymond Lee, CaltechMicroarrays, SAGE:
Igor Antoshechkin, Caltech
Sequence, Gene Structures:Sanger, Wash UAuthors, Papers: Cecilia Nakamura, Daniel Wang
Curation Tools, Database:Juancarlos Chan, Caltech