GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical...

38
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical...

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

TextpressoSearch engine for Biomedical Literature

~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Born out of frustration….

• Search systems effective at locating Search systems effective at locating interesting papers ….. BUT …. have to interesting papers ….. BUT …. have to read the paper to get to the facts. read the paper to get to the facts.

• Many data are not contained in abstract Many data are not contained in abstract or index …. therefore, important papers or index …. therefore, important papers can be missed by search engines.can be missed by search engines.

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

The Perfect System

Type in question and the search

engine tells you the answer!

Full text

“Conceptual search”

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

• Searches full text– returns any sentences that match your query

• Provides two ways to query– search raw data – Keyword search– search meta-data – Category search

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Enter Textpresso

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29.

Biological Process

Regulation RegulationGene

GeneMolecular Function

Biological Process

<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?><!DOCTYPE article SYSTEM "/var/www/html/textpresso.dtd"><article> // <sentence id='s7'> // <process grammar ='NN' source='textpresso' type='general' biosynthesis='no'> activation</process> <pposition grammar ='IN' type='of'> of </pposition> <gene grammar ='JJ' reference='direct'> let-7 </gene> <text>RNA</text> <process grammar ='NN' source='textpresso' type='molecular' biosynthesis='expression'> expression</process> <regulation grammar ='NNS' type='negative'> down regulates</regulation> <function grammar ='NNP' reference='direct' source='textpresso' protein='yes'> LIN-41 </function> <pposition grammar ='TO' type='to'>to </pposition> <text>relieve</text> <regulation grammar ='NNS' type='negative'> inhibition </regulation> <pposition grammar ='IN' type='of'> of</pposition> <gene grammar ='NNP' reference='direct'> lin-29 </gene> <text>. </text> </sentence> //</article>

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Categories

GENEPATHWAY

REGULATION CELL

Locus let-60 eat-4 LIN-12

repress enhanced upregulate inhibition

precursorupstream cascade descendants

Neuron EMS

HSN AB Vulva precursor

37 Categories!!!

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

lin-39 acts downstream of Ras

lin-25 acts indirectly via sur-2

eor-1 and eor-2 are closely involved in Ras signaling

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Find sentences from the literature that describe genetic interaction!

>= 2 named “Gene” &&(>= 1 “Association” || >= 1 “Regulation”)

Using Textpresso to expediate curation

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Sentences containing gene-gene interactions

Random 1 (0.5%)

2 named genes 13 (6.5%)

2 named genes

+

1 category39 (19.5%)

Sampling 200 sentences ……

Adding Textpresso category enriches 3-fold!

Installation and Adaption of Textpresso for your Domain

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Dependencies• Tested on Redhat 9.0 or Debian 3.1 (kernel 2.4.20 or higher)

– should work on any unix-based system

• Apache (1.3.29), Perl (5.6.1 or higher)• Perl Modules:

– XML::Parser XML::RegExp – XML::XQL XML::Checker– XML::DOM XML::Parser::PerlSAX– PDF::Create

• Brill Tagger (C compiler)– parts of speech tagger (http://research.microsoft.com/~brill/)

• XPDF– pdftotext utility (http://www.foolabs.com/xpdf/)

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Download

http://www.textpresso.org

http://www.gmod.org

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Unpack and Install

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Web-site

Web Scripts

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Database

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Build Scripts

Electronic PDF

Raw Text

Parts-of-speech Text

Annotated Text

Abstracts

Keywords

Index Maker

PDF2Text

Preprocessor

Text2XML

Textpresso Database

Wormbase Database

Journal Web-sites

TextpressoOntology

CollectPapers

CollectAbstracts

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Tailoring Pt 1 -Text Collection

• Abstracts Collection– can be downloaded from central resource such as PubMed – PubFetch!

• PDF Collection:– limited to open access journals (PLoS Biology) or journals

to which you subscribe– inject_pmid script from Textpresso web-site (Allen Day)– manual download from journal web-site

Tailoring Pt 2 – Adapting Ontology

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Tailoring Pt 2 – Adapting Ontology

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

• Almost all “Relationship and Description” and “Syntax and Grammer”categories and some “Biological Concepts” categories are generic to the Biomedical domain.

• Some new categories can use existing category structure (yeast genes replace worm genes)

• Some de novo categories would be useful (Cell Cycle, Chromosomal Aberrations, Disease etc).

Tailoring Pt 3 – Adapting Interface

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Tailoring Pt 3 – Adapting Interface

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Tailoring Pt 3 – Adapting Interface

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Textpresso 2.0

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Overhaul Code

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

• Adding another layer of abstraction– definition files and modulesuse constant SY_ANNOTATION_FIELDS => { abstract => ‘abstract/’,

body=> ‘body/’, title=> ‘title/’};

… defines which fields are to be annotated during the build process

• Advantages:– easy to adapt software (no script tweaking)– easy to add new modules

New Features

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Distributed Searches

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Variable Scope

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

New Sort Modes

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

100 sentences per hour!

Search for patterns in sentences

The life-extension phenotype of old-1 was completely suppressed by daf-16 ( m26 ) ( Figure 1e ) . <determiner> <text> <phenotype> <preposition> <gene> <auxiliary> <effect> <regulation> <preposition> <gene> <bracket> <text> <bracket> <bracket> <text> <text> <bracket> <text>

Developed hidden Markov model to identify common patterns of text that surrounds required entities.

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Hidden Markov Model

Match Match Match

I I I I I II

Begin End

I I

<gene> <gene><regulation>

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

True test sentences have similar score to training sentences

Textpresso TeamDevelopers:Eimear KennyHans-Michael Müller

Code Contributers:Allen Day (many patches including inject_pmid)Robert Li (alternative pdf2text converter)Stan Dong and Christopher Lane (code optimization for speed)Juancarlos Chan (web-site scripting)

Information Extraction Analysis:Andrei Petcherski

Paper Collection:Daniel Wang

Principle Investigator:Paul Sternberg

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary