Bionlp 07

24
June 29, 2007 BioNLP ’07, Prague Open Text Mining Initiative Tony Hammond Nature Publishing Group

description

Presentation on OTMI at BioNLP 2007on June 29, 2007. This was a one-day workshop attached to ACL 2007 (45th Annual Meeting of the Association for Computational Linguistics) conference held in quiet outskirts of Prague.

Transcript of Bionlp 07

Page 1: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Open Text Mining Initiative

Tony HammondNature Publishing Group

Page 2: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Publishing Opportunity

Opening up sites for text mining can potentially lead to content misuse and lost business

But, what if content can be provided openly in a form fit for purpose yet control be maintained?

Hence, OTMI - a proposed industry standard from Nature

Page 3: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

History

Brief summary of idea at Bio-IT World Conference (Boston, 3-5 Apr. ‘06)

Nascent (Apr. & Jun. ‘06, Feb, ‘07) Nature’s 27 Apr. ’06 Editorial

“Machine Readability” Discussions also continued on blogs:

O’Reilly Radar, HubLog, Open Access News,ars technica, LiveSerials

Page 4: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

The Big Ideas

Present full text in nonlinear order (i.e. not in document order)

Keep size of ordered text strings (“snippets”) under publisher control

Streamline content for consumption Use standard XML schema Clean text of extraneous markup

Include word vectors

Page 5: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Design Goals

Enable text mining on full text Facilitate document categorization Allow domain entities (e.g.

chemical compounds, genomes, etc) to be mapped

Encourage published entity maps to reference original document

Page 6: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Generator

PublisherHTML or XML

Full Text Document

PublisherOTMI

Document

OTMI Generator Process

Conversion Tailored to Publisher-Specific Source

Page 7: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Standards

Core Content (XML) presented as Atom

“Entry” Document (see RFC 4287) Manifest (XML) in OPML (known

format) Optional

Metadata uses PRISM (IDEAlliance) References use DOI (NISO, [ISO]) Stopwords from NLM table

Page 8: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Anatomy

Basic components: Document sections Vectors (word counts) “Snippets” (units of full text) Figures References (with DOI)

Page 9: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Entry / Data<atom:entry xmlns:otmi=‘...’ xmlns:prism=‘...‘ xmlns:atom=‘...'> <atom:title>Structural biology Dangerous liaisons on neurons</atom:title> <atom:author> <atom:name>Giampietro Schiavo</atom:name> </atom:author> <atom:id>info:doi/10.1038/nature05410</atom:id> <atom:link href='http://dx.doi.org/10.1038/nature05410' /> <atom:link href='http://.../nature/journal/v444/n7122/otmi/nature05410.otmi‘

rel='self' /> <atom:link href='http://opentextmining.org/' rel='related' /> <atom:published>2006-12-21T00:00:00Z</atom:published> <atom:updated>2006-12-21T00:00:00Z</atom:updated> <atom:rights type='html'>(c) 2006 Nature Publishing Group</atom:rights> <prism:publicationName>Nature</prism:publicationName> <prism:volume>444</prism:volume> <prism:number>7122</prism:number> <prism:startingPage>1019</prism:startingPage> <prism:endingPage>1020</prism:endingPage> <prism:issn>0028-0836</prism:issn> <prism:eIssn /> <otmi:data/></atom:entry>

Page 10: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Data / Vectors<otmi:data> <otmi:stoplist

href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />

<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>

Page 11: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Vectors<otmi:vectors> <otmi:split-regex>(?-mix:\s+\W+|\W+\s+|\s+|\/)</otmi:split-regex> ... <otmi:vector count='9'>vesicles</otmi:vector> <otmi:vector count='8'>al</otmi:vector> <otmi:vector count='8'>et</otmi:vector> <otmi:vector count='8'>protein</otmi:vector> <otmi:vector count='8'>synaptic</otmi:vector> <otmi:vector count='8'>vesicle</otmi:vector> <otmi:vector count=’7'>chain</otmi:vector> <otmi:vector count=’7'>neuron</otmi:vector> ...</otmi:vectors>

Page 12: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Data / Snippets<otmi:data> <otmi:stoplist

href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />

<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>

Page 13: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Snippets<otmi:snippets> <otmi:split-regex>(?-mix:\.\s+(?=[A-Z]))</otmi:split-regex> ... <otmi:snippet>The amino acids lining this cleft are very similar to those found in

BoNT/G (ref. 10 ), but differ in the other toxin family members, which explains why different BoNTs recognize distinct protein receptors</otmi:snippet>

<otmi:snippet>The model predicts that the interaction of BoNTs with both PSGs and protein receptors is necessary to explain their awesome potency , with a different protein receptor being recognized by each BoNT</otmi:snippet>

<otmi:snippet>The rigid character of this interaction might be further enhanced by the association of the toxins heavy chain with nearby negatively charged lipid molecules, which play an accessory role in stabilizing the toxin on membranes </otmi:snippet>

<otmi:snippet>The simplest possibility is that BoNT/B binds to PSGs and synaptotagmin within the lumen of a synaptic vesicle that is fused to the neuron membrane</otmi:snippet>

<otmi:snippet>The toxins then escape from the vesicle lumen when the vesicles are acidified as they reload with neurotransmitters</otmi:snippet>

<otmi:snippet>The two binding sites would firmly anchor the tip of BoNT/B to the vesicles inner surface, constraining the toxins mobility</otmi:snippet>

...</otmi:snippets>

Page 14: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Data / Figures<otmi:data> <otmi:stoplist

href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />

<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>

Page 15: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Figures<otmi:figure> <otmi:title> <otmi:reduced-text>Possible binding sites botulinum neurotoxin B (BoNT/B) neurons.

Crystal studies Jin et al . Chai et al . suggest BoNT/B invades neurons stowing away carriers known synaptic vesicles. forming complex lipid molecules (polysialogangliosides, PSGs) vesicle protein ( synaptotagmin or synaptotagmin II) neuronal membrane. complex stabilized interactions neighbouring acidic lipid molecules (orange). BoNT/B enter open vesicles neurons membrane, one three possible sequences. , BoNT/B enters vesicle directly forms required complex. b , BoNT/B binds first PSGs membrane, transferred synaptic vesicle containing synaptotagmin. c , BoNT/B forms full complex membrane, synaptotagmin left behind inaccurate vesicle recycling. transferred lumen vesicle.</otmi:reduced-text>

</otmi:title> <otmi:caption> <otmi:reduced-text>Possible binding sites botulinum neurotoxin B (BoNT/B) neurons.

Crystal studies Jin et al . Chai et al . suggest BoNT/B invades neurons stowing away carriers known synaptic vesicles. forming complex lipid molecules (polysialogangliosides, PSGs) vesicle protein ( synaptotagmin or synaptotagmin II) neuronal membrane. complex stabilized interactions neighbouring acidic lipid molecules (orange). BoNT/B enter open vesicles neurons membrane, one three possible sequences. , BoNT/B enters vesicle directly forms required complex. b , BoNT/B binds first PSGs membrane, transferred synaptic vesicle containing synaptotagmin. c , BoNT/B forms full complex membrane, synaptotagmin left behind inaccurate vesicle recycling. transferred lumen vesicle.</otmi:reduced-text>

</otmi:caption></otmi:figure>

Page 16: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Data / References<otmi:data> <otmi:stoplist

href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />

<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>

Page 17: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

References<otmi:references> <otmi:ref-id>info:doi/10.1038/nature05387</otmi:ref-id> <otmi:ref-id>info:doi/10.1038/nature05411</otmi:ref-id> <otmi:ref-id>info:doi/10.1016/0968-0004(86)90282-3</otmi:ref-id> <otmi:ref-id>info:doi/10.1016/0014-5793(95)01471-3</otmi:ref-id> <otmi:ref-id>info:doi/10.1083/jcb.200305098</otmi:ref-id> <otmi:ref-id>info:doi/10.1074/jbc.M403945200</otmi:ref-id> <otmi:ref-id>info:doi/10.1126/science.1123654</otmi:ref-id> <otmi:ref-id>info:doi/10.1016/j.febslet.2006.02.074</otmi:ref-id> <otmi:ref-id>info:doi/10.1083/jcb.200508170</otmi:ref-id> <otmi:refs-noid>3</otmi:refs-noid></otmi:references>

Page 18: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Repository

http://nature.com/otmi

Discovery / Navigation*.opml -> *.opml

Content “tarballs” - *.tar.gz (issues) documents - *.otmi (articles)

Page 19: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Autodiscovery

All content on Nature.com to be linked for autodiscovery Abstracts, Full Text (HTML) Web Feeds (RSS/Atom)

Use link elements such as:

<link rel="otmi" type="application/xml" href="../otmi/nature04614.otmi" />

Page 20: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Tools

Ruby Generator Script Open source GPL’ed Modular (Nature-specific code marked

out) Handles multiple DTD’s

Page 21: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Present Status

Now being integrated into Production Workflow (Jun./Jul. ‘07)

Already two years archive (‘05, ‘06) available online: Nature Nature Genetics Nature Reviews Drug Discovery Nature Structural & Molecular Biology The Pharmacogenomegics Journal

Page 22: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Improvements

Future possibilities: Add in references to associated data

files and/or database entries For open-access titles allow text in

normal human-readable form etc. (as feedback indicates - your

call)

Page 23: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

More Information

[email protected](public discussion)

[email protected](private feedback)

opentextmining.org/ Wiki pages Resources (draft spec, scripts, etc.)

Page 24: Bionlp 07

June 29, 2007 BioNLP ’07, Prague

Thanks

Tony Hammond<[email protected]>

or<[email protected]>