Bionlp 07
-
Upload
tony-hammond -
Category
Technology
-
view
1.255 -
download
0
description
Transcript of Bionlp 07
June 29, 2007 BioNLP ’07, Prague
Open Text Mining Initiative
Tony HammondNature Publishing Group
June 29, 2007 BioNLP ’07, Prague
Publishing Opportunity
Opening up sites for text mining can potentially lead to content misuse and lost business
But, what if content can be provided openly in a form fit for purpose yet control be maintained?
Hence, OTMI - a proposed industry standard from Nature
June 29, 2007 BioNLP ’07, Prague
History
Brief summary of idea at Bio-IT World Conference (Boston, 3-5 Apr. ‘06)
Nascent (Apr. & Jun. ‘06, Feb, ‘07) Nature’s 27 Apr. ’06 Editorial
“Machine Readability” Discussions also continued on blogs:
O’Reilly Radar, HubLog, Open Access News,ars technica, LiveSerials
June 29, 2007 BioNLP ’07, Prague
The Big Ideas
Present full text in nonlinear order (i.e. not in document order)
Keep size of ordered text strings (“snippets”) under publisher control
Streamline content for consumption Use standard XML schema Clean text of extraneous markup
Include word vectors
June 29, 2007 BioNLP ’07, Prague
Design Goals
Enable text mining on full text Facilitate document categorization Allow domain entities (e.g.
chemical compounds, genomes, etc) to be mapped
Encourage published entity maps to reference original document
June 29, 2007 BioNLP ’07, Prague
Generator
PublisherHTML or XML
Full Text Document
PublisherOTMI
Document
OTMI Generator Process
Conversion Tailored to Publisher-Specific Source
June 29, 2007 BioNLP ’07, Prague
Standards
Core Content (XML) presented as Atom
“Entry” Document (see RFC 4287) Manifest (XML) in OPML (known
format) Optional
Metadata uses PRISM (IDEAlliance) References use DOI (NISO, [ISO]) Stopwords from NLM table
June 29, 2007 BioNLP ’07, Prague
Anatomy
Basic components: Document sections Vectors (word counts) “Snippets” (units of full text) Figures References (with DOI)
June 29, 2007 BioNLP ’07, Prague
Entry / Data<atom:entry xmlns:otmi=‘...’ xmlns:prism=‘...‘ xmlns:atom=‘...'> <atom:title>Structural biology Dangerous liaisons on neurons</atom:title> <atom:author> <atom:name>Giampietro Schiavo</atom:name> </atom:author> <atom:id>info:doi/10.1038/nature05410</atom:id> <atom:link href='http://dx.doi.org/10.1038/nature05410' /> <atom:link href='http://.../nature/journal/v444/n7122/otmi/nature05410.otmi‘
rel='self' /> <atom:link href='http://opentextmining.org/' rel='related' /> <atom:published>2006-12-21T00:00:00Z</atom:published> <atom:updated>2006-12-21T00:00:00Z</atom:updated> <atom:rights type='html'>(c) 2006 Nature Publishing Group</atom:rights> <prism:publicationName>Nature</prism:publicationName> <prism:volume>444</prism:volume> <prism:number>7122</prism:number> <prism:startingPage>1019</prism:startingPage> <prism:endingPage>1020</prism:endingPage> <prism:issn>0028-0836</prism:issn> <prism:eIssn /> <otmi:data/></atom:entry>
June 29, 2007 BioNLP ’07, Prague
Data / Vectors<otmi:data> <otmi:stoplist
href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />
<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>
June 29, 2007 BioNLP ’07, Prague
Vectors<otmi:vectors> <otmi:split-regex>(?-mix:\s+\W+|\W+\s+|\s+|\/)</otmi:split-regex> ... <otmi:vector count='9'>vesicles</otmi:vector> <otmi:vector count='8'>al</otmi:vector> <otmi:vector count='8'>et</otmi:vector> <otmi:vector count='8'>protein</otmi:vector> <otmi:vector count='8'>synaptic</otmi:vector> <otmi:vector count='8'>vesicle</otmi:vector> <otmi:vector count=’7'>chain</otmi:vector> <otmi:vector count=’7'>neuron</otmi:vector> ...</otmi:vectors>
June 29, 2007 BioNLP ’07, Prague
Data / Snippets<otmi:data> <otmi:stoplist
href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />
<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>
June 29, 2007 BioNLP ’07, Prague
Snippets<otmi:snippets> <otmi:split-regex>(?-mix:\.\s+(?=[A-Z]))</otmi:split-regex> ... <otmi:snippet>The amino acids lining this cleft are very similar to those found in
BoNT/G (ref. 10 ), but differ in the other toxin family members, which explains why different BoNTs recognize distinct protein receptors</otmi:snippet>
<otmi:snippet>The model predicts that the interaction of BoNTs with both PSGs and protein receptors is necessary to explain their awesome potency , with a different protein receptor being recognized by each BoNT</otmi:snippet>
<otmi:snippet>The rigid character of this interaction might be further enhanced by the association of the toxins heavy chain with nearby negatively charged lipid molecules, which play an accessory role in stabilizing the toxin on membranes </otmi:snippet>
<otmi:snippet>The simplest possibility is that BoNT/B binds to PSGs and synaptotagmin within the lumen of a synaptic vesicle that is fused to the neuron membrane</otmi:snippet>
<otmi:snippet>The toxins then escape from the vesicle lumen when the vesicles are acidified as they reload with neurotransmitters</otmi:snippet>
<otmi:snippet>The two binding sites would firmly anchor the tip of BoNT/B to the vesicles inner surface, constraining the toxins mobility</otmi:snippet>
...</otmi:snippets>
June 29, 2007 BioNLP ’07, Prague
Data / Figures<otmi:data> <otmi:stoplist
href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />
<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>
June 29, 2007 BioNLP ’07, Prague
Figures<otmi:figure> <otmi:title> <otmi:reduced-text>Possible binding sites botulinum neurotoxin B (BoNT/B) neurons.
Crystal studies Jin et al . Chai et al . suggest BoNT/B invades neurons stowing away carriers known synaptic vesicles. forming complex lipid molecules (polysialogangliosides, PSGs) vesicle protein ( synaptotagmin or synaptotagmin II) neuronal membrane. complex stabilized interactions neighbouring acidic lipid molecules (orange). BoNT/B enter open vesicles neurons membrane, one three possible sequences. , BoNT/B enters vesicle directly forms required complex. b , BoNT/B binds first PSGs membrane, transferred synaptic vesicle containing synaptotagmin. c , BoNT/B forms full complex membrane, synaptotagmin left behind inaccurate vesicle recycling. transferred lumen vesicle.</otmi:reduced-text>
</otmi:title> <otmi:caption> <otmi:reduced-text>Possible binding sites botulinum neurotoxin B (BoNT/B) neurons.
Crystal studies Jin et al . Chai et al . suggest BoNT/B invades neurons stowing away carriers known synaptic vesicles. forming complex lipid molecules (polysialogangliosides, PSGs) vesicle protein ( synaptotagmin or synaptotagmin II) neuronal membrane. complex stabilized interactions neighbouring acidic lipid molecules (orange). BoNT/B enter open vesicles neurons membrane, one three possible sequences. , BoNT/B enters vesicle directly forms required complex. b , BoNT/B binds first PSGs membrane, transferred synaptic vesicle containing synaptotagmin. c , BoNT/B forms full complex membrane, synaptotagmin left behind inaccurate vesicle recycling. transferred lumen vesicle.</otmi:reduced-text>
</otmi:caption></otmi:figure>
June 29, 2007 BioNLP ’07, Prague
Data / References<otmi:data> <otmi:stoplist
href='http://www.nature.com/nature/journal/v444/n7122/otmi/otmi-stoplist.xml' />
<otmi:section name='body'> <otmi:section name='other'> <otmi:vectors> ... </otmi:vectors> <otmi:snippets> ... </otmi:snippets> </otmi:section> </otmi:section> <otmi:figure> ... </otmi:figure> <otmi:references> ... </otmi:references></otmi:data>
June 29, 2007 BioNLP ’07, Prague
References<otmi:references> <otmi:ref-id>info:doi/10.1038/nature05387</otmi:ref-id> <otmi:ref-id>info:doi/10.1038/nature05411</otmi:ref-id> <otmi:ref-id>info:doi/10.1016/0968-0004(86)90282-3</otmi:ref-id> <otmi:ref-id>info:doi/10.1016/0014-5793(95)01471-3</otmi:ref-id> <otmi:ref-id>info:doi/10.1083/jcb.200305098</otmi:ref-id> <otmi:ref-id>info:doi/10.1074/jbc.M403945200</otmi:ref-id> <otmi:ref-id>info:doi/10.1126/science.1123654</otmi:ref-id> <otmi:ref-id>info:doi/10.1016/j.febslet.2006.02.074</otmi:ref-id> <otmi:ref-id>info:doi/10.1083/jcb.200508170</otmi:ref-id> <otmi:refs-noid>3</otmi:refs-noid></otmi:references>
June 29, 2007 BioNLP ’07, Prague
Repository
http://nature.com/otmi
Discovery / Navigation*.opml -> *.opml
Content “tarballs” - *.tar.gz (issues) documents - *.otmi (articles)
June 29, 2007 BioNLP ’07, Prague
Autodiscovery
All content on Nature.com to be linked for autodiscovery Abstracts, Full Text (HTML) Web Feeds (RSS/Atom)
Use link elements such as:
<link rel="otmi" type="application/xml" href="../otmi/nature04614.otmi" />
June 29, 2007 BioNLP ’07, Prague
Tools
Ruby Generator Script Open source GPL’ed Modular (Nature-specific code marked
out) Handles multiple DTD’s
June 29, 2007 BioNLP ’07, Prague
Present Status
Now being integrated into Production Workflow (Jun./Jul. ‘07)
Already two years archive (‘05, ‘06) available online: Nature Nature Genetics Nature Reviews Drug Discovery Nature Structural & Molecular Biology The Pharmacogenomegics Journal
June 29, 2007 BioNLP ’07, Prague
Improvements
Future possibilities: Add in references to associated data
files and/or database entries For open-access titles allow text in
normal human-readable form etc. (as feedback indicates - your
call)
June 29, 2007 BioNLP ’07, Prague
More Information
[email protected](public discussion)
[email protected](private feedback)
opentextmining.org/ Wiki pages Resources (draft spec, scripts, etc.)