Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department...
-
Upload
chloe-oneil -
Category
Documents
-
view
219 -
download
0
Transcript of Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department...
![Page 1: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/1.jpg)
Chemical named entity recognition and literature mark-upColin BatchelorInformatics DepartmentRoyal Society of [email protected]
![Page 2: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/2.jpg)
2
Overview
Project Prospect: what we find and how we find it.
RDF: How should we be disseminating it?
Next steps: Basics for a chemical ontology.
![Page 3: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/3.jpg)
3
![Page 4: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/4.jpg)
4
![Page 5: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/5.jpg)
5
![Page 6: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/6.jpg)
6
![Page 7: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/7.jpg)
7
![Page 8: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/8.jpg)
8
![Page 9: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/9.jpg)
9
Project Prospect: What do we find?
Chemical compounds Chemical terms from the IUPAC Gold Book
Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types
![Page 10: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/10.jpg)
10
Project Prospect: How do we find it?
For compound names:~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and
Corbett 2007)
~20% PubChem~20% ChemDrawFor compound numbers:~70% author ChemDraw~30% editors
![Page 11: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/11.jpg)
11
![Page 12: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/12.jpg)
12
RDF in an RSS reader
![Page 13: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/13.jpg)
13
RDF: how we do it now
Content module from RSS 1.0
http://web.resource.org/rss/1.0/modules/content
In what sense does an article “contain” pyridine or base pairs?
We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.
![Page 14: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/14.jpg)
14
RDF: what it looks like now
<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&RSS=1><title> [… title] </title><link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link><description> [… blah] </description><content:encoded> [… human-readable stuff</content:encoded>[… dublin core stuff …]<content:items> <rdf:Bag> <rdf:li>
<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1"/></rdf:li><rdf:li><content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/></rdf:li>
</rdf:Bag></content:items></item>
![Page 15: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/15.jpg)
15
Basics for a chemical ontology
1. Unambiguous representation of objects of chemical discourse
2. Proper parthood relations
![Page 16: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/16.jpg)
16
Basics for a chemical ontology:1. Objects of chemical discourse
Must be able to represent and clearly distinguish
Compounds Classes of compound Parts of molecules Mixtures
Would be nice to have:
Disambiguation cues for the first three
![Page 17: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/17.jpg)
17
Imidazole
![Page 18: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/18.jpg)
18
An imidazole
![Page 19: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/19.jpg)
19
The imidazole side-chain/group/ring
![Page 20: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/20.jpg)
20
Can ChEBI handle this?
Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069)
Imidazole ring not yet Imidazolyl group not yet (but methyl, benzyl, etc.)
… and there are no disambiguation cues
![Page 21: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/21.jpg)
21
Disambiguation
One Sense per Discourse (Gale et al. 1992)
… this doesn’t hold at all
One Sense per Collocation (Yarowsky 1993)
… matches our intuitions
![Page 22: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/22.jpg)
22
Disambiguation:What a one sense per collocation feature set might look like
CLASS:w(–1) = a, an, the, thisw(0) plural (bit of a cheat, as not a collocation)
PART:w(–1) = bridging, terminalw(+1) = backbone, bridge, chain, core, dyad,
fluorophore, fragment, framework (and many more)
w(+1)w(+2) = “building block”, “protecting group”, “side chain”
![Page 23: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/23.jpg)
23
Basics for a chemical ontology:2. Parthood relations
Parthood in ChEBI means at least three things:
is necessarily chemically part of
carbonyl group part_of carbonyl compounds
![Page 24: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/24.jpg)
24
Basics for a chemical ontology:2. Parthood relations
Is possibly chemically part of:
Lead(2+) part_of lead diacetate
(most lead(2+) isn’t)
Electron part_of muonium (!)
![Page 25: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/25.jpg)
25
Basics for a chemical ontology:2. Parthood relations
Is part of a mixture
Kanamycin A part_of kanamycin
![Page 26: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/26.jpg)
26
Basics for a chemical ontology:2. Parthood relations
Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., “Relations in biomedical ontologies”, 2005)
carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+) (?!) Muonium has_part electron Kanamycin has_part kanamycin A (?!)
![Page 27: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/27.jpg)
27
Basics for a chemical ontology:2. Parthood relations
Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships
Carbonyl compound molecule has_part carbonyl substituent
Muonium atom has_part electron
Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+) (?!)
![Page 28: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/28.jpg)
28
Open questions
How do we represent the relationship between named entities and documents?
How do we integrate ontologies and word-sense disambiguation?
What is the best way of distinguishing molecules and samples?
![Page 29: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/29.jpg)
29
Acknowledgements
University of Cambridge: Peter Corbett
OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)
www.projectprospect.org
![Page 30: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.](https://reader035.fdocuments.net/reader035/viewer/2022062511/5513ca0955034679748b4994/html5/thumbnails/30.jpg)
30
Open questions
How do we represent the relationship between named entities and documents?
How do we integrate ontologies and word-sense disambiguation?
What is the best way of distinguishing molecules and samples?