High throughput mining of the plant-science literature
-
Upload
petermurrayrust -
Category
Science
-
view
419 -
download
1
Transcript of High throughput mining of the plant-science literature
![Page 1: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/1.jpg)
Mining science from the plant literature
ContentMine
Rothamsted Research, Harpenden, UK, 2016-09-12
Peter Murray-Rust[1]University of Cambridge [2]TheContentMine
5,000 scholarly publications every day.How many relate to plants?
![Page 2: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/2.jpg)
Overview
• Scholarly literature• Automation of downloading, normalization• Discipline-dependent semantics/ontology• Classification• Extraction• Annotation• Mining diagrams• Politics of mining
![Page 3: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/3.jpg)
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
![Page 4: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/4.jpg)
(2x digital music industry!)
![Page 5: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/5.jpg)
Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] /month 8000 papers/day2.5 3 million (papers + supplemental data) /year each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html
![Page 6: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/6.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 7: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/7.jpg)
MozFest 2015
ContentMine + TGAC / hack
![Page 8: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/8.jpg)
Terpinome Phytochemists!
Salvia officinalis
Salvia microphylla
Origanum vulgare Ocimum basilicum
Laurus nobilis [1]
[1] Lauraceae
![Page 9: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/9.jpg)
We can search for
• Plants• Compounds• Other species• Diseases• Frequent terms
• We’ll need: sources, dictionaries, software
![Page 10: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/10.jpg)
Europe PubMedCentral
Over 1 million biomedical papers
![Page 11: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/11.jpg)
![Page 12: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/12.jpg)
![Page 13: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/13.jpg)
Dictionaries!
Diseases (WHO)
![Page 14: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/14.jpg)
![Page 15: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/15.jpg)
![Page 16: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/16.jpg)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
100, 000 pages/day Semantic ScholarlyHTML(W3C community group)
Facts
Latest 20150908
CONTENTMINE SOFTWARE
Crossref
![Page 17: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/17.jpg)
What plants produce Carvone?
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
![Page 18: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/18.jpg)
Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is
![Page 19: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/19.jpg)
Search for carvone
![Page 20: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/20.jpg)
https://en.wikipedia.org/wiki/Carvone
WIKIDATA
![Page 21: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/21.jpg)
Carvone in WikidataAlso SPARQL endpointWP identifier
Chemical type
Chemical identifier
![Page 22: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/22.jpg)
ARTICLES FACETS
gene disease drug Phytochem
species genus words
![Page 23: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/23.jpg)
Suggest the title of this article
![Page 24: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/24.jpg)
![Page 25: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/25.jpg)
![Page 26: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/26.jpg)
species words
drug Phytochemdisease
![Page 27: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/27.jpg)
species words
drug Phytochemdisease
disease
![Page 28: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/28.jpg)
Annotation (entity in context)
prefixsurface
label
location
suffix
![Page 29: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/29.jpg)
Annotation with Hypothes.is
Original publication “on publisher’s site”Annotation “on Hypothes.is site”
![Page 30: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/30.jpg)
Amanuens.isHypothes.is link
Hypothes.is markupof article
![Page 31: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/31.jpg)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
![Page 32: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/32.jpg)
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
![Page 33: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/33.jpg)
![Page 34: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/34.jpg)
![Page 35: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/35.jpg)
Automatic extraction of plant species from the literature
Lars Willighagen, ContentMine Fellow 2016, NLhttps://larsgw.github.io/contentmine-fellowship/html/card_c03-d.html
![Page 36: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/36.jpg)
Mining diagrams
![Page 37: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/37.jpg)
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
![Page 38: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/38.jpg)
“Root”
![Page 39: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/39.jpg)
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
![Page 40: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/40.jpg)
Supertree created from 4300 papers
![Page 41: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/41.jpg)
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
![Page 42: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/42.jpg)
After AMI2 processing…..
… AMI2 has detected a square
![Page 43: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/43.jpg)
![Page 44: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/44.jpg)
![Page 45: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/45.jpg)
https://contentmine-demo.herokuapp.com/
ContentMine data visualizations,Chris Kittel
![Page 46: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/46.jpg)
https://contentmine-demo.herokuapp.com/trending
1 month , commonest disease terms
![Page 47: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/47.jpg)
Terms from dictionaries
![Page 48: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/48.jpg)
Co-ocurrence of gene names in same sentence
![Page 49: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/49.jpg)
https://zenodo.org/record/61334#.V9XKT4XerCk
![Page 50: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/50.jpg)
Systematic Reviews
Can we:• eliminate true negatives automatically?• extract data from formulaic language?• mine diagrams?• Annotate existing sources?• forward-reference clinical trials?
![Page 51: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/51.jpg)
Polly has 20 seconds to read this paper…
…and 10,000 more
![Page 52: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/52.jpg)
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
![Page 53: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/53.jpg)
400,000 Clinical TrialsIn 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s happened in last 6 years??
Search the whole scientific literatureFor “2009-0100068-41”
![Page 54: High throughput mining of the plant-science literature](https://reader035.fdocuments.net/reader035/viewer/2022062412/58a2064a1a28ab40098b4e21/html5/thumbnails/54.jpg)
(2x digital music industry!)
Contentmine.orgNon-profitCollaborations include:• University of Cambridge Plant Sciences• TGAC/Open Plant• EuropePMC• Wikimedia• Some publishers