Liberating facts from the scientific literature - Jisc Digifest 2016

43
Content Mining (TDM) Peter Murray-Rust, ContentMine.org and UniversityofCambridge JISC Digifest, Birmingham, UK, 2016-03-02 Invited and Sponsored by JISC F/OSS tools from contentmine.org Images from Wikimedia CC-BY-SA

Transcript of Liberating facts from the scientific literature - Jisc Digifest 2016

Page 1: Liberating facts from the scientific literature - Jisc Digifest 2016

Content Mining (TDM)

Peter Murray-Rust, ContentMine.org and UniversityofCambridge

JISC Digifest, Birmingham, UK, 2016-03-02

Invited and Sponsored by JISC

F/OSS tools from contentmine.org

Images from Wikimedia CC-BY-SA

Page 2: Liberating facts from the scientific literature - Jisc Digifest 2016

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 3: Liberating facts from the scientific literature - Jisc Digifest 2016

Overview• Open Semistructured Documents .are the most exciting underutilised

knowledge resource– Scholarly literature– Theses– Clinical trials– Government and NGO publications– Product information …

• Content Mining can make huge contributions.• EuropePubMedCentral(*) is the world’s best place to start.• Socio-politico-legal aspects cannot be ignored.

• (*) Wellcome Trust, RCUK, FWF (Austria), Cancer Research UK, NHS UK ….

Page 4: Liberating facts from the scientific literature - Jisc Digifest 2016

Mining strategy• Discover. negotiate permissions . => bibliography• Crawl / Scrape (download), documents AND

supplemental • Normalize. PDF => XML• Index: facets => Facts and snippets (“entities”)• Interpret/analyze entities => relationships,

aggregations (“Transformative”) • Publish

Page 5: Liberating facts from the scientific literature - Jisc Digifest 2016

catalogue

getpapers

query

DailyCrawl

EPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

Page 6: Liberating facts from the scientific literature - Jisc Digifest 2016

Want to know about Zika?

Just Type:

ZIKA!

Page 7: Liberating facts from the scientific literature - Jisc Digifest 2016

Semantic Fulltext• EuropePMC coherent OpenAccess• getpapers: query , download (through API).• AMI filters, checks[1], transforms facts in papers.

• sequences, species, genera, genes, dictionaries

[0] All operations shown run in total of <3 minutes.[1] Dictionaries and lookup.[2] Usable from home by anyone

Zika endemic areasWikimedia CC-BY-SA

Page 8: Liberating facts from the scientific literature - Jisc Digifest 2016

Download all Open Access “Zika” from EuropePMC in 10 seconds (click below for movie)

Aedes aegypti, Wikimedia CC-BY-SA

Note: movies of this and other slides can be seen at https://vimeo.com/154705161

Page 9: Liberating facts from the scientific literature - Jisc Digifest 2016

Downloaded all Open Access “Zika” from EuropePMC in 10 seconds

Final download screen

Page 10: Liberating facts from the scientific literature - Jisc Digifest 2016

Eyeballing 20/120 Zika papers, click below for movie

Yellow Fever Virus Wikimedia CC-BY-SA

Note: movie of this and other slides can be seen at https://vimeo.com/154705161

Page 11: Liberating facts from the scientific literature - Jisc Digifest 2016

3011 virus 1939 Ae./Aedes 1212 dengue 901 mosquito/es 894 species 791 ZIKV 721 using 716 DENV 567 detection 513 aegypti 484 infection 442 RNA 428 protein 401 albopictus 360 viral

Commonest words in 120 Zika papers

Mosquito spp. Wikimedia CC-BY-SA

Page 12: Liberating facts from the scientific literature - Jisc Digifest 2016

Filtering local files for sequence and viruses

AMI (part of ContentMine software)

(click below for movie)Note: movies of this and other slides can be seen at https://vimeo.com/154705161

Page 13: Liberating facts from the scientific literature - Jisc Digifest 2016

DNA Primers in running text

…the sodium channel voltage dependent gene (Nav). Primers used to amplify this fragment were AaNaA 5’-ACAATGTGGATCGCTTCCC-3’ and AaNaB 5’-TGGACAAAAGCAAGGCTAAG-3’(8). The primers amplify a fragment of approximately 472…

Snippet (quotable under 2014 UK Statutory Instrument (“Hargreaves”):

~/PMC4654492/results/sequence/dnaprimer/results.xml”

W3C Annotation

[PREFIX] [MATCH] (link to target)[SUFFIX]

CMine structure

pluginoption

DNA double stranded fragment Wikimedia CC-BY-SA

Page 14: Liberating facts from the scientific literature - Jisc Digifest 2016

Commonest species in 120 Zika papers423 Ae./Aedes aegypti 333 Ae./Aedes albopictus 63 Ae. bromeliae 58 Ae. lilii 46 Ae. hensilli 42 Glossina pallidipes 40 Plasmodium vivax 35 Ae. luteocephalus 28 Ae. vittatus 25 Ae. furcifer 22 Plasmodium falciparum 21 Drosophila melanogaster

pre=“fever (DHF), are caused by the world's most prevalent mosquito-borne virus. 37 DENV is carried by " exact="Aedes aegypti” post=" mosquito, which is strongly affected by ecological and human drivers, but also influenced by clima" name="binomial"/>

Page 15: Liberating facts from the scientific literature - Jisc Digifest 2016

183 Wolbachia 70 Aedes 69 Flavivirus/Flaviviridae 30 Glossina 17 Culex

Commonest genera in Zika papers

pre=”…-negative endosymbiotic bacterium, is a promising tool against diseases transmitted by mosquitoes. " exact="Wolbachia” post=" can be found worldwide in numerous arthropod species. More than 65% of all insect species are natu…”

Wolbachia in insect cell Wikimedia CC-BY-SA

Page 16: Liberating facts from the scientific literature - Jisc Digifest 2016

38 ITS20 MHC2TA19 COI20 CYPJ9221 CYP6BB222 CYP9J283 MHC

Commonest genes in 120 Zika papers

Page 17: Liberating facts from the scientific literature - Jisc Digifest 2016

• microcephaly 400/2400 papers; 2 mins;

commonest genes:

203 MCPH1 86 MECP2 54 SOX2 49 E2F1 47 SNAP29 40 IKBKG 40 NDE1

N-terminal domain of microcephalin Wikimedia CC-BY-SA

Page 18: Liberating facts from the scientific literature - Jisc Digifest 2016

Systematic Reviews

Researchers and their machines need to “read” hundreds of papers a day or even more.

Page 19: Liberating facts from the scientific literature - Jisc Digifest 2016

Polly has 20 seconds to read this paper…

…and 10,000 more

Page 20: Liberating facts from the scientific literature - Jisc Digifest 2016

ContentMine software can do this in a few minutes

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

Page 21: Liberating facts from the scientific literature - Jisc Digifest 2016

400,000 Clinical TrialsIn 10 government registries

Mapping trials => papers

http://www.trialsjournal.com/content/16/1/80

2009 => 2015. What’s happened in last 6 years??

Search the whole scientific literatureFor “2009-0100068-41”

Page 22: Liberating facts from the scientific literature - Jisc Digifest 2016

Extracting scientific information

Page 23: Liberating facts from the scientific literature - Jisc Digifest 2016

Mining strategy• Discover. negotiate permissions . => bibliography• Crawl / Scrape (download), documents AND

supplemental • Normalize. PDF => XML• Index: facets => Facts and snippets (“entities”)• Interpret/analyze entities => relationships,

aggregations (“Transformative”) • Publish

Page 24: Liberating facts from the scientific literature - Jisc Digifest 2016

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 25: Liberating facts from the scientific literature - Jisc Digifest 2016

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

Page 26: Liberating facts from the scientific literature - Jisc Digifest 2016

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 27: Liberating facts from the scientific literature - Jisc Digifest 2016

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 28: Liberating facts from the scientific literature - Jisc Digifest 2016

Facts in contextdaily IUCN endangered species news

en.wikipedia.org CC By-SA

Page 29: Liberating facts from the scientific literature - Jisc Digifest 2016

ContentMine Fact of The Day

• Fact of the day• Endangered species in recent science• Facts• Bubbles

Page 30: Liberating facts from the scientific literature - Jisc Digifest 2016

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

Page 31: Liberating facts from the scientific literature - Jisc Digifest 2016

“Root” 4500 papers each with 1 tree

Page 32: Liberating facts from the scientific literature - Jisc Digifest 2016

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 33: Liberating facts from the scientific literature - Jisc Digifest 2016

Supertree for 924 species

Tree

Page 34: Liberating facts from the scientific literature - Jisc Digifest 2016

Supertree created from 4300 papers

Page 35: Liberating facts from the scientific literature - Jisc Digifest 2016

Socio-politico-legal

• TDM is one of the most complex, uncertain, confrontational, political, areas of human endeavour.

Page 36: Liberating facts from the scientific literature - Jisc Digifest 2016

Copyright and Mining

• PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright.

• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data

analytics”– legitimizes copying (?to disk), but not publishing

*teaching, textbooks, etc. may be “commercial”

Page 37: Liberating facts from the scientific literature - Jisc Digifest 2016

STM Publishers prevent Mining• FUD & disinformation about legality (Elsevier)• Monopolies on infrastructure (“API”s, CCC

Rightfind)• Technical obstruction (Wiley Captcha,

Macmillan Readcube)• Restrictive contracts with libraries (ALL) [1]• Wasting my/our time (ALL)

[1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.

Page 38: Liberating facts from the scientific literature - Jisc Digifest 2016

WILEY … “new security feature… to prevent systematic download of content

“[limit of] 100 papers per day”

“essential security feature … to protect both parties (sic)”

CAPTCHAUser has to type words

Page 39: Liberating facts from the scientific literature - Jisc Digifest 2016

ContentMine working with Libraries

• Cambridge: Library, Plant Sciences, Epidemiology, Chemistry

• Cochrane Collaboration on Systematic Reviews of Clinical Trials

• FutureTDM (H2020, LIBER)• Running workshops and training

Page 40: Liberating facts from the scientific literature - Jisc Digifest 2016

CM Future

• Hypothes.is use ContentMine results for annotation• (with Cambridge Univ Library) extracting daily scientific

facts from open and closed literature.• with EBI, Cochrane Collaborations, JISC, OKF, LIBER,

TGAC/JohnInnes, DNADigest.• Running workshops, hackdays.• Planned outreach: MEPs, EC, Slashdot, Reddit,

Kickstarter, geekdom

• http://contentmine.org (OpenLock non-profit)

Page 41: Liberating facts from the scientific literature - Jisc Digifest 2016

ContentMine working with Libraries

• Cambridge: Library, Plant Sciences, Epidemiology, Chemistry

• Cochrane Collaboration on Systematic Reviews of Clinical Trials

• FutureTDM (H2020, LIBER)• Running workshops and training

• Offers services for information extraction and indexing for born-digital documents.

Page 42: Liberating facts from the scientific literature - Jisc Digifest 2016

Tractable Open Repositories

• CORE• OpenAIRE• arXiv• HAL

Page 43: Liberating facts from the scientific literature - Jisc Digifest 2016

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org