Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale...

34
. . Large-scale Information Extraction for Biomedical Literature 1st Swiss Text Analytics Conference (Swisstext 2016) Fabio Rinaldi, Lenz Furrer www.ontogene.org June 8, 2016 Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 1 / 34

Transcript of Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale...

Page 1: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

.

......

Large-scale Information Extractionfor Biomedical Literature

1st Swiss Text Analytics Conference (Swisstext 2016)

Fabio Rinaldi, Lenz Furrerwww.ontogene.org

June 8, 2016

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 1 / 34

Page 2: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Outline

...1 Motivation

...2 Text mining in a curation workflow

...3 Large-scale detection of protein interactions

...4 Biomedical text mining: competitive evaluations

...5 The OntoGene approach and highlights

...6 Preview of recent work

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 2 / 34

Page 3: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Flow of Information – Role of Curation

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 3 / 34

Page 4: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Growth of Information

Pubmed citations 1986–2010

Zhiyo

ngLu

:Pu

bMed

and

beyo

nd:

asu

rvey

ofwe

bto

ols

fors

earc

hing

biom

edica

llite

ratu

re.

Dat

abas

e20

11:b

aq03

6

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 4 / 34

Page 5: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Outline

...1 Motivation

...2 Text mining in a curation workflow

...3 Large-scale detection of protein interactions

...4 Biomedical text mining: competitive evaluations

...5 The OntoGene approach and highlights

...6 Preview of recent work

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 5 / 34

Page 6: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Assisted Curation

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 6 / 34

Page 7: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

RegulonDB

Escherichia coli K-12Transcriptional Regulatory Network.High-throughput literature curation of genetic regulation in bacterialmodels..

......

Funded by the NIHGrant ID: GM110597 (NIGMS-NIH)Funding: $1.6 millionDuration: 4 years (Jan 2015 – Dec 2018)PI: Dr. Julio Collado-Vides (UNAM)Collaborators: Dr. Michael Savageau (UCDavis), Dr. Stephen Busby(Univ. of Birmingham), Dr. Fabio Rinaldi (Univ. Zurich)

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 7 / 34

Page 8: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Assisted Curation: Example

.

......

Topic: oxidative stress by OxyRCorpus: 46 papers, curated in RegDB

Methods: automated annotations of entities viaOntoGene, selection of sentences via ODINfilters, manual validation

Results: 100% of RIs retrieved, including TF,EFFECT and their TGIdentified the growth conditions for 15 of the20 RIs of OxyR checking only a limited set ofsentences (about 10% of the article is read)

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 8 / 34

Page 9: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Assisted Curation: ODIN Interface

Lightweight browser-based graphical interfacePurpose: literature-based curation tasksCoupled with OntoGene pipelineEasily customizable

.Applications..

......

Novartis (2008–2012)PharmGKB (2011)CTD (2012)RegulonDB (2013)

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 9 / 34

Page 10: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Outline

...1 Motivation

...2 Text mining in a curation workflow

...3 Large-scale detection of protein interactions

...4 Biomedical text mining: competitive evaluations

...5 The OntoGene approach and highlights

...6 Preview of recent work

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 10 / 34

Page 11: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Large scale mining of protein interactions

Text mining in an industrial contextConcept filtering and relation rankingCollection-based ranking

.Search interface built on Apache Solr..

......

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 11 / 34

Page 12: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Search Interface Apache Solr

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 12 / 34

Page 13: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Search Interface Apache Solr

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 13 / 34

Page 14: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Search Interface Apache Solr

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 14 / 34

Page 15: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Outline

...1 Motivation

...2 Text mining in a curation workflow

...3 Large-scale detection of protein interactions

...4 Biomedical text mining: competitive evaluations

...5 The OntoGene approach and highlights

...6 Preview of recent work

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 15 / 34

Page 16: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Competitive Evaluations

BioCreativeBioNLPCALBCCLEF-ERQA4MREDDI @ SemevalBioASQI2B2

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 16 / 34

Page 17: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

BioCreative Shared Task

Mar

tinKr

allin

ger/

Flor

ian

Leitn

er

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 17 / 34

Page 18: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

BioCreative Shared Task

2004 (I) gene mentions, GO annotations2006 (II) GM, GN, PPI2009 (II.5) PPI2010 (III) GN, PPI-ACT, PPI-IMT, IAT2012 CTD-triage, curation workflow, IAT2013 (IV) BioC, CHEMDNER, CTD, GO, IAT2015 (V) BioC, CHEMDNER, Chem/Dis, BEL, IAT

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 18 / 34

Page 19: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

CTD-triage

.Purpose..

......

“promotes understanding aboutthe effects of environmentalchemicals on human health byintegrating data from curatedscientific literature”.Task........entity extraction and triage Best overall results,

best detection of genes and diseases

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 19 / 34

Page 20: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Outline

...1 Motivation

...2 Text mining in a curation workflow

...3 Large-scale detection of protein interactions

...4 Biomedical text mining: competitive evaluations

...5 The OntoGene approach and highlights

...6 Preview of recent work

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 20 / 34

Page 21: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Single Term Resource

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 21 / 34

Page 22: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Single Term Resource: Entities

Genes and proteins (NCBI gene, UniProt)Chemicals (MeSH, ChEBI, CTD)Diseases (MeSH, CTD)Organism and species (MeSH, NCBI taxonomy)Cell lines (Cellosaurus)

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 22 / 34

Page 23: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Term Resource: Some Figures

genes/proteins chemicals diseases species cell lines total

count 10.4 M 979 k 67 k 1.3 M 36 k 12.8 Mavg. length 11.73 37.49 26.98 22.87 7.611 14.92terms/ID 1.1455 3.545 6.018 1.326 1.000 1.328IDs/term 1.371 1.049 1.000 1.003 1.004 1.306

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 23 / 34

Page 24: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Term Resource: Annotated Document in ODIN

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 24 / 34

Page 25: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Relation Extraction Pipeline

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 25 / 34

Page 26: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Syntax-based Approach

“The neuronal nicotinic acetylcholine receptor alpha7 (nAChR alpha7) maybe involved in cognitive deficits in Schizophrenia and Alzheimer’s disease.”[PMID 15695160]

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 26 / 34

Page 27: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Highlights

[2006] BioCreative II: PPI (3rd), IMT (best)[2009] BioCreative II.5 PPI (best results); BioNLP[2010] BioCreative III: ACT, IMT, IAT[2011] CALBC (large scale entity extraction), BioNLP[2012] CTD task at BioCreative 2012[2013] BioCreative IV: BioC, CTD, IAT80+ publications, 20+ journal articles

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 27 / 34

Page 28: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Outline

...1 Motivation

...2 Text mining in a curation workflow

...3 Large-scale detection of protein interactions

...4 Biomedical text mining: competitive evaluations

...5 The OntoGene approach and highlights

...6 Preview of recent work

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 28 / 34

Page 29: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Veterinary Pathology Text Mining

Collaboration with the veterinary facultyof the University of Bern.Task..

......

Development and evaluation of an automated text-mining andsyndrome-classifying tool:

extract relevant information from pathology reports with minimalexpert interventionclassify pathology findings into syndromic groups to enhance theefficiency of health event detection

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 29 / 34

Page 30: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Veterinary TM: Term Variation

volvulusdiagnosis

Volvulus

Voluvlus

Volvolus

VovulusDarmtorsion

Darmtorsionen

Dünndarmtorsion

Dünndarmdrehungen

Dünndarmvolvulus

Caecumtorsion

Zäkumtorsion

Zäkumstorsion

Caecum-Torsion

canonical term

inflectionmisspellings

synonym

synonyms

hyponym hyponym

spelling variants

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 30 / 34

Page 31: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Veterinary TM: Enhancing Terminology

BelastungsmyopathieMuskeldegenerationMuskelfaserdegenerationMuskelfaserndegenerationMuskelzelldegenerationMuskelfasernekroseMuskelläsionenMuskelnekroseMuskelnekrosenMyodegenerationMyofaserdegenerationmyodegenerationMyonekrosemyonecrosisRhabdomyolyseWeiss-Muskel-KrankheitWeiss-MuskelkrankheitWeissmuskelerkrankungWeissmuskelkrankeitWeissmuskelkrankheit

C0234958

C0235957

C0035410

C0043153

Denegeration des Muskels

MUSKELDEGENERATION

Muskeldegeneration

MUSKELNEKROSE

Muskelnekrose

Myonekrose

RHABDOMYOLYSE

Rhabdomyolyse

Muskeldystrophie, nutritive

Weißmuskelkrankheit

Disease or Syndrome

Pathologic Function

UMLS datasyndrome: musculo-skeletaldiagnosis: myodegeneration

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 31 / 34

Page 32: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

PsyMine

.Text mining in support of psychiatric research:overcoming fragmented knowledge..

......

Collaboration with the Compentence Center for Mental Health at theEpidemiology, Biostatistics and Prevention Institute

Goal: identify potential causes of mental diseasesMethods: analyse the whole biomedical literature, identify causes of

mental disorders (genetic/disease/social), rank and correlateVision: “global overview” of knowledge in literature

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 32 / 34

Page 33: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

MelanoBase

.Large-scale automatic extraction of actionable information from thebiomedical literature..

......

integration with existing structured knowledgeuse-case scenario: melanomaresults to be integrated within the Melanoma Molecular Maprepository (S. Mocellin, Padua)collaborations with clinical researchers (Marisol Soengas, CNIO,Spain).

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 33 / 34

Page 34: Large-scale Information Extraction for Biomedical Literature · 2019-10-17 · Large-scale automatic extraction of actionable information from the biomedical literature.. integration

Summary

Text mining technologies can provide an effective support inbiomedical curationODIN is a user-friendly tool for text-mining supporting interactive(collaborative) curation of the biomedical literature.OntoGene provides competitive text mining technologies (BioCreative,CALBC prove quality)New projects and applications: VetSuisse, PsyMine, MelanoBase

Fabio Rinaldi, Lenz Furrer www.ontogene.org Large-scale Biomedical Information Extraction 34 / 34