Biomedical text mining: Automatic processing of unstructured text

Post on 28-Jan-2018

36 views 1 download

Transcript of Biomedical text mining: Automatic processing of unstructured text

Lars Juhl Jensen

Biomedical text miningAutomatic processing of unstructured

text

>10 km

1 paper / 40 seconds

patent literature

grant proposals

FDA product labels

electronic medical records

too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive dictionary

genes/proteins

cyclin dependent kinase 1

CDC2

chemical compounds

diseases

adverse drug reactions

cellular components

tissues

organisms

environments

orthographic variation

flexible matching

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

expansion rules

prefixes and suffixes

CDC2

hCdc2

plural/adjective forms

mitochondrion

mitochondria

mitochondrial

abbreviated forms

Saccharomyces cerevisiae

S. cerevisiae

“black list”

SDS

use cases

assess studiedness

TIN-X

Cannon et al., Bioinformatics, 2017newdrugtargets.org

interactive annotation

EXTRACT

Pafilis et al., Database, 2016extract.hcmr.gr

extract.hcmr.gr Pafilis et al., Database, 2016

implicit relations

Encyclopedia of Life

habitats

Pafilis et al., Bioinformatics, 2016environments.hcmr.gr / eol.org

SIDER

adverse drug reactions

Kuhn et al., Nucleic Acids Research, 2016sideeffects.embl.de

relation extraction

two approaches

natural language processing

part-of-speech tagging

what you learned in schoolpronoun pronoun verb preposition noun

sentence parsing

Gene and protein namesCue words for entity

recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Saric et al., Proceedings of ACL, 2004

manually crafted rules

machine learning

manually annotated corpus

association type

direction

high precision

poor recall

manual work

co-mentioning

counting

within documents

within paragraphs

within sentences

scoring scheme

weighted counts

normalization

easy

high recall

high precision

undirected associations

unknown type

use cases

natural language processing

transcription factor targets

kinase substrates

protein–protein interactions

co-mentioning

drug targets

protein function

subcellular localization

Binder et al., Database, 2014compartments.jensenlab.org

tissue expression

tissues.jensenlab.org Santos et al., PeerJ, 2015

disease genes

diseases.jensenlab.org Frankild et al., Methods, 2015

disease mutations