Mining literature and medical records

207
Lars Juhl Jensen Mining literature and medical records >10 km

Transcript of Mining literature and medical records

Page 1: Mining literature and medical records

Lars Juhl Jensen

Mining literature and medical records

>10 km

Page 2: Mining literature and medical records

exponential growth

Page 3: Mining literature and medical records
Page 4: Mining literature and medical records
Page 5: Mining literature and medical records

~45 seconds per paper

Page 6: Mining literature and medical records

outline

Page 7: Mining literature and medical records

information retrieval

Page 8: Mining literature and medical records

named entity recognition

Page 9: Mining literature and medical records

augmented browsing

Page 10: Mining literature and medical records

information extraction

Page 11: Mining literature and medical records

text corpora

Page 12: Mining literature and medical records

web resources

Page 13: Mining literature and medical records

electronic health records

Page 14: Mining literature and medical records

medical text mining

Page 15: Mining literature and medical records

questions

Page 16: Mining literature and medical records

information retrieval

Page 17: Mining literature and medical records

find the relevant papers

Page 18: Mining literature and medical records

ad hoc retrieval

Page 19: Mining literature and medical records

user-specified query

Page 20: Mining literature and medical records

“yeast AND cell cycle”

Page 21: Mining literature and medical records

PubMed

Page 22: Mining literature and medical records
Page 23: Mining literature and medical records

indexing

Page 24: Mining literature and medical records

fast lookup

Page 25: Mining literature and medical records

stemming

Page 26: Mining literature and medical records

word endings

Page 27: Mining literature and medical records

dynamic query expansion

Page 28: Mining literature and medical records

MeSH terms

Page 29: Mining literature and medical records

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 30: Mining literature and medical records

no tool will find that

Page 31: Mining literature and medical records

named entity recognition

Page 32: Mining literature and medical records

identify the concepts

Page 33: Mining literature and medical records

computer

Page 34: Mining literature and medical records

as smart as a dog

Page 35: Mining literature and medical records

teach it specific tricks

Page 36: Mining literature and medical records
Page 37: Mining literature and medical records
Page 38: Mining literature and medical records

comprehensive lexicon

Page 39: Mining literature and medical records

small molecules

Page 40: Mining literature and medical records

proteins

Page 41: Mining literature and medical records

cellular components

Page 42: Mining literature and medical records

tissues

Page 43: Mining literature and medical records

organisms

Page 44: Mining literature and medical records

environments

Page 45: Mining literature and medical records

diseases

Page 46: Mining literature and medical records

phenotypes

Page 47: Mining literature and medical records

behaviors

Page 48: Mining literature and medical records

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 49: Mining literature and medical records

orthographic variation

Page 50: Mining literature and medical records

prefixes and postfixes

Page 51: Mining literature and medical records

CDC28 vs. Cdc28p

Page 52: Mining literature and medical records

Myc vs. c-Myc

Page 53: Mining literature and medical records

singular and plural forms

Page 54: Mining literature and medical records

noun and adjective forms

Page 55: Mining literature and medical records

flexible matching

Page 56: Mining literature and medical records

upper- and lower-case

Page 57: Mining literature and medical records

spaces and hyphens

Page 58: Mining literature and medical records

disambiguation

Page 59: Mining literature and medical records

homonyms

Page 60: Mining literature and medical records

“black list”

Page 61: Mining literature and medical records

unfortunate names

Page 62: Mining literature and medical records

SDS

Page 63: Mining literature and medical records

a

Page 64: Mining literature and medical records

scalable implementation

Page 65: Mining literature and medical records

>10 km<10 hours

Page 66: Mining literature and medical records

augmented browsing

Page 67: Mining literature and medical records

show relevant information

Page 68: Mining literature and medical records

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 69: Mining literature and medical records

Reflect

Page 70: Mining literature and medical records

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010

Page 71: Mining literature and medical records

browser add-on

Page 72: Mining literature and medical records

Firefox

Page 73: Mining literature and medical records

Google Chrome

Page 74: Mining literature and medical records

Safari

Page 75: Mining literature and medical records

Internet Explorer

Page 76: Mining literature and medical records

PDF viewer

Page 77: Mining literature and medical records

Utopia Documents

Page 78: Mining literature and medical records

web services

Page 79: Mining literature and medical records
Page 80: Mining literature and medical records

still too much to read

Page 81: Mining literature and medical records

information extraction

Page 82: Mining literature and medical records

formalize the facts

Page 83: Mining literature and medical records

the starting point

Page 84: Mining literature and medical records

named entity recognition

Page 85: Mining literature and medical records

two approaches

Page 86: Mining literature and medical records

co-mentioning

Page 87: Mining literature and medical records

within documents

Page 88: Mining literature and medical records

within paragraphs

Page 89: Mining literature and medical records

within sentences

Page 90: Mining literature and medical records

weighted counts

Page 91: Mining literature and medical records
Page 92: Mining literature and medical records

co-mentioning score

Page 93: Mining literature and medical records

absolute co-mentionings

Page 94: Mining literature and medical records

relative overrepresentation

Page 95: Mining literature and medical records
Page 96: Mining literature and medical records

NLPNatural Language Processing

Page 97: Mining literature and medical records

grammatical analysis

Page 98: Mining literature and medical records

part-of-speech tagging

Page 99: Mining literature and medical records

noun, verb, etc.

Page 100: Mining literature and medical records

multiword detection

Page 101: Mining literature and medical records

semantic tagging

Page 102: Mining literature and medical records

binding, regulation, etc.

Page 103: Mining literature and medical records

sentence parsing

Page 104: Mining literature and medical records

Gene and protein names

Cue words for entity recognition

Verbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 105: Mining literature and medical records

extract stated facts

Page 106: Mining literature and medical records

handle negations

Page 107: Mining literature and medical records

high precision

Page 108: Mining literature and medical records

poor recall

Page 109: Mining literature and medical records

highly domain specific

Page 110: Mining literature and medical records

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 111: Mining literature and medical records

text corpora

Page 112: Mining literature and medical records

a body of text

Page 113: Mining literature and medical records

most use abstracts

Page 114: Mining literature and medical records

few use full-text articles

Page 115: Mining literature and medical records

no access

Page 116: Mining literature and medical records
Page 117: Mining literature and medical records

~22 mio. abstracts

Page 118: Mining literature and medical records

~1.8 mio. free articles

Page 119: Mining literature and medical records

~1.4 mio. Elsevier articles

Page 120: Mining literature and medical records

~7.5 mio. patents

Page 121: Mining literature and medical records

web resources

Page 122: Mining literature and medical records

information on proteins

Page 123: Mining literature and medical records

iHOP

Page 124: Mining literature and medical records
Page 125: Mining literature and medical records
Page 126: Mining literature and medical records

STRING

Page 127: Mining literature and medical records

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

Page 128: Mining literature and medical records

text mining channel

Page 129: Mining literature and medical records

what is known

Page 130: Mining literature and medical records

not in databases

Page 131: Mining literature and medical records

human proteins

Page 132: Mining literature and medical records
Page 133: Mining literature and medical records

co-mentioning dominates

Page 134: Mining literature and medical records

NLP provides actions

Page 135: Mining literature and medical records

homology transfer

Page 136: Mining literature and medical records

STITCH

Page 137: Mining literature and medical records

small molecules

Page 138: Mining literature and medical records
Page 139: Mining literature and medical records

COMPARTMENTS

Page 140: Mining literature and medical records

subcellular localization

Page 141: Mining literature and medical records
Page 142: Mining literature and medical records

DISEASES

Page 143: Mining literature and medical records

human diseases

Page 144: Mining literature and medical records

search for a protein

Page 145: Mining literature and medical records
Page 146: Mining literature and medical records

search for a disease

Page 147: Mining literature and medical records
Page 148: Mining literature and medical records

STRING payload

Page 149: Mining literature and medical records

evidence viewers

Page 150: Mining literature and medical records
Page 151: Mining literature and medical records

electronic health records

Page 152: Mining literature and medical records

what happens at a hospital

Page 153: Mining literature and medical records

Jensen et al., Nature Reviews Genetics, 2012

Page 154: Mining literature and medical records

two types of data

Page 155: Mining literature and medical records

structured data

Page 156: Mining literature and medical records

Jensen et al., Nature Reviews Genetics, 2012

Page 157: Mining literature and medical records

unstructured data

Page 158: Mining literature and medical records

clinical narrative

Page 159: Mining literature and medical records
Page 160: Mining literature and medical records

getting access

Page 161: Mining literature and medical records

patient consent

Page 162: Mining literature and medical records

opt-out

Page 163: Mining literature and medical records

opt-in

Page 164: Mining literature and medical records

ethical approval

Page 165: Mining literature and medical records

medical question

Page 166: Mining literature and medical records

no explorative studies

Page 167: Mining literature and medical records

data security

Page 168: Mining literature and medical records

not anonymized

Page 169: Mining literature and medical records

not transferable

Page 170: Mining literature and medical records

hospital IT systems

Page 171: Mining literature and medical records

not standardized

Page 172: Mining literature and medical records

clinical narrative

Page 173: Mining literature and medical records

not normal language

Page 174: Mining literature and medical records

trouble for NLP

Page 175: Mining literature and medical records

in native language

Page 176: Mining literature and medical records

not English

Page 177: Mining literature and medical records

few tools

Page 178: Mining literature and medical records

no dictionaries

Page 179: Mining literature and medical records

by busy doctors and nurses

Page 180: Mining literature and medical records

typos

Page 181: Mining literature and medical records

medical text mining

Page 182: Mining literature and medical records

what is possible?

Page 183: Mining literature and medical records

a psychiatric corpus

Page 184: Mining literature and medical records

clinical narrative

Page 185: Mining literature and medical records
Page 186: Mining literature and medical records

in Danish

Page 187: Mining literature and medical records

dictionaries

Page 188: Mining literature and medical records

diseases

Page 189: Mining literature and medical records

drugs

Page 190: Mining literature and medical records

adverse drug reactions

Page 191: Mining literature and medical records

disease comorbidity

Page 192: Mining literature and medical records

Jensen et al., Nature Reviews Genetics, 2012

Page 193: Mining literature and medical records

multiple testing

Page 194: Mining literature and medical records

comorbidity matrix

Page 195: Mining literature and medical records

Roque et al., PLoS Computational Biology, 2011

Page 196: Mining literature and medical records

patient clustering

Page 197: Mining literature and medical records

Jensen et al., Nature Reviews Genetics, 2012

Page 198: Mining literature and medical records

clustering algorithm

Page 199: Mining literature and medical records

Roque et al., PLoS Computational Biology, 2011

Page 200: Mining literature and medical records

patient stratification

Page 201: Mining literature and medical records

temporal correlation

Page 202: Mining literature and medical records

drug treatment

Page 203: Mining literature and medical records

adverse drug events

Page 204: Mining literature and medical records

Eriksson et al., in preparation, 2012

Page 205: Mining literature and medical records

pharmacovigilance

Page 206: Mining literature and medical records

thank you!

Page 207: Mining literature and medical records

questions?