Biomedical literature mining
-
Upload
lars-juhl-jensen -
Category
Technology
-
view
529 -
download
1
description
Transcript of Biomedical literature mining
![Page 1: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/1.jpg)
Biological Literature Mining
Lars Juhl Jensen
EMBL
![Page 2: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/2.jpg)
Why?
![Page 3: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/3.jpg)
Overview
• Information retrieval and entity recognition Methodologies for finding and classifying texts Identification of gene/protein/drug names in text
• Information extraction and text/data mining Statistical and NLP methods for relation extraction Making discoveries from text alone Integration of text and other data types
![Page 4: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/4.jpg)
Status
• IR, ER, and simple IE methods are fairly well established
• Advanced NLP-based IE systems are rapidly being improved
• Methods for text mining and text/data integration are still in their infancy
![Page 5: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/5.jpg)
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
![Page 6: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/6.jpg)
Information Retrieval andEntity Recognition
Lars Juhl Jensen
EMBL
![Page 7: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/7.jpg)
Overview
• Ad hoc information retrieval The user enters a query/a set of keywords The system attempts to retrieve the relevant texts from
a large text corpus (typically Medline)
• Text categorization A training set of texts is created in which texts are
manually assigned to classes (often only yes/no) A machine learning methods is trained to classify texts This method can subsequently be used to classify a
much larger text corpus
![Page 8: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/8.jpg)
Ad hoc IR
• These systems are very useful since the user can provide any query The query is typically Boolean (yeast AND cell cycle) A few systems instead allow the relative weight of each
search term to be specified by the user
• The art is to find the relevant papers even if they do not actually match the query Ideally our example sentence should be extracted by
the query yeast cell cycle although none of these words are mentioned
![Page 9: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/9.jpg)
![Page 10: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/10.jpg)
![Page 11: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/11.jpg)
![Page 12: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/12.jpg)
![Page 13: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/13.jpg)
Automatic query expansion
• In a typical query, the user will not have provided all relevant words and variants thereof
• By automatically expanding queries with additional search terms, recall can be improved Stemming removes common endings (yeast / yeasts) Thesauri can be used to expand queries with synonyms
and/or abbreviations (yeast / S. cerevisiae) The next logical step is to use ontologies to make
complex inferences (yeast cell cycle / Cdc28 )
![Page 14: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/14.jpg)
![Page 15: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/15.jpg)
Document similarity
• The similarity of two documents can be defined based on their word content Each document can be represented by a word vector Words should be weighted based on their frequency
and background frequency The most commonly used scheme is tf*idf weighting
• Document similarity can be used in ad hoc IR Rather than matching the query against each document
only, the N most similar documents are also considered
![Page 16: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/16.jpg)
Document clustering
• Unsupervised clustering algorithms can be applied to a document similarity matrix All pairwise document similarities are calculated Clusters of “similar documents” can be constructed
using one of numerous standard clustering methods
• Practical uses of document clustering The “related documents” function in PubMed Logical organization of the documents found by IR
![Page 17: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/17.jpg)
Text categorization
• These systems are a lot less flexible than ad hoc systems but can attain better accuracy Works on a pre-defined set of document classes Each class is defined by manually assigning a number
of documents to it
• Method Rules may be manually crafted based on a very small
set of manually classified documents Statistical machine learning methods can be trained on
a large number of classified documents
![Page 18: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/18.jpg)
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
Hints in the text Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”) Weaker: mitotic cyclin, Clb2, and Cdk1 ( “cell cycle)
![Page 19: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/19.jpg)
Machine learning
• Input features Word content or bi-/tri-grams Part-of-speech tags Filtering (stop words, part-of-speech) Singular value decomposition
• Training Support vector machines are best suited Choice of kernel function Separate training and evaluation sets, cross validation
![Page 20: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/20.jpg)
![Page 21: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/21.jpg)
Entity recognition
• An important but boring problem The genes/proteins/drugs mentioned within a given text
must be identified
• Recognition vs. identification Recognition: find the words that are names of entities Identification: figure out which entities they refer to Recognition without identification is of limited use
![Page 22: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/22.jpg)
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
Entities identified S. cerevisiae proteins: Clb2 (YPR119W), Cdc28
(YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)
![Page 23: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/23.jpg)
Recognition
• Features Morphological: mixes letters and digits or ends on -ase Context: followed by “protein” or “gene” Grammar: should occur as a noun
• Methodologies Manually crafted rule-based systems Machine learning (SVMs)
• But what can it be used for?
![Page 24: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/24.jpg)
Identification
• A good synonyms list is the key Combine many sources Curate to eliminate stop words
• Flexible matching to handle orthographic variation Case variation: CDC28, Cdc28, and cdc28 Prefixes: myc and c-myc Postfixes: Cdc28 and Cdc28p Spaces and hyphens: cdc28 and cdc-28 Latin vs. Greek letters: TNF-alpha and TNFA
![Page 25: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/25.jpg)
Disambiguation
• The same word may mean many different things Entity names may also be common English words
(hairy) or technical terms (SDS) Protein names may refer to related or unrelated proteins
in other species (cdc2)
• The meaning can be resolved from the context ER can distinguish between names and common words Disambiguating non-unique names is a hard problem Ambiguity between orthologs can be safely be ignored
![Page 26: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/26.jpg)
![Page 27: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/27.jpg)
![Page 28: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/28.jpg)
![Page 29: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/29.jpg)
Summary
• Information retrieval Ad hoc IR is more flexible than text categorization as it
does not require a separate training set for each topic Some topics are not easily described by a query Text categorization methods can generally attain better
recall and accuracy than ad hoc IR methods
• Entity recognition It is not sufficient to recognize names – the entities
should also be identified The best methods rely on curated synonyms lists
![Page 30: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/30.jpg)
Information Extractionand Text/Data Mining
Lars Juhl Jensen
EMBL
![Page 31: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/31.jpg)
Overview
• Information extraction (IE) Simple statistical co-occurrence methods Combining co-occurrence and text categorization Natural Language Processing (NLP)
• Text/data mining Discovery of global trends from text alone Mining text for overlooked relations Augmenting text mining with other data types Automated annotation of high-throughput data
![Page 32: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/32.jpg)
Co-occurrence
• Relations are extracted for co-occurring entities Relations are always symmetric The type of relation is not given
• Scoring the relations More co-occurrences more significant Ubiquitous entities less significant Same sentence vs. same paragraph
• Simple, good recall, poor precision
![Page 33: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/33.jpg)
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
Relations Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and
Cdc5–Swe1 Wrong: Clb2–Cdc5 and Cdc28–Cdc5
![Page 34: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/34.jpg)
![Page 35: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/35.jpg)
Categorization
• Extracting specific types of relations Text categorization methods can be used to identify
sentences that mention a certain type of relations Filtering can be done before or after relation extraction
• Well suited for database curation Text categorization can be reused High recall is most important Curators can compensate for the lack of precision
![Page 36: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/36.jpg)
![Page 37: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/37.jpg)
NLP
• Information is extracted based on parsing and interpreting phrases or full sentences Good at extracting specific types of relations Handles directed relations
• Complex, good precision, poor recall
![Page 38: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/38.jpg)
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
Relations: Complex: Clb2–Cdc28 Phosphorylation: Clb2Swe1, Cdc28Swe1, and
Cdc5Swe1
![Page 39: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/39.jpg)
Architecture
• Tokenization Entity recognition with synonyms list Word boundaries (multi words) Sentence boundaries (abbreviations)
• Part-of-speech tagging TreeTagger trained on GENIA
• Semantic labeling Dictionary of regular expressions
• Entity and relation chunking Rule-based system implemented in CASS
![Page 40: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/40.jpg)
Semantic labeling Gene and protein names Cue words for entity recognition Cue words for relation extraction
Named entity chunking A CASS grammar recognizes
noun chunks related to gene expression:[nxgene The GAL4 gene]
Relation chunking Our CASS grammar also extracts
relations between entities:[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
![Page 41: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/41.jpg)
[expression_repression_active
Btkregulatesthe IL-2 gene]
[dephosphorylation_nominal
Dephosphorylation ofSyk and Btkmediated by
SHP-1]
[phosphorylation_nominal
phosphorylation of Shc bythe hematopoietic cell-specific
tyrosine kinase Syk]
[phosphorylation_nominal
the phosphorylation ofthe adapter protein SHCby the Src-related kinase Lyn]
[phosphorylation_active
Lynalso participates in[phosphorylation the tyrosine phosphorylationand activation of syk]]
[phosphorylation_active
Lyn, [negation but not Jak2]phosphorylatedCrkL]
[phosphorylation_active
Lyn, [negation but not Jak2]phosphorylatedCrkL]
[phosphorylation_active
Lynalso participates in[phosphorylation the tyrosine phosphorylationand activation of syk]]
[phosphorylation_nominal
the phosphorylation ofthe adapter protein SHCby the Src-related kinase Lyn]
[phosphorylation_nominal
phosphorylation of Shc bythe hematopoietic cell-specific
tyrosine kinase Syk]
[dephosphorylation_nominal
Dephosphorylation ofSyk and Btkmediated by
SHP-1]
[expression_repression_active
IL-10also decreased
[expression mRNA expression of IL-2 and IL18 cytokine receptors]
[expression_repression_active
IL-10also decreased
[expression mRNA expression of IL-2 and IL18 cytokine receptors]
[expression_activation_passive
[expression IL-13 expression]induced by
IL-2 + IL-18]
[expression_activation_passive
[expression IL-13 expression]induced by
IL-2 + IL-18]
[expression_repression_active
Btkregulatesthe IL-2 gene]
![Page 42: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/42.jpg)
![Page 43: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/43.jpg)
Mining text for nuggets
• New relations can be inferred from published ones This can lead to actual discoveries if no person knows
all the facts required for making the inference Combining facts from disconnected literatures
• Swanson’s pioneering work Fish oil and Reynaud's disease Magnesium and migraine
![Page 44: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/44.jpg)
![Page 45: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/45.jpg)
![Page 46: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/46.jpg)
Trends
• Most similar to existing data mining approaches Although all the detailed data is in the text, people may
have missed the big picture
• Temporal trends Historical summaries Forecasting
• Correlations “Customers who bought this item also bought …”
![Page 47: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/47.jpg)
Time
![Page 48: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/48.jpg)
Successful genes
![Page 49: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/49.jpg)
Buzzwords
![Page 50: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/50.jpg)
Correlations
• “Customers who bought this item also bought …”
• Protein networks “Proteins that regulate
expression …” “Proteins that control
phosphorylation …” “Proteins that are
phosphorylated …”
• Co-author networks
![Page 51: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/51.jpg)
Transcriptional networks
3279 83
3592
Regulates Regulated
P < 910-9
![Page 52: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/52.jpg)
Signaling pathways
1127 44
3704
Phosphorylates Phosphorylated
P < 210-7
![Page 53: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/53.jpg)
Multiple regulation
8107 47
3625
Expression Phosphorylation
P < 510-4
![Page 54: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/54.jpg)
![Page 55: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/55.jpg)
Integration
• Automatic annotation of high-throughput data Loads of fairly trivial methods
• Protein interaction networks Can unify many types of interactions Powerful as exploratory visualization tools
• More creative strategies Identification of candidate genes for genetic diseases Linking genes to traits based on species distributions
![Page 56: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/56.jpg)
![Page 57: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/57.jpg)
![Page 58: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/58.jpg)
![Page 59: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/59.jpg)
![Page 60: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/60.jpg)
![Page 61: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/61.jpg)
RCCs
![Page 62: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/62.jpg)
Disease candidate genes
• Rank the genes within a chromosomal region to which a disease has been mapped
• Methods G2D
• GeneFunctionChemicalPhenotypeDisease
• Uses MEDLINE but not the text BITOLA
• GeneWordsDisease (similar to ARROWSMITH)
Hide and co-workers• GeneTissueDisease
![Page 63: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/63.jpg)
G2D
![Page 64: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/64.jpg)
![Page 65: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/65.jpg)
![Page 66: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/66.jpg)
![Page 67: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/67.jpg)
Genotype–phenotype
• Genes can be linked to traits by comparing the species distributions of both Mainly works for prokaryotes Traits are represented by keywords
• Finding the species profiles Gene profiles are found by sequence similarity Keyword profiles are based co-occurrence with the
species name in MEDLINE
![Page 68: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/68.jpg)
![Page 69: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/69.jpg)
![Page 70: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/70.jpg)
Annotation
• Many experiment result in groups of related genes ER is used to find the associated abstracts The frequency of each word is counted in the abstracts Background frequencies of all words are pre-calculated A statistical test is used to rank the words
• The same strategy can be applied to find MeSH terms associated with a gene cluster
• Most people prefer using GO annotation instead
![Page 71: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/71.jpg)
Summary
• Information extraction Co-occurrence methods generally give better recall but
worse accuracy than NLP methods Only NLP methods can handle directed interactions
• Text/data mining Few overlooked relations can be found from text alone Methods that combine text and other data types have
much better discovery potential Protein networks are useful for structuring other data Literature-based annotation is of limited use
![Page 72: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/72.jpg)
Outlook
Lars Juhl Jensen
EMBL
![Page 73: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/73.jpg)
Death?
• Literature mining will not be made obsolete by <insert your favorite new technology here> Repositories are always made too late There will always be new types of relations Semantically tagged XML may replace ER (hopefully!) Semantically tagged XML will never tag everything
• Specific IE problems will become obsolete Protein function Physical protein interactions
![Page 74: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/74.jpg)
Permission denied
• Open access Literature mining methods cannot retrieve, extract, or
correlate information from text unless it is accessible Restricted access is already now the primary problem
• Standard formats Getting the text out of a PDF file is not trivial Many journals now store papers in XML format
• Where do I get all the patent text?!
![Page 75: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/75.jpg)
Innovation
• The basic tools are now in place for IR, ER, and IE Development was driven by
computational linguists
• Text- and data-mining Biologists are needed Collaboration with linguists
• Lack of innovation Very few new ideas Text should be combined
with other data
![Page 76: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/76.jpg)
Acknowledgments
• EML Research Jasmin Saric Isabel Rojas
• EMBL Heidelberg Peer Bork Miguel Andrade Rossitza Ouzounova Jan Korbel Tobias Doerks
![Page 77: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/77.jpg)
Exercises
Lars Juhl Jensen
EMBL
![Page 78: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/78.jpg)
Information retrieval
• PubFinder http://www.glycosciences.de/tools/PubFinder/
• Ideas Do a very specific search on PubMed that retrieves
only around 10–20 relevant papers See if PubFinder is able to retrieve more Compare this with using the “Related Articles”
function in PubMed
![Page 79: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/79.jpg)
Entity recognition
• iHOP http://www.pdg.cnb.uam.es/UniPub/iHOP/
• Ideas Compare iHOP vs. PubMed for finding papers related
to a particular gene Use iHOP to construct a small literature-based network
![Page 80: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/80.jpg)
Information extraction
• Relation extraction iProLINK (http://pir.georgetown.edu/iprolink/) PreBIND (http://prebind.bind.ca) PubGene (http://www.pubgene.org)
• Ideas Check how complex sentences iProLINK can handle Check how well PreBIND can discriminate between
physcial and other interactions (other interactions can be found with PubGene, ProLinks, or STRING)
![Page 81: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/81.jpg)
Text mining
• ARROWSMITH
http://arrowsmith.psych.uic.edu
• Ideas Fish oil and Reynaud's disease Magnesium and migraine Arginine and somatomedin C Estrogen and Alzheimer's disease
![Page 82: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/82.jpg)
Integration 1
• Protein networks STRING (http://string.embl.de) ProLinks (http://dip.doe-mbi.ucla.edu/pronav/)
• Ideas Use both tools to find functions for proteins of known
and unknown function Use STRING to construct a network for a set of proteins Try to reproduce the Ssn3–Msn2–Hsp104 link
![Page 83: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/83.jpg)
Integration 2
• Finding candidate disease genes G2D (http://www.ogic.ca/projects/g2d_2/) BITOLA (http://www.mf.uni-lj.si/bitola/)
• Ideas Take a look at the G2D results for some diseases where
you know which types of genes would be sensible to suggest
Compare the results with BITOLA (if you have the patience to figure out there interface!)
![Page 84: Biomedical literature mining](https://reader033.fdocuments.net/reader033/viewer/2022061217/54b3e2ac4a7959855a8b4614/html5/thumbnails/84.jpg)
Integration 3
• Annotation of expression data MedMiner (http://discover.nci.nih.gov/textmining/)
• Ideas Stating the obvious … do the one thing that MedMiner
can do …