Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC.
-
Upload
quentin-shaw -
Category
Documents
-
view
223 -
download
5
Transcript of Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC.
Michigan, 2005Alfonso Valencia CNB-CSIC
SLIDE WINDOW APPROACH
Krallinger Valencia Drug Discovery Today 2005
ISMB-Biolink
Michigan, 2005Alfonso Valencia CNB-CSIC
BioLINK SIG: Linking Literature, Information and Knowledge for Biology
A Joint Meeting ofThe ISMB BioLINK Special Interest Group on Text Data Mining andThe ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological
Semantics
Christian Blaschke, Hagit Shatkay, Kevin B. Cohen, Lynette Hirschman
1. InTex: a Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. S. T. Ahmed, D. Chidambaram, H. Davulcu, C. Baral
2. Corpus Design for Biomedical Natural Language Processing. K. B. Cohen, L. Fox, P. V. Ogren, L. Hunter
3. Unsupervised Gene/Protein Named Entity Normalization using Automatically Extracted Dictionaries. A. M. Cohen
4. Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions. A. Ramani, E. Marcotte, R. Bunescu, R. Mooney
5. MedTag: a Collection of Biomedical Annotations. L.H Smith, L. Tanabe, T. Rindflesch, W. John Wilbur
6. A Machine Learning Approach to Acronym Generation. Y. Tsuruoka, S. Ananiadou, J. Tsujii7. Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data.
B. Wellner8. Adaptive String Similarity Metrics for Biomedical Reference Resolution. B. Wellner, J. Castaño, J.
Pustejovsky9. A Cross-Domain Application of Natural Language Processing in Biology. I. Chiu, L. H. Shu10. Functional Annotation of Genes Using Hierarchical Text Categorization. S. Kiritchenko, S. Matwin,
A. F. Famili11. Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. P.
Nakov, A. Schwartz, B. Wolf, M. Hearst12. Searching for High-Utility Text in the Biomedical Literature. H. Shatkay, A. Rzhetsky, W. J. Wilbur13. Automatic Highlighting of Bioscience Literature. H. Wang, S. Bradshaw, M. Light
BioLINK SIG / BioOntologies in ECCB05 Madrid Sept. www.eccb05.org
Michigan, 2005Alfonso Valencia CNB-CSIC
Competitions
- BioCreAtIveTask 1: Extraction of gene / protein names from text, mapping to identifiers
(fly, mouse, yeast) Task 2: GO to protein via text for a collection of human genes.
- TREC I, II- KDD- JNLPBA- others
Text Mining vs. Curation
• Text Mining supports curation
• Curators build and maintain ontologies and databases
• Text Mining profits from data from different resources: ontologies, databases
BioCreAtIvE ©
Michigan, 2005Alfonso Valencia CNB-CSIC
Text mining in a nutshell
1. Protein / gene namesInterspecies
Linking to DBs
2. Relations between entitiesProtein-protein
Other entities (regulation, drugs)
Function
3. Type of RelationProteins
Metabolic pathways
1. 80% prec/recall (BioCreative)Far less than that
Essential (Bioinformatics not NLP)
2. Easy on the surfaceBest known one (accessible?)
Dictionaries
Very difficult (i.e. GO in BioCreative)
3. SemanticSummaries very difficult
New challenge, unexplored
Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005Krallinger et al., DDToday 2005
Michigan, 2005Alfonso Valencia CNB-CSIC
Text mining in a nutshell1. Protein / gene names
1. Interspecies2. Linking to DBs
2. Relations1. Protein protein2. Others (regulation, drugs)3. Function
3. Type of Relation1. Proteins2. Metabolic pathways
4. Concepts for groups of genes1. Existing2. Creating new ones
1. 80% prec/recall (biocreative)1. Far less than that2. Essential (not NLP)
2. Easy on the surface1. Best known one (accessible?)2. Dictionaries3. Very difficult (to GO Biocreative)
3. Semantic1. Summaries very difficult2. New challenge, unexplored
4. Knowledge discovery1. Summaries and generalization2. Not jet
Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005
Michigan, 2005Alfonso Valencia CNB-CSIC
MeiosisCyclinCheckpointInterphaseNucleoplasmaDivisionHistoneReplicationChromatid
DipeptidylProlylnmrCollagen-binding
17 genesPCNACDC2MSH2LBR
TOP2A...
24 genesABCA5
CATELF2PIM1WNT2
...
Cell cycle
Unknown
DNA replicationDNA metabolismCell Cycle control
PCNA-MSH2The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication.
LBR-CDC2LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase.
Word
s
GO codes
Sentences
Words
Blaschke, et al., Funct. Integ. Genomics 2001
Michigan, 2005Alfonso Valencia CNB-CSIC
AC Intro1:30-1:45pm Text Mining: Dietrich Rebholz-Schuhmann
7. High-recall Protein Entity Recognition Using a Dictionary. Kou, Cohen, Murphy1:45-2:10pm
9. Beyond The Clause: Extraction of Phosphorylation Information from Medline Abstracts. Narayanaswamy, Ravikumar, Vijay-Shanker2:10-2:35pm
Michigan, 2005Alfonso Valencia CNB-CSIC
Exponential Growth in Data
EMBLTotal Entries / year
MedlineTotal Articles / year
MedlineNew Articles / year
Michigan, 2005Alfonso Valencia CNB-CSIC
OFFICIAL 62542 44.46 %
ALIAS 51749 36.79 %
PROTEIN 26363 18.74 %
The 2492 selected genes in the year 2002 were cited 140654 times
Tamames et al., 2005
Michigan, 2005Alfonso Valencia CNB-CSIC
Leon et al., 2004
- 98 pathways with more than one step (information available for 73)
- 2111 individual steps. Protein-compound links in abstracts
Total 2111 steps 856 linked (40 %)Bacterial chemotaxis 19 17
(89 %)Glutathione metabolism 7 6
(85 %)Fatty acid biosynthesis -path 1- 9 7
(78 %)
in sentences
Total 2111 steps 611 linked (29%)
Bacterial chemotaxis 19 13 (65 %)
Two-component system 85 52 (61 %)
Citrate cycle -TCA cycle- 27 17 (63 %)
KEGG links to literature
Michigan, 2005Alfonso Valencia CNB-CSIC
Ye a
rsEvolution of gene names
Hoffmann, Valencia TIGs 2003
Gene names
The evolution of gene names over time is a “scale free” process- “critical state” system- the evolution of a gene name cannot be predicted- some gene name act as attractors of other names
Michigan, 2005Alfonso Valencia CNB-CSIC
SOTA clustering versus significance of Geisha terms.
Oliveros, Blaschke, GIW 2000 ©
Michigan, 2005Alfonso Valencia CNB-CSIC
SOTA and GEISA mixed information
Blaschke, Herrero, Dopazo, Valencia 2002
Expression based clustering
Weight (expression) + Weight (text)
Term (text) based clustering
Michigan, 2005Alfonso Valencia CNB-CSIC
Stable clusters > central processes with expression and functional information agree
Unstable groups > contradictory information
“jumping” genes, divergent expression and functional classifications.
(Gene of very unstable behavior > related with insufficient information)