Text mining in the field of evolutionary biology: facilitating scholarly collaboration

20
Text mining in the field of evolutionary biology: facilitating scholarly collaboration Sarah Carrier February 2008

description

Text mining in the field of evolutionary biology: facilitating scholarly collaboration. Sarah Carrier February 2008. What is text mining?. Deriving novel, relevant information from unstructured information (text). Identification of patterns and trends. Typical techniques: Clustering - PowerPoint PPT Presentation

Transcript of Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Page 1: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Text mining in the field of evolutionary biology: facilitating

scholarly collaboration

Sarah Carrier

February 2008

Page 2: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

What is text mining?

• Deriving novel, relevant information from unstructured information (text).

• Identification of patterns and trends.• Typical techniques:

– Clustering

– Categorization

– Concept/entity extraction -> dictionary-based, statistical methods/machine learning

– Document summarization

Page 3: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Long-Term Objective

1. To identify biological entities through text mining methods, then categorize them into predetermined classes of objects

2. To describe biological concepts using simple ontologies - for example, use the controlled vocabulary generated in step 1 to describe results and methods

Page 4: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Semester Objective

1. To categorize evolutionary biology abstracts into 5 different predetermined categories using nouns and noun-phrases associated with the text.

2. To prepare for long-term objectives.

Page 5: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Motivation

• Scholarly collaboration

• Generation of ontologies to describe results of experiments, to enhance meta-analyses for research purposes

• Web publishing

• Indexing by central repositories

Page 6: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Motivation and Current Research

• need in the life sciences for alternatives to keyword-based approaches based in the traditional information retrieval framework

• extensive (text mining) work is being done to identify protein-protein interactions and gene annotations

• extracted entities can be linked to existing ontologies and potentially used to generate new ontologies

• the most common text mining applications in the life sciences tend toward information extraction, as this method produces a potential solution to the deluge of information in the field

Page 7: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Manual Keyword Identification• 8 categories: concept, field/discipline, gene, habitat,

method, place, taxon, time period • 104 articles, 5 journals, 600 keywords - 551 with

duplicates removed, most terms ended up in the “concept” category -> varied sizes

• Manual categorization accomplished with domain experts on the Dryad team, matched with existing terminologies

• 16% were duplicates, avg. 50% matched terminologies - implies that controlled vocabularies should be used for standardization

Page 8: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Some potential challenges

• Evolutionary biology is an interdisciplinary field: ecology, genomics, paleontology, population genetics, physiology, systematics

• A varied and complex terminology for the life sciences

• Incredibly sparse dataset• Coverage of existing terminologies incomplete

(UMLS, Open Biomedical Ontologies)

Page 9: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Methodology

• MEDLINE abstracts from American Naturalist, Ecology, Journal of Evolutionary Biology, Molecular Ecology, Molecular Biology and Evolution, Systematic Biology

• Total: 15,179 abstracts, 227,731 terms extracted from list of MeSH terms and 831,245 terms using abstract

• Standard preprocessing of abstracts using Perl, including the Porter stemmer and the Brill Tagger

Page 10: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

An ExamplePMID- 17206577TI- Ecological specialization and adaptive decay in digital organisms.AB- The transition from generalist to specialist may entail the loss of unused traits or abilities, resulting in narrow niche breadth. Here we examine the process of specialization in digital organisms--self-replicating computer programs that mutate, adapt, and evolve. Digital organisms obtain energy by performing computations with numbers they input from their environment. We examined the evolutionary trajectory of generalist organisms in an ecologically narrow environment, where only a single computation yielded energy. CONTINUED…MH- *Adaptation, Biological, Competitive Behavior, Computer Simulation, Ecology, *Evolution, Molecular, Genotype, *Models, Genetic, Mutation, Phenotype, Software

Page 11: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Preprocessing

17206577|1|transition17206577|1|specialist17206577|1|loss of unus trait17206577|1|trait17206577|1|generalist17206577|1|loss17206577|1|transition from generalist17206577|1|unus trait17206577|1|narrow nich breadth17206577|1|nich breadth17206577|1|breadth17206577|2|process17206577|2|abil17206577|2|nich

• CONCEPT: regressive evolution, specialization, pleiotropy, adaptation, mutation accumulation

• METHOD: digital evolution

Page 12: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Preprocessing, cont.

The/DET transition/NN from/IN generalist/NN to/TO specialist/NN may/MD entail/VB the/DET loss/NN of/IN unused/JJ traits/NNS or/CC abilities/NNS ,/PPC resulting/VBG in/IN narrow/JJ niche/NN breadth/NN ./PP Here/RB we/PRP examine/VBP the/DET process/NN of/IN specialization/NN in/IN digital/JJ organisms/NNS self-replicating/NN computer/NN programs/NNS that/IN mutate/VB ,/PPC adapt/VBP ,/PPC and/CC evolve/VB ./PP Digital/NNP organisms/NNS obtain/VBP energy/NN by/IN performing/VBG computations/NNS with/IN numbers/NNS they/PRP input/NN from/IN their/PRPS environment/NN ./PP We/PRP examined/VBD the/DET evolutionary/JJ trajectory/NN of/IN generalist/NN organisms/NNS in/IN an/DET ecologically/RB narrow/JJ environment/NN ,/PPC where/WRB only/RB a/DET single/JJ computation/NN yielded/VBD energy/NN ./PP We/PRP determined/VBD the/DET extent/NN to/TO which/WDT

Page 13: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

An Example• <MeshHeadingList>• <MeshHeading>• <DescriptorName MajorTopicYN="N">Adaptation, Physiological• </DescriptorName>• </MeshHeading>• <QualifierName MajorTopicYN="N">genetics• </QualifierName>• <QualifierName MajorTopicYN="Y">metabolism• </QualifierName>• </MeshHeading>• <MeshHeading>• <DescriptorName MajorTopicYN="N">Predatory Behavior• </DescriptorName>• </MeshHeading>• </MeshHeadingList>

Page 14: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Most FrequentAbstract Terms(collection)

Most Frequent MeSHTerms (collection)

specipopulgeneresultsequencstudidataanalysipatternevolutvariatdnaphylogenetregionmodellevelrateanalysstructurselect

genetsequencanimdnaphysiologiphylogenievolutpopuldataanalysimodelmolecular sequenc datasequenc datagenevariatacidbasebase sequencclassifprotein

Page 15: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Other Steps

• TF*IDF weighting, pruning– Challenges: skew in category sizes (“concept” being

the largest), lack of truly discriminative terms

• Application of a machine-learning model: Hidden Markov Models, Support Vector Machines– SVMs: outperform HMM

• also better for large, sparse datasets

• Evaluation: – Recall, Precision, F-Scores– Presentation to Dryad domain experts for feedback

Page 16: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Future Steps

• Use of existing vocabularies to assist in controlling terminology: NBII thesaurus, MeSH, GTN, WordNet, Gene Ontology, ITIS, UBIO, UMLS, etc.

Page 17: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Ontology generation?

• The POS processing has already been done - the verb is an essential element of the relationship

• Find most common verbs and define them as “relational verbs”

• Methodology: using POS tags, pull out “triplets” or certain sequences of words– NOUN - VERB - NOUN

…in some studies, prepositions are also analyzed

Page 18: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Ontology, cont.

Our/PRPS results/NNS show/VBP that/IN as/IN organisms/NNS evolved/VBD improved/VBN performance/NN of/IN the/DET selected/JJ function/NN ,/PPC they/PRP often/RB lost/VBN the/DET ability/NN to/TO perform/VB other/JJ computations/NNS ,/PPC and/CC these/DET losses/NNS resulted/VBD most/JJS often/RB from/IN the/DET accumulation/NN of/IN neutral/JJ and/CC deleterious/JJ mutations/NNS ./PP

Page 19: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Conclusions• Term variation and ambiguity presented a challenge in my

project because it yielded a very sparse data set• With more time I would have supplemented the dataset I

generated this semester with more data from more abstracts, perhaps even the full text, if available

• Although the objective of the project changed over the semester, the results provide valuable insight into the structure and use of evolutionary biology vocabularies

• Potential future developments in the project, namely ontology generation, would have a positive impact on scholarly communication amongst researchers in the field of evolutionary biology

Page 20: Text mining in the field of evolutionary biology: facilitating scholarly collaboration

Thank you!