Analysis Environments For Functional Genomics

56
Bioinformatics Seminar Bioinformatics Seminar Department of Computer Science, Department of Computer Science, UIUC UIUC February 25, 2005 February 25, 2005 Analysis Environments Analysis Environments For Functional Genomics For Functional Genomics Bruce R. Schatz CANIS Laboratory Institute for Genomic Biology University of Illinois at Urbana- Champaign [email protected] , www.canis.uiuc.edu

description

Analysis Environments For Functional Genomics. Bruce R. Schatz CANIS Laboratory Institute for Genomic Biology University of Illinois at Urbana-Champaign [email protected] , www.canis.uiuc.edu. Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005. - PowerPoint PPT Presentation

Transcript of Analysis Environments For Functional Genomics

Page 1: Analysis Environments For Functional Genomics

Bioinformatics SeminarBioinformatics SeminarDepartment of Computer Science, UIUCDepartment of Computer Science, UIUC

February 25, 2005February 25, 2005

Analysis EnvironmentsAnalysis Environments For Functional GenomicsFor Functional Genomics

Bruce R. SchatzCANIS Laboratory

Institute for Genomic BiologyUniversity of Illinois at Urbana-Champaign

[email protected] , www.canis.uiuc.edu

Page 2: Analysis Environments For Functional Genomics

What are Analysis EnvironmentsWhat are Analysis Environments

Functional Analysis Find the underlying Mechanisms Of Genes, Behaviors, Diseases

Comparative Analysis Top-down data mining (vs Bottom-up) Multiple Sources especially literature

Page 3: Analysis Environments For Functional Genomics

Building Analysis EnvironmentsBuilding Analysis Environments

Manual by Humans Interaction user navigation Classification collection indexing

Automatic by Computers Federation search bridges Integration results links

Page 4: Analysis Environments For Functional Genomics

Trends in Analysis EnvironmentsTrends in Analysis Environments

Central versus Distributed Viewpoints

The 90s Pre-Genome Entrez (NIH NCBI) versus WCS (NSF Arizona)

The 00s Post-Genome GO (NIH curators) versus BeeSpace (NSF Illinois)

Page 5: Analysis Environments For Functional Genomics

Pre-Genome EnvironmentsPre-Genome Environments

Focused on Syntax pre-Web

WCS (Worm Community System) Search words across sources Follow links across sources Words automatic, Links manual

Towards Uniform Searching

Page 6: Analysis Environments For Functional Genomics

Post-Genome EnvironmentsPost-Genome Environments

Focused on Semantics post-Web

BeeSpace (Honey Bee Inter Space) Navigate concepts across sources Integrate data across sources Concepts automatic, Links automatic

Towards Question Answering

Page 7: Analysis Environments For Functional Genomics

Paradigm ShiftParadigm ShiftTowards Dry-Lab Biology, Walter Gilbert (Jan 1991)

“The new paradigm, now emerging, is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis. ...

To use this flood of knowledge [the total sequence of the human and model organisms], which will pour across the computer networks of the world, biologists not only must become computer-literate, but also change their approach to the problem of understanding life. ...

The Coming of Informational ScienceCorrelation of Information across Sources

Page 8: Analysis Environments For Functional Genomics

NCBI EntrezNCBI Entrez

Page 9: Analysis Environments For Functional Genomics

Community SystemsCommunity Systems

browse and share all the knowledge of a community

data results(database management) (electronic mail)

literature news(information retrieval) (bulletin

boards)

knowledge(hypertext annotations)

Formal Informal

Page 10: Analysis Environments For Functional Genomics

Worm Community SystemWorm Community System WCS Information:Literature BIOSIS, MEDLINE, newsletters,

meetings

Data Genes, Maps, Sequences, strains, cells

WCS FunctionalityBrowsing search, navigationFiltering selection, analysisSharing linking, publishing

WCS: 250 users at 50 labs across Internet (1991)

Page 11: Analysis Environments For Functional Genomics

WCSMolecular

Page 12: Analysis Environments For Functional Genomics

WCS Cellular

Page 13: Analysis Environments For Functional Genomics

WCS Publishing

Page 14: Analysis Environments For Functional Genomics

WCS Linking

Page 15: Analysis Environments For Functional Genomics

WCS invokes

gm

Page 16: Analysis Environments For Functional Genomics

WCS vis-à-vis

acedb

Page 17: Analysis Environments For Functional Genomics

from Objects to Concepts

from Syntax to Semantics

Infrastructure is Interaction with Abstraction

Internet is packet transmission across computers

Interspace is concept navigation across repositories

Towards the InterspaceTowards the Interspace

Page 18: Analysis Environments For Functional Genomics

THE THIRD WAVE OF NET EVOLUTIONTHE THIRD WAVE OF NET EVOLUTION

PACKETS

OBJECTS

CONCEPTS

Page 19: Analysis Environments For Functional Genomics

Technology

Engineering

Electrical

FORMAL

INFORMAL

(manual)

(automatic)

IEEE

communities

groups

individuals

LEVELS OF INDEXESLEVELS OF INDEXES

Page 20: Analysis Environments For Functional Genomics

1992 1993 1995 1996 1998

COMPUTING CONCEPTSCOMPUTING CONCEPTS

‘92: 4,000 (molecular biology)

‘93: 40,000 (molecular biology)

‘95: 400,000 (electrical engineering)

‘96: 4,000,000 (engineering)

‘98: 40,000,000 (medicine)

Page 21: Analysis Environments For Functional Genomics

Simulating a New WorldSimulating a New World Obtain discipline-scale collection

MEDLINE from NLM, 10M bibliographic abstracts human classification: Medical Subject Headings

Partition discipline into Community Repositories 4 core terms per abstract for MeSH classification 32K nodes with core terms (classification tree)

Community is all abstracts classified by core term 40M abstracts containing 280M concepts concept spaces took 2 days on NCSA Origin 2000

Simulating World of Medical Communities 10K repositories with > 1K abstracts (1K w/ > 10K)

Page 22: Analysis Environments For Functional Genomics

Interspace Remote Access ClientInterspace Remote Access Client

Page 23: Analysis Environments For Functional Genomics

Navigation in MEDSPACENavigation in MEDSPACE

For a patient with Rheumatoid Arthritis Find a drug that reduces the pain (analgesic) but does not cause stomach (gastrointestinal) bleeding

Choose DomainChoose Domain

Page 24: Analysis Environments For Functional Genomics

Concept SearchConcept Search

Page 25: Analysis Environments For Functional Genomics

Concept NavigationConcept Navigation

Page 26: Analysis Environments For Functional Genomics

Retrieve DocumentRetrieve Document

Page 27: Analysis Environments For Functional Genomics

Navigate DocumentNavigate Document

Page 28: Analysis Environments For Functional Genomics

Retrieve DocumentRetrieve Document

Page 29: Analysis Environments For Functional Genomics

Informational ScienceInformational ScienceComputational Science is widely accepted as The Third Branch of Science (beyond Experimental and Theoretical)

Genes are Computed, Proteins are Computed, Sequence “equivalences” are Computed.

Informational Science is coming to be accepted asThe Fourth Branch of Science

Based on Information Science technologies forFunctional Analysis across Information Sources

Page 30: Analysis Environments For Functional Genomics

Post-Genome Informatics IPost-Genome Informatics I

Comparative Analysis within theDry Lab of Biological Knowledge

Classical Organisms have Genetic Descriptions.There will be NO more classical organisms beyondMice and Men, Worms and Flies, Yeasts and Weeds.

Must use comparative genomics on classical organismsVia sequence homologies and literature analysis.

Page 31: Analysis Environments For Functional Genomics

Post-Genome Informatics IIPost-Genome Informatics II

Functional Analysis within theDry Lab of Biological Knowledge

Automatic annotation of genes to standard classifications, e.g. Gene Ontology via homology on computed protein sequences.

Automatic analysis of functions to scientific literature, e.g. concept spaces via text extractions. Thus must use functions in literature descriptions.

Page 32: Analysis Environments For Functional Genomics

Informatics: From Bases to SpacesInformatics: From Bases to Spaces

data Bases support genome datae.g. FlyBase has sequences and mapsGenes annotated by GO and linked to literaturee.g. BeeBase has computed annotationsProtein homologies for similar Genes via GO

information Spaces support biomedical literaturee.g. BeeSpace uses automatically generated conceptual relationships to navigate functions

Page 33: Analysis Environments For Functional Genomics

Gene OntologyGene Ontology

Page 34: Analysis Environments For Functional Genomics

Gene OntologyGene OntologyGene Symbol Data Source Full Name…Calca MGI calcitonin-related polypeptideCat-1 Wormbase NoneCat-2 Wormbase NoneCCKR-Human UniProt Cholecystokinin receptorCRF2-Rat UniProt Corticotropin releasing factorCrhr2 RGD corticotrophin relse hormoneEgl-10 Wormbase NoneEgl-30 Wormbase NoneFeh-1 Wormbase NoneFor FlyBase None

Page 35: Analysis Environments For Functional Genomics

Conceptual Navigation in BeeSpaceConceptual Navigation in BeeSpace

NeuroscienceLiterature

MolecularBiology

Literature

BeeLiterature

Flybase,WormBase

BeeGenome

Brain RegionLocalization

Brain GeneExpression

Profiles

BehavioralBiologist

MolecularBiologist

Neuro-scientist

Page 36: Analysis Environments For Functional Genomics

BeeSpace Analysis EnvironmentBeeSpace Analysis Environment Build Concept Space of Biomedical Literature

for Functional Analysis of Bee Genes

-Partition Literature into Community Collections-Extract and Index Concepts within Collections-Navigate Concepts within Documents-Follow Links from Documents into Databases

Locate Candidate Genes in Related Literatures then follow links into Genome Databases

Page 37: Analysis Environments For Functional Genomics

Question AnsweringQuestion AnsweringBehaviour Organism Gene

Molecular Function

Reference

Foraging

Rover vs sitter phenotype Drosophila melanogaster for Protein kinase G 8

Roamer vs dweller phenotype C. elegans egl-4 Protein kinase G 16

Division of labour: age at onset of foraging

Apis mellifera for Protein kinase G 9

Division of labour: age at onset of foraging

Apis mellifera mlv Mn transporter 19

Division of labour: foraging-related? Apis mellifera per Transcription cofactor 68

Division of labour: foraging-related? Apis mellifera ache Acetylcholine esterase 69

Division of labour: foraging-related? Apis mellifera IP(3)K Inositol signaling 70

Foraging specialization: nectar vs. pollen

Apis mellifera pkc Protein kinase C 71

Social feeding Drosophila melanogaster dpnfNeuropeptide Y

(NPY) homolog21

Social feeding (aggregation) C. elegans npr-1 Receptor for NPY 22, 23

Page 38: Analysis Environments For Functional Genomics

Functional PhrasesFunctional Phrases<gene> encodes <chemical> Sokolowski and colleagues demonstrated in Drosophila melanogaster that the foraging gene (for) encodes a cGMP dependent protein kinase (PKG). The dg2 gene encodes a cyclic guanosine monophosphate (cGMP)- dependent protein kinase (PKG). <chemical> affects/causes <behavior> Thus, PKG levels affected food-search behavior. cGMP treatment elevated PKG activity and caused foraging behavior. <gene> regulates <behavior> Amfor, an ortholog of the Drosophila for gene, is involved in the regulation of age at onset of foraging in honey bees. This idea is supported by results for malvolio (mvl), which encodes a manganese transporter and is involved in regulating Drosophila feeding and age at onset of foraging in honey bees.

Page 39: Analysis Environments For Functional Genomics

BeeSpace Software ImplementationBeeSpace Software Implementation

Natural Language Processing Identify noun and verb phrasesRecognize biological entitiesCompute biological relations

Statistical Information Retrieval Compute statistical contextsSupport conceptual navigation

Page 40: Analysis Environments For Functional Genomics

Data Integration (FlyBase Gene)Data Integration (FlyBase Gene)D. melanogaster gene foraging , abbreviated as for , is reported here . It has also been known in FlyBase as BcDNA:GM08338, CG10033 and l(2)06860. It encodes a product with cGMP-dependent protein kinase activity (EC:2.7.1.-) involved in protein amino acid phosphorylation which is a component of the cellular_component unknown . It has been sequenced and its amino acid sequence contains an eukaryotic protein kinase , a protein kinase C-terminal domain , a tyrosine kinase catalytic domain , a serine/Threonine protein kinase family active site , a cAMP-dependent protein kinase and a cGMP-dependent protein kinase . It has been mapped by recombination to 2-10 and cytologically to 24A2--4 . It interacts genetically with Csr . There are 27 recorded alleles : 1 in vitro construct (not available from the public stock centers), 25 classical mutants ( 3 available from the public stock centers) and 1 wild-type. Mutations have been isolated which affect the larval nerve terminal and are behavioral, pupal recessive lethal, hyperactive, larval neurophysiology defective and larval neuroanatomy defective. for is discussed in 80 references (excluding sequence accessions), dated between 1988 and 2003. These include at least 6 studies of mutant phenotypes , 2 studies of wild-type function , 3 studies of natural polymorphisms and 7 molecular studies . Among findings on for function, for activity levels influence adult olfactory trap response to a food medium attractant. Among findings on for polymorphisms, the frequency of for R and for s strains in three natural populations are studied to determine the contribution of the local parasitoid community to the differences in for R and for s frequencies.

Page 41: Analysis Environments For Functional Genomics

BeeSpace Information SourcesBeeSpace Information Sources Biomedical Literature- Medline (medicine)- Biosis (biology)- Agricola, CAB Abstracts, Agris (agriculture)

Model Organisms (heredity)-Gene Descriptions (FlyBase, WormBase) Natural Histories (environment)-BeeKeeping Books (Cornell Library, Harvard

Press)

Page 42: Analysis Environments For Functional Genomics

Medical Concept Spaces (1998)Medical Concept Spaces (1998)

Medical Literature (Medline, 10M abstracts) Partition with Medical Subject Headings (MeSH)

Community is all abstracts classified by core term 40M abstracts containing 280M concepts computation is 2 days on NCSA Origin 2000

Simulating World of Medical Communities 10K repositories with > 1K abstracts (1K with > 10K)

Page 43: Analysis Environments For Functional Genomics

Biological Concept Spaces (2005)Biological Concept Spaces (2005)

Compute concept spaces for All of BiologyBioSpace across entire biomedical literature

50M abstracts across 50K repositories

Use Gene Ontology to partition literature into biological communities for functional analysis

GO same scale as MeSH but adequate coverage?GO light on social behavior (biological process)

Page 44: Analysis Environments For Functional Genomics

Paradigm ShiftParadigm ShiftDissecting Human Disease, Victor McKusick (Feb 2001)

Structural genomics Functional genomics Genomics Proteomics Map-based gene discovery Sequence-based gene discovery Monogenic disorders Multifactorial disorders Specific DNA diagnosis Monitoring susceptibility Analysis of one gene Analysis of multi-gene

pathways Gene action Gene regulation Etiology (mutation) Pathogenesis (mechanism) One species Several species

Page 45: Analysis Environments For Functional Genomics

Needles and HaystacksNeedles and Haystacks

Genes Honey Bees have 13K genes Perhaps 100 have known functions

Paths Perhaps 30K protein families exist KEGG has 200 known pathways

Statistical Clustering for Interactive DiscoveryAcross Two Orders of Magnitude!

Page 46: Analysis Environments For Functional Genomics

Concept SwitchingConcept Switching

In the Interspace…

each Community maintains its own repository

Switching is navigating Across repositories

use your specialty vocabulary to search another specialty

Page 47: Analysis Environments For Functional Genomics

CONCEPT SWITCHINGCONCEPT SWITCHING

“Concept” versus “Term” set of “semantically” equivalent terms

Concept switching region to region (set to set) match

term

Semantic region

Concept SpaceConcept Space

Page 48: Analysis Environments For Functional Genomics

Biomedical SessionBiomedical Session

Page 49: Analysis Environments For Functional Genomics

Categories and ConceptsCategories and Concepts

Page 50: Analysis Environments For Functional Genomics

Concept SwitchingConcept Switching

Page 51: Analysis Environments For Functional Genomics

Document RetrievalDocument Retrieval

Page 52: Analysis Environments For Functional Genomics

Future TechnologiesFuture Technologies Concept Switching

Spreading activation, type tagging

Dynamic Indexing On-the-fly collections, during session

Path Matching Aggregating indexes, many repositories

Page 53: Analysis Environments For Functional Genomics

THE NET OF THE 21st CENTURYTHE NET OF THE 21st CENTURY

Beyond Objects to Concepts Beyond Search to Analysis Problem Solving via Cross-Correlating

Multimedia Information across the Net

Every community has its own special library Every community does semantic indexing

The Interspace approximates Cyberspace

Page 54: Analysis Environments For Functional Genomics

Interactive Functional AnalysisInteractive Functional AnalysisBeeSpace will enable users to navigate a uniform space of

diverse databases and literature sources for hypothesis development and testing, with a software system beyond a searchable database, using literature analyses to discover functional relationships between genes and behavior.

Genes to BehaviorsBehaviors to GenesConcepts to ConceptsClusters to ClustersNavigation across Sources

Page 55: Analysis Environments For Functional Genomics

XSpace Information SourcesXSpace Information SourcesOrganize Genome Databases (XBase)Compute Gene Descriptions from Model OrganismsPartition Scientific Literature for Organism XCompute XSpace using Semantic Indexing

Boost the Functional Analysis from Special SourcesCollecting Useful Data about Natural Historiese.g. CowSpace Leverage in AIPL Databases

Page 56: Analysis Environments For Functional Genomics

Towards the InterspaceTowards the Interspace

The Analysis Environment technology is GENERAL!

BirdSpace? BeeSpace?PigSpace? CowSpace? BehaviorSpace? BrainSpace?

BioSpace… Interspace