The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX...

31
The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology Searches and Data Analysis Jacob Köhler Rothamsted Research

Transcript of The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX...

Page 1: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Searches and Data Analysis

Jacob KöhlerRothamsted Research

Page 2: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

RRes

Page 3: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

RRes

Page 4: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Credits

Jan Baumbach (Rothamsted & Bielefeld) Jessica Butz (Rothamsted & Bielefeld) Ina Kupp (Rothamsted & Koblenz) Stephan Philippi (Koblenz) Chris Rawlings (Rothamsted) Alexander Rüegg (Bielefeld) Andre Skusa (Bielefeld)Michael Specht (Rothamsted & Bielefeld) Jan Taubert (Rothamsted & Bielefeld) Paul Verrier (Rothamsted) Rainer Winnenburg (Rothamsted & Bielefeld)

Rothamsted Research, Harpenden, Hertfordshire, UK. University of Bielefeld, GermanyUniversity of Koblenz, Germany

Page 5: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Overview

• Motivation and Introduction• Principles• ONDEX System• Applications

Page 6: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Motivation and introduction

Datasources

Applications

Text sourcesMEDLINE

15x106 abstracts

Databases – 100sKEGG, MetaCyc, Aracyc, Gene Ontology, EC nomenclature, ….

Support database curation,Hypothesis generation

microarray analysis and interpretation, modelling and simulation, gene annotation

? Text miningSequence analysis

Database Integration

Graph analysis

Page 7: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Combining - concept based data integration- concept based text mining - graph analysis/visualisation- sequence analysis

Motivation and introduction

Page 8: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

everything is a network…

protein interactions

… in which the nodes and edges have different properties

metabolic pathways ontologies

Principles

Page 9: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Think of it as layers which can separately be added or removed

Principles

Page 10: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

everything is a network…

protein interactions

… in which the nodes and edges have different properties

metabolic pathways ontologies

Principles

Page 11: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Protein Proteinbinds

Proteinbinds

Principles

Page 12: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Metabolite Enzymecatalyses

Metabolitecatalyses

Principles

Page 13: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Principles

Integrated ontology O(C, R, CA, CV, CC, RT, P, ca, cv, cc, rt, id)

Data Structure

- a finite, not empty, distinct set of Concepts C(O) - a finite, not empty set of Relations: R(O) ⊆ C(O) × C(O) - a finite set of Concept Accessions CA(O) - a finite, not empty set of Controlled Vocabularies CV(O) - a tree consisting of Concept Classes CC(O) - a tree consisting of Relation Types RT(O) - the additional properties P(O) of an ontology O’ consisting of: - a finite set of Concept Names CN(O) - a finite set of Sequences SEQ(O) - a finite set of Structures STR(O) - the function ca which assigns concept accessions to concepts ca: C(O) {(ca1 × … × can) | caj ∈ CA(O)} - the totally defined functions cv, cc, rt that assign CVs, concept classes and relation types to concepts or relations cv: C(O) R(O) CV(O) cc: C(O) CC(O) rt: R(O) RT(O) - the bijective function id which assigns a unique identifier to every concept and every relation with: id: C(O) N - and the functions def, cn, seq and str that optionally link concept names (terms), definitions, polypeptide or nucleotide sequences and protein structures to concepts: - def: C(O) DEF(O) - cn: C(O) {(cn1 × … × cnn) | cnj ∈ CN(O)} - seq: C(O) {(seq1 × … × seqn) | seqj ∈ SEQ(O)} - str: C(O) {(str1 × … × strn) | strj ∈ STR(O)}

Page 14: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

PrinciplesData Structure

visible graph G (O, CO, colour, size, visibility, x, y)

- an integrated ontology, O - a finite, not empty set of Colours CO(G) - the functions colour, size, visibility, x and y (coordinates) which affect the way concepts and relations are visualised: - colour: C(O) R(O) CO(G) - size: C(O) R - visibility: C(O) R(O) {true, false} - x: C(O) R - y: C(O) R

Page 15: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Metabolite Enzymecatalyses

Metabolitecatalyses

Principles

Concept based data integration and text mining

Db integration Text mining

Creating concepts Db import, conversion, extraction

NER, dictionaries

Creating relations mapping methods, sequence analysis methods

Relation mining

Page 16: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

- Only fully automated methods (no manual mapping)

Mapping methods: graph alignment (not merging!)

- Use evidence codes to annotate how mappings were generated

- Assign correct semantics (relation type) to the mapping

Principles

Page 17: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Mapping methods: graph alignment (not merging!)

- Import of mapping lists

- Methods based on graph structure (structalign)

- Compare concept names (2syn)- Sequence analysis:

Homology, INPARANOID(Motif search)

- transitive mapping (trans)

( - protein-protein docking methods)( - protein-ligand docking methods)

Principles

Page 18: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Text mining

Principles

- Concept based approach- Word stemming, normalisation of concept names, POS

tagging- Concept groups can be defined by

a) selected subset of concepts (dictionary approach)

b) regular expressionc) Planned: other NER methods

- Relation mininga) co-occurrence of concept groupsb) planned: deep parsing methods

- Text sources: PubMed

Page 19: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

ONDEX systemLatest release

Page 20: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Generate concepts

and relations

Data sourceimport

Text import

OndexDatabase

ONDEX serverData integration

Server

Full textindex

Sequencecomparison

ONDEX systemLatest release

Graph analysisAnd

Visualisation

Page 21: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Scripting

GlobalConsistency

Generate concepts

and relations

Data sourceimport

Text import

OndexDatabase

DB

abs

trac

tion

laye

r

ONDEX serverData integration

Server

Full textindex

Sequencecomparison

Graph analysisAnd

Visualisation

ONDEX systemOngoing work

Page 22: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Scripting

GlobalConsistency

Generate concepts

and relations

Data sourceimport

Text import

OndexDatabase

DB

abs

trac

tion

laye

r

ONDEX serverData integration

Server

Full textindex

Sequencecomparison

Taverna Freefluo

Graph analysisAnd

Visualisation

ONDEX systemOngoing work

Page 23: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

SOAP API

Scripting

GlobalConsistency

Generate concepts

and relations

Data sourceimport

Text import

OndexDatabase

DB

abs

trac

tion

laye

r

ONDEX server

Web Query

Data integration

Server

Full textindex

Sequencecomparison

Taverna Freefluo

Graph analysisAnd

Visualisation

Web browser

WebServiceclients

ONDEX systemOngoing work

Page 24: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

SOAP API

WorkflowEngine

Scripting

GlobalConsistency

Generate concepts

and relations

Data sourceimport

Text import

OndexDatabase

DB

abs

trac

tion

laye

r

ONDEX server

Web Query

Data integration

Server

Full textindex

Sequencecomparison

Taverna Freefluo

Workflowgeneratorand editor

Graph analysisAnd

Visualisation

ONDEX clients

Web browser

WebServiceclients

ONDEX systemOutlook

Page 25: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

- Annotation pipeline

- Microarray analysis

-Text mining to support database curation

- Intelligibility and circularity of terms and definitions in ontologies and taxonomies

- Pathway modelling and simulation

Applications

Page 26: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Parani, M., Rudrabhatla, S., Myers, R., Weirich, H., Smith, B., Leaman, D.W. and Goldman, S.L. (2004) Microarray analysis of nitric oxide responsive transcripts in Arabidopsis. Plant Biotechnology Journal, 2, 359-366.

Arabidopsis data with 120 novel genesprovided annotation to 71 “novels”lignin biosynthesis

New Observations not in original paper:Overexpressed transcription factor

but no effect on expected genedraught stressjasmonic acid biosynthesis

Applications – microarray analysis

Page 27: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology
Page 28: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Environment

Genes Pathways

Reaction

Enzyme

Protein

Integrated data from several databases

Filtering

Graph Analysis

Graph Layout

Page 29: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology
Page 30: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

Associated to stress responsegenes

At5g52310Not differentially expressed

Regulated bytranscriptionfactors

Page 31: The ONDEX Framework: Uniting Concept-based Data ...nactem.ac.uk/files/koehler.pdf · The ONDEX Framework: Uniting Concept-based Data Integration, Text Mining, Biological Homology

3rd Integrative Bioinformatics workshop4th to 6th September 2006Rothamsted Research, Harpenden, UK

http://www.rothamsted.bbsrc.ac.uk/bab/conf/ibiof/

8thMay 2006 Paper submission deadline23rdJune 2006 Notification of acceptance for papers17thJuly 2006 Camera ready paper submission deadline

1stAugust 2006 Registration deadline15thAugust 2006 Poster submission deadline