BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

48
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005

description

Challenges Data sources –Large –Distributed –Different data

Transcript of BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Page 1: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart Federated Database Architecture

Arek KasprzykEBI9 June 2005

Page 2: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart• A join project

– European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

• Aim– To develop a simple and scalable data management

system capable of integrating distributed data sources.

Page 3: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Challenges• Data sources

– Large– Distributed– Different data

Page 4: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Requirements• User

– All data accessible through a single set of interaces– Suitable for power biologists and bioinformaticians

• Deployer– ‘Out of the box’ installation– Built in query optimization– Easy data federation

• Architecture– Distributed– Domain agnostic– Platform independent

Page 5: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Query Engine

Federated architecture

Page 6: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart

Data mart

User interfaces

Data sources

Page 7: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Data mart and dataset

Dataset

Page 8: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Data mart, dataset and schema

Schema

Page 9: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Dataset Configuration

XML

XML

XML

Page 10: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart abstractions• Dataset

– A subset of data organized into 1 or more tables• Attribute

– A single data point – e. g. gene name

• Filter– An operation on an attribute – e. g. ‘Chromosome =1’

Page 11: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Datasets, Attributes and Filters

GENE

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

Page 12: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

ExamplesUpstream sequences

for all kinases up-regulated in brain and associated with a

QTL for a neurological disorder

Name, chromosome position, description of all genes located on chromosome 1, expressed in lung,

associated with human homologues and non-synonymous snp changes

Page 13: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

FK

FK

FK

FK

PK

PK

Data model

Page 14: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

FK

FK

FK

FK

PK

FK FK FKFK

PK PK

PK PK

Data model

Page 15: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Data model

Page 16: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Data model - ‘reversed star’

Page 17: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

DatasetFixed schema transformation

A

B

TA

TB

C

Page 18: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart abstractions• Link

– ‘common currency’ between two datasets – e. g. accession

• Exportable – Potential links to export

• Importable– Potential links to import

Page 19: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Exportables, Importables and Links

Dataset 1

Dataset 2

Links

Page 20: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importablename = uniprot_id

attributes = uniprot_ac

name = uniprot_id

filters = uniprot_ac

Links

Page 21: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importablename=genomic_region

attributes=chr_name, chr_start, chr_end

name=genomic_region

filters=chr_name (=), chr_start (>=), chr_end (<=)

Links

Page 22: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Building BioMart databases

Source databases

Mart

Transformation

MartBuilder

Configuration

XML

MartEditor

Page 23: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

MartEditor

Page 24: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key

Page 25: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Retrieval

myDatabaseSNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

BioMart architecture

Page 26: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

MartView

Page 27: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

MartExplorer

Page 28: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

MartShell

Using = dataset

Get = attributeWhere = filter

Page 29: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Mart Query Language (MQL)● Mart Query Language (MQL) syntax:using <dataset> get <attributes> where <filters>

● Can join datasets together:using Dataset1 get Attribute1 where Filter1=var1 as q;using Dataset2 get Attribute2 where Filter2=var2 and

filter3 in q

● Can script and pipe:martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc

Page 30: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Third party software• Bioconductor (biomaRt)

– BioMart schema• Taverna

– BioMart java library• DAS ProServer

– BioMart perl library

Page 31: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

biomaRt

Page 32: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Taverna

Page 33: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

ProServer• No programming• DAS request and responses defined by

Exportables and Importables and configured by MartEditor

• DAS1

Page 34: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart deployers• Large scale data federation (EBI)• Optimising access to a large database

(Ensembl, WormBase)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 35: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

EBIUniprotMSD

SANGEREnsemblSNPVegaSequenceWWW

Hinxton example

Page 36: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart deployers• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase, ArrayExpress)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 37: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

WormBase

Page 38: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Ensembl

Page 39: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

ArrayExpress

Page 40: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart deployers• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase)• Federating user data with public data

(Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

Page 41: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

dbsnp HapMap Ensembl

Give me frequency data from dbsnp

Give me genoype and frequency data from HapMap

Give me SNPs location on gene/transcript

Give me frequency, genotype, location on gene/transcript from dbsnp, HapMap, Ensembl, RefSeq, AceView and Vegas

Java graphical user interfaceWWW web browser

                GMIA_SNP_mart_database

RefSeq

SNP1 T/A AL13929 963253 1SNP2 C/T AL13929 963255 -1SNP3 C/G AL13929 963258 1. ……………………………….. ……………………………….

AceView Vega

Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.

Page 42: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

… what next ?

Page 43: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

BioMart model

• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Variety of ‘in house’ projects

• In development– HapMap

Page 44: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Summary• BioMart interface

– Batch queries– ‘Data mining’– Large annotation

• BioMart software– Set up your own database– Make your database scalable and

responsive– Federate with other data

Page 45: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Where are we?• 0.2 released in february• 0.3 to be released in june

– Platforms• Mysql• Oracle• Postgres

Page 46: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Acknowledgments

• BioMart– Damian Smedley (EBI)– Darin London (EBI)– Will Spooner (CSHL)

• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)

Page 47: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.
Page 48: BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.