The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective...
-
Upload
jonathan-oconnell -
Category
Documents
-
view
216 -
download
3
Transcript of The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective...
The ArrayExpress Gene Expression Database: a Software Engineering and Implementation
Perspective
Ugis Sarkans
European Bioinformatics Institute
Outline
• Microarray data and standards overview• ArrayExpress overall principles• ArrayExpress architecture• AE repository• AE data warehouse• Future plans and conclusions
SamplesG
enes
Gene expression levels – problem 2
Sample annotations problem 1
Gene annotations
Gene expression matrix
Gene expression data and annotation
Platform comparison (Tan et al, PNAS, 2003)
‘Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression’ (Margareth Cam, NIH)
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
Array design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidMicroarray
RNA extract
Sample
Experiment
Gene expression data matrix
normalization
integration
ProtocolProtocolProtocolProtocolProtocolProtocol
genes
Array scans
Spo
ts
Quantitations
Gen
es
Samples
Different processing levels of MA data
A
B
C
D
MGED standards
• MIAME – minimum information about a microarray experiment
• MAGE-OM and MAGE-ML – microarray gene expression object model and mark-up language
• MO – microarray ontology
• Data normalisation and transformations (and quality control)
BioEvent
Experiment
ArrayDesign
BioMaterial BioAssayData
BioAssay
DesignElement
UML Packages of MAGE
HigherLevelAnalysis
BioSequence
ArrayQuantitationType
DescriptionProtocol
MeasurementAuditAndSecurity
BQS
what was used what was done results
miscellaneous
MAGE – an example diagram
ArrayExpress aims
• An archive for microarray data supporting scientific publications
• Providing easy access to public gene expression and other to microarray data in a structured format
• Facilitating the sharing of microarray designs and protocols
• Facilitating the establishment of infrastructure for microarray data sharing
AE users
• Experimentalists
• “Single-gene” biologists
• Bioinformaticians; genome-wide studies
• Bioinformaticians – algorithm developers
• Software developers
ArrayExpress repository
Other MicroarrayDatabases
(SMD, TIGR, Utrecht, RZPD)
www
EBI
ExpressionProfiler
External Databases (EMBL, UniProt, Ensemble)
Data analysis
Queries, analysis
MIAMExpress
Submissions
Array Manufacturers
(Affymetrix,Agilent)
Data AnalysisSoftware
(R/Bioconductor, J-Express,Resolver)
Submissions
Warehouse(Biomart)
ArrayExpress infrastructure
Submission tracking/curation toolExternal MIAMExpress
installations (Camb. U., EMBL)
www
MAGE-ML
MAGE-ML
MAGE-MLAnalysis
ArrayExpres
MAGE-ML
AE: overall principles
• Adherence to community standards
• Data captured in a granular, formalized manner
• Modern but proven software technologies
• Incremental development
AE design considerations
• Separate data archiving from the query-optimized data warehouse
• Generate default implementation, then refine– ~2 full-time developers– pressure to bring system online quickly
• Use object abstraction layer– deal with performance overhead on case-by-
case basis
Web pagetemplate
Tomcat
Curationenvironment
Oracle DB
MAGE-MLDTD
MAGE-OM
MAGE-ML (doc)MAGE-ML (doc)MAGE-ML document
MAGEloader
Velocity
Castorobject/
relationalmapping
Java servlets
MAGEvalidator
MAGEunloader
error.log Web pagetemplate
Repository architecture overview
AE schema- Why auto-generated?
– AE must be able to import any valid MAGE-ML and not lose information
– good for navigating through data in terms of object model
– if some queries don’t work well, add something to the schema
• Experiment-Biomaterial, Experiment-Protocol links
– so far works for 400Gb of data
Auto-generated web pages
To ontologize ornot to ontologize
BioSource
speciesagesex
cellLinetissuecolor
distanceToSunweight
favoriteCereal..........
BioSource
OntologyEntry
categoryvalue
description
0..n
At the beginning: At the end:
To ontologize ornot to ontologize
BioSource
speciesagesex
cellLinetissuecolor
distanceToSunweight
favoriteCereal..........
BioSource
OntologyEntry
categoryvalue
description
0..n
At the beginning: At the end:
Model vs. ontology
• Model – stable; ontologies – flexible
• Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard
• Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model
Experiment1• type• performer• ….
Hybridization data 1• Experimental factors• Quantitation type definitions•…
>15 000 000 000 data points
NetCDF
sample
bioassay(hybridization)
experiment
expression value(ratio or absolute)
genegene
property(e.g. GO annot.)
experimentproperty
(e.g. type)
bioassayproperty
(e.g. exper.factor)
sampleproperty
(e.g. species,tissue)
arraydesign
array element
Data warehouse schema
What BioMart gives to AEDW
• Query language abstraction– Joins automatically generated
• Schema optimized for performance
• Clear database integration roadmap
prod. DBclone
productiondatabase
curation(data testing)
database
dev./testdatabase
curationTomcat(alpha)
developer'sTomcat
(PC)
developer'sTomcat
(PC)
web router
external users curators
productiondata mgmt
tools
curationdata mgmt
tools
developmentdata mgmt
tools
MIAMExpressor pipelineMAGE-ML
MAGE-ML froma new pipeline any MAGE-ML
prototypeDW
developmentDW
developers
productionTomcat 1
(Linux node)
productionTomcat 2
(Linux node)
ArrayExpress environment
Future plans
• Data management environment automation
• Flexible data warehouse interface
• Programmatic interface (HTTP/XML based)
• Distributed infrastructure??
Distributed data infrastructure
ArrayExpress
A local database A local
database
A local database
Query broker
Users
query
find resource
deliverdata
Conclusions
• Conceptual object modeling works well for complex life sciences domains
• Many software infrastructure components can be auto-generated from object models
• A range of approaches can be used for modeling, e.g., UML framework + ontologies
• Repository and data warehouse – different aims and different implementation principles
Acknowledgements• Gonzalo Garcia Lara - web interface• Ahmet Oezcimen - DBA• Anjan Sharma - curation tool• Sergio Contrino, Richard Coulson – data
warehouse• Niran Abeygunawardena – webmaster• Mohammadreza Shojatalab –
MIAMExpress• Misha Kapushesky – Expression Profiler• Curation team:
– Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner
• Domain-specific projects:– Susanna Sansone, Philippe Rocca-
Serra• Alvis Brazma
• MGED collaborators– Stanford, TIGR,
Affymetrix, EMBL, ….• BioMart team