ArrayExpress and Expression Atlas: Mining Functional Genomics data
description
Transcript of ArrayExpress and Expression Atlas: Mining Functional Genomics data
ArrayExpress and Expression Atlas: Mining Functional Genomics data
Gabriella Rustici, PhDFunctional Genomics TeamEBI-EMBL
What is functional genomics (FG)?
• The aim of FG is to understand the function of genes and other parts of the genome
• FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions
• High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome
2 ArrayExpress
What biological questions is FG addressing?
• When and where are genes expressed?
• How do gene expression levels differ in various cell types and states?
• What are the functional roles of different genes and in what cellular processes do they participate?
• How are genes regulated?
• How do genes and gene products interact?
• How is gene expression changed in various diseases or following a treatment?
3 ArrayExpress
Components of a FG experiment
ArrayExpress4
FG public repositories: ArrayExpress
Is a public repository for FG data, which provides easy access to well annotated data in a structured and standardized format
Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ
Facilitates the sharing of experimental information associated with the data such as microarray designs, experimental protocols,……
Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/)
ArrayExpress5
Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment
MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment
The checklist:
ArrayExpress6
Requirements MIAME MINSEQE1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed)
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray or HTS experiments
IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.
SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.
ADFArray Design Format file that describes probes on an array, e.g. sequence, genomic mapping location (for array data ony)
Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.
Standards for microarray & sequencingMAGE-TAB format
ArrayExpress7
ArrayExpress8
ArrayExpress – two databases
What is the difference between them?
ArrayExpress Archive• Central object: experiment • Query to retrieve experimental information and
associated data
Expression Atlas• Central object: gene/condition• Query for gene expression changes across
experiments and across platforms
ArrayExpress9
ArrayExpress – two databases
ArrayExpress10
ArrayExpress Archive – when to use it?
• Find FG experiments that might be relevant to your research
• Download data and re-analyze it.
Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments.
• Submit microarray or HTS data that you want to publish.
Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process.
ArrayExpress11
How much data in AE Archive?(as of mid-September 2012)
ArrayExpress12
HTS data in AE Archive(as of mid-September 2012)
Microarray vs HTS
RNA-, DNA-, ChIP-seq breakdown
ArrayExpress13
ArrayExpresswww.ebi.ac.uk/arrayexpress/
ArrayExpress14
Browsing the AE Archive
The direct link to raw and processed data. An icon indicates that this type of data is available.
The total number of experiments and assay retrieved
Species investigated
Curated title of experiment
The date when the data were loaded in the
ArchiveAE unique
experiment IDNumber of
assays
The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed
loaded in Atlas flag
Raw sequencing data available in
ENA
ArrayExpress15
Browsing the AE Archive
ArrayExpress16
Searching AE with the Experimental factor ontology (EFO) Application focused ontology modeling the relationship between
experimental factors (EFs) in AE
Developed to:
• increase the richness of annotations that are currently made in AE Archive
• to promote consistency
• to facilitate automatic annotation and integrate external data
EFs are transformed into an ontological representation, forming classes and relationships between those classes
Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology, NCBI Taxonomy
ArrayExpress17
Building EFOAn example
sarcoma
cancer
neoplasm
disease
Kaposi’s sarcoma
Take all experimental factors
sarcoma
cancer
neoplasm
Kaposi’s sarcoma
disease is the parent term
is a type of disease
is synonym of neoplasm
is a type of cancer
is a type of sarcoma
Find the logical connection between them
disease
neoplasm
cancer
sarcoma
Kaposi’s sarcoma
[-]
[-]
[-]
[-]
Organize them in an ontology
ArrayExpress18
Exploring EFOAn example
ArrayExpress19
More information at: http://www.ebi.ac.uk/efo
Searching AE ArchiveSimple query
ArrayExpress20
AE Archive query output
• Matches to exact terms are highlighted in yellow• Matches to synonyms are highlighted in green• Matches to child terms in the EFO are highlighted in pink
ArrayExpress21
AE Archive – experiment view
ArrayExpress22
SDRF file – sample & data relationship
ArrayExpress23
ArrayExpress – two databases
ArrayExpress24
Expression Atlas – when to use it?
• Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas;
• Discover which genes are differentially expressed in a particular biological condition that you are interested in.
ArrayExpress25
• Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to enable re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID)
• At least 3 replicates for each value of the experimental factor
• Maximum 4 experimental factors
• Adequate sample annotation using EFO terms
• Presence of data files: CEL raw data files for Affymetrix assays, processed data files for non-Affymetrix ones
Expression Atlas constructionExperiment selection criteria during curation
ArrayExpress26
Expression Atlas constructionAnalysis pipeline
gene
s
Cond.1 Cond.2 Cond.3
Linear model*(Bio/C Limma)
Cond.1 Cond.2 Cond.3
Input data (Affy CEL, non-Affy processed)
1= differentially expressed 0 = not differentially expressed
Output: 2-D matrix
* More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full
ArrayExpress27
Expression Atlas constructionAnalysis pipeline
“Is gene X differentially expressed in condition 1 in this
experiment?”
Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples
= a single expression value for gene X
Compare and calculate statistic
ArrayExpress28
gene
sCond.1 Cond.2 Cond.3Exp.1
gene
s
Cond.4 Cond.5 Cond.6Exp. 2
gene
s
Cond.X Cond.Y Cond.ZExp. n
Statistical test
Statistical test
Statistical test Each experiment has its own
“verdict” or “vote” on whether a gene is differentially expressed or not
under a certain condition
ArrayExpress29
Expression Atlas construction
Expression Atlas construction
Summary of the “verdicts” from different experiments
ArrayExpress30
Expression Atlas
ArrayExpress31
Atlas home pagehttp://www.ebi.ac.uk/gxa/
Query for genes
Query for conditionsRestrict query by direction of differential expression
The ‘advanced query’ option allows building more complex queries
ArrayExpress32
Atlas home pageThe ‘Genes’ and ‘Conditions’ search boxes
ArrayExpress33
Atlas home pageA single gene query
ArrayExpress34
Atlas gene summary page
Atlas experiment page
ArrayExpress36
Atlas experiment page – HTS data
ArrayExpress37
Atlas home pageA ‘Conditions’ only query
ArrayExpress38
Atlas heatmap view
ArrayExpress39
Atlas gene-condition query
ArrayExpress40
Atlas advanced search
ArrayExpress41
Atlas advanced search
ArrayExpress42
Atlas advanced search
ArrayExpress43
A glimpse of what’s coming…“Differential atlas”
“Is gene X differentially expressed in condition 1 in this experiment?”
Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples
= a single expression value for gene X
Compare and calculate statistic
ArrayExpress44
A glimpse of what’s coming…“Differential atlas” mock-up (1)
ArrayExpress45
A glimpse of what’s coming…“Differential atlas” mock-up (2)
ArrayExpress46
• Gene expression in normal tissues, not looking for differentially expressed genes based on different conditions
• E.g. “Give me all the genes expressed in normal human kidney”
• Can also filter genes by expression level (e.g. FPKM values)
• Start with Illumina Body Map 2.0 RNA-seq data
• 16 tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells
• We are working on something similar for mouse
A glimpse of what’s coming…“Baseline atlas”
ArrayExpress47
A glimpse of what’s coming…“Baseline atlas” mock-up display
ArrayExpress48
Data submission to AE
ArrayExpress49
Data submission to AEwww.ebi.ac.uk/microarray/submissions.html
ArrayExpress50
Submission of HTS dataArrayExpress acts as a “broker” for submitter.
• Meta-data and processed data: ArrayExpress• Raw sequence reads* (e.g. fastq, bam): ENA
*See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format
ArrayExpress51
Find out more
• Visit our eLearning portal, Train online, at http://www.ebi.ac.uk/training/online/ for courses on ArrayExpress and Atlas
• Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y
• Email us at: [email protected]
• Atlas mailing list: [email protected]
ArrayExpress52