How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide...
Transcript of How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide...
EBI is an Outstation of the European Molecular Biology Laboratory.
How to store and visualize RNA-seq data
Gabriella RusticiFunctional Genomics Group
Talk summary
• How do we archive RNA-seq data in ArrayExpress
• How do we process RNA-seq data
• How we display RNA-seq data in the Expression Atlas
HTS data in ArrayExpress and Atlas26/08/20112
3
Components of a functional genomics experiment
HTS data in ArrayExpress and Atlas26/08/2011
ArrayExpresswww.ebi.ac.uk/arrayexpress/
Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays
Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ
Provides easy access to well annotated data in a structured and standardized format
Facilitates the sharing of microarray designs, experimental protocols,……
Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/)
4 HTS data in ArrayExpress and Atlas26/08/2011
Standards for sequencingMINSEQE guidelines
Minimal Information about a high-throughput Nucleotide SEQuencing Experiment
The proposed guidelines for MINSEQE are (still work in progress):
1. General information about the experiment2. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)3. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)4. Essential experimental and data processing protocols5. Sequence read data with quality scores, raw intensities and
processing parameters for the instrument6. Final processed data for the set of assays in the experiment
5 HTS data in ArrayExpress and Atlas26/08/2011
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data:
IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.
SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.
Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.
6
Standards for microarray & sequencingMAGE-TAB format
HTS data in ArrayExpress and Atlas26/08/2011
Types of data that can be submitted
7 HTS data in ArrayExpress and Atlas26/08/2011
8
ArrayExpress – two databases
HTS data in ArrayExpress and Atlas26/08/2011
9
What is the difference between Archive and Atlas?
Archive• Query by experiment, sample and experimental
factor annotations• Filter on species, array platform, molecule assayed
and technology used
Atlas• Gene and/or condition queries• Query across experiments and across platforms
HTS data in ArrayExpress and Atlas26/08/2011
ArrayExpress – two databases
10 HTS data in ArrayExpress and Atlas26/08/2011
How much data in AE Archive?
ArrayExpress11
12
Browsing the AE Archive
HTS data in ArrayExpress and Atlas26/08/2011
Browsing the AE Archive
The direct link to raw and processed data. An icon indicates that this type of data is available.
The direct link to raw and processed data. An icon indicates that this type of data is available.
The total number of experiments and assay retrieved
The total number of experiments and assay retrieved
Species investigated
Curated title of experiment
Curated title of experiment
The date when the data were loaded in the
ArchiveAE unique
experiment IDAE unique
experiment IDNumber of
assaysNumber of
assays
The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed
The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed
loaded in Atlas flag
13
Raw sequencing data available in
ENA
14
Browsing the AE Archive
HTS data in ArrayExpress and Atlas26/08/2011
15
RNA-seq data in AE Archive
HTS data in ArrayExpress and Atlas26/08/2011
HTS data in AE Archive
06.09.201116 HTS data in ArrayExpress and Atlas
17
HTS data in AE Archive
HTS data in ArrayExpress and Atlas26/08/2011
Master headline06.09.201118
Link to raw data in ENA
RNA-seq processing pipeline
ArrayExpressArchive
ENA
Dat
a A
cqui
sitio
n
EGA
FAS
Q fi
les
Short reads (FASTQ files)
Sum
mar
y lev
el da
ta
Expression Atlas
Ensembl
RNAseqProcessingpipeline
RPKMs
BAMs
SDRF
FASTQ
Direct data submissions and GEO import
19 HTS data in ArrayExpress and Atlas26/08/2011
RNA-seq processing pipeline: ArrayExpressHTS
• ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets
• The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA
• It can be used: on a local computer remotely on the EBI R Cloud, www.ebi.ac.uk/tools/rcloud
20
Goncalves et al., Bioinformatics 2011
HTS data in ArrayExpress and Atlas26/08/2011
ArrayExpressHTS in Bioconductor
21 HTS data in ArrayExpress and Atlas26/08/2011
ArrayExpressHTS pipeline
transcriptome or genome
Bowtie, BWA or TopHat
cufflinks or MMSEQ
filtering options(e.g., average base quality, read complexity,…)
22 HTS data in ArrayExpress and Atlas26/08/2011
Using ArrayExpressHTS
library("ArrayExpressHTS")aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE)
23 HTS data in ArrayExpress and Atlas26/08/2011
24
ArrayExpressHTS on the R cloud
Pipeline tools- tophat- bowtie- bwa- cufflinks- samtools
R-cloudR-server
References, Index &Annotation
ENA
ArrayExpress
R-serverR-server
ArrayExpressHTSR package
- SDRF- IDF
- RAW DATA- Experiment meta data
User ProjectStorage
- ExpressionSet- Quality reports - ExpressionSet- Quality reports
HTS data in ArrayExpress and Atlas26/08/2011
RNA-seq processing pipeline
ArrayExpressArchive
ENA
Dat
a A
cqui
sitio
n
EGAFA
SQ
file
s
Short reads (FASTQ files)
Sum
mar
y lev
el da
ta
Expression Atlas
Ensembl
RNAseqProcessingpipeline
RPKMs
BAMs
SDRF
FASTQ
Direct data submissions and GEO import
25 HTS data in ArrayExpress and Atlas26/08/2011
26
ArrayExpress – two databases
HTS data in ArrayExpress and Atlas26/08/2011
27
The criteria we use for selecting experiments for inclusion in the Atlas are as follows:
• For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done)
• High MIAME/MINSEQE scores • Experiment must have 6 or more assays • Sufficient replication and large sample size• EF and EFV must be well annotated• Adequate sample annotation must be provided • Processed data must be provided or raw data which can be
renormalized must be available
Expression AtlasExperiment selection criteria
HTS data in ArrayExpress and Atlas26/08/2011
28
Data is taken as normalized by the submitter
Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments
The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions
The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression
Expression AtlasAtlas construction
HTS data in ArrayExpress and Atlas26/08/2011
Gene Expression AtlasAtlas construction
30
Gene Expression Atlas
HTS data in ArrayExpress and Atlas26/08/2011
31
Atlas home pagehttp://www.ebi.ac.uk/gxa/
Query for genes
Query for conditionsRestrict query by direction of differential expression
The ‘advanced query’ option allows building more complex queries
HTS data in ArrayExpress and Atlas26/08/2011
32
Atlas gene summary page
HTS data in ArrayExpress and Atlas26/08/2011
33
Atlas heatmap view
HTS data in ArrayExpress and Atlas26/08/2011
06.09.201134
Atlas experiment page
35
View of RNA-seq data in Ensembl
HTS data in ArrayExpress and Atlas26/08/2011
36
Atlas gene-condition query
HTS data in ArrayExpress and Atlas26/08/2011
37
Data submission to AE
HTS data in ArrayExpress and Atlas26/08/2011
Submission of HTS gene expression data• Submit via MAGE-TAB submission route• Submit:
• MAGE-TAB spreadsheet containing details of the samples and protocols used.
• Trace data files for each sample (in SRF, FASTQ or SFF format )• Processed data files
• For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA).
• If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely.
38 HTS data in ArrayExpress and Atlas26/08/2011
What happens after submission?
• Email confirmation• Curation
• The curation team will review your submission and will email you with any questions.
• Possible reopening for editing • We will send you an accession number when all the
required information has been provided. • We will load your experiment into ArrayExpress and
provide you with a reviewer login for viewing the data before it is made public.
39 HTS data in ArrayExpress and Atlas26/08/2011
To find out more
Email questions regarding ArrayExpressHTS to:• Angela Goncalves, [email protected]• Andrew Tikhonov, [email protected]
Read more at:
• Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. http://www.ncbi.nlm.nih.gov/pubmed/21233166
• http://www.bioconductor.org/packages/2.9/bioc/html/ArrayExpressHTS.html
• R-cloud: http://www.ebi.ac.uk/Tools/rcloud/
eLearning courses: http://www.ebi.ac.uk/training/online/
40 HTS data in ArrayExpress and Atlas26/08/2011