How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide...

40
EBI is an Outstation of the European Molecular Biology Laboratory. How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group [email protected]

Transcript of How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide...

Page 1: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

EBI is an Outstation of the European Molecular Biology Laboratory.

How to store and visualize RNA-seq data

Gabriella RusticiFunctional Genomics Group

[email protected]

Page 2: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Talk summary

• How do we archive RNA-seq data in ArrayExpress

• How do we process RNA-seq data

• How we display RNA-seq data in the Expression Atlas

HTS data in ArrayExpress and Atlas26/08/20112

Page 3: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

3

Components of a functional genomics experiment

HTS data in ArrayExpress and Atlas26/08/2011

Page 4: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

ArrayExpresswww.ebi.ac.uk/arrayexpress/

Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays

Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ

Provides easy access to well annotated data in a structured and standardized format

Facilitates the sharing of microarray designs, experimental protocols,……

Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/)

4 HTS data in ArrayExpress and Atlas26/08/2011

Page 5: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Standards for sequencingMINSEQE guidelines

Minimal Information about a high-throughput Nucleotide SEQuencing Experiment

The proposed guidelines for MINSEQE are (still work in progress):

1. General information about the experiment2. Essential sample annotation including experimental factors and their

values (e.g. compound and dose)3. Experimental design including sample data relationships (e.g. which

raw data file relates to which sample, ….)4. Essential experimental and data processing protocols5. Sequence read data with quality scores, raw intensities and

processing parameters for the instrument6. Final processed data for the set of assays in the experiment

5 HTS data in ArrayExpress and Atlas26/08/2011

Page 6: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data:

IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.

SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.

Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.

6

Standards for microarray & sequencingMAGE-TAB format

HTS data in ArrayExpress and Atlas26/08/2011

Page 7: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Types of data that can be submitted

7 HTS data in ArrayExpress and Atlas26/08/2011

Page 8: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

8

ArrayExpress – two databases

HTS data in ArrayExpress and Atlas26/08/2011

Page 9: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

9

What is the difference between Archive and Atlas?

Archive• Query by experiment, sample and experimental

factor annotations• Filter on species, array platform, molecule assayed

and technology used

Atlas• Gene and/or condition queries• Query across experiments and across platforms

HTS data in ArrayExpress and Atlas26/08/2011

Page 10: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

ArrayExpress – two databases

10 HTS data in ArrayExpress and Atlas26/08/2011

Page 11: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

How much data in AE Archive?

ArrayExpress11

Page 12: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

12

Browsing the AE Archive

HTS data in ArrayExpress and Atlas26/08/2011

Page 13: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Browsing the AE Archive

The direct link to raw and processed data. An icon indicates that this type of data is available.

The direct link to raw and processed data. An icon indicates that this type of data is available.

The total number of experiments and assay retrieved

The total number of experiments and assay retrieved

Species investigated

Curated title of experiment

Curated title of experiment

The date when the data were loaded in the

ArchiveAE unique

experiment IDAE unique

experiment IDNumber of

assaysNumber of

assays

The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed

The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed

loaded in Atlas flag

13

Raw sequencing data available in

ENA

Page 14: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

14

Browsing the AE Archive

HTS data in ArrayExpress and Atlas26/08/2011

Page 15: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

15

RNA-seq data in AE Archive

HTS data in ArrayExpress and Atlas26/08/2011

Page 16: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

HTS data in AE Archive

06.09.201116 HTS data in ArrayExpress and Atlas

Page 17: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

17

HTS data in AE Archive

HTS data in ArrayExpress and Atlas26/08/2011

Page 18: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Master headline06.09.201118

Link to raw data in ENA

Page 19: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

RNA-seq processing pipeline

ArrayExpressArchive

ENA

Dat

a A

cqui

sitio

n

EGA

FAS

Q fi

les

Short reads (FASTQ files)

Sum

mar

y lev

el da

ta

Expression Atlas

Ensembl

RNAseqProcessingpipeline

RPKMs

BAMs

SDRF

FASTQ

Direct data submissions and GEO import

19 HTS data in ArrayExpress and Atlas26/08/2011

Page 20: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

RNA-seq processing pipeline: ArrayExpressHTS

• ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets

• The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA

• It can be used: on a local computer remotely on the EBI R Cloud, www.ebi.ac.uk/tools/rcloud

20

Goncalves et al., Bioinformatics 2011

HTS data in ArrayExpress and Atlas26/08/2011

Page 21: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

ArrayExpressHTS in Bioconductor

21 HTS data in ArrayExpress and Atlas26/08/2011

Page 22: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

ArrayExpressHTS pipeline

transcriptome or genome

Bowtie, BWA or TopHat

cufflinks or MMSEQ

filtering options(e.g., average base quality, read complexity,…)

22 HTS data in ArrayExpress and Atlas26/08/2011

Page 23: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Using ArrayExpressHTS

library("ArrayExpressHTS")aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE)

23 HTS data in ArrayExpress and Atlas26/08/2011

Page 24: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

24

ArrayExpressHTS on the R cloud

Pipeline tools- tophat- bowtie- bwa- cufflinks- samtools

R-cloudR-server

References, Index &Annotation

ENA

ArrayExpress

R-serverR-server

ArrayExpressHTSR package

- SDRF- IDF

- RAW DATA- Experiment meta data

User ProjectStorage

- ExpressionSet- Quality reports - ExpressionSet- Quality reports

HTS data in ArrayExpress and Atlas26/08/2011

Page 25: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

RNA-seq processing pipeline

ArrayExpressArchive

ENA

Dat

a A

cqui

sitio

n

EGAFA

SQ

file

s

Short reads (FASTQ files)

Sum

mar

y lev

el da

ta

Expression Atlas

Ensembl

RNAseqProcessingpipeline

RPKMs

BAMs

SDRF

FASTQ

Direct data submissions and GEO import

25 HTS data in ArrayExpress and Atlas26/08/2011

Page 26: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

26

ArrayExpress – two databases

HTS data in ArrayExpress and Atlas26/08/2011

Page 27: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

27

The criteria we use for selecting experiments for inclusion in the Atlas are as follows:

• For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done)

• High MIAME/MINSEQE scores • Experiment must have 6 or more assays • Sufficient replication and large sample size• EF and EFV must be well annotated• Adequate sample annotation must be provided • Processed data must be provided or raw data which can be

renormalized must be available

Expression AtlasExperiment selection criteria

HTS data in ArrayExpress and Atlas26/08/2011

Page 28: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

28

Data is taken as normalized by the submitter

Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments

The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions

The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression

Expression AtlasAtlas construction

HTS data in ArrayExpress and Atlas26/08/2011

Page 29: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Gene Expression AtlasAtlas construction

Page 30: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

30

Gene Expression Atlas

HTS data in ArrayExpress and Atlas26/08/2011

Page 31: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

31

Atlas home pagehttp://www.ebi.ac.uk/gxa/

Query for genes

Query for conditionsRestrict query by direction of differential expression

The ‘advanced query’ option allows building more complex queries

HTS data in ArrayExpress and Atlas26/08/2011

Page 32: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

32

Atlas gene summary page

HTS data in ArrayExpress and Atlas26/08/2011

Page 33: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

33

Atlas heatmap view

HTS data in ArrayExpress and Atlas26/08/2011

Page 34: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

06.09.201134

Atlas experiment page

Page 35: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

35

View of RNA-seq data in Ensembl

HTS data in ArrayExpress and Atlas26/08/2011

Page 36: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

36

Atlas gene-condition query

HTS data in ArrayExpress and Atlas26/08/2011

Page 37: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

37

Data submission to AE

HTS data in ArrayExpress and Atlas26/08/2011

Page 38: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

Submission of HTS gene expression data• Submit via MAGE-TAB submission route• Submit:

• MAGE-TAB spreadsheet containing details of the samples and protocols used.

• Trace data files for each sample (in SRF, FASTQ or SFF format )• Processed data files

• For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA).

• If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely.

38 HTS data in ArrayExpress and Atlas26/08/2011

Page 39: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

What happens after submission?

• Email confirmation• Curation

• The curation team will review your submission and will email you with any questions.

• Possible reopening for editing • We will send you an accession number when all the

required information has been provided. • We will load your experiment into ArrayExpress and

provide you with a reviewer login for viewing the data before it is made public.

39 HTS data in ArrayExpress and Atlas26/08/2011

Page 40: How to store and visualize RNA-seq data · ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed

To find out more

Email questions regarding ArrayExpressHTS to:• Angela Goncalves, [email protected]• Andrew Tikhonov, [email protected]

Read more at:

• Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. http://www.ncbi.nlm.nih.gov/pubmed/21233166

• http://www.bioconductor.org/packages/2.9/bioc/html/ArrayExpressHTS.html

• R-cloud: http://www.ebi.ac.uk/Tools/rcloud/

eLearning courses: http://www.ebi.ac.uk/training/online/

40 HTS data in ArrayExpress and Atlas26/08/2011