RNA-seq: analysis of raw data and preprocessing - part 2
-
Upload
bits -
Category
Technology
-
view
2.688 -
download
1
description
Transcript of RNA-seq: analysis of raw data and preprocessing - part 2
![Page 1: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/1.jpg)
Raw data investigation
Joachim Jacob20 and 27 January 2014
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
![Page 2: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/2.jpg)
Experimental setup
We have decided on:● how many samples per condition● how deep
This determines how reliable the statistics will be, using experience, and tools like Scotty. A wrong experimental design cannot be fixed. Best approach: pilot data (3 samples per condition, 10M)
But we have other sequencing options to choose!
![Page 3: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/3.jpg)
PE versus SE Illumina
● Single end (SE): from each cDNA fragment only one end is read.
● Paired end (PE): the cDNA fragment is read from both ends.
Purify and fragment
PE
SE
![Page 4: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/4.jpg)
PE versus SE Illumina
Single end (SE):
● Gene level differential expression
Paired end (PE):
● Novel splice junction detection
● De novo assembly of transcriptome
● Helps with correctly positioning reads on the reference genome sequence.
Note: PE not the same as mate pairs.
![Page 5: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/5.jpg)
Strandedness
● Naive protocols obtain reads from cDNA fragments. BUT the link with the sense or antisense strand is broken.
● Stranded protocols generate reads from one strand, corresponding to the sense or antisense strand (depending on the protocol).
![Page 6: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/6.jpg)
Strandedness
Not strandedStranded
![Page 7: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/7.jpg)
Example of a stranded protocol
● dUTP protocol to generate stranded reads.
![Page 8: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/8.jpg)
Importance of strandedness
● Strandedness can bias the read counts compared to non-stranded protocols.
● Depends on the genome whether you should apply it, e.g. in case genes overlap, the improved benefit of assigning reads to correct genes can outweigh technical variation.
![Page 9: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/9.jpg)
Length of the reads
● Does not matter so much (when we want to quantify aligning to a reference sequence): 50 bp will do.
● The most important point is to be able to accurately position the read on the reference genome sequence, to assign it to the correct gene.
● Length can become important, if you want to assemble the transcriptome.
![Page 10: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/10.jpg)
For DE on the gene level
The 'cheapest' protocol for high-throughput sequencing suffices to achieve DE detection:● SE● 50bp● Option: strandedness.
Use the money you have left over for increasing the number of replicates.
![Page 11: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/11.jpg)
Illumina Truseq protocol
sdf
![Page 12: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/12.jpg)
Raw Illumina data
The data you get arrives as...
barcode
experiment
Compressed, usually with gzip
![Page 13: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/13.jpg)
Raw Illumina data
@HWI-ST571:202:D1B86ACXX:2:1102:1146:2155 1:N:0:ACAGTG
CCAACATCGAGGTCGCAATCTTTTTNANCGATATGAACTCTCCAAAAAAA
+
@@@FFFDFHHDG?FFHIIJJJJJIJ#1#1:BFFIGJJJJJIJJGIJJJJA
@HWI-ST571:202:D1B86ACXX:2:1102:1073:2240 1:N:0:ACAGTG
CGGAGCTGAAGGAGAAACTGAAATCCCTGCAATGTGAATTGTACGTTCTT
+
CCCFFFFFGGHHHIJJJJJJJIJFHIJIIIJJJJGIIIIIEFGHIFCHJI
@HWI-ST571:202:D1B86ACXX:2:1102:1385:2192 1:N:0:ACAGTG
GTTGGCAGCCCTGGAGCCCTGCCTCGGTGGTTTAGCCAGTACTAGGGGAT
+
CCCFFFFFHHHHHJJJIJJJJJJGIJJCGHFHIGIHJJJBDHGHHJJJIE
@HWI-ST571:202:D1B86ACXX:2:1102:1352:2244 1:N:0:ACAGTG
ATTTCCTCTTATTTACGTTGCTTTAAAGCGAGACTTCAACGCCATTTGAC
+
@@CFFFFFHHFHDFGHIJIIJGIJGGEHGGJB>??FHHGFFFGHIGIECF
@HWI-ST571:202:D1B86ACXX:2:1102:1981:2152 1:N:0:ACAGTG
CATCGAAGCAAAGCATATAAAGTTANTNNTNNCTGAGTTGTACATATTGC
+
??;;D?DB6CDB+<EFE>:AFA443#2##1##11)0:0?9**0??DAGI4
@HWI-ST571:202:D1B86ACXX:2:1102:1877:2165 1:N:0:ACAGTG
GAAGTGCCCCGCTGGCAGCACACAAGGAGCAGCCCGCTGCCGGACCACTC
+
?@@DDDADFFAA:CEGHBFGAHGD?F@BE9BFF?D@F;'-8AG<B92=;;
One read (minimum 4 lines)
http://wiki.bits.vib.be/index.php/.fastq
sequence
certainty reading this base at this position ('quality')
(this one: 87196924 lines)
![Page 14: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/14.jpg)
Exploring the raw data
1) check whether the Fastq file is consistent-
2) Make graphs of some metrics of the raw data
http://wiki.bits.vib.be/index.php/.fastq
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Quality_control_and_visualization_of_raw_reads
![Page 15: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/15.jpg)
FastQC – graphical exploration
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
![Page 16: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/16.jpg)
FastQC – perfect example
Reads have good quality!
![Page 17: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/17.jpg)
FastQC – perfect example
Anna Karenina principle: “There is only one way to be good, but there are many ways to be wrong.”
We will start by showing a good sample. Afterwards we will discuss a less good sample.
http://en.wikipedia.org/wiki/Anna_Karenina_principle
![Page 18: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/18.jpg)
FastQC – perfect example
Smooth histogram/ density line towards the right,
![Page 19: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/19.jpg)
FastQC – perfect example
steady nucleotide distribution.
Bias typical for illumina
![Page 20: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/20.jpg)
Not strongly fluctuating GC content
Bias typical for illumina
FastQC – perfect example
![Page 21: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/21.jpg)
GC-content nicely bell shaped
FastQC – perfect example
![Page 22: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/22.jpg)
No N's! (should ring something)
FastQC – perfect example
![Page 23: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/23.jpg)
All reads have length 50bp,
FastQC – perfect example
![Page 24: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/24.jpg)
Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.
FastQC – perfect example
![Page 25: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/25.jpg)
Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.
FastQC – perfect example
![Page 26: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/26.jpg)
Kmers are short sequence stretches. Sometimes they are overrepresented. But in RNA-seq this is not so important (duplication).
FastQC – perfect example
![Page 27: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/27.jpg)
FastQC – less good RNA-seq sample
A relatively large Portion of the reads have mistakes at the 3' end of the read.
![Page 28: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/28.jpg)
FastQC – less good RNA-seq sample
There is an over- representation of reads
with a low mean quality score
![Page 29: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/29.jpg)
FastQC – less good RNA-seq sample
Not a steady levelof different nucleotide
fractions
![Page 30: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/30.jpg)
FastQC – less good RNA-seq sample
Fluctuates
![Page 31: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/31.jpg)
FastQC – less good RNA-seq sample
Heavily skewed versusAT rich reads
![Page 32: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/32.jpg)
FastQC – less good RNA-seq sample
Apparently a mixture of two sets of reads
with different lengths
![Page 33: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/33.jpg)
FastQC – less good RNA-seq sample
Duplication seems abit on the low side
(reported figures are from 60 -75%)
![Page 34: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/34.jpg)
FastQC – less good RNA-seq sample
Very highly skewed read number.
Often the sequence of Truseqadaptor, or multi-
plex identifierscan be
found here. BLAST can reveal
more information!
![Page 35: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/35.jpg)
FastQC – less good RNA-seq sample
Specific patterns of Specific kmers.
Note: A and T rich
![Page 36: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/36.jpg)
Quality control of raw data
Proceed? Or rerun?
This QC can guide you to which preprocessing steps you need to apply for sure. The extra time and money needed to correct the biases can sometimes justify a rerun of the experiment.
This QC shows which preprocessing steps have already been made by the sequencing provider.
![Page 37: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/37.jpg)
Preprocessing
Removing unwanted parts of the raw data so it helps as much as possible with reaching our goal: defining differentially expressed genes.
1) removing technical contamination● Low quality read parts● Technical sequences: adaptors● PhiX internal control sequences
2) removing biological contamination● polyA-tails● rRNA sequences● mtDNA sequences
After this, we run FastQC again.
![Page 38: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/38.jpg)
Technical contamination
Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location.
Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.
![Page 39: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/39.jpg)
Technical contamination
Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location.
Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.
![Page 40: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/40.jpg)
Technical contamination
![Page 41: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/41.jpg)
Technical contamination
Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location.
Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome.
![Page 42: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/42.jpg)
Technical contamination
Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location.
Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome.
List of technical sequences
Advised to use defaults
http://code.google.com/p/ea-utils/wiki/FastqMcf
![Page 43: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/43.jpg)
Fastq-mcf output
http://code.google.com/p/ea-utils/wiki/FastqMcf
![Page 44: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/44.jpg)
Technical contamination
● Never remove duplicate reads! Highly expressed genes can have genuine duplicate reads, which are not due to the PCR amplification step in the protocol.
● PhiX sequences: the DNA of Phi X bacteriophage is spiked in to monitor and optimize sequencing on Illumina machines. Your sequencing provider should filter out those sequences before delivery. You can filter them out by aligning your reads to the PhiX genome.
http://en.wikipedia.org/wiki/Phi_X_174
![Page 45: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/45.jpg)
Biological contamination
Mitochondria containrRNA, mRNA and mtDNA
cell
rRNA and non-coding (95% of RNA)
mRNA (5% of RNA)
nucleus
![Page 46: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/46.jpg)
Biological contamination
mRNAs are captured with oligo-dT coated beads.
Occasionally, non-protein coding sequences are also captured (especially since mtRNA and rRNA can be relatively rich in AT).
We can remove them via homology searching (BLAST) with known non-protein coding sequences.
Mitochondrial
mRNA (5% of RNA)
rRNA and nc
![Page 47: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/47.jpg)
Biological contamination
mRNAs are post-transcrip- tionally modified: e.g. the addition of a poly-A tail. If our goal is to map the reads to a reference genome sequence, the polyA tails should be removed. This can be viewed as some source of 'biological contamination' in our sequences (…).
AAAAAAAAAAAAA
![Page 48: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/48.jpg)
● Get the non-protein coding sequences via Biomart.
Mitochondrial genome sequence also.
Biological contamination
![Page 49: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/49.jpg)
Biological contamination
![Page 50: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/50.jpg)
Biological contamination
![Page 51: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/51.jpg)
Filter the biological contamination
Your reads
The biological readsImported via Biomart
We are interested in the reads that don't map!
![Page 52: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/52.jpg)
Filter the biological contamination
Your reads
The biological readsImported via Biomart
We are interested in the reads that don't map!
![Page 53: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/53.jpg)
Doing this in Galaxy
Useful: take a sample of your reads: fastq-to-tabular, select random lines, tabular-to-fastq
1. create a new history2. load the sample data in3. Run fastqMcf to remove technical sequences4. Run bowtie to match against biological sequence databases, and keep reads that don't match.5. Summarize: fastqc
→ make a workflow of this sample history.→ run the workflow on all your samples in parallel→ store the cleaned reads in a data library.
![Page 54: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/54.jpg)
Summary preprocessing
Your reads
…...Format consistent? Errors in quality?
Your groomed reads
…....…... Trends in raw data? QC report
Your groomed reads without technical contamination
….... ... Get biological contaminants- ….- ….
Your groomed reads without technical and biological contamination
…... How does your data look now? QC
... Get technical contaminants- ….
![Page 55: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/55.jpg)
KeywordsPaired end
Stranded reads
gzip
fastq
Biological contamination
Technical contamination
Adapter sequence
Write in your own words what the terms mean
![Page 56: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/56.jpg)
Exercise
→ investigating and preprocessing raw RNA-seq data
![Page 57: RNA-seq: analysis of raw data and preprocessing - part 2](https://reader038.fdocuments.net/reader038/viewer/2022103114/554e91fbb4c90573338b4e63/html5/thumbnails/57.jpg)
Break