Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].
-
Upload
kendall-hollifield -
Category
Documents
-
view
228 -
download
5
Transcript of Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].
Biology introCentral dogma molecular biology
• RNAo mRNAo tRNAo miRNA => Regulationo ...
• (*) Proteinso Structureo Hormoneso Enzymeso ...
Number of mRNA molecules = gene expression measure
protein bio-synthesis
*
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
Sample preparationGeneral RNA-seq pipeline
Strandspecific
Argarose gel electrophoresis purificationGeneral RNA-seq pipeline - Sample prep
• Digestion by restriction endonucleases (EcoRI)
• Cutting out desired length
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
NGS platformsGeneral RNA-seq pipeline - Sequencing
• Short readso Illumina-Solexao ABI-Solid (Life Technologies)
• Long readso Roche-454
• Third generation (direct/single molecule)o Helicoso Pacific Bioscienceso Ox ford Nanopore
No amplification, ligation or cDNA synthesis!
Intermezzo: colorspaceGeneral RNA-seq pipeline - QC
A 1 3 1 3 1 3 2 3A C G TC A T G C A T C G
Intermezzo: colorspaceSNP
A 1 3 1 3 1 3 2 3A C G T A C G A T
A C G T T C G A T
G 3 3 1 0 2 3 2 3 3 1 0 2 3 2 3
• First 2 'bases' are often removed
SNP
A 1 3 1 0 2 3 2 3
Intermezzo: colorspaceSequencing error
A 1 3 1 3 1 3 2 3A C G T A C G A T
A 1 3 1 0 1 3 2 3
• Translate reference to colorspace during alignments!
Sequencing error!
A C G T T G C T A
General RNAseq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
Quality ControlGeneral RNAseq pipeline - QC
• QC tools -> FastQC, ...o Input: Fastq fileo Output: Quality report
Quality ControlGeneral RNAseq pipeline - QC
Per base sequence quality
Quality ControlGeneral RNAseq pipeline - QC
Per sequence quality scores
Quality ControlGeneral RNAseq pipeline - QC
Per base sequence content • Fastq input• Double encoded!
3102
Quality ControlGeneral RNAseq pipeline - QC
Per base GC content
Quality ControlGeneral RNAseq pipeline - QC
Per sequence GC content
General RNAseq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
AlignmentGeneral RNAseq pipeline
• Tool: IGV Integrative Genomics Viewer(Broad Institute)
• Coverage• Direction• Annotation:
Ref. seq genes
CoverageGeneral RNA-seq pipeline
3 24 3 23
• The more samplesthe less coverage!
AlignmentGeneral RNAseq pipeline - SAGE sample
• Position• Introns
AlignmentGeneral RNAseq pipeline - SAGE sample
Start Stop
Start Stop
• Mismatches• Minus reads are reversed!• Seeding (l 25 k 1 n 100)
SAM/BAM FormatGeneral RNA-seq pipeline
header
reference name
reference length
query name = sample name:bead coordinates
strand: 0=plus, 16 =minus, 4=no match
left most postition• 5' for plus strands• 3' for minus strands
mapping quality (phred scaled)
cigar string
query sequence on same strand as reference
SAM/BAM FormatGeneral RNA-seq pipeline
query quality (encoded)
optional fields TAG:VALUE TYPE:VALUE
mate info (mate pair seq)
Original RNA read isreverse complement!
Alternative splicing => isoforms
Re-alignmentRNA splicing
Re-alignmentRNA splicing
donor acceptor
Y = T C (pyrimidine)
Re-alignmentRNA splicing
Re-alignmentRNA splicing
Re-alignmentRNA splicing
Re-alignmentGeneral RNA-seq pipeline
Re-alignmentGeneral RNA-seq pipeline
Database available from ALEXA-Seq
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QC Whole transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
Alignment analysisGeneral RNA-seq pipeline
24nt (-l 24 -k 1 -n 100 -o 0 -t 4 -c)
20nt
QC filteringGeneral RNA-seq pipeline
Mapping quality > 0
Unique mappings
BWA default settings
Alignment analysisGeneral RNA-seq pipeline
Read length vs. Mapping %
BWA default settings
Unique mappingsMapping qual > 0
Error = 1 - (reads mapping in exons / total reads mapping)
MappingPercentage1 = 100%
Read length
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
Whole transcriptomeGeneral RNA-seq pipeline
Total RNA or poly(A) RNA RNAs: mRNA, tRNA, rRNA, pri-miRNA, snRNA ...
Which RNA is polyadenylated?
RNA is fragmented Enzymatic digestion / physical shearing
Several reads per transcript possible!
Sliding window analysis Discover new RNAs
Intergenic regions
Sliding window analysis - ScalingWhole transcriptome
• meanpos(meani(coverageko))meanpos(meani(coveragewt))
X meani(coveragewt)
Mean Coverage
Position
Wildtype vs. Knock out Wildtype vs. Knock out scaled
Sliding window analysisWhole transcriptome
• Sliding window = hypothetical transcriptional unit
• Parameters : o Windowsize: 200, 100 and 20 nt
o Threshold: log2 meanwindow(FoldChange) > 2, 1.6, 1.3, 1
o Background: (wt < 10 AND ko < 10)
Sliding window analysisWhole transcriptome
+ strand - strand
Position
index
Sliding window analysisWhole transcriptome
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
Intermezzo:Post transcriptional modifications
• 5' Capping of pre-mRNA
• 3' Polyadenylation of pre-mRNA
Intermezzo:5' Capping
- Stability- Export out of nucleus- Promote translation- Splicing
mitochondrial and chloroplast mRNA are not capped => not in CAGE
Intermezzo:3' Polyadenylation
Nuclear export, translation, stability
Gene expression profilingGeneral RNA-seq pipeline
• Qualitativeo Which part of the genome is expressed, in which cells, which
mRNA isoforms
• Quantitativeo Compare across conditions, understand biological
processes/mechanisms Tumor vs. Normal tissue Knock-out vs. wild-type mouse Changing nutrient conditions in yeast Etc.
Gene expression profilingGeneral RNA-seq pipeline
• DeepSAGE = Digital Gene Expression
• SAGE = Serial Aanlysis of Gene Expression
• Tag-based: one read per transcripto DeepSAGE -> most 3' CATG
o DeepCAGE -> 5' end
• DE analysis
• GSEA
SAGEGene expression profiling
--------------------------------------------------------------
------------------------------- Magnetic beads capture poly(A) RNA
cDNA synthesis with reverse transcriptase (E.coli)
NlaIII digestionEvery 250 bp~99% human transcripts
Adapter A ligationComplementary overhang EcoP151 recognistion sitePCR primer site (P2)
EcoR151 DigestionAsymmetric27 bp downstream from adapter A
Adapter B ligationPCR primer site (P1)Sequencing intitiation site
SAGE
• Mapping against• Genome
• Exons
• Tag library (SAGE Genie)
• No re-alignment
• No gene length bias
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
RNA interference
• siRNAo Silencing
through methylation
• Exogenous
• Viral dsRNA
• Endogenous
• miRNA
microRNA
• Short• Mature ~ 18 - 26 bases
• Difficult to map uniquely 2 miRNAs may differ 1 base Adapter removal
• Reads are cut out of gel at desired length
• Mapping against miRBase
• New miRNA(target) discovery / predictiono mirDeepo miRFindero etc...
Adapter (P2) removalBefore alignment
2330123231
2330123231
2330123231
2330123231cutcut
move
Adapter (P2) removalDuring alignment
Adapter (P2) removalDuring alignment
Adapter (P2) removel
Total number of
Read number of
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
Quantifying transcript abundance
• How many reads?
• Depends on• Sequencing method
• Sequencing depth
• Cell typeGenes detected
Number of aligned reads
Quantifying transcript abundance
exon exon exon
+6 -2 +4 -1 +4 -0
+14 -3
5' 3'
• Exons
• Alternative transcripts
CountsQuantifying transcript abundance
2 wildtype replicates 2 knockout replicates
features
Total tag count
Libraries
NormalizationQuantifying transcript abundance
• Technical bias
o Solid bar codes random spread
• Sample size bias
• Gene length bias
o Proportion of significant DE genes increases width transcript length
o Has in particular implications for the ranking of differentially expressed genes => introduce bias in gene set testing
• Normalization:o RPKM = Reads per Kb per million mapped reads
total exon reads
mapped reads(millions) * exon length(KB)RPKM =
General RNA-seq pipeline
Sample prep
Alignmentre-alignment
Sequencing
QCWhole
transcriptome
Gene expression
profiling
microRNAAlignmentanalysis
QC Filtering
Quantifying transcript abundance
Statistical analysis
Applications:
RNA-seq vs. microarraysStatistical analysis
• RNA-seqo Countso Absolute abundance of
transcripto All transcripts present
• Microarrayo Hybridization signal to complementary probeo Relative abundanceo Content limitedo Cross-hybridizaton
Count data - Poisson distribution
• Discrete probability distribution
• Not continuous
• Probability of a number of events occurring in a fixed period of time
• Events occur with known average rate and independently of the time since the last event
• DESeq/EdgeR negative binomial distribution (related to Poisson)
λ = expectedk = number of occurrences
Multiple Hypothesis testing
H0 = true H0 = false
H0 rejected V S
H0 not rejected U T
• One test: H0: μ1 = μ2
H1: μ1 <> μ2
• Multiple tests:
"Discoveries"
• Sensitivity vs. Specificityo false negatives => not sensitive enougho false positives => not specific enough
(false positives)
(false negatives)
Multiple Hypothesis testing
• K tests at level α = 0.05.
• Expect 0.05 * K false discoveries
• 1 out of 20
• K = 40.000 => 2000
Multiple Hypothesis testing
• Bonferroni (Holm) controls the FWER (familywise error rate)
α’ = α/k
• Benjamini–Hochberg controls the FDR
• FDR = false discovery rate
• V false discoveries
• V + S total discoveries
• Expected proportion of false positives
• FDR is less conservative than FWER
H0 true H0 false
H0 rejected V S
H0 not rejected U T
Differential Expression
Condition 1: Wildtype vs. Condition 2: Knockout
Statistical testing (DESeq)
etc.
Multiple Testing!correcting givesadjusted p-Value
design=c("wt","wt","d","d")
Differential Expression
• Results: log2FC / p-value / padj
o Fold change => log2FC
log2 0.94 = -0.08log2 1.06 = 0.08
fold change: 121/114 = 1.06 114/121 = 0.94
mean: 121 114
wt wt ko ko129 113 120 108
20.08 = 1.06
Other way around:
Differential Expression
• Results:o M vs. A plot (minus vs. average)
M = log2wt - log2ko = log2(wt/ko) = fold change log2FC
A = 0.5(log2wt + log2ko) = average expression baseMean
M
A
= padj < 0.05
Differential Expression
• Results:o Hierarchical clustering of samples and genes - Heatmap
9.5 ED 7.5 ED 8.5 ED => relicates
Top 100 varying genes (features)
F-test / ANOVA Linear regression Variance stabilizing
transformation (VST)
Differential Expression
• Variance-stabilizing transformation
• Find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value
Differential Expression
Variance stabilizing transformed Raw counts
Venn diagramDifferential Expression
• Intersection of differentially expressed genes
Gene Set Enrichment Analysis (GSEA)
• edgeR, bayseq, DESeq ...o Return a list of differentially expressed genes
• Biological theory is not about isolated geneso Typical biological research questions and hypotheses
About pathways About biological processes About areas of the genome
o About sets of related genes
• Questiono How to analyze RNAseq data from a gene set perspective?
GSEA• Which gene sets?
• Any defined set that has something in commono Pathways
KEGG, Reactome, Biocartao Gene Ontology terms
Biological process, Molecular function, Cellular component
o Chromosomal regions Chromosome arms, cytobands, linkage peaks, genes
o Published gene sets Predictive signatures, gene lists
GSEA
• H0: null hypothesiso Genes in the gene set are as often differentially
expressed as genes outside
• H1: alternative hypothesiso Genes in the gene set are more often differentially
expressed as genes outside
GSEAScoring gene sets
Geneyes Adjusted p-value Member of set G
gene 1 3.44E-006 yes
gene 2 1.77E-005 no
gene 3 9.92E-005 yes
...
gene 100 0.49 yes
gene 101 0.51 no
...
gene n 1 no
Ordered gene list L
Gene set G:{1,3,...100,...}
GSEAScoring gene sets
Geneyes Adjusted p-value Member of set G
gene 1 3.44E-006 yes
gene 2 1.77E-005 no
gene 3 9.92E-005 yes
...
gene 100 0.49 yes
gene 101 0.51 no
...
gene n 1 no
Ordered gene list L
Gene set G:{1,3,...100,...}
Define cut off and count members of Gabove and below cutoff (= 0.5)
GSEAFisher’s exact test
• Define cutoff (C=100) and count members of G above andbelow cutoff
• Fisher’s exact test:
# genes above C # genes below C
# genes in set G 5 (a) 0 (b) 5 (a + b)
# genes not in set G 1 (c) 4 (d) 5 (c + d)
6 (a + c) 4 (b + d) 10 (a + b + c + d = n)
n = a + b + c + d
GSEAFisher’s exact test
• P-value = sum of all probabilities ≤ Pcutoff
diff. expr. gene non-diff. expr. gene
in gene set 5 0
not in gene set 1 4
= 1
P-value = 0.0476Reject H0 => enriched set!
GSEA
• Alternative: • Chi2 test
• Tool: goseq (R package)o Correction for length biaso Not in SAGE!
GSEA
• Example KEGG:Time series 7.5 ED vs. 9.5 ED
7.5 ED 9.5 ED
All genes in pathway
The End