Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].

Introduction to Bioinformatics

- RNA seq -Marcel Willemsen

[email protected]

Biology introCentral dogma molecular biology

• RNAo mRNAo tRNAo miRNA => Regulationo ...

• (*) Proteinso Structureo Hormoneso Enzymeso ...

Number of mRNA molecules = gene expression measure

protein bio-synthesis

*

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Sample preparationGeneral RNA-seq pipeline

Strandspecific

Argarose gel electrophoresis purificationGeneral RNA-seq pipeline - Sample prep

• Digestion by restriction endonucleases (EcoRI)

• Cutting out desired length


Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

NGS platformsGeneral RNA-seq pipeline - Sequencing

• Short readso Illumina-Solexao ABI-Solid (Life Technologies)

• Long readso Roche-454

• Third generation (direct/single molecule)o Helicoso Pacific Bioscienceso Ox ford Nanopore

No amplification, ligation or cDNA synthesis!

Intermezzo: colorspaceGeneral RNA-seq pipeline - QC

A 1 3 1 3 1 3 2 3A C G TC A T G C A T C G

Intermezzo: colorspaceSNP

A 1 3 1 3 1 3 2 3A C G T A C G A T

A C G T T C G A T

G 3 3 1 0 2 3 2 3 3 1 0 2 3 2 3

• First 2 'bases' are often removed

SNP

A 1 3 1 0 2 3 2 3

Intermezzo: colorspaceSequencing error

A 1 3 1 3 1 3 2 3A C G T A C G A T

A 1 3 1 0 1 3 2 3

• Translate reference to colorspace during alignments!

Sequencing error!

A C G T T G C T A

General RNAseq pipeline

Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

Quality ControlGeneral RNAseq pipeline - QC

• QC tools -> FastQC, ...o Input: Fastq fileo Output: Quality report


Per base sequence quality


Per sequence quality scores


Per base sequence content • Fastq input• Double encoded!

3102


Per base GC content


Per sequence GC content

General RNAseq pipeline

Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

AlignmentGeneral RNAseq pipeline

• Tool: IGV Integrative Genomics Viewer(Broad Institute)

• Coverage• Direction• Annotation:

Ref. seq genes

CoverageGeneral RNA-seq pipeline

3 24 3 23

• The more samplesthe less coverage!

AlignmentGeneral RNAseq pipeline - SAGE sample

• Position• Introns

AlignmentGeneral RNAseq pipeline - SAGE sample

Start Stop

Start Stop

• Mismatches• Minus reads are reversed!• Seeding (l 25 k 1 n 100)

SAM/BAM FormatGeneral RNA-seq pipeline

header

reference name

reference length

query name = sample name:bead coordinates

strand: 0=plus, 16 =minus, 4=no match

left most postition• 5' for plus strands• 3' for minus strands

mapping quality (phred scaled)

cigar string

query sequence on same strand as reference

SAM/BAM FormatGeneral RNA-seq pipeline

query quality (encoded)

optional fields TAG:VALUE TYPE:VALUE

mate info (mate pair seq)

Original RNA read isreverse complement!

Alternative splicing => isoforms

Re-alignmentRNA splicing


donor acceptor

Y = T C (pyrimidine)


Re-alignmentGeneral RNA-seq pipeline

Re-alignmentGeneral RNA-seq pipeline

Database available from ALEXA-Seq


Sample prep


Sequencing

QC Whole transcriptome

Gene expression

profiling


QC Filtering



Applications:

Alignment analysisGeneral RNA-seq pipeline

24nt (-l 24 -k 1 -n 100 -o 0 -t 4 -c)

20nt

QC filteringGeneral RNA-seq pipeline

Mapping quality > 0

Unique mappings

BWA default settings

Alignment analysisGeneral RNA-seq pipeline

Read length vs. Mapping %

BWA default settings

Unique mappingsMapping qual > 0

Error = 1 - (reads mapping in exons / total reads mapping)

MappingPercentage1 = 100%

Read length


Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

Whole transcriptomeGeneral RNA-seq pipeline

Total RNA or poly(A) RNA RNAs: mRNA, tRNA, rRNA, pri-miRNA, snRNA ...

Which RNA is polyadenylated?

RNA is fragmented Enzymatic digestion / physical shearing

Several reads per transcript possible!

Sliding window analysis Discover new RNAs

Intergenic regions

Sliding window analysis - ScalingWhole transcriptome

• meanpos(meani(coverageko))meanpos(meani(coveragewt))

X meani(coveragewt)

Mean Coverage

Position

Wildtype vs. Knock out Wildtype vs. Knock out scaled

Sliding window analysisWhole transcriptome

• Sliding window = hypothetical transcriptional unit

• Parameters : o Windowsize: 200, 100 and 20 nt

o Threshold: log2 meanwindow(FoldChange) > 2, 1.6, 1.3, 1

o Background: (wt < 10 AND ko < 10)


+ strand - strand

Position

index


Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

Intermezzo:Post transcriptional modifications

• 5' Capping of pre-mRNA

• 3' Polyadenylation of pre-mRNA

Intermezzo:5' Capping

- Stability- Export out of nucleus- Promote translation- Splicing

mitochondrial and chloroplast mRNA are not capped => not in CAGE

Intermezzo:3' Polyadenylation

Nuclear export, translation, stability

Gene expression profilingGeneral RNA-seq pipeline

• Qualitativeo Which part of the genome is expressed, in which cells, which

mRNA isoforms

• Quantitativeo Compare across conditions, understand biological

processes/mechanisms Tumor vs. Normal tissue Knock-out vs. wild-type mouse Changing nutrient conditions in yeast Etc.

Gene expression profilingGeneral RNA-seq pipeline

• DeepSAGE = Digital Gene Expression

• SAGE = Serial Aanlysis of Gene Expression

• Tag-based: one read per transcripto DeepSAGE -> most 3' CATG

o DeepCAGE -> 5' end

• DE analysis

• GSEA

SAGEGene expression profiling

--------------------------------------------------------------

------------------------------- Magnetic beads capture poly(A) RNA

cDNA synthesis with reverse transcriptase (E.coli)

NlaIII digestionEvery 250 bp~99% human transcripts

Adapter A ligationComplementary overhang EcoP151 recognistion sitePCR primer site (P2)

EcoR151 DigestionAsymmetric27 bp downstream from adapter A

Adapter B ligationPCR primer site (P1)Sequencing intitiation site

SAGE

• Mapping against• Genome

• Exons

• Tag library (SAGE Genie)

• No re-alignment

• No gene length bias


Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

RNA interference

• siRNAo Silencing

through methylation

• Exogenous

• Viral dsRNA

• Endogenous

• miRNA

microRNA

• Short• Mature ~ 18 - 26 bases

• Difficult to map uniquely 2 miRNAs may differ 1 base Adapter removal

• Reads are cut out of gel at desired length

• Mapping against miRBase

• New miRNA(target) discovery / predictiono mirDeepo miRFindero etc...

Adapter (P2) removalBefore alignment

2330123231

2330123231

2330123231

2330123231cutcut

move

Adapter (P2) removalDuring alignment

Adapter (P2) removel

Total number of

Read number of


Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:


• How many reads?

• Depends on• Sequencing method

• Sequencing depth

• Cell typeGenes detected

Number of aligned reads


exon exon exon

+6 -2 +4 -1 +4 -0

+14 -3

5' 3'

• Exons

• Alternative transcripts

CountsQuantifying transcript abundance

2 wildtype replicates 2 knockout replicates

features

Total tag count

Libraries

NormalizationQuantifying transcript abundance

• Technical bias

o Solid bar codes random spread

• Sample size bias

• Gene length bias

o Proportion of significant DE genes increases width transcript length

o Has in particular implications for the ranking of differentially expressed genes => introduce bias in gene set testing

• Normalization:o RPKM = Reads per Kb per million mapped reads

total exon reads

mapped reads(millions) * exon length(KB)RPKM =


Sample prep


Sequencing

QCWhole

transcriptome

Gene expression

profiling


QC Filtering



Applications:

RNA-seq vs. microarraysStatistical analysis

• RNA-seqo Countso Absolute abundance of

transcripto All transcripts present

• Microarrayo Hybridization signal to complementary probeo Relative abundanceo Content limitedo Cross-hybridizaton

Count data - Poisson distribution

• Discrete probability distribution

• Not continuous

• Probability of a number of events occurring in a fixed period of time

• Events occur with known average rate and independently of the time since the last event

• DESeq/EdgeR negative binomial distribution (related to Poisson)

λ = expectedk = number of occurrences

Multiple Hypothesis testing

H0 = true H0 = false

H0 rejected V S

H0 not rejected U T

• One test: H0: μ1 = μ2

H1: μ1 <> μ2

• Multiple tests:

"Discoveries"

• Sensitivity vs. Specificityo false negatives => not sensitive enougho false positives => not specific enough

(false positives)

(false negatives)


• K tests at level α = 0.05.

• Expect 0.05 * K false discoveries

• 1 out of 20

• K = 40.000 => 2000


• Bonferroni (Holm) controls the FWER (familywise error rate)

α’ = α/k

• Benjamini–Hochberg controls the FDR

• FDR = false discovery rate

• V false discoveries

• V + S total discoveries

• Expected proportion of false positives

• FDR is less conservative than FWER

H0 true H0 false

H0 rejected V S

H0 not rejected U T

Differential Expression

Condition 1: Wildtype vs. Condition 2: Knockout

Statistical testing (DESeq)

etc.

Multiple Testing!correcting givesadjusted p-Value

design=c("wt","wt","d","d")


• Results: log2FC / p-value / padj

o Fold change => log2FC

log2 0.94 = -0.08log2 1.06 = 0.08

fold change: 121/114 = 1.06 114/121 = 0.94

mean: 121 114

wt wt ko ko129 113 120 108

20.08 = 1.06

Other way around:


• Results:o M vs. A plot (minus vs. average)

M = log2wt - log2ko = log2(wt/ko) = fold change log2FC

A = 0.5(log2wt + log2ko) = average expression baseMean

M

A

= padj < 0.05


• Results:o Hierarchical clustering of samples and genes - Heatmap

9.5 ED 7.5 ED 8.5 ED => relicates

Top 100 varying genes (features)

F-test / ANOVA Linear regression Variance stabilizing

transformation (VST)


• Variance-stabilizing transformation

• Find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value


Variance stabilizing transformed Raw counts

Venn diagramDifferential Expression

• Intersection of differentially expressed genes

Gene Set Enrichment Analysis (GSEA)

• edgeR, bayseq, DESeq ...o Return a list of differentially expressed genes

• Biological theory is not about isolated geneso Typical biological research questions and hypotheses

About pathways About biological processes About areas of the genome

o About sets of related genes

• Questiono How to analyze RNAseq data from a gene set perspective?

GSEA• Which gene sets?

• Any defined set that has something in commono Pathways

KEGG, Reactome, Biocartao Gene Ontology terms

Biological process, Molecular function, Cellular component

o Chromosomal regions Chromosome arms, cytobands, linkage peaks, genes

o Published gene sets Predictive signatures, gene lists

GSEA

• H0: null hypothesiso Genes in the gene set are as often differentially

expressed as genes outside

• H1: alternative hypothesiso Genes in the gene set are more often differentially

expressed as genes outside

GSEAScoring gene sets

Geneyes Adjusted p-value Member of set G

gene 1 3.44E-006 yes

gene 2 1.77E-005 no


...

gene 100 0.49 yes

gene 101 0.51 no

...

gene n 1 no

Ordered gene list L

Gene set G:{1,3,...100,...}

GSEAScoring gene sets

Geneyes Adjusted p-value Member of set G


gene 2 1.77E-005 no


...

gene 100 0.49 yes

gene 101 0.51 no

...

gene n 1 no

Ordered gene list L

Gene set G:{1,3,...100,...}

Define cut off and count members of Gabove and below cutoff (= 0.5)

GSEAFisher’s exact test

• Define cutoff (C=100) and count members of G above andbelow cutoff

• Fisher’s exact test:

# genes above C # genes below C

# genes in set G 5 (a) 0 (b) 5 (a + b)

# genes not in set G 1 (c) 4 (d) 5 (c + d)

6 (a + c) 4 (b + d) 10 (a + b + c + d = n)

n = a + b + c + d

GSEAFisher’s exact test

• P-value = sum of all probabilities ≤ Pcutoff

diff. expr. gene non-diff. expr. gene

in gene set 5 0

not in gene set 1 4

= 1

P-value = 0.0476Reject H0 => enriched set!

GSEA

• Alternative: • Chi2 test

• Tool: goseq (R package)o Correction for length biaso Not in SAGE!

GSEA

• Example KEGG:Time series 7.5 ED vs. 9.5 ED

7.5 ED 9.5 ED

All genes in pathway

The End

Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].

Documents

Transcript of Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].