Post on 01-Aug-2020
RNA-seq analysis
Alexey Sergushichev
July 30th, MIPT
Intro to RNA-sequencing
RNA-seq quantification: from raw data to gene expression table
RNA-seq analysis: from an expression table to pathway analysis
Overview of the lecture
2
Intro to RNA-sequencing
RNA-seq quantification: from raw data to gene expression table
RNA-seq analysis in R: from an expression table to pathway analysis
• Getting publicly available expression tables
• Doing differential expression
• Doing pathway analysis
Practical session
3
Intro to RNA-sequencing
RNA-seq quantification: from raw data to gene expression table
RNA-seq analysis in R: from an expression table to pathway analysis
Overview of the lecture
4
Central dogma of molecular biology
5
Central dogma of molecular biology
Information
storage
Function
6
Mass-spectrometry based high-throughput proteomics:
• Measures what really matters: proteins
• Measuring thousands proteins simultaneously, but not all
• Complex instrument: costly maintenance of the instrument, complex
raw data, complex experimental design, complex analysis
Ask Pavel Synitcyn for more about proteomics
Measuring proteins: proteomics
7
Central dogma of molecular biology
Information
storage
FunctionMeasuring RNA
as a proxy to protein 8
Correct central dogma of molecular biology
Measuring RNA
as a proxy to protein 9
RNA-seq = RNA->cDNA + DNA-sequencing
http://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf10
FASTQ format has four lines per read: name, sequence, comment, base
call qualities
Raw RNA-sequencing data: FASTQ files
@HW-ST997:532:h8um1adxx:1:1101:2141:1965 1:N:0:
NGGGCCAAAGGAGCTTTCAAGGAGAGAAAGAGAAGAAATAGAGAAGCAAA
+
#1=DDFFFHFHDHIJIJJJIIJGIGFGHIJIIGGIJJJJIIIIFHID9BD
run number
flowcellID
lanenumber tile
number
X coordinateof cluster
Y coordinateof cluster
read number(single/paired)
Y – filteredN - not
control number
instrumentname
base callqualities
https://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm11
model organism: with good reference genome
non-model organism: with no/poor reference genome
Two distinct types of RNA-seq
Well studied Not so much
12
Human: chr1-22, chrX, chrY, chrM,
• 3235 Mb, 19815 genes
Mouse: chr1-19, chrX, chrY, chrM,
• 2718 Mb, 21971 genes
Assembly is mostly complete, but not 100% -
there are unplaced scaffolds and gaps
There are genes in unplaced/unlocalized
sequence, which could be important
Well-defined genomes
http://www.slideshare.net/hhalhaddad/the-human-genome-project-part-iii13
Human:
• UCSC notation (hg19, hg38)
• Genome reference consortium notation (major: GRCh37, minor:
GRCh38.p7)
• 1000 genomes notation (b37)
Mouse – same (mm10, GRCm37)
In RNA-seq always use the latest primary assembly:
• hg38/GRCh38 for human
• mm10/GRCm38 for mouse
Popular genome assemblies
14
Primary assembly: the best known assembly of a haploid genome.
• Chromosome assembly: a sequence with known physical location (e.g.
according to a physical map).
• Unlocalized sequence: a sequence found in an assembly that is
associated with a specific chromosome but cannot be ordered or
oriented on that chromosome.
• Unplaced sequence: a sequence found in an assembly that is not
associated with any chromosome.
Genome Reference Consortium terminology
https://www.ncbi.nlm.nih.gov/grc/help/definitions/
http://lh3lh3.users.sourceforge.net/humanref.shtml 15
Unlocalized/unplaced sequences and patches can
contain genes!
16
rRNA – ribosomal RNA: 80% of the cell RNA
tRNA – transfer RNA: 15% of the cell RNA
mRNA – messenger RNA for protein coding genes
Other RNAs: miRNA, lncRNA, …
Some of the RNAs are short: tRNA, miRNA, … and are not getting into normal RNA-seq
Main types of RNAs
https://www.ncbi.nlm.nih.gov/books/NBK21729/ 17
Protein-coding RNA
Estimated 105 to 106 mRNA molecules per animal cell with high dynamic
range for genes: from several copies to 104
mRNA levels correlate with protein levels
mRNA
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3129258/
https://bionumbers.hms.harvard.edu/bionumber.aspx?id=111220 18
polyA selection: most standard, relatively cheap and easy protocol,
selects mRNAs and some non-coding RNAs
riboZero: depletes rRNA, works better for degraded RNA, captures all long
RNAs
Two main approaches for RNA selection
19
DNA is transcribed a lot, giving multiple types of RNA
Some encode proteins, some do not
Set of transcripts with a similar function = gene
For a canonical protein-coding gene, transcripts = isoforms
What’s a gene?
20
RefSeq – very
conservative
ENSEMBL/Gencode –very inclusive
Practical definition of a gene: genome annotation
21
Using Gencode is the most practical
https://www.gencodegenes.org 22
We have raw data: FASTQ-files
We have genome reference and genome annotation
Summary (1)
23
Intro to RNA-sequencing
RNA-seq quantification: from raw data to gene expression table
RNA-seq analysis in R: from an expression table to pathway analysis
Overview of the lecture
24
Designed for DNA-seq, so “bad” is not always badQCFail: https://sequencing.qcfail.com/
Quality control: FastQC
https://sequencing.qcfail.com/articles/positional-
sequence-bias-in-random-primed-libraries/ 25
Alignment
• HISAT2
• STAR
• bowtie/bowtie2
Counting
• featureCounts
• htseq
• mmquant
• RSEM
Alignment + counting pipeline
26
CIGAR string
Typical SAM file with alignment
mapping quality
bitwiseSAMflag
CIGAR string 27
Expectations for genomic RNA-seq alignment
All reads
Not mappedanywhere
Uniquely mapped
Multimapped2-15 times
Multimapped>15 times
2-10% 70%
10-20% ~5%28
Generate read coverage and
visualize in a genome browser
Check alignment rate
Check library strategy (next slides)
Useful to check ribosomal RNA content as well
Tools: RSeQC, Picard/CollectRnaSeqMetrics, QoRTs
RNA-seq quality control
29
Library strategies: single-end vs paired-end
https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html30
Library strategies: stranded vs unstranded
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393 31
Library strategies: 3’- or 5’- specific, full-length
https://bitesizebio.com/13559/ngs-quality-control-in-rna-sequencing-some-free-tools/ 32
For stranded experiment, we can
distinguish between two different
genes if they are on the opposite
strands
For non-stranded experiment,
we can’t htseq-count discards reads with 2
or more features (ambiguous)
~50% assignment rate is normal
FeatureCounts
33
>5M assigned reads are required for a typical analysis, thus there should
be >10M raw reads
Usually it’s better to increase the number of biological replicates instead of library depth
Library depth
34
Gene expression table
35
Very fast pseudo-alignment
No sam/bam output
Transcript level
quantification
Expectation-maximization for
counting
multimappers/ambigous
reads
Very useful for reprocessing
of public datasets
Kallisto
36
FPKM = Fragments per Kilobase of gene per Million
• First normalize to library depth, then to gene length
TPM = Transcripts per Kilobase Million
• First normalize to effective transcript length, then to library depth
• Sum of all TPMs is one million
• Works well with isoforms, ~proportional to concentrations
Effective length for 3’-seq is the same for all genes/transcripts
Units: FPKM vs TPM
https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ 37
FeatureCounts vs Kallisto: similar but different
38
Generate a single report for many tools:
• fastqc
• hisat2
• rseqc
• kallisto
• …
MultiQC
39
polyA vs riboZero:
• If interested only in protein-coding genes, then polyA-selection, unless
RNA quality is bad
Library strategy:
• Always stranded if possible
• Paired and full-length for isoform quantitifcation
• Single-end 3’ is enough for simple mRNA analysis, no bias for transcript length
Library preparation recap
40
Went from raw data to gene expression tables
• Alignment + quantification pipeline
• Alignment-free analysis with kallisto
QC for every step
Summary so far
41
Intro to RNA-sequencing
RNA-seq quantification: from raw data to gene expression table
RNA-seq analysis: from an expression table to pathway analysis
Overview of the lecture
42
Two biological conditions: heart after
myocardial injury or sham treatment
Looking for individual genes
differentially expressed between
conditions
Looking for molecular processes
differentially regulated between
conditions
The simplest experiment design
https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.117.028252 43
Quality control: principal component analysis
(PCA)
44
Quality control: principal component analysis
(PCA)
Outlier?
45
Multiple pipelines
• DESeq2 – the easiest one
• edgeR
• Limma+voom – faster and
better for large datasets
Differential gene expression analysis
46
Differential gene expression: there can be too
much genes
~2000 significant genes 47
Pathway analysis: gene set enrichment analysis
Epithelial–mesenchymal
transition pathway48
Gene set enrichment analysis table
49
MSigDB
Reactome
KEGG
Enrichr pathways
Gene Ontology
…
Pathway databases
50
Do QC and visualize data to check that biology worked
Differential gene expression results can be hard to interpret directly
Pathway analysis allows to go from single genes to molecular pathways
which are much more robust and interpretable
Summary (3)
51
RNA-seq is a tool, not the result
Generate hypothesis based on RNA-seq data
Validate hypotheses experimentally
Epilogue: RNA-seq is a beginning of a beautiful
biology
52
Annual Systems Biology Workshop
(http://bioinf.me/sbw)
International master
program “Bioinformatics and Systems Biology”(https://vk.com/bioinf_itmo)
Advertisement
53