Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].

86
Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected]

Transcript of Introduction to Bioinformatics - RNA seq - Marcel Willemsen [email protected].

Page 1: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Introduction to Bioinformatics

- RNA seq -Marcel Willemsen

[email protected]

Page 2: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Biology introCentral dogma molecular biology

• RNAo mRNAo tRNAo miRNA => Regulationo ...

• (*) Proteinso Structureo Hormoneso Enzymeso ...

Number of mRNA molecules = gene expression measure

protein bio-synthesis

*

Page 3: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 4: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 5: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Sample preparationGeneral RNA-seq pipeline

Strandspecific

Page 6: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Argarose gel electrophoresis purificationGeneral RNA-seq pipeline - Sample prep

• Digestion by restriction endonucleases (EcoRI)

• Cutting out desired length

Page 7: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 8: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

NGS platformsGeneral RNA-seq pipeline - Sequencing

• Short readso Illumina-Solexao ABI-Solid (Life Technologies)

• Long readso Roche-454

• Third generation (direct/single molecule)o Helicoso Pacific Bioscienceso Ox ford Nanopore

No amplification, ligation or cDNA synthesis!

Page 9: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Intermezzo: colorspaceGeneral RNA-seq pipeline - QC

A 1 3 1 3 1 3 2 3A C G TC A T G C A T C G

Page 10: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Intermezzo: colorspaceSNP

A 1 3 1 3 1 3 2 3A C G T A C G A T

A C G T T C G A T

G 3 3 1 0 2 3 2 3 3 1 0 2 3 2 3

• First 2 'bases' are often removed

SNP

A 1 3 1 0 2 3 2 3

Page 11: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Intermezzo: colorspaceSequencing error

A 1 3 1 3 1 3 2 3A C G T A C G A T

A 1 3 1 0 1 3 2 3

• Translate reference to colorspace during alignments!

Sequencing error!

A C G T T G C T A

Page 12: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNAseq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 13: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quality ControlGeneral RNAseq pipeline - QC

• QC tools -> FastQC, ...o Input: Fastq fileo Output: Quality report

Page 14: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quality ControlGeneral RNAseq pipeline - QC

Per base sequence quality

Page 15: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quality ControlGeneral RNAseq pipeline - QC

Per sequence quality scores

Page 16: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quality ControlGeneral RNAseq pipeline - QC

Per base sequence content • Fastq input• Double encoded!

3102

Page 17: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quality ControlGeneral RNAseq pipeline - QC

Per base GC content

Page 18: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quality ControlGeneral RNAseq pipeline - QC

Per sequence GC content

Page 19: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNAseq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 20: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

AlignmentGeneral RNAseq pipeline

• Tool: IGV Integrative Genomics Viewer(Broad Institute)

• Coverage• Direction• Annotation:

Ref. seq genes

Page 21: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

CoverageGeneral RNA-seq pipeline

3 24 3 23

• The more samplesthe less coverage!

Page 22: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

AlignmentGeneral RNAseq pipeline - SAGE sample

• Position• Introns

Page 23: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

AlignmentGeneral RNAseq pipeline - SAGE sample

Start Stop

Start Stop

• Mismatches• Minus reads are reversed!• Seeding (l 25 k 1 n 100)

Page 24: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

SAM/BAM FormatGeneral RNA-seq pipeline

header

reference name

reference length

query name = sample name:bead coordinates

strand: 0=plus, 16 =minus, 4=no match

left most postition• 5' for plus strands• 3' for minus strands

mapping quality (phred scaled)

cigar string

query sequence on same strand as reference

Page 25: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

SAM/BAM FormatGeneral RNA-seq pipeline

query quality (encoded)

optional fields TAG:VALUE TYPE:VALUE

mate info (mate pair seq)

Page 26: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Original RNA read isreverse complement!

Page 27: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Alternative splicing => isoforms

Re-alignmentRNA splicing

Page 28: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Re-alignmentRNA splicing

donor acceptor

Page 29: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Y = T C (pyrimidine)

Re-alignmentRNA splicing

Page 30: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Re-alignmentRNA splicing

Page 31: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Re-alignmentRNA splicing

Page 32: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Re-alignmentGeneral RNA-seq pipeline

Page 33: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Re-alignmentGeneral RNA-seq pipeline

Database available from ALEXA-Seq

Page 34: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QC Whole transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 35: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Alignment analysisGeneral RNA-seq pipeline

24nt (-l 24 -k 1 -n 100 -o 0 -t 4 -c)

20nt

Page 36: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

QC filteringGeneral RNA-seq pipeline

Mapping quality > 0

Unique mappings

BWA default settings

Page 37: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Alignment analysisGeneral RNA-seq pipeline

Read length vs. Mapping %

BWA default settings

Unique mappingsMapping qual > 0

Error = 1 - (reads mapping in exons / total reads mapping)

MappingPercentage1 = 100%

Read length

Page 38: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 39: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Whole transcriptomeGeneral RNA-seq pipeline

Total RNA or poly(A) RNA RNAs: mRNA, tRNA, rRNA, pri-miRNA, snRNA ...

Which RNA is polyadenylated?

RNA is fragmented Enzymatic digestion / physical shearing

Several reads per transcript possible!

Sliding window analysis Discover new RNAs

Intergenic regions

Page 40: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Sliding window analysis - ScalingWhole transcriptome

• meanpos(meani(coverageko))meanpos(meani(coveragewt))

X meani(coveragewt)

Mean Coverage

Position

Wildtype vs. Knock out Wildtype vs. Knock out scaled

Page 41: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Sliding window analysisWhole transcriptome

• Sliding window = hypothetical transcriptional unit

• Parameters : o Windowsize: 200, 100 and 20 nt

o Threshold: log2 meanwindow(FoldChange) > 2, 1.6, 1.3, 1

o Background: (wt < 10 AND ko < 10)

Page 42: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Sliding window analysisWhole transcriptome

+ strand - strand

Position

index

Page 43: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Sliding window analysisWhole transcriptome

Page 44: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 45: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Intermezzo:Post transcriptional modifications

• 5' Capping of pre-mRNA

• 3' Polyadenylation of pre-mRNA

Page 46: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Intermezzo:5' Capping

- Stability- Export out of nucleus- Promote translation- Splicing

mitochondrial and chloroplast mRNA are not capped => not in CAGE

Page 47: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Intermezzo:3' Polyadenylation

Nuclear export, translation, stability

Page 48: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Gene expression profilingGeneral RNA-seq pipeline

• Qualitativeo Which part of the genome is expressed, in which cells, which

mRNA isoforms

• Quantitativeo Compare across conditions, understand biological

processes/mechanisms Tumor vs. Normal tissue Knock-out vs. wild-type mouse Changing nutrient conditions in yeast Etc.

Page 49: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Gene expression profilingGeneral RNA-seq pipeline

• DeepSAGE = Digital Gene Expression

• SAGE = Serial Aanlysis of Gene Expression

• Tag-based: one read per transcripto DeepSAGE -> most 3' CATG

o DeepCAGE -> 5' end

• DE analysis

• GSEA

Page 50: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

SAGEGene expression profiling

--------------------------------------------------------------

------------------------------- Magnetic beads capture poly(A) RNA

cDNA synthesis with reverse transcriptase (E.coli)

NlaIII digestionEvery 250 bp~99% human transcripts

Adapter A ligationComplementary overhang EcoP151 recognistion sitePCR primer site (P2)

EcoR151 DigestionAsymmetric27 bp downstream from adapter A

Adapter B ligationPCR primer site (P1)Sequencing intitiation site

Page 51: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

SAGE

• Mapping against• Genome

• Exons

• Tag library (SAGE Genie)

• No re-alignment

• No gene length bias

Page 52: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 53: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

RNA interference

• siRNAo Silencing

through methylation

• Exogenous

• Viral dsRNA

• Endogenous

• miRNA

Page 54: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

microRNA

• Short• Mature ~ 18 - 26 bases

• Difficult to map uniquely 2 miRNAs may differ 1 base Adapter removal

• Reads are cut out of gel at desired length

• Mapping against miRBase

• New miRNA(target) discovery / predictiono mirDeepo miRFindero etc...

Page 55: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Adapter (P2) removalBefore alignment

2330123231

2330123231

2330123231

2330123231cutcut

move

Page 56: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Adapter (P2) removalDuring alignment

Page 57: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Adapter (P2) removalDuring alignment

Page 58: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Adapter (P2) removel

Total number of

Read number of

Page 59: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 60: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quantifying transcript abundance

• How many reads?

• Depends on• Sequencing method

• Sequencing depth

• Cell typeGenes detected

Number of aligned reads

Page 61: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Quantifying transcript abundance

exon exon exon

+6 -2 +4 -1 +4 -0

+14 -3

5' 3'

• Exons

• Alternative transcripts

Page 62: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

CountsQuantifying transcript abundance

2 wildtype replicates 2 knockout replicates

features

Total tag count

Libraries

Page 63: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

NormalizationQuantifying transcript abundance

• Technical bias

o Solid bar codes random spread

• Sample size bias

• Gene length bias

o Proportion of significant DE genes increases width transcript length

o Has in particular implications for the ranking of differentially expressed genes => introduce bias in gene set testing

• Normalization:o RPKM = Reads per Kb per million mapped reads

total exon reads

mapped reads(millions) * exon length(KB)RPKM =

Page 64: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

General RNA-seq pipeline

Sample prep

Alignmentre-alignment

Sequencing

QCWhole

transcriptome

Gene expression

profiling

microRNAAlignmentanalysis

QC Filtering

Quantifying transcript abundance

Statistical analysis

Applications:

Page 65: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

RNA-seq vs. microarraysStatistical analysis

• RNA-seqo Countso Absolute abundance of

transcripto All transcripts present

• Microarrayo Hybridization signal to complementary probeo Relative abundanceo Content limitedo Cross-hybridizaton

Page 66: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Count data - Poisson distribution

• Discrete probability distribution

• Not continuous

• Probability of a number of events occurring in a fixed period of time

• Events occur with known average rate and independently of the time since the last event

• DESeq/EdgeR negative binomial distribution (related to Poisson)

λ = expectedk = number of occurrences

Page 67: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Multiple Hypothesis testing

H0 = true H0 = false

H0 rejected V S

H0 not rejected U T

• One test: H0: μ1 = μ2

H1: μ1 <> μ2

• Multiple tests:

"Discoveries"

• Sensitivity vs. Specificityo false negatives => not sensitive enougho false positives => not specific enough

(false positives)

(false negatives)

Page 68: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Multiple Hypothesis testing

• K tests at level α = 0.05.

• Expect 0.05 * K false discoveries

• 1 out of 20

• K = 40.000 => 2000

Page 69: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Multiple Hypothesis testing

• Bonferroni (Holm) controls the FWER (familywise error rate)

α’ = α/k

• Benjamini–Hochberg controls the FDR

• FDR = false discovery rate

• V false discoveries

• V + S total discoveries

• Expected proportion of false positives

• FDR is less conservative than FWER

H0 true H0 false

H0 rejected V S

H0 not rejected U T

Page 70: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Differential Expression

Condition 1: Wildtype vs. Condition 2: Knockout

Statistical testing (DESeq)

etc.

Multiple Testing!correcting givesadjusted p-Value

design=c("wt","wt","d","d")

Page 71: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Differential Expression

• Results: log2FC / p-value / padj

o Fold change => log2FC

log2 0.94 = -0.08log2 1.06 = 0.08

fold change: 121/114 = 1.06 114/121 = 0.94

mean: 121 114

wt wt ko ko129 113 120 108

20.08 = 1.06

Other way around:

Page 72: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Differential Expression

• Results:o M vs. A plot (minus vs. average)

M = log2wt - log2ko = log2(wt/ko) = fold change log2FC

A = 0.5(log2wt + log2ko) = average expression baseMean

M

A

= padj < 0.05

Page 73: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Differential Expression

• Results:o Hierarchical clustering of samples and genes - Heatmap

9.5 ED 7.5 ED 8.5 ED => relicates

Top 100 varying genes (features)

F-test / ANOVA Linear regression Variance stabilizing

transformation (VST)

Page 74: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Differential Expression

• Variance-stabilizing transformation

• Find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value

Page 75: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Differential Expression

Variance stabilizing transformed Raw counts

Page 76: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Venn diagramDifferential Expression

• Intersection of differentially expressed genes

Page 77: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

Gene Set Enrichment Analysis (GSEA)

• edgeR, bayseq, DESeq ...o Return a list of differentially expressed genes

• Biological theory is not about isolated geneso Typical biological research questions and hypotheses

About pathways About biological processes About areas of the genome

o About sets of related genes

• Questiono How to analyze RNAseq data from a gene set perspective?

Page 78: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEA• Which gene sets?

• Any defined set that has something in commono Pathways

KEGG, Reactome, Biocartao Gene Ontology terms

Biological process, Molecular function, Cellular component

o Chromosomal regions Chromosome arms, cytobands, linkage peaks, genes

o Published gene sets Predictive signatures, gene lists

Page 79: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEA

• H0: null hypothesiso Genes in the gene set are as often differentially

expressed as genes outside

• H1: alternative hypothesiso Genes in the gene set are more often differentially

expressed as genes outside

Page 80: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEAScoring gene sets

Geneyes Adjusted p-value Member of set G

gene 1 3.44E-006 yes

gene 2 1.77E-005 no

gene 3 9.92E-005 yes

...

gene 100 0.49 yes

gene 101 0.51 no

...

gene n 1 no

Ordered gene list L

Gene set G:{1,3,...100,...}

Page 81: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEAScoring gene sets

Geneyes Adjusted p-value Member of set G

gene 1 3.44E-006 yes

gene 2 1.77E-005 no

gene 3 9.92E-005 yes

...

gene 100 0.49 yes

gene 101 0.51 no

...

gene n 1 no

Ordered gene list L

Gene set G:{1,3,...100,...}

Define cut off and count members of Gabove and below cutoff (= 0.5)

Page 82: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEAFisher’s exact test

• Define cutoff (C=100) and count members of G above andbelow cutoff

• Fisher’s exact test:

# genes above C # genes below C

# genes in set G 5 (a) 0 (b) 5 (a + b)

# genes not in set G 1 (c) 4 (d) 5 (c + d)

6 (a + c) 4 (b + d) 10 (a + b + c + d = n)

n = a + b + c + d

Page 83: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEAFisher’s exact test

• P-value = sum of all probabilities ≤ Pcutoff

diff. expr. gene non-diff. expr. gene

in gene set 5 0

not in gene set 1 4

= 1

P-value = 0.0476Reject H0 => enriched set!

Page 84: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEA

• Alternative: • Chi2 test

• Tool: goseq (R package)o Correction for length biaso Not in SAGE!

Page 85: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

GSEA

• Example KEGG:Time series 7.5 ED vs. 9.5 ED

7.5 ED 9.5 ED

All genes in pathway

Page 86: Introduction to Bioinformatics - RNA seq - Marcel Willemsen a.m.willemsen@amc.uva.nl.

The End