INSTITUTE OF BIOSTATISTICS AND ANALYSES UK SCIENCE & INNOVATION NETWORK BRITISH EMBASSY Searching...
-
Upload
chloe-sutton -
Category
Documents
-
view
217 -
download
4
Transcript of INSTITUTE OF BIOSTATISTICS AND ANALYSES UK SCIENCE & INNOVATION NETWORK BRITISH EMBASSY Searching...
INSTITUTE OF BIOSTATISTICS AND ANALYSES
UK SCIENCE & INNOVATION NETWORKBRITISH EMBASSY
Searching for Differentially Expressed Genes
Searching for Differentially Expressed Genes
Eva BudinskáEva BudinskáBioinformatics Conference on Genomics and Bioinformatics Conference on Genomics and
Proteomics Data AnalysisProteomics Data Analysis
25-27.11.200925-27.11.2009Brno, Czech RepublicBrno, Czech Republic
We need you to find the differentially expressed genes in our dataset...
Euh???
Biological/medical researcher
Data analyst
External HDD
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
4. HOW to find it?
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
4. HOW to find it?
5. (WHO should do that?, WHY me!...)
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
4. HOW to find it?
5. (WHO should do that?, WHY me!...)
WHATWHAT is a is an n expressed geneexpressed gene??
gene expression
a gene is expressed when it is being transcribed into mRNA
if a gene is expressed, we say it is active we can measure the gene expression by the amount
of mRNA
DNADNA
mRNAmRNA
ProteinProtein
transcription ~ expression
translation
WHATWHAT is a is a differentially expressed genedifferentially expressed gene??
differentially expressed genecompare two samples
in one sample is expressed MORE than in the other
DNADNA
mRNAmRNA
DNADNA
mRNAmRNA
Healthy colon tissue Colon cancer tissue
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
4. HOW to find it?
5. (WHO should do that?, WHY me!...)
WHYWHY should we search for it? should we search for it?
In MEDICINE
To understand the mechanism of diseases: DISEASE / HEALTHY TISSUEwhy some patients do respond to the therapy and some do
not: RESPONDERS / NON-RESPONDERS
new therapeutical targets, optimized therapy and prevention
In BIOLOGY
To studymechanisms of adaptation (bacteria in extreme conditions,
parasites in host organism, ...) ...
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
4. HOW to find it?
5. (WHO should do that?, WHY me!...)
WHEREWHERE can we can we search forsearch for it? it?
we can measure the gene expression by the amount of mRNA
mRNA can be extracted from cells of any living organism
In medicine tissues/organs
extracted tumors, nodes
blood
bone marrow
In biology extracting mRNA from bacteria
tissues of plants
tissues/organs
Searching for Searching for differentially expressed differentially expressed genesgenes
1. WHAT is it a differentially expressed gene?
2. WHY should we search it?
3. WHERE can we search for it?
4. HOW to find it?
5. (WHO should do that?, WHY me!...)
HOW HOW can we find can we find differentially expressed differentially expressed genesgenes??
Exploring gene by geneRT-PCR
FISH
OR
Thousands of genes in one experimentMicroarrays
MicroarraysMicroarrays
microarrays
Quantify the pixel intensity
of each gene in each channel (green, red)
~
Numbers equivalent to the amount of
mRNA
~
gene activity
~
gene expression
gene
HOW HOW can we find a can we find a differentially expressed genedifferentially expressed gene??
DNADNA
mRNAmRNA
DNADNA
mRNAmRNA
A. Healthy colon tissue B. Colon cancer tissue
9/3 = 3
FOLD CHANGE
A gene is 3 times more expressed in colon cancer than in healthy colon tissue
>
HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??
METHODSMETHODS
Fold change rules
Hypothesis testing
Regression strategies
HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??
METHODSMETHODS
Fold change rules
Hypothesis testing
Regression strategies
Fold change rulesFold change rules
All genes that have 2 fold change in expression (both directions) are considered differentially expressed between the two samples
Why to do it:
EASY
Why NOT: Smaller changes can be biologically significant!
(the small effects can be multiplied inside a group of genes from the same pathway)
The data come with biological and technical variability: What about 1.9?
The fold-changes can be biased to zero (mix of tumor and normal cells)
No assessment of statistical significance
Statistical testing
HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??
METHODSMETHODS
Fold change rules
Hypothesis testing
Regression strategies
Hypothesis testing I.Hypothesis testing I.
Is the mean expression of a gene in group A different from the mean expression in group B?
Conduct a statistical test for each gene g = 1, . . . ,m,
giving test statistics Tg and corresponding p-values
Choosing a statistical test
Number of groups to compare
Data have Gaussian distribution
Data have Gaussian distribution
2 >2
T-testMann-Whitney
testANOVA
Kruskal-Wallistest
YES NO YES NO
Hypothesis testing II.Hypothesis testing II.
Two–sample T-test can be used to test equality of the group means μ1, μ2.
The p-value pg - is the probability that the test statistic under the null hypothesis (here: μ1 = μ2) is at least as extreme as the observed value Tg. Under the null hypothesis, Pr(Tg ≤ T) = pg .
21
21
11nn
s
T
g
ggg
variability
Multiple hypothesis Multiple hypothesis testing testing problemproblem
Thousands of genes on microarray slide
Thousands of hypotheses are tested simultaneously
Increased chance of false positives
Example: 10 000 genes on a chip, no differentially expressed => 0.05 x 10 000 = 500 with p-value < 0.05.
p–values <0.05 do not correspond to significant findings anymore
We need to ADJUST for this multiple testing problem
Adjustment for multiple hypothesis testing Adjustment for multiple hypothesis testing problemproblem
# non–rejected # rejected
# non-diff. expressed genes
True Negatives (TN)False Positives (FP)
Type I. error
# diff.expressed genesFalse Negatives (FN)
Type II. errorTrue Positives (TP)
Type I. error rates
1. Family–wise error rate (FWER): The probability of at least one Type I error (false positive): FWER = Pr(FP > 0)
2. False discovery rate (FDR)(Benjamini & Hochberg,1995): The expected proportion of False positives among all positives.
AdjustAdjusting p-valuesing p-values
Controlling the Family Wise Error Rate (FWER) Bonferroni correction (for independent testings)
p < / m (e.g. p < 0.05/10 000)
Controlling the False Discovery Rate (FDR) Benjamini/Hochberg procedure
Ordered unadjusted p–values: P(1),..., P(m)
To control FDR at level , For a given α, find the largest k such that
Reject the hypotheses Hj for j = 1, . . . , k.
FDR = 10% (from 100 rejected hypotheses we can expect 10 false positives)
Adjustment for multiple hypothesis testing Adjustment for multiple hypothesis testing problemproblem
FWER if we want ALL selected genes to be significant. However, many differentially expressed genes may not appear significant
FDR if we prefer to pick up the majority of differentially expressed genes and do not care about some false positives.
Significance analysis of microarraysSignificance analysis of microarrays- (2001 by Tusher, Tibshirani and Chu)
- Permutation algorithm for False discovery rate (FDR) estimation.
- Based on modified t-statistic:
- Statistical significance of observed score di is subsequently assessed with permutation of original data and calculating expected score de (d score distribution).
- Gene is considered statistically significant when satisfying |di - de | > Δ.
- Advantage: easy to use, methodologically simple
- Disadvantage: computationally intensive, high memory requirements
tuning constant (adjustment for variability in data)
Significance analysis of microarraysSignificance analysis of microarrays- A gene is considered statistically significant when satisfying |di - de | > Δ.
de
di
HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??
METHODSMETHODS
Fold change rules
Hypothesis testing
Regression strategies
Regression strategiesRegression strategies
When have more than 1 variable that can affect the gene expression gene expression ~ group + age + gender
Linear modelling
We try to find out how much the gene expression changes when the value of some continuous variable changes gene expression ~ overall survival
gene expression ~ age
Linear modelling, Cox proportional hazards model
We want to find the probability that the sample belongs to a certain group given the expression level of a gene.Logistic regression
Searching for differentially expressed genesSearching for differentially expressed genes
Number of groups to compare
Data have Gaussian distribution
Data have Gaussian distribution
2 >2
Number of factors
Mann-Whitney test, SAM
ANOVA, Linear models,
SAM
Kruskal-Wallis test,SAM
Linear models,Cox proportional hazards
models (survival times)
continuous response variable
YES YES NONO
T-test,Linear models, SAM
Linear models
1>1
What to do with a list of differentially expressed What to do with a list of differentially expressed genes?genes?
Ad-hoc pathway analysis
Clustering genes in order to determine the groups of genes
Clustering samples for control purposes
Compare to other datasets (meta-analysis)
Microsatellite instabilityMicrosatellite instability (MSI) (MSI) in colon in colon cancercancer
• MSI tumors are characteristic by:
– observed in ~15 % of sporadic colon cancers
– high microsatellite instability • due to mismatch repair gene epigenetic silencing (hypermethylation of
MSH1, MSH2, MSH6)
– increased immune response• infiltration of tumor epithelium by T-lymphocytes
• HLA class increased expression
– better survival
Differential gene-expression analysis of MSI vsDifferential gene-expression analysis of MSI vs.. MSSMSS
• Aim: • Find MSI gene expression signatures stable across different datasets
• We have analyzed 3 publicly available datasets• Affymetrix HG-U133_Plus_2 (54675 probesets)
• Analysis:• Significance analysis of microarrays SEPARATELY on each of the datasets• Compared lists of differentially expressed genes on FDR = 10%• Significant genes in ALL THREE DATASETS were inserted into KEGG
pathway analysis
Dataset MSI/MSS
GSE4554 33/51 (39.3% / 60.7%)
GSE13067 11/62 (15.1% / 84.9%)
GSE13294 78/77 (50.3% / 49.7%)
SAM resultsSAM resultsGSE 4554 GSE 13067
GSE 13294
685 differentially expressed genes in all three datasets
at FDR<=10%
TOP 15 TOP 15 DOWN-regulated DOWN-regulated genes in MSI (significant in all 3 genes in MSI (significant in all 3 datasets)datasets)
Gene Symbol Gene Title
GSE4554 GSE13067 GSE13294
adj.p.val logFCH adj.p.val logFCH adj.p.val logFCH
TNNC2 troponin C type 2 (fast) 0.003 -1.01 0.003 -0.49 0.000 -1.12
7A5 metastasis associated in colon cancer 1 0.000 -1.33 0.019 -1.32 0.000 -1.54
ZMYND8 zinc finger, MYND-type containing 8 0.003 -0.88 0.000 -1.04 0.000 -0.98
RNF43 ring finger protein 43 0.007 -0.91 0.000 -1.60 0.000 -1.30
SYT7 synaptotagmin VII 0.306 -0.49 0.065 -0.40 0.000 -0.86
TSPAN6 tetraspanin 6 0.007 -0.93 0.000 -1.50 0.000 -1.01
ASCL2 achaete-scute complex homolog 2 0.002 -1.30 0.001 -1.90 0.000 -1.87
TDGF1 teratocarcinoma-derived growth factor 1 0.000 -1.87 0.000 -2.61 0.000 -2.32
ATP9A ATPase, class II, type 9A 0.000 -1.20 0.000 -1.40 0.000 -1.19
GABRE GABA A receptor, epsilon 0.062 -0.70 0.022 -0.89 0.000 -1.25
PROX1 prospero homeobox 1 0.004 -1.11 0.021 -1.15 0.000 -1.23
VIL1 villin 1 0.034 -0.76 0.057 -0.55 0.000 -1.09
NOX1 NADPH oxidase 1 0.014 -1.17 0.001 -2.44 0.000 -2.03
PLAGL2 pleiomorphic adenoma gene-like 2 0.007 -0.91 0.000 -1.49 0.000 -0.88
A1CF APOBEC1 complementation factor 0.160 -0.52 0.002 -1.18 0.000 -1.27
TOP 15 UP-regulated genes in MSI (significant in all 3 TOP 15 UP-regulated genes in MSI (significant in all 3 datasets)datasets)
Gene Symbol Gene Title
GSE4554 GSE13067 GSE13294
adj.p.val logFCH adj.p.val logFCH adj.p.val logFCH
KDELR3Homo sapiens KDEL endoplasmic reticulum
protein0.002 1.01 0.000 0.90 0.000 0.90
TRIB2 tribbles homolog 2 (Drosophila) 0.000 1.33 0.000 1.41 0.000 1.69
TFAP2AHomo sapiens AP-2 gene for transcription factor
AP-20.000 1.52 0.000 2.35 0.000 1.78
TRIM7 tripartite motif-containing 7 0.000 2.46 0.000 1.55 0.000 2.01
KCNK1 potassium channel, subfamily K, member 1 0.041 0.79 0.002 1.33 0.000 0.86
CTSE cathepsin E 0.077 0.96 0.043 1.53 0.000 1.18
CATSPERB cation channel, sperm-associated, beta 0.151 0.55 0.109 0.27 0.000 1.03
DUSP4 dual specificity phosphatase 4 0.000 1.90 0.000 2.73 0.000 1.83
CCDC68 coiled-coil domain containing 68 0.030 0.76 0.002 1.17 0.000 1.29
LSMD1 LSM domain containing 1 0.001 1.03 0.000 1.00 0.000 0.86
SECTM1 secreted and transmembrane 1 0.006 0.94 0.105 0.59 0.000 1.30
LMO4 LIM domain only 4 0.018 0.75 0.001 1.03 0.000 0.79
CD55
Homo sapiens decay accelerating factor
for complement (CD55, Cromer
bloodgroup system) (DAF) gene, complete cds.
0.024 0.85 0.029 0.94 0.000 1.28
SPATA18 spermatogenesis associated 18 homolog (rat) 0.022 0.92 0.007 1.22 0.000 0.96
RPL22L1 ribosomal protein L22-like 1 0.000 1.32 0.000 1.80 0.000 1.68
Immune responseImmune response in MSI in MSI– MSI-H associated with the higher frequency of activated tumour infiltrating
lymphocytes