Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia...
-
Upload
cindy-perrett -
Category
Documents
-
view
217 -
download
0
Transcript of Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia...
Laboratorio Laboratorio BioinformaticaBioinformatica
ObbiettiviObbiettivi
• Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare:– Marcatori prognosti/diagnostici di patologie
EsempioEsempio• Analizzeremo il modo con cui si identificano marcatori molecolari di
patologie dissezionando l’approccio presentato in:
PNAS 2005, 102:11023-28
PNAS 2007, 104:14424-29
La domanda biologicaLa domanda biologica• Huntington’s disease (HD) is an autosomal dominant disorder caused by
an expansion of glutamine repeats in ubiquitously distributed huntingtin protein.
• Mutant huntingtin interferes with the function of widely expressed transcription factors, suggesting that gene expression may be altered in a variety of tissues in HD, including peripheral blood.
• Highly quantitative biomarkers of neurodegenerative disease remain an important need in the urgent quest for disease-modifying therapies.
• For Huntington’s disease (HD), a genetic test is available (trait marker), but necessary state markers are still in development.
• Tested hypothesis:– Two studies exists:
• Borovecki et al. Detecting biomarkers profiling complete blood from HD patients (hd), pre-HD patients (pre) and normal donors (n).
• Runne et al. Detecting biomarkers profiling lymphocytes from HD patients (hd), and normal donors (n).
– Is it possible to identify disease biomarkers using these data sets?
Experimental groupsExperimental groups• Borovecki :
– HD group:• 12 HD-affected (stage I-II) subjects• 5 early presymptomatic carriers of the gene mutation, as
determined by genetic testing.– Normal group:
• 14 healthy control subjects– Affymetrix hgu133a
• Runne:– HD group:
• 12 HD-moderate stage HD subjects– Normal group:
• 10 healthy control subjects– Affymetrix hgu133plus2
Experimental designExperimental design
Recognition and statement of the Recognition and statement of the problemproblem
• The problem should be specified enough and the conditions under which the experiment will be performed should be understood so the appropriate design for the experiment can be selected.
ExampleExample
• We are investigating the effect of a drug, by BrdU incorporation, considering three concentrations (10 nM, 100 nM, 1 M), over 3 different tumor cell lines (CL).
• In this example the factors are two:– CL, qualitative factor with 3 levels– Drug concentration, quantitative factor with 3
levels
Identicare i fattori coinvolti nello Identicare i fattori coinvolti nello studio di Boroveckistudio di Borovecki
• Lo studio è costituito da:– pazienti HD, pazienti preHD e donatori
• Quanti fattori sono coinvolti?– 1
• Quali:– pazienti
• I fattori sono quantitativi o qualitativi?– Qualitativi
• Quanti livelli ci sono?– 3 (HD, preHD, N)
Pre HD
N
HD
Livelli
Fattore
Come posso ottenere i dati Come posso ottenere i dati sperimentali?sperimentali?
• Recentemente per l’accettazione di un articolo su riviste internazionali viene richiesto che dati siano depositati su banche dati pubbliche:– Europa: arrayexpress– USA: GEO
E’ possibile scaricare i dati:E’ possibile scaricare i dati:1.1. in formato tipo excel (tabulato) contenente tutte le in formato tipo excel (tabulato) contenente tutte le informazioni dell’esperimentoinformazioni dell’esperimento2.2.le immagini dell’array (in questo caso i .CEL files le immagini dell’array (in questo caso i .CEL files dell’Affymetrix)dell’Affymetrix)
Header Matrix series fileHeader Matrix series file
Affymetrix geneChipsAffymetrix geneChips
PMMM
cellProbe pair
Genesequence
ACCAGATCTGTAGTCCATGCGATGC
ACCAGATCTGTAATCCATGCGATGC
PM
MM
Probe set (Affymetrix)Probe set (Affymetrix)
Per analizzare i dati di microarray è Per analizzare i dati di microarray è necessario disporre di softwares necessario disporre di softwares
dedicatidedicati
• I dati da microarray non possono essere analizzati con un semplice foglio excel ma necessitano di strumenti statistici alquanto sofisticati.
• Esistono software commerciali od open-source.• In questo corso le esercitazioni verranno fatte
utilizzando un software open-source:– Bioconductor
BioconductorBioconductor
Platform specificPlatform specificdevicesdevices
Analysis pipe-lineAnalysis pipe-line
SampleSample
PreparatioPreparationn
Array Array FabricatioFabricatio
nn
HybridizatioHybridizationn
ScanninScanning g + +
Image Image AnalysisAnalysis
NormalizatioNormalizationn
FilterinFilteringg
statisticalstatisticalanalysisanalysis
AnnotatioAnnotationn
Biological Biological KnowledgKnowledg
eeextractionextraction
QualityQualitycontrolcontrol
Come si inizia ad analizzare i dati?Come si inizia ad analizzare i dati?
• Se i .CEL files sono disponibili si esegue un approfondito controllo di qualità.
• In mancanza dei .CEL files, se è solo disponibile il matrix series file, è possibile eseguire un numero più limitato di controlli di qualità.
Analysis pipe-lineAnalysis pipe-line
NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis
AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction
QualityQualitycontrolcontrol
Perché si fanno i controlli di qualità Perché si fanno i controlli di qualità (QC)?(QC)?
• I QC sono un punto molto importante di un analisi di dati di microarray.
• Questo perché solitamente il numero di esperimenti disponibili è limitato e la presenza di uno o più arrays caratterizzati da un alto numero di artefatti sperimentali potrebbe inficiare l’analisi.
• Il QC permette di identificare gli arrays outliers e permettere al ricercatore di valutare se è necessario rimuoverli o no.
Controllo di qualità per identificare la Controllo di qualità per identificare la presenza di array outlierspresenza di array outliers
• Avendo a disposizione solo MSF per valutare la presenza di arrays outliers si ispezionano:
• Box plot delle frequenze di intensità dei vari arrays.
Controllo di qualità per valutare Controllo di qualità per valutare l’omogeneità dei gruppi sperimentalil’omogeneità dei gruppi sperimentali
• Principal component analysis• Clustering gerarchico
Principal component analysisPrincipal component analysis
• Principal component analysis (PCA) involves a mathematical Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables variables into a (smaller) number of uncorrelated variables called called principal componentsprincipal components. .
• The first principal component accounts for as much of the The first principal component accounts for as much of the variability in the data as possiblevariability in the data as possible
• Each succeeding component accounts for as much of the Each succeeding component accounts for as much of the remaining variability as possible. remaining variability as possible.
• The components can be thought of as axes in n-dimensional The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis space, where n is the number of components. Each axis represents a different trend in the data.represents a different trend in the data.
PCA1
PCA2
PCA
2
1
2° PC will be orthogonal to the 1st
In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably
represented in a 3D space.
In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably
represented in a 3D space.
Hierarchical Clustering Hierarchical Clustering (HCL(HCL)
• HCL is an agglomerative/divisive clustering HCL is an agglomerative/divisive clustering method. method.
• The iterative process continues until all groups The iterative process continues until all groups are connected in a hierarchical tree.are connected in a hierarchical tree.
Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)
s8s1 s2 s3 s4 s5 s6 s7
s7s1 s8 s2 s3 s4 s5 s6
s7s1 s8 s4 s2 s3 s5 s6
s1 is most like s8
s4 is most like {s1, s8}
Modified by TMEV presentation (www.tigr.org)
s7s1 s8 s4 s2 s3 s5 s6
s6s1 s8 s4 s2 s3 s5 s7
s6s1 s8 s4 s5 s7 s2 s3
Hierarchical ClusteringHierarchical Clustering
s5 is most like s7
{s5,s7} is most like {s1, s4, s8}
Modified by TMEV presentation (www.tigr.org)
s6s1 s8 s4 s5 s7 s2 s3
Hierarchical TreeHierarchical Tree
Modified by TMEV presentation (www.tigr.org)
Hierarchical ClusteringHierarchical Clustering
• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine which decisions must be made to determine which clusters should be joined. clusters should be joined.
• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern this must be calculated. The rules that govern this calculation are calculation are linkage methodslinkage methods..
Agglomerative Linkage MethodsAgglomerative Linkage Methods• Linkage methods are rules or metrics that Linkage methods are rules or metrics that
return a value that can be used to determine return a value that can be used to determine which elements (clusters) should be linked.which elements (clusters) should be linked.
• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage
– Average LinkageAverage Linkage
– Complete LinkageComplete Linkage
Modified by TMEV presentation (www.tigr.org)
t4 is clearly an outlier!t4 is clearly an outlier!
ExerciseExercise• Usare target file target.GSE8762.classif.txt e il Usare target file target.GSE8762.classif.txt e il
file esperimental.design.names.gse8762.txt file esperimental.design.names.gse8762.txt per valutare con la PCA il comportamento dei per valutare con la PCA il comportamento dei fattori disease status e gender nel dataset in fattori disease status e gender nel dataset in esame.esame.
ExerciseExercise
• Open ROpen R• Load the oneChannelGUILoad the oneChannelGUI• Start a new project:Start a new project:
– Change the working dir in dataset.huntingtonChange the working dir in dataset.huntington– Load the target fileLoad the target file– Set as project name: ronneSet as project name: ronne
Exercise Exercise • Starting from the data set you have loaded Starting from the data set you have loaded
– check the data box plotplots check the data box plotplots • Answer the following questions:Answer the following questions:
– Is there any array characterized by a very narrow Is there any array characterized by a very narrow probe intensity distribution?probe intensity distribution?• YES (which? …………………………….)YES (which? …………………………….) NONO
– Is there any array which is significantly different Is there any array which is significantly different with respect to the others?with respect to the others?• YES (which? …………………………….) YES (which? …………………………….) NONO
ExerciseExercise
• Inspect if the experimental groups of our Inspect if the experimental groups of our ronne data set (HD, N) are relatively ronne data set (HD, N) are relatively homogeneous using PCA and hierachical homogeneous using PCA and hierachical clustering.clustering.
• Is it easy to discriminate on the basis of Is it easy to discriminate on the basis of disease status?disease status?– YesYes
– NoNo
Analysis pipe-lineAnalysis pipe-line
NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis
AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction
QualityQualitycontrolcontrol
Raggruppare i dati dei singoli probes in Raggruppare i dati dei singoli probes in un unico valore per il probesetun unico valore per il probeset
• Analysis steps:Analysis steps:– Calculating probe set summaries:Calculating probe set summaries:
• RMARMA• GCRMAGCRMA
– Normalization:Normalization:• Quantile methodQuantile method
• L’INTENSITA’ DI FLUORESCENZA E’ L’INTENSITA’ DI FLUORESCENZA E’ ESPRESSA COME LOGESPRESSA COME LOG22(INTENSITA’)(INTENSITA’)
Brief summary about probe set intensity Brief summary about probe set intensity calculationcalculation
• RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation.
• GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004).
Why Normalization ?Why Normalization ?
• Sample preparationSample preparation• Variability in hybridizationVariability in hybridization• Spatial effectsSpatial effects• Scanner settingsScanner settings• Experimenter biasExperimenter bias
To remove systematic biases, which To remove systematic biases, which include,include,
Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray
Analysis pipe-lineAnalysis pipe-line
NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis
AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction
QualityQualitycontrolcontrol
Multiple testing errorsMultiple testing errors
• Performing multiple statistical tests two types of Performing multiple statistical tests two types of errors can occur:errors can occur:– Type I error (False positive)Type I error (False positive)
– Type II error (False negative)Type II error (False negative)
• Reduction of type I errors increases the number of Reduction of type I errors increases the number of type II errors.type II errors.
• It is important to identify an approach that reduces It is important to identify an approach that reduces false positivesfalse positives with the minimum loss of information with the minimum loss of information ((false negativefalse negative))
Filtering can be performed at various Filtering can be performed at various levels:levels:
• Annotation features:Annotation features:– Specific gene features (i.e. GO term, presence of Specific gene features (i.e. GO term, presence of
transcriptional regulative elements in promoters, transcriptional regulative elements in promoters, etc.)etc.)
• Signal features:Signal features:– % intensities greater of a user defined value% intensities greater of a user defined value– Interquantile range (IQR) greater of a defined Interquantile range (IQR) greater of a defined
valuevalue
Intensity distributionsIntensity distributions
RMA GCRMA
Bg level probe setsBg level probe sets
How to define the efficacy of a filtering How to define the efficacy of a filtering procedure?procedure?
• This enrichment is very similar to that used to evaluate the purification folds This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.of a protein after a chromatographic step.
inspikeingfterFilterprobesetsA
probesetsteringinAfterFilspike
NN
NNenrichment
100
mBeforeChroEAfterChromgP
mBeforeChrogPAfterChromEenrichment
..
..100
Filtering by genefilter pOverAFiltering by genefilter pOverA(keep if ≥ 25% probe sets have intensities ≥ log(keep if ≥ 25% probe sets have intensities ≥ log22(100))(100))
5553 5553 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:
401%401%
223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:
100%100%
Filtering by InterQuantile RangeFiltering by InterQuantile RangeIQR25% 75%
How filtering by genefilter IQR works?How filtering by genefilter IQR works?The distribution of all intensity values of a differential expression experiment are the The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditionssummary of the distribution of each gene expression over the experimental conditions
The filter removes genes that show little changes within the experimental pointsThe filter removes genes that show little changes within the experimental points
How filtering by IQR works?How filtering by IQR works?
Filtering by genefilter IQRFiltering by genefilter IQR(removing if intensities IQR(removing if intensities IQR0.25, 0.5)0.25, 0.5)
68 68 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:
32794%32794%
223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:
100%100%
244 244 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:
9139%9139%
EsercizioEsercizio• Caricare i dati di Borovecki partendo dal matrix series file Caricare i dati di Borovecki partendo dal matrix series file
usando:usando:– target.GSE1751.hd.n.txttarget.GSE1751.hd.n.txt
• Valutare con PCA/HCL come si separano i campioni.Valutare con PCA/HCL come si separano i campioni.
• Applicare un filtro interquartile a 0.25 e 0.5.Applicare un filtro interquartile a 0.25 e 0.5.– Quanti trascritti rimangono dopo ognuno dei filtri?Quanti trascritti rimangono dopo ognuno dei filtri?– Con la PCA e HCL i due gruppi di dati sono ancora separabili?Con la PCA e HCL i due gruppi di dati sono ancora separabili?
• Applicare un filtro interterquantile a 0.5 ed un filtro di Applicare un filtro interterquantile a 0.5 ed un filtro di intensità 50% > 100intensità 50% > 100– Cosa succede alla distribuzione dei dati?Cosa succede alla distribuzione dei dati?– Con PCA ed HCL i dati sono ancora separabili?Con PCA ed HCL i dati sono ancora separabili?
Analysis pipe-lineAnalysis pipe-line
NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis
AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction
QualityQualitycontrolcontrol
Statistical analysisStatistical analysis• The sensitivity of statistical tests is affected by the The sensitivity of statistical tests is affected by the
number of available replicates.number of available replicates.• Replicates can be:Replicates can be:
– TechnicalTechnical– BiologicalBiological
• Biological replicates better summarize the variability Biological replicates better summarize the variability of samples belonging to a common group.of samples belonging to a common group.
• The minimum number of replicates is an important The minimum number of replicates is an important issue!issue!
Fold change filteringFold change filtering
• The intensity change between experimental groups The intensity change between experimental groups (i.e. control versus treated) are known as:(i.e. control versus treated) are known as:– Fold changeFold change..
• Frequently an arbitrary threshold
is used to define a significant differential expression.
1log2 Ctrl
Trtd
Statistical analysisStatistical analysis
• Intensity changes between experimental groups (i.e. Intensity changes between experimental groups (i.e. control versus treated) are known as:control versus treated) are known as:– Fold change. Fold change. – Ranking genes based on fold change alone implicitly Ranking genes based on fold change alone implicitly
assigns equal variance to every gene.assigns equal variance to every gene.• Fold change alone is not sufficient to indicate the Fold change alone is not sufficient to indicate the
significance of the expression changes.significance of the expression changes.• Fold change has to be supported by statistical Fold change has to be supported by statistical
information. information.
Statistical validationStatistical validation
• Statistical validation can be performed using Statistical validation can be performed using parametric and non-parametric tests.parametric and non-parametric tests.
• Parametric tests:Parametric tests:– The populations under analysis are normally distributed.The populations under analysis are normally distributed.
• Non parametric tests:Non parametric tests:– There is no assumption on samples distribution.There is no assumption on samples distribution.
• Non parametric are less sensitive than parametric.Non parametric are less sensitive than parametric.
Selecting differentially expressed genesSelecting differentially expressed genes
Differential expressionlinked to a specific
biological event.
Statistical validationmethod I
Statistical validationmethod III
Statistical validationmethod II
Selecting differentially expressed genesSelecting differentially expressed genes
• Each method grasps some true signals but not Each method grasps some true signals but not all.all.
• Each method catches some false signals.Each method catches some false signals.• The trick is to find the best condition to The trick is to find the best condition to
maximize true signals while minimizing fakes.maximize true signals while minimizing fakes.
• Each method grasps some true signals but not Each method grasps some true signals but not all.all.
• Each method catches some false signals.Each method catches some false signals.• The trick is to find the best condition to The trick is to find the best condition to
maximize true signals while minimizing fakes.maximize true signals while minimizing fakes.
SAMSAM
Significance Analysis of Microarray
SAM SAM (Significance analysis of microarrays)(Significance analysis of microarrays)(Tusher et al. 2001)(Tusher et al. 2001)
fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator
fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator
s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.
s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.
Two-class unpairedTwo-class unpaired: : to pick out genes to pick out genes whose mean expression level is significantly whose mean expression level is significantly different between two groups of samples different between two groups of samples (analogous to between subjects t-test). (analogous to between subjects t-test).
SAM design in oneChannelGUISAM design in oneChannelGUI
• SAM uses data permutations to define a set SAM uses data permutations to define a set of significant differential expression.of significant differential expression.
N N N
T T T
N
N
N
T
T
T N
N NT
T T N
N
N
T
T
T N
N NT
T T{ }
FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed
FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed
How SAM calculates the False Discovery Rate for a How SAM calculates the False Discovery Rate for a
specific delta?specific delta?
Permutations1234
Mean falseMean false
720
Rank Product is a non-parametric statistic that detects items that are consistently highly ranked in a number of lists, for example genes that are consistently found among the most strongly upregulated genes in a number of replicate experiments.
It is based on the assumption that under the null hypothesis that the order of all items is random the probability of finding a specific item among the top r of n items in a list is p = r/n.
Multiplying these probabilities leads to the definition of the rank product:
where ri is the rank of the item in the i-th list and ni is the total number of items in the i-th list.
The smaller the RP value, the smaller the probability that the observed placement of the item at the top of the lists is due to chance.
i
i
n
rRP
123
4567
i
i
n
rRP
(1/7)*(2/7) = 0.04
0.040.310.330.430.61
0.86
A B AB
g
gg n
rRP
123
4567
1
2
3
4
5
6
7
Permutating the genes in the two arrays
A B
a1 a2 b1 b2
0.04
0.100.180.290.410.57
0.73
a1b1
0.10
0.120.140.160.430.49
0.61
a1b2
0.02
0.080.370.370.510.57
0.57
a2b1
0.08
0.120.140.240.310.49
0.71
a2b2
i
i
n
rRP
)|(|1 *
)( gmlg
l gg RPPRIGL
P
ggg
glg
l gg RPRPI
RPPRIL
FDR)|(|
)|(|1 *
)(
E
0.04
0.100.180.290.410.57
0.73
a1b1
0.10
0.120.140.160.430.49
0.61
a1b2
0.02
0.080.370.370.510.57
0.57
a2b1
0.08
0.120.140.240.310.49
0.71
a2b2
0.04
0.040.310.330.430.61
0.86
AB
0.04
0.040.310.330.430.61
0.86
AB
0.04
0.040.310.330.430.61
0.86
AB
0.04
0.040.310.330.430.61
0.86
AB
gmlg RPRP *
)(0
01111
1
0
00111
1
0
01111
1
0
01111
1
AB
0
01111
1
0
00111
1
0
01111
1
0
01111
1
)|(|1 *
)( gmlg
l gg RPPRIGL
P
(0+0+0+0)/(4*7)=0
(0+0+0+0)/(4*7)=0(1+1+0+1)/(4*7)=0.10(1+1+1+1)/(4*7)=0.14(1+1+1+1)/(4*7)=0.14(1+1+1+1)/(4*7)=0.14
(1+1+1+1)/(4*7)=0.14
AB
0
01111
1
0
00111
1
0
01111
1
0
01111
1
[(0+0+0+0)/4]/(0+0+0+0)=0
[(0+0+0+0)/4]/(0+0+0+0)=0[(1+0+1+1)/4]/(1+0+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25
[(1+1+1+1)/4]/(1+1+1+1)=0.25
Significantly differentially
expressed genes!
ggg
glg
l gg RPRPI
RPPRIL
FDR)|(|
)|(|1 *
)(
EsercizioEsercizio• Caricare i dati di Borovecki partendo dal matrix series file Caricare i dati di Borovecki partendo dal matrix series file
usando:usando:– Creare il target.GSE8762.gender.txtCreare il target.GSE8762.gender.txt
• Valutare con PCA/HCL come si separano i campioni.Valutare con PCA/HCL come si separano i campioni.
• Applicare un filtro interquartile a 0.5.Applicare un filtro interquartile a 0.5.
• SAM (FDR < 10%) per identificare un set di geni SAM (FDR < 10%) per identificare un set di geni differenzialmente espressi.differenzialmente espressi.