Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia...

Laboratorio Laboratorio BioinformaticaBioinformatica

ObbiettiviObbiettivi

• Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare:– Marcatori prognosti/diagnostici di patologie

EsempioEsempio• Analizzeremo il modo con cui si identificano marcatori molecolari di

patologie dissezionando l’approccio presentato in:

PNAS 2005, 102:11023-28

PNAS 2007, 104:14424-29

La domanda biologicaLa domanda biologica• Huntington’s disease (HD) is an autosomal dominant disorder caused by

an expansion of glutamine repeats in ubiquitously distributed huntingtin protein.

• Mutant huntingtin interferes with the function of widely expressed transcription factors, suggesting that gene expression may be altered in a variety of tissues in HD, including peripheral blood.

• Highly quantitative biomarkers of neurodegenerative disease remain an important need in the urgent quest for disease-modifying therapies.

• For Huntington’s disease (HD), a genetic test is available (trait marker), but necessary state markers are still in development.

• Tested hypothesis:– Two studies exists:

• Borovecki et al. Detecting biomarkers profiling complete blood from HD patients (hd), pre-HD patients (pre) and normal donors (n).

• Runne et al. Detecting biomarkers profiling lymphocytes from HD patients (hd), and normal donors (n).

– Is it possible to identify disease biomarkers using these data sets?

Experimental groupsExperimental groups• Borovecki :

– HD group:• 12 HD-affected (stage I-II) subjects• 5 early presymptomatic carriers of the gene mutation, as

determined by genetic testing.– Normal group:

• 14 healthy control subjects– Affymetrix hgu133a

• Runne:– HD group:

• 12 HD-moderate stage HD subjects– Normal group:

• 10 healthy control subjects– Affymetrix hgu133plus2

Experimental designExperimental design

Recognition and statement of the Recognition and statement of the problemproblem

• The problem should be specified enough and the conditions under which the experiment will be performed should be understood so the appropriate design for the experiment can be selected.

ExampleExample

• We are investigating the effect of a drug, by BrdU incorporation, considering three concentrations (10 nM, 100 nM, 1 M), over 3 different tumor cell lines (CL).

• In this example the factors are two:– CL, qualitative factor with 3 levels– Drug concentration, quantitative factor with 3

levels

Identicare i fattori coinvolti nello Identicare i fattori coinvolti nello studio di Boroveckistudio di Borovecki

• Lo studio è costituito da:– pazienti HD, pazienti preHD e donatori

• Quanti fattori sono coinvolti?– 1

• Quali:– pazienti

• I fattori sono quantitativi o qualitativi?– Qualitativi

• Quanti livelli ci sono?– 3 (HD, preHD, N)

Pre HD

N

HD

Livelli

Fattore

Come posso ottenere i dati Come posso ottenere i dati sperimentali?sperimentali?

• Recentemente per l’accettazione di un articolo su riviste internazionali viene richiesto che dati siano depositati su banche dati pubbliche:– Europa: arrayexpress– USA: GEO

E’ possibile scaricare i dati:E’ possibile scaricare i dati:1.1. in formato tipo excel (tabulato) contenente tutte le in formato tipo excel (tabulato) contenente tutte le informazioni dell’esperimentoinformazioni dell’esperimento2.2.le immagini dell’array (in questo caso i .CEL files le immagini dell’array (in questo caso i .CEL files dell’Affymetrix)dell’Affymetrix)

Header Matrix series fileHeader Matrix series file

Affymetrix geneChipsAffymetrix geneChips

PMMM

cellProbe pair

Genesequence

ACCAGATCTGTAGTCCATGCGATGC

ACCAGATCTGTAATCCATGCGATGC

PM

MM

Probe set (Affymetrix)Probe set (Affymetrix)

Per analizzare i dati di microarray è Per analizzare i dati di microarray è necessario disporre di softwares necessario disporre di softwares

dedicatidedicati

• I dati da microarray non possono essere analizzati con un semplice foglio excel ma necessitano di strumenti statistici alquanto sofisticati.

• Esistono software commerciali od open-source.• In questo corso le esercitazioni verranno fatte

utilizzando un software open-source:– Bioconductor

BioconductorBioconductor

Platform specificPlatform specificdevicesdevices

Analysis pipe-lineAnalysis pipe-line

SampleSample

PreparatioPreparationn

Array Array FabricatioFabricatio

nn

HybridizatioHybridizationn

ScanninScanning g + +

Image Image AnalysisAnalysis

NormalizatioNormalizationn

FilterinFilteringg

statisticalstatisticalanalysisanalysis

AnnotatioAnnotationn

Biological Biological KnowledgKnowledg

eeextractionextraction

QualityQualitycontrolcontrol

Come si inizia ad analizzare i dati?Come si inizia ad analizzare i dati?

• Se i .CEL files sono disponibili si esegue un approfondito controllo di qualità.

• In mancanza dei .CEL files, se è solo disponibile il matrix series file, è possibile eseguire un numero più limitato di controlli di qualità.


NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction


Perché si fanno i controlli di qualità Perché si fanno i controlli di qualità (QC)?(QC)?

• I QC sono un punto molto importante di un analisi di dati di microarray.

• Questo perché solitamente il numero di esperimenti disponibili è limitato e la presenza di uno o più arrays caratterizzati da un alto numero di artefatti sperimentali potrebbe inficiare l’analisi.

• Il QC permette di identificare gli arrays outliers e permettere al ricercatore di valutare se è necessario rimuoverli o no.

Controllo di qualità per identificare la Controllo di qualità per identificare la presenza di array outlierspresenza di array outliers

• Avendo a disposizione solo MSF per valutare la presenza di arrays outliers si ispezionano:

• Box plot delle frequenze di intensità dei vari arrays.

Controllo di qualità per valutare Controllo di qualità per valutare l’omogeneità dei gruppi sperimentalil’omogeneità dei gruppi sperimentali

• Principal component analysis• Clustering gerarchico

Principal component analysisPrincipal component analysis

• Principal component analysis (PCA) involves a mathematical Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables variables into a (smaller) number of uncorrelated variables called called principal componentsprincipal components. .

• The first principal component accounts for as much of the The first principal component accounts for as much of the variability in the data as possiblevariability in the data as possible

• Each succeeding component accounts for as much of the Each succeeding component accounts for as much of the remaining variability as possible. remaining variability as possible.

• The components can be thought of as axes in n-dimensional The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis space, where n is the number of components. Each axis represents a different trend in the data.represents a different trend in the data.

PCA1

PCA2

PCA

2

1

2° PC will be orthogonal to the 1st

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

Hierarchical Clustering Hierarchical Clustering (HCL(HCL)

• HCL is an agglomerative/divisive clustering HCL is an agglomerative/divisive clustering method. method.

• The iterative process continues until all groups The iterative process continues until all groups are connected in a hierarchical tree.are connected in a hierarchical tree.

Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)

s8s1 s2 s3 s4 s5 s6 s7

s7s1 s8 s2 s3 s4 s5 s6

s7s1 s8 s4 s2 s3 s5 s6

s1 is most like s8

s4 is most like {s1, s8}

Modified by TMEV presentation (www.tigr.org)

s7s1 s8 s4 s2 s3 s5 s6

s6s1 s8 s4 s2 s3 s5 s7

s6s1 s8 s4 s5 s7 s2 s3

Hierarchical ClusteringHierarchical Clustering

s5 is most like s7

{s5,s7} is most like {s1, s4, s8}


s6s1 s8 s4 s5 s7 s2 s3

Hierarchical TreeHierarchical Tree


Hierarchical ClusteringHierarchical Clustering

• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine which decisions must be made to determine which clusters should be joined. clusters should be joined.

• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern this must be calculated. The rules that govern this calculation are calculation are linkage methodslinkage methods..

Agglomerative Linkage MethodsAgglomerative Linkage Methods• Linkage methods are rules or metrics that Linkage methods are rules or metrics that

return a value that can be used to determine return a value that can be used to determine which elements (clusters) should be linked.which elements (clusters) should be linked.

• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage

– Average LinkageAverage Linkage

– Complete LinkageComplete Linkage


t4 is clearly an outlier!t4 is clearly an outlier!

ExerciseExercise• Usare target file target.GSE8762.classif.txt e il Usare target file target.GSE8762.classif.txt e il

file esperimental.design.names.gse8762.txt file esperimental.design.names.gse8762.txt per valutare con la PCA il comportamento dei per valutare con la PCA il comportamento dei fattori disease status e gender nel dataset in fattori disease status e gender nel dataset in esame.esame.

ExerciseExercise

• Open ROpen R• Load the oneChannelGUILoad the oneChannelGUI• Start a new project:Start a new project:

– Change the working dir in dataset.huntingtonChange the working dir in dataset.huntington– Load the target fileLoad the target file– Set as project name: ronneSet as project name: ronne

Exercise Exercise • Starting from the data set you have loaded Starting from the data set you have loaded

– check the data box plotplots check the data box plotplots • Answer the following questions:Answer the following questions:

– Is there any array characterized by a very narrow Is there any array characterized by a very narrow probe intensity distribution?probe intensity distribution?• YES (which? …………………………….)YES (which? …………………………….) NONO

– Is there any array which is significantly different Is there any array which is significantly different with respect to the others?with respect to the others?• YES (which? …………………………….) YES (which? …………………………….) NONO

ExerciseExercise

• Inspect if the experimental groups of our Inspect if the experimental groups of our ronne data set (HD, N) are relatively ronne data set (HD, N) are relatively homogeneous using PCA and hierachical homogeneous using PCA and hierachical clustering.clustering.

• Is it easy to discriminate on the basis of Is it easy to discriminate on the basis of disease status?disease status?– YesYes

– NoNo

Raggruppare i dati dei singoli probes in Raggruppare i dati dei singoli probes in un unico valore per il probesetun unico valore per il probeset

• Analysis steps:Analysis steps:– Calculating probe set summaries:Calculating probe set summaries:

• RMARMA• GCRMAGCRMA

– Normalization:Normalization:• Quantile methodQuantile method

• L’INTENSITA’ DI FLUORESCENZA E’ L’INTENSITA’ DI FLUORESCENZA E’ ESPRESSA COME LOGESPRESSA COME LOG22(INTENSITA’)(INTENSITA’)

Brief summary about probe set intensity Brief summary about probe set intensity calculationcalculation

• RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation.

• GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004).

Why Normalization ?Why Normalization ?

• Sample preparationSample preparation• Variability in hybridizationVariability in hybridization• Spatial effectsSpatial effects• Scanner settingsScanner settings• Experimenter biasExperimenter bias

To remove systematic biases, which To remove systematic biases, which include,include,

Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray

Multiple testing errorsMultiple testing errors

• Performing multiple statistical tests two types of Performing multiple statistical tests two types of errors can occur:errors can occur:– Type I error (False positive)Type I error (False positive)

– Type II error (False negative)Type II error (False negative)

• Reduction of type I errors increases the number of Reduction of type I errors increases the number of type II errors.type II errors.

• It is important to identify an approach that reduces It is important to identify an approach that reduces false positivesfalse positives with the minimum loss of information with the minimum loss of information ((false negativefalse negative))

Filtering can be performed at various Filtering can be performed at various levels:levels:

• Annotation features:Annotation features:– Specific gene features (i.e. GO term, presence of Specific gene features (i.e. GO term, presence of

transcriptional regulative elements in promoters, transcriptional regulative elements in promoters, etc.)etc.)

• Signal features:Signal features:– % intensities greater of a user defined value% intensities greater of a user defined value– Interquantile range (IQR) greater of a defined Interquantile range (IQR) greater of a defined

valuevalue

Intensity distributionsIntensity distributions

RMA GCRMA

Bg level probe setsBg level probe sets

How to define the efficacy of a filtering How to define the efficacy of a filtering procedure?procedure?

• This enrichment is very similar to that used to evaluate the purification folds This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.of a protein after a chromatographic step.

inspikeingfterFilterprobesetsA

probesetsteringinAfterFilspike

NN

NNenrichment

100

mBeforeChroEAfterChromgP

mBeforeChrogPAfterChromEenrichment

..

..100

Filtering by genefilter pOverAFiltering by genefilter pOverA(keep if ≥ 25% probe sets have intensities ≥ log(keep if ≥ 25% probe sets have intensities ≥ log22(100))(100))

5553 5553 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

401%401%

223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

100%100%

Filtering by InterQuantile RangeFiltering by InterQuantile RangeIQR25% 75%

How filtering by genefilter IQR works?How filtering by genefilter IQR works?The distribution of all intensity values of a differential expression experiment are the The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditionssummary of the distribution of each gene expression over the experimental conditions

The filter removes genes that show little changes within the experimental pointsThe filter removes genes that show little changes within the experimental points

How filtering by IQR works?How filtering by IQR works?

Filtering by genefilter IQRFiltering by genefilter IQR(removing if intensities IQR(removing if intensities IQR0.25, 0.5)0.25, 0.5)


32794%32794%

223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

100%100%


9139%9139%

EsercizioEsercizio• Caricare i dati di Borovecki partendo dal matrix series file Caricare i dati di Borovecki partendo dal matrix series file

usando:usando:– target.GSE1751.hd.n.txttarget.GSE1751.hd.n.txt

• Valutare con PCA/HCL come si separano i campioni.Valutare con PCA/HCL come si separano i campioni.

• Applicare un filtro interquartile a 0.25 e 0.5.Applicare un filtro interquartile a 0.25 e 0.5.– Quanti trascritti rimangono dopo ognuno dei filtri?Quanti trascritti rimangono dopo ognuno dei filtri?– Con la PCA e HCL i due gruppi di dati sono ancora separabili?Con la PCA e HCL i due gruppi di dati sono ancora separabili?

• Applicare un filtro interterquantile a 0.5 ed un filtro di Applicare un filtro interterquantile a 0.5 ed un filtro di intensità 50% > 100intensità 50% > 100– Cosa succede alla distribuzione dei dati?Cosa succede alla distribuzione dei dati?– Con PCA ed HCL i dati sono ancora separabili?Con PCA ed HCL i dati sono ancora separabili?

Statistical analysisStatistical analysis• The sensitivity of statistical tests is affected by the The sensitivity of statistical tests is affected by the

number of available replicates.number of available replicates.• Replicates can be:Replicates can be:

– TechnicalTechnical– BiologicalBiological

• Biological replicates better summarize the variability Biological replicates better summarize the variability of samples belonging to a common group.of samples belonging to a common group.

• The minimum number of replicates is an important The minimum number of replicates is an important issue!issue!

Fold change filteringFold change filtering

• The intensity change between experimental groups The intensity change between experimental groups (i.e. control versus treated) are known as:(i.e. control versus treated) are known as:– Fold changeFold change..

• Frequently an arbitrary threshold

is used to define a significant differential expression.

1log2 Ctrl

Trtd

Statistical analysisStatistical analysis

• Intensity changes between experimental groups (i.e. Intensity changes between experimental groups (i.e. control versus treated) are known as:control versus treated) are known as:– Fold change. Fold change. – Ranking genes based on fold change alone implicitly Ranking genes based on fold change alone implicitly

assigns equal variance to every gene.assigns equal variance to every gene.• Fold change alone is not sufficient to indicate the Fold change alone is not sufficient to indicate the

significance of the expression changes.significance of the expression changes.• Fold change has to be supported by statistical Fold change has to be supported by statistical

information. information.

Statistical validationStatistical validation

• Statistical validation can be performed using Statistical validation can be performed using parametric and non-parametric tests.parametric and non-parametric tests.

• Parametric tests:Parametric tests:– The populations under analysis are normally distributed.The populations under analysis are normally distributed.

• Non parametric tests:Non parametric tests:– There is no assumption on samples distribution.There is no assumption on samples distribution.

• Non parametric are less sensitive than parametric.Non parametric are less sensitive than parametric.

Selecting differentially expressed genesSelecting differentially expressed genes

Differential expressionlinked to a specific

biological event.

Statistical validationmethod I

Statistical validationmethod III

Statistical validationmethod II

Selecting differentially expressed genesSelecting differentially expressed genes

• Each method grasps some true signals but not Each method grasps some true signals but not all.all.

• Each method catches some false signals.Each method catches some false signals.• The trick is to find the best condition to The trick is to find the best condition to

maximize true signals while minimizing fakes.maximize true signals while minimizing fakes.

• Each method grasps some true signals but not Each method grasps some true signals but not all.all.

• Each method catches some false signals.Each method catches some false signals.• The trick is to find the best condition to The trick is to find the best condition to

maximize true signals while minimizing fakes.maximize true signals while minimizing fakes.

SAMSAM

Significance Analysis of Microarray

SAM SAM (Significance analysis of microarrays)(Significance analysis of microarrays)(Tusher et al. 2001)(Tusher et al. 2001)

fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator

fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

Two-class unpairedTwo-class unpaired: : to pick out genes to pick out genes whose mean expression level is significantly whose mean expression level is significantly different between two groups of samples different between two groups of samples (analogous to between subjects t-test). (analogous to between subjects t-test).

SAM design in oneChannelGUISAM design in oneChannelGUI

• SAM uses data permutations to define a set SAM uses data permutations to define a set of significant differential expression.of significant differential expression.

N N N

T T T

N

N

N

T

T

T N

N NT

T T N

N

N

T

T

T N

N NT

T T{ }

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

How SAM calculates the False Discovery Rate for a How SAM calculates the False Discovery Rate for a

specific delta?specific delta?

Permutations1234

Mean falseMean false

720

Rank Product is a non-parametric statistic that detects items that are consistently highly ranked in a number of lists, for example genes that are consistently found among the most strongly upregulated genes in a number of replicate experiments.

It is based on the assumption that under the null hypothesis that the order of all items is random the probability of finding a specific item among the top r of n items in a list is p = r/n.

Multiplying these probabilities leads to the definition of the rank product:

where ri is the rank of the item in the i-th list and ni is the total number of items in the i-th list.

The smaller the RP value, the smaller the probability that the observed placement of the item at the top of the lists is due to chance.

i

i

n

rRP

123

4567

i

i

n

rRP

(1/7)*(2/7) = 0.04

0.040.310.330.430.61

0.86

A B AB

g

gg n

rRP

123

4567

1

2

3

4

5

6

7

Permutating the genes in the two arrays

A B

a1 a2 b1 b2

0.04

0.100.180.290.410.57

0.73

a1b1

0.10

0.120.140.160.430.49

0.61

a1b2

0.02

0.080.370.370.510.57

0.57

a2b1

0.08

0.120.140.240.310.49

0.71

a2b2

i

i

n

rRP

)|(|1 *

)( gmlg

l gg RPPRIGL

P

ggg

glg

l gg RPRPI

RPPRIL

FDR)|(|

)|(|1 *

)(

E

0.04

0.100.180.290.410.57

0.73

a1b1

0.10

0.120.140.160.430.49

0.61

a1b2

0.02

0.080.370.370.510.57

0.57

a2b1

0.08

0.120.140.240.310.49

0.71

a2b2

0.04

0.040.310.330.430.61

0.86

AB

0.04

0.040.310.330.430.61

0.86

AB

0.04

0.040.310.330.430.61

0.86

AB

0.04

0.040.310.330.430.61

0.86

AB

gmlg RPRP *

)(0

01111

1

0

00111

1

0

01111

1

0

01111

1

AB

0

01111

1

0

00111

1

0

01111

1

0

01111

1

)|(|1 *

)( gmlg

l gg RPPRIGL

P

(0+0+0+0)/(4*7)=0

(0+0+0+0)/(4*7)=0(1+1+0+1)/(4*7)=0.10(1+1+1+1)/(4*7)=0.14(1+1+1+1)/(4*7)=0.14(1+1+1+1)/(4*7)=0.14

(1+1+1+1)/(4*7)=0.14

AB

0

01111

1

0

00111

1

0

01111

1

0

01111

1

[(0+0+0+0)/4]/(0+0+0+0)=0

[(0+0+0+0)/4]/(0+0+0+0)=0[(1+0+1+1)/4]/(1+0+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25

[(1+1+1+1)/4]/(1+1+1+1)=0.25

Significantly differentially

expressed genes!

ggg

glg

l gg RPRPI

RPPRIL

FDR)|(|

)|(|1 *

)(

EsercizioEsercizio• Caricare i dati di Borovecki partendo dal matrix series file Caricare i dati di Borovecki partendo dal matrix series file

usando:usando:– Creare il target.GSE8762.gender.txtCreare il target.GSE8762.gender.txt

• Valutare con PCA/HCL come si separano i campioni.Valutare con PCA/HCL come si separano i campioni.

• Applicare un filtro interquartile a 0.5.Applicare un filtro interquartile a 0.5.

• SAM (FDR < 10%) per identificare un set di geni SAM (FDR < 10%) per identificare un set di geni differenzialmente espressi.differenzialmente espressi.

Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia...

Documents

Transcript of Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia...