Laboratorio Bioinformatica

79
Laboratorio Laboratorio Bioinformat Bioinformat ica ica

description

Laboratorio Bioinformatica. Obbiettivi. Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare: Marcatori prognosti/diagnostici di patologie. Esempio. - PowerPoint PPT Presentation

Transcript of Laboratorio Bioinformatica

Page 1: Laboratorio Bioinformatica

Laboratorio Laboratorio BioinformaticaBioinformatica

Page 2: Laboratorio Bioinformatica

ObbiettiviObbiettivi

• Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare:– Marcatori prognosti/diagnostici di patologie

Page 3: Laboratorio Bioinformatica

EsempioEsempio• Analizzeremo il modo con cui si identificano marcatori molecolari di

patologie dissezionando l’approccio presentato in:

PNAS 2005, 102:11023-28

PNAS 2007, 104:14424-29

Page 4: Laboratorio Bioinformatica

La domanda biologicaLa domanda biologica• Huntington’s disease (HD) is an autosomal dominant disorder caused by

an expansion of glutamine repeats in ubiquitously distributed huntingtin protein.

• Mutant huntingtin interferes with the function of widely expressed transcription factors, suggesting that gene expression may be altered in a variety of tissues in HD, including peripheral blood.

• Highly quantitative biomarkers of neurodegenerative disease remain an important need in the urgent quest for disease-modifying therapies.

• For Huntington’s disease (HD), a genetic test is available (trait marker), but necessary state markers are still in development.

• Tested hypothesis:– Two studies exists:

• Borovecki et al. Detecting biomarkers profiling complete blood from HD patients (hd), pre-HD patients (pre) and normal donors (n).

• Runne et al. Detecting biomarkers profiling lymphocytes from HD patients (hd), and normal donors (n).

– Is it possible to identify disease biomarkers using these data sets?

Page 5: Laboratorio Bioinformatica

Experimental groupsExperimental groups• Borovecki :

– HD group:• 12 HD-affected (stage I-II) subjects• 5 early presymptomatic carriers of the gene mutation, as

determined by genetic testing.– Normal group:

• 14 healthy control subjects– Affymetrix hgu133a

• Runne:– HD group:

• 12 HD-moderate stage HD subjects– Normal group:

• 10 healthy control subjects– Affymetrix hgu133plus2

Page 6: Laboratorio Bioinformatica

Experimental designExperimental design

Page 7: Laboratorio Bioinformatica

Recognition and statement of the Recognition and statement of the problemproblem

• The problem should be specified enough and the conditions under which the experiment will be performed should be understood so the appropriate design for the experiment can be selected.

Page 8: Laboratorio Bioinformatica

ExampleExample

• We are investigating the effect of a drug, by BrdU incorporation, considering three concentrations (10 nM, 100 nM, 1 M), over 3 different tumor cell lines (CL).

• In this example the factors are two:– CL, qualitative factor with 3 levels– Drug concentration, quantitative factor with 3

levels

Page 9: Laboratorio Bioinformatica

Identicare i fattori coinvolti nello Identicare i fattori coinvolti nello studio di Boroveckistudio di Borovecki

• Lo studio è costituito da:– pazienti HD, pazienti preHD e donatori

• Quanti fattori sono coinvolti?– 1

• Quali:– pazienti

• I fattori sono quantitativi o qualitativi?– Qualitativi

• Quanti livelli ci sono?– 3 (HD, preHD, N)

Pre HD

N

HD

Livelli

Fattore

Page 10: Laboratorio Bioinformatica

Come posso ottenere i dati Come posso ottenere i dati sperimentali?sperimentali?

• Recentemente per l’accettazione di un articolo su riviste internazionali viene richiesto che dati siano depositati su banche dati pubbliche:– Europa: arrayexpress– USA: GEO

Page 11: Laboratorio Bioinformatica
Page 12: Laboratorio Bioinformatica
Page 13: Laboratorio Bioinformatica
Page 14: Laboratorio Bioinformatica

E’ possibile scaricare i dati:E’ possibile scaricare i dati:1.1. in formato tipo excel (tabulato) contenente tutte le in formato tipo excel (tabulato) contenente tutte le informazioni dell’esperimentoinformazioni dell’esperimento2.2.le immagini dell’array (in questo caso i .CEL files le immagini dell’array (in questo caso i .CEL files dell’Affymetrix)dell’Affymetrix)

Page 15: Laboratorio Bioinformatica

Header Matrix series fileHeader Matrix series file

Page 16: Laboratorio Bioinformatica

Affymetrix geneChipsAffymetrix geneChips

Page 17: Laboratorio Bioinformatica

PMMM

cellProbe pair

Genesequence

ACCAGATCTGTAGTCCATGCGATGC

ACCAGATCTGTAATCCATGCGATGC

PM

MM

Probe set (Affymetrix)Probe set (Affymetrix)

Page 18: Laboratorio Bioinformatica

Per analizzare i dati di microarray è Per analizzare i dati di microarray è necessario disporre di softwares necessario disporre di softwares

dedicatidedicati

• I dati da microarray non possono essere analizzati con un semplice foglio excel ma necessitano di strumenti statistici alquanto sofisticati.

• Esistono software commerciali od open-source.• In questo corso le esercitazioni verranno fatte

utilizzando un software open-source:– Bioconductor

Page 19: Laboratorio Bioinformatica

BioconductorBioconductor

Platform specificPlatform specificdevicesdevices

Analysis pipe-lineAnalysis pipe-line

SampleSample

PreparatioPreparationn

Array Array FabricatioFabricatio

nn

HybridizatioHybridizationn

ScanninScanning g + +

Image Image AnalysisAnalysis

NormalizatioNormalizationn

FilterinFilteringg

statisticalstatisticalanalysisanalysis

AnnotatioAnnotationn

Biological Biological KnowledgKnowledg

eeextractionextraction

QualityQualitycontrolcontrol

Page 20: Laboratorio Bioinformatica

Come si inizia ad analizzare i dati?Come si inizia ad analizzare i dati?

• Se i .CEL files sono disponibili si esegue un approfondito controllo di qualità.

• In mancanza dei .CEL files, se è solo disponibile il matrix series file, è possibile eseguire un numero più limitato di controlli di qualità.

Page 21: Laboratorio Bioinformatica

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

Page 22: Laboratorio Bioinformatica

Perché si fanno i controlli di qualità Perché si fanno i controlli di qualità (QC)?(QC)?

• I QC sono un punto molto importante di un analisi di dati di microarray.

• Questo perché solitamente il numero di esperimenti disponibili è limitato e la presenza di uno o più arrays caratterizzati da un alto numero di artefatti sperimentali potrebbe inficiare l’analisi.

• Il QC permette di identificare gli arrays outliers e permettere al ricercatore di valutare se è necessario rimuoverli o no.

Page 23: Laboratorio Bioinformatica

Controllo di qualità per identificare la Controllo di qualità per identificare la presenza di array outlierspresenza di array outliers

• Avendo a disposizione solo MSF per valutare la presenza di arrays outliers si ispezionano:

• Box plot delle frequenze di intensità dei vari arrays.

Page 24: Laboratorio Bioinformatica
Page 25: Laboratorio Bioinformatica
Page 26: Laboratorio Bioinformatica

Controllo di qualità per valutare Controllo di qualità per valutare l’omogeneità dei gruppi sperimentalil’omogeneità dei gruppi sperimentali

• Principal component analysis• Clustering gerarchico

Page 27: Laboratorio Bioinformatica

Principal component analysisPrincipal component analysis

• Principal component analysis (PCA) involves a mathematical Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables variables into a (smaller) number of uncorrelated variables called called principal componentsprincipal components. .

• The first principal component accounts for as much of the The first principal component accounts for as much of the variability in the data as possiblevariability in the data as possible

• Each succeeding component accounts for as much of the Each succeeding component accounts for as much of the remaining variability as possible. remaining variability as possible.

• The components can be thought of as axes in n-dimensional The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis space, where n is the number of components. Each axis represents a different trend in the data.represents a different trend in the data.

Page 28: Laboratorio Bioinformatica

PCA1

PCA2

PCA

Page 29: Laboratorio Bioinformatica

2

1

2° PC will be orthogonal to the 1st

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

Page 30: Laboratorio Bioinformatica
Page 31: Laboratorio Bioinformatica

Hierarchical Clustering Hierarchical Clustering (HCL(HCL)

• HCL is an agglomerative/divisive clustering HCL is an agglomerative/divisive clustering method. method.

• The iterative process continues until all groups The iterative process continues until all groups are connected in a hierarchical tree.are connected in a hierarchical tree.

Page 32: Laboratorio Bioinformatica

Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)

s8s1 s2 s3 s4 s5 s6 s7

s7s1 s8 s2 s3 s4 s5 s6

s7s1 s8 s4 s2 s3 s5 s6

s1 is most like s8

s4 is most like {s1, s8}

Modified by TMEV presentation (www.tigr.org)

Page 33: Laboratorio Bioinformatica

s7s1 s8 s4 s2 s3 s5 s6

s6s1 s8 s4 s2 s3 s5 s7

s6s1 s8 s4 s5 s7 s2 s3

Hierarchical ClusteringHierarchical Clustering

s5 is most like s7

{s5,s7} is most like {s1, s4, s8}

Modified by TMEV presentation (www.tigr.org)

Page 34: Laboratorio Bioinformatica

s6s1 s8 s4 s5 s7 s2 s3

Hierarchical TreeHierarchical Tree

Modified by TMEV presentation (www.tigr.org)

Page 35: Laboratorio Bioinformatica

Hierarchical ClusteringHierarchical Clustering

• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine which decisions must be made to determine which clusters should be joined. clusters should be joined.

• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern this must be calculated. The rules that govern this calculation are calculation are linkage methodslinkage methods..

Page 36: Laboratorio Bioinformatica

Agglomerative Linkage MethodsAgglomerative Linkage Methods• Linkage methods are rules or metrics that Linkage methods are rules or metrics that

return a value that can be used to determine return a value that can be used to determine which elements (clusters) should be linked.which elements (clusters) should be linked.

• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage

– Average LinkageAverage Linkage

– Complete LinkageComplete Linkage

Modified by TMEV presentation (www.tigr.org)

Page 37: Laboratorio Bioinformatica

t4 is clearly an outlier!t4 is clearly an outlier!

Page 38: Laboratorio Bioinformatica

ExerciseExercise• Usare target file target.GSE8762.classif.txt e il Usare target file target.GSE8762.classif.txt e il

file esperimental.design.names.gse8762.txt file esperimental.design.names.gse8762.txt per valutare con la PCA il comportamento dei per valutare con la PCA il comportamento dei fattori disease status e gender nel dataset in fattori disease status e gender nel dataset in esame.esame.

Page 39: Laboratorio Bioinformatica

ExerciseExercise

• Open ROpen R• Load the oneChannelGUILoad the oneChannelGUI• Start a new project:Start a new project:

– Change the working dir in dataset.huntingtonChange the working dir in dataset.huntington– Load the target fileLoad the target file– Set as project name: ronneSet as project name: ronne

Page 40: Laboratorio Bioinformatica

Exercise Exercise • Starting from the data set you have loaded Starting from the data set you have loaded

– check the data box plotplots check the data box plotplots • Answer the following questions:Answer the following questions:

– Is there any array characterized by a very narrow Is there any array characterized by a very narrow probe intensity distribution?probe intensity distribution?• YES (which? …………………………….)YES (which? …………………………….) NONO

– Is there any array which is significantly different Is there any array which is significantly different with respect to the others?with respect to the others?• YES (which? …………………………….) YES (which? …………………………….) NONO

Page 41: Laboratorio Bioinformatica

ExerciseExercise

• Inspect if the experimental groups of our Inspect if the experimental groups of our ronne data set (HD, N) are relatively ronne data set (HD, N) are relatively homogeneous using PCA and hierachical homogeneous using PCA and hierachical clustering.clustering.

• Is it easy to discriminate on the basis of Is it easy to discriminate on the basis of disease status?disease status?– YesYes

– NoNo

Page 42: Laboratorio Bioinformatica

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

Page 43: Laboratorio Bioinformatica

Raggruppare i dati dei singoli probes in Raggruppare i dati dei singoli probes in un unico valore per il probesetun unico valore per il probeset

• Analysis steps:Analysis steps:– Calculating probe set summaries:Calculating probe set summaries:

• RMARMA• GCRMAGCRMA

– Normalization:Normalization:• Quantile methodQuantile method

• L’INTENSITA’ DI FLUORESCENZA E’ L’INTENSITA’ DI FLUORESCENZA E’ ESPRESSA COME LOGESPRESSA COME LOG22(INTENSITA’)(INTENSITA’)

Page 44: Laboratorio Bioinformatica

Brief summary about probe set intensity Brief summary about probe set intensity calculationcalculation

• RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation.

• GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004).

Page 45: Laboratorio Bioinformatica

Why Normalization ?Why Normalization ?

• Sample preparationSample preparation• Variability in hybridizationVariability in hybridization• Spatial effectsSpatial effects• Scanner settingsScanner settings• Experimenter biasExperimenter bias

To remove systematic biases, which To remove systematic biases, which include,include,

Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray

Page 46: Laboratorio Bioinformatica

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

Page 47: Laboratorio Bioinformatica

Multiple testing errorsMultiple testing errors

• Performing multiple statistical tests two types of Performing multiple statistical tests two types of errors can occur:errors can occur:– Type I error (False positive)Type I error (False positive)

– Type II error (False negative)Type II error (False negative)

• Reduction of type I errors increases the number of Reduction of type I errors increases the number of type II errors.type II errors.

• It is important to identify an approach that reduces It is important to identify an approach that reduces false positivesfalse positives with the minimum loss of information with the minimum loss of information ((false negativefalse negative))

Page 48: Laboratorio Bioinformatica

Filtering can be performed at various Filtering can be performed at various levels:levels:

• Annotation features:Annotation features:– Specific gene features (i.e. GO term, presence of Specific gene features (i.e. GO term, presence of

transcriptional regulative elements in promoters, transcriptional regulative elements in promoters, etc.)etc.)

• Signal features:Signal features:– % intensities greater of a user defined value% intensities greater of a user defined value– Interquantile range (IQR) greater of a defined Interquantile range (IQR) greater of a defined

valuevalue

Page 49: Laboratorio Bioinformatica

Intensity distributionsIntensity distributions

RMA GCRMA

Bg level probe setsBg level probe sets

Page 50: Laboratorio Bioinformatica

How to define the efficacy of a filtering How to define the efficacy of a filtering procedure?procedure?

• This enrichment is very similar to that used to evaluate the purification folds This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.of a protein after a chromatographic step.

inspikeingfterFilterprobesetsA

probesetsteringinAfterFilspike

NN

NNenrichment

100

mBeforeChroEAfterChromgP

mBeforeChrogPAfterChromEenrichment

..

..100

Page 51: Laboratorio Bioinformatica

Filtering by genefilter pOverAFiltering by genefilter pOverA(keep if ≥ 25% probe sets have intensities ≥ log(keep if ≥ 25% probe sets have intensities ≥ log22(100))(100))

5553 5553 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

401%401%

223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

100%100%

Page 52: Laboratorio Bioinformatica

Filtering by InterQuantile RangeFiltering by InterQuantile RangeIQR25% 75%

Page 53: Laboratorio Bioinformatica

How filtering by genefilter IQR works?How filtering by genefilter IQR works?The distribution of all intensity values of a differential expression experiment are the The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditionssummary of the distribution of each gene expression over the experimental conditions

Page 54: Laboratorio Bioinformatica

The filter removes genes that show little changes within the experimental pointsThe filter removes genes that show little changes within the experimental points

How filtering by IQR works?How filtering by IQR works?

Page 55: Laboratorio Bioinformatica

Filtering by genefilter IQRFiltering by genefilter IQR(removing if intensities IQR(removing if intensities IQR0.25, 0.5)0.25, 0.5)

68 68 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

32794%32794%

223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

100%100%

244 244 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

9139%9139%

Page 56: Laboratorio Bioinformatica

EsercizioEsercizio• Caricare i dati di Borovecki partendo dal matrix series file Caricare i dati di Borovecki partendo dal matrix series file

usando:usando:– target.GSE1751.hd.n.txttarget.GSE1751.hd.n.txt

• Valutare con PCA/HCL come si separano i campioni.Valutare con PCA/HCL come si separano i campioni.

• Applicare un filtro interquartile a 0.25 e 0.5.Applicare un filtro interquartile a 0.25 e 0.5.– Quanti trascritti rimangono dopo ognuno dei filtri?Quanti trascritti rimangono dopo ognuno dei filtri?– Con la PCA e HCL i due gruppi di dati sono ancora separabili?Con la PCA e HCL i due gruppi di dati sono ancora separabili?

• Applicare un filtro interterquantile a 0.5 ed un filtro di Applicare un filtro interterquantile a 0.5 ed un filtro di intensità 50% > 100intensità 50% > 100– Cosa succede alla distribuzione dei dati?Cosa succede alla distribuzione dei dati?– Con PCA ed HCL i dati sono ancora separabili?Con PCA ed HCL i dati sono ancora separabili?

Page 57: Laboratorio Bioinformatica

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

Page 58: Laboratorio Bioinformatica

Statistical analysisStatistical analysis• The sensitivity of statistical tests is affected by the The sensitivity of statistical tests is affected by the

number of available replicates.number of available replicates.• Replicates can be:Replicates can be:

– TechnicalTechnical– BiologicalBiological

• Biological replicates better summarize the variability Biological replicates better summarize the variability of samples belonging to a common group.of samples belonging to a common group.

• The minimum number of replicates is an important The minimum number of replicates is an important issue!issue!

Page 59: Laboratorio Bioinformatica

Fold change filteringFold change filtering

• The intensity change between experimental groups The intensity change between experimental groups (i.e. control versus treated) are known as:(i.e. control versus treated) are known as:– Fold changeFold change..

• Frequently an arbitrary threshold

is used to define a significant differential expression.

1log2 Ctrl

Trtd

Page 60: Laboratorio Bioinformatica

Statistical analysisStatistical analysis

• Intensity changes between experimental groups (i.e. Intensity changes between experimental groups (i.e. control versus treated) are known as:control versus treated) are known as:– Fold change. Fold change. – Ranking genes based on fold change alone implicitly Ranking genes based on fold change alone implicitly

assigns equal variance to every gene.assigns equal variance to every gene.• Fold change alone is not sufficient to indicate the Fold change alone is not sufficient to indicate the

significance of the expression changes.significance of the expression changes.• Fold change has to be supported by statistical Fold change has to be supported by statistical

information. information.

Page 61: Laboratorio Bioinformatica

Statistical validationStatistical validation

• Statistical validation can be performed using Statistical validation can be performed using parametric and non-parametric tests.parametric and non-parametric tests.

• Parametric tests:Parametric tests:– The populations under analysis are normally distributed.The populations under analysis are normally distributed.

• Non parametric tests:Non parametric tests:– There is no assumption on samples distribution.There is no assumption on samples distribution.

• Non parametric are less sensitive than parametric.Non parametric are less sensitive than parametric.

Page 62: Laboratorio Bioinformatica

Selecting differentially expressed genesSelecting differentially expressed genes

Differential expressionlinked to a specific

biological event.

Statistical validationmethod I

Statistical validationmethod III

Statistical validationmethod II

Page 63: Laboratorio Bioinformatica

Selecting differentially expressed genesSelecting differentially expressed genes

• Each method grasps some true signals but not Each method grasps some true signals but not all.all.

• Each method catches some false signals.Each method catches some false signals.• The trick is to find the best condition to The trick is to find the best condition to

maximize true signals while minimizing fakes.maximize true signals while minimizing fakes.

• Each method grasps some true signals but not Each method grasps some true signals but not all.all.

• Each method catches some false signals.Each method catches some false signals.• The trick is to find the best condition to The trick is to find the best condition to

maximize true signals while minimizing fakes.maximize true signals while minimizing fakes.

Page 64: Laboratorio Bioinformatica

SAMSAM

Significance Analysis of Microarray

Page 65: Laboratorio Bioinformatica

SAM SAM (Significance analysis of microarrays)(Significance analysis of microarrays)(Tusher et al. 2001)(Tusher et al. 2001)

fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator

fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

Page 66: Laboratorio Bioinformatica

Two-class unpairedTwo-class unpaired: : to pick out genes to pick out genes whose mean expression level is significantly whose mean expression level is significantly different between two groups of samples different between two groups of samples (analogous to between subjects t-test). (analogous to between subjects t-test).

SAM design in oneChannelGUISAM design in oneChannelGUI

Page 67: Laboratorio Bioinformatica

• SAM uses data permutations to define a set SAM uses data permutations to define a set of significant differential expression.of significant differential expression.

N N N

T T T

N

N

N

T

T

T N

N NT

T T N

N

N

T

T

T N

N NT

T T{ }

Page 68: Laboratorio Bioinformatica

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

Page 69: Laboratorio Bioinformatica

How SAM calculates the False Discovery Rate for a How SAM calculates the False Discovery Rate for a

specific delta?specific delta?

Permutations1234

Mean falseMean false

720

Page 70: Laboratorio Bioinformatica

Rank Product is a non-parametric statistic that detects items that are consistently highly ranked in a number of lists, for example genes that are consistently found among the most strongly upregulated genes in a number of replicate experiments.

It is based on the assumption that under the null hypothesis that the order of all items is random the probability of finding a specific item among the top r of n items in a list is p = r/n.

Page 71: Laboratorio Bioinformatica

Multiplying these probabilities leads to the definition of the rank product:

where ri is the rank of the item in the i-th list and ni is the total number of items in the i-th list.

The smaller the RP value, the smaller the probability that the observed placement of the item at the top of the lists is due to chance.

i

i

n

rRP

Page 72: Laboratorio Bioinformatica

123

4567

i

i

n

rRP

(1/7)*(2/7) = 0.04

0.040.310.330.430.61

0.86

A B AB

Page 73: Laboratorio Bioinformatica

g

gg n

rRP

Page 74: Laboratorio Bioinformatica

123

4567

1

2

3

4

5

6

7

Permutating the genes in the two arrays

A B

a1 a2 b1 b2

0.04

0.100.180.290.410.57

0.73

a1b1

0.10

0.120.140.160.430.49

0.61

a1b2

0.02

0.080.370.370.510.57

0.57

a2b1

0.08

0.120.140.240.310.49

0.71

a2b2

i

i

n

rRP

Page 75: Laboratorio Bioinformatica

)|(|1 *

)( gmlg

l gg RPPRIGL

P

ggg

glg

l gg RPRPI

RPPRIL

FDR)|(|

)|(|1 *

)(

E

Page 76: Laboratorio Bioinformatica

0.04

0.100.180.290.410.57

0.73

a1b1

0.10

0.120.140.160.430.49

0.61

a1b2

0.02

0.080.370.370.510.57

0.57

a2b1

0.08

0.120.140.240.310.49

0.71

a2b2

0.04

0.040.310.330.430.61

0.86

AB

0.04

0.040.310.330.430.61

0.86

AB

0.04

0.040.310.330.430.61

0.86

AB

0.04

0.040.310.330.430.61

0.86

AB

gmlg RPRP *

)(0

01111

1

0

00111

1

0

01111

1

0

01111

1

Page 77: Laboratorio Bioinformatica

AB

0

01111

1

0

00111

1

0

01111

1

0

01111

1

)|(|1 *

)( gmlg

l gg RPPRIGL

P

(0+0+0+0)/(4*7)=0

(0+0+0+0)/(4*7)=0(1+1+0+1)/(4*7)=0.10(1+1+1+1)/(4*7)=0.14(1+1+1+1)/(4*7)=0.14(1+1+1+1)/(4*7)=0.14

(1+1+1+1)/(4*7)=0.14

Page 78: Laboratorio Bioinformatica

AB

0

01111

1

0

00111

1

0

01111

1

0

01111

1

[(0+0+0+0)/4]/(0+0+0+0)=0

[(0+0+0+0)/4]/(0+0+0+0)=0[(1+0+1+1)/4]/(1+0+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25[(1+1+1+1)/4]/(1+1+1+1)=0.25

[(1+1+1+1)/4]/(1+1+1+1)=0.25

Significantly differentially

expressed genes!

ggg

glg

l gg RPRPI

RPPRIL

FDR)|(|

)|(|1 *

)(

Page 79: Laboratorio Bioinformatica

EsercizioEsercizio• Caricare i dati di Borovecki partendo dal matrix series file Caricare i dati di Borovecki partendo dal matrix series file

usando:usando:– Creare il target.GSE8762.gender.txtCreare il target.GSE8762.gender.txt

• Valutare con PCA/HCL come si separano i campioni.Valutare con PCA/HCL come si separano i campioni.

• Applicare un filtro interquartile a 0.5.Applicare un filtro interquartile a 0.5.

• SAM (FDR < 10%) per identificare un set di geni SAM (FDR < 10%) per identificare un set di geni differenzialmente espressi.differenzialmente espressi.