Literature Review of Microarray Data Mining

52
Literature Review Literature Review of Microarray Data Mining of Microarray Data Mining Xin Anders Xin Anders March 24 March 24 th th , 2006 , 2006

description

Literature Review of Microarray Data Mining. Xin Anders March 24 th , 2006. Gene Expression. Genes are coding DNA segments which specify the composition and structure of proteins. DNA is transcribed into mRNA which in turn translates the information into proteins. - PowerPoint PPT Presentation

Transcript of Literature Review of Microarray Data Mining

Page 1: Literature Review of Microarray Data Mining

Literature ReviewLiterature Reviewof Microarray Data Miningof Microarray Data Mining

Xin AndersXin Anders

March 24March 24thth, 2006, 2006

Page 2: Literature Review of Microarray Data Mining

Gene ExpressionGene Expression

Genes are coding DNA segments which specify the Genes are coding DNA segments which specify the composition and structure of proteins. composition and structure of proteins.

DNA is transcribed into mRNA which in turn translates the DNA is transcribed into mRNA which in turn translates the information into proteins. information into proteins.

The process of transcribing DNA information into mRNA is The process of transcribing DNA information into mRNA is known as known as gene expressiongene expression. .

The advances in microarray technologies revolutionized the The advances in microarray technologies revolutionized the traditional one-gene-by-one-gene approach by making it traditional one-gene-by-one-gene approach by making it possible to study tens of thousands of genes at once. possible to study tens of thousands of genes at once.

Page 3: Literature Review of Microarray Data Mining

Microarray TechnologiesMicroarray Technologies

There are two types of microarray platforms: There are two types of microarray platforms: spotted arraysspotted arrays (historically called cDNA arrays) and (historically called cDNA arrays) and photolithographic photolithographic synthetic arrayssynthetic arrays (i.e. Affymetrix). (i.e. Affymetrix).

The fundamental difference between these two platforms lies The fundamental difference between these two platforms lies in the experiment setups: two-dyes-labeling versus one-dye in the experiment setups: two-dyes-labeling versus one-dye labeling and co-hybridization versus individual hybridization.labeling and co-hybridization versus individual hybridization.

Although different data pre-processing are required for these Although different data pre-processing are required for these two platforms, most downstream data analyses are similar for two platforms, most downstream data analyses are similar for them. This review will focus on talking about downstream them. This review will focus on talking about downstream data analyses. data analyses.

Page 4: Literature Review of Microarray Data Mining

Spotted ArraysSpotted Arrays

Figure 1. A diagram of a typical spotted arrays experiment.

Visualization of up-regulation and down regulation in one go.

No absolute gene expression levels.

Source: wikipedia.com

Page 5: Literature Review of Microarray Data Mining

Gene Chip (Affymetrix)Gene Chip (Affymetrix)

Figure 2. Each gene/EST is represented by various probe sets scattered in the GeneChip. (A) Each probe is made by up to 20 couple of oligos. (B) Each probe set is made by perfect match (PM) and miss match (MM). Source: Saviozzi S. et al. 2004

Page 6: Literature Review of Microarray Data Mining

Statistical Analysis and Data Statistical Analysis and Data Mining TechniquesMining Techniques

Gene selectionGene selection - identify differential gene expressions to a particular - identify differential gene expressions to a particular biological problems.biological problems.

Exploratory data analysisExploratory data analysis – extract (dis)similarities of the gene – extract (dis)similarities of the gene expression levels (patterns) among all samples. expression levels (patterns) among all samples.

Discrimination analysisDiscrimination analysis – train a classifier using gene expression – train a classifier using gene expression profiles to assign any new example to a respective class. profiles to assign any new example to a respective class.

Pathway analysisPathway analysis – find how genes interact as part of pathways. – find how genes interact as part of pathways. Gene functional annotationsGene functional annotations – associate functional meaning to genes. – associate functional meaning to genes.

Page 7: Literature Review of Microarray Data Mining

Differentially Expressed GenesDifferentially Expressed Genes

Traditionally, a fixed cut-off threshold is used to infer the increase or decrease of gene expression for a single-slide experiment.

Statistical methods based on replicate array data for ranking genes are better.

Perform an experiment as biological triplicates to increase data reliabilities (Lee ML et al. 2000, Saviozzi et al. 2004).

Page 8: Literature Review of Microarray Data Mining

Statistical Tools to Rank Genes Statistical Tools to Rank Genes form Replicated Dataform Replicated Data

Generally, for a limited number of replicates, parametric (student t-test) or non-parametric (Mann-Whitney test) is good.

However, when multiple hypotheses are tested in the case of thousands of genes on a single microarray chip, the false positives (Type I error) can increase sharply with the number of hypotheses.

a 10,000 gene array with a P value set to 0.05a 10,000 gene array with a P value set to 0.05

____> 10,000 * 0.05 (500) genes can be inferred____> 10,000 * 0.05 (500) genes can be inferred

even though none is differentially expressed. even though none is differentially expressed.

Page 9: Literature Review of Microarray Data Mining

Statistical Tools to Rank Genes Statistical Tools to Rank Genes form Replicated Dataform Replicated Data

It is often accepted to have few false positives if the majority of true positives are chosen (Leung YF 2003).

SAM (Significance Analysis of Microarrays) developed by Tusher et al. is such a technique that it uses the above concept as a tool to assist in determining a cut-off after performing adjusted t-tests.

Page 10: Literature Review of Microarray Data Mining

SAM SAM

SAM measures the strength between gene expression and the response variable (e.g. irradiated versus un-irradiated) by using repeated permutations of the data and assimilating a set of gene-specific adjusted t-tests.

The user can set the acceptable false discovery rate (FDR), significant threshold, and fold change threshold.

Page 11: Literature Review of Microarray Data Mining

A SAM Example A SAM Example

Experiment Setups:2 states: Unirradiated (U) versus Irradiated (I)2 biological duplicates: 1 and 22 technical duplicates: A and B

8 hybridizationsU1A, U1B, U2A, U2BI1A, I1B, I2A, I2B

Source: Tusher VG et al. 2000

Page 12: Literature Review of Microarray Data Mining

A SAM Example A SAM Example

Relative difference for the gene i isd(i) = (meanI(i) – meanU(i))/(s(i) + s0)

s(i) is the standard deviation of repeated expression measurement:

m n UnIm ixixsqrixixsqrais )})()(())()(({)(

Genes are ranked by the magnitude of d(i) so that d(1) is the largest relative difference, d(2) is the second largest relative difference and so on.

Source: Tusher VG et al. 2000

Page 13: Literature Review of Microarray Data Mining

A SAM Example A SAM Example

8 hybridizationsU1A, U1B, U2A, U2BI1A, I1B, I2A, I2B

Permutations balanced on biologic duplicates are generated.U1A I1A U2A I2AU1B I1B U2B I2B …

Calculate dp(i) for each permutation

dE(i): average over the balanced permutations

Calculate the observed relative difference d(i)

Source: Tusher VG et al. 2000

Page 14: Literature Review of Microarray Data Mining

A SAM Example A SAM Example

Now we have:Observed relative difference d(i)Expected relative difference dE(i) calculated from the permutations

Source: Tusher VG et al. 2000

A threshold can be chosen to yield

significant genes.

Page 15: Literature Review of Microarray Data Mining

A SAM Example A SAM Example

Now we have:N significant genes

We want to determine the false discovery rate (FDR):1. Horizontal cutoffs are defined as the smallest d(i) and the least negative d(i) for significantly induced and depressed respectively.2. For each permutation, the number of false significant genes is Counted.3. The estimated number of false significant genes F is the average Of the number of false significant genes in all permutations. 4. FDR can be calculated as F/N.

Source: Tusher VG et al. 2000

Page 16: Literature Review of Microarray Data Mining

SAM SAM

SAM clearly outperforms fold test, t-test and the ANOVA based bootstrap method (Marchal K. et al 2002).

The number of permutations is affected by the number of replicates and the user should perform the full set of permutations.

Usually, a significant cutoff is chosen to give less than one false positive (Saviozzi et al. 2004).

Page 17: Literature Review of Microarray Data Mining

Statistical Analysis and Data Statistical Analysis and Data Mining TechniquesMining Techniques

Gene selectionGene selection - identify differential gene expressions to a particular - identify differential gene expressions to a particular biological problems.biological problems.

Exploratory data analysisExploratory data analysis – extract (dis)similarities of the gene – extract (dis)similarities of the gene expression levels (patterns) among all samples.expression levels (patterns) among all samples.

Discriminant analysisDiscriminant analysis – train a classifier using gene expression – train a classifier using gene expression profiles to assign any new example to a respective class. profiles to assign any new example to a respective class.

Pathway analysis – find how genes interact as part of pathways.Pathway analysis – find how genes interact as part of pathways. Gene functional annotations – associate functional meaning to genes. Gene functional annotations – associate functional meaning to genes.

Page 18: Literature Review of Microarray Data Mining

Exploratory Data AnalysisExploratory Data Analysis

In a more complex experiment, it is essential to extract gene expression In a more complex experiment, it is essential to extract gene expression patterns among all samples.patterns among all samples.

Exploratory data analysis, also known as unsupervised data analysis, is Exploratory data analysis, also known as unsupervised data analysis, is essentially a grouping technique that aims to find genes with similar essentially a grouping technique that aims to find genes with similar behaviors and doesn’t require prior response measurements for the items behaviors and doesn’t require prior response measurements for the items to be grouped.to be grouped.

Commonly used clustering techniques include:Commonly used clustering techniques include:

hierachical clustering, self organization maps, k-means clustering, and hierachical clustering, self organization maps, k-means clustering, and principal component analysis. principal component analysis.

Page 19: Literature Review of Microarray Data Mining

Expression MatrixExpression Matrix

To interpret the results from multiple To interpret the results from multiple experiments, creating an expression matrix is experiments, creating an expression matrix is a common visual representation technique. a common visual representation technique.

Each column of the matrix represents a single Each column of the matrix represents a single experiment and each row of the matrix experiment and each row of the matrix represents a particular gene. Coloring the represents a particular gene. Coloring the matrix provides an intuitive visual matrix provides an intuitive visual representation. representation.

Experiment 1, 2, 3

Gene 1, 2

Each member is log2(ratio). If a value is 0, the color is black. A positive value is red and a negative value is green.

Page 20: Literature Review of Microarray Data Mining

Before Clustering The DataBefore Clustering The Data

The data may need to be rescaled to prevent The data may need to be rescaled to prevent dominating values from obscuring other dominating values from obscuring other important difference.important difference.

Decide what kind of distance measurement Decide what kind of distance measurement should be used. should be used.

Page 21: Literature Review of Microarray Data Mining

Hierarchical ClusteringHierarchical Clustering

It is an agglomerative approach in which single It is an agglomerative approach in which single expression profiles are joined to form groups, which expression profiles are joined to form groups, which are further joined until the completion of the process. are further joined until the completion of the process.

Initially, each cluster contains a single gene. Initially, each cluster contains a single gene. First, the pairwise distance is calculated for all genes.First, the pairwise distance is calculated for all genes. Second, two most similar genes g1 and g2 form a new Second, two most similar genes g1 and g2 form a new

cluster {g1, g2}.cluster {g1, g2}. Third, the distance is calculated between all other Third, the distance is calculated between all other

clusters and the new cluster.clusters and the new cluster. Repeat step 2-3 until all objects are in one cluster. Repeat step 2-3 until all objects are in one cluster.

Page 22: Literature Review of Microarray Data Mining

Hierarchical ClusteringHierarchical Clustering

There are different methods to calculate the There are different methods to calculate the distances between the growing clusters and distances between the growing clusters and the other remaining clusters. the other remaining clusters.

1. Single-linkage clustering;1. Single-linkage clustering;

2. Complete-linkage clustering;2. Complete-linkage clustering;

3. Average-linkage clustering;3. Average-linkage clustering;

4. Weighted pair-group average;4. Weighted pair-group average;

5. Within-group clustering;5. Within-group clustering;

6. Ward’s method. 6. Ward’s method.

Page 23: Literature Review of Microarray Data Mining

Single Linkage ClusteringSingle Linkage Clustering

The distance between two clusters i and j is The distance between two clusters i and j is calculated as the minimum distance between calculated as the minimum distance between a member of i and a member of j. a member of i and a member of j.

This method tends to produce loose clusters This method tends to produce loose clusters and often result in “chaining” – a sequential and often result in “chaining” – a sequential addition of single samples into an existing addition of single samples into an existing cluster. cluster.

Page 24: Literature Review of Microarray Data Mining

Complete Linkage ClusteringComplete Linkage Clustering

The distance between two clusters i and j is The distance between two clusters i and j is calculated as the greatest distance between a calculated as the greatest distance between a member of i and a member of j. member of i and a member of j.

This method tends to produce compact This method tends to produce compact clusters and clusters are often similar in size. clusters and clusters are often similar in size.

Page 25: Literature Review of Microarray Data Mining

Average Linkage ClusteringAverage Linkage Clustering

The distance between clusters is calculated The distance between clusters is calculated with average values. with average values.

There are many ways to calculate the average There are many ways to calculate the average value. The most common one is unweighted value. The most common one is unweighted pair-group method average (UPGMA). pair-group method average (UPGMA).

In UPGMA, the distance between each point in In UPGMA, the distance between each point in one cluster and all points in another cluster is one cluster and all points in another cluster is calculated for the average value. The two calculated for the average value. The two clusters with the lowest average value are clusters with the lowest average value are joined to form a new cluster. joined to form a new cluster.

Page 26: Literature Review of Microarray Data Mining

Average Linkage ClusteringAverage Linkage Clustering

Weighted pair-group average is identical to UPGMA Weighted pair-group average is identical to UPGMA except that the size of the respective cluster is used except that the size of the respective cluster is used as a weight. This is useful when the cluster size is as a weight. This is useful when the cluster size is greatly varied. greatly varied.

Within-group clustering is similar to UPGMA except Within-group clustering is similar to UPGMA except that the cluster average is used instead of all that the cluster average is used instead of all individual elements from a cluster. individual elements from a cluster.

Ward’s method determines whether to include a Ward’s method determines whether to include a cluster by calculating the total sum of squared cluster by calculating the total sum of squared deviations from the mean of a cluster and joining deviations from the mean of a cluster and joining clusters in such a way that it produces the smallest clusters in such a way that it produces the smallest possible increase in the sum of square errors. possible increase in the sum of square errors.

Page 27: Literature Review of Microarray Data Mining

Hierarchical ClusteringHierarchical Clustering

Typically, average linkage clustering is used Typically, average linkage clustering is used for gene expression data.for gene expression data.

As clusters grow in size, the expression vector As clusters grow in size, the expression vector representing the cluster may no longer representing the cluster may no longer represent any gene in the cluster.represent any gene in the cluster.

Furthermore, if a mistake is introduced early Furthermore, if a mistake is introduced early in the process, it can’t be corrected. in the process, it can’t be corrected.

Page 28: Literature Review of Microarray Data Mining

K-mean/median ClusteringK-mean/median Clustering

K-mean/median clustering is a good K-mean/median clustering is a good alternative to hierarchical clustering if there is alternative to hierarchical clustering if there is advanced knowledge about the number of the advanced knowledge about the number of the clusters should be represented in the data. clusters should be represented in the data.

Page 29: Literature Review of Microarray Data Mining

K-means/medians ClusteringK-means/medians Clustering

1. Specify the fixed number (k) of clusters;1. Specify the fixed number (k) of clusters;2. Randomly assign genes to clusters;2. Randomly assign genes to clusters;3. Calculate the mean/median expression vector 3. Calculate the mean/median expression vector

for for each cluster which is used to calculate each cluster which is used to calculate the the distance between clusters; distance between clusters;

4. Shuffle genes among clusters so that each 4. Shuffle genes among clusters so that each gene is gene is now in a cluster whose now in a cluster whose mean/median mean/median expression expression vector is closest to that gene’s vector is closest to that gene’s expression vector. expression vector.

5. Repeat Steps 3 and 4 until genes can’t be 5. Repeat Steps 3 and 4 until genes can’t be shuffled any more. shuffled any more.

Page 30: Literature Review of Microarray Data Mining

Self-Organization MapSelf-Organization Map

Self-organization map (SOM) assigns genes to Self-organization map (SOM) assigns genes to a series of partition on the basis of the a series of partition on the basis of the similarity of their expression vectors to similarity of their expression vectors to reference vectors that are defined for each reference vectors that are defined for each partition.partition.

Before genes can be assigned to partitions, Before genes can be assigned to partitions, the user defines a geometric configuration for the user defines a geometric configuration for the partitions. Random vectors are generated the partitions. Random vectors are generated for each partition and then are trained so that for each partition and then are trained so that the data are most effectively separated. the data are most effectively separated.

Page 31: Literature Review of Microarray Data Mining

Principal Component AnalysisPrincipal Component Analysis

Some of the data might contain redundant Some of the data might contain redundant information. information.

Principal component analysis (PCA) picks out Principal component analysis (PCA) picks out patterns in the data while reducing the patterns in the data while reducing the effective dimensionality without significant loss effective dimensionality without significant loss of information.of information.

PCA is difficult to be used alone but powerful PCA is difficult to be used alone but powerful when combined with another classification when combined with another classification technique such as k-means clustering and SOM. technique such as k-means clustering and SOM.

Page 32: Literature Review of Microarray Data Mining

Statistical Analysis and Data Statistical Analysis and Data Mining TechniquesMining Techniques

Gene selectionGene selection - identify differential gene expressions to a particular - identify differential gene expressions to a particular biological problems.biological problems.

Exploratory data analysisExploratory data analysis – extract (dis)similarities of the gene – extract (dis)similarities of the gene expression levels (patterns) among all samples. expression levels (patterns) among all samples.

Discrimination analysisDiscrimination analysis – train a classifier using gene expression – train a classifier using gene expression profiles to assign any new sample to a respective class. profiles to assign any new sample to a respective class.

Pathway analysisPathway analysis – find how genes interact as part of pathways. – find how genes interact as part of pathways. Gene functional annotationsGene functional annotations – associate functional meaning to genes. – associate functional meaning to genes.

Page 33: Literature Review of Microarray Data Mining

Discrimination AnalysisDiscrimination Analysis

It is also known as supervised data analysis, It is also known as supervised data analysis, which trains a classifier algorithm using gene which trains a classifier algorithm using gene expression profiles to classify samples.expression profiles to classify samples.

This has great promise in clinical diagnostics This has great promise in clinical diagnostics and has been used successfully in several and has been used successfully in several recent studies. recent studies.

Page 34: Literature Review of Microarray Data Mining

Clinical Diagnostics with Clinical Diagnostics with Supervised LearningSupervised Learning

T.R. Golub’s group at Whitehead Institute/MIT T.R. Golub’s group at Whitehead Institute/MIT had had

several successful cases for certain cancers’ several successful cases for certain cancers’ classclass

prediction. prediction. Shipp MA et al. (2002) Diffuse large B-cell lymphoma outcome prediction Shipp MA et al. (2002) Diffuse large B-cell lymphoma outcome prediction

by gene by gene

expression profiling and supervised machine learning. Nat. Med. 8, 68-expression profiling and supervised machine learning. Nat. Med. 8, 68-74.74.

Pomeroy SL et al. (2002) Prediction of central nervous system embryonal Pomeroy SL et al. (2002) Prediction of central nervous system embryonal tumour outcome tumour outcome

based on gene expression. Nature 415, 436-442. based on gene expression. Nature 415, 436-442.

Page 35: Literature Review of Microarray Data Mining

An Example of Clinical An Example of Clinical DiagnosticsDiagnostics

Experiment setup:Experiment setup:

Known classification for Cancer1 (AML) and Cancer2 (ALL)Known classification for Cancer1 (AML) and Cancer2 (ALL)

Known samples: 27 ALL, 11 AMLKnown samples: 27 ALL, 11 AML

Affymetrix chips (6817 genes)Affymetrix chips (6817 genes) Find a set of informative genes whose gene expression Find a set of informative genes whose gene expression

patterns were strongly correlated with the class distinction to patterns were strongly correlated with the class distinction to be predicted. be predicted.

Build a classifier based on the set of informative genes. Build a classifier based on the set of informative genes.

Source: T. R Golub et al. Molecular classfication of cancer: Source: T. R Golub et al. Molecular classfication of cancer:

Class discovery and class prediction by gene expression monitoring. Science 286Class discovery and class prediction by gene expression monitoring. Science 286

(1999) 531-537. (1999) 531-537.

Page 36: Literature Review of Microarray Data Mining

An Example of Clinical An Example of Clinical Diagnostics Diagnostics

Neighborhood analysis.

Page 37: Literature Review of Microarray Data Mining

An Example of Clinical An Example of Clinical Diagnostics Diagnostics

Class predictor.

Page 38: Literature Review of Microarray Data Mining

Discrimination AnalysisDiscrimination Analysis

The challenge for supervised data analysis is The challenge for supervised data analysis is to generalize a classifier for all situations.to generalize a classifier for all situations.

Over-training on the same dataset would Over-training on the same dataset would result in over-fitting. result in over-fitting.

Different cross-validation (e.g. leave-one-out) Different cross-validation (e.g. leave-one-out) methods can be used to establish a balance methods can be used to establish a balance between accuracy and generalizability. between accuracy and generalizability.

Page 39: Literature Review of Microarray Data Mining

Statistical Analysis and Data Statistical Analysis and Data Mining TechniquesMining Techniques

Gene selectionGene selection - identify differential gene expressions to a particular - identify differential gene expressions to a particular biological problems.biological problems.

Exploratory data analysisExploratory data analysis – extract (dis)similarities of the gene – extract (dis)similarities of the gene expression levels (patterns) among all samples. expression levels (patterns) among all samples.

Discrimination analysisDiscrimination analysis – train a classifier using gene expression – train a classifier using gene expression profiles to assign any new example to a respective class.profiles to assign any new example to a respective class.

Pathway analysis – find how genes interact as part of pathways.Pathway analysis – find how genes interact as part of pathways. Gene functional annotations – associate functional meaning to genes. Gene functional annotations – associate functional meaning to genes.

Page 40: Literature Review of Microarray Data Mining

Pathway AnalysisPathway Analysis

Genes never work alone in a biological system. Genes never work alone in a biological system. Analyzing microarray data in a pathway perspective Analyzing microarray data in a pathway perspective can lead a higher level of understanding of the can lead a higher level of understanding of the system. system.

A natural extension of clustering analysis: if genes A natural extension of clustering analysis: if genes are assigned to the same cluster, they may be are assigned to the same cluster, they may be involved in a same signal pathway. By analyzing the involved in a same signal pathway. By analyzing the promoters of genes, a higher level of network may promoters of genes, a higher level of network may be unveiled (Pilpel Y 2001). be unveiled (Pilpel Y 2001).

Various models are used to construct networks for Various models are used to construct networks for microarray data. Bayesian network and Boolean microarray data. Bayesian network and Boolean network are two commonly used models. network are two commonly used models.

Page 41: Literature Review of Microarray Data Mining

A Genetic Regulatory SystemA Genetic Regulatory System

Source: HD Jong. Modeling and simulation of genetic regulatory systems: a literature Source: HD Jong. Modeling and simulation of genetic regulatory systems: a literature review.review.J. Comp. Biol. 9 (2002) 67-103J. Comp. Biol. 9 (2002) 67-103

Page 42: Literature Review of Microarray Data Mining

A Simple Example of Bayesian A Simple Example of Bayesian NetworkNetwork

Source: HD Jong. Modeling and simulation of genetic regulatory systems: a literature Source: HD Jong. Modeling and simulation of genetic regulatory systems: a literature review.review.J. Comp. Biol. 9 (2002) 67-103J. Comp. Biol. 9 (2002) 67-103

A graph, conditional probability distributions for the random Variables, the joint probability distribution, and conditional Independency.

Page 43: Literature Review of Microarray Data Mining

A Simple Example of Boolean A Simple Example of Boolean NetworkNetwork

Source: HD Jong. Modeling and simulation of genetic regulatory systems: a literature Source: HD Jong. Modeling and simulation of genetic regulatory systems: a literature review.review.J. Comp. Biol. 9 (2002) 67-103J. Comp. Biol. 9 (2002) 67-103

For example, given a state vector 000 at t = 0, the system will moveto a state 011 at the next time point t = 1The induction of a gene is a deterministic function of the state of a group of other genes.

Page 44: Literature Review of Microarray Data Mining

Pathway AnalysisPathway Analysis

A free software called A free software called Pathway Processor Pathway Processor developed by the Bauer Center for Genomics developed by the Bauer Center for Genomics at Harvard can map expression data onto at Harvard can map expression data onto metabolic pathways and evaluate which metabolic pathways and evaluate which metabolic pathways are most affected. Fisher metabolic pathways are most affected. Fisher Exact test is used to score pathways Exact test is used to score pathways according to the probability that as many or according to the probability that as many or more genes in a pathway would be altered in more genes in a pathway would be altered in a given experiment than by chance alone. a given experiment than by chance alone.

Page 45: Literature Review of Microarray Data Mining

Statistical Analysis and Data Statistical Analysis and Data Mining TechniquesMining Techniques

Gene selectionGene selection - identify differential gene expressions to a particular - identify differential gene expressions to a particular biological problems.biological problems.

Exploratory data analysisExploratory data analysis – extract (dis)similarities of the gene – extract (dis)similarities of the gene expression levels (patterns) among all samples. expression levels (patterns) among all samples.

Discrimination analysisDiscrimination analysis – train a classifier using gene expression – train a classifier using gene expression profiles to assign any new example to a respective class.profiles to assign any new example to a respective class.

Pathway analysis – find how genes interact as part of pathways.Pathway analysis – find how genes interact as part of pathways. Gene functional annotations – associate functional meaning to genes.Gene functional annotations – associate functional meaning to genes.

Page 46: Literature Review of Microarray Data Mining

Gene Functional AnnotationGene Functional Annotation

In order to know whether some specific In order to know whether some specific biological process is strongly affected by biological process is strongly affected by transcriptional expression, we have to transcriptional expression, we have to associate functional meaning to genes by associate functional meaning to genes by using gene functional annotations. using gene functional annotations.

Researchers rely on robust gene annotations Researchers rely on robust gene annotations to link functional to transcriptional profiling. to link functional to transcriptional profiling.

Gene Ontology (GO) Gene Ontology (GO) is a commonly used is a commonly used control vocabulary for describing the roles of control vocabulary for describing the roles of genes and gene products in any organism. genes and gene products in any organism.

Page 47: Literature Review of Microarray Data Mining

Gene OntologyGene Ontology

GO is divided into three categories: molecular GO is divided into three categories: molecular function, biological process, and cellular function, biological process, and cellular component. component.

[Term]id: GO:0000786name: nucleosomenamespace: cellular_componentdef: "A complex comprised of DNA wound around a multisubunit core and associated proteins, which forms the primary packing unit of DNA into higher order structures." [GOC:elh]is_a: GO:0043234 ! protein complexrelationship: part_of GO:0000785 ! chromatin

Page 48: Literature Review of Microarray Data Mining

Gene OntologyGene Ontology

GO terms are organized in directed acyclic GO terms are organized in directed acyclic graphs, which differ from hierarchies in that graphs, which differ from hierarchies in that a child term can have many parent terms. a child term can have many parent terms.

Hexose biosynthesis

Hexose metabolism

Monosaccharide biosynthesis

Page 49: Literature Review of Microarray Data Mining

Gene OntologyGene Ontology

GO terms become associated with their appropriate gene GO terms become associated with their appropriate gene products through collaborating databases. These databases products through collaborating databases. These databases annotate genes with GO terms, providing references and annotate genes with GO terms, providing references and indicating what kind of evidence is available to support the indicating what kind of evidence is available to support the annotations. annotations.

Page 50: Literature Review of Microarray Data Mining

ReferencesReferences

Aas Km(2001). Aas Km(2001). Microarray data mining: a survey. Microarray data mining: a survey. NorskNorsk RegnesentralRegnesentral: Norwegian Computing Center: Norwegian Computing Center.. Dudoit S. et al. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data. Technical report no. 576, University Dudoit S. et al. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data. Technical report no. 576, University

of Claifornia, Berkely. of Claifornia, Berkely. Saviozzi S. et al. (2004). Saviozzi S. et al. (2004). Microarray data analysis and miningMicroarray data analysis and mining. Methods Mol. Med, 94: 67-89. . Methods Mol. Med, 94: 67-89. Lee ML et al. (2000). Lee ML et al. (2000). Importance of replication in Importance of replication in microarraymicroarray gene expression studies: statistical methods and evidence from repetitive gene expression studies: statistical methods and evidence from repetitive cDNAcDNA hybridizations hybridizations. .

Proc. Natl. Acad. Sci. USA, 97: 9834-39. Proc. Natl. Acad. Sci. USA, 97: 9834-39. Leung YF and Cavalieri D (2003). Fundamentals of cDNA microarray data analysis. Trends Genet., 19(11): 649-59.Leung YF and Cavalieri D (2003). Fundamentals of cDNA microarray data analysis. Trends Genet., 19(11): 649-59. Tusher VG et al. (2001). Significance analysis of microarrays applied to ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98: 5116-21. Tusher VG et al. (2001). Significance analysis of microarrays applied to ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98: 5116-21. Marchal K et al. (2002). Comparison of different methodologies to identify differentially expressed genes in two-example cDNA microarrays. J. Bio Systems, Marchal K et al. (2002). Comparison of different methodologies to identify differentially expressed genes in two-example cDNA microarrays. J. Bio Systems,

10: 409-430.10: 409-430. Eisen MB et al. (1998).Cluster Analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 96: 2907-2912. Eisen MB et al. (1998).Cluster Analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 96: 2907-2912.

Page 51: Literature Review of Microarray Data Mining

ReferencesReferences

Tavazoie S et al. (1999). Systematic determination of genetic network architecture. Nat. Genet. 22, 281-285.Tavazoie S et al. (1999). Systematic determination of genetic network architecture. Nat. Genet. 22, 281-285. Raychaudhuri S (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455-466. Raychaudhuri S (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455-466. Tamayo P et al. (1999). Interperting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Aca. Tamayo P et al. (1999). Interperting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Aca.

Sci. 96: 2907-2912. Sci. 96: 2907-2912. Quackenbush J (2001). Computational analysis of microarray data. Nat. Rev. Genet., 2L 418-27.Quackenbush J (2001). Computational analysis of microarray data. Nat. Rev. Genet., 2L 418-27. Pomeroy SL et al. (2002). Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature 415: 436-442. Pomeroy SL et al. (2002). Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature 415: 436-442. Shipp MA et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8, 68-74.Shipp MA et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8, 68-74. Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537. Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537. Pilpel Y et al. (2001). Identifying regulatory networks by combinatorial analysis of promoter elements. Nat. Genet. 29: 153-159.Pilpel Y et al. (2001). Identifying regulatory networks by combinatorial analysis of promoter elements. Nat. Genet. 29: 153-159. De Jong, H (2002). Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol. 9: 67-103.De Jong, H (2002). Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol. 9: 67-103. Pavlidi P et al. (2004). Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neuro. Pavlidi P et al. (2004). Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neuro.

Res. 29: 1213-22. Res. 29: 1213-22. Li SH et al. (2004). Microarray data mining using gene ontology. Medinfo 2004. Li SH et al. (2004). Microarray data mining using gene ontology. Medinfo 2004.

Page 52: Literature Review of Microarray Data Mining

ReferencesReferences

Goble CA et al. (2001). Transparent access to multiple bioinformatics Goble CA et al. (2001). Transparent access to multiple bioinformatics information sources. IBM Systems Journal 40: 532- 551. information sources. IBM Systems Journal 40: 532- 551.