More on Microarrays Chitta Baral Arizona State University.

More on Microarrays

Chitta Baral

Arizona State University

Some case studies• Yeast cell cycle

– Goal: Find groups of yeast genes whose expression profiles are similar over a 24-hour period.

– Approach: Obtain gene expression measurements using Affymetrix S98 genome microarrays for a synchronized sample of yeast cells over a 24-hr period by sampling the total RNA populations at 30-minute intervals.

• 48 separate time points were sampled twice (for duplicate measurements) at each time point. (T1 … T48 and replicates T1

’…T48’)

• Drug intervention study– Goal: Characterize effect(s) of drug X three hours after it is introduced

into normal adult mice by the expression level of liver cell genes.– Approach: Gene expression profiles of normal adult mice liver cells that

are not treated with drug X are used as the control state.• Call the preintervention or control state A, and the post intervention state B• For replicate measurements, liver samples were obtained without drug X

application from MA adult mice and another MB adult mice liver samples were obtained after drug X was applied.

Some potential questions when trying to cluster

• What uncategorized genes have an expression pattern similar to these genes that are well-characterized?

• How different is the pattern of expression of gene X from other genes?

• What genes closely share a pattern of expression with gene X?

• What category of function might gene X belong to?• What are all the pairs of genes that closely share

patterns of expression?• Are there subtypes of disease X discernible by tissue

gene expression?• What tissue is this sample tissue closest to?

Questions – cont.

• Which are the different patterns of gene expression? • Which genes have a pattern that may have been a result

of the influence of gene X?• What are all the gene-gene interactions present among

these tissue samples?• Which genes best differentiate these two group of

tissues?• Which gene-gene interactions best differentiate these

two groups of tissue samples.• DIFFERENT ALGORITHMS ARE MORE

PARTICULARLY SUITED TO ANSWER SOME OF THESE QUESTIONS, COMPARED WITH THE OTHERS.

Bioinformatics algorithms and some known uses -- Unsupervised

• Feature determination: Determining genes with interesting properties, without looking for a particular pattern determined a priori.– Principal component analysis: determine genes explaining the majority

of the variance in the data set.• Cluster determination: Determine groups of genes or samples with

similar patterns of gene expression.– Nearest neighbour clustering: # clusters are decided first , the clusters

are calculated and each gene is assigned to a single cluster.• Self-organizing maps

– Tamayao et al. used it to functionally cluster genes into various patterned time courses in HL-60 cell macrophage differentiation.

– Toronen used hierarchical SOM to cluster yeast genes responsible fore diauxic shift.

• K-means clustering– Soukas et al. used it to cluster genes involved in leptin signalling.– Tavazoie et al. used it to cluster genes with common regulatory sequences.

k-means clustering: basic idea

• Input: n objects (or points) and a number k• A set of k-clusters that mimimizes the squared-error

criterion (sum of squared errors = i=1k p in Ci |p-mi|2,

where mi is the mean of cluster ci.) • Algorithm – complexity is O(nkt), where t = #iterations

– Arbitrarily choose k objects as the initial cluster centers– Repeat– (Re)assign each object to the cluster to which the object is the

most similar based on the mean value of the objects in the cluster.

– Update the cluster means, (i.e., calculate the mean value of the objects for each cluster)

– Until no change.

Pluses and minuses of k-means

• Pluses: Low complexity• Minuses

– Mean of a cluster may not be easy to define (data with categorical attributes)

– Necessity of specifying k– Not suitable for discovering clusters of non-convex shape or of

very different sizes– Sensitive to noise and outlier data points (a small number of

such data can substantially influence the mean value)– Some of the above objections (especially the last one) can be

overcomed by the k-medoid algorithm.• Instead of the mean value of the objects in a cluster as a reference

point, the medoid can be used, which is the most centrally located object in a cluster.

K-Medoid clustering• Input: Set of objects (often points in a multi-dimensional space)• Output: These objects clustered into k clusters• Algorithm: Complexity O(n2) when k <<n for each iteration

– Select arbitrarily k representative objects.– Mark these objects as selected and mark the remaining as non-selected– Repeat– For all selected objects Oi do ****(complexity O(k(n-k)n))(complexity O(k(n-k)n))

• For all non-selected objects Oh compute Cih, where Cih, denotes the total cost of swapping i with h, i.e., Cih = j Cjih, where Cjih =

dhj – d ij, and dij denotes d(Oi, Oj) the distance between Oi and Oj.– Select imin, hmin such that Cimin,hmin = Mini,h Cih ****(complexity O(k(n-k)))****(complexity O(k(n-k)))–

If Cimin,hmin < 0 then mark Oi as non-selected and Oh as selected.– Until no change– The selected objects now define the clusters.

• A non-selected object Oj belongs to the cluster represented by an object Oi if d(Oi, Oj) = Min Oe d(Oj, Oe), where min is taken over all selected objects Oe.

Self organizing maps• A neural network algorithm that has been used for a wide variety of

applications, mostly for engineering problems but also for data analysis.

• SOM can be used at the same time both to reduce the amount of data by clustering, and for projecting the data nonlinearly onto a lower-dimensional display.

• SOM vs k-means– In the SOM the distance of each input from all of the reference vectors

instead of just the closest one is taken into account, weighted by the neighborhood kernel h. Thus, the SOM functions as a conventional clustering algorithm if the width of the neighborhood kernel is zero.

– Whereas in the K-means clustering algorithm the number K of clusters should be chosen according to the number of clusters there are in the data, in the SOM the number of reference vectors can be chosen to be much larger, irrespective of the number of clusters. The cluster structures will become visible on the special displays

Bioinformatics algorithms and some known uses – Unsupervised; cont.

• Cluster determination (cont.)– Aggolmerative clustering: bottom up method, where

clusters start as empty, then genes are successively added to the existing clusters

• Dendograms: Groups are defined as sub-trees in a phylogenetic-type tree created by a comprehensive pair-wise dissimilarity measure.

• 2-D Dendograms

– Divisive or partitional clustering: top-down method, where large clusters are successively broken down into smaller ones, until each sub-cluster contains only one object (gene)

• Dendograms and 2-D Dendograms.

More on Microarrays Chitta Baral Arizona State University.

Documents

Transcript of More on Microarrays Chitta Baral Arizona State University.