Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Microarrays and CancerSegal et al.
CS 466
Saurabh Sinha
Genomics and pathology
• Genomics provides high-throughput measurements of molecular mechanisms– Microarrays, ChIP-on-chip, etc.
• Genomics may provide the molecular underpinnings of pathology, in a highly comprehensive manner– Revolutionize the diagnosis and management of
diseases, including cancer
Prior applications to cancer
• Gene expression measurements have been applied to cancer diagnosis
• Measure each gene’s expression in several normal tissue samples, and several pathological (diseased) samples
• Find subset of genes differentially expressed in the two sample groups
• If such “gene signatures” of particular cancer types are found, they can become the basis of tests for malignancy
We want better …
• Genes may be differentially expressed, but not enough to cross certain thresholds used in the analysis
• Analyzing the data on a gene-by-gene basis is error prone -- microarray data has inherent noise
• Finding the genes involved in one type of cancer is only the first step; it does not reveal the underlying processes
Part 1: Cancer modules
A “module” level view
• Many methods use “gene modules” (sets of genes) as basic blocks for analysis
• Instead of trying to find changes in individual gene expression profiles, look out for entire sets of genes with changing expression profiles
The study of Mootha et al.
• Showed that expression of “oxidative phosphorylation” genes (a particular set of genes) is reduced in diabetic muscle
• Signal not very strong when looking at individual genes, but highly significant when looking at the “gene module”
Source: Nature Genetics 37, S38 - S45 (2005) D
isea
se t
issu
e(D
iabe
tes
mel
litus
typ
e 2)
Normal tissue(Normal tolerance to glucose)
Grey: all genesRed: oxidative phosphorylation genes
Segal et al.: Methodology
• Compile a large collection of cancer-related microarrays – microarrays measuring gene expression in cancer tissues or
normal tissue
• Compile a large collection of gene sets (modules) from earlier studies
• Identify gene set (modules) induced or repressed in a microarray
• Identify modules induced in several arrays, or repressed in several arrays
• Check if these arrays are enriched in some clinical annotation
Identify gene set (modules) induced or repressed in a microarray
• Given expression value Eg,m of each gene g in the microarray experiment m
• Compute average expression Eg of the gene g over all microarrays
• If Eg,m is 2-fold greater than Eg, call the gene g as induced in array m
• Categorize each gene as being induced or not-induced in the array.
Source: Nature Genetics 36, 1090 - 1098 (2004)
Identify gene set (modules) induced or repressed in a microarray
• |All genes| = N• |Module| = n• |Induced| = m• |Intersection| = k
• Hypergeometric test(N,n,m,k):
• If a set of m genes was chosen at random (sampling w/o replacement), what is the probability that the intersection would be larger than or equal to k?
All genes
Module
Induced
Intersection
€
Pr(
Identify gene set (modules) induced or repressed in a microarray
• |All genes| = N• |Module| = n• |Induced| = m• |Intersection| = k
• Hypergeometric test(N,n,m,k):
• Sum over i>=k: If a set of m genes was chosen at random (sampling w/o replacement), what is the probability that the intersection would be equal to i?
All genes
Module
Induced
Intersection
€
Pr(
Identify gene set (modules) induced or repressed in a microarray
• |All genes| = N• |Module| = n• |Induced| = m• |Intersection| = k
• Hypergeometric test(N,n,m,k):
All genes
Module
Induced
Intersection
€
Pr(
€
n
i
⎛
⎝ ⎜
⎞
⎠ ⎟N − n
m − i
⎛
⎝ ⎜
⎞
⎠ ⎟
N
m
⎛
⎝ ⎜
⎞
⎠ ⎟i≥k
∑
“p-value” of the Hypergeometric test
Identify gene set (modules) induced or repressed in a microarray
• |All genes| = N• |Module| = n• |Induced| = m• |Intersection| = k
• Hypergeometric test(N,n,m,k)
• If the “p-value” is very small, then we infer that the intersection is “statistically significant”, i.e., the module is induced in the microarray
• Similarly define module repressed in microarray
All genes
Module
Induced
Intersection
€
Pr(
Segal et al.: Methodology
• Compile a large collection of cancer-related microarrays – microarrays measuring gene expression in cancer tissues or
normal tissue
• Compile a large collection of gene sets (modules) from earlier studies
• Identify gene set (modules) induced or repressed in a microarray
• Identify modules induced in several arrays, or repressed in several arrays
• Check if these arrays are enriched in some clinical annotation
Source: Nature Genetics 36, 1090 - 1098 (2004)
Segal et al.: Methodology
• Compile a large collection of cancer-related microarrays – microarrays measuring gene expression in cancer tissues or
normal tissue
• Compile a large collection of gene sets (modules) from earlier studies
• Identify gene set (modules) induced or repressed in a microarray
• Identify modules induced in several arrays, or repressed in several arrays
• Check if these arrays are enriched in some clinical annotation
Identify modules induced in several arrays, or repressed in several arrays
Source: Nature Genetics 36, 1090 - 1098 (2004)
Segal et al.: Methodology
• Compile a large collection of cancer-related microarrays – microarrays measuring gene expression in cancer tissues or
normal tissue
• Compile a large collection of gene sets (modules) from earlier studies
• Identify gene set (modules) induced or repressed in a microarray
• Identify modules induced in several arrays, or repressed in several arrays
• Check if these arrays are enriched in some clinical annotation
Check if these arrays are enriched in some clinical annotation
Source: Nature Genetics 36, 1090 - 1098 (2004)
Segal et al: Cancer “module maps”
Source: Nature Genetics 37, S38 - S45 (2005)
Red(m,c): Microarrays in whichmodule m was overexpressed (induced) are enriched in condition c
Green: Microarrays in whichmodule m was underexpressed (repressed) are enriched in condition c
Rows and columns are not inan arbitrary order. They havebeen “clustered” to displaysimilar rows (or columns) together
Insights from cancer module map
• Some modules activated or repressed across many tumor types. Such modules could be related to general tumorogenic processes
• Some modules specifically activated or repressed in certain tumor types or stages of tumor progression
From modules to regulation
• A module map shows the transcriptional changes underlying cancer
• Transcriptional changes are a result of transcription factors and their binding sites
• A deeper understanding of cancer would come from finding out which transcription factors and binding sites led to the transcriptional changes
Part 2: Cis-regulatory elements
Genomics and gene regulation
• Such knowledge comes from genomics data• ChIP-chip studies identify which transcription
factors bind which DNA sequences• Analysis of DNA sequence, using known
binding site motifs, gives us putative binding sites
• Cross-species conservation also tells us something about possible locations of binding sites
Cis-regulatory analysis
• Identify a set of genes whose promoters contain the same binding sites– Such a set of genes is likely to be regulated by the
same TF– Often called a “regulatory module”
• Earlier studies mined microarrays for “co-expressed” genes, then used motif finding algorithms to discover their shared binding sites
Cis-regulatory analysis
• Another approach (Segal et al. 2003) tried to solve the problem in an integrated manner
• Find a set of genes such that– their expression profiles are similar (microarrays)– they share the same binding sites (sequence)
• Joint learning of “regulatory module” from two very different types of data: microarray and sequence– An important theme in current bioinformatics
Cis-regulatory analysis
• Connection between gene expression and cis-regulatory elements (binding sites) also explored in Beer & Tavazoie.
• Found rules on combinations and locations of binding sites that would cause the gene to be over- or under-expressed
• The binding sites “RRPE” and “PAC” must occur within240 bp and 140 bp of gene start• Genes containing both motifs, following certain rules on location, are tightly co-regulated• Genes containing any one motif, or both in incorrect positional configuration, have close to random expression
Source: Nature Genetics 37, S38 - S45 (2005)
Eukaryotes• These studies have mostly focused on yeast
(which is a eukaryote, but has a small, compact genome)
• Not much work of this type in the longer, more complex genomes of metazoans (e.g., humans, rodents, fruitflies)
• The genome is not compact; may not suffice to look at sequence right next to a gene. Intergenic regions are long, and cis-regulatory signals may not be close to gene
One study in humans
• HeLa cells are an “immortal” cell-line derived from cervical cancer cells in a person who died in 1951.– Used extensively in studying cancer
• Method of Segal et al. (joint learning of regulatory modules from gene expression and sequence data) applied to these cells
One study in humans
• Gene expression data used: microarrays measuring genes during cell cycle in HeLa cells
• Sequence: 1000 bp promoters (upstream) of human genes
Result of analysis: Two motifs found to be shared by this set of genes. The genes have similar expression profiles. One of theidentified motifs (NFAT) known to be involved in cell-cycle
Source: Nature Genetics 37, S38 - S45 (2005)
Summary
• The common theme is to analyze sets of genes, and relate their common expression patterns to cancer types or to presence of cis-regulatory motifs
• Search algorithms may be required to identify some of these features