Analysis of Gene Expression and Gene Networks Biclustering 2.
-
Upload
lewis-martin -
Category
Documents
-
view
228 -
download
0
Transcript of Analysis of Gene Expression and Gene Networks Biclustering 2.
Analysis of Gene Expression and Gene Networks
Biclustering 2
On this lecture
• Two current biclustering methodologies
• Iterative Signature Algorithm (ISA)– Simple– Randomized
• SAMBA– Combinatorial Roots– Fast
• And maybe a little more
What makes a biclustering algorithm?
• Score/Define what is a bicluster• Algorithm for finding one bicluster in
the data• Algorithm for finding all (many)
biclusters in the data
• Important themes:– Normalization– Redundencies
Previously in GE:
• What is a bicluster:– Cheng church– CTWC
• How to search for a bicluster– Cheng church– CTWC
• Normalization• Redundancies
• Developed at Naama Barkai’s Lab at WIS (I. Ihmels, S. Bergman)
• Motivation: – A bicluster is a “stable” set
of genes and conditions– It is possible to refine
approximate set of genes by “stabalizing” them
The Iterative Signature Algorithm
Normalization: ISA
• Can we normalize for both gene and condition dependent trends?
• In the ISA we are not trying to..
• Given a gene expression matrix E one conditions U and genes V form:– EC : normalize each column to 0 mean, 1
std– EG : normalize each gene to 0 mean, 1 std
What is a bicluster: ISA
• Observe: assume all columns are independent, what is the distribution of
(j in U’) eGij
for a random condition set U’ and gene i?
• Mean = 0, Std=sqrt(|U’|)• Same for (i in V’) eG
ij and gene set V’.• In a bicluster, we like independence
not to hold.
What is a bicluster: ISA
• Given a set of genes U’ define:– ISA(U’) = {v in V s.t. (j in U’) eG
vj > TGσU’}• Given a set of genes V’ define:
– ISA(V’) = {u in U s.t. (j in V’) eCiu > TCσV’}
• TG ,TC – threshold parameters, σU’ ,σV’ standard deviations
• A (perfect) bicluster is a pair (U’,V’) s.t.
ISA(V’) = U’ISA(U’) = V’
Searching for biclusters: ISA
• ISA – defining a directed graph on the set of condition and genes subsets.
• A bicluster is a cycle of two nodes U’• An approximated bicluster is a larger cycle but
not too large.
• The algorithm: start from a random or known gene set, compute ISA until converging to an approximated bicluster:
– Ui = ISA(Vi) , Vi = ISA(Ui-1)– Converge at i when for all j > i-m, |Ui-Uj|/|Ui+Uj| < 1-ε
Redundancies: ISA
• Starting from different seeds yield different fixed points (Biclusters)
• Using different threshold changes the graph structure and give more fixed points.
• Need to filter similar solutions and report a short list of significant biclusters
ISA - applications
• Starting from genes with a known functional annotation and refine them to a bicluster
• Starting from genes with known transcription factor binding sites
• Starting from a set of sequence orthologs
• See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS 2004.
ISA – Pros/Cons
• Pros– Simple, Quite fast– Elegant solution to the normalization problem– Good empirical results in several cases
• Cons– Thresholds setting– Finding good seeds– Redundencies– Non normal behaviors
• Assignment 3 will give you more insights
SAMBA
• Developed here• Motivation:
– Harvest efficient combinatorial techniques for biclustering large datasets.
– Couple a statistical model to the biclusters
– Allow integration of heterogeneous data
The SAMBA model
conditions
gen
es
edge
no edge
G=(U,V,E)Goal : Find high similarity submatrices
Goal : Find dense subgraphs
The SAMBA approach
• Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data
• Bicluster model: Heavy subgraphs
• How to find biclusters: Combined hashing and local optimization
• Redundancies: Find many biclusters at once, filter them in post process
From a statistical model to edge weights – a simple example
• Background model: Independent edges, each present with prob. p.
• H – subgraph of n genes, m conds, k edges• P-value = tail of binomial distribution:
• Weight the graph– edges: (-1-log p)– non-edges: (-1-log(1-p)).
then subgraph weight log p-value.
knmknmknmk
kk
ppppk
nmHp
)1(2)1(
')( ''
'
Limitations of the uniform probability model
• Not all dense subgraphs are statistically significant. • Different genes/conds have typical noise
characteristics.• Noisy genes/conds have high probability of forming
dense subgraphs.• An extended likelihood ratio model:
Background Random Graph
Model
Bicluster Random Subgraph Model
Likelihood modeltranslates to sum of weights over edges and non
edges
=
A Degree Based Random Graph Model
• An edge between (u,v) occurs independently with prob p(u,v).• p(u,v) depends on both u and v degrees• P(u,v) = Pr((u,v) in E’ | all G=(U,V,E’) such that
deg(w, E’)=deg(w,E) for all w in U,V)
• Approximated using a hyper-geometric calculation
low-prob edges
medium-prob edges
high-prob edges
Model Likelihood Ratio
'),('),(
'),('),(
),(1
1log
),(log)(log
),(1
1
),()(
Evu
c
Evu
c
Evu
c
Evu
c
vup
p
vup
pBL
vup
p
vup
pBL
Subgraph weight = log likelihood ratio
• Model assumption - bicluster edges occur independently with prob pc
• Likelihood ratio score:
Heaviest bipartite subgraph
• NPC (Dawande et al. 97, Hochbaum 98)• (Recall: node blicque is polynomial!)
• Assumption: degree on V side bounded by d:
• Start by finding heavy bicliques.
• Alg: use hashing to discover heavy subsets of conds. Takes O(n2d) time and space.
Finding Heaviest Biclique432223222
464443224
Using bicliques to find the heaviest biclusters
'
(( ', ')) ((( '), ')u U
w U V w u V
Lemma: If B=(U’,V’) is maximal and XU’ then v s.t. |N(v)X|>=|X|/2.Pf:
Assume edge weight = 1, non-edge weight = -1
Note that:
'
'
0 (( , ')) | ( ) | | ( ) |
2 | ( ) | | |v V
v V
w X V N v X N v X
N v X X
Corrolary: If B=(U’,V’) is maximal then |U’|<= 2d
Using bicliques to find the heaviest biclusters
A set of conditions in a maximal bicluster is the union of up to log(2D) subsets of gene neighborhoods.
• Exhaustive O((n2D)log(2D)) time alg:
•Hash bicliques
•enumerate all log(2D) size N(v) combinations.
• Can be generalized to handle arbitrary edge/nonedge weights.
u’’ u’’’ …U’
SAMBA’s implementation
• Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6.
• Phase II: greedy expansion of heaviest bicliques containing each gene/cond
• Phase III: filter overlapping biclusters.
Heterogeneous information sources
Transcription Level Protein Level Phenotype Level
1 + 1 = 0
ChIP Chip
mRNA profiling2-Hybrid
Protein ComplexesIdentification usingMass Spec
Syntheticlethality
Barcoded deletion libraries
and so many more…
From experiments to properties
StrongInduction
MediumInduction
MediumRepression
StrongRepression
p1 p2 p3 p4
StrongBinding toTF T
MediumBinding toTF T
HighSensitivity
MediumSensitivity
High ConfidenceInteraction
Medium ConfidenceInteraction
p1
Strong complex binding toprotein Pp2
Medium complex binding toProtein P
p1 p2 p1 p2 p1 p2
gene g
A Heterogeneous Collection of Yeast Genomic
Information• Gene expression: ~1000 conditions, 27
publications• TF binding profiles: 110 profiles from
growth on YPD (Lee et al.)• Phenotype profiles: 6 (30) profiles
(Giaever et al.)• Two hybrid interactions: ~1000
(Uetz et al.)• Protein Complex interaction: ~4000
(Ho et al.)• MIPS interactions: ~1000
A SAMBA moduleG
en
es
Properties
GO annotations
CPA1 CPA2
Statistical Model Provides High Specificity
+ Lymphoma data (Alizadeh et.al)
x Shuffled Data
log p-value
log
lik
elih
ood
Global View of modular organization in yeast
Inferring functional annotations
• Using SAMBA results for annotating uncharacterized yeast genes
• Performing “guilt by association”• Same procedure for properties (which
reflects poorly characterized conditions)
Mating Genes
Uncharacterized Putative Mating
Over
X%
Predictions are highly specific
5 mating predictions were tested experimentally4 mutants failed to mate
SAMBA as a universal language for functional genomics
databases
Gene expressionTF locationProteomicsPhenotypes
…..S
AM
BA
Qu
ery
User
Updated Relevant Modules
SAMBA – Pros/Cons
• Pros– Fast– Allow simultaneous normalization of
genes and conditions– Allow integration of hetergenous data– Well suited for query based usage
• Cons– Discretization
Two words on: Probabilistic Models for Biclsutering
• Bicluster model: each subcolumn have a typical normal distribution ,different from the background
• Model the entire matrix: tile the matrix by biclusters
• Model score: likelihood based• Avoid overfitting by standard
techinuqes
Two words on: Probabilistic Models for Biclsutering
• How to find the biclusters: Start by clustering and refine them using an EM algorithm:– Given a clustering calculate the model
parameters (distirubtions per bicluster)– Given the distributions, reassign the
biclusters
Biclustering - Summary
• A general data mining problem• The key point: defining what is a
bicluster• Algorithms vary depending on the
nature of bicluster model• The future problem: search for
biclusters in a really huge matrices.