Analysis of Gene Expression and Gene Networks Biclustering 2.

Analysis of Gene Expression and Gene Networks

Biclustering 2

On this lecture

• Two current biclustering methodologies

• Iterative Signature Algorithm (ISA)– Simple– Randomized

• SAMBA– Combinatorial Roots– Fast

• And maybe a little more

What makes a biclustering algorithm?

• Score/Define what is a bicluster• Algorithm for finding one bicluster in

the data• Algorithm for finding all (many)

biclusters in the data

• Important themes:– Normalization– Redundencies

Previously in GE:

• What is a bicluster:– Cheng church– CTWC

• How to search for a bicluster– Cheng church– CTWC

• Normalization• Redundancies

• Developed at Naama Barkai’s Lab at WIS (I. Ihmels, S. Bergman)

• Motivation: – A bicluster is a “stable” set

of genes and conditions– It is possible to refine

approximate set of genes by “stabalizing” them

The Iterative Signature Algorithm

Normalization: ISA

• Can we normalize for both gene and condition dependent trends?

• In the ISA we are not trying to..

• Given a gene expression matrix E one conditions U and genes V form:– EC : normalize each column to 0 mean, 1

std– EG : normalize each gene to 0 mean, 1 std

What is a bicluster: ISA

• Observe: assume all columns are independent, what is the distribution of

(j in U’) eGij

for a random condition set U’ and gene i?

• Mean = 0, Std=sqrt(|U’|)• Same for (i in V’) eG

ij and gene set V’.• In a bicluster, we like independence

not to hold.

What is a bicluster: ISA

• Given a set of genes U’ define:– ISA(U’) = {v in V s.t. (j in U’) eG

vj > TGσU’}• Given a set of genes V’ define:

– ISA(V’) = {u in U s.t. (j in V’) eCiu > TCσV’}

• TG ,TC – threshold parameters, σU’ ,σV’ standard deviations

• A (perfect) bicluster is a pair (U’,V’) s.t.

ISA(V’) = U’ISA(U’) = V’

Searching for biclusters: ISA

• ISA – defining a directed graph on the set of condition and genes subsets.

• A bicluster is a cycle of two nodes U’• An approximated bicluster is a larger cycle but

not too large.

• The algorithm: start from a random or known gene set, compute ISA until converging to an approximated bicluster:

– Ui = ISA(Vi) , Vi = ISA(Ui-1)– Converge at i when for all j > i-m, |Ui-Uj|/|Ui+Uj| < 1-ε

Redundancies: ISA

• Starting from different seeds yield different fixed points (Biclusters)

• Using different threshold changes the graph structure and give more fixed points.

• Need to filter similar solutions and report a short list of significant biclusters

ISA - applications

• Starting from genes with a known functional annotation and refine them to a bicluster

• Starting from genes with known transcription factor binding sites

• Starting from a set of sequence orthologs

• See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS 2004.

ISA – Pros/Cons

• Pros– Simple, Quite fast– Elegant solution to the normalization problem– Good empirical results in several cases

• Cons– Thresholds setting– Finding good seeds– Redundencies– Non normal behaviors

• Assignment 3 will give you more insights

SAMBA

• Developed here• Motivation:

– Harvest efficient combinatorial techniques for biclustering large datasets.

– Couple a statistical model to the biclusters

– Allow integration of heterogeneous data

The SAMBA model

conditions

gen

es

edge

no edge

G=(U,V,E)Goal : Find high similarity submatrices

Goal : Find dense subgraphs

The SAMBA approach

• Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data

• Bicluster model: Heavy subgraphs

• How to find biclusters: Combined hashing and local optimization

• Redundancies: Find many biclusters at once, filter them in post process

From a statistical model to edge weights – a simple example

• Background model: Independent edges, each present with prob. p.

• H – subgraph of n genes, m conds, k edges• P-value = tail of binomial distribution:

• Weight the graph– edges: (-1-log p)– non-edges: (-1-log(1-p)).

then subgraph weight log p-value.

knmknmknmk

kk

ppppk

nmHp

)1(2)1(

')( ''

'

Limitations of the uniform probability model

• Not all dense subgraphs are statistically significant. • Different genes/conds have typical noise

characteristics.• Noisy genes/conds have high probability of forming

dense subgraphs.• An extended likelihood ratio model:

Background Random Graph

Model

Bicluster Random Subgraph Model

Likelihood modeltranslates to sum of weights over edges and non

edges

=

A Degree Based Random Graph Model

• An edge between (u,v) occurs independently with prob p(u,v).• p(u,v) depends on both u and v degrees• P(u,v) = Pr((u,v) in E’ | all G=(U,V,E’) such that

deg(w, E’)=deg(w,E) for all w in U,V)

• Approximated using a hyper-geometric calculation

low-prob edges

medium-prob edges

high-prob edges

Model Likelihood Ratio

'),('),(

'),('),(

),(1

1log

),(log)(log

),(1

1

),()(

Evu

c

Evu

c

Evu

c

Evu

c

vup

p

vup

pBL

vup

p

vup

pBL

Subgraph weight = log likelihood ratio

• Model assumption - bicluster edges occur independently with prob pc

• Likelihood ratio score:

Heaviest bipartite subgraph

• NPC (Dawande et al. 97, Hochbaum 98)• (Recall: node blicque is polynomial!)

• Assumption: degree on V side bounded by d:

• Start by finding heavy bicliques.

• Alg: use hashing to discover heavy subsets of conds. Takes O(n2d) time and space.

Finding Heaviest Biclique432223222

464443224

Using bicliques to find the heaviest biclusters

'

(( ', ')) ((( '), ')u U

w U V w u V

Lemma: If B=(U’,V’) is maximal and XU’ then v s.t. |N(v)X|>=|X|/2.Pf:

Assume edge weight = 1, non-edge weight = -1

Note that:

'

'

0 (( , ')) | ( ) | | ( ) |

2 | ( ) | | |v V

v V

w X V N v X N v X

N v X X

Corrolary: If B=(U’,V’) is maximal then |U’|<= 2d

Using bicliques to find the heaviest biclusters

A set of conditions in a maximal bicluster is the union of up to log(2D) subsets of gene neighborhoods.

• Exhaustive O((n2D)log(2D)) time alg:

•Hash bicliques

•enumerate all log(2D) size N(v) combinations.

• Can be generalized to handle arbitrary edge/nonedge weights.

u’’ u’’’ …U’

SAMBA’s implementation

• Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6.

• Phase II: greedy expansion of heaviest bicliques containing each gene/cond

• Phase III: filter overlapping biclusters.

Heterogeneous information sources

Transcription Level Protein Level Phenotype Level

1 + 1 = 0

ChIP Chip

mRNA profiling2-Hybrid

Protein ComplexesIdentification usingMass Spec

Syntheticlethality

Barcoded deletion libraries

and so many more…

From experiments to properties

StrongInduction

MediumInduction

MediumRepression

StrongRepression

p1 p2 p3 p4

StrongBinding toTF T

MediumBinding toTF T

HighSensitivity

MediumSensitivity

High ConfidenceInteraction

Medium ConfidenceInteraction

p1

Strong complex binding toprotein Pp2

Medium complex binding toProtein P

p1 p2 p1 p2 p1 p2

gene g

A Heterogeneous Collection of Yeast Genomic

Information• Gene expression: ~1000 conditions, 27

publications• TF binding profiles: 110 profiles from

growth on YPD (Lee et al.)• Phenotype profiles: 6 (30) profiles

(Giaever et al.)• Two hybrid interactions: ~1000

(Uetz et al.)• Protein Complex interaction: ~4000

(Ho et al.)• MIPS interactions: ~1000

A SAMBA moduleG

en

es

Properties

GO annotations

CPA1 CPA2

Statistical Model Provides High Specificity

+ Lymphoma data (Alizadeh et.al)

x Shuffled Data

log p-value

log

lik

elih

ood

Global View of modular organization in yeast

Inferring functional annotations

• Using SAMBA results for annotating uncharacterized yeast genes

• Performing “guilt by association”• Same procedure for properties (which

reflects poorly characterized conditions)

Mating Genes

Uncharacterized Putative Mating

Over

X%

Predictions are highly specific

5 mating predictions were tested experimentally4 mutants failed to mate

SAMBA as a universal language for functional genomics

databases

Gene expressionTF locationProteomicsPhenotypes

…..S

AM

BA

Qu

ery

User

Updated Relevant Modules

SAMBA – Pros/Cons

• Pros– Fast– Allow simultaneous normalization of

genes and conditions– Allow integration of hetergenous data– Well suited for query based usage

• Cons– Discretization

Two words on: Probabilistic Models for Biclsutering

• Bicluster model: each subcolumn have a typical normal distribution ,different from the background

• Model the entire matrix: tile the matrix by biclusters

• Model score: likelihood based• Avoid overfitting by standard

techinuqes

Two words on: Probabilistic Models for Biclsutering

• How to find the biclusters: Start by clustering and refine them using an EM algorithm:– Given a clustering calculate the model

parameters (distirubtions per bicluster)– Given the distributions, reassign the

biclusters

Biclustering - Summary

• A general data mining problem• The key point: defining what is a

bicluster• Algorithms vary depending on the

nature of bicluster model• The future problem: search for

biclusters in a really huge matrices.

Analysis of Gene Expression and Gene Networks Biclustering 2.

Documents

Transcript of Analysis of Gene Expression and Gene Networks Biclustering 2.