A Statistical Method for Finding Transcriptional Factor Binding Sites

Authors: Saurabh Sinha and Martin Tompa

Presenter: Christopher Schlosberg

CS598ss

Regulation of Gene Expression

Difficulties of Motif Finding

Regulatory sequences don’t follow same orientation as the coding sequence or each other

Multiple binding sites might exist for each regulated gene

Large variation in the binding sites of a single factor. Variations are not well understood.

Previous & Proposed Methods for Finding Motifs

Previous Methods: Find longer, general motifs Use local search algorithms (Gibbs

sampling, Expectation Maximization, greedy algorithms)

Proposed Method: TFBS is small enough to use enumerative

methods Enumerative statistical methods guarantee

global optimality and affordability

Proposed Method Highlights Allows variations in the binding site instances of a

given transcription factor

Allows for motifs to include “spacers”

Allows for overlapping occurrences (in both orientations), which lends to complex dependencies

Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides

Use of Markov chain to model background genomic distribution

Use of z-score to measure statistical significance

Allows for multiple binding sites

Characteristics of a Motif

Any single TFBS has significant variation

Many motifs have spacers from 1-11bp

Variation often occurs as a transition (e.g. purine purine) rather than a transversion (e.g. pyrimidine purine)

Variation occurs less between a pair of complementary bases.

Indels are uncommon

Proposed Motif Definition

Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N}

A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer)

TF database (SCPD) confirms this model of variation Of 50 binding site consensi, 31 exact fits (62%) Another 10 fit if slight variations allowed

Measure of Statistical Significance

Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation.

Model must measure from input sequences: Absolute number of occurrences (Ns) of motif (s) Background genomic distribution

X is a set of random DNA sequences in the same number and lengths of the input sequences Generated by Markov chain of order m Transition probabilities determined by (m+1)-mer

frequencies in fully complement of 6000+ (800bp in length) Background model chooses m=3

z-score

Xs – r.v. is number of occurrences of motif (s) in X

E(Xs) – expectation, σ(Xs) – standard deviation

zs – number of S.D. by which observed value Ns exceeds expectation

Implications

Possibility of overlap of a motif with itself (in either orientation)

Previous study of pattern autocorrelation

Generalized computation of SD, treating motif as a finite set of strings Higher order Markov chains Spacers handled at no extra computational cost Handles motif in either orientation

Algorithm

Enumerates over each input sequence

Tabulates number Ns of occurrences of each motif in either direction

Compute expectation and SD for each motif s.t. Ns>0

Calculate z-score

Rank motifs by z-score

Algorithm Analysis

For single motif, complexity is O(c2k2) k – # of nonspacer characters in motif c – # of instantiations of R, Y, S, W in motif

Only modest values of k

Linear dependence on genome size

Can trim variance calculation to optimize

Number of Occurrences

Convert motif s into a multiset W

Add reverse complements for each string in W

Motif s only occurs at position in X iff some string in W occurs at same position

Xs - # of occurrences (in X) of each member of W

Handling Palindromes

Wi – member of W

|W| = T

Number of Occurrences Con’t

Expectation

Linearity of Expectation

Variance

B term

C term

C Term

A term

A Term

Overlapping Concatenation

CW (like W) is potentially a multiset

One-to-one correspondence

C Term Simplification

A Term Revisited

Si1Si2 Term & Approximation

Kleffe and Borodovsky (1992) Approximation

B Term

B Term Con’t

Summary

Higher Order Markov Models

Variance calculations remain the same except for Si1Si2 term

Experimental m = 3

Experimental Results & Future Considerations

17 coregulated sets of genes

Known TF with known binding site consensus

In 9 experiments, known consensus was one of 3 highest scoring motifs

Future Topics: Non-centered spacers Enumeration Loop optimization Filtering repeats

Question

E(Xs) is more straight-forward to calculate compared to σ(Xs). Under the assumptions given in the paper, name one of the reasons for this complication.

A Statistical Method for Finding Transcriptional Factor Binding Sites

Documents

Transcript of A Statistical Method for Finding Transcriptional Factor Binding Sites

RNA-Binding Protein Musashi1 Modulates Glioma Cell Growth … · 2017. 3. 23. · RNA-Binding Protein Musashi1 Modulates Glioma Cell Growth through the Post-Transcriptional Regulation

Biotic abiotic transcriptional post transcriptional regulation

Finding Ligand Binding Sites on a Proteome-wide Scale and its Implications

Dynamic Programming: String Editing - BGUmichaluz/seminar/Edit1Class.pdf · 2008-05-18 · Finding Similarities between the Cystic Fibrosis Gene and ATP binding proteins •ATP binding

Durham Research Online - COnnecting REpositoriesDNA-binding transcriptional regulator (26). FrmR is a member of the RcnR/CsoR-family (DUF156) of (predominantly) metal-sensing transcriptional

High Resolution Genome Wide Binding Event Finding and ...motif information and binding event prediction improves our ability to deconvolve closely spaced binding events with greater

Binding and transcriptional activation of the promoter for the

p53 binding to nucleosomes within the p21 promoter in vivo ... · p53 binding to nucleosomes within the p21 promoter in vivo leads to nucleosome loss and transcriptional activation

DELLA protein functions as a transcriptional activator ... · DELLA protein functions as a transcriptional activator through the DNA binding of the INDETERMINATE DOMAIN family proteins

DNA-binding and transcriptional activation properties of ...

PscanChIP: finding over-represented transcription factor-binding site ...

Multiple binding sites for transcriptional repressors can ...users.df.uba.ar/morelli/morelli/home_files/Lengyel 2017 Multiple... · PHYSICAL REVIEW E 95, 042412 (2017) Multiple binding

Interaction of Sox2 with RNA binding proteins in mouse ...Summary blurb Sox2 interacts with RNA-binding proteins and diverse RNAs Abstract Sox2 is a master transcriptional regulator

The Influence of Repressor DNA Binding Site …mbio.asm.org/content/5/5/e01684-14.full.pdfThe Inﬂuence of Repressor DNA Binding Site Architecture on Transcriptional Control Dan ...

TREE1-EIN3–mediated transcriptional repression inhibits ...The DNA Binding Motif Identified Is Involved in the Transcriptional Repression in the Ethylene Response. To evaluate the

ABI5-BINDING PROTEIN2 Coordinates CONSTANS …ABI5-BINDING PROTEIN2 Coordinates CONSTANS to Delay Flowering by Recruiting the Transcriptional Corepressor TPR21 Guanxiao Chang,a Wenjuan

The Rice bZlP Transcriptional Activator RlTA-1 1s Highly ... · PDF fileThe Rice bZlP Transcriptional Activator RlTA-1 1s Highly ... RITA-1 has a broad binding specificity for ACGT

Introduction. Zn 2+ homeostasis is regulated at the transcriptional level by the DNA-binding protein SmtB. Manipulation of Zn 2+ homeostasis could act.

Ctcf controls vascular development2018/04/02 · Ctcf controls vascular development The transcriptional regulator CCCTC-binding factor limits oxidative stress in endothelial cells

Functional Domains of ExsA, the Transcriptional Activator ... · PDF fileative binding interactions. The calculated Hill coefﬁcient for ExsA binding to the P exoT promoter (1.1),