Computational Genomics and Proteomics

51
Computational Genomics and Proteomics Lecture 8 Lecture 8 Motif Discovery Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

description

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Computational Genomics and Proteomics. Lecture 8 Motif Discovery. Outline Gene Regulation DNA Transcription factors Motifs What are they? - PowerPoint PPT Presentation

Transcript of Computational Genomics and Proteomics

Page 1: Computational Genomics and Proteomics

Computational Genomics and Proteomics

Lecture 8Lecture 8

Motif DiscoveryMotif Discovery

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Page 2: Computational Genomics and Proteomics

OutlineGene Regulation

DNATranscription factors

MotifsWhat are they?Binding Sites

Combinatoric ApproachesExhaustive searchesConsensus

Comparative GenomicsExample

Probabilistic ApproachesStatisticsEM algorithmGibbs Sampling

Page 3: Computational Genomics and Proteomics

www.accessexcellence.org

Page 4: Computational Genomics and Proteomics

www.accessexcellence.org

Page 5: Computational Genomics and Proteomics

www.accessexcellence.org

Page 6: Computational Genomics and Proteomics

Four DNA nucleotide building blocks

G-C is more strongly hydrogen-bonded than A-T

Page 7: Computational Genomics and Proteomics

Degenerate code

Four bases: A, C, G, T

Two-fold degenerate IUB codes:

R=[AG] -- PurinesY=[CT] -- PyrimidinesK=[GT]M=[AC]S=[GC]W=[AT]

Four-fold degenerate: N=[AGCT]

Page 8: Computational Genomics and Proteomics

Transcription Factors

•Required but not a part of the RNA polymerase complex

•Many different roles in gene regulation

Binding

Interaction

Initiation

Enhancing

Repressing

•Various structural classes (eg. zinc finger domains)

•Consist of both a DNA-binding domain and an interactive domain

Page 9: Computational Genomics and Proteomics

Short sequences of DNA or RNA (or amino acids)Often consist of 5- 16 nucleotidesMay contain gapsExamples include:

Splice sitesStart/stop codonsTransmembrane domainsCentromeresPhosphorylation sitesCoiled-coil domainsTranscription factor binding sites (TFBS – regulatory motifs)

Motifs

Page 10: Computational Genomics and Proteomics

TFBSsDifficult to identifyEach transcription factor may have more than one binding siteDegenerateMost occur upstream of translation start site (TSS) but are known to also occur in:

intronsexons3’ UTRs

Usually occur in clusters, i.e. collections of sites within a region (modules)Often repeatedSites can be experimentally verified

Page 11: Computational Genomics and Proteomics

Why are TFBSs important?

Aid in identification of gene networks/pathways

Determine correct network structure

Drug discovery

Switch production of gene product on/off

Gene A Gene B

Page 12: Computational Genomics and Proteomics

Consensus sequencesMatches all of the example sequences closely but not exactlyA single site

TACGATA set of sites:

TACGATTATAATTATAATGATACTTATGATTATGTT

Consensus sequence:TATAAT orTATRNT

Trade-off: number of mismatches allowed, ambiguity in consensus sequence and the sensitivity and precision of the representation.

Page 13: Computational Genomics and Proteomics

Information Content and Entropy

Page 14: Computational Genomics and Proteomics

Sequence Logos

Page 15: Computational Genomics and Proteomics

Given a collection of motifs,

TACGATTATAATTATAATGATACTTATGATTATGTT

Create the matrix:

Frequency Matrices

TACG

Page 16: Computational Genomics and Proteomics

Position weight matrices

Page 17: Computational Genomics and Proteomics

Two problems:Given a collection of known motifs, develop a representation of the motifs such that additional occurrences can reliably be identified in new promoter regionsGiven a collection of genes, thought to be related somehow, find the location of the motif common to all and a representation for it.

Two approaches:CombinatorialProbabilistic

Finding Motifs

Page 18: Computational Genomics and Proteomics

Combinatorial Approach

Page 19: Computational Genomics and Proteomics

Exhaustive Search

Page 20: Computational Genomics and Proteomics

Exhaustive Search

Sample-driven here refers to trying all the words as they occur in the sequences, instead of trying all possible (4W) words exhaustively

Page 21: Computational Genomics and Proteomics

Greedy Motif Clustering

Page 22: Computational Genomics and Proteomics

Greedy Motif Clustering

Page 23: Computational Genomics and Proteomics

Greedy Motif Clustering

Page 24: Computational Genomics and Proteomics

Main Idea: Conserved non coding regions are importantAlign the promoters of orthologous co-expressed genes from two (or more) species e.g. human and mouseSearch for TFBS only in conserved regions

Problems:Not all regulatory regions are conservedWhich genomes to use?

Comparative Genomics

Page 25: Computational Genomics and Proteomics

Phylogenetic Footprinting

Phylogenetic Footprinting refers to the task of finding conserved motifs across different species. Common ancestry and selection on these motifs has resulted in these “footprints”.

Page 26: Computational Genomics and Proteomics

Xie et al. 2005

Genome-wide alignments for four species (human, mouse, rat, dog)

Promoter regions and 3’UTRs then extracted for 17,700 well-annotated genes

Promoter region taken to be (-2000, 2000)

This set of sequences then searched exhaustively for motifs

Phylogenetic Footprinting

An Example

Nature 434, 338-345, 2005

Page 27: Computational Genomics and Proteomics

The SearchXie et al. 2005

Page 28: Computational Genomics and Proteomics

Expected Rate

Page 29: Computational Genomics and Proteomics

Probabilistic Approach

Page 30: Computational Genomics and Proteomics

Gibbs Sampling (applied to Motif Finding)

Page 31: Computational Genomics and Proteomics

Gibbs Sampling Algorithm

Page 32: Computational Genomics and Proteomics

Gibbs Sampling – Motif Positions

Page 33: Computational Genomics and Proteomics

AlignACE - Gibbs Sampling

Page 34: Computational Genomics and Proteomics

Remainder of the lecture:Maximum likelihood and the EM algorithm

The remaining slides are for your information only and will not be part of the exam

Page 35: Computational Genomics and Proteomics

Basic Statistics

Page 36: Computational Genomics and Proteomics

Maximum Likelihood Estimates

Page 37: Computational Genomics and Proteomics

EM Algorithm

Page 38: Computational Genomics and Proteomics

Basic idea (MEME)

http://meme.nbcr.net/meme/meme-intro.html

Page 39: Computational Genomics and Proteomics

Basic idea (MEME)MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences.

MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. http://meme.nbcr.net/meme/meme-intro.html

Page 40: Computational Genomics and Proteomics

Basic MEME Model

Page 41: Computational Genomics and Proteomics

MEME Background frequencies

Page 42: Computational Genomics and Proteomics

MEME – Hidden Variable

Page 43: Computational Genomics and Proteomics

MEME – Conditional Likelihood

Page 44: Computational Genomics and Proteomics

EM algorithm

Page 45: Computational Genomics and Proteomics

Example

Page 46: Computational Genomics and Proteomics

E-step of EM algorithm

Page 47: Computational Genomics and Proteomics

Example

Page 48: Computational Genomics and Proteomics

M-step of EM Algorithm

Page 49: Computational Genomics and Proteomics

Example

Page 50: Computational Genomics and Proteomics

Characteristics of EM

Page 51: Computational Genomics and Proteomics

Gibbs Sampling (versus EM)