Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

51
Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215

Transcript of Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Transcription RegulationTranscription Factor Motif Finding

Xiaole Shirley Liu

STAT115, STAT215

Imagine a Chef

Restaurant Dinner Home Lunch

Certain recipes used tomake certain dishes

2

Each Cell Is Like a Chef

3

Each Cell Is Like a Chef

Infant Skin Adult Liver

Glucose, Oxygen, Amino Acid

Fat, AlcoholNicotine

HealthySkin Cell

State

DiseaseLiver Cell

State

Certain genes expressed tomake certain proteins

4

Understanding a Genome

Get the complete sequence (encoded cook book)

Observe gene expressionsat different cell states

(meals prepared at different situations)

Decode gene regulation(decode the book, understand the rules)

5

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

Information in DNA

Milk->Yogurt

Beef->Burger

Egg->Omelet

Fish->Sushi

Flour->Cake

Coding region 2%What is to be made

6

Information in DNA

Non-coding region 98%

Regulation: When, Where,

Amount, Other Conditions, etc

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Milk->Yogurt

Beef->Burger

Egg->Omelet

Fish->Sushi

Flour->Cake

Morning

Morning

Japanese Restaurant

5 Oz

9 Oz

Butter

Butter

Coding region 2%

7

Measure Gene Expression

• Microarray or SAGE detects the expression of every gene at a certain cell state

• Clustering find genes that are co-expressed (potentially share regulation)

8

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Morning

Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...

...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...

...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...

atttgctt ttcact gcaacct

aactccagt

actca

gcaacct

gcaacct

gcaacctccagcgccg

gcaacctTranscription Factor (TF)

TF Binding Motif

Hemoglobin Beta

Hemoglobin Zeta

Hemoglobin Alpha

Hemoglobin Gamma

Motif can only be computational discovered when there are enough cases for machine learning

12

Computational Motif Finding

• Input data:– Upstream sequences of gene expression profile cluster

– 20-800 sequences, each 300-5000 bps long

• Output: enriched sequence patterns (motifs)• Ultimate goals:

– Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?

– Which genes are regulated by this TF, why is there disease when a TF goes wrong?

– Are there binding partner / competitor for a TF?

13

Challenges: Where/what the signal

The motif should be abundant

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

WaterWater

Water

Water

Water

14

The motif should be abundant

And Abundant with significance

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

CoconutCoconut

Coconut

Coconut

Coconut

Challenges: Where/what the signal

15

Challenges: Double stranded DNA

Motif appears in both

strandsGATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC

TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

|||||||||||||||||||||||||||||GTGTAGCGTACCATTTATGGTCAAGTCTG

|||||||||||||||||||||||||||||AGAGTCCATTTAGTCAGTATGATGGGTGT

16

Challenges: Base substitutions

Sequences do not have to match the motif

perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATGTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG

CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT

17

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

18

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

Sushi

Hand Roll

Sashimi

Tempura

Sake

FishFish Fish

Fish Fish Fish Fish

19

Challenges: Two-block motifsSome motifs have two parts

GACACATTTACCTATGC TGGCCCTACGACCTCTCGC

CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC

GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA

TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG

CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT

AATGCG

GCGTAA

or palindromic patterns

Coconut Milk

20

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR

• Motif representation:– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW

IUPAC A/TA/G

21

IUPAC for DNA

A adenosine

C cytidine

G guanine

T thymidine

U uridine

R G A (purine)

Y T C (pyrimidine)

K G T (keto)

M A C (amino)

S G C (strong)

W A T (weak)

B C G T (not A)

D A G T (not C)

H A C T (not G)

V A C G (not T)

N A C G T (any)

22

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR

• Motif representation:– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW

– Position weight matrix (PWM): need score cutoff

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Sit

es

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

23

A Word on Sequence Logo

• SeqLogo consists of stacks of symbols, one stack for each position in the sequence

• The overall height of the stack indicates the sequence conservation at that position

• The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

24

JASPAR

• User defined cutoff to scan for a particular motif

25

Drawbacks to Known TF Motif Scans

• Limited number of motifs• Limited number of sites to represent each motif

– Low sensitivity and specificity

• Poor description of motif– Binding site borders not clear

– Binding site many mismatches

• Many motifs look very similar– E.g. GC-rich motif, E-box (CACGTG)

26

De Novo Motif Finding

27

De novo Sequence Motif Finding

• Goal: look for common sequence patterns enriched in the input data (compared to the genome background)

• Regular expression enumeration – Pattern driven approach

– Enumerate k-mers, check significance in dataset

• Position weight matrix update – Data driven approach, use data to refine motifs

– EM & Gibbs sampling

– Motif score and Markov background

28

Regular Expression Enumeration

• Oligonucleotide Analysis: check over-representation for every w-mer:– Expected w occurrence in data

• Consider genome sequence + current data size

– Observed w occurrence in data

– Over-represented w is potential TF binding motif

Observed occurrence of w in the data

pw from genome background

size of sequence data

Expected occurrence of w in the data

29

Suffix Tree for Fast Search

• Weeder, Pavesi & Pesole 2006

• Construction is linear in time and space to length of S.

• Quickly locating a substring allowing a certain number of mistakes

• Provides first linear-time solutions for the longest common substring problem

• Typically requires significantly more space than storing the string itself.

30

Regular Expression Enumeration

• RE Enumeration Derivatives:– oligo-analysis, spaced dyads w1.ns.w2

– IUPAC alphabet – Markov background (later)– 2-bit encoding, fast index access– Enumerate limited RE patterns known for a TF

protein structure or interaction theme

• Exhaustive, guaranteed to find global optimum, and can find multiple motifs

• Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width

31

Expectation Maximization and Gibbs Sampling Model

• Objects:– Seq: sequence data to search for motif 0: non-motif (genome background) probability : motif probability matrix parameter : motif site locations

• Problem: P(, | seq, 0)• Approach: alternately estimate

by P( | , seq, 0) by P( | , seq, 0)– EM and Gibbs differ in the estimation methods

32

Expectation Maximization

• E step: | , seq, 0

TTGACGACTGCACGT

TTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTGp6

ACTGC p7

CTGCA p8

...

P1 = likelihood ratio =

P(TTGAC| )

P(TTGAC| 0)

p0T p0T p0G p0A p0C= 0.3 0.3 0.2 0.3 0.2

33

Expectation Maximization• E step: | , seq, 0

TTGACGACTGCACGT

TTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTG p6

ACTGC p7

CTGCA p8

...

• M step: | , seq, 0

p1 TTGAC

p2 TGACG

p3 GACGA

p4 ACGAC

...

• Scale ACGT at each position, reflects weighted average of

34

M Step

TTGACGACTGCACGT

0.8 TTGAC0.2 TGACG0.6 GACGA0.5 ACGAC0.3 CGACT0.7 GACTG0.4 ACTGC0.1 CTGCA0.9 TGCAC…

35

EM Derivatives

• First EM motif finder (C Lawrence)– Deterministic algorithm, guarantee local optimum

• MEME (TL Bailey)– Prior probability allows 0-n site / sequence

– Parallel running multiple

EM with different seed

– User friendly results

36

Gibbs Sampling

• Stochastic process, although still may need multiple initializations– Sample from P( | , seq, 0)

– Sample from P( | , seq, 0)

• Collapsed form: estimated with counts, not sampling from Dirichlet

– Sample site from one seq based on sites from other seqs

• Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain

37

1

2

3

4

5

Gibbs Sampler

Initial 1

31

41

51

21

11

• Randomly initialize a probability matrixRandomly initialize a probability matrix

nA1 + sA

nA1 + sA + nC1 + sC + nG1 + sG + nT1 + sT

estimated with counts

pA1 =

38

Gibbs Sampler

1 Without11 Segment

• Take out one sequence with its sites from current Take out one sequence with its sites from current motifmotif

31

41

51

21

11

39

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

Segment (1-8) Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

1 Without11 Segment

31

41

51

21

40

Segment (2-9)

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

31

41

51

21

1 Without11 Segment 41

Segment Score

• Use current motif matrix to score a segment

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Sit

es

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

42

Scoring Segments

Motif 1 2 3 4 5 bg

A 0.4 0.1 0.3 0.4 0.2 0.3

T 0.2 0.5 0.1 0.2 0.2 0.3

G 0.2 0.2 0.2 0.3 0.4 0.2

C 0.2 0.2 0.4 0.1 0.2 0.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… score

TAATC …

AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383

ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185

TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444

CAGAT …

43

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

12

Gibbs Sampler• Sample site from one seq based on sites from other seqs

31

41

51

21

Modified 1

estimated with counts 44

Hill Climbing vs Sampling

45

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 1 2 6

SubT 3 4 16 21 29 38 39 41 47

• Rand(subtotal) = X• Find the first position with subtotal larger than X

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 500 2 6

SubT 3 4 16 21 29 38 538 540 546

Gibbs Sampler

• Repeat the process until motif convergesRepeat the process until motif converges

1 Without21 Segment

31

41

51

12

21

46

Gibbs Sampler Intuition

• Beginning:– Randomly initialized motif

– No preference towards any segment

Beginning Iterations

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments

47

Gibbs Sampler Intuition

• Motif appears:– Motif should have enriched signal (more sites)

– By chance some correct sites come to alignment

– Sites bias motif to attract other similar sites

Some good aligned segments come

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments

48

Gibbs Sampler Intuition

• Motif converges:– All sites come to alignment

– Motif totally biased to sample sites every time

Motif converges towards the end

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments

49

1

2

3

4

5

Gibbs Sampler

3i

4i

5i

2i

1i

• Column shift

• Metropolis algorithm:– Propose * as shifted 1 column to left or right

– Calculate motif score u() and u(*)

– Accept * with prob = min(1, u(*) / u())

50

Summary

• Biology and challenge of transcription regulation• Scan for known TF motif sites: TRANSFAC &

JASPAR• De novo method

– Regular expression enumeration• Oligonucleotide analysis

– Position weight matrix update• EM (iterate , ; ~ weighted average)

• Gibbs Sampler (sample , ; Markov chain convergence)

51