Discovering gapped binding sites
-
Upload
steven-hartman -
Category
Documents
-
view
49 -
download
2
description
Transcript of Discovering gapped binding sites
![Page 1: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/1.jpg)
Discovering gapped binding sites
Chengwei LeiDr. Jianhua Ruan
University of Texas at San AntonioDepartment of Computer Science
![Page 2: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/2.jpg)
Outline of Talk
• Motif Finding Background • Gapped Motif Finding
– Chen’s method– SPACE
• The PSO-motif algorithm• Future Work
![Page 3: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/3.jpg)
Introduction/Motivation
• Introduction: Identification of a transcription factor binding sites is an important aspect of the analysis of genetic regulation. Many programs have been developed for discovering the motif.
• Motivation: The previously algorithms cost too much memory or time to find out the result; my work is trying to find out a new algorithm use less memory and less time to find the motif.
![Page 4: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/4.jpg)
What is motif finding
• Motif finding, the process of discovering a meaningful pattern (of nucleotides or amino acids) that is shared by two or more sequences, is an important part of the study of gene function.
![Page 5: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/5.jpg)
Cells respond to environment
Heat
FoodSupply
Responds toenvironmentalconditions
Various external messages
![Page 6: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/6.jpg)
Regulation of Genes
GenePromoter
RNA polymerase(Protein)
Transcription Factor (TF)(Protein)
DNA
![Page 7: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/7.jpg)
Regulation of Genes
GeneRegulatory Element, TF binding site, TF binding motif, cis-regulatory motif (element)
RNA polymerase(Protein)
Transcription Factor (TF)(Protein)
DNA
![Page 8: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/8.jpg)
Regulation of Genes
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
![Page 9: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/9.jpg)
Regulation of Genes
Gene
RNA polymerase
Transcription Factor
Regulatory Element
DNA
New protein
![Page 10: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/10.jpg)
Real example
.
.
.
![Page 11: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/11.jpg)
Real example
.
.
.
![Page 12: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/12.jpg)
Look Like
• I need a refrigerator, so I go to a refrigerator shop, I try to pick a very beautiful refrigerator from a lot of refrigerator(s). Finally I decide that I will buy a GE refrigerator.
![Page 13: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/13.jpg)
Look Like
• I need a refrigeretor, so I go to a rafrigerator shop, I try to pick a very beautiful refragerator from a lot of refrigerater(s). Finally I decide that I will buy a GE refrigarator.
![Page 14: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/14.jpg)
Mismatch
…TACGAT……TAAAAT……TATACT……GATAAT……TATAAT……TATGTT…
.
.
.
![Page 15: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/15.jpg)
Real example
• …TACGAT…• …TAAAAT…• …TATACT…• …GATAAT…• …TATAAT…• …TATGTT…
Consensus: TATAAT
•refrigeretor•rafrigerator •refragerator •refrigerater •refrigarator.
refrigerator
![Page 16: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/16.jpg)
Gapped Motif
Gene
RNA polymerase
Transcription Factor
Regulatory Element
DNA
New protein
![Page 17: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/17.jpg)
Gapped DNA binding?
![Page 18: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/18.jpg)
Gapped Motif
• Together
• Separate
![Page 19: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/19.jpg)
Together
• Red+blue+green=5/25+15/15+5/25 = 25/65
• Red+xxx+green=5/25+xxx+5/25 = 10/50
mutationsn = 5
L
5+3+5
![Page 20: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/20.jpg)
Separate
• Red=5/25• Green=5/25• Pink=4/25
mutationsn = 5
L
![Page 21: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/21.jpg)
![Page 22: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/22.jpg)
What can we do with the gap?
• Chen’s method
• SPACE
• PSO
![Page 23: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/23.jpg)
Chen’s method
• ChIP-chip experiment – Get a positive set Ga
– Get a negative set G-a
![Page 24: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/24.jpg)
Compact Blocks
• Patterns that are found in Ga with a proportion larger than a predefined value (25% by default) are included in the pattern list.
![Page 25: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/25.jpg)
![Page 26: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/26.jpg)
Compact Blocks
• Long enough patterns (3containing at least six
nonwildcards) are taken as candidate motifs. Short patterns (2blocks of 3 or 4 bp) are filtered
![Page 27: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/27.jpg)
![Page 28: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/28.jpg)
Hit/Seq ratio
• The sequences that match the pattern are called the supporting sequences of a pattern. It is possible that a pattern matches a sequence at more than one position.
• The Hit/Seq ratio of a pattern is the average number of occurrences of a pattern among its supporting sequences.
![Page 29: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/29.jpg)
Block Filtering
• Filtered out if the Hit/Seq ratio is larger than 15
• A large Hit/Seq ratio implies that the compact blocks are frequently repeated in a single promoter region.
![Page 30: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/30.jpg)
• In addition to the Hit/Seq ratio, they also use an upper threshold for f-a (the proportion of sequences with a pattern P in G-a) to eliminate repetitive elements present across different promoter sequences. A pattern is retained only if it satisfies: (less than 0.16)
![Page 31: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/31.jpg)
Growing Gapped Motifs
• Growing gapped motifs is similar to growing compact motifs.
![Page 32: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/32.jpg)
Pattern Ranking
• An identified pattern is filtered out before ranking if the Hit/Seq ratio is2, which is considered as a reasonable upper bound for selecting reliable patterns.
![Page 33: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/33.jpg)
• Sd is the preferential occurrence of a pattern in Ga relative to G-a
• Sp is a formula value.• Sc is the conservation score.
![Page 34: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/34.jpg)
Sd
• The proportions of sequences in Ga and G-a that contain a pattern P are denoted as fa and f-a. The one-tailed two-sample proportion test can be performed as follows:
• Patterns with a z score (Sd) smaller than z1–0.01 are treated as nonsignificant and are removed before the ranking process.
![Page 35: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/35.jpg)
Sp
![Page 36: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/36.jpg)
Sc
• Sc is the degree of evolutionary conservation among a set of orthologous sequences.
• (from Saccharomyces paradoxus, Saccharomyces kudriavzevii, Saccharomyces mikatae, and Saccharomyces bayanus)
![Page 37: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/37.jpg)
Result
![Page 38: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/38.jpg)
![Page 39: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/39.jpg)
Key point
• Filter !!
![Page 40: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/40.jpg)
SPACE
• Generation of motif candidates– Consider L=20
![Page 41: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/41.jpg)
• Consider L=20, r=0.5, l=5, d=1 and q=4.
![Page 42: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/42.jpg)
Refinding Motif
• GAAGAnnnnnnnTAGAAAnn is a spaced motif of five sequences.
![Page 43: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/43.jpg)
• Motif Score(M) =
• +
• E(M, e) be the expected frequency of M with at most e mutations based on a set of background sequences
![Page 44: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/44.jpg)
Why PSO methodBackground• Particle swarm optimization (PSO) is a population based
stochastic optimization technique and it is inspired by social behavior of bird flocking or fish schooling.
• PSO shares many similarities with evolutionary computation techniques such as Genetic Algorithms (GA). But it is simpler and faster than GA.
• It has been shown to be effective in optimizing difficult multidimensional problems in a variety of fields.
• PSO has widely application in ANN (Artificial Neural Network), Nonlinear Control, Electromagnetic, Antenna design, Bioinformatics.
![Page 45: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/45.jpg)
Some key terms used to describe PSO
Agent (Particle)
One single individual in the swarm
Position An agent’s N-dimensional coordinates which represents a solution to the problem
Swarm The entire collection of agents.
Fitness A single number representing the goodness of a given solution
Pbest The location in parameter space of the best fitness returned for a specific agent
Gbest The location in parameter space of the best fitness returned for the entire swarm
V The velocity of each agent.
![Page 46: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/46.jpg)
gbest
Pbest1
Pbest2
n n nx x V
1 , 2 ,() ( ) () ( )n n best n n best n nV V C rand p x C rand g x
![Page 47: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/47.jpg)
• One agent’s movement in the PSO algorithm.
![Page 48: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/48.jpg)
Flow chart of the PSO algorithm
![Page 49: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/49.jpg)
• In a typical PSO algorithm, one wishes to control the velocity so that at the beginning stage the particles can fly around quickly inside the search space, and when a particle approaches the optimal solution, it should slow down so it can converge quickly.
![Page 50: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/50.jpg)
.
.
.• …TACGATA…• …TAAAAT…• …TATACT…• …GATAAT…• …TATGAT…• …TATGTT…
• One can achieve this if the fitness function is continuous, since the velocity is updated according to the distances between the current position and the positions of pbest and gbest.
![Page 51: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/51.jpg)
How to solve
• Remap
• Redefine
![Page 52: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/52.jpg)
Remap the neighborhood information
1
2 N
A C G T T C C A T.............A C G T T C C T mis is 6
mis is 1
![Page 53: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/53.jpg)
Redefine
• Green Current • Red Gbest• Pink Pbest• Blue Random
n = 5
L
![Page 54: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/54.jpg)
Redefine
• Good for gapped motif finding.– Quick– Flexible– High sensitivity– High extensibility
![Page 55: Discovering gapped binding sites](https://reader036.fdocuments.net/reader036/viewer/2022062304/56812fee550346895d956566/html5/thumbnails/55.jpg)
Thank you !