Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery...
![Page 1: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/1.jpg)
Recap
3 different types of comparisons
1. Whole genome comparison
2. Gene search
3. Motif discovery (shared pattern discovery)
![Page 2: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/2.jpg)
Agenda
More about Shared Pattern Discovery Edit Distance
– Recap– What you need to know for the next quiz
Alignment– More details– More examples
![Page 3: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/3.jpg)
Shared Pattern Discovery
I have 10 rats that all have green eyes I have 10 rats that all have blue eyes What exactly do the 10 rats have in
common that give them green eyes?
![Page 4: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/4.jpg)
Shared Pattern Discovery
Multiple Alignment can be used to measure the strength a genomic pattern found in a set of sequences
– First, completely align the 10 green-eyed rats– Then, align green-eyed rats with blue-eyed rats– Finally, compare the statistical difference
Initially, this is how genes were pin-pointed
![Page 5: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/5.jpg)
Shared Pattern Discovery
Multiple alignment of 10 green-eyed rats
Alignment of blue-eyed rat and green-eyed rat
99.2%similar
99.4%similar
99.1%similar
94.5%similar
99.3%similar
95.2%similar
99.2%similar
94.7%similar
![Page 6: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/6.jpg)
Recap: Exact string matching
Its important to know why exact matching doesn’t work.– Target: CGTACGAC– Pattern: CGTACGTACGTACGTTCA
Problem: Target can NOT be found in the pattern even though there is a near-match
Sequences either match or don’t match There is no ‘in-between’
![Page 7: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/7.jpg)
Recap: Edit Dist. for Local Search
Question: How many edits are needed to exactly match the target with part of the pattern– Target: CGTACGAC– Pattern: CGTACGTACGTACGTTCA
Answer: 1 deletion Example of local search Gene finding
![Page 8: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/8.jpg)
Recap: Edit Dist. for Global Comp.
Question: How many edits are needed to exactly match the ENTIRE target the WHOLE pattern– Target: CGTACGAC– Pattern: CGTACGTACGTACGTTCA
Answer: 10 deletions Example of global comparison (whole
genome comparison)
![Page 9: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/9.jpg)
Quiz coming up!
You need to be able to compute optimal edit distance.
You need to fill-in the table.
![Page 10: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/10.jpg)
Edit Distance – Dynamic Programming
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
1 2 1 2 3 4 5 6 7
2 3 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
Optimal edit distance forTG and TCG
Optimal edit distance for TG and TCGA
Optimal edit distance for TGA and TCG
Final Answer
Optimal edit distance for TGA and TCGA
![Page 11: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/11.jpg)
Edit Distance
int matrix[n+1][m+1];
for (x = 0; x <= n; x++) matrix[x][0] = x;
for (y = 1; y <= m; y++) matrix [0][y] = y;
for (x = 1; x <= n; x++)
for (y = 1; y <= m; y++)
if (seq1[x] == seq2[y])
matrix[x][y] = matrix[x-1][y-1];
else
matrix[x][y] = max(matrix[x][y-1] + 1,
matrix[x-1][y] + 1);
return matrix[n][m];
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
1 2 1 2 3 4 5 6 7
2 3 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
![Page 12: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/12.jpg)
Why Edit Distances Stinks for Genetic Data?
DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….
…GATCTCCCAGATAGAAGCAGTATTCAGTCA…
… CCTATCAGCAGGATCAAGTATGTCATACTAC…
The edit distance between rat and virus is smaller thanrat and fruit bat.
This is a gene in the rat genome
This is the same gene in the fruit bat
This is a totally unrelatedregion of the AIDS virus
![Page 13: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/13.jpg)
Alignment
We need a more robust way to measure similarity
Alignment meets several requirements
1. It rewards matches
2. It penalizes mismatches
3. Different strategies for penalizing gaps
4. It helps visualize similarity.
![Page 14: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/14.jpg)
Alignment
Two examples
Seq1 G C T A G T A T G C C G A T A C T G A
Seq2 G C T A G A T G C A G A T A C T T G A
Seq3 G C T A G T A T G C C G A T A C G A
Seq4 G A T A G A C G C A G A T G C T T G T
What’s more similar– Seq1 & Seq2, or– Seq3 & Seq4
![Page 15: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/15.jpg)
Alignment
Three steps in the dynamic programming algorithm for alignment
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
![Page 16: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/16.jpg)
Initialization
![Page 17: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/17.jpg)
Matrix Fill
For each position, Mi,j is defined to be the maximum score at position i,j
Mi,j = MAX [ Mi-1, j-1 + Si,j (match/mismatch),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)
]
![Page 18: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/18.jpg)
Matrix Fill
Mi,j = MAX [ Mi-1, j-1 + Si,j (match/mismatch),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)
] Si,j = 1 if symbols match, otherwise Si,j = 0 w = 0 (no gap penalty)
![Page 19: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/19.jpg)
Matrix Fill
The score at position 1,1 can be calculated.
The first residue in both sequences is a G
Thus, S1,1 = 1
Thus, M1,1 =
MAX[M0,0 + 1, M1,0 + 0, M0,1 + 0] = MAX[1, 0, 0] = 1.
![Page 20: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/20.jpg)
Matrix Fill
![Page 21: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/21.jpg)
Matrix Fill
![Page 22: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/22.jpg)
Matrix Fill
![Page 23: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/23.jpg)
Matrix Fill
![Page 24: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/24.jpg)
Tracing Back
(Seq #1) A
|
(Seq #2) A
![Page 25: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/25.jpg)
Tracing Back
(Seq #1) A
|
(Seq #2) A
![Page 26: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/26.jpg)
Tracing back the alignment
(Seq #1) TA
|
(Seq #2) A
![Page 27: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/27.jpg)
Tracing Back
(Seq #1) TTA
|
(Seq #2) A
![Page 28: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/28.jpg)
Tracing Back
(Seq #1) GAATTCAGTTA
| | || | |
(Seq #2) GGA_TC_G__A
![Page 29: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/29.jpg)
Robust Scoring
Mi,j = MAX [ Mi-1, j-1 + Si,j (match/mismatch),
Mi,j-1 + w1 (gap in sequence #1),
Mi-1,j + w2 (gap in sequence #2)
]
Si,j A C G T
A 1.1 0.0 0.3 0.5
w1 -0.5 C 1.3 0.1 0.0
w2 -0.7 G 1.0 0.0
T 1.2
![Page 30: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/30.jpg)
Alignment Scoring
Si,j A C G T
A 1.1 0.0 0.3 0.5
w1 -0.5 C 1.3 0.1 0.0
w2 -0.7 G 1.0 0.0
T 1.2
Seq1 G T A C T A C G A C
Seq2 G A A C G T A G A C
score 1.0 0.5 1.1 1.3 -0.5 1.2 1.1 -0.7 1.0 1.1 1.3
Alignment score = 8.4
![Page 31: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/31.jpg)
Alignment Scoring
Si,j A C G T
A 1.1 0.0 0.3 0.5
w1 -0.5 C 1.3 0.1 0.0
w2 -0.7 G 1.0 0.0
T 1.2
Seq1 G T A C T A C G A C
Seq2 G A A C G T A G A C
score 1.0 0.5 1.1 1.3 -0.5 1.2 1.1 -0.7 1.0 1.1 1.3
Can you find a better alignment?
![Page 32: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/32.jpg)
Alignment Scoring
Si,j A C G T
A 1.1 0.0 0.3 0.5
w1 -0.5 C 1.3 0.1 0.0
w2 -0.7 G 1.0 0.0
T 1.2
Seq1 G T A C T A C G A C
Seq2 G A A C G T A G A C
score 1.0 0.5 1.1 1.3 0.0 0.5 0.0 1.0 1.1 1.3
Alignment score = 7.8
![Page 33: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/33.jpg)
Alignment Scoring
Summary: We have a way of rewarding different types of
matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps
– if we knew that didn’t affect biological similarity
We could even reward some types of mismatches– if we knew they were still biological similarity
![Page 34: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/34.jpg)
Alignment scoring
Process
1. Experts (chemists or biologist) look at sequence segments that are known to be biologically similar and compare them to sequence segments that are biologically disimilar.
2. Use direct observation and statistics to develop a scoring scheme
3. Given the scoring scheme, develop an algorithm to compute the maximum scoring alignment.
![Page 35: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/35.jpg)
Alignment – Algorithmic Point of View Align the symbols of two strings.
– Maximize the number of symbols that match.– Minimize the number of symbols that do NOT match
Gaps can be inserted to improve alignments. A scoring system is used to measure the quality of
an alignment.Gap penalty
-8
5-3-4-5T
-34-4-4G
-4-44-3C
-5-4-35A
TGCA
Scoring matrix In practice:– Scoring matrices and
gap penalties are based on biological knowledge and statistical analysis
![Page 36: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/36.jpg)
Local Alignment and Global Alignment In Global Alignment the two strings must be entirely
aligned (every aligned pair of symbols is scored).
In Local Alignment segments from each string are aligned and the rest of the string can be ignored
Global alignment is used to compare the similarity of entire organisms
Local alignment is used to search for genes
A G A G T A C T C A G T A T C T G A T
A C A T A C T A C A G T A T C C A
A G A G T A C T C A G T A T C T G A T
A C A T A C T A C A G T A T C C A
![Page 37: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/37.jpg)
Alignment Scoring Revisited Given a scoring system, the alignment score is the sum of the
scores for each aligned pair of symbols plus the gap penalties
Local AlignmentA G A G T A C T C A G T A T C T G A T
A C A T A C T A C T G T A T C C A
A C G T
A 3 -3 -4 -5
C -3 2 -4 -4
G -4 -4 2 -3
T -5 -4 -3 3
-6
3 3 2 3 -6 2 -5 2 3 3 3 2
Total Score = 15
![Page 38: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/38.jpg)
Alignment - Computer Science Perspective
Given two input strings and a scoring system, find the highest scoring local alignment among all possible alignments.
Fact: The number of possible alignments grows exponentially with the length of the input strings
Solving this problem efficiently was an open problem until Smith and Waterman (1980) designed an efficient dynamic programming algorithm
The algorithm takes O(nm) time where n and m are the lengths of the two input strings
![Page 39: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/39.jpg)
Interesting History
The Smith Waterman algorithm for computing local alignment is considered one of the most important algorithms in computational biology.
However, the algorithm is merely a generalization of the edit distance algorithm, which was already published and well-known in computer science.
Converting the edit distance algorithm to solve the alignment problem is “trivial.”
Smith and Waterman are consider almost legendary for this accomplishment.
It is a perfect example of “being in the right place at the right time.”
![Page 40: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/40.jpg)
Smith Waterman Algorithm
T 0
C 0
43G 0
0038A 0
C
0
G
0
C
0
A
00Dynamic programming table
D[i][j]=MAX( 0, M[i-1][j-1] + S(i,j), M[i-1][j] + w, M[i][j-1] + w);
ii
jj
8-3-4-5T
-37-4-4G
-4-47-3C
-5-4-38A
TGCA
S(i,j)
-5w-4 -5
-5
![Page 41: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/41.jpg)
Smith Waterman Algorithm
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0
G 0
C 0
T 0
C 0
A 0
A C G T
A 6 -3 -4 -5
C -3 5 -4 -4
G -4 -4 4 -3
T -5 -4 -3 7
-5
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1 0 0 0 0
G 0 1 2 5 0 0 0
C 0 0 6 1 10 5 0
T 0 0 1 3 5 17 12
C 0 0 5 0 8 12 14
A 0 6 1 1 3 7 18
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6
G 0
C 0
T 0
C 0
A 0
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1
G 0
C 0
T 0
C 0
A 0
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1 0 0 0 0
G 0
C 0
T 0
C 0
A 0
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1 0 0 0 0
G 0 1
C 0
T 0
C 0
A 0
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1 0 0 0 0
G 0 1 2
C 0
T 0
C 0
A 0
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1 0 0 0 0
G 0 1 2 5
C 0
T 0
C 0
A 0
![Page 42: Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649d615503460f94a42ca8/html5/thumbnails/42.jpg)
Smith Waterman Algorithm
0
A
0
C
0
G
0
C
0
T
0
A
0
A 0 6 1 0 0 0 0
G 0 1 2 5 0 0 0
C 0 0 6 1 10 5 0
T 0 0 1 3 5 17 12
C 0 0 5 0 8 12 14
A 0 6 1 1 3 7 18
A
AC
T
T
C
C
G
G
CA
A