Where are we going? Remember the extended analogy? – Given binary code, what does the program do?...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
2
Transcript of Where are we going? Remember the extended analogy? – Given binary code, what does the program do?...
Where are we going?
Remember the extended analogy?– Given binary code, what does the program do?– How does it work?
At the end of the semester, I am going to show you how biologist solved that problem
– Binary Code DNA Code– How the program works How life works
But, we are approaching it from the Bottom-up
Bottom-up Design
Top Down– See the big picture first– Break it into part– Analyze each part– Continue breaking down sub-part into solvable tasks
Bottom Up– Identify easily solvable task– Use them to solve larger problem– Use the solution to larger and larger problems to solve the
BIG problem and see the big picture
Bottom-up Design
Top Down– Rethinking the design of existing ideas/inventions– Managing projects that are underway– Works really good in the Utopian world
Bottom Up– Designing totally new ideas– Putting together projects from scratch– Works really good in the real world
Bottom-up Design
Top Down– Let build an airplane– Lets build a steering mechanism– Lets build a lift mechanism– Lets build a propulsion mechanism
Bottom UP– This shape produces lift– A spinning propeller creates propulsion in the air– Canvas with a wood frame is light enough– Perhaps we can build an stable controllable airplane
Bottom-up Design
Before we can analyze the big picture We have to
– Look at some of the initial smaller problems– See how they were solved– See how they led to new discoveries
Remember
Don’t forget to – pick a paper and – Email me
See the schedule to see what’s taken– http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html
Recap
3 different types of comparisons
1. Whole genome comparison
2. Gene search
3. Motif discovery (shared pattern discovery)
Agenda
Overview of Shared Pattern Discovery Edit Distance
– How do you compute it– Why its not good enough
Alignment– Why its better– How to compute it
Shared Pattern Discovery
I have 100 rats that all have green eyes I have 1000 rats that all have blue eyes What exactly do the 100 rats have in
common that give them green eyes?
Shared Pattern Discovery
A technique called multiple alignment can be used to measure the strength a genomic pattern found in a set of sequences (a group of rats)
– You can identify a subset (rats that have green eyes) and– You can find a sub-region of DNA (a pattern) that the
subset shares – But that isn’t shared by any other subset (rats that have
blue eyes)
Initially, this is how genes were pin-pointed
Shared Pattern Discovery
To understand multiple alignment One needs to understand pair-wise alignment Multiple alignment emerged from the successful
application of pair-wise alignment Pair-wise alignment emerged from improvements
to traditional string matching algorithms All of this emerged from a need to compare genetic
sequences
Exact string matching
Target: CGTACGAC Pattern: ACGTACGTACGT Problem: Target can not be found in the
pattern even though its really close
Edit Distance
How many edits are needed to exactly match the target with part of the pattern
Target: CGTACGAC Pattern: ACGTACGTACGT Just one
Edit Distance
How many edits are needed to exactly match the target with the WHOLE Pattern
Target: CGTACGAC Pattern: ACGTACGTACGT Four
Edit Distance – Dynamic Programming
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
1 2 1 2 3 4 5 6 7
2 3 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
Optimal edit distance forTG and TCG
Optimal edit distance for TG and TCGA
Optimal edit distance for TGA and TCG
Final Answer
Optimal edit distance for TGA and TCGA
Edit Distance
int matrix[n+1][m+1];
for (x = 0; x <= n; x++) matrix[x][0] = x;
for (y = 1; y <= m; y++) matrix [0][y] = y;
for (x = 1; x <= n; x++)
for (y = 1; y <= m; y++)
if (seq1[x] == seq2[y])
matrix[x][y] = matrix[x-1][y-1];
else
matrix[x][y] = max(matrix[x][y-1] + 1,
matrix[x-1][y] + 1);
return matrix[n][m];
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
1 2 1 2 3 4 5 6 7
2 3 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
Edit Distance
int matrix[n+1][m+1];
for (x = 0; x <= n; x++) matrix[x][0] = x;
for (y = 0; y <= m; y++) matrix [0][y] = y;
for (x = 1; x <= n; x++)
for (y = 1; y <= m; y++)
if (seq1[x] == seq2[y])
matrix[x][y] = matrix[x-1][y-1];
else
matrix[x][y] = max(matrix[x][y-1] + 1,
matrix[x-1][y] + 1);
return matrix[n][m];
How many times is this comparison performed?
How many times is this assignment performed?
How many times is this assignment performed?
How many times is this assignment performed?
Edit Distance – Dynamic Programming
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
1 2 1 2 3 4 5 6 7
2 3 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
To derive the value 7,we need to know that we
can match two T’sn=8
In the worst case,this may take n comparisons
To derive the value 6,we need to know that we can match two C’s after
matching two T’s
To derive this value 5,we need to know that
we can match two G’s after already matching two C’s and previously matching two T’s
Edit Distance – Dynamic Programming
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
1 2 1 2 3 4 5 6 7
2 3 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
Given our previous matches,there is no way we can match two A’sThus, the edit distance is increased
Luckily, we can match these two C’sBut now we’ve matched the last symbol
We can’t do any more matching (period!)
Lesson to learn
There is no way to compute the optimal (minimum) edit distance without considering all possible matching combinations.
The only way to do that is to consider all possible sub-problems.
This is the reason the entire table must be considered.
If you can compute the optimal (minimum) edit distance using less than O(nm) computations.
Then you will be renown!
Why Edit Distances Stinks for Genetic Data?
DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….
…GATCTCCCAGATAGAAGCAGTATTCAGTCA…
… CCTATCAGCAGGATCAAGTATGTCATACTAC…
The edit distance between rat and virus is smaller thanrat and fruit bat.
This is a gene in the rat genome
This is the same gene in the fruit bat
This is a totally unrelatedregion of the AIDS virus
Alignment
We need a more robust way to measure similarity
Alignment meets several requirements1. It rewards matches2. It penalizes mismatches3. It allows for different strategies for
penalizing gaps4. It helps visualize similarity.
Alignment
Example
1. G A A T T C A G T T A (sequence #1)
2. G G A T C G A (sequence #2)
One possible alignment:
G A A T T C A G T T A
G G A _ T C _ G _ _ A
Mismatch Gap Gap Gap (size 2)
Alignment
A simple scoring scheme is used where– Si,j is the score at position i,j – Si,j = 1 if the residue at position i of sequence #1
matches the residue at position j of sequence #2 (match score); otherwise
– Si,j = 0 (mismatch score)
w is the gap penalty which we will discuss later
Alignment
Three steps in the dynamic programming algorithm for alignment
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
Alignment
Initialization Step The first step create a matrix with
– M + 1 columns and – N + 1 rows – where M and N correspond to the size of the sequences to
be aligned.
The first row and first column of the matrix can be initially filled with 0.
Alignment
Alignment
Matrix Fill Step– For each position, Mi,j is defined to be the maximum
score at position i,j; i.e. – Mi,j = MAX[
Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)
]
Example
The score at position 1,1 can be calculated.
The first residue in both sequences is a G
Thus, S1,1 = 1
Thus, M1,1 =
MAX[M0,0 + 1, M1,0 + 0, M0,1 + 0] = MAX[1, 0, 0] = 1.
Example
Example
Example
Example
Edit Dist. vs. Alignment Scoring
Note that the metric used in alignment is different that that of edit distance
– Smaller edit distance more similar– Higher alignment score more similar
Also: Edit distance refers specifically to edits
– delete or insert a symbol– discrete value– not flexible
Alignment Scoring
Mi,j = MAX[ Mi-1, j-1 + Si,j (match/mismatch score), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)
]
Si,j A C G T
A 1.1 0.0 0.3 0.5
C 1.3 0.1 0.0
G 1.0 0.0
T 1.2
Alignment Scoring
Mi,j = MAX[Mi-1, j-1 + Si,j (match/mismatch score),Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)
]w = -1
One possible alignment:
G A A T T C A G T T A
G G A _ T C _ G _ _ A
Gap -1 Gap -1 Gap -2
Alignment Scoring
Summary: We have a way of rewarding different types of
matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps
– if we had a clue that they weren’t harmful
Recall
DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….
…GATCTCCCAGATAGAAGCAGTATTCAGTCA…
… CCTATCAGCAGGATCAAGTATGTCATACTAC…
The edit distance between rat and virus is smaller thanrat and fruit bat.
This is a gene in the rat genome
This is the same gene in the fruit bat
This is a totally unrelatedregion of the AIDS virus
Tracing back the alignment
(Seq #1) A
|
(Seq #2) A
Tracing back the alignment
(Seq #1) A
|
(Seq #2) A
Tracing back the alignment
(Seq #1) TA
|
(Seq #2) A
Tracing back the alignment
(Seq #1) TTA
|
(Seq #2) A
Tracing back the alignment
(Seq #1) GAATTCAGTTA
| | || | |
(Seq #2) GGA_TC_G__A