Where are we going? Remember the extended analogy? – Given binary code, what does the program do?...

Where are we going?

Remember the extended analogy?– Given binary code, what does the program do?– How does it work?

At the end of the semester, I am going to show you how biologist solved that problem

– Binary Code DNA Code– How the program works How life works

But, we are approaching it from the Bottom-up

Bottom-up Design

Top Down– See the big picture first– Break it into part– Analyze each part– Continue breaking down sub-part into solvable tasks

Bottom Up– Identify easily solvable task– Use them to solve larger problem– Use the solution to larger and larger problems to solve the

BIG problem and see the big picture

Bottom-up Design

Top Down– Rethinking the design of existing ideas/inventions– Managing projects that are underway– Works really good in the Utopian world

Bottom Up– Designing totally new ideas– Putting together projects from scratch– Works really good in the real world

Bottom-up Design

Top Down– Let build an airplane– Lets build a steering mechanism– Lets build a lift mechanism– Lets build a propulsion mechanism

Bottom UP– This shape produces lift– A spinning propeller creates propulsion in the air– Canvas with a wood frame is light enough– Perhaps we can build an stable controllable airplane

Bottom-up Design

Before we can analyze the big picture We have to

– Look at some of the initial smaller problems– See how they were solved– See how they led to new discoveries

Remember

Don’t forget to – pick a paper and – Email me

See the schedule to see what’s taken– http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html

http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html

Recap

3 different types of comparisons

1. Whole genome comparison

2. Gene search

3. Motif discovery (shared pattern discovery)

Agenda

Overview of Shared Pattern Discovery Edit Distance

– How do you compute it– Why its not good enough

Alignment– Why its better– How to compute it

Shared Pattern Discovery

I have 100 rats that all have green eyes I have 1000 rats that all have blue eyes What exactly do the 100 rats have in

common that give them green eyes?


A technique called multiple alignment can be used to measure the strength a genomic pattern found in a set of sequences (a group of rats)

– You can identify a subset (rats that have green eyes) and– You can find a sub-region of DNA (a pattern) that the

subset shares – But that isn’t shared by any other subset (rats that have

blue eyes)

Initially, this is how genes were pin-pointed


To understand multiple alignment One needs to understand pair-wise alignment Multiple alignment emerged from the successful

application of pair-wise alignment Pair-wise alignment emerged from improvements

to traditional string matching algorithms All of this emerged from a need to compare genetic

sequences

Exact string matching

Target: CGTACGAC Pattern: ACGTACGTACGT Problem: Target can not be found in the

pattern even though its really close

Edit Distance

How many edits are needed to exactly match the target with part of the pattern

Target: CGTACGAC Pattern: ACGTACGTACGT Just one

Edit Distance

How many edits are needed to exactly match the target with the WHOLE Pattern

Target: CGTACGAC Pattern: ACGTACGTACGT Four

Edit Distance – Dynamic Programming

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Optimal edit distance forTG and TCG

Optimal edit distance for TG and TCGA

Optimal edit distance for TGA and TCG

Final Answer

Optimal edit distance for TGA and TCGA

Edit Distance

int matrix[n+1][m+1];

for (x = 0; x <= n; x++) matrix[x][0] = x;

for (y = 1; y <= m; y++) matrix [0][y] = y;

for (x = 1; x <= n; x++)

for (y = 1; y <= m; y++)

if (seq1[x] == seq2[y])

matrix[x][y] = matrix[x-1][y-1];

else

matrix[x][y] = max(matrix[x][y-1] + 1,

matrix[x-1][y] + 1);

return matrix[n][m];

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Edit Distance

int matrix[n+1][m+1];

for (x = 0; x <= n; x++) matrix[x][0] = x;

for (y = 0; y <= m; y++) matrix [0][y] = y;

for (x = 1; x <= n; x++)

for (y = 1; y <= m; y++)

if (seq1[x] == seq2[y])

matrix[x][y] = matrix[x-1][y-1];

else

matrix[x][y] = max(matrix[x][y-1] + 1,

matrix[x-1][y] + 1);

return matrix[n][m];

How many times is this comparison performed?

How many times is this assignment performed?




A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

To derive the value 7,we need to know that we

can match two T’sn=8

In the worst case,this may take n comparisons

To derive the value 6,we need to know that we can match two C’s after

matching two T’s

To derive this value 5,we need to know that

we can match two G’s after already matching two C’s and previously matching two T’s


A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Given our previous matches,there is no way we can match two A’sThus, the edit distance is increased

Luckily, we can match these two C’sBut now we’ve matched the last symbol

We can’t do any more matching (period!)

Lesson to learn

There is no way to compute the optimal (minimum) edit distance without considering all possible matching combinations.

The only way to do that is to consider all possible sub-problems.

This is the reason the entire table must be considered.

If you can compute the optimal (minimum) edit distance using less than O(nm) computations.

Then you will be renown!

Why Edit Distances Stinks for Genetic Data?

DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….

…GATCTCCCAGATAGAAGCAGTATTCAGTCA…

… CCTATCAGCAGGATCAAGTATGTCATACTAC…

The edit distance between rat and virus is smaller thanrat and fruit bat.

This is a gene in the rat genome

This is the same gene in the fruit bat

This is a totally unrelatedregion of the AIDS virus

Alignment

We need a more robust way to measure similarity

Alignment meets several requirements1. It rewards matches2. It penalizes mismatches3. It allows for different strategies for

penalizing gaps4. It helps visualize similarity.

Alignment

Example

1. G A A T T C A G T T A (sequence #1)

2. G G A T C G A (sequence #2)

One possible alignment:

G A A T T C A G T T A

G G A _ T C _ G _ _ A

Mismatch Gap Gap Gap (size 2)

Alignment

A simple scoring scheme is used where– Si,j is the score at position i,j – Si,j = 1 if the residue at position i of sequence #1

matches the residue at position j of sequence #2 (match score); otherwise

– Si,j = 0 (mismatch score)

w is the gap penalty which we will discuss later

Alignment

Three steps in the dynamic programming algorithm for alignment

1. Initialization

2. Matrix fill (scoring)

3. Traceback (alignment)

Alignment

Initialization Step The first step create a matrix with

– M + 1 columns and – N + 1 rows – where M and N correspond to the size of the sequences to

be aligned.

The first row and first column of the matrix can be initially filled with 0.

Alignment

Alignment

Matrix Fill Step– For each position, Mi,j is defined to be the maximum

score at position i,j; i.e. – Mi,j = MAX[

Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)

]

Example

The score at position 1,1 can be calculated.

The first residue in both sequences is a G

Thus, S1,1 = 1

Thus, M1,1 =

MAX[M0,0 + 1, M1,0 + 0, M0,1 + 0] = MAX[1, 0, 0] = 1.

Example

Edit Dist. vs. Alignment Scoring

Note that the metric used in alignment is different that that of edit distance

– Smaller edit distance more similar– Higher alignment score more similar

Also: Edit distance refers specifically to edits

– delete or insert a symbol– discrete value– not flexible

Alignment Scoring

Mi,j = MAX[ Mi-1, j-1 + Si,j (match/mismatch score), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)

]

Si,j A C G T

A 1.1 0.0 0.3 0.5

C 1.3 0.1 0.0

G 1.0 0.0

T 1.2

Alignment Scoring

Mi,j = MAX[Mi-1, j-1 + Si,j (match/mismatch score),Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)

]w = -1

One possible alignment:

G A A T T C A G T T A

G G A _ T C _ G _ _ A

Gap -1 Gap -1 Gap -2

Alignment Scoring

Summary: We have a way of rewarding different types of

matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps

– if we had a clue that they weren’t harmful

Recall

DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….

…GATCTCCCAGATAGAAGCAGTATTCAGTCA…

… CCTATCAGCAGGATCAAGTATGTCATACTAC…

The edit distance between rat and virus is smaller thanrat and fruit bat.

This is a gene in the rat genome

This is the same gene in the fruit bat

This is a totally unrelatedregion of the AIDS virus

Tracing back the alignment

(Seq #1) A

|

(Seq #2) A


(Seq #1) TA

|

(Seq #2) A


(Seq #1) TTA

|

(Seq #2) A


(Seq #1) GAATTCAGTTA

| | || | |

(Seq #2) GGA_TC_G__A

Where are we going? Remember the extended analogy? – Given binary code, what does the program do?...

Documents

Transcript of Where are we going? Remember the extended analogy? – Given binary code, what does the program do?...