Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

32
Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Page 1: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sequencing and Sequence Alignment

CIS 667 BioinformaticsSpring 2004

Page 2: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Protein Sequencing

• Before DNA sequencing, protein sequencing was common Sanger won a Nobel prize for determining

amino acid sequence of insulin Protein sequences much shorter than

today’s DNA fragments One amino acid at a time can be removed

from the protein The aa can then be determined

Page 3: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Protein Sequencing

• Unfortunately, this works only for a few aa’s from the end So insulin broken up into fragments

Gly Ile Val GluIle Val Glu GlnGln Cys Cys Ala

Page 4: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Protein Sequencing

• Then the fragments are sequenced• After they are assembled by finding

the overlapping regions

Gly Ile Val Glu Ile Val Glu Gln Gln Cys Cys Ala

Gly Ile Val Glu Gln Cys Cys Ala

Page 5: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Protein Sequencing

• By the late 1960s protein sequencing machines on market

• RNA sequencing following the same basic methodology by 1965

Page 6: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

DNA Sequencing

• DNA was first sequenced by transcribing DNA to RNA Slow - years to sequence tens of base

pairs

• By mid 70s Maxam and Gilbert learned how to cleave DNA selectively at A, C, G, or T This led to the development of Maxam-

Gilbert sequencing method

Page 7: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Maxam-Gilbert Sequencing

• Single-stranded DNA labeled with radioactive tag at 5’ end

• Sample quartered and digested in four base-specific reactions Reaction concentrations are such that

each strand of DNA in each sample cut once at random location

• Use gel electrophoresis to find lengths of tagged fragments

Page 8: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.
Page 9: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sanger Sequencing

• Today, an alternative method called Sanger sequencing is generally used A primer bonds to a single-stranded DNA

near the 3’ end of the target to be sequenced

DNA polymerase extends the primer along the target DNA

For each of the 4 bases this extension is done

Page 10: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sanger Sequencing

• A small amount of extension ending nucleotides are introduced This causes the extension to end

randomly at a specific base

• Now use gel electrophoresis and read the sequence as the complement of the bases

Page 11: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sanger Sequencing

Page 12: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sequence Alignment

• Given two string, find the optimal alignment of the strings Strings may be of different lengths,

optimal alignment may include gaps An alignment score is produced

SHALL WEARALL WE

SHALL WEAR--ALL WE--

Example:

Page 13: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sequence Alignment

• Alignment score produced by looking at each column in alignment Match gives column a +1 score Mismatch: -1 Space: -2

HELLO THEREJELLO TEAR-

Score: 7*(+1)+3*(-1)+1*(-2)=2

Page 14: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sequence Alignment

• In biology, the sequences to be aligned consist of nucleotides or amino acids

• Sufficiently similar sequences can allow us to infer homology Common evolutionary history

• We can also infer the function of a protein or gene given similarity to one with known functionality

Page 15: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Sequence Alignment

• Since homologous sequences share a common evolutionary history the alignment score should reflect evolutionary processes

• DNA changes over time due to mutations Most mutations are harmful May be due to environmental factors,

e.g. radiation

Page 16: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Mutation

• May also be due to problems in the transcription process One nucleotide may be substituted for

another Deletion of a nucleotide Duplication Insertions Inversions

Page 17: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Mutation

Page 18: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Mutation

• Deletions have different effects depending on the number of nucleotides deleted Deletions of 3 in an ORF result in the

deletion of a codon, so an amino acid is not produced Usually damaging, sometimes lethal

Deletion of 1 causes a frame shift - changes all downstream amino acids Almost always lethal

Page 19: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Codon Deletion

ATGATACCGACGTACGGCATTTAA

START IPTYGI STOP

ATGATACCGACGTACGGCATTTAA

START IPTYI STOP

Page 20: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Frame Shift

ATGATACCGACGTACGGCATTTAA

START IPT STOP

START IPTYGI STOP

Page 21: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Mutations

• Some notes… A single base substitution may even

produce the same amino acid (especially if it is the last in a codon)

May also produce a similar amino acid It is impossible to tell whether the gap in

an alignment results from insertion in one sequence or deletion from another

After mutation, an organism may be more or less likely to survive natural selection

Page 22: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Alignment Scores

• Based on what we have said about mutations - how should we modify the alignment scores? Note that a single long gap is more

likely than several shorter ones… Therefore it should have a smaller penalty Say…• Match: +1• Mismatch: 0• Gap origination: -2• Gap extension: -1

Page 23: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Alignment

• We can have sequences with different sizes

• An alignment is defined to be the insertion of spaces in arbitrary locations along the sequences so that they end up being the same size No space in the sequence can be

aligned with a space in the other GA-CGGATTAGGATCGGAATAG

Page 24: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Alignment

• Let’s use the following scores for similarity - match: +1; mismatch: -1; space: -2

• Let sim(s, t) denote the similarity score for two sequences s and t

• We want to develop an algorithm to compute the maximum sim(s, t) given s and t

Page 25: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

• We will use a technique known as dynamic programming Solve an instance of a problem by using

an already solved smaller instance of the same problem

In our case, we build up the solution by determining the similarities between arbitrary prefixes of the two sequences Start with shorter prefixes, work towards

longer ones

Page 26: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

• Let m be the size of s and n the size of t Then there are m + 1 prefixes of s and n

+ 1 prefixes of t, including the empty string

We store the similarities of the prefixes in an (m + 1) (n + 1) array Entry (I, j) contains the similarity between

s[1..I] and t[1..j]

Page 27: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

• Let s = AAAC and t = AGC We need to initialize part of the array to

get started If one of the sequences is empty, we just

add as many spaces as characters in the other sequence

Correspondingly, we fill in the first row and column with multiples of the space penalty (-2)

Page 28: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

• We can compute the value of entry (i, j) by looking at just three previous entries: (i - 1, j), (i - 1, j - 1), (i, j - 1) Corresponds to these choices

Align s[1..i] with t[1..j - 1] and match a space with t[j]

Align s[1..i - 1] with t[1..j - 1] and match s[i] with t[j]

Align s[1..i - 1] with t[1..j] and match s[i] with a space

Page 29: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

• If we compute entries in an smart way, scores for best alignments between smaller prefixes have already been stored in the array, so

sim(s[1..i], t[1..j] = max {sim (s[1..i], t[1..j - 1]) - 2,sim (s[1..i - 1], t[1..j - 1]) + p(i, j),sim (s[1..i - 1], t[1..j]) - 2}Where p(i, j) = + 1 if s[i] = t[j], -1 otherwise

Page 30: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

• We should fill in the array row by row, left to right

• If we denote the array by a then we have

a[i, j] = max {a[i, j - 1] - 2,a[i - 1, j - 1] + p(i, j),a[i - 1, j] - 2}Where p(i, j) = + 1 if s[i] = t[j], -1 otherwise

Page 31: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Dynamic Programming

Algorithm Similarityinput: sequences s and toutput: similarity of s and tm |s|n |t|for i 0 to m do

a[i, 0] i gfor j 0 to n do

a[0, j] j g for i 1 to m do

for j 1 to n doa[i, j] max(a[i - 1, j] + g,

a[i - 1, j - 1] + p(i, j), a[i, j - 1] + g)return a[m, n]

Page 32: Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004.

Optimal Alignments

• So now we know the maximum similarity, but we still need to compute the optimal alignment We will use the array a of similarities

previously computed To be continued …