Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....

33
Lecture 3 – Sequence alignment 15 th September 2010

Transcript of Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....

Page 1: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Lecture 3 – Sequence alignment

15th September 2010

Page 2: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Bioinformatics course   Course webpage

  http://courses.cs.ut.ee/2010/bioinformatics/

  Lecture 1 – Introduction to Bioinformatics   Lecture 2 – Biological Databases, Assignment -1   Lecture 3 – Sequence alignments, Assignment -2   Lecture 4 - Sequence alignments, Assignment - 3

Page 3: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Outline   Introduction

  Sequence similarity searches   Similarity and homology   Sequence alignment   Alignment algorithms   Scores

Page 4: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Search word ??   Word   Book name   Text   Title   Reference   ………

Page 5: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Organisms/Speciescells(DNA-RNA-Protein)

Stored in databases What next ???

Page 6: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

DNA Mutation and Repair   A mutation, which may arise during replication and/or

recombination, is a permanent change in the nucleotide sequence of DNA.

  Damaged DNA can be mutated either by substitution, deletion or insertion of base pairs.

  Mutations, for the most part, are harmless except when they lead to cell death or tumor formation.

  Because of the lethal potential of DNA mutations cells have evolved mechanisms for repairing damaged DNA.

  Types of Mutations   There are three types of DNA Mutations: base

substitutions, deletions and insertions.

Page 7: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Mutations/substitutions

Page 8: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Point mutations

Page 9: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Codon table

Page 10: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Mutations / substitutions   Synonymous substitutions

  TTC (Phe) >>> TTT (Phe)

  Non-synonymous substitutions   TTC (Phe) >>> TTA (Leu)

Page 11: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Why or when to compare two sequences ?

  Are they homologous / share common ancestor   Do they share same domain   Identify the exact locations to see the common features-

active sites   Compare a gene and its product

Page 12: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Q. Similarity by chance or ancestral ?   Homology and similarity used interchangebly   Alignment can reveal homology

  Orthologous   Paralogous

  Similarity – a sequence in question show some degree of match

Page 13: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Sequence   A sequence in question - Query   A matching sequence - Hit

Page 14: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Principles of sequence alignment

1 2

3 4

Page 15: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Key concepts in sequence alignment

To locate equivalent regions of two or more sequences to maximize their similarity

Score : Identity = 85%

Page 16: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Sequences of same length

Score : Identity = 30%

Page 17: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Sequence of different length

Score : Identity = 69%

Page 18: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

When do you say sequences are homologous?

  Nucleotide : if the paired sequence share atleast 70% identity over more than 100 bases (E-value lower than 10e-4

  Protein : if the paired sequence share atleast 25% identity over more than 100 amino acids (E-value lower than 10e-4

Page 19: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Choosing a method

Pairwise comparisons

Method Situations

Dot plot General exploration of your sequence Discovering repeats Finding long insertions/deletions Extracting portions of sequences to make a multiple alignment

Local alignments Comparing sequences with partial homology Making high-quality alignments Making residue-per-residue analysis

Global alignments Comparing two sequences over their entire length Identifying long insertions/deletions Checking the quality of your data Identifying every mutation in your sequence

BLAST

FASTA

Page 20: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Dot plot   Definition - is a graphical method that allows the

comparison of two biological sequences and identify regions of close similarity between them

  Compare each sequence against the other   Results

  Repeated regions / domains   Regions with small motifs repeated many times (low

complexity)   Palindromes (portions of DNA repeated in opposite

directions)   Potential secondary structures in RNA

Page 21: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Aligning text

Raw Data ? A C A T G C A T T G

How many possible ways can we align ?

Page 22: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Aligning text Raw Data ? A C A T G C A T T G

2 matches, 0 gaps A C A T G | | C A T T G

3 matches (2 gaps in ends)

A C A T G . | | | . C A T T G

4 matches, 1 insertion

A C A - T G | | | | . C A T T G

4 matches, 1 insertion

A C A T - G | | | | . C A T T G

Page 23: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Dynamic programming   What to do if the text is Bigger?

SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGGREGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEP

KPNEPRGDILLPTVGHALAFIERLERPELYGVNPEVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRTEDFDGVWAS   Needleman-Wunsch (1970) provided first automatic

method   Dynamic Programming to Find Global Alignment

Page 24: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Aligning a 4 character A C B P

A C P M - A C P M A C - P M

Page 25: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Global and Local alignments

Page 26: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Local alignment The scoring system uses negative scores for

mismatches The minimum score for at a matrix element is zero Find the best score anywhere in the matrix (not just

last column or row)   These three changes cause the algorithm

  to seek high scoring subsequences,   which are not penalized for their global effects,   which don’t include areas of poor match,   and which can occur anywhere

Page 27: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Scoring matrices (BLOSSUM, PAM)   eg.

A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 28: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Where do matrices come from Manually align protein structures

Look at frequency of a.a. substitutions at structurally constant sites

Compute log-odds S(aa-1,aa-2) = log2 ( freq(O) / freq(E) ) O = observed exchanges, E = expected exchanges

  odds = freq(observed) / freq(expected)   Sij = log odds   freq(expected) = f(i)*f(j)

= is the chance of getting amino acid i in a column and then having it change to j

  e.g. A-R pair observed only a tenth as often as expected

Page 29: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Local vs. Global Alignment   GLOBAL

  best alignment of entirety of both sequences   For optimum global alignment, we want best score in the final row or final

column   Are these sequences generally the same?   Needleman Wunsch   find alignment in which total score is highest, perhaps at expense of areas of

great local similarity   LOCAL

  best alignment of segments, without regard to rest of sequence   For optimum local alignment, we want best score anywhere in matrix   Do these two sequences contain high scoring subsequences   Smith Waterman   find alignment in which the highest scoring subsequences are identified, at the

expense of the overall score

Page 30: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Global vs Local alignments Global Local

Page 31: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

BLAST   Extend hits into High Scoring Segment Pairs (HSPs)   Stop extension when total score doesn’t increase   Starts with all overlapping words from query   Calculates “neighborhood” of each word using PAM

matrix and probability threshold matrix and probability threshold

  Looks up all words and neighbors from query in database index

  Extends High Scoring Pairs (HSPs) left and right to maximal length

  Finds Maximal Segment Pairs (MSPs) between query and database

Page 32: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

Types of BLAST

Page 33: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence

  Practicals and assignments