Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....

Lecture 3 – Sequence alignment

15th September 2010

Bioinformatics course   Course webpage

  http://courses.cs.ut.ee/2010/bioinformatics/

  Lecture 1 – Introduction to Bioinformatics   Lecture 2 – Biological Databases, Assignment -1   Lecture 3 – Sequence alignments, Assignment -2   Lecture 4 - Sequence alignments, Assignment - 3

Outline   Introduction

  Sequence similarity searches   Similarity and homology   Sequence alignment   Alignment algorithms   Scores

Search word ??   Word   Book name   Text   Title   Reference   ………

Organisms/Speciescells(DNA-RNA-Protein)

Stored in databases What next ???

DNA Mutation and Repair   A mutation, which may arise during replication and/or

recombination, is a permanent change in the nucleotide sequence of DNA.

  Damaged DNA can be mutated either by substitution, deletion or insertion of base pairs.

  Mutations, for the most part, are harmless except when they lead to cell death or tumor formation.

  Because of the lethal potential of DNA mutations cells have evolved mechanisms for repairing damaged DNA.

  Types of Mutations   There are three types of DNA Mutations: base

substitutions, deletions and insertions.

Mutations/substitutions

Point mutations

Codon table

Mutations / substitutions   Synonymous substitutions

  TTC (Phe) >>> TTT (Phe)

  Non-synonymous substitutions   TTC (Phe) >>> TTA (Leu)

Why or when to compare two sequences ?

  Are they homologous / share common ancestor   Do they share same domain   Identify the exact locations to see the common features-

active sites   Compare a gene and its product

Q. Similarity by chance or ancestral ?   Homology and similarity used interchangebly   Alignment can reveal homology

  Orthologous   Paralogous

  Similarity – a sequence in question show some degree of match

Sequence   A sequence in question - Query   A matching sequence - Hit

Principles of sequence alignment

1 2

3 4

Key concepts in sequence alignment

To locate equivalent regions of two or more sequences to maximize their similarity

Score : Identity = 85%

Sequences of same length


Sequence of different length


When do you say sequences are homologous?

  Nucleotide : if the paired sequence share atleast 70% identity over more than 100 bases (E-value lower than 10e-4

  Protein : if the paired sequence share atleast 25% identity over more than 100 amino acids (E-value lower than 10e-4

Choosing a method

Pairwise comparisons

Method Situations

Dot plot General exploration of your sequence Discovering repeats Finding long insertions/deletions Extracting portions of sequences to make a multiple alignment

Local alignments Comparing sequences with partial homology Making high-quality alignments Making residue-per-residue analysis

Global alignments Comparing two sequences over their entire length Identifying long insertions/deletions Checking the quality of your data Identifying every mutation in your sequence

BLAST

FASTA

Dot plot   Definition - is a graphical method that allows the

comparison of two biological sequences and identify regions of close similarity between them

  Compare each sequence against the other   Results

  Repeated regions / domains   Regions with small motifs repeated many times (low

complexity)   Palindromes (portions of DNA repeated in opposite

directions)   Potential secondary structures in RNA

Aligning text

Raw Data ? A C A T G C A T T G

How many possible ways can we align ?

Aligning text Raw Data ? A C A T G C A T T G

2 matches, 0 gaps A C A T G | | C A T T G

3 matches (2 gaps in ends)

A C A T G . | | | . C A T T G

4 matches, 1 insertion

A C A - T G | | | | . C A T T G

4 matches, 1 insertion

A C A T - G | | | | . C A T T G

Dynamic programming   What to do if the text is Bigger?

SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGGREGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEP

KPNEPRGDILLPTVGHALAFIERLERPELYGVNPEVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRTEDFDGVWAS   Needleman-Wunsch (1970) provided first automatic

method   Dynamic Programming to Find Global Alignment

Aligning a 4 character A C B P

A C P M - A C P M A C - P M

Global and Local alignments

Local alignment The scoring system uses negative scores for

mismatches The minimum score for at a matrix element is zero Find the best score anywhere in the matrix (not just

last column or row)   These three changes cause the algorithm

  to seek high scoring subsequences,   which are not penalized for their global effects,   which don’t include areas of poor match,   and which can occur anywhere

Scoring matrices (BLOSSUM, PAM)   eg.

A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Where do matrices come from Manually align protein structures

Look at frequency of a.a. substitutions at structurally constant sites

Compute log-odds S(aa-1,aa-2) = log2 ( freq(O) / freq(E) ) O = observed exchanges, E = expected exchanges

  odds = freq(observed) / freq(expected)   Sij = log odds   freq(expected) = f(i)*f(j)

= is the chance of getting amino acid i in a column and then having it change to j

  e.g. A-R pair observed only a tenth as often as expected

Local vs. Global Alignment   GLOBAL

  best alignment of entirety of both sequences   For optimum global alignment, we want best score in the final row or final

column   Are these sequences generally the same?   Needleman Wunsch   find alignment in which total score is highest, perhaps at expense of areas of

great local similarity   LOCAL

  best alignment of segments, without regard to rest of sequence   For optimum local alignment, we want best score anywhere in matrix   Do these two sequences contain high scoring subsequences   Smith Waterman   find alignment in which the highest scoring subsequences are identified, at the

expense of the overall score

Global vs Local alignments Global Local

BLAST   Extend hits into High Scoring Segment Pairs (HSPs)   Stop extension when total score doesn’t increase   Starts with all overlapping words from query   Calculates “neighborhood” of each word using PAM

matrix and probability threshold matrix and probability threshold

  Looks up all words and neighbors from query in database index

  Extends High Scoring Pairs (HSPs) left and right to maximal length

  Finds Maximal Segment Pairs (MSPs) between query and database

Types of BLAST

  Practicals and assignments

Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....

Documents

Transcript of Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....