Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....
Transcript of Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 ....
Lecture 3 – Sequence alignment
15th September 2010
Bioinformatics course Course webpage
http://courses.cs.ut.ee/2010/bioinformatics/
Lecture 1 – Introduction to Bioinformatics Lecture 2 – Biological Databases, Assignment -1 Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence alignments, Assignment - 3
Outline Introduction
Sequence similarity searches Similarity and homology Sequence alignment Alignment algorithms Scores
Search word ?? Word Book name Text Title Reference ………
Organisms/Speciescells(DNA-RNA-Protein)
Stored in databases What next ???
DNA Mutation and Repair A mutation, which may arise during replication and/or
recombination, is a permanent change in the nucleotide sequence of DNA.
Damaged DNA can be mutated either by substitution, deletion or insertion of base pairs.
Mutations, for the most part, are harmless except when they lead to cell death or tumor formation.
Because of the lethal potential of DNA mutations cells have evolved mechanisms for repairing damaged DNA.
Types of Mutations There are three types of DNA Mutations: base
substitutions, deletions and insertions.
Mutations/substitutions
Point mutations
Codon table
Mutations / substitutions Synonymous substitutions
TTC (Phe) >>> TTT (Phe)
Non-synonymous substitutions TTC (Phe) >>> TTA (Leu)
Why or when to compare two sequences ?
Are they homologous / share common ancestor Do they share same domain Identify the exact locations to see the common features-
active sites Compare a gene and its product
Q. Similarity by chance or ancestral ? Homology and similarity used interchangebly Alignment can reveal homology
Orthologous Paralogous
Similarity – a sequence in question show some degree of match
Sequence A sequence in question - Query A matching sequence - Hit
Principles of sequence alignment
1 2
3 4
Key concepts in sequence alignment
To locate equivalent regions of two or more sequences to maximize their similarity
Score : Identity = 85%
Sequences of same length
Score : Identity = 30%
Sequence of different length
Score : Identity = 69%
When do you say sequences are homologous?
Nucleotide : if the paired sequence share atleast 70% identity over more than 100 bases (E-value lower than 10e-4
Protein : if the paired sequence share atleast 25% identity over more than 100 amino acids (E-value lower than 10e-4
Choosing a method
Pairwise comparisons
Method Situations
Dot plot General exploration of your sequence Discovering repeats Finding long insertions/deletions Extracting portions of sequences to make a multiple alignment
Local alignments Comparing sequences with partial homology Making high-quality alignments Making residue-per-residue analysis
Global alignments Comparing two sequences over their entire length Identifying long insertions/deletions Checking the quality of your data Identifying every mutation in your sequence
BLAST
FASTA
Dot plot Definition - is a graphical method that allows the
comparison of two biological sequences and identify regions of close similarity between them
Compare each sequence against the other Results
Repeated regions / domains Regions with small motifs repeated many times (low
complexity) Palindromes (portions of DNA repeated in opposite
directions) Potential secondary structures in RNA
Aligning text
Raw Data ? A C A T G C A T T G
How many possible ways can we align ?
Aligning text Raw Data ? A C A T G C A T T G
2 matches, 0 gaps A C A T G | | C A T T G
3 matches (2 gaps in ends)
A C A T G . | | | . C A T T G
4 matches, 1 insertion
A C A - T G | | | | . C A T T G
4 matches, 1 insertion
A C A T - G | | | | . C A T T G
Dynamic programming What to do if the text is Bigger?
SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGGREGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEP
KPNEPRGDILLPTVGHALAFIERLERPELYGVNPEVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRTEDFDGVWAS Needleman-Wunsch (1970) provided first automatic
method Dynamic Programming to Find Global Alignment
Aligning a 4 character A C B P
A C P M - A C P M A C - P M
Global and Local alignments
Local alignment The scoring system uses negative scores for
mismatches The minimum score for at a matrix element is zero Find the best score anywhere in the matrix (not just
last column or row) These three changes cause the algorithm
to seek high scoring subsequences, which are not penalized for their global effects, which don’t include areas of poor match, and which can occur anywhere
Scoring matrices (BLOSSUM, PAM) eg.
A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Where do matrices come from Manually align protein structures
Look at frequency of a.a. substitutions at structurally constant sites
Compute log-odds S(aa-1,aa-2) = log2 ( freq(O) / freq(E) ) O = observed exchanges, E = expected exchanges
odds = freq(observed) / freq(expected) Sij = log odds freq(expected) = f(i)*f(j)
= is the chance of getting amino acid i in a column and then having it change to j
e.g. A-R pair observed only a tenth as often as expected
Local vs. Global Alignment GLOBAL
best alignment of entirety of both sequences For optimum global alignment, we want best score in the final row or final
column Are these sequences generally the same? Needleman Wunsch find alignment in which total score is highest, perhaps at expense of areas of
great local similarity LOCAL
best alignment of segments, without regard to rest of sequence For optimum local alignment, we want best score anywhere in matrix Do these two sequences contain high scoring subsequences Smith Waterman find alignment in which the highest scoring subsequences are identified, at the
expense of the overall score
Global vs Local alignments Global Local
BLAST Extend hits into High Scoring Segment Pairs (HSPs) Stop extension when total score doesn’t increase Starts with all overlapping words from query Calculates “neighborhood” of each word using PAM
matrix and probability threshold matrix and probability threshold
Looks up all words and neighbors from query in database index
Extends High Scoring Pairs (HSPs) left and right to maximal length
Finds Maximal Segment Pairs (MSPs) between query and database
Types of BLAST
Practicals and assignments