Sequence Alignment Part 3 Introduction to bioinformatics.

29
Sequence Alignment Part 3 roduction to bioinformat
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    235
  • download

    0

Transcript of Sequence Alignment Part 3 Introduction to bioinformatics.

Page 1: Sequence Alignment Part 3 Introduction to bioinformatics.

Sequence Alignment Part 3

Introduction to bioinformatics

Page 2: Sequence Alignment Part 3 Introduction to bioinformatics.

WHATS TODAY?

• MORE BLAST ….

- Similarity scores for protein sequences

- Gaps

- Statistical significance (e-value)

Page 3: Sequence Alignment Part 3 Introduction to bioinformatics.

Protein Sequence AlignmentRule of thumb:Rule of thumb:Proteins are homologous if 25% identical (Proteins are homologous if 25% identical (length >100length >100))DNA sequences are homologous if 70% identicalDNA sequences are homologous if 70% identical

Page 4: Sequence Alignment Part 3 Introduction to bioinformatics.

Protein Pairwise Sequence Alignment

• The alignment tools are similar to the DNA alignment tools• BLASTN for nucleotides • BLASTP for proteins

• Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores:• Score s(i,j) > 0 if amino acids i and j have similar

properties • Score s(i,j) is 0 otherwise

• How should we score s(i,j)?

Page 5: Sequence Alignment Part 3 Introduction to bioinformatics.

The 20 Amino Acids

Page 6: Sequence Alignment Part 3 Introduction to bioinformatics.

Chemical Similarities Between Amino Acids

Acids & Amides DENQ (Asp, Glu, Asn, Gln)

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)

Page 7: Sequence Alignment Part 3 Introduction to bioinformatics.

Sequence Alignment based on AA similarity

TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS|| + |||| +|| ||| | +| | | | |TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL

---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID

| = identity + = similarity

Page 8: Sequence Alignment Part 3 Introduction to bioinformatics.

Amino Acid Substitutions Matrices

• When scoring protein sequence alignments it is common to use a matrix of 20 20, representing all pairwise comparisons :

Substitution Matrix

Page 9: Sequence Alignment Part 3 Introduction to bioinformatics.

Given an alignment of closely related sequences we can score the relation between amino acidsbased on how frequently they substitute each other

In this column

E & D are found

7/8

M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E

Page 10: Sequence Alignment Part 3 Introduction to bioinformatics.

Amino Acid MatricesSymmetric matrix of 20x20 entries: entry (i,j)=entry(j,i)

Entry (i,j): the score of aligning amino acid i against amino acid j.

Entry (i,i) is greater than any entry (i,j), ji.

Page 11: Sequence Alignment Part 3 Introduction to bioinformatics.

PAM - Point Accepted Mutations• Developed by Margaret Dayhoff, 1978.• Analyzed very similar protein sequences

• Proteins are evolutionary close. • Alignment is easy.• Point mutations - mainly substitutions• Accepted mutations - by natural selection.

• Used global alignment.• Counted the number of substitutions (i,j) per amino acid pair:

Many i<->j substitutions => high score s(i,j)

• Found that common substitutions occurred involving chemically similar amino acids.

Page 12: Sequence Alignment Part 3 Introduction to bioinformatics.

PAM 250

• Similar amino acids are close to each other.• Regions define conserved substitutions.

Page 13: Sequence Alignment Part 3 Introduction to bioinformatics.

C H+H3N

COO-

HCH

C

O-O

C H+H3N

C

COO-

HCH

O-O

HCH

Aspartate(Asp, D)

Glutamate(Glu, E)

Example: Asp & Glu

Score = 3

Page 14: Sequence Alignment Part 3 Introduction to bioinformatics.

Selecting a PAM Matrix

• Low PAM numbers: short sequences, strong local similarities.

• High PAM numbers: long sequences, weak similarities.– PAM120 recommended for general use (40% identity)

– PAM60 for close relations (60% identity)

– PAM250 for distant relations (20% identity)

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended

Page 15: Sequence Alignment Part 3 Introduction to bioinformatics.

BLOSUM• Blocks Substitution Matrix

– Steven and Jorga G. Henikoff (1992)

• Based on BLOCKS database (www.blocks.fhcrc.org)

– Families of proteins with identical function

– Highly conserved protein domains

• Ungapped local alignment to identify motifs– Each motif is a block of local alignment

– Counts amino acids observed in same column

– Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

Page 16: Sequence Alignment Part 3 Introduction to bioinformatics.

BLOSUM Matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS

• BLOSUMn is based on sequences that are at most n percent identical.

Page 17: Sequence Alignment Part 3 Introduction to bioinformatics.

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations

Page 18: Sequence Alignment Part 3 Introduction to bioinformatics.

Summary:

• BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps

• PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

Page 19: Sequence Alignment Part 3 Introduction to bioinformatics.

Gap Scores

• Example showed -1 score per indel– So gap cost is proportional to its length

• Biologically, indels occur in groups– We want our gap score to reflect this

• Standard solution: affine gap model– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 20: Sequence Alignment Part 3 Introduction to bioinformatics.

Scoring system =

Substitution Matrix + Gap Penalty

Page 21: Sequence Alignment Part 3 Introduction to bioinformatics.

Gap penalty

• We expect to penalize gaps • Scoring for gap opening & for extension

– Insertions and deletions are rare in evolution

– But once they are created, they are easy to extend

– Gap-extension penalty < gap-open penalty

• Default gap parameters are given for each matrix:– PAM30: open=9, extension=1– PAM250: open=14, extension=2

Page 22: Sequence Alignment Part 3 Introduction to bioinformatics.

• AAAAAAAAAAA

• ATATATATATATA

• CAGCAGCAGCAG

Low Complexity Sequences

How does BLAST deal with low complexity sequences?

Sequences of low complexity can cause getting significant hitswhich are not true homologues !!!

By default low complexity sequences are filtered out and replaced by XXXXX

Page 23: Sequence Alignment Part 3 Introduction to bioinformatics.

Statistical significance

Page 24: Sequence Alignment Part 3 Introduction to bioinformatics.

E-value• The number of hits (with the same similarity score) one can

"expect" to see just by chance when searching the given string in a database of a particular size.

• higher e-value lower similarity– “sequences with E-value of less than 0.01 are almost always

found to be homologous”

• The lower bound is normally 0 (we want to find the best)

Page 25: Sequence Alignment Part 3 Introduction to bioinformatics.

Expectation Values

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

Page 26: Sequence Alignment Part 3 Introduction to bioinformatics.

• Bit score (S)– Similar to alignment score– Normalized– Higher means more significant

• E value:Number of hits of score ≥ S expected by chance – Based on random database of similar size– Lower means more significant– Used to assess the statistical significance of the

alignment

Page 27: Sequence Alignment Part 3 Introduction to bioinformatics.

Remote homologues

• Sometimes BLAST isn’t enough.

• Large protein family, and BLAST only gives close members. We want more distant members

PSI-BLAST

Page 28: Sequence Alignment Part 3 Introduction to bioinformatics.

PSI-BLAST

• Position Specific Iterated BLAST

Regular blast

Construct profile from blast results

Blast profile search

Final results

Page 29: Sequence Alignment Part 3 Introduction to bioinformatics.

PSI-BLAST

• Advantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friends

• Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration