Pairwise sequence alignments & BLAST - Read the Docs ·  · 2016-08-24BLAST • Basic Local...

22
Pairwise sequence alignments & BLAST

Transcript of Pairwise sequence alignments & BLAST - Read the Docs ·  · 2016-08-24BLAST • Basic Local...

Pairwisesequencealignments&BLAST

Thepointofsequencealignment

•  Ifyouhavetwoormoresequences,youmaywanttoknow– Howsimilararethey?(AquanCtaCvemeasure)– Whichresiduescorrespondtoeachother?–  IsthereapaGerntotheconservaCon/variabilityofthesequences?

– WhataretheevoluConaryrelaConshipsofthesesequences?

BLAST•  BasicLocalAlignmentSearchTool•  Altschul,etal1990•  Hasbeencitedover61,000CmesaccordingtoGoogle

•  ThemosthighlycitedscienCficpaperintheenCredecadeofthe1990s

BLAST

•  ComparesaQUERYsequencetoaDATABASEofsequences(alsocalledSUBJECTsequences)

•  nucleoCdeorproteinsequences•  CalculatesstaCsCcalsignificance•  Availableasanonlinewebserver,forexample,atNCBI(hGp://blast.ncbi.nlm.nih.gov/Blast.cgi)

BLASTprogramsProgram Query Database

blastp protein protein

blastn nucleoCde nucleoCde

blastxnucleoCdetranslatedtoprotein

protein

tblastn proteinnucleoCdetranslatedtoprotein

tblastxnucleoCdetranslatedtoprotein

nucleoCdetranslatedtoprotein

WhywouldwewanttousetranslatednucleoCdes?

BLAST

•  Alsoavailableasacommandlinetool(guesswhichonewe’llbeusing???)

•  Needtoconquersomebasicconcepts– Alignment– Scoringanalignment– SubsCtuConmatrices

String A = a b c d eString B = a c d e f

A (good) alignment would be:

String A = a b c d e – | | | |String B = a - c d e f

Alignment

Manyalignmentsarepossible,wewanttofindthebest

g c t g a a c gc t a t a a t c

Bad:

g c t g a a c g - - - - - - -- - - - - - - c t a t a a t c

Manyalignmentsarepossible,wewanttofindthebest

g c t g a a c gc t a t a a t c

Better?

g c t g - a a - c g | | | | | - c t a t a a t c

Todecidewhichalignmentisbestweneed-  Awaytoexamineallpossiblealignments

-  Awaytocomputeascorethatgivesthequalityofthealignment

Scoringsequencesimilarity

•  Asimplescheme+1foramatch-1foramismatch

String A = a b c d e | | | |String B = a c c d e

+ 4- 1

Total Score: 3

ScoringbasedonBiology

•  NucleoCdesarenotmutatedrandomly•  TransiConmutaConsaremorecommon–  Purine(A/G)topurine(A/G)–  Pyrimidine(C/T)topyrimidine(C/T)

•  TransversionmutaConsarelesscommon•  Canbuildascoringschemetoreflectthis:–  Residueisthesame=+1–  ResidueundergoestransiCon=0–  Residueundergoestransversion=-1

ScoringBasedonBiology

•  AminoAcidsarenotmutatedatrandomeither

•  Thoseofsimilarphysicochemicaltypesaremorelikelytoreplaceeachother

•  Insteadofguessingwhattheseratesmightbe,canmeasureempirically

ScoringBasedonBiology

•  MargaretDayhoff(1978)–  CollectedstaCsCcsonproteinsubsCtuConfrequencies

–  BuiltthefirstsetofproteinsubsCtuConmatrices

–  PointacceptedmutaCon(PAM)matrices

BLOSUM

•  BLOSUM(BLOckSUbsCtuConMatrix)-HenikoffandHenikoff

•  AnewsubsCtuConmatrix,preferredtoday•  MuchbeGerformoredivergentspecies(constructedusingdivergentspeciesalignments)

•  BLOSUM62isthematrixusedbydefaultinmostrecentalignmentapplicaConssuchasBLAST.

BLOSUM62A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

ScoringGaps

•  Whataboutgaps?•  Usually,agapopeningismoreofapenaltythanagapextension

•  Why?AsinglemutaConalevenmayinsertmorethanonebase.

•  Commonlyusedistheaffinegappenalty:Gapopeningpenaltyof11Gapextensionpenaltyof1foreachaddiConalresidue

ScoringWrapUp

•  NowwehavegoodawaytoscoreaparCcularalignment1.ScoresubsCtuConsappropriatelyreflecCngbiology2.ScoregapsappropriatelyreflecCngbiology

•  Buthowtogenerateallthepossiblealignments?

ApproximateMethods•  Needmorespeed!•  Approximatemethodshavebeendevelopedthatare

–  GreatatdetecCngcloserelaConships–  InferiortoexactmethodsforpickingupdistantrelaConships–  Approximate!(IEnoguaranteethattheopCmalmatchisfound)

•  StartwithidenCcal“words”–  Calledk-tuplesork-mers–  Usethesewordstoquicklyfindperfectmatches–  Thenusethemoreslowmethodstogrowthematches

•  BLASTworksthiswayHeurisCc–anythatemploysapracCcalmethodologynotguaranteedtobeopCmalorperfect,butsufficientforthe

immediategoals

SignificanceofAlignments•  Nowwecanfindthebestscoringalignment(or

atleastapproximatelyifusingBLAST)•  ButisitsignificantinthestaCsCcalsense?

–  Whatisthelikelihoodthatyouareobservingtruebiologicalsimilarity(evoluCon)vsrandomchance?

•  E(expect)value=thenumberofhitsonecan"expect"toseebychancewhensearchingadatabaseofaparCcularsize

•  Takesintoaccountthesizeofthedatabasebutnotthenumberofqueries(bewareofmulCpletesCng!)

•  Lower=morebiologicallymeaningful

Evalues

EValue Howmanyrandomalignmentsjustasgood?

1 1in1.2 1in51e-5 1in100,0001e-9 1in1,000,000,0000 0%

BLAST

•  high-scoringsegmentpairs(HSP)•  AqueryandamatchsequencecanhavemorethanoneHSP

Review•  Comparesasequencequerytoasetofsequences

•  UsesscoringandstaCsCcstofindagoodalignment

•  HeurisCc–approximatesthebestalignment

WanttolearnmoreabouthowBLASTworks?•  WheelerandBhagwat.BLASTQuickStarthGp://www.ncbi.nlm.nih.gov/books/NBK1734/

•  WikipediahGps://en.wikipedia.org/wiki/BLAST