NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Peter Cooper Using NCBI BLAST.
NCBI resources II: web-based tools and ftp resourcesbcb.unl.edu/yyin/teach/PBB/ncbi-blast.pdf ·...
Transcript of NCBI resources II: web-based tools and ftp resourcesbcb.unl.edu/yyin/teach/PBB/ncbi-blast.pdf ·...
NCBI resources II: web-based tools and ftp
resourcesYanbinYin
Mostmaterialsaredownloaded fromftp://ftp.ncbi.nih.gov/pub/education/
1
Outline
• Tools– BLAST– SpecializedBLAST– GEO
• ftpdownload• Handsonexercise
2
References
3
http://homepages.ulb.ac.be/~dgonze/TEACHING/stat_scores.pdf
http://www.bioinformatics.wsu.edu/bioinfo_course/notes/lecture6.pdf
NCBIdiscoveryworkshopsftp://ftp.ncbi.nih.gov/pub/education/discovery_workshops/NLM/2012/Sept2012/
Evolutionofpairwisealignmenttools
4
Smith-Watermanalgorithm
FASTA
BLAST
Fasterbut lessaccurate
Needleman-Wunsch algorithm1970
1981
1985
1990
BasicLocalAlignmentSearchTool
• Widelyusedsimilaritysearchtool• HeuristicapproachbasedonSmithWatermanalgorithm• Findsbestlocalalignments• Providesstatisticalsignificance• Allcombinations(DNA/Protein)queryanddatabase
– DNAvs DNA– DNAtranslationvs Protein– Proteinvs Protein– Proteinvs DNAtranslation– DNAtranslationvs DNAtranslation
• www,standalone,andnetworkclient
5
6http://www.bioinformatics.wsu.edu/bioinfo_course/notes/lecture6.pdf
7http://homepages.ulb.ac.be/~dgonze/TEACHING/stat_scores.pdf
8http://www.bioinformatics.wsu.edu/bioinfo_course/notes/lecture6.pdf
9http://www.bioinformatics.wsu.edu/bioinfo_course/notes/lecture6.pdf
LocalAlignmentStatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nmen
ts
(applies to ungapped alignments)
E = Kmne-lS or E = mn2-S’
K = scale for search spacel= scale for scoring systemS’ = bitscore = (lS - lnK)/ln2
Expect ValueE = number of database hits you expect to find by chance
sizeofdatabase
yourscore
expectednumberofrandomhits
10http://www.youtube.com/ncbinlm
LocalAlignmentScoring:Protein
KK +5
KE +1
QF -3
Gap-(11 + 4(1))= -14
Number of Chance Alignments = 4 X 10-50
Scores from BLOSUM62, a position independent matrix
11
LocalAlignmentScoring:Nucleotide
Gap-(5 + 4(2))= -13
Number of Chance Alignments = 2 X 10-73
Match=+2 Mismatch=-3
12
BLASTandBLAST-likeprograms
• TraditionalBLAST(formerlyblastall)nucleotide,protein,translations– blastn nucleotidequeryvs.nucleotidedatabase– blastp proteinqueryvs.proteindatabase– blastx nucleotidequeryvs.proteindatabase– tblastn proteinqueryvs.translatednucleotidedatabase– tblastx translatedqueryvs.translateddatabase
• Megablastnucleotideonly– Contiguousmegablast
• Nearlyidenticalsequences
– Discontiguousmegablast• Cross-speciescomparison
13
Position-specificBLASTPrograms(proteinonly)
• PositionSpecificIterativeBLAST(PSI-BLAST)Automaticallygeneratesapositionspecificscorematrix(PSSM)
• Position-HitInitiatedBLAST(PHI-BLAST)Focusessearcharoundpattern(motif)
• DomainEnhancedLookupTimeAccelerated(DELTA)BLASTUsesdomainPSSM infirstroundofsearch
• ReversePSI-BLAST(RPS-BLAST)Searchesadatabaseof PSI-BLASTPSSMsConservedDomainDatabaseSearch
14
15
http://www.ch.embnet.org/CourseAthens/slides/intro_hmm_profile.pdf
Non-redundantprotein
nr (non-redundant proteinsequences)– GenBank CDS
translations– NP_,XP_refseq_protein– OutsideProtein
• PIR,Swiss-Prot,PRF• PDB (sequencesfromstructures)
pat proteinpatentsenv_nrmetagenomes
(environmental samples)
Servicesblastpblastx
16
NucleotideDatabases:Traditional
Servicesblastntblastntblastx
17
NucleotideDatabases:Traditional
• nr(nt)– TraditionalGenBank– NM_andXM_RefSeqs
• refseq_rna
• NCBIGenomes– NC_RefSeqs– GenBankChromosomes
• dbest– ESTDivision
• non-human,non-mouseests
• htgs– HTGdivision
• gss– GSSdivision
• wgs– wholegenomeshotgun
contigs
• tsa– transcriptomeshotgun
assembly
• 16Smicrobial– Selected16Ssequences
(targetedloci)
Databases are mostly non-overlapping
18
SpecializedBLASTPages
19
Handsonexercise1
blastn andmegablast
20
21
22
Searchagainsthumandatabase
23
Alotofthingsyoumayexplore
24
Changehereto1000
Uploadatextfilewithhumantp53mRNAfasta sequenceDownload fromcoursewebpage
Question:howmanyESTsmatchtp53genes?
25
Ittook~1minutetofinish
Alotofthingsyoumayexplore!!!
26
Searchagainstotherrefseq genomes
Handsonexercise2
Proteinblast(blastp andtblastn)
27
28
Ifnotselectorganisms…
29
Youcanstillspecifyorganisms…
30
Uploadatextfilewithtwo arabidopsis protein fastasequenceDownload fromcoursewebpage
Typeinpopulus tochoosepopulus trichocarpa
Youmaysubmitmanysequences,butexpectittakestime
Question:whatarethehomologs inpoplartree?
31
Ittook~1minute(smallerdatabase)
Clickheretochoosetoviewwhichqueryprotein
32
Howtodeterminewhatisagoode-valuecutofftoselecthomologs?
http://www.youtube.com/watch?v=nO0wJgZRZJs&list=PL8FD4CC12DABD6B39&index=6
33
Typeincharoph tochoosecharophytes
Question:whataretheESThomologs incharophytic algae?
34
35
Handsonexercise3
PHI-BLASTQueryprotein+shortmotif/pattern
&PSI-BLAST(iteratedBLAST)
Multi-roundBLASTP
36
37
Example:plantglycosyltransferase family8(GT8)hassignaturemotif
WewanttosearchArabidopsisGAUT1protein(gi #:86611465)andtheHXXGXXKPWmotif
ProSite stylepattern:H-x(2)-G-x(2)-K-P-W
38
39
40
Handsonexercise4
RPS-BLASTGivenproteinsequences,findconservedfunctionaldomains
41
42
43
44
Nextclass:NCBIGEOandftpresource(withalittlebitintroto
Linuxskills)andpractice
45