EBI is an Outstation of the European Molecular Biology Laboratory. Quaternary Structure.
EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...
-
Upload
mervin-roberts -
Category
Documents
-
view
220 -
download
2
Transcript of EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...
EBI is an Outstation of the European Molecular Biology Laboratory.
Sequence Searching Strategies
Jennifer McDowallEMBL-EBI
A guide to efficient database searching
Sequence Searching Tools
Overview
• Know the data
• The Toolbox
• Search Guidelines
Sequence Searching Tools
Know the data
Sequence Searching Tools
• Many databases, each getting bigger
• Efficient searching requires knowledge of what data is stored in a database Don’t assume annotation can be transferred because of
a good match
• Databases can contain errors
• Data can change Deletions, sequence modifications Daily updates, identifier changes…
Know the Data…
Sequence Searching Tools
Know the Data…Nucleotides
Specialist databases
EMBL-Bank
• Immunoglobulins: IMGT/HLA, IMGT/LIGM…• Alternative splicing: ASTD…• Completed genomes: Ensembl, Integr8…• Variation: HGVBase, dbSNP…
• Divided into classes and divisions...• Release and updates• Supplementary sets: EMBL-CDS, EMBL-MGA
Sequence Searching Tools
Know the Data…Proteins
UniProt• Divided into 3 sections• Release and updates
Specialist databases• Sequence from structure: PDB, SGT…• Immunoglobulins: IMGT/HLA…• Alternative splicing: ASTD…• Completed proteomes: Ensembl, Integr8…• Protein interactions: IntAct• Patent proteins: EPO, USTPO, JPO, KIPO
Sequence Searching Tools
• Homologous sequences share a common origin
• Presence of similar features because of common decent
• Statistically significant similar sequences are considered ‘homologous’
• Homology is like pregnancy: either one is or one isn’t! (Gribskov – 1999)
• Similarity is a measure of the “likeness” of 2 sequences
• Uses statistics to determine ‘significance’ of similarity If significant, considered to be
homologous If not significant uncertain
• Similarity does not necessarily reflect homology
vs. Homology Similarity
Sequence Searching Tools
The Toolbox
Sequence Searching Tools
Sequence Similarity Search Tools
Sequence Searching Tools
Sequence Similarity Search Tools
BLASTBLAST
FASTAFASTA
IterativeIterativesearchessearches
Sequence Searching Tools
BLASTBLAST
FASTAFASTA
Sequence Similarity Search Tools
IterativeIterative searchsearch
NCBI-BLAST
Wu-BLAST
PSI-BLAST
PSI-SEARCH
FASTA
SSEARCH
GGSEARCH
GLSEARCH
Sequence Searching Tools
Tools: NCBI BLAST
Protein DB• BLASTP: protein
• BLASTX: DNA translate Protein DB
• BLASTN: DNA DNA DB
Sequence Searching Tools
Tools: NCBI BLAST
Nucleotide search Protein search
Sequence Searching Tools
Tools: Wu-BLAST
Protein DB• BLASTP: protein
• BLASTN: DNA DNA DB
• BLASTX: Protein DBDNA translate
• TBLASTN: Translated DNA DBprotein
• TBLASTX: Translated DNA DBDNA translate
Sequence Searching Tools
Tools: Wu-BLAST
Nucleotide search Protein search
Sequence Searching Tools
Tools: FASTA
• FASTA: Protein DBprotein DNA DNA DBor
• FASTX/Y: DNA translate Protein DB
• GLSEARCH: protein Protein DB
• GGSEARCH: Protein DBprotein
• SSEARCH: Protein DBprotein DNA DNA DBor
Sequence Searching Tools
Tools: FASTA
Nucleotide search Protein search
Sequence Searching Tools
Database size
Que
ry le
ngth
FASTA
WU-BLAST
NCBI BLAST
PSI-SEARCH
When to use which search?
Sequence Searching Tools
PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc
FASTA
WU-BLAST
NCBI BLAST
PSI-SEARCH
Spe
ed o
f se
arch
When to use which search?
Sequence Searching Tools
BLAST FASTA
• Fast
• Excels with proteins
• Good local alignments + short global alignments
• Proteins: BLOSUM62(-11/-1) alignments good at >85% homology
• Good at finding siblings
• Slower
• Excels with proteins and DNA (better than BLASTN for DNA)
• Produces S-W alignments
• Proteins: BLOSUM50(-10/-2) longer alignments good at >70% homology
• Good at finding cousins
v
Sequence Searching Tools
GLSEARCH and GGSEARCH
GLSEARCH
GGSEARCH
Global (query) - Local (target DB) alignment
For global query alignments to domains/patterns in
target proteins
Global (query) – Global (target DB) alignment
Specific for searching short sequences against short targets
or for gene-to-gene comparisons
Sequence Searching Tools
What are global and local alignments?
Local - Local
Global - Local
Global - Global
|||||||| ||||||||||||||
Query
Subject
||||||||| |||||||||||||
Query
Subject
||||||||| |||||||||||||
Query
Subject
BLAST,FASTA
GLSEARCH
GGSEARCH
Sequence Searching Tools
Tools: PSI (Position Specific Iterated) Search
Single Protein Sequence
Search Database
Generate AlignmentConstruct profile
iterateEstimate significance
Sequence Searching Tools
Tools: PSI Search
• PSI-BLAST
• Combines: PSI-BLAST(iterative strategy)
• PSI-SEARCH
SSEARCH(S&W algorithm)
+
• Part of NCBI-BLAST package• Automatic iteration service • (PSSM = position specific scoring)
• Manually guided service
• Manually guided service
Sequence Searching Tools
Let’s look at a FASTA search
Sequence Searching Tools
FASTA search
Step 1:Step 1:Select a databaseSelect a database
Sequence Searching Tools
Which database to choose?
• ENA-Annotation >124 million
• UniParc (non-redundant) >24 million
• Databases grow every day
Database size is important
Sequence Searching Tools
How database size affects results
sequence: gatctccatggg
>122M >15M >700,000>1.5M
0 hits 60 hits 489 hits 3 hits
e-values of 100% matches
789.0 621.0(>1000) 0.96
BLAST
Sequence Searching Tools
How database size affects results
• Search smallest database
likely to contain your sequence
• Run multiple small searches (can run all ENA/UniParc as well)
Sequence Searching Tools
Protein or nucleotide database search?
Two issues are worth considering…
Sequence Searching Tools
Protein or nucleotide database search?
Codon degeneracy
Amino acids
Nucleotides UCU
AGC
match
mismatch
Ser
Ser
Sequence Searching Tools
Protein or nucleotide database search?
Over-simple match/mismatch scoring
Amino acids
identical
Ser
Sersimilar
Ser
Asn
Ser
Leumismatch
Nucleotides UCU
AGCmismatch
UCU
AACmismatch
UCU
CUCmismatch
no distinction
not conserved
weakly conserved
highly conserved
Sequence Searching Tools
Protein or nucleotide database search?
Protein
Nucleotide
vHuman CKS1B kinase Zebra finch CDC28 kinase 1B
Sequence Searching Tools
1
2
3
4
today
Cambrian explosion
chemical evolution
self-replicating cells
Bil
lio
ns
of
ye
ars
ag
o
formation of Earth
multicellular life
photosynthesis
complex cells
extinction of dinosaurs
genu
s Homo
land
pla
nts
amph
ibia
ns
prok
aryo
tes
arch
aea
cyan
obac
teria
euka
ryto
es
plan
ts
arth
ropo
dsfis
h
inse
cts
rept
iles
mam
mal
sbi
rds
flow
ers
Proteins
DNA
Identify homologs searching:
Protein comparisons
identify homologues 5-10x
further back in evolution
Protein or nucleotide search?
Sequence Searching Tools
Protein or nucleotide database search?
…therefore, searching a protein database could
pull out many more homologues than searching a
nucleotide database
…if you start with a nucleotide sequence, try
BLASTX or FASTX to translate your query sequence
and search a protein database
Sequence Searching Tools
FASTA search
Step 1:Step 1:Select a databaseSelect a database
Step 2:Step 2:Paste sequencePaste sequence
Sequence Searching Tools
FASTA search
Step 1:Step 1:Select a databaseSelect a database
Step 2:Step 2:Paste sequencePaste sequence
Step 3:Step 3:Choose parametersChoose parameters
Sequence Searching Tools
Choosing parameters
Sequence Searching Tools
User manual
provides help
Choosing parameters
Sequence Searching Tools
Which parameters to choose?
Matrix
Nucleotide search
‘simpler’ - only
match/mismatch
Protein search uses
substitution matrix tables
(based on amino acid
similarities and rate of change)
Sequence Searching Tools
Which parameters to choose?
2. length of query sequence
Choice of matrix
depends on:
1. strictness of search
QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4
Sequence Searching Tools
Matrices - controlling search sensitivity
PAM (point accepted mutation)
• Based on global alignments of related proteins
• 1 substitution in 100 residues = PAM 1
• Other matrices extrapolated from PAM 1
• Model of evolutionary divergence
• Bias against rare substitutions (e.g. Cys → Tyr)
due to seed proteins
Sequence Searching Tools
BLOSUM (BLOCKS amino-acid substitution)
• Based on protein domain alignments from the
BLOCKS database
• Observed substitutions in conserved domains
• Based on percentage identity, so BLOSUM50 is
deeper than BLOSUM80
Matrices - controlling search sensitivity
Sequence Searching Tools
Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence
10 100 200
400 500300
Sequence Searching Tools
Which parameters to choose?
Matrix - protein
...instead have...
FASTA BLAST
Match/mismatch - nucleotide
Sequence Searching Tools
Match/mismatch scores
• “Reward” for match, “penalty” for mismatch
• Reward/penalty ratio:
Increase ratio to find more divergent sequences:
Ratio of 0.33 (1/-3) for 99% conserved
Ratio of 0.5 (1/-2) for 95% conserved
Ratio of 1 (1/-1) for 75% conserved
Sequence Searching Tools
gap penalties
Which parameters to choose?
Nucleotide search
gap open = -2 to -16
Gap extension = 0 to -4
Protein search
gap open = 0 to -23
Gap extension = 0 to -8
Sequence Searching Tools
Choice of gap penalties depends on:
2. to match scoring matrix
1. strictness of search
QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4
• larger penalty fewer gaps
Which parameters to choose?
Sequence Searching Tools
KTUP (word length)
Which parameters to choose?
KTUP = ‘word-length’ of search
Large word-length less sensitive
faster
Nucleotide search
- fewer bases than
amino acids higher
KTUP
Sequence Searching Tools
Which parameters to choose?
Do I mask my sequence?
**Be careful you don’t mask what you are looking for
Low complexity regions should be
masked to avoid spurious results
• CA repeats
• poly-A tails
• proline-rich regions
Sequence Searching Tools
Which parameters to choose?
use strict matrices
use high gap penalties
avoid masking
allow high e-values
What do I use for short
sequences?
Sequence Searching Tools
Fasta results
Matches section
Sequence Searching Tools
Fasta results
Do you use
e-values or
% identity?
Sequence Searching Tools
E-values or % identity?
E-value
% identity
Estimates statistical significance of matches
Default = 10 expect 10 matches found by chance
E() = <0.01 usually homologous
E() = 1-10 frequently related
% of positions identical between query
and match sequence
Sequence Searching Tools
E-values or % identity?
Similar
% identity scores
Different
e-values
Sequence Searching Tools
E-values or % identity?
Pattern of
conservation indicates
homology
No evidence of
homology
Sequence Searching Tools
E-values or % identity?
Use e-values to estimate likelihood two sequences are homologous
Sequence Searching Tools
Fasta results
Check length and
alignments in relation
to % identity
Sequence Searching Tools
Length of match
100% identity, but only over 124 / 663 (20%) of sequence
Sequence Searching Tools
Fasta results
Protein and nucleotide
search results have additional
annotation
Sequence Searching Tools
Fasta results
Related EMBL
nucleotide entries
Sequence Searching Tools
Fasta results
Related genomic
information
Sequence Searching Tools
Fasta results
Gene ontology (GO)
mapping for protein
Sequence Searching Tools
Fasta results
InterPro family/domain
classification
Sequence Searching Tools
Fasta results
Literature
Sequence Searching Tools
Fasta results
Functional prediction on ALL proteins
Sequence Searching Tools
Function Predictions using InterPro
Visual comparison
find mis- or partial
matchesPrioritize
results
Functional predictions:
InterPro family/domain
classifications
Extractinformation
Sequence Searching Tools
Function Predictions using InterPro
34% ID• Matches:
• family signature • 3 domain signatures
28% ID• Matches:
• 1 domain signature
24% ID• Matches:
• No signatures
100% ID• Matches:
• family signature • 4 domain signatures
Sequence Searching Tools
Navigate to search tools Select search tool (1) Select database
(2) Copy/paste sequence
(3) Set parameters(4) SubmitResult summary + annotation
Functional predictions
Sequence Search Summary
Sequence Searching Tools
Search guidelines
Sequence Searching Tools
Search Guidelines: #1
• BEST: Protein DBproteinAVTEGPIPEVLFNYDAQYTFGHKNSDKSS
FASTA BLASTP
• 2nd BEST: DNAATGGCTAGCTTCGACTAGGCGATGCGA translate Protein DB
FASTX BLASTX
• 3rd BEST: DNAATGGCTAGCTTCGACTAGGCGATGCGA DNA DB
FASTA BLASTN
• WORST: Translated DNA DBproteinAVTEGPIPEVLFNYDAQYTFGHKNSDKSS
TFASTX TBLASTN
Sequence Searching Tools
Search Guidelines: #2
• Search smallest database likely to contain your sequence
• Use sequence statistics (E-values) rather than % identity or % similarity, as your primary criterion for sequence homology
Sequence Searching Tools
Search Guidelines: #3
• Check statistics are likely to be accurate by looking for highest scoring unrelated sequence
Examine the histograms
Use programs such as prss3 to confirm the E-values
Searching with shuffled sequences (use MLE/Shuffle in FASTA) which should have an E-value ~1.0
Sequence Searching Tools
Search Guidelines: #4• Consider searches with different gap penalties and
other scoring matricesUse shallower matrices and/or more stringent gaps to
uncover or force out relationships in partial sequences
Adjust scoring matrix to suit length of query sequence
Adjust gap penalties to match scoring matrix
QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4
Sequence Searching Tools
Search Guidelines: #5
• Homology can be reliably inferred from statistically significant similarity
Homology = common 3D structure
Homology - NOT common function
• Paralogous sequences acquire very different functional roles
• Orthologous sequences have similar functions
Sequence Searching Tools
Search Guidelines: #6
• Consult motif or fingerprint databases to find evidence for conservation critical for functional residues
Motif identity in the absence of overall sequence similarity is not a reliable indicator of homology!
• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data
ClustalW, MUSCLE, T-Coffee, Kalign, MAFFTMview, DBClustal (available form EBI FASTA & BLAST services)
Sequence Searching Tools
Search Guidelines: #7
• Low complexity regions (e.g. CA repeats, poly-A tails and
Proline-rich regions) give spuriously high scores that reflect compositional bias rather than significant position-by-position alignment
Use seg, xnu, dust, CENSOR, etc. BUT be careful about what you filter!!!
Sequence Searching Tools
Search Guidelines: #8
• What about short sequences?
• Depends on their nature:
• Protein:
Reduce word length and/or increase e-value Use shallow matrices
• DNA:
Reduce word length (but NOT to 1!) Set Threshold for band optimisation (FASTA) to 0 Ignore gap penalties (force local alignments only)
Sequence Searching Tools
Accessing Tools at EBI
Access Sequence Similarity Search services over various interfaces:
1) Using your browser
2) Over email
3) Using Web Services (SOAP/REST) Perl, Python, C and Java clients available Taverna & Triana workflows are fully supported
(See: http://www.ebi.ac.uk/Tools/webservices/)
(See: http://www.myexperiment.org/)
Sequence Searching Tools
Typical workflow
searchreview Check stats
compareevolutionfunction
Sequence Searching Tools
• Don’t assume a single tool will cater for all your search needs
• DO change the parameters of the tools
• Remember where the tool excels and what its limitations are
• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)
• Crazy input will always give crazy results!
Final remarks
EBI is an Outstation of the European Molecular Biology Laboratory.
Contacts:http://www.ebi.ac.uk/support/