EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...

EBI is an Outstation of the European Molecular Biology Laboratory.

Sequence Searching Strategies

Jennifer McDowallEMBL-EBI

A guide to efficient database searching

Sequence Searching Tools

Overview

• Know the data

• The Toolbox

• Search Guidelines


Know the data


• Many databases, each getting bigger

• Efficient searching requires knowledge of what data is stored in a database Don’t assume annotation can be transferred because of

a good match

• Databases can contain errors

• Data can change Deletions, sequence modifications Daily updates, identifier changes…

Know the Data…


Know the Data…Nucleotides

Specialist databases

EMBL-Bank

• Immunoglobulins: IMGT/HLA, IMGT/LIGM…• Alternative splicing: ASTD…• Completed genomes: Ensembl, Integr8…• Variation: HGVBase, dbSNP…

• Divided into classes and divisions...• Release and updates• Supplementary sets: EMBL-CDS, EMBL-MGA


Know the Data…Proteins

UniProt• Divided into 3 sections• Release and updates

Specialist databases• Sequence from structure: PDB, SGT…• Immunoglobulins: IMGT/HLA…• Alternative splicing: ASTD…• Completed proteomes: Ensembl, Integr8…• Protein interactions: IntAct• Patent proteins: EPO, USTPO, JPO, KIPO


• Homologous sequences share a common origin

• Presence of similar features because of common decent

• Statistically significant similar sequences are considered ‘homologous’

• Homology is like pregnancy: either one is or one isn’t! (Gribskov – 1999)

• Similarity is a measure of the “likeness” of 2 sequences

• Uses statistics to determine ‘significance’ of similarity If significant, considered to be

homologous If not significant uncertain

• Similarity does not necessarily reflect homology

vs. Homology Similarity


The Toolbox


Sequence Similarity Search Tools



BLASTBLAST

FASTAFASTA

IterativeIterativesearchessearches


BLASTBLAST

FASTAFASTA


IterativeIterative searchsearch

NCBI-BLAST

Wu-BLAST

PSI-BLAST

PSI-SEARCH

FASTA

SSEARCH

GGSEARCH

GLSEARCH


Tools: NCBI BLAST

Protein DB• BLASTP: protein

• BLASTX: DNA translate Protein DB

• BLASTN: DNA DNA DB


Tools: NCBI BLAST

Nucleotide search Protein search


Tools: Wu-BLAST

Protein DB• BLASTP: protein

• BLASTN: DNA DNA DB

• BLASTX: Protein DBDNA translate

• TBLASTN: Translated DNA DBprotein

• TBLASTX: Translated DNA DBDNA translate


Tools: Wu-BLAST



Tools: FASTA

• FASTA: Protein DBprotein DNA DNA DBor

• FASTX/Y: DNA translate Protein DB

• GLSEARCH: protein Protein DB

• GGSEARCH: Protein DBprotein

• SSEARCH: Protein DBprotein DNA DNA DBor


Tools: FASTA



Database size

Que

ry le

ngth

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

When to use which search?


PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

Spe

ed o

f se

arch

When to use which search?


BLAST FASTA

• Fast

• Excels with proteins

• Good local alignments + short global alignments

• Proteins: BLOSUM62(-11/-1) alignments good at >85% homology

• Good at finding siblings

• Slower

• Excels with proteins and DNA (better than BLASTN for DNA)

• Produces S-W alignments

• Proteins: BLOSUM50(-10/-2) longer alignments good at >70% homology

• Good at finding cousins

v


GLSEARCH and GGSEARCH

GLSEARCH

GGSEARCH

Global (query) - Local (target DB) alignment

For global query alignments to domains/patterns in

target proteins

Global (query) – Global (target DB) alignment

Specific for searching short sequences against short targets

or for gene-to-gene comparisons


What are global and local alignments?

Local - Local

Global - Local

Global - Global

|||||||| ||||||||||||||

Query

Subject

||||||||| |||||||||||||

Query

Subject

||||||||| |||||||||||||

Query

Subject

BLAST,FASTA

GLSEARCH

GGSEARCH


Tools: PSI (Position Specific Iterated) Search

Single Protein Sequence

Search Database

Generate AlignmentConstruct profile

iterateEstimate significance


Tools: PSI Search

• PSI-BLAST

• Combines: PSI-BLAST(iterative strategy)

• PSI-SEARCH

SSEARCH(S&W algorithm)

+

• Part of NCBI-BLAST package• Automatic iteration service • (PSSM = position specific scoring)

• Manually guided service

• Manually guided service


Let’s look at a FASTA search


FASTA search

Step 1:Step 1:Select a databaseSelect a database


Which database to choose?

• ENA-Annotation >124 million

• UniParc (non-redundant) >24 million

• Databases grow every day

Database size is important


How database size affects results

sequence: gatctccatggg

>122M >15M >700,000>1.5M

0 hits 60 hits 489 hits 3 hits

e-values of 100% matches

789.0 621.0(>1000) 0.96

BLAST


How database size affects results

• Search smallest database

likely to contain your sequence

• Run multiple small searches (can run all ENA/UniParc as well)


Protein or nucleotide database search?

Two issues are worth considering…



Codon degeneracy

Amino acids

Nucleotides UCU

AGC

match

mismatch

Ser

Ser



Over-simple match/mismatch scoring

Amino acids

identical

Ser

Sersimilar

Ser

Asn

Ser

Leumismatch

Nucleotides UCU

AGCmismatch

UCU

AACmismatch

UCU

CUCmismatch

no distinction

not conserved

weakly conserved

highly conserved



Protein

Nucleotide

vHuman CKS1B kinase Zebra finch CDC28 kinase 1B


1

2

3

4

today

Cambrian explosion

chemical evolution

self-replicating cells

Bil

lio

ns

of

ye

ars

ag

o

formation of Earth

multicellular life

photosynthesis

complex cells

extinction of dinosaurs

genu

s Homo

land

pla

nts

amph

ibia

ns

prok

aryo

tes

arch

aea

cyan

obac

teria

euka

ryto

es

plan

ts

arth

ropo

dsfis

h

inse

cts

rept

iles

mam

mal

sbi

rds

flow

ers

Proteins

DNA

Identify homologs searching:

Protein comparisons

identify homologues 5-10x

further back in evolution

Protein or nucleotide search?



…therefore, searching a protein database could

pull out many more homologues than searching a

nucleotide database

…if you start with a nucleotide sequence, try

BLASTX or FASTX to translate your query sequence

and search a protein database


FASTA search


Step 2:Step 2:Paste sequencePaste sequence


FASTA search


Step 2:Step 2:Paste sequencePaste sequence

Step 3:Step 3:Choose parametersChoose parameters


Choosing parameters


User manual

provides help

Choosing parameters


Which parameters to choose?

Matrix

Nucleotide search

‘simpler’ - only

match/mismatch

Protein search uses

substitution matrix tables

(based on amino acid

similarities and rate of change)



2. length of query sequence

Choice of matrix

depends on:

1. strictness of search

QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4


Matrices - controlling search sensitivity

PAM (point accepted mutation)

• Based on global alignments of related proteins

• 1 substitution in 100 residues = PAM 1

• Other matrices extrapolated from PAM 1

• Model of evolutionary divergence

• Bias against rare substitutions (e.g. Cys → Tyr)

due to seed proteins


BLOSUM (BLOCKS amino-acid substitution)

• Based on protein domain alignments from the

BLOCKS database

• Observed substitutions in conserved domains

• Based on percentage identity, so BLOSUM50 is

deeper than BLOSUM80

Matrices - controlling search sensitivity


Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

10 100 200

400 500300



Matrix - protein

...instead have...

FASTA BLAST

Match/mismatch - nucleotide


Match/mismatch scores

• “Reward” for match, “penalty” for mismatch

• Reward/penalty ratio:

Increase ratio to find more divergent sequences:

Ratio of 0.33 (1/-3) for 99% conserved

Ratio of 0.5 (1/-2) for 95% conserved

Ratio of 1 (1/-1) for 75% conserved


gap penalties


Nucleotide search

gap open = -2 to -16

Gap extension = 0 to -4

Protein search

gap open = 0 to -23

Gap extension = 0 to -8


Choice of gap penalties depends on:

2. to match scoring matrix

1. strictness of search


• larger penalty fewer gaps



KTUP (word length)


KTUP = ‘word-length’ of search

Large word-length less sensitive

faster

Nucleotide search

- fewer bases than

amino acids higher

KTUP



Do I mask my sequence?

**Be careful you don’t mask what you are looking for

Low complexity regions should be

masked to avoid spurious results

• CA repeats

• poly-A tails

• proline-rich regions



use strict matrices

use high gap penalties

avoid masking

allow high e-values

What do I use for short

sequences?


Fasta results

Matches section


Fasta results

Do you use

e-values or

% identity?


E-values or % identity?

E-value

% identity

Estimates statistical significance of matches

Default = 10 expect 10 matches found by chance

E() = <0.01 usually homologous

E() = 1-10 frequently related

% of positions identical between query

and match sequence



Similar

% identity scores

Different

e-values



Pattern of

conservation indicates

homology

No evidence of

homology



Use e-values to estimate likelihood two sequences are homologous


Fasta results

Check length and

alignments in relation

to % identity


Length of match

100% identity, but only over 124 / 663 (20%) of sequence


Fasta results

Protein and nucleotide

search results have additional

annotation


Fasta results

Related EMBL

nucleotide entries


Fasta results

Related genomic

information


Fasta results

Gene ontology (GO)

mapping for protein


Fasta results

InterPro family/domain

classification


Fasta results

Literature


Fasta results

Functional prediction on ALL proteins


Function Predictions using InterPro

Visual comparison

find mis- or partial

matchesPrioritize

results

Functional predictions:

InterPro family/domain

classifications

Extractinformation


Function Predictions using InterPro

34% ID• Matches:

• family signature • 3 domain signatures

28% ID• Matches:

• 1 domain signature

24% ID• Matches:

• No signatures

100% ID• Matches:

• family signature • 4 domain signatures


Navigate to search tools Select search tool (1) Select database

(2) Copy/paste sequence

(3) Set parameters(4) SubmitResult summary + annotation

Functional predictions

Sequence Search Summary


Search guidelines


Search Guidelines: #1

• BEST: Protein DBproteinAVTEGPIPEVLFNYDAQYTFGHKNSDKSS

FASTA BLASTP

• 2nd BEST: DNAATGGCTAGCTTCGACTAGGCGATGCGA translate Protein DB

FASTX BLASTX

• 3rd BEST: DNAATGGCTAGCTTCGACTAGGCGATGCGA DNA DB

FASTA BLASTN

• WORST: Translated DNA DBproteinAVTEGPIPEVLFNYDAQYTFGHKNSDKSS

TFASTX TBLASTN



• Search smallest database likely to contain your sequence

• Use sequence statistics (E-values) rather than % identity or % similarity, as your primary criterion for sequence homology



• Check statistics are likely to be accurate by looking for highest scoring unrelated sequence

Examine the histograms

Use programs such as prss3 to confirm the E-values

Searching with shuffled sequences (use MLE/Shuffle in FASTA) which should have an E-value ~1.0


Search Guidelines: #4• Consider searches with different gap penalties and

other scoring matricesUse shallower matrices and/or more stringent gaps to

uncover or force out relationships in partial sequences

Adjust scoring matrix to suit length of query sequence

Adjust gap penalties to match scoring matrix




• Homology can be reliably inferred from statistically significant similarity

Homology = common 3D structure

Homology - NOT common function

• Paralogous sequences acquire very different functional roles

• Orthologous sequences have similar functions



• Consult motif or fingerprint databases to find evidence for conservation critical for functional residues

Motif identity in the absence of overall sequence similarity is not a reliable indicator of homology!

• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data

ClustalW, MUSCLE, T-Coffee, Kalign, MAFFTMview, DBClustal (available form EBI FASTA & BLAST services)



• Low complexity regions (e.g. CA repeats, poly-A tails and

Proline-rich regions) give spuriously high scores that reflect compositional bias rather than significant position-by-position alignment

Use seg, xnu, dust, CENSOR, etc. BUT be careful about what you filter!!!



• What about short sequences?

• Depends on their nature:

• Protein:

Reduce word length and/or increase e-value Use shallow matrices

• DNA:

Reduce word length (but NOT to 1!) Set Threshold for band optimisation (FASTA) to 0 Ignore gap penalties (force local alignments only)


Accessing Tools at EBI

Access Sequence Similarity Search services over various interfaces:

1) Using your browser

2) Over email

3) Using Web Services (SOAP/REST) Perl, Python, C and Java clients available Taverna & Triana workflows are fully supported

(See: http://www.ebi.ac.uk/Tools/webservices/)

(See: http://www.myexperiment.org/)


Typical workflow

searchreview Check stats

compareevolutionfunction


• Don’t assume a single tool will cater for all your search needs

• DO change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Final remarks

EBI is an Outstation of the European Molecular Biology Laboratory.

Contacts:http://www.ebi.ac.uk/support/

EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...

Documents

Transcript of EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...