EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...

82
EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database searching

Transcript of EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies...

Page 1: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

EBI is an Outstation of the European Molecular Biology Laboratory.

Sequence Searching Strategies

Jennifer McDowallEMBL-EBI

A guide to efficient database searching

Page 2: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Overview

• Know the data

• The Toolbox

• Search Guidelines

Page 3: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Know the data

Page 4: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

• Many databases, each getting bigger

• Efficient searching requires knowledge of what data is stored in a database Don’t assume annotation can be transferred because of

a good match

• Databases can contain errors

• Data can change Deletions, sequence modifications Daily updates, identifier changes…

Know the Data…

Page 5: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Know the Data…Nucleotides

Specialist databases

EMBL-Bank

• Immunoglobulins: IMGT/HLA, IMGT/LIGM…• Alternative splicing: ASTD…• Completed genomes: Ensembl, Integr8…• Variation: HGVBase, dbSNP…

• Divided into classes and divisions...• Release and updates• Supplementary sets: EMBL-CDS, EMBL-MGA

Page 6: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Know the Data…Proteins

UniProt• Divided into 3 sections• Release and updates

Specialist databases• Sequence from structure: PDB, SGT…• Immunoglobulins: IMGT/HLA…• Alternative splicing: ASTD…• Completed proteomes: Ensembl, Integr8…• Protein interactions: IntAct• Patent proteins: EPO, USTPO, JPO, KIPO

Page 7: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

• Homologous sequences share a common origin

• Presence of similar features because of common decent

• Statistically significant similar sequences are considered ‘homologous’

• Homology is like pregnancy: either one is or one isn’t! (Gribskov – 1999)

• Similarity is a measure of the “likeness” of 2 sequences

• Uses statistics to determine ‘significance’ of similarity If significant, considered to be

homologous If not significant uncertain

• Similarity does not necessarily reflect homology

vs. Homology Similarity

Page 8: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

The Toolbox

Page 9: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Sequence Similarity Search Tools

Page 10: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Sequence Similarity Search Tools

BLASTBLAST

FASTAFASTA

IterativeIterativesearchessearches

Page 11: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

BLASTBLAST

FASTAFASTA

Sequence Similarity Search Tools

IterativeIterative searchsearch

NCBI-BLAST

Wu-BLAST

PSI-BLAST

PSI-SEARCH

FASTA

SSEARCH

GGSEARCH

GLSEARCH

Page 12: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: NCBI BLAST

Protein DB• BLASTP: protein

• BLASTX: DNA translate Protein DB

• BLASTN: DNA DNA DB

Page 13: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: NCBI BLAST

Nucleotide search Protein search

Page 14: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: Wu-BLAST

Protein DB• BLASTP: protein

• BLASTN: DNA DNA DB

• BLASTX: Protein DBDNA translate

• TBLASTN: Translated DNA DBprotein

• TBLASTX: Translated DNA DBDNA translate

Page 15: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: Wu-BLAST

Nucleotide search Protein search

Page 16: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: FASTA

• FASTA: Protein DBprotein DNA DNA DBor

• FASTX/Y: DNA translate Protein DB

• GLSEARCH: protein Protein DB

• GGSEARCH: Protein DBprotein

• SSEARCH: Protein DBprotein DNA DNA DBor

Page 17: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: FASTA

Nucleotide search Protein search

Page 18: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Database size

Que

ry le

ngth

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

When to use which search?

Page 19: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

Spe

ed o

f se

arch

When to use which search?

Page 20: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

BLAST FASTA

• Fast

• Excels with proteins

• Good local alignments + short global alignments

• Proteins: BLOSUM62(-11/-1) alignments good at >85% homology

• Good at finding siblings

• Slower

• Excels with proteins and DNA (better than BLASTN for DNA)

• Produces S-W alignments

• Proteins: BLOSUM50(-10/-2) longer alignments good at >70% homology

• Good at finding cousins

v

Page 21: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

GLSEARCH and GGSEARCH

GLSEARCH

GGSEARCH

Global (query) - Local (target DB) alignment

For global query alignments to domains/patterns in

target proteins

Global (query) – Global (target DB) alignment

Specific for searching short sequences against short targets

or for gene-to-gene comparisons

Page 22: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

What are global and local alignments?

Local - Local

Global - Local

Global - Global

|||||||| ||||||||||||||

Query

Subject

||||||||| |||||||||||||

Query

Subject

||||||||| |||||||||||||

Query

Subject

BLAST,FASTA

GLSEARCH

GGSEARCH

Page 23: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: PSI (Position Specific Iterated) Search

Single Protein Sequence

Search Database

Generate AlignmentConstruct profile

iterateEstimate significance

Page 24: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Tools: PSI Search

• PSI-BLAST

• Combines: PSI-BLAST(iterative strategy)

• PSI-SEARCH

SSEARCH(S&W algorithm)

+

• Part of NCBI-BLAST package• Automatic iteration service • (PSSM = position specific scoring)

• Manually guided service

• Manually guided service

Page 25: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Let’s look at a FASTA search

Page 26: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

FASTA search

Step 1:Step 1:Select a databaseSelect a database

Page 27: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Which database to choose?

• ENA-Annotation >124 million

• UniParc (non-redundant) >24 million

• Databases grow every day

Database size is important

Page 28: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

How database size affects results

sequence: gatctccatggg

>122M >15M >700,000>1.5M

0 hits 60 hits 489 hits 3 hits

e-values of 100% matches

789.0 621.0(>1000) 0.96

BLAST

Page 29: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

How database size affects results

• Search smallest database

likely to contain your sequence

• Run multiple small searches (can run all ENA/UniParc as well)

Page 30: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Protein or nucleotide database search?

Two issues are worth considering…

Page 31: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Protein or nucleotide database search?

Codon degeneracy

Amino acids

Nucleotides UCU

AGC

match

mismatch

Ser

Ser

Page 32: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Protein or nucleotide database search?

Over-simple match/mismatch scoring

Amino acids

identical

Ser

Sersimilar

Ser

Asn

Ser

Leumismatch

Nucleotides UCU

AGCmismatch

UCU

AACmismatch

UCU

CUCmismatch

no distinction

not conserved

weakly conserved

highly conserved

Page 33: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Protein or nucleotide database search?

Protein

Nucleotide

vHuman CKS1B kinase Zebra finch CDC28 kinase 1B

Page 34: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

1

2

3

4

today

Cambrian explosion

chemical evolution

self-replicating cells

Bil

lio

ns

of

ye

ars

ag

o

formation of Earth

multicellular life

photosynthesis

complex cells

extinction of dinosaurs

genu

s Homo

land

pla

nts

amph

ibia

ns

prok

aryo

tes

arch

aea

cyan

obac

teria

euka

ryto

es

plan

ts

arth

ropo

dsfis

h

inse

cts

rept

iles

mam

mal

sbi

rds

flow

ers

Proteins

DNA

Identify homologs searching:

Protein comparisons

identify homologues 5-10x

further back in evolution

Protein or nucleotide search?

Page 35: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Protein or nucleotide database search?

…therefore, searching a protein database could

pull out many more homologues than searching a

nucleotide database

…if you start with a nucleotide sequence, try

BLASTX or FASTX to translate your query sequence

and search a protein database

Page 36: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

FASTA search

Step 1:Step 1:Select a databaseSelect a database

Step 2:Step 2:Paste sequencePaste sequence

Page 37: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

FASTA search

Step 1:Step 1:Select a databaseSelect a database

Step 2:Step 2:Paste sequencePaste sequence

Step 3:Step 3:Choose parametersChoose parameters

Page 38: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Choosing parameters

Page 39: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

User manual

provides help

Choosing parameters

Page 40: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Which parameters to choose?

Matrix

Nucleotide search

‘simpler’ - only

match/mismatch

Protein search uses

substitution matrix tables

(based on amino acid

similarities and rate of change)

Page 41: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Which parameters to choose?

2. length of query sequence

Choice of matrix

depends on:

1. strictness of search

QUERY LENGTH  MATRIX    open   ext    >300       BLOSUM50  -10    -2    85-300     BLOSUM62   -7    -1    50-85      BLOSUM80  -16    -4    >300       PAM250    -10    -2    85-300    PAM120    -16    -4    35-85    MDM40     -12    -2     <=35     MDM20     -22    -4     <=10     MDM10     -23    -4

Page 42: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Matrices - controlling search sensitivity

PAM (point accepted mutation)

• Based on global alignments of related proteins

• 1 substitution in 100 residues = PAM 1

• Other matrices extrapolated from PAM 1

• Model of evolutionary divergence

• Bias against rare substitutions (e.g. Cys → Tyr)

due to seed proteins

Page 43: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

BLOSUM (BLOCKS amino-acid substitution)

• Based on protein domain alignments from the

BLOCKS database

• Observed substitutions in conserved domains

• Based on percentage identity, so BLOSUM50 is

deeper than BLOSUM80

Matrices - controlling search sensitivity

Page 44: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

10 100 200

400 500300

Page 45: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Which parameters to choose?

Matrix - protein

...instead have...

FASTA BLAST

Match/mismatch - nucleotide

Page 46: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Match/mismatch scores

• “Reward” for match, “penalty” for mismatch

• Reward/penalty ratio:

Increase ratio to find more divergent sequences:

Ratio of 0.33 (1/-3) for 99% conserved

Ratio of 0.5 (1/-2) for 95% conserved

Ratio of 1 (1/-1) for 75% conserved

Page 47: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

gap penalties

Which parameters to choose?

Nucleotide search

gap open = -2 to -16

Gap extension = 0 to -4

Protein search

gap open = 0 to -23

Gap extension = 0 to -8

Page 48: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Choice of gap penalties depends on:

2. to match scoring matrix

1. strictness of search

QUERY LENGTH  MATRIX    open   ext    >300       BLOSUM50  -10    -2    85-300     BLOSUM62   -7    -1    50-85      BLOSUM80  -16    -4    >300       PAM250    -10    -2    85-300    PAM120    -16    -4    35-85    MDM40     -12    -2     <=35     MDM20     -22    -4     <=10     MDM10     -23    -4

• larger penalty fewer gaps

Which parameters to choose?

Page 49: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

KTUP (word length)

Which parameters to choose?

KTUP = ‘word-length’ of search

Large word-length less sensitive

faster

Nucleotide search

- fewer bases than

amino acids higher

KTUP

Page 50: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Which parameters to choose?

Do I mask my sequence?

**Be careful you don’t mask what you are looking for

Low complexity regions should be

masked to avoid spurious results

• CA repeats

• poly-A tails

• proline-rich regions

Page 51: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Which parameters to choose?

use strict matrices

use high gap penalties

avoid masking

allow high e-values

What do I use for short

sequences?

Page 52: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Matches section

Page 53: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Do you use

e-values or

% identity?

Page 54: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

E-values or % identity?

E-value

% identity

Estimates statistical significance of matches

Default = 10 expect 10 matches found by chance

E() = <0.01 usually homologous

E() = 1-10 frequently related

% of positions identical between query

and match sequence

Page 55: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

E-values or % identity?

Similar

% identity scores

Different

e-values

Page 56: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

E-values or % identity?

Pattern of

conservation indicates

homology

No evidence of

homology

Page 57: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

E-values or % identity?

Use e-values to estimate likelihood two sequences are homologous

Page 58: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Check length and

alignments in relation

to % identity

Page 59: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Length of match

100% identity, but only over 124 / 663 (20%) of sequence

Page 60: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Protein and nucleotide

search results have additional

annotation

Page 61: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Related EMBL

nucleotide entries

Page 62: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Related genomic

information

Page 63: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Gene ontology (GO)

mapping for protein

Page 64: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

InterPro family/domain

classification

Page 65: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Literature

Page 66: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Fasta results

Functional prediction on ALL proteins

Page 67: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Function Predictions using InterPro

Visual comparison

find mis- or partial

matchesPrioritize

results

Functional predictions:

InterPro family/domain

classifications

Extractinformation

Page 68: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Function Predictions using InterPro

34% ID• Matches:

• family signature • 3 domain signatures

28% ID• Matches:

• 1 domain signature

24% ID• Matches:

• No signatures

100% ID• Matches:

• family signature • 4 domain signatures

Page 69: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Navigate to search tools Select search tool (1) Select database

(2) Copy/paste sequence

(3) Set parameters(4) SubmitResult summary + annotation

Functional predictions

Sequence Search Summary

Page 70: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search guidelines

Page 71: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #1

• BEST: Protein DBproteinAVTEGPIPEVLFNYDAQYTFGHKNSDKSS

FASTA BLASTP

• 2nd BEST: DNAATGGCTAGCTTCGACTAGGCGATGCGA translate Protein DB

FASTX BLASTX

• 3rd BEST: DNAATGGCTAGCTTCGACTAGGCGATGCGA DNA DB

FASTA BLASTN

• WORST: Translated DNA DBproteinAVTEGPIPEVLFNYDAQYTFGHKNSDKSS

TFASTX TBLASTN

Page 72: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #2

• Search smallest database likely to contain your sequence

• Use sequence statistics (E-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Page 73: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #3

• Check statistics are likely to be accurate by looking for highest scoring unrelated sequence

Examine the histograms

Use programs such as prss3 to confirm the E-values

Searching with shuffled sequences (use MLE/Shuffle in FASTA) which should have an E-value ~1.0

Page 74: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #4• Consider searches with different gap penalties and

other scoring matricesUse shallower matrices and/or more stringent gaps to

uncover or force out relationships in partial sequences

Adjust scoring matrix to suit length of query sequence

Adjust gap penalties to match scoring matrix

QUERY LENGTH  MATRIX    open   ext    >300       BLOSUM50  -10    -2    85-300     BLOSUM62   -7    -1    50-85      BLOSUM80  -16    -4    >300       PAM250    -10    -2    85-300    PAM120    -16    -4    35-85    MDM40     -12    -2     <=35     MDM20     -22    -4     <=10     MDM10     -23    -4

Page 75: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #5

• Homology can be reliably inferred from statistically significant similarity

Homology = common 3D structure

Homology - NOT common function

• Paralogous sequences acquire very different functional roles

• Orthologous sequences have similar functions

Page 76: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #6

• Consult motif or fingerprint databases to find evidence for conservation critical for functional residues

Motif identity in the absence of overall sequence similarity is not a reliable indicator of homology!

• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data

ClustalW, MUSCLE, T-Coffee, Kalign, MAFFTMview, DBClustal (available form EBI FASTA & BLAST services)

Page 77: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #7

• Low complexity regions (e.g. CA repeats, poly-A tails and

Proline-rich regions) give spuriously high scores that reflect compositional bias rather than significant position-by-position alignment

Use seg, xnu, dust, CENSOR, etc. BUT be careful about what you filter!!!

Page 78: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Search Guidelines: #8

• What about short sequences?

• Depends on their nature:

• Protein:

Reduce word length and/or increase e-value Use shallow matrices

• DNA:

Reduce word length (but NOT to 1!) Set Threshold for band optimisation (FASTA) to 0 Ignore gap penalties (force local alignments only)

Page 79: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Accessing Tools at EBI

Access Sequence Similarity Search services over various interfaces:

1) Using your browser

2) Over email

3) Using Web Services (SOAP/REST) Perl, Python, C and Java clients available Taverna & Triana workflows are fully supported

(See: http://www.ebi.ac.uk/Tools/webservices/)

(See: http://www.myexperiment.org/)

Page 80: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

Typical workflow

searchreview Check stats

compareevolutionfunction

Page 81: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

Sequence Searching Tools

• Don’t assume a single tool will cater for all your search needs

• DO change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Final remarks

Page 82: EBI is an Outstation of the European Molecular Biology Laboratory. Sequence Searching Strategies Jennifer McDowall EMBL-EBI A guide to efficient database.

EBI is an Outstation of the European Molecular Biology Laboratory.

Contacts:http://www.ebi.ac.uk/support/