Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss...

25
Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node

Transcript of Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss...

Page 1: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Databasesindexation

Laurent Falquet, EPFL March, 2005

Swiss Institute of BioinformaticsSwiss EMBnet node

Page 2: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Overview

Data access concept sequential direct

Indexing EMBOSS Fetch Other

BLAST Why indexing? formatdb Parsing output

Excel import/export Tab delimited Coma delimited

Page 3: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Data access: sequential vs direct

Sequential access Direct access

Vary from very short to very longVery small variations

track

sector

head

Page 4: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Similar concept for databases

Flat files = sequential Indexing = simulated direct

>seq1

cgatgtcatgtg

>seq2

cgatcgtagctgtagctgtag

>seq3

catgtgcatgcgacgt

ID Position (byte)

Length (byte)

SEQ1 0 19

SEQ2 19 28

SEQ3 47 23

Page 5: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Tools

EMBOSS dbiflat dbifasta dbiblast

seqret seqretsplit entret

Other examples SRS (icarus language)

http://srs.ebi.ac.uk http://www.lionbioscience.com/

indexer & fetch (warning local SIB tool)

Relational (MySQL, Oracle…)

Page 6: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

EMBOSS how to index?

Where is your file? What is the format? Where should be the

indices? Where is the

emboss.default file? (.embossrc)

Other EMBOSS tools textsearch whichdb

Page 7: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

EMBOSS example

Input file and directory ~/embossidx/ECOLI.dat cd embossidx

Index creation dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0 -date

12/02/05 -fields AC

Generates 4 files acnum.hit acnum.trg division.lkp entrynam.idx

Don’t forget to modify ~/.embossrc

Page 8: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

.embossrc Example of queries

seqret ecoli:thio_ecoli seqret ecoli:P00274 entret ecoli:thio_ecoli

and even seqret ‘ecoli:*_ECOLI’

set emboss_filter 1

# Ecoli

DB ecoli [

type: P

comment: "E.coli proteome"

method: emblcd

format: swiss

dir:  "~/embossidx"

file: "ECOLI.dat"

release: "1.0"

indexdir:  "~/embossidx"

]

Page 9: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Indexer & fetch

Warning this is a local SIB tool!! Input file and directory

~/embossidx/ECOLI.dat cd embossidx

Index creation indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx

Generates 1 file ecoli.idx

Don’t forget to modify config file

Page 10: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Config file: fetch.conf

Example of queries fetch -c fetch.conf ecoli:thio_ecoli fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’

fetch.conf

#dbkey format indexfile datafile

ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat

Page 11: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

BLAST

Maintained at NCBI Source distributed freely with

several accessory tools ftp://ftp.ncbi.nlm.nih.gov/

toolbox/ncbi_tools/ncbi.tar.gz

Requires compilation to install on your local computer

blastall contains blastp blastn blastx tblastn tblastx

Other tools blastpgp megablast formatdb

Page 12: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Available Blast programsProgram Query Database

blastp protein protein

blastn nucleotide nucleotide

blastx

protein

nucleotide

protein

tblastn

protein protein

nucleotide

tblastx

protein

nucleotide

protein

nucleotide

VS

VS

VS

VS

VS

Page 13: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

What makes BLAST so fast?

Indexing all words of 3 aa or 11 bp in the sequence database

Searching the query for all words of a score > T

Search the indexed database for all perfect matches

Try to align matches that are on the same diagonal

Page 14: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Indexing for Blast (1)

RELQuery

RSLRSL

AAAAACAAD

YYY

AAAAACAAD

YYY

List of all possible words with3 amino acid residues (8000)

...

ACT

RSL

TVF

ACT

RSL

TVF

List of words matching thequery with a score > T

score > T

...

...

LKPLKP

LKPLKP

score < T

A substitution matrix is used to compute the word scoresA substitution matrix is used to compute the word scores

Page 15: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Indexing for Blast (2)

ACT

RSL

TVF

ACT

RSL

TVF

List of words matching thequery with a score > T

...

...

ACTACTACT

RSL

RSL TVF

RSLRSL

RSLRSL TVFTVF

Database sequences

List of sequences containing words similar to the query (hits)

List of sequences containing words similar to the query (hits)

Search forexact matches

Page 16: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Indexing for Blast (3)Database sequence

Qu

er

y

A

Ungapped extension if:2 "Hits" are on the same diagonal but at a distance less than A

Database sequence

Qu

er

y

A

Extension using dynamic programminglimited to a restricted region limited through a score drop-off threshold

Page 17: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

BLAST indexing with formatdb

Formatdb mydb.seq must contain sequences in FASTA format formatdb -i mydb.seq -p T -n mydb

Generates 3 files mydb.psq mydb.pin mydb.phr

Then start a Blast: blastall -p blastp -d mydb -i myseq (-optional parameters)

Page 18: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Blast local vs remote

blastall Executed locally Slow No need to transfert db

blastall.remote Executed remotely Fast Requires special

priviledges and db transfert

Page 19: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Multiple Blasts?

1 seq vs db seq 1 FASTA seq as input

db seq vs db seq Several single FASTA seq

files as input or 1 Multiple FASTA seq file

as input

Possibility to export results as XML

Use Perl to automatize the queries and parse the output

Page 20: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Parsing Blast outputBLASTP 2.2.10 [Oct-19-2004]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyltransferase subunit alpha (EC 6.4.1.2). (325 letters)

Database: ecoli_blast 4339 sequences; 1,373,039 total letters

Searching.........done

Score ESequences producing significant alignments: (bits) Value

ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe... 266 1e-72

Page 21: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Parsing Blast output (2)>ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha (EC 6.4.1.2). Length = 318

Score = 266 bits (681), Expect = 1e-72 Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)

Query: 5 LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W QSbjct: 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64

Query: 62 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TKSbjct: 65 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124

Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NLSbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184

Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++LWK + A Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244

Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL +ID +I E GGAH + + A+ + Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304

Query: 302 VQQRYEKYKAIG 313 +RY++ + GSbjct: 305 KNRRYQRLMSYG 316

Page 22: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Parsing Blast output (3)

With BioPerl:#!/usr/local/bin/perl

use Bio::SearchIO;

my $blast_report = new Bio::SearchIO ('-format' => 'blast',                                      '-file' => $ARGV[0]);

print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n";while( my $result = $blast_report->next_result) {

print $result->query_name(), "\t", $result->query_description(), "\n";while( my $hit = $result->next_hit()) {   

    print  "\t\t", $hit->name(), "\t", $hit->description();        while( my $hsp = $hit->next_hsp()) {            print "\t", $hsp->evalue(), "\t", $hsp->score();    }    print "\n";

}}exit 0;

Page 23: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

MS-Excel import/export

Excel can import Tab delimited Coma delimited

Excel can export Tab delimited Space delimited

AC/ID desc score e-value

THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5

THIO_HUMAN thioredoxin Homo sapiens 120 0.001

Page 24: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

MS-Excel import/export

Tab delimited file: \t delimits the columns \n delimits the lines Optional first line contains columns title Example:

AC/ID\tdesc\tscore\te-value\n

THIO_ECOLI\tthioredoxin Escherichia coli\t234\t2.1e-5\n

THIO_HUMAN\tthioredoxin Homo sapiens\t120\t0.001\n

Page 25: Databases indexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

MS-Excel import/export

Coma delimited file: , delimits the columns, each value is surrounded by ‘ ’ \n delimits the lines Optional first line contains columns title Example:

‘AC/ID’,’desc’,’score’,’e-value’\n

’THIO_ECOLI’,’thioredoxin Escherichia coli’,’234’,’2.1e-5’\n

’THIO_HUMAN’,’thioredoxin Homo sapiens’,’120’,’0.001’\n