NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

92
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB

Transcript of NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

Page 1: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

NCBI Molecular Biology Resources

A Field Guidepart 2

September 30, 2004 ICGEB

Page 2: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Genomes

Taxonomy

Links Between and Within Nodes

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structures

Word weight

VAST

BLASTBLAST

Phylogeny

ComputationalComputational

Computational

Computational

Page 3: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

BLAST

VAST

Pubmed

Text

Sequence

Structure

Page 4: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

ePubmed: Computation of Related

Articles

The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths.

The value of a term is dependent on global and local types of information:

1) the number of different documents in the database that contain the term;

2) the number of times the term occurs in a particular document;

Page 5: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Global and local weights

• The global weight of a term is greater for the less frequent terms. The presence of a term that occurred in most of the documents would really tell one very little about a document. On the other hand, a term that occurred in only 100 documents of one million would be very helpful in limiting the set of documents of interest.

• The local weight of a term is the measure of its importance in a particular document. Generally, the more frequent a term is within a document, the more important it is in representing the content of that document. However, this relationship is saturating, i.e., as the frequency continues to go up, the importance of the word increases less rapidly and finally comes to a finite limit.

Page 6: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

How we define similar documents

• The similarity between two documents is computed by adding up the weights (local wt1 × local wt2 × global wt) of all of the terms the two documents have in common. This provides an indication of how related two documents are.

• Once the similarity score of a document in relation to each of the other documents in the database has been computed, that document's neighbors are identified as the most similar (highest scoring) documents found. These closely related documents are pre-computed for each document in PubMed.

Page 7: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Related articles: difficult task

Page 8: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

E-utilities: Top Level of Entrez

Page 9: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

E-utilities course

Page 10: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

E-utilities

• A set of seven server-side programs.

• Support a uniform URL syntax.

• Translate a standard set of URL-encoded input parameters for the array of programs comprising the Entrez system.

Page 11: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Entrez Functions and E-utilities

• Searches: esearch.fcgi

• DocSums: esummary.fcgi

• Links: elink.fcgi

• Uploads: epost.fcgi

• Downloads: efetch.fcgi

• Global Query: egquery.fcgi

• Information: einfo.fcgi

Page 12: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eA Docsum via esummary.fcgi and via the Web

Page 13: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eA Simple Eutilities Pipeline

Page 14: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eSearch for upstream regions of

homologous genes

• #!/usr/local/bin/perl #where the Perl is located

• use LWP::Simple; # we use LWP:Simple to get the content of URLs

• $ebase="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"; # this is a base URL we will add details to

• while(<>){ # we are reading file of gene names; file name is read from the command line;• chomp;$gene=$_;• $term=$gene."[gene+name]+AND+human[orgn]"; # we are interested in human genes only

#1. Search in Homologene

• $url=$ebase."esearch.fcgi?db=homologene&term=$term"; #search Entrez Gene with gene name• $result=get($url); #with the help of LWP's "get" command we download the content of the corresponding URL

• while($result=~/<Id>(\d+)<\/Id>/sg) #parsing out the content, reading gi's from Id lines• {$id.="$1,";} #...and concatenating them in one string, with commas as delimiters• chop $id;

#2. Link Homologene -> Nucleotide

• $url=$ebase."elink.fcgi?db=nucleotide&id=$id&dbfrom=homologene";#link back to nucleotides to get list of homolog NM gi's• $result=get($url);• $id="";• while($result=~/<Link>[^<]+<Id>(\d+)<\/Id>/sg){$id.="$1,";} chop $id;

#3. Link Nucleotide -> Gene

Page 15: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eLots of precomputed data and a little bit of

parsing

• $url=$ebase."elink.fcgi?db=gene&id=$id&dbfrom=nucleotide"; #link to Entrez Gene again to get the genomic coordinates• $result=get($url);$id="";• while($result=~/<Link>[^<]+<Id>(\d+)<\/Id>/sg){push @ids,$1;} chop $id;• print @ids;• foreach $id (@ids){ #foreach NM accession gi

#4. Fetch XML document with gene information from Gene

• $url=$ebase."efetch.fcgi?db=gene&id=$id&retmode=xml";• #fetch the gene report that gives the genomic sequence and coordinates• $result=get($url);• $result=~/<Gene-commentary_type value=.genomic.>.+?<Seq-id_gi>(\d+)/s;• $id=$1;• $result=~/<Seq-interval_from>(\d+)/;$from=$1;• $result=~/<Seq-interval_to>(\d+)/;$to=$1;• $result=~/<Na-strand value="(\w+)"/;$strand=$1;if($strand eq "minus"){$strand=2;}else{$strand=1;}

• if($strand==1){ $to=$from;$from-=1000; }else{ $from=$to;$to+=1000; }

#5. Fetch upstream sequence from Nucleotide

• $url=$ebase."efetch.fcgi?db=nucleotide&id=$id&retmode=text&rettype=fasta&seq_start=$from&seq_stop=$to&strand=$strand";• #fetch sequence• $result=get($url);$result=~s/>ref/>lcl|$gene|/;• print "$result";• }

• }

Page 16: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

A General Design Approach

• Know what you want before you begin– Do I need the full record? (EFetch)– Will a DocSum be sufficient? (ESummary)

• Know what Entrez database contains the data you want– If it’s not in Entrez, the eUtils can’t access it

• Try your pipeline in interactive web Entrez first– Some Entrez queries may surprise you– Some Entrez data may surprise you– Some Entrez links may surprise you

Page 17: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Others use E-utilities too: PubCrawler

Page 18: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eMedBlast: searching for articles related

to a sequence.

Page 19: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Fairness issue. Gate is only so wide. Scripts use the resources of many to satisfy a few.

Why Regulate?

Page 20: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Scripts are like “fat” bunnies!!!

Page 21: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Web Servers and Browsers

• Your browser makes one connection.

• Each server has an finite number of slots.

• A slot is allotted to a connection 1st come 1st served.

• Connections are (typically) not persistent.

• Scripts use more slots, and approach “persistent” connection.

Page 22: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Normal Use

Page 23: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Scripting

Page 24: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Detection

• Weblogs are monitored by a script. • Alarm e-mails are sent hourly and a daily

encapsulation once a day.• Analysis – copyright versus volume. Not

automatic!• Blocking occurs.

– Copyrighted material can be very light volume.– Blast is “sensitive” can also be light in volume.– Entrez and PubMed mostly a “fairness” issue.

Page 25: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

How you are blocked.

• The IP address is blacklisted from the main NCBI web servers.

• You get a very obvious error message.

• Remember Spock: “The needs of the many outweigh the needs of the few.”

Page 26: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

How to avoid blockage.

• Plan your project.• Can I use other methods?

– FTP– Batch Entrez

• Write good scripts.– Expect errors– Multiple UIDs

• Follow the E-utils recommendations. • Ask us for advice.

Page 27: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Recommendations.

• Use ‘eutils.ncbi.nlm.nih.gov’.

• Use the &tool and &email fields.

• Do not submit more than once every 3 seconds.

• Limit to 9 PM – 5 AM EST (our time).

Page 28: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

BLAST

VAST

Pubmed

Text

Sequence

Structure

Page 29: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

BLAST®

Basic Local Alignment Search Tool

• Why align sequences ? - because it is the best way to infer structure-function relationships for the

unknown biomolecules • Global vs local alignments• BLAST basics• MegaBLAST• Discontiguous MegaBLAST

Page 30: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Basic Local Alignment Search Tool

Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman

algorithm Searches for matching “words” and then extends the

hits Uses statistical theory to determine if a match might

have occurred by chance

Page 31: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eGlobal

Alignment

Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125

Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194

Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264

Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401

Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471

Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492

human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60

440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500

Align program (Lipman and Pearson)

Page 32: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eHow BLAST

Works

Make a lookup table of all “words” in the query

Scan the database for matching words

Initiate extensions from these matches

Page 33: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eWord

sGTQITVEDLFYNIATRRKALKNQuery:

Word Size = 3

Word size is adjustable 2 or 3 for protein ( 3 default) > 7 for blastn ( 11 default )

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ TQI QIT ITV TVE VED EDL DLF LFY …

Make a lookuptable of words

Page 34: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eScan Database…Initiate

Extensions Protein BLAST requires two hits

GTQITVEDLFYNI

<------ TVE FFN ------>

two neighborhood words(threshold score)

Nucleotide BLAST requires exact matches

exact word match

ATCGCCATGCTTAATTGGGCTT<------ CATGCTTAATT ------>

Page 35: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eAn Alignment That BLAST Can’t

Find…

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT

| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC

|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Page 36: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e…but the corresponding amino acid

sequences are conserved much better

Page 37: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Protein alignment looks good

Page 38: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e…and they have the same domains,

too

Page 39: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eLocal Alignment Statistics

High scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by

chancesize of database

your score

expected number of

random hits

Page 40: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eScoring Systems -

Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

Page 41: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eScoring Systems -

ProteinsPosition Independent Matrices

PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly

conserved blocks• Each matrix derived separately from blocks with a

defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST

Page 42: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutionsPositive for more likely substitutions

Page 43: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eOptions for Advanced Blast:

Protein

Matrix Selection•PAM30 -- most stringent•BLOSUM45 -- least stringent

Example Entrez queriesproteins all[Filter] NOT mammalia[Organism]green plants[Organism]srcdb refseq[Properties]Other advanced-W 2 word size–e 10000 expect value-v 2000 descriptions-b 2000 alignments

Limit by taxonMus musculus[Organism]Mammalia[Organism]Viridiplantae[Organism]

Page 44: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eOptions for Advanced Blasting:

Nucleotide

Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced-W 7 word size–e 10000 expect value-v 2000 descriptions-b 2000 alignments

Page 45: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eFind a homolog of human CSK in C. elegans

Query = c-src tyrosine kinase (CSK) NP_004374 (450 aa) [Homo sapiens]

Database = NCBI protein nr Entrez limit: Caenorhabditis elegans [ORGN]

Program = BLASTP

Homology Searches

Hits to the Conserved Domain Database:

Query=>gi|4758078|ref|NP_004374.1| c-src tyrosine kinase [Homo sapiens]MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSIDEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPVKWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL

Page 46: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

BLAST Graphical Overview

SH3 SH2 tyr kinase domain

Page 47: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eBLAST

Alignments

gi|7160701|emb|CAB04427.2| C. elegans KIN-22 protein (corresponding sequence F49B2.5) [Caenorhabditis elegans]

gi|17508235|ref|NP_493502.1| Tyrosine kinase with SH2, SH3 and N myristoylation domains, Drosophila suppressor of pole hole homolog (57.5 kD) (kin-22) [Caenorhabditis elegans] Length = 507

Score = 290 bits (742), Expect = 1e-78 Identities = 170/440 (38%), Positives = 245/440 (55%), Gaps = 21/440 (4%)

Page 48: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e3D

Domains

TyrKc

SH3SH2

Page 49: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%)

Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

FilteredUnfiltered

Low Complexity Filtering

Page 50: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Position-Specific Iterated BLAST

• Mining for protein domains• Confirming relationships among related proteins

Page 51: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

ePosition Specific Substitution

Rates

Active site serineWeakly conserved serine

Page 52: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

ePosition Specific Score Matrix

(PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Active site nucleophile

Serine scored differently in these two positions

Page 53: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK

PSI-BLAST

e value cutoff for PSSM

Page 54: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

RESULTS: Initial BLASTPSame results as protein-protein BLAST

Page 55: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eResults of First PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

Page 56: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Third PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

Check to add to PSSM

Page 57: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eMegaBLAS

T

AI217550AI251192AI254381BE645079

C:\seq\hs.4.fsa

> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT

Page 58: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eWhat is Discontiguous (Cross-species)

MegaBLAST?

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5

Page 59: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Neighbors: Precomputed BLAST

Nucleotide

Protein

Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.

Page 60: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eBlink – Protein BLAST Alignments

• Lists only 200 hits • List is nonredundant

Page 61: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eBLAST Databases: Non-redundant

protein

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein

• PIR, Swiss-Prot, PRF

– PDB (sequences from structures)

Page 62: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eBLAST Databases: Nucleic

Acid• nr (nt)

– Traditional GenBank Divisions– NM_ and XM_ RefSeqs

• dbest – EST Division

• htgs – HTG division

• gss – GSS division

• chromosome – NC_ RefSeqs

• wgs– whole genome shotgun

Page 63: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Genomic BLAST

• These pages provide customized nucleotide and protein databases for each genome• If a Map Viewer is available, the BLAST hits can be viewed on the maps

Page 64: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eWhat if Your Favorite Gene is not found

in the latest genome build?

POSSIBLE VARIANTS:

• The gene does not exist;

• It exists, but there is a problem with assembly;

• It exists, but there is a problem with annotation

Page 65: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eAn example: finding prestin in Human

genome

• We start with rat prestin, BLAST it against the Human genome and look for evidences that human prestin exists as well.

Page 66: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Searching the Human Genome

>gi|12188917|emb|AJ303372.1|RNO303372 Rattus norvegicus

ATGGATCATGCTGAAGAAAATGAAATTCCTGCAGAGATCAGAAGTACCTCGTGGAA

GTCATCCGGTCCTCCAGGAGAGGCTGCACGTCAAGGACAAAGTCACAGACTCCATC

GCAGGCATTCACGTGCACTCCTAAAAAAGTAAGAAACATCATCTACATGTTCTTGC

TTGCCAGCATATAAATTCAAGGAGTATGTGCTGGGTGACTTGGTCTCGGGCATAAG

AGCTCCCCCAAGGCTTAGCCTTCGCGATGCTGGCAGCTGTGCCTCCGGTGTTCGGC

On for same species comparisons

Page 67: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eBLAST Results

16 hits to one contig

Human Genome Database953 contigs2.9 billion letters

Page 68: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eMap Viewer: Genomic Context of BLAST Hits

Genes

Genome Scan

Models

Human EST hits

Contig

GenBank

Mouse EST hits

Page 69: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eHuman prestin: now appears in Build

34

Page 70: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Now we can compare genes

Page 71: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Three prestin genes: finally together!

Page 72: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Same prestin, different assemblies

Page 73: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eDoes homology mean the common

biological function?

• Not always; the existence of the common ancestor does not guarantee that some function won’t be lost or acquired after the divergence.

An example: zeta-crystallin is a component of a transparent lens matrix

of the vertebrate eye. Its homolog in E.coli is the metabolic

enzyme quinone oxidoreductase.

Page 74: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

BLAST

VAST

Entrez

Text

Sequence

Structure

Page 75: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eStructure similarity: No More

BLASTing!

• Three-dimensional structures are most conserved during the evolution;

• One still can detect the existence of the common ancestor based on the structure similarity;

• Spatial similarity is not calculated the same way we do it for sequences

Page 76: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eVAST: Structure

NeighborsVector Alignment Search Tool

For each protein chain,

locate SSEs (secondarystructure elements),

and represent them asindividual vectors.

1

2

3

4

5 6

Human IL-4

Page 77: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

VAST: Structure Neighbors

Page 78: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eStructure Neighbors in Cn3D

SH3 SH2

C-Srckinase

Human vs.Chicken

Page 79: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e3D Domain Neighbors

HumanC-SrcKinase(Tyr)

vs.

Chk1kinase(Ser/Thr)

Page 80: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

NCBI is changing

From sequence data storage facility to one-stop shop with integrated databases of various kind.

You can be part of the future – work with us! Your expertise and data are indispensable.

Page 81: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

GenBank

Page 82: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Refseq

Page 83: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Entrez Gene

Page 84: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Homologene database

Page 85: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

New generation of databases: an example

Page 86: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eProtein interaction database: a seed for

future precomputed resources

Page 87: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

New databases: GenSAT

Page 88: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

PubChem

Page 89: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Headache? Take Aspirin

Page 90: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Aspirin has 432 neighbors

Page 91: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

e

Link to 3D protein structures

Page 92: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NC

BI

Fie

ldG

uid

eFor More Information…

•General Help [email protected]•BLAST [email protected]

E-mail addresses

The (free!) NCBI Newsletter

The NCBI Handbook

http://www.ncbi.nih.gov/Education/index.html

The NCBI Education Page

http://www.ncbi.nih.gov/About/newsletter.html

Follow the link from the NCBI Home Page