Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005...
-
Upload
geraldine-harrington -
Category
Documents
-
view
221 -
download
0
Transcript of Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005...
![Page 1: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/1.jpg)
Overview of Biological Databases
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Sept. 6, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Most slides are taken from NCBI field guide at the web site http://www.ncbi.nlm.nih.gov/
![Page 2: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/2.jpg)
The Central Dogma & Biological Data
Protein structures-Experiments-Models (homologues)
Literature information
Original DNA Sequences(Genomes)
Protein Sequences-Inferred -Direct sequencing
Expressed DNA sequences( = mRNA Sequences= cDNA sequences)
Expressed Sequence Tags (ESTs)
![Page 3: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/3.jpg)
Entrez Integrates Most of Them!
Entrez
Nucleotide
PubMed
Protein
Taxonomy
Structure
Domains 3D DomainsJournal
s
PMC
OMIM
Books
PopSet
SNP
UniGene UniSTS
Genome
Gene
GEO
GEO Datasets
MeSH
CancerChromosomes
Homologene
![Page 4: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/4.jpg)
Outline
• NCBI & Entrez
• Major Biological Databases
• Using Entrez
![Page 5: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/5.jpg)
Some background about Entrez…
![Page 6: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/6.jpg)
The National Center for Biotechnology Information
Created in 1988 as a part of theNational Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
![Page 7: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/7.jpg)
Web Access: http://www.ncbi.nlm.nih.gov
![Page 8: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/8.jpg)
![Page 9: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/9.jpg)
Number of Users and Hits Per Day
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Nu
mb
er o
f U
sers
1997 1998 1999 2000 2001 2002 2003
Christmas &New Year’s
Days
Currently averaging10,000,000 to 50,000,000
hits per day!
![Page 10: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/10.jpg)
Major Biological Databases
![Page 11: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/11.jpg)
Entrez: Database Integration
Hard Link
NeighborsRelated Structures
3 -D 3 -D StructureStructure
VAST
NeighborsRelated Sequences
NucleotideNucleotideSequencesSequences
BLAST
NeighborsRelated SequencesBLinkDomains
ProteinProteinSequencesSequences
BLAST
TaxonomyTaxonomy
Phylogeny
PubMedPubMedAbstractsAbstracts
Word weight
Related Articles
GenomeGene
OMIM
Cancer Chromosome
CDD3D domain
PubChem
Books
PMC
OMIM
SNP
Genome Project
HomoloGene
UniGeneGEO
![Page 12: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/12.jpg)
Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, RefSNP, GEO Datasets, UniGene, TPA, NCBI Protein, Structure, Conserved Domain
![Page 13: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/13.jpg)
Primary vs. DerivativeSequence Databases
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATTAT
TC
CGAGA
ATTAT
TC
C
AT
GAGA
ATTC
C GAGA
ATTC
C
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCG
TATAGCCG TATAGCCGTATAGCCG TATAGCCGCAT
T
GAGA
ATTC
C GAGA
ATTC
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continuall
y by NCBI
Updated ONLY by submitters
![Page 14: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/14.jpg)
0%
20%
40%
60%
80%
100%
PDBTPARefSeqGenBank
Entrez Nucleotides
Primary
• GenBank / EMBL / DDBJ 57,172,944
Derivative
• RefSeq 1,278,742
• Third Party Annotation 4,653
• PDB 5,973 Total 58,462,312
![Page 15: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/15.jpg)
0%
20%
40%
60%
80%
100% PDBPRFPIRSwissProtTPARefSeqGenPept
Entrez Protein: Derivative Databases
GenPept 3,515,141
RefSeq 1,802,523
Third Party Annotation 4,217
Swiss Prot 189,324
PIR 222,232
PRF 12,079
PDB 68,621
Total 5,814,137
BLAST nr total 2,726,372
![Page 16: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/16.jpg)
Database 1: GenBank NCBI’s Primary Sequence Database
![Page 17: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/17.jpg)
What is GenBank?
• Nucleotide only sequence database
• Archival in nature– Historical
– Reflective of submitter point of view (subjective)
– Redundant
• GenBank Data
– Direct submissions (traditional records)
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL) Database
![Page 18: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/18.jpg)
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
International Sequence Database Collaboration
![Page 19: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/19.jpg)
GenBank Divisions“Organismal”PRI (28) Primate
ROD (15) Rodent PLN (13) Plant and FungalBCT (11) Bacterial/ArchealINV (7) InvertebrateVRT (7) Other VertebrateVRL (4) ViralMAM (2) MammalianPHG (1) PhageSYN (1) SyntheticUNA (1) Unannotated
“Functional”EST (377) Expressed Sequence Tag GSS (138) Genome Survey SequenceHTG (63) High Throughput GenomicPAT (17) PatentSTS (9) Sequence Tagged SiteCON (1) Contigs, virtual
• Organized by taxonomy (sort of)• Direct submissions (Sequin/Bankit)• Accurate (~1 error per 10,000 bp)• Well characterized
• Organized by sequence type• Batch submissions (ftp/email) • Inaccurate• Poorly characterized
![Page 20: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/20.jpg)
GenBank Functional (Bulk) Divisions
GenBankEST
STS
GSS
HTG
• Expressed Sequence Tag
– 1st pass single read cDNA
• Genome Survey Sequence
– 1st pass single read gDNA
• High Throughput Genomic
– incomplete sequences of genomic clones
• Sequence Tagged Site
– PCR-based mapping reagents
Whole Genome Shotgun
![Page 21: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/21.jpg)
EST Division: Expressed Sequence Tags
RNA gene products
nucleus30,000 genes
80-100,000 uniquecDNA clones in library
- isolate unique clones - sequence once from
each end
make cDNA library
5’
3’
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
![Page 22: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/22.jpg)
ESTs in Entrez
Total 28 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million
Total 28 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million
![Page 23: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/23.jpg)
GSS, WGS, HTG
shred
Whole BAC insert (or genome)
isolate clonessequence
GSS divisionor trace archive
Draft sequence (HTG division)
assembly whole genome shotgun assemblies (traditional division)
![Page 24: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/24.jpg)
HTG Example: Honeybee Draft Sequences
• Unfinished sequences of BACs
• Gaps and unordered pieces
• Finished sequences (Phase 3) move
to traditional GenBank division
• Unfinished sequences of BACs
• Gaps and unordered pieces
• Finished sequences (Phase 3) move
to traditional GenBank division
LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004
DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT
SEQUENCE, 14 unordered pieces.
ACCESSION AC141845
VERSION AC141845.1 GI:29124029
KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004
DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT
SEQUENCE, 14 unordered pieces.
ACCESSION AC141845
VERSION AC141845.1 GI:29124029
KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
![Page 25: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/25.jpg)
Seq
uen
ce R
eco
rds
(mil
lio
ns)
To
tal Base P
airs(b
illion
s)
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
40Sequence recordsTotal base pairs
Release 148: 45.2 million records 49.4 billion nucleotides
Average doubling time ≈ 14 months
’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 ’05 ’06
40
45
45
50
5550
![Page 26: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/26.jpg)
File Formats of theSequence Databases
Each sequence is represented bya text record called a flat file.
GenBank/GenPept (useful for scientists) FASTA (the simplest format)
ASN.1 & XML (useful for programmers)
![Page 27: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/27.jpg)
A TraditionalGenBank
Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//
Header
Feature Table
Sequence
The Flatfile Format
![Page 28: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/28.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
The Header
![Page 29: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/29.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Locus LineLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
Molecule typeMolecule typeDivisionDivision
Modification DateModification Date
Locus nameLocus name
LengthLength
![Page 30: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/30.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Database Identifiers
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
![Page 31: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/31.jpg)
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Organism
SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
NCBI-controlled taxonomy
![Page 32: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/32.jpg)
FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"
The Feature Table
Coding sequenceCoding sequence
start (atg)start (atg) stop (tag)stop (tag)
ImpliedproteinImpliedprotein
GenPept Identifiers
![Page 33: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/33.jpg)
The Sequence: 99.99% Accurate
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
![Page 34: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/34.jpg)
GenPept: FASTA format
>gi|32265058|gb|AAO22848.2| (E,E)-alpha-farnesene synthase [Malus x domestica]MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWKNDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLFEKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLENHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHSLELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWWANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGSEEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLTKVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMADFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIKGMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHILSLLFQPLVN
>gi|32265070|gb|AAP75563.1| putative doublecortin domain-containing protein MAKTGAEDHREALSQSSLSLLTEAMEVLQQSSPEGTLDGNTVNPIYKYILNDLPREFMSSQAKAVIKTTDDYLQSQFGPNRLVHSAAVSEGSGLQDCSTHQTASDHSHDEISDLDSYKSNSKNNSCSISASKRNRPVSAPVGQLRVAEFSSLKFQSARNWQKLSQRHKLQPRVIKVTAYKNGSRTVFARVTAPTITLLLEECTEKLNLNMAARRVFLADGKEALEPEDIPHEADVYVSTGEPFLNPFKKIKDHLLLIKKVTWTMNGLMLPTDIKRRKTKPVLSIRMKKLTERTSVRILFFKNGMGQDGHEITVGKETMKKVLDTCTIRMNLNLPARYFYDLYGRKIEDISKGKH
![Page 35: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/35.jpg)
Seq-entry ::= set { class nuc-prot , descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds." , source { org { taxname "Malus x domestica" , common "cultivated apple" , db { { db "taxon" , tag id 3750 } } , orgname { name binomial { genus "Malus" , species "x domestica" } , mod { { subtype cultivar , subname "'Law Rome'" } , { subtype old-name , subname "Malus domestica" , attrib "(10)cultivar='Law Rome'" } } , lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus" , gcode 1 ,,
Abstract Syntax Notation: ASN.1
FASTA NucleotideFASTA Nucleotide
FASTAProteinFASTAProtein
GenPeptGenPept GenBankGenBank
ASN.1ASN.1
![Page 36: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/36.jpg)
Database 2: RefSeq NCBI’s Derivative Sequence Database
![Page 37: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/37.jpg)
What is RefSeq?• Curated transcripts and proteins (NM_, NP_)
– reviewed
– human, mouse, rat, fruit fly, zebrafish, arabidopsis
microbial genomes (proteins), and more
• Model transcripts and proteins (XM_, XP_)
• Assembled Genomic Regions (contigs) (NT_, NW_)– human genome
– mouse genome
– rat genome
• Chromosome records (NC_)
– Human genome
– microbial
– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
srcdb_refseq[Properties]
![Page 38: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/38.jpg)
RefSeq Benefits
• non-redundancy
• explicitly linked nucleotide and protein sequences
• updates to reflect current sequence data and biology
• data validation
• format consistency
• distinct accession series
• stewardship by NCBI staff and collaborators
![Page 39: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/39.jpg)
Curated genomic DNACurated genomic DNA(NC, NT, NW)(NC, NT, NW)
Curated Model mRNACurated Model mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
RefSeq Curation Processes
ProteinProtein (NP)(NP)
Scanning....
![Page 40: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/40.jpg)
RefSeq Accession Numbers
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle , viral
genomes, human chromosomesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
![Page 41: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/41.jpg)
From GenBank to RefSeq
![Page 42: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/42.jpg)
NM_000121: Sequence Revision History
![Page 43: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/43.jpg)
![Page 44: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/44.jpg)
Database 3: UniGene NCBI’s Derivative EST Database
![Page 45: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/45.jpg)
UniGene
• Records are clusters of mRNAs and ESTs that ideally represent single genes
• Records are created automatically by a modified BLAST algorithm
• UniGene provides a means to identify an EST or unannotated mRNA
Clustering Expressed Sequences
![Page 46: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/46.jpg)
Gene-oriented clusters of expressed sequences
• Automatic clustering using MegaBlast
• Each cluster represents a unique gene
• Informed by genome hits
• Information on tissue types and map locations
• Useful for gene discovery and selection of
mapping reagents
UniGene
unique gene
![Page 47: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/47.jpg)
A Cluster of ESTs
query
5’ EST hits
3’ EST hits
![Page 48: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/48.jpg)
UniGene Collections
![Page 49: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/49.jpg)
Example UniGene Cluster
![Page 50: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/50.jpg)
Histogram of cluster sizes for UniGene Hs Build 177
(Now at Build #186)
![Page 51: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/51.jpg)
UniGene Cluster Hs.95351
SELECTED PROTEIN SIMILARITES
![Page 52: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/52.jpg)
UniGene Cluster Hs.95351
GENE EXPRESSION
![Page 53: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/53.jpg)
UniGene Cluster Hs.95351: expression
![Page 54: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/54.jpg)
UniGene Cluster Hs.95351: seqs
![Page 55: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/55.jpg)
Download sequences
web page
ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
![Page 56: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/56.jpg)
Database 4: MMDB
NCBI’s derivative protein structure database
![Page 57: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/57.jpg)
Indexing into MMDB
Structure
id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,
Add secondary structure
inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,
Add chemical bonds
• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences
• Create “backbone” model (Cα, P only)• Create single-conformer model
MMDBMolecular Modeling Data Base
![Page 58: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/58.jpg)
Structure Summary
Cn3D viewer
Conserved Domains3D Domain Neighbors
Structure Neighbors
![Page 59: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/59.jpg)
Cn3D 4.1: C-Src
![Page 60: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/60.jpg)
Cn3D 4.1: Structural Alignment
Casein kinase S. pombe
Src Kinase H. sapiens
Conserved ATP binding site
![Page 61: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/61.jpg)
Cn3D: Simple Homology Modeling
human
swordtail
![Page 62: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/62.jpg)
NCBI CD: Tyrosine Kinase
![Page 63: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/63.jpg)
Using Cn3D to model domains
![Page 64: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/64.jpg)
Submitting a PDB File to VAST
• Choose the file format• Remove all lines except ATOM
This is the best way to convert PDB files to MMDB format
for viewing with Cn3D!
![Page 65: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/65.jpg)
Database 5: GEO
NCBI’s Gene Expression Omnibus
![Page 66: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/66.jpg)
GPLPlatform
descriptions
GSMRaw/processedspot intensities
from a singleslide/chip
GSEGrouping of
slide/chip data“a single experiment”
GDSGrouping ofexperiments
Curated byNCBI
Submitted byExperimentalistsSubmitted by
Manufacturer*
Entrez GEOEntrez
GEO Datasets
GEO SaMple:
experimental
conditions
GEO SEries:
set of related
samples
![Page 67: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/67.jpg)
What’s a DataSet?
Platform (GPL)
array definition
Sample(GSM)
hyb. measurements
Series(GSE)
related Samples
Supplied by submitter
DataSet (GDS)
• A collection of experimentally-related samples processed using the same platform.• Samples within DataSets are organized into subgroups based on experimental variables.• Form the basis of GEO’s query, analysis and data display tools.
Assembled by GEO staff
![Page 69: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/69.jpg)
GEO Dataset Browser
![Page 70: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/70.jpg)
GEO Dataset Report
![Page 71: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/71.jpg)
GEO Profiles… of 12625
![Page 72: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/72.jpg)
Database 6: CDDNCBI’s Derivative Conserved Domain
Database
![Page 73: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/73.jpg)
Entrez CDD
![Page 74: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/74.jpg)
Conserved Domain Database
• Multiple sequence alignments
• Position-specific scoring matrices (PSSM)
• Sources SMART, PFAM, COGs, KOGs, and
NCBI curated domains (structure-informed alignments)
• Multiple sequence alignments
• Position-specific scoring matrices (PSSM)
• Sources SMART, PFAM, COGs, KOGs, and
NCBI curated domains (structure-informed alignments)
![Page 75: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/75.jpg)
CDD
>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE
![Page 76: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/76.jpg)
CDD
CD
Pfam
COG
Click on a colored bar to align your sequence to the CD
![Page 77: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/77.jpg)
Conserved Domain Database: cd00371.1, HMA
![Page 78: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/78.jpg)
CDD
![Page 79: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/79.jpg)
CDART: Conserved Domain Architecture Retrieval Tool
![Page 80: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/80.jpg)
Database 7: NCBI Genome Map
![Page 81: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/81.jpg)
Viewing Complex Genomes
• Map Viewer Home Page• Shows all supported organisms
• Provides links to genomic BLAST
– Genome Overview Page• Provides links to individual chromosomes
• Shows hits on a genome graphically
– Chromosome Viewing Page• Allows interactive views of annotation details
• Provides numerous maps unique to each genome
NCBI Map Viewer
![Page 82: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/82.jpg)
The Map Viewer
Genome BLASTGenome BLAST
![Page 83: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/83.jpg)
Map Viewer: Human MLH1
Customizable
NCBI Assembly
EST Hits
Gene Annotations
Models
Transcripts
![Page 84: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/84.jpg)
Maps and Options
![Page 85: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/85.jpg)
Mapped Variations
![Page 86: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/86.jpg)
MLH1 Synteny: Mammalian Genomes
![Page 87: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/87.jpg)
Many Other NCBI Databases…
![Page 88: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/88.jpg)
Other Specialized Databases
• Gene Symbol Database ( HUGO Gene Nomenclature )
• KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway
• EPD (Eukaryote Promoter Database)
• Transcription Factor Database ( TRANSFAC )
• Many organism-specific databases (e.g., Flybase, Beebase)
• …
![Page 89: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/89.jpg)
Access Databases through Entrez
![Page 90: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/90.jpg)
Accessing the Data in Entrez• Web Tools
– Batch Entrez
• Upload a file of GI or accession numbers to retrieve sequences
– Batch Citation Matcher
• Send citation information to Entrez and retrieve PubMed IDs for linking, citation display or other applications
– Advanced Entrez Searching
• Advanced searching techniques for Web Entrez
– My NCBI
• Includes automatic e-mailing of search updates and filters for search results
• Requires a username and password to access stored searches
• Programming Tools
– E-Utilities
• Run Entrez queries and download data from your own scripts over the Web
– Linking to Entrez
• Link to specific Entrez pages from your own web pages or applications
– Entrez Client/Server
• C language library for embedding Entrez calls into your programs
![Page 91: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/91.jpg)
Entrez: Web Access
Default search: Against all databases in EntrezDefault search: Against all databases in Entrez
Interface: Global EntrezInterface: Global Entrez
Target database: Adjustable using the pull-down menuTarget database: Adjustable using the pull-down menu
Default search: Against all databases in EntrezDefault search: Against all databases in Entrez
Interface: Global EntrezInterface: Global Entrez
Target database: Adjustable using the pull-down menuTarget database: Adjustable using the pull-down menu
![Page 92: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/92.jpg)
/************************************************************************** asn2ff.c* convert an ASN.1 entry to flat file format, using the FFPrintArray. ***************************************************************************/#include <accentr.h>#include "asn2ff.h"#include "asn2ffp.h"#include "ffprint.h"#include <subutil.h>#include <objall.h>#include <objcode.h>#include <lsqfetch.h>#include <explore.h>
#ifdef ENABLE_ID1#include <accid1.h>#endif
FILE *fpl;
Args myargs[] = {{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
Toolbox Sources
ftp> open ftp.ncbi.nih.gov..ftp> cd toolboxftp> cd ncbi_tools
ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools
NCBI Toolbox
![Page 93: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/93.jpg)
![Page 94: Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.](https://reader035.fdocuments.net/reader035/viewer/2022062410/56649e9f5503460f94ba23fc/html5/thumbnails/94.jpg)
Challenges in Bioinformatics
Hard Link
NeighborsRelated Structures
3 -D 3 -D StructureStructure
VAST
NeighborsRelated Sequences
NucleotideNucleotideSequencesSequences
BLAST
NeighborsRelated SequencesBLinkDomains
ProteinProteinSequencesSequences
BLAST
TaxonomyTaxonomy
Phylogeny
PubMedPubMedAbstractsAbstracts
Word weight
Related Articles
GenomeGene
OMIM
Cancer Chromosome
CDD3D domain
PubChem
Books
PMC
OMIM
SNP
Genome Project
HomoloGene
UniGeneGEO
How can we help biologists manage and exploit all such
rapid growing, heterogeneous, and inaccurate information both efficiently and effectively?