Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

19
Bioinformatics Overview, NCBI & GenBank JanPlan 2012

Transcript of Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Page 1: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Bioinformatics Overview, NCBI &

GenBankJanPlan 2012

Page 2: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

What is Bioinformatics

• Find three different definitions of the word “bioinformatics”

• How is “bioinformatics different from “computational biology”?

• What areas of biological research are dependent on bioinformatics?

Page 3: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

What is Bioinformatics Used For?

Page 4: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Introduction to NCBI

• NCBI, EMBL & DDBJ• What function do these organizations play in the global

society?

• How do their missions differ?

• NCBI Training and Tutorials page

• The NCBI Handbook

• NCBI How-To page

• NCBI Help Manual

Page 5: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

GenBank

• Annotated collection of all publicly available nucleotide sequences and their protein translations.

• Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms.

• Grows exponentially, doubling every 10 months

Page 6: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

GenBank

• Initially built and maintained at Los Alamos National Laboratory.

• Transferred to NCBI in early 1990s by congressional mandate.

• Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited.

• Submitters may keep their data confidential for a specified period of time prior to publication.

Page 7: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Direct Submission

• A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence (contigs) with annotations (metadata).

• If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped.

• Example

Page 8: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

High-Throughput Genomic Sequence (HTGS)

• HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank.

• Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.

Page 9: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

High-Throughput Genomic Sequence (HTGS)

• Data submitted in 4 phases.• Phase 0: Sequences are one-to-few reads of a single clone

and are not usually assembled into contigs. They are low-quality sequences that are often used to check whether another center is already sequencing a particular clone.

• Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known.

• Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation.

• Phase 3: Sequences are of finished quality and have no gaps. For each organism, the group overseeing the sequencing effort determines the definition of finished quality.

Page 10: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Whole Genome Shotgun Sequences (WGS)

• Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

Page 11: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

EST, STS, and GSS

• EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation.

• STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by PCR amplification. They define a specific location on the genome and are thus useful for mapping.

• GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.

Page 12: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

HTC and FLIC

• HTC = High-Throughput cDNA/mRNA: Similar to ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region.

• FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.

Page 13: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Submission Tools

• BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank.

• Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.

Page 14: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Sequence Data Flow and Processing

• Triage: Within 48 hours of direct submission with BankIt or Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an Accession number.• All sequences must be > 50 bp in length and be

sequenced by, or on behalf of, the group submitting the sequence.

• GenBank will not accept sequences constructed in silico• GenBank will not accept noncontiguous sequences

containing internal, unsequenced spacers.• GenBank will not accept sequences for which there is

not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA.

• Submissions are checked to determine whether they are new or updates.

Page 15: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Sequence Data Flow and Processing

• Indexing: • Biological validity: Translation, organism lineage, BLAST

searches• Vector contamination: Is there any vector DNA present in the

sequence?• Publication status: If published, citation is included in

annotation and linked to Entrez• Formatting and spelling

• Sequences are sent to submitter for final review before release into the public database.

• Sequences must become publicly available once the accession number or the sequence has been published.

• GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.

Page 16: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

RefSeq

• A curated collection of DNA, RNA, and protein sequences built by NCBI.

• Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

• May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts.

• Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Page 17: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Third Party Annotation (TPA) database

• Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal.

• Two types of records:• Experimental: Annotation supported by wet-lab

evidence• Inferential: Annotation inferred only

• Bridges the gap between GenBank and RefSeq: Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

Page 18: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Universal Protein Resource (UniProt)

• Protein sequence database that was formed through the merger of three protein databases:

1. The Swiss Institute of Bioinformatics

2. The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data Library (TrEMBL) databases

3. Georgetown University’s Protein Information Resource Protein Sequence Database (PIR-PSD)

Page 19: Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Problem Set

• ftp://ftp.ncbi.nih.gov/pub/education/tutorials/genbank.pdf

• Linked on today’s web page