Intro to Bioinformatics - Office of Faculty Affairs
Transcript of Intro to Bioinformatics - Office of Faculty Affairs
Introduction to Bioinformatics:useful tools and tricks
Kristi Holmes, PhD
Bernard Becker Medical Library
Overview
• Introduction to NCBI
• Tour the NCBI site
• Take a look at literature and molecular databases
• Explore search strategies
• MyNCBI
NCBI: The National Center for Biotechnology Information
Bethesda, MDhttp://www.nlm.nih.gov/exhibition/gallery/gallery.html
NCBI homepage: http://www.ncbi.nlm.nih.gov/
• Resource guide
• Help documents
• Info buttons
NCBI literature databases: http://www.ncbi.nlm.nih.gov/Literature/
literature databases
From Using Reference Resources – Tutorials and Tips
• PubMed PubMed Basics (PDF)
– PubMed Quick Reference Card (PDF)
– PubMed Tutorials and Quick Tours
– PubMed Search Tags & Field Qualifiers (PDF)
• Responsible and Efficient Literature Searching (PDF)
• Searching Basics (PDF)
NCBI molecular databases: http://www.ncbi.nlm.nih.gov/Database/
• Entrez is the integrated, text‐based search and retrieval system used at NCBI for the major databases
• Entrez searching strategy is much the same as PubMed!
• Nucleotide, protein, structure, taxonomy, genome, expression, and chemical databases
Let’s get started! 1. Evaluating databases – a general
guide
2. Database searching
3. Limits
4. Entrez and type in a search query• http://www.ncbi.nlm.nih.gov/sites/gquery
• Jump to your database of interest (nucleotide, protein, gene, etc.)
• Once you choose the database, “Limits” are your friends!
http://www.ncbi.nlm.nih.gov/http://www.xkcd.com
Things to keep in mind when looking at databases (and software tools, too!)
• Scope: What are you looking for?• Data quality: Depends heavily on data
curation, validation can be difficult (huge range)
• Freshness of data: Well‐maintained databases with current information can indicate a higher quality database
• Data quantity: Never assume completeness of a data set
• Availability: Download restrictions and IP restrictions
• Technical architecture: Interfaces which promote ease of use
Adapted from Baxevanis & Ouellette, 2005
Boccioni, Umberto: States of Mind: The Farewells1911 - Museum of Modern Art, New York
Database searchingThree easy ways to search within individual Entrez databases:
1. Basic (enter a search term in the text box and press “go”)
2. Advanced Method (do a separate search for each term or phrase and combine searches using History)
3. Complex Boolean Query– Enter your search in command language, indicating field qualifiers in square brackets [].
– Syntax: term[field] BOOLEAN term[field] BOOLEAN term[field] etc.
– Boolean operators (AND, OR, NOT) must be written in UPPER CASE
– Boolean operators are processed from left to right unless parentheses are used for nesting. If parentheses are used, the portions of the query in parentheses will be processed first, then the remaining Boolean operators will be processed from left to right.
– list of Search Fields and Qualifiers and search fields available by database
I'd like to retrieve human nucleotide sequences associated with colon cancer.
Demo:
– open Entrez CoreNucleotide search page
– enter the following query into the text box: colon cancer[title] AND human[orgn]
http://www.ncbi.nlm.nih.gov/Education/
LimitsLimits vary by database. In Entrez Nucleotides, for example, you can limit by:• search field
• source database (subset)
• molecule type
– Try this: starting with the search results for the search: breast cancer[title] AND human[orgn], use the Limits page to find mRNA sequences.
– How many non‐mRNA sequences were in the original search results?
• gene location
• modification date
• exclude certain categories of records
Many Limits are drawn from the Properties [PROP] field. The most commonly used properties are shown on the Limits page as check boxes or in pop‐up menus (see Entrez Help doc for more details).
To see a complete list of properties, browse the index of that field in the database of interest. For example, select the Entrez CoreNucleotide database, follow the link for Preview/Index in the grey bar beneath the search box, select Properties from the Search Field pop‐up menu, and press the Index button. Use the Up and Down buttons to scroll through the index.
http://www.ncbi.nlm.nih.gov/Education/
THOUSANDS of databases…
There are literally thousands of databases available! How do you know which ones to use?
• It is easiest to start with a single search system (such as Entrez) that combines data from the most commonly used comprehensive databases
• If you need additional specialized databases, search the database and software directories (more about this later)
NCBI molecular databases: http://www.ncbi.nlm.nih.gov/Database/
• Nucleotide, protein, structure, taxonomy, genome, expression, and chemical databases
• Entrez is the integrated, text‐based search and retrieval system used at NCBI for the major databases
• Entrez searching strategy is much the same as PubMed!
What is GenBank?NCBI’s Primary Sequence Database
Nucleotide only sequence database Archival in nature
– Historical– Reflective of submitter point of view
(subjective)– Redundant
GenBank Data– Direct submissions (traditional records)– Batch submissions (EST, GSS, STS)– ftp accounts (genome data)
Three collaborating databases– GenBank– DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL)
Database
http://www.ncbi.nlm.nih.gov/Education/
Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 Aug-06 Aug-070
20
40
60
80
100
120
140
160
180
200
Bas
es
(bill
ions
)
The Growth of GenBank
As of February 2008
Doubling time = 12‐14 months
GenBank Release: 85.8 billion bases
WGS: 108.6 billion bases
http://www.ncbi.nlm.nih.gov/Education/
GenBank: NCBI’s Primary Sequence Database
Records27,439,206
Whole Genome Shotgun
Bases108,635,736,141
Bases85,759,586,764
Records82,853,685
Total Records
Total Bases
110,292,891
194,395,322,905
February 2008Release 164
ftp.ncbi.nih.gov/genbank/ • full release every two months• incremental updates daily• available only via ftp
http://www.ncbi.nlm.nih.gov/
A TraditionalGenBank Record
LOCUS AF124527 2540 bp mRNA linear PLN 29-JAN-2004DEFINITION Prunus persica ethylene receptor (ETR1) mRNA, complete cds.ACCESSION AF124527VERSION AF124527.1 GI:6841074KEYWORDS .SOURCE Prunus persica (peach)
ORGANISM Prunus persicaEukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;rosids; eurosids I; Rosales; Rosaceae; Amygdaloideae; Prunus.
REFERENCE 1 (bases 1 to 2540)AUTHORS Bassett,C.L., Artlip,T.S. and Callahan,A.M.TITLE Characterization of the peach homologue of the ethylene receptor,
PpETR1, reveals some unusual features regarding transcriptprocessing
JOURNAL Planta 215 (4), 679-688 (2002)PUBMED 12172852
REFERENCE 2 (bases 1 to 2540)AUTHORS Bassett,C.B., Artlip,T.S. and Nickerson,M.L.TITLE Direct SubmissionJOURNAL Submitted (29-JAN-1999) Appalachian Fruit Research Station,
USDA-ARS, 45 Wiltshire Road, Kearneysville, WV 25430, USAFEATURES Location/Qualifiers
source 1..2540/organism="Prunus persica"/mol_type="mRNA"/cultivar="Loring"/db_xref="taxon:3760"/dev_stage="III B/C fruit"
gene 1..2540/gene="ETR1"
CDS 269..2485/gene="ETR1"/codon_start=1/product="ethylene receptor"/protein_id="AAF28893.1"/db_xref="GI:6841075"/translation="MEACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVKKSAVFPYRWVLVQFGAFIVLCGATHLINLWTFSMHSRTVAIVMTTAKVLTAVVSCATALMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQEETGRHVRMLTHEIRSTLDRHTILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIHLPVINQVFSSNRALKISPNSPVARMRPLAGKHMPGEVVAVRVPLLHLSNFQINDWPELSTKRYALMVLMLPSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLARREAETAIRARNDFLAVMNHEMRTPMHAIIALSSLLQETELTPEQRLMVETILKSSHLLATLINDVLDLSRLEDGSLQLEIATFNLHSVFREVHNLIKPVASVKKLSVSLNLAADLPVQAVGDEKRLMQIVLNVVGNAVKFSKEGSISITAFVAKSESLRDFRAPEFFPAQSDNHFYLRVQVKDSGSGINPQDIPKLFTKFAQTQSLATRNSGGSGLGLAICKRFVNLMEGHIWIESEGPGKGCTAIFIVKLGFAERSNESKLPFLTKVQANHVQTNFPGLKVLVMDDNGSVTKGLLVHLGCDVTTVSSIDEFLHVISQEHKVVFMDVCMPGIDGYELAVRIHEKFTKRHERPVLVALTGNIDKMTKENCMRVGMDGVILKPVSVDKMRSVLSELLEHRVLFEAM"
ORIGIN 1 gcacgagggc tcaccgagcg agctagctct tcaggagtca aggcttctgg gtgaggggaa
61 gaagaagaag cttctttgat gtgttggggt gccaatctaa agaggaagaa gaaggcctct121 aatgtattga ggtcggctgt ctgggctgcc gatctgtgtt gaatggatag tttggtagag181 atgcttcaac gacatagggt ggctgaaaag ggtttgaaga aagtgaagga ggaaaccaag
...2401 tatactgaaa cctgtctcag ttgataaaat gaggagtgtt ttatcagaac tgttggagca2461 tcgagtttta tttgaggcta tgtaagatat aggaaaattg ttctagtgaa ggaaagattt2521 aaatggaaaa aaaaaaaaaa
//
Header
Feature Table
Sequence
The Flatfile Format
http://www.ncbi.nlm.nih.gov/Education/
Traditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession•Stable•Reportable•Universal
VersionTracks changes in sequence
GI numberNCBI internal use
well annotated
the sequence is the datahttp://www.ncbi.nlm.nih.gov/Education/
Check out the sample GenBank record
• http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
RefSeq‐ a database of reference sequences
• Curated, non‐redundant; one record for each gene, or each splice variant, from each organism represented
• A representative GenBank record is used as the source for a RefSeq record
• Value‐added information is added by an expert(s)
• Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article
• RefSeq database includes genomic DNA, mRNA, and protein sequences, so organizes information according to the model of the central dogma of biology
• Accessible through Entrez, BLAST, and FTP site(RefSeq records are available in various Entrez Databases such as Nucleotide, Protein, Genome, and are also accessible from Entrez Gene records)
What are you looking for? Try Entrez
Entrez help document: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.chapter.EntrezHelp
NCBI: Gene• Entrez Gene is a searchable database of genes, from RefSeq
genomes, and defined by sequence and/or located in the NCBI Map Viewer
• http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
NCBI: MapViewer
• The Map Viewer provides special browsing capabilities for a subset of organisms in Entrez Genomes.
• MapViewer allows you to:– view and search an organism's
complete genome
– display chromosome maps
– zoom into progressively greater levels of detail, down to the sequence data for a region of interest.
• The number and types of available maps vary by organism, and are described in the "data and search tips" file for each organism.
http://www.ncbi.nlm.nih.gov/mapview/
http://www.ncbi.nlm.nih.gov/Education/
dbSNP
http://www.ncbi.nlm.nih.gov/SNP/index.html• The Single Nucleotide Polymorphism
database (dbSNP) is a public‐ domain archive for a broad collection of simple genetic polymorphisms.
• This collection of polymorphisms includes:– Single‐base nucleotide substitutions (also
known as single nucleotide polymorphisms or SNPs)
– Small‐scale multi‐base deletions or insertions (also called deletion insertion polymorphisms or DIPs),
– Microsatellite repeat variations (also called short tandem repeats or STRs).
• Statistics
http://www.ncbi.nlm.nih.gov/Education/
My NCBI – do YOU have an account?What is My NCBI?
My NCBI is a central place to customize NCBI Web services. To use it, you must first register, and your browser must accept cookies.
You can use My NCBI to:– Save searches
– Set up e‐mail alerts for new content
– Display links to Web resources (LinkOut)
– Choose filters that group search results
Like all NCBI resources, My NCBI is free.
For more information, read My NCBI Help.
My NCBI tutorial from MIT: http://libraries.mit.edu/video/pubmed/myncbi/myncbi.html
http://www.ncbi.nlm.nih.gov/
RESOURCES
• Office of the Vice Chancellor of Research: http://research.wustl.edu/Pages/default.aspx
• Becker Medical Library: http://becker.wustl.edu/
• WUSM Core Research Facilities: http://intramed.wustl.edu/research/corefaci.nsf
Other resources:
• HSLS Online Bioinformatics Resource Collection (OBRC)
• BioMed Central Databases collection
• NAR 2008 Database Issue
• NAR 2008 Web Server Issue