Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins...
-
Upload
roger-fitzgerald -
Category
Documents
-
view
215 -
download
0
Transcript of Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins...
![Page 1: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/1.jpg)
Tri-I Bioinformatics Workshop:Public data and tool repositories
Alex Lash & Maureen Higgins
Bioinformatics Core
Memorial Sloan-Kettering Cancer Center
![Page 2: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/2.jpg)
Workshop sections
1. Retrieving data from public resources• public databases at NCBI, EBI, Ensembl• locate and utilize some of the myriad of publicly available
bioinformatics tools• common data formats
2. Genome Browsers• genome build process, ongoing and complete genome
projects• genome browsers of Ensembl, UCSC and NCBI
Mapviewer• broad survey of analysis tools and tutorials available on
the Web for use directly and after download
![Page 3: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/3.jpg)
Public data and tool repositoriesSection 1
Retrieving data from public resources
![Page 4: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/4.jpg)
Goals
A. Understand the scope and organization of the major public databases: NCBI, EBI/ Ensembl.
B. Understand the importance of a unique identifiers, database fields, logical operators and wildcards.
C. Be able to query, retrieve and display publications and sequences.
D. Be able to visualize/analyze protein structure
![Page 5: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/5.jpg)
Amyloid Precursor Protein(APP)
ß-secretase
-secretase
G-proteincoupled
receptor thatbinds heparin
and laminin
Controlsnerve cellgrowth
interacts with protein-synthesismachinery
amyloidfibril
amyloidplaque
![Page 6: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/6.jpg)
NCBI
Strengths are data storage, annotation and BLAST:
1. PubMed: Biomedical publications
2. Heritable diseases and syndromes
3. GenBank: Nucleotide and protein sequences
4. BLAST: Pairwise sequence comparison
5. Curated gene-centric data, including reference sequences
6. Genome builds
7. Nucleotide sequence traces
Ex: Finding Entrez Gene record for APP
![Page 7: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/7.jpg)
Indexing and logical operatorsQuery: app[Gene Name] AND homo sapiens[Organism]
1 2 3 4 5 6 7 8…0 1 1 0 0 0 0 0……0 1 0 0 0 1 0 0……1 0 0 0 1 1 0 0……0 1 0 0 0 0 1 0…
aardvark…app…homo sapiens…mus musculus
0 0 0 0 0 1 0 0…AND0 0 0 0 0 1 0 0…1 0 0 0 1 1 0 0…
![Page 8: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/8.jpg)
An Entrez Query
1. Query parsed: terms, fields and operators organized in a tree (if syntax incorrect generate error or warning)
2. Unfielded terms matched to synonyms, and extra terms, fields and operators added as needed
3. For each database:a) According to order of operations:
i. Term found in appropriate index (if term not found, then generate warning)
ii. Bit map pulled and uncompressediii. Pairwise operations performed with previous result (if zero result,
then stop)
b) Number of results generated
4. If Global Query, display results summary and stop5. List of UIDs generated from final result6. UIDs sorted by user preference7. Records pulled and displayed by user preference
![Page 9: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/9.jpg)
Gene-centric questions
1. Where is a gene located?
2. What’s its genomic sequence?
3. What variations are associated with it?
4. What’s its exon-intron structure?
5. What are the mRNA sequences of its alternate transcripts?
6. What are the protein sequences of its isoforms?
7. What post-translational modification is possible?
8. What regulates its transcription?
9. What are its co-regulated partners?
10. What’s its normal function?
11. What’s its function in disease?
12. How does it fit into the larger cellular context?
May depend uponcellular “state”
Ex: Looking over the Entrez Gene record for APP
![Page 10: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/10.jpg)
Common id and record formats
2. Formatsa) Flat
i. GenBank and GenPeptii. FASTAiii. Multiple FASTAiv. Alignmentv. Multiple alignmentvi. Tab-delimited
b) Hierarchicali. ASN.1ii. XMLiii. HTML
1. Idsa) GenBank accession
i. Nucleotidei. BI559391,Y00264
ii. Proteini. AAB23646
iii. RefSeq
b) Ensemblc) UniGene
i. Hs.651215
d) PDB Structuresi. 1iyt
e) HUGO Gene Namesi. APP
![Page 11: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/11.jpg)
NCBI’s RefSeq project
1. Is a project to create curated sequence records for the biopolymers of the Central Dogma: DNA, mRNA and protein
2. First release 20033. 4,079 organisms, 3,234,358 proteins4. Goals
1. non-redundancy2. explicitly linked nucleotide and protein sequences3. updates to reflect current knowledge of sequence data and
biology4. data validation and format consistency distinct accession series5. ongoing curation by NCBI staff and collaborators, with reviewed
records indicated
5. What’s its relationship to BLAST database called “nr”?
![Page 12: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/12.jpg)
UniGene versus Entrez Gene
1. UniGene1. Automated process that compares and clusters transcript-source
sequences (no assembly)2. Gene discovery tool: predates Entrez Gene, genome assemblies3. Based primarily on EST sequences4. ID turn-over and retirement is common5. Currently 76 taxa and 1,299,304 clusters
2. Entrez Gene1. Curated clearinghouse of gene-centric information2. Grew out of LocusLink (eukaryote model organisms) and Entrez
Genome (bacteria, viruses, organelles)3. ID turn-over and retirement happens, but is less common since it
is based primarily on sequenced genomes4. Currently 3882 taxa and 2,479,759 genes
3. Hs: 85,793 UniGene clusters compared to 38,604 Entrez Gene records
![Page 13: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/13.jpg)
EBI/Ensembl
Strengths are data storage and analysis software:
1. Biomedical publications
2. Nucleotide and protein sequences
3. Protein domains/signatures4. Sequence comparison5. Sequence analysis6. Structure analysis7. Protein function analysis8. Ensembl genome browser
Ex: Looking at the APP gene in the EBI/Ensembl resources
![Page 14: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/14.jpg)
Ensembl ids
1. Human1. ENSG: gene2. ENST: transcript3. ENSE: exon4. ENSP: protein
2. Other organisms1. ENS{species 3-letter code}{G|T|P}{11 digits}2. RNO=rat3. MUS=mouse
![Page 15: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/15.jpg)
Amyloid Precursor Protein(APP)
ß-secretase
-secretase
G-proteincoupled
receptor thatbinds heparin
and laminin
amyloidfibril
amyloidplaque
Ex: Viewing the structure of an amyloid fibril
DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA
![Page 16: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/16.jpg)
Other structure tools
1. Structure visualization. Free applications:a) RasMolb) Cn3Dc) VMD
2. Structure prediction servers/applicationsa) CASP: Critical Assessment of Techniques for
Protein Structure Predictionb) General method:
i. Sequence similarity search to identify closest homolog with known structure
ii. Fit to homolog’s known structure, minimizing some constraint
![Page 17: Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.](https://reader030.fdocuments.net/reader030/viewer/2022032708/56649e875503460f94b8a340/html5/thumbnails/17.jpg)
Problems1. Query Entrez Gene with the following two queries separately
and then explain the differences between the two results using a logical NOT operation:
a) tyrosine kinase[Gene Ontology] AND human[Organism]b) cd00192[Domain] AND human[Organism]
2. Retrieve the APP gene record from NCBI and use the Display dropdown menu to display Conserved Domain Links. Use the ids of the listed domains to query Entrez Gene for records with the same domains.
3. Use the SNP Geneview link at NCBI to identify coding SNPs in the APP gene. Which SNP is missing from this display which was present in the Ensembl APP protein record?
4. Use the Homologene link at NCBI to identify possible functional orthologs for human APP. How does this list compare to the Ensembl list of orthologs that we reviewed previously?