A guided SQL tour of bioinformatics databases
-
Upload
yannick-pouliot -
Category
Health & Medicine
-
view
317 -
download
5
description
Transcript of A guided SQL tour of bioinformatics databases
Lane Medical Library & Knowledge Management Centerhttp://lane.stanford.edu
A Guided SQL Tour of Bioinformatics Databases
Yannick Pouliot, PhDBioresearch Informationist
Lane Medical Library & Knowledge Management Center
2/28/2007
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
2
Content Very abbreviated review of the relational principle Some of the technology required to connect to a
remote database Walk-through of the database schema for Ensembl
Hands-on querying Walk-through of the database schema for
BioWarehouse Hands-on querying
Resources Details on connecting to a remote database
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
3
So Why Are We Here?
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
4
Bioinformatics Databases: Who Supports Direct Querying?
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
5
Relational Database Terms Database: Collection of tables and relationship
between tables Table
Collection of records that share a common fundamental characteristic E.g., patients and locations can each be stored in their own table
Record Basic unit of information in a relational database
E.g., 1 record per perso A record is composed of columns (“fields”)
Query Set of instructions to a database “engine” to retrieve, sort
and format returning data. “find me all patients in my database”
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
6
Main Relational Database “Engines”
Filemaker MS Access MS SQL Server
MySQLOracle Postgress Sybase
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
7
Structure of Relational DB Tables
Data values live in rows
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
8
Understanding the Relational Principle: A Simple Database
Every patient gets ONE record in the Patients table Every visit gets ONE record in the Visits table Rows in different tables can be related one to another using a shared
key (identifier) There can be multiple visits records for a given patient There can be multiple tissue records for a given patient
“join”
return
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
9
The Relational Principle at Work
Related records can be found using a shared key Example: Patients.ID = Visits.PatientID
Table name Primary Key
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
10
SQL Querying…With What?
Query browsers used here: MySQL Query Browser WinSQL
Other query browsers exist but are more sophisticated Often more expensive or more complex Example: PL/SQL Developer, from Allround Automations
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
11
Example: Network Querying of Ensembl Database Using MySQL Query Browser
What happens when you use query a remote database? DEMO
Of note: May take some time
Big database, lots of data to return from far away… Easy to write queries with voluminous output May have to kill the query…
Setting up ODBC: not discussed here, but cheat sheet instructions are in handout. Location will also be mailed
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
12
The Database Schema: Your Roadmap For Querying
The schema describes all tables and all fields Used to determine how to inter-relate tables to
retrieve the desired data Very important:
Must understand schema for accurate querying Wrong understanding = wrong results
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
13
Introducing The SQL Select Statement
Good news: This is the only SQL statement you need to understand for querying
SELECT LastName, FirstNameFROM Patients
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
14
Basic Syntax of Select StatementSELECT field_name FROM table [WHERE condition]
Example:
Select LastName,FirstName From PatientsWhere Alive = ‘Y’;
Note: case sensitive for all but Oracle Query statement are written into a tool such as MS Query or
MySQL Query Browser
[ ] = elective
Handout: p2
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
15
SELECT – (Some) Details
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
16
Moving On: Real
Biodatabase
Schemas
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
17
Schemas We’ll Look At… Remember: Schemas…
describe all tables and all fields used to determine how to inter-relate tables to
retrieve the desired data
Our schemas today: Ensembl BioWarehouse
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
18
Ensembl Produced by Sanger Institute Collection of genome databases for many different organisms Free, open source Web querying: http://www.ensembl.org/ FAQ: What is Ensembl? All PubMed references pertaining to Ensembl and written by the
Ensembl group
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
19
Exploring the Ensembl Schema
Ensembl CORE schema documentation First place to go to answer: “what does this table
store?” Problem: no graphical representation of overall
schema Relationships harder to appreciate
Use Catalog function and go from there…
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
20
“Fundamental” TablesFundamental tables Features and analyses ID Mapping (Map identifiers between releases)assembly alt_allele gene_archiveassembly_exception analysis mapping_sessionattrib_type analysis_description peptide_archivecoord_system density_feature stable_id_eventdna density_typednac dna_align_featureexon map Exernal references (IDs to objects in other dbs)exon_stable_id marker external_dbexon_transcript marker_feature external_synonymgene marker_map_location go_xrefgene_stable_id marker_synonym identity_xrefkaryotype misc_attrib object_xrefmeta misc_feature xrefmeta_coord misc_feature_misc_setprediction_exon misc_setprediction_transcript prediction_transcript Miscellaneousseq_region protein_align_feature interproseq_region_attrib protein_featuresupporting_feature qtltranscript qtl_featuretranscript_attrib qtl_synonymtranscript_stable_id regulatory_factortranslation regulatory_factor_codingtranslation_attrib regulatory_featuretranslation_stable_id regulatory_feature_object
regulatory_search_regionrepeat_consensusrepeat_featuresimple_feature
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
21
Understanding The Ensembl Schema Using The Catalog
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
22
Querying Ensembl
Ensembl runs on the MySQL database engine We’ll use WinSQL
MySQL Query Browser can also be used, as well as lots of other querying tools
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
23
Before Proceeding: A Word of Caution
Go to join
Easy to write queries that… Retrieve nonsense Never complete
Scotty to Captain Kirk: “Where going in circles, and at warp 6 we’re going mighty fast…”
Understanding schema is only way to prevent this Tips:
Use “count” to determine the number of rows in table BEFORE returning large datasets
Remember: the more tables are joined, the slower the query
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
24
Demo Queries… To Get You Started
Query 1: return number of genes stored in Ensembl Human
Query 2: return number of transcripts produced by genes stored in Ensembl Human
Demonstrates JOINing of tables
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
25
ExercisesTogether: 1. the number of genes stored in Ensembl Human 2. the number of transcripts produced by genes stored in
Ensembl Human(10 min)
On your own: 3. the types of analyses that Ensembl provides 4. the number of types of markers 5. the number of markers per chromosome for all chromosomes 6. Extra points: the minimum and maximum marker distances for
markers on chromosome 19(20 min)
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
26
SELCT Statement: A Refresher
SELECT [DISTINCT] select_list FROM table_list
[WHERE conditions]
[START WITH] [CONNECT BY]
[GROUP BY group_by_list] [HAVING search_conditions]
[ORDER BY order_list [ASC | DESC] ]
“Modifiers” of select list:
DISTINCT COUNT SUM MIN MAX
Also: ORDER BY LIKE (used in
WHERE clause)
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
27
Example Of A Biologically-Useful Query: All Markers on Chromosome 1
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
28
Now We’re Talking: Returning Results into Your Favorite Tool
SQL query results returned to… MS Excel
… using Data/Import External Data/New Database Query Details: Excel Advanced Report Development
, Zapawa 2005
SpotfireIn Lane catalog
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
29
Next: BioWarehouse
Produced by SRI International Integration of genome, biochem rxns, pathways, etc databases from many different organisms Free, open source Accessing PublicHouse FAQ Schema All PubMed references pertaining to BioWarehouse and written by the BioWarehouse group
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
30
Conceptual Views of the BioWarehouse Database
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
31
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
32
Querying BioWarehouse
We’ll query using MySQL Query Browser Caveats:
Lots of datasets supported by BioWarehouse… .. but some critical ones are missing from publichouse
due to licensing requirements, e.g., MetaCyc UniProt
Also: Need to request account to query Anonymous user not supported
Resource: MySQL v5 Reference Manual
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
33
BioWarehouse Demo Queries…to get you started
Query 1: What are the datasets available in PublicHouse?
Query 2: How many pathways are there for the EcoCyc dataset?
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
34
Example Biologically Meaningful Query Of BioWarehouse: For a Given Pathway, Return Proteins Involved Pathway and Their Molecular Weight
SELECT D.Name as PathwayName,J.WID AS ProteinWID, J.Name AS ProteinName, J.MolecularWeightCalc AS MolecularWeightCalc
FROM Pathway D,PathwayReaction F, Reaction G, EnzymaticReaction H, Protein J
WHERE D.WID = F.PathwayWID AND F.ReactionWID = G.WID
AND G.WID = H.ReactionWID and H.ProteinWID = J.WID
AND D.DataSetWID=19AND D.Name LIKE "%lipopolysaccharide%"ORDER BY ProteinName
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
35
ExercisesTogether: 1. How many datasets are there in PublicHouse? 2. What is the number of genes in S. aureus
(SAUR158878Cyc)?
(10 min)
On your own: 3. List the coding region start and ends for all genes that
code for proteins in the SAUR158878Cyc dataset 4. How many biochemical reactions are there in each
pathway (of any type) in the EcoCyc (=E. coli) dataset? (20 min)
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
36
In Summary… Knowing the db schema is essential SELECT statement all you need to know Remote databases good for exploring a schema at
low cost No installation…
But: Performance can be poor Restrictions on data set Better to install locally if “real work” to be performed
Remember: SQL gives you the power to return results directly into your favorite tool!
Lane Medical Library & Knowledge Management Centerhttp://lane.stanford.edu
Don’t Forget The Class Evaluation
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
38
Resources
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
39
Setting-Up for Internet SQL Querying
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
40
Setting Up Data Source Names
Steps1. Make sure you have the requisite
driver (next slide)
2. Create a Data Source Name (Windows only)
3. Write your query
4. Get the results back into Excel!See Lane videorecorded class Managing Experiment Data Using Excel and Friends: Digging Out from Under the Avalanche for lots more details.
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
41
Step 1: Getting DriversEssential for SQL Querying
A driver is a piece of software that lets your operating system talk to a database Installed drivers visible in ODBC manager
“data connectivity” tool
Each database engine (Oracle, MySQL, etc) requires its own driver Generally must be installed by user
Drivers are needed by Data Source Name tool and querying programs
Require (simple) installation
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
42
MySQL Driver: Needed to Query MySQL Databases
Windows: Download MySQL Connector/ODBC 3.51 here
Must be installed for direct querying using e.g. Excel Not necessary if you are using the MySQL Query
Browser
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
43
Oracle Driver: Needed to Query Oracle Databases
Installing “client” software will also install driver Windows: Download 10g Client here Mac: Download 10g Client here Free Oracle user account required to
download Must be installed if you are querying
using MS Query or any other query browser involving Oracle
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
44
Step 2: Creating a Data Source Name
A Data Source Name (DSN) tells programs on your PC where and how to query a database
Populating the fields: Data Source Name: Unique name of your choice Description: anything Server: exactly as given by the database provider Port number: as specified by database provider
Defaults: MySQL: 3306; Oracle: 1521; MS Access: N/A
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
45
Resources – SQL
eBook: Beginning SQL eBook: Learning SQL
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
46
Lots More Resources From Lane
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
47
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
48
How To Get Accounts for Direct SQL Querying
Direct Querying of Selected Bioinformatics Databases
Database How? DB Engine
BioWarehouse
http://biowarehouse.ai.sri.com/ get account for access to publichouse (publicly-accessible installation of BioWarehouse; see http://biowarehouse.ai.sri.com/PublicHouseOverview.html
MySQL
Ensemblhttp://www.ensembl.org/info/data/download.html
MySQL
Mouse Genome Database
Mail [email protected] to ask for an account
Sybase
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
49
Example Querying with MySQL Query Browser Free MySQL only Facilitates writing of a SQL query
graphical Get it at http://www.mysql.com/products/tools/query-
browser/
Query statement
Execute statement
Table descriptions