SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse...

15
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April 25, 2011

description

SRI International Bioinformatics 3 Why a query interface? Allow a structured way to access the rich data representation stored in a PGDB. Most advanced databases have a high-level, declarative method of access (i.e., SQL). Provides an intermediate level of access between graphically browsing the PGDB and programmatically processing the data using Lisp.

Transcript of SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse...

Page 1: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics1

The Structured Advanced Query Page

Tomer Altman Mario Latendresse

Bioinformatics Research GroupSRI International

April 25, 2011

Page 2: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

2 SRI International Bioinformatics

IntroductionBioVelo is a query language

Like SQL but simpler and no updates allowedDocumentation: http://biocyc.org/bioveloLanguage.shtmlFree-Form Advanced Query Page (FFAQP) allows Web submission of BioVelo queries

Structured Advanced Query Page (SAQP)Web page for interactively constructing advanced and precise queries to PGDBsQueries are translated to BioVelo and sent to the server for processingSAQP: http://biocyc.org/query.shtmlDocumentation: http://biocyc.org/webQueryDoc.shtml

Page 3: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics3

Why a query interface?

Allow a structured way to access the rich data representation stored in a PGDB.

Most advanced databases have a high-level, declarative method of access (i.e., SQL).

Provides an intermediate level of access between graphically browsing the PGDB and programmatically processing the data using Lisp.

Page 4: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics4

The Structured Advanced Query Page

'Advanced', in that it allows you to ask more advanced and complicated queries than the basic search interface.

In other words, the SAQP allows you to search for a precise set of answers given simple or complex conditions

'Structured', in that it is a dynamic HTML form, that provides greater ease in crafting queries, but trades flexibility and power for simplicity (FFAQP).

'Page', in that it is accessed via the Web interface for BioCyc (www.biocyc.org/query.shtml), or from your own Pathway Tools Web server.

Page 5: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics5

SAQP ArchitectureThe SAQP is built on top of a high-level functional

declarative language called BioVelo (Mario Latendresse, SRI), which is built on top of Pathway Tools.

On every result page, you will see the equivalent BioVelo code that was generated from the SAQP, which, in turn, generated the results.

You don't need to know anything about BioVelo to use the SAQP, but it might be helpful later if you need the ability to write even more complicated queries using the Free Form Advanced Query Page (FFAQP).

Page 6: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics6

The Structure of the SAQP:Database specificationClass specification'Where' constraints on attributes of classesOutput attributes descriptionData format (HTML vs TXT)

Page 7: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics7

Example #1:A simple query usually consists of querying a

particular database about a particular class.Find all the proteins in E. coli K-12.Display the protein names.

Page 8: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics8

Structure of the ResultsA line that shows the equivalent BioVelo expression

that the SAQP generated to answer the query.A HTML table of the results, with the corresponding

entries hyperlinked to the matching Pathway Tools Web pages.

If a text data format was requested, then a tab-delimited text file is generated, with just the table data.

Page 9: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics9

Example #2:Find all the proteins of E. coli K-12 for which the DNA-

FOOTPRINT-SIZE is smaller than 10.Display the protein name, and the DNA footprint size.

Page 10: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics10

Example #3: In EcoCyc, display polypeptides constrained by

experimentally determined molecular weight and isoelectric point.

The experimental molecular weight should be between 50 and 100 kD.

The pI should be less than 7.Display the polypeptide name, the experimental

molecular weight, and the pI.

Page 11: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics11

Example #4:The SAQP allows for specifying quantifiers on

relations between PGDB classes.Extending example #3, now we want only proteins

where at least one of the genes that encodes the protein to be within the first 500 kilobases of the E. coli chromosome.

Page 12: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics12

Example #5: Queries with Several Components

A second search component will search potentially another database and another class of objects for each element found in the first search component.

It is called a 'cross-product' search.Any number of search components can be added. In

general, the new search component is done for each set of objects found in the previous components.

Some restraints is needed not to build a query that takes too long to answer. (The server gives a limit of a few minutes for a query.)

Example: Search for MetaCyc pathways in the taxonomic range of Bacteria that also exist in E. coli K-12 using the common-name attribute.

Page 13: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics13

Exercises1) Find all genes of E. coli that contain “trp” in their

name.2) Find all genes in MetaCyc that have more than one

product. Output the gene names and product names.

3) Find all reactions in E. coli which have the reactant (i.e., the left side) “acetaldehyde”.

4) Find all monomers in E. coli. A monomer has no components.

5) Find all reactions in MetaCyc that have more than 4 reactants.

6) Find all metabolic pathways, in MetaCyc, that have more than 5 reactions. Output the reaction lists as well as the pathway names.

Page 14: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics14

Introduction to BioVeloBioVelo is based on set and list comprehension. In Mathematics, a set comprehension describes a set of

values as in: {x | x in Prime, x > 100}The output is 'x', the body has a generator 'x in Prime' and

a condition 'x > 100'. Several conditions and several generators could be used.

BioVelo used a concise syntax: 1) [ output-expression : generator, condition, ... ] 2) a generator has the form v ← database^^class 3) a condition uses logical and relational operators

Page 15: SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.

SRI International Bioinformatics15

Examples of BioVelo Queries

[r : r <- ecoli^^reactions] [p^name : p <- ecoli^^proteins] [p^?name : p<- ecoli^^proteins] [p^?name : p <- ecoli^^proteins, p^dna-footprint-size < 10] [(g^?name, g^left-end-position): g <- ecoli^^genes, g^left-end-position < 153000] [(g^?name, k): g<- ecoli^^genes, k := abs(g^left-end-

position – g^right-end-position)+1, k < 200 ] [(r^?name, c^?name) : r<- ecoli^^reactions, c<- r^left, c in

r^right]