Source Code for Biology and Medicine BioMed CentralSource Code for Biology and Medicine Research...

BioMed Central

Source Code for Biology and Medicine

ss
Open AcceResearchBioinformatics process management: information flow via a computational journalLance Feagan†, Justin Rohrer†, Alexander Garrett†, Heather Amthauer†, Ed Komp†, David Johnson†, Adam Hock†, Terry Clark†, Gerald Lushington†, Gary Minden† and Victor Frost*†
Address: Information and Telecommunication Technology Center, University of Kansas, Lawrence, Kansas, USA

Email: Lance Feagan - [email protected]; Justin Rohrer - [email protected]; Alexander Garrett - [email protected]; Heather Amthauer - [email protected]; Ed Komp - [email protected]; David Johnson - [email protected]; Adam Hock - [email protected]; Terry Clark - [email protected]; Gerald Lushington - [email protected]; Gary Minden - [email protected]; Victor Frost* - [email protected]

* Corresponding author †Equal contributors

AbstractThis paper presents the Bioinformatics Computational Journal (BCJ), a framework for conductingand managing computational experiments in bioinformatics and computational biology. Theseexperiments often involve series of computations, data searches, filters, and annotations which canbenefit from a structured environment. Systems to manage computational experiments exist,ranging from libraries with standard data models to elaborate schemes to chain together input andoutput between applications. Yet, although such frameworks are available, their use is notwidespread–ad hoc scripts are often required to bind applications together. The BCJ exploresanother solution to this problem through a computer based environment suitable for on-site use,which builds on the traditional laboratory notebook paradigm. It provides an intuitive, extensibleparadigm designed for expressive composition of applications. Extensive features facilitate sharingdata, computational methods, and entire experiments. By focusing on the bioinformatics andcomputational biology domain, the scope of the computational framework was narrowed,permitting us to implement a capable set of features for this domain. This report discusses thefeatures determined critical by our system and other projects, along with design issues. Weillustrate the use of our implementation of the BCJ on two domain-specific examples.

IntroductionThe Bioinformatics Computational Journal (BCJ) is anextensible environment that integrates computationalresources, methods, and data. Bioinformatics and compu-tational biology span a wide variety of applications rang-ing from interpretation of gene expression data to proteinstructure prediction (For brevity, in this report we oftendescribe bioinformatics and computational biology as bioin-

formatics.). However, within this range of applicationsthere is considerable common ground that is pivotal to adomain-oriented approach. Many bioinformatics applica-tions are centered on the rapidly growing sequence dataarchived at national centers [1]. These data can be seen asenablers of bioinformatics approaches. As a result, stand-ard approaches to process these data have appeared, butspecifics of the applications can vary widely. Foremost in

Published: 3 December 2007

Source Code for Biology and Medicine 2007, 2:9 doi:10.1186/1751-0473-2-9

Received: 21 January 2007Accepted: 3 December 2007

This article is available from: http://www.scfbm.org/content/2/1/9

© 2007 Feagan et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

of 15(page number not for citation purposes)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=18053179

http://www.scfbm.org/content/2/1/9

http://creativecommons.org/licenses/by/2.0

http://www.biomedcentral.com/

http://www.biomedcentral.com/info/about/charter/

Source Code for Biology and Medicine 2007, 2:9 http://www.scfbm.org/content/2/1/9

common is sequence alignment [2]. Many other applica-tions are widely used (for example, see EMBOSS [3]) andmore are under development.

The BCJ is designed around characteristics of bioinformat-ics research, which often involves: (1) stepwise applica-tion of a series of methods, (2) re-application of the seriesof methods in search and retry iterations using differentparameters, (3) development of new procedures and com-binations in the rapidly growing discipline, (4) an openenvironment crucially dependent on web-based applica-tion servers and public databases, and (5) rapidly growingdata collections and evolving vocabularies. The volume ofdata and the number of computational experiments createsignificant processing and management complications.

The BCJ framework reported here addresses these prob-lems through data and process management, with mech-anisms for experiment specification and composingcomputations, tracking data provenance and retaining thecontext of computations. Activities such as entering andlocating data, applications, and results are facilitated bythe framework. The BCJ fosters collaboration amongproject teams through structured privacy mechanisms andnaming constructs, increasing productivity in computa-tional experiments for both computationally expert andnon-expert researchers.

BackgroundA computational environment is a coherent interface to aset of applications that integrates them under a single par-adigm. The design goal is to simplify the use and manage-ment of applications and their output. Here, we areconcerned about bioinformatics applications. Many bio-informatics applications are based on scanning sequencesfor features relative to some database. This activity is a sig-nificant application in the field spurred by an exponen-tially growing sequence collection archived at nationalcenters, now at around 65 billion bases at the NationalCenter for Biotechnology Information (NCBI) [1].Numerous applications are standard in these studies, forexample, BLAST for sequence alignment [4], Markovmodels for profiling related sequences [5], and GrailEXPfor searching for predicting genome features [6]. Theseand other programs are often integrated together in differ-ent ways, raising the challenge in bioinformatics of man-aging trails of data and methods applied to them. Oftenprograms are composed together in ad hoc ways. This canlead to errors in interpreting results, lost time in rewritingthe software glue to compose the programs, and overallpoor solutions. The problem is not insignificant, andincludes scientists from numerous disciplines involved inthe interdisciplinary bioinformatics field. There are inter-mediate solutions to unified approaches to using bioin-formatics software such as the somewhat standardized

interfaces and data models provided by Bioperl and Bio-java [4], but these appropriately do not attempt a user par-adigm.

The BCJ is geared for bioinformatics applications wherecomputations are often organized as workflows. A work-flow consists of a series of computations connected by aflow of information between them (cf. task/control-basedworkflows [5]). Computational experiments are com-posed of workflows. As in the laboratory, computationsare often repeated or modified to create new experimentsto test additional hypotheses. When considering compu-tational environments, there are several distinct factors tobe considered: the execution environment, computationdesign, and the management of experiments and data.

A number of execution environment issues need to beconsidered. First, the target computing platform may be astandalone workstation, a local-area cluster, or a wide-area grid. In the wide-area grid scenario, integration ofresources may be difficult because of decentralized secu-rity and administrative tasks. Second, the native executionenvironment may support simple, linear execution pipe-lines where the output of each stage drives only one input,or more complicated execution graphs where one outputcan branch to provide multiple inputs. Iteration and con-ditional branching are challenging execution features toinclude because there are many iteration and conditionalbranching strategies to choose from. Third, extension ofthe environment may be done with internal tools that arepart of the (target) execution environment, or with exter-nal tools that a user runs on his/her local workstation sep-arately from the execution environment. Fourth, webservices can form the basis for an execution environment.

The graphical user interface paradigm for specifying theexperiment design is central to the usability of the envi-ronment. Ease of use is achieved with an intuitive andconsistent interface to applications, data, and results.Experiment-management issues including archiving,searching, modifying and retrieving existing workflows,searching stored data, searching annotated workflows/experiments, and the ability to export data to and importdata from external sources impact the usability of theenvironment. Finally, other miscellaneous features, suchas access controls on data, data provenance, experimentrepeatability, collaboration, and annotation of experi-ments are important.

Computational journal overviewThe BCJ documents the entire computational experiment.The BCJ has mechanisms to record the complete contextof the experiment to enable its review and re-execution.Furthermore, by recording this information, the ability tointerpret results is enhanced. Although the range of com-



putational experiments in bioinformatics research isextensive, the experimental processes have much in com-mon. We leveraged the similarities to arrive at our frame-work for this domain. We chose the traditional laboratorynotebook, long used by biologists to record wet lab exper-iments, as the base paradigm. However, the BCJ is distinctfrom a lab notebook. The BCJ's task is limited to provid-ing support for managing and performing computationalexperiments with a complete electronic record to enableadvanced functions like searching and collaboration.From the user perspective, the top-level structure of thecomputational journal is a collection of journal entriesand relationships among these entries. More formally, anentry is the basic unit of information provided by the enduser for the BCJ. Each entry is characterized by a single con-tent type that identifies the format of this information. Thenumber of content types is user extensible, allowing theBCJ to support new data formats and programs. Relationsamong entries provide an organizational structure to ena-ble tracking entry dependencies, e.g., entry A "is an execu-tion input" for entry B. These relationships are definedautomatically as the entries are created and then they aremaintained by the BCJ environment.

Our implementation of the BCJ uses the Eclipse infra-structure [7]. The Eclipse project is an open source soft-

ware framework that provides a development platform forbuilding applications. Eclipse is based on a plug-in frame-work that supports the requirementsdiscussed above.Eclipse provides a rigorously tested, refined, and docu-mented infrastructure for the management of plug-ins. Inaddition, we were able touse several Eclipse sub-projects,such as the Graphical Editor Framework (GEF), to provideinfrastructure for BCJ – specific editors. By using Eclipse,we have been able to focus directly on the development ofa set of plug-ins and functionality particular to theBCJ.

The BCJ is composed of a two-tier architecture with a sin-gle server and multiple clients. The server is located on acluster while clients run a multi-platform application sup-ported on Windows, Linux, or OS X. The security modelprovides encryption between the client's computer, theBCJ database, and the cluster, as well as the concept ofgroup-based access control. Workflows are defined graph-ically with the client, and the requested computations areloaded into the computational resources for execution.Tools can be added using XML and an execution wrapperscript (for example, specific bioinformatics applicationslike BLAST). The BCJ, similar to Wildfire (see Appendix),offers a drag-and-drop interface for workflow creation(Figure 1). The BCJ does not provide iterations and condi-tional branches, but these are planned.

Sample BCJ SessionFigure 1Sample BCJ Session.



In addition to providing program integration, the BCJmanages workflows, data, and experiment definitions.Data that are input to and output from workflows arestored and linked with the corresponding execution spec-ification, thus tracking all information related to a com-putational experiment. For example, users can search forexperiments that used specific data as an input or createdspecific data as an output. Each execution of a workflowcan be identified, along with the input, output, andparameters. There are also options to search based on con-tent type, owner, journal, and user annotations of work-flows and experiments to facilitate experiment creationand management. The BCJ can import data from externalsources such as local files, program output, or webresources; and it can export data to a file, for other use.

The BCJ provides four of the five types of provenancedescribed in the taxonomy of data provenance techniquesdeveloped in [8]. These four types are: (1) audit trail(complete traces of the execution can be stored); (2) rep-lication recipes (the steps, parameters, and other transfor-mations are recorded in experiments); (3) attribution(authentication and database control over ownershipmetadata ensures proper pedigree can be established);and (4) informational (annotations of varying types alongwith indexing of data). (The fifth type of provenance isdata quality, i.e., fitness of the data for an application.)Data, workflows, and experiments can be shared throughgroup-level access control lists. Scientists working on thesame project can publish their work to other usersthrough the BCJ. Annotation of experiments, workflow,and data can be textual or may be of other types, forinstance, a Microsoft Word® document or a picture of aplot related to an experiment.

ComparisonsFour aspects were used to analyze the environments dis-cussed above: execution environment, workflow design,data management, and additional features. A comparisonbased on these aspects with other computations is dis-cussed below. (See the appendix for introductions to the

systems discussed in this section: PathPort, Pegasys, PISE,BioCoRE, MIGenAS, Taverna, and Wildfire.)

ExecutionMost surveyed solutions were designed to be run locallyor on the grid–few were designed for cluster-based sys-tems. Neither PathPort nor PISE were implemented forcluster-based execution. MIGenAS was written to be runon only one specific cluster. BioCoRE, Pegasys, Wildfire,Taverna, and BCJ allowed execution on a local cluster.Only Pegasys, Wildfire, Taverna, and the BCJ offer work-flow capabilities. PISE and MIGenAS allow for linkingprogram input and output, but neither goes beyond aserial pipeline. Furthermore, BioCoRE, MIGenAS, andWildfire do not offer the ability to add tools to the server'sexecution environment, while only PathPort, Wildfire,and the BCJ allow client-side, external tools to be added.Only the BCJ and PathPort allow for both internal andexternal extensibility. Table 1 summarizes the executioncapabilities of the environments.

Workflow DesignAll of the surveyed solutions provide a GUI, however onlyPegasys, Wildfire, and the BCJ provide an interactive,drag-and-drop interface. Unlike the BCJ, all surveyed solu-tions use input forms that are loosely coupled to the work-flow on a separate page from the workflow specification.The input of parameters for each element in the BCJ isclosely coupled to the associated workflow, as is discussedbelow. Wildfire and Taverna provide iterations andbranching based on user specifications. Wildfire iterationsare essentially while loops based on user-supplied imper-ative programs; Taverna iterations are defined graphically.Table 2 summarizes the workflow design capabilities ofthese environments.

Data ManagementPathPort, BioCoRE, and the BCJ are the only environ-ments that provide data management. Additionally, onlyPegasys, Wildfire, Taverna, and the BCJ provide workflowmanagement. The combination of workflow and data

Table 1: Execution capabilities of analyzed systems

Tool Cluster-Based Execution Workflow/Pipeline Web Services Internal Tool Extensibility

External Tool Extensibility

PathPort No No Yes Yes (Web Services) YesBioCoRE Yes No No No No

PISE No Pipelines No Yes (XML) NoMIGenAS Not locally Pipelines No No NoPegasys Yes Workflow No Yes (XML) NoWildfire Yes Workflow Yes No YesTaverna Yes (Grid Based) Workflow Yes Yes (Web Services) No

BCJ Yes Workflow No Yes (XML) Yes



management increases the usefulness of the environment.Table 3 summarizes the data management capabilities.

Additional FeaturesOf the systems reviewed, only the BCJ and Taverna pro-vided the ability to track the provenance of data createdwithin it and provided the ability to repeat an experimentunder the same conditions. Note that for the BCJ to sup-port experiment repeatability, program versioning needsto be implemented. This provides the ability to re-runcomputational experiments. The BCJ and BioCoREencrypt information exchanges between the client and theserver. Additionally, the collaborative features in severalreviewed systems consisted of data sharing by exportingdata/workflows to a shared file and sharing the filedirectly. Collaboration at the system level is offered by theBCJ. BioCoRE offered many collaborative features, includ-ing message boards and chatting, while the BCJ nativelyhandled collaboration through the sharing of data, work-flows, and experiment journals. Annotation of data andexperiments, like in BioCoRE, Wildfire, Taverna, and theBCJ, is a valuable feature and is an integral part of work-flow and data management. Table 4 summarizes thesefeatures.

Design and implementation of the BCJAbstract data interfaceThere has been a proliferation of specialized tools, inter-faces, and ad hoc scripts to handle specific, repetitive tasksin narrow domains. The generality of these tools is some-times limited by implicit data formatting, organization,and naming conventions. The presence of these low-leveldetails can impede a system's usefulness. In the BCJ envi-ronment, the interface has been designed to shield theuser from these details; every data element is simply anentry, regardless of its specific content type, providing auniform, high-level abstraction to the user. Users are insu-lated from tedious low-level abstractions such as files sys-tems and their computational hardware resources.

The BCJ environment maintains every entry created byevery user. An entry can be a bioinformatics application, aworkflow, an experiment, or a data set. The underlyingframework collects and maintains a set of meta-data foreach entry, including its name, author, creation and mod-ification dates, and content type. Every application thatplugs into the BCJ environment is an entry and is accessi-ble through the standard entry interface. Many BCJ func-tions manipulate only meta-data fields of an entry, otherBCJ functions are designed to manipulate only a specificcontent type, and are enabled only when an entry contain-ing this content type is selected.

The BCJ environment provides a variety of navigators.These are functions that provide a hierarchical view of acollection of entries. Navigators establish a virtual organi-zation of the entries. This allows the user to see the dataorganized to match a specific activity, by selecting the nav-igator most appropriate for the task. The most basic navi-gator, the Journal navigator, displays entries according tothe journal in which entries are created. This navigatorprovides a view of entries, similar to a common file hier-archy with journals corresponding to file directories.Other navigators display entries related to a specific entry.

Table 2: Workflow design of analyzed systems

Tool Web Interface Capabilities

Drag-and-Drop Creation

Iteration/Conditional Branching

PathPort No No NoBioCoRE Yes No No

PISE Yes No NoMIGenAS Yes No NoPegasys No No NoWildfire Yes (Client Applet) Yes YesTaverna No No Yes

BCJ No Yes No

Table 3: Data management properties of analyzed systems

Searching

Tool Data Management Workflow (Experiment) Management

Data Workflows Data Export/Import to/from Other Tools

PathPort VBI Curated Data No Yes No YesBioCoRE Yes (BioFS) No No No Yes

PISE No No No No YesMIGenAS No No No No YesPegasys No Yes No No YesWildfire No Yes No No YesTaverna No Yes No No Yes

BCJ Yes Yes Yes Yes Yes



For example, the dependency navigator displays entriesthat depend on the selected entry.

The BCJ can be extended through the addition of naviga-tors. As new tools and techniques are incorporated intothe environment, it is likely that some will implementnovel data organization schemes requiring a new naviga-tor.

Shared access to all entries must be balanced with theindividual's need for security and privacy. The BCJ securitymodel only allows the creator of an entry to modify it.This simple model is consistent with the larger computa-tional journal paradigm. The journal is primarily a recordof experimental steps. Once completed, a step may nolonger be modified, however, it may be followed by addi-tional experimental steps to refine results and investigatevariations on the base experiment. This characteristic aidsin maintaining the replication recipe provenance. If anotheruser wants to make changes to data or an experiment cre-ated by another user, the user may create a new entry as acopy of the original one using the SaveAs operation,thereby maintaining data pedigree.

Read access to every entry is controlled by the group accessattribute of the entry, where a group is a set of users. Onlymembers of the associated group are able to view an entryand its contents. All manner of access to an entry is con-trolled by this attribute. Search results will only containentries for which the initiator has group access. The BCJenvironment provides administrative tools to managegroups.

The BCJ environment provides another form of accesscontrol. Every entry contains a committed attribute. Onceset, the content of the entry may no longer be modified byany user, including the entry's creator. The committedattribute of an entry can never be reset. The committedattribute is fundamental both to encourage sharingamong users and to ensure reproducibility of experimen-tal results.

Single framework for diverse programsA key feature of bioinformatics research is the breadth ofprograms that are being used by team members to addressa shared goal. Frequently, specialized user interfaces areconstructed around a set of closely related programs tofacilitate their usage and to help perform repetitive tasks.However, integration of the process and results amongthese specialized interfaces is often difficult. The BCJ fun-damental paradigm leverages the similar structures ofcomputational experiments in biological experiments.These often are chained computations where the outputof one is the input of another. However, the program'sdata formats and graphical interfaces can vary widely.Thus, within its common framework, the BCJ provides theflexibility for the problem – specific components. The BCJenvironment provides common support for the experi-mental process, and flexibility and extensibility in the frame-work in these three key areas.

Designing a database schema broad enough to handle allforms of biological data can be a daunting undertaking.Specialized systems such as the Genomic Unified Schema(GUS) are designed for this [9]. For the BCJ role in per-forming and recording computational experiments, thisgenerality is not a critical requirement. We have chosen adata model that simply stores the content of an entry as anuninterpreted stream of bytes. The BCJ domain does notattempt a structured biological database schema. Othersystems exist for managing this problem with issues dis-cussed elsewhere [9]. Instead, the BCJ uses a simplegeneric entry. Every entry is stored in the same table witha very simple schema. This schema consists of a fixednumber of fields such as name, author, creation time, datatype information, and a pointer to actual content which isstored as a vector of bytes. With this simple structure, sup-port for a new data format simply requires the definitionof a new content type.

No update or extension of a database schema is required.All BCJ environment functions that operate on theabstract entry interface, such as navigators and search

Table 4: Additional features of analyzed systems

Tool Data Provenance Experiment Repeatability

Group Access Encryption Collaboration Annotation

PathPort No No No No Yes (Exporting Data) NoBioCoRE No No Yes Yes Yes Yes

PISE No No No No Yes (Exporting Workflows) NoMIGenAS No No No No No NoPegasys No No No No No NoWildfire No No No No Yes (Exporting Workflows) YesTaverna Yes No No No Yes (Exporting Workflows) Yes

BCJ Yes Yes Yes Yes Yes (Natively) Yes



tools, implicitly handle entries with content in the newformat.

The BCJ supports a wide variety of raw data types. This isachieved in the storage architecture through an extensibleset of content types. There are several parallels betweenrequirements for Internet browsers and the BCJ environ-ment:

1. The number of content types will grow after the productis delivered; and reissuing the product to handle new con-tent type(s) is undesirable;

2. Each user does not require handlers for all contenttypes;

3. There may be multiple handlers for the same contenttype;

4. It is desirable to distribute development of contenttype-specific handlers to external groups with specificinterest and knowledge related to the content type.

The content types are presented and manipulated througha plug-in framework that has become ubiquitous in Inter-net browsers. Although a BCJ environment may containmany plug-ins, the user sees a relatively simple, coherentinterface (Figure 1). The functionality that a user interfaceplug-in provides fits into one of four categories whereeach member of a category adheres to a common inter-face: entry navigation, entry editing, auxiliary views, andexperiment execution.

From the user perspective, the BCJ provides a mechanismto navigate among the potentially large number of entries.Each navigator provides a virtual organization of theentries, based on one or more relations among the entries.This allows the user to see the data organized to match aspecific activity, by selecting the navigator most appropri-ate for the task. At creation time, each entry is assigned aparent journal. This provides an immediate and naturalhierarchical view of all entries. This journal relationshipprovides a view of the entries that closely resembles thefile directory hierarchy familiar to all computer users. Thisis the most basic of the BCJ navigators. Graphically, it usesa tree-list display that is widely used on computers topresent such hierarchies, but with annotations that pro-vide more detail than a basic hierarchical file system inter-face.

When an entry is selected, a plug-in associated with theentry's content type is selected to edit or view the content.Graphical editors are provided for workflow definitionsand experiment definitions. When a new content type isdefined in the BCJ environment, it is natural to incorpo-

rate a new editor plug-in tailored for the type. Many of theprogram inputs and outputs in the bioinformaticsdomain are textual. For these types, a generic text editorplug-in can be used. The editing area in the BCJ environ-ment is subdivided into multiple panes so that the usercan conveniently compare the content of two or moreentries.

Some data formats, particularly binary data formats, aresupported by sophisticated and optimized viewer pro-grams. For example, three-dimensional graphical viewersare available for molecular dynamics simulation outputs.It is not practical to convert these programs to operatedirectly inside the framework. The BCJ environment pro-vides an interface that allows a user to associate an exter-nal viewer, a program installed on his/her client machine,as the default viewer for a specified content type. When-ever an entry of one of these content types is selected forviewing, the BCJ environment automatically generates alocal copy of the data on the client machine and invokesthe associated program with this input.

In addition to editors and navigators, the BCJ environ-ment provides a variety of additional views that presentauxiliary information to support these primary tasks, suchas the Queue Manager view, which displays the progressof currently executing experiments.

Precise descriptions of BCJ programs are kept in XML doc-uments. This description is called a resource definition inthe BCJ environment. The Document Type Definition(DTD) for the XML data can be found in [10]. The BCJenvironment currently contains about sixty programs/resource definitions (see Table 5). Many resource defini-tions were derived from the XML descriptions for the cor-responding programs in PISE [11].

Graphical interface for experiment definitionExperiments are defined with simple block diagrams in ahierarchical fashion. Each block represents a program, orsequence of programs, to be executed. Input and outputconnections are made between blocks. In the BCJ environ-ment, a block diagram is called a workflow. Figure 2 is anexample of a simple workflow. Workflows provide a con-venient and intuitive way for any user to define a custom-ized task. Users are not required to be familiar withprogramming languages to define complex tasks. Oncedefined, a workflow may be reused. Workflows are justanother entry content type, so they can be shared amongall BCJ users. Experiments can be joined by connectingresources to form workflows. A data source can be used asan input to one BCJ resource with the output directed toanother resource. The BCJ also provides a mechanism toannotate workflows. Figure 3 provides an example of asimple workflow definition.



Management of data provenanceThe details of the experimental process are nearly asimportant as the experimental results themselves. The BCJenvironment collects and maintains this informationtransparently, providing organizational views of data andexperiments that focus on data provenance. With theprovenance data maintained in the BCJ environmentusers can:

1. Review all the details of an experiment at anytime,

2. Regenerate the results by re-executing the experiment,

3. Run similar experiments and compare with the originalresults.

When reviewing an experiment, the dependency navigatorin the BCJ environment is particularly useful. The depend-ency navigator collects the set of entries upon which aselected entry is dependent, in a concise hierarchical view.The first level of the hierarchy contains the immediatedependencies of the selected entry. The second level of the

hierarchy contains the dependencies for each of theseentries, and so on. Figure 4 displays a sample dependencyview.

Computational journal experiment examplesIn this section we describe two computational experi-ments performed in the BCJ environment, highlightingmany of the features described above. The two exampleswere chosen from significantly different domains in orderto demonstrate that the BCJ environment is sufficientlyflexible to support a wide variety of applications. A usermanual for the BCJ can be found in [12].

Sequence-based investigationTo demonstrate sequence analysis in the BCJ environ-ment, a simple example experiment was implementedthat tries to identify a protein that may be a potential tar-get for drug therapy in treating malaria. If there is suffi-cient divergence between some protein of the parasite andthe host, then inhibition of the parasite protein is a poten-tial target for drug therapy. In this experiment, we areinterested in the divergence between the mosquito Plas-

Complete Experimental WorkflowFigure 2Complete Experimental Workflow.

Table 5: Resources currently defined in BCJ

Suite/Class Resource Definitions

Sequence search and alignment SimpleBlastall, blastn, blastp, ClustalW, extractSequencesHMMER (biosequence analysis using hidden Markov models) hmmAlign, hmmBuild, hmmCalibrate, hmmSearch, hmmit, hmmer2sam

GROMACS (a molecular dynamics package) editconf, genbox, grompp, mdrun, pdb2gmxGLIMMER (a system for finding genes in microbial DNA) glimmer

GrailEXP (predicts exons, genes, repeats, and CpG islands) grailAlign, grailCPG, grailExon, grailGeneAssembly, grailRepeatsEMBOSS (The European Molecular Biology Open Software Suite) backtranseq, banana, bl2seq, btwisted, cai, chips, codcmp, coderet, compseq,

cpgreport, distmat, einverted, equicktandem, fuzznuc, garnier, geecee, iep, marscan, msbar, newcoils, newcpgseek, octanol, palindrome, pepcoil, pepinfo, pepstats, primersearch, profit, prophecy, prophet, recoder, redata, shuffleseq,

silent, water



modium falciparum casein kinase and its vertebrate hosts.There are three applications used by the experiment:BLAST, CLUSTALW, and HMMER. The experimentinvolves the following computations: (1) identify similarproteins to the human casein kinase enzyme using BLAST,(2) create a global alignment using the sequences of thesimilar proteins that were extracted from the BLAST reportusing CLUSTALW, (3) construct a hidden Markov model(HMM) based on this multiple alignment, using compo-nents of the HMMER program suite, (4) run the modelagainst the proteins in the Plasmodium falciparum genomicdatabase (PlasmoDB) to identify proteins that may befunctionally equivalent to human casein kinase.

In this experiment, we build two workflows. The firstworkflow contains the HMMER components. The secondworkflow, which will contain the complete experiment,will incorporate the HMMER workflow and the BLASTand CLUSTALW components.

The HMMER [9] program suite for biosequence analysisusing profile HMM contains several components that arefrequently used in a fixed order. In this workflow, wedefine a reusable component that may find value in otherstudies involving HMM. This workflow will construct anHMM based on the multiple alignment input, calibratethe model, and finally search a sequence database usingthis model.

To increase the flexibility of this computational compo-nent, the database is provided as a second input. As adesign alternative, we could have chosen to specificallyinclude PlasmoDB in this workflow (as a data sourceblock) and have a single input to this workflow. A pictureof the final workflow can be seen in Figure 3.

This workflow could be useful to a wide variety ofresearchers and does not contain any sensitive data ornovel computation techniques, thus we set the groupaccess control for this entry to ALL, so any user in the BCJenvironment will have access to this workflow. To makethe entry easier for others to find we could place the entryin a journal designated for shared workflows, and/or addan annotation briefly describing its functionality, so thatother users can locate it with a search query.

In the complete workflow, we use the previously definedHMMER workflow as a resource. It also includes BLAST, tofind proteins similar to the input protein, and CLUS-TALW, to perform a multiple alignment of these similarproteins. The CLUSTALW program does not accept inputin the format generated by the BLAST program, so theblock, Extract Sequence, was inserted between them toperform the appropriate data conversion. Resource defini-tions include type information for each of their ports; theWorkflow Editor does not permit connections betweenports with incompatible types. These constraints helpusers construct only meaningful and valid workflows.Resources such as Extract Sequences serve the necessaryrole of automatic resolution of data reformatting. In addi-

Dependency ViewFigure 4Dependency View.

HMMER WorkflowFigure 3HMMER Workflow.



tion, users can define new resource definitions that per-form data format conversions to meet their specificrequirements. The completed workflow is shown in Fig-ure 2.

To define an experiment based on this workflow, the userprovides a protein sequence input and a database to besearched with HMMER. For this example, Amino AcidSequence was used, for the input is human casein kinase.The Search DB input is PlasmoDB. The output file, RelatedProteins, will contain the sequences from PlasmoDB thatare most similar to the profile HMM built from thehuman casein kinase related proteins. Figure 5 shows thecompleted experiment definition.

When defining the experiment, the user is queried to spec-ify access control for the experiment definition. For exam-ple, the user may want to restrict any other access to theexperiment and results, by setting the access control toNONE, pending validation of results. Later, the owner canmodify access control to allow access by a wider range ofusers. At the completion of the experiment definition, acollection of jobs to perform the tasks defined by theworkflow are submitted to the computational cluster forexecution. The user tracks the progress of the experimentusing the Queue Manager view.

The output entry generated by the experiment containsthe top sequences from PlasmoDB that match the profilecreated by HMMER. From the results (Figure 6), we seethat the top result has the gene ID of PF11_0377. Thisgene happens to encode P. falciparum's casein kinase 1.Based on these results, we see that this kinase would notbe a good candidate for drug therapy.

The user can summarize what has been learned and anno-tate the experiment as shown in Figure 7. These annota-tions communicate reviews to other users and provide alog for the author. Annotations also enable search capa-

bilities, since neither the experiment definition nor theoutput entry directly contain the lessons learned from theexperiment.

If the researcher has other candidate proteins to study,another experiment can reuse this workflow definition.Here, the user simply repeats the experiment definitionstep, and supplies another protein for the Amino AcidSequence input. The BCJ environment emphasizes thatproviding a different sequence input is a new experimentdefinition, rather than a modification of the originalexperiment. It will be useful to keep the original experi-ment, even though it was not successful in locating apromising drug target. These results may be helpfulmonths later if someone asks, "Have you consideredcasein kinase as a possible target for malaria?" The BCJenvironment provides three alternatives to find an answerfor this question:

1. The dependent viewer in the BCJ environment is partic-ularly helpful for answering such questions. Theresearcher could select the casein kinase sequence entry,and find all experiments in which it is used as an input;

2. Alternatively, the researcher could use the dependencyviewer, which is effectively the inverse of the dependentviewer. With this viewer, the user can select the workflowdefinition and immediately find all experiments, basedon the workflow, that have been executed;

3. Perform a content-based search to locate an annotationthat summarizes the result of a previous experiment.

Molecular dynamics experimentTypically, molecular dynamics (MD) simulations involvea large number of simulations, where each varies from theprevious only in the input data. The organization of thecomputations is often tailored to some specific applica-tion such as drug design or methodology development.

Experiment DefinitionFigure 5Experiment Definition.



Analysis of the simulation results will vary widely betweensimulation goal and investigators' styles. In addition,although the simulations are repetitive, there are numer-ous places during the study where interactions with theprocess and output are needed. Here, we demonstrate theuse of the BCJ on such a representative MD simulation.For this experiment, we use the GROMACS package [13]and an example based on a tutorial [14].

There are five major steps in this experiment: (1) convert-ing the protein structure from a common database format,PDB (Protein Data Bank), into a format suitable forGROMACS, (2) minimizing the energy of the system, (3)solvating the protein in a simulation box with water, (4)performing an MD simulation from the initial state, and(5) analyzing the simulation results. (Often, proteins aresolvated then minimized, but for purposes of demonstra-tion, the protein is minimized first.)

Adding an Annotation to the ExperimentFigure 7Adding an Annotation to the Experiment.

Experiment ResultsFigure 6Experiment Results.



Each step can be run as an independent experiment withthe outputs of each step appearing as entries in the user'sjournal. This stepwise procedure is useful for reviewingintermediate results. Alternatively, the steps can be incor-porated into a single workflow, which is useful for a testedsimulation procedure. The current example represents anintermediate approach in order to demonstrate aspects ofthe BCJ environment; two sub-workflows are combinedinto another workflow to perform the end-to-end experi-ment.

The workflow "Energy Min" shown in Figure 1 combinesthe first two steps of the procedure, since these two stepsare standard initialization steps. This workflow has beendesigned with reuse in mind. The precise parameter set-tings for minimization may depend on the simulatedmolecule (one of the inputs); thus, a second input isdefined for this workflow "MD params," which will trans-late the simulation parameters into GROMACS format. Inaddition, several output ports are defined for this work-flow that are not specifically required in the example here.However, these additional outputs may be of use inanother application that chooses to reuse this workflow.

The workflow "MD Sim" shown in Figure 8 combinessteps three through five of the procedure outlined above.

As with the previous workflow, care was taken so that theworkflow could be reused. The "Solvent" input for theresource "genbox" is provided by the data source labeled"spc216.gro." An entry for this data in the BCJ environ-ment was created by using the "Create New Entry" inter-face to import this information. It contains a GROMACS-specific description of "Simple Point Charge water" that isoften used as the solvent in GROMACS experiments.

The workflow "Tutorial Example" shown in Figure 9 rep-resents the complete process from the GROMACS tutorial.It makes use of the two supporting workflows described inthe previous sections, demonstrating the hierarchicalworkflow definition integral to the BCJ environment. Asdescribed above, the supporting workflows were notdesigned only for this example tutorial. Each wasdesigned generally, so that it could be used as a compo-nent in other molecular dynamics studies. Outputs notneeded can be discarded by connecting each of them to a"Sink" block. The Workflow Editor requires these explicitconnections since the ports to which they are connectedwere required. This consistency check by the WorkflowEditor helps to ensure workflow correctness.

The workflow "Tutorial Example" defines only the processabstracted from the GROMACS tutorial. To begin a new

MD Simulation WorkflowFigure 8MD Simulation Workflow.



project (or experiment), the user must import the originalinput data from an external source such as a public data-base. When data is imported to the BCJ, an entry is createdand its content is initialized to that of a file provided bythe user. The BCJ physically copies the data into the envi-ronment with subsequent references to the data in the BCJthrough the entry. Thus, this experiment began by havingthe user import a number of files into the BCJ. Conse-quently, the workflow, Tutorial Example, allows one toperform this experiment for any protein from the PDB.Figure 10 displays the specific experiment definition cor-responding to the example in the tutorial. The proteinsupplied as input is lysozyme, a serine protease. The MD

parameters for the energy minimization are those sup-plied in the tutorial.

Numerous specialized programs have been developed formolecular visualization and interaction with moleculardynamics simulation. Rather than attempt to developanother molecular visualization subsystem directly in theBCJ environment (and then require BCJ users to becomefamiliar with a new user interface), the BCJ environmentinvokes an existing tool, integrated as an external viewer,to display these results. For example, VMD, Visual Molec-ular Dynamics [15], can be used to display the outputstructure.

Experiment DefinitionFigure 10Experiment Definition.

Tutorial Experiment WorkflowFigure 9Tutorial Experiment Workflow.



ConclusionBioinformatics research involves computational process-ing of multi faceted biological data at scales from atomiclevel to cellular. Sophisticated environments are neededto support the diverse and expanding range of computa-tional methods, scales, models, and data. The inherentlycollaborative work requires managing data and methodsbetween colleagues and public-domain databases. Com-plex experiments are often involved, which need to bechecked, altered, and documented. The BioinformaticsComputational Journal (BCJ) reported here provides sup-port for these activities. The BCJ handles data manage-ment and process specification in a single framework,demonstrated here with two representative bioinformat-ics problems. The system was founded on Eclipse, provid-ing a rigorously tested and maintained softwareinfrastructure. The user paradigm was designed to resem-ble a laboratory notebook, inline with the target users. Wefound this paradigm a natural basis for describing compu-tational experiments. Features are provided in the BCJ toauthenticate and establish pedigree of experiments. This isa crucial feature toward verifying and documenting com-putational results. Future efforts include adding iterationand branching functionality and enabling for grid collab-orations.

Competing interestsThe author(s) declare that they have no competing inter-ests.

Authors' contributionsLF designed and developed the data management archi-tecture for the BCJ. JR contributed to database develop-ment. AG and HA contributed the computationalexamples. EK was senior software architect for the BCJ. DJcontributed to the database and computing architecture.AH provided cluster computing support for the BCJ. TCand GL provided bioinformatics direction. GM conceivedof the Computational Journal architecture and directedthe software development team. VF was the project princi-pal investigator, and coordinated and contributed to thedevelopment of the BCJ and the writing of this paper. Allauthors read and approved the final manuscript.

Appendix of computational environmentsPathPort [16], developed by the Virginia BioinformaticsInstitute (VBI), is designed to combine information aboutpathogens with analysis and visualization tools. It usesweb services to provide remote execution of independent,single process jobs in a non-cluster environment (it isgrid-based, hence designed for execution over a heteroge-neous, geographically distributed network of individualcomputers shared by users).

BioCoRE [17], developed by the Theoretical and Compu-tational Biophysics Group at the University of Illinois atUrbana-Champlain, is designed as a biological collabora-tive environment. Through a combination of web pagesand Java applets, BioCoRE combines the capabilities ofproject management, instant message, message board, labnotebook, and file sharing software with several specificapplications, all designed for the simulation and visuali-zation of molecular dynamics experiments, includingNAMD [18], VMD (Visual Molecular Dynamics) [19], andMDTools [11]. BioCoRE also allows the submission ofjobs to a cluster through a web page.

PISE [11], developed by the Pasteur Institute, is designedto offer a web interface to bioinformatics programsdescribed in XML. The tools in PISE were not designed forexecution on a cluster. (However, as part of this worksome of the PISE applications have been modified to facil-itate cluster computing.) The basic functionality of PISE isdesigned to execute programs sequentially. However,there is a plug-in called G-Pipe, which allows pipelines tobe defined.

MIGenAS [20], the Max-Planck Integrated Gene AnalysisSystem, is similar to PISE in that it offers a web page pro-viding the ability to graphically enter program parametersand execute linear pipelines. Unlike PISE, which can bedownloaded and run locally, MIGenAS performs all oper-ations on servers at the Max-Planck Institute. It comeswith a set of tools, and at this time does not provide amechanism to add external or internal tools.

Pegasys [21], developed by the University of BritishColumbia's Bioinformatics Center, is designed to graphi-cally create and execute bioinformatics workflows forhigh-throughput analysis. It offers clients for Linux, Win-dows, and Macintosh. It comes with a set of analysis pro-grams, but allows internal tool expansion through thecreation of XML tool definitions.

Wildfire [22], like Pegasys, is developed to create and exe-cute bioinformatics workflows. Wildfire is available asboth a standalone and a web-based workflow applicationthat runs on a local machine, on the Grid, or on a cluster.It allows drag-and-drop creation of both simple linearpipelines and complex branching workflows. Wildfireprovides some forms of iteration and conditional branch-ing.

Taverna [23], a commonly used standalone workflow sys-tem, was developed to provide language and softwaretools for workflows as a component of the myGridproject. Taverna provides the ability to create complexworkflows including iteration and conditional branching,as well as a scripting language.



Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

Other systems include the commercial software PipelinePilot [24], Ergatis [25] (which is still in development),and Biopipe [26] (which is no longer actively supported).The SciRun software has capabilities that are conceptuallyrelevant to this work (workflow editing, procedure repli-cation, job control, data management, etc.), but isdesigned for imaging rather than molecular computa-tional biology [27]. A commercial data mining work-bench from SPSS that includes a workflow editor has alsobeen developed[28].

AcknowledgementsThis work was supported in part by the U.S. Department of Commerce under Grant No. BS123456 and by U.S. Army Contract W911SR-04-C-0087.

References1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL:

GenBank. Nucleic Acids Research 2007, 35(suppl_1):D21-25.2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local

alignment search tool. Journal of Molecular Biology 1990,215(3):403-410.

3. Rice P, Longden I, Bleasby A: EMBOSS: the European MolecularBiology Open Software Suite. Trends Genet 2000,16(6):276-277.

4. Mangalam H: The Bio* toolkits--a brief overview. Briefings in Bio-informatics 2003, 3(3):296-302.

5. Ludascher B and Groble: Introduction to the Special Section onScientific Workflows. Sigmod Record 2005, 34(3):3-4.

6. Xu Y, Uberbacher EC: Gene prediction by pattern recognitionand homology search. Proceedings of the Fourth International Confer-ence on Intelligent Systems for Molecular Biology; ISMB 1996, 4:241-251.

7. D'Anjou J, Fairbrother S, Kehn D, Kellerman J, McCarthy P: The JavaDeveloper's Guide to ELIPSE. 2nd edition. Boston , Addison-Wesley; 2005.

8. Simmhan YL, Plale B, Gannon D: A survey of data provenance ine-science. In SIGMOD Record Volume 34. Issue 3 ACM; 2005:31-36.

9. Davidson SB, Crabtree J, Brunk BP, Schug J, Tannen V, Overton GC,Stoeckert CJ Jr: K2/Kleisli and GUS: experiments in integratedaccess to genomic data sources. IBM Systems Journal 2001,40(2):512-531.

10. Document Type Definition for XML data [http://silk.bioinfo.ittc.ku.edu/cj/schema/ResourceDef.dtd]

11. Letondal C: A Web interface generator for molecular biologyprograms in Unix. Bioinformatics 2001, 17(1):73-82.

12. Frost V, Clark T, Gauch S, Lushington G, Minden G, Komp E, HockA, Johnson D, Feagan L, Garrett A, Rohrer J, Amthauer H, Ozor A:Bioinformatics Computational Journal: User Guide, Techni-cal Report ITTC-FY2008-TR-38270-04. 2007 [http://www.ittc.ku.edu/publications/documents/Bioinfo_Comp_Journal_User_Manual.pdf].

13. GROMACS: Fast, Free and Flexible MD [http://www.gromacs.org]

14. Molecular Dynamics Tutorial (GROMACS) [http://md.chem.rug.nl/education/mdcourse/index.html]

15. Humphrey W, Dalke A, Schulten K: VMD: visual moleculardynamics. Journal of Molecular Graphics 1996, 14(1):33-38.

16. Eckart JD, Sobral B: A Life Scientists Gateway to DistributedData Management and Computing: The PathPort/ToolBusFramework. OMICS: A Journal of Integrative Biology 2003, 7(1):79-88.

17. Bhandarkar M, Budescu G, Humphrey WF, Izaguirre JA, Izrailev S, KalLV, Kosztin D, Molnar F, Phillips JC, Schulten K: BioCoRE: a collab-oratory for structural biology: San Francisco, California. ;1999:242-251.

18. Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E,Chipot C, Skeel RD, Kale L, Schulten K: Scalable moleculardynamics with NAMD. J Comput Chem 2005, 26(16):1781-1802.

19. VMD - Visual Molecular Dynamics [http://www.ks.uiuc.edu/Research/vmd/]

20. Letondal C: A web interface generator for molecular biologyprograms in Unix. Bioinformatics (Oxford, England) 2001,17(1):73-82.

21. Rampp M Soddemann T, Lederer H.: The MIGenAS integratedbioinformatics toolkit for web-based sequence analysis.Nucleic Acids Research 2006, 1(34W1):5-9.

22. Shah SP, He DY, Sawkins JN, Druce JC, Quon G, Lett D, Zheng GX,Xu T, Ouellette BF: Pegasys: software for executing and inte-grating analyses of biological sequences. BMC Bioinformatics2004, 5:40.

23. Stevens RD, Robinson AJ, Goble CA: myGrid: personalised bioin-formatics on the information grid. Bioinformatics (Oxford, Eng-land) 2003, 19 Suppl 1:i302-4.

24. SciTegic: SciTegic - PP Overview. [http://www.scitegic.com/products/overview/].

25. Angiuol SV: Workflow / Ergatis: A Data Management Systemfor Automated Genome Annotation and Comparative Anal-ysis (poster). Genome Informatics 2005 (Cold Spring Harbor Labora-tory, New York) 2006:Poster.

26. Hoon S, Ratnapu KK, Chia JM, Kumarasamy B, Juguang X, Clamp M,Stabenau A, Potter S, Clarke L, Stupka E: Biopipe: a flexible frame-work for protocol-based bioinformatics analysis. Genome Res2003, 13(8):1904-1915.

27. SCIRun - SCI Software - Scientific Computing and ImagingInstitute [http://software.sci.utah.edu/scirun.html]

28. Data mining, Clementine, Predictive Modeling, PredictiveAnalytics, Predictive Analysis [http://www.spss.com/clementine]






http://silk.bioinfo.ittc.ku.edu/cj/schema/ResourceDef.dtd



http://www.ittc.ku.edu/publications/documents/Bioinfo_Comp_Journal_User_Manual.pdf

http://www.ittc.ku.edu/publications/documents/Bioinfo_Comp_Journal_User_Manual.pdf

http://www.gromacs.org

http://md.chem.rug.nl/education/mdcourse/index.html








http://www.ks.uiuc.edu/Research/vmd/







http://www.scitegic.com/products/overview/



http://software.sci.utah.edu/scirun.html

http://www.spss.com/clementine


http://www.biomedcentral.com/info/publishing_adv.asp


Source Code for Biology and Medicine BioMed CentralSource Code for Biology and Medicine Research...

Documents

Transcript of Source Code for Biology and Medicine BioMed CentralSource Code for Biology and Medicine Research...