An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome...

11
An Object-Oriented Genetics Information System Elizabeth Shoopl, Jaideep Srivastav& Paul Bieganski2, John Riedl Computer Science Department, University of Minnesota Ernest Ret.zel Medical School, University of Minnesota Keywords: computational molecular biology, genome sequencing, object-oriented database, graphical user interface, suffix tree 1. Suppmtrxl in psrt by the AmsyResearchOfficecontractnumberDAA03-89-C-M)38withthe Uniwa-sityof MinnesotaArmYHighPerfonnamx computing ResearchCenter 2. Suppcated in put by theMedicatSchool, Universityof Minnesota Abstract Sequence data is being produced by genomic sequencing laborato- ries at ever-increasing rates, making it impossible for individual researchers to keep track of all the new data that might afkt their research. Computer systems are needed so that researchers can access this data. The systems must support high-level interfaces that communicate in the language of the researchers, database sys- tems that guarantee availability and consistency of the data, and powerful search systems that rapidly scan for similarities between sequences. We have developed a prototype system that includes a graphical user interface, an object-oriented database management system, and high-performance similarity search algorithms. The prototype has the potential to increase researchers’ productivity by automating ermy of amotated sequence fragments as they are pro- duced by sequencing machines, storing the fragmenta in the data- base, and automatically prcducing and displaying similarity search results of new sequences against the large public sequence dats- bsses GenBank and PIR. This paper describes the prototype, dis- cusses the kme!its of object-oriented databases for complex and changing sequence da@ and presents an object-oriented schema for genetic information. Graphical tools for annotating sequences, storing them in the database, automating similarity searches, and viewing similarity search results are presented. A new suffix tiee- based data stnscture that supports rapid similarity searches on sequence data is introduced. Finally, future plans for the system are discussed. Introduction Researchers involved in genome and expressed sequence tag (EST) sequencing projects are producing DNA sequence data at such a rapid rate that they need computer systems that will assist them in the management and manipulation of that data. Motivated by this need, we have developed a prototype software system con- sisting of a graphical user interface (GUI) software. layer built on top of an object-oriented database management system (GODBMS) and similarity search algorithms. W have completed prototype tools for phase 1 of the system, which is depicted in Fig- Mn—sAc’9m#nN. usA .- ure 1. The database, GUI, and algorithmic (both traditional and high-performance computing) components of the system design are shown in Figure 1. From left to right, the flow of sequence and related information can be seen. The raw sequence is read by the data entry tools on the left. The new sequence and existing pubhc databases are used as inputs into the row of various algorithms for sequence similarity determination. The output of these algorithms is stored in the 00DBMS and written to sscii text reports. The data in the 00DBMS can be further manipulated by some of the algorithms. Contigs are produced by the contig re-assembly algo- rithm. The HTree and Gotoh alignment algorithms are used to refine the functional similarity search for those sequences where any similarity has yet to be found. The HTree alignment algonthnr is used to test incoming sequences with existing ones for possible duplication, The HTme repetitive sequence pattern extraction engine performs common sub-sequence analysis on the data, spe- cifically locating ‘unusual’ sequence patterns. Several GUI tools for displaying contigs, similarity results, and pattern matches are also planned and shown in the upper right potion of Figure 1. The goal for the system is a fully automated prcxess in which sequence data is entered by users, automatically searched for any similarity against existing public sequence data and sequence data that has already been produced for a project, and processed into the Gen- Bank public database. The results of the similarity searches will be stored in the database for later inspection by researchers. The contributions made in this paper are the following: The introduction of a database schema for cDNA sequencing projects, based on the object-oriented data model; The introduction of GUI tools for annotating and depositing sequence into a database, automating the similarity search prmess, and presentation of search results; d “ The introduction of a dynanuc data structure for sequence data, based on the familiar suffix tree structure, and new algorithms which use this data structure. In section 2.0, we describ the informatics needs of laboratory sequencing projects. In section 3,0, we outline what has been done in the past to meet those needs. In section 4.0, we discuss the scope of the project and the open issues we intend to address. In sectIon 5.(), we descrik the approach and how the system can further enhance molecular biologists’ productivity. In section 6,0. we describe the future plans we have for the system. @ 1993ACN @S9791-%S-~1..Sl3O 641

Transcript of An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome...

Page 1: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

An Object-Oriented Genetics Information System

Elizabeth Shoopl, Jaideep Srivastav& Paul Bieganski2, John Riedl

Computer Science Department, University of Minnesota

Ernest Ret.zel

Medical School, University of Minnesota

Keywords: computational molecular biology, genome sequencing, object-oriented database, graphical userinterface, suffix tree

1. Suppmtrxl in psrt by the AmsyResearchOffice contractnumberDAA03-89-C-M)38with the Uniwa-sityof MinnesotaArmYHigh Perfonnamxcomputing ResearchCenter2. Suppcatedinput by theMedicatSchool, Universityof Minnesota

Abstract

Sequence data is being produced by genomic sequencing laborato-ries at ever-increasing rates, making it impossible for individualresearchers to keep track of all the new data that might afkt theirresearch. Computer systems are needed so that researchers canaccess this data. The systems must support high-level interfacesthat communicate in the language of the researchers, database sys-tems that guarantee availability and consistency of the data, andpowerful search systems that rapidly scan for similarities betweensequences. We have developed a prototype system that includes agraphical user interface, an object-oriented database managementsystem, and high-performance similarity search algorithms. Theprototype has the potential to increase researchers’ productivity byautomating ermy of amotated sequence fragments as they are pro-duced by sequencing machines, storing the fragmenta in the data-base, and automatically prcducing and displaying similarity searchresults of new sequences against the large public sequence dats-bsses GenBank and PIR. This paper describes the prototype, dis-cusses the kme!its of object-oriented databases for complex andchanging sequence da@ and presents an object-oriented schemafor genetic information. Graphical tools for annotating sequences,storing them in the database, automating similarity searches, andviewing similarity search results are presented. A new suffix tiee-based data stnscture that supports rapid similarity searches onsequence data is introduced. Finally, future plans for the system arediscussed.

Introduction

Researchers involved in genome and expressed sequence tag(EST) sequencing projects are producing DNA sequence data atsuch a rapid rate that they need computer systems that will assistthem in the management and manipulation of that data. Motivatedby this need, we have developed a prototype software system con-sisting of a graphical user interface (GUI) software. layer built ontop of an object-oriented database management system(GODBMS) and similarity search algorithms. W have completedprototype tools for phase 1 of the system, which is depicted in Fig-

Mn—sAc’9m#nN. usA

.-

ure 1. The database, GUI, and algorithmic (both traditional andhigh-performance computing) components of the system designare shown in Figure 1. From left to right, the flow of sequence andrelated information can be seen. The raw sequence is read by thedata entry tools on the left. The new sequence and existing pubhcdatabases are used as inputs into the row of various algorithms forsequence similarity determination. The output of these algorithmsis stored in the 00DBMS and written to sscii text reports. Thedata in the 00DBMS can be further manipulated by some of thealgorithms. Contigs are produced by the contig re-assembly algo-rithm. The HTree and Gotoh alignment algorithms are used torefine the functional similarity search for those sequences whereany similarity has yet to be found. The HTree alignment algonthnris used to test incoming sequences with existing ones for possibleduplication, The HTme repetitive sequence pattern extractionengine performs common sub-sequence analysis on the data, spe-cifically locating ‘unusual’ sequence patterns. Several GUI toolsfor displaying contigs, similarity results, and pattern matches arealso planned and shown in the upper right potion of Figure 1. Thegoal for the system is a fully automated prcxess in which sequencedata is entered by users, automatically searched for any similarityagainst existing public sequence data and sequence data that hasalready been produced for a project, and processed into the Gen-Bank public database. The results of the similarity searches will bestored in the database for later inspection by researchers.

The contributions made in this paper are the following:

● The introduction of a database schema for cDNAsequencing projects, based on the object-oriented datamodel;

“ The introduction of GUI tools for annotating anddepositing sequence into a database, automating thesimilarity search prmess, and presentation of searchresults; d

“ The introduction of a dynanuc data structure for sequencedata, based on the familiar suffix tree structure, and newalgorithms which use this data structure.

In section 2.0, we describ the informatics needs of laboratorysequencing projects. In section 3,0, we outline what has been donein the past to meet those needs. In section 4.0, we discuss the scopeof the project and the open issues we intend to address. In sectIon5.(), we descrik the approach and how the system can furtherenhance molecular biologists’ productivity. In section 6,0. wedescribe the future plans we have for the system.

@ 1993ACN @S9791-%S-~1..Sl3O 641

Page 2: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

I I

RawSequena

TranslationPr@einsearch ~ w

REPGRTSBLASTX

Tools

* . SequenceAnnotationTool

MuliipleSequence

- WIF

REFGKrs

L )

Figure 1. High-bvel Schematic of Project Comporumts

mtabaae Canpnnma

‘mditiOnalcanpi agcnmpesm

Motivation

A large-scale sequetwing project involves marry researchers, eachworking on different portiona of the genome. Currently, genomicand EST sequencing projects use automated machines to producesequence on the order of 1 million or more base P* @p) pexmachine-year. However, scientists want to increase this rate, sincecovering all cDNAs of moat genomes in a timely fashion requiresconsiderably more than I million bp per year. A rate of at least 4million bp per year is being sought. At this rate, 8W0 fragments oflength 500 bp would be generated per machine. This comapondato more than 150 fragments (approximate y 75,(DO bp) per week.

As an example of the activities of sequencing lab personrd. Figure2 depicts a high-level overview of the process used in a typicalcDNA sequencing project. The ‘wet lab’ portion of the experi-ment conducted by the biologists are showm along with thosetasks that are conducted with the help of a computer system. Sinceitilvidual labs (and groups of labs participating in a project) haveso much sequence data to organize into conhgs and explore forpossible genea and their function, we propose that these sequenceslx managed by a database management system (DBMS) and thatautomated entry of the sequences themselves into the databasemust be done. Itis not sficient for the DBMS to be a repositoryfor the sequence information, rather it must support modeling ofthe hierarchical and composite structure of genetic information,and be capable of complex data manipulation for creation of con-tigs and queries for the determination of genes and their function.Molecular biologists need the ability to explore their sequencefragment data for interesting intervals, comparing it to both theirown data and that generated by othem. The possibility of dtierentlabs participating in a project and generating large volumes of total

.seqUerm xdatameanathatti Wtomatd entry amimstnagententftuictiona must be disrnbuted across a set of rnaclsim or-a net-work and available to participating labs. Additicmally, since aev-eralreaearclwms atparticipating laba will be~ing andmanipulating the data simultaneously, efficient support for diatxibUtedquerying andcortcurrent dataacceaa isrequil’ed.

Developing a model for full genome aequeming projects is beyoruithe scope of this paper, but the comepb presented in Figure 2 canbe extended to itdude them. For till genome sequemin~ thelarger questions of a sequence fragment’s pcsaible influence oncontrol, expression, developmem and evolution might be asked intitiatiwhb mtiitti~~~ dag~tiw~tifunction of that gene might be.

Background and Related Work

Past irtformatics research for managing genomic irtfonnationencompasses a wide spectrum of various types of databes, alg~rithmic techniques, and graphical w interface (GUI) applica-tiorx. First we will introduce a taxonomy of genomic databasesand discuss some related work on database systems, then we willdiscuss other work related to algorithmic techniques and GUI-based applications.

Genomic databases cart be categorized by the types of data theycontain and the user access and manipulation allowed. A taxonomyof genomlc databases is summarized in Figure 3, showing threecategories of databases: public, specialized and private laboratorydatabases, with the amows depicting the flow of data betweenthem. Public databases are those from which the entire biologyresearch community extracts data. and into which only authorized

642

Page 3: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

+ clorung

‘=machine sequerseing

‘Wet’ Lab Portion t

‘Computer’ Lab PortionDNA Sequence Fragments

KEY O BiologicalObjects O DataObjects m Processes

Figure 2. Portions of cDNA Sequencing Lab Efforts

submissions or changes are accepted. We therefore categorize theiraccess (from the user’s point of view) as read-only. Specializeddatabases are also read-only for the same reasons, though they con-tain more specific information of interest to a smaJler communityof researchers. The data in specialized databases which is of irtter-est to the larger community will sometimes be duplicated in thepublic databases. Private laboratory databases are used by individ-ual researeh labs for storage and manipulation of their data as theresearchers are generating it. These databases are used by a smallergroup of people who need to create and update information thatincludes more specific day-to-day laboratory activity data, in addi-tion to sequence and map data. It is these types of databases thatwe wish to address in this paper.

There are many current genomic database-s that fit into this taxon-omy. The public databases include the GenBank ([Burks901[Gen-Bank92]) and Europmn Moleculsr Biology Laboratory (EMBL)DNA sequence databases, the Protein Identification Resource(PIR) amino acid sequence database, and the Genome Database(GDB), which contains non-sequence data relevant to the HumanGenome Project ([HG92]).

The chromosome Information System (CIS) ([Johnston91],[HG92]) is a specialized database system that has a GUJ layer ontop of a relational DBMS containing many types of map data forindividual chromosomes. This system makea its contribution byincorporating a DBMS with a GUJ that makes the underlying data-base transparent to the uaem. CIS includes a schema editor for biological data called ERDRAW, which uses an extended entity-relationship msxiel to help biologists describe their biologicrdobjects. These object deaeriptions are then automatically translatedto relational DBMS table descriptions.

The Laboratory Notebook Database project at Los AlamosNational Laboratory [Fickett91] is a private database projezt. Thissystem is designed to handle the raw acquisition and initiaJ pro-cessing of sequence data. The system developed is an example ofart underlying relatiorud DBMS with an additional layer of soft-ware built on it to provide a user interface designed to take theplace of a traditional laboratory notebook. It has been successful inhelping researchers manage the large amount of physical map databeing generated for chromosome 16 for the Human GenomeProject.

The C. Elegtsna database (ACEDB) ([Durbin92]) does not fitdirectly into the taxonomy of Figure 3, because it provides more

than one type of data. This is an example of an attempt to incorp-oratedata from aeversd database sources into one svstem. ACEDB Kspeeistlized in that it contains data for one organi~m. Gther specisd -imd data can be formatted and then used by this system, however.ACEDB has an attractive X-window based interface to the under-lying dak which allows the user to view information from thechromosome map level down to the physical sequence level. T?redisadvantage of ACEDB is that it does not use an underlying data-base management system (DBMS), so that the features typical to aDBMS such as concurrency conuol, recovery, and access restric-tion are not available. ACEDB is currentlybest used as a single-user standalone system for retrieval; manipulation and update ofthe data is not possible at this time.

In [Shin92], the authors present an experimental study in whichthey converted a small set of existing data files and genomlc soft-ware tools into persistent data in art object-oriented DBMS(GGDBMS) and procedures (“methcds” in object-oriented termi-nology) that acton the data, which are also stored in theGGDBMS. The data used consists of DNA sequences and restric-tion enzymes and the methods were for developing resrnctionenzyme maps and determining probes for the represented sequenceinformation. This work is significant in that it &monstrates theviability of the object-oriented database paradigm. They were ableto illustrate the advantages of the increased data modeling powerof the object-oriented model, the increased support for referentialintegrity provided by GODBMS, and the advantages of havingfunctions encapsulated within class defiNtiorss. This work is a firststep towards future database systems for genomic information,where complex data and frequently-used cperatiorts on it are storedand maintained in a database management system,

In addition to various databases of genetic information, there areavariety of algorithmic techniques that have been developed to aidin the exploration of genetic data for biological significartce andfunction. ‘lltese include sequence similarity search and alignmentprograms, contig assembly programs, gene modeling algorithms,and secondary structure prediction algorithms. The current ver-sions of programs that incorporate these algorithmic techniques aretraditional, in that they require users to provide input data in aspecified format, and print output text files that the user mustperuse after the program has completed execution.

All genetic database similarity search and alignment algorithmsface the same underlying chalJenge: balancing speed against thequality of the results prcduced by the algorithm. Possible trade-

643

Page 4: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

Usergroup 1Usergroup 2

Figure 3. ~ical Genomic DatabaseArchitecture

offs form a full spectrum of solutions< They range from slower,more accurate brute-force alignment algorithms to fast superann-puter or hardware implementations of those brute-force methodsthrough probabilistic algmithms to algorithms using specialsequence encodings and pre-computed information about thesearch space to speed-up the execution. The more costly alignmentalgorithms (in terms of computation lime) are often used when thefaster solutions fail to detect similarities. The dvnamic momm-ming algorithms in ~ed70], IWat76J, and [Got8~] are re~m.&nta-tive of these rigorous methods. The biological information signalprocessor ([Hunk92]) project is an example of using special hard-ware to increase the speed of search execution, in which a systolicimplementation of a dynamic programming algorithm is beingstudied. The FASTA ([Waraon88], ll%araon90]) and BLAST([Karlin901, [Altachu1901, [Altschu1911) algorithms am examplesof probabilistic implementations. Algorithms baaed on suffix treeindexing structures ([PoweU89a]) are of the special sequenceencoding type. The suffix tree algorithms attempt to reduce thecost of the search by generating a structure representing thesequences being compared in a position-independent mjumer, thusreducing the time-complexity of the search from O(n) to O(n),where n is the length of the sequences involved. Data structuresproviding such representations and algorithms for their generationhave been fairly weU known asui used for sequence search applica-tions([PoweU90], [Gormet92]). The challenge of using these struc-tures for large-scfde (entire GenBank) multiple-sequence searcheslies in the ability to conatmct and maintain the structwes effec-tively. In its basic form, the suffix tree stmcture has been definedfor a single sequence. Extensions allow multiple sequences to berepresented in a single structure, and the effective use of suffixtrees for large-scale genetic sequence searches depends on the abil-ity to repmaent a large number of sequences in a single structure.

provide specific functionality, such as the X-window based genedetection program, Gene Modelex ([Fielda90]). An example of anapplicationthat haa gone a step tkrtherand provided a user inter-face to several differenttypes of algorithmsis the X-window baaedapplication called Genetic Data Environment (GDE) ([GDE92]).The deaignem of GDE are attempting to provide a framework inwhich new algorithms can be integrated into a common user envi-ronment. GDE does not use an underlvin~ DBMS. however. andrequires input data to be in specific f~~-for the &lividual ~go-rithms that the user must be aware of. A more general applicationprogram that is suited for local research laboratories is the VwtualNotebook System (VNS) ([Gony91]). The VNS, which is alsobased on the laboratorynotebook paradigm, was developed specif-ically for collaboration among a group of users. It is designed toprovide basic functionality and can be extended to provide accessto applications needed for a particular user community. VNS goesfurther than any of the systems mentioned above in that it has anunderlying m.lational database management system for storing thebasic notebook information as well as hooks to userdefined appli-cations.

Scope and Open Issues

This project addresses the issuea in &veloping a private databasefor use by local sequencing laba and groups of labs involved in asequencing project. Operating on tk premise that the reaearclm’sin these labs must have a computer system whine crucial memberis a database management system with complex modeling and que-rying capabilities, these researchers are faced with the followingissues:

● how can the raw sequena data and neceaaary annotation toit be auicklv entered and stored in the database?

Several high-level application programs with GUIs have been. .

developed for molecular biology researchers. Some applications● how should the genetic information in b database be

644

Page 5: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

modelled and m what format should lt be stored? high-performance computmg techmques to address the above

● how do we address the issue of acceas to data being storedissues. Our protot~~, sy:tem consists of an underlying database

on diffemtt computers for different labs? using the ITASCA Dlsmbuted Object Database Management Sys-tem (ITASCA ODBMS) Utasca92] and a graphical irtterface to the

● how do researchers keep abreast of new data entering the ITAS A ODBMS based on the Open Software Foundation (OSF)system, whether it be data they are producing or that which $Moti toolkit. The system is currently running on a network ofis submitted to public databases by the research Sun SPARCstations3. ITASCA is a commercial DBMS bawd oncommunity at large? ORION ([Kim89], [Kim90]).

● exactly what types of evolution of the database schema arefeasible?

We have chosen to use current object-oriented database technologyin order to cxmitslize on two of ib most important features

● how do researchers use the DBMS to effectively querygenetic data for relevant information?

● what are the most appropriate algorithms to use for aparticular query?

1). The complex modeling capabdity of object-oriented sys-tems, in the form of hierarchical and composite objects.

2). The extensibility of object-oriented systems.

Approach

As part of the on-going research in computational molecular biol- 1. ITASCA is a trademark of hasca Systems, Inc.ogy, we are investigating the suitability of employing existing 2. Motifis atmlanarkof Open Software Foundationobject-oriented database technology, graphical user interfaces, and 3. SPARCstation is a trademark of Sun Microsystems, fnc.

DNA_fkagraent

aequenceJiagrnentpmc8&&fmg.tfOuOwing_fragnld

mw~~ tingmalt

SaquUEO

m

prat0in.5equ-

~P nuckic_acid_aequenceaeqlmlce.commenldnlilalltiea DNA_stnng

letl.ovdap~a tlUTICh#S/DisplayMatch

right_0verlapJo9 /sSimilarTo/ DisplaySim”larityA/igntnentOfl DisplayAlignmenr

CreateContigI DisplCom Cu

Trand@eSequencepaste

Figure 4. Preliminary Schema Under Itasca Schema Editor And DNA Sequences Pomon of Complete SchemaI

645

Page 6: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

We will show how genetic data can be easily modelled using anobiect-oriented data model, and we will describe how we can takead~antage of the ability to incorporate new biological data opera-tors into the ITASCA ODBMS.

Using an object-oriented data model as proposed in [Kim911, wehave developed a preliminary database schema for genetic data.This schema is shown and described in Appendix A. It forms thebasis for the class definitions for the underlying ITASCA ODBMSfor this project.

We are currently implementing the schema in Appendix A in theITASCA ODBMS. The top of Figure 4 shows the IZASCA SchemaEdiror tool (CItasca92]) with the schema as it currently exists, andthe lower half of Figure 4 depicts the potion of the schema inAppendix A that describes the sequence fragment data (the DNAsequences subschema). In the Schema Editor Tml, all the classesthat we have defined for the database are shown in a list along theleft side and specific information about the classsequence~agment is given in the three boxes along the upperright porhon. Below these in the lower left corner of the SchemaEditor Tiwl is specific information about the attribute sequence ofthis class. Note that the domain of sequence isnucleic acid_sequence, which is another clasa we have defined, asshown Tn the DNA sequences subachema. This is an example ofthe composite hierarchy and the allowance for complex dataobjects in the object-oriented model. Our experience thus far hasbeen that this type of complex modeling capability has allowed usto create schemss that am intuitive for molecular blologisfs tounderstand, as compared to relational schemaa where connectiomrbetween object classes (meddled as tables) are not explicit.

With the preliminary schema for the database defined and implem-ented, we developed a set of tools for rapidly inserting data fromsequencing machines and associated annotation into the database.The Sequence Annotation TM1, shown in Figure 5, is designed toassist lab persomel, who am generating sequences from thesequencing machines, enter them and the appropriate annotationinto their private database. The tool reads the raw sequence datafile. The file name is used for the sequence name. The user canchoose a library that the sequence is part of from a personsthailist, and can add or subtract from that list. Similarly, a list of vec-tors used by the lab is displayed for the user to choose from as the

one used for sequencing this fragment. The user cart indicate thepolarity of the sequence by choosing between plus-sense andminus-sense, and either a custom primer can be entered or a uni-versal @srter is used. The user can choose h name of the previ-ous sequence (or supply it by typing), if known, so that an orderingof the sequences can be &termined later when atteanpting to create(and later view) contig assemblies. The user can also add any addi-tional comments required in the comment section shown at the bot-tom of Figure 5. All of the information entered by the user is savedto an annotated file. New files can be opened in succession toannotate each of the raw sequences as they come off the machinein preparation for storing the information in the database, or previ-ously saved annotated files can be opened and modified. The abil-ity to quickly describe the annotation for each sequence throughsevers] mouse clicks and a few keystrokes means that a uniformset of annotation can be saved for ewh sequence and a huge num-ber of sequences can be prccessed in a relatively short amount oftime.

A second prototype tcd we have deve@ed stores any number ofthe annotated sequences created by the SequenceAnnotm”on Tmlinto the database and initiates similarity searches of thesesequences against existing sequence information. This MultipleSequenceManager Twl is shown in Figure 6. A directory wherepreviously annotated sequences are located is chosen and thenames of the sequences it contains are shown in the list on the left.The dialog box for saving these sequences is shown to Use rightwhere the user can enter whether or not a similarity search shouldbe done for each sequence, which similarity sesuch algorithm touse, and appropriate database access information. We currentlyhave the most recent versions of FASTA and BLAST installed on aSun workstation network, and have the Multiple Sequence Mm-ager Tw1 interfacing with these implementations ad storing theresults horn them in the database. If BLAST is chosen (thedefault), the blastx algorithm is execute& in which each sequencefragment is translated in all six reading flames of amino acidsequence and compated against the Protein Mentification Resoume(PIR)database,andthe blasm algorirhsn is also executed, in whicheach sequence fragmentis comparedto all sequences in GenBank.If FASTA is chosen, this algorithm is executed for each sequenceagainst all of the GenBank database.

If Hlkee is chosen, an algorithm we have developed utilizing a suf-

Figure 5. The Sequence Amotstion Tool

646

Page 7: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

Figure 6. MuMpleSequence ManagerTool

fix ~ stJuctumfor similarity search is used. Due to the volume of multi-valued attribute called similarities, which M a set ofsequence data involved, a suffix tree structure must lend itself to similarity_result objects. A new sirnt”karity_resullobject is placedsecondary storage (disk as opposed to main memory), i.e. it must in this set for each algorithm executed for a particularbe possible to add additional sequences to the smucture with lirn- sequence>agment object.ited primary storage (main memory) and the search algorithms uti-lizing the structure should maintain locality of access. We havedeveloped an indexing structure with these properties, called Huge‘fkee (HIYee). This structure is a multiple-sequence index structurespecifically dmigned for large volumes of sequence data. It sup-ports incremental updating of the structure in limited on-line mem-ory as new sequences become available. The Hllee has beendesigned to support binary merging; two HTrees can be merged inlimited space to prcduce an H’lYee for the union of the two sets ofsequences represented by the original trees. This property of I-Ilkconstruction permits pitrallelization, and a system constructingHTmes for large subae~ of GenBank using a network of Sun work-stations has been implemented. We have also constructed HTkeestructures for the entire GenBank using a large disk farm, and haveimplemented and tested a search algorithm utilizing the HTkeestnrcture for a section of GenBank. The algorithm is capable ofdetermining the pmaence of a sequence fragment in the entire Gen-Bank in time proportional to the length of the fragment. This fea-ture of the algoxithm makea it particularly useful for fast screeningof incoming (’row’) sequence data.

The Multiple Sequence Manager Tiwi can save an unlimited num-ber of sequences and their annotation in the database at one time,along with initiating similarity searches for them. This is an impor-tant improvement because this automates the previously manualprocess used by researchers attempting to manage their sequencesand determine whether any of the sequence being generated hadany possible function or other impmlance.

After the Multiple Sequence Manager Twl has initiated slmilantysearches, the user must have a means of reviewing the results oftime searches at a later time. The initial prototype software tooldeveloped for aiding in this process M the Sinu”/ariVResult Viewer,shown in Figure 7. The tool allows the user to retrieve thesim”krity_result objects from the similarities attribute of a chosensequencefiagnrent object and view the ascii text file reports for aparticular sequence, which were generated by the similarity searchalgorithms. The ability to type in and search for key patterns in thetext is also provided. This gives the user rudimentary browsingcapability on these search results.

The FASTA, BLAST, and H’hee algorithms for sequence similar-ity Mpm,nt a porn on of the ,set of operations re@red by molecu-lar tnologMta for manipulahon of the data contained in geneticdatabases. We have determined that we can incorporate these ofx2r-atiom into an automated, GUI-based system for management oflarge-scale sequencing projects.

Putssre Plans

The future plans for the system include the following:

● incorporahng contig assembly algorithms

● adding direct submission of data to GenBank

● incorporating triggers (spontaneous action as the result ofupdatea) into the database

Currently, the directory path of the output file from the similarity ● adding high performance computing techniquessearch execution and the type of algorithm used are stored in thedatabase for a acmuence fswsment as attributes of a ● further modeling of similarity search results

sirru”lari@_tzwlt objm~. Each seq;ence>agtnent object has a ● distribution of data on a network

647

Page 8: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

Trtnshttm iwth stl-udJ of w sOwOIKs In ●ll 6 redims frm

OatdiWO: plr42,225$ewmco$: 22,422,07sml tutidma.

mim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..ti

*11*POinul

SaadltU HIS41PmMllltU~ WW4X1WHi#mcorln4 Sawmt Pairs: FrmaScaw P(M) N

OflJVS~ticdl w-l fmtaln - Vlamm (strai . . . +1 3S2 9.S0-31 1lFSJV!3lrms-actiwtim trmscriptianal regulata-upro... ●3 XQ 2.+25 1WLJvnTrma-utivatirls WaWrlPtILwl M9JIata-tlPm... ●3 300 4.*35 1S2.%@&etJwtlcdl prctdn Y2SB7U- Yeact(~... -3 m 0.- 2S33iw @# Mtlgm - lbll~ Wlcri -1 S4 0,0072 1PSNJ2H?hmf4mIif.asaW 1 - Fawt a&t SEC~ 3..., -3 w 0.= 1PWJ2S4ptwaph011pas9w I - flOzANQJO*a a+,.. -3 $4 0.0s6 1FSN.E3!PkaphOlipase@ 11- lbzarbiqun*a 9Sc-nulb... -3 $4 %0s6 1fStUWPhXf4101ipawW 111- Wiqa cdu-aKc-mm... -3 w 0.693 1PsNJ33R!mp4nlllp @z111- *a (M,@-Ica p... -3 S4 0.03s 1W2S22phmfhO1lpase(Z nigadm - Splttin3 c&t KC... :: : ~.1# 1fQ7079Flbwmct.in remptar dfha chain - h 1R2942SFitromctin .aceptar alfha chain pmxwaar - W... -3 82 0:16 1W4W Fitimatin recaptcr alf+a chain - h (FrW... -3 $2 0.2sS030?4$md3 Am trawc+-tfmtmln - tbm -3 so 0.25 :R?621~*Smd3 mim vmspart pretain - b -3 so 0.’5 1

Ftef-: tlltdul, Stw4mnF., UmwI Cish, I(UMHill-,EUSUWII.MM,d kid J. Lipnm (1S30). kiC local ali~ JSMX41ttwi. J. HoI. Bial,21W03-410.

btica: Statistical Simifi- is Wtinatsd * th qim that thaetwiuakt of me reditq fraa in UN qmrM~ mdw f= Pro@in andthat simificant alisrmata sill inuoluaanlncding rOadinSfra8.

krb= ~~id~;ile

Figure 7. Similarity Result Viewer

Applications for contig assembly and for direct submission ofsequences to GenBank are currently being developed. The HIkeestructureis being used to form the core of a contig assembly algo-rithm, which will be used to updatethe local databasewith contig-uous sequence information. The longer contiguous sequences andtheir associated annotation will be chosen by users and electroni-cally submitted to GenBank with the aid of a Sequence SubmissionTool.

As researchers generate sequence data at their site, it would bebeneficial for them to quickly recognize whether a fragment hasalready been sequenced. If duplication of effort can be reduced, thecomplete sequence will be finished faster, and duplication in thepublic database avoided. We intend to create an HTresstructureofall sequences generated by a project and add to this structure asnew ones are produced. Each new sequence entering the projectmanagement environment will automatically pass through theHTree filtering algorithm, using the HTme structure of all currentproject sequences. If hits are detected, the users will be alerted thatthe new sequence may be a duplicate of a previous sequence frag-ment.

A related problem is that of new sequences being generated byother researchers in the community and submitted to the publicdatabases, as well as new findings of specific genes and their func-tion. Triggem are spontaneous action taken as the result of updatesto data in the database. After receiving pericdic updates of the pub-lic databases, we will ‘trigger’ new similarity searches for ail thosesequence fragments for which possible function was not found pre-viously. The extensibility of the 00DBMS we have chosen willallow us to explore ways to add triggers and report to any informa-

tionobtainedto users.

The extensibility of the underlying object-oriented database sys-tem will also allow us to incorporateoperationson the datadirectlyinto the database object class descriptions, thereby hiding theirimplementationdetails from the user and allowing them to be useddirectly within the database query language. A method calledlsSimik3rTo can be described on the class aha_sequence, for exam-ple, which could then be used in subsequent queries on thesequence>agment objects. A typical query that uses this operatorwould be “Select each sequence that is similar to sequence X,”.The schema in Appendix A names several other methods that willbe incorporated into the ITASCA ODBMS so that they may beused as operations in the query language.

High performance computing techniques will be incorporated intothe system. We have developed a high perfcsnnance implementa-tion of the Gotoh dynamic programming-baaed sequence align-ment algorithm [Got82]. In ita sequential form this algorithmrequires O(MN) steps to find an optimal sequence alignmen~where M and N are the lengths of two sequences being aligned.Our implementation utilizes the fine-grain parallelism of the algo-rirhm b mapping it onto an army of parallel prccessora of the

?CM200 supercomputer. This CM200 implementation achievesnear-linear speedups (the algorithm execution time decnmes proportionally to the number of CM2~#rocesaor3 used, up to themaximum of 32K processors). A CM5 implementation which can

1.CM200 iss trsitemark ofl’bissking Machines, IXFC.2. CM5 is a tradcznak of Thinking Machines, k

648

Page 9: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

be incorporated mto the system as the AligntnentOf operator isunder development. In the prmxaa of &veloping the CM203 ver-sion of the algorithm we have also found that some of the fine-grain parallelism of the algorithm can also be mapped onto vectormachines. We are also &ve!oping a vectorized version of theGotoh algorithm for a Cray-21 supercomputer.

As these types of queries allow reaearchm to discover more infor-mation, the schema representing the data will undoubted y have tochange. The ITASCA ODBMS we have chosen allows fordynamic schema modification, in that entirely new classes can bedescribed, new attributes can be added to existing classes, andexisting attributes can he changed. We intend to study which typesof schema modification will be needed by genomic researchers andwhether or not they are allowed by the ITASCA ODBMS.

The type of query that utilizes a new operation such as IsSimilarToor AlignmentOf can be characterized as a long-duration trsnsac-non, because its execution may span a few minutes and need notbe atomic. The ITASCA ODBMS provides support for this type oftransaction. Long-duration transactions will be &monstrated dur-ing the further study and implementation of these operators.

Pubhc sequence databases will be needed by researchers across thecountry, and private and specialized databases will be shared bymany different laboratories. To ensure adequate performance mul-tiple copies of the data will be needed. To ensure high availabilitythe multiple copies will b scattered geographically, at sites thatare unlikely to fail simultaneously. Msintairung consistency of thedata WIIIrequire new algorithms that are tuned for the largely read-only nature of the data, but that guamntee the database is never inan inconsistent state during updates. The ITASCA ODBMS prod-uct is a fully distributed system through which databases can beaccessed across a network. We intend to extend ITASCA with sup-port for replicahon to increase performance and fault-tolerance.

Conclusion

A schema has been presented for data generated by a geneticsequencing projec~ based on the object-oriented data model, andimplemented using a commercially-available object-oriented data-base management system. A prototype has been described forGUI-based software applications that decrease the time needed toenter sequence data and its annotation into the database, along withapplications that automate the process of determining possiblefunctson of sequences as they are generated and stored in the data-base. In an effort to depict the even greater potentisJ that this sys-tem has, we have described the features of the components of thesystem that will b exploited as research and developmentprogresses. These featurea include complex modeling capabilities,dynarmc schema evolution, application-oriented queries throughobject-oriented database extensibility, the automation of input andsimilarity searching tasks as new data becomes available, theincluslon of high performance computing techniques, and distribu-aon of data across a network of machines.

Acknowledgments

This work was supported in part by funds from the MedicalSchool, University of Minnesota, a grant from Sun Microsystems,Inc., and resources from Itssca Systems, Inc.

References

[Aitschu190] Altschul, Stephen F., Warren Gish, Webb Miller,Eugene W. Myers, and David J. Lipmsn. Basic Alignment SearchTal, Journal of Molecular Biology, vol. 215, pp. 403-410[Ak.schu191] Altschul, Stephen F. Amino Acid Substitution Matri-ces~m an Information Theoretic Prospective, Journal of Molecu-lar Biology, vol. 219, pp. 555-565, 1991.[Bay73] Bays, J,C., Ph.D., The Complete PATRICIA, Ph.D. disser-tation, U. of Oklahoma, 1974.

1.Cray-2is a trademarkof CrayResearch,Inc.649

[Burks90] Burks, Christian, et al. [1990]. Genk?ank: Currenf Statusand Future Directions, Methods in Enzymology, Vol. 183, Aca-demic Press[Durbin92] Durbin, Richard and Jean lierry-Mieg, acedb - A C.Elegans Databose, User’s Guide, Installation Guide, and Configu-ration Guide, March, 1992.[Fickett91] Fickett, J,W., M.J. Cinkosky, C. Burks, P.E. Hempfner,D. Nelson, R.M. Pecherer, P. Sgro, D.M. Sorenson, R.D. Suther-land, and B.C. Ysntis [1991] Human Genome InformationResource, LANL Technical Report, obtained through personalcommunication.[Fields90] Fields, C. A. and C. A. Soderhsnd. gm: A Pracrical Toolfor Automating DNA Sequence Analysis, Computer Applications inthe Biological Sciences (CABIOS), vol 6, pp. 263-270, 1990.[GDE92] GDE, Genetic Data Environment, version 2,0 UsersGuide, Harvard Genome Lsboratmy-, 1992.[Genbank92] Relational Schema of Genbank Database [1992].Frsontd communication with Los Alamos National Lsborato~.[Gomet921 Gonnet, Gsston H., Mark A. Cohen, and StephenA.Benrter. Exhaustive Matching of the Entire Protein SequenceDatabase, Science, vol. 256, pp. 1443-1445, June 5, 1992.

[Gorry91] Gorty, G. Anthony, Kevin B. Long, Andrew M. Burger,Cynthia P. Jung, Barry D. Meyer. The Virtual Notebook System: AnArchitecture for Collaborative Work, Journal of OrganizationalComputing, 1(3), pp. 233-250, 1991.[Got82] Gotoh O., An Improved Algorithm for Matching Biologi-cal Sequences, J. Mol. Biol. (1982). 162,705-708.[HG92] Human Genotne 1991-1992 Program Report, U.S. Dept.of Energy publication DOIYER-0544P, June, 1992.[Hunk92] Hunkspiller, Tim, Leroy Hood, Ed Chen, MichaelWaterman. BISP: VLSI Solutions to Sequence-Comparison Pro-blems,Human Genome 1991-1992 Program Report, U.S. Dept. ofEnergy publication DCNYER-0544F! June, 1992. pp. 143

[Itasca92] ITASCA DistributedObjectDatabaseManagementSys-tem User Manual, Release 2.0,Itssca Systems, Jnc., 1992

[Johnston91] Johnston, William, Suzanna Lewis, Victor Msrkow-itz, John McCarthy, Frank Olken, and Manfred Zom [1991]. TheChromosome Information System, Lawrence Berkeley LaboratoryTechnical Report 29675, May 17, 1991[Karlin90] Karlin, Samuel and Stephen F. Altschul.Methods forassessing the statistical significance of nrolecukw sequence fea-tures using general scoring schemes, Proceedings of the NationalAcademy of Scienc~ USA, Vol. 87, pp. 2264-2268, 1990..[Kim91] Kim, Won. Introduction to Object-Oriented Databases,MIT Preas, 1991.[Hughes91] Hughes, J.G. Object-Oriented Databases, PrentsceHall Internstionsf, Hemel Hempstead, England, 1991.[Kim89] Kim, Won, et. al. “Features of the ORION Object-Or-iented Database System,” Object-Oriented Concepts, Applications.and Databases, cd. W. Kim and F. Lochovsky, Addison-Wesley,1989.[Kim90] Kim, Won. “Architecture of the ORION Next-GenerationDatabase System:’, IEEE Transactions on Knowledge and DataEngineering, March 1990.[McC76] McCreigh4 Edward M., A Space-Economical Sum TreeConstruction Algorithm, JACM, 23(2), April, 1976, pp 262-272.[Ned70]Needlemsn, S.B. & Wunsch C. D., (1970). J. Mol. Biol.48,443-453[Pessson88] Pearson, W. R. and D. J. Lipmsn. Improved Tools forBwlogical Sequence AnaJysis, Proceedings of the National Acsd-emy of Sciences USA, Vol. 85, pp. 2444-2448, 1988.[Pearson90] Pearson, W.R. Rapid and Sensitive Sequence Cornpar-isan with FASTP and FASTA, Methods in Enzymology, VOI183, pp.63-98, 1990.[Powel189a] Powell, Patrick A., Using and Constructing the SumTree index Structure, U. of Minnesota Computer Science ~art-men~ Technical Report TR 89-90[Powel189b]Powell, P.A., P. Biegsnski, E. Shoop [1989], X11 -Based Tools for Network Access to and Comparison of DNA

Page 10: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

Sequence Data. Presentation at MacroMolecules, Genes and Com- the CreateContig methcd. Also depicted is the type of informationputers, Chapter ‘ho, WaterviUe VaUey,New Hampshire. that needs to be kept about the experimental process of DNA[PoweU90]P.A. Powell [1990]. FASTSIM:A New Algorithm for sequencing, and is representativeof what is now kept in an indi-Rapid Sequence Similarity Determination. Presentationat Amen- vidualresearcher’slaborstoxynotebook.Mostof the attributes for

can Medical InformaticsAssociation First Annual Research Con- Iibrary, clone, vector, and seque~e_fragrnent objects are entered

ference: Computm, Molecular Biology and Medicine. Snowbird, when a new sequence is to be stored in the database. This prelimi-

Utah. naryschema wiU evolve as the system progresses.

[Shin92] Shin, Dong-Guk, Changwan Lee, Jinghui Zhang, Ken-neth E. Rudd, and Claire M. Berg. Redesigning, Implementing, andIntegrating Escherichia Coli Genome Software Tads with anObject-Onen@d Database System, Computer Applications in theBiological Sciences (CABIOS), VO1.8,no. 3, pp. 227-238, 1992.[Shoop90]Shoop, E., E.F. Retzel, P.A. Powell, P. Biegsnski and S.Kemp [1990]. The hfOLBIO Project: A System for Access, Com-parison and Display of Molecular Information. Presentation atAmerican Medical Infonmatics Association First Annual ResearchConference: Computers, Molecular Biology and Medicine. Snow-bird, Utah.[Wat76] Waterm~ M.S., Smith, T.F. & Beyer, W.A. (1976).Advan. MAth. 20,367-387.[Z0eUe192] Zdler, Randal and Douglas Bamy. Dy-”c Seif-Con$guring Methods for Grqohical Presentation of ODBIUSObjects, Proceedings, 8th International Conference on Data Engi-neering, IEEE Computer Society Press, 1992, pp. 136143.

Appendix A

A Preliminary Object-Oriented Databaae Schema forcDNA Sequencing

Figure Al is a schema fur a cDNA sequencing project databawthat can easily be generalimd for any DNA sequencing projectdatabase. The schema diagram is based on the object-oriented datamodel described in [Kim91]. ‘he bold lines represent the class-subclass hierarchy relationships, and the thin lines represent thecomposite hierarchy relationships. These composite links can be ofseveral types, but it is assumed that they were aU exclusive inde-pendent references initially. Further study may reveal that some ofthem should be dependent or shared, but this is not addressed atthis time. The names of the classes are in bold-face type aboveeach box, with the attributes in plain type inside each box (multi-valued attributes am plural), and methcxls associated with this classbelow them in italics. It is these methods that represent the neces-sary special operations that will be provided for these types ofdatabases. It is the ability to add these special operations thatmakes the object-oriented data model so attractive.

~ol~ohwma has several major components that can be categorized

l). – ‘--”. . . .

2).

3).

4).

5).

I’heIJNA sequencesmat are tang Creating ana the con-nections between them.

The record of experimental methods used to obtain thesequences.

The special operations that help the biologists organizethe connections between the sequences and to search forpossible meaning of these sequences.

The results of the search exploration for possible mean-ing.

A dictionary of known sequences and their meanings(e.g. transcription control elements).

The CM description that is central to the schema is thesequence~agment, which is the major type of DNA~agmentobject that is being stored in the database that tlus schemadescribes. The methods shown in italics in Figure Al represent thespecial operations mentioned in (3) above. We chose to model theDNA sequence itself as a string upon which various methcds couldbe used. The sequence fragments arc smaUer overlapping piecesthat can be thought as having some ordering; thus thepreceding_fragment, following_fragment, left.overlap~s, andright_overlap_pcs attributes. These attributes are computed using

650

Page 11: An Object-Oriented Genetics Information Systemdmr.cs.umn.edu/Papers/P1993_6.pdf · The chromosome Information System (CIS) ([Johnston91], [HG92]) is a specialized database system

Qel

DNA_fragment --’_’

Illz..bp-

‘~=: ~ ,h&L ,

F-gEerie —

\ ~,m ~

pattern cDNA_fmgmmta

pnttern.mcqumcesequence.fragment

rusoiction.nqn

fxecedm~fmgnmnt —

followm~fmgmmt —

mwJmmedin~fkgme.nt

q-

m-

potem.sequences

~P

$equerlce_cO-t

Slmllsmhes

left_Ovedsp~

righ_Overlap_pJs

Ipnttem_&3cnptmn II IclOm_v.3ct01

I I Iw-tw_6-followm~me

gem_9qleace

&st_fiOm_pmc_gene

dlst_tO_fol~mH 1H1 restriction.map

amisso_acid_sequessce 3

vector

v9X0r_nanm

a)CnA

&9_stmg

IsHomologowTo

m7

CreakConrig / DisplContTramlateSequence

\l\

vectOr_sequema

nucteii_acid_sequenee -’/’similarity-resutt

‘a / DNA_ming

llkltCht’S / DtsplqvMafchIsSintilarTo/ DisplaySitni/orin,AlignmenrOj/ Displm,AltgnmenfCldpafte

atgOrithm_type I

7input

outputs fasta_simihuity-input

ctratss&iption_control_element

Irest-map_sitefasta_zimilarity_output

restriction enzvme -=’- – mm

+m res_l.acat10n8

bold clam

regular attribute

Italic rnetfwd

composite hierarchy IS-PART-OF

l— ctasdsubclass hierarchy Is+j (inheritance) I Figure A 1, cDNA .%quencmg FroJect Obyxt-orrented Schema1 ——~

—-