Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that...

11
AGENAE AGENAE Program Program Installing several Ensembl and BioMart instances on a single Apache daemon : version 2, Christophe KLOPP, August 2005 The SIGENAE group (http://www.sigenae.org/ ) provides services to laboratories financed by the AGENAE program (http://www.inra.fr/agenae ). The services are linked to two genomic techniques : EST sequencing and Micro-arrays. The services provided are data processing, data storage, training and expertise. End 2003, while looking for an environment to present EST contigs, we tested Ensembl and EnsMart and found them simple to adapt to our needs. For EST libraries, the service is provided mainly before publication of the sequences in order to verify the quality of the libraries therefore the biologists asked for a limited access environment. Recently new projects coordinators asked us the same service with limited access also. Ensembl and BioMart where meant to be run on a standalone apache daemon (httpd). We modified them to be able to run several instances on one apache daemon using virtual hosts. This work was done with Ensembl 31 source code (CVS extraction june 2005). NB : this document has not been written by members of the Ensembl development team and can contain comprehension errors. Please send comments, precisions or modification propositions to [email protected]. This document has not been written by native English speakers. Please send us all comments and typos using the address here above. The document is split in three parts. The first part presents the Ensembl/BioMart launching process, the second describes the elements such as data structures needed to build a multi instance implementation, the last gives an overview of the parameters and code modifications in this new configuration for Ensembl and BioMart. I The launching process : The first step of our work was to install Ensembl as presented in the Ensembl Website Installation Instructions found in the information, software, Ensembl website part of the Ensembl web site (http://www.ensembl.org ). To avoid up-loading and storing a lot of data we just installed Saccharomyces Cerevisiae. The result of the apache launch is presented here under. Fig. 1 Screen shot of Ensembl launch process. Lets see what is going on during the apache launch process. First which are the files used by Ensembl during the process. In the apache conf directory you will find the following files : 1. httpd.conf : apache configuration file using SiteDefs.pm data to set up parameters of apache. 2. SiteDefs.pm : a perl module exporting parameters set by the user. On one hand, these parameters are used by apache through an import command in httpd.conf and on the other hand by the SpeciesDefs.pm module to create the config.packed configuration file. The usage of this file will be explained later on. 3. perl.startup : this perl script is launched at the end of the configuration part of httpd.conf. It generates the configuration files of Ensembl (config.packed) and BioMart (martconfig.packed) using the parameters set in SiteDefs.pm and the ini files listed hereunder. 4. DEFAULT.ini : initialization file used to set defaults values of connexion parameters SIGENAE Ensembl and BioMart multi-instances page 1/11

Transcript of Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that...

Page 1: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

AGENAEAGENAE Program Program

Installing several Ensembl andBioMart instances on a single

Apache daemon :version 2, Christophe KLOPP, August 2005

The SIGENAE group (http://www.sigenae.org/) provides services to laboratories financed by the AGENAEprogram (http://www.inra.fr/agenae). The services are linked to two genomic techniques : EST sequencing andMicro-arrays. The services provided are data processing, data storage, training and expertise. End 2003, whilelooking for an environment to present EST contigs, we tested Ensembl and EnsMart and found them simple toadapt to our needs. For EST libraries, the service is provided mainly before publication of the sequences in orderto verify the quality of the libraries therefore the biologists asked for a limited access environment. Recently newprojects coordinators asked us the same service with limited access also. Ensembl and BioMart where meant tobe run on a standalone apache daemon (httpd). We modified them to be able to run several instances on oneapache daemon using virtual hosts. This work was done with Ensembl 31 source code (CVS extraction june2005).

NB : this document has not been written by members of the Ensembl development team and can containcomprehension errors. Please send comments, precisions or modification propositions [email protected]. This document has not been written by native English speakers. Please send usall comments and typos using the address here above.

The document is split in three parts. The first part presents the Ensembl/BioMart launching process, the seconddescribes the elements such as data structures needed to build a multi instance implementation, the last givesan overview of the parameters and code modifications in this new configuration for Ensembl and BioMart.

I The launching process :

The first step of our work was to install Ensembl as presented in the Ensembl Website Installation Instructions found in theinformation, software, Ensembl website part of the Ensembl web site (http://www.ensembl.org). To avoid up-loading and storing a lot of data we just installed Saccharomyces Cerevisiae. The result of the apache launchis presented here under.

Fig. 1 Screen shot of Ensembl launchprocess.

Lets see what is going onduring the apache launchprocess. First which are thefiles used by Ensembl duringthe process. In the apacheconf directory you will find thefollowing files :1. httpd.conf : apache

configuration file usingSiteDefs.pm data to set upparameters of apache.

2. SiteDefs.pm : a perlmodule exportingparameters set by the user.On one hand, theseparameters are used byapache through an importcommand in httpd.conf andon the other hand by theSpeciesDefs.pm module tocreate the config.packed configuration file. The usage of this file will be explained later on.

3. perl.startup : this perl script is launched at the end of the configuration part of httpd.conf. It generates the configurationfiles of Ensembl (config.packed) and BioMart (martconfig.packed) using the parameters set in SiteDefs.pm and the inifiles listed hereunder.

4. DEFAULT.ini : initialization file used to set defaults values of connexion parameters

SIGENAE Ensembl and BioMart multi-instances page 1/11

Page 2: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

5. MULTI.ini : initialization file used to set parameters of multi species modules as BioMart or Compara,

6. Saccharomyces_Cerevisiae.ini : species specific initialization file. There must be one of this type per selected specieslisted in SiteDefs.pm.

7. martRegistry.xml : configuration file of BioMart indicating which databases to access to collect datasets configurationdata.

The diagram below present the organization of the elements and from top to bottom the sequence of the process.

Fig 2 : Ensembl and BioMart launch process organization

This diagram introduces two newelements : vars and Registry which willbe explained in the followingparagraphs.

First lets describe the process :1. apache uses the httpd.conf

configuration file,2. in the perl section of this file the

SiteDefs.pm module is called,3. In this modules there are several

parts mixing web site parametersand Ensembl parameters. Thismodules exports blocks ofparameters. The WEB: block isimported and used in the perlsection of httpd.conf. . Thismodules also fills the@ENSEMBL_LIB_DIRS array usedby httpd.conf to add entries in the@INC perl library path array.

4. Having imported the needed WEB:parameter the httpd.conf file setsthe apache directives. At the end ofthis process it launches theperl.startup script.

5. perl.startup checks if the packedconfiguration of Ensembl andBioMart have to be generated. Ifthey exists, it tries to load them inthe registries. If not it launches thedata collection process and the file generation.

Here is the Ensembl output of the launch process when the config.packed file already exists and the parameter inSiteDefs.pm is set to try to reload it.

Fig. 3 Screen shot of Ensembl launch process with file

The two new elements shown in red in the seconddiagram correspond to data loaded in apache'smemory during the process. The first one is a setof global variables (vars) defined and exported bySiteDefs.pm, the other is the Registry structuregenerated using all the initialization files data forEnsembl and XML files located in the BioMartdatabases. Both data structures are used by the perl modules and scripts while Ensembl and BioMart are running. TheRegistry is a way to brings persistence to Ensembl configuration data.

NB : in the SiteDefs.pm file exists a species alias configuration part. This hash table links the usual species name to severalother derived names. Example for Saccharomyces_Cerevisiae the aliases are :• S_Cerevisiae• s_Cerevisiae• SCerevisiae• scerevisiae• SaccharomycesCerevisiae• saccharomycescerevisiae

SIGENAE Ensembl and BioMart multi-instances page 2/11

httpd.conf

SiteDefs.pm

perl.startup

DEFAULT.ini

MULTI.ini

Species*.ini

config.packed

martconf.packed

Ensembl Registry

vars

BioMart Registry

martRegistry.xml

Page 3: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

• ...These aliases are used in the web site URL to enable different names and upper or lowercase writing to point to the sameusual species name. The Handler.pm module uses the alias table when processing a request. You can test this on the Ensembl web site.

II Multi instance needs :To create multi instances we need a way to separate the queries send to each server. This is provided by the virtual hostingcapabilities of apache and the global variables linked to each host. In the output shown below you have all the apache environment variables.

Fig. 4 apache env variables

This output has been generated by a very simpleperl script env.cgi found on the web(http://www.cgi101.com/book/ch10/env-cgi.html).It shows two ENV variables containing the servername : SERVER_NAME and HTTP_HOST.These can be used in a script or module toseparate queries coming from different hosts. Weused the SERVER_NAME.

Lets quickly look at how the apache virtual hostsare configured in a classical httpd.conf file. Theoutput here under corresponds to the tail of ourhttpd.conf file. The first directive NameVirtualHostindicates that the virtual hosts will be separatedby their names on the localhost. Then there is ablock of directives for each virtual host. This blockcontains strait forward directives such asDocumentRoot, ServerName or ErrorLog.

Fig. 5 virtual host configuration in httpd.conf

If we launch apache with such a configuration file we will havetwo web sites (ensembl1, ensembl2) each pointing on theindex.html page found in the htdocs directory of the site.

Now to run two Ensembl web sites on the same httpd, theconfiguration files and some perl modules have to bemodified thus to be able to separate the queries between theserver.

SIGENAE Ensembl and BioMart multi-instances page 3/11

SERVER_NAME

HTTP_HOST

Page 4: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

III Multi instance implementation :This part describes how the multi instance can be implemented in Ensembl and BioMart.

III.1In Ensembl :

The aim was to create a multi instance implementation but also to keep an as close as possible Ensembl and BioMartdirectory and code organization.

III.1.a Directories tree organization :

To create a new site we need the specific parts of the site meaningconfiguration (/conf), static pages (/htdocs) and we can have locallogs (/logs). The screen shot on the right presents an example ofdirectory tree for two servers.

Fig. 6 multi instance directory organization

All virtual hosts are using the same modules and scripts, thespecificities come from the called URL and the configuration dataused.

III.1.b Databases :

Each site has its own databases which are specified in thedifferent ini files.

III.1.c Configuration files modifications :

The modification of the configuration takesplace in the httpd.conf and the SiteDefs.pmfiles.

httpd.conf : as presented in the previous partthis file will contain one virtual host entry foreach site. This virtual host entry specifies thedocument root and the log file location.

SiteDefs.pm : In this file we have to separatethe parts used by all web sites from the onesspecific to each of them. Several choices arepossible depending on once strategy. Wechoose for this first implementation to limit thespecific part to the smallest possible. Thespecific part will therefore contain :• the site name• the configuration directory path• the static pages directory path• the species information part

Fig. 7 SiteDefs.pm multi instance parameter organization

The modules, the scripts, the temporary filesstorage directories, the other configurationparameters are shared by all instances. These modifications are made by replacing thecorresponding $ENSEMBL_ variables by %ENSEMBL_ hash tables. The keys of the hashtables being the web site names('ensembl1','ensembl2',... in our example).

SIGENAE Ensembl and BioMart multi-instances page 4/11

Page 5: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

NB : The $DEFAULT_SERVER variable is used in the launch process because then there is no $ENV{SERVER_NAME}variable set yet.

Fig. 8 SiteDefs.pm multi instance species organization

The two previous screenshot and the next onepresent the modifications made in SiteDefs.pm.

Fig. 9 SiteDefs.pm new export hash tables

III.1.d Registry modification :

The mart registry is storing all the ini files configuration parameters and permits a very simple add and retrieve mechanismof this data though the AUTOLOAD function. The structure of the data tree is build by the SpeciesDefs.pm module whileparsing the files and stored on the hard drive in config.packed file (conf directory). The generation of the file depends on aparameter set in SiteDefs.pm ( $ENSEMBL_CONFIG_BUILD ). If this parameter is set to 1 SpeciesDefs.pm will try to loadthe Registry with the config.packed file data. If it is set to 0 the config.packed file will be regenerated at each apache start.

Fig 10 : usual config.packed data organization

In fact there are two data structures named _storageand _multi in the config.packed file. The first stores allspecies data and multi data. The second stores onlymulti data. These structures are built as trees and theroot of the tree is the species name defined inSiteDefs.pm. To adapt the system to multi instance wehave to add a new level in the tree : the site or serverlevel. Wanting to limit the amount of changes we decidejust to add this new level to the tree root, turning : Anophele_gambiae into ensembl1| Anophele_gambiae.

SIGENAE Ensembl and BioMart multi-instances page 5/11

Page 6: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

The data structure look now like this :

Fig 11 : multi site config.packed data organization

We have now a multi instance SiteDefs.pm andRegistry. Thus we have to modify the modulesand scripts using these data structures. HopefullyEnsembl and Biomart are object oriented and wellcoded, this greatly has simplified our work.

III.1.e Modules modifications :

The modifications in the perl modules have different aims :- to generate registry entries specific to each site,- to redirect the requests of static pages,- to modify the species name using the species aliases,- to access the registry for configuration parameters

SpeciesDefs.pm : configuration data storage :

This module, through its _parse function generates the data structure loaded in the registry. We added a new loop creatingthe _storage tree for each site|species instance.

Fig 12 : SpeciesDefs.pm loop

Fig 13 : SpeciesDefs.pm data storage

As you can see in this screen shot the_storage dataset name is now composed ofthe server name and the species name. Wename here dataset one specific next to rootbranch of the tree.

Handler.pm : redirecting the requests of static pages :

One function of the Handler.pm module is to redirect the web query to the right location, static page or perl script. This isdone by analyzing the query.The modification here was just to limit the search space to the server specific dataset.

Handler.pm : using species aliases tables :

The same module copes also with the species aliases. It modifies the URL if the species name corresponds to a given alias.

SIGENAE Ensembl and BioMart multi-instances page 6/11

Page 7: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

NB : Strangely, the alias structure is created during the launch process by lines located in the module but outside of thefunctions. This was dwelt with using the $DEFAULT_SERVER variable.

Registry.pm : configuration parameter access :

Depending on which dataset we wanted to access we had to modify modules as HTML.pm, Page.pm or Registry.pm to addthe $ENV{SERVER_NAME} variable to the species name.

Fig 14 : Ensembl modules modifications

Once these modifications done themulti instance was working fine forEnsembl. The next part explains brieflyhow to add a new instance.

III.1.f adding a new Ensembl instance :

The steps to add a new instance are :- create and fill the corresponding databases, - create the root directory of the instance,- create the conf, htdocs, logs directories of the instance and fill the two first ones with the needed files. For the confdirectory you should have one ini file per species, a DEFAULT.ini and a MULTI.ini file. For the htdocs directory you shouldhave an index.html file and one complete directory per species.- modify the httpd.conf to add the new virtual host,- modify the SiteDefs.pm file to add the site specific parameters,- delete the config.packed file- stop and start apache httpd

III.2 In BioMart :

The organization of BioMart is a bit different from Ensembl. First the Registry structure is defined in the meta_configurationtable of the database. The database location and access parameters are themselves defined in the martRegistry.xml file.This explains that the perl.startup file first checks if the martRegistry.xml file exists if not it creates it from a template fileusing the data present in the MULTI.ini configuration file.Then using the connexion information of the martRegistry.xml file perl.startup launches the Registry building process, as inEnsembl. The registry is then dumped in a configuration file named martconf.packed. This file is stored in the conf directory. The 'name' parameter of the martRegistry.xml file was used to specify the server name of the BioMart instance. This is notvery effective because it limits quiet a lot the mart capabilities meaning that you just have one possible mart database for aninstance and this instance is named following the server name. Nevertheless this works.Another difference between BioMart and Ensembl is the use of cache. Here, once more, we had to define the server nameto which the cache belongs by adding the server name to the cache file name.

The following sections present the modifications we have made.

III.2.a Directories tree organization :

It is the same as in Ensembl. Only the conf directory is used to store the martRegistry.xml file of each server.

III.2.b Databases :

Each BioMart site has its own database defined in the different MULTI.ini files.

NB : the configuration of the database is done in the MULTI.ini file then reused to generate the martRegistry.xml file. This isdone in Registry.pm of Ensembl. This did not work on our local instance. We had to configure the martRegistry.xmlmanually.

SIGENAE Ensembl and BioMart multi-instances page 7/11

Page 8: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

Here is the martRegistry.xml we used for serverensembl1.

Fig 15 : martRegistry.xml

III.2.c Configuration files modifications :

No other configuration file modification needed.

III.2.d modules modification :

As in Ensembl, the registry information must be stored using a server specific tag. This is done by adding the server nameto the dataset name when creating the registry data. Some function are used both for building the registry data and forquerying it with or without dataset name therefore the module has to test whether or not to add the server name. In perl.startup the processing of the single martRegistry.xml is replaced by a look on all sites and the generation of a %MR_file hash table instead of a variable. See code hereunder.

Fig 16. perl.startup martRegistry generation loop

The generated table is used to create a newinitializer.

Fig 17.perl.startup new BioMart initializer

The initializer new constructor is modified towork on a table instead of a single element. This hash table will be processed to build themartconf.packed file. This is done in theBiomart::Initializer module.

Fig 18. Initializer modifications

The initializer creates a registry which isretrieved in perl.startup and the configurationtrees are build using this registry. In thegetRegistry function of Biomart::Initializer.pmthe name of the dataset is modified to addthe name of the database (same as virtualserver). See code below.

Fig 19. Initializer modifications 2

SIGENAE Ensembl and BioMart multi-instances page 8/11

Page 9: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

The dataset are then parsed to retrieve theconfiguration trees. The result can be seen inthe martconfig.packed generation output.

Fig 20. BioMart configuration output

The registry data structure is now ready andthe dataset names can be separated by the$ENV{SERVER_NAME} environmentvariable.

III.2.e perl script modification :

The modifications made in martview have different purposes :- list the databases and datasets of the corresponding server,- separate cache files for each server,- redirect the query to the corresponding datasets.

Lets see how we have done this.

The cache file name construction process made is very easy to separate cache files. We just add the server name in thecache file naming array. See line with #CK in the next script.

Fig 21. specific cache file names

To list the server corresponding databasesand datasets on the start page of MartSearchwe modified the following lines.

Fig 22. functions retrieving server specific data

We had to create new functions in the registry.pm module to retrieve only the right dataset names and display names of theserver. We also created a function to retrieve the databases name of a given server. An example of these functions ispresented next page.

SIGENAE Ensembl and BioMart multi-instances page 9/11

Page 10: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

To do this we have added a simple test inthe new function (See script on the rightbetween comment lines)

Fig 23. getAllDatatsetNamesServer function

To redirect the query to the right dataset wehave added a name modification line. Thisline is included in a test because at the startof martview the dataset is empty.

Fig 24. Dataset renaming code

Now MartSearch on server ensembl1 is giving access to the datasets of database ensembl1 and MartSearch on serverensembl2 is giving access to the datasets of database ensembl2.

NB : We did not yet implement the second dataset choice and the links. But this would possibly be done the same way.

III.2.f adding a new BioMart instance :

We expect the corresponding Ensembl server to be already working.

The steps to add a new instance are :- create and fill the corresponding databases, - set the right values in the MULTI.ini file of this server,- modify the httpd.conf to add the new virtual host,- delete the martconf.packed file- stop and start apache

SIGENAE Ensembl and BioMart multi-instances page 10/11

Page 11: Installing several Ensembl and BioMart instances on a single … · 2005-08-10 · indicates that the virtual hosts will be separated by their names on the localhost. Then there is

IV Conclusion :This work has been for us a very good way to understand how the launching process of Ensembl and Biomart is working.We hope it will help others to understand it as well. The possibility to run Ensembl or BioMart as any other web application,meaning not having to run it in a new Apache instance could help to install it without the root privilege [OK, you can also useports over 1024 and have several httpd running]. It may also help developers to look at Ensembl as a platform which couldmore easily be integrated in their existing architecture.This work also showed us the new organization and features of BioMart, our web site is still using EnsMart. These newfeatures permit multiple databases access but through a unique front page (MartSearch)

NB : The modified code is at the disposal of the community (47 Mo).

PS : Thanks' to Nathanael Veillith for the discussions we have had.

SIGENAE Ensembl and BioMart multi-instances page 11/11