The development of the MIRRI ICT infrastructure for microbial … The development of the MIRRI...
Transcript of The development of the MIRRI ICT infrastructure for microbial … The development of the MIRRI...
CBS-KNAW
The development of the MIRRI ICT
infrastructure for microbial resources
Paolo Romano, Boyke Bunk, Anna Klindsworth, David Smith, Alexander
Vasilenko, Frank Oliver Glockner and Vincent Robert
CBS-KNAW
A common situation …
MS-Access MySQL
MS-Excel
CBS-KNAW
A common situation …
CBS-KNAW
Outlook
1. Management system for curators
2. Publication of data for third parties
3. Interoperability
CBS-KNAW
1. Management system for curators
A. MANAGE COLLECTION’S DATA USING WEB BASED APPLICATIONS
Pros:• Accessibility to databases from anywhere
• Accessibility to databases using any devices
• Possibly easy to use for basic operation
• Maintenance is easy for IT departments since the software is centrally
installed and maintained
• No need for installations on curators, researchers or technicians devices
(Desktop, laptop, tablet, smart phone, etc.) since access is done using
browsers
• The same software might be used for the management and the publication
of data
Cons:• Developments costs are usually higher
• Developments can be significantly more complex to support all browsers and their versions
• Some advanced or even basic functionalities might be much more difficult or impossible to program
• Rich interfaces or memory demanding operation might be impossible
• Interface can be much slower than desktop applications
• Interactions with other software might be more difficult or impossible
• Maintenance of software might be more intensive to allow new versions of browsers to still function properly
• Security issues are more complex to handle with Web Apps than with desktop application since the application is
potentially accessible from any device by anyone
• Stable Internet connections are needed
CBS-KNAW
1. Management system for curators
B. MANAGE COLLECTION’S DATA USING DESKTOP APPLICATIONS
Pros:• Rich software interface
• Easy to use
• Fast response to user’s commands
• Memory demanding or interface rich operations can easily be performed
(to the technical limits of the OS, computer, etc., of course)
• Relatively easy to develop (for basic functionalities at least)
• Interactions with other software can be easy to establish. Pipelines can be
created and import-export functionalities easy to implement or to use
• Data access security can easily be ensured
Cons:• Installation can be problematic (different Operating Systems (OS) versions, missing DLL, etc.)
• DA are usually made for one OS (Windows, Mac or Linux) but won’t work with others.
• When installed on different computers, updates and upgrades of the software must be re-installed everywhere
making bug fixing or new version less easy to fix or install
• DA are usually not accessible from a remote computer or device
• For software working with limited installation options (fixed number of licenses), DA might become expensive
and/or difficult to update/upgrade
• Can be heavy to manage for IT departments
CBS-KNAW
1. Management system for curators
C. CREATE MANAGEMENT SOFTWARE USING IN-HOUSE RESOURCES
Pros:
• Taylor made application fitting perfectly with the needs of
the curators (at design time at least)
• Fast response to implement new features and bug solving
• This solution can be quite cheap if the software remains
simple
• Possible if strong team of stable developers
Cons:• Curators or researchers are rarely good software designers or programmers making the resulting solution uneasy
to use, maintain and further develop
• Real developers are rarely available in culture collections (CC) because they are expensive.
• Good developers easily tend to leave the CC to find better paid position leaving the software unmaintained and
hardly usable by newly recruited developers.
• This option can be extremely expensive when the wanted functionalities are complex and large.
• Most in-house solutions are not (easily at least) scalable (add/modify/remove more
tables, fields, operations, etc.) and redesign or complete rewriting of software is often needed. This leads to
interfacial instability for the end-users which is a key issue.
• Developments take a long time before being usable and stable especially for single or small developers teams.
• Many software were abandoned after a few months/years because they were too slow, difficult to use, user-
unfriendly, buggy or unstable. This is a common situation in a CC.
“If you think that professionals are expensive, wait until you work with amateurs …” Red Adair
CBS-KNAW
1. Management system for curators
E. USE EXISTING OPEN-SOURCE OR FREE SOFTWARE
Pros:• Large offer
• Very good and advanced software available
• Free of charge
• Unlimited use
• Good for collections with strong IT support and software
developers
• Extensions possible by local developers (not always)
Cons:• No complete solution for culture collections available
• Creation of pipelines needed and can be difficult to achieve in a user-friendly way
• Using open-source software is far from easy and in practice it may be impossible to
enter into the code of others Access to code can be an illusion
• Support might be a serious issue in case of problems … and there are always problems
with any software …
CBS-KNAW
1. Management system for curators
D. USE EXISTING COMMERCIAL SOFTWARE
Pros:
• Large offer
• Very good and advanced software available
• Support available
• Custom developments can be made by professionals
• Complete or near-complete solutions are available
for culture collections, so why reinvent the wheel ?
Cons:
• Few complete solutions for culture collections available
• Some solutions are not extensible/flexible or adapted to all collections
• Pure software companies have little biological background making it difficult
to communicate
• Costs associated with software can be important
• Maintenance costs should remain under control
“If you think that professionals are expensive, wait until you work with amateurs …” Red AdairButSelect the right professionals too …
CBS-KNAW
1. Management system for curators
F. DATABASE TYPES
NOSQL
CBS-KNAW
1. Management system for curators
F. DATABASE TYPES
Good:• Relational databases :
• MySQL
• PostgreSQL
• MSSQL
• Oracle
• Document based or other advanced databases
• MongoDB
• Vertica
• etc
• All data should be in databases
Not good:• Proprietary databases
• Catalogs on paper
• Word
• Excel
• MS-Access
• Filemaker Pro
CBS-KNAW
1. Management system for curators
G. DATABASE ACCESS & BACKUP
Good:• Backup 2x/day
• Sharding which is the process of storing data records
across multiple servers
• Live replication
• Databases should be physically close to application
especially for large data exchanges or sequence
alignments (for example)
Not good:• No backup
• Remote databases are slow
CBS-KNAW
1. Management system for curators
H. INSTALLATION OF SOFTWARE, VERSIONING INFORMATION AND
TECHNOLOGY (IT) RESOURCES NEEDS
Good:• No installation or simple or minimum (true for web
apps, less for desktop apps if installed on all computers)
• Hosted solutions are super easy for both IT and users
Not good:• Very complex installations or settings of parameters
• Some LIMS software can be extremely hard/long/expensive to set
• Client-server apps are more difficult to maintain if installed on all computers and
updates can be challenging
• IT costs can be high
• Salaries
• Servers, hardware, firewalls, SAN, etc
• Management software like VMware, etc
CBS-KNAW
1. Management system for curators
I. HOSTED SOLUTIONS
Pros:• No installation
• Super easy for both IT and users
• Available anywhere, anytime on any device (computer, smartphone, tablet, etc)
• Fast and reliable if good IT infrastructure behind and using Citrix
• Easy maintenance of software/databases
• No need to buy hardware (server, SAN, firewalls, etc)
• No need to buy and maintain expensive and sophisticated software for the management and the monitoring of
the system (VMWare vSphere, for example)
• No need to hire IT staff
• Continuous monitoring and support
• Given the number of services provided, hosted solutions are often much cheaper than running a complete
infrastructure in house
• Management of CC software and associated database can directly be connected to the website used for
publication of CC data
Cons:• Require recurrent payments (monthly or annually) which means that these costs must be part of the annual
budget of the CC
• Access to database engine might not be possible (only backups of databases could be asked from time to time)
• Dependency to the hosting company
• Need Internet connection to work
• Not possible for extremely slow or erratic Internet connections/networks
CBS-KNAW
1. Management system for curators
J. MOST WANTED FUNCTIONALITIES
We love:• Collection maintenance
• Strain distribution
• Research
• Screening
• Dynamic System (curators/researcher can change the system without the need for IT
or developers)
• Advanced security and access management
• Tracking of database modifications by each user
• Ability to import and export data as text, images, DNA trace files, microplate reader
data, MS-Excel, HTML, XML, FASTA, NCBI and more
• Linking or exportation of data to other websites such as GBIF, StrainInfo, NCBI, etc.
• Ability to create custom layouts such as invoices, catalogs, sample labels
• Strains stock management
• Customer information management
• Orders and invoices management
CBS-KNAW
1. Management system for curators
J. MOST WANTED FUNCTIONALITIES
We love:• LIMS module to manage and track DNA sequencing projects including revival
of strains from collection stocks, DNA prep, PCR, gels, viewing, aligning and
editing DNA sequences, and depositing consensus DNA sequences into the
database and online catalog
• Scripting and debugger tools to automate routine tasks and extend
functionalities of the software
• Integration of scripts within existing menus of the software
• Reporting functions allow export of data in many formats including tab
delimited, text, MS-word, MS-excel, HTML, FASTA, NCBI, etc.
• Integrated content management system for the administration of CC websites
and associated communication devices
• Polyphasic identification and classification, to identify and classify strains
based on a custom weighted combination of DNA
sequence, physiological, morphological and other
• Species determinations
• Cluster analysis using various algorithms such as UPGMA, WPGMA, Single
and Complete Linkage, Ward’s Minimum Variance, and Neighbor Joining
• Dendrogram generation
CBS-KNAW
1. Management system for curators
J. MOST WANTED FUNCTIONALITIES
We love:• Pairwise DNA sequence alignment.
• Multiple DNA sequence alignment
• Storage of data of many formats including text, dates, calculations, literature
references, DNA sequence trace files, electrophoresis gel photos, GPS
coordinates, microplate reader data (96 or 384 wells), and photos. Data types
can thus include morphological, physiological, molecular, chemical,
ecological, geographic, and literature reference data
• DNA gel analysis
• Cell size determination
• Import, manage, analyze and export spectral data such as MALDI tof or other
systems
• Generation of dynamic geographic distribution maps using Google Maps
• etc
CBS-KNAW
2. Publication of data for third parties
Curators want:• Direct access to published data.
• Easy/live release of new strains and associated data
• Restrict data access to Internet users/clients if needed
• Easy/live adaption of webpages and website content
• Websites should be seen as a way to communicate with clients and end-users. This
could be done by:
• simple webpages
• forums
• news systems
• Change the look and some functionalities of the website on the fly without the
intervention of website developers
• Allow deposit forms to be filled by depositors of strains without having to re-type all
data manually.
• Allow clients to easily select strains to be ordered via a Cart system
• Know pending orders, payments and data associated with any client
• Allow end-users searching their databases according to the specificities of their
collection
• Allow third parties to take advantage of their CC’s data to increase traffic to their
websites. This can be done via friendly URLs, simple or advanced web services
(REST, SOAP, etc.).
• etc.
CBS-KNAW
2. Publication of data for third parties
Clients want:• Easy searching system on as many features as possible
• Simple Cart system allowing easy (de-)selection of strains to
be ordered
• Not having to retype all personal or institutional information
each time they order strains
• Fast and easy communication with curators or sales
departments of the CC
• Frequently asked question (FAQ) section answering most of
their questions
• Etc.
CBS-KNAW
2. Publication of data for third parties
End-users want:• Easy searching system on as many features as possible
• Advanced query system allowing to combine queries in complex
ones using AND, OR and NOT operators (including brackets to
group conditions)
• Easy copy-pasting of data
• Easy exportation of selected data, manually or via software (web
services)
• Pairwise DNA or protein sequences alignments against reference
databases
• Polyphasic identifications and/or classifications against reference
databases
• MLST (or similar methods) allowing identifications or typing of
strains
• etc
CBS-KNAW
3. Interoperability
DATA STANDARDS AND PROTOCOLS
• BioSharing (http://biosharing.org/)
• Biodiversity Information Standards (TDWG; http://www.tdwg.org/)
• Genomic Standards Consortium (GSC;
http://en.wikipedia.org/wiki/Genomic_Standards_Consortium)
• etc
LINKS TO EXISTING RESOURCES
• STRAININFO
• WDCM
• TAXONOMIC DATABASES (MYCOBANK, DSMZ, ETC)
• GBIF
• INSDC (NCBI, ENBL, DDBJ, ETC)
• BOLD
• LIFEWATCH, BIOVEL, VIBRANT, LIFELINK, ELIXIR, Q-BANK, ETC
• MANY MORE …
CBS-KNAW
Work in progress
We need your help, opinions, suggestions and critics
Contact us : [email protected]