Free software and bioinformatics

101
Free software and biomedical research Alberto Labarga [email protected]

description

an overview of the free software philosophy as it has been applied on the bioinformatics field

Transcript of Free software and bioinformatics

Page 1: Free software and bioinformatics

Free software and biomedical research

Alberto Labarga

[email protected]

Page 2: Free software and bioinformatics

When Craig Venterwas asked, “Whatmakes you think youcan do a better jobwith life and geneticsthan God?”, he answered…

Page 3: Free software and bioinformatics

we have computers

Page 4: Free software and bioinformatics

and software too

Page 5: Free software and bioinformatics

¡free software!

Page 6: Free software and bioinformatics

biology is a data intensive science

Page 7: Free software and bioinformatics
Page 8: Free software and bioinformatics

Scientificinformationavailable in 2010 will double every72 hours

Page 9: Free software and bioinformatics
Page 10: Free software and bioinformatics

data mining

Page 11: Free software and bioinformatics

my data is mine!

Page 12: Free software and bioinformatics

and your data is mine, too!

Page 13: Free software and bioinformatics

open sourceopen dataopen access

Page 14: Free software and bioinformatics

open science

Page 15: Free software and bioinformatics

Comparative genomicsSequence (DNA/RNA)

& phylogeny

Regulation of gene expression; transcription factors & micro RNAs

Protein sequence analysis &evolution

Protein families, motifs and domains

Protein structure & function: computational crystallography

Protein interactions & complexes: modelling and prediction

Chemical biology

Pathway analysis

Systems modelling

Image analysis

Data integration & literature mining

Page 16: Free software and bioinformatics

The first Atlas of Protein Sequence and Structure, presented information about 65 proteins. 

Page 17: Free software and bioinformatics

In 1981 the EMBL Nucleotide Sequence Data Library is created. Version 2 was composed of 811 secuences, around 1 millionbases introduced by hand.

Page 18: Free software and bioinformatics

Smith TF, Waterman MS (1981). "Identification of common molecular subsequences.". J Mol Biol. 147 (1): 195‐7.

Page 19: Free software and bioinformatics

S.F. Altschul, et al. (1990), "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): 403‐10, 1990. 15,306 citations

Page 20: Free software and bioinformatics

J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment. Nuc. Acids. Res. 22, 4673 ‐ 4680

Page 21: Free software and bioinformatics

In 1995 the European bioinformatics institute is created.

Page 22: Free software and bioinformatics

EMBOSS (The European Molecular Biology Open Software Suite) is a free Open Source software analysis package that provides a comprehensive set of sequence analysis package specially developed for the needs of the molecular biology user community. 

First requirements based on a list of long‐standing problems in existing commercial software (GCG), and the need for public source code

Within EMBOSS you will find around 200 programs (applications).

Current version is 6.0.1 

http://emboss.sourceforge.net/

Page 23: Free software and bioinformatics

Main Programs in EMBOSS

Retrieve sequences from databaseSequence alignmentNucleic gene finding and translationProtein secondary structure predictionRapid database searching with sequence patternsProtein motif identification, including domain analysisNucleotide sequence pattern analysis, for example to identify CpG islands or repeats.Codon usage analysis for small genomesRapid identification of sequence patterns in large scale sequence setsPresentation tools for publication

Page 24: Free software and bioinformatics
Page 25: Free software and bioinformatics

open‐bio.org

• The Open Bioinformatics Foundation is a non profit, volunteer run organization focused on supporting open source programming in bioinformatics. 

• Its main activities are:– Underwriting and supporting the BOSC conferences

– Organizing and supporting developer‐centric "hackathon" events (Bio*)

2

Page 26: Free software and bioinformatics
Page 27: Free software and bioinformatics

O’Reilly Books and Conferences

Page 28: Free software and bioinformatics
Page 29: Free software and bioinformatics

http://www.ensembl.org

Page 30: Free software and bioinformatics

30http://www.uniprot.org

Page 31: Free software and bioinformatics

GenericModel Organism Database projecthttp://gmod.org

Page 32: Free software and bioinformatics

DAS Concept

Reference server

Annotation server BAnnotation server A Annotation server C

Clienthttp://www.biodas.org

Page 33: Free software and bioinformatics

DAS Server

• DAS request to retrieve features on a segment:

• http://das.ensembl.org/das/ens_36_omim_genes/features?segment=1:1,1000000

• Result:

Page 34: Free software and bioinformatics

Das viewer

Page 35: Free software and bioinformatics

http://www.ebi.ac.uk/dasty/

Page 36: Free software and bioinformatics

Applied Biosystems ABI 3730XL

Illumina / Solexa Genetic Analyzer

Applied BiosystemsSOLiD

Roche / 454 Genome Sequencer

1 Mb/day 100 Mb/run 3000 Mb/run

Page 37: Free software and bioinformatics

Sequencing  Fragment assembly problem  The Shortest Superstring Problem Velvet (Zerbino, 2008) 

Gene finding  Hidden Markov Models, pattern recognition methods  GenScan (Burge & Karlin, 1997)

Sequence comparison  pairwise and multiple sequence alignments  dynamic algorithm, heuristic methods PSI‐ BLAST (Altschul et. al., 1997) (SSAHA, 2001) (MUMmerGPU, 2008) 

Page 38: Free software and bioinformatics
Page 39: Free software and bioinformatics

Genomes

Nucleotides

Proteins

Structures

Other molecules

Interactions

Experiments

Literature

Ontologies

http://www.ebi.ac.uk/Databases/

Page 40: Free software and bioinformatics

Curso práctico de base de datos e integración de información biológica

Page 41: Free software and bioinformatics

Challenges of Data Integration

• Different types of data (sequence, function, literature etc.)

• Different data formats (FASTA, EMBL, Genbank, tab delimited etc.)

• Different storage formats (ASCII flatfile, XML, RDBMS)

• No standard formats for common fields (citations, descriptions, dates etc.)

• Volume and size of data

Page 42: Free software and bioinformatics

BioMart is a simple and robust data integration system for large scale data querying, providing researchers with fast and flexible access to biological databases

http://www.biomart.org/

Page 43: Free software and bioinformatics

Web Services

http://www.ebi.ac.uk/Tools/

Page 44: Free software and bioinformatics

Challenges when using tools in unison

• Manually transfer data from one application to another

• Understand disparate data formats

• Convert file formats where appropriate

• Manage and understand disparate application environments e.g. web browser, desktop application

Page 45: Free software and bioinformatics
Page 46: Free software and bioinformatics

dataflow workflow

ws ws ws ws ws

curation

submission

Page 47: Free software and bioinformatics

REST: REpresentational State Transfer

http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human

GET, POST

HTML,XML,PNG

RESTful web services

Page 48: Free software and bioinformatics

Any web page is a web servicehttp://www.ebi.ac.uk/cgi-bin/dbfetch?db=uniprot&id=alk1_human&style=html&format=default

Page 49: Free software and bioinformatics

Friendly URL and XML documents

• http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human

• http://www.ebi.ac.uk/Tools/webservices/rest/dbfecth/uniprot/slpi_human/xml

• http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human/fasta

Page 50: Free software and bioinformatics

Biomart query

<Query virtualSchemaName="central_server_1"><Dataset name="hsapiens_gene_ensembl" >

<Attribute name="ensembl_gene_id"/><Attribute name="ensembl_transcript_id"/><Filter name="chromosome_name" value="1"/><Filter name="band_end" value=”p36.33"/><Filter name="band_start" value=”q44"/>

</Dataset><Dataset name="msd">

<Attribute name="pdb_id"/><Attribute name=”experiment_type"/><Filter name="experiment_type" value=”NMR"/>

</Dataset></Query>

Page 51: Free software and bioinformatics

SOAP: Simple Object Access ProtocolfetchData(uniprot,wap_rat,default,xml)

SOAP services

Page 52: Free software and bioinformatics

fetchData (db, id, format, style)

entry

wsdbfetch

Page 53: Free software and bioinformatics

Perl client

use SOAP::Lite;

my $WSDL='http://www.ebi.ac.uk/Tools/webservices/wsdl/WSDbfetch.wsdl'; my $soap = SOAP::Lite->service($WSDL);

# fetchData dbName:id <format> <style>

my $result = $soap->fetchData(‘uniprot’, ‘default’, ‘raw’); die $soap->call->faultstring if $soap->call->fault;

foreach my $i (@$result) { print "$i\n"; }

Page 54: Free software and bioinformatics

EBI web services (analysis tools)

jobid

getResults (jobid)

results available

checkStatus (jobid)

status

run(params, data)

poll (jobid, type)

result file

Page 55: Free software and bioinformatics

use SOAP::Lite;

my $WSDL = 'http://www.ebi.ac.uk/Tools/webservices/wsdl/WSFasta.wsdl'; my $fasta_client = SOAP::Lite->service($WSDL);

my %params=(); $params{'program'}='fasta3'; $params{'database'}='uniprot';$params{'email'}='[email protected]';$params{‘async'}= 1;

$data={type=>"sequence",content=>"MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECI"};

# $data={type=>"sequence",# content=>“uniprot:slpi_human"};

my $jobid = $fasta_client >runFasta(SOAP::Data->name('params')->type(map=>\%params), SOAP::Data->name( content => [$data]));

print $fasta_client->poll($jobid);

Perl client

Page 56: Free software and bioinformatics

Perl client (cont.)# set a loop for checking job submission status # RUNNING, NOT_FOUND, ERROR, DONE

my $status = $fasta_client ->checkStatus($jobid); while (status eq "RUNNING") {sleep 10; $status = $fasta_client->checkStatus($jobid); }

# when job is done, poll for the results

my $result = $fasta_client ->poll($jobid) if ($status eq "DONE") ;

print $result;

Page 57: Free software and bioinformatics

http://taverna.sourceforge.net/

Page 58: Free software and bioinformatics

http://www.myexperiment.org/users/471

Page 59: Free software and bioinformatics
Page 60: Free software and bioinformatics

high throughput genomics

Page 61: Free software and bioinformatics

data management

Page 62: Free software and bioinformatics

https://carmaweb.genome.tugraz.at/

http://base.thep.lu.se/

Page 63: Free software and bioinformatics

Why must support standards?

• Unambiguous representation, description and communication– Final results and metadata

• Interoperability – Data management and analysis 

• Integration of OMICS    system biology

Page 64: Free software and bioinformatics

What to standarize?

• CONTENT: Minimal/Core Information to be reported ‐> MIBBI (http://www.mibbi.org)

• SEMANTIC: Terminology Used ‐> Ontologies, OBI (http://obi‐ontology.org)

• SYNTAX: Data Model, Data Exchange ‐>Fuge (http://fuge.sourceforge.net/) ISA‐TAB, MAGE‐TAB, PRIDE

Page 65: Free software and bioinformatics

MIBBI: Standard Content

Promoting Coherent Minimum Reporting Requirements for Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotechnology

Page 66: Free software and bioinformatics

data analysis

Page 67: Free software and bioinformatics

Microarray

RT‐PCR

Biological question

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Prediction

Expression quantification Pre‐processing

Analysis

Page 68: Free software and bioinformatics

r‐project.org

• R is an open source implementation of the S Language • Many statistical and machine learning algorithms• Good visualization capabilities• Possible to write scripts that can be reused• Sophisticated package creation and distribution system• Supports many data technologies: XML, DBI, SOAP• Interacts with other languages: C; Perl; Python; Java• R is largely platform independent: Unix; Windows; OSX• R has an active user community 

cran.r‐project.org

Page 69: Free software and bioinformatics

BioConductor

• Access wide range of powerful statistical and graphical tools• Facilitate the integration of biological metadata in the analysis of 

experimental data• Allow the rapid development of extensible, scalable, and 

interoperable software; • Promote high‐quality documentation and reproducible research.• Provide training in computational and statistical methods for the 

analysis of genomic data. 

http://www.bioconductor.org/

Page 70: Free software and bioinformatics

Bioconductor Packages/libraries

Two releases each year that follow the biannual releases of R 

294 software packages

490 Metadata packages

>700 citations

Release        1.1        1.2          1.3        1.4       1.5          1.6        1.7         1.8        1.9       2.0      2.1          2.2       2.3 ‐> 294  packages

No. softw

are packages

Page 71: Free software and bioinformatics

Bioconductor for Microarray Analysis

• Quickly becoming the accepted approach

• Open source

• Flexible

• (fairly) simple to use ‐ intuitive

• Wide applications – many packages

Page 72: Free software and bioinformatics

affy packagePre-processing oligonucleotide chip data:• diagnostic plots, • background correction, • probe-level normalization,• computation of expression measures.

imageplotDensity

plotAffyRNADeg

barplot.ProbeSet

Page 73: Free software and bioinformatics

heatmap

mva package

Page 74: Free software and bioinformatics
Page 75: Free software and bioinformatics

proteomics

Page 76: Free software and bioinformatics

http://www.agml.org/

Page 77: Free software and bioinformatics

Trans‐Proteomic Pipeline (TPP) is a collection of integrated tools for MS/MS proteomics

http://tools.proteomecenter.orghttp://proteowizard.sourceforge.nethttp://www.thegpm.org/TANDEM

Page 78: Free software and bioinformatics

Bioclipse

View

View

Editor

ConsoleProperties

http://www.bioclipse.net/

Page 79: Free software and bioinformatics

Work with spectra: Spectrum plugin

Page 80: Free software and bioinformatics

Work with sequences: BioJava plugin

Page 81: Free software and bioinformatics

CMLRSS plugin: Chemistry on the web

Page 82: Free software and bioinformatics

cytoscape

http://www.cytoscape.org

Page 83: Free software and bioinformatics

pyMol

http://www.pymol.org

Page 84: Free software and bioinformatics

image processing

Page 85: Free software and bioinformatics

Open Microscopy Environment

• OME is a multi‐site collaborative effort among academic laboratories and a number of commercial entities that produces open tools to support data management for biological light microscopy.

• The original OME server is an application written in Perl running under Apache. It is accessed using a Web User Interface, via a Java API, or using a plugin for ImageJ.

• The server can support images in a wide range of file formats. This model is also extendable allowing custom data to be stored in the server.

• It supports multiple users and provides appropriate security for private research and collaboration.

http://openmicroscopy.org

Page 86: Free software and bioinformatics

OMERO

Page 87: Free software and bioinformatics
Page 88: Free software and bioinformatics

OMERO

Page 89: Free software and bioinformatics

beyond software

Page 90: Free software and bioinformatics

At $150,000, the Polonator is the cheapestinstrument on the market, says Harvard University's George Church, whose labdeveloped the technology in conjunctionwith Dover Systems, Plus, the tool uses five‐fold less reagents than other platforms, and is the smallest instrument available. 

http://www.polonator.org/

Page 91: Free software and bioinformatics

http://www.igem.org

http://www.bioparts.org/

Page 92: Free software and bioinformatics

where is the stuff

Page 93: Free software and bioinformatics

http://bioinformatics.oxfordjournals.org

Page 94: Free software and bioinformatics

http://nar.oxfordjournals.org

Page 95: Free software and bioinformatics

http://www.biomedcentral.com/bmcbioinformatics/

Page 96: Free software and bioinformatics

http://genomebiology.com/software/

Page 97: Free software and bioinformatics

the future

Page 98: Free software and bioinformatics

Growth of open access scientistsdigital natives, always online, hybrids

catalysts for change

[Phil Bourne]

Page 99: Free software and bioinformatics
Page 100: Free software and bioinformatics

• Making scientific research “re‐useful”—We help people and organizations open and mark their research and data for reuse. 

• Enabling “one‐click” access to research materials—We help streamline the materials‐transfer process so researchers can easily replicate, verify and extend research. 

• Integrating fragmented information sources—We help researchers find, analyze and use data from disparate sources by marking and integrating the information with a common, computer‐readable language. 

Page 101: Free software and bioinformatics