Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects:...

57
Data management for large collaborative projects: challenges and solutions. Arek Kasprzyk Head of Data Management Center for Translational Genomics and Bioinformatics San Raffaele Scientific Institute, March 27, 2014

description

Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems. The BioMart project (www.biomart.org) was initiated to address these challenges. BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart offers different types of access tailored to different groups of users. For biologists, BioMart offers a number of interactive and customisable web-based graphical user interfaces. For bioinformaticians, BioMart provides data access through a range of application programing interfaces. For service providers, BioMart offers a highly customizable system that can be installed locally and tailored to support different types of data management needs. In this talk I will share my experiences in managing data for large international collaborations involving academic and industry partners. I will also outline the current status of BioMart’s software and services, and describe its new features – such as tools for analysing next generation sequencing data.

Transcript of Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects:...

Page 1: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Data management for large

collaborative projects: challenges

and solutions. Arek Kasprzyk

Head of Data Management

Center for Translational Genomics and Bioinformatics

San Raffaele Scientific Institute,

March 27, 2014

Page 2: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

Big Data

Page 3: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

2000 – 1 Genome ---- Human Genome Project

2008 – 1000 Genomes ----- 1000 Genome Project

2008 – 25, 000 Genomes ----- ICGC

2012 – 100,000 ----- UK Genomes

Big Data?

Page 4: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics, Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

Biomedical databases

Page 5: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

www.biomart.org

Page 6: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Egyptian Hieroglyphs

Page 7: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Phoenician Alphabet

Page 8: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Biological abstractions

Page 9: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Query abstractions

Dataset

Filter

Attribute

Page 10: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Examples

human genes

located on chromosome 1, expressed in lungs

name, chromosome, description

rat genes

up-regulated in brain and associated with a QTL for

a neurological disorder

Upstream sequences

Rihanna songs

released before 2012

UK top 10

Page 11: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Graphical User Interface

Dataset

Filter

Attribute

Page 12: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

SOAP/REST

<Query> <Dataset name="hsapiens_gene_ensembl" > <Filter name="chromosome_name" value="1"/> <Attribute name="ensembl_gene_id"/> <Attribute name="ensembl_transcript_id"/> <Attribute name="biotype"/> </Dataset> </Query>

Page 13: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

What percentage of patients with primary breast cancer who relapsed within 5 years of surgery?

What is the average survival of patients with Chronic Myeloid Leukaemia (CML) and both with and without splenomegaly at diagnosis?

Find the age and gender of patients who have been diagnosed with Hodgkin's disease, where the initial diagnosis occurred between the ages 50 and 70 inclusive

What is the percentage of patients diagnosed with primary breast cancer in the age range 30 to 70 who were surgically treated and had post operative haematoma/seroma?

Examples

Page 14: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

?

Page 15: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart schema – “reversed star”

Filter

Attribute

Dataset

Page 16: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart Architecture

Page 17: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Mart Configurator Add new data sources

Federate data sources

Edit metadata

Convert relational databases into mart schemas ‘virtual marts’

Website with a click of a button

Page 18: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart “out of the box” website

Multiple Graphical User Interfaces

Multiple Aplication Programing Interfaces

Page 19: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

What is BioMart?

BioMart

Page 20: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart

URL SOAP

REST JAVA

What is BioMart?

SPARQL

Page 21: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart

Bioclipse Taverna

Galaxy

Cytoscape

BioConductor

WebLab

What is BioMart?

Page 22: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart Central Portal: An Open Database Network for Biological Community Guberman et al Database Vol. 2011, doi:10.1093/database/bar041

BioMart Central Portal

Page 23: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart Community

Page 24: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart Community

Page 25: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart Central Portal

Page 26: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart Central Portal

Page 27: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

The G-protein coupled receptor domain (GPCR) has the InterPro ID of IPR000276. Find the human protein-coding genes in Ensembl that code for this domain, and investigate whether any of them are detectable with the Affy HuGene 1_0 st v1 array

esv263 is the DGVa accession number of a structural variation from Redon et al. (20). What genomic region does this copy number variation span?

Find the genes from Escherichia coli strain K12 that are found within the region ‘360473–365601’ and discover whether there are any orthologs in the related strains E. coli O157:H7 EC4115 and E. coli DH10B

The three-gene APL1 locus encodes essential components of the mosquito immune defense against malaria parasites. Find the variations within the APL1A, APL1B and APL1C genes as well as the strain name, strain genotype, allele and biotype

Ensembl

Page 28: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Find all IKMC resources for genes encoding transcription factors on chromosome 1 between 180-190 Mbp

Find all IKMC resources for genes expressed in heart

Find all IKMC mice available from the EMMA Repository with information on the vector used to make the mutation

Show me all the distributed EMMA lines have passed Southern blot quality control at a distribution center

s there any existing phenotype data for other mouse knockouts of the same gene for mouse lines produced from EUCOMM ES resources

International Knockout Mouse Consortium (IKMC)

Page 29: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information

Find variation information for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information

Check the transcriptomic alteration status of the genes gained in lung cancer

Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information

COSMIC

Page 30: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Find genes commonly deregulated in pancreatic cancer precursor lesions, pancreatic intraepithelial neoplasia (PanIN) samples and display gene information, comparison and direction of regulation

Find genes differentially expressed in the serum of pancreatic cancer patients when compared to the serum of patients with benign pancreatic diseases (chronic pancreatitis and pancreatic pseudocyst). Find associated pathways via query integration with Reactome. Display gene and protein information, experimental details and pathway information

Find DNA copy number high-level amplifications in PDAC samples that also contain genes differentially expressed in PDAC versus chronic pancreatitis (CP) and display copy number information, gene information and differential expression experimental details

Find miRNAs differentially expressed in PDAC versus CP whose expression has been confirmed by RT–PCR techniques and display miRNA attributes and study information

Pancreatic Expression Database

Page 31: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Scales better

No central funding

No admin overhead

Very green footprint

Maintained by experts

“Virtual Bioinformatics Institute”

Page 32: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

ICGC Data Portal

International Cancer Genome Consortium Data Portal: A One Stop Shop for Cancer Genomic Data Zhang et al Database Vol. 2011, doi:10.1093/database/bar038

Page 33: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

International Cancer Genome Consortium

Goals Catalogue genomic abnormalities in tumors in 50 different cancer types

and/or subtypes of clinical and societal importance across the globe

Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors

Make the data available to the entire research community as rapidly as possible and with minimal restrictions to accelerate research into the causes and control of cancer

50 different tumor types and/or subtypes

500 samples per tumor

25,000 Human Genome Projects!

Page 34: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

ICGC members

Page 35: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Data models

Genes

Samples

Simple mutations

Copy number mutations

Structural rearrangements

Gene expression

DNA methylation

miRNA

Exon junction

Page 36: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Architecture

Page 37: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

“Parallel” Query Engine

Page 38: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Gene Report

Ensembl

KEGG

ICGC

Pancreatic Expression

Database

Page 39: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Quick Search

Ensembl

KEGG

ICGC

Ensembl

Page 40: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions
Page 41: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Retrieve clinical staging data for colorectal cancer patients with non-synonymous simple mutations in genes that are involved in WNT signaling pathway

Search for genes affected by copy number loss and also detected as deletion from structural rearrangement analysis

In pancreatic cancer data set, retrieve all RNA-seq expression data for genes that are affected by copy number gains

ICGC Query Examples

Page 42: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

“Digital Medicine” University Health Network (UHN)

Page 43: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

“Digital Medicine” Pilot

Page 44: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

“Digital Medicine” Architecture

KEGG Reactome COSMIC Ensembl

BioMart Central Portal

HICT

COSMIC Ensembl

HICT

KEGG

Reactome

TCGA - Ovarian

ICGC

ICGC - Pancreas

TCGA

Cancer Portal (ICGC) Clinical Trials

Page 45: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Collaboration between UHN, OICR and Pfizer on collorectal cancer

Sequencing

Stem cells

Clinical data

Pop-cure project

Page 46: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

PopCure data management architecture

KEGG Reactome COSMIC Ensembl

BioMart Central Portal

COSMIC Ensembl

HICT

KEGG

Reactome

TCGA - Colorectal

ICGC

ICGC - Colorectal

TCGA

Cancer Portal (ICGC) Pfizer internal data UHN internal data

Page 47: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Collaboration between BioMart and Pfizer on their internal data management infrastructure

BioMart technology provided a single access point to

BioMart community portal (Ensembl, Reactome etc)

ICGC Portal

Internal Pfizer resources

The “La Jolla” Project

Page 48: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Institut National de la Recherche Agronomique (INRA)

Page 49: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

San Raffaele Scientific Institute (SRSI)

SRSI is one of the principal research Institutes in Italy as per volume and profile of scientific output

Italy’s leading center for translational medicine

Page 50: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Center for Translational Genomics and Bioinformatics

NGS e Malattie

Supportare a livello nazionale ed internazionale l’utilizzo di tecnologie genomiche e di metodologie bioinformatiche per migliorare la nostra conoscenza delle malattie umane, permettendo di migliorarne prevenzione, diagnosi e cura.

Interdisciplinarità

Offrire all’Istituto e all’Ospedale, a collaboratori e clienti una piattaforma integrata e interdisciplinare che spazi dalla biologia molecolare alla medicina, dalla statistica all’informatica, dalla matematica all’etica

Divulgazione

Contribuire con pubblicazioni di livello internazionale alle scienze genomiche e bioinformatiche

Servizio

Offrire un supporto professionale e qualificato nell’erogazione di servizi in ambito genomico e bioinformatico, che garantisca puntualità e precisione nell’erogazione dei risultati pur mantenendo la naturale curiosità scientifica e attitudine collaborativa caratterizzante la ricerca

Page 51: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

“Consumer driven market”

Plethora of web tools

The usual problems

Incompatibility of the websites

Incompatibility of technologies

Etc

The reality of translational bioinformatics

Page 52: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart v 0.9 Data analysis and visualization framework

Enrichment Tool All ensembl species

Plethora of identifiers

Homology

BED files (CNVs, DMRs)

Full programmatic access

Page 53: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

All data and analytics under one roof All publically available disease data

Cancer

Mendelian

Complex

Analysis Enrichment

Prioritizer

“Disease report” for your experimental data

BioMart Disease Portal

Page 54: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

BioMart

Services: Single access point to biomedical data

Software: Support for collaborative efforts

Data federation

Scalability

Data agnostic

Summary

Page 55: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Host institutions

European Bioinformatics Institute (EBI)

Ontario Institute for Cancer Research (OICR)

San Raffale Scientific Institute (SRSI)

BioMart community

28 organizations

50 database projects

BioMart developers

Acknowledgments

Page 56: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

Center for Translational Genomics and Bioinformatics SRSI Milan Italy Director: Professor Giorgio Casari

Co-Director: Dr Elia Stupka

Page 57: Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

www.biomart.org