Laura Furlong. Big Data in Biomedicine debate. Barcelona, Nov 11 2014
Transcript of Laura Furlong. Big Data in Biomedicine debate. Barcelona, Nov 11 2014
Five years of DisGeNET: Lessons learned and challenges ahead
Laura I. Furlong IMIM-‐UPF Big Data in Biomedicine Barcelona November 11, 2014
• Knowledge plaDorm on human diseases and their genes
• Aims to cover all disease therapeuHc areas
• Developed by integraHon of data from expert-‐curated
resources and from the literature by text mining
• Centered on gene-‐disease associaHon (GDA) and its supporHng evidence and provenance
11/11/14 Laura I. Furlong 2
Re# Syndrome UMLS:C0035372
MCEP2 NCBI:4204
11/11/14 Laura I. Furlong 3
Re# Syndrome UMLS:C0035372
MCEP2 NCBI:4204
ü
ü
ü
x
ü 11/11/14 Laura I. Furlong 4
CTD
UniProt LHGDN MGD
CTD
Curated Predicted Literature
RGD
BEFREE
GAD
11/11/14 Laura I. Furlong 5
CTD
UniProt LHGDN MGD
CTD
Curated Predicted Literature
RGD
BEFREE
GAD
11/11/14 Laura I. Furlong 6
381,056 GDAs 16,666 genes
13,172 diseases
Key points
1. FragmentaPon of informaPon as a barrier to knowledge on the mechanisms of human diseases
11/11/14 Laura I. Furlong 7
11/11/14 Laura I. Furlong 8
hVp://lod-‐cloud.net/
StandardizaPon • Controlled vocabularies and ontologies • DisGeNET associaHon type ontology
Open access • RDF and nanopublicaHons • Data distributed under the Open database commons license
hVp://opendatacommons.org/
hVp://semanHcscience.org
Key points
2. High rate of data generaPon on GDAs poses challenges to biocuraPon pipelines
11/11/14 Laura I. Furlong 10
11/11/14 Laura I. Furlong 11
PublicaHons on diseases and genes from 1980 (737,712 publicaHons)
530,347 associaPons
14,777 genes 12,650 diseases 355,976 publicaHons
Aprox. 30,000 GDAs
Supervised machine-‐learning approach for relaHon extracHon between genes and diseases Text Mining
11/11/14 Laura I. Furlong 12
• Nearly 70 % of GDAs supported by only one publicaHon!
• 900 GDAs supported by > 200 publicaHons • Average 2.8 publicaHons/GDA
11/11/14 Laura I. Furlong 13
GDAs supported by only one publicaHon
Li#le overlap between text mined data and data present in curated repositories
• Other publicaHons, full text, etc
• FN from TM
• Non-‐relevant data for DBs • Relevant data but sHll not
curated • FP from TM
11/11/14 Laura I. Furlong 14
11/11/14 Laura I. Furlong 15
• This is not raw data (e.g. from NGS analysis), is data already collated and filtered that have passed peer review before publicaHon
• Text mined from abstracts:
• Not mining full text, nor supplementary material
• Not mining tables, figures • Focus on relaHons stated on sentences, not handling anaphoras
• We are just looking into a small frac)on of all the available data!!!
Key points
2. High rate of data generaPon on GDAs poses challenges to biocuraPon pipelines
• Need to find alternaHve strategies to expert curaHon • “wisdom of the crowds” approaches (e.g.
crowdsourcing) • ?
11/11/14 Laura I. Furlong 16
Key points
3. Data prioriPzaPon to support interpretaPon of data on the genePc determinants of human diseases
11/11/14 Laura I. Furlong 17
Ranks gene-‐disease associaHons based on the supporHng evidence
DisGeNET score = SCURATED + SPREDICTED + SLITERATURE
GDAs prioriPzaPon
SCURATED = WUNIPROT + WCTD
SPREDICTED = WRat + WMouse
SLITERATURE = WGAD + WLHGDN + WBeFree
11/11/14 Laura I. Furlong 18
Disease Gene Score UniProt CTD Rat Mouse Number of publicaPons
BeFree GAD LHGDN
Wilson’s Disease
ATP7B 0.99 0.3 0.3 0.1 0.1 174 31 23
ReV Syndrome
MECP2 0.9 0.3 0.3 0 0.1 438 27 43
CysHc Fibrosis
CFTR 0.9 0.3 0.3 0 0.1 1429 150 78
Obesity MC4R 0.94 0.3 0.3 0.1 0.1 220 46 0
Alzheimer Disease
APP 0.88 0.3 0.3 0 0.1 1096 18 81
11/11/14 Laura I. Furlong 19
Key points
3. Data prioriPzaPon to support interpretaPon of data on the genePc determinants of human diseases
• SHll we have a large dataset with low score (350,000 GDAs)
• Other approaches based on: • type of associaHon of the gene to the disease • experimental evidence for the GDAs • network-‐based gene-‐prioriHzaHon algorithms
• Take into account contradictory findings • Different prioriHzaHon approaches for different
purposes? 11/11/14 Laura I. Furlong 20
Key points
4. Large number of genes associated to some diseases might reflect phenotypic diversity
11/11/14 Laura I. Furlong 21
11/11/14 Laura I. Furlong 22
Need for deep phenotyping of diseases to idenHfy disease subtypes associated to different gene networks Human Phenotype Ontology (HPO)
hVp://www.human-‐phenotype-‐ontology.org/
• Ontology of phenotypic abnormaliHes of human diseases • Based on OMIM, Orphanet and DECIPHER
11/11/14 Laura I. Furlong 23
DisGeNET-‐HPO annotaPons 33 % of DisGeNET diseases annotated with HPO terms Mainly come from OMIM diseases (28 % of DisGeNET diseases come from OMIM) Need for other approaches to annotate the full spectrum of diseases at phenotypic level
11/11/14 Laura I. Furlong 24
Key points
1. FragmentaHon of informaHon as a barrier to knowledge on the mechanisms of human diseases
2. High rate of data generaHon on GDAs poses challenges to biocuraHon pipelines
3. Data prioriHzaHon to support interpretaHon of data on the geneHc determinants of human diseases
4. Large number of genes associated to some diseases might
reflect phenotypic diversity
11/11/14 Laura I. Furlong 26
IBI Group Alba GuHérrez Àlex Bravo Janet Piñero Núria Queralt-‐Rosiñach Miguel A. Mayer Pablo Carbonell Laura I. Furlong Ferran Sanz
IBI Past members Montserrat Cases Michael Rautschka Anna Bauer-‐Mehren Solène Grosdidier