Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

26
Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications Jim McCusker @jpmccu , Timothy Lebo @timrdf , Michael Krauthammer, and Deborah McGuinness @dlmcguinness

description

To encourage data sharing in the life sciences, supporting tools need to minimize effort and maximize incentives. We have created infrastructure that makes it easy to create portals that supports dataset sharing and simplified publishing of the datasets as high quality linked data. We report here on our infrastructure and its use in the creation of a melanoma dataset portal. This portal is based on the Comprehensive Knowledge Archive Network (CKAN) and Prizms, an infrastructure to acquire, integrate, and publish data using Linked Data principles. In addition, we introduce an extension to CKAN that makes it easy for others to cite datasets from within both publications and subsequently-derived datasets using the emerging nanopublication and World Wide Web Consortium provenance standards.

Transcript of Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Page 1: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Next Generation Cancer Data Discovery, Access, and

Integration Using Prizms and Nanopublications

Jim McCusker@jpmccu, Timothy Lebo@timrdf, Michael Krauthammer,

and Deborah McGuinness@dlmcguinness

Page 2: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

What we’re trying to fix From: Data Sharing and Management SNAFU in 3 Acts

Page 3: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

What we’re trying to fix

Ah yes, SAM1 is the level of CXCR4 expression.

What is the content of the field called

“SAM1”?

From: Data Sharing and Management SNAFU in 3 Acts

Page 4: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

What we’re trying to fix

That is logical if you think about it.

And what is the content of the field

called “SAM2”?

From: Data Sharing and Management SNAFU in 3 Acts

Page 5: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

What we’re trying to fix

… What is the content of the field called

“SAM2”?

I don’t remember.

From: Data Sharing and Management SNAFU in 3 Acts

Page 6: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Life Science data seems to start its life very

scruffy.

Page 7: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

5 Levels of Data Sharing, from scruffy to neat

Level 1: Basic data sharing Who, what, when, where, why

Level 2: Automated Conversion Computable RDF representations

Level 3: Semantic enhancement Human-enhanced RDF representations

Level 4: Semantic eScience Use of vocabularies with formal semantics

Level 5: Community-Based Standards Consensus use of preferred ontologies

Page 8: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

The Prizms Architecture

Page 9: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Prizms User Interactions

Page 10: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Provenance of Prizms

Prizms

healthdata.tw.rpi.edu

lod.melagrid.org

More Prizms Nodes: https://github.com/timrdf/prizms/wiki/Prizms-Nodes

prov:wasDerivedFrom

prov:wasDerivedFrom

Linking Open Govt. Data prov:wasDerivedFrom

Page 11: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

5 Levels in Prizms

Level 1: Basic data sharing CKAN dataset metadata + datapubs

Level 2: Automated Conversion Prizms raw conversions

Level 3: Semantic Conversion Prizms enhanced conversions

Level 4: Semantic eScience Level 3 + NCBO ontology recommender + similar tools

Level 5: Community-Based Standards Level 4 + Vocabulary reuse analysis

Page 12: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Level 1: Basic Data Sharing

CKAN1 and Datapubs

1Comprehensive Knowledge Archive Network

Page 13: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

What is CKAN?

•  A data portal for all kinds of data

•  Link or upload •  Linked Data-

friendly •  Link to:

o  Files o  APIs o  SPARQL

endpoints o  Metadata o  Publications o  Visualizations…

Page 14: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

•  A data portal for all kinds of data •  Link or upload

•  Linked Data-friendly •  Link to: o  Files

o  APIs o  SPARQL endpoints

o  Metadata o  Publications

o  Visualizations…

data.melagrid.org A portal for melanoma data

Page 15: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

What is a Datapub?

Viewing Relations, Attributes, and Entities in RDF (VRAER)dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma.ttl Redraw

hasAttribution

hasSupporting

hasAssertion

hasProvenance

exome-variants-in-melanomaa Nanopublication

provenancea Provenance

attributiona Attribution

supportinga Supporting

assertiona Assertion

Groth et al., 2010

Page 16: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Anatomy of a Datapub: Assertion

Viewing Relations, Attributes, and Entities in RDF (VRAER)http://dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma-‐‑assertion.ttl Redraw

IMT

homepage

distribution

exome_aa_variants_final.xlsa DistributionaccessURL: exome_aa_variants_final.xls

xls

value: xls

Variant data from "Exome sequencing identifiesrecurrent somatic RAC1 mutations in melanoma"

a Datasetdescription: Variant data from M. Krauthammer, Y. Kong, B. Ha,

P. Evans, A. Bacchiocchi, J.P. McCusker, E.Cheng, M.J. Davis, G. Goh, M. Choi, S. Ariyan, D.Narayan, K. Dutton-Regester, A. Capatana, E.C.Holman, M. Bosenberg, M. Sznol, H.M. Kluger, D.E.Brash, D.F. Stern, M.A. Materin, R.S. Lo, S. Mane,S. Ma, K.K. Kidd, N.K. Hayward, R.P. Lifton, J.Schlessinger, T.J. Boggon, and R. Halaban, Exomesequencing identifies recurrent somatic RAC1mutations in melanoma. Nature Genetics, 2012. inpress. **Tab 1: Description** This worksheetcontains a description of the variant calling method.**Tab 2: SNVs** This worksheet containsautomatically called somatic non-silent SNVs inmatched melanoma samples. Annotations fromMU2A. **Tab 3: InDels** This worksheet containsautomatically called somatic InDels in matchedmelanoma samples. Annotations from VEP. **Tab 4:Splice Site Variants** This worksheet containsautomatically called somatic splice site variants inmatched melanoma samples. Annotations fromVEP. **Tab 5: Additional mutations** This worksheetcontains additional somatic mutations. Thesemutations are either inferred in unmatched samples(see Methods overview above), or have beenSanger-validated via PCR amplified products, aftermanual inspections of sequencing reads.Annotations from MU2A/VEP. Nomenclature --------**SNV:** Single Nucleotide Variant **DNV:**Dinucleotide Variant **DNV*: ** Two SNVs affectingthe same codon, at positions 1 and 3 of the codon**TNV:** Trinucleotide Variant **Parentheses ingenotype calls:** Nucleotides that appear inparentheses are true variant calls in tumor whichhave not been called somatic by the automaticpipeline. These variants are shown if anotherposition in the same codon has a somatic call. Thecorresponding SNP position, if known, is alsoshown. **InDel:** Insertions and Deletions**HGVS:** Human Genome Variation Societyvariant format **COSMIC:** Catalogue of SomaticMutations -http://www.sanger.ac.uk/perl/genetics/CGP/cosmic/**SNP:** This column provides SNP-IDs if availablefor any the mutated positions in tumors **PhyoP:**Computation of p-values for conservation oracceleration(http://compgen.bscb.cornell.edu/phast/faq.php).Data from UCSC genome browser. References ------ **MU2A:** Garla V, Kong Y, Szpakowski S,Krauthammer M. MU2A--reconciling the genomeand transcriptome to determine the effects of basesubstitutions. Bioinformatics. 2011 Feb 1;27(3):416-8. Epub 2010 Dec 12. PubMed PMID: 21149339;PubMed Central PMCID: PMC3031033. **VEP:**McLaren W, Pritchard B, Rios D, Chen Y, Flicek P,Cunningham F. Deriving the consequences ofgenomic variants with the Ensembl API and SNPEffect Predictor. Bioinformatics. 2010 Aug

15;26(16):2069-70. Epub 2010 Jun 18. PubMedPMID: 20562413; PubMed Central PMCID:PMC2916720.

keyword: exome-sequencing, homo-sapiensidentifier: exome-variants-in-melanoma

Page 17: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Anatomy of a Datapub: Attribution, Evidence

Viewing Relations, Attributes, and Entities in RDF (VRAER)http://dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma-‐‑attribution.ttl Redraw

contributor

creatorexome-variants-in-melanoma

rights: cc-by

James McCusker

mbox: mailto:[email protected]

Michael Krauthammer

mbox: mailto:[email protected]

Attribution

Evidence

Page 18: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Citing a Dataset using Datapubs

Page 19: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Citing a Dataset using Datapubs

Page 20: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Levels 2-3: Automated Conversion, Semantic

Conversion Prizms raw conversions, enhanced conversions

Page 21: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Prizms RDF Converter

smart, naïve bootstrap

"Hawaii","Alii Garden Market Place", "75-6129 Alii Drive", "Kailua-Kona", "96740", "-155.9819183", "19.61436844"

ds4383:thing_1367 raw:column_1 "Hawaii"; raw:column_2 "Alii Garden Market Place"; raw:column_3 "75-6129 Alii Drive"; raw:column_4 "Kailua-Kona"; raw:column_5 "96740"; raw:column_6 "-155.9819183"; raw:column_7 "19.61436844" .

ds4383:thing_1367 con:preferredURI ds4383:farmersMarket_1367 .

ds4383:farmersMarket_1367 a ds4383_vocab:FarmersMarket; con:address :address_1367; dcterms:title "Alii Garden Market Place"; wgs:lat -155.9; wgs:long 19.6 .

:address_1367 a con:Address; con:stateOrProvince typed_state:Hawaii; con:street "75-6129 Alii Drive"; con:city "Kailua-Kona"; con:zip "96740" .

typed_state:Hawaii a ds4383_vocab:State; dcterms:identifier "Hawaii"; rdfs:label "Hawaii"; owl:sameAs <http://sws.geonames.org/5855797/>, govtrackusgov:HI, dbpedia:Hawaii .

enhancementTime Domain

ExpertiseSemWebExpertise

Time Domain Expertise

SemWebExpertise

Lebo et al., 2012

Page 22: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Prizms Benefits

Prizms has worked with: • BFO/IAO/OBI • SIO • RDF Data Cube

Vocabulary • PROV • VOID • FOAF • etc.

For free, you get: • Provenance at

dataset and triple levels

• Automatic source/dataset/version URI generation

• Automated conversion as data changes

Page 23: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Future Work: Supporting Levels 4-5

Level 1: Basic data sharing CKAN dataset metadata + datapubs

Level 2: Automated Conversion Prizms raw conversions

Level 3: Semantic Conversion Prizms enhanced conversions

Level 4: Semantic eScience Level 3 + NCBO ontology recommender + similar tools

Level 5: Community-Based Standards Level 4 + Vocabulary reuse analysis✔✔

Page 24: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Publishing Custom Linked Data Using LODSPeaKr

•  Custom templates for RDF and HTML

•  Templates driven by rdf:type

•  Web-based template editor

•  Embed easy-to-generate visualizations

Page 25: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Conclusions

•  Prizms is an infrastructure for sharing data on many levels of sophistication

•  Good support for Level 1-3 Data Sharing •  Initial support for Level 4-5 Data Sharing •  Didn't just make life science data better, it made future

Linked Data better! •  More to be done, but lots of progress

Page 26: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Thanks!

• Rensselaer Polytechnic (Tetherless World): o  Alvaro Graves o  John Erickson o  The LOGD Team

• The Open Knowledge Foundation Network (OKFN)

• Yale University: o  Ruth Halaban o  Tobias Kuhn

• Grant support from: o  Yale SPORE in Skin Cancer o  Semantic Sea Ice Interoperability Initiative