Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications
-
Upload
jim-mccusker -
Category
Technology
-
view
309 -
download
0
description
Transcript of Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications
Next Generation Cancer Data Discovery, Access, and
Integration Using Prizms and Nanopublications
Jim McCusker@jpmccu, Timothy Lebo@timrdf, Michael Krauthammer,
and Deborah McGuinness@dlmcguinness
What we’re trying to fix From: Data Sharing and Management SNAFU in 3 Acts
What we’re trying to fix
Ah yes, SAM1 is the level of CXCR4 expression.
What is the content of the field called
“SAM1”?
From: Data Sharing and Management SNAFU in 3 Acts
What we’re trying to fix
That is logical if you think about it.
And what is the content of the field
called “SAM2”?
From: Data Sharing and Management SNAFU in 3 Acts
What we’re trying to fix
… What is the content of the field called
“SAM2”?
I don’t remember.
From: Data Sharing and Management SNAFU in 3 Acts
Life Science data seems to start its life very
scruffy.
5 Levels of Data Sharing, from scruffy to neat
Level 1: Basic data sharing Who, what, when, where, why
Level 2: Automated Conversion Computable RDF representations
Level 3: Semantic enhancement Human-enhanced RDF representations
Level 4: Semantic eScience Use of vocabularies with formal semantics
Level 5: Community-Based Standards Consensus use of preferred ontologies
The Prizms Architecture
Prizms User Interactions
Provenance of Prizms
Prizms
healthdata.tw.rpi.edu
lod.melagrid.org
More Prizms Nodes: https://github.com/timrdf/prizms/wiki/Prizms-Nodes
prov:wasDerivedFrom
prov:wasDerivedFrom
Linking Open Govt. Data prov:wasDerivedFrom
5 Levels in Prizms
Level 1: Basic data sharing CKAN dataset metadata + datapubs
Level 2: Automated Conversion Prizms raw conversions
Level 3: Semantic Conversion Prizms enhanced conversions
Level 4: Semantic eScience Level 3 + NCBO ontology recommender + similar tools
Level 5: Community-Based Standards Level 4 + Vocabulary reuse analysis
Level 1: Basic Data Sharing
CKAN1 and Datapubs
1Comprehensive Knowledge Archive Network
What is CKAN?
• A data portal for all kinds of data
• Link or upload • Linked Data-
friendly • Link to:
o Files o APIs o SPARQL
endpoints o Metadata o Publications o Visualizations…
• A data portal for all kinds of data • Link or upload
• Linked Data-friendly • Link to: o Files
o APIs o SPARQL endpoints
o Metadata o Publications
o Visualizations…
data.melagrid.org A portal for melanoma data
What is a Datapub?
Viewing Relations, Attributes, and Entities in RDF (VRAER)dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma.ttl Redraw
hasAttribution
hasSupporting
hasAssertion
hasProvenance
exome-variants-in-melanomaa Nanopublication
provenancea Provenance
attributiona Attribution
supportinga Supporting
assertiona Assertion
Groth et al., 2010
Anatomy of a Datapub: Assertion
Viewing Relations, Attributes, and Entities in RDF (VRAER)http://dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma-‐‑assertion.ttl Redraw
IMT
homepage
distribution
exome_aa_variants_final.xlsa DistributionaccessURL: exome_aa_variants_final.xls
xls
value: xls
Variant data from "Exome sequencing identifiesrecurrent somatic RAC1 mutations in melanoma"
a Datasetdescription: Variant data from M. Krauthammer, Y. Kong, B. Ha,
P. Evans, A. Bacchiocchi, J.P. McCusker, E.Cheng, M.J. Davis, G. Goh, M. Choi, S. Ariyan, D.Narayan, K. Dutton-Regester, A. Capatana, E.C.Holman, M. Bosenberg, M. Sznol, H.M. Kluger, D.E.Brash, D.F. Stern, M.A. Materin, R.S. Lo, S. Mane,S. Ma, K.K. Kidd, N.K. Hayward, R.P. Lifton, J.Schlessinger, T.J. Boggon, and R. Halaban, Exomesequencing identifies recurrent somatic RAC1mutations in melanoma. Nature Genetics, 2012. inpress. **Tab 1: Description** This worksheetcontains a description of the variant calling method.**Tab 2: SNVs** This worksheet containsautomatically called somatic non-silent SNVs inmatched melanoma samples. Annotations fromMU2A. **Tab 3: InDels** This worksheet containsautomatically called somatic InDels in matchedmelanoma samples. Annotations from VEP. **Tab 4:Splice Site Variants** This worksheet containsautomatically called somatic splice site variants inmatched melanoma samples. Annotations fromVEP. **Tab 5: Additional mutations** This worksheetcontains additional somatic mutations. Thesemutations are either inferred in unmatched samples(see Methods overview above), or have beenSanger-validated via PCR amplified products, aftermanual inspections of sequencing reads.Annotations from MU2A/VEP. Nomenclature --------**SNV:** Single Nucleotide Variant **DNV:**Dinucleotide Variant **DNV*: ** Two SNVs affectingthe same codon, at positions 1 and 3 of the codon**TNV:** Trinucleotide Variant **Parentheses ingenotype calls:** Nucleotides that appear inparentheses are true variant calls in tumor whichhave not been called somatic by the automaticpipeline. These variants are shown if anotherposition in the same codon has a somatic call. Thecorresponding SNP position, if known, is alsoshown. **InDel:** Insertions and Deletions**HGVS:** Human Genome Variation Societyvariant format **COSMIC:** Catalogue of SomaticMutations -http://www.sanger.ac.uk/perl/genetics/CGP/cosmic/**SNP:** This column provides SNP-IDs if availablefor any the mutated positions in tumors **PhyoP:**Computation of p-values for conservation oracceleration(http://compgen.bscb.cornell.edu/phast/faq.php).Data from UCSC genome browser. References ------ **MU2A:** Garla V, Kong Y, Szpakowski S,Krauthammer M. MU2A--reconciling the genomeand transcriptome to determine the effects of basesubstitutions. Bioinformatics. 2011 Feb 1;27(3):416-8. Epub 2010 Dec 12. PubMed PMID: 21149339;PubMed Central PMCID: PMC3031033. **VEP:**McLaren W, Pritchard B, Rios D, Chen Y, Flicek P,Cunningham F. Deriving the consequences ofgenomic variants with the Ensembl API and SNPEffect Predictor. Bioinformatics. 2010 Aug
15;26(16):2069-70. Epub 2010 Jun 18. PubMedPMID: 20562413; PubMed Central PMCID:PMC2916720.
keyword: exome-sequencing, homo-sapiensidentifier: exome-variants-in-melanoma
Anatomy of a Datapub: Attribution, Evidence
Viewing Relations, Attributes, and Entities in RDF (VRAER)http://dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma-‐‑attribution.ttl Redraw
contributor
creatorexome-variants-in-melanoma
rights: cc-by
James McCusker
mbox: mailto:[email protected]
Michael Krauthammer
mbox: mailto:[email protected]
Attribution
Evidence
Citing a Dataset using Datapubs
Citing a Dataset using Datapubs
Levels 2-3: Automated Conversion, Semantic
Conversion Prizms raw conversions, enhanced conversions
Prizms RDF Converter
smart, naïve bootstrap
"Hawaii","Alii Garden Market Place", "75-6129 Alii Drive", "Kailua-Kona", "96740", "-155.9819183", "19.61436844"
ds4383:thing_1367 raw:column_1 "Hawaii"; raw:column_2 "Alii Garden Market Place"; raw:column_3 "75-6129 Alii Drive"; raw:column_4 "Kailua-Kona"; raw:column_5 "96740"; raw:column_6 "-155.9819183"; raw:column_7 "19.61436844" .
ds4383:thing_1367 con:preferredURI ds4383:farmersMarket_1367 .
ds4383:farmersMarket_1367 a ds4383_vocab:FarmersMarket; con:address :address_1367; dcterms:title "Alii Garden Market Place"; wgs:lat -155.9; wgs:long 19.6 .
:address_1367 a con:Address; con:stateOrProvince typed_state:Hawaii; con:street "75-6129 Alii Drive"; con:city "Kailua-Kona"; con:zip "96740" .
typed_state:Hawaii a ds4383_vocab:State; dcterms:identifier "Hawaii"; rdfs:label "Hawaii"; owl:sameAs <http://sws.geonames.org/5855797/>, govtrackusgov:HI, dbpedia:Hawaii .
enhancementTime Domain
ExpertiseSemWebExpertise
Time Domain Expertise
SemWebExpertise
Lebo et al., 2012
Prizms Benefits
Prizms has worked with: • BFO/IAO/OBI • SIO • RDF Data Cube
Vocabulary • PROV • VOID • FOAF • etc.
For free, you get: • Provenance at
dataset and triple levels
• Automatic source/dataset/version URI generation
• Automated conversion as data changes
Future Work: Supporting Levels 4-5
Level 1: Basic data sharing CKAN dataset metadata + datapubs
Level 2: Automated Conversion Prizms raw conversions
Level 3: Semantic Conversion Prizms enhanced conversions
Level 4: Semantic eScience Level 3 + NCBO ontology recommender + similar tools
Level 5: Community-Based Standards Level 4 + Vocabulary reuse analysis✔✔
✔
✔
✔
Publishing Custom Linked Data Using LODSPeaKr
• Custom templates for RDF and HTML
• Templates driven by rdf:type
• Web-based template editor
• Embed easy-to-generate visualizations
Conclusions
• Prizms is an infrastructure for sharing data on many levels of sophistication
• Good support for Level 1-3 Data Sharing • Initial support for Level 4-5 Data Sharing • Didn't just make life science data better, it made future
Linked Data better! • More to be done, but lots of progress
Thanks!
• Rensselaer Polytechnic (Tetherless World): o Alvaro Graves o John Erickson o The LOGD Team
• The Open Knowledge Foundation Network (OKFN)
• Yale University: o Ruth Halaban o Tobias Kuhn
• Grant support from: o Yale SPORE in Skin Cancer o Semantic Sea Ice Interoperability Initiative