Standardized Metadata for Human Pathogen/Vector Genomic Sequences
description
Transcript of Standardized Metadata for Human Pathogen/Vector Genomic Sequences
Richard H. Scheuermann, Ph.D.Director of InformaticsJ. Craig Venter Institute
On behalf of theGSC-BRC Metadata Working Group
Standardized Metadata for Human Pathogen/Vector Genomic Sequences
Genome Sequencing Centers for Infectious Disease (GSCID)
Bioinformatics Resource Centers (BRC)
www.viprbrc.org www.fludb.org
High Throughput Sequencing
• Enabling technology– Epidemiology of outbreaks– Pathogen evolution– Host range restriction– Genetic determinants of virulence and pathogenicity
• Metadata requirements– Temporal-spatial information about isolates– Selective pressures– Host species of specimen source– Disease severity and clinical manifestations
Metadata Submission Spreadsheets
1 1 1 1
2
2 3
3
4
4 4
Complex Query Interface
Metadata Inconsistencies
• Each project was providing different types of metadata
• No consistent nomenclature being used• Impossible to perform reliable comparative
genomics analysis• Required extensive custom bioinformatics
system development
GSC-BRC Metadata Standards Working Group
• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs
• Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects
• Bottom up approach to capture data considered to be important by users
• Compatible with data standards and submission requirements
Metadata Standardization Process
• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)
• Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific
• For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc.
• Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI)
• Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples
• Draft data submission spreadsheets • Beta test version 1.0 standard with new GSCID white paper projects, collecting
feedback• Adopt version 1.1 metadata standard and data submission spreadsheets for all
GSCID white paper and BRC-associated projects
Core ProjectMetadata Field ID Metadata Field Descriptor OBO Foundry ID BioProject/BioSample MIxS
CP1 Project Title http://purl.obolibrary.org/obo/OBI_0001622 Title project name
CP2 Project ID http://purl.obolibrary.org/obo/OBI_0001628
CP3 Project Description http://purl.obolibrary.org/obo/OBI_0001615 Description
CP4 Supporting Grants/Contract ID http://purl.obolibrary.org/obo/OBI_0001629 Grant Agency
CP5 Publication Citation http://purl.obolibrary.org/obo/OBI_0001617 PubMed ID ref_biomaterial
CP6 Sample Provider Principal Investigator (PI) Name
CP7 Sample Provider PI's Institution
CP8 Sample Provider PI's email
CP9 Sequencing Facility
CP10 Sequencing Facility Contact Name
CP11 Sequencing Facility Contact's Institution
CP12 Sequencing Facility Contact's email
CP13 Bioinformatics Resource Center http://purl.obolibrary.org/obo/OBI_0001626
CP14 Bioinformatics Resource Center Contact Name
CP15 Bioinformatics Resource Center Contact's Institution
CP16 Bioinformatics Resource Center Contact's email
CP17 Target Material Material
CP18 Project Method Methodology
CP19 Project Objectives Objective
CP20 Sample Scope Sample Scope
CP21 Target Capture Capture
Core SampleMetadata Field ID Metadata Field Descriptor OBO Foundry ID NCBI BioSample MIxS
CS1 Specimen Source ID http://purl.obolibrary.org/obo/OBI_0001141 host-subject-id host_subject_idCS2 Specimen Source Species http://purl.obolibrary.org/obo/OBI_0100026 specific_host host_taxidCS3 Species Source Common Name host-common-name host_common_nameCS4 Specimen Source Gender http://purl.obolibrary.org/obo/PATO_0000047 host-sex sexCS5 Specimen Source Age - Value http://purl.obolibrary.org/obo/OBI_0001167 host-age ageCS6 Specimen Source Age - Unit http://purl.obolibrary.org/obo/UO_0000003 host-age CS7 Specimen Source Health Status http://purl.obolibrary.org/obo/OGMS_0000022 host-health-state disease statusCS8 Specimen Collection Date http://purl.obolibrary.org/obo/OBI_0001619 collection_date collection dateCS9 Specimen Collection Location - Latitude http://purl.obolibrary.org/obo/OBI_0001620 lat_lon geographic location (lat and long)CS10 Specimen Collection Location - Longitude http://purl.obolibrary.org/obo/OBI_0001621 lat_lon geographic location (lat and long)CS11 Specimen Collection Location - Location http://purl.obolibrary.org/obo/GAZ_00000448 geo_loc_name CS12 Specimen Collection Location - Country http://purl.obolibrary.org/obo/OBI_0001627 geo_loc_name geographic location (country and/or sea)CS13 Specimen ID http://purl.obolibrary.org/obo/OBI_0001616 sample name CS14 Specimen Type http://purl.obolibrary.org/obo/OBI_0001479 host-tissue-sampled body habitat, body site, body productCS15 Suspected Organism(s) in Specimen - Species http://purl.obolibrary.org/obo/OBI_0000925 organism
CS16 Suspected Organism(s) in Specimen - Subclass strain subspecific genetic lineage
CS17 Human Pathogenicity of Suspected Organism(s) in Specimen http://purl.obolibrary.org/obo/OBI_0000925 phenotype
CS18 Environmental Material http://purl.obolibrary.org/obo/ENVO_00010483 isolation-source environment (material)CS19 Organism Detection Method http://purl.obolibrary.org/obo/OBI_0001624 sample collection device or methodCS20 Specimen Repository culture-collection source material identifiersCS21 Specimen Repository Sample ID culture-collection source material identifiersCS22 Sample ID - Sequencing Facility CS23 Nucleic Acid Extraction Method http://purl.obolibrary.org/obo/OBI_0666667 samp_mat_process sample material processingCS24 Nucleic Acid Preparation Method samp_mat_process sample material processingCS25 Sequencing Method http://purl.obolibrary.org/obo/OBI_0600047 sequencing methodCS26 Assembly Algorithm http://purl.obolibrary.org/obo/OBI_0001522 assemblyCS27 Depth of Coverage - Average http://purl.obolibrary.org/obo/OBI_0001618 finishing strategyCS28 Annotation Algorithm http://purl.obolibrary.org/obo/OBI_0001625 CS29 GenBank Record ID http://purl.obolibrary.org/obo/OBI_0001614 CS30 Comments http://purl.obolibrary.org/obo/IAO_0000300 host-description CS31 Specimen Collector Name collected-by CS32 Specimen Collector's Institution CS33 Specimen Collector's email CS34 Sample Category attribute_package CS35 Host Disease host-disease
Metadata Processes
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID
qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_input
has_output
has_output
has_specification has_part has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBankID
denotes
located_in
denotes
has_input
has_quality
instance_of
temporal-spatialregion
located_in
Specimen Isolation Material Processing
Data ProcessingSequencing Assay
Investigation
temporal-spatialregion
located_in
temporal-spatialregion
located_in
temporal-spatialregion
located_in
temporal-spatialregion
located_in
quality assessmentassay
Host Characterization
has_input
has_output
organism
environmentalmaterial
equipment
person
specimensource role
specimencapture role
specimencollector role
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen Xspecimen isolationprocedure X
isolationprotocol
has_input
has_output
plays
plays
has_specification
has_partdenotes
located_in
name
denotes
spatialregion
geographiclocation
denoteslocated_in
affiliation
has_affiliation
ID
denotes
specimen typeinsta
nce_of
specimen isolationprocedure type
instance_of
Specimen Isolation
plays
has_input
organism parthypothesis
is_about
IRB/IACUCapproval
has_authorization
environment
has_quality
organismpathogenicdisposition
has part
has disp
osition
ID
denotes
CS1
gender age health status
has quality
CS4 CS5/6 CS7
CS2/3
CS8
CS9/10
CS11/12
CS13
CS14
CS18
CS15/16
Core Project Semantics
Outcome of Metadata Standards WG
• Consistent metadata captured across GSCID• Bottom up approach focuses standard on important
features• Support more standardized BRC interface
development• Harmonization with related stakeholders – Genome
Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample
• Represented in the context of an extensible semantic framework
• Identified gaps in data field list (e.g. temporal components)• Includes logical structure for other, project-specific, data
fields - extensible• Identified gaps in ontology data standards (use case-driven
standard development)• Identified commonalities in data structures (reusable)• Support for semantic queries and inferential analysis in
future• Ontology-based framework is extensible
– Sequencing => “omics”
Utility of semantic representation
Acknowledgements
Bruce Birren2,b, Lauren Brinkac1,a, Vincent Bruno3,c, Elizabeth Caler1,a, Ishwar Chandramouliswaran1,a, Sinéad Chapman2,b, Frank Collins8,h, Christina Cuomo2,b, Joana Carneiro Da Silva3,c, Valentina Di Francesco4, Vivien Dugan1,a, Scott Emrich8,h, Mark Eppinger3,c, Michael Feldgarden2,b, Claire Fraser3,c, W. Florian Fricke3,c, Maria Giovanni4, Gloria Giraldo-Calderon8,h, Omar S. Harb5,g, Matt Henn2,b, Erin Hine3,c, Julie Dunning Hotopp3,c, Jessica C. Kissinger6,g, Eun Mi Lee4, Punam Mathur4, Garry Myers3,c, Emmanuel Mongodin3,c, Cheryl Murphy2,b, Dan Neafsey2,b, Karen Nelson1,a, Ruchi Newman2,b, William Nierman1,a, Brett E. Pickett1,d,e, Julia Puzak4, David Rasko3,c, David S. Roos5,g, Lisa Sadzewica3,c, Richard H. Scheuermann1,d,e, Lynn M. Schriml3,c, Bruno Sobral7,f, Tim Stockwell1,a, Chris Stoeckert5,g, Dan Sullivan7,f, Luke Tallon3,c, Herve Tettelin3,c, Doyle V. Ward2,b, David Wentworth1,a, Owen White3,c, Rebecca Will7,f, Jennifer Wortman2,b, Alison Yao4, Jie Zheng5,g
1J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2Broad Institute, Cambridge, MA, 3Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4National Institute of Allergy and Infectious Diseases, Rockville, MD, 5University of Pennsylvania, Philadelphia, PA, 6University of Georgia, Athens, GA, 7Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8University of Notre Dame, South Bend, IN, aJ. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, bBroad Institute Genome Sequencing Center for Infectious Diseases, cInstitute for Genome Sciences Genome Sequencing Center for Infectious Diseases, dInfluenza Research Database Bioinformatics Resource Center, eVirus Pathogen Resource Bioinformatics Resource Center, fPATRIC Bioinformatics Resource Center, gEuPathDB Bioinformatics Resource Center, hVectorBase Bioinformatics Resource Center
Tanya Barrett – NCBIPelin Yilmaz – Genome Standards Consortium
N01AI2008038 /N01AI40041