The NIDDK Repositories Adding value to shared … Session 5... · The NIDDK Repositories – Adding...
Transcript of The NIDDK Repositories Adding value to shared … Session 5... · The NIDDK Repositories – Adding...
Central repository components
Contract funded since 2003.
Biosample repository (Fisher): archival storage of biological specimens Database repository (RTI): maintain archival datasets, respond to queries about data and stored samples Genetics repository (Rutgers Univ.): create immortalized cell lines, DNA extraction
Samples and data stored from >50 major multi-site clinical studies in diabetes, digestive, kidney, liver, and urologic diseases
Each study collects according to its own protocols
43 datasets available for sharing
23 GWAS datasets available for sharing through dbGAP
DNA and/or biosamples available for sharing from 28 studies
The NIDDK Central Repositories’ holdings:
Biosample Repository Genetics Repository Affiliated Repositories
Total samples 7,384,858 113,057 560,191
Types of studies Diabetes and Obesity Studies DCCT/EDIC (The Type 1 Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications) DPP (Diabetes Prevention Program) DPPOS (The Diabetes Prevention Program Outcome Study) LookAHEAD (Action for Health in Diabetes) HEALTHY (Middle-School Based Primary Prevention Trial of Type 2 Diabetes) TrialNet - TN01 (NATURAL HISTORY STUDY OF THE DEVELOPMENT OF TYPE 1 DIABETES) TEDDY (The Environmental Determinants of Diabetes in the Young)
Kidney Studies AASK Trial (The African American Study of Kidney Disease and Hypertension Study) CRIC (Chronic Renal Insufficiency Cohort Study) MDRD (The Modification of Diet in Renal Disease)
Liver Disease Studies A2ALL (The Adult-to-Adult Living Donor Liver Transplantation Cohort Study) HALT-C (The Hepatitis C Antiviral Long-term Treatment against Cirrhosis) VIRAHEP-C (The Study of Viral Resistance to Antiviral Therapy of Chronic Hepatitis C)
Urology Studies MTOPS (The Medical Therapy of Prostatic Symptoms) SISTEr (The Stress Incontinence Surgical Treatment Efficacy Trial)
NIDDK has custodianship of all samples and data transferred to the Repositories and no IP protections are attached
The Steering Committee of each study or study group has control of the samples and data during a “proprietary period” (2 years after the end of the study or study increment)
Expensive resources have to be useful: the Repositories’ sharing policies
Remove all identifiers, except some elements of dates (“Limited data set”)
Collect all forms, MOPs, key papers and analytic datasets from those papers
Reconcile sample list with phenotypic data
Carry out Dataset Integrity Check (DSIC) process
Curating studies
Curating studies – perform Dataset Integrity Check
• verify that published results from the study can be reproduced using the archived datasets
• perform a small number of analyses to duplicate published results • intent is to provide confidence that the dataset distributed by the
NIDDK repository is a true copy of the study data • does not attempt to resolve minor or inconsequential discrepancies
with published results
www.niddkrepository.org
using the CaBIG Common Biorepository Model (CBM) of 30
variables an additional 140 variables that are domain- or study-specific
Curation for common variables
Index Variable_Name Description CBM (curated for
all studies)
Diabetes
Domain
Kidney
Domain
Liver
Domain
Study_Specific* Other_Common**
1 Ethnicity Ethnicity x x x x
2 Gender Gender x x x x
3 Race Race x x x x
4 ace_arb Use of antihypertensives (ACE inhibitors, ARBs) x x x
5 acr Albumin to creatinine ratio x
6 add_dx prior addisons disease x
7 aer Albumin Excretion Rate x
8 age Age x x x x
9 age_transplant Age at transplant x x
10 ageatonset Age at IDDM onset x
11 agegroup Age group x
12 aki Acute kidney injury (aka ARF) x x
13 alcohol Frequent alcohol use x x
14 assign Treatment group x
15 beckqaire Severe anxiety(BECK) x
'What's in the NIDDK CDR?'--public query tools for the NIDDK central data repository. Pan H, Ardini MA, Bakalov V, et al., 2013, Database (Oxford).
No analytic dataset – impossible to recreate results in papers
No data dictionary for variables used in analysis – impossible to recreate results in paper
Errors in calculation
Poor or incomplete linkage of sample lists to phenotypic data
Sample labeling issues, including:
Labels applied incorrectly cannot be read by barcode scanner
Duplicate ids
Empty or nearly empty vials
Incorrectly preserved samples
Curation issues
Using the Repository – requests for data and samples
Requests for Repository materials
year
requests for
biosamples
requests for
genetic samples
total number of unique
samples data requests
2004 0 7 1936 0
2005 15 7 3658 4
2006 47 12 9391 5
2007 49 6 6979 15
2008 50 24 29271 16
2009 64 45 48561 33
2010 98 34 64195 29
2011 149 14 44110 58
2012 109 16 73113 94
2013 55 6 10638 14
Using the data and samples
91 publications by researchers who gained access to data and samples through the NIDDK Repository, including:
Papers based on the GWAS data sets in dbGAP
A paper that re-examined the data from a study of dialysis intensity (HEMO) and suggested a re-interpretation the major study conclusion (Argyropoulos, C et al., 2009, J. Am. Soc. Nephrol., 20, 2034-2043).
A paper that re-analyzed the IBD Genetics GWAS data to identify additional loci (Elding, H et al., 2011, Am J Hum Genet. 2011 Dec 9;89(6):798-805)
Publications on novel analytic methods or markers in NIDDK Repository-supplied samples
dbGAP – the NIDDK Repository: two different curatorial approaches
NIDDK Repository dbGAP
Manual Curation Automated curation
Elements of dates accepted No elements of dates accepted
DSIC No DSIC
Linkage to samples No linkage to samples
Expensive - ~$1M/year for the
Data Repository
Minimal costs – the NLM is
bearing the costs of acquiring
studies
Low volume High volume
dbgap NIDDK Repository
Year approved #
downloaded pct
downloaded approved pct
downloaded
2010 54 28 52% 29 100%
2011 91 52 57% 58 100%
2012 137 83 61% 94 100%
2013 53 17 32% 14 100%
dbGAP – the NIDDK Repository: two different curatorial approaches
• Curation is expensive • More familiarity = more sophisticated use of data • If investigators are not obliged to share their data,
they can get by with poor documentation and processing/storage errors
Lessons learned and cautionary notes
Project Officers Beena Akolkar Paul Eggers Bob Karp Contracting Specialist Rich Bailey Repository Specialists Sharon Kay Mobley Kris Moen
NIDDK Repository Staff
RTI – Data Repository Phil Cooley, PI Helen Pan, Sylvia Tan ThermoFisher, Biosample Repository Heather Higgins, PI Rutgers Univ., Genetics Repository Jay Tischfield, PI
www.niddkrepository.org