Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data...

45
Structuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS Bioinformatics Geisel School of Medicine at Dartmouth and UNC at Chapel Hill January 13, 2017

Transcript of Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data...

Page 1: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Structuring for Data Integrity Success: Concepts and Cases

Kristen Anton, MS

Bioinformatics

Geisel School of Medicine at Dartmouth

and UNC at Chapel Hill

January 13, 2017

Page 2: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

From Information Design, Nathan Shedroff

Page 3: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

• Don’t know what data formats will be • Data stored on a variety of platforms • Data needs to be available when we want it • Data are distributed – not in one place • Data cannot be centralized • Data are protected

Do know we have to be able to use some or all of the data at any time, from one single access point.

The problem with research data

Page 4: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Science data challenges

Big Data

Emphasis on capture

Standards Access

Meaningful use Archive

Grant Dollars

Patient empowerment and interactivity

Page 5: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Secrets to success

1. Do not design a study without contribution of computational and analytic experts (figure)

2. Plan adequate resources

3. There is no magic bullet technical solution

4. Don’t skip the engineering

5. Develop realistic timelines collaboratively

6. Remember: the goal is a high quality data set

7. Test your systems

8. Iterate

Page 6: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 7: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Secrets to success

1. Do not design a study without contribution of computational and analytic experts

2. Plan adequate resources

3. There is no magic bullet technical solution

4. Don’t skip the engineering

5. Develop realistic timelines collaboratively

6. Remember: the goal is a high quality data set

7. Test your systems

8. Iterate

Page 8: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Informatics & Data Management

• Needs analysis

• Process engineering

• Functional specification

• Technical specification

• System development & implementation

• Training, documentation

• Infrastructure (hardware, network)

• Data validation

• Process refinement (feedback to developers)

• Data cleaning

• Data manipulation

• Data reporting

• Data sharing, transformation

• Data integration

• Data analysis/interface

Page 9: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Informatics for Complex Biomedical Research

Page 10: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

The goal: To develop high-quality computerized system designed to

give you a high-quality data set

• Efficient, accurate, secure, validated data capture

• Safe data storage and archive (make sure you can restore from backup: practice!)

• Change control and audit (data elements, process and data)

• Code repository (especially important for longitudinal studies)

• Ability to retrieve data in such a way that all study data for an individual is attributable to that subject

• Low-cost maintainability of the system

• Speedy startup, minimal cost

Page 11: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

First (critical) step: Process Engineering

• Read and understand the protocol

• Describe the protocol diagrammatically (flow charts, use-case diagrams, etc.)

• Review the documentation and adjust (scientific, operations and tech staff) – iterate!

• Include documentation within study manuals

Page 12: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 13: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data Entry: Lots of shapes and sizes

• Paper – centralized double entry, then post to database

• Paper – scan & verify, then post to database

• Paper with entry at source of data collection

• Electronic data collection, stand-alone systems deployed to data collection site; data transfer

• Web and mobile device-based electronic data collection (automatic post to database)

• Automated data extraction and capture (e.g. eHR, PHR)

Page 14: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data Entry: General Principles

• Authentication and authorization

• Design with measures to make identification clear

• Good security practice (no sharing passwords, regular changing of passwords, physical & logical security)

• Audit trail

• Date/time stamp (make sure systems’ date/time is correct; account for different time zones)

Page 15: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Web and app-based Data Entry: Dynamic interfaced enforce protocol

• Use bar codes, pick-lists, radio buttons, check boxes and

skip-patterns

• Minimal use of default values and free text

• Information can be pre-loaded to facilitate data linking (ie: specimens to subject)

• Consistent use of standardized semantics

• Validation at entry, submit, and posting to database

• Reduce data entry burden/ capture at source

• Use standardized header to facilitate subject identification

Page 16: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Security: Physical and Logical

• Physical – Servers locked, locked and

locked

– “Good Practice”

• Topology – Firewall

– Separate servers

– Systematic Virus Protection

• Authorized Access to Data – Regulation of access by login

(individual signature)

– Multiple levels of access privilege

– Time-out

• Disaster Recovery

– Audit function within database stores all changes to data with metadata

– Nightly back-up

– Monthly archive with off-site storage

• Design

– Data: identified, de-identified, anonymous

– Encrypted transactions (data moves between interfaces and database)

Page 17: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data Security: Challenging High Priority

Create a dataset that includes the least

possible identification

Page 18: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 19: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Standards & ‘Gold’ Standards

• FDA guidelines for computer systems in clinical trials http://www.fda.gov/RegulatoryInformation/Guidances/ucm126

402.htm

• HIPAA guidelines regarding security of information

• IRB requirements with regard to ability to identify individuals in the data set

• Industry semantic standards and ontologies

• Metadata standards

• Transmission standards FHIR, HL7 (support automated extraction)

Page 20: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data Management for Complex Biomedical Research

Page 21: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

The goal: To safely, efficiently and accurately

collect, store, validate and retrieve study data

• Process validation

• Participate in system validation

• Data validation measures within central database

• Quality control checks at source of capture

• Ensuring ready access to data for analytical staff

• Data reporting to support operation of study as well as interim analysis

• Data sharing (with potential data transformation or mapping) - becoming increasingly important

Page 22: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Challenge: BIG DATA

1. Volume • Multiple generators: humans,

machines, networks, social media

2. Variety • Structured data e.g. databases,

spreadsheets

• Unstructured data e.g. emails, photos, videos, pdf, path reports

3. Velocity • Fast, sometimes continuous capture

• Real-time access, use desired

4. Veracity • Biases, noise, outliers,

abnormalities • “Dirty” data • How long is data valid,

how long should it be stored?

• Is the data being analyzed meaningful to the problem?

5. Value

Page 23: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Preparing to handle big data

• Invest in capturing and maintaining data in well-annotated, accessible, structured data repositories

– Based on rigorous data/information architectures

• Computer Scientists, Statisticians/Data Scientists, Domain Experts (Scientists) must systematize the analysis of massive data

– Significant efficiencies may be achieved by thinking of data analysis and data access together rather than thinking of them as serial operations.

– We need new statistical methods and algorithms optimized for this type of environment

• Develop computing infrastructures for sharing and analyzing highly distributed, heterogeneous data

– Requires coordination (international, cross-agency)

– Requires a software architecture

• Sustainability in both the data and the software infrastructures are critical

Page 24: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

The fun part: Bioinformatics in action

Page 25: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Internet cohort of 14,000 IBD patients • Baseline and 6-month surveys on disease activity and

treatment • Modules collect information on a variety of patient reported

outcomes, diet, sleep, etc. – very flexible to incorporate new questions

• Pilot tested biospecimen collection • 25 abstracts and more than a dozen papers • Two PCORI grants based on Partners

Page 26: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

What has this research shown?

… also Kids & Teens

Page 27: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

• Coordinator supported IBD registry • Information and specimens • Specimen analysis bringing in “big data” e.g. genotyping • Sub-population linked to CCFA Partners

7 IBD centers

of excellence

5000

IBD patients

Developed Baseline &

Follow-up surveys

(in person and online)

Page 28: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Web site home page serves as portal to Registry, biospecimen network, forms, documents, SOP’s, project descriptions

Page 29: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data capture:

• Updated web-based data collection tools to enroll subjects and enter clinical information from patients and charts

• Baseline and follow-up survey

Page 30: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Online follow-up to reduce coordinator burden:

1266 on-line follow up questionnaires completed

Page 31: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Sharing and analyzing data:

Annotated case report form helps users understand data set contents and formats

Page 33: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

SHARE biospecimen registry

Dynamic biospecimen registry developed using Apache Software Foundation open source software, collaboration with NASA Jet Propulsion Laboratory

• Basic metadata describing SHARE biospecimens defined and implemented within this system

• Ongoing effort to connect all sites electronically, for dynamic data retrieval

• Data extract brings information to network until electronic connection is established

• Blood, tissue and stool data now available

• Status page continuously update to reflect progress on development

Page 34: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

SHARE biospecimen registry: scalable

Opportunity: Building a full data science “knowledge environment” for SHARE IBD research

• Document and search protocols, data and metadata standards, cohort descriptions, outcomes

• Reproducible analyses supported by “pipelines” e.g. RNASeq, Secretome, Mass spec

• Persistent archive of data • Searchable environment

Page 35: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data Science Knowledge Environment:

Early Detection Research Network (EDRN) is an initiative of the National Cancer Institute created to bring together dozens of research institutions to help accelerate the translation of biomarker information into clinical applications for diagnosing cancer in its earliest stages …

… supported by a virtual, national, integrated Knowledge System.

Page 36: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

SHARE biospecimen registry

September 2016, our systems were named one of the ten technical advances improving cancer research by Tech Republic.

Others on the list include IBM Watson, Google DeepMind and CRISPR.

Our work is the only technology that addresses enterprise architecture for managing and sharing big data for analytics and visualization.

http://www.techrepublic.com/article/10-ways-tech-is-improving-cancer-research/

Page 38: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Chemoprevention study: Prevention of skin cancer by antioxidants

• Subjects (9000) with clinical evidence of chronic arsenic exposure will be recruited from two ongoing cohort studies

• Two study sites in Araihazar and Matlab; Coordinating center at ICDDR, B in Dhaka and University of Chicago; Mailman School of Public Health Columbia University, data center at Dartmouth

• Goal: to test effectiveness of vitamin E and selenium in preventing development of skin cancer

Page 39: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Data issues

• Design processes, forms and systems

• Collect: Family history, Risk factors, Health & outcomes, bio-specimens/related data

• Screening, recruitment, pill distribution, data collection, bio-specimen tracking, data management, data integration and analysis

• Challenges: intermittent power, low bandwidth internet access, low-tech facilities (no land-line phone, no fax, no cooling, no computers at sites), translation of information into and back from Bengali – etc.

Page 40: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 41: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 42: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 43: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired
Page 44: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

Colon Cancer Family Registries

Chemoprevention of Arsenic Induced Skin Cancer

Visual Media Influences on Adolescence Smoking

Vit D/Calcium Polyp Prevention Study

PCORI: IBD Patient Powered Research Network

Methotrexate Response in Treatment of UC

IBD research

Page 45: Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data Integrity Success: Concepts and Cases Kristen Anton, MS ... • Real-time access, use desired

We Care

Improving the quality of life of patients is not just a job for us – we are

committed to facilitating better disease screening, treatment, cure and the basic

science that leads to knowledge.

Great Team

Department of Biomedical Data Science

Dave Aman Judy Harjes Rob Rheaume Scott Gerlach Suzie Rovell-Rixx John Gilman Susan Gallagher

Jane Hebb Ted Bush Steve Pyle Judi Forman Laurie Johnson Maureen Colbert

Center for Gastrointestinal Biology and Disease

David Seligson Ginny Sharpless Wenli Chen Van Nguyen