Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data...

Post on 03-Mar-2018

216 views 2 download

Transcript of Structuring for Data Integrity Success: Concepts and Cases · PDF fileStructuring for Data...

Structuring for Data Integrity Success: Concepts and Cases

Kristen Anton, MS

Bioinformatics

Geisel School of Medicine at Dartmouth

and UNC at Chapel Hill

January 13, 2017

From Information Design, Nathan Shedroff

• Don’t know what data formats will be • Data stored on a variety of platforms • Data needs to be available when we want it • Data are distributed – not in one place • Data cannot be centralized • Data are protected

Do know we have to be able to use some or all of the data at any time, from one single access point.

The problem with research data

Science data challenges

Big Data

Emphasis on capture

Standards Access

Meaningful use Archive

Grant Dollars

Patient empowerment and interactivity

Secrets to success

1. Do not design a study without contribution of computational and analytic experts (figure)

2. Plan adequate resources

3. There is no magic bullet technical solution

4. Don’t skip the engineering

5. Develop realistic timelines collaboratively

6. Remember: the goal is a high quality data set

7. Test your systems

8. Iterate

Secrets to success

1. Do not design a study without contribution of computational and analytic experts

2. Plan adequate resources

3. There is no magic bullet technical solution

4. Don’t skip the engineering

5. Develop realistic timelines collaboratively

6. Remember: the goal is a high quality data set

7. Test your systems

8. Iterate

Informatics & Data Management

• Needs analysis

• Process engineering

• Functional specification

• Technical specification

• System development & implementation

• Training, documentation

• Infrastructure (hardware, network)

• Data validation

• Process refinement (feedback to developers)

• Data cleaning

• Data manipulation

• Data reporting

• Data sharing, transformation

• Data integration

• Data analysis/interface

Informatics for Complex Biomedical Research

The goal: To develop high-quality computerized system designed to

give you a high-quality data set

• Efficient, accurate, secure, validated data capture

• Safe data storage and archive (make sure you can restore from backup: practice!)

• Change control and audit (data elements, process and data)

• Code repository (especially important for longitudinal studies)

• Ability to retrieve data in such a way that all study data for an individual is attributable to that subject

• Low-cost maintainability of the system

• Speedy startup, minimal cost

First (critical) step: Process Engineering

• Read and understand the protocol

• Describe the protocol diagrammatically (flow charts, use-case diagrams, etc.)

• Review the documentation and adjust (scientific, operations and tech staff) – iterate!

• Include documentation within study manuals

Data Entry: Lots of shapes and sizes

• Paper – centralized double entry, then post to database

• Paper – scan & verify, then post to database

• Paper with entry at source of data collection

• Electronic data collection, stand-alone systems deployed to data collection site; data transfer

• Web and mobile device-based electronic data collection (automatic post to database)

• Automated data extraction and capture (e.g. eHR, PHR)

Data Entry: General Principles

• Authentication and authorization

• Design with measures to make identification clear

• Good security practice (no sharing passwords, regular changing of passwords, physical & logical security)

• Audit trail

• Date/time stamp (make sure systems’ date/time is correct; account for different time zones)

Web and app-based Data Entry: Dynamic interfaced enforce protocol

• Use bar codes, pick-lists, radio buttons, check boxes and

skip-patterns

• Minimal use of default values and free text

• Information can be pre-loaded to facilitate data linking (ie: specimens to subject)

• Consistent use of standardized semantics

• Validation at entry, submit, and posting to database

• Reduce data entry burden/ capture at source

• Use standardized header to facilitate subject identification

Security: Physical and Logical

• Physical – Servers locked, locked and

locked

– “Good Practice”

• Topology – Firewall

– Separate servers

– Systematic Virus Protection

• Authorized Access to Data – Regulation of access by login

(individual signature)

– Multiple levels of access privilege

– Time-out

• Disaster Recovery

– Audit function within database stores all changes to data with metadata

– Nightly back-up

– Monthly archive with off-site storage

• Design

– Data: identified, de-identified, anonymous

– Encrypted transactions (data moves between interfaces and database)

Data Security: Challenging High Priority

Create a dataset that includes the least

possible identification

Standards & ‘Gold’ Standards

• FDA guidelines for computer systems in clinical trials http://www.fda.gov/RegulatoryInformation/Guidances/ucm126

402.htm

• HIPAA guidelines regarding security of information

• IRB requirements with regard to ability to identify individuals in the data set

• Industry semantic standards and ontologies

• Metadata standards

• Transmission standards FHIR, HL7 (support automated extraction)

Data Management for Complex Biomedical Research

The goal: To safely, efficiently and accurately

collect, store, validate and retrieve study data

• Process validation

• Participate in system validation

• Data validation measures within central database

• Quality control checks at source of capture

• Ensuring ready access to data for analytical staff

• Data reporting to support operation of study as well as interim analysis

• Data sharing (with potential data transformation or mapping) - becoming increasingly important

Challenge: BIG DATA

1. Volume • Multiple generators: humans,

machines, networks, social media

2. Variety • Structured data e.g. databases,

spreadsheets

• Unstructured data e.g. emails, photos, videos, pdf, path reports

3. Velocity • Fast, sometimes continuous capture

• Real-time access, use desired

4. Veracity • Biases, noise, outliers,

abnormalities • “Dirty” data • How long is data valid,

how long should it be stored?

• Is the data being analyzed meaningful to the problem?

5. Value

Preparing to handle big data

• Invest in capturing and maintaining data in well-annotated, accessible, structured data repositories

– Based on rigorous data/information architectures

• Computer Scientists, Statisticians/Data Scientists, Domain Experts (Scientists) must systematize the analysis of massive data

– Significant efficiencies may be achieved by thinking of data analysis and data access together rather than thinking of them as serial operations.

– We need new statistical methods and algorithms optimized for this type of environment

• Develop computing infrastructures for sharing and analyzing highly distributed, heterogeneous data

– Requires coordination (international, cross-agency)

– Requires a software architecture

• Sustainability in both the data and the software infrastructures are critical

The fun part: Bioinformatics in action

Internet cohort of 14,000 IBD patients • Baseline and 6-month surveys on disease activity and

treatment • Modules collect information on a variety of patient reported

outcomes, diet, sleep, etc. – very flexible to incorporate new questions

• Pilot tested biospecimen collection • 25 abstracts and more than a dozen papers • Two PCORI grants based on Partners

What has this research shown?

… also Kids & Teens

• Coordinator supported IBD registry • Information and specimens • Specimen analysis bringing in “big data” e.g. genotyping • Sub-population linked to CCFA Partners

7 IBD centers

of excellence

5000

IBD patients

Developed Baseline &

Follow-up surveys

(in person and online)

Web site home page serves as portal to Registry, biospecimen network, forms, documents, SOP’s, project descriptions

Data capture:

• Updated web-based data collection tools to enroll subjects and enter clinical information from patients and charts

• Baseline and follow-up survey

Online follow-up to reduce coordinator burden:

1266 on-line follow up questionnaires completed

Sharing and analyzing data:

Annotated case report form helps users understand data set contents and formats

SHARE biospecimen registry

Dynamic biospecimen registry developed using Apache Software Foundation open source software, collaboration with NASA Jet Propulsion Laboratory

• Basic metadata describing SHARE biospecimens defined and implemented within this system

• Ongoing effort to connect all sites electronically, for dynamic data retrieval

• Data extract brings information to network until electronic connection is established

• Blood, tissue and stool data now available

• Status page continuously update to reflect progress on development

SHARE biospecimen registry: scalable

Opportunity: Building a full data science “knowledge environment” for SHARE IBD research

• Document and search protocols, data and metadata standards, cohort descriptions, outcomes

• Reproducible analyses supported by “pipelines” e.g. RNASeq, Secretome, Mass spec

• Persistent archive of data • Searchable environment

Data Science Knowledge Environment:

Early Detection Research Network (EDRN) is an initiative of the National Cancer Institute created to bring together dozens of research institutions to help accelerate the translation of biomarker information into clinical applications for diagnosing cancer in its earliest stages …

… supported by a virtual, national, integrated Knowledge System.

SHARE biospecimen registry

September 2016, our systems were named one of the ten technical advances improving cancer research by Tech Republic.

Others on the list include IBM Watson, Google DeepMind and CRISPR.

Our work is the only technology that addresses enterprise architecture for managing and sharing big data for analytics and visualization.

http://www.techrepublic.com/article/10-ways-tech-is-improving-cancer-research/

Chemoprevention study: Prevention of skin cancer by antioxidants

• Subjects (9000) with clinical evidence of chronic arsenic exposure will be recruited from two ongoing cohort studies

• Two study sites in Araihazar and Matlab; Coordinating center at ICDDR, B in Dhaka and University of Chicago; Mailman School of Public Health Columbia University, data center at Dartmouth

• Goal: to test effectiveness of vitamin E and selenium in preventing development of skin cancer

Data issues

• Design processes, forms and systems

• Collect: Family history, Risk factors, Health & outcomes, bio-specimens/related data

• Screening, recruitment, pill distribution, data collection, bio-specimen tracking, data management, data integration and analysis

• Challenges: intermittent power, low bandwidth internet access, low-tech facilities (no land-line phone, no fax, no cooling, no computers at sites), translation of information into and back from Bengali – etc.

Colon Cancer Family Registries

Chemoprevention of Arsenic Induced Skin Cancer

Visual Media Influences on Adolescence Smoking

Vit D/Calcium Polyp Prevention Study

PCORI: IBD Patient Powered Research Network

Methotrexate Response in Treatment of UC

IBD research

We Care

Improving the quality of life of patients is not just a job for us – we are

committed to facilitating better disease screening, treatment, cure and the basic

science that leads to knowledge.

Great Team

Department of Biomedical Data Science

Dave Aman Judy Harjes Rob Rheaume Scott Gerlach Suzie Rovell-Rixx John Gilman Susan Gallagher

Jane Hebb Ted Bush Steve Pyle Judi Forman Laurie Johnson Maureen Colbert

Center for Gastrointestinal Biology and Disease

David Seligson Ginny Sharpless Wenli Chen Van Nguyen